|
Checking and manipulating FASTQ files
Most modern sequencers produce FASTQ
files as output, which is a modified version of a
traditional FASTA
formatted file. FASTQ flles are ASCII text files that
encode both nucleotide calls as well as 'quality
information', which provides information about the
confidence of each nucleotide. FASTQ format uses 4
lines for each read produced by the sequencer. Fastq
files are nomally given the file extension ".fq" or ".fastq".
A typical files looks something like this:
@SRR566546.970
HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50
TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT
+SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109
length=50
hhhhhhhhhhghhghhhhhfhhhhhfffffe`ee[`X]b[d[ed`[Y[^Y
@SRR566546.971
HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108 length=50
GATTTGTATGAAAGTATACAACTAAAACTGCAGGTGGATCAGAGTAAGTC
+SRR566546.971 HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108
length=50
hhhhgfhhcghghggfcffdhfehhhhcehdchhdhahehffffde`bVd
@SRR566546.972
HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109 length=50
TGCATGATCTTCAGTGCCAGGACCTTATCAAGCGGTTTGGTCCCTTTGTT
+SRR566546.972 HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109
length=50
dhhhgchhhghhhfhhhhhdhhhhehhghfhhhchfddffcffafhfghe
...
The example above encodes 3 reads (each uses 4 lines to
report information). Each read has:
1. Header line - This line must start with "@",
followed by the name of the read (should be unique)
2. The nucleotide sequence
3. A second header line - This line must start with
"+". Usually, the information is the same as in the
first header line, but it can also be blank (The "+" is
still required though)
4. Quality Information - For each nucleotide in the
sequence, an ASCII encoded quality score is
reported. The idea is that better quality scores
indicate the base is reliably reported, while poor quality
scores reflect uncertaintly about the true identity of the
base.
These files represent the primary data generated by the
sequencer, and will be requested by other researchers after
you publish your study. Do not loose or modify these
files!!! If you are using Life Tech/ABI Solid
sequencers, the data may be returned as a color space FASTQ
file (usually with a *.csfastq" extension)
Duplicate and archive
The most important thing you can do once you get
primary data off the sequencer is back to it up. If
you loose a mapping file (sam/bam), it's not a big deal as
you can always remap your data, but if you lose the raw
sequencing file (i.e. FASTQ), you're in trouble, and may
have to repeat the experiment. Therefore, it is
critical that you backup your files to a secure server
that is [ideally] in a separate physical location than
your primary copy. Storing a single copy on a RAID
device does NOT count...
The other under appreciated task is to annotate your
files. This means giving them proper names, and
including information about how the experiment was done,
cell type, etc. so that when you find the file later, you
know what the experiment was. I would strongly
recommend including a date in the file name, the cell
type, the treatment, the type of experiment, the
lab/scientist who performed the experiment. For
example, consider naming them like this:
Lab-YYMMDD-Cell-Txn-expID.fastq
ChuckNorrisLab-120927-Bcell-LPS-0123.fastq
Also, it's probably a good idea to zip the files such that
they take up less space. For example:
gzip *.fq *.fastq
Removing barcodes
Depending on your sequencing strategy, you may need to
remove certain parts of the sequence that is not
biologically meaningful. For example, if you
sequence short RNAs that are between 15-40 bp in size,
and you sequence them using 50 nucleotide reads, the
sequence will start to identify the adapter used in
sequencing library construction at the end of the RNA
sequence.
Quality value encoding
Different version of the Illumina pipeline (from
back in the day) can produce different encoding of
quality. A discussion of these differences can be
found here.
In general, recent implementations of the Illumina
pipeline output Sanger-style quality encoding, so you
should have to worry much about it. Many programs,
such as bowtie for read mapping, have options to specify
which style of encoding is used.
Performing quality controls on FASTQ files
It's a good idea to perform a general quality
control check on your sequence files - this can help
indicate if there were any major technical issues with
your sequencing. I nice tool for this is FASTQC,
developed by Simon Andrews at the Babraham
Institute. Go here
to download and install the appropriate version of
FASTQC. After unzipping it, add the main FASTQC
directory to your executable path for ease of use.
Usually, the easiest way to run FASTQC is on the command
line:
mkdir OutputDirectory/
fastqc -o OutputDirectory/ inputFile.fastq
Unfortunately, it will complain if you do not create the
output directory ahead of time. This analysis will
produce several interesting analyses that help you
understand how your sequencing went:
Manipulating FASTQ files
Since most of the applications covered here
involve "re-sequencing" of a known genome, the quality
information about each base is not terribly important
(it's more important when trying to identify SNP or for de
novo genome assembly). However, sometimes, as a read
is sequenced, errors start to appear and the reliability
of the sequence goes down. In these cases it's best
to remove these sequences from the mapping to improve
downstream analysis.
An excellent resource for the manipulation of FASTQ files
is the FASTX
program suite. These programs can be very
useful. In addition, some mapping tools (i.e.
bowtie) have options that perform on-the-fly FASTQ
manipulation, such as trimming from the 3' end.
Trimming adapter sequences from your fastq files
If you perform short RNA sequencing or another
type of experiment where the functional sequences you
are measuring might be smaller than the read length, it
is likely that the 3' end of the read will be adapter
sequences from the Illumina library preparation, and not
relevant biological/genomic sequence. You MUST
MUST MUST remove this sequence before
trying to map or assemble the reads. The FASTX
program fastx_clipper can perform adapter
clipping, as can the HOMER utility homerTools.
Trimming sequences based on quality scores
If your reads are very long, you may want to trim
sequences where the quality scores took a dive.
This may be necessary for 100 bp reads if the last 20
bp are all random base calls. In this case the
read may be hard to map since the final 20 bp will be
largely wrong. The FASTX tool fastq_quality_trimmer
is useful for this purpose.
|