Software for motif discovery and next-gen sequencing analysis

Checking and manipulating FASTQ files

Most modern sequencers produce FASTQ files as output, which is a modified version of a traditional FASTA formatted file. FASTQ flles are ASCII text files that encode both nucleotide calls as well as 'quality information', which provides information about the confidence of each nucleotide.  FASTQ format uses 4 lines for each read produced by the sequencer.  Fastq files are nomally given the file extension ".fq" or ".fastq".  A typical files looks something like this:

@SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50
+SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50
@SRR566546.971 HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108 length=50
+SRR566546.971 HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108 length=50
@SRR566546.972 HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109 length=50
+SRR566546.972 HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109 length=50

The example above encodes 3 reads (each uses 4 lines to report information).  Each read has:
1. Header line - This line must start with "@", followed by the name of the read (should be unique)
2. The nucleotide sequence
3. A second header line - This line must start with "+".  Usually, the information is the same as in the first header line, but it can also be blank (The "+" is still required though)
4. Quality Information - For each nucleotide in the sequence, an ASCII encoded quality score is reported.  The idea is that better quality scores indicate the base is reliably reported, while poor quality scores reflect uncertaintly about the true identity of the base.

These files represent the primary data generated by the sequencer, and will be requested by other researchers after you publish your study.  Do not loose or modify these files!!!  If you are using Life Tech/ABI Solid sequencers, the data may be returned as a color space FASTQ file (usually with a *.csfastq" extension)

Duplicate and archive

The most important thing you can do once you get primary data off the sequencer is back to it up.  If you loose a mapping file (sam/bam), it's not a big deal as you can always remap your data, but if you lose the raw sequencing file (i.e. FASTQ), you're in trouble, and may have to repeat the experiment.  Therefore, it is critical that you backup your files to a secure server that is [ideally] in a separate physical location than your primary copy.  Storing a single copy on a RAID device does NOT count...

The other under appreciated task is to annotate your files.  This means giving them proper names, and including information about how the experiment was done, cell type, etc. so that when you find the file later, you know what the experiment was.  I would strongly recommend including a date in the file name, the cell type, the treatment, the type of experiment, the lab/scientist who performed the experiment.  For example, consider naming them like this:

Also, it's probably a good idea to zip the files such that they take up less space.  For example:
gzip *.fq *.fastq

Removing barcodes

Depending on your sequencing strategy, you may need to remove certain parts of the sequence that is not biologically meaningful.  For example, if you sequence short RNAs that are between 15-40 bp in size, and you sequence them using 50 nucleotide reads, the sequence will start to identify the adapter used in sequencing library construction at the end of the RNA sequence. 

Quality value encoding

Different version of the Illumina pipeline (from back in the day) can produce different encoding of quality.  A discussion of these differences can be found here.  In general, recent implementations of the Illumina pipeline output Sanger-style quality encoding, so you should have to worry much about it.  Many programs, such as bowtie for read mapping, have options to specify which style of encoding is used.

Performing quality controls on FASTQ files

It's a good idea to perform a general quality control check on your sequence files - this can help indicate if there were any major technical issues with your sequencing.  I nice tool for this is FASTQC, developed by Simon Andrews at the Babraham Institute.  Go here to download and install the appropriate version of FASTQC.  After unzipping it, add the main FASTQC directory to your executable path for ease of use. 

Usually, the easiest way to run FASTQC is on the command line:
mkdir OutputDirectory/
fastqc -o OutputDirectory/ inputFile.fastq
Unfortunately, it will complain if you do not create the output directory ahead of time.  This analysis will produce several interesting analyses that help you understand how your sequencing went:

Manipulating FASTQ files

Since most of the applications covered here involve "re-sequencing" of a known genome, the quality information about each base is not terribly important (it's more important when trying to identify SNP or for de novo genome assembly).  However, sometimes, as a read is sequenced, errors start to appear and the reliability of the sequence goes down.  In these cases it's best to remove these sequences from the mapping to improve downstream analysis.

An excellent resource for the manipulation of FASTQ files is the FASTX program suite.  These programs can be very useful.  In addition, some mapping tools (i.e. bowtie) have options that perform on-the-fly FASTQ manipulation, such as trimming from the 3' end.

Trimming adapter sequences from your fastq files

If you perform short RNA sequencing or another type of experiment where the functional sequences you are measuring might be smaller than the read length, it is likely that the 3' end of the read will be adapter sequences from the Illumina library preparation, and not relevant biological/genomic sequence.  You MUST MUST MUST remove this sequence before trying to map or assemble the reads.  The FASTX program fastx_clipper can perform adapter clipping, as can the HOMER utility homerTools.

Trimming sequences based on quality scores

If your reads are very long, you may want to trim sequences where the quality scores took a dive.  This may be necessary for 100 bp reads if the last 20 bp are all random base calls.  In this case the read may be hard to map since the final 20 bp will be largely wrong.  The FASTX tool fastq_quality_trimmer is useful for this purpose.

Can't figure something out? Questions, comments, concerns, or other feedback: