Software for motif discovery and next-gen sequencing analysis

Mapping reads to the genome

Once you have checked your FASTQ files and have removed all adapter sequences that might be present, you are ready to map them to a reference genome.  While tools like BLAST and BLAT are powerful methods, they are not specialized for the vast amount of data generated by next-generation sequencers.  It is highly recommended that you use a next-gen specific read alignment program.

Note: While BWA, Bowtie, and Tophat have received the most attention as short read alignment algorithms, new methods such as STAR are significantly faster and in some cases more accurate.  More like it are likely to come along soon (if not already available...)

Selecting a reference genome

Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads.  Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a genome):
    1. Do you have a favorite genome in the lab that already has a bunch of experiments mapped to it?  Use that one.
    2. Do any of your collaborators have a favorite genome/version?
    3. Use the latest stable release - I would recommend using genomes curated at UCSC so that you can easily visualize your data later using the UCSC Genome Browser.  (i.e. mm9, hg19)

Mapping to a normal genomic reference

You want to map your reads directly to the genome if you are performing:
  • ChIP-Seq
  • GRO-Seq
  • DNase-Seq
  • MNase-Seq
  • Hi-C

Popular short read aligners

Full List

Most Popular:
bowtie : fast, works well
bowtie2 : fast, can perform local alignments too
BWA - Fast, allows indels, commonly used for genome/exome resequencing
Subread - Very fast, (also does splice alignment)
STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory)

To be honest, I would probably recommend STAR for almost any application at this point if you have the memory (see below)

Example of alignment with bowtie2:

Step 1 - Build Index (takes a while, but only do this once):

After installing bowtie2, the reference genome must first be "indexed" so that reads may be quickly aligned.  You can download pre-made indices from the bowtie website (check for those here first).  Please be aware that bowtie2 indexes are different than "bowtie" indexes.  To perform make your own from FASTA files, do the following:
  1. Download FASTA files for the unmasked genome of interest if you haven't already (i.e. from UCSC).  Do NOT use masked sequences.  I also tend to remove the "*_random.fa" chromosomes.  These often contain part of the canonical chromosomes in addition to regions that cannot be placed in the assembly.  The problem with these regions is that the part shared with the canonical chromosome will be present twice, making it difficult to map the reads to a unique location.
  2. Concatenate FASTA files into a single file.  We can do this using the UNIX cat command, which merges files together
    cat *.fa > genome.fa
  3. From the directory containing the genome.fa file, run the "bowtie2-build" command.  The default options usually work well for most genomes.  For example, for hg19:
    /path-to-bowtie-programs/bowtie2-build genome.fa hg19
    This command will create 6 files with a *.bt2 file extension.  These will then be used by bowtie2 or newer versions of tophat to map data.

  4. Copy the *.bt2 files to the bowtie2 indexes directory (or somewhere you can store them) so that bowtie2 knows where to find them later:
    cp *.bt2 /path-to-bowtie-programs/indexes/

Step 2 - Align sequences with bowtie (perform for each experiment):

The most common output format for high-throughput sequencing is FASTQ format, which contains information about the sequence (A,C,G,Ts) and quality information which describes how certain the sequencer is of the base calls that were made.  In the case of Illumina sequencing, the output is usually a "s_1_sequence.txt" file.  More recently the Illumina pipeline will output a file that is debarcoded with your sample name such as "Experiment1.fastq".  In addition, much of the data available in the SRA, the primary archive of high-throughput sequencing data, is in this format. 

To use bowtie2 to map this data, run the following command:

/path-to-bowtie-programs/bowtie2 -p <# cpu> -x <genome index prefix> <fastq file>  > <output filename>

/programs/bowtie2 -p 8 -x hg19 Experiment1.fastq > Experiment1.sam
Where <genome index prefix> is the common prefix for the *.bt2 files that were created using the bowtie2-build command in step 1, or from a downloaded index.  If the *.bt2 files are stored int the "/path-to-bowtie2-program/indexes/" directory, you only need to specify the name of the index.  If the index files are located elsewhere, you can specify the full path names of the index files (in the examples above this would be "-x /programs/indexes/hg19").

In the example above, we use 8 processors/threads.  The bowtie2 program is very parallel in nature, with near linear speed up with additional processors.

The default output is a SAM file.  To learn more about SAM alignment files, go to the next section on SAM/BAM files.

There are many other options for bowtie2 that may be important for your study, but most ChIP-Seq data can be mapped using the default options.

NOTE: Usually, the process of removing duplicate reads or removing non-unique alignments is handled by the downstream analysis program.

Example of alignment with bwa:

Step 1 - Build Index (takes a while, but only do this once):

bwa index -a bwtsw genome.fa

Step 2 - Align reads to the index (perform for each experiment):

# where the genome.fa is in the same directory with your index from the first step

bwa mem -t #cpus genome.fa reads.fq > aln-se.sa

#paired end
bwa mem -t #cpus genome.fa reads1.fq reads2.fq > aln-pe.sam

Mapping to a genome while allowing splicing

Usually, any kind of RNA-seq method will benefit from looking for splicing junctions in addition to genomic mapping:
  • RNA-Seq (polyA+, total)
  • RIP-Seq
  • ChIRP-Seq
  • Ribo-Seq

Popular splice read aligners

Tophat (most popular)
Subread - Very fast, (also does splice alignment)
STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory)
lots of others...

I would probably recommend STAR for RNA-Seq is you have enough RAM (see below)

Example of aligning RNA-Seq data with STAR (very very fast)

STAR is one of a growing number of short read aligners that takes advantage of advances in computational power to optimize the short read mapping process (original publication: Dobin et al.)  The key limitation with STAR is computer RAM - STAR requires at least 30 Gb to align to the human or mouse genomes.

To install STAR, visit the website and follow their instructions.

Step 1 - Build a genome index

Like all aligners, you need to build the genome index first.  The STAR website has links to the hg19 genome index if you want to skip this step.  First, make a directory for the index (i.e. mm9-starIndex/).  Then copy the genome FASTA file it the directory and cd into it to make that directory your current directory.  Then, to build the index, the command is:
STAR  --runMode genomeGenerate --runThreadN <# cpus> --genomeDir <genome output directory> --genomeFastaFiles <input Genome FASTA file>

i.e. STAR  --runMode genomeGenerate --runThreadN 24 --genomeDir ./ --genomeFastaFiles mm9.fa
Note: For small genomes, you may need to add the following to create a proper index: "--genomeSAindexNbases 6"

Step 2 - Align RNA-Seq Reads to the genome with STAR

To align RNA-Seq reads with STAR, run the following command:
STAR --genomeDir <Directory with the Genome Index>  --runThreadN <# cpus> --readFilesIn <FASTQ file> --outFileNamePrefix <OutputPrefix>

i.e. STAR --genomeDir mm9-starIndex/  --runThreadN 24 --readFilesIn Experiment1.fastq --outFileNamePrefix Experiment1Star

# paried-end data:
STAR --genomeDir mm9-starIndex/  --runThreadN 24 --readFilesIn read1.fastq read2.fastq --outFileNamePrefix Experiment1Star

STAR will create several output files - the most important of which is the "*.Aligned.out.sam" - you will likely want to convert this to a bam file and sort it to use it with other programs.  The default output is a SAM file.  To learn more about SAM alignment files, go to the next section on SAM/BAM files.

Notes on STAR

STAR is very very fast - it will rip through 20 million reads in a matter of minutes if you have 20 cpus working for you.  In fact, the longest part of the program is loading the index into memory.  If you are aligning several experiments in a row, add the option "--genomeLoad LoadAndKeep" and STAR will load the genome index into shared memory so that it can use it for the next time you run the program.  Also, converting the sam file into a sorted bam file will take much longer than aligning the data in the first place.

Example of Alignment with Tophat (not recommended)

Tophat is basically a specialized wrapper for bowtie2 - it manipulates your reads and aligns them with bowtie2 in order to identify novel splice junctions.  It can also use given splice junctions/gene models to map your data across known splice junctions.

Step 1 - Build Index (takes a while, but only do this once):

This part is exactly the same as for bowtie2 - if you already made or downloaded an index for bowtie2, you can skip this step.

Step 2 - Align your RNA-seq data to the genome using Tophat

To use tophat to map this data, run the following command:

/path-to-bowtie-programs/tophat -o <output directory> -p <# cpu> </path-to-genome-index/prefix> <fastq file>   

For example:
/programs/tophat -o TophatOutput/ -p 8 /programs/indexes/hg19 Experiment1.fastq

Paired-end Example:
/programs/tophat -o TophatOutputPE/ -p 8 /programs/indexes/hg19 Experiment1.r1.fastq Experiment1.r2.fastq

In the example above, we use 8 processors/threads.  The tophat2 program contains a mix of serial and parallel routines, so more processors doesn't necessarily mean it will be finished much faster.  In particular, if you are performing coverage based searchers, it may take a long time (multiple processors will not help).  In general, if you have multiple samples to map, it's better to map them at the same time with separate commands rather than mapping them one at a time with mapping processors.

Tophat places several output files in an output directory.  The most important is the "accepted_hits.bam" file - this is the alignment file that will be used for future analysis (more info here).  There are additional files that can be useful, including a "junctions.bed" file with records all of the splice junctions discovered in the data.

Important Tophat Parameters:

--library-type <fr-unstranded | fr-firststrand | fr-secondstrand>

Describes which method was used to generate your RNA-seq library. If you used a method that doesn't produce strand specific signal (i.e. just sequencing a cDNA library), then select the fr-unstranded. If you use a stranded method that sequences the first DNA strand made (like a dUTP method), then use fr-firststrnad. If you use direct ligation methods, then fr-secondstrand.  Correctly specifying the library type will help Tophat identify splice junctions.
-G <GTF file>
Use this option to specify a known transcriptome to map the reads against.  By default, tophat will also search for de novo splice events, but this will help it known were the known ones are so that you don't miss them.  A GTF files are called Gene Transfer Files, and a description of the format can be found here.  To get a GTF file for your organism, you can usually get one from UCSC Table Browser:
UCSC Table browser GTF

In the output format, be sure to select GTF file - the file you download from here should work with tophat.

Tophat Mapping Strategy

If your goal is to identify novel transcripts and you have several separate experiments, I would recommend pooling all of your data together into a single expeirment/FASTQ file and mapping your data in one run.  One of the ways Tophat tries to identify novel junctions is by first identifying "exons" by mapping segments of reads to the genome using bowtie2.  The more segments it's able to map, the more confident it is about putative exons and the greater the chance it will identify unannotated splice sites.

You can then go back with your novel splice sites and remap the original experiments (not pooled) to get reads for each individual experiment using the "-j <raw junction file>".  This is most useful for short reads.  This general strategy is also useful if running cufflinks...

Mapping bisulfite-treated DNA

MethylC-Seq, BS-Seq, or RBBS-Seq data requires a special mapping strategy.  In these experiments the genomic DNA is bisulfite treated, causing all unmethylated cytosines to be converted to uracil, which will utimately show up as thymine after sequencing.  This is a clever way to figure out which cytosines are methylated in the genome, but requires a clever mapping strategy to avoid bias detection.  I'll try to include an example of mapping this type of data in the near future, for now consider this list of BS-aligners.

De novo Assembly

Sometimes it makes more sense to perform de novo assembly instead of mapping reads to a reference genome.  Assembly is well beyond the scope of this tutorial.

Genomics assembly from short reads: Velvet, SOAPdenovo
Transcript assembly: Trinity

Can't figure something out? Questions, comments, concerns, or other feedback: