Software for motif discovery and next-gen sequencing analysis

Analysis of RNA-Seq data with Cufflinks

There are several different tools available for RNA-Seq analysis.  One of the more popular tools is Cufflinks, a all-in-one tool for performing several aspects of RNA-Seq analysis.  One of Cufflinks' best features is that it can function as a reference-based de novo transcriptome assembler - that is, it can identify novel transcripts in your sequencing data by examining their alignments to the genome.  This is different from de novo transcript assemblers that work directly from the RNA sequence that do not require a genome (such as Trinity).  In theory, reference-based transcript assembly is more sensitive.

Input Files for Cufflinks

The primary input file for Cufflinks is a sorted BAM file.  Cufflinks is usually run after mapping reads to the genome with its sister tool Tophat.  A description of how to run Tophat can be found here.  The main input file used from Tophat will the be "accepted_hits.bam" file which contains all of the mapped reads as well as their splicing information.  You should have a single BAM file for each experiment you want to analyze.

In practice, Cufflinks can work with BAM files produced by any read alignment program.  However, you must make sure the alignment is converted into a sorted BAM file for it to work correctly with Cufflinks.  For more information on creating a sorted BAM file, see here.

Reference Transcriptome Files (GTF files)

While Cufflinks can function as a de novo transcriptome assembler, it can also take advantage of known transcripts in its analysis.  In a practical sense this is important when quantify protein coding genes and low expressed genes that may not have enough sequencing evidence to be "discovered" de novo but are known to exist.

Cufflinks works with GTF files (Gene Transfer Format), which are similar to GFF files.  To get a GTF files for your organism, check out the UCSC Table Browser.  There is a description about how to download GTF files on the mapping page (same GTF files used to assist with Tophat mapping).  Usually, the RefSeq (refGene) genes serve as a good reference genome as their identifiers (i.e. NM_0012355) are recognized by many downstream programs.

In summary, you should have a sorted BAM file and a GTF files containing the reference - something like this:
For gene expression quantification and de novo transcriptome assembly you only need a single BAM file.  If you want to do differential expression analysis you will need a separate BAM file for each experiment.

Installing Cufflinks

To install Cufflinks, visit their main page and check out the "Releases" panel along the right side of the webpage.  I would strongly recommend downloading the Mac or Linux Binary files since Cufflinks requires some special libraries (to install from source, follow the directions here).  To install Cufflinks from the binaries:
Download the latest version for your computer from here.
Unzip and untar the file: tar zxvf cufflinks-2.0.2.OSX_x86_64.tar.gz
The new directory should have all of the executable files - might want to add this directory to your executable PATH

Running Cufflinks

This page is not meant to replace the Cufflinks manual.  I would recommend reading it to understand the full set of options and tools available in the Cufflinks package.  This tutorial will try to distill some of the important points to help you get started.  For each of the following sections, cufflinks has some general parameters that are commonly used by nearly all of it's programs.

Common Parameters:

-o <output directory> : Most of the cufflinks programs generate several output files.  Use this option to specify what the output directory should be (i.e. "-o cufflinksOutput/").

-p <# CPUs>
: speed up the analysis by using additional processors if available on your system

--library-type <type> : where the library type describes how the sequencing library was made.  This helps cufflinks correctly take advantage of strand information, etc.  This is most important when analyzing paired-end libraries.  See the Cufflinks website for a full description of the data (see at the bottom of this page). The common different library types are:
fr-unstranded : Standard Illumina kit, analogous to sequencing a cDNA library.
fr-firststrand : Strand specific, such at UTP methods
fr-secondstrand : Strand specific, direction RNA ligation strategies

Quantifying Known Transcripts using Cufflinks

To quantify the expression level of protein coding genes, you can run the cufflinks program with a bam file and a GTF file like so:
cufflinks -o OutputDirectory/ -G refseq.gtf mappedReads.bam
Where "OutputDirectory/" is the name of the folder where the output files will be placed, "refseq.gtf" is the GTF file describing the reference genome, and mappedReads.bam is your output from Tophat.  Also, considering using some of the options mentioned in the section above (such as "--library-type", etc.) to improve cufflinks performance. 

Output Files are in the OutputDirectory

De novo Transcript Discovery using Cufflinks

One of Cufflinks' best features is that it can "identify" transcripts from your mapped RNA-Seq without any previous knowledge. This can be very useful if you are analyzing a rarely studied organism, or want to identify novel transcripts in a rare cell type.  To see how Cufflinks performs at transcript discovery, run the command above but omit the "-G refseq.gtf" option.  Without a reference, Cufflinks will attempt to assemble the reads into transcripts:
cufflinks -o OutputDirectory/  mappedReads.bam
The output file "transcripts.gtf" will be in the directory "OutputDirectory/".  Load it into the genome browser to see how it looks.

The key to successful transcript discovery is lots of data.  Cufflinks needs aligned transcripts that describe splice sites and exon coverage in able to build transcripts.  If regions are sparse, it may have trouble discovering where the transcripts are (or ends up identifying several smaller fragments).  For this reason it is often a good idea to combine as much data as you have together into a single cufflinks run to maximize your sensitivity. In contrast, if you are attempting to identify novel splice junctions or transcript isoforms, you should keep the data separate.

Differential Expression Analysis using Cuffdiff

To identify differentially expressed genes with the Cufflinks suite, you'll want to run cuffdiff.  The command is similar to the other ones
cuffdiff -o OutputDirectory/ refseq.gtf sample1.bam sample2.bam

Or if you have replicates:
cuffdiff -o OutputDirectory/ refseq.gtf sample1.rep1.bam,sample1.rep2.bam sample2.rep1.bam,sample2.rep2.bam
This will produce several files in the output directory (many of which may be empty).  For most people, the gene_exp.diff file is the most informative, and can be opened in Excel.  If you want to use cuffdiff to identify splice junctions and differential isoform usage, you might have to run cufflinks for de novo transcript identification separately on each sample first (combine the replicates), run cuffmerge to make a single transcripts.gtf file, and then run cuffdiff using the merged file.  For more on that, follow the cufflinks manual.

Note about Cufflinks vs. DESeq/EdgeR for Differential Expression calculations

Cufflinks models gene expression values and variance very differently than DESeq/EdgeR, which use simple read counts and then leverage properties of the negative binomial distribution to identify differences in expression.  As a result, the two types of methods may show lots of differences in their lists of differentially expressed genes.  Generally speaking, cufflinks is more conservative in their differential expression calls while DESeq/EdgeR are a little more aggressive (identify more genes).

Can't figure something out? Questions, comments, concerns, or other feedback: