|
Analysis of RNA-Seq data with Cufflinks
There are several different tools available for RNA-Seq
analysis. One of the more popular tools is Cufflinks, a
all-in-one tool for performing several aspects of RNA-Seq
analysis. One of Cufflinks' best features is that it
can function as a reference-based de novo
transcriptome assembler - that is, it can identify novel
transcripts in your sequencing data by examining their
alignments to the genome. This is different from de
novo transcript assemblers that work directly from the RNA
sequence that do not require a genome (such as Trinity).
In theory, reference-based transcript assembly is more
sensitive.
Input Files for Cufflinks
The primary input file for Cufflinks is a sorted
BAM file. Cufflinks is usually run after mapping
reads to the genome with its sister tool Tophat. A
description of how to run Tophat can be found here. The main input file
used from Tophat will the be "accepted_hits.bam" file
which contains all of the mapped reads as well as their
splicing information. You should have a single BAM
file for each experiment you want to analyze.
In practice, Cufflinks can work with BAM files produced by
any read alignment program. However, you must make
sure the alignment is converted into a sorted BAM file for
it to work correctly with Cufflinks. For more
information on creating a sorted BAM file, see here.
Reference Transcriptome Files (GTF files)
While Cufflinks can function as a de novo
transcriptome assembler, it can also take advantage of
known transcripts in its analysis. In a
practical sense this is important when quantify
protein coding genes and low expressed genes that may
not have enough sequencing evidence to be "discovered"
de novo but are known to exist.
Cufflinks works with GTF files (Gene Transfer
Format), which are similar to GFF files. To get
a GTF files for your organism, check out the UCSC
Table Browser. There is a description about how
to download GTF files on the mapping
page (same GTF files used to assist with Tophat
mapping). Usually, the RefSeq (refGene) genes
serve as a good reference genome as their identifiers
(i.e. NM_0012355) are recognized by many downstream
programs.
In summary, you should have a sorted BAM file and a GTF
files containing the reference - something like this:
mappedReads.bam
refseq.gtf
For gene expression quantification and de novo
transcriptome assembly you only need a single BAM
file. If you want to do differential expression
analysis you will need a separate BAM file for each
experiment.
Installing Cufflinks
To install Cufflinks, visit their main page and
check out the "Releases" panel along the right side of the
webpage. I would strongly recommend downloading the
Mac or Linux Binary files since Cufflinks requires some
special libraries (to install from source, follow the
directions here).
To install Cufflinks from the binaries:
Download the latest version for your computer
from here.
Unzip and untar the file: tar zxvf
cufflinks-2.0.2.OSX_x86_64.tar.gz
The new directory should have all of the executable
files - might want to add this directory to your
executable PATH
Running Cufflinks
This page is not meant to replace the Cufflinks
manual. I would recommend reading it to
understand the full set of options and tools available in
the Cufflinks package. This tutorial will try to
distill some of the important points to help you get
started. For each of the following sections,
cufflinks has some general parameters that are commonly
used by nearly all of it's programs.
Common Parameters:
-o <output directory> : Most of
the cufflinks programs generate several output
files. Use this option to specify what the output
directory should be (i.e. "-o cufflinksOutput/").
-p <# CPUs> : speed up the analysis by using
additional processors if available on your system
--library-type <type> : where the library
type describes how the sequencing library was
made. This helps cufflinks correctly take
advantage of strand information, etc. This is most
important when analyzing paired-end libraries. See
the Cufflinks website for a full description of the data
(see at the bottom of this page).
The common different library types are:
fr-unstranded : Standard Illumina kit,
analogous to sequencing a cDNA library.
fr-firststrand : Strand specific, such at UTP methods
fr-secondstrand : Strand specific, direction RNA
ligation strategies
Quantifying Known Transcripts using Cufflinks
To quantify the expression level of protein
coding genes, you can run the cufflinks program with a bam
file and a GTF file like so:
cufflinks -o OutputDirectory/ -G refseq.gtf
mappedReads.bam
Where "OutputDirectory/" is the name of the folder where
the output files will be placed, "refseq.gtf" is the GTF
file describing the reference genome, and mappedReads.bam
is your output from Tophat. Also, considering using
some of the options mentioned in the section above (such
as "--library-type", etc.) to improve cufflinks
performance.
Output Files are in the OutputDirectory
De novo Transcript Discovery using Cufflinks
One of Cufflinks' best features is that it can
"identify" transcripts from your mapped RNA-Seq without
any previous knowledge. This can be very useful if you are
analyzing a rarely studied organism, or want to identify
novel transcripts in a rare cell type. To see how
Cufflinks performs at transcript discovery, run the
command above but omit the "-G refseq.gtf" option.
Without a reference, Cufflinks will attempt to assemble
the reads into transcripts:
cufflinks -o OutputDirectory/
mappedReads.bam
The output file "transcripts.gtf" will be in the directory
"OutputDirectory/". Load it into the genome browser
to see how it looks.
The key to successful transcript discovery is lots of
data. Cufflinks needs aligned transcripts that
describe splice sites and exon coverage in able to build
transcripts. If regions are sparse, it may have
trouble discovering where the transcripts are (or ends up
identifying several smaller fragments). For this
reason it is often a good idea to combine as much data as
you have together into a single cufflinks run to maximize
your sensitivity. In contrast, if you are attempting to
identify novel splice junctions or transcript isoforms,
you should keep the data separate.
Differential Expression Analysis using Cuffdiff
To identify differentially expressed genes with
the Cufflinks suite, you'll want to run cuffdiff.
The command is similar to the other ones
cuffdiff -o OutputDirectory/ refseq.gtf
sample1.bam sample2.bam
Or if you have replicates:
cuffdiff -o OutputDirectory/ refseq.gtf
sample1.rep1.bam,sample1.rep2.bam
sample2.rep1.bam,sample2.rep2.bam
This will produce several files in the output directory
(many of which may be empty). For most people, the gene_exp.diff
file is the most informative, and can be opened in
Excel. If you want to use cuffdiff to identify
splice junctions and differential isoform usage, you might
have to run cufflinks for de novo transcript
identification separately on each sample first (combine
the replicates), run cuffmerge to make a single
transcripts.gtf file, and then run cuffdiff using the
merged file. For more on that, follow the cufflinks
manual.
Note about Cufflinks vs. DESeq/EdgeR for Differential
Expression calculations
Cufflinks models gene expression values and
variance very differently than DESeq/EdgeR, which use
simple read counts and then leverage properties of the
negative binomial distribution to identify differences
in expression. As a result, the two types of
methods may show lots of differences in their lists of
differentially expressed genes. Generally
speaking, cufflinks is more conservative in their
differential expression calls while DESeq/EdgeR are a
little more aggressive (identify more genes).
|