|
Guide to RNA-Seq with HOMER
The goal of this page is to provide a RNA-Seq focused
discussion of HOMER analysis. For those who would
rather 'roundhouse kick' their data without going into the
details, below is a quickstart:
RNA-Seq quick-start:
0. Align FASTQ reads using STAR or similar 'splicing
aware' genome alignment algorithm
1. Make tag
directories for each experiment
# works with sam or bam (samtools must be
installed for bam)
makeTagDirectory Exp1r1/ inputfile1r1.sam -format sam
makeTagDirectory Exp1r2/ inputfile1r2.sam -format
sam
makeTagDirectory Exp2r1/ inputfile2r2.sam -format sam
makeTagDirectory Exp2r2/ inputfile2r2.sam -format sam
If the experiment is strand specific paired end
sequencing, add "-sspe" to the end. If it's
unstranded paired-end sequencing, no extra options are
needed.
makeTagDirectory Exp1/ inputfile.sam -format sam -sspe
2. Make
bedGraph visualization files for each tag directory
# Add "-strand separate" for strand-specific sequencing
makeUCSCfile Exp1r1/ -fragLength given -o auto
(repeat for other tag directories)
3. Quantify
gene expression across all experiments for clustering
and reporting (-rpkm):
# May also wish to use "-condenseGenes" if you don't
want multiple isoforms per gene
analyzeRepeats.pl rna hg19 -strand both -count exons
-d Exp1r1/ Exp1r2 Exp2r1/ Exp2r2/ -rpkm > rpkm.txt
# Use this result for gene expression clustering, PCA,
etc.
4. Quantify gene
expression as integer counts for differential expression
(-noadj)
# May also wish to use "-condenseGenes" if you don't
want multiple isoforms per gene
analyzeRepeats.pl rna hg19 -strand both -count exons
-d Exp1r1/ Exp1r2 Exp2r1/ Exp2r2/ -noadj > raw.txt
5. Calculate
differential expression (requires R/Bioconductor and the
package edgeR to be installed!!)
# Name the samples in the same order as they were used
in the analyzeRepeats.pl command.
# Replicates should have the same name (i.e. cond1)
getDiffExpression.pl raw.txt cond1 cond1 cond2 cond2
> diffExp.output.txt
# Check out the results in Excel - there will be 4
columns per comparison: log2 fold, log2 intensity cpm,
p-value, and FDR/q-value.
HOMER contains two specialized programs to quantify RNA
reads in the genome (analyzeRNA.pl and analyzeRepeats.pl).
The goal of these tools is to make certain types of
RNA-Seq/GRO-Seq analysis easy and straightforward.
They are relatively simple - the idea is to count reads in
regions - not perform complex probabilistic inference of
true gene expression values (such as with cufflinks and some
other tools). The problem with some complex methods is
that the assumptions driving them may not always be valid in
all situations, and I've found I often run into more
problems than solutions with some of these software packages
(that's not to say there aren't good tools out there!).
analyzeRNA.pl and analyzeRepeats.pl
primarily
produce a gene expression matrix from "Tag Directories", and
work in a similar manner to how annotatedPeaks.pl works
for ChIP-Seq
data. There options to count reads in "genic" vs.
"exon" regions, normalization options, and other
features. Both programs work in a very similar manner
and most options are interchangeable between the two [Both
analyze mRNA or repeats even though their names are
different!!]. Generally speaking, at some point
in the future everything will be available in analyzeRepeats.pl.
Once a gene expression matrix is generated, it can be
treated a lot like microarray data and a series of other
analysis techniques can be applied to the data. HOMER
also provides a script to integrate with popular
R/Bioconductor differential expression callers (getDiffExpression.pl).
On that note, analyzeRNA.pl and analyzeRepeats.pl
DO NOT
produce differential expression or differential splicing
calls. It would be great if Chuck had the time to help
me with that, but for now I recommend sending the output
from analyzeRNA.pl to other utilities such as edgeR
or DESeq
or whatever your favorite next-gen differential signal
finder is. getDiffExpression.pl is provided as a tool
to help automate the process of using EdgeR or DESeq.
analyzeRNA.pl vs. analyzeRepeats.pl
Of all the tools in HOMER, analyzeRNA.pl is
probably the least efficient and resource hungry program
in terms of memory. It needs to be rewritten, and
more and more of it's functionality is being ported to analyzeRepeats.pl
(which is much more efficient). However, it does
perform useful calculations, particularly for GRO-Seq
applications such as proximal promoter pausing, that are
not available or not automated for other programs.
Use analyzeRNA.pl when:
- Calculating Pausing Ratios with GRO-Seq data
[Recommended] Use analyzeRepeats.pl when:
- Quantifying Gene Expression of exonic regions or
whole gene bodies from RNA-Seq, GRO-Seq, etc.
- Quantifying the expression of Repeat regions
Basic usage (same for
analyzeRNA.pl and analyzeRepeats.pl):
analyzeRepeats.pl <rna|repeats|gtf file>
<genome> [options] -count
[genes|exons|introns|cds|3utr|5utr] -d <Tag
Directory> [Tag Directory 2] ... > <output
file>
i.e. analyzeRepeats.pl
rna hg19 -count genes -d IMR90-GroSeq >
outputfile.txt
Specifying the Genes/Transcripts to Analyze
The first argument to analyzeRepeats.pl
specifies which "gene definition" to use for
analysis. There are three options:
- rna - This
will direct HOMER to analyze the default RefSeq
annotation for that genome.
- repeats -
HOMER will load repeat definitions from UCSC and
assess the expression levels of different repeat
classes. It is recommended to only use this
option with analyzeRepeats.pl.
- <GTF file>
- Specify your own custom genes. These can an
alternative annotation (i.e. Ensembl genes, UCSC
genes) or it can be custom, de novo genes you found
analyzing GRO-Seq or mRNAs defined using
"cufflinks". If you are looking for a GTF file
for you favorite annotation, check out the UCSC
Table
Browser. You can use this tool to download
GTF formmated files (with the correct genome version)
for many popular annotations.
- <peak/homer transcript file> - Specify
you own regions using a peak file.
Examples include "analyzeRepeats.pl rna mm9 ..." or
"analyzeRNA.pl myGenes.gtf mm9 ...".
Choosing Experiments to Analyze
As with all HOMER programs,
you need to create "Tag Directories" out of each
experiment you want to quantify first. To quantify
them with analyzeRNA.pl
simply add one or more directory after " -d". For
example:
analyzeRNA.pl rna hg18
-d
LNCaP-RNA-Seq-notx/ LNCaP-RNA-Seq-Dht/ >
outputfile.txt
This will produce an output file containing genes
expression values for two experiments,
"LNCaP-RNA-Seq-notx" and "LNCaP-RNA-Seq-Dht".
Measuring Gene Expression in Exons vs. Gene Bodies.
Depending on the type of
sequencing you are analyzing, you will want to quantify
RNA from different parts of the gene. The "-count
[...]" option controls which regions of the gene are used
for analysis (use like " -count
exons" or " -count
genes"). The options below only pertain to
'rna' or a custom GTF file:
- exons -
Counts tags in exons only. Use this for most
applications of RNA-Seq, such as polyA-RNA-seq or
other techniques that aim to measure mRNA.
- cds - Counts
tags in coding regions only. This could be
useful for quantifying ribosome coverage on coding
sequences with techniques such as Ribo-Seq
- introns -
Counts tags on introns only.
- 5utr or 3utr - Count tags
on 5' UTR and 3' UTR regions, respectively.
- genes (default)
- Counts tags on the full gene body (TSS to
TTS). This is useful for GRO-Seq where we expect
coverage across the entire transcript. Can also
be used to quantify H3K36me3 or PolII ChIP-Seq.
By default, analyzeRepeats.pl assumes your RNA is
strand-specific. If it is not, or you are using
analyzeRepeats.pl for other types of data such as
ChIP-Seq, you can specify:
"-strand +" - default, only measure +
strand (relative to gene orientation), for strand
specific RNA
"-strand -" -
only measure - strand (relative to gene orientation)
"-strand both" -
count reads on both strands, for non-strand specific RNA
or ChIP-Seq
HOMER counts reads by considering their 5' position.
It will also assign reads to each transcript they are
found within.
analyzeRNA.pl specific features
There are a couple options to slightly modify
how tags are counted to avoid TSS/TTS features. By
specifying "-start
<#>", you can adjust where (relative to
the TSS) HOMER starts counting reads. This is
useful for GRO-Seq when you might want to avoid the
promoter-proximal paused region by add "-start 500" to start
counting 500 bp downstream from the TSS. Negative
values would start counting 500 bp upstream. The
optoin "-end <#>"
is similar, but applies to the TTS. Negative
values will stop the tag counting upstream of the TTS ("-end -500"), while
positive values will count tags further downstream.
Normalization of Gene Expression Values
Normalization is probably
the trickiest part about RNA-Seq. There are lots of
papers on it. HOMER offers a couple different
normalization options depending on what the experiment is
and what your needs are.
By default, HOMER normalizes each experiment to 10 million
mapped reads, which is the same normalization strategy
used in annotatePeaks.pl
for ChIP-Seq data. However, RNA has the potential to
contain much more "contaminates" than ChIP-Seq. In
ChIP-Seq, the background is the genome, with random
fragments coming down in the immunoprecipitation
step. With RNA, instead of a two copies of the
genome per cell, you have 95%+ ribosomal RNA, along with a
host of other very very common short RNA species such as
tRNAs, snoRNAs, and other problematic things, and then a
small fraction of what you're actually interested
in. Most RNA-Seq protocols contain enrichment steps,
such as polyA selection, to isolate mRNAs from the rest of
the crap (my apologies to those studying rRNA and
tRNA). These enrichment steps can have different
efficiencies from sample to sample, with some samples
containing more rRNA, tRNA etc. than others. When
mapping to the genome, many of these RNA species will be
discarded if you only keep reads that map uniquely since
the genomic elements that create them are often repeated
in the genome. However, you're bound to still map
many of these reads. Differential contamination of
common RNA species can throw off the default normalization
- if 50% of your "mapped" reads are rRNA in one sample and
only 25% in another, you're likely going to have problems.
There are several things you can do to combat these
problems. One is to "pre-clear" common RNA species
computationally, i.e. map your reads to rRNA first to
remove them, then map the rest to the genome.
Another strategy is to normalize strictly with mRNA
species only instead of considering the total number of
mapped reads. In general, I would recommend
normalizing in this strategy (" -normMatrix <#>") - it's probably
the safest in most situations due to the contamination
problem. Below are the various normalization options
in analyzeRepeats.pl and
analyzeRNA.pl.
- -norm <#>
- Normalize the total number of mapped reads per
experiment to #, this is the default "-norm 1e7".
- -normMatrix
<#> - Normalize the total of number of
reads found in the gene expression matrix to # (i.e.
normalize total reads in mRNAs)
- -noadj -
Don't perform any normalization, just report the exact
number of reads found. This is useful when
sending the output of HOMER to another differential
expression program such as edgeR or similar that
requires the raw read counts.
- -quantile - [analyzeRepeats.pl only] perform
quantile normalization (might be the best option in
some cases)
The options above perform the actual normalization.
"Dress up" normalization/transformations options follow:
- -rpkm -
Report normalized values as reads per kilobase per
million mapped reads
- -log -
Report log normalized values with jitter
=log2(x+1+rand(0-1))
- -sqrt -
Report square root of the normalized value with jitter
=sqrt(x+rand(0-1))
Limiting the number of
Tags per Position ("-pc <#>")
Another important parameter
(that I almost made mandatory) is "-pc <#>", which
controls the maximum number of reads to consider per
position. In the case of some genes, there may be
high-abundance RNA species in introns, where large spikes
in RNA reads may occur. If quantifying expression
from gene bodies (i.e. "-count genes", GRO-Seq), these
reads will be counted toward the gene's total
abundance. Ideally these are removed. However,
as a quick and dirty solution to this problem, you can
limit the number of reads that HOMER counts from each
position. In many cases rRNA loci will have tens of
thousands of reads per bp. On the flip side,
limiting the number of reads per bp can kill your dynamic
range. Usually the middle gound such as "-pc 3" is a good way
to go - you effectively remove the super spikes but still
allow genes with high RPKM to "express themselves" in your
analysis.
analyzeRepeats.pl specific options and output
analyzeRepeats.pl leverages more of HOMER's C++
base to make read counting and processing much
faster. This also enables repeats to be efficiently
quantified across the genome.
Gene Expression Options and Output
analyzeRepeats.pl also has some useful
features with respect to gene annotations that help
control how the information is outputted when using
'rna' or your own GTF file:
- -condenseGenes (only reports one isoform
per locus - based on Entrez Gene ID)
- -noCondensing (do NOT combine expression
information if the same transcript ID is represented
twice. Lets say a gene shows up in two
locations in the genome. Normally HOMER will
add their expression information together.
Specifying this option will separate their
expression information)
- -noCondensingParts (report expression of
each 'fragment' of a transcript i.e. exons reported
separately. Normally each feature is counted
separately and added together - this will report
each part separately).
The output file will contain the following information
(by column):
- Transcript ID (RefSeq if 'rna' is used)
- Chromosome
- Start
- End
- Strand
- RNA length
- copies of transcript annotation in the genome
- annotation information
- Data: First Tag Directory Gene Expression
information
- [Data: Second Tag Directory Gene Expression
information]
- ...
If you want additional information, consider
running the command addGeneAnnotation.pl to beef
up the annotation output:
addGeneAnnotation.pl <data file>
<organism> [header: yes|no]
i.e. addGeneAnnotation.pl output.txt human yes
> output.ann.txt
Analyzing Repeat and non-mRNA Expression
mRNA is not the only RNA species present in
the cell. In fact, it makes up a very small
fraction of the total. analyzeRepeats.pl offers an option to
help quantify repeat RNAs, rRNA, etc. By
specifying "repeats",
HOMER will load the definition of all repeats in the
genome and quantify tags on them. Within the
repeat definition are tRNAs, rRNAs, etc. so you get more
than just transposable elements. Use it like this:
analyzeRNA.pl repeats
mm9 -d Macrophage-RNAseq > outputfile.txt
By default, all the repeats are treated like
"exons". Repeats from similar classes are combined
to give a single expression value.
To do repeat expression justice, you should carefully
consider how the data is mapped to the genome.
Normally for ChIP-Seq (or even RNA-Seq), you do not want
to consider reads that map to multiple locations in the
genome. However, in the case of RNA repeats, this
means that you will be discarding many of the reads
mapping to repeat regions. One trick that works
fairly well is to keep one random position from the
mapping, regardless if it maps uniquely to the genome
(but only do it for this type of analysis).
Typically, if a read maps to multiple locations, those
multiple locations are probably all the same type of
repeat element, so it will be added to the expression of
that repeat class regardless of where it is specifically
placed. HOMER will keep one read (the primary
alignment) if you add the "-keepOne" command when
creating Tag Directories. This is "approximate",
so use with care. For example, when making tag
directories:
makeTagDirectory Macrophage-RNAseq
alignment.sam -keepOne
The thing to note is that repeat/rRNA/tRNA expression is
compounded by the fact that most protocols try to get
rid of it, so depending on the efficiency of these
clearing steps experiment-to-experiment, you may be
measuring the difference in clearing efficiency rather
than the actual expression of the RNA.
Output Format (stdout):
* Since most repeats are found in hundreds or
thousands of locations, only the repeat with the
highest individual read density is reported.
- Repeat ID (name | family | class)
- * Chromosome
- * Start
- * End
- * Strand
- Average repeat length
- copies of repeat in the genome
- Average divergence of repeat from repbase model
- Data: First Tag Directory Gene Expression
information
- [Data: Second Tag Directory Gene Expression
information]
- ...
Controlling how repeats are reported
Since there are millions of repeat elements annotated
in the mouse or human genomes, it is important to be
able to control how they are processed and their gene
expression information reported.
Only analyze repeats from a specific name, family, or
class (you can look these up by clicking on repeats in
the genome browser). For example, by specifying
"-L2 LINE", only LINE elements will be analyzed.
Similarly, "-L3 L1" will restrict analysis to LINE L1
elements.
-L1 <name> (Specific repeat
names, such as L1HS, AluSx, etc.)
-L2 <family> (top level, SINE, LINE, LTR,
DNA, RNA, Satellite, Low_complexity, Simple_repeat)
-L3 <class> (2nd level, such as L1, L2,
Alu, etc.)
(default: report all repeats)
Normally, repeat expression is condensed such
that all reads found in a given repeat name (i.e. -condenseL1)
are reported as a single element. You can also
report reads condensed down to each class or
family. Alternatively, you can report each
instance of a repeat separately (-noCondensing) -
however, if you do this, it is recommended you filter
your repeats with one of the options above to avoid
outputting millions of different repeat elements:
-noCondensing (report each repeat
element separately)
-condenseL1 (default, condense
reads from each repeat name into one entry)
-condenseL2 (condense reads from each family
into a single entry)
-condenseL3 (condense reads from each class
into a single entry)
You may also find yourself only interested in repeat
elements that have minimal divergence from their
consensus repbase element. The idea is that the
less divergence, the more likely that they are the
result of recent insertion event. Divergence in
this case is measured as the fraction of mutations
relative to the original repeat element (0-100%).
You may also want to limit your analysis to repeats that
are full length. The following options allow you
to control which repeats are used in the analysis:
-maxdiv <#> (Maximum repeat
divergence to consider 0-1.00, default, no maximum)
-mindiv <#> (Minimum repeat divergense to
consider, range 0-1.00, default no minimum)
-maxLength <#> (Maximum length of repeat
to consider in bp, default no maximum)
-minLength <#> (Minimum length of repeat
to consider in bp, default no minimum)
-maxLengthP <#> (Maximum length of repeat
to consider as % of full length, range 0-1.00, default
no maximum)
-minLengthP <#> (Minimum length of repeat
to consideras % of full length, range 0-1.00, default
no minimum)
-noexon (Do not consider repeats in exons -
useful to avoid counting mRNA reads as repeat
reads...)
Checking for transcript read-through
One of the biggest challenges of quantifying
expression of repeats is that often the reads are not
from the repeat but instead represent signal from a
transcript that transcribes 'through' the repeat, such
as a SINE repeat in the 3' UTR of a protein coding
gene. To help identify instances where this may
be a problem, analyzeRepeats.pl will quantify the
reads immediate upstream or downstream of repeats if
you specify "-upstream <#>" or "-downstream
<#>". You can then specify "-L
<#>" to filter repeats that do not have a
read enrichment of at least # relative to the upstream
and downstream regions.
Reporting Repeat Coordinates
Lets say that for some reason you want to get
the coordinates of all Alu elements - maybe you want to
make a histogram of read density using annotatePeaks.pl!
If you do not specify any tag directories (-d "<tag
dir>") with analyzeRepeats.pl, the command
will simply output a list of positions instead of
calculating expression information. For example,
the following will output all Alu elements:
analyzeRepeats.pl repeats hg19 -L3 Alu
> output.txt
Useful parameters for this type of output are the
following:
-og (report coordinate based on full
length repeat element - many elements are truncated
insertions, this will line them up at a common
reference point)
-5p (report coordinates centered on 5' end of
repeat, useful when combined with -og)
-3p (report coordinates centered on 3' end of
repeat, useful when combined with -og)
analyzeRNA.pl specific options and output
The following are the steps
HOMER takes to produce a Gene Expression Matrix:
- Parses the RNA definition file to figure out where
all of the genes are in the genome.
- Queries each of the "Tag Directories" and determine
where each of the tags are within gene bodies.
- At that point it will rummage through the tag
positions and count only the tags found within the
desired regions (i.e. exons). Incidently, this
is also the part that makes it less efficient at the
momment - it's done with perl and keeps way too much
extra information around.
- Expression values are calculated and normalized
- Results are formatted and set to stdout
The output gene expression matrix with the
following columns:
- Transcript ID (RefSeq accession is the default with
"rna" option, custom name is used otherwise)
- chromosome
- start
- end
- strand
- mRNA length (e.g. exons only)
- gene length (TSS to TTS)
- Copies in genome (some genes map to the genome more
than once...)
- Symbol (e.g Gene Name)
- Alias (alternative gene names)
- Description
- Unigene
- Entrez Gene ID
- Ensembl
- Data: First Tag Directory Gene Expression
information
- [Data: Second Tag Directory Gene Expression
information]
- ...
A separate column is generated for each tag
directory. The head of these columns contains the
name, the region of the gene used for expression (i.e.
gene, exon, 5utr..), the total number of mapped tags in
the directory, and the normalization factor.
Due to the fact that some genes map to multiple places in
the genome (<1% of RefSeq), only one of their locations
will be reported. However, the gene length and mRNA
lengths will be averaged, and their read totals will be
averaged in the output.
If custom IDs are used, HOMER will attempt to link them to
known gene identifiers (As of now it does not annotate
their position, just treats their ID as an accession
number and tries to match it wil known genes). If
you give analyzeRNA.pl a custom GTF file from Cufflinks,
where the IDs are all "CUFF12345.1", most of the
annotation columns will be blank.
Below is an example of the output opened with EXCEL:
If you specify the "-log"
or "-sqrt"
options, you can make a nice X-Y scatter plot of the data
columns. Here is an example:
Quantifying Promoter-proximal RNA Polymerase Pausing
with analyzeRNA.pl
One of the neat
observations from GRO-Seq experiments is that an
actively engadged RNA Polymerases are present at the
promoters of many genes, but they do not seem to
elongate down the body of the gene. The idea is
that they are "paused", waiting for elongation signals
to give them the green light to make the gene.
HOMER can calculate the "pausing" ratio, which is the
ratio of tag density at the promoter relative to the
body of the gene. To use this option, specify " -pausing <#>",
where # the is the distance downstream of the TSS to
stop counting promoter tags and start counting gene body
tags. A safe value is probably 250-500.
This will queue the program to produce 3 columns of
output per experiment. First, it will report the
pausing ratio, and then it will report the promoter and
gene body read densities.
Finding Differentially Regulated Genes
As of now, HOMER doesn't have routines for
calculating differential gene expression. However,
it does help support analysis with edgeR
and DESeq,
two R/Bioconductor
packages designed for RNA-Seq differential expression
analysis. HOMER contains a script called getDiffExpression.pl
to help. The program requires the following to be
installed:
- R must be available in your executable PATH
- Bioconductor must be installed
- The EdgeR and/or DESeq packages must be installed
(for more info on installation see here)
Using getDiffExpression.pl is a two step
process. First, use analyzeRepeats.pl, analyzeRNA.pl
or annotatePeaks.pl (i.e. think ChIP-Seq) to
quantify experiments from two different conditions using
the "-noadj" option. This will result in
integer read counts for each transcript.
analyzeRepeats.pl rna hg19 -noadj -count
exons -d ES-RNAseq-rep1/ ES-RNAseq-rep2/
MEF-RNAseq-rep1/ MEF-RNAseq-rep2/ > output.txt
Then run the getDiffExpression.pl script with the
following:
getDiffExpression.pl <data file>
<group code...> [-rna | -repeats | -peaks]
[options]
i.e. getDiffExpression.pl output.txt es es mef
mef -repeats > diffOutput.txt
Key Paremeters for getDiffExpression.pl:
- The data file is the first argument to the program -
this is the output file from analyzeRepeats.pl,
analyzeRNA.pl or annotatePeaks.pl.
- Annotate each column with an argument (in the same
order given to the first command).
i.e. if you provide three tagDirs for macrophage
RNA-Seq and two tagDirs for Bcell RNA-seq, enter " mac
mac mac bcell bcell ". This tells the program
which columns go with which experiment.
- Use "-repeats", "-rna", or "-peaks"
depending on if you used analyzeRepeats.pl, analyzeRNA.pl
or annotatePeaks.pl to analyze the data.
By default the program calls on EdgeR to perform
the differential expression calculations. In the
experiment has no replication, you may want to set the
dispersion by using "-dispersion <#>"
(default is 0.05). To use DESeq instead of EdgeR,
specify "-DESeq".
If your samples are paired in anyway you may want to try
to account for batch effects. EdgeR allows you to
apply a generalized model to try to remove effects caused
by analyzing data on a different day or slightly different
batch. For example, lets say you did an experiment
with a control and a drug treatment, then did a second
experiment two weeks later with a different library
preparation protocol, etc. You can tell the program
that the samples are linked by specifying a batch code -
the code is like the experiment annotation in that it
applies to each experiment in the same order that they
were listed when preparing the gene expression matrix:
analyzeRepeats.pl rna hg19 -noadj -d
Week1-Ctrl/ Week1-Drug/ Week2-Ctrl/ Week2-Drug/ >
output.txt
getDiffExpression.pl output.txt -repeats ctrl drug
ctrl drug -batch 1 1 2 2 > diffOutput.txt
This is analogous to a paired t-test.
Output Format
This command will basically add 4 to 5 columns to the
end of your original file. Most important are
the log2 fold change and the p-value and FDR
values. The raw read counts will be normalized
in this file based on the total reads in each column
so that the values are more easily to compare when
looking through the results.
Other Analysis Ideas
Once your data is in a gene expression matrix, there
are countless ways to analyze your data. Check out
this link
for more ideas.
Command line options for analyzeRNA.pl
Usage:
analyzeRepeats.pl <input file> <genome>
[options] ...
Where <input
file> can be one of:
rna (quantify gene expression of RefSeq annotated genes)
repeats (just type "repeats" to load the default repeats
for genome)
<gtf file> (quantify gene expression from GTF file)
<transcript file> (Homer formatted peak/transcript
file i.e. hg19.repeats)
Available
Genomes (required argument): (name,org,directory,default
promoter set)
Specifying
distinct classes of repeats or filtering based on
parameters:
-L1 <repeat name> (level one 'repeat name', i.e.
AluSx3)
-L2 <repeat name> (level two 'repeat family', i.e.
SINE)
-L3 <repeat name> (level three 'repeat class', i.e.
Alu)
-maxdiv (max divergence, i.e. -div 0.10, default: 1.0)
-mindiv (min divergence, default: 0)
-minLength <#> (only return repeats at lest this
length, default: 0)
-maxLength <#> (only return repeats less than % of
full length, default: no max)
-minLengthP <#> (only return repeats at lest % of
full length, default: 0%)
-maxLengthP <#> (only return repeats less than % of
full length, default: 100%)
-noexon (do not consider repeats found within coding
exons)
-condenseL2, -condenseL3 (combine read counts for repeats
for same L2 or L3 annotation)
-rmsk (force filtering of repeats in case it thinks your
file is for RNA)
Adjusting
coordinates returned:
-5p (return peak files centered on 5' end of repeats)
-3p (return peak files centered on 3' end of repeats)
-og (return positions relative to full length repeats)
Expression/Read
Coverage Reporting Options:
-d <tag directory 1> [tag directory 2] ... (list of
experiment directories to show
tag counts for) NOTE: -dfile <file> where file is a
list of directories in first column
-count <genes|exons|introns|5utr|3utr|cds> (regions
to count reads in, default: genes)
-strand <+|-|both> (count tags on indicated strand,
default: +)
-pc <#> or -tbp <#> (maximum tags to count per
position, default: 0=no limit)
-log (output tag counts as randomized log2 values - for
scatter plots)
-sqrt (output tag counts as randomized sqrt values - for
scatter plots)
-condenseGenes (Only report one transcript from each gene
locus with highest expression)
-noCondensing (do not condense counts from entries will
same ID, default: do condense)
-noCondensingParts (i.e. report exons separately)
-min <#> (minimum expression value to print,
default: none)
Normalization:
-rpkm (Report results as reads per kb per million mapped)
-norm <#> (Normalize to total mapped tags: default
1e7)
-normMatrix <#> (Normalize to total tags in gene
expression matrix: not used)
-noadj (Don't normalize)
Advanced Normalization: (will adjust normalization from
above)
-quantile (quantile normalization on reported genes)
Checking
read-through expression: (example: "-upstream -1000 -L 2")
-upstream <#> (Distance upstream of each repeat to
check for reads, default: 0 [off])
-downstream <#> (Distance downstream to each repeat
to check for reads, default: 0 [off])
-L <#> (Only keep repeats with local enrichment
greater than #, default: keep all)
Annotation:
-gwasCatalog <gwasCatalog.txt> (Find overlapping
GWAS Catalog file from UCSC)
Command line options for analyzeRNA.pl
Usage:
analyzeRNA.pl <rna | repeats> <genome
version> [additional options...]
-or-:
analyzeRNA.pl <custom RNA/GTF file>
<organism|none> [additional options...]
Available Genomes
(required argument): (name,org,directory,default promoter
set)
Primary
Annotation Options:
-d <tag directory 1> [tag directory 2] ... (list of
experiment directories to show
tag counts for) NOTE: -dfile <file> where file is a
list of directories in first column
-rpkm (Report results as reads per kb per million mapped)
-norm <#> (Normalize to total mapped tags: default
1e7)
-normMatrix <#> (Normalize to total tags in gene
expression matrix: not used)
-noadj (Don't normalize)
-count <exons|introns|genes|5utr|3utr|cds> (Count tags
in introns, exons, etc., default: genes)
-noCondensing (do not condense counts from entries will same
ID, default: do condense)
-pc <#> (maximum tags to count per position, default:
0=no limit)
-strand <+|-|both> (count tags on indicated strand,
default: +)
-gene <data file> ... (Adds additional data to result
based on the closest gene.
This is useful for adding gene expression data. The
file must have a header,
and the first column must be a GeneID, Accession number,
etc. If the peak
cannot be mapped to data in the file then the entry will be
left empty.
-log (output tag counts as randomized log2 values - for
scatter plots)
-sqrt (output tag counts as randomized sqrt values - for
scatter plots)
-tss (estimate actual TSS in 1st exon and report as the
centered position in columns 3 & 4)
-start <#> (start counting tags relative # offset of
beginning of gene)
-end <#> (finish counting tags relative # offset to
end of the gene)
-maxLength <#> (Don't count tags past # bp from the
TSS, useful for GroSeq)
-pausing <#> (calculate ratio of pausing first [# bp
of transcript] to gene body)
Produces 3 columns - promoter rpk, body rpk, and ratio (add
-log for log versions)
Also sets "-count genes". Use "-strand both" when
analyzing Pol II ChIP-Seq
rpk is reads per kb - set -norm 1e6 or -normMatrix 1e6 to
get rpkm
Command line options for getDiffExpression.pl
This program uses
edgeR to find differential expression between sets of genes
(R must be
installed in the executable path, and the edgeR package must
be installed)
Step 1: Run analyzeRNA.pl, but use -noadj (or
analyzeRepeats.pl or annotatePeaks.pl)
Step 2: Run this program using that file (use
-rna/-repeats/-peaks to match program)
The output is
sent to stdout - appends columns to original file containing
data
Group/Batch names
below correspond to order of experiments given as tag
directories
Usage:
getDiffExpression.pl <data file> <group code...>
[options]
i.e. getDiffExpression.pl inputFile.txt groupName1
groupName1 groupName2 > output.txt
(for 2 replicates in first group and 1 in second group)
To normalize
batch effects, add "-batch <batch code 1> [batch
code2] ..." to the command
i.e. getDiffExpression.pl inputFile.txt gName1 gName1 gName2
gName2 -batch 1 2 1 2 > output.txt
(for 2 replicates in each group, where the 1st in each and
2nd in are paired)
Input File format
options:
-rna (for analyzeRNA.pl formatted input, default)
-repeats (for analyzeRepeats.pl formatted input file)
-peaks (for annotatePeaks.pl formatted input file)
Additional
Options:
-dispersion <#> (edgeR common dispersion to use if no
replicates, default: 0.05)
-DESeq (use DESeq instead of edgeR - doesn't support batch
mode yet...)
|