logo

HOMER

Software for motif discovery and next-gen sequencing analysis



Guide to RNA-Seq with HOMER

The goal of this page is to provide a RNA-Seq focused discussion of HOMER analysis.  For those who would rather 'roundhouse kick' their data without going into the details, below is a quickstart:

RNA-Seq quick-start:

0. Align FASTQ reads using STAR or similar 'splicing aware' genome alignment algorithm

1. Make tag directories for each experiment

# works with sam or bam (samtools must be installed for bam)
makeTagDirectory Exp1r1/ inputfile1r1.sam -format sam
makeTagDirectory Exp1r2/ inputfile1r2.sam -format sam
makeTagDirectory Exp2r1/ inputfile2r2.sam -format sam
makeTagDirectory Exp2r2/ inputfile2r2.sam -format sam

If the experiment is strand specific paired end sequencing, add "-sspe" to the end.  If it's unstranded paired-end sequencing, no extra options are needed.
makeTagDirectory Exp1/ inputfile.sam -format sam -sspe

2. Make bedGraph visualization files for each tag directory
# Add "-strand separate" for strand-specific sequencing
makeUCSCfile Exp1r1/ -fragLength given -o auto
(repeat for other tag directories)

3. Quantify gene expression across all experiments for clustering and reporting (-rpkm):
# May also wish to use "-condenseGenes" if you don't want multiple isoforms per gene
analyzeRepeats.pl rna hg19 -strand both -count exons -d Exp1r1/ Exp1r2 Exp2r1/ Exp2r2/ -rpkm > rpkm.txt

# Use this result for gene expression clustering, PCA, etc.


4. Quantify gene expression as integer counts for differential expression (-noadj)
# May also wish to use "-condenseGenes" if you don't want multiple isoforms per gene
analyzeRepeats.pl rna hg19 -strand both -count exons -d Exp1r1/ Exp1r2 Exp2r1/ Exp2r2/ -noadj > raw.txt

5. Calculate differential expression (requires R/Bioconductor and the package edgeR to be installed!!)
# Name the samples in the same order as they were used in the analyzeRepeats.pl command.
# Replicates should have the same name (i.e. cond1)
getDiffExpression.pl raw.txt cond1 cond1 cond2 cond2 > diffExp.output.txt

# Check out the results in Excel - there will be 4 columns per comparison: log2 fold, log2 intensity cpm, p-value, and FDR/q-value.











HOMER contains two specialized programs to quantify RNA reads in the genome (analyzeRNA.pl and analyzeRepeats.pl).  The goal of these tools is to make certain types of RNA-Seq/GRO-Seq analysis easy and straightforward.  They are relatively simple - the idea is to count reads in regions - not perform complex probabilistic inference of true gene expression values (such as with cufflinks and some other tools).  The problem with some complex methods is that the assumptions driving them may not always be valid in all situations, and I've found I often run into more problems than solutions with some of these software packages (that's not to say there aren't good tools out there!).

analyzeRNA.pl and analyzeRepeats.pl primarily produce a gene expression matrix from "Tag Directories", and work in a similar manner to how annotatedPeaks.pl works for ChIP-Seq data.  There options to count reads in "genic" vs. "exon" regions, normalization options, and other features.  Both programs work in a very similar manner and most options are interchangeable between the two [Both analyze mRNA or repeats even though their names are different!!].  Generally speaking, at some point in the future everything will be available in analyzeRepeats.pl.  Once a gene expression matrix is generated, it can be treated a lot like microarray data and a series of other analysis techniques can be applied to the data.  HOMER also provides a script to integrate with popular R/Bioconductor differential expression callers (getDiffExpression.pl).

On that note, analyzeRNA.pl and analyzeRepeats.pl DO NOT produce differential expression or differential splicing calls.  It would be great if Chuck had the time to help me with that, but for now I recommend sending the output from analyzeRNA.pl to other utilities such as edgeR or DESeq or whatever your favorite next-gen differential signal finder is.  getDiffExpression.pl is provided as a tool to help automate the process of using EdgeR or DESeq.

analyzeRNA.pl vs. analyzeRepeats.pl

Of all the tools in HOMER, analyzeRNA.pl is probably the least efficient and resource hungry program in terms of memory.  It needs to be rewritten, and more and more of it's functionality is being ported to analyzeRepeats.pl (which is much more efficient).  However, it does perform useful calculations, particularly for GRO-Seq applications such as proximal promoter pausing, that are not available or not automated for other programs.

Use analyzeRNA.pl when:
  • Calculating Pausing Ratios with GRO-Seq data
[Recommended] Use analyzeRepeats.pl when:
  • Quantifying Gene Expression of exonic regions or whole gene bodies from RNA-Seq, GRO-Seq, etc.
  • Quantifying the expression of Repeat regions

Basic usage (same for analyzeRNA.pl and analyzeRepeats.pl):

analyzeRepeats.pl <rna|repeats|gtf file> <genome> [options] -count [genes|exons|introns|cds|3utr|5utr] -d <Tag Directory> [Tag Directory 2] ... > <output file>

i.e. analyzeRepeats.pl rna hg19 -count genes -d IMR90-GroSeq > outputfile.txt

Specifying the Genes/Transcripts to Analyze

The first argument to analyzeRepeats.pl specifies which "gene definition" to use for analysis.  There are three options:
  1. rna - This will direct HOMER to analyze the default RefSeq annotation for that genome.
  2. repeats - HOMER will load repeat definitions from UCSC and assess the expression levels of different repeat classes.  It is recommended to only use this option with analyzeRepeats.pl.
  3. <GTF file> - Specify your own custom genes.  These can an alternative annotation (i.e. Ensembl genes, UCSC genes) or it can be custom, de novo genes you found analyzing GRO-Seq or mRNAs defined using "cufflinks".  If you are looking for a GTF file for you favorite annotation, check out the UCSC Table Browser.  You can use this tool to download GTF formmated files (with the correct genome version) for many popular annotations.
  4. <peak/homer transcript file> - Specify you own regions using a peak file.
Examples include "analyzeRepeats.pl rna mm9 ..." or "analyzeRNA.pl myGenes.gtf mm9 ...".

Choosing Experiments to Analyze

As with all HOMER programs, you need to create "Tag Directories" out of each experiment you want to quantify first.  To quantify them with analyzeRNA.pl simply add one or more directory after "-d".  For example:

analyzeRNA.pl rna hg18 -d LNCaP-RNA-Seq-notx/ LNCaP-RNA-Seq-Dht/ > outputfile.txt

This will produce an output file containing genes expression values for two experiments, "LNCaP-RNA-Seq-notx" and "LNCaP-RNA-Seq-Dht".

Measuring Gene Expression in Exons vs. Gene Bodies.

Depending on the type of sequencing you are analyzing, you will want to quantify RNA from different parts of the gene.  The "-count [...]" option controls which regions of the gene are used for analysis (use like "-count exons" or "-count genes").  The options below only pertain to 'rna' or a custom GTF file:
  • exons - Counts tags in exons only.  Use this for most applications of RNA-Seq, such as polyA-RNA-seq or other techniques that aim to measure mRNA.
  • cds - Counts tags in coding regions only.  This could be useful for quantifying ribosome coverage on coding sequences with techniques such as Ribo-Seq
  • introns - Counts tags on introns only.
  • 5utr or 3utr - Count tags on 5' UTR and 3' UTR regions, respectively.
  • genes (default) - Counts tags on the full gene body (TSS to TTS).  This is useful for GRO-Seq where we expect coverage across the entire transcript.  Can also be used to quantify H3K36me3 or PolII ChIP-Seq.
By default, analyzeRepeats.pl assumes your RNA is strand-specific.  If it is not, or you are using analyzeRepeats.pl for other types of data such as ChIP-Seq, you can specify:

"-strand +" - default, only measure + strand (relative to gene orientation), for strand specific RNA
"-strand -" - only measure - strand (relative to gene orientation)
"-strand both" - count reads on both strands, for non-strand specific RNA or ChIP-Seq

HOMER counts reads by considering their 5' position.  It will also assign reads to each transcript they are found within.

analyzeRNA.pl specific features

There are a couple options to slightly modify how tags are counted to avoid TSS/TTS features.  By specifying "-start <#>", you can adjust where (relative to the TSS) HOMER starts counting reads.  This is useful for GRO-Seq when you might want to avoid the promoter-proximal paused region by add "-start 500" to start counting 500 bp downstream from the TSS.  Negative values would start counting 500 bp upstream.  The optoin "-end <#>" is similar, but applies to the TTS.  Negative values will stop the tag counting upstream of the TTS ("-end -500"), while positive values will count tags further downstream.

Normalization of Gene Expression Values

Normalization is probably the trickiest part about RNA-Seq.  There are lots of papers on it.  HOMER offers a couple different normalization options depending on what the experiment is and what your needs are.

By default, HOMER normalizes each experiment to 10 million mapped reads, which is the same normalization strategy used in annotatePeaks.pl for ChIP-Seq data.  However, RNA has the potential to contain much more "contaminates" than ChIP-Seq.  In ChIP-Seq, the background is the genome, with random fragments coming down in the immunoprecipitation step.  With RNA, instead of a two copies of the genome per cell, you have 95%+ ribosomal RNA, along with a host of other very very common short RNA species such as tRNAs, snoRNAs, and other problematic things, and then a small fraction of what you're actually interested in.  Most RNA-Seq protocols contain enrichment steps, such as polyA selection, to isolate mRNAs from the rest of the crap (my apologies to those studying rRNA and tRNA).  These enrichment steps can have different efficiencies from sample to sample, with some samples containing more rRNA, tRNA etc. than others.  When mapping to the genome, many of these RNA species will be discarded if you only keep reads that map uniquely since the genomic elements that create them are often repeated in the genome.  However, you're bound to still map many of these reads.  Differential contamination of common RNA species can throw off the default normalization - if 50% of your "mapped" reads are rRNA in one sample and only 25% in another, you're likely going to have problems.

There are several things you can do to combat these problems.  One is to "pre-clear" common RNA species computationally, i.e. map your reads to rRNA first to remove them, then map the rest to the genome.  Another strategy is to normalize strictly with mRNA species only instead of considering the total number of mapped reads.  In general, I would recommend normalizing in this strategy ("-normMatrix <#>") - it's probably the safest in most situations due to the contamination problem.  Below are the various normalization options in analyzeRepeats.pl and analyzeRNA.pl.
  • -norm <#> - Normalize the total number of mapped reads per experiment to #, this is the default "-norm 1e7".
  • -normMatrix <#> - Normalize the total of number of reads found in the gene expression matrix to # (i.e. normalize total reads in mRNAs)
  • -noadj - Don't perform any normalization, just report the exact number of reads found.  This is useful when sending the output of HOMER to another differential expression program such as edgeR or similar that requires the raw read counts.
  • -quantile - [analyzeRepeats.pl only] perform quantile normalization (might be the best option in some cases)
The options above perform the actual normalization.  "Dress up" normalization/transformations options follow:
  • -rpkm - Report normalized values as reads per kilobase per million mapped reads
  • -log - Report log normalized values with jitter =log2(x+1+rand(0-1))
  • -sqrt - Report square root of the normalized value with jitter =sqrt(x+rand(0-1))

Limiting the number of Tags per Position ("-pc <#>")

Another important parameter (that I almost made mandatory) is "-pc <#>", which controls the maximum number of reads to consider per position.  In the case of some genes, there may be high-abundance RNA species in introns, where large spikes in RNA reads may occur.  If quantifying expression from gene bodies (i.e. "-count genes", GRO-Seq), these reads will be counted toward the gene's total abundance.  Ideally these are removed.  However, as a quick and dirty solution to this problem, you can limit the number of reads that HOMER counts from each position.  In many cases rRNA loci will have tens of thousands of reads per bp.  On the flip side, limiting the number of reads per bp can kill your dynamic range.  Usually the middle gound such as "-pc 3" is a good way to go - you effectively remove the super spikes but still allow genes with high RPKM to "express themselves" in your analysis.

analyzeRepeats.pl specific options and output

analyzeRepeats.pl leverages more of HOMER's C++ base to make read counting and processing much faster.  This also enables repeats to be efficiently quantified across the genome. 

Gene Expression Options and Output

analyzeRepeats.pl also has some useful features with respect to gene annotations that help control how the information is outputted when using 'rna' or your own GTF file:
  • -condenseGenes (only reports one isoform per locus - based on Entrez Gene ID)
  • -noCondensing (do NOT combine expression information if the same transcript ID is represented twice.  Lets say a gene shows up in two locations in the genome.  Normally HOMER will add their expression information together.  Specifying this option will separate their expression information)
  • -noCondensingParts (report expression of each 'fragment' of a transcript i.e. exons reported separately.  Normally each feature is counted separately and added together - this will report each part separately).
The output file will contain the following information (by column):
  1. Transcript ID (RefSeq if 'rna' is used)
  2. Chromosome
  3. Start
  4. End
  5. Strand
  6. RNA length
  7. copies of transcript annotation in the genome
  8. annotation information
  9. Data: First Tag Directory Gene Expression information
  10. [Data: Second Tag Directory Gene Expression information]
  11. ...
If you want additional information, consider running the command addGeneAnnotation.pl to beef up the annotation output:
addGeneAnnotation.pl <data file> <organism> [header: yes|no]
i.e. addGeneAnnotation.pl output.txt human yes > output.ann.txt

Analyzing Repeat and non-mRNA Expression

mRNA is not the only RNA species present in the cell.  In fact, it makes up a very small fraction of the total.  analyzeRepeats.pl offers an option to help quantify repeat RNAs, rRNA, etc.  By specifying "repeats", HOMER will load the definition of all repeats in the genome and quantify tags on them.  Within the repeat definition are tRNAs, rRNAs, etc. so you get more than just transposable elements.  Use it like this:

analyzeRNA.pl repeats mm9 -d Macrophage-RNAseq > outputfile.txt

By default, all the repeats are treated like "exons".  Repeats from similar classes are combined to give a single expression value.

To do repeat expression justice, you should carefully consider how the data is mapped to the genome.  Normally for ChIP-Seq (or even RNA-Seq), you do not want to consider reads that map to multiple locations in the genome.  However, in the case of RNA repeats, this means that you will be discarding many of the reads mapping to repeat regions.  One trick that works fairly well is to keep one random position from the mapping, regardless if it maps uniquely to the genome (but only do it for this type of analysis).  Typically, if a read maps to multiple locations, those multiple locations are probably all the same type of repeat element, so it will be added to the expression of that repeat class regardless of where it is specifically placed.  HOMER will keep one read (the primary alignment) if you add the "-keepOne" command when creating Tag Directories.  This is "approximate", so use with care.  For example, when making tag directories:
makeTagDirectory Macrophage-RNAseq alignment.sam -keepOne
The thing to note is that repeat/rRNA/tRNA expression is compounded by the fact that most protocols try to get rid of it, so depending on the efficiency of these clearing steps experiment-to-experiment, you may be measuring the difference in clearing efficiency rather than the actual expression of the RNA.

Output Format (stdout):

* Since most repeats are found in hundreds or thousands of locations, only the repeat with the highest individual read density is reported.

  1. Repeat ID (name | family | class)
  2. * Chromosome
  3. * Start
  4. * End
  5. * Strand
  6. Average repeat length
  7. copies of repeat in the genome
  8. Average divergence of repeat from repbase model
  9. Data: First Tag Directory Gene Expression information
  10. [Data: Second Tag Directory Gene Expression information]
  11. ...

Controlling how repeats are reported

Since there are millions of repeat elements annotated in the mouse or human genomes, it is important to be able to control how they are processed and their gene expression information reported.

Only analyze repeats from a specific name, family, or class (you can look these up by clicking on repeats in the genome browser).  For example, by specifying "-L2 LINE", only LINE elements will be analyzed.  Similarly, "-L3 L1" will restrict analysis to LINE L1 elements.

-L1 <name> (Specific repeat names, such as L1HS, AluSx, etc.)
-L2 <family> (top level, SINE, LINE, LTR, DNA, RNA, Satellite, Low_complexity, Simple_repeat)
-L3 <class> (2nd level, such as L1, L2, Alu, etc.)
(default: report all repeats)
Normally, repeat expression is condensed such that all reads found in a given repeat name (i.e. -condenseL1) are reported as a single element.  You can also report reads condensed down to each class or family.  Alternatively, you can report each instance of a repeat separately (-noCondensing) - however, if you do this, it is recommended you filter your repeats with one of the options above to avoid outputting millions of different repeat elements:
-noCondensing (report each repeat element separately)
-condenseL1 (default, condense reads from each repeat name into one entry)
-condenseL2 (condense reads from each family into a single entry)
-condenseL3 (condense reads from each class into a single entry)
You may also find yourself only interested in repeat elements that have minimal divergence from their consensus repbase element.  The idea is that the less divergence, the more likely that they are the result of recent insertion event.  Divergence in this case is measured as the fraction of mutations relative to the original repeat element (0-100%).  You may also want to limit your analysis to repeats that are full length.  The following options allow you to control which repeats are used in the analysis:
-maxdiv <#> (Maximum repeat divergence to consider 0-1.00, default, no maximum)
-mindiv <#> (Minimum repeat divergense to consider, range 0-1.00, default no minimum)
-maxLength <#> (Maximum length of repeat to consider in bp, default no maximum)
-minLength <#> (Minimum length of repeat to consider in bp, default no minimum)
-maxLengthP <#> (Maximum length of repeat to consider as % of full length, range 0-1.00, default no maximum)
-minLengthP <#> (Minimum length of repeat to consideras % of full length, range 0-1.00, default no minimum)
-noexon (Do not consider repeats in exons - useful to avoid counting mRNA reads as repeat reads...)

Checking for transcript read-through

One of the biggest challenges of quantifying expression of repeats is that often the reads are not from the repeat but instead represent signal from a transcript that transcribes 'through' the repeat, such as a SINE repeat in the 3' UTR of a protein coding gene.  To help identify instances where this may be a problem, analyzeRepeats.pl will quantify the reads immediate upstream or downstream of repeats if you specify "-upstream <#>" or "-downstream <#>".  You can then specify "-L <#>" to filter repeats that do not have a read enrichment of at least # relative to the upstream and downstream regions.

Reporting Repeat Coordinates

Lets say that for some reason you want to get the coordinates of all Alu elements - maybe you want to make a histogram of read density using annotatePeaks.pl!  If you do not specify any tag directories (-d "<tag dir>") with analyzeRepeats.pl, the command will simply output a list of positions instead of calculating expression information.  For example, the following will output all Alu elements:
analyzeRepeats.pl repeats hg19 -L3 Alu > output.txt
Useful parameters for this type of output are the following:
-og (report coordinate based on full length repeat element - many elements are truncated insertions, this will line them up at a common reference point)
-5p (report coordinates centered on 5' end of repeat, useful when combined with -og)
-3p (report coordinates centered on 3' end of repeat, useful when combined with -og)

analyzeRNA.pl specific options and output

The following are the steps HOMER takes to produce a Gene Expression Matrix:
  1. Parses the RNA definition file to figure out where all of the genes are in the genome.
  2. Queries each of the "Tag Directories" and determine where each of the tags are within gene bodies.
  3. At that point it will rummage through the tag positions and count only the tags found within the desired regions (i.e. exons).  Incidently, this is also the part that makes it less efficient at the momment - it's done with perl and keeps way too much extra information around.
  4. Expression values are calculated and normalized
  5. Results are formatted and set to stdout
The output gene expression matrix with the following columns:
  1. Transcript ID (RefSeq accession is the default with "rna" option, custom name is used otherwise)
  2. chromosome
  3. start
  4. end
  5. strand
  6. mRNA length (e.g. exons only)
  7. gene length (TSS to TTS)
  8. Copies in genome (some genes map to the genome more than once...)
  9. Symbol (e.g Gene Name)
  10. Alias (alternative gene names)
  11. Description
  12. Unigene
  13. Entrez Gene ID
  14. Ensembl
  15. Data: First Tag Directory Gene Expression information
  16. [Data: Second Tag Directory Gene Expression information]
  17. ...
A separate column is generated for each tag directory.  The head of these columns contains the name, the region of the gene used for expression (i.e. gene, exon, 5utr..), the total number of mapped tags in the directory, and the normalization factor.

Due to the fact that some genes map to multiple places in the genome (<1% of RefSeq), only one of their locations will be reported.  However, the gene length and mRNA lengths will be averaged, and their read totals will be averaged in the output.

If custom IDs are used, HOMER will attempt to link them to known gene identifiers (As of now it does not annotate their position, just treats their ID as an accession number and tries to match it wil known genes).  If you give analyzeRNA.pl a custom GTF file from Cufflinks, where the IDs are all "CUFF12345.1", most of the annotation columns will be blank.

Below is an example of the output opened with EXCEL:

example
                gene expression matrix RNA-seq


If you specify the "-log" or "-sqrt" options, you can make a nice X-Y scatter plot of the data columns.  Here is an example:

scatter plot RNA-seq

Quantifying Promoter-proximal RNA Polymerase Pausing with analyzeRNA.pl

One of the neat observations from GRO-Seq experiments is that an actively engadged RNA Polymerases are present at the promoters of many genes, but they do not seem to elongate down the body of the gene.  The idea is that they are "paused", waiting for elongation signals to give them the green light to make the gene.

Proximal
                  Promoter pausing GRO-Seq


HOMER can calculate the "pausing" ratio, which is the ratio of tag density at the promoter relative to the body of the gene.  To use this option, specify "-pausing <#>", where # the is the distance downstream of the TSS to stop counting promoter tags and start counting gene body tags.  A safe value is probably 250-500.

This will queue the program to produce 3 columns of output per experiment.  First, it will report the pausing ratio, and then it will report the promoter and gene body read densities.

Finding Differentially Regulated Genes

As of now, HOMER doesn't have routines for calculating differential gene expression.  However, it does help support analysis with edgeR and DESeq, two R/Bioconductor packages designed for RNA-Seq differential expression analysis.  HOMER contains a script called getDiffExpression.pl to help.  The program requires the following to be installed:
  • R must be available in your executable PATH
  • Bioconductor must be installed
  • The EdgeR and/or DESeq packages must be installed
    (for more info on installation see here)
Using getDiffExpression.pl is a two step process.  First, use analyzeRepeats.pl, analyzeRNA.pl or annotatePeaks.pl (i.e. think ChIP-Seq) to quantify experiments from two different conditions using the "-noadj" option.  This will result in integer read counts for each transcript.
analyzeRepeats.pl rna hg19 -noadj -count exons -d ES-RNAseq-rep1/ ES-RNAseq-rep2/ MEF-RNAseq-rep1/ MEF-RNAseq-rep2/ > output.txt
Then run the getDiffExpression.pl script with the following:
getDiffExpression.pl <data file> <group code...> [-rna | -repeats | -peaks] [options]

i.e. getDiffExpression.pl output.txt es es mef mef -repeats > diffOutput.txt

Key Paremeters for getDiffExpression.pl:

  1. The data file is the first argument to the program - this is the output file from analyzeRepeats.pl, analyzeRNA.pl or annotatePeaks.pl.
  2. Annotate each column with an argument (in the same order given to the first command).
    i.e. if you provide three tagDirs for macrophage RNA-Seq and two tagDirs for Bcell RNA-seq, enter " mac mac mac bcell bcell ".  This tells the program which columns go with which experiment.
  3. Use "-repeats", "-rna", or "-peaks" depending on if you used analyzeRepeats.pl, analyzeRNA.pl or annotatePeaks.pl to analyze the data.
By default the program calls on EdgeR to perform the differential expression calculations.  In the experiment has no replication, you may want to set the dispersion by using "-dispersion <#>" (default is 0.05).  To use DESeq instead of EdgeR, specify "-DESeq".

If your samples are paired in anyway you may want to try to account for batch effects.  EdgeR allows you to apply a generalized model to try to remove effects caused by analyzing data on a different day or slightly different batch.  For example, lets say you did an experiment with a control and a drug treatment, then did a second experiment two weeks later with a different library preparation protocol, etc.  You can tell the program that the samples are linked by specifying a batch code - the code is like the experiment annotation in that it applies to each experiment in the same order that they were listed when preparing the gene expression matrix:
analyzeRepeats.pl rna hg19 -noadj -d Week1-Ctrl/ Week1-Drug/ Week2-Ctrl/ Week2-Drug/ > output.txt
getDiffExpression.pl output.txt -repeats ctrl drug ctrl drug -batch 1 1 2 2 > diffOutput.txt

This is analogous to a paired t-test.

Output Format

This command will basically add 4 to 5 columns to the end of your original file.  Most important are the log2 fold change and the p-value and FDR values.  The raw read counts will be normalized in this file based on the total reads in each column so that the values are more easily to compare when looking through the results.

Other Analysis Ideas

Once your data is in a gene expression matrix, there are countless ways to analyze your data.  Check out this link for more ideas.



Command line options for analyzeRNA.pl

        Usage: analyzeRepeats.pl <input file> <genome> [options] ...

        Where <input file> can be one of:
                rna (quantify gene expression of RefSeq annotated genes)
                repeats (just type "repeats" to load the default repeats for genome)
                <gtf file> (quantify gene expression from GTF file)
                <transcript file> (Homer formatted peak/transcript file i.e. hg19.repeats)

        Available Genomes (required argument): (name,org,directory,default promoter set)

        Specifying distinct classes of repeats or filtering based on parameters:
                -L1 <repeat name> (level one 'repeat name', i.e. AluSx3)
                -L2 <repeat name> (level two 'repeat family', i.e. SINE)
                -L3 <repeat name> (level three 'repeat class', i.e. Alu)
                -maxdiv (max divergence, i.e. -div 0.10, default: 1.0)
                -mindiv (min divergence, default: 0)
                -minLength <#> (only return repeats at lest this length, default: 0)
                -maxLength <#> (only return repeats less than % of full length, default: no max)
                -minLengthP <#> (only return repeats at lest % of full length, default: 0%)
                -maxLengthP <#> (only return repeats less than % of full length, default: 100%)
                -noexon (do not consider repeats found within coding exons)
                -condenseL2, -condenseL3 (combine read counts for repeats for same L2 or L3 annotation)
                -rmsk (force filtering of repeats in case it thinks your file is for RNA)

        Adjusting coordinates returned:
                -5p (return peak files centered on 5' end of repeats)
                -3p (return peak files centered on 3' end of repeats)
                -og (return positions relative to full length repeats)

        Expression/Read Coverage Reporting Options:
                -d <tag directory 1> [tag directory 2] ... (list of experiment directories to show
                        tag counts for) NOTE: -dfile <file> where file is a list of directories in first column
                -count <genes|exons|introns|5utr|3utr|cds> (regions to count reads in, default: genes)
                -strand <+|-|both> (count tags on indicated strand, default: +)
                -pc <#> or -tbp <#> (maximum tags to count per position, default: 0=no limit)
                -log (output tag counts as randomized log2 values - for scatter plots)
                -sqrt (output tag counts as randomized sqrt values - for scatter plots)
                -condenseGenes (Only report one transcript from each gene locus with highest expression)
                -noCondensing (do not condense counts from entries will same ID, default: do condense)
                -noCondensingParts (i.e. report exons separately)
                -min <#> (minimum expression value to print, default: none)
                Normalization:
                        -rpkm (Report results as reads per kb per million mapped)
                        -norm <#> (Normalize to total mapped tags: default 1e7)
                        -normMatrix <#> (Normalize to total tags in gene expression matrix: not used)
                        -noadj (Don't normalize)
                Advanced Normalization: (will adjust normalization from above)
                        -quantile (quantile normalization on reported genes)

        Checking read-through expression: (example: "-upstream -1000 -L 2")
                -upstream <#> (Distance upstream of each repeat to check for reads, default: 0 [off])
                -downstream <#> (Distance downstream to each repeat to check for reads, default: 0 [off])
                -L <#> (Only keep repeats with local enrichment greater than #, default: keep all)

        Annotation:
                -gwasCatalog <gwasCatalog.txt> (Find overlapping GWAS Catalog file from UCSC)

Command line options for analyzeRNA.pl

        Usage: analyzeRNA.pl <rna | repeats> <genome version>  [additional options...]
         -or-: analyzeRNA.pl <custom RNA/GTF file> <organism|none>  [additional options...]

        Available Genomes (required argument): (name,org,directory,default promoter set)

        Primary Annotation Options:
                -d <tag directory 1> [tag directory 2] ... (list of experiment directories to show
                        tag counts for) NOTE: -dfile <file> where file is a list of directories in first column
                -rpkm (Report results as reads per kb per million mapped)
                -norm <#> (Normalize to total mapped tags: default 1e7)
                -normMatrix <#> (Normalize to total tags in gene expression matrix: not used)
                -noadj (Don't normalize)
                -count <exons|introns|genes|5utr|3utr|cds> (Count tags in introns, exons, etc., default: genes)
                -noCondensing (do not condense counts from entries will same ID, default: do condense)
                -pc <#> (maximum tags to count per position, default: 0=no limit)
                -strand <+|-|both> (count tags on indicated strand, default: +)
                -gene <data file> ... (Adds additional data to result based on the closest gene.
                        This is useful for adding gene expression data.  The file must have a header,
                        and the first column must be a GeneID, Accession number, etc.  If the peak
                        cannot be mapped to data in the file then the entry will be left empty.
                -log (output tag counts as randomized log2 values - for scatter plots)
                -sqrt (output tag counts as randomized sqrt values - for scatter plots)
                -tss (estimate actual TSS in 1st exon and report as the centered position in columns 3 & 4)
                -start <#> (start counting tags relative # offset of beginning of gene)
                -end <#> (finish counting tags relative # offset to end of the gene)
                -maxLength <#> (Don't count tags past # bp from the TSS, useful for GroSeq)
                -pausing <#> (calculate ratio of pausing first [# bp of transcript] to gene body)
                        Produces 3 columns - promoter rpk, body rpk, and ratio (add -log for log versions)
                        Also sets "-count genes".  Use "-strand both" when analyzing Pol II ChIP-Seq
                        rpk is reads per kb - set -norm 1e6 or -normMatrix 1e6 to get rpkm

Command line options for getDiffExpression.pl

        This program uses edgeR to find differential expression between sets of genes
        (R must be installed in the executable path, and the edgeR package must be installed)
           Step 1:  Run analyzeRNA.pl, but use -noadj (or analyzeRepeats.pl or annotatePeaks.pl)
           Step 2:  Run this program using that file (use -rna/-repeats/-peaks to match program)
        The output is sent to stdout - appends columns to original file containing data
        Group/Batch names below correspond to order of experiments given as tag directories

        Usage: getDiffExpression.pl <data file> <group code...> [options]

                i.e. getDiffExpression.pl inputFile.txt groupName1 groupName1 groupName2 > output.txt
                        (for 2 replicates in first group and 1 in second group)

        To normalize batch effects, add "-batch <batch code 1> [batch code2] ..." to the command
                i.e. getDiffExpression.pl inputFile.txt gName1 gName1 gName2 gName2 -batch 1 2 1 2 > output.txt
                        (for 2 replicates in each group, where the 1st in each and 2nd in are paired)

        Input File format options:
                -rna (for analyzeRNA.pl formatted input, default)
                -repeats (for analyzeRepeats.pl formatted input file)
                -peaks (for annotatePeaks.pl formatted input file)

        Additional Options:
                -dispersion <#> (edgeR common dispersion to use if no replicates, default: 0.05)
                -DESeq (use DESeq instead of edgeR - doesn't support batch mode yet...)



help
Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu