HOMER

Software for motif discovery and ChIP-Seq analysis

Quantifying Differential Features (Enrichment/Expression)

If you boil down the analysis of most quantitative NGS data types and experiments, it all comes down to defining a set of features (whether they are peaks, transcripts, promoter regions, etc.) and then quantifying your set of experiments across those features to identify if any of them are differentially regulated. Within HOMER, feature identification is generally handled by findPeaks, while quantification is mostly performed with annotatePeaks.pl (peak/region based) or analyzeRepeats.pl (gene/repeat based). The analysis of differential regulation in handled by getDiffExpression.pl, which uses R/Bioconductor and DESeq2 or edgeR to perform the differential enrichment calculations. If you are trying to find differential peaks from scratch, consider using the getDifferentialPeaksReplicates.pl command instead, which automates running findPeaks/annotatePeaks.pl/getDiffExpression.pl for you.

The other operation that getDiffExpression.pl will do is normalize your input raw counts and return this in the output file along with the differential regulation statistics. By default the program will output rlog variance stabilized values calculated by DESeq2. These are useful when doing clustering, PCA analysis, etc. This can be modified with the options specified below.

Required 3rd party software to run getDiffExpression.pl:

HOMER requires R/Bioconductor to be installed with packages for DESeq2 for getDiffExpression.pl to work. The command "R" must be available in your executable PATH. For tips on how to install them, check out the installation page.

R statistical computing environment
Bioconductor (R packages for bioinformatics)
DESeq2
DESeq - packages for differential expression calculations (DESeq2 is recommended)
EdgeR

Step1: Preparing input files for getDiffExpression.pl

The getDiffExpression.pl program is essentially a wrapper for R/Bioconductor/DESeq2 to make running those programs easy using data generated by other HOMER programs. In general, we want to provide a data matrix containing raw read counts across several experiments that we want to compare (tab-delimited text format). These types of matrices are readily created using annotatePeaks.pl or analyzeRepeats.pl and using the "-raw" (or "-noadj") options to report the raw counts.

For example, lets say we have RNA-seq tag directories for WNT and Mock treated conditions in replicate. To quantify their expression (raw counts), we might run the following:

analyzeRepeats.pl rna hg38 -count exons -condenseGenes -raw -d RNAseq-Mock-rep1/ RNAseq-Mock-rep2 RNAseq-WNT-rep1/ RNAseq-WNT-rep2/ > countTable.rna.txt

For another example, lets consider quantifying the differences in H3K4me3 signal at promoters for the same conditions:

annotatePeaks.pl tss hg38 -raw -d H3K4me3-Mock-rep1/ H3K4me3-Mock-rep2/ H3K4me3-WNT-rep1/ H3K4me3-WNT-rep3/ > countTable.peaks.txt

You can also use analyzeRNA.pl to generate these matrices, but generally I would recommend using analyzeRepeats.pl. You can also create your own simple data matrix that contains a single column of IDs (i.e. gene/region IDs) and the top row for experiment names.

IMPORTANT: Make sure you remember the order that your experiments and replicates where entered in for generating these commands. Because experiment names can be cryptic, you will need to specify which experiments are which when running getDiffExpression.pl to assign replicates and conditions.

Some of you might be wondering if you can use FPKM, TPM, normalized read counts, or some other type of processed data. The simple answer is "yes you can" (getDiffExpression.pl will round everything to the nearest integer and it shouldn't 'crash'), but the better answer is "no you shouldn't". For a vast majority of NGS assays, the raw read counts have well-behaved variance structure as a function of intensity, which is what DESeq2 and EdgeR will be modeling when calculating the significance differences in the data. FPKM, TPM, etc. introduces normalization factors (i.e. for the length of the gene) that will obscure the intensity vs. variance relationship and undermine the assumptions used by the programs. There may be cases where you do want to provide processed values instead of raw counts, so the program will do its best to choke them down!

Step2: Running getDiffExpression.pl to find differentially regulated features

Once you have a raw count file containing a tab-delimited matrix of raw count files, you're ready to run getDiffExpression.pl. The getDiffExpression.pl program is executed with the following arguments:

getDiffExpression.pl <raw count file> <group code1> <group code2> [group code3...] [options] > diffOutput.txt

#example for the RNAseq file prepared from above
getDiffExpression.pl countTable.rna.txt Mock Mock WNT WNT > diffOutput.txt

Key arguments for getDiffExpression.pl:

[required] The raw count file is the first argument to the program - this is the output file from analyzeRepeats.pl, analyzeRNA.pl or annotatePeaks.pl. It can also be a simple tab-delimited text file with a header and a single column of IDs followed by the matrix of counts.

[required] Provide sample group annotation for each experiment with an argument on the command line (in the same order found in the file, i.e. the same order given to the annotatePeaks.pl/analyzeRepeats.pl command when preparing the raw count file).
i.e. if you provide three tag directories for macrophage RNA-Seq and two tag directories for Bcell RNA-seq, enter " mac mac mac bcell bcell ". This tells the program which columns go with which experiment, and which are replicates.

Differential Expression Calculation program - To use any of them they must already be installed on your local copy of R:

"-edgeR"

"-DESeq"

"-DESeq2" (default)

[Usually not needed] Use "-repeats", "-rna", "-peaks", or "-basic" depending on if you used analyzeRepeats.pl, analyzeRNA.pl or annotatePeaks.pl to generate the initial raw count table. The program will generally auto detect which program was used.

Other Parameters and Considerations

No replicates

If you do not have replicates, the R packages DESeq2 or DESeq will automatically attempt to model the variance and provide differential expression, although they will be VERY conservative and you'll be luck to find anything that is differentially expressed. edgeR, on the other hand, requires that you set the common dispersion when performing the calculation to estimate what you think the general variance in the sample is. You can set this dispersion explicitly "-dispersion <#>" (default is 0.05, only works with edgeR). The other potential problem is that DESeq2 might have a problem performing the rlog variance stabilization described below. Because of that you may also want to use the options "-simpleNorm" to normalize the output read counts so that they are directly comparable.

In general, if you don't have replicates, I would recommend running the program with the paramters "-edgeR -simpleNorm -dispersion 0.05".

Multiple Comparisons

By default, if 3 or more groups of samples are provided, the script will compare the first experiment against each of the others. If you instead want all pairwise comparisons to be performed (all vs. all), specify "-AvsA". For example, lets say you had 6 experiments across 3 different conditions. If you specify "cond1 cond1 cond2 cond2 cond3 cond3", the output would provide differential calls for "cond1 vs. cond2" and "cond1 vs. cond3". Adding the "-AvsA" option would include the "cond2 vs. cond3" as well.

Unbalanced Normalization

Normally RNA-seq analysis is carried out under the assumption that the distribution of gene expression values should be relatively similar between experiments. There are situations where this assumption breaks down, such as with certain types of ChIP-Seq analysis (i.e. IP vs. input control). If you would prefer that the differential enrichment calculation normalize to the total mapped reads in the tag directory instead of normalizing based on the matrix of gene expression values, specify "-norm2total". Unfortunately, this will only work with DESeq/DESeq2 (edgeR doesn't really support this feature). The "-norm2total" option is very useful for ChIP-Seq when you do not expect the signal from each experiment to be comparable, like when comparing target vs. IgG experiemnts. However, if you were comparing a type of experiment where you do expect similar signal (for example, H3K27ac levels might be assumed to be similar across conditions), it is not recommended that you use the "-norm2total" option.

Reporting Lists of Differentially Expressed Genes

By default, HOMER returns a full list of genes with log2 Fold change, p-values, and FDR calculations for each comparison. HOMER can create files with only differentially expressed genes/features if you specify the "-export <prefix>", where <prefix> is the start of each filename that will contain the differentially expressed genes. The "-log2fold <#>" and "-fdr <#>" parameters control the threshold for differential detection. If "-export <prefix>" is not included in the command, the program will still output the numbers of differentially expressed genes to stderr to give you an idea how the comparison turned out.

Accounting for Paired Samples/Experiment Design

If your samples are paired or have other relationships, you may want to try to account for batch effects. EdgeR and DESeq2 allow you to apply a generalized model to try to remove effects caused by analyzing data on a different day, from the same patient, or slightly different batch. For example, lets say you did an experiment with a control and a drug treatment, then did a second experiment two weeks later with a different library preparation protocol, etc. You can tell the program that the samples are linked by specifying a batch code - the code is like the experiment annotation in that it applies to each experiment in the same order that they were listed when preparing the raw count matrix:

analyzeRepeats.pl rna hg19 -noadj -d Week1-Ctrl/ Week1-Drug/ Week2-Ctrl/ Week2-Drug/ > output.txt
getDiffExpression.pl output.txt -repeats ctrl drug ctrl drug -batch 1 1 2 2 > diffOutput.txt

This type of analysis is analogous to a paired t-test. The design formula will also be applied to the variance stabilization transforms discussed in the next paragraph.

Variance Stabilization/Normalized Read counts in output file

By default, getDiffExpression.pl will now also perform a variance stabilization transform on your raw count data so that viewing the data is easier after the command is finished. The idea behind variance stabilization is to "transform" the data such that variance in the read counts is relatively constant as a function of intensity. This limits the "flare" of low expression variance that you normally see in log-transformed scatter plots when comparing one experiment versus another. By default the program will use DESeq2's rlog transform to create nicely normalized log2 read counts ("-rlog"). The program also supports DESeq2's VST transform (option "-vst"), which is faster and may be recommended if you are processing a lot of samples. In both cases the tranforms will be conducted with the design matrix specifying which samples are replicates (or paired in the batch definition). If you prefer read counts that are simply normalized to the total count in the raw count matrix, specify "-simpleNorm", or "-raw" to keep the data the same as the input data.

NOTE: The rlog and vst transforms may have problems if there are no replicates or too few samples (i.e. only 2), so in these cases it is recommended to use "-simpleNorm" or "-raw".

Output Format

This command will basically add 3 columns to the end of your original file per pairwise comparison. Most important are the log2 fold change and the p-value and FDR values. The raw read counts in the output file will be log2 transformed and variance stabilized (using DESeq2's rlog) unless specified otherwise (i.e. "-vst", "-raw", "-simpleNorm"). This is generally useful for other types of downstream analysis.

Command line options for getDiffExpression.pl

        This program uses DESeq2/edgeR to find differential expression between sets of genes
        (R must be installed in the executable path, and the DESeq2/edgeR package must be installed)
           Step 1: Run analyzeRepeats.pl, but use -raw (or analyzeRNA.pl or annotatePeaks.pl)
           Step 2: Run this program using that file (use -repeats/-rna/-peaks to match program)
        The output is sent to stdout - appends columns to original file containing data
        Group/Batch names below correspond to order of experiments given as tag directories

        Usage: getDiffExpression.pl <data file> <group code...> [options]

                i.e. getDiffExpression.pl inputFile.txt groupName1 groupName1 groupName2 > output.txt
                        (for 2 replicates in first group and 1 in second group)

        To normalize batch effects, add "-batch <batch code 1> [batch code2] ..." to the command
                For now batch normalization is only available with edgeR and DESeq2
                i.e. getDiffExpression.pl inputFile.txt gName1 gName1 gName2 gName2 -batch 1 2 1 2 > output.txt
                        (for 2 replicates in each group, where the 1st in each and 2nd in are paired)

                Update: Expression values are now variance stabalized with DESeq2/rlog by default

        Input File format options (will try to autodetect):
                -rna (for analyzeRNA.pl formatted input, default)
                -repeats (for analyzeRepeats.pl formatted input file)
                -peaks (for annotatePeaks.pl formatted input file)
                -basic (for simple file with one column of gene identifiers and then the count data)

        Additional Options:
                -dispersion <#> (edgeR common dispersion to use if no replicates, default: 0.05)
                -norm2total (normalize using tag directory totals, default: normalize to gene totals(i.e.table)
                -AvsA (compare each group vs. each group, default: compare 1st group vs. others)
                -showR (Show R status updates, command output)

        Differential Expression program selection:
                -DESeq2 (use DESeq2, now the default)
                -DESeq (use DESeq instead of DESeq2 - doesn't support batch mode)
                -edgeR (use edgeR, - doesn't support -norm2total for normalization to total mapped reads)

        DE List Reporting:
                -export <prefix> (output differential expression gene lists with filename prefix)
                -fdr <#> (FDR threshold for diff. expression reporting, default: 0.05)
                -log2fold <#> (Log2 Fold threshold for diff. expression reporting, default: 1.0)

        Expression value normalization/Variance Stabilization:
                -rlog (Use DESeq2's rlog transform to normalize output matrix, default)
                -vst (Use DESeq2's vst transform, quicker for large sample sets)
                -simpleNorm (Normalize to experiment totals, i.e. basic normalization)
                -raw (do not normalize read counts in output table, keep raw values)

        Current versions that work:
                R: v3.3.1
                Bioconductor: v3.2
                DESeq v1.24.0
                DESeq2 v1.12.4
                edgeR v3.14.0
                limma v3.29.0

Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu