Software for motif discovery and ChIP-Seq analysis

Finding Peaks and Differential Peaks with Replicate Experiments

The following outlines HOMER's recommended approach to identifying peaks that are statistically enriched across replicates.  As you might expect, the use of replicate experiments is highly encouraged and can help reduce the number of false positives in your data. This page covers both the initial identification of peaks (i.e. IP vs. input) and also the identification of differential peaks (IP vs. IP).

There are two general (but related) approaches to identifying differential peaks. The first is to use the getDifferentialPeaksReplicates.pl command, which attempts to automate the steps described below into a single command to generate a peak file.  The second is to prepare your own regions and read counts and then use getDiffExpression.pl directly to calculate the differential enrichment.  That's covered in more detail (along with differential expression) in the next section, and is recommended if you want a little more fine control on how things are quantified. In both cases, HOMER uses R/Bioconductor and DESeq2 (by default) to perform the differential enrichment calculations.

Looking for overlapping peaks (i.e. Venn Diagram) or differential peaks without replicates:

The information below is catered to the analysis of peaks using replicate experiments. For more basic operations/analysis without replicates, check out the page covering the mergePeaks and getDifferentialPeaks programs.

Required 3rd party software:

HOMER requires R/Bioconductor to be installed with packages for DESeq2 installed.  For tips on how to install them, check out the installation page.
R statistical computing environment
Bioconductor (R packages for bioinformatics)
DESeq      - packages for differential expression calculations (DESeq2 is recommended)

Identifying Peaks from Replicates using getDifferentialPeaksReplicates.pl

To identify peaks from a series of ChIP-seq (or DNase/ATAC-seq) experiments with input/IgG control experiments, run the following command:
getDifferentialPeaksReplicates.pl -t <target tag directory 1> [target tag directory 2] ... -i <input tag directory 1> [input tag directory 2] [options] > outputPeaks.txt

getDifferentialPeaksReplicates.pl -t Oct4-r1/ Oct4-r2/ -i Input-r1/ Input-r2/ -genome hg19 > outputPeaks.txt
The output file is a HOMER-style peak file that contains several columns of annotation, normalized read counts, and differential enrichment statistics from DESeq2 for each peak.

The program will essentially perform 3 steps:
  1. First, it will pool the target tag directories and input directories separately into pooled experiments and perform an initial peak identification (using findPeaks).  Pooling the experiments is generally more sensitive than trying to merge the individual peak files coming from each experiment (although this can be done using the "-use <peaks.txt...>" option if each directory already has a peak file associated with it).
  2. Next, it will quantify the reads at the initial putative peaks across each of the target and input tag directories using annotatePeaks.pl
  3. Finally, it calls getDiffExpression.pl and ultimately passes these values to the R/Bioconductor package DESeq2 to calculate enrichment values for each peak, returning only those peaks that pass a given fold enrichment (default: 2-fold) and FDR cutoff (default 5%).
One key difference in the use of DESeq2 for finding significant peaks is that HOMER will force the normalization factors for each experiment to reflect the total number of mapped tags per experiment during the DESeq2 calculation.  By default, DESeq2 (and DESeq/edgeR) will normalize the read counts in the input table to be the same (i.e. think quantile normalization), but in this case "input" or IgG experiments would likely have significantly less reads in putative peaks than the target experiments, so it is important to adjust the normalization factors. Unfortunately, edgeR does not support this adjustment so it is not possible to use edgeR to calculate the statistics for differential peak finding (vs. input).

When identifying differential peaks between separate experiments, the program offers a way to include both the "background" ChIP-seq experiments as well as the input experiment by specifying them with either "-b" or "-i".  For example, if trying to find Oct4 peaks that have significantly more reads than Sox2 experiments, you could specify:
getDifferentialPeaksReplicates.pl -t Oct4-r1/ Oct4-r2/ -b Sox2-r1/ Sox2-r2/ -i Input-r1/ Input-r2/ > outputPeaks.txt
The key difference between "-i" and "-b" directories is that the input directories ("-i") are used during the initial peak finding as controls to limit the feature detection and during the differential calculations with DESeq2, while the "-b" directories are used only during the differential calculations with DESeq2.  If both are specified (both -b and -i), the input directories are used during feature selection and the background directories are used during the differential calculation.  In general, if you have input experiments, you should use them with "-i".

The other useful option is to explicitly specify the peaks regions that you want to interrogate. To specify a specific peak file to search for differential peaks in, use "-p <peakFile>".  For example, you might want to find which TSS regions are differentially enriched with H3K27ac data after WNT treatment:
#first get TSS regions using annotatePeaks.pl
annotatePeaks.pl tss hg19 > tss.txt

getDifferentialPeaksReplicates.pl -t H3K27ac-WNT-r1/ H3K27ac-WNT-r2/ -b H3K27ac-notx-r1 H3K27ac-notx-r2/ -p tss.txt > outputPeaks.txt

Common options:

-genome <version> : Genome version to use to annotate peak locations (default: none)

-DESeq2 or -DESeq or -edgeR : Choice of R package to perform differential enrichment calculations (default -DESeq2)

-f <#> : replicate fold change cutoff for peak identification (calculated by DESeq2, default: 2)

-q <#> : replicate FDR cutoff for peak identification (calculated by DESeq2, default: 0.05)

-fdr <#>, -F <#>, -L <#> : options passed to findPeaks during putative peak identification (findPeaks defaults)

-balanced : Do not force the use of normalization factors to match total mapped reads.  This can be useful when analyzing differential peaks between similar data (for example H3K27ac) where we expect similar levels in all experiments. Applying this allows the data to essentially be quantile normalized during the differential calculation.

-all : output all possible peaks with their differential regulation statistics (default: only output regulated peaks)

-style <factor | histone | tss> : style of peaks found by findPeaks during features selection (not applicable if -p is used, default: factor)

-use <peaks.txt|regions.txt|tss.txt>  : Uses the existing peak file found in the tag directories instead of merging the experiment and finding peaks

-p <peak/BED file> : peaks to use for calculating differentially enriched regions - will skip the peak finding step.

Other options are directly passed to findPeaks.

Can't figure something out? Questions, comments, concerns, or other feedback: