Software for motif discovery and next-gen sequencing analysis

Finding Enriched Peaks, Regions, and Transcripts

HOMER contains a program called findPeaks that performs all of the peak calling and transcript identification analysis.  (Not to be confused with another peak finding program called FindPeaks, which was also very creatively named).  Finding peaks is one of the central goals of any ChIP-Seq experiment, and the same basic principles apply to other types of sequencing such as DNase-Seq.  The basic idea is to identify regions in the genome where we find more sequencing reads than we would expect to see by chance.  There are number of different approaches one can use to find peaks, and correspondingly there are many different methods for identifying peaks from ChIP-Seq experiments.  It is not required that you use HOMER for peak finding to use the rest of the tools included in HOMER (see below).

findPeaks has 7 basic modes of operation:

Peak finding for single contact or focal ChIP-Seq experiments or DNase-Seq.  This type of analysis is useful for transcription factors, and aims to identify the precise location of DNA-protein contact.  This type of peak finding uses a FIXED width peak size, which is automatically estimated from the Tag Autocorrelation. (see below)

Peak finding for broad regions of enrichment found in ChIP-Seq experiments for various histone marks.  This analysis finds variable-width peaks. (see below)

Find Super Enhancers in your data (see below)

De novo transcript identification from strand specific GRO-Seq.  This attempts to identify transcripts from nascent RNA sequencing reads. (see below)

Identification of promoter/TSS from 5'RNA-Seq/CAGE or 5'GRO-Seq data.  More info in the TSS section.

Adjusted parameters for DNase-Seq peak finding.  (NOTE: this is only for tag-style DNase-Seq) More info in the DNase section.

DNA methylation analysis - documentation coming soon...

HOMER does not perform de novo transcript isoform detection from spliced RNA-Seq.  There is already a bunch of great work on this topic, and differential isoform detection, etc. with RNA-seq is quite difficult.  We recommend the Tophat/Cufflinks family of programs for RNA-Seq isoform detection.

Using findPeaks

To run findPeaks, you will normally type: 

findPeaks <tag directory> -style <factor | histone> -o auto -i <control tag directory>

i.e. findPeaks ERalpha-ChIP-Seq/ -style factor -o auto -i Control-ChIP-Seq/

Where the first argument must be the tag directory (required).  The other options are not required.  The "-style <...>" option can be either "factor", "histone", or one of a number of specialized types.  Use the "-i" option to specify a control experiment tag directory (good idea when doing ChIP-Seq).

Output files

Use the "-o <filename>" to specify where to send the resulting peak file.  If "-o" is not specified, the peak file will be written to stdout

If "-o auto" is specified, the peaks will be written to:
"<tag directory>/peaks.txt" (-style factor)
"<tag directory>/regions.txt" (-style histone)
"<tag directory>/transcripts.txt" and "<tag directory>/transcripts.gtf" (-style groseq)

The top portion of the peak file will contain parameters and various analysis information.  This output differs somewhat for GRO-Seq analysis, and is explained in more detail later. Some of the values are self explanatory.  Others are explained below:

# HOMER Peaks
# Peak finding parameters:
# tag directory = Sox2-ChIP-Seq
# total peaks = 10280
# peak size = 137
# peaks found using tags on both strands
# minimum distance between peaks = 342
# fragment length = 132
# genome size = 4000000000
# Total tags = 9908245.0
# Total tags in peaks = 156820.0
# Approximate IP efficiency = 1.58%
# tags per bp = 0.001907
# expected tags per peak = 0.523
# maximum tags considered per bp = 1.0
# effective number of tags used for normalization = 10000000.0
# Peaks have been centered at maximum tag pile-up
# FDR rate threshold = 0.001000
# FDR effective poisson threshold = 0.000000
# FDR tag threshold = 8.0
# number of putative peaks = 10800
# size of region used for local filtering = 10000
# Fold over local region required = 4.00
# Poisson p-value over local region required = 1.00e-04
# Putative peaks filtered by local signal = 484
# Maximum fold under expected unique positions for tags = 2.00
# Putative peaks filtered for being too clonal = 36
# cmd = findPeaks Sox2-ChIP-Seq -style factor -o auto
# Column Headers:

Genome size represents the total effective number of mappable bases in the genome (remember each base could be mapped in each direction)

Approximate IP effeciency describes the fraction of tags found in peaks versus. genomic background.  This provides an estimate of how well the ChIP worked.  Certain antibodies like H3K4me3, ERa, or PU.1 will yield very high IP efficiencies (>20%), while most rand in the 1-20% range.  Once this number dips below 1% it's a good sign the ChIP didn't work very well and should probably be optimized.

Below the header information are the peaks, listed in each row.  Columns contain information about each peak:
  • Column 1: PeakID - a unique name for each peak (very important that peaks have unique names...)
  • Column 2: chr - chromosome where peak is located
  • Column 3: starting position of peak
  • Column 4: ending position of peak
  • Column 5: Strand (+/-)
  • Column 6: Normalized Tag Counts - number of tags found at the peak, normalized to 10 million total mapped tags (or defined by the user)
  • Column 7: (-style factor): Focus Ratio - fraction of tags found appropriately upstream and downstream of the peak center. (see below)
  •                  (-style histone/-style groseq): Region Size - length of enriched region
  • Column 8: Peak score (position adjusted reads from initial peak region - reads per position may be limited)
  • Columns 9+: Statistics and Data from filtering
Two generic tools are available as part of HOMER to convert peak files to BED files and back.  This will allow you to upload your peak files to the UCSC Genome Browser, or convert peak files in BED format from another program into a peak file that can be used by HOMER.  These programs are named pos2bed.pl and bed2pos.pl, which can be used the following way:

pos2bed.pl peakfile.txt > peakfile.bed
bed2pos.pl peakfile.bed > peakfile.txt

Finding Transcription Factor Peaks with HOMER

To find peaks for a transcription factor use the findPeaks command:

findPeaks <tag directory> -style factor -o auto -i <input tag directory>

i.e. findPeaks ERalpha-ChIP-Seq/ -style factor -o auto -i MCF7-input-ChIP-Seq

Identification of Putative Peaks

If findPeaks is run in "factor" mode, a fixed peak size is selected based on estimates from the autocorrelation analysis performed during the makeTagDirectory command.  This type of analysis maximizes sensitivity for identifying locations where the factor makes a single contact with the DNA. Peak size can be set manually with "-size <#>".

findPeaks loads tags from each chromosome, adjusting them to the center of their fragments, or by half of the estimated fragment length in the 3' direction (this value is also automatically estimated from the autocorrelation analysis).  The fragment length can be specified manually using the "-fragLength <#>" option.  It then scans the entire genome looking for fixed width clusters with the highest density of tags.  As clusters are found, the regions immediately adjacent are excluded to ensure there are no "piggyback peaks" feed off the signal of large peaks.  By default, peaks must be greater than 2x the peak width apart from on another (set manually with "-minDist <#>").  This continues until all tags have been assigned to clusters.

After all clusters have been found, a tag threshold is established to correct for the fact that we may expect to see clusters simply by random chance.  Previously, to estimate the expected number of peaks for each tag threshold, HOMER would randomly assign tag positions and repeat the peak finding procedure.  HOMER now assumes the local density of tags follows a Poisson distribution, and uses this to estimate the expected peak numbers given the input parameters much more quickly.  Using the expected distribution of peaks, HOMER calculates the expected number of false positives in the data set for each tag threshold, setting the threshold that beats the desired False Discovery Rate specified by the user (default: 0.001, "-fdr <#>").

HOMER assumes the total number of mappable base pairs in the genome is 2,000,000,000 bp (** change from previous version.  here 2e9 assumes the actual number of mappable positions is actually 2x [think + and - strand]) , which is "close enough" for human and mouse calculations.  You can specify a different genome size using "-gsize <#>".  HOMER also uses the reads themselves to estimate the size of the genome (i.e. that highest tag position on each chromosome).  If this estimate is lower than the default, it will use that value to avoid using too large of a number on smaller genomes (For example, if you used findPeaks on drosophila data without specifying "-gsize").

It is important to note that this false discovery rate controls for the random distribution of tags along the genome, and not any other sources of experimental variation.  Alternatively, users can specify the threshold using "-poisson <#>" to calculate the tag threshold that yields a cumulative poisson p-value less than provided or "-tagThreshold <#>" to specify a specific number tags to use as the threshold.

Filtering Peaks

The initial step of peak finding is to find non-random clusters of tags, but in many cases these clusters may not be representative of true transcription factor binding events.  To increase the overall quality of peaks identified by HOMER, 3 separate filtering steps can be applied to the initial, putative peaks identified:

Using Input/IgG Sequencing as a Control

To use an Input or IgG sequencing run as a control (HIGHLY RECOMMENDED), you must first create a separate tag directory for the input experiment (see here).  Additionally, you can use other cleaver experiments as a control, such as a ChIP-Seq experiment for the same factor in another cell or in a knockout.  To find peaks using a control, type:

findPeaks <tag directory> -style factor -i <control tag directory> -o auto

i.e. findPeaks ERalpha-ChIP-Seq/ -style factor -i Input-ChIP-Seq/ -o auto

HOMER uses two parameters to filter peaks against a control experiment.  First, it uses a fold change (which is sequencing depth-independent), requiring each putative peak to have 4-fold more normalized tags in the target experiment than the control (or specify a different fold change with "-F <#>").  In the case where there are no input tags near the putative peak, HOMER automatically sets these regions to be set to the average input tag coverage to avoid dividing by zero.  HOMER also uses the poisson distribution to determine the chance that the differences in tag counts are statistically significant (sequencing-depth dependent), requiring a cumulative poisson p-value of 0.0001 (change with "-P <#>").  This effectively removes peaks with low tag counts for which there is a chance the differential enrichment is found simply due to sampling error.

One modification in recent versions of HOMER is the in the size of region used to compare experiment with control tags.  Since control experiments are not always performed the same way, e.g. different fragment lengths, it helps to enlarge the peak size for the purposes of comparing experiments to ensure control reads found immediately outside of the peak region are still considered.  By default, HOMER enlarges peaks by 2x to search the control experiment (change with "-inputSize <#>").  This may reduce specificity when trying to identify certain types of peaks.

Filtering Based on Local Signal

Our experience with peak finding is that often putative peaks are identified in regions of genomic duplication, or in regions where the reference genome likely differs from that of the genome being sequenced.  This produces large regions of high tag counts, and if no Input/IgG sample is available, it can be hard to exclude these regions.  Also, it may be advantageous to remove putative peaks that a spread out over larger regions as it may be difficult to pin-point the important regulatory regions within them.

To deal with this, HOMER will filter peaks based on the local tag counts (similar in principle to MACS).  Be default, HOMER requires the tag density at peaks to be 4-fold greater than in the surrounding 10 kb region.  This can be modified using "-L <#>" and "-localSize <#>" to change the fold threshold and size of the local region, respectively.  As with input filtering, the comparison must also pass a poisson p-value threshold of 0.0001, which can be set using "-LP <#>" option.

Filtering Based on Clonal Signal

When we first sifted through peaks identified in ChIP-Seq experiments we noticed there are many peaks near repeat elements that contain odd tag distributions.  These appear to arise from expanded repeats that result in peaks with high numbers of tags from only a small number of unique positions, even when many of the other positions withing the region may be "mappable".  To help remove these peaks, HOMER will compare the number of unique positions containing tags in a peak relative to the expected number of unique positions given the total number of tags in the peak.  If the ratio between the later and the former number gets to high, the peak is discarded.  The fold threshold can be set with the "-C <#>" option (default: "-C 2").  Homer uses the averageTagsPerPosition parameter in the tagInfo.txt file adjust this calculation as to not over-penalize ChIP-Seq experiments that are already highly "clonal".  If analyzing MNase or other restriction enzyme digestion experiments turn this option off ("-C 0");

Disabling Filtering

To disable Input, Local, or Clonal filtering set any combination of "-F 0 -L 0 -C 0".

Finding Variable Length Transcription Factor Peaks

For those interested in finding peaks that are more conforming to the regions of enrichment, add "-region" to the END of the command.  The command line is parsed sequentially, so putting it at the end makes sure other options will not overwrite it.  You can also get finer resolution by increasing "-regionRes <#>".

Peak Centering and Focus Ratios

If the option "-style factor" or "-center" is specified, findPeaks will calculate the position within the peak with the maximum ChIP-fragment overlap and calculate a focusRatio for the peak.  This is not always desired (such as with histone modifications).  The focus ratio is defined as the ratio of tags located 5' of the peak center on either strand relative to the total number of tags in the peak.  Peaks that contain tags in the ideal positions are more likely to be centered on a single binding site, and these peaks can be used to help determine what sequences are directly bound by a transcription factor.  Unfocused peaks, or peaks with low (i.e <80%) focusRatios may be the result of several closely spaced binding sites or large complexes that cross-link to multiple positions along the DNA.  Sometimes this means you have more than one binding site in close proximity (example below), but other times it means you cross-linked the $#@& out of your cells, or you have background in your sample.
focused peak schematic

To isolate focused peaks, you can use the getFocalPeaks.pl tool:

getFocalPeaks.pl <peak file> <focus % threshold> > focalPeaksOutput.txt

i.e. getFocalPeaks.pl ERpeaks.txt 0.90 > ERfocalPeaks.txt

Finding Enriched Regions of Variable Length

To find variable length peaks for histone marks, use the findPeaks command:

findPeaks <tag directory> -style histone -o auto -i <input tag directory>

i.e. findPeaks H3K4me1-ChIP-Seq/ -style histone -o auto -i Input-ChIP-Seq

If the option "-style histone" or "-region" is specified, findPeaks will stitch together enriched peaks into regions.  Note that local filtering is turned off when finding regions.  The most important parameters for region finding are the "-size" and "-minDist" (and of course the fragment length).  First of all, "-size" specifies the width of peaks that will form the basic building blocks for extending peaks into regions.  Smaller peak sizes offer better resolution, but  larger peak sizes are usually more sensitive.  By default, "-style histone" evokes a peak size of 500. 

The second parameter, "-minDist", is usually used to specify the minimum distance between adjacent peaks.  If "-region" is used, this parameter then specifies the maximum distance between putative peaks that is allowed if they are to be stitched together to form a region.  By default this is 2x the peak size.  If you think about histone modifications, the signal is never continuous in enriched regions, with reduced signal due to non-unique sequences (that can't be mapped to) and nucleosome depleted regions.  "-minDist" informs findPeaks how big of a gap in the signal will be tolerated for adjacent peaks to be considered part of the same region. (by default "-style histone" sets this to 1000).

One thing to note is that you may have to play around with these parameters to get the results you want.  If you look at the examples below, you could make arguments for using each of the tracks given what you're interested in and how you would define a "region".

For example: (in the example below, the default size comes from the autocorrelation estimate for the Macrophage-H3K4me1 dataset)

Default Parameters:
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -size 150 -minDist 370 > output.txt (i.e. defaults)

Recommend Parameters for fixed width peaks (i.e. for motif finding):
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -size 1000 -minDist 2500 > output.txt

Default Parameters for variable length peaks.
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 150 -minDist 370 > output.txt

Effect on variable length peaks if we increase minDist to 1000.
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 150 -minDist 1000 > output.txt

Recommend Parameters for variable length peaks (H3K4me1 at least).
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 1000 -minDist 2500 > output.txt

Effect on variable peaks if we increase minDist to 10000 (H3K4me1 at least).
findPeaks Macrophage-H3K4me1/ -i Macrophage-IgG -region -size 1000 -minDist 10000 > output.txt

                length peaks example

Finding Histone Modification Peaks and Motif Finding

Finding peaks using histone modification data can be a little tricky - largely because we have very little idea what the histone marks actually do.  If you want to find peaks in histone modification data with the purpose of analyzing them for enriched motifs, read this section.  The problem with histone modification data (and some other types) is that the signal can spread over large distances.  Trying to analyze large, variable length regions for motif enrichment is very difficult and not recommended.  As such it is recommended sometimes a better idea to use fixed-size peak finding on histone marks (i.e. H3K4me1 enhancer mark) to improve motif analysis results.   Using fixed size peaks helps make sure the assumptions needed for Motif Finding are, let's say, less violated, when dealing with regions of constant size.

There are two important differences between finding fixed width peaks for transcription factors and histone modifications.  The first is the size of the peaks: Most histone modifications should be analyzed using a peak size in the range of 500-2000 usually (i.e. "-size 1000").  Below is an example of the histone mark distribution around transcription factor peaks (i.e. the things you're hoping Motif Finding will identify), which can be used to help estimate the parameters:

H3K4me1 and H3K4me2 MNase near
                  PU.1 Peaks

The other is that you should omit the "-center" option (i.e. do not specify "-style factor").  Since you are looking at a region, you do not necessarily want to center the peak on the specific position with the highest tag density, which may be at the edge of the region.  Besides, in the case of histone modifications at enhancers, the highest signal will usually be found on nucleosomes surrounding the center of the enhancer, which is where the functional sequences and transcription factor binding sites reside.  Consider H3K4me marks surrounding distal PU.1 transcription factor peaks.  Typically, adding the -center option moves peaks further away from the functional sequence in these scenarios.  An example for finding peaks:

findPeaks H3K4me1-ChIP-Seq/ -i Input-ChIP-Seq -size 1000 -o H3K4me1-ChIP-Seq/peaks.1kb.txt

One issue with finding histone modification peaks using the defaults in HOMER is that the Local filtering step removes several of the peaks due to the "spreading" nature of many histone modifications.  This can be good and bad.  If you are looking for nice concentrated regions of modified histones, the resulting peaks will be a nice set for further analysis such as motif finding.  However, if you are looking to identify every region in the cell that has an appreciable amount of modified histone, you may want to disable local filtering, or consider using the "-region" option below  e.g.:

findPeaks H3K4me1-ChIP-Seq/ -i Input-ChIP-Seq -size 1000 -L 0 -o H3K4me1-ChIP-Seq/peaks.1kb.txt

Also, if using MNase-treated chromatin (i.e. nucleosome mapping), you may want to use "-C 0" to avoid filtering peaks composed of highly clonal tag positions.  These can arise from well positioned nucleosomes (or sites that are nicely digested by MNase at least).

Nucleosome Free Regions (NFR) -nfr

Just like peak centering for transcription factors, findPeaks has an option to look for large "dips" in the histone modification signal to infer where the primary nucleosome free region (NFR) is in the region.  By adding "-nfr" to the command, HOMER will search for the location within the region that has the greatest differential in ChIP-signal and assign that location as the peak center.  In some ways this is trying to identify regions that simultaneously contain both high signal and a nice gap (this is why the absolute difference is used).  This does pretty well when applied to H3K4me2 or H3K27ac data.
NFR diagram


Works much better with MNase treated ChIP-Seq samples.  The output file will then be centered on the NFR - this is useful for motif finding.  By using NFRs instead of whole regions, you can narrow the search for regulatory elements down the +/- 100 bp instead of +/-500 or 1000 bp.  Below is an example of ChIP-Seq peak locations with respect to center of H3K4me2-ChIP-Seq regions generated with and without the "-nfr" flag in Macrophages.

nucleosome free region
                    detection motifs peaks

Peak finding and Sequencing Saturation

HOMER does not try to estimate sequencing saturation, which is the estimate of whether or not you have sequenced enough reads to identify all the peaks in a given experiment.  Generally speaking, if you sequence more, you will get more peaks since your sensitivity will increase.  The only real way to assess this is if you can somehow show that all of the "functional" or "real" peaks have high tag counts (i.e. are well above the threshold for identifying a peak), meaning that sequencing more is not likely to identify more "real" peaks.  This generally cannot be determined by simply re-sampling the data and repeating the peak finding procedure - you need some sort of outside information to assess peak quality, such as motif enrichment or something else.  Simply re-sampling will likely only tell you if you've gotten to the point where you are simply re-sequencing the same fragments again - i.e. becoming clonal.  In short, if there is a method that claims it will tell you if you've sequenced enough only by analyzing your sequencing data alone, it is probably full of crap.

Finding Super Enhancers

The concept of a 'super enhancer' has gained traction as a way to categorize regions of the genome that are contain several enhancers in close proximity.  The concept was pioneered by the Young lab (Whyte et al.).  These 'super enhancer' regions are normally located in the vicinity of important genes and can be useful for describing key components of the regulatory landscape of the cell.  findPeaks has a 'super enhancer' option built in to help identify these regions in a similar manner to how the Young Lab did it ('-style super').

findPeaks <tag directory> -i <input tag directory> -style super -o auto

findPeaks ES-Med1 -i ES-input -style super -o auto
(resulting super enhancers will be in the file ES-Med1/superEnhancers.txt)

How finding super enhancers works:

Super enhancer discovery in HOMER emulates the original strategy used by the Young lab.  First, peaks are found just like any other ChIP-Seq data set.  Then, peaks found within a given distance are 'stitched' together into larger regions (by default this is set at 12.5 kb).  The super enhancer signal of each of these regions is then determined by the total normalized number reads minus the number of normalized reads in the input.  These regions are then sorted by their score, normalized to the highest score and the number of putative enhancer regions, and then super enhancers are identified as regions past the point where the slope is greater than 1.  Example of a super enhancer plot:

HOMER Super enhancer plot

In the plot above, all of the peaks past 0.95 or so would be considered "super enhancers", while the one's below would be "typical" enhancers.  If the slope threshold of 1 seems arbitrary to you, well... it is!  This part is probably the 'weakest link' in the super enhancer definition.  However, the concept is still very useful.  Please keep in mind that most enhancers probably fall on a continuum between typical and super enhancer status, so don't bother fighting over the precise number of super enhancers in a given sample and instead look for useful trends in the data.

Super Enhancer Output

The output file is a peak file containing the super enhancers (if you use "-o auto" the peak file named 'superEnhancers.txt' will be placed in the tag directory).  If you want to get the "typical" enhancers, use the option "-typical <filename>" to output them to a separate peak file.

Making your own Super Enhancer Plot

Find super enhancers like you normally would, but add the option "-superSlope -1000" - the idea is to include ALL potential peaks as 'super enhancers' so that we can plot them together.  Open the resulting peak file in Excel.  The 6th column ("Normalized Tag Count") contains the super enhancer score for each region.  Simply ploting this column as a line plot will give you a sense of what your plot will look like.  To get an official "Young-lab style" plot you'll have to do some Excel algebra to normalize score by the total.

Key parameters for determining super enhancers:

-style super

Required to trigger findPeaks super enhancer routines.

-typical <filename> 

If you want findPeaks to output the typical (non-super) enhancers too, specify -typical followed by the desired output file name.

<Tag Directory>

The tag directory you use for super enhancer calculation is probably the most important step.  In theory, any data could be used.  Normally, you should use either ChIP-Seq for Transcription Factors (Master/Lineage Determining Factors preferably), ChIP-Seq for key co-factors (e.g. Mediator, p300, Brd4, etc.), or enhancer marks (e.g. H3K27ac, H3K4me1/2, etc.).  You can combine transcription factors into a single tag directory (makeTagDirectory Combined/ -d TF1/ TF2/ TF3/) too.

-minDist <#>
The maximum distance used to stitch peaks together.  By default it will be set to 12500.  Setting this larger or smaller will allow super enhancers to expand to larger regions or shrink.  If the data is very dense with peaks, it's probably best to keep this value small.  If on the other hand the data has very few initial peaks (i.e. less than 10,000), you may need to increase this value to identify more regions.
-L <#>   
The local fold change is used to identify initial peak by default for super enhancers.  However, when using histone marks with super enhancers, you may want to disable this set by setting it to 0 or at least 1.
-superSlope <#>
By default Super Enhancers are identified by finding the point in the super enhancer graph where the slope drops below 1.  Use this setting to change the slope threshold.
-superWindow <#>
To ensure the slope calculation is robust, the local slope is calculated by considering peaks across a 10 peak window.

Analyzing GRO-Seq: de novo transcript identification

To find transcripts directly from GRO-Seq, use the findPeaks command:

findPeaks <tag directory> -style groseq -o auto

i.e. findPeaks Macrophage-GroSeq -style groseq -o auto

GRO-Seq analysis does not make use of an control tag directory

Basic Idea behind GRO-Seq Transcript identification

Finding transcripts using strand-specific GRO-Seq data is not trivial.  GRO-Seq measures the production of nascent RNA, and is capable of revealing the loction of protein coding transcripts, promoter anti-sense transcripts, enhancer templated transcripts, long and short functional non-coding and miRNA transcripts, Pol III and Pol I transcripts, and whatever else is being transcribed in the cell nucleus.  Identification and quantifiction of these transcripts is important for downstream analysis.  Traditional RNA-Seq tools mainly focus on mRNA, which has different features than GRO-Seq, and are generally not useful for identifying GRO-Seq transcripts.

Important NOTE: Just as with ChIP-Seq, not all GRO-Seq data was created equally.  Data created by different labs can have features that make it difficult to have an single analysis technique that works perfectly for each one.  As such, there are many parameters to play with the help get the desired results.

A large number of assumptions go into the analysis and are covered in more detail in the GRO-Seq tutorial (coming soon).  In a nutshell, findPeaks tracks along each strand of each chromosome, searching for regions of continous GRO-Seq signal.  Once it encounters high numbers of GRO-Seq reads, it starts a transcript.  If the signal decreases significantly or disappears, the putative transcript is stopped.  If the signal increases significantly (and sustainably), then a new transcript is considered from that point on.  If the signal spikes, but overall does not increase over a large distance, it is considered an artifact or pause site and not considered in the analysis.  Below is a chart that helps explain how the transcript detection works:

Gro-seq groseq schematic

By default, new transcripts are created when the tssFold exceeds 4 and bodyFold exceed 3 ("-tssFold <#>", "-bodyFold <#>").  A small pseudo-count is added to the tag count from region a above to avoid dividing by zero and helps serve to set a minimum threshold for transcript detection ("-pseudoCount <#>", default: 1).  Most transcripts show robust signal at the start of the transcript, and the tssFold helps select for these regions with high accuracy.  The bodyFold is important for distinguishing between "spikes" in signal and real start sites; if a transcript is real, it's likely that increased levels of transcription follow behind the putative TSS.  If the signal is roughly equal before and after the putative TSS, it is more likely to be an artifact.

To increase senstivity, HOMER tries to adjust the size of the bodySize parameter above since it essentially defines the resolution of the detected transcript.  If there are a large number of GRO-Seq tags in a region, the bodySize can be small since there is adequate data to estimate the location of the transcript.  However, if the data is relatively sparse, the bodySize needs to be large to get a reliable estimate of the level of the transcript.  The minimum and maximum bodySizes are 600 and 10000 bp ("-minBodySize <#>", "-maxBodySize <#>").  HOMER uses the smallest bodySize that contains at least x number of tags, where x is determined as the number of tags where the chance of detecting a bodyFold change is less than 0.00001 assuming the read depth varies according to the poisson distribution (adjustable with "-confPvalue <#>", or directly with "-minReadDepth <#>").  The basic idea is that the threshold for tag counts must be high enough that we don't expect it to vary too much by chance.

Using uniquely mappable regions to improve results

Since some transcripts cover very large regions, there are many places where genomic repeats interrupt the GRO-Seq signal of continous transcripts.  To help deal with this problem, HOMER can take advantage of mappability information to help estimate transcript levels where uniquely mapping sequencing reads is not possible.  In general this information is not really that helpful for ChIP-Seq analysis, but in this case it can make an important difference.  For now, HOMER only take specially formatted binary files available below.  To use them, download the appropriate version and unzip the archive:

Human: hg18 hg19
Mouse: mm8 mm9 mm10
Fly: dm3

To use the uniq-map information, specify the location of the unzipped directory on the command line with "-uniqmap <directory>":

findPeak Macrophage-GroSeq/ -style groseq -o auto -uniqmap mm9-uniqmap/

GRO-Seq analysis output

Running findPeaks in groseq mode will produce a file much like the one produced for traditional peak finding, complete with a header section listing the parameters and statistics from the analysis.  HOMER can also produce a GTF (gene transfer format) file for use with various programs.  If "-o auto" is used to specify output, a "transcripts.gtf" file will be created in the tag directory.  Otherwise, you can specify the name of the GTF output file by use "-gtf <filename>".  The GTF file can also be easily uploaded to the UCSC Genome Browser to visualize your transcripts.

de novo transcripts
                    GRO-Seq LNCAP
The GRO-Seq transcript detection works pretty well, but is likely to get some face-lifts in the near future.

Command line options for findPeaks

        Usage: findPeaks <tag directory> [options]

        Finds peaks in the provided tag directory.  By default, peak list printed to stdout

        General analysis options:
                -o <filename|auto> (file name for to output peaks, default: stdout)
                        "-o auto" will send output to "<tag directory>/peaks.txt", ".../regions.txt",
                        or ".../transcripts.txt" depending on the "-style" option
                -style <option> (Specialized options for specific analysis strategies)
                        factor (transcription factor ChIP-Seq, uses -center, output: peaks.txt,  default)
                        histone (histone modification ChIP-Seq, region based, uses -region -size 500 -L 0, regions.txt)
                        groseq (de novo transcript identification from GroSeq data, transcripts.txt)
                        tss (TSS identification from 5' RNA sequencing, tss.txt)
                        dnase (Hypersensitivity [crawford style (nicking)], peaks.txt)
                        super (Super Enhancers, superEnhancers.txt)
                        mC (Cytosine methylation (BS-seq/methylC-seq), regions.txt)

        chipseq/histone options:
                -i <input tag directory> (Experiment to use as IgG/Input/Control)
                -size <#> (Peak size, default: auto)
                -minDist <#> (minimum distance between peaks, default: peak size x2)
                -gsize <#> (Set effective mappable genome size, default: 2e9)
                -fragLength <#|auto> (Approximate fragment length, default: auto)
                -inputFragLength <#|auto> (Approximate fragment length of input tags, default: auto)
                -tbp <#> (Maximum tags per bp to count, 0 = no limit, default: auto)
                -inputtbp <#> (Maximum tags per bp to count in input, 0 = no limit, default: auto)
                -strand <both|separate> (find peaks using tags on both strands or separate, default:both)
                -norm # (Tag count to normalize to, default 10000000)
                -region (extends start/stop coordinates to cover full region considered "enriched")
                -center (Centers peaks on maximum tag overlap and calculates focus ratios)
                -nfr (Centers peaks on most likely nucleosome free region [works best with mnase data])
                        (-center and -nfr can be performed later with "getPeakTags"

        Peak Filtering options: (set -F/-L/-C to 0 to skip)
                -F <#> (fold enrichment over input tag count, default: 4.0)
                  -P <#> (poisson p-value threshold relative to input tag count, default: 0.0001)
                -L <#> (fold enrichment over local tag count, default: 4.0)
                  -LP <#> (poisson p-value threshold relative to local tag count, default: 0.0001)
                -C <#> (fold enrichment limit of expected unique tag positions, default: 2.0)
                -localSize <#> (region to check for local tag enrichment, default: 10000)
                -inputSize <#> (Size of region to search for control tags, default: 2x peak size)
                -fdr <#> (False discovery rate, default = 0.001)
                -poisson <#> (Set poisson p-value cutoff, default: uses fdr)
                -tagThreshold <#> (Set # of tags to define a peak, default: 25)
                -ntagThreshold <#> (Set # of normalized tags to define a peak, by default uses 1e7 for norm)
                -minTagThreshold <#> (Absolute minimum tags per peak, default: expected tags per peak)

        SuperEnhancer Options: (Need to specify "-style super"):
                -superSlope <#> (Slope threshold to identify super vs. typical enh., default: 1.00)
                -superWindow <#> (moving window/number of peaks to use to calculate slope, default: 10)
                -typical <filename> (Output typical enhancers to this file, default: not used)

        MethylC-Seq/BS-Seq options (Need to specify "-style mC"):
                -unmethylC / -methylC (find unmethylated/methylated regions, default: -unmethyC)
                -mCthresh <#> (methylation threshold of regions, default: avg methylation/2)
                -minNumC <#> (Minimum number of cytosines per methylation peak, default: 6)

        GroSeq Options (Need to specify "-style groseq"):
                -tssSize <#> (size of region for initiation detection/artifact size, default: 250)
                -minBodySize <#> (size of regoin for transcript body detection, default: 1000)
                -tssFold <#> (fold enrichment for new initiation dectection, default: 4.0)
                -bodyFold <#> (fold enrichment for new transcript dectection, default: 4.0)
                -endFold <#> (end transcript when levels are this much less than the start, default: 10.0)
                -method <fold|level> (method used for identifying new transcripts, default: fold)
                -fragLength <#> (Approximate fragment length, default: 150)
                -uniqmap <directory> (directory of binary files specifying uniquely mappable locations)
                        Download from http://biowhat.ucsd.edu/homer/groseq/
                -confPvalue <#> (confidence p-value: 1.00e-05)
                -minReadDepth <#> (Minimum initial read depth for transcripts, default: auto)
                -pseudoCount <#> (Pseudo tag count, default: 2.0)
                -rev (reverse strand of reads - for first-strand rna-seq/gro-seq)
                -gtf <filename> (Output de novo transcripts in GTF format)
                        "-o auto" will produce <dir>/transcripts.txt and <dir>/transcripts.gtf

Links to alternative peak finding software

Lots of quality programs exist for finding peaks in ChIP-Seq data.  Most use slightly different assumptions and peak definitions that result in slightly different sets of peaks.  Many of these programs output peak files in BED format.  To covert these to HOMER peak file format, use the bed2pos.pl program to convert the file (as of now, most HOMER programs work with BED files, so this isn't really necessary):

bed2pos.pl peaks.bed > peaks.txt

Peak finding software links:

Can't figure something out? Questions, comments, concerns, or other feedback: