|
Finding Overlapping and Differentially Bound Peaks
HOMER provides a utility for comparing sets of peaks called
mergePeaks.
It's default behavior is to take two or more peak files and
return a single peak file containing the unique peak
positions from the original files. For example:
mergePeaks -d <maximum distance to merge>
<peak file1> <peak file2> [peak file3] ...
> newPeakFile.txt
The program will output a
new peak file containing the merged peaks to stdout. Peaks
within the distance in bp specified by "-d <#>" will be
reported as the average position between the peaks found
within the common region. The origin of the peaks is
specified in the 7th column of the new peak file.
Alternatively you can specify "-d given" to require a specific overlap
between the start and end coordinates of the peaks.
This is more useful if comparing large regions as opposed
to peaks. The program will also output the numbers
for creating a venn diagram, and these can be directed to
a specific file by specifying "-venn <filename>".
NOTE: As of v4.4, mergePeaks now
now looks for specific, literal overlaps by default (-d
given) and matches most users expectations.
Separating Peaks into Unique and Overlapping sets
Merging peaks together into
a single file is very useful for certain types of
analysis, such as making scatter plots that compare the
tag-densities between peaks from separate experiments - in
this case you want to count tags at specific and common
regions. Alternatively, you may be interested in
separating the peaks into common and specific sets for
focused analysis. To do this use the "-prefix <filename>"
option - this will create separate files based on
overlapping peaks for each set of peaks. For example:
mergePeaks -d 100 pu1.peaks cebp.peaks -prefix
mmm
This will create files named " mmm_pu1.peaks", " mmm_cebp.peaks", and " mmm_pu1.peaks_cebp.peaks".
The output file will contain the following columns:
1. Merged Peak name (will
start with "Merged-")
2. chromosome
3. start (average from merged peaks)
4. end (average from merged peaks)
5. strand
6. Average peak score (actually, the average of the
original values in column 6 of the peak files - or
column 5 of BED files)
7. Original peak files contributing to the merged peak
8. Total number of peaks merged (occasionally more than
one peak from a single file will be merged if the peaks
are within the specify distance or two or more peaks
from one file overlap with the same single peak(s) from
another file)
Peak Co-Occurrence Statistics
The mergePeaks program
will also find calculate the statistics of co-occurrence
between peaks in a pairwise fashion. If " -matrix <filename>"
is specified, HOMER will calculate statistics about the
pairwise overlap of peaks. Three separate pairwise
matrix files will be produced using the supplied
<filename> as a prefix:
filename.logPvalue.matrix.txt (natural
log p-values for overlap using the hypergeometric
distribution, positive values signify divergence)
filename.logRatio.matrix.txt
(natural log of the ratio of observed overlapping peaks
to the expected number of overlapping peaks)
filename.count.matrix.txt
(raw counts of overlapping peaks)
The statistics are dependent on the effective size of the
genome, which can be specified using " -gsize <#>"
(default: 2,000,000,000)
Co-Bound Peaks
Sometimes you just want to
know how many other peaks bind a set of reference
peaks. If "-cobound
<#>", mergePeaks counts how many of the
other peak files contain overlapping peaks with the peaks
found in the first peak file. It then outputs peak
files named "coboundBy0.txt", "coboundBy1.txt", etc. up to
the number specified.
Differentially Bound Peaks
To find peaks that are
differentially enriched between two experiments, there are
two basic options. First, you could run findPeaks ( info here) using the 2nd
experiment as the control sample. Alternatively, you
can use getDifferentialPeaks,
which will take a given list of peaks and quickly identify
which peaks contain significantly more tags in the target
experiment relative to the background experiment. To
use it, follow this syntax:
getDifferentialPeaks <peak/BED file>
<target Tag directory> <background Tag
directory> [options]
By default it looks for peaks that have 4-fold more tags
(sequencing-depth independent) and a cumulative Poisson
p-value less than 0.0001 (sequencing-depth
dependent). These parameters are adjustable with (" -F <#>", and " -P <#>").
By specifying " -same",
peaks that are similar between the two tag directories
will be returned instead of differential peaks. One
caveat is that it is a good idea to set the size of the
region used to search for reads to be larger than the
actual peaks (i.e. +100 bp relative to the peak size) to
avoid problems that arise from experiments with different
fragment lengths, etc.
The output file for getDifferentialPeaks looks like the
following:
Columns 8-11 contain information about the differential
peak detection (colunns 1-7 come from the original peak
file, including the 'score' and focus 'ratio/other'
columns):
Column 8: Total [normalized] reads in the target tag
directory
Column 9: Total [nomralized] reads in the background
directory
Column 10: Fold change (Target/Background Total reads)
Column 11: Poisson-based p-value calculation for target
read total assuming background read total represents the
expected number of reads.
NOTE about normalization for this step: The read
counts are normalized to the total number of reads found
in their tag directories (no fancy normalization is
assumed for this analysis). However, to be
conservative with the Poisson calculations, the
experiment with the largest number of reads is
normalized such that it contains the same number of
reads as the smaller experiment. As a result, only
one of the directories - either the target directory
("Total Tags", column 8) or the background directory
("Background Tags", Column 9), will have integer counts
and the other is likely to contain fractional counts.
Command Line options for mergePeaks
Usage: mergePeaks [options] <primary
peak file> [additional peak/annotation files...]
Merges and/or compares peak/position
files (peak files listed twice are only considered once)
General Options:
-strand (Only
merge/consider peaks on the same strand, default: either
strand)
-d <#|given>
(Maximum distance between peak centers to merge, default:
100)
Using "-d given" looks for literal overlaps in peak regions
Use
"-d given" when features have vastly different sizes (i.e.
peaks vs. introns)
-file <filename>
(file listing peak files to compare - for lots of peak
files)
-gsize <#>
(Genome size for significance calculations, default: 2e9)
Merging Peaks Options (default):
-prefix
<filename> (Generates separate files for overlapping
and unique peaks)
By
default all peaks are sent to stdout
-matrix
<filename> (Generates files with pairwise comparison
statistics)
filename.logPvalue.matrix.txt - ln p-values for overlap,
+values for divergence
filename.logRatio.matrix.txt - ln ratio of observed/expected
overlaps
filename.count.matrix.txt - peak overlap counts
-venn <filename>
(output venn diagram numbers to file, default: to stderr)
-code (report peak
membership as binary instead of by file names)
Classify peaks by how many are co-bound
by other peak files vs. reference(1st file)
-cobound <#>
(Maximum number of co-bound peaks to consider)
Will output sets of peaks that are co-bound by various
numbers of factors
to
files coBoundBy0.txt, coBoundBy1.txt, coboundBy2.txt, ...
Or
<prefix>.coBoundBy0.txt,
<prefix>.coBoundBy1.txt, ...
-matrix
<filename> (generates similar files to above with
pairwise overlap statistics)
Single peak file:
(If a single peak file
is given, peaks within the maximum distance will be merged)
-filter chrN:XXX-YYY
(only analyze peaks within range)
-coverage <output
file> (returns the total bp covered by each peak file -
use "-d given"
Command Line options for getDifferentialPeaks
Usage: getDifferentialPeaks <peak
file> <target tag directory> <background tag
directory> [options]
Extracts tags near each peak from the tag
directories and counts them,
outputting peaks with significantly
different tag densities
General Options:
-F <#> (fold
enrichment over background tag count, default: 4.0)
-P <#> (poisson
enrichment p-value over background tag count, default:
0.0001)
-same (return similar
peaks instead of different peaks)
-rev (return peaks
with higher tag counts in background instead of target
library)
-size <#> (size
of region around peak to count tags, default: -fixed)
-fixed (Count tags
relative to actual peak start and stop, default)
"-size given" is the same as "-fixed"
Advanced Options:
-strand
<both|+|-> (Strand [relative to peak] to count tags
from, default:both)
-tagAdjust <#>
(bp to shift tag positions to estimate fragment centers,
default: auto)
'-tagAdjust auto' uses half of the estimated tag fragment
length
-tagAdjustBg <#>
(bp to shift background tag positions to estimate fragment
centers, default: auto)
'-tagAdjustBg auto' uses half of the estimated tag fragment
length
-tbp <#>
(Maximum tags per bp to count, 0 = no limit, default: 0)
-tbpBg <#>
(Maximum background tags per bp to count, 0 = no limit,
default: 0)
|