|
Finding TADs and Loops with HOMER
HOMER has updated how it looks for loops and topological
associated domains (TADs) from previous versions to be
more robust, and to address how to identify confidently
changing loops between different conditions. Chromatin
loops (previously referred to as 'significant
interactions'), represent two regions that interact at a
high frequency with one another. On a Hi-C contact map,
these show up as dots/points. TADs, on the other hand, are
domains of chromatin that make interactions within
themselves at a higher rate than with their surrounding
chromatin, and look like 'triangles' on a Hi-C contact
map. The two are often related in the sense that a loop is
often at the apex of the TAD.
Quick Reference
#analyzing TADs and loops (i.e. specific
locations that interact, e.g. two CTCF sites interacting):
findTADsAndLoops.pl find HicExp1TagDir/ -cpu 10 -res
3000 -window 15000 -genome hg38 -p badRegions.bed
#This will create *.loop.2D.bed and *.tad.2D.bed files and
place them within the tag directory. Even better to
include a list of segmental duplications/blacklisted
regions with "-p <peak/BED file>" to filter out
likely false positives. The 2D.bed files can be visualized
with Juicebox.
#To analyze changes in TAD/Loops across
experiments, first merge features that you want to
analyze from each experiment to get the union of
features to analyze
merge2Dbed.pl exp1.loop.2D.bed exp2.loop.2D.bed -loop
> merged.loop.2D.bed
merge2Dbed.pl exp1.tad.2D.bed exp2.tad.2D.bed
-tad > merged.tad.2D.bed
#Once features have been merged, next quantify them
across all replicates/experimental conditions (computes
stats for loops and TADs at the same time)
findTADsAndLoops.pl score -tad merged.tad.2D.bed
-loop merged.loop.2D.bed -o outPrefix -d
HicExp1r1TagDir/ HicExp1r2TagDir/ HicExp2r1TagDir/
HicExp2r2TagDir/ -cpu 10 -res 3000 -window 15000
#Finally, identify features that are differentially
enriched (uses edgeR/limma):
getDiffExpression.pl outPrefix.loop.scores.txt exp1
exp1 exp2 exp2 -loop > output.loop.txt
getDiffExpression.pl outPrefix.tad.scores.txt
exp1 exp1 exp2 exp2 -tad > output.tad.txt
Finding TADs and Loops with HOMER
TADs and Loops are found simultaneously with
HOMER using the creatively named command findTADsAndLoops.pl.
This command works by generating relative contact matrices
for each chromosome and scanning them for locally dense
regions of contacts (i.e. loops) or triangle domains that
have a high degree of inter-domain interactions relative
to their surrounding region. Once these features have been
identified, the command can be used again to score these
regions across several experiments to identify
differential features (covered in the next section). This
process can take a while, but HOMER can speed it up by
analyzing each chromosome using a separate process (of
course, chr1 will take the longest still...).
The findTADsAndLoops.pl command has two modes. The first
one is "find" which is covered here, which is used to
identify features in the Hi-C experiment. The find command
runs 3 sub-programs to identify features:
- Generate relative contact maps for each chromosome
(using analyzeHiC)
- Analyzes the contact maps to identify TADs/Loops
(using findTADsAndLoopsFromRelMatrix)
- Filters the final TAD/Loop calls to remove likely
artifacts (low coverage regions and features near
segmental duplications, using filterTADsAndLoops.pl).
If there are regions you know might cause problems (i.e.
regions with known segmental duplications), it's a great
idea to pass these to findTADsAndLoops.pl with the "-p
<peak/BED file>" option. Segmental
duplications are often sites where read mapping anomalies
occur. To generate a file for this purpose, I would
recommend downloading the genomicSuperDups.txt.gz file
from the UCSC annotation database for the organism you're
studying. It's also worth downloading the gaps.txt.gz file
while you're at it:
wget -O gap.txt.gz
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/gap.txt.gz
wget -O dups.txt.gz
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz
zcat gap.txt.gz dups.txt.gz | cut -f2-4 >
badRegions.bed
Make sure to use the correct genome!!! (this example
uses hg38) Now you can use "-p badRegions.bed"
with the findTADsAndLoops.pl
To actually find Loops and TADs, the command takes the
following form:
findTADsAndLoops.pl find <HiC tag
directory> -res <#> -window <#> -genome
<version> -p <peak/BED file for bad
regions> -o <outputPrefix>
findTADsAndLoops.pl find HicExp1TagDir/ -cpu 10 -res
3000 -window 15000 -genome hg38 -p badRegions.bed
After completion, the program will generate two *.2D.bed
files containing the TADs and Loops. If no output
file prefix is given with "-o <outputPrefix>", the
output files will be placed inside the tag directory. The
program will also generate two bedGraph files that
describe the directionality index (i.e. preference for a
region to interact either upstream or downstream of its
current position) and the insulation score.
The 2D.bed files can be further analyzed in the context of
other experiments to identify differential structural
features (described below). They are also compatible with
other tools, such as Juicebox. You can load them
them as 2D annotations using Juicebox's annotation panel
(2D annotations-> load Loops/Domains). An example
of loading these annotations is shown below (note: this
example only displays the features on 'half' of the
contact map so that you can see the interaction density on
the other half):
Scoring Loops and TADs
Loops are scored by their Hi-C interaction
density, which is normalized to the read depth, but
otherwise will NOT be normalized by area or total read
counts. This is so that these values can be treated
later as read-count data for analysis with tools like
DESeq2/edgeR.
In contrast, TADs are scored by an inclusion ratio (IR),
which represents the ratio of intra-TAD interactions
relative to interactions from the TAD to the surrounding
region (both upstream and downstream of the TAD to
regions the same size as the TAD). The interactions rate
is averaged across distances, summed, and then the ratio
is calculated. High IR indicates a dense triangle that
makes few interactions to adjacent regions, while low IR
values indicate a lack of TAD. It is also important to
note that HOMER looks for TADs by optimizing IR values,
taking the optimum regions and discarding overlapping
TAD calls. It also uses an IR threshold when considering
TADs, and does not force all regions of the genome to be
part of a TAD.
Identifying Differentially regulated TADs and Loops
As with almost all differential analysis with
NGS data, finding differential structural features follows
a similar pattern:
- Identify features of interest (i.e. TADs/Loops)
across all experiments
- Merge features into a union of all features
- Quantify the features across all experiments
- Find features with statistically significant
differences across experiments
Once you find TADs/Loops across all of your experiments
(as describe in the section above), you can merge them
into a common set of features using merge2Dbed.pl (This
example assumes two experimental conditions with two
replicates each):
merge2Dbed.pl <2D bed file1> <2D
bed file2> [2D bed file3] ... > output.2D.bed
merge2Dbed.pl exp1r1.loop.2D.bed exp1r2.loop.2D.bed
exp21.loop.2D.bed exp2r2.loop.2D.bed -loop >
merged.loop.2D.bed
merge2Dbed.pl exp1r1.tad.2D.bed exp1r2.tad.2D.bed
exp21.tad.2D.bed exp2r2.tad.2D.bed -tad >
merged.tad.2D.bed
Now that we have merged files containing the loops and
TADs across all of the experiments, it's time to quantify
each of the features across all experiments. To score each
feature, rerun the findTADsAndLoops.pl command in
'score' mode (i.e. first argument is 'score'),
supplying the merged TAD and Loop files and the Hi-C Tag
Directories that you want to quantify:
findTADsAndLoops.pl score -tad
<tad.2D.bed> -loop <loop.2D.bed> -d
<tagDir1> [tagDir2] -cpu <#> -o
<outputPrefix>
findTADsAndLoops.pl score -tad merged.tad.2D.bed
-loop merged.loop.2D.bed -d HicExp1r1TagDir/
HicExp1r1TagDir/ HicExp1r1TagDir/ HicExp1r1TagDir/
-cpu 10 -o output
This command will produce two output files:
output.loop.scores.txt and output.tad.scores.txt. These
files are like 2D bed files (i.e. columns 2-9 compose the
primary columns of a 2D bed file), but contain a feature
ID in the first column and have extra columns for each
Hi-C tag directory experiment quantifying the feature. To
identify differentially regulated loops, these files can
be analyze with the getDiffExpression.pl script, which
acts as a wrapper for DESeq2/edgeR/limma to perform
differential calculations
getDiffExpression.pl output.loop.scores.txt
exp1 exp1 exp2 exp2 > output.diff.loop.txt
getDiffExpression.pl output.tad.scores.txt exp1
exp1 exp2 exp2 > output.diff.tad.txt
Command line options for findTADsAndLoops.pl
findTADsAndLoops.pl <cmd> [options]
Usage Modes:
findTADsAndLoops.pl find <tagDir> [options]
Finds TADs and Loops in the given Hi-C tag directory.
Output files
will have the same prefix, and include 2D bed files
specifying the locations of the features.
findTADsAndLoops.pl score -tad <tad.2D.bed> -loop
<loop.2D.bed> -d <tagDir1> [tagDir2] ...
[options]
Scores TADs and Loops cross several Hi-C tag directories
so that they can be compared across
different replicates and experimental conditions.
Note: When comparing experiments, it is recommended to
find TADs/Loops on individual experiments first,
then merge the features with merge2Dbed.pl, then 'score'
the commbined features across multiple tagDirs
Common Options:
-cpu <#> (max CPUs to use)
-res <#> (resolution, default: 3000)
-window <#> (overlapping window size, i.e. superRes,
default: 15000)
-balance | -nobalance (balance contact maps, default:
-balance)
-minDist <#> (minimum interaction distance to
consider, default: 25000)
-maxDist <#> (maximum interaction distance to
consider, default: 2000000)
Options
specific to 'find' mode:
-o <output prefix> (by default will place output
files within the Tag Directory)
TAD options:
-minTADsize <#> (default: 45000)
-minTADscore <#> (minimum inclusion score to call
TAD, default: 1.5)
-keepOverlappingTADs (by default, program removes
overlapping TAD calls)
Loop options:
-foldLoopLocalBg <#> (fold threshold for local loop
enrichment, default: 1.5)
-foldLoopGlobalBg <#> (fold threshold over avg.
interactions of similar distance, default: 2)
-poissonLoopLocalBg <#> (poisson p-value threshold
for differential interactions per loop, default: 0.01)
-poissonLoopGlobalBg <#> (poisson p-value threshold
for minimum interactions per loop, default: 1e-05)
Advanced options (probably best to avoid changing):
-minLoopDist <#> (minimum length of contact points
to find, default: 3x window size)
-anchorSize <#> (size of region used to calculate
loop score, default: window size)
-anchorLocalBgSize <#> (size of region used to
calculate local bg, default: 5x window size)
-minLoopReads <#> (minimum number of adjusted reads
for contact point, default: use -poissonLoopGlobalBg)
Artifact filtering options: (runs filterTADsAndLoops.pl)
-skipFiltering (skip post-feature finding filtering step)
-p <peak/BED file> (File of genome
duplications/bad/blacklisted regions to exclude from
Loop/TAD endpoints)
-genome <genome version> (i.e. hg19, hg38)
-badChr <chr> [chr] ... (chromosomes to
skip/exclude)
-minCoverageTAD <#> (minimum ratio of read coverage
in TAD relative to genome average, def: 0.25)
-minCoverageAnchor <#> (minimum ratio of read
coverage at anchors relative to average, def: 0.25)
Other options:
-insDist <#> (distance across which to calculate
insulation ratio, default: 150000)
-diDist <#> (distance across which to calculate
directionality index , default: 1000000)
-override (if res/maxDist are set to produce huge matrices
and you have the RAM, you may need to include this)
Options
specific to 'score' mode:
-o <output prefix> (default: "score")
-d <HiC TagDir1> [HiC TagDir2] ...
-loop <loop 2D BED file> (will score loop
interaction rates from this file)
-tad <tad 2D BED file> (will score TAD inclusion
ratios from this file)
Loop score reporting: (default: Hi-C depth normalized
counts [use for use with getDiffExpression.pl...])
-raw (report raw interaction counts, will impact TAD
inclusion ratio scoring too)
-normTotal <#> (normalize based on this depth of
sequencing for Hi-C experiments)
Command line options for filterTADsAndLoops.pl
filterTADsAndLoops.pl [options] -loop <Loop 2D bed>
-tad <TAD 2D bed> -d <tagDir> -genome
<genome> -o <outPrefix>
Paramters: (no
strict order needed)
-o <prefix for output files> (def: out)
-loop <loop 2D bed> (input loop file, 2D BED format)
-tad <TAD 2D bed> (input TAD file, 2D BED format)
-d <Hi-C tag dirctory> [tagdir2] ... (Used to
calculate experiment coverage)
-cpu <#> (number of processes to use to count reads
in multiple tagdirs at once)
-genome <genome version> (genome version, def: none)
** -p <peak/BED file of
duplication/bad/blacklist regions> (def: gaps/segDup
from UCSC)
-badChr <chr> [chr] ... (chromosomes to skip e.g.
chrY, def: none)
Thresholds:
-res <#> (resolution for interaction/region overlap,
def: 15000)
-minTADir <#> (minimum TAD inclusion ratio, def: 0)
-minLoopScore <#> (minimum int score, def: 0)
-minCoverageTAD <#> (minimum TAD coverage ratio
relative to average, def: 0.25)
-minCoverageAnchor <#> (minimum int coverage ratio
relative to average, def: 0.25)
co-filter score
files:
-TADscore <TAD score file>
-LoopScore <loop score file>
** Notes on -p
<peak/BED bad regions>:
Recommend to use genome assembly gaps and known duplicated
regions. To get these if you're using a UCSC
annotated genome, try something like the following to
create BED file with these regions (for hg38):
wget -O gap.txt.gz
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/gap.txt.gz
wget -O dups.txt.gz
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz
zcat gap.txt.gz dups.txt.gz | cut -f2-4 >
badRegions.bed
(then use "badRegions.bed" with the -p option)
Note that depending on the genome, UCSC might have
different file names...
Command line options for merge2Dbed.pl
merge2Dbed.pl
[options] <2D BED file1> <2D BED file2> [2D
BED file3]...
Options:
-res <#> (maximum distance between endpoints to
merge, default: 15000)
Usually for loops -res should be set to the
window/superRes size, for TADs 2x window/superRes
-loop (treat 2D bed input files as loops, default)
-tad (treat 2D bed input files as TADs)
-prefix <filePrefix> (output venn diagram overlaps
to separate files)
|