|
Finding Significant Interactions in Hi-C Data
HOMER can search for pairs of loci that have a greater
number of Hi-C reads that expected by chance, which below
will be referred to as a 'significant interaction'.
The enhanced proximity of these regions may have relevant
biological interpretation. It is important to note
up front that it is basically unheard of for two loci to
ALWAYS co-localize next to one another (i.e. in same
cross-linked complex). Usually regions simply show
"enrichment" for their co-localization, meaning that
evidence for their co-fixation comes from only a fraction of
the total cells used in the experiment. This also
means that some regions may for 'significant interactions'
with multiple other loci in the same experiment.
The section below describes the HOMER commands that can be
used to find significant interactions from Hi-C experiments,
followed by routines and techniques to analyze their
biological significance. The next section describes
how to visualize interactions in
Circos Diagrams. The section after that
describes the analysis of interactions with respect to other
features and their annotation. These sections focus on
the analysis of specific interactions between any two loci,
while the section after that describes the use of Structured Interaction Matrix Analysis
(SIMA), which boosts analysis power by considering
interactions between multiple regions simultaneously.
Finding Significant Interactions in Hi-C Data
To find significant interactions between any two
loci with HOMER, use the analyzeHiC command with
the option "-interactions <outputfilename.txt>".
This will trigger homer to perform significance
calculations between each loci described in the
command. For example:
analyzeHiC ES-HiC -res 1000000
-interactions significantInteractions.txt -nomatrix
This command will search for significant interactions in
the genome at 1Mb resolution. The "-nomatrix" is
optional, but since the default of analyzeHiC is
to produce a matrix, you may want (or need) to stop
creation of the matrix as it could be very large and
consume too much memory. Opening the
"significantInteractions.txt" files with Excel will reveal
a file that looks like:
This file is formatted as a HOMER interaction file.
It is a tab-delimited text file with a header and one
interaction per line. The definitions of the columns
are as follows:
HOMER Interaction File Format Column Types
(tab-separated text file)
1) Interaction ID (must be unique)
2) Peak ID for region 1
3) chr for region 1
4) start position for region 1
5) end position for region 1
6) strand for region 1
7) total reads for region 1
8) Peak ID for region 2
9) chr for region 2
10) start position for region 2
11) end position for region 2
12) strand for region 2
13) total reads for region 2
14) Distance between regions (or "interchromosomal")
15) Interaction Reads (total Hi-C reads connecting the
regions)
16) Expected Interaction Reads (total expected Hi-C
reads based on background model)
17) Modified Z-score
18) Natural log of the p-value for the interaction
(binomial)
19) False Discovery Rate (based on Benjamini correction)
20**) Circos Thickness (used for visualization by
Circos)
The interaction file can be further analyzed with commands
in the sections below, or it can be the starting point of
your own custom analysis.
How HOMER finds Significant Interactions
The premise behind finding significant
interactions is simple enough: Look for pairs of regions
that have more Hi-C reads between them than would be
expected by chance. The expected number of reads is
calculated using the background model (covered here). The
background model is used to model how many reads we expect
to connect to each other region in the genome. If
regions are far away (or on separate chromosomes), the we
expect only a small number of reads to connect them.
Likewise, if two regions are close to one another, we
would expect a large number of interactions. These
expectations are also dependent on the number of total
reads mapping to each locus (which may depend on the
number of restriction sites etc. in the region). The
background model attempts to take all of these factors
into account.
Since the total number of reads per region is fixed (and
more or less constant for each region due to the unbiased
nature of Hi-C), we test how these reads are distributed
relative to the expectation. For two given loci that
could potentially interact, we model their randomly
expected read counts using the cumulative binomial
distribution, where the total number of trials is the
number of reads that could possibly map between the loci
(i.e. the region total), the rate of success is the
expected interaction frequency, and the number of observed
successes is the number of observed reads mapping between
the loci. In this setting, regions with only 1 or 2
reads between them will have high p-values, regardless of
their expected interaction frequency, while regions with
many interactions above expected will have low p-values.
Parameters to Consider when Finding Significant
Interactions
Regions to Analyze
The analyzeHiC command will look for
interactions across the entire genome unless otherwise
specified. To specify a specific subset of the
genome to analyze, use the "-chr/-start/-end/-pos/-p"
options.
Super resolution and congruent interactions
When using a "-superRes <#>"
value that is larger than the resolution ("-res
<#>"), HOMER analyzes the Hi-C data in
overlapping windows. If there is a truly
significant interaction in the data, there is a good
chance that multiple overlapping windows may observe the
same reads and call the interaction multiple times with
slightly different offsets. By default, HOMER
looks through the identified interactions before writing
them to the output and removes overlapping interactions
(defined as overlapping at both endpoints), keeping only
the interaction with the most significant logP value.
Limiting the Search Space
The further apart regions get, the less reads
that are likely to map between them, and the less likely
that a significant interaction is to be found.
Depending on the resolution and the sequencing depth, it
can be next to impossible to identify "significant
interactions" past a certain distance. To
control the space that is searched, specify "-maxDist
<#>" or "-minDist <#>".
This can dramatically speed up the search for high
resolution interactions (i.e. "-res 10000 -maxDist
3000000").
Increasing Accuracy by Centering Interactions
If you add the option "-center",
analyzeHiC will replace the coordinates of the regions
in the output file by re-centering the regions on the
average of the position of the Hi-C reads participating
in the interaction. Lets say a region spans from
position 10,000 to 20,000 (10k resolution), and it
interacts with the region at 100,000 to 110,000.
If most of the interacting regions originate from a
regulatory elements at 17,000, the output file will be
re-centered such that the output coordinates are 12,000
to 22,000 for the first region.
Interaction Reporting Filtering
By default, HOMER will report all interactions
with a p-value less than 0.001. To modify this
threshold, use "-pvalue <#>". You can
also change the modified z-score cutoff using the
parameter "-zscore <#>".
Multiple CPUs
Using the "-cpu <#>" option will
speed up the analysis if analyzing multiple chromosomes.
Finding Differential Interactions between two Hi-C
Experiments
To assess the interactions in a second Hi-C
experiment, use the "-ped <HiC Tag Directory>"
option. When this command is used with the "-interaction
<outputfile.txt>" option, interactions are
first found just like normal. After interactions are
found, HOMER will go back and quantify the Hi-C reads for
each interaction in the second Hi-C experiment.
Independent statistics will be calculated for that
experiment based on it's background model, and then it
will be compared to the first Hi-C experiment. For
example:
analyzeHiC ES-HiC/ -res 100000 -ped MEF-HiC/
-interactions significantInteractions.txt -nomatrix
The output file is slightly different in the case of a
background Hi-C experiment. There are 8 extra
columns as described below:
20**) Background Experiment Interaction Reads
21) Background Experiment Expected Reads
22) Background Modified Z-score
23) Background LogP
24) Background total reads region 1
25) Background total reads region 2
26) LogP of Primary Experiment vs. Background
27) Modified Z-score of Primary Expeirment vs.
Background
28) Circos Thickness (used for visualization by Circos)
(Column 20 is the Circos Thickness normally without
specifying "-ped <tag dir>", see above)
Note that the background Hi-C experiment is not used to
"find" interactions, just score the ones found in the
primary Hi-C experiment. To find interactions that
are specific to the background Hi-C experiment, reverse
their positions in the command.
Finding Intra-chromosomal Interactions Genome-wide at
High Resolution
A common task is to find all the interactions in
the genome. With more sequencing and better data
quality, the hope is to find interactions between
individual regulatory elements, at a resolution of 10kb or
less (maybe 1kb or 500bp! depending on restriction enzyme
used). At these small resolutions it is near
impossible to identify significant inter-chromosomal
interactions, and the search space for inter-chromosomal
interactions is massive.
To speed up the process, HOMER will look at chromosomes
individually which allows it to save on resources. A
separate program called findHiCInteractionsByChr.pl
automates this process and helps speed up the
calculation. Here's an example:
findHiCInteractionsByChr.pl ES-HiC/ -res
2000 -superRes 10000 -cpu 8 >
outputInteractions.txt
The first argument MUST be the Hi-C tag directory.
The output interaction formatted file is sent to stdout,
so make sure to capture the output in a file. Other
options are as follows:
-res <#> : resolution of
analysis, see above, default: 2000
-superRes <#> : super resolution (window
size), see above, default: 10000
-minDist <#> : Minimum distance between
regions to consider for an interaction (default:
-superRes value)
-maxDist <#> : Maximum distance between
regions to consider for an interactions (default:
10,000,000) Changing this parameter may change the
running time a bit
-pvalue <#> : pvalue cutoff, default: 0.01
-zscore <#> : modified z-score cutoff,
default: 1.5
-cpu <#> : number of CPUs to use, default:
1 (remember that more CPUs will require more memory too)
-ped <background HiC directory> :
Will quantify background experiment reads at significant
interactions.
-std <#> : exclude regions with sequencing
depth exceeding # std deviations, default: 4)
-min <#> : exclude regions with sequencing
depth less than this fraction of mean, default: 0.2)
Miscellaneous Interaction Actions
Quantifying Hi-C reads Given Predetermined
Interactions
Lets say you have known interactions, or
interactions from another experiment. You can
supply these as an input file to analyzeHiC and
it will quantify their read counts and significance for
you. The input file should look exactly like the
HOMER formatted interaction file. It should be a
tab-delimited text file. However, most of the
columns can have a value of 0. The only values
that MUST have values are the interaction IDs and region
information (columns 1-12). The read totals to not
need to be specified - HOMER will fill that in. To
have HOMER analyze you given interactions, use the "-i
<inputInteractions.txt>" option. Below
is an example:
analyzeHiC ES-HiC -i inputInteractions
-res 100000 -interactions outputInteractions.txt
-nomatrix
The interaction quantification will be in the new
outputInteractions.txt file from this example.
Retrieving Hi-C reads at Interacting Loci
To investigate the exact evidence that
contributes to a given interaction or set of significant
interactions, use the "-i
<inputInteractions.txt>" option to specify
your interactions of interest in conjunction with the "-iraw
<outputFile.bed>" or "-irawtags
<outputFile.tags.tsv>" options. These
options will output the raw interactions from the Hi-C
experiment that participate in the interaction in the
input file. The "-iraw" output file is in BED
format and can be uploaded to the UCSC Genome Browser to
visualize your interactions.
|