|
Annotating and Analyzing Significant Interactions From
Hi-C Data
Once you have found significant interactions in your data,
it's time to figure out what it means. HOMER contains
a tool called annotateInteractions.pl that can be
used to execute a variety of different types of analysis
(analogous to annotatePeaks.pl in some ways).
The annotateInteractions.pl command takes a
HOMER-style interaction file as input (see here). The
interactions do not need to be produced by HOMER - they can
come from any tool or created manually, but they must have
the same format. The basic syntax of the command is as
follows:
annotateInteractions.pl <interaction
file> <genome version> <output directory>
[additional options...]
example: annotateInteractions.pl bcell-Interactions.txt
mm9 AnnotationOutputDirectory/
By default, this command will produce a bunch of output
files and place them in the given output directory.
Below is a description of the default output files.
Additional options enable the assessment of feature
enrichment, and are covered further down.
Many of the options for the annotateInteractions.pl
command are used to filter the interactions such that a
subset is analyzed or annotated. Filtering options are
described below. One of the main purposes for the
filtering options is so that liberal cutoffs for the p-value
and z-score can be selected when running the analyzeHiC
or findHiCInteractionsByChr.pl command to find the
initial set of interactions. Once this large list of
possible interactions is found, the p-value and z-score
cutoffs can be changed/optimized in the annotateInteractions.pl
command to avoid rerunning the other commands, which can be
very time consuming.
Default Annotation Output:
Each of these output files will be placed in the
chosen output directory
interactions.txt
This file is a "HOMER interaction" formatted
file that contains the interactions used in the
analysis. Several of the annotateInteractions.pl
options are used to filter the input interactions, and
this file will only contain the interactions that pass
the filters and used for analysis and annotation.
interactionAnnotation.txt
This file is an extension of the HOMER
interaction format that includes 16 additional columns
with annotation information describing the regions at
each end of the interaction:
Column#) Description(Peak/Region 1 or
2)
21) Total Number of Significant Interactions at
region(1) - how many other interactions use the same
region endpoint
22) Total Number of Significant Interactions at
region(2)
23) Annotation(1) - basic annotation from annotatePeaks.pl
24) Detailed Annotation(1) - detailed annotation from
annotatePeaks.pl
25) Distance to TSS(1)
26) Nearest PromoterID(1)
27) Gene Name(1)
28) Gene Alias(1)
29) Gene Description(1)
30) Annotation(2)
31) Detailed Annotation(2)
32) Distance to TSS(2)
33) Nearest PromoterID(2)
34) Gene Name(2)
35) Gene Alias(2)
36) Gene Description(2)
The "Total Number of Significant Interactions
at region" described how many other interactions
originate from the same region/endpoint. The
"Annotation" and "Detailed Annotation" correspond to the
basic and full annotations from annotatePeaks.pl.
NOTE: If you don't care too much about the
annotation part, you can put "none" as the genome, and
this part will be skipped (fields will be replaced with
"NA").
lengthDist.txt
This file contains a histogram showing the
distribution of interaction lengths. Graphing the
file in Excel will produce something like the following:
peaks.txt
Peak file containing all of the unique
endpoint positions. The 7th column indicates how
many interactions connect to that peak position.
endpoint.bedGraph
This file produces a coverage track of
interaction endpoints. If you load this up to
UCSC, it will look something like this:
endpoint.bedGraph.peaks
This peak file is an exploratory file that
attempts to identify hubs directly from the peaks in the
endpoint.bedgraph file that exceed the "-hubCount
<#>" threshold (default: 5)
hubs.gt#.interactions.txt
This file is a peak file that contains regions
that have more than 5 interactions connecting to that
region. This allows you to focus your attention on
highly connected regions if you want. To change
the number of interactions required to designate a hub,
use the "-hubCount <#>" option.
hubs.distribution.txt
This file contains a histogram describing the
distribution of interactions per unique region.
Graphing it in Excel looks something like:
Filtering Options
As mentioned above, one of the main purposes for
the filtering options is so that liberal cutoffs for the
p-value and z-score can be selected when running the analyzeHiC
or findHiCInteractionsByChr.pl command to find the
initial set of interactions. Once this large list of
possible interactions is found, the p-value and z-score
cutoffs can be changed/optimized in the annotateInteractions.pl
command to avoid rerunning the other commands, which can
be very time consuming.
Interaction Confidence:
-pvalue <#> : filter out interactions with
p-value greater than #
-zscore <#> : filter out interactions with
z-score less than #
Interaction Length:
-minDist <#> : filter out interactions
spaced less than # bp - set > 300 million for only
inter-chromosomal interactions
-maxDist <#> : filter out interactions
spaced more than # bp, will remove inter-chromosomal
interactions if set
Interaction Confidence Vs. Background:
-dpvalue <#> : filter out interaction with
p-value vs. background Hi-C experiment greater than #
-dzscore <#> : filter out interaction with
z-score vs. background Hi-C experiment less than #
Filtering Regions:
-filter <peakfile> : only look at
interactions with endpoints in overlapping with peak in
the file
-filter2 <peakfile> : only look at
interactions connecting peaks in "-filter" file to the
"-filter2" file
Feature Enrichment at Interaction Endpoints
HOMER can compute feature enrichment
calculations with your interactions. Simply add the
"-p <peak/BED file1> [peak/BED file2]..." to
the annotateInteractions.pl command. You can
add as many peaks as you want to be considered.
Below is an example:
annotateInteractions.pl bcell-Interactions.txt
mm9 AnnotationOutputDirectory/ -p tss.peaks.txt
ctcf.peaks.txt pu1.peaks.txt
Adding the "-p ..." option will initiate three
changes to the output of the program:
1. Creation of a "featureEnrichment.txt"
file in the output directory. This uses mergePeaks
to assess the significance of overlap between the
interaction endpoints and each of the peak files.
When opened in Excel, the output looks like this:
2. Creation of a "pairwiseFeatureEnrichment.txt"
file in the output directory. The program annotate
the interaction endpoints of each significant
interaction and check if any of the feature peak files
overlaps with the interaction endpoints. It will
then quantify how often each feature is "connected" to
each other feature by an interaction.
annotateInteractions.pl will then assess if a connection
between the features is over- or under-represented given
the general enrichment for each feature in the data
set. The output file looks like this (+ logp
enrichment indicate under-represented connections):
3. The interactionAnnotation.txt file
will have one new column at the far right that will give
the interaction codes that match the first column of the
pairwiseFeatureEnrichment.txt file. In brief, each
file is peak file is given an ID number (starting with
0, 1, 2...). Interactions that link peak features
are assigned with the code (#x#, e.g. 0x0, 0x2, 1x2,
etc.). For example, if an interaction is assigned
the code 0x2 from above, then the interaction connects a
TSS and PU.1 peak.
Visualizing Feature Enrichment with Cytoscape
It's hard to appreciate the pairwise feature
enrichment in an excel table. The good news is
that there is a great visualization tool called Cytoscape that
can help out with that. HOMER's support for
Cytoscape is a little clumsy (feedback on a more
efficient way to setup the input files would be
appreciated!).
When HOMER calculates feature enrichment, it will
produce six files in the output directory staring with
the name "cytoscape". These files are
tab-delimited text files formatted to be used with
Cytoscape. To load the network, follow these
instructions:
- Go to File -> Import -> Network From Table
(Text/MS Excel) For the file, select the "cytoscape.network.sif.txt"
file. In the "Interaction Definition" box, set
the Source Interaction to column 1, the Interaction
Type to column 2, and the Target Interaction to
column 3:
- Next go to File -> Import -> Node
Attributes... and load the "cytoscape.node.logp.txt"
file. Do the same for the other
cytoscape.node.* files (cytoscape.node.size.txt,
cytoscape.node.ratio.txt). These files
correspond to the values for general feature
enrichment (i.e. the featureEnrichment.txt file
[size is the total number of regions overlapping
with each set of peaks)
- Then go to File -> Import -> Edge
Attributes... and load the "cytoscape.edge.logp.txt"
file. Do the same for the other
cytoscape.edge.* files (cytoscape.edge.ratio.txt).
These values correspond to the pairwise feature
enrichment values.
- Clicking on the network diagram should now reveal
attributes in the "Data Panel" at the bottom of the
screen. Next, to make the network pretty,
click on the "VisMapper" on the left side of the
screen. From here you can customize how your
network displays the data. For example, click
on the "Edge Line Width" and double click to
activate it.
You can choose which attribute you want to visualize
and select the appropriate parameters.
- Takes awhile the first time to play with and make
it look right... Good luck!
Modifying the Background for Feature Enrichment
By default, annotateInteractions.pl assumes your
interactions were found by searching the entire
genome. If you are analyzing a subset, you need
to specify what was used with one of the following
options:
-pos <chrN:XXX-YYY> : specific
a specific region used for analysis
-gsize <#> : set the genome size used for
significance calculations
-bgp <peak/BED file> : peaks used to find
interactions from.
Miscellaneous/Specialized Analysis
Compare Interactions
If you have two interaction files, add "-i
<HOMER interaction file2>" to the
end of your annotateInteractions.pl command,
and the 2nd interaction file will be compared to the
first one, and only the common interactions will be
analyzed. Common interactions are defined as
interactions where the endpoints at each end are
overlapping with each other (with the resolution
size).
Connecting Features with Interactions
Lets say you have two peak files, maybe one called
"enhancers.txt" and the other "promoters.txt", and you
want to see which of the peaks in one of the files is
connected to peaks in the other by significant
interactions. If you add the "-connect
<peak/BED file1> <peak/BED file2>"
to the command, a "mapping" file will be sent to stdout.
For example:
annotateInteractions.pl
bcell-Interactions.txt mm9
AnnotationOutputDirectory/ -connect TSS.txt
Enhancers.txt > outputMap.txt
The outputMap.txt file will contain 3 columns, the
peakID from the first peak file (column1), the peakID
from the 2nd file (column2), and the distance between
them (3rd column). There could be many mappings
for each peak, so each peak may appear multiple times
in the file.
Manually Specifying Circos Edge Widths
The next sections discusses how to visualize
interactions using Circos. However, you might
find that it's difficult to set the width of the edges
yourself. If you modify a HOMER interaction file
manually and put the desired width of the edges in the
file, you can run annotateInteractions.pl with
the "-circos" option. This will output a
Circos formatted interaction file. More on this
in the next section...
Command Line options for annotateInteractions.pl
Usage:
annotateInteractions.pl <interaction file> <genome
version> <output directory>
[additional options]
General Options:
-res <#> (Resolution of analysis, default: auto
detect)
-hubCount <#> (Minimum number of interactions to
define a hub, default: 5)
Filtering
Options:
-minDist <#> (filter out interactions spaced less than
# bp - set high for only interchr)
-maxDist <#> (filter out interactions spaced more than
# bp, will remove interchr)
-pvalue <#> (filter out interactions with p-value
greater than #)
-dpvalue <#> (filter out interactions with p-value (vs
bg) greater than #)
-zscore <#> (filter out interactions with zscore less
than #)
-dzscore <#> (filter out interactions with zscore (vs
bg) less than #)
-filter <peakfile> (only look at interactions with
endpoints in peakfile)
-filter2 <peakfile2> (only look at interactions
connecting -filter and -filter2 peak files)
Enrichment
Options:
-p <peak file 1> [peak file 2] ... (Check overlap with
peak files)
Special
Operations:
-circos (Convert interactions to circos interactions format
- stdout)
-i <interaction file2> [interaction file3] ...
(Compare 1st file interactions to these)
-connect <peakFile1> <peakFile2> (returns
association table between sets of peaks)
-pout (Convert interactions to a non-redundant peak file,
sent to stdout)
Specifying
Background (i.e. regions used to find interactions -
default: whole genome)
-gsize <#> (size of genome, default: 2e9)
-pos chrN:XXX-YYY (specific, continuous region)
-bgp <peak file> (peak file)
|