|
Creating and Normalizing Hi-C interaction Matrices
The most basic way to represent Hi-C data is in matrix
format, where the number of interactions can be reported
between sets of regions. Since it's difficult to
extract this data from the raw mapped reads, HOMER
provides tools create matrices from tag directories.
Most Hi-C tasks in HOMER revolve around the analyzeHiC
command. Below is a description of how to use it to
create matrices. analyzeHiC requires a Hi-C
tag directory (direction on creating one can be found here). Many of the
commands below require a "Hi-C background model" to
efficiently normalize and calculate expected interaction
counts (If one doesn't exist, it will be created
automatically). Creation of Hi-C background models
can be found here.
Quick Note about Hi-C Analysis
Hi-C is an unbiased assay of chromatin conformation,
resulting in even read coverage across the entire
genome. This, coupled with the fact that most Hi-C
reads describe interactions at close linear distance
along the chromosomes, results in relatively sparse read
coverage for interactions between individual restriction
fragments separated by great distance. This makes
it difficult to find high-resolution interactions across
long distances. To help describe how distal and/or
inter-chromosomal regions co-localize, two general
approaches are utilized:
- Increase the size of the regions [i.e. decrease the
resolution] to boost the number of Hi-C reads used for
interaction analysis. This is the most common
and simplest technique used by most approaches
(including HOMER). It's much easier to identify
significant inter-chromosomal interactions between
regions that are 100kb in size than regions that are
only 5kb in size (essentially 20x more Hi-C reads to
work with).
- Compare the profile of interactions regions
participate in when comparing regions. The best
example of this is calculating the correlation
coefficient of the interaction profiles between two
regions. The idea is that if two regions share
interactions with several other regions, they are
probably similar themselves (even if they do not have
direct interaction evidence between them). This
forms the basis of the PCA
analysis, and uses the transitive property to
link regions. If A interacts with C and D, and B
interactions with C and D, perhaps A and B interact as
well.
Running analyzeHiC
Basic usage:
analyzeHiC <Hi-C Tag Directory>
[options] > outputMatrixFile.txt
By default, HOMER generates
a normalized interaction matrix and sends it to
stdout. There are many other options covered
below
Changing the Resolution and Super Resolution
The default resolution is
10000000 (10 Mb). This is to make sure the command
finishes quickly if you forget to specify the correct
resolution. To specify a different resolution, use
"-res <#>".
analyzeHiC ES-HiC -res 10000 >
output.10kResolution.txt
If the output requires normalization, HOMER will
construct a background model for the specified
resolution. If it is the first time you've used
the designated resolution, HOMER will generate a
background model (more on that below). For small
resolutions, this can take a while...
HOMER uses two (related) notions of resolution.
The first, "- res <#>" represents how
frequent the genome is divided up into regions.
The second, the super resolution " -superRes <#>"
(poorly named), represents how large the region is
expanded when counting reads. Usually the " -res
<#>" should be smaller than the " -superRes
<#>". For example a res of 50000 and a
superRes of 100000 would mean that HOMER will analyze
the regions 0-50k, 50k-100k, 100k-150k etc., but at each
region it will look at reads from a region the size of
100k, so it would look at reads from -25k-75k, 25k-125k,
75k-175k, etc. This means HOMER will analyze data
with overlapping windows. The principle advantage
to this strategy is that you don't penalize features
that span boundaries. For example:
analyzeHiC
ES-HiC -res 10000 -superRes 20000 >
output.10kby20kResolution.txt
Keep in mind that the resolution will also dictate the
size of the output matrix. The command will warn
you if your parameters seem unreasonable...
Specifying specific Regions to Analyze
By default HOMER will
analyze interactions across the entire genome. You
can restrict the analysis to a specific chromosome, or
part of a chromosome using the following options:
-chr <chr name>
: will restrict analysis to this chromosome
-start <#> :
starting position for analysis
-end <#> :
end position for analysis
-pos
<chr:start-end> : UCSC browser
formatted position - if you're lazy like me, takes
place of the -chr/-start/-end
If you only " -chr chr1"
and do not specify a start and end, HOMER will simply
visualize all of chr1. Regions will be created
starting at position 0 to 1*resolution, then from
1*resolution to 2*resolution, etc. (i.e. 0-10kb,
10kb-20kb, 20kb-30kb) If an alternative start is
specified, then regions will be created at
start+0*resolution to start+1*resolution,
start+1*resolution to start+2*resolution, etc (i.e.
205kb-215kb,215kb-225kb,...).
HOMER will normally make a symmetric matrix by
default. If you want to specifically look at a
matrix between two different regions:
-chr2 <chr name> :
will restrict analysis to this chromosome
-start2 <#> :
starting position for analysis
-end2 <#>
: end position for analysis
-pos2
<chr:start-end> : UCSC browser
formatted position - if you're lazy like me, takes
place of the -chr2/-start2/-end2
-vsGenome : compare to the rest of the genome
NOTE: Don't use
-chr2/-start2/-end2/-pos2/-vsGenome unless you specified
something with -chr/-start/-end/-pos etc. first.
Using these regions, HOMER will divide them into #-bp
regions, where # is the resolution. HOMER doesn't
allow you to cherry-pick several regions from different
chromosomes. At least not using the -chr, -start,
-end, and -pos options. (You could do this with
peaks and the -p option).
You can arbitrarily define the regions you want to
examine by providing a peak/BED file of the
regions. However, these options work a little
differently. In the case above (with
-chr/-start/-end/-pos), HOMER will chop up the region
into resolution sized chunks and perform the
analysis. I.e. you provide a locus you want a
detailed picture of. When providing a peak file,
HOMER will only consider the center of the peak file and
the surrounding "resolution-sized" region. It will
not chop-up your peaks into resolution-sized chunks
(unless you specify the " -chopify" option).
In general, the "-p" option is more useful for
looking at all CTCF peaks, for example, to see which are
interacting. You can provide a peak/BED that
effectively tiles a region, in which case it would mimic
how -chr/-start/-end/-pos works, but give you the
flexibility to define any region(s) you wanted. To
specify a peak file containing regions to analyze with
analyzeHiC:
-p <peak/BED file>
: peak/BED file to use to search for interactions
between.
-p2 <peak/BED
file> : A second peak/BED file for
non-symetrical matricies.
Keep in mind that an interaction matrix is probably only
useful at 2000 x 2000 points - much bigger you can't
really visualize it easily anymore (The file will also
get really big...)
Matching the resolution of Peak/BED Files and
Resolution of analyzeHiC
Another important point,
often if you're using Transcription factor peaks for
analysis, they may be located less than the resolution
apart from one another. (i.e. two PU.1 peaks may be
less than 1000 bp from each other, but the resolution
for analyzeHiC is 50000) - this means that the two
PU.1 peaks will give essentially the same results. To
avoid this redundancy, It's best to run mergePeaks with a
single peak/BED file to collapse peaks within the size
of the resolution. For example:
mergePeaks pu1.peaks -d
50000 > newPu1Peaks.txt
The resulting file will contain only peaks at least
50000 bp from one another. Use this resulting
file with analyzeHiC.
Creating an Interaction Matrix
By default, analyzeHiC
creates an interaction matrix file. This is a
tab-delimited text file formatted to be easy opened with Java Tree View
(or any other software that generates Heat Maps).
The rows and columns correspond to genomic regions, and
the values correspond interaction information between each
locus. The type of information shown depends on the
options choosen when running analyzeHiC. The
genomic
positions reported correspond to the beginning of the
region. Normally, the output is sent to stdout,
but you can also specify " -o
outputfilename.txt".
Output Information Options (choose only one ):
-raw
Outputs the raw
interactions counts between the regions
-simpleNorm
Outputs the ratio of
observed to expected interactions assuming each region
has an equal chance of interacting with every other
region in the genome. This normalization assumes
each region should contain roughly the same number of
total interactions, and essentially controls for the
sequencing depth at each region.
-norm ( default)
Outputs the ratio of
observed to expected interactions by assuming each
region has an equal chance of interacting with every
other region in the genome AND that regions are expected
to interact depending on their linear distance along the
chromosome. This attempts to take into account the
"proximity ligation" effect, where adjact regions are
expected to have interactions regardless of the specific
3D genomic structure.
-logp
Outputs the natural log of
the p-value describing the likelihood of observeing the
actual number of interactions relative to the expected
number of interactions between to two loci. This
is calculated conservatively as a cumulative binomial
distribution.
-expected
Outputs the expected
number of interactions instead of the observed number of
interactions. To get the "simpleNorm" version of
the expected interactions, include "-simpleNorm" too.
-rawAndExpected <expectedMatrix filename>
Outputs both a raw
interaction matrix and the expected number of
interaction matrix. The raw interactions are sent
to stdout (or the file specified with "-o") and
the expected interactions are sent to the given
filename.
-corr (can be used
with any of the above options)
Instead of outputing the
matrix as is, the value of each cell is replaced with
the Peason's Correlation Coefficient between the row and
column. This can be useful as it adds transitive
information to the problem. Instead of just using
the number of interaction that directly span between to
loci, the correlation option will consider how each
region interacts with all of the other loci too.
If they have similar interaction profiles, the
correlation will be high (i.e. 1). If "-logp" or "-expected" is used,
those values are the ones that will be used for the
correlation calculation. The matrix must be
symmetric for this option to work.
-nomatrix
Don't create a matrix
Visualizing the Interaction Matrix
The interaction matrix
output is a tab-delimited text file. Below is an
example opened in Excel (resolution of 1 Mb). The
coordinates are given as "chr-position", where the
position indicates the beginning of the
region. If a peak file is given as input, the
columns/rows will be labeled with the peak names.
To view the output as a heatmap, open it in Java Tree
View (or any other software that generates Heat
Maps). You may have to adjust the pixel settings
to get the desired level of brightness. If viewing
normalized data of observed vs. expected log2 ratios,
you may want to view the matrix so that interacting
regions are positive (one color) and non-interacting
regions are negative (a different color).
By default, analyzeHiC sorts the chromosomes
numerically and places the X, Y, and M at the end of the
list (You can tell in the example below that the cells
are from a male because the X chromosome is lighter than
the others...)
Example 1: Raw interactions across the genome at
1 Mb resolution
analyzeHiC ES-HiC -res
1000000 -raw > outputFile.txt
As you can see, interactions are much more frequent
within chromosomes than between.
Example 2: Raw interactions across chr1 at
100 kb resolution
analyzeHiC ES-HiC/ -raw
-chr chr1 -res 100000 > outputFile.txt
Here you'll start to notice the compartment patterns.
Example 3: Ratio of observed vs. expected
interactions across chr1 at 100 kb resolution using
simple normalization (normalizing for sequencing
coverage only). Visualized as log2 ratios of
observed interactions relative to expected interactions
(red being positive, blue negative)
analyzeHiC ES-HiC
-chr chr1 -res 100000 -simpleNorm >
outputFile.txt
Example 4: Ratio of observed vs. expected
interactions across chr1 at 1 kb resolution (normalizing
for sequencing coverage and distance between loci - this
is the default). Visualized as log2 ratios of
observed interactions relative to expected interactions
(red being positive, blue negative). NOTE: If a
100kb background model
does not exist, it will be created the first time (may
take some time).
analyzeHiC ES-HiC
-chr chr1 -res 100000 -norm > outputFile.txt
Example 5: You might be curious what the
"expected" interactions are from the background
model. Use the " -expected" option for that:
analyzeHiC
ES-HiC -chr chr1 -res 100000 -expected >
outputFile.txt
Example 6: log p-values for interactions across
chr1 at 100 kb resolution (normalizing for sequencing
coverage and distance between loci). More
information about the p-value calculation is available
in the " Identifying
Significant Interactions" section.
analyzeHiC ES-HiC
-chr chr1 -res 100000 -logp > outputFile.txt
As you might notice, most of the significant
interactions appear closer to the diagonal where the
read coverage is high enough to confidently identify
overrepresented interactions.
Example 7: Correlation of normalized interaction
ratios across chr1 at 100 kb resolution. More
about correlation can be found in the " Sub-nuclear compartment
analysis/PCA" section. Positive correlation
is shown as red, negative correlation as green.
analyzeHiC ES-HiC
-chr chr1 -res 100000 -corr > outputFile.txt
Example 8: Poor man's data integration.
Chuck whipped this up
faster than you can spell "ROUNDHOUSE". Using
powerpoint as professional image arranging software,
import a picture such as the one from the previous
example, and a picture of a chromosome wide track of
H3K4me2 for chr1. You can then rotate the
interaction matrix by 45 degrees, and visualize them
together to see how all the active regions of the
genome "interact" together, and all the inactive
regions interact together. Pretty cool...
Example 9: Analyzing Peak Files
When analyzing peak files, use the "-p
<peak/BED file>". The key difference
here is that the matrix that is returned will show
each row/column as a peak, and not a continuous
section of the genome. Peak regions will be
sorted based on their position in the output.
For example:
mergePeaks
CTCFchr10.peaks.txt -d 50000 > inputPeaks.txt
analyzeHiC ES-HiC -p inputPeaks.txt -res
50000 > outputFile.txt
Using -res and -superRes
The "-superRes <#>" option is basically a
way to make moving windows (explained in more detail at
the beginning of this section). In summary,
instead of analyzing the data at resolution of 10kb for
every 10kb of sequence, you can analyze at a resolution
of 10kb (-superRes) but perform that analysis every 1kb
of sequence (-res) Normally, you want to superRes
to be larger than the resolution, or the same
(default). The idea is that it will smooth out
features and be more resilient to boundary
effects. Below is an example of how it looks with
raw interactions.
analyzeHiC ES-HiC -res 10000 -superRes
10000 -pos chr1:20000000-21000000 -raw >
output.txt
analyzeHiC ES-HiC -res 5000 -superRes 10000
-pos chr1:20000000-21000000 -raw > output.txt
analyzeHiC ES-HiC -res 1000 -superRes 10000
-pos chr1:20000000-21000000 -raw > output.txt
Creating 2D Histograms of Interactions
HOMER can create 2D
histograms by examining the interactions in the vicinity
of specified locations, such as looking at the interaction
profiles around transcription factor peaks, or the
Transcription Start Site. The idea is that you specify
locations in a peak file (with "-p <peak/BED
file>"), specify a resolution and a size, and then
HOMER will create several "mini-interaction matrices"
around each peak, with the peak position in the center,
and in the end adds them together to get a global
profile. This is similar to looking at the
distribution of a histone mark around a set of
transcription factor binding sites, accept in this case
you will see a 2D histogram that you would view in the
same way as a normal interaction matrix.
To make a 2D histogram, you MUST specify a peak file (" -p <peak/BED file>"),
a total size of the interaction matrix (" -size <#>"), and
the histogram resolution (" -hist <#>") ("-hist" takes the
place of "-res" in this case).
analyzeHiC
<HiC Tag Directory> -size <#> -hist
<#> -p <peak/BED file> >
output2dHistogram.txt
i.e.
analyzeHiC ES-HiC -size 1000000 -hist 10000 -p
tssPositions.txt > output2dHistogram.txt
If -logp, -simpleNorm, etc. are
used then those values will be added together for use in
the histogram. If you choose to view too many peaks,
too large of a size or small of a resolution, the memory
requirements can get large, so be careful.
For example:
analyzeHiC ES-HiC -size 1000000 -hist 5000
-p CTCFpeaks.chr10.bed > output.txt
Visualizing with TreeView:
In this example you can see how CTCF tends to sit at the
boundaries between interconnected "domains".
Command Line Options for analyzeHiC
Usage: analyzeHiC
<PE tag directory> [options]
...
Resolution
Options:
-res <#> (Resolution of matrix in bp or use "-res
site" [see below], default: 10000000)
-superRes <#> (size of region to count tags, i.e. make
twice the res, default: same as res)
Region Options:
-chr <name> (create matrix on this chromosome,
default: whole genome)
-start <#> (start matrix at this position, default:0)
-end <#> (end matrix at this position, default: no
limit)
-pos chrN:xxxxxx-yyyyyy (UCSC formatted position instead of
-chr/-start/-end)
-chr2 <name>, -start2 <#>, -end2 <#>, or
-pos2 (Use these positions on the
y-axis of the matrix. Otherwise the matrix will be
sysmetric)
-p <peak file> (specify regions to make matrix,
unbalanced, use -p2 <peak file>)
-vsGenome (normally makes a square matrix, this will force
2nd set of peaks to be the genome)
-chopify (divide up peaks into regions the size of the
resolution, default: use peak midpoints)
-fixed (do not scale the size of the peaks to that of the
resolution)
-pout <filename> (output peaks used for analysis as a
peak file)
Interaction
Matrix Options:
-norm (normalize by dividing each position by expected
counts [log ratio], default)
-zscoreNorm (normalize log ratio by distance dependent
stddev, add "-nolog" to return linear values)
-logp (output log p-values)
-simpleNorm (Only normalize based on total interactions per
location [log ratio], not distance)
-raw (report raw interaction counts)
-expected (report expected interaction counts based on
average profile)
-corr (report Pearson's correlation coeff, use "-corrIters
<#>" to recursively calculate)
-corrDepth <#> (merge regions in correlation so that
minimum # expected tags per data point)
-o <filename> (Output file name, default: sent to
stdout)
-cluster (cluster regions, uses "-o" to name cdt/gtr files,
default: out.cdt)
-clusterFixed (clusters adjacent regions, good for linear
domains)
(# defined the threshold at which to define
clusters/domains, i.e. 0.5)
-std <#> (# of std deviations from mean interactions
per region to exclude, default:4)
-min <#> (minimum fraction of average read depth to
include in analysis, default: 0.2)
Background
Options:
-force (force the creation of a fresh genome-wide background
model)
-bgonly (quit after creating the background model)
-createModel <custom bg model output file> (Create
custome bg from regions specified, i.e. -p/-pos)
-model <custom bg model input file> (Use Custom
background model, -modelBg for -ped)
-randomize <bgmodel> <# reads> (and the output
is a PE tag file, initail PE tag directory not used
Use makeTagDirectory ... -t outputfile to create the new
directory)
Non-matrix stuff:
-nomatrix (skip matrix creation - use if more than 100,000
loci)
-loci <filename> (calculate collective p-value of each
loci)
-interactions <filename> (output interactions, add
"-center" to adjust pos to avg of reads)
-pvalue <#> (p-value cutoff for interactions, default
0.001)
-zscore <#> (z-score cutoff for interactions, default
1.0)
-minDist <#> (Minimum interaction distance, default:
resolution/2)
-maxDist <#> (Maximum interaction distance, default:
-1=none)
Making
Histograms:
-hist <#> (create a histogram matrix around peak
positions, # is the resolution)
-size <#> (size of region in histogram, default = 100
* resolution)
Comparing HiC
experiments:
-ped <background PE tag directory>
Creating Circos
Diagrams:
-circos <filename prefix> (creates circos files with
the given prefix)
-d <tag directory 1> [tag directory 2] ... (will plot
tag densities with circos)
-b <peak/BED file> (similar to visiualization of
genes/-g, but no labels)
-g <gene location file> (shows gene locations)
Given Interaction
Analysis Mode (no matrix is produced):
-i <interaction input file> (for analyzing specific
sets of interactions)
-iraw <output BED filename> (output raw reads from
interactions, or -irawtags <file>)
-4C <output BED file> (outputs tags interacting with
specified regions)
-peakStats <output BED file prefix> (outputs several
UCSC bed/bedGraph files with stats)
-cpu <#> (number of CPUs to use, default: 1)
|