logo

HOMER

Software for motif discovery and next-gen sequencing analysis



Sub-nuclear Compartment Analysis (PCA/Clustering)

Hi-C data is very complex, and it's interpretation is not trivial.  One strategy is to look for general patterns in an unbiased way.  This section covers the application two general strategies, Principal Component Analysis and hierarchical clustering. Generally speaking, analysis of Hi-C data using these techniques tend to reveal the genome is spatially segregation in two compartments harboring "active" and "inactive" chromatin.

Principal Component Analysis of Hi-C Data

PCA is a powerful technique for analyzing high dimensional data, and first used on Hi-C data by Lieberman-Aiden et al.  The basic idea behind PCA is to redefine the coordinate system such the data can be "described" with as few dimensions as possible (a type of clustering or dimensionality reduction).  This axes of the coordinate system are referred to as the principle components, and the "1st" component is found such that it can be used to describe or discriminate as much of the system as possible, with the "2nd" component describing as much of the remaining variance as possible, and so on.  We can then consider each region with respect to their values along the first couple principal components to get a simplified view of the data.

To perform PCA on Hi-C data, we apply it to the normalized interaction matrix. In this setting, each region along the chromosome represents a dimension in the analysis.  Normally, PCA is applied to full chromosome matrices to identify important features.  You could apply PCA to the entire genome interaction matrix, but sparse interactions between chromosomes may introduce noise.  You can also apply PCA to a small region, although the conclusions from such an analysis are less general with respect to general chromosome structure.

To perform PCA analysis with HOMER, use the program runHiCpca.pl.  Below is an example:
runHiCpca.pl <outputPrefix> <HiC Tag Directory> [options]

example: runHiCpca.pl pcaOut ES-HiC/ -res 50000 -cpu 4 -genome mm9
This command will produce two output files: *.PC1.txt and *.PC1.bedGraph (i.e. pcaOut.PC1.bedGraph).  The bedGraph file can be directly uploaded to UCSC and represents the coordinates of each region with respect to the first two principle components.  The txt file contains the same information in peak file format.  Below is an example of the output on chr17:

Hi-C PCA chr17

Usually the values for the first principal component (PC1) are the most informative.  These indicate the classification of each region of the chromosome with respect to the first component.  Normally, gene-rich regions with 'active' chromatin marks have similar PC1 values, while gene deserts tend to have the opposite.  You can also view additional principal components by changing the "-pc <#>" option.  The 2nd component (PC2) usually describes the the location along the chromosome (e.g. near telomere vs. centromere, or different chromosome arms).  The PC2 values tend look a lot like the PC1 values but usually "flip" along the length of the chromosome.  Their similarity is partly due to the fact that they are mathematically linked - Each principal component must be orthogonal to one another, so unless the first PC1 perfectly describes a feature of the data, other PCs might reflect the feature as well.  In the example below, PC1 and PC2 values are more similar for the first 40 Mb, and then they tend to be more opposite for the rest of the chromosome.

PC1 vs. PC2 Hi-C


In general, this analysis tends to highlight the fact that the genome can be divided into two compartments, creating a "checkerboard" pattern, with positive PC1 regions reflecting "active/permissive" chromatin and negative PC1 regions indicative of "inactive/inert" chromatin

Hi-C PCA analysis

What does runHiCpca.pl do:

runHiCpca.pl is a wrapper script that will perform PCA analysis on each chromosome individually and combine all the results.  Below are each of the steps:

1. Check for the appropriate background model, create it if necessary.

2. For each chromosome:
- Generate normalized interaction matrix (analyzeHiC)
- Calculate correlation between the contact profiles from each region against each other region
- Using R, compute the principal components from the correlation matrix
- Output the values of reach region with respect to the first two principal components
3. If "seed" regions are provided, HOMER checks the overlap between active (and/or inactive) regions to decide if the coordinates should be multiplied by -1 to flip the signs.  HOMER will try to make it so regions in "active" zones are positive.  If no seed regions are used, but a genome is specified, HOMER will look for overlap with TSS to assign "active" vs. "inactive".

4. Concatenate results from all chromosomes, format, and print the output

runHiCpca.pl Options:

-pc <#> : Number of principal components to return (default: 1).

-res <#>
: resolution of analysis (if it's too small, the program may take too long or use up too much memory).  A resolution of 25000 is about the limit for mammalian genomes, unless your computer is ridiculous... The sequencing depth and quality of an experiment may dictate which resolutions are possible.  If the PC1 values are very noisy you may need to increase the resolution size. Default: 50000

-superRes <#> : window size of analysis, should be equal to or larger than -res. Default: 100000.

-active <peak/BED file> : seed regions of "active" chromatin to use when assessing the proper sign of the PC1 results.

-inactive <peak/BED file> : seed regions of "inactive" chromatin to use when assessing the proper sign of the PC1 results.

-genome <genome> : If no seed regions are given, HOMER will get the TSS locations and treat them as "active" seeds.

-corrDepth <#> : Minimum number of expected reads per region to use when computing correlation (pools adjacent regions), default: 3

-std <#> : exclude regions with sequencing depth exceeding # std deviations, default: 4.  The idea here is to remove regions that may throw off the PCA calculation

-min <#> : exclude regions with sequencing depth less than this fraction of mean, default: 0.15.  The idea is to get rid of regions with low coverage that don't behave as well during the PCA analysis.

-rpath <path to R executable> : If "R" is not available through the $PATH variable (i.e. can't just type "R" to run R)

-cpu <#> : number of CPUs to use.  Memory consumption is also higher with more CPUs (work separated by chromosome).  Please note that if multiple CPUs are configured, the screen output will spew all over the place - just ignore it...

Directly Comparing Two Hi-C Experiments

PCA analysis is a useful tool for analyzing single Hi-C experiments.  However, comparing PC1 values between two experiments is a little dangerous.  PCA is an unbiased analysis of data, and while PC1 values usually correlated with "active" vs. "inactive" compartments, the precise qualitative nature of this association may differ slightly between experiments.  Because of this, it is recommended to directly compare the interaction profiles between experiments rather than simply looking at PC1 values.

One way to do this is to directly correlate the interaction profile of a locus in one experiment to the interaction profile of that same locus in another experiment.  If the locus tends to interact with similar regions in both Hi-C experiments, the correlation will be high.  If the locus interacts with different regions in the two experiments, the correlation will be low.  Note that this is NOT PCA analysis, cut a direct comparison between two experiments at each locus.  By default, this will compare the interaction profiles across each chromosomes (interchromosomal interactions are generally too sparse for this type of analysis)

To calculate the "correlation difference" between two experiments, use the getHiCcorrDiff.pl.  Below is how it's used:

getHiCcorrDiff.pl <outputPrefix> <HiC Tag Directory1> <HiC Tag Directory2> [options]

example: getHiCcorrDiff.pl out ES-HiC Bcell-HiC -cpu 8 -res 50000

This program will produce two output files, much like runHiCpca.pl.  One is a bedGraph what can be loaded into the UCSC Genome Browser as a custom track.  The other is a peak file with the correlation information in the 6th column.  Below is an example of the corrDiff output loaded with the PC1 values from two experiments:

HiC CorrDiff


Additional Options for getHiCcorrDiff.pl:

-res <#> : resolution of analysis (if it's too small, the program may take too long or use up too much memory).  A resolution of 25000 is about the limit for mammalian genomes, unless your computer is ridiculous... The sequencing depth and quality of an experiment may dictate which resolutions are possible.  If the values are very noisy you may need to increase the resolution size. Default: 50000.

-superRes <#> : window size of analysis, should be equal to or larger than -res. Default: 100000.
               
-corrDepth <#> : Minimum number of expected reads per region to use when computing correlation (pools adjacent regions), default: 3

-std <#> : exclude regions with sequencing depth exceeding # std deviations, default: 4.  The idea here is to remove regions that may throw off the PCA calculation

-min <#> : exclude regions with sequencing depth less than this fraction of mean, default: 0.15.  The idea is to get rid of regions with low coverage that don't behave as well during the PCA analysis.

-maxDist <#> : By default, getHiCcorrDiff.pl compares each locus by looking at it's contact profile across the whole chromosome.  Setting this option will limit the profile comparison to the region +/- this distance from the locus (sort of like a 'local' correlation)

-cpu <#> : number of CPUs to use.  Memory consumption is also higher with more CPUs (work separated by chromosome).  Please note that if multiple CPUs are configured, the screen output will spew all over the place - just ignore it...


Downstream PC1 analysis

There are several ways to use HOMER to analyze PC1 values (or corrDiff values) with respect to other data.  Some are very general ways to manipulate bedGraphs, while others are very specialized to PC1 values (i.e. getDomains.pl).

Histograms/Quantifying PC1 values at genomic features

You can make histograms of PC1 values around areas of interest by using annotatePeaks.pl and the "-bedGraph" option.  Lets say you'd like to see the distribution of PC1 values around TSS. (alternatively you could look at PC1 values near ChIP-Seq peaks or other features)  Using the command below:

annotatePeaks.pl tss mm9 -size 1000000 -hist 1000 -bedGraph exp1.PC1.bedGraph exp2.PC1.bedGraph > output.histogram.txt

Opening the histogram output in Excel and graphing the "coverage" columns:

Hi-C PC1 histogram TSS

You can also omit the "-hist <#>" option and adjust the size from the annotatePeaks.pl command, and annotatePeaks.pl will instead quantify the PC1 values at each locus, which can be useful for classifying which TSS are in the 'active/permissive' compartment vs. which TSS are in the 'inactive/inert' compartment:

annotatePeaks.pl tss mm9 -size 1000 -bedGraph exp1.PC1.bedGraph exp2.PC1.bedGraph > output.table.txt

Opening this file with Excel:

Hi-C PC1 Chart

Plotting the bedGraph data columns as a scatter plot:

Hi-C PC1 scatter plot TSS

This file can also be used to figure out which individual TSS are changing the most, etc.  Similar operations could be performed on ChIP-Seq peaks or other features of interest.

Finding PC1 Based Compartments

Regions of continuous positive or negative PC1 values set the stage for identifying "compartments", in this case referring to linear regions of DNA that are in the "active/permissive" or "inactive/inert" compartments.  (This is a little different than the concept of topological domains reported in Dixon et al., which are more of a local phenomenon instead of a compartment association).  A custom tool is available in HOMER to help with this task named findHiCCompartments.pl.

Unlike the annotatePeaks.pl command avoid, findHiCCompartments.pl works on the *.PC1.txt files that are produced by runHiCpca.pl, not the *.PC1.bedGraph files.

To identify compartments, run the command:

findHiCCompartments.pl exp1.PC1.txt > compartments.txt

By default, this command will identify all regions that have a continuous positive PC1 value.  The output is a peak file that contains each of these regions, their average PC1 value across the region, and their length in the 6th and 7th columns, respectively.  To adjust the threshold used to call compartments, use the "-thresh <#>" option.  To identify regions with negative PC1 values, add the "-opp" option to the command.  Also, if you prefer to return compartments as the original sub-peaks (instead of stitching them together into a single, long peak), add the "-peaks" option.

Differential Compartments (i.e. Flipping)

The findHiCCompartments.pl command also has some built in support for identifying regions that change their compartment between experiments.  Two sources of information can be used to identify changes: the PC1 values from a second Hi-C experiment, and/or the correlation difference (i.e. getHiCcorrDiff.pl output) between two experiments.  If both options are used, the identified regions must pass both criterion.

-bg <PC1.txt file> : specify a background experiment's PC1 results.
-diff <#> : minimum difference between PC1 values to define region as different, default: 50

-corr <corrDiff.txt file> : specify a correlation difference file
-corrDiff <#> : maximum correlation for a region to be defined as different, default: 0.4

Below is an example of these files uploaded to the UCSC Genome Browser. NOTE: If UCSC complains that you chromsome coordinates exceed the end of the chromosome (could happen due to the resolution), run the command "removeOutOfBoundsReads.pl compartments.txt <genome> > compartments.new.txt" to adjust the coordinates at the end of the chromosome.

Hi-C Differential Domains



Clustering Regions Based on their Interactions

HOMER also offers more "direct" ways to group regions (most of these are experimental).  For now, it's recommended you stick to the PCA analysis from above, however, there may be situations where the routines below are more useful.

analyzeHiC has two types of clustering routines to help identify sets of regions that are "related" in 3D-space.  They are:

-cluster : Clustering of regions regardless of genomic locations (pure clustering based on interaction frequency)

-clusterFixed : Clustering of regions based on adjacent, linear, regions on the chromosome (for finding "linear domains")

The clustering is performed as hierarchical clustering, and the output is stored in files "out.cdt" and "out.gtr".  You can use Java Tree View to open the "out.cdt" file and view the clustering result.  From there you can select your own clusters if you like.  Selection of "-corr", "-logp" or "-simpleNorm" etc. during the command will cause those values to be used in the clustering instead of the default ("-norm"). 

To name the clustering output something other than "out", use "-o <filename>", which will use <filename> as the prefix for the clustering files instead of "out".

Example of -cluster

As an example, lets try clustering the normalized matrix of chr1 at 1 Mb resolution.  Using the command:

analyzeHiC ES-HiC/ -chr chr1 -res 1000000 -cluster > outputmatrix.txt

This produces several files, named out.cdt and out.gtr (and outputmatrix.txt, which we'll ignore).  Opening out.cdt with Java Tree View will give us:

Hi-C Clustering

In this example, we allow "free clustering", and the algorithm will group together loci that are considered "close together" - in this case, loci whose interaction profiles have high interaction log2 ratios.  As you can see, there are essentially two groups.  To visualize which regions are in the groups, highlight the sub-tree of interest, and then export the list (under Export->Save List in Java Tree View).  HOMER contains a tool called cluster2bed.pl to allow you to visualize these clusters in the genome browser.  Run the following command on the saved list file (be default this is called "out_list.txt"):

cluster2bed.pl <cluster list> <#resolution> [# minimum to cluster size to keep]  > out2.bed
i.e. cluster2bed.pl out_list.txt 1000000  > out.bed

You can now upload the output file to the UCSC genome browser.  The 3rd option in the cluster2bed.pl command is optional, which will remove clusters containing only a couple regions (be default, it removes cluster containing fewer than 5% of the total).  Colors are randomly assigned to the clusters to help differentiate them.

Visualizing these clusters in the UCSC genome browser along with H3K4me2 ChIP-Seq and the interaction matrix, you can see that the 2 major clusters from above correlate with H3K4me2 very nicely (would look better at higher resolution, say 100kb)

Hi-C Clustering Domains

Example of -clusterFixed

As an example, lets try clustering again, this time using "fixed" clustering, which will force adjacent positions to be grouped.  Using the command:

analyzeHiC ES-HiC/ -chr chr1 -res 1000000 -clusterFixed > outputmatrix.txt

This produces the same basic output files as "-cluster", named out.cdt and out.gtr (and outputmatrix.txt, which we'll ignore).  Opening out.cdt with Java Tree View will give us:

Hi-C Clustering Fixed


Here you'll notice that the heatmap look just as it would if you did not cluster the data.  The only difference is that it grouped regions together based on there similarity.  This can be useful for finding linear domains.  To extract domains from the clustering result, you must first choose a cutoff threshold to identify cluster (i.e. 0.5 - related to correlation) and use the "homerTools cluster" program, and then use the cluster2bed.pl tool to format for the UCSC Genome Browser:
homerTools cluster -gtr out.gtr -thresh 0.5 > cluster.txt
cluster2bed.pl cluster.txt 1000000  > out.bed
Uploading this to the browser will give us:

Hi-C Clustering Fixed



Command Line Options for runHiCpca.pl

        Usage runHiCpca.pl <output prefix> <HiC directory> [options]

        Options:
                -res <#> (resolution in bp, default: 50000)
                -superRes <#> (super resolution in bp, i.e. window size, default: 100000)
                -active <K4me+ regions> (Regions to use to help decide sign on principal component [active=+])
                -inactive <K4me- regions> (Regions to use to help decide sign on principal component [inactive=-])
                        -genome <genome> (If you don't have seed regions, this will use the TSS file as active seeds)
                -corrDepth <#> (number of expected reads needed per data point when calculating correlation, default: 3)
                -std <#> (exclude regions with sequencing depth exceeding # std deviations, default: 4)
                -min <#> (exclude regions with sequencing depth less than this fraction of mean, default: 0.15)
                -rpath <path to R executable> (If R is not accessible via the $PATH variable)
                -cpu <#> (number of CPUs to use, default: 1)

        Output files:
                <prefix>.PC1.txt - peak file containing coordinates along the first 2 principal components
                <prefix>.PC1.bedGraph - UCSC upload file showing PC1 values across the genome

Command Line Options for getHiCcorrDiff.pl

        Usage getHiCcorrDiff.pl <output prefix> <HiC directory1> <HiC directory2> [options]

        Options:
                -res <#> (resolution in bp, default: 50000)
                -superRes <#> (super resolution in bp, i.e. window size, default: 100000)
                -corrDepth <#> (number of expected reads needed per data point when calculating correlation, default: 3)
                -std <#> (exclude reads with sequencing depth exceeding # std deviations, default: 4)
                -min <#> (exclude reads with sequencing depth less than this fraction of mean, default: 0.1)
                -maxDist <#> (maximum distance around regions to calculate similarity metrics, default: none)
                -cpu <#> (number of CPUs to use, default: 1)

        Output files:
                <prefix>.corrDiff.txt - peak file containing correlation values for each region
                <prefix>.corrDiff.bedGraph - UCSC upload file showing correlation values across the genome


Command Line Options for getDomains.pl

        Usage: getActiveDomains.pl <PC1.txt file> [options]

        Options:
                -opp (return inactive, not active regions)
                -thresh <#> (threshold for active regions, default: 0)
                -bg <2nd PC1.txt file> (for differential domains)
                        -diff <#> (difference threshold, default: 50)
                -corr <corrDiff.txt file> (for differential domains)
                        -corrDiff <#> (correlation threshold, default: 0.4)
                -peaks (output as peaks/regions, not continuous domains)






Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu