Homer Software and Data Download

The most basic way to represent Hi-C data is in matrix format, where the number of interactions can be reported between sets of regions. Since it's difficult to extract this data from the raw mapped reads, HOMER provides tools create matrices from tag directories.

Most Hi-C tasks in HOMER revolve around the analyzeHiC command. Below is a description of how to use it to create matrices. analyzeHiC requires a Hi-C tag directory (direction on creating one from FASTQ or Alignment files can be found here). Newer versions of HOMER strive to perform normalization of matrices on the fly to accommodate different parameters for analysis. Matrices that are normalized for interaction distance will still trigger the creation of a HOMER-background model though.

It is also worth noting that there are many other useful ways to visualize Hi-C data. One highly recommended way to visualize and 'surf' Hi-C data is to use Juicebox. This requires generating a *.hic file from the experiment, which is covered here.

Quick Reference:

Quick Note about Hi-C Analysis

Running analyzeHiC

By default, HOMER generates a normalized interaction matrix by sequencing depth per region and send the matrix to stdout (You can also specify "-o outputfilename.txt" instead). The resulting file is a tab-delimited text file that can be used by other programs to visualize as a heatmap (covered in more depth below). HOMER now normalizes the matrix to interactions per hundred square kilobases per billion mapped reads (ihskb). This unit is a mouthful, but the idea is to standardize the signal across different resolutions of analysis and sequencing depths. There are many other options covered below.

Changing the Resolution and Window Size of Analysis

The default resolution is 10000000 (10 Mb). This is to make sure the command finishes quickly if you forget to specify the correct resolution. To specify a different resolution, use "-res <#>".

analyzeHiC ES-HiC -pos chr1:10,000,000-13,000,000 -res 10000 -balance > output.10kResolution.txt

HOMER uses two (related) notions of resolution. The first, "-res <#>" represents how frequent the genome is divided up into regions to analyze (i.e. the step-size of the analysis). The second, the window resolution "-window <#>", represents how large the region is expanded when counting reads. Usually the "-res <#>" should be smaller than the "-window <#>" - this will effectively analyze the data in overlapping window, which is useful for removing edge effects. For example a res of 50000 and a window of 100000 would mean that HOMER will analyze the regions 0-50k, 50k-100k, 100k-150k etc., but at each region it will look at reads from a region the size of 100k, so it would look at reads from -25k-75k, 25k-125k, 75k-175k, etc. This means HOMER will analyze data with overlapping windows. The principle advantage to this strategy is that you don't penalize features that span boundaries. For example:

analyzeHiC ES-HiC -pos chr1:10,000,000-13,000,000 -res 10000 -window 20000 > output.10kby20kResolution.txt

Keep in mind that the resolution will also dictate the size of the output matrix. The command will warn you if your parameters seem unreasonable...

Specifying specific regions to analyze

By default HOMER will analyze interactions across the entire genome. You can restrict the analysis to a specific chromosome, or part of a chromosome using the following options. This necessary when analyzing things at high resolution:

-chr <chr name> : will restrict analysis to this chromosome
-start <#> : starting position for analysis
-end <#> : end position for analysis
-pos <chr:start-end> : UCSC browser formatted position - if you're lazy like me, takes place of the -chr/-start/-end

If you only "-chr chr1" and do not specify a start and end, HOMER will simply visualize all of chr1. Regions will be created starting at position 0 to 1*resolution, then from 1*resolution to 2*resolution, etc. (i.e. 0-10kb, 10kb-20kb, 20kb-30kb) If an alternative start is specified, then regions will be created at start+0*resolution to start+1*resolution, start+1*resolution to start+2*resolution, etc (i.e. 205kb-215kb,215kb-225kb,...).

HOMER will normally make a symmetric matrix by default. If you want to specifically look at a matrix between two different regions:

-chr2 <chr name> : will restrict analysis to this chromosome
-start2 <#> : starting position for analysis
-end2 <#> : end position for analysis
-pos2 <chr:start-end> : UCSC browser formatted position - if you're lazy like me, takes place of the -chr2/-start2/-end2
-vsGenome : compare to the rest of the genome

NOTE: Don't use -chr2/-start2/-end2/-pos2/-vsGenome unless you specified something with -chr/-start/-end/-pos etc. first.

Using these regions, HOMER will divide them into #-bp regions, where # is the resolution. HOMER doesn't allow you to cherry-pick several regions from different chromosomes. At least not using the -chr, -start, -end, and -pos options. (You could do this with peaks and the -p option).

You can arbitrarily define the regions you want to examine by providing a peak/BED file of the regions. However, these options work a little differently. In the case above (with -chr/-start/-end/-pos), HOMER will chop up the region into resolution sized chunks and perform the analysis. I.e. you provide a locus you want a detailed picture of. When providing a peak file, HOMER will only consider the center of the peak file and the surrounding "resolution-sized" region. It will not chop-up your peaks into resolution-sized chunks (unless you specify the "-chopify" option). In general, the "-p" option is more useful for looking at all CTCF peaks, for example, to see which are interacting. You can provide a peak/BED that effectively tiles a region, in which case it would mimic how -chr/-start/-end/-pos works, but give you the flexibility to define any region(s) you wanted. To specify a peak file containing regions to analyze with analyzeHiC:

-p <peak/BED file> : peak/BED file to use to search for interactions between.
-p2 <peak/BED file> : A second peak/BED file for non-symetrical matricies.

Keep in mind that an interaction matrix is probably only useful at 2000 x 2000 points - much bigger you can't really visualize it easily anymore (The file will also get really big...)

Matching the resolution of Peak/BED Files and Resolution of analyzeHiC

Another important point, often if you're using Transcription factor peaks for analysis, they may be located less than the resolution apart from one another. (i.e. two PU.1 peaks may be less than 1000 bp from each other, but the resolution for analyzeHiC is 50000) - this means that the two PU.1 peaks will give essentially the same results. To avoid this redundancy, It's best to run mergePeaks with a single peak/BED file to collapse peaks within the size of the resolution. For example:

mergePeaks pu1.peaks -d 50000 > newPu1Peaks.txt

The resulting file will contain only peaks at least 50000 bp from one another. Use this resulting file with analyzeHiC.

Visualizing the Interaction Matrix

Different types of normalization and analysis options for Hi-C matrices

By default, analyzeHiC creates an ihskb normalized interaction matrix file. This is a tab-delimited text file formatted to be easy opened with Java Tree View (or any other software that generates Heat Maps). The rows and columns correspond to genomic regions, and the values correspond interaction information between each locus. The type of information shown depends on the options chosen when running analyzeHiC. The genomic positions reported correspond to the beginning of the region. Normally, the output is sent to stdout, but you can also specify

Output Information Options (choose only one ):

-raw

Outputs the raw interaction counts between the regions

-coverageNorm (default)

Outputs normalized interaction counts assuming each region should have the same number of Hi-C interaction reads. This normalization essentially controls for the sequencing depth at each region. By default it will also normalize to the total sequencing depth and resolution size (more on that below). This normalization can be customized by using the options "-normTotal <#>" (controlling the total sequencing depth, default 1e9) and "-normArea <#>" (controlling the area in bp^2 used to normalize).

-distNorm

Outputs the ratio of observed to expected interactions by assuming each region has an equal chance of interacting with every other region in the genome AND that regions are expected to interact depending on their linear distance along the chromosome. This attempts to take into account the "proximity ligation" effect, where adjacent regions are expected to have large numbers of interactions regardless of the specific 3D genomic structure in the region. If this option is used, HOMER will automatically generate a 'background' model that attempts to model this effect across the genome.

-corr

Instead of outputting the matrix as is, the value of each cell is replaced with the Pearson's Correlation Coefficient between the row and column. This can be useful as it adds transitive information to the problem. Instead of just using the number of interaction that directly span between to loci, the correlation option will consider how each region interacts with all of the other loci too. If they have similar interaction profiles, the correlation will be high (i.e. 1). If "-logp" or "-expected" is used, those values are the ones that will be used for the correlation calculation. The matrix must be symmetric for this option to work.

-nomatrix

Don't create a matrix (useful in other contexts such as calculating compaction scores, etc.)

Matrix Balancing

Using "-balance" will iteratively balance matrices to ensure the total interactions are the same for each region (i.e. row/column). This helps remove artifacts caused from differential Hi-C read coverage. However, regions with limited/low read coverage will have "inflated" interaction counts - so be careful trying to interpreting interactions from these regions.

Creating Relative Matrices

You can create a matrix which only analyzes contacts near the diagonal up to a maximum distance using the options "-relative" and "-maxDist <#>". This allows you to create a 'matrix' that excludes the relatively sparse interactions found between distal regions.

Other options:

-cpu <#> : (default: 1) Use multiple threads when performing analysis (only useful for in this case for the processing of multiple chromosomes or when creating a background model for -distNorm)

-std <#> : (default: 8) exclude the analysis of regions where the number of mapped reads exceeds the average number by 8 standard deviations (i.e. z-score greater than 8).

-min <#> : (default: 0.05) exclude the analysis of regions where the number of mapped reads is lower than this fraction of the average (default excludes regions with less than 5%)

-override : By default, HOMER will bail if you try to make a matrix that is too big. This option will remove the check and go ahead and make it anyway...

-log, -nolog : Will force the output of log (or linear) transformed values

Examples of Hi-C matrices created with analyzeHiC

Visualizing Multiple Hi-C experiments

Command Line Options for analyzeHiC

Command Line Options for batchMakeHiCMatrix.pl

        batchMakeHiCMatrix.pl -pos <chr:start-end> -res <#> -window <#> [etc.] -d <HiCtagDir1> [HiCtagDir2] ...

        Options:
                -d <HiC TagDir1> [HiC TagDir2] ... (Tag Directories of Hi-C experiments to visulize)
                -pos <chr:start-end> (genomic position to visualize)
                -res <#> (resolution of step size to use for analysis)
                -window <#> (resolution of window size for aggregating interactions)
                -balance (balance resulting Hi-C matrix)
                -stack (Stacks matricies on top of one another i.e. square and symetric, non-rotated, default)
                -split (Creates split matricies i.e. square, non-symetric, non-rotated)
                        (printed in order of directories: 1\2 3\4 5\6 ...)
                -rotate (Rotates matrices, default)
                        -frac <#> (fraction of square matrix to consider for rotating, default: 0.333)
                -cpu (number of different processes to use, def: 1)

                Other options are passed to analyzeHiC to control the creation of the matrices

HOMER

Creating and Normalizing Hi-C interaction/contact Matrices