Software for motif discovery and next-gen sequencing analysis

Quick'n'Dirty HOMER Hi-C Tutorial

Quick cheat sheet for how to use HOMER to analyze Hi-C data. This workflow generally works well for in situ Hi-C experiments sequenced to a depth of 250 million to 1 billion reads. Lower read counts may require some parameter adjustments (like increasing the resolution of some analyses). More detailed descriptions of what HOMER is doing and how to use different utilities can be found here.

FASTQ trimming and read alignment:

#reads should be trimmed and aligned separately (do not perform paired-end alignment with HOMER).  Assumes MboI/DpnII (GATC) is the restriction enzyme used in the Hi-C assay:
homerTools trim -3 GATC -mis 0 -matchStart 20 -min 20 hicExp1_R1_fastq
homerTools trim -3 GATC -mis 0 -matchStart 20 -min 20 hicExp1_R2_fastq

bowtie2 -p 20 -x hg38index -U hicExp1_R1_fastq.trimmed > hicExp1_R1.hg38.sam
bowtie2 -p 20 -x hg38index -U hicExp1_R2_fastq.trimmed > hicExp1_R2.hg38.sam

Create Hi-C Tag Directory with HOMER:

#Paired alignment files should be provided with a comma (NO spaces around the comma). The "-tbp 1" removes PCR duplicates and is highly recommended:
makeTagDirectory HicExp1TagDir/ hicExp1_R1.hg38.sam,hicExp1_R2.hg38.sam -tbp 1

#optional - for more thorough QC read-outs (takes longer):
makeTagDirectory HicExp1TagDir/ hicExp1_R1.hg38.sam,hicExp1_R2.hg38.sam -tbp 1 -genome hg38 -checkGC -restrictionSite GATC
#optional - create a *.hic file if you have juicer_tools installed to visualize with Juicebox (output file will be placed inside the tag directory):
tagDir2hicFile.pl HicExp1TagDir/ -juicer auto -genome hg38 -p 10

Visualize a Hi-C contact map for a specific region in the genome:

analyzeHiC HicExp1TagDir/ -pos chr2:10,000,000-12,000,000 -res 3000 -window 15000 -balance > output.txt

#visualize "output.txt" with Treeview 3 or other heatmap/cluster visualization software
#resolution controls the sampling resolution, window controls the binning resolution (i.e. above it will pool reads in 15kb bins at 3kb intervals, i.e. overlapping intervals)

Chromatin Compartment Analysis (PCA, requires R):

#PCA of Hi-C contact matrices essentially clusters apart the 'checkerboard' pattern to reveal active and inactive chromatin regions along the genome:
runHiCpca.pl auto HicExp1TagDir/ -res 25000 -window 50000 -genome hg38 -cpu 10

#This will create two files in the tag directory, *.PC1.bedGraph and *.PC1.txt.  The *.PC1.bedGraph file can be viewed in a Genome Browser. If you have ChIP-seq or other regions that you know represent 'active regions', replace "-genome hg38" with something like "-active K27ac.peaks.bed"
#If your sequencing depth is low, you may need to use "-res 50000 -window 100000"
#To compare multiple experiments, first run runHiCpca.pl on each tag directory. Once you have several PCA analysis from multiple experiments, you can combine their quantification into a single spreadsheet:
annotatePeaks.pl HiCExp1TagDir/HiCExp1TagDir.PC1.txt hg38 -noblanks -bedGraph HiCExp1TagDir/HiCExp1TagDir.PC1.bedGraph HiCExp2TagDir/HiCExp2TagDir.PC1.bedGraph HiCExp3TagDir/HiCExp3TagDir.PC1.bedGraph > output.txt

#If you have PC1 bedGraphs from replicate experiments across two conditions, you can identify significantly changing compartments using the following:
annotatePeaks.pl Exp1r1.PC1.txt hg38 -noblanks -bedGraph Exp1r1.PC1.txt Exp1r2.PC1.txt Exp2r1.PC1.txt Exp2r2.PC1.txt > output.txt
getDiffExpression.pl output.txt exp1 exp1 exp2 exp2 -pc1 -export outputPrefix > output2.txt

#In the example above, "exp1 exp1 exp2 exp2" labels the groups/replicates in the order that they appear in the input file

Chromatin Compaction (DLR, ICF):

#calculate the distal-to-local log2 ratio (DLR) and interchromosomal fraction of interactions for each 5kb region of the genome (pooling interactions from a 15kb window size):
analyzeHiC HicExp1TagDir/ -res 5000 -window 15000 -nomatrix -compactionStats auto -cpu 10

#This will produce both *.DLR.bedGraph and *.ICF.bedGraph files and place them in the tag directory.
#Both the DLR and ICF are more useful for comparing experiments:
subtractBedGraphs.pl exp1.DLR.bedGraph exp2.DLR.bedGraph -center -name Exp1VsExp2-DLR > output.DLR.bedGraph
subtractBedGraphs.pl exp1.ICF.bedGraph exp2.ICF.bedGraph -center -name Exp1VsExp2-ICF > output.ICF.bedGraph

Finding TADs and Loops:

#analyzing TADs and loops (i.e. specific locations that interact, e.g. two CTCF sites interacting):
findTADsAndLoops.pl find HicExp1TagDir/ -cpu 10 -res 3000 -window 15000 -genome hg38

#This will create *.loop.2D.bed and *.tad.2D.bed files and place them within the tag directory. Even better to include a list of segmental duplications/blacklisted regions with "-p <peak/BED file>" to filter out likely false positives. The 2D.bed files can be visualized with Juicebox.
#To analyze changes in TAD/Loops across experiments, first merge features that you want to analyze from each experiment to get the union of features to analyze
merge2Dbed.pl exp1.loop.2D.bed exp2.loop.2D.bed -loop > merged.loop.2D.bed
merge2Dbed.pl exp1.tad.2D.bed exp2.tad.2D.bed -tad > merged.tad.2D.bed

#Once features have been merged, next quantify them across all replicates/experimental conditions (computes stats for loops and TADs at the same time)
findTADsAndLoops.pl score -tad merged.tad.2D.bed -loop merged.loop.2D.bed -o outPrefix -d HicExp1r1TagDir/ HicExp1r2TagDir/ HicExp2r1TagDir/ HicExp2r2TagDir/ -cpu 10 -res 3000 -window 15000

#Finally, identify features that are differentially enriched (uses edgeR/limma):
getDiffExpression.pl outPrefix.loop.scores.txt exp1 exp1 exp2 exp2 -loop > output.loop.txt
getDiffExpression.pl outPrefix.tad.scores.txt exp1 exp1 exp2 exp2 -tad > output.tad.txt

Can't figure something out? Questions, comments, concerns, or other feedback: