Software for motif discovery and ChIP-Seq analysis

Introduction to HOMER

The best way to learn about HOMER is to go through the tutorial pages.  We've tried to spell out what happens in each step and explain the "why".  A brief description of the Motif Finding component of HOMER is found below.  Explanation of the sequencing analysis components of HOMER are integrated into the tutorials.

General Introduction to Motif Discovery with HOMER

HOMER is a collection of tools that are commonly needed for the analysis of gene expression profiling (microarray) and genome-wide location analysis experiments (ChIP-Seq or ChIP-Chip).  There are also routines for other types of sequencing experiments, such as DNase-Seq or GRO-Seq. 

Some of the things HOMER does NOT DO is find differentially expressed genes (although it has some routines to help with this), cluster gene expression profiles, or search for all the instances Transfac motifs in order to make you hopelessly confused!!!  The idea was not to completely reinvent the wheel if possible.

Unfortunately, HOMER must be run as a command-line tool, and may be difficult to use if you are new to UNIX.  While commands have been distilled to be as simple and user-friendly as possible, basic knowledge of the UNIX environment and file system is critical (but can probably be learned quickly after typing “unix tutorial” into google).  I am proud to say that may of the people using HOMER are completely new to UNIX, so it is indeed possible.  In addition, a spreadsheet program (i.e. EXCEL) is needed to graph and visualize some of the results produced by HOMER.

Below is a description of how motif analysis is executed with HOMER.  Documentation describing the steps of analysis for Next-Gen Sequencing (or genomic position analysis) or Microarrays (gene-based analysis) are covered in separate sections.

De Novo Motif Discovery Strategy

HOMER was designed as a de novo motif discovery algorithm that scores motifs by looking for motifs with differential enrichment between two sets of sequences.  This means that HOMER uses two sets of sequences when performing motif finding – 1. target sequences of interest (i.e. promoters of genes that are co-regulated) and 2. a set of background sequences (i.e. promoters of genes that are not regulated).  Without background sequences a motif discovery algorithm must guess what sequences are expected to be found by chance, such as assuming background sequences are a random collection of A, C, G, and T.  This can be extremely dangerous since real genomic sequence is anything but random. 

In practice HOMER will try to select the appropriate background sequences for you, but results can vary depending on what is used as background and certain applications may require careful consideration of these sequences.  By default HOMER will use confident, non-regulated promoters as background when analyzing promoters, and sequences in the vicinity of genes for ChIP-Seq analysis (i.e. from –50kb to +50kb).  In each case sequences are matched for their GC content to avoid bias from CpG Islands.

Once target and background sequences are chosen, HOMER looks for motifs of a specific length that are over-represented in the target set relative to the background set.  This enrichment is measured using the cumulative hypergeometric distribution (or cumulative binomial distribution for large data sets), and places no requirement on the degeneracy of the motif or the number of occurrences.  Motifs are found by first exhaustively checking the enrichment of simple motifs, then refining promising candidates into accurate probability matrices.

With v3.0 of HOMER, the motif discovery software has been rewritten and modernized (the homer2 executable).  There is a subtle, but very important difference in how the new version of HOMER performs de novo motif analysis.  The original HOMER divided the input sequences into short oligos to perform the analysis, and once a motif was found, only the oligos considered "bound" by the motif were removed from the analysis.  The problem was that several oligos representing "offsets" of the original motif (think GGAAGT vs. GAAGTg) were left for the 2nd round of motif enrichment to find, creating results that often contained several versions of the original motif.  The new version revisits the input sequences and removes all oligos that are slightly offset from the optimal motifs, making it much more sensitive to co-enriched motifs.

Known Motif Discovery Strategy

The biggest problem when looking for “known” motifs is defining how degenerate you should allow them to be.  To circumvent this problem, we loaded motif derived from published ChIP-Seq experiments that were already optimized for degeneracy thresholds.

Interpretation of Motif Discovery Results

De Novo Results

Unfortunately, if you give HOMER random data, HOMER will find motifs, and they may look significant.  Due to the finite amount of data and many degrees of freedom in a motif probability matrix, it is easy to find a motif with a seemingly significant p-value.  Because of this, we can only trust the most promising of motifs as likely to be real.  For most promoter datasets, motifs with a p-value of more than 1e-10 or even 1e-12 are likely to be false positives.  In general the p-value cutoff should be estimated by randomizing data labels and running the algorithm several times.  In practice you should start ignoring results that are either below 1e-10 or when the results start becoming very different from one another (in terms of sequence) yet have similar p-values.  In addition, high quality motifs usually appear multiple times in the list with different offsets (i.e. nnnTGACTCAnn and nTGACTCAnnnn).  HOMER attempts to remove extremely similar motifs, but different offsets of motifs are likely to be present if the signal is strong (remember motifs may appear as if on the negative strand).

Matching De Novo to Known Motifs

Homer makes every attempt to tell you if the motifs it discovered resembles a known motif.  The difficulty of interpreting these results SHOULD NOT BE UNDERESTIMATED!!!  Consider the following:
  1. Databases of known motifs are a mixture of accurate and inaccurate motifs
  2. Databases of known motifs are not complete
  3. The literature (especially motif finding papers) is full of inaccurate assessments and motif annotations that are ludicrous.
HOMER tries to find the known motifs with the best correlation between the known motif and de novo motif.  It then aligns the motifs from the top hits so that you can see it and judge the alignment for yourself.  The top known motif match is not always the best match.  The top match is not always annotated correctly.  If you feel something is worth pursuing, look up the known binding sites of the transcription factor via PUBMED.  Feedback I got when writing the program was to provide the name of the motif in the main result table – this was promptly followed by the misinterpretation of results because people are too lazy to look at the alignment to figure out if it makes any sense.  These results do not write the paper for you – critical thinking and follow-up is required.

Additional Reading: Tips for de novo motif finding

Known Motif Enrichment

First and most important: There is a subtle but IMPORTANT difference between looking for motifs de novo and looking for known motif enrichment.  De novo motif discovery allows you to directly query the sequence to discover which motifs are the MOST enriched sequences in your target set.  Known motif discovery will simply tell you which of the known motifs is most enriched in your target set.

This may not seem important but consider the following scenario:  You have a set of random GA-rich sequences and compare them to random genomic sequences.  De novo motif finding will likely return a G/A-rich matrix that doesn’t look anything like a transcription factor.  Known motif finding will return astonishingly high p-values for motifs like PU.1 (GAGGAAGT) and ISRE (GAAACTGAAA).  Because of this de novo motif finding results are much more trustful in terms of results.

The greatest advantage to using known motifs is found when you have a limited set of target sequences.  The less data that is available or the weaker the true signal, it is difficult for de novo motif finding to accurately define a signal that is significant.  Known motifs have the advantage of many less degrees of freedom and in may cases find the correct motifs when the enrichment falls below the 1e-10 thresholds for reliability when considering de novo results.

A more detailed description of the motif finding procedure is available in the Motif Finding Tutorial.

Next: Introduction to Homer Programs

Can't figure something out? Questions, comments, concerns, or other feedback: