HOMER

Software for motif discovery and ChIP-Seq analysis

Motif finding details

Description of HOMER for de novo motif discovery

Sequences were divided into target and background sets for each application of the algorithm. Background sequences were then selectively weighted to equalize the distributions of CpG content in target and background sequences to avoid comparing sequences of different general sequence content common in mammalian genomes. Motifs of different lengths are identified separately by first exhaustively screening all possible oligos of a specific length for enrichment in the target set compared to the background set by assessing the number of target and background sequences containing each oligo and then using the cumulative hypergeometric distribution to score enrichment. Up to two mismatches were allowed in each oligonucleotide sequence to increase the sensitivity of the method. The top 100 oligonucleotides of each length with the best enrichment scores were then converted into basic probability matrices for further optimization. HOMER then generates motifs comprised of a position-weight matrix and detection threshold by empirically adjusting motif parameters to maximize the enrichment of motif instances in target sequences versus background sequences using the cumulative hypergeometric distribution as a scoring function. Probability matrix optimization follows a local hill-climbing approach that weights the contributions of individual oligos recognized by the motif to improve enrichment, while optimization of motif detection thresholds were performed by exhaustively screening degeneracy levels for maximal enrichment during each iteration of the algorithm. Once a motif is optimized, the individual oligos recognized by the motif are removed from the data set to facilitate the identification of additional motifs. Sequence logos were generated using SVG code to visualize them within HTML pages.

Algorithmic Details

The algorithm is composed of an exhaustive search step followed by the local optimization of promising motifs. Target and background sequences are initially parsed to produce an oligo table of length w, where w is the length of the motif. The number of occurrences in both target and background sequences is recorded for each oligo, providing a fast look up for oligo enrichment later in the algorithm. A motif in this context is a collection of oligos that are considered bound by the same transcription factor, so it is natural and more efficient to analyze the data in terms of oligos rather than the original sequences. The original sequences are discarded at this point in the motif finding algorithm and used only at the end for recovering actual instances of the discovered motifs.

Motif enrichment is calculated using a modified version of the cumulative hypergeometric distribution (or Fisher Exact test, referred to as the hypergeometric). Given a total set of N sequences, the hypergeometric is used to calculate the random probability that two independent sub-groups of sequences of size n1 and n2 are likely to share n or more common sequences. In this setting, the two independent groups are represented by the group of target sequences and the group of sequences containing a putative motif. This method of scoring implies that the existence of a motif in a sequence is enough to cause it to bind and does not try to integrate the collective probability of binding at each position over the entire sequence.

Since HOMER uses an oligo table to record the number of times each oligo appears in the target or background groups, exact mapping between individual sequences and the oligos of a putative motif is not possible. Instead, HOMER can only calculate the total number of oligos corresponding to a given motif found in either the target or background sets. However, one can calculate the expected number of sequences likely to contain the motif by assuming each oligo occurrence was independently distributed in the set of sequences using the recursive relationship ( nsi = nsi-1 + (Nset-nsi-1)/Nset; np1=1 where nsi is the expected number of sequences given the presence of i oligos in the set of Nset total sequence). Since the recursive relationship is not normally an integer, its value is rounded to the nearest integer for the calculation of the hypergeometric. While this scoring scheme differs from traditional applications of the hypergeometric, this approach can dramatically speed up implementation and drastically reduce memory requirements by removing associations between oligos and individual sequences.

HOMER searches for the most enriched motifs by first performing an exhaustive search of putative motifs centered on each entry in the oligo table. For each oligo, HOMER constructs a putative motif composed of the oligos with at most two mismatches from the given oligo. Each putative motif is scored by calculating the total number of oligos represented by the motif in both the target and background sets. These values are used to calculate the modified hypergeometric described above. A sequence tree is used for fast look up of oligos within a given number of mismatches from the consensus oligo. While allowing mismatches, as opposed to modeling specific nucleotide variations, is not a very descriptive motif model, optimal motifs at the end of the algorithm are typically represented by one or more highly significant putative mismatch motifs. More complicated schemes such as searching the space of IUPAC degenerate symbols motifs may be performed as well, but our experience has shown that the simple mismatch model offers a good trade off between running time and sensitivity for regulatory motifs.

The most promising putative motifs identified by the exhaustive search are then refined using an iterative local optimization algorithm. HOMER uses promising putative motifs (seeds) to construct probability matrices and detection thresholds capable of specifying groups of oligos that are not only similar in sequence but also collectively highly enriched in the target sequence set and correspondingly likely to be bound by the same transcription factor. These probability matrices and thresholds will be the final output of the algorithm. Prior to local optimization each promising seed from the exhaustive part of the algorithm is converted into a probability matrix that is representative of the consensus oligo with small arbitrary probabilities assigned to the non-consensus nucleotides.

The local optimization phase of the algorithm can be divided into two parts. The first part finds the optimal detection threshold, which corresponds to the overall degeneracy of the motif, and is analogous to the number of mismatches allowed. Given a probability matrix, each oligo in the oligo table is scored relative to the probability matrix by multiplying the probability of seeing each nucleotide at each position of the oligo. Each oligo is then sorted based on this score such that the oligos most similar to the probability matrix are listed first. The optimal detection threshold is found by adding each oligo in the list to the motif and calculating the enrichment after adding each oligo. With the addition of each oligo the effective threshold is lowered, and the threshold that results in the most significant enrichment is assigned as the optimal threshold.

The second part of the local optimization phase refines the probability matrix to more accurately define oligos described by the motif. This part is performed by assessing the relative enrichment contributed by each oligo when included in the motif. For each oligo with a similarity score greater than the optimal threshold calculated in part one, the enrichment of the motif (not including the oligo) is calculated and the log differential from the full motif is recorded. A new probability matrix is then formed by adding each oligo weighted by its relative contribution to the enrichment. Oligos that decrease the significance of the motif when removed are excluded from this calculation. The resulting matrix is normalized and the local optimization process repeated until the enrichment of the motif fails to improve. Since the probability matrix refinement step can be stuck in an unwanted local optimum, several new matrices corresponding to several thresholds below the optimal threshold are also used to effectively search for highly enriched oligos that miss the optimal threshold, giving them an opportunity to contribute to a new probability matrix. Each of the probability matrices are scored for their optimal enrichment, and only the most enriched matrix is used for subsequent refinement. Once the local optimization algorithm fails to improve the p-value of the motif, the motif is reported and the next seed is optimized. To avoid the tendency for seeds to converge on the same motif, oligos corresponding to the optimized motif are deleted from the oligo table, which is analogous to masking them from the primary sequence and repeating the procedure.

Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu