|
Motif finding details
Description of HOMER for de novo motif discovery
Sequences were divided into target and background sets for
each
application of the algorithm. Background sequences
were then
selectively weighted to equalize the distributions of CpG
content in
target and background sequences to avoid comparing
sequences of
different general sequence content common in mammalian
genomes.
Motifs of different lengths are identified separately by
first
exhaustively screening all possible oligos of a specific
length for
enrichment in the target set compared to the background
set by
assessing the number of target and background sequences
containing each
oligo and then using the cumulative hypergeometric
distribution to
score enrichment. Up to two mismatches were allowed
in each
oligonucleotide sequence to increase the sensitivity of
the method. The
top 100 oligonucleotides of each length with the best
enrichment scores
were then converted into basic probability matrices for
further
optimization. HOMER then generates motifs comprised
of a
position-weight matrix and detection threshold by
empirically adjusting
motif parameters to maximize the enrichment of motif
instances in
target sequences versus background sequences using the
cumulative
hypergeometric distribution as a scoring function.
Probability
matrix optimization follows a local hill-climbing approach
that weights
the contributions of individual oligos recognized by the
motif to
improve enrichment, while optimization of motif detection
thresholds
were performed by exhaustively screening degeneracy levels
for maximal
enrichment during each iteration of the algorithm.
Once a motif
is optimized, the individual oligos recognized by the
motif are removed
from the data set to facilitate the identification of
additional
motifs. Sequence logos were generated using SVG code to
visualize them within HTML pages.
Algorithmic Details
The algorithm is composed of an exhaustive search step
followed by the
local optimization of promising motifs. Target and
background
sequences are initially parsed to produce an oligo table
of length w,
where w is the length of the motif. The number of
occurrences in
both target and background sequences is recorded for each
oligo,
providing a fast look up for oligo enrichment later in the
algorithm. A motif in this context is a collection
of oligos that
are considered bound by the same transcription factor, so
it is natural
and more efficient to analyze the data in terms of oligos
rather than
the original sequences. The original sequences are
discarded at
this point in the motif finding algorithm and used only at
the end for
recovering actual instances of the discovered motifs.
Motif enrichment is calculated using a modified version of
the
cumulative hypergeometric distribution (or Fisher Exact
test, referred
to as the hypergeometric). Given a total set of N
sequences, the
hypergeometric is used to calculate the random probability
that two
independent sub-groups of sequences of size n1 and n2 are
likely to
share n or more common sequences. In this setting,
the two
independent groups are represented by the group of target
sequences and
the group of sequences containing a putative motif.
This method
of scoring implies that the existence of a motif in a
sequence is
enough to cause it to bind and does not try to integrate
the collective
probability of binding at each position over the entire
sequence.
Since HOMER uses an oligo table to record the number of
times each
oligo appears in the target or background groups, exact
mapping between
individual sequences and the oligos of a putative motif is
not
possible. Instead, HOMER can only calculate the
total number of
oligos corresponding to a given motif found in either the
target or
background sets. However, one can calculate the
expected number
of sequences likely to contain the motif by assuming each
oligo
occurrence was independently distributed in the set of
sequences using
the recursive relationship ( nsi = nsi-1 +
(Nset-nsi-1)/Nset; np1=1
where nsi is the expected number of sequences given the
presence of i
oligos in the set of Nset total sequence). Since the
recursive
relationship is not normally an integer, its value is
rounded to the
nearest integer for the calculation of the
hypergeometric.
While this scoring scheme differs from traditional
applications of the
hypergeometric, this approach can dramatically speed up
implementation
and drastically reduce memory requirements by removing
associations
between oligos and individual sequences.
HOMER searches for the most enriched motifs by first
performing an
exhaustive search of putative motifs centered on each
entry in the
oligo table. For each oligo, HOMER constructs a
putative motif
composed of the oligos with at most two mismatches from
the given
oligo. Each putative motif is scored by calculating
the total
number of oligos represented by the motif in both the
target and
background sets. These values are used to calculate
the modified
hypergeometric described above. A sequence tree is
used for fast
look up of oligos within a given number of mismatches from
the
consensus
oligo. While allowing mismatches, as opposed to
modeling specific
nucleotide variations, is not a very descriptive motif
model, optimal
motifs at the end of the algorithm are typically
represented by one or
more highly significant putative mismatch motifs.
More
complicated schemes such as searching the space of IUPAC
degenerate
symbols motifs may be performed as well, but our
experience has shown
that the simple mismatch model offers a good trade off
between running
time and sensitivity for regulatory motifs.
The most promising putative motifs identified by the
exhaustive search
are then refined using an iterative local optimization
algorithm.
HOMER uses promising putative motifs (seeds) to construct
probability
matrices and detection thresholds capable of specifying
groups of
oligos that are not only similar in sequence but also
collectively
highly enriched in the target sequence set and
correspondingly likely
to be bound by the same transcription factor. These
probability
matrices and thresholds will be the final output of the
algorithm. Prior to local optimization each
promising seed from
the exhaustive part of the algorithm is converted into a
probability
matrix that is representative of the consensus oligo with
small
arbitrary probabilities assigned to the non-consensus
nucleotides.
The local optimization phase of the algorithm can be
divided into two
parts. The first part finds the optimal detection
threshold,
which corresponds to the overall degeneracy of the motif,
and is
analogous to the number of mismatches allowed. Given
a
probability matrix, each oligo in the oligo table is
scored relative to
the probability matrix by multiplying the probability of
seeing each
nucleotide at each position of the oligo. Each oligo
is then
sorted based on this score such that the oligos most
similar to the
probability matrix are listed first. The optimal
detection
threshold is found by adding each oligo in the list to the
motif and
calculating the enrichment after adding each oligo.
With the
addition of each oligo the effective threshold is lowered,
and the
threshold that results in the most significant enrichment
is assigned
as the optimal threshold.
The second part of the local optimization phase refines
the probability
matrix to more accurately define oligos described by the
motif.
This part is performed by assessing the relative
enrichment contributed
by each oligo when included in the motif. For each
oligo with a
similarity score greater than the optimal threshold
calculated in part
one, the enrichment of the motif (not including the oligo)
is
calculated and the log differential from the full motif is
recorded. A new probability matrix is then formed by
adding each
oligo weighted by its relative contribution to the
enrichment.
Oligos that decrease the significance of the motif when
removed are
excluded from this calculation. The resulting matrix
is
normalized and the local optimization process repeated
until the
enrichment of the motif fails to improve. Since the
probability
matrix refinement step can be stuck in an unwanted local
optimum,
several new matrices corresponding to several thresholds
below the
optimal threshold are also used to effectively search for
highly
enriched oligos that miss the optimal threshold, giving
them an
opportunity to contribute to a new probability
matrix. Each of
the probability matrices are scored for their optimal
enrichment, and
only the most enriched matrix is used for subsequent
refinement.
Once the local optimization algorithm fails to improve the
p-value of
the motif, the motif is reported and the next seed is
optimized.
To avoid the tendency for seeds to converge on the same
motif, oligos
corresponding to the optimized motif are deleted from the
oligo table,
which is analogous to masking them from the primary
sequence and
repeating the procedure.
|