Introduction to HOMER
The best way to learn about
HOMER
is to go through the tutorial pages. We've tried to
spell out
what happens in each step and explain the "why". A
brief
description of the Motif Finding component of HOMER is
found
below. Explanation of the sequencing analysis
components of HOMER
are integrated into the tutorials.
General Introduction to Motif Discovery with HOMER
HOMER is a collection of
tools
that are commonly needed for the
analysis of gene expression profiling (microarray) and
genome-wide
location analysis experiments (ChIP-Seq or
ChIP-Chip). There are
also routines for other types of sequencing experiments,
such as
DNase-Seq or GRO-Seq.
Some of the things HOMER does NOT DO is find
differentially expressed genes (although it has some
routines to help with this), cluster gene expression
profiles, or
search for all the instances Transfac motifs in order to
make you
hopelessly confused!!! The idea was not to
completely reinvent
the wheel if possible.
Unfortunately, HOMER must be run as a command-line tool,
and may be
difficult to use if you are new to UNIX. While
commands have been
distilled to be as simple and user-friendly as possible,
basic
knowledge of the UNIX environment and file system is
critical (but can
probably be learned quickly after typing “unix tutorial”
into
google). I am proud to say that may of the people
using HOMER are
completely new to UNIX, so it is indeed possible. In
addition, a
spreadsheet program (i.e. EXCEL) is
needed to graph and visualize some of the results produced
by HOMER.
Below is a description of how motif analysis is executed
with
HOMER. Documentation describing the steps of
analysis for Next-Gen
Sequencing (or genomic
position
analysis) or Microarrays
(gene-based analysis) are covered in separate sections.
De Novo Motif
Discovery Strategy
HOMER was designed as a de
novo
motif discovery algorithm that scores motifs by looking
for motifs with
differential enrichment between two sets of
sequences. This means
that HOMER uses two sets of sequences when performing
motif finding –
1. target sequences of interest (i.e. promoters of genes
that are
co-regulated) and 2. a set of background sequences (i.e.
promoters of
genes that are not regulated). Without background
sequences a
motif discovery algorithm must guess what sequences are
expected to be
found by chance, such as assuming background sequences are
a random
collection of A, C, G, and T. This can be extremely
dangerous
since real genomic sequence is anything but random.
In practice HOMER will try to select the appropriate
background
sequences for you, but results can vary depending on what
is used as
background and certain applications may require careful
consideration
of these sequences. By default HOMER will use
confident,
non-regulated promoters as background when analyzing
promoters, and
sequences in the vicinity of genes for ChIP-Seq analysis
(i.e. from
–50kb to +50kb). In each case sequences are matched
for their GC
content to avoid bias from CpG Islands.
Once target and background
sequences are chosen, HOMER looks for motifs of a specific
length that
are over-represented in the target set relative to the
background
set. This enrichment is measured using the
cumulative
hypergeometric
distribution (or cumulative binomial distribution for
large data
sets), and places no requirement on the degeneracy
of the motif or the number of occurrences. Motifs
are found by
first exhaustively checking the enrichment of simple
motifs, then
refining promising candidates into accurate probability
matrices.
With v3.0 of HOMER, the motif discovery software has been
rewritten and
modernized (the homer2 executable). There is a
subtle, but very
important difference in how the new version of HOMER
performs de novo
motif analysis. The
original HOMER divided the input sequences into short
oligos to perform
the analysis, and once a motif was found, only the oligos
considered
"bound" by the motif were removed from the analysis.
The problem
was that several oligos representing "offsets" of the
original motif
(think GGAAGT vs. GAAGTg) were left for the 2nd round of
motif
enrichment to find, creating results that often contained
several versions of the original motif. The new
version revisits the input
sequences and removes all oligos that are slightly offset
from the optimal motifs, making it much more sensitive to
co-enriched motifs.
Known Motif Discovery Strategy
The biggest problem when
looking
for “known” motifs is defining how degenerate you should
allow them to
be. To circumvent this problem, we loaded motif
derived from
published ChIP-Seq experiments that were already optimized
for
degeneracy thresholds.
Interpretation of Motif Discovery Results
De Novo
Results
Unfortunately, if you give
HOMER
random data, HOMER will find motifs, and they may look
significant. Due to the finite amount of data and
many degrees of
freedom in a motif probability matrix, it is easy to
find a motif with
a seemingly significant p-value. Because of this,
we can only
trust the most promising of motifs as likely to be
real. For most
promoter datasets, motifs with a p-value of more than
1e-10 or even
1e-12 are likely to be false positives. In general
the p-value
cutoff should be estimated by randomizing data labels
and running the
algorithm several times. In practice you should
start ignoring
results that are either below 1e-10 or when the results
start becoming
very different from one another (in terms of sequence)
yet have similar
p-values. In addition, high quality motifs usually
appear
multiple times in the list with different offsets (i.e.
nnnTGACTCAnn
and nTGACTCAnnnn). HOMER attempts to remove
extremely similar
motifs, but different offsets of motifs are likely to be
present if the
signal is strong (remember motifs may appear as if on
the negative
strand).
Matching De Novo
to
Known Motifs
Homer makes every attempt to tell you if the motifs it
discovered
resembles a known motif. The difficulty of
interpreting these
results SHOULD NOT BE UNDERESTIMATED!!! Consider
the following:
- Databases of known motifs are a mixture of accurate
and
inaccurate motifs
- Databases of known motifs are not complete
- The literature (especially motif finding papers) is
full of
inaccurate assessments and motif annotations that are
ludicrous.
HOMER tries to find the
known
motifs with the best correlation between the known motif
and de novo
motif. It then aligns
the motifs from the top hits so that you can see it and
judge the
alignment for yourself. The top known motif match
is not always
the best match. The top match is not always
annotated
correctly. If you feel something is worth
pursuing, look up the
known binding sites of the transcription factor via
PUBMED.
Feedback I got when writing the program was to provide
the name of the
motif in the main result table – this was promptly
followed by the
misinterpretation of results because people are too lazy
to look at the
alignment to figure out if it makes any sense.
These results do
not write the paper for you – critical thinking and
follow-up is
required.
Additional Reading: Tips for de
novo
motif finding
Known Motif Enrichment
First and most important:
There
is a subtle but IMPORTANT difference between looking for
motifs de novo
and looking for known motif enrichment. De novo
motif discovery
allows you to directly query the sequence to discover
which motifs are
the MOST enriched sequences in your target set.
Known motif
discovery will simply tell you which of the known motifs
is most
enriched in your target set.
This may not seem important but consider the following
scenario:
You have a set of random GA-rich sequences and compare
them to random
genomic sequences. De novo motif finding will
likely return a
G/A-rich matrix that doesn’t look anything like a
transcription
factor. Known motif finding will return
astonishingly high
p-values for motifs like PU.1 (GAGGAAGT) and ISRE
(GAAACTGAAA).
Because of this de novo motif finding results are much
more trustful
in terms of results.
The greatest advantage to using known motifs is found
when you have a
limited set of target sequences. The less data
that is available
or the weaker the true signal, it is difficult for de
novo motif
finding to accurately define a signal that is
significant. Known
motifs have the advantage of many less degrees of
freedom and in may
cases find the correct motifs when the enrichment falls
below the 1e-10
thresholds for reliability when considering de novo
results.
A more detailed description of the motif finding procedure
is available
in the Motif Finding Tutorial.
|