|
Creating Custom Motif Matrices
A common task when doing regulatory element analysis is to
scan for a specific sequence/motif - but not necessarily
with one that is found using motif finding. Often you
want to find a very specific sequence, or you want to load
your own motif matrix derived from another source.
This page will help explain how to get HOMER to play nice
with your custom motifs.
Below is a description of the HOMER *.motif format for
specifying motifs, as well as some tricks & tips on
creating simple motif files.
*.motif format files
HOMER works exclusively with
homer motif formatted files - so if you want to find a
sequence or motif with HOMER, you must first create a
"motif" file. A typical
motif file will look something like:
>ASTTCCTCTT
1-ASTTCCTCTT
8.059752
-23791.535714
0
T:17311.0(44 ...
0.726 0.002 0.170
0.103
0.002 0.494 0.354
0.151
0.016 0.017 0.014
0.954
0.005 0.006 0.027
0.963
0.002 0.995 0.002
0.002
0.002 0.989 0.008
0.002
0.004 0.311 0.148
0.538
0.002 0.757 0.233
0.009
0.276 0.153 0.030
0.542
0.189 0.214 0.055
0.543
The first row starts with a ">" followed by various
information, and
the other rows are the positions specific probabilities
for each
nucleotide (A/C/G/T). These values do not need to be
between 0-1. HOMER will automatically normalize
whatever values are there, so interger counts are
ok. The header row is actually TAB delimited,
and contains the following information:
- ">" + Consensus
sequence (not actually used for
anything, can be blank) example: >ASTTCCTCTT
- Motif name (should be
unique if several motifs are in the
same file) example: 1-ASTTCCTCTT or NFkB
- Log odds detection
threshold, used to determine bound vs.
unbound sites (mandatory)
example:
8.059752
- (optional) log P-value of enrichment, example:
-23791.535714
- (optional) 0 (A place holder for backward
compatibility, used to
describe "gapped" motifs in old version, turns out it
wasn't very
useful :)
- (optional) Occurence Information separated by
commas, example:
T:17311.0(44.36%),B:2181.5(5.80%),P:1e-10317
- T:#(%) - number of target sequences with motif, %
of
total of total targets
- B:#(%) - number of background sequences with
motif, % of
total background
- P:# - final enrichment p-value
- (optional) Motif statistics separated by commas,
example:
Tpos:100.7,Tstd:32.6,Bpos:100.1,Bstd:64.6,StrandBias:0.0,Multiplicity:1.13
- Tpos: average position of motif in target
sequences (0 =
start of sequences)
- Tstd: standard deviation of position in target
sequences
- Bpos: average position of motif in background
sequences
(0 = start of sequences)
- Bstd: standard deviation of position in background
sequences
- StrandBias: log ratio of + strand occurrences to -
strand
occurrences.
- Multiplicity: The averge number of occurrences per
sequence in sequences with 1 or more binding site.
Only the first 3 columns are
needed. In fact, the rest of the columns are really
just statistics from motif finding and aren't important
when searching for instances of a motif.
The MOST IMPORTANT value is the 3rd column - this sets the
detection threshold, which specifies whether a given
sequence is enough of a "match" to be considered
recognized by the motif. More on that below.
Creating Simple Motifs with seq2profile.pl
HOMER comes with a handly
little tool called seq2profile.pl.
This program automates the creation of motifs from
consensus sequences from letters ACGTN.
seq2profile.pl <consensus> [# mismatches]
[name] > output.motif
i.e. seq2profile.pl
GGAAGT 0 ets > output.motif
Output from this example looks like this:
>GGAAGT
ets 8.28973911259755
0.001 0.001 0.997
0.001
0.001 0.001 0.997
0.001
0.997 0.001 0.001
0.001
0.997 0.001 0.001
0.001
0.001 0.001 0.997
0.001
0.001 0.001 0.001
0.997
This script will automatically set the 3rd column to a
threshold to only detect perfect matches. Using this
file with tools like annotatePeaks.pl will only find
sequences matching GGAAGT (or ACTTCC on the reverse
strand) If we want to allow up to one mismatch:
i.e. seq2profile.pl
GGAAGT 1
ets > output.motif
Output from this example looks like this:
>GGAAGT
ets 1.38498834263571
0.001 0.001 0.997
0.001
0.001 0.001 0.997
0.001
0.997 0.001 0.001
0.001
0.997 0.001 0.001
0.001
0.001 0.001 0.997
0.001
0.001 0.001 0.001
0.997
The detection theshold now changed to allow up to one
mismatch. This will now identify GGAAGT, GGAAGA,
GGAAGC, GGAAGG, ... ettc.
Creating motif files manually
You can also create a motif
using a text editor or Excel. The first line should
contain 3 tab-separated columns with ">whatever",
"name", and "threshold", followed by motif matrix.
You can also start from an existing motif or use seq2profile.pl to
start a motif for you.
Once you get the probabilities the way you want it (i.e.
maybe you copied them from JASPAR, or from in vitro
affinity selection, or a set of binding sites, or
whatever), you need to pick the correct detection
threshold. During motif discovery, HOMER optimizes
this threshold as part of the algorithm. However,
since you're making up your own motif, you need to figure
out how degenerate you want your motif to be. This
is easily the biggest hang-up for people new to the
concept of probability matrices and motif finding.
Motif Scanning
In order to select the
proper detection threshold, it helps to understand how
motifs are "scored" by HOMER. Any given sequence
can be scored using the probability matrix. HOMER
calculates the score by adding the "log-odds
probabilities" for the nucleotide found in each
position.
Score for GGATGT
score = log(pG1/0.25) + log(pG2/0.25)
+ log(pA3/0.25) + log(pT4/0.25)
+ log(pG5/0.25) + log(pT6/0.25)
If the nucleotide in the given position has a high
probability in the motif matrix (i.e. 0.9), a positive
number is added to the score. If the nucleotide as
a neutral probility (i.e. 0.25), then the score is
unchanged, and if a nucleotide is disfavored in the
probability matrix (i.e. 0.001), then a negative number
is added to the score.
Homer fixes the log-odds expectation at 0.25 for each
nucleotide. If an organism has a strong imbalance
in nucleotide composition, this causes HOMER sacrifices
some sensitivity by keeping the expectation at 0.25
instead of adjusting this to reflect the general
sequence composition. However, the huge upside to
fixing this at 0.25 is that you can use the motif with
HOMER on any sequence in any context an always get
recognize the same sequences, which makes life much
easier.
If the score is above the threshold (from the 3rd column
of the motif file), HOMER will identify the sequence as
"recognized" by the motif. If the score is lower,
the sequence will be ignored. Most motif
thresholds are around 5.0-10.0
To select the correct threshold, you may need to "guess
and check" you results to ensure your motif is recognizing
the correct sequences. Usually, you can start with a
threshold around 5-10. For example, run findMotifs.pl/ findMotifsGenome.pl
with " -find
motifFile.motif". The score of each motif
is found in the 6th column of the output. Looking
through the file and the sequenes that were identified
will give you a better idea of what to set the score at.
|