Software for motif discovery and next-gen sequencing analysis

Creating Custom Motif Matrices

A common task when doing regulatory element analysis is to scan for a specific sequence/motif - but not necessarily with one that is found using motif finding.  Often you want to find a very specific sequence, or you want to load your own motif matrix derived from another source.  This page will help explain how to get HOMER to play nice with your custom motifs.

Below is a description of the HOMER *.motif format for specifying motifs, as well as some tricks & tips on creating simple motif files.

*.motif format files

HOMER works exclusively with homer motif formatted files - so if you want to find a sequence or motif with HOMER, you must first create a "motif" file.  A typical motif file will look something like:

>ASTTCCTCTT     1-ASTTCCTCTT    8.059752        -23791.535714   0       T:17311.0(44 ...
0.726   0.002   0.170   0.103
0.002   0.494   0.354   0.151
0.016   0.017   0.014   0.954
0.005   0.006   0.027   0.963
0.002   0.995   0.002   0.002
0.002   0.989   0.008   0.002
0.004   0.311   0.148   0.538
0.002   0.757   0.233   0.009
0.276   0.153   0.030   0.542
0.189   0.214   0.055   0.543

The first row starts with a ">" followed by various information, and the other rows are the positions specific probabilities for each nucleotide (A/C/G/T).  These values do not need to be between 0-1.  HOMER will automatically normalize whatever values are there, so interger counts are ok.  The header row is actually TAB delimited, and contains the following information:
  1. ">" + Consensus sequence (not actually used for anything, can be blank) example: >ASTTCCTCTT
  2. Motif name (should be unique if several motifs are in the same file) example: 1-ASTTCCTCTT  or NFkB
  3. Log odds detection threshold, used to determine bound vs. unbound sites (mandatory) example: 8.059752
  4. (optional) log P-value of enrichment, example: -23791.535714
  5. (optional) 0 (A place holder for backward compatibility, used to describe "gapped" motifs in old version, turns out it wasn't very useful :)
  6. (optional) Occurence Information separated by commas, example: T:17311.0(44.36%),B:2181.5(5.80%),P:1e-10317
    1. T:#(%) - number of target sequences with motif, % of total of total targets
    2. B:#(%) - number of background sequences with motif, % of total background
    3. P:# - final enrichment p-value
  7. (optional) Motif statistics separated by commas, example: Tpos:100.7,Tstd:32.6,Bpos:100.1,Bstd:64.6,StrandBias:0.0,Multiplicity:1.13
    1. Tpos: average position of motif in target sequences (0 = start of sequences)
    2. Tstd: standard deviation of position in target sequences
    3. Bpos: average position of motif in background sequences (0 = start of sequences)
    4. Bstd: standard deviation of position in background sequences
    5. StrandBias: log ratio of + strand occurrences to - strand occurrences.
    6. Multiplicity: The averge number of occurrences per sequence in sequences with 1 or more binding site.
Only the first 3 columns are needed.  In fact, the rest of the columns are really just statistics from motif finding and aren't important when searching for instances of a motif.

The MOST IMPORTANT value is the 3rd column - this sets the detection threshold, which specifies whether a given sequence is enough of a "match" to be considered recognized by the motif.  More on that below.

Creating Simple Motifs with seq2profile.pl

HOMER comes with a handly little tool called seq2profile.pl.  This program automates the creation of motifs from consensus sequences from letters ACGTN.

seq2profile.pl <consensus> [# mismatches] [name] > output.motif
i.e. seq2profile.pl GGAAGT 0 ets > output.motif

Output from this example looks like this:

>GGAAGT ets     8.28973911259755
0.001   0.001   0.997   0.001
0.001   0.001   0.997   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.001   0.001   0.997   0.001
0.001   0.001   0.001   0.997

This script will automatically set the 3rd column to a threshold to only detect perfect matches.  Using this file with tools like annotatePeaks.pl will only find sequences matching GGAAGT (or ACTTCC on the reverse strand)  If we want to allow up to one mismatch:

i.e. seq2profile.pl GGAAGT 1 ets > output.motif

Output from this example looks like this:

>GGAAGT ets     1.38498834263571
0.001   0.001   0.997   0.001
0.001   0.001   0.997   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.001   0.001   0.997   0.001
0.001   0.001   0.001   0.997

The detection theshold now changed to allow up to one mismatch.  This will now identify GGAAGT, GGAAGA, GGAAGC, GGAAGG, ... ettc.

Creating motif files manually

You can also create a motif using a text editor or Excel.  The first line should contain 3 tab-separated columns with ">whatever", "name", and "threshold", followed by motif matrix.  You can also start from an existing motif or use seq2profile.pl to start a motif for you.

Once you get the probabilities the way you want it (i.e. maybe you copied them from JASPAR, or from in vitro affinity selection, or a set of binding sites, or whatever), you need to pick the correct detection threshold.  During motif discovery, HOMER optimizes this threshold as part of the algorithm.  However, since you're making up your own motif, you need to figure out how degenerate you want your motif to be.  This is easily the biggest hang-up for people new to the concept of probability matrices and motif finding.

Motif Scanning

In order to select the proper detection threshold, it helps to understand how motifs are "scored" by HOMER.  Any given sequence can be scored using the probability matrix.  HOMER calculates the score by adding the "log-odds probabilities" for the nucleotide found in each position.

Score for GGATGT

score = log(pG1/0.25) + log(pG2/0.25) + log(pA3/0.25) + log(pT4/0.25) + log(pG5/0.25) + log(pT6/0.25)

If the nucleotide in the given position has a high probability in the motif matrix (i.e. 0.9), a positive number is added to the score.  If the nucleotide as a neutral probility (i.e. 0.25), then the score is unchanged, and if a nucleotide is disfavored in the probability matrix (i.e. 0.001), then a negative number is added to the score.

Homer fixes the log-odds expectation at 0.25 for each nucleotide.  If an organism has a strong imbalance in nucleotide composition, this causes HOMER sacrifices some sensitivity by keeping the expectation at 0.25 instead of adjusting this to reflect the general sequence composition.  However, the huge upside to fixing this at 0.25 is that you can use the motif with HOMER on any sequence in any context an always get recognize the same sequences, which makes life much easier.

If the score is above the threshold (from the 3rd column of the motif file), HOMER will identify the sequence as "recognized" by the motif.  If the score is lower, the sequence will be ignored.  Most motif thresholds are around 5.0-10.0

To select the correct threshold, you may need to "guess and check" you results to ensure your motif is recognizing the correct sequences.  Usually, you can start with a threshold around 5-10.  For example, run findMotifs.pl/findMotifsGenome.pl with "-find motifFile.motif".  The score of each motif is found in the 6th column of the output.  Looking through the file and the sequenes that were identified will give you a better idea of what to set the score at.

Can't figure something out? Questions, comments, concerns, or other feedback: