Software for motif discovery and ChIP-Seq analysis

File Formats

List of files used by HOMER - might be helpful when encountering problems.
Another good resource on file formats: UCSC Genome Browser File Formats

Peak/Positions files

These files specify genomic locations similar to BED files.  They are tab-delimited text files with a minimum of 5 columns (additional columns are ignored).  They are 1-indexed and inclusive, meaning the first
nucleotide of a chromosome is referenced as position 1.  They are inclusive in the sense that a line with a start of 100 and end of 200 indicates of region of size 101.  Columns are as followed:

1. peak name (should be unique)
2. chromsome
3. starting position [integer] (1-indexed)
4. end position [integer]
5. strand [either 0/1 or +/-] (in HOMER strand of 0 is +, 1 is -)
6. Optional/ignored ...

Peak/Position files are very similar to BED files - to convert them use pos2bed.pl or bed2pos.pl.

BED files

These are essentially the same as Peak/Position files, except that they have a stricter definition but greater portability.  They are also tab-delimited text files - the important difference is that they are 0-indexed, meaning the first nucleotide of the chromosome is referenced as position 0.

1. chromosome
2. starting position [integer] (0-indexed)
3. ending position [integer]
4. peak name
5. value (usually ignored)
6. strand [+/-]

BED files also come in a short form:

1. chromosome
2. starting position [integer] (0-indexed)
3. ending position [integer]
4. strand [+/-]

Peak/Position files are very similar to BED files - to convert them use pos2bed.pl or bed2pos.pl.

Motif files

These are files for specifying motifs, and are created by HOMER during motif discovery.  They are tab-delimited text files.  A more elaborate description of the format and how to tinker with it is here.  Basically, each motif within the file contains a header row starting with a ">", followed by several rows with 4 columns, specifying the probabilities of each nucleotide at each position.

>ASTTCCTCTT     1-ASTTCCTCTT    8.059752        -23791.535714   0       T:17311.0(44 ...
0.726   0.002   0.170   0.103
0.002   0.494   0.354   0.151
0.016   0.017   0.014   0.954
0.005   0.006   0.027   0.963
0.002   0.995   0.002   0.002
0.002   0.989   0.008   0.002
0.004   0.311   0.148   0.538
0.002   0.757   0.233   0.009
0.276   0.153   0.030   0.542
0.189   0.214   0.055   0.543

The first row starts with a ">" followed by various information, and the other rows are the positions specific probabilities for each nucleotide (A/C/G/T).  These values do not need to be between 0-1.  HOMER will automatically normalize whatever values are there, so interger counts are ok.  The header row is actually TAB delimited, and contains the following information:
  1. ">" + Consensus sequence (not actually used for anything, can be blank) example: >ASTTCCTCTT
  2. Motif name (should be unique if several motifs are in the same file) example: 1-ASTTCCTCTT  or NFkB
  3. Log odds detection threshold, used to determine bound vs. unbound sites (mandatory) example: 8.059752
  4. (optional) log P-value of enrichment, example: -23791.535714
  5. (optional) 0 (A place holder for backward compatibility, used to describe "gapped" motifs in old version, turns out it wasn't very useful :)
  6. (optional) Occurence Information separated by commas, example: T:17311.0(44.36%),B:2181.5(5.80%),P:1e-10317
    1. T:#(%) - number of target sequences with motif, % of total of total targets
    2. B:#(%) - number of background sequences with motif, % of total background
    3. P:# - final enrichment p-value
  7. (optional) Motif statistics separated by commas, example: Tpos:100.7,Tstd:32.6,Bpos:100.1,Bstd:64.6,StrandBias:0.0,Multiplicity:1.13
    1. Tpos: average position of motif in target sequences (0 = start of sequences)
    2. Tstd: standard deviation of position in target sequences
    3. Bpos: average position of motif in background sequences (0 = start of sequences)
    4. Bstd: standard deviation of position in background sequences
    5. StrandBias: log ratio of + strand occurrences to - strand occurrences.
    6. Multiplicity: The averge number of occurrences per sequence in sequences with 1 or more binding site.
Only the first 3 columns are needed.  In fact, the rest of the columns are really just statistics from motif finding and aren't important when searching for instances of a motif.

The MOST IMPORTANT value is the 3rd column - this sets the detection threshold, which specifies whether a given sequence is enough of a "match" to be considered recognized by the motif.  More on that below.

Internal File Formats:

These are files that you normally won't modify or play with, but in case your interested...

*.tags.tsv files

These are files used to store sequencing data in HOMER tag directories.  They are tab-delimited text files that are sorted to allow for relatively quick access and processing. 

1. blank (can be used for a name)
2. chromsome
3. position (1-indexed)
4. strand (0 or 1, +/- not allowed here)
5. Number of reads (can be fractional)
6. length of the read (optional)

Can't figure something out? Questions, comments, concerns, or other feedback: