File Formats
List of files used by HOMER - might be helpful when
encountering problems.
Another good resource on file formats: UCSC Genome Browser File Formats
Peak/Positions files
These files specify genomic
locations similar to BED files. They are
tab-delimited text files with a minimum of 5 columns
(additional columns are ignored). They are 1-indexed
and inclusive, meaning the first
nucleotide of a chromosome is referenced as position
1. They are inclusive in the sense that a line with
a start of 100 and end of 200 indicates of region of size
101. Columns are as followed:
1. peak name (should be
unique)
2. chromsome
3. starting position [integer] (1-indexed)
4. end position [integer]
5. strand [either 0/1 or +/-] (in HOMER strand of 0 is +,
1 is -)
6. Optional/ignored ...
...
Peak/Position files are very
similar to BED files - to convert them use pos2bed.pl or bed2pos.pl.
BED files
These are essentially the same
as Peak/Position files, except that they have a stricter definition
but greater portability. They are also tab-delimited
text files - the important difference is that they are
0-indexed, meaning the first nucleotide of the chromosome
is referenced as position 0.
1. chromosome
2. starting position [integer] (0-indexed)
3. ending position [integer]
4. peak name
5. value (usually ignored)
6. strand [+/-]
BED files also come in a short
form:
1. chromosome
2. starting position [integer] (0-indexed)
3. ending position [integer]
4. strand [+/-]
Peak/Position files are very
similar to BED files - to convert them use pos2bed.pl or bed2pos.pl.
Motif files
These are files for
specifying motifs, and are created by HOMER during motif
discovery. They are tab-delimited text files.
A more elaborate description of the format and how to
tinker with it is here.
Basically, each motif within the file contains a header
row starting with a ">", followed by several rows with
4 columns, specifying the probabilities of each nucleotide
at each position.
>ASTTCCTCTT
1-ASTTCCTCTT
8.059752
-23791.535714
0 T:17311.0(44 ...
0.726 0.002 0.170
0.103
0.002 0.494 0.354
0.151
0.016 0.017 0.014
0.954
0.005 0.006 0.027
0.963
0.002 0.995 0.002
0.002
0.002 0.989 0.008
0.002
0.004 0.311 0.148
0.538
0.002 0.757 0.233
0.009
0.276 0.153 0.030
0.542
0.189 0.214 0.055
0.543
The first row starts with a ">" followed by various
information, and the other rows are the positions specific
probabilities for each nucleotide (A/C/G/T). These
values do not need to be between 0-1. HOMER will
automatically normalize whatever values are there, so
interger counts are ok. The header row is actually
TAB delimited, and contains the following information:
- ">" + Consensus
sequence (not actually used for anything, can be
blank) example: >ASTTCCTCTT
- Motif name (should be
unique if several motifs are in the same file)
example: 1-ASTTCCTCTT or NFkB
- Log odds detection
threshold, used to determine bound vs. unbound sites (mandatory)
example: 8.059752
- (optional) log P-value of enrichment, example:
-23791.535714
- (optional) 0 (A place holder for backward
compatibility, used to describe "gapped" motifs in old
version, turns out it wasn't very useful :)
- (optional) Occurence Information separated by
commas, example:
T:17311.0(44.36%),B:2181.5(5.80%),P:1e-10317
- T:#(%) - number of target sequences with motif, %
of total of total targets
- B:#(%) - number of background sequences with
motif, % of total background
- P:# - final enrichment p-value
- (optional) Motif statistics separated by commas,
example:
Tpos:100.7,Tstd:32.6,Bpos:100.1,Bstd:64.6,StrandBias:0.0,Multiplicity:1.13
- Tpos: average position of motif in target
sequences (0 = start of sequences)
- Tstd: standard deviation of position in target
sequences
- Bpos: average position of motif in background
sequences (0 = start of sequences)
- Bstd: standard deviation of position in background
sequences
- StrandBias: log ratio of + strand occurrences to -
strand occurrences.
- Multiplicity: The averge number of occurrences per
sequence in sequences with 1 or more binding site.
Only the first 3 columns are
needed. In fact, the rest of the columns are really
just statistics from motif finding and aren't important
when searching for instances of a motif.
The MOST IMPORTANT value is the 3rd column - this sets the
detection threshold, which specifies whether a given
sequence is enough of a "match" to be considered
recognized by the motif. More on that below.
Internal File Formats:
These are files that you normally won't modify or play with,
but in case your interested...
*.tags.tsv files
These are files used to
store sequencing data in HOMER tag directories. They
are tab-delimited text files that are sorted to allow for
relatively quick access and processing.
1. blank (can be used for
a name)
2. chromsome
3. position (1-indexed)
4. strand (0 or 1, +/- not allowed here)
5. Number of reads (can be fractional)
6. length of the read (optional)
|