|
Selecting Background Sequences (homer2 background)
Traditionally, HOMER has selected background
sequences semi-randomly from the genome, making sure that
the per-region GC% distribution in the background sequences
matched that of the target sequences, explained here.
HOMER2 now offers much greater control of the background
sequence, including routines that help model
position-specific features of background sequence.
The new background features are performed using a new
subprogram called "homer2 background". This tool
allows HOMER to randomly select from the genome or generate
synthetic sequences that match positional k-mer
content among other things. This means that if your DNA
sequences are anchored on a feature of interest, HOMER will
attempt to account for lower-order (i.e. dinucleotide)
positional sequence bias that might be present in your
sequences. For example, if you have sequences centered on
transcription start sites (TSS), certain nucleotides are
likely to occur at certain positions within your sequences
(such as 'CA' near the TSS), which may artificially increase
the chance a given DNA motif will be recognized at that
position. New background selection methods help address this
concern, and provide tools to score the relative
enrichment/depletion of motifs as a function of the motif's
position in such sequences.
The "homer2 background" program is used by two other
utilities, findMotifsGenome.pl and createHomer2EnrichmentTable.pl.
However, if you want full control over how HOMER selects
background sequences, you can use this tool directly.
Furthermore, if you would like to use HOMER to generate
background sequences for other applications or for use with
other tools and analysis, you may want to use "homer2
background" directly. Ultimately, the output
sequences can then be use as controls for DNA motif finding,
enrichment calculations, machine learning applications, or
other creative uses.
One important note that applies to all of these
approaches - the sequences you analyze must all have the SAME
LENGTH. The assumption for positional
enrichment is that all of your sequences are anchored on
some position or feature of interest such that there is a
relationship between nucleotide 1, 2, 3, etc. across all
of the input sequences. There are no exceptions to
this - if you do have variable length sequences, you [the
user] are in charge of deciding how to trim (or pad them
with N's) to make them the same length.
homer2 background
The basic idea is to select background sequences
with similar position-specific sequence properties relative
to your input/target sequences. Traditionally, HOMER only
normalizes for GC content on a per-sequence basis, which
helps control for the presence of sequences with extreme
levels of GC dinucleotides which are typically found in CpG
Islands near vertebrate regulatory regions (e.g. promoters).
With HOMER2, not only will HOMER control for the overall
per-sequence GC%, it will also attempt to select (or model
sequences) that contain similar lower-order nucleotide
content (i.e. mononuclotides, dinucleotides, trinucleotides,
etc.) at each position.
One thing to keep in mind is that this program was designed
with regulatory elements in mind. It is meant to work with
collections of sequences that are ~50 to 500 bp in length
comprised of ~5000 to 500,000 total sequences. Although it
can work in other scenarios too, it's behavior and
performance may start to falter.
Overview of background selection:
Key inputs and parameters:
- Input sequences (either genomic positions or a FASTA
file, MUST be the same length)
- Genome or other FASTA file to select background
sequences from (technically optional as HOMER2 can
create synthetic background sequences if desired).
- k = length of k-mer at each position to match sequence
properties with (k=1 nucleotides, k=2 dinucleotides, k=3
trinucleotides, etc.)
- Position-dependent or position-independent analysis of
k-mer content (i.e. can also normalize k-mer content
regardless of position)
- Select actual background sequences from a file or
generate synthetic sequences.
HOMER2 then performs the following steps:
Following the diagram above, (1) HOMER will first
segregate your target sequences into 10 equal sized ‘bins’
based on their overall GC content (controlled by the -nbin
<#> parameter). This helps yield background
sequences with similar overall properties to the target
sequences as vertebrate regulatory elements often fall along
a spectrum of different overall GC content levels.
Otherwise, it is likely the subsequent sequence selection or
modeling steps would tend to select background sequences
resembling the average GC content of the target set (instead
of a mix of extremes), providing a poor set of control
sequences. Once it determines the overall GC content of the
target sequences, the program will read through the genome
(i.e. all possible background sequences) and segregate them
into the appropriate GC-bin for background selection at the
next step. Since large genomes will cause this assignment to
eat up a lot of memory, the relative number of sequences
considered at this step can be controlled using the
parameter -NN <#> (Default 10 million). The
program is ‘smart’ in the sense that when randomly selecting
potential background sequences, it will try to select them
such that each GC-bin gets a similar number (i.e. if your
target sequences are GC rich, it will tend to select GC-rich
background sequences at a higher rate relative to AT-rich
sequences).
(2) Once candidate background sequences where selected,
HOMER will then iteratively select background sequences to
use in the final background set that match the positional
k-mer content of the targets – doing this separately for
each GC-bin. This is done by calculating the average k-mer
content at each position for both the target and putative
background. The difference between these matrices is
calculated and then each target sequences is scored based on
summation of the frequency differences at each position
based on the k-mer present in the sequence. Sequences are
then placed in 10 ‘sub bins’ based on this difference (-nsubBins
<#>, instead of overall GC content). The same is
done for the putative background sequences, and then a
subset of background sequences from each sub bin are then
selected to use in the next iteration. This continues until
the desired number of background sequences are reached (3).
Sample Data:
- tss.positions.200.txt
- HOMER-style peak file similar to BED that contains
regions centered on 155k TSS locations identified by
csRNA-seq in U2OS cells in human [hg38]). The "200"
refers to the fact that the regions are 200 bp long
(centered on the TSS). Unzip it with gunzip before using
it.
- hg38.fa - you can download this directly from UCSC
using "wget
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz"
and then unzipping the file with gunzip
Example: This program will take ~5min and use ~16Gb of
memory (if that is too much, try setting -NN 10000000 to
only use ~8Gb):
homer2 bg -p tss.positions.200.txt -g hg38.fa
-pkmer 2 -N 100000 -NN 100000000 -o output -allowTargetOverlaps
-allowBgOverlaps
Output: The program will create several output files
starting with <prefix> (set with -o <prefix>, in
the case above it is 'output'):
- <prefix>.info.txt - reports the command
line options used and some basic information including
actual number of sequences in output background set and
the cumulative 'difference' or error in the background
sequence properties and the target set.
- <prefix>.bg.positions.bed - tab-delimited
BED format file with the genomic positions of all
selected sequences
- <prefix>.bg.sequences.fasta - FASTA
formatted file containing the selected background
sequences
- <prefix>.bg.stats.txt - table containing
the name of the background sequence, chromosome,
position (min regardless of strand), strand, weight
(1.0, not used), Profile Score (internal score, not
used), GC%, and the internal bin ID the background was
assigned from.
- <prefix>.bg.positions.bed, <prefix>.bg.sequences.fasta,
and <prefix>.target.stats.txt provide the
same information, but for the target sequences.
The output sequences can then be use as controls for DNA
motif finding, enrichment calculations, or other creative
uses.
Key input parameters to consider:
-model
Instead of selecting sequence from the
background, the program can also generate synthetic
sequences using what is essentially a positional k-mer
markov model by adding the parameter "-model" to
the command. One advantage is that this is much less
resource intensive, and you can essentially generate as
many as you want. The downside is that the sequences are
synthetic a may not capture some of the higher order
sequence features that are nonetheless present in the
genome, such as longer simple repeats, etc.
k-mer length
By default, homer will attempt to model
dinucloetides in a position-dependent manner ("-pkmer
2"). Certain applications might call for longer
kmers (i.e. when analyzing coding regions,
trinucleotides might be appropriate. However, the higher
the k-mer length, the more likely you are to "normalize
out" longer signals (such as TF binding sites). Also,
the ability of the program to find a background set that
matches the desired k-mer positional content
exponentially decreases as the length of the k-mer
increases. As a useful heuristic, it is best to have at
least 100x more target sequences than possible k-mers
such that each can be properly quantified. For example,
if you want to model trinucleotide content, you probably
want 4^3*100=6400 target sequences at minimum.
-pkmer <#> vs. -ikmer <#>
To model kmers in a position-dependent manner,
use "-pkmer <#>". However, if you don't
care about the positional enrichment for each kmer in
the dataset, just that overall kmer content of each
sequence, you can use "-ikmer <#>" instead.
This will generate background sequence sets more akin to
traditional motif analysis approaches (including
original HOMER).
-allowTargetOverlaps and -allowBgOverlaps
When selecting background sequences, the
program, by default, will avoid picking sequences that
overlap one another. "-allowBgOverlaps" means that the
final background sequences that are chosen may overlap
some of their bp. By also specifying "-allowTargetOverlaps",
it then is possible that some of the background
sequences will overlap with the target sequences to some
degree (i.e. the sequences should be the same, but some
regions may overlap by 10% of their length instead of
being completely separate sequences). When performing de
novo motif finding in a position independent manner, you
probably don't want to allow any overlaps. However, when
performing positional enrichment analysis, two
overlapping sequences may still contain valuable,
independent information even if some of their
nucleotides are in common (since their anchor positions
are ultimately different), so allowing overlaps could be
a good idea. The other practical consideration is that
for high GC content regions, there simply aren't very
many of them in the genome, and you could simply 'run
out' of sequences to chose from if you do not allow some
degree of overlap.
Practical considerations with respect to performance
-N <#> : Total number of sequences to
select
-NN <#> : Total number of putative background
sequences to consider during selection.
Reducing this
number can save on memory for big genomes.
-nbins <#> : Number of GC-bins to split the target
sequences into (default 10).
-nsubBins <#> : Number of sub bins used when
matching positional kmer content (default 10).
Increasing this
number can improve accuracy, but again as a useful
heuristic, you probably want at least 100-fold more
target sequences than the product of these two numbers
(i.e. #targets > 10x10x100)
Overall, it might be important to consider that this
approach was designed to work with eukaryotic regulatory
elements in the size range of 50-500 bp in length. As the
k-mer size is increased, or the size of the target regions
are increased, the ability of the program to find
background sequences that match the properties of the
targets will become increasingly difficult and resource
intensive, leading to reduced accuracy and long run times.
Using the new background selection with other HOMER
tools.
DNA motif discovery: findMotifsGenome.pl
Add "-useNewBg" to your command to
envoke the new positional dependent background
correction. Many of the same parameters above can also
be entered on the command line (i.e. -allowBgOverlaps,
-pkmer 2)
Note: You can always generate background sequences using
"homer2 background" and have complete control and then
use the resulting FASTA files for motif finding too.
Checking position-dependent motif enrichment for a library
of motifs: createHomer2EnrichmentTable.pl
HOMER2 has new routines to help
automate the screening of large motif libraries for
positional enrichment.
Usage:
homer2
background -i <target sequences.fasta> [options]
Generate/Select
background sequences that match properties in a set of
target sequences.
Inputs:
Target sequences you want to model:
-i <target sequences.fasta> (FASTA file)
-p <target positions.bed> (Alteratively, provide a BED
or HOMER peak file with genomic coordinates)
Background sequences to select from:
-model (generate sequences using a model, do not extract
real background sequences)
-g <genome.fasta> (genome FASTA file or seqeunce
resource to select sequences from)
-b <background sequences.fasta> (explicit set of
background sequences to choose from FASTA file)
-bg <background positions.bed> (explicit set of
background positions to choose from)
-bgr <background regions positions.bed> (regions of
the genome to select bg sequences from)
Key options:
-size <#> (size of regions to consider in background,
default: avg of length of target sequences)
-N <#> (number of background sequences to select,
default: 100000)
-NN <#> (number of background sequences initial screen
from genome, default: 100000000)
-mask (mask lowercase sequence i.e. softmasked sequence,
default: use all sequences)
-nbins <#> (number of bins to segregate sequences into
for GC selection, def: 10
-nsubBins <#> (number of bins to segregate sequences
into for positional frequencies, def: 10
-maxFractionN <#> (Maximum fraction of sequence that
can be N and still used, default: 0.5)
-allowTargetOverlaps (allow selected bg sequences from a
genome to overlap targets, def: not allowed)
-allowBgOverlaps (allow selected bg sequences from a genome
to overlap, def: not allowed)
-strand (allow sequences to overlap if on separate strands)
-pkmer <#> (match positional kmer content)
-ikmer <#> (match overall kmer content [position
independent])
-excludeNs/-includeNs (by default, kmers with Ns are
excluded when selecting bg sequences,
but included when generating sequences with -model)
-pscore <outputBEDfile> (Report initial pscores)
-maxIterations <#> (maximum iterations, def: 20)
-overlapIteration <#> (iteration to start enforcing no
overlaps, def: 5)
-decayRate <#> (selection rate per iteration, def:
0.75)
-seed <#> (seed for random number generator, def: uses
time)
Output:
-o <output prefix> (default: out)
-gs (include homer-style group and sequence output files)
Important: You input sequences
must be the same length!
|