|
Finding Motif Instances Across the Whole Genome
To make it easier to predict motif sites across the genome,
HOMER contains a program called scanMotifGenomeWide.pl
to assist with scanning large FASTA files.
Using scanMotifGenomeWide.pl to look for motif
instances:
The scanMotifGenomeWide.pl
script will take a motif file (may contain multiple
motifs) and look for instances across the genome.
The basic command looks like this (output file is sent to
stdout):
scanMotifGenomeWide.pl <motif file>
<genome> [options]
scanMotifGenomeWide.pl pu1.motif mm9 -bed >
pu1.sites.mm9.bed
The motif file and genome arguments are required.
You man look for several motifs at once by concatenating
them into a single motif file. For the genome, you
may also provide a FASTA file to analyze a custom genome.
Options:
-bed : Output file will be in BED
format - useful when you want to upload to the UCSC
browser.
-keepAll : By default, HOMER will remove a motif
that overlaps itself, useful for palindromes. To
report all, specify this option.
-mask : Do not look for motifs in RepeatMasked
sequence (lower case sequence in FASTA files)
-5p : Report motif positions based on the 5' end
of the motif sequence.
-homer1 or -homer2 : version of homer
motif finding to use (-homer2 is default)
Output format:
Tab delimited text file (default):
- Site ID (motif name + number)
- chr
- start
- end
- strand
- log-odds score
- sequence
BED (tab) format (use -bed):
- chr
- start
- end
- motif name
- log-odds score (will be floored to an integer)
- strand
Creating a motif prediction UCSC track for All Motifs:
To upload your motif predictions, create a BED
file using the scanMotifGenomeWide.pl command above and
then upload the file as a custom track to the UCSC Genome
Browser (or your favorite browser). If your files
gets VERY large from predicting too many motifs,
you may need to create a bigBed file and use a webserver
to host it (much like a bigWig file).
NOTE: This is how the custom tracks on the HOMER homepage
were made!
To create a bigBed (lets say from the homer known.motifs
collection):
scanMotifGenomeWide.pl
homer/data/knownTFs/vertebrates/known.motifs hg19 -bed
-int -keepAll > output.bed
[This step is not needed with newer versions of
HOMER-scanMotifGenomeWide.pl will automatically sort the
output file]
Next, we need to make sure it's properly sorted:
sort -k1,1 -k2,2n output.bed > output.sorted.bed
Finally, get the bedToBigBed program from UCSC and create
a bigBed file (you need a chrom.sizes file, which is just
a text file with chromosome names and sizes - refer to
UCSC for more info):
bedToBigBed output.sorted.bed
homer/data/genomes/hg19/chrom.sizes output.bigBed
Now you're ready to view the bigBed file on the
browser. First, copy the bigBed file to a location
where it can be accessed as a URL on a webserver.
Next, load a custom track into the UCSC Genome Browser
like this:
track type=bigBed name="track name"
description="track description"
bigDataUrl=http://URLtoYourBigBED visibility=3
Command Line Usage for scanMotifGenomeWide.pl:
Usage:
scanMotifGenomeWide.pl <motif> <genome> [-5p]
[-homer1/2] [-bed] [-keepAll] [-mask]
Possible Genomes:
-- or --
Custom: provide the path to genome FASTA files (directory
or single file)
Output will be
sent to stdout
Add -5p to
report positions centered on the 5' start of the motif
Add -bed to
format as a BED file (i.e. for UCSC upload)
Add -homer1 to
use the original homer
Add -homer2 to
use homer2 instead of the original homer(default)
Add -keepAll to
keep ALL sites, even ones that overlap (default - keep
one)
Add -mask to
search for motifs in repeat masked sequence.
|