Software for motif discovery and ChIP-Seq analysis

Finding Motif Instances Across the Whole Genome

To make it easier to predict motif sites across the genome, HOMER contains a program called scanMotifGenomeWide.pl to assist with scanning large FASTA files.

Using scanMotifGenomeWide.pl to look for motif instances:

The scanMotifGenomeWide.pl script will take a motif file (may contain multiple motifs) and look for instances across the genome.  The basic command looks like this (output file is sent to stdout):
scanMotifGenomeWide.pl <motif file> <genome> [options]

scanMotifGenomeWide.pl pu1.motif mm9 -bed > pu1.sites.mm9.bed
The motif file and genome arguments are required.  You man look for several motifs at once by concatenating them into a single motif file.  For the genome, you may also provide a FASTA file to analyze a custom genome.


-bed : Output file will be in BED format - useful when you want to upload to the UCSC browser.
-keepAll : By default, HOMER will remove a motif that overlaps itself, useful for palindromes.  To report all, specify this option.
-mask : Do not look for motifs in RepeatMasked sequence (lower case sequence in FASTA files)
-5p : Report motif positions based on the 5' end of the motif sequence.
-homer1 or -homer2 : version of homer motif finding to use (-homer2 is default)

Output format:

Tab delimited text file (default):

  1. Site ID (motif name + number)
  2. chr
  3. start
  4. end
  5. strand
  6. log-odds score
  7. sequence

BED (tab) format (use -bed):

  1. chr
  2. start
  3. end
  4. motif name
  5. log-odds score (will be floored to an integer)
  6. strand

Creating a motif prediction UCSC track for All Motifs:

To upload your motif predictions, create a BED file using the scanMotifGenomeWide.pl command above and then upload the file as a custom track to the UCSC Genome Browser (or your favorite browser).  If your files gets VERY large from predicting too many motifs, you may need to create a bigBed file and use a webserver to host it (much like a bigWig file).

NOTE: This is how the custom tracks on the HOMER homepage were made!

To create a bigBed (lets say from the homer known.motifs collection):
scanMotifGenomeWide.pl homer/data/knownTFs/vertebrates/known.motifs hg19 -bed -int -keepAll > output.bed

[This step is not needed with newer versions of HOMER-scanMotifGenomeWide.pl will automatically sort the output file]
Next, we need to make sure it's properly sorted:
sort -k1,1 -k2,2n output.bed > output.sorted.bed

Finally, get the bedToBigBed program from UCSC and create a bigBed file (you need a chrom.sizes file, which is just a text file with chromosome names and sizes - refer to UCSC for more info):
bedToBigBed output.sorted.bed homer/data/genomes/hg19/chrom.sizes output.bigBed

Now you're ready to view the bigBed file on the browser.  First, copy the bigBed file to a location where it can be accessed as a URL on a webserver.  Next, load a custom track into the UCSC Genome Browser like this:

    track type=bigBed name="track name" description="track description" bigDataUrl=http://URLtoYourBigBED visibility=3

Command Line Usage for scanMotifGenomeWide.pl:

        Usage: scanMotifGenomeWide.pl <motif> <genome> [-5p] [-homer1/2] [-bed] [-keepAll] [-mask]
                Possible Genomes:

                        -- or --
                Custom: provide the path to genome FASTA files (directory or single file)

        Output will be sent to stdout
        Add -5p to report positions centered on the 5' start of the motif
        Add -bed to format as a BED file (i.e. for UCSC upload)
        Add -homer1 to use the original homer
        Add -homer2 to use homer2 instead of the original homer(default)
        Add -keepAll to keep ALL sites, even ones that overlap (default - keep one)
        Add -mask to search for motifs in repeat masked sequence.

Can't figure something out? Questions, comments, concerns, or other feedback: