Motif Finding with HOMER from FASTA files
Most of HOMER's functionality is built around either
promoter or genomic position based analysis, and aims to
manage the sequence manipulation, hiding it from the
user. However, if you have some sequences that you
would like HOMER to analyze, the program findMotifs.pl accepts FASTA formatted files
for analysis. Alternatively you could use the homer2 executable which
also accepts FASTA files as input.
HOMER is designed to analyze high-throughput data using
differential motif discovery, which means that it is HIGHLY
recommended that you have both target and background sequences,
and in each case you should have several (preferably
thousands) of sequences in each set that are roughly the same
length. If you absolutely can't think of the
proper background, homer will scramble your input sequences
for you (starting v4.3, you can also call the scrambling
script directly: scrambleFasta.pl). Even
better, use homer2 background to generate
background sequences.
A quick note about FASTA files - Each sequence should have a unique
identifier. In theory, HOMER should be
flexible with what is in the header line, but if you're
having trouble please just keep it simple with minimal
quite-space, especially tabs. For example:
>NM_003456
AAGGCCTGAGATAGCTAGAGCTGAGAGTTTTCCACACG
Running findMotifs.pl with FASTA files:
To find motifs from FASTA
files, run findMotifs.pl with the target sequence FASTA
file as the first command-line argument, and use the
option " -fasta
<file>" to specify the background FASTA
file. You should make every attempt to get sequences
that represent a thoughtful background file - it would
defeat the purpose of differential motif finding not to
have it!
findMotifs.pl <targetSequences.fa> fasta
<output directory> [-fastaBg
<background.fa>] [options]
NOTE: you must choose an "organism" for the 2nd argument
to keep with the structure of the command, even though
this isn't actually relevant for FASTA based
analysis. Organism doesn't have to match the data in
the FASTA files. You can use a valid organism or
just put " fasta"
as a place holder. i.e.:
findMotifs.pl chuckNorrisGenes.fa fasta
analysis_output/ -fastaBg normalHumanGenes.fa
Many other options are available to control motif
finding parameters. findMotifs.pl will perform GC
normalization and autonormalization be default (see here for more details).
Selecting Background Sequences:
There are many ways to choose FASTA input files:
- Simplest (and not recommended) - let HOMER scramble
them for you: Simply use 'fasta' as the 3rd argument
and do not specify '-fastaBg <file>'. This will
randomly scramble the sequences, and is only
guaranteed to preserve nucleotide content (not higher
order k-mers, use "homer2 background" for that).
- Specify your own background FASTA file
(recommended): Add "-fastaBg <fasta file>"
to specify background sequences to use in FASTA
format. Note that the program will still try to
re-weight the sequences to normalize GC content etc.
unless you turn off these features.
- Specify large FASTA regions (or a genome FASTA file)
and have homer chop it up for you to use as background
(not really recommended): Add "-fastaBg <fasta
file> -chopify" This will chop up the FASTA
file sequences to match the average size of the target
sequences.
As of now, HOMER2 is not integrated into this command. If
you would like HOMER2 to select background sequences for
your FASTA target input sequences, see this page
about how to run 'homer2 background'.
Finding instances of motifs with FASTA files:
To find instance of a motif,
run the same command used for motif discovery above but
add the option " -find
<motif file>". Motif results will be
sent to stdout, so to capture the results in a file Add " > outputfile" to
the end of the command.
findMotifs.pl <targetSequences.fa> fasta
<output directory> -fasta <background.fa>
[options] -find motif1.motif > outputfile.txt
For more information on the
output file format, see here.
Using homer2 directly with FASTA files:
homer2 is the motif finding executable,
and it can choke down FASTA files if you want to avoid all
the nonsense above. Running the homer2 command will
also give you access to other options for optimizing the
motif finding process. homer2 works by first specifying a
command, and then the appropriate options:
homer2
<command> [options]
i.e. homer2 denovo -i input.fa -b background.fa
> outputfile.txt
To find instances of the output motifs, use " homer2 find". To
see other commands, just type " homer2".
|