Motif Finding with HOMER from FASTA files
Most of HOMER's functionality is built around either
promoter or genomic position based analysis, and aims to
manage the sequence management and manipulation from the
user. However, if you have some sequences that you
would like HOMER to analyze, the program findMotifs.pl accepts FASTA formatted files
for analysis.
HOMER is designed to analyze high-throughput data using
differential motif discovery, which means you MUST have both target and
background sequences, and in each case you should
have several (preferably thousands) of sequences in each set
that are roughly
the same length.
A quick note about FASTA files - Each sequence must have a unique
identifier. In theory, HOMER should be
flexible with what is in the header line, but if you're
having trouble please just keep it simple with minimal
quite-space, especially tabs. For example:
>NM_003456
AAGGCCTGAGATAGCTAGAGCTGAGAGTTTTCCACACG
Running findMotifs.pl with FASTA files:
To find motifs from FASTA
files, run findMotifs.pl with the target sequence FASTA
file as the first command-line argument, and use the
option " -fasta
<file>" or " -fastaBg <file>"
to specify the background FASTA file. You are
generally encouraged to specify a background file - not
having it would defeat the purpose of differential motif
finding.
findMotifs.pl
<targetSequences.fa> <organism> <output
directory> -fasta <background.fa> [options]
NOTE: you must choose an "organism" (i.e. just put
"human") for the 2nd argument, even though this isn't
actually relevant for FASTA based analysis. Organism
doesn't have to match the data in the FASTA files.
For example:
findMotifs.pl
chuckNorrisGenes.fa human analysis_output/ -fasta
normalHumanGenes.fa
Many other options are available to control motif
finding parameters (see here for
more details).
Running findMotifs.pl with FASTA files without a
background file:
If you run findMotifs.pl without a FASTA file with
background sequences, HOMER will attempt to scramble
your input sequences and use them as background. HOMER
only does a simple first order scramble, meaning that if
there are any over-represented signals in your FASTA
file that are common in the genome but not necessarily
specific (think polyA - AAAAAAA), these will be picked
out first by the motif discovery algorithm. If you
still want to try it, be sure to specify "fasta" as the
2nd parameter and omit the "-fasta
<background.fa>" parameter:
findMotifs.pl <targetSequences.fa> fasta
<output directory> -fasta
<background.fa> [options]
Finding instances of motifs with FASTA files:
To find instance of a motif,
run the same command used for motif discovery above but
add the option " -find
<motif file>". Motif results will be
sent to stdout, so to capture the results in a file Add " > outputfile" to
the end of the command.
findMotifs.pl
<targetSequences.fa> <organism> <output
directory> -fasta <background.fa> [options]
-find motif1.motif > outputfile.txt
For more information on the
output file format, see here.
|