|
Configuring, Updating, and Customizing HOMER
This section covers how to update and customize HOMER - It
is new to v4.4, and long overdue. All of the
annotation parsing and updating scripts that were previously
kept with the development code are now part of HOMER.
Overall, there are 5 different strategies for customizing
HOMER:
- [configureHomer.pl] Installing & updating
packages from the HOMER website using configureHomer.pl
- this is the normal way to change HOMER, covered here.
- Using FASTA genome files and custom GTF files with
HOMER analysis programs (no permanent Homer
configuration changes)
- [loadGenome.pl, loadPromoters.pl]
Creating your own genome and promoter packages from
FASTA and/or annotation files.
- [update/updateGeneIdentifiers.pl, update/updateUCSCGenomeAnnotations.pl]
Updating or adding organism gene accessions and GO terms
from NCBI, updating or adding additional UCSC genomes or
annotation.
- Spoofing HOMER format files to incorporate you own
data.
If these options don't make sense to you, don't worry.
As you go through this section and/or use HOMER these will
start to make more sense.
[1] Basic Configuration of HOMER - recommended
In an effort to make sure things are
standardized for analysis, HOMER organizes promoters,
genome sequences and annotation into packages.
Versions are based on assemblies from the UCSC Genome Browser
whenever possible. Accession numbers gene ontology
definitions are based on the NCBI gene database. All
program files and motif libraries are all part of the
standard HOMER installation.
Configuration is handled
automatically through the configureHomer.pl script, which should
reside in the directory where HOMER
is installed (i.e. /path-to-homer/). To see
which packages are available, run the configureHomer.pl
script:
perl /path-to-homer/configureHomer.pl -list
Every time you run the configureHomer.pl script, it will
attempt to update the available packages by downloading
the update.txt
file from homer.salk.edu/homer/. Using this, the
program will assess which packages are installed and
which are available to download.
To install or remove any packages, simply rerun the
command using " -install
<package name>" or " -remove <package name>".
perl /path-to-homer/configureHomer.pl -install
human
This would configure HOMER
for analysis of human promoters.
You may notice that a
package may have a "-p" (i.e. "human-p") at the end of it,
or a "-o" or "-g". These help disambiguate package
names if they have the same name in different sections
(i.e. -p for promoters). Overall, HOMER packages
come in 4 types:
- SOFTWARE - In this case only the homer code is
there. This package contains all of the code
plus some general data files, such as motif matrices.
- ORGANISMS - Species specific packages contain
accession conversion data, gene descriptions, and GO
analysis files specific to each organism. Most
are based on NCBI Gene database information
- PROMOTERS - Promoter sequences and related files for
analyzing promoters for motif enrichment. Most
often based on RefSeq transcript definitions.
Packages with "-mRNA" in the name contain RNA sequence
for analysis of RNA instead of DNA
- GENOMES - Genomic sequence and annotation
information
You may also notice that if you download the hg19 genome,
it will automatically download the 'human' organism
package. Each time you download a promoter or genome
package, it will check to make sure you have the Organism
package too.
[2] Using Custom Genomes and annotation files
"on-the-fly"
If you want to use a genome, set of promoters, or
genomic annotations that are not part of HOMER's
configuration, most HOMER commands support the use of
FASTA files, GTF files, or other sensible options to
enable analysis.
Genomes (FASTA):
Nearly every program in HOMER that accepts a
genome parameter can also accept a genome FASTA file.
Genome FASTA files should contain all chromosomes in a
single file. Chromosome names should occur right
after the FASTA header (i.e. ">chr1"). If
additional information is in the FASTA headers, there
should be a space between the chromosome name and the
additional information.
Example of using FASTA file for motif finding for the
alien A.L.F.: findMotifsGenome.pl peaks.txt ALF.fasta
OutputResults/
Or, lets say you did ChIP-Seq on one of A.L.F.'s alien
transcription factors: makeTagDirectory
ALF-TF-ChIPseq/ reads.sam -genome ALF.fasta
-checkGC
Gene Annotation (GTF files):
When annotating regions in the genome with
programs like annotatePeaks.pl, you may with to
use novel transcripts (or be using a genome with no
annotation). Most programs such as this annotatePeaks.pl
offer a "-gtf <gtffilename>" option to
specify a custom annotation.
If you do not have a GTF file, you can try to use GFF or
GFF3 formatted files (use "-gff" or "-gff3" instead of
"-gtf"). GFF, GFF3, and GTF format files are all
very similar in their format, however, GFF/GFF3 files
are not as strict in their specification and it may be
difficult for HOMER to process their contents. As
a result, it's best to use GTF files whenever possible.
If you're having trouble with your file, you can try
parsing the GTF/GFF files with the parseGTF.pl
program. This script is used internally by
programs like annotatePeaks.pl to convert
GTF files into HOMER-style annotation files.
Running this script with your files as a test can
sometimes identify problems with the parsing.
Motif Finding from FASTA files (FASTA):
findMotifs.pl (and homer2) allow you
to use FASTA files for motif finding (i.e. findMotifs.pl
targets.fa fasta OutputDirectory -fasta
background.fa). More information here.
[3] Adding Custom Genomes and Promoters to the HOMER
configuration
To simplify the use of custom genomes and annotations,
you may want to load them into HOMER's configuration so
that you can use them by name like pre-configured HOMER
packages. HOMER offers two basic programs that
assist with promoter and genome configuration.
Loading Custom Promoter Sets [loadPromoters.pl]
Promoter set creation starts with either a FASTA file
of promoter sequences, or a genome and TSS peak file
that defines the promoter locations. The
loadPromoters.pl command will takes either of these
data types and place a database of promoter sequences
in the data/promoters directory so that they can be
referenced by their 'name' using findMotifs.pl
and other commands that reference promoter sets.
Required pieces of information include:
- Promoter Set Name (-name): Name that you will
refer to it later by when running findMotifs.pl
- Organism (-org): If you supply a HOMER organism it
will attempt to leverage all of the ID conversion
and GO analysis. Put null here if you
are using an unsupported organism.
- ID type (-id): The type of ID your promoters are
indexed by (gene [as in NCBI Gene], refseq, ensembl,
unigene, or custom). Put custom here
if you have an unsupported id type.
The Promoter Set Name should be unique. If
there is a conflict, the program will tell you to
rerun the command with "-force" to overwrite the
existing definition. There are other options,
such as specifying the version for record keeping
purposes (-version <#>). As a heads
up, if you start your version with a 'v', HOMER will
assume it is part of the normally maintained packages
and might overwrite it during an update with configureHomer.pl.
After you run the loadPromoters.pl command, a
new entry will appear in the config.txt file in the
base homer directory.
Example with FASTA file as input:
If all you want to do is find motifs from a FASTA
file, you do NOT want to run this configuration
command. If, however, you have a list of
promoter sequences (or something similar) where you
will frequently want to perform motif finding on a
subset of them (i.e. co-regulated genes, etc.), then
this is the perfect command for you! The key
is that you must specify the FASTA file with
sequences that are the same length for each entry
(i.e. +/- 2kb from the TSS). You MUST also
specify the offset from the TSS for the first
nucleotide in the file using "-offset <#>".
For example, if you have sequences that are from
-2kb to +2kb, use "-offset -2000".
Example of custom human promoters:
loadPromoters.pl -name ChucksPromoters -org human
-id refseq -fasta ChuckSequences.fasta -offset
2000
Example from alternative organism not recognized by
HOMER:
loadPromoters.pl -name ALFpromoters -org null -id
custom -fasta ALFsequences.fasta -offset 2000
Example with Genome/TSS peaks as input:
The other popular option is that you may have a
list of TSS derived from 5'RNA-Seq or CAGE, or you
have a custom genome with TSS locations. In
these cases you can specify the genome and TSS
peaks. The TSS should be specified in a peak
file where the 'center' of the peak in the actual
TSS:
Example of custom human promoters:
loadPromoters.pl -name GencodePromoters -org
human -id ensembl -genome hg19 -tss
gencode.tss.txt
RNA-based motif finding (FASTA)
HOMER will let you load promoter sets that
aren't really promoters at all, but rather RNA
sequences. This can facilitate motif finding for
RNA motifs. In these cases it works just like a
traditional FASTA command, but be sure to name the
promoter set such that it contains -mRNA, such as
"human-mRNA". findMotifs.pl will look for
"-mRNA" in the promoter set name and automatically
adjust settings for RNA motif finding. It is
recommended that if you are using RNA sequences for
submission, use "-offset 0".
Loading Custom Genomes [loadGenome.pl]
Loading custom genomes can save you a lot of time if
you use HOMER a bunch. If your genome is
available through the UCSC Genome Browser, it is
recommended that you try adding it to HOMER through
the update scripts in the next section. However,
if that is not an option for you, I would recommend
adding it with loadGenome.pl. To load a
genome, there are several required parameters:
- Genome name (-name): Name that you will refer to
the genome as when running findMotifsGenome.pl,
annotatePeaks.pl, etc.
- A FASTA file of the genome (-fasta): all in one
file (soft masked is preferred)
- A GTF file describing the locations of genes
(-gtf): HOMER will attempt to choke down GFF
and GFF3 files, but the conventions for how genes
are recorded in these files is more variable and
HOMER might have trouble. You can test HOMER's
parsing of these files by running parseGTF.pl.
- Organism (-org): If you supply a HOMER organism it
will attempt to leverage all of the ID conversion
and GO analysis. Put null here if you
are using an unsupported organism.
The Genome Name should be unique. If there is a
conflict, the program will tell you to rerun the
command with "-force" to overwrite the existing
definition. There are other options, such as
specifying the version for record keeping purposes (-version
<#>). As a heads up, if you start
your version with a 'v', HOMER will assume it is part
of the normally maintained packages and might
overwrite it during an update with configureHomer.pl.
After you run the loadGenomes.pl command, a
new entry will appear in the config.txt file in the
base homer directory.
Example:
loadGenome.pl -name alf -org null -fasta
ALFgenome.fasta -gtf ALFgenes.gtf
loadGenome.pl can also save you some time will
creating a promoter set for your genome too! Add
"-promoters <PromoterSetName>" and it
will automatically call loadPromoters.pl in
the command using the TSS defined in your GTF file.
[4] Updating Organism Accessions, Promoters, and Genomes
from NCBI & UCSC
Starting with HOMER v4.4, the auxiliary scripts used to
generate HOMER annotation packages are now part of the
software release. This means that you can now
update HOMER annotations whenever you like, and also
allows you to add organisms and genomes such that they
are prepared the same way that most HOMER genomes and
annotation is prepared.
The following update scripts are located in the update/
directory and can be used for the following:
- Update and/or add organism gene accessions and
conversion tables, GO and pathway ontologies [updateGeneIdentifiers.pl]
- Update and/or add genomes and annotations maintained
by the UCSC Genome Browser [updateUCSCGenomeAnnotations.pl]
- Update and/or add promoter sets based on genome
annotations [updatePromoters.pl]
- Update Transcription Factor motif libraries [updateMotifFiles.pl]
NOTE: These programs are designed to be
executed from the update/ directory - do not
attempt to run them from a different directory.
Also, they should be executed in order when performing a
full update (i.e. update the gene identifiers before
updating the genome, etc.)
WARNING: These scripts *should* work on both UNIX/Linux
and Mac OS based systems. However, due to all the
small differences between how these systems operate and
if you encounter trouble please contact us.
General Concept behind update/ scripts:
Each of these scripts will attempt to download
information from various sources using wget to
fetch files over your internet connection. Most
of it comes from NCBI Gene Database and the UCSC
Genome Browser, but other sources are included as
well. The organisms, genomes, etc. that are
incorporated are determined by initial manifest tables
that are provided as the primary input file to each of
the commands. Examples are in the update/
directory and described in greater detail below.
In each case, the update programs will download the
appropriate data, parse it and perform any initial
data organization or manipulation that is necessary,
and then automatically autoconfigure the data for use
with HOMER's configuration management. It will
then be ready to use with the rest of HOMER.
Updating Gene Identifiers, Accession number mappings,
and Gene Onotology/Pathway resources (i.e. Organism
packages)
The updateGeneIdentifiers.pl script can be
used to update and add organism packages to
HOMER. To run the script you must create an
organism manifest file, or use/modify the example
provided in the update/ directory
("taxids.tsv"). The file is a tab-delimited text
file with the following columns:
- NCBI Taxonomy ID for the organism (i.e. human is
9606)
- The common name for the organism that HOMER will
use to reference the data later in commands like convertIDs.pl
(i.e. human, mouse, etc.)
- The full species name for the organism.
- The version specification of the package HOMER
creates. You may use whatever you want for
this value - if it starts with a "v" (i.e. v1.0),
the configureHomer.pl script may attempt to
modify the package later. If it starts with
anything else, configureHomer.pl should
ignore when attempting to update HOMER.
(Check out taxids.tsv in the update/ directory for an
example)
To run the command (from the update/ directory):
./updateGeneIdentifiers.pl taxids.tsv
# or, to include some common affymetrix ID
conversions:
./updateGeneIdentifiers.pl taxids.tsv
externalIDs.tsv
The gene identifier update process
Gene identifiers are build around the NCBI Gene
Database, which depending on your point of view is
one of the more complete databases of information on
gene accession information available across a wide
range of species. The script will basically
raid all of the files on the NCBI Gene FTP site
(ftp://ftp.ncbi.nih.gov/gene/DATA/) and process them
to create ID conversion tables for each of the
organisms in the input manifest file (i.e.
taxids.tsv). It will also download the primary
uniprot data files
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/).
The GO/pathway update process
By Gene Ontology(GO) and pathway
information, what we really mean is any type of gene
groupings that could provide information about how a
group of genes are connected. The Gene Onotology
is by far the most used database for functional
enrichment calculations, but there are many other
sources worth considering. HOMER downloads the
GO trees directly from the GO website, and uses the
NCBI Gene database mappings to populate the genes in
the GO tree. HOMER also parses the Gene
annotation in NCBI Gene and Uniprot files to identify
genes with common protein domains, chromosome
locations, and protein-protein interactions.
HOMER also downloads files from the new NCBI
biosystems database, which include KEGG, Pathway
Interaction Database, REACTOME, BIOCYC, Lipid Maps,
and Wikipathways databases. Finally get
downloads genes with GWAS hits (GWAS Catalog) and or
ones mutated in the same cancer (COSMIC database).
Downloading uniprot_trembl.dat
The most resource intensive aspect of the updateGeneIdentifiers.pl
script is the time it takes to download the
uniprot/trembl data files and parse them. The
script that does this is relatively inefficient and
uses up to 30Gb of RAM.
Downloading the file itself can also take several
hours - if you download it once and leave it in the
update/ directory, the program will NOT download it
again if it finds it. This can save time...
Updating Genomes and Genomic Annotations from the UCSC
Genome Browser (i.e. Genome packages)
updateUCSCGenomeAnnotations.pl helps automate
the process of compiling key information for genomes
maintained by the UCSC Genome Browser. Just like
the updateGeneIdentifiers.pl script, it
requires an input file containing key information
about the genomes you want to create packages out
of. The file is a tab-delimited text file
(example file in the update/ directory is
ucsc.txt). The following are the columns
required (Changed for v4.5):
- Name of the UCSC Genome (i.e. hg19, mm9, ce10,
etc.)
- Organism name (ideally the organism used for the
organism package, i.e. human, mouse, etc.)
- The version specification of the package HOMER
creates. You may use whatever you want for
this value - if it starts with a "v" (i.e. v1.0),
the configureHomer.pl script may attempt to
modify the package later. If it starts with
anything else, configureHomer.pl should
ignore when attempting to update HOMER.
(Check out ucsc.txt in the update/ directory for an
example)
To run the command (from the update/ directory):
./updateUCSCGenomeAnnotations.pl ucsc.txt
The Genome Update Process
Similar to the other update scripts, this one will
extract the genome FASTA file(s) and key files from
the annotation database at UCSC for each
organism. Gene annotations are built from the
refGene.txt files stored for each genome, and repeat
definitions are derived from repeat mask
files. These files are parsed into the various
files found in the genome directories.
Potential Problems when adding new organisms
The UCSC Genome Browser is nearly infallible due to
the incredible resource they provide the community,
but sometimes the information about each organism is
stored slightly differently and the current update
scripts in HOMER may not be looking for the correct
files (or they have a slightly different format,
etc.). If this happens let me know and maybe
we can adjust things to fix it.
Updating Promoters from available Genomes (i.e.
Promoter packages)
updatePromoters.pl works a lot like the other
update scripts, and takes a promoter manifest file as
input. The file should be a tab-delimited text
file with the following columns:
- Promoter Set Name (package name to be used with
findMotifs.pl)
- Genome to use (i.e. hg19)
- Offset
(Check out promoters.txt in the update/ directory for an
example)
To run the command (from the update/ directory):
./updatePromoters.pl promoters.txt
The Promoter Update Process
This script will automatically find the TSS
locations in the genome annotation and create promoter
sets using sequences from these regions. The
script basically automates the execution of loadPromoters.pl
Creating mRNA, 3'UTR, and 5'UTR analysis sets
If you feel like trying your luck and RNA
motif finding, you can specify "-rna" at the
command line when running updatePromoters.pl
to have the program automatically download RefSeq RNAs
from UCSC and set them up as promoter sets.
Updating Motif Files
The updateMotifFiles.pl script is useful for
two purposes - it will re-download JASPAR motif
matrices and incorporate them in the motif files homer
uses. It also provides you an opportunity to add
your own motifs in the motif/ directory. Motifs
placed in the correct directories in the motif/ folder
will be incorporated into the final files.
[5] Customizing HOMER and the organization of the
software
What follows is a short description of how
HOMER is organized - as some researchers may want to force
HOMER to do things that aren't available out of the box,
this might help them accomplish this successfully!
HOMER configuration is stored in a file named "config.txt" which is
located in the base Homer directory. This is a
tab-delimited file that is read by various programs to
determine where certain data is stored. Directories
to genome or promoter based data are stored here (given
relative to the base Homer directory).
Other standard files, such as a README.txt, COPYING, and Homer.pdf documentation are also found in
this directory, as well as the configureHomer.pl script and the update.txt file which
is downloaded each time configureHomer.pl
is evoked.
Sub-Directories:
bin/
- location of all perl scripts and executable programs
that apart of HOMER. There is a lot of "stuff" in
here, some of which are half finished, abandoned, or
simply don't work. These pages only talk about the
ones that do work :)
cpp/ - location
of c++ source files. Parts of the program which
need to be fast and/or memory efficient are written in
c++. As time goes on, and data sets get bigger,
I've been slowly migrating perl programs to c++. I
love perl - it's much much faster to write a useful
program, but in the end c++ is much much faster at
executing.
motifs/ - location to put custom motif files you
may want to reuse
update/ - location of annotation update
scripts. Also contains some specialized
information (such as organism specific motifs,
affymetrix probeID conversion files, etc.) (new in v4.4)
data/ -
location of all the data files for HOMER
data/accession/
- location of flat files for accession number
conversion. For each organism, there is a org2gene.tsv and org.description, both of
which are tab-delimited text files, which are used for
ID conversion and annotation information.
data/GO/ -
gene ontology files (*.genes) that are tab-delimited
text files with GO ID, GO name, and a comma separated
list of gene IDs for various "ontologies". These
files are species independent (contain IDs from
several organisms). The names of the files are
hard-coded in the gene ontology program, so you can
either replace the files with something you are
interested in or change the hard-coded file names in findGO.pl program.
data/knownTFs/
- This directory contains motif libraries used for
checking the identities of de novo motifs (all.motifs - most
of which come from JASPAR), and a
list of previously found motifs (known.motifs) used
for checking the enrichment of known motifs.
These files can be replaced with similar formatted
files if you wish. There is also a sub
directory, named "data/knownTFs/motifs/",
which contains *.motif
files for my own personal motif library (to be used
with other applications such as annotatePeaks.pl).
data/misc/ - I
guess if don't like reading about a legendary human
being at the bottom of your motif finding results, you
could delete or change this file. Be warned -
I'm not responsible if you end up getting a swift
roundhouse kick to the face.
data/promoters/
- files used for promoter motif finding. For
each promoter set (called "name"), there are several files:
- name.seq (promoter sequence)
- name.mask (repeat-masked promoter sequence)
- name.cgbins (assignments of promoters to
different CpG classes)
- name.cgfreq (CpG/GC frequencies)
- name.pos (genomic positions of promoters)
- name.redun (mapping between redundant promoters
i.e. share similar sequence)
- name.cons (might be missing - 0-9 phastCons
score or promoter sequence)
- name.base (IDs to use as background - typically
expressed or confident promoters)
- name.base.gene (name.base in terms of gene ids
to use with gene ontology analysis)
data/genomes/
- each genome has it's own directory.
Within each directory are the *.fa files or *.fa.masked files
containing the genome sequence. In addition,
there are several annotation files:
- *.fa or *.fa.masked files for each chromosome
- genome.tss (positions of refseq transcription
start sites)
- genome.tts (positions of refseq transcription
termination sites)
- genome.splice3p (positons of refseq 3' splice
sites)
- genome.splice5p (positons of refseq 5' splice
sites)
- genome.aug (positions of refseq translation
start codons)
- genome.stop (positions of refseq translation
stop codons)
- genome.rna (refseq RNA definition file)
- genome.repeats.rna (repeat RNA definition file)
- genome.basic.annotation
(exon/intron/TSS/TTS/intergenic region
annotations)
- genome.full.annotation (basic with CpG island
and repeats annotated)
- conservation/ subdirectory (contains
"FASTA-like" files with phastcons information) -
this is being phased out
- annotation/ subdirectory (contains annotation
definitions for the GenomeOntology)
|