logo

HOMER

Software for motif discovery and next-gen sequencing analysis



Configuring HOMER

In an effort to make sure things are standardized for analysis, HOMER organizes promoters, genome sequences and annotation into packages.  Versions are based on assemblies from the UCSC Genome Browser.  Accession numbers, gene ontology definitions, motif libraries are all part of the standard HOMER installation.

Basic configuration of HOMER

Configuration is handled automatically through the configureHomer.pl script, which should reside in the directory where HOMER is installed (i.e. /path-to-homer/).  To see which packages are available, run the configureHomer.pl script:

perl /path-to-homer/configureHomer.pl -list

Every time you run the configureHomer.pl script, it will attempt to update the available packages by downloading the update.txt file from homer.salk..edu.  Using this, the program will assess which packages are installed and which are available to download.

To install or remove any packages, simply rerun the command using "-install <package name>" or "-remove <package name>".

perl /path-to-homer/configureHomer.pl -install human

This would configure HOMER for analysis of human promoters.

You may notice that a package may have a "-p" (i.e. "human-p") at the end of it, or a "-o" or "-g".  These help disambiguate package names if they have the same name in different sections (i.e. -p for promoters).  Overall, HOMER packages come in 4 types:
  • SOFTWARE - In this case only the homer code is there.  This package contains all of the code plus some general data files, such as motif matrices.
  • ORGANISMS - Species specific packages contain accession conversion data, gene descriptions, and GO analysis files specific to each organism.  Most are based on NCBI Gene database information
  • PROMOTERS - Promoter sequences and related files for analyzing promoters for motif enrichment.  Most often based on RefSeq transcript definitions.  Packages with "-mRNA" in the name contain RNA sequence for analysis of RNA instead of DNA
  • GENOMES - Genomic sequence and annotation information
You may also notice that if you download the hg19 genome, it will automatically download the 'human' organism package.  Each time you download a promoter or genome package, it will check to make sure you have the Organism package too.

Custom Genomes

If your favorite genome, promoter locations, or even organisms are not in the HOMER configuration list, don't panic! HOMER v4.4 finally organizes all of the annotation data scripts so that it is relatively easy for you to configure your own annotations to use with HOMER.  Covered in the next section on Updating & Customizing HOMER

Organization of HOMER

What follows is a short description of how HOMER is organized - as some researchers may want to force HOMER to do things that aren't available out of the box, this might help them accomplish this successfully!

HOMER configuration is stored in a file named "config.txt" which is located in the base Homer directory.  This is a tab-delimited file that is read by various programs to determine where certain data is stored.  Directories to genome or promoter based data are stored here (given relative to the base Homer directory).

Other standard files, such as a README.txt, COPYING, and Homer.pdf documentation are also found in this directory, as well as the configureHomer.pl script and the update.txt file which is downloaded each time configureHomer.pl is evoked.

Sub-Directories:

bin/ - location of all perl scripts and executable programs that apart of HOMER.  There is a lot of "stuff" in here, some of which are half finished, abandoned, or simply don't work.  These pages only talk about the ones that do work :)

cpp/ - location of c++ source files.  Parts of the program which need to be fast and/or memory efficient are written in c++.  As time goes on, and data sets get bigger, I've been slowly migrating perl programs to c++.  I love perl - it's much much faster to write a useful program, but in the end c++ is much much faster at executing. 

update/ - location of annotation update scripts.  Also contains some specialized information (such as organism specific motifs, affymetrix probeID conversion files, etc.) (new in v4.4)

data/
- location of all the data files for HOMER

data/accession/ - location of flat files for accession number conversion.  For each organism, there is a org2gene.tsv and org.description, both of which are tab-delimited text files, which are used for ID conversion and annotation information.

data/GO/ - gene ontology files (*.genes) that are tab-delimited text files with GO ID, GO name, and a comma separated list of gene IDs for various "ontologies".  These files are species independent (contain IDs from several organisms).  The names of the files are hard-coded in the gene ontology program, so you can either replace the files with something you are interested in or change the hard-coded file names in findGO.pl program.

data/knownTFs/ - This directory contains motif libraries used for checking the identities of de novo motifs (all.motifs - most of which come from JASPAR), and a list of previously found motifs (known.motifs) used for checking the enrichment of known motifs.  These files can be replaced with similar formatted files if you wish.  There is also a sub directory, named "data/knownTFs/motifs/", which contains *.motif files for my own personal motif library (to be used with other applications such as annotatePeaks.pl).

data/misc/ - I guess if don't like reading about a legendary human being at the bottom of your motif finding results, you could delete or change this file.  Be warned - I'm not responsible if you end up getting a swift roundhouse kick to the face.

data/promoters/ - files used for promoter motif finding.  For each promoter set (called "name"), there are several files:
  • name.seq (promoter sequence)
  • name.mask (repeat-masked promoter sequence)
  • name.cgbins (assignments of promoters to different CpG classes)
  • name.cgfreq (CpG/GC frequencies)
  • name.pos (genomic positions of promoters)
  • name.redun (mapping between redundant promoters i.e. share similar sequence)
  • name.cons (might be missing - 0-9 phastCons score or promoter sequence)
  • name.base (IDs to use as background - typically expressed or confident promoters)
  • name.base.gene (name.base in terms of gene ids to use with gene ontology analysis) 
data/genomes/ - each genome has it's own directory.  Within each directory are the *.fa files or *.fa.masked files containing the genome sequence.  In addition, there are several annotation files:
  • *.fa or *.fa.masked files for each chromosome
  • genome.tss (positions of refseq transcription start sites)
  • genome.tts (positions of refseq transcription termination sites)
  • genome.splice3p (positons of refseq 3' splice sites)
  • genome.splice5p (positons of refseq 5' splice sites)
  • genome.aug (positions of refseq translation start codons)
  • genome.stop (positions of refseq translation stop codons)
  • genome.rna (refseq RNA definition file)
  • genome.repeats.rna (repeat RNA definition file)
  • genome.basic.annotation (exon/intron/TSS/TTS/intergenic region annotations)
  • genome.full.annotation (basic with CpG island and repeats annotated)
  • conservation/ subdirectory (contains "FASTA-like" files with phastcons information) - this is being phased out
  • annotation/ subdirectory (contains annotation definitions for the GenomeOntology)

Next: Updating and Customizing HOMER




Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu