Software for motif discovery and next-gen sequencing analysis

Annotating Regions in the Genome (annotatePeaks.pl)

Homer contains a useful, all-in-one program for performing peak annotation called annotatePeaks.pl.  In addition to associating peaks with nearby genes, annotatePeaks.pl can perform Gene Ontology Analysis, genomic feature association analysis (Genome Ontology), associate peaks with gene expression data, calculate ChIP-Seq Tag densities from different experiments, and find motif occurrences in peaks.  annotatePeaks.pl can also be used to create histograms and heatmaps.  Description of the annotation functions are covered below, while quantification of tags, motifs, histograms, etc. are covered here.

NOTE: If you're running annotatePeaks.pl on your laptop, you may want to use "-noann" to skip the full annotation routines, which use a bit of memory (up to 4Gb)

Basic usage:

annotatePeaks.pl <peak/BED file> <genome>   > <output file>

i.e. annotatePeaks.pl ERpeaks.txt hg18   > outputfile.txt

The first two arguments, the <peak file> and <genome>, are required, and must be the first two arguments.  Other optional command line arguments can be placed in any order after the first two.  By default, annotatePeaks.pl prints the program output to stdout, which can be captured in a file by appending " > filename" to the command.  With most uses of annotatePeaks.pl, the output is a data table that is meant to be opened with EXCEL or similar program.  An example of the output can been seen below:


Acceptable Input files

annotatePeaks.pl accepts HOMER peak files or BED files:

HOMER peak files should have at minimum 5 columns (separated by TABs, additional columns will be ignored):
  • Column1: Unique Peak ID
  • Column2: chromosome
  • Column3: starting position
  • Column4: ending position
  • Column5: Strand (+/- or 0/1, where 0="+", 1="-")
BED files should have at minimum 6 columns (separated by TABs, additional columns will be ignored)
  • Column1: chromosome
  • Column2: starting position
  • Column3: ending position
  • Column4: Unique Peak ID
  • Column5: not used
  • Column6: Strand (+/- or 0/1, where 0="+", 1="-")
In theory, HOMER will accept BED files with only 4 columns (+/- in the 4th column), and files without unique IDs, but this is NOT recommended.

Mac Users: If using a EXCEL to prepare input files, make sure to save files as a "Text (Windows)" if running MacOS - saving as "Tab delimited text" in Mac produces problems for the software.  Otherwise, you can run the script "changeNewLine.pl <filename>" to convert the Mac-formatted text file to a Windows/Dos/Unix formatted text file.

If errors occur, it is likely that the file is not in the correct format, or the first column is not actually populated with unique identifiers.

How Basic Annotation Works

The process of annotating peaks/regions is divided into two primary parts.  The first determines the distance to the nearest TSS and assigns the peak to that gene.  The second determines the genomic annotation of the region occupied by the center of the peak/region.

Distance to the nearest TSS

By default, annotatePeaks.pl loads a file in the "/path-to-homer/data/genomes/<genome>/<genome>.tss" that contains the positions of RefSeq transcription start sites.  It uses these positions to determine the closest TSS, reporting the distance (negative values mean upstream of the TSS, positive values mean downstream), and various annotation information linked to locus including alternative identifiers (unigene, entrez gene, ensembl, gene symbol etc.).  This information is also used to link gene-specific information (see below) to a peak/region, such as gene expression.

Genomic Annotation

To annotate the location of a given peak in terms of important genomic features, annotatePeaks.pl calls a separate program (assignGenomeAnnotation) to efficiently assign peaks to one of millions of possible annotations genome wide.  Two types of output are provided.  The first is "Basic Annotation" that includes whether a peak is in the TSS (transcription start site), TTS (transcription termination site), Exon (Coding), 5' UTR Exon, 3' UTR Exon, Intronic, or Intergenic, which are common annotations that many researchers are interested in.  A second round of "Detailed Annotation" also includes more detailed annotation, also considering repeat elements and CpG islands.  Since some annotation overlap, a priority is assign based on the following (in case of ties it's random [i.e. if there are two overlapping repeat element annotations]):
  1. TSS (by default defined from -1kb to +100bp)
  2. TTS (by default defined from -100 bp to +1kb)
  3. CDS Exons
  4. 5' UTR Exons
  5. 3' UTR Exons
  6. **CpG Islands
  7. **Repeats
  8. Introns
  9. Intergenic
** Only applicable for the "Detailed Annotation".

Although HOMER doesn't allow you to explicitly change the definition of the region that is the TSS (-1kb to +100bp), you can "do it yourself" by sorting the annotation output in EXCEL by the "Distance to nearest TSS" column, and selecting those within the range you are interested in.

Using Custom Annotations

Custom Gene Annotation Definition using GTF files

RefSeq doesn't do it for everyone.  There are many other quality gene annotations out there, including UCSC genes, Ensembl, and Gencode to name a couple.  Even more important, as RNA-Seq methods develop, the locations of exons etc. can be defined based on your own experimental RNA data rather than using a static database to define transcripts.  You can download GTF files for various gene definitions from the UCSC Genome Browser in the "Table Browser" section.  Custom GTF files can be created from RNA-Seq data using tools like Cufflinks.

HOMER can process GTF (Gene Transfer Format) files and use them for annotation purposes ("-gtf <gtf filename>").  If a GTF file is specified, HOMER will parse it and use the TSS from the GTF file for determining the distance to the nearest TSS.  It will also use the GTF file's definition of TSS/TTS/exons/Introns for Basic Genome Annotation.  The original HOMER annotation files are still used for the "Detailed Annotation" since the repeats will not be define in the GTF files.

i.e. annotatePeaks.pl ERpeaks.txt hg18 -gtf gencode.gtf  > outputfile.txt

HOMER will try it's best to take the "transcript_id" from the GTF definition and translate it into a known gene identifier.  If it can't match it to a known gene, many of the annotation columns corresponding to Unigene etc. will be empty.

HOMER can also work with GFF and GFF3 files - to a degree.  The problem with these is that the formats do not have the same strict naming convention enforced in GTF files.  To use them, substitute "-gtf <GTF file>" with "-gff <GFF file>" or "-gff3 <GFF3 file>"

Custom Promoter Locations

By default, annotatePeaks.pl assigns peaks to the nearest TSS.  If you have a custom peak/pos file of these locations, you can supply that with the option "-cTSS <peak/pos file>".  This will override the default, which is RefSeq TSS.
NOTE: For extreme control over annotations and use of the '-ann' option, check out the advanced annotation page.

Using Custom Genomes/Annotations

For organisms with relatively incomplete genomes, annotatePeaks.pl can still provide some functionality.  If the genome is not available as a pre-configured genome in HOMER, then you can supply the path to the full genome FASTA file or path to directory containing chromsome FASTA files as the 2nd argument.  For example, lets say you were sequencing the banana slug genome, and had downloaded the current draft of the genome into the file "bananaSlug.fa".  You could then run annotatePeaks.pl like this:

annotatePeaks.pl chip-seq-peaks.txt bananaSlug.fa > output.txt

If no genome sequence is available, you can also specify "none":

annotatePeaks.pl chip-seq-peaks.txt none > output.txt

You may also find a custom annotation file for the organism, such as banana_slug_genes.gtf, or banana_slug_genes.gff from the community website.  This can then be used to help annotate your data:

annotatePeaks.pl chip-seq-peaks.txt bananaSlug.fa -gtf banana_slug_genes.gtf > output.txt
annotatePeaks.pl chip-seq-peaks.txt bananaSlug.fa -gff banana_slug_genes.gff > output.txt

In the case of custom genomes, HOMER will not be able to convert gene IDs to names - you may have to do this yourself, but many of the other features, including motif finding etc., are available as long as you have the sequence.

NOTE: For extreme control over annotations and use of the '-ann' option, check out the advanced annotation page.

Basic Annotation File Output

Description of Columns:
  1. Peak ID
  2. Chromosome
  3. Peak start position
  4. Peak end position
  5. Strand
  6. Peak Score
  7. FDR/Peak Focus Ratio/Region Size
  8. Annotation (i.e. Exon, Intron, ...)
  9. Detailed Annotation (Exon, Intron etc. + CpG Islands, repeats, etc.)
  10. Distance to nearest RefSeq TSS
  11. Nearest TSS: Native ID of annotation file
  12. Nearest TSS: Entrez Gene ID
  13. Nearest TSS: Unigene ID
  14. Nearest TSS: RefSeq ID
  15. Nearest TSS: Ensembl ID
  16. Nearest TSS: Gene Symbol
  17. Nearest TSS: Gene Aliases
  18. Nearest TSS: Gene description
  19. Additional columns depend on options selected when running the program.
As of now, basic annotation is based on alignments of RefSeq transcripts to the UCSC hosted genomes.

Adding Gene Expression Data

annotatePeaks.pl can add gene-specific information to peaks based on each peak's nearest annotated TSS.  To add gene expression or other data types, first create a gene data file (tab delimited text file) where the first column contains gene identifiers, and the first row is a header describing the contents of each column.  In principle, the contents of these columns doesn't mater.  To add this information to the annotation result, use the "-gene <gene data file>".

annotation.pl <peak file> <genome> -gene <gene data file>   > output.txt

For peaks that are near genes with associated data in the "gene data file", this data will be appended to the end of the row for each peak.

Peak Annotation Enrichment

Gene Ontology Analysis of Associated Genes

annotatePeaks.pl offers two types of annotation enrichment analysis.  The first is based on Gene Ontology classifications for genes.  The idea is that your regions (i.e. ChIP-Seq peaks) might be preferentially found near genes with specific biological functions.  By specifying "-go <GO output directory>", HOMER will take the list of genes associated with your regions and search for enriched functional categories, placing the output in the indicated directory.  This is just like looking for GO enrichment in a set of regulated genes (covered in greater detail here). 

There are some caveats to this type of analysis - for example, some genes are simply more likely to be considered in the analysis if there are isolated along the genome where they are likely to be the "closest gene" for peaks in the region.  But... since gene density roughly correlates with transcription factor density, it works itself out to some degree.  Other tools, such as GREAT attempt to deal with this problem to some degree, and are worth trying.  My experience is that the results are not that different once you factor in the fact that most other tools use "GO slims".

Genome Ontology: Looking for Enriched Genomic Annotations

The Genome Ontology looks for enrichment of various genomic annotations in your list of peaks/regions.  Just as the Gene Ontology contains groups of genes sharing a biological function, the Genome Ontology contains groups of genomic regions sharing an annotation, such as TSS, or LINE repeats, coding exons, Transcription factor peaks, etc.  To run the Genome Ontology, add the option "-genomeOntology <output directory>".  A much more detailed description of the Genome Ontology will be available here (coming soon).

Command line options for annotatePeaks.pl

    Usage: annotatePeaks.pl <peak file | tss> <genome version>  [additional options...]

    Available Genomes (required argument): (name,org,directory,default promoter set)

    User defined annotation files (default is UCSC refGene annotation):
        annotatePeaks.pl accepts GTF (gene transfer formatted) files to annotate positions relative
        to custom annotations, such as those from de novo transcript discovery or Gencode.
        -gtf <gtf format file> (-gff and -gff3 can work for those files, but GTF is better)

    Peak vs. tss/tts/rna mode (works with custom GTF file):
        If the first argument is "tss" (i.e. annotatePeaks.pl tss hg18 ...) then a TSS centric
        analysis will be carried out.  Tag counts and motifs will be found relative to the TSS.
        (no position file needed) ["tts" now works too - e.g. 3' end of gene]
        ["rna" specifies gene bodies, will automaticall set "-size given"]
        NOTE: The default TSS peak size is 4000 bp, i.e. +/- 2kb (change with -size option)
        -list <gene id list> (subset of genes to perform analysis [unigene, gene id, accession,
             probe, etc.], default = all promoters)
        -cTSS <promoter position file i.e. peak file> (should be centered on TSS)

    Primary Annotation Options:
        -mask (Masked repeats, can also add 'r' to end of genome name)
        -m <motif file 1> [motif file 2] ... (list of motifs to find in peaks)
            -mscore (reports the highest log-odds score within the peak)
            -nmotifs (reports the number of motifs per peak)
            -mdist (reports distance to closest motif)
            -mfasta <filename> (reports sites in a fasta file - for building new motifs)
            -fm <motif file 1> [motif file 2] (list of motifs to filter from above)
            -rmrevopp <#> (only count sites found within <#> on both strands once, i.e. palindromic)
            -matrix <prefix> (outputs a motif co-occurrence files:
                prefix.count.matrix.txt - number of peaks with motif co-occurrence
                prefix.ratio.matrix.txt - ratio of observed vs. expected  co-occurrence
                prefix.logPvalue.matrix.txt - co-occurrence enrichment
                prefix.stats.txt - table of pair-wise motif co-occurrence statistics
                additional options:
                -matrixMinDist <#> (minimum distance between motif pairs - to avoid overlap)
                -matrixMaxDist <#> (maximum distance between motif pairs)
            -mbed <filename> (Output motif positions to a BED file to load at UCSC (or -mpeak))
        -d <tag directory 1> [tag directory 2] ... (list of experiment directories to show
            tag counts for) NOTE: -dfile <file> where file is a list of directories in first column
        -bedGraph <bedGraph file 1> [bedGraph file 2] ... (read coverage counts from bedGraph files)
        -wig <wiggle file 1> [wiggle file 2] ... (read coverage counts from wiggle files)
        -p <peak file> [peak file 2] ... (to find nearest peaks)
            -pdist to report only distance (-pdist2 gives directional distance)
            -pcount to report number of peaks within region
        -vcf <VCF file> (annotate peaks with genetic variation infomation, one col per individual)
            -editDistance (Computes the # bp changes relative to reference)
            -individuals <name1> [name2] ... (restrict analysis to these individuals)
        -gene <data file> ... (Adds additional data to result based on the closest gene.
            This is useful for adding gene expression data.  The file must have a header,
            and the first column must be a GeneID, Accession number, etc.  If the peak
            cannot be mapped to data in the file then the entry will be left empty.
        -go <output directory> (perform GO analysis using genes near peaks)
        -genomeOntology <output directory> (perform genomeOntology analysis on peaks)
            -gsize <#> (Genome size for genomeOntology analysis, default: 2e9)

    Annotation vs. Histogram mode:
        -hist <bin size in bp> (i.e 1, 2, 5, 10, 20, 50, 100 etc.)
        The -hist option can be used to generate histograms of position dependent features relative
        to the center of peaks.  This is primarily meant to be used with -d and -m options to map
        distribution of motifs and ChIP-Seq tags.  For ChIP-Seq peaks for a Transcription factor
        you might want to use the -center option (below) to center peaks on the known motif
        ** If using "-size given", histogram will be scaled to each region (i.e. 0-100%), with
        the -hist parameter being the number of bins to divide each region into.
            Histogram Mode specific Options:
            -nuc (calculated mononucleotide frequencies at each position,
                Will report by default if extracting sequence for other purposes like motifs)
            -di (calculated dinucleotide frequencies at each position)
            -histNorm <#> (normalize the total tag count for each region to 1, where <#> is the
                minimum tag total per region - use to avoid tag spikes from low coverage
            -ghist (outputs profiles for each gene, for peak shape clustering)
            -rm <#> (remove occurrences of same motif that occur within # bp)

    Peak Centering: (other options are ignored)
        -center <motif file> (This will re-center peaks on the specified motif, or remove peak
            if there is no motif in the peak.  ONLY recentering will be performed, and all other
            options will be ignored.  This will output a new peak file that can then be reanalyzed
            to reveal fine-grain structure in peaks (It is advised to use -size < 200) with this
            to keep peaks from moving too far (-mirror flips the position)
        -multi (returns genomic positions of all sites instead of just the closest to center)

    Advanced Options:
        -len <#> / -fragLength <#> (Fragment length, default=auto, might want to set to 0 for RNA)
        -size <#> (Peak size[from center of peak], default=inferred from peak file)
            -size #,# (i.e. -size -10,50 count tags from -10 bp to +50 bp from center)
            -size "given" (count tags etc. using the actual regions - for variable length regions)
        -log (output tag counts as log2(x+1+rand) values - for scatter plots)
        -sqrt (output tag counts as sqrt(x+rand) values - for scatter plots)
        -strand <+|-|both> (Count tags on specific strands relative to peak, default: both)
        -pc <#> (maximum number of tags to count per bp, default=0 [no maximum])
        -cons (Retrieve conservation information for peaks/sites)
        -CpG (Calculate CpG/GC content)
        -ratio (process tag values as ratios - i.e. chip-seq, or mCpG/CpG)
        -nfr (report nuclesome free region scores instead of tag counts, also -nfrSize <#>)
        -norevopp (do not search for motifs on the opposite strand [works with -center too])
        -noadj (do not adjust the tag counts based on total tags sequenced)
        -norm <#> (normalize tags to this tag count, default=1e7, 0=average tag count in all directories)
        -pdist (only report distance to nearest peak using -p, not peak name)
        -map <mapping file> (mapping between peak IDs and promoter IDs, overrides closest assignment)
        -noann, -nogene (skip genome annotation step, skip TSS annotation)
        -homer1/-homer2 (by default, the new version of homer [-homer2] is used for finding motifs)

Can't figure something out? Questions, comments, concerns, or other feedback: