logo

HOMER

Software for motif discovery and next-gen sequencing analysis



Also check out this page in the Basic NGS Tutorial (It's more up-to-date)

Alignment of High-throughput Sequencing Data

Homer does not perform alignment - this is something that must be done before running homer.  Several quality tools are available for alignment of short reads to large genomes.  Check out this link for a list of programs that do short read alignment.  BLAST, BLAT, and other traditional alignment programs, while great at what they do, are not practical for alignment of these types of data.

If you need help deciding on a program to use, I'll recommend Bowtie (it's nice and fast).

If you have a core that maps your data for you, don't worry about this step.  However, in many cases there is public data available that hasn't been mapped to the genome or mapped to a different version of the genome or mapped with different parameters.  In these cases it is nice to be able to map data yourself to keep a nice, consistent set of data for analysis.

Most types of ChIP-Seq/DNase-Seq/MNase-Seq and GRO-Seq simply need to be mapped to the genome, as they represent the sequencing of genomic DNA (or nascent RNA, which should not be spliced yet).  If analyzing RNA-Seq, you may be throwing away interesting information about splicing if you simply align the data to the genome.  If aligning RNA, I'd recommend sticking to the formal wear and trying Tophat, which does a good job of identifying splice junctions in your data.

Which reference genome (version) should I map my reads to?

Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads.  Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a genome):
    1. Do you have a favorite genome in the lab that already has a bunch of experiments mapped to it?  Use that one.
    2. Do any of your collaborators have a favorite genome?
    3. Use the latest stable release - I would recommend using genomes curated at UCSC so that you can easily visualize your data later using the UCSC Genome Browser.  (i.e. mm9, hg18)

Q: I'm changing genome versions, can I just "liftover" my data using UCSC liftover tool, or do I need to remap it to the new genome version? 

If you want to do it right, you need to remap it.  This is because some regions of the genome that are considered "unique" in one version may suddenly be found multiple times in the new version and vice versa, so using the liftover tool will yield different results from remapping.  However, liftover is fine if you're looking for a quick and dirty solution. If you fell like cheating, as Chuck often does, try convertCoordinates.pl. - it's a wrapper that uses the "liftOver" program to migrate peak files and whole Tag Directores.

Should I trim my reads when mapping to the genome?

Depends.  In the old days, the read quality dropped off quite a bit past ~30 bp, but these days even the end of sequencing reads are pretty high quality.  In fact, there's usually negligible difference between using full reads vs. using ones that have been trimmed based on quality scores.  However, sometimes the quality does drop off quite a bit.  There are many tools for trimming FASTQ files, including some included with HOMER: homerTools.

I have barcodes and/or adpater sequences in my reads.  Should I remove them first or just map them?

You should definitely remove the adapter sequences or other "non-biological" sequences before mapping.  Various tools can accomplish this.  You can check out homerTools for trimming sequences and dealing with adaptersGalaxy also has a nice variety of tools for accomplishing this type of stuff.

Performing Genome Alignments (part of the HOMER Basic NGS Tutorial)




Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@salk.edu