Also check out this
page in the Basic NGS Tutorial (It's more
up-to-date)
Alignment of High-throughput Sequencing Data
Homer does not perform
alignment - this is something that must be done before
running homer. Several quality tools are available
for alignment of short reads to large genomes. Check
out this
link for a list of programs that do short read
alignment. BLAST, BLAT, and other traditional
alignment programs, while great at what they do, are not
practical for alignment of these types of data.
If you need help deciding on a program to use, I'll
recommend Bowtie
(it's nice and fast).
If you have a core that maps your data for you, don't
worry about this step. However, in many cases there
is public data available that hasn't been mapped to the
genome or mapped to a different version of the genome or
mapped with different parameters. In these cases it
is nice to be able to map data yourself to keep a nice,
consistent set of data for analysis.
Most types of ChIP-Seq/DNase-Seq/MNase-Seq and GRO-Seq
simply need to be mapped to the genome, as they represent
the sequencing of genomic DNA (or nascent RNA, which
should not be spliced yet). If analyzing RNA-Seq,
you may be throwing away interesting information about
splicing if you simply align the data to the genome.
If aligning RNA, I'd recommend sticking to the formal wear
and trying Tophat,
which does a good job of identifying splice junctions in
your data.
Which reference genome (version) should I map my reads
to?
Both the organism and the
exact version
(i.e. hg18, hg19) are very important when mapping
sequencing reads. Reads mapped to one version are
NOT interchangeable with reads mapped to a different
version. I would follow this recommendation list when
choosing a genome (Obviously try to match species or sub
species when selecting a genome):
- Do you have a favorite genome in the lab that
already has a bunch of experiments mapped to it?
Use that one.
- Do any of your collaborators have a favorite genome?
- Use the latest stable release - I would recommend
using genomes curated at UCSC so that you can easily
visualize your data later using the UCSC Genome Browser.
(i.e. mm9, hg18)
Q: I'm changing genome
versions, can I just "liftover" my data using UCSC
liftover tool, or do I need to remap it to the new
genome version?
If you want to do it right,
you need to remap it. This is because some regions
of the genome that are considered "unique" in one version
may suddenly be found multiple times in the new version
and vice versa, so using the liftover tool will yield
different results from remapping. However, liftover
is fine if you're looking for a quick and dirty solution.
If you fell like cheating, as Chuck often does, try convertCoordinates.pl.
- it's a wrapper that uses the "liftOver" program to
migrate peak files and whole Tag Directores.
Should I trim my reads when mapping to the genome?
Depends. In the old
days, the read quality dropped off quite a bit past ~30
bp, but these days even the end of sequencing reads are
pretty high quality. In fact, there's usually
negligible difference between using full reads vs. using
ones that have been trimmed based on quality scores.
However, sometimes the quality does drop off quite a
bit. There are many tools for trimming FASTQ files,
including some included with HOMER: homerTools.
I have barcodes and/or adpater sequences in my
reads. Should I remove them first or just map them?
|