|
Understanding and manipulating SAM/BAM alignment files
Once you have mapped your FASTQ file to the reference genome
you will normally end up with a SAM or BAM alignment
file. SAM stands for Sequence Alignment/Map format,
and BAM is the binary version of a SAM file. These
formats have become industry standards for reporting
alignment/mapping information, so it's good to understand
them to some degree. We will also discuss samtools,
which is a free software package for manipulating SAM/BAM
files. Understanding how to use samtools is important
since BAM files are often the input files needed for many
different analysis programs. Sometimes they need to be
sorted, or have an index, and samtools is an easy way to
create one. BEDtools
will also be covered at the end to demonstrate some nifty
things you can do once you have a BAM file.
SAM file format
For the most part, you don't really need to
understand the SAM format since most analysis software
will handle it for you. A full description of the
SAM format can be found here[PDF].
SAM files are tab-delimited text files, and can be opened
using a text editor or viewed using the UNIX "more"
command. Most files will start with a header
section, which has lines that start with "@" and looks
like this:
@HD VN:1.0
SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ
SN:chr10
LN:135534747
@SQ
SN:chr11
LN:135006516
...
Most alignment programs will supply a header that looks
like this, which contains information about each of the
chromosomes that were mapped to. After the header
section, the aligned reads will appear, with one line per
alignment (below I added a space between lines):
HWI-ST1001:137:C12FPACXX:7:1115:14131:66670
0
chr1 12805
1 42M4I5M
*
0
0
TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG
CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ
AS:i:-28
XN:i:0 XM:i:2 XO:i:1XG:i:4
NM:i:6 MD:Z:2C41C2 YT:Z:UU
NH:i:3 CC:Z:chr15
CP:i:102518319 XS:A:+ HI:i:0
HWI-ST1001:137:C12FPACXX:7:2313:17391:30032
272 chr1
13494 1
51M
*
0
0
ACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACAGTGTTT
CFFFFHHJJJJIJJJJIJJJJJJJJJJJJJJJJJJJJJHHHHFA+FFFC@B
AS:i:-3 XN:i:0 XM:i:1 XO:i:0
XG:i:0NM:i:1
MD:Z:44G6 YT:Z:UU
XS:A:+ NH:i:3
CC:Z:chr15
CP:i:102517626 HI:i:0
HWI-ST1001:137:C12FPACXX:7:1109:17518:53305
16 chr1
13528 1
51M
*
0
0
CGCTGGAGCCGGTGTTTGTCATGGGCCTGGGCTGCAGGGATCCTGCTACAA
#############AB=?:*B?;A?<2+233++;A+A2+<7==@7,A<A<=>
AS:i:-5 XN:i:0 XM:i:2 XO:i:0
XG:i:0NM:i:2 MD:Z:8A21T20
YT:Z:UU XS:A:+ NH:i:4
CC:Z:chr15
CP:i:102517592 HI:i:0
The SAM alignment has several columns:
- Read Name
- SAM flag
- chromosome (if read is has no alignment, there will
be a "*" here)
- position (1-based index, "left end of read")
- MAPQ (mapping quality - describes the uniqueness of
the alignment, 0=non-unique, >10 probably unique)
- CIGAR string (describes the position of
insertions/deletions/matches in the alignment, encodes
splice junctions, for example)
- Name of mate (mate pair information for paired-end
sequencing, often "=")
- Position of mate (mate pair information)
- Template length (always zero for me)
- Read Sequence
- Read Quality
- Program specific Flags (i.e. HI:i:0, MD:Z:66G6,
etc.)
Obviously, the chromsome and position are important.
The CIGAR string is also important to know where
insertions (i.e. introns) might exist in your read.
The other two important values in a SAM file are the MAPQ
value (column 5) and the SAM flag (column 2). The
MAPQ value can be used to figure out how unique an
alignment is in the genome (large number, >10 indicate
it's likely the alignment is unique).
The SAM flag is a little more difficult to decipher - the
value of the flag is formulated as a bitwise flag, with
each binary bit corresponding to a certain
parameter. See the format specification for more
info [pdf].
For example, if the 0x10 bit is set, then the read is
aligned as the reverse compliment (i.e. maps to the -
strand).
NOTE: Usually, the process of removing duplicate
reads or removing non-unique alignments is handled by the
downstream analysis program.
Manipulating Alignment Files with samtools
Samtools is a very useful program for
manipulating SAM/BAM files, and is worth learning how to
use.
Install samtools
To install samtools, download the software
from the samtools website.
Unzip/untar the file and make the executable using the
make command:
tar zxvf samtools-0.1.18.tar.bz2
cd samtools-0.1.18/
make
This should produce the executable "samtools".
You may want to add the samtools directory to your
executable path so that all you have to do is type
"samtools" instead of the full path. You can also
just copy samtools to a different directory that is
already in the executable path.
Converting BAM to SAM and vice versa
A common thing you may want to do is view the
contents of a BAM file. Since BAM files are a
binary format, it is hard to read them with a text
editor. To convert a bam file to a sam file:
samtools view -h file.bam > file.sam
To convert back to a bam file:
samtools view -b -S file.sam >
file.bam
Sorting a BAM file
Many of the downstream analysis programs that
use BAM files actually require a sorted BAM file.
This allows access to reads to be done more
efficiently. To sort a BAM file:
samtools sort -m 1000000000 file.bam
outputPrefix
The number of -m specifies the maximum memory to use,
and can be changed to fit your system. The
output will be placed in file named
"outputPrefix.bam".
Creating a BAM
index
Some tools require a BAM Index file to more
efficiently access reads in a BAM file. An example
of this is viewing alignments using the UCSC
Genome Browser. To create a BAM index, you
must first sort the BAM file. After that, run the
following command to make a BAM index:
samtools index sorted.bam
This will create a file named sorted.bam.bai which
contains the index.
Using BEDtools to perform basic analysis tasks
BEDtools
is a powerful suite of tools that enables you to perform
several different types of analysis. In many ways,
it's like HOMER in that it is flexible and allows you to
analyze your data in many different ways.
Installing BEDtools
To install, download
the program (something like
"BEDTools.v2.16.2.tar.gz") and unzip and un-tar the
file. After that, move into the new directory and
type "make" to compile the program:
tar zxvf BEDTools.v2.16.2.tar.gz
cd BEDTools-Version-2.16.2
make
You might want to edit your executable PATH variable and
add the BEDtools directory to it
Creating a Genome Coverage BedGraph
Experiments such as ChIP-Seq and RNA-Seq are means to
measure the density of reads in a given location.
To build a density graph of your reads across the
genome, use the BEDtools program "genomeCoverageBed"
to create a bedGraph file.
genomeCoverageBed -ibam
SRR065240.notx.bam -bg -trackline -trackopts
'name="notx" color=250,0,0' > notx.bedGraph
The "-trackline" and "-trackopts" options are
used to specify parameters for the bedGraph file so that
when you upload it to UCSC or the like it looks
good. It's also a good idea to gzip file file
after creating it to reduce upload times to UCSC.
|