Software for motif discovery and next-gen sequencing analysis

Understanding and manipulating SAM/BAM alignment files

Once you have mapped your FASTQ file to the reference genome you will normally end up with a SAM or BAM alignment file.  SAM stands for Sequence Alignment/Map format, and BAM is the binary version of a SAM file.  These formats have become industry standards for reporting alignment/mapping information, so it's good to understand them to some degree.  We will also discuss samtools, which is a free software package for manipulating SAM/BAM files.  Understanding how to use samtools is important since BAM files are often the input files needed for many different analysis programs.  Sometimes they need to be sorted, or have an index, and samtools is an easy way to create one.  BEDtools will also be covered at the end to demonstrate some nifty things you can do once you have a BAM file.

SAM file format

For the most part, you don't really need to understand the SAM format since most analysis software will handle it for you.  A full description of the SAM format can be found here[PDF].  SAM files are tab-delimited text files, and can be opened using a text editor or viewed using the UNIX "more" command.  Most files will start with a header section, which has lines that start with "@" and looks like this:
@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
Most alignment programs will supply a header that looks like this, which contains information about each of the chromosomes that were mapped to.  After the header section, the aligned reads will appear, with one line per alignment (below I added a space between lines):
HWI-ST1001:137:C12FPACXX:7:1115:14131:66670     0       chr1    12805   1       42M4I5M *       0       0       TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG     CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ     AS:i:-28        XN:i:0  XM:i:2  XO:i:1XG:i:4   NM:i:6  MD:Z:2C41C2     YT:Z:UU NH:i:3  CC:Z:chr15      CP:i:102518319  XS:A:+  HI:i:0

HWI-ST1001:137:C12FPACXX:7:2313:17391:30032     272     chr1    13494   1       51M     *       0       0       ACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACAGTGTTT     CFFFFHHJJJJIJJJJIJJJJJJJJJJJJJJJJJJJJJHHHHFA+FFFC@B     AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0NM:i:1   MD:Z:44G6       YT:Z:UU XS:A:+  NH:i:3  CC:Z:chr15      CP:i:102517626  HI:i:0

HWI-ST1001:137:C12FPACXX:7:1109:17518:53305     16      chr1    13528   1       51M     *       0       0       CGCTGGAGCCGGTGTTTGTCATGGGCCTGGGCTGCAGGGATCCTGCTACAA     #############AB=?:*B?;A?<2+233++;A+A2+<7==@7,A<A<=>     AS:i:-5 XN:i:0  XM:i:2  XO:i:0  XG:i:0NM:i:2   MD:Z:8A21T20    YT:Z:UU XS:A:+  NH:i:4  CC:Z:chr15      CP:i:102517592  HI:i:0
The SAM alignment has several columns:
  1. Read Name
  2. SAM flag
  3. chromosome (if read is has no alignment, there will be a "*" here)
  4. position (1-based index, "left end of read")
  5. MAPQ (mapping quality - describes the uniqueness of the alignment, 0=non-unique, >10 probably unique)
  6. CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example)
  7. Name of mate (mate pair information for paired-end sequencing, often "=")
  8. Position of mate (mate pair information)
  9. Template length (always zero for me)
  10. Read Sequence
  11. Read Quality
  12. Program specific Flags (i.e. HI:i:0, MD:Z:66G6, etc.)
Obviously, the chromsome and position are important.  The CIGAR string is also important to know where insertions (i.e. introns) might exist in your read.  The other two important values in a SAM file are the MAPQ value (column 5) and the SAM flag (column 2).  The MAPQ value can be used to figure out how unique an alignment is in the genome (large number, >10 indicate it's likely the alignment is unique). 

The SAM flag is a little more difficult to decipher - the value of the flag is formulated as a bitwise flag, with each binary bit corresponding to a certain parameter.  See the format specification for more info [pdf].  For example, if the 0x10 bit is set, then the read is aligned as the reverse compliment (i.e. maps to the - strand).

NOTE: Usually, the process of removing duplicate reads or removing non-unique alignments is handled by the downstream analysis program.

Manipulating Alignment Files with samtools

Samtools is a very useful program for manipulating SAM/BAM files, and is worth learning how to use. 

Install samtools

To install samtools, download the software from the samtools website.  Unzip/untar the file and make the executable using the make command:
tar zxvf samtools-0.1.18.tar.bz2
cd samtools-0.1.18/
This should produce the executable "samtools".  You may want to add the samtools directory to your executable path so that all you have to do is type "samtools" instead of the full path.  You can also just copy samtools to a different directory that is already in the executable path.

Converting BAM to SAM and vice versa

A common thing you may want to do is view the contents of a BAM file.  Since BAM files are a binary format, it is hard to read them with a text editor.  To convert a bam file to a sam file:
samtools view -h file.bam > file.sam
To convert back to a bam file:
samtools view -b -S file.sam > file.bam

Sorting a BAM file

Many of the downstream analysis programs that use BAM files actually require a sorted BAM file.  This allows access to reads to be done more efficiently.  To sort a BAM file:
samtools sort -m 1000000000 file.bam outputPrefix

The number of -m specifies the maximum memory to use, and can be changed to fit your system.  The output will be placed in file named "outputPrefix.bam".

Creating a BAM index

Some tools require a BAM Index file to more efficiently access reads in a BAM file.  An example of this is viewing alignments using the UCSC Genome Browser.  To create a BAM index, you must first sort the BAM file.  After that, run the following command to make a BAM index:
samtools index sorted.bam
This will create a file named sorted.bam.bai which contains the index.

Using BEDtools to perform basic analysis tasks

BEDtools is a powerful suite of tools that enables you to perform several different types of analysis.  In many ways, it's like HOMER in that it is flexible and allows you to analyze your data in many different ways.

Installing BEDtools

To install, download the program (something like "BEDTools.v2.16.2.tar.gz") and unzip and un-tar the file.  After that, move into the new directory and type "make" to compile the program:
tar zxvf BEDTools.v2.16.2.tar.gz
cd BEDTools-Version-2.16.2
You might want to edit your executable PATH variable and add the BEDtools directory to it

Creating a Genome Coverage BedGraph

Experiments such as ChIP-Seq and RNA-Seq are means to measure the density of reads in a given location.  To build a density graph of your reads across the genome, use the BEDtools program "genomeCoverageBed" to create a bedGraph file.
genomeCoverageBed -ibam SRR065240.notx.bam -bg -trackline -trackopts 'name="notx" color=250,0,0' > notx.bedGraph
The "-trackline" and "-trackopts" options are used to specify parameters for the bedGraph file so that when you upload it to UCSC or the like it looks good.  It's also a good idea to gzip file file after creating it to reduce upload times to UCSC.

Can't figure something out? Questions, comments, concerns, or other feedback: