logo

HOMER

Software for motif discovery and next-gen sequencing analysis



Analysis of RNA-Seq data with R/Bioconductor

There are several different tools available for RNA-Seq analysis.  The statistical computing environment R has been a popular platform for the development of RNA-seq analysis algorithms.  This is partially a result of the strong microarray community that developed the Bioconductor suite of programs.

The primary input files for this analysis are sorted BAM files.  You should have a single BAM file for each experiment you want to analyze.  Technical replicates should be combined into single BAM files (this type of analysis is primarily meant for biological replicates).

Installing Bioconductor and packages in R

To install R, go to the R homepage and install the appropriate version for your computer (CRAN download page).

Go here to get a full description about how what bioconductor is and how to install it (below is the cheat sheet):
After you start R, type:

# load the script from the internet that is used in install bioconductor
source("http://bioconductor.org/biocLite.R")

# Each of these commands tells Bioconductor to download and install each package
biocLite("GenomicRanges")
biocLite("GenomicFeatures")
biocLite("Rsamtools")
biocLite("DESeq")
biocLite("edgeR")
biocLite("org.Mm.eg.db")

Couple of R basics

Main R tutorial

This one isn't too bad either.

Also, each bioconductor package has it's own tutorial/documentation (usually they offer a lot of explanation)

Also, to change your current working directory in R, got to the top menu "Misc -> Change Working Directory"

Counting Reads in annotated genes (using R)

Before checking differential expression, we need to quantify reads in transcripts.

# load library for genomic annotations
library(GenomicFeatures)
library(GenomicRanges)

# load the transcript annotation file from UCSC.  Make sure to enter the correct genome version
txdb=makeTranscriptDbFromUCSC(genome='mm9',tablename='refGene')

# Use the function transcriptsBy(txdb,'gene') for the whole genic region instead of just exons
ex_by_gene=exonsBy(txdb,'
gene')


# load the samtools library for R
library(Rsamtools)

# read the sequencing read alignment into R (combine with next step to save memory)
reads1r1=readBamGappedAlignments("exp1.rep1.bam")
reads1r2=readBamGappedAlignments("exp1.rep2.bam")
reads2r1=readBamGappedAlignments("exp2.rep1.bam")
reads2r2=readBamGappedAlignments("exp2.rep2.bam")
#repeat as necessary for more samples)

# count reads overlapping the exons
counts1r1 = countOverlaps(ex_by_gene,reads1r1)
counts1r2 = countOverlaps(ex_by_gene,reads1r2)
counts2r1 = countOverlaps(ex_by_gene,reads2r1)
counts2r2 = countOverlaps(ex_by_gene,reads2r2)

# create count table
countTable = data.frame(condition1r1=counts1r1,condition1r2=counts1r2,condition2r1=counts2r1,
    condition2r2=counts2r2,stringsAsFactors=FALSE)
# set the gene IDs to the table row names
rownames(countTable)=names(ex_by_gene)

#output tag counts to a file
write.table(countTable,file="countTable.txt",sep="\t")

#removing rows that are zero for all genes (edgeR and DESeq have trouble with these)
x <- rowSums(countTable==0)!=ncol(countTable)
newCountTable <- countTable[x,]

Differential Expression Analysis with EdgeR

# install edgeR
biocLite("edgeR")

# load the edgeR library
library(edgeR)

# either use the input data from above, or load your own table of read counts.
data <- as.matrix(read.table("countTable.txt"))
# - or -
data = newCountTable

# assign groups to the samples in the counts table
g <- c(0,0,1,1)
# get the library sizes (total counts in annotated genes)
libSizes <- as.vector(colSums(data))


# edgeR command pipeline (basically the same for each sample)
d <- DGEList(counts=data,group=g,lib.size=libSizes)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
de.com <- exactTest(d)
results <- topTags(de.com,n = length(data[,1]))

# write the output to a text file
write.table(as.matrix(results$table),file="outputFile.txt",sep="\t")



# No replicates
# If you don't have replicates, edgeR will throw a fit.  In place of:
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
you'll want to use:
d$common.dispersion <- 0.05


DESeq

# install DESeq
biocLite("DESeq")

# load the DESeq library
library(DESeq)

# create a DESeq analysis object
cds = newCountDataSet(data,c("notx","notx","lps","lps"))

# DESeq pipeline (similar to the edgeR one)
cds = estimateSizeFactors( cds )
Differential expression of RNA-Seq data at the gene level – the DESeq package Differential expression of RNA-Seq data at the gene level – the DESeq pack cds = estimateDispersions( cds )
res = nbinomTest( cds, "notx", "lps" )

# print out the results
write.table(res,file="DESeqOutput.txt",sep="\t")


Adding Annotation

Lets say you have a table named "dataTable" (must be data table, i.e. dataTable <- as.data.table(x)).

# for human, replace org.Mm.eg.db with org.Hs.eg.db
biocLite("org.Mm.eg.db")
library(org.Mm.eg.db)

# Use this command to see which types of IDs you can convert:
keytypes(org.Mm.eg.db)

fbids <- rownames(dataTable)
cols <- "SYMBOL"
annots <- select(org.Mm.eg.db,keys=fbids,columns=cols,keytype="ENTREZID")
dataTable <- cbind(entrezid=rownames(as.data.frame(dataTable)), as.data.frame(dataTable))
newTable = merge(dataTable,annots,by.x="entrezid",by.y="ENTREZID")

write.table(newTable,file="annotatedOutput.txt",sep="\t")




Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@salk.edu