|
Analysis of RNA-Seq data with R/Bioconductor
There are several different tools available for RNA-Seq
analysis. The statistical computing environment R has
been a popular platform for the development of RNA-seq
analysis algorithms. This is partially a result of the
strong microarray community that developed the Bioconductor
suite of programs.
The primary input files for this analysis are sorted BAM
files. You should have a single BAM file for each
experiment you want to analyze. Technical replicates
should be combined into single BAM files (this type of
analysis is primarily meant for biological replicates).
Installing Bioconductor and packages in R
To install R, go to the R homepage and
install the appropriate version for your computer (CRAN download page).
Go here
to get a full description about how what bioconductor is
and how to install it (below is the cheat sheet):
After you start R, type:
# load the script from the internet that is used in
install bioconductor
source("http://bioconductor.org/biocLite.R")
# Each of these commands tells Bioconductor to download
and install each package
biocLite("GenomicRanges")
biocLite("GenomicFeatures")
biocLite("Rsamtools")
biocLite("DESeq")
biocLite("edgeR")
biocLite("org.Mm.eg.db")
Couple of R basics
Main
R tutorial
This one
isn't too bad either.
Also, each bioconductor package has it's own
tutorial/documentation (usually they offer a lot of
explanation)
Also, to change your current working directory in R,
got to the top menu "Misc -> Change Working
Directory"
Counting Reads in annotated genes (using R)
Before checking differential expression, we need
to quantify reads in transcripts.
# load library for genomic annotations
library(GenomicFeatures)
library(GenomicRanges)
# load the transcript annotation file from UCSC.
Make sure to enter the correct genome version
txdb=makeTranscriptDbFromUCSC(genome='mm9',tablename='refGene')
# Use the function transcriptsBy(txdb,'gene') for the
whole genic region instead of just exons
ex_by_gene=exonsBy(txdb,'gene')
# load the samtools library for R
library(Rsamtools)
# read the sequencing read alignment into R (combine with
next step to save memory)
reads1r1=readBamGappedAlignments("exp1.rep1.bam")
reads1r2=readBamGappedAlignments("exp1.rep2.bam")
reads2r1=readBamGappedAlignments("exp2.rep1.bam")
reads2r2=readBamGappedAlignments("exp2.rep2.bam")
#repeat as necessary for more samples)
# count reads overlapping the exons
counts1r1 = countOverlaps(ex_by_gene,reads1r1)
counts1r2 = countOverlaps(ex_by_gene,reads1r2)
counts2r1 = countOverlaps(ex_by_gene,reads2r1)
counts2r2 = countOverlaps(ex_by_gene,reads2r2)
# create count table
countTable =
data.frame(condition1r1=counts1r1,condition1r2=counts1r2,condition2r1=counts2r1,
condition2r2=counts2r2,stringsAsFactors=FALSE)
# set the gene IDs to the table row names
rownames(countTable)=names(ex_by_gene)
#output tag counts to a file
write.table(countTable,file="countTable.txt",sep="\t")
#removing rows that are zero for all genes (edgeR and
DESeq have trouble with these)
x <- rowSums(countTable==0)!=ncol(countTable)
newCountTable <- countTable[x,]
Differential Expression Analysis with EdgeR
# install edgeR
biocLite("edgeR")
# load the edgeR library
library(edgeR)
# either use the input data from above, or load your own
table of read counts.
data <- as.matrix(read.table("countTable.txt"))
# - or -
data = newCountTable
# assign groups to the samples in the counts table
g <- c(0,0,1,1)
# get the library sizes (total counts in annotated genes)
libSizes <- as.vector(colSums(data))
# edgeR command pipeline (basically the same for each
sample)
d <- DGEList(counts=data,group=g,lib.size=libSizes)
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
de.com <- exactTest(d)
results <- topTags(de.com,n = length(data[,1]))
# write the output to a text file
write.table(as.matrix(results$table),file="outputFile.txt",sep="\t")
# No replicates
# If you don't have replicates, edgeR will throw a
fit. In place of:
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
you'll want to use:
d$common.dispersion <- 0.05
DESeq
# install DESeq
biocLite("DESeq")
# load the DESeq library
library(DESeq)
# create a DESeq analysis object
cds =
newCountDataSet(data,c("notx","notx","lps","lps"))
# DESeq pipeline (similar to the edgeR one)
cds = estimateSizeFactors( cds )
Differential expression of RNA-Seq data at the gene
level – the DESeq package
Differential expression of RNA-Seq data at the gene
level – the DESeq pack
cds = estimateDispersions( cds )
res = nbinomTest( cds, "notx", "lps" )
# print out the results
write.table(res,file="DESeqOutput.txt",sep="\t")
Adding Annotation
Lets say you have a table named "dataTable"
(must be data table, i.e. dataTable <-
as.data.table(x)).
# for human, replace org.Mm.eg.db with org.Hs.eg.db
biocLite("org.Mm.eg.db")
library(org.Mm.eg.db)
# Use this command to see which types of IDs you can
convert:
keytypes(org.Mm.eg.db)
fbids <- rownames(dataTable)
cols <- "SYMBOL"
annots <-
select(org.Mm.eg.db,keys=fbids,columns=cols,keytype="ENTREZID")
dataTable <-
cbind(entrezid=rownames(as.data.frame(dataTable)),
as.data.frame(dataTable))
newTable =
merge(dataTable,annots,by.x="entrezid",by.y="ENTREZID")
write.table(newTable,file="annotatedOutput.txt",sep="\t")
|