HOMER

Software for motif discovery and next-gen sequencing analysis

5'GRO-Seq TSS Analysis Tutorial

This tutorial is now out-dated - please see the csRNA-seq tutorial which will work well for 5'GRO-seq too! (and other TSS mapping methods)

csRNA-seq tutorial link.

[old tutorial information]:
Sequencing the 5' end of cap-protected RNAs enables the identification of Transcription Start Sites (TSS) at nucleotide resolution. Several varieties of this method exist, including CAGE, TSS-Seq, START-seq, PRO-Cap, GRO-cap, CAP-seq, 5'RNA-Seq, 5'GRO-Seq etc., and each are designed to collect 5' ends of RNA for sequencing but use different enzymatic or enrichment strategies to achieve their goal. 5'GRO-Seq and PRO-Cap are techniques that perform 5' RNA sequencing on nascent RNA allowing the identification of TSS for unstable transcripts such as eRNAs. These techniques are particularly powerful for identifying active regulatory elements (enhancers + promoters) and assessing their activity in a quantitative manner with relatively low sequencing coverage. For example, 20-40 million reads from a 5'GRO-Seq experiment might yield the same number of reads at enhancers as 200-400 million reads in a 5'RNA-Seq experiment.

This tutorial will take you through the basic process of trying to analyze 5'RNA-Seq data with HOMER. Generally speaking, the analysis of each 5' RNA sequencing method is similar. The basic idea is to identify regions with a high density of 5' RNA sequencing reads, which on the surface sounds really similar to finding peaks in ChIP-Seq data (and it is!).

Introduction to Transcription Initiation at Metazoan Promoters

To understand the analysis of 5'RNA data, it is worth taking a moment highlight that there are multiple 'types' of promoters in living organisms. First of all, there are different RNA polymerases including RNA polymerase I (rRNA), II (mRNA, lncRNA, miRNA), III (tRNA), IV(plant specific), viral polymerases, etc., and each polymerase has different mechanisms of transcriptional initiation that may vary between different distally related organisms. Also be aware that different RNA polymerases may generate RNAs with different covalent modifications and may or may not be present in your 5' RNA sequencing, depending on how the experiment was performed. By in large most researchers are interested in RNA polymerase II transcripts (mRNA) and as a result most 5'RNA methods focus on the identification of RNAs containing a 7-methylguanosine cap protecting their 5' end.

With respect to RNA polymerase II initiation sites, there are two generally recognized 'types' of TSS. Sharp (or Focused) TSS initiate transcription from a single nucleotide (or +/- 2 nt) and resemble the promoters found in molecular biology text books. They often contain well define core-promoter elements such as the TATA box and usually initiate transcription from a purine preceded by a pyrimidine (PyPu, i.e. CA, with the A being the initiating nucleotide).

The other, more common TSS is a broad (or dispersed) TSS. These promoters initiate transcription from sevearl different sites within a large area (often 50-100 nt in size). These promoters usually lack core promoter elements (no TATA box), but they each individual initiation site DOES normally still initiate on a purine preceded by a pyrimidine (PyPu).

False TSS - be careful of artifacts

A quick note about artifacts in 5'RNA-Seq data: Most 5' RNA-Seq methodologies work by enriching for 5' cap-protected RNA, which means that most of the sequence data describes 5' RNA ends, but a fraction of it may be noise from random RNA-Seq fragments (again, a lot like ChIP-Seq). In particular, highly expressed RNAs may yield "5'RNA-Seq" reads along the whole body of the gene giving the appearance of alternative TSS which are likely false positives. Because of this, I would highly recommend using traditional RNA-Seq as a "background" when analyzing 5' RNA-Seq data. This approach (describe below) may remove several real TSS from the results, but it is also likely to remove a large number of false positives and clean up your analysis.

Transcplicing of transcripts (where the 5' end of one transcript is added to the front of another) and recapping (where a transcript is cleaved and a new cap placed on the truncated product) are two phenomena you may want to think carefully about when analysing 5' RNA-Seq data. Transplicing will create false negatives and recapping will create false-positives. In certain organisms, such as C. elegans, transcplicing is very common, making 5'GRO-Seq a much better assay for identifying TSS than 5'RNA-Seq (i.e. measuring the 5' RNA ends before they have a chance to transplice). In other organisms (e.g. mouse, human, fly, etc.) it appears to be rare. The degree to which transcription are 'recapped' is a matter of debate because it can be hard to distinguish them from true alternative TSS or noise in the 5' RNA-seq assay.

Preprocessing and Mapping

Depending on the specific method of 5'RNA-Seq you are analyzing, you may or may not have to think about processing the reads before you analyze the data with HOMER. Most techniques simply yield sequence that starts with the 5' end of the read, and nothing special needs to be done. CAGE in particular may require you to remove initial 'G's that may have been added to the 5' end of the transcript during library construction. Also, older CAGE protocols may require you to separate the actual CAGE tags from longer 454 reads - refer the the author/source of the data for how to deal with the processing of these reads.

Mapping 5'RNA-Seq reads to the genome should be done with a splicing-aware mapper like STAR (see here for more details on mapping reads). You could use bowtie or another DNA-based mapping algorithm for 5'GRO-Seq, although STAR is fine for 5'GRO-Seq too.

Creating Tag Directories and Quality Control

Creation of a 5'RNA-Seq tag directory works the same way as with ChIP-Seq or RNA-Seq.

Finding TSS from 5'RNA-Seq Data

The basic idea behind identifying TSS from 5'RNA data is similar to finding peaks in ChIP-Seq data. Active TSS are likely to generate several reads within a confined space (<150 bp to cover both broad and focused promoters). findPeaks is already designed to look for regions like this, but, unlike for ChIP-Seq, we want to make sure we search each strand independently. Also, active TSS are likely to produce several reads from the same initiation site, giving them the appearance of clonal/PCR artifacts. However, in this case, we do not what to penalize clonal reads since they provide the dynamic range of expression at each nucleotide.

To find TSS with findPeaks, simply run:

findPeaks IMR90-5GROseq/ -o auto -style tss

If you also performed a traditional, non-5' version of the assay (i.e. RNA-Seq for 5'RNA-Seq, or GRO-Seq for 5'GRO-Seq), then use that as background:

findPeaks IMR90-5GROseq/ -o auto -style tss -i IMR90-GROseq/

The "-style tss" automatically sets the options on findPeaks to work well with 5'RNA-Seq data. The option "-style tss" basically expands to "-C 0 -strand separate -fragLength 1 -inputFragLength 1 -tbp 0 -inputtbp 0 -size 150". When used with "-o auto", the output TSS will be placed in a file called 'tss.txt' in the target tag directory.

Output: Peaks will be centered on the mode of the TSS - i.e. the highest individual initiation site.

Creating UCSC Visualization Files

To visualize 5'RNA-Seq experiments in the UCSC Genome Browser, we'll run the makeUCSCfile command (more info here). Since 5'GRO-Seq is strand specific, we need to specify options to ensure it is visualized on separate strands. For our example:

makeUCSCfile IMR90-5GroSeq/ -o auto -style tss

You can also make 'coverage' tracks by extending the fragments so that they 'pileup'. Instead of specifying "-style tss", use "-strand separate" and "-fragment given" to generate more traditional coverage tracks, which are better for visualizing the data at larger intervals (i.e. > 50kb).

You can also use makeBigWig.pl and makeMultiWigHub.pl if you have a webserver at your disposal to post the resulting bigWig files (covered in more depth here). Each have an option called '-cage' that will automatically generate nucleotide resolution.

Quantifying TSS data with annotatePeaks.pl

Be sure to include the following options to make sure you count reads strand-specifically from their 5' ends:

annotatePeaks.pl ... -fragLength 1 -strand + ... > output.txt

Analysis of 5'RNA Data

Almost all of the routines in HOMER dedicated to ChIP-Seq work well with 5'RNA methods as well.

Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu