HOMER

Software for motif discovery and next-gen sequencing analysis

csRNA-seq Analysis Tutorial

Check out the csRNA-seq paper in Genome Research: https://genome.cshlp.org/content/early/2019/10/24/gr.253492.119.abstract

Updated csRNA-seq protocol: csRNAseq2_V5.2.210523.pdf

Sequencing the 5' end of cap-protected RNAs enables the identification of Transcription Start Sites (TSS) at nucleotide resolution. Several varieties of this method exist, including CAGE, TSS-Seq, START-seq, PRO-Cap, GRO-cap, CAP-seq, 5'RNA-Seq, 5'GRO-Seq etc., and each are designed to collect the 5' ends of RNAPII transcripts for sequencing but use different enzymatic or enrichment strategies to achieve their goal. We recently developed [capped short] csRNA-seq to map 'nascent-like' initiation sites, which identifies initiating transcripts by sequencing short (20-65nt) RNAs with 5' caps extracted from total RNA. This strategy targets RNAs created just after RNAPII has initiated, transcribed a short distance, and added a 5' cap to the nascent RNA. Like other methods that attempt to isolate nascent initiation sites like Start-seq, GRO-cap/PRO-cap/5'GRO-seq/CoPRO/etc., csRNA-seq identifies TSS from all RNAPII transcripts, including mRNAs, primary miRNA transcripts, enhancer RNAs (eRNA), promoter anti-sense transcripts, and other stable and unstable transcripts. These techniques are particularly powerful for identifying active regulatory elements (enhancers + promoters) and assessing their activity in a quantitative manner with relatively low sequencing coverage. For example, 20-40 million reads from a csRNA-seq experiment might yield the same number of reads at enhancers as 200-400 million reads in a 5'RNA-Seq experiment.

Introduction to Transcription Initiation

To understand the analysis of 5'RNA data, it is worth taking a moment highlight that there are multiple 'types' of promoters in living organisms. First of all, there are different RNA polymerases including RNA polymerase I (rRNA), II (mRNA, lncRNA, miRNA), III (tRNA), IV(plant specific), viral polymerases, etc., and each polymerase has different mechanisms directing transcription initiation that may vary between organisms. Also be aware that different RNA polymerases may generate RNAs with different modifications and may or may not be present in your sample depending on how the experiment was performed. Most researchers are interested in RNA polymerase II transcripts (i.e. mRNA, eRNA, etc.), and as a result, most methods focus on the identification of RNAs containing a 7-methylguanosine cap protecting their 5' end.

With respect to RNA polymerase II initiation sites, there are two generally recognized 'types' of TSS. Sharp (or Focused) TSS initiate transcription from a single nucleotide (or +/- 2 nt) and resemble the promoters found in molecular biology text books. They often contain well define core-promoter elements such as the TATA box and usually initiate transcription from a purine preceded by a pyrimidine (PyPu, e.g. CA, with the A being the initiating nucleotide).

The other, more common TSS is a broad (or dispersed) TSS or TSS cluster. These promoters initiate transcription from several different sites within a large area (often 50-100 nt in size). These promoters usually lack clear core promoter elements (usually lack a TATA box), but each individual initiation site still usually initiates on a purine preceded by a pyrimidine (PyPu).

The csRNA-seq Method

Capped small RNA-seq (csRNA-seq) accurately maps RNA polymerase II transcription start sites (TSSs) from any tissue or eukaryotic organism where RNA can be extracted. csRNA-seq is an adaptation of Start-seq (Nechaev et al., Scruggs et al.) and small RNA sequencing (Lister et al., Gu et al.). The method captures small (<60 nt), 5'-capped RNAs which associate with newly initiated RNAs, providing data similar to Start-seq or 5'GROseq/GROcap(Kruesi et al., Lam et al.). However, csRNA-seq works with total RNA, eliminating the requirement of nuclei isolation or run-on. csRNA-seq is thus widely applicable and facilitates accurate maps of TSSs of stable (i.e. mRNAs, ncRNAs) and unstable transcripts (i.e. enhancer RNAs, pre-miRNAs, divergent transcripts, antisense transcripts) from fresh, frozen or archived tissues or samples - basically any material where quality RNA can be isolated from.

False TSS - be careful of artifacts

A quick note about artifacts in csRNA-seq (and other 5'RNA) data: Most 5' RNA-Seq methodologies work by enriching for 5' cap-protected RNA, which means that most of the sequence data describes true 5' RNA ends, but a fraction of it may be noise from random RNA fragments that may lack a 5' cap. In particular, highly expressed RNAs may yield reads along exons or highly transcribed gene bodies, giving the appearance of alternative TSS that are likely false positives. Because of this, it is recommend that a non-capped control sample or traditional RNA-Seq data set is used as a "background" when analyzing this type of data. This approach (described below) may end up removing some real TSS from the results, but it is also likely to remove a large number of false positives and clean up your analysis.

Trans-splicing of transcripts (where the 5' end of one transcript is added to the front of another) and recapping (where a transcript is cleaved and a new cap placed on the truncated product) are two phenomena you may want to think carefully about when analyzing 5' RNA data. Trans-splicing will create false negatives and recapping will create false-positives. In certain organisms, such as C. elegans, trans-splicing is common, making csRNA-seq or 5'GRO-Seq much better assays for identifying TSS than 5'RNA-Seq (i.e. measuring the 5' RNA ends before they have a chance to trans-splice). In other organisms (e.g. mouse, human, fly, etc.) trans-splicing appears to be rare. The degree to which transcripts are 'recapped' is a matter of debate because it can be hard to distinguish noise in these assays from true recapped transcripts, or true alternative TSS (in the genome). For practical purposes the analysis framework below assumes capped 5'RNAs found in the context of highly expressed exons are likely to be false positives.

Analyzing csRNA-seq data (initiated transcripts)

Nearly all of this analysis can be applied to related 5'RNA sequencing methods as well, including 5'RNA-seq, 5'GRO-seq, GRO-cap, PRO-cap, CoPRO, Start-seq, TSS-seq, etc.

1. FASTQ trimming

csRNA-seq captures short RNAs that are likely shorter than the read-length used for sequencing. To properly map and analyze the data, you will need to trim the 3' adapter from the end of the sequences. The following command uses homerTools, but you can use your favorite read trimming program too.

homerTools trim -3 AGATCGGAAGAGCACACGTCT -mis 2 -minMatchLength 4 -min 20 exp1.csRNA.fastq.gz
#repeat for all samples and input controls

This command allows up to 2 mismatches when matching the barcode sequence, will only allow a match if at least 4 nucleotides match at end of the read, and will automatically discard trimmed sequences that are less than 20 nucleotides long. The command will produce 2 files: exp1.csRNA.fastq.gz.trimmed (FASTQ containing trimmed sequences) and exp1.csRNA.fastq.gz.lengths, which contains a table in tab-delimited text format describing the read length distribution of the reads. Don't forget to do the same with control samples (i.e. input) that skip the 5' cap enrichment steps.

Quality Control Checks:

Total number of reads (e.g. did you get the expected number of reads in this experiment?)

% of adapter-dimers: Since csRNA-seq relies on adapter ligation, it is possible to generate adapter dimers, which are normally excluded from sequencing by size selection. Furthermore, reads trimmed to lengths less than ~15-20 nt are essentially unusable for downstream analysis due to their inability to be properly aligned to the genome.

Read length distribution (can check this here, or during step 3 below)

2. Align csRNA reads to the Genome

In general, RNA molecules identified by csRNA-seq should be freshly initiated without having time to be spliced or processed in any meaningful way (aside from the cap being added to the 5' end of the RNA). In that sense, it may be defensible to align reads with either a DNA (bwa, bowtie2) or RNA (STAR, Hisat2) alignment algorithm. However, we recommend aligning the reads with a splicing-aware (i.e. RNA) alignment program in case there is significant contamination from spliced RNA molecules in the sample. It is better to know they are there and potentially remove them later than to bury your head in the sand. For example, to align with STAR you might run:

STAR --genomeDir GenomeIndexForhg38/ --runThreadN 24 --readFilesIn exp1.csRNA.fastq.gz.trimmed --outFileNamePrefix exp1.csRNA.fastq.gz.trimmed. --outSAMstrandField intronMotif --outMultimapperOrder Random --outSAMmultNmax 1 --outFilterMultimapNmax 10000 --limitOutSAMoneReadBytes 10000000
#repeat for all samples and input controls

This command will align the reads to create the file exp1.csRNA.fastq.gz.trimmed.Aligned.out.sam, which should contain alignments for all of the csRNA-seq data. It also has options set to output a single alignment position in the case of a read mapping to multiple locations in the genome (i.e. multimapper). The exp1.csRNA.fastq.gz.trimmed.log.final.out file in this example should contain alignment statistics. Control samples should also be aligned in the same manner to facilitate comparisons later.

Quality Control Checks:

Overall alignment %: Ideally, this should be high (>70-90%) depending on organism and completeness of the genome assembly.

Uniquely alignable vs. multimapper %: The relative fraction of multimappers may depend biologically on the sample, organism, and cell type, and may vary technically with RNA quality (e.g. perhaps more rRNA is in the library).

3. Create HOMER Tag Directories

To leverage HOMER's quality control routines and prepare tag directories for use with HOMER's other utilities, run the makeTagDirectory command on the alignment file. This command will also analyze the sequence context of the reads, and requires the genome as well (either as a FASTA file, or if it's preconfigured in HOMER, a genome version name e.g. hg38).

makeTagDirectory Exp1-csRNA-tagDir/ exp1.csRNA.fastq.gz.trimmed.Aligned.out.sam -genome hg38.genome.fa -checkGC -fragLength 150
#repeat for all samples and input controls

This command will create a new directory (Exp1-csRNA-tagDir/) that will contain read information and several QC and metadata files that you can look at. The command also includes options that will trigger the program to check the nucleotide content of the reads and their genomic context out to 150 nt from the 5' end of the reads.

Other potential options to use:

-keepOne : Normally homer discards multimappers, this option will tell it to keep the primary alignment even if the MAPQ < 10
-minlen <#> and -maxlen <#> (Filter reads with lengths outside this range, could be useful to exclude specific contaminants, i.e. "-maxlen 65" could help avoid a stable snRNA that's 70nt long)
-single : If using an organism with a genome composed of many contigs/scaffolds, you may be required to specify "-single" to modify how it stores reads in the directory

Quality Control Checks:

Read clonality/PCR duplicates: This information is available in the tagCountDistribution.txt file located in the completed tag directories. csRNA-seq data is like a hybrid between ChIP- and RNA-seq. It has positional information like you would expect from ChIP-seq, but a dynamic range like RNA-seq. I would not recommend removing duplicate reads as it would unnecessarily limit the dynamic range of the data (although use of UMIs would be best). Usually the input libraries, which often contain high levels of miRNAs, should contain more 'highly clonal' locations in the data. A typical example of good data might look like the following (plotted as a log-log plot). Although rare, look out for anything that has a mode other than 1, which might be indicative of a low complexity library.

Read length distribution: The distribution of read lengths for each experiment will be available in the tagLengthDistribution.txt file located in the completed tag directory. Most csRNA-seq reads are between 20-55 nt in length, with relatively few longer than 60 nt. The distribution normally peaks around 30 nt or so. Depending on the source of the sample, there may be spikes in the distribution that correspond to specific high-abundance RNAs that may be present in the sample. For input controls, there is normally a spike in ~22 nt RNAs corresponding to miRNAs in the sample. Input sample may contain similar or unique spikes at specific lengths depending on the presence of high abundance RNAs or fragments of rRNA in the sample. A typical csRNA-seq library and it's control from a human cell line look like this:

Nucleotide preferences near the 5' end of the read: [Requires a genome to be specified or FASTA file] The composite nucleotide frequency in the genome relative to the 5' end of each read is available in the tagFreqUniq.txt and tagFreq.txt files in the tag directory. The files are similar, except that the tagFreq.txt weights each position based on the total number of reads at each position, while the tagFreqUniq.txt file only counts each position in the genome once, regardless of how many reads align to each position. Normally you can ignore the tagFreq.txt file since it will be heavily influenced by any high abundance/clonal reads in the sample.
What to look for: Most csRNA-seq data should have a strong preference for Initiator-like sequences near the 5' end (usually C_-1A₀). Most species also show evidence for a TATA box at approx. -30 bp (e.g. spike in A/T content). Most samples also have specific preferences in the region downstream of the TSS from +1 through +40. This region is normally G-rich in vertebrates. In general, Metazoans commonly display 10 nt helical fluctuations in nucleotide content downstream of the TSS (e.g. +40 to +190) likely corresponding to sequences that help rotationally position nucleosomes relative to transcription initiation. Control libraries, on the other hand, can vary widely depending on the relative fraction of non-TSS associated reads they contain. In some cases they may contain features that a good csRNA-seq library should have (i.e. as listed above), while other times they have their own sequence preferences corresponding to the types of high abundance RNAs represented in the library.
Note on GC-content: Some csRNA-seq libraries are more GC-rich or less GC-rich depending on the sample being studied, which might cause the relative levels of G+C relative to A+T to change. However, good csRNA-seq libraries should still have the major features described above.

Distribution of reads relative to one another: The tagAutocorrelation.txt file in the tag directory contains information about how reads relate to one another in the sample. In the case of transcription initiation data, reads from TSS will typically cluster on the same strand within 50-100nt from each other. Due to the [normally] bidirectional nature of transcription from regulatory elements, reads will also typically appear upstream of TSS on the opposite strand (i.e. bidirectional transcription). Control libraries, with reads typically originating from high abundance RNAs or fragments of longer RNAs, normally don't show much enrichment for these bidirectional read clusters:

4. Create Genome Browser Visualization Files (e.g. bigWigs)

Visualizing the experiments in a genome browser is imperative to provide an independent assessment of the experiment quality and read distribution, and enables you to visualize how the experiment may relate to other data or annotations. Visualizing the experiment as the rates of transcription at single nucleotide (5' ends, i.e. sites where transcription initiates) is typically the most useful way to visualize the experiment, but you could also visualize the result more traditionally as a read-pileup showing the coverage of the reads. Note that the assay is strand specific, so separate visualization files will need to be generated for + strand and - strand alignments. Usually, we visualize csRNA-seq data in either the UCSC Genome Browser or with IGV. In principle you can visualize the data with any genome browser. By default, these routines will normalize the resulting bedGraphs by total number of uniquely aligned reads (to an effective total of 10 million).

Creating visualization files in steps from bedGraphs:

makeUCSCfile Exp1-csRNA-tagDir/ -style tss -strand + > exp1.posStrand.bedGraph
makeUCSCfile Exp1-csRNA-tagDir/ -style tss -strand - -neg > exp1.negStrand.bedGraph

Note that when creating the negative strand data, it's useful to specify "-neg" to have the reads extend downward when visualized to make the separation between + and - strand reads more obvious. The above example will visualize the data at single nucleotide resolution - to visualize the total read coverage, replace "-style tss" with "-style rnaseq". The resulting bedGraph files can be gzipped and uploaded to either UCSC or used with IGV.

To create bigWigs from bedGraphs, you'll need UCSC's bedGraphToBigWig utility (more info on this stuff here), as well as a file specifying the sizes of the chromosomes:

bedGraphToBigWig input.bedGraph chrom.sizes output.bigWig

Other HOMER scripts to generate visualization files:

The following scripts help automate the creation of bigWig files from HOMER-style tag directories. More information is available here. Hubs are useful in that they can be used to combine + strand and -strand bigwig files into the same container so that both strands can be very clearly visualized at the same time (like in the examples below). Both scripts support the option "-tss" and "-rna", which will trigger the creation of single nucleotide or coverage graphs, respectively.

makeMultiWigHub.pl <hubname> <genome> [options] -tss -d <tag directory1> [tag directory2]...

makeBigWig.pl <tag directory> <genome> -tss [special options] [options]

Quality Control Checks:

General distribution of csRNA-seq reads: You should see nice signal at the 5' end of annotated genes. Depending on the organism, you should also generally see reads initiating in the opposite direction (divergent or bidirectional transcription). In most vertebrates you should also see csRNA-seq read clusters in enhancer regions too, although the signal here will generally be lower than near most promoters. Visually inspecting areas that are thought to be super enhancers is normally a good idea as they should have evidence of enhancer transcription. Very high signal may also be present at certain snRNA, snoRNA, or other high abundance RNA.

Visually check exon enrichment at highly expressed genes: One of the main sources of contamination in csRNA-seq is from reads that map to the downstream exons from highly expressed genes, particularly when analyzing samples with poor RNA quality (i.e. low RIN values). For example, reads from the csRNA-seq and the control sample have signal on the ENO1 gene exons in the example below, which is highly expressed in K562 cells. Note the level of the y-axis in the upper and lower panels. This example comes from a cell line with high quality RNA, which is relatively "clean", and in this case it should be relatively easy to identify these regions as not enriched. However, the more reads that associate with these regions, the greater the risk that TSS clustering discovery will identify false positives from these regions.

Spot-check TSS and key genes: In a vast majority of cases reads should be initiating from PyPu initiators (e.g. CA, TG, CG, TA), where the initiation site is on the Pu.

5. Finding TSS clusters from csRNA-seq data

This section will describe the identification of TSS clusters or "peaks" from csRNA-seq data. These are sites with significant initiation activity, and usually correspond to active regulatory elements in the genome. To be clear and avoid confusion, we will use the following definitions:

TSS in any individual nucleotide site that initiates transcription.
TSS cluster or peak: one or more TSS initiating from the same regulatory element and in the same direction (generally all TSS within the cluster are co-regulated or correlated in their activity). A TSS cluster is analogous to a 'peak' from ChIP/ATAC-seq experiments.

HOMER offers two related options to identify TSS clusters. If you would like to perform a quick'n'dirty analysis of the data, you can simply use findPeaks using the "-style tss" option, which does a very reasonable job of identifying TSS clusters with minimal assumptions. For a more careful and fully annotated analysis of the data, use findcsRNATSS.pl. The later script inherently uses the former, but also tries to use genome annotations to automatically optimize thresholds for TSS cluster detection, and will also provide various annotations that may be useful.

Simple Approach (findPeaks):

findPeaks will identify strand-specific clusters of reads in regions the size of 150 nt. It uses essentially the same approach used for ChIP-seq peak finding, except that duplicate reads are considered and the final TSS cluster will be centered on the mode (i.e. the most highly expressed individual TSS). This approach generally works well, particularly for clean samples with relatively minimal contamination from highly expressed exons.

findPeaks Exp1-csRNA-TagDir/ -i Exp1-input-TagDir/ -style tss > tssOutput.txt

This command will produce a HOMER-style peak file (similar to BED - convert with pos2bed.pl) specifying the locations of the identified TSS clusters.

Key parameters to consider:

-ntagThreshold <#> : Minimum number of reads to consider before a region is considered a valid TSS cluster, measured in reads per 10 million. By default this is estimated the same way as it is for ChIP-seq by considering a Poisson distribution. This can be tricky (covered more in the findcsRNATSS.pl section), so you may want to adjust this directly to get more or less clusters.
-F <#> : Fold enrichment relative input/control sample (default is 2.0). This helps remove contaminants.
-L <#> : Fold enrichment relative to the surrounding 10 kb region (default is 2.0). This helps remove regions with constant signal (like downstream exons) or other odd artifacts. It may also eliminate valid TSS clusters that are 'near' other much highly expressed TSS clusters.

Full Approach (findcsRNATSS.pl):

findcsRNATSS.pl is a relatively complicated script which uses first uses findPeaks to perform an initial identification of putative TSS clusters, then attempts to annotate them and quantify different aspects of their read coverage to distinguish true TSS from likely false positives. It also provides detailed annotation information, and depending on the data that is available, will attempt to annotate TSS clusters that produce stable/unstable transcripts, uni- or bi-directional initiation, etc.

Although the program can work with simply a single Tag Directory, its full functionality requires the following input files:

csRNA-seq Tag Directory [required] (assumed to be single end, + strand, i.e. 5' end of read represents TSS)

input/control Tag Directory (assumed to be single end, + strand)

RNA-seq Tag Directory (assumed to be + strand***)

Genome FASTA file*

GTF file*

* To use HOMER's pre-configured genome/annotations, just specify the genome version (e.g. "-genome hg38").

*** Notes about RNA-seq tag directories

Since different RNA-seq methods can yield different distributions of reads, you may need to create a special RNA-seq tag directory that works best for this analysis. In fact, it is best to create the tag directory such that only the 5' reads associated with each RNA-seq fragment are considered to make it a more 'fair' control for csRNA-seq. If using a truseq-style RNA-seq method (RNA encoded on the negative strand, dUTP methods), use "-flip" (single end data) or "-read2" (paired end data) when creating the tag directory with makeTagDirectory for the RNA-seq. If the sample encodes the + strand, then consider using "-read1" if the sample is paired-end.

SE, +strand encoding: makeTagDirectory TagDir/ alignment.sam
PE, +strand encoding: makeTagDirectory TagDir/ alignment.sam -read1
SE, -strand encoding(i.e. TruSeq): makeTagDirectory TagDir/ alignment.sam -flip
PE, -strand encoding(i.e. TruSeq): makeTagDirectory TagDir/ alignment.sam -read2

To find TSS clusters [or peaks] run the following command (the csRNA tag directory is the first argument):

findcsRNATSS.pl <Tag Directory> [options]

findcsRNATSS.pl Exp1-csRNA-TagDir/ -o outputPrefix -i Exp1-input-TagDir/ -rna Exp1-totalRNAseq-TagDir -gtf genes.gtf -genome genome.fasta

This command will create several output files with the prefix provided with parameter "-o <prefix>" (default is "out"). The main output file will be prefix.tss.txt, which is a tab-delimited text file with each row corresponding to a TSS cluster identified in the data. The positions are given as a chromosome, start and end position, and a strand. The positions are centered on the mode of the cluster (i.e. the most prominent TSS). The other output files are mainly for QC purposes, but include:

out.tss.txt : primary output file, contains positions of TSS clusters and annotation information

out.stats.txt : contains general information about the run, the command used, parameters selected for the analysis.

out.tss.txt.freq.tsv : Nucleotide frequencies relative to the primary TSS of each valid TSS cluster.

out.alltss.txt : All putative TSS clusters, not just the valid ones, in case you want to go back and look through the others.

out.inputDistribution.txt : if you used an input/control sample, this will show the CDFs for the threshold optimization.

out.rnaDistribution.txt : same as the input, but for an RNA-seq control.

Key parameters to consider:

-ntagThreshold <#> : Minimum number of reads to consider before a region is considered a valid TSS cluster, measured in reads per 10 million (default 7.0). Unlike peak finding for ChIP-seq, which has a high amount of background from non-specific DNA, csRNA-seq (and other RNA-based methods) can be much more sensitive, such that even single reads have a reasonable chance of representing true sites of transcription initiation. This difference is also problematic across genome sizes, where the size of the genome may not scale with the total number of TSS. The default is set to ensure TSS clusters have a reasonable number of reads to ensure their potential enrichment over control samples. You may need to adjust this lower (if the samples are really clean) or higher (if the samples are too dirty and the input/RNA-seq control data is not filtering false positive TSS clusters aggressively enough).

-includeSingleExons : As explained below, the method is a little more robust/sensitive when calculating detection thresholds when using multiexon genes (single exon genes are sometimes high abundance transcripts or non-RNAPII templated transcripts). However, some organisms are comprised of mainly single exon genes (i.e. no splicing) and specifying this option will include them in the threshold defining calculations.

Working with poorly annotated genomes:

Some organisms may lack a quality set of gene annotations (i.e. no GTF file). In these cases, your options are (1) Use the simple findPeaks method above, or (2) if RNA-seq data is available, identify transcripts de novo using Cufflinks/Stringtie/etc. and use the resulting GTF file with findcsRNATSS.pl.

Steps performed by findcsRNATSS.pl to identify valid TSS clusters:

First, all potential TSS clusters are identified by using findPeaks to scan for regions with a high density of reads. This step is performed without any filtering for input or RNA-seq coverage to identify the locations where csRNA-seq reads are being aligned.

Using the GTF file (or homer annotation), the putative TSS clusters are annotated based on whether they are near an annotated TSS (within 200 bp), or if they fall on an annotated exon. The exons are further segregated based on whether they are in single exon genes, if they are part of the first exon or a downstream exon, etc. By default, HOMER will not include TSS clusters from single exon genes when optimizing detection thresholds, however, this can be overridden by specifying -includeSingleExons (useful for lower organisms with few spliced transcripts).

The reads in putative TSS clusters at annotated TSS and downstream exons are they calculated for csRNA-seq, control, and RNA-seq samples. For csRNA-seq and input, the read density corresponding to the cluster is calculated by the number of 5' read ends found with 75 bp from the primary TSS (i.e. peak size 150 centered on the TSS). For RNA-seq, reads are counted from the regions -150 to -10 relative to the TSS (i.e. upstream). By avoiding counting RNA-seq reads found at the TSS or downstream, we hope to identify TSS found within the middle of transcripts where false positives are common. The RNA-seq levels at the TSS or downstream may be expected at valid TSS depend on other criterion such as transcript stability, resulting in the erroneous discarding of valid TSS if we count the signal in these regions.

Since the presence of high abundance RNAs (e.g. snRNAs, snoRNAs, miRNAs, rRNA) and the quality of the RNA (i.e. RIN number) can vary between samples, cell types, tissues, organisms, etc., it is difficult to know a priori what the relative levels of input or RNA-seq should be relative to csRNA-seq levels at valid TSS. To guide this decision, the log2 ratios of csRNA-seq to both input and RNA-seq are calculated for each TSS and exon putative TSS cluster, and the log2 ratio that provides the greatest trade-off between positive identification of annotated TSS-associated clusters and exon-associated clusters is selected. For example, the graph below is created from the prefix.inputDistribution.txt file that is created by the script. Here, the log2 threshold is continuously lowered, and the total fraction of annotated TSS and annotated exon associated TSS are shown. The threshold is selected where the distributions are maximal (In this case at -0.34). You can specify "-fisherExact" to instead select the log2 ratio where the hypergeometric enrichment is max (takes longer to compute). For comparison, a sample with significant exon contamination but low read coverage in the input is shown next to it.

Once optimized thresholds are determined, they are then applied equally to ALL putative TSS clusters (regardless of annotation) to establish a set of valid TSS clusters. If both control/input and RNA-seq tag directories are provided, each threshold will be applied independently (i.e. valid TSS cluster must pass both thresholds).

Annotation performed by findcsRNATSS.pl:

TSS cluster shape (aka focusRatio) : Ratio of reads found near the primary TSS [-5,+5] relative to the full surrounding region [-75,+75]. High focus ratios (approaching 1.0) indicate that all of the reads originate from single location within the promoter, making it a sharp/focused TSS cluster. Lower ratios indicate that the region has multiple strong TSS diffused over a larger region, making it a broad/dispersed TSS cluster. This value is also calculated by findPeaks when using the '-style tss' option.

General annotation : Type of TSS cluster and the Gene Identifier associated with it, which can include:

tss : likely annotated transcript/gene TSS

tssAntisense : antisense to an annotated TSS

singleExon : TSS for a single exon transcript/snRNA/snoRNA/etc.

firstExon : Found downstream of the annotated TSS in the first exon

otherExon : TSS found in a downstream transcript, sense direction

otherExonBidirectional : Found in downstream exon, but there is also evidence for bidirectional transcription increasing the likelihood that it may be a valid TSS.

other : everything else (enhancers, etc.)

Closest annotated TSS : including the distance, gene type, and gene name (for linking for regulatory purposes)

Promoter proximal vs. distal : relative to annotated promoters, default definition is >500bp for a distal TSS cluster)

Stable vs. Unstable Transcript : If RNA-seq is available, TSS clusters will be annotated based on whether stable RNA is detected downstream [-100, +500]. By default, if at least 2 reads (per 10 million) are found, the transcript is considered stable.

Bidirectional vs. unidirectional : If csRNA-seq reads (2 per 10 million) are found in the antisense direction [-500,+100] relative to the primary TSS, the cluster is considered bidirectional.

Regulatory Element Configuration: Combines the stable/unstable and bidirectional annotations into a 1/2-letter code:

SS (TSS cluster and antisense direction produce stable transcripts)

SU (TSS cluster is stable, antisense direction is unstable)

S (TSS cluster is stable and unidirectional)

US (TSS cluster produces an unstable transcript, but the antisense direction creates a stable one)

UU (TSS cluster produces two unstable transcripts in sense and antisense directions)

U (TSS cluster is unstable and unidirectional)

Quality Control Checks:

Total number of TSS clusters and their distribution: Data generated from mammalian cell lines will typically have 30-100k+ TSS clusters, and tissues a little less than that. Numbers that vary greatly above or below that number might require a closer look. Other species can be variable.

Spot check TSS clusters in browser : Informaticians don't like this as much, but it is important to check examples of the TSS clusters called with a genome browser to see if the identified sites look "normal". This works best if other data can be viewed at the same time (i.e. load tracks for RNA-seq or any other experiments you may have). For example, are the TSS clusters located near the 5' ends of transcripts?

Nucleotide Frequencies at the primary TSS : A quick quality control check provided by findcsRNATSS.pl is a file containing the nucleotide frequencies found near the primary TSS from each cluster (prefix.tss.txt.freq.tsv). You can also calculate these for any set (or subset) of TSS as described in the sections below. In general, mammalian TSS should resemble the human figure below, although the overall GC content may vary depending on the source of the sample. Fly was included to provide some reference of how things may vary as you investigate other metazoan TSS:

Cross-reference with other data : The best way verify the TSS clusters is to look at the TSS calls with respect to other data that provide independent evidence of transcription initiation activity. Are the TSS being called near the 5' ends of transcripts or regions with RNA-seq signal? Histone modifications like H3K4me3/2/1 and H3K27ac (histone acetylation in general) are highly associated with active initiation. TSS are also very often associated with open, nucleosome-free chromatin, meaning most sites should also overlap well with ATAC-seq or DNase-seq peaks.

The sections below cover various downstream analysis strategies that may be useful. Although this page describes how to perform these analyses with HOMER, the analysis methodology can be applied with a variety of other NGS tools as well:

6. Comparing TSS positions

To compare TSS clusters between two or more experiments, you can use HOMER's mergePeaks command (general page). This comparison is performed on the positions of the TSS, and does not consider quantitative changes - for that, see the next section.

To merge the TSS cluster positions from two separate experiments into a single set of non-redundant TSS clusters, use the following:

mergePeaks exp1.tss.txt exp2.tss.txt -strand > merged.tss.txt

By default, this will merge TSS clusters that overlap by at least 1 bp. It is critical to include the '-strand' option to ensure the TSS clusters that are merged are from the same strand. To control the distance between clusters to be merged, you can use the "-d <#>" to specify maximum distance between cluster centers that are merged. You can also merge multiple (3+) files at the same time. If you wish to separate out the TSS clusters that are specific to one or more of the files (instead of a single merged file), specify "-venn <filename>" and separate files will be produced.

NOTE: TSS clusters that are merged between two experiments will be replaced with a single region that may no longer be centered on the TSS mode (i.e. primary TSS) if the mode differs between the experiments compared (i.e. by default, merging two peaks will replace the original peaks with a single peak that is the average of their positions). If it is important that the merged peaks still be centered on the strongest TSS for subsequent analysis, you can recenter peaks on the primary TSS (i.e. the mode) for a given experiment by running the following:

getPeakTags exp1.tss.txt Exp1-csRNA-tagDir/ -center -fragLength 1 -strand + -start -150 -end 150 > recentered.txt

For some analyses you may want to determine which TSS overlap a given feature without merging the features together. mergePeaks has a "-cobound <#>" feature which will use the first file provided and compare it to the others. For example, lets say you want to overlap TSS positions with ATAC-seq peaks from another experiment to see which ones overlap and which ones don't:

mergePeaks exp1.tss.txt atac.peaks.txt -cobound 1

This command will produce two files - coBoundBy0.txt (non-overlapping TSS) and coBoundBy1.txt (TSS that overlap with ATAC peaks). In this case, since the ATAC-seq peaks are not strand specific, we omit the "-strand" option. The number (1) associated with "-cobound 1" is not too important here and can be set to 1 - more information on mergePeaks can be found here.

Quantifying shifts in TSS positions between expeirments

To compare the positions of the primary TSS between two experiments, you can run the following command. For each TSS in the first file given to the program, it will find the closest TSS in the second file on the same strand and provide the distance (The "-pdist2" and "-strand +" are analysis/reporting options needed to tell the program to only look for peaks on the same strand and to report the relative distance):

annotatePeaks.pl <tss/peak/BED file> <genome> -p <tss/peak/BED file2> -pdist2 -strand + > outputFile.txt
annotatePeaks.pl exp1.tss.txt hg38 -p exp2.tss.txt -pdist2 -strand + > output.txt

Alternatively, you could also generate a histogram to see the relative distribution of TSSs from the 2nd file relative to the first (The "-size <#>" and "-hist <#>" control the region plotted and the size of the histogram bins, respectively):

annotatePeaks.pl <tss/peak/BED file> <genome> -p <tss/peak/BED file2> -size <#> -hist <#> -strand + > outputFile.txt
annotatePeaks.pl exp1.tss.txt hg38 -p exp2.tss.txt -size 1000 -hist 10 -strand + > outputFile.txt

7. Quantifying TSS cluster strength and calculating differential expression

csRNA-seq is a lot like ChIP/RNA combination experiment. The signal is generally confined to well defined 'TSS/peak' regions in the genome like a ChIP-seq experiment for a transcription factor, but the dynamic range of the signal is much greater, similar to RNA-seq. In general, quantitative analysis of csRNA-seq data can be handled in much the same way as any ChIP/ATAC/RNA-seq experiment. The steps below highlight how to quantitatively compare csRNA-seq signal at different sites within the HOMER framework, although this can be done with other NGS tools in a similar manner too.

The general strategy to compare two experiments (or replicates) is the same for virtually any quantitative NGS experiment:

Identify the genomic features we want to compare across all experiments (i.e. merge TSS clusters across all experiments)

Quantify the reads for all features across all experiments

If replicates are available, perform differential enrichment calculations

This workflow can also apply to cases where we have 3+ experimental conditions - just replace the final differential expression calculations with a fancier analysis, such as clustering, ANOVA, etc.

Step1: Merge TSS clusters into a single non-redundant set to quantify reads with.

mergePeaks exp1.tss.txt exp2.tss.txt -strand > merged.tss.txt

If you have multiple replicates, you can either merge all of the TSS clusters identified from each individual replicate, or you could combine replicates into "meta-experiments" and perform the TSS cluster calling on the merged experiments. The latter is probably a little more sensitive, but in the end the results are *probably* not that different.

Step2: Quantify csRNA-seq read counts across experiments

annotatePeaks.pl merged.tss.txt hg38 -strand + -fragLength 1 -raw -d Exp1-tagDir/ Exp2-tagDir/ > counts.txt

Using HOMER's annotatePeaks.pl tool, we can create a table of read counts for all TSS clusters across all experiments. When using csRNA-seq, be sure to specify "-strand + -fragLength 1" to ensure that the program quantifies the reads based on their 5' ends (initiation sites). You can also have the program normalize the read counts in different ways, which can be useful:

-raw : Straight read counts, useful if feeding the table into another tool like DESeq2 (see below)

-rlog : If R (with DESeq2) is configured on the system, this will use the rlog to transform the data - very useful if you want to do clustering/PCA/etc. as a downstream analysis. The program also supports "-vst" to use that transform instead.

If neither is used, it will normalize the reads relative to 10 million total mapped reads per experiment.

In general, annotatePeaks.pl provides many useful capabilities to assist with analysis - more info here and here.

Step3: Calculate significant changes in csRNA-seq read counts (requires replicates, R/DESeq2 to be installed)

#lets say you have 2 experiments with 2 replicates each:

mergePeaks exp1-rep1.tss.txt exp1-rep2.tss.txt exp2-rep1.tss.txt exp2-rep2.tss.txt -strand > merged.tss.txt

annotatePeaks.pl merged.tss.txt hg38 -strand + -fragLength 1 -raw -d Exp1-rep1-tagDir/ Exp1-rep2-tagDir/ Exp2-rep1-tagDir/ Exp2-rep2-tagDir/ > rawcounts.txt

getDiffExpression.pl rawcounts.txt exp1 exp1 exp2 exp2 > output.txt

More information on differential analysis with HOMER can be found here.

8. Analysis of Sequence Features and Motif Discovery at TSS

Examining the nucleotide frequencies relative to TSS can reveal several interesting features, including evidence for core promoter elements, CpG islands, nucleosome positioning sequences, and other strong sequence biases that may be important for gene regulation. To find the DNA nucleotide frequencies relative to a set of TSS, run the following command:

annotatePeaks.pl <tss/peak/BED file> <genome> -size <#> -hist 1 -di > output.txt

annotatePeaks.pl exp1.tss.txt hg38 -size 1000 -hist 1 -di > output.txt

The first argument is the TSS cluster positions file (i.e. out.tss.txt), the 2nd argument is the genome (for custom genomes use a FASTA file). The "-size <#>" option specifies how large of a region you want to look for nucleotide frequencies, and the "-hist <#>" specifies the bin size used in the histogram (1=single nucleotide percision). The "-di" option is required to tell the program to look at the DNA sequence (short of dinucleotide frequencies). For example, below is an example of nucleotide frequencies near zebrafish promoters:

To find DNA motifs that are enriched near TSS clusters, you can run the following command:

findMotifsGenome.pl <tss/peak/BED file> <genome> <outputDirectory> [options]

findMotifsGenome.pl exp1.tss.txt hg38 OutputDirectory/ -size -150,50

The first argument is the TSS cluster positions file (i.e. out.tss.txt), the 2nd argument is the genome (for custom genomes use a FASTA file), and the 3rd argument is an output directory where the result files will be placed. For TSS, most of the key transcription factor motifs can usually be found just upstream of the TSS, unlike ChIP-seq or ATAC-seq, where the motifs are likely right in the center of the peak. Because of this, you can usually get higher sensitivity if you tell the program to modify the search to the region from -150 to +50 relative to the TSS ("-size -150,50", as in the example above).

Once you've identified motifs, you can check for their distribution relative to the TSS to look for additional patterns. To do this, you'll need to select one of the *.motif files identified in the motif analysis (or grab one or more from the homer/motifs/ directory). You can then run annotatePeaks.pl to create a histogram of their locations relative to the primary TSS. The "-size <#>" and "-hist <#>" specify the properties of the histogram:

annotatePeaks.pl exp1.tss.txt hg38 -size 500 -hist 1 -m sp1.motif > output.txt

The graph above looks for where motifs are found relative to the TSS. However, another strategy is to look at where the csRNA-seq reads fall relative to the location of motifs. To do this, we first recenter our TSS/peaks/regions on the motif, and then quantify the csRNA-seq data in the surrounding regions. When centering the motif, all of the resulting peaks will have the motif site (e.g. Sp1) starting at 0 (center of the peak). When quantifying csRNA-seq reads relative to Sp1, the option "-histNorm <#>" triggers the program to normalize the read profile for each peak before adding it to the composite profile (this way very highly expressed regions or other 'spikes' in the data don't drown out the other regions). The value for "-histNorm <#>" represents the minimum total normalized number of reads used in the normalization (this way regions with only 1 read do not create spikes in the profile either).

#first re-center the TSS on the motifs
annotatePeaks.pl exp1.tss.txt hg38 -size 500 -center sp1.motif > recentered.peaks.txt

#Then quantify reads near these peaks
annotatePeaks.pl recentered.peaks.txt hg38 -size 500 -hist 1 -histNorm 20 -d Exp1-csRNA-tagDir/ > output.txt

9. Comparing csRNA-seq/TSS to other data.

Integrating different types of data with csRNA-seq is similar to any integration workflow with NGS data. For example, if you have several HOMER tag directories for different types of experiments, such as histone modifications, ATAC-seq, etc., you can quantify each at TSS clusters or create histograms that show the relative enrichment of each experiment spatially.

For example, if you have ChIP-seq for H3K4me3/H3K27ac and ATAC-seq data, first create Tag Directories, and then you can map their distributions relative to the TSS clusters found from csRNA-seq (note: the "-tbp 3" option limits the number of reads (i.e. PCR duplicates) per position to 3, which removes random spikes from mapping artifacts):

annotatePeaks.pl exp1.tss.txt hg38 -size 2000 -hist 10 -tbp 3 -d ATACseq-TagDir/ H3K4me3-TagDir/ H3K27ac-TagDir/ > output.txt

Instead of making a histogram, you could quantify the levels of each experiment within 1kb of the TSS and compare their levels at each site:

annotatePeaks.pl exp1.tss.txt hg38 -size 1000 -log -d ATACseq-TagDir/ H3K4me3-TagDir/ H3K27ac-TagDir/ > output.txt

Graphing the levels of levels of H3K4me3 vs. H3K27ac at each peak yields the following, which shows groups of TSS corresponding to H3K4me3+/H3K27ac+ gene promoter TSS and H3K4me3-/H3K27ac+ enhancer TSS:

These are meant as examples - there are countless ways to analyze and compare these data!

Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@ucsd.edu