|
Quantifying Differential Features
(Enrichment/Expression)
If you boil down the analysis of most quantitative NGS data
types and experiments, it all comes down to defining a set
of features (whether they are peaks, transcripts, promoter
regions, etc.) and then quantifying your set of experiments
across those features to identify if any of them are
differentially regulated. Within HOMER, feature
identification is generally handled by findPeaks,
while quantification is mostly performed with annotatePeaks.pl
(peak/region based) or analyzeRepeats.pl
(gene/repeat based). The analysis of differential regulation
in handled by getDiffExpression.pl, which uses R/Bioconductor and DESeq2
or edgeR
to perform the differential enrichment calculations. If you
are trying to find differential peaks from scratch, consider
using the getDifferentialPeaksReplicates.pl command
instead, which automates running
findPeaks/annotatePeaks.pl/getDiffExpression.pl for you.
The other operation that getDiffExpression.pl will do is
normalize your input raw counts and return this in the
output file along with the differential regulation
statistics. By default the program will output rlog
variance stabilized values calculated by DESeq2. These are
useful when doing clustering, PCA analysis, etc. This can be
modified with the options specified below.
Required 3rd party software to run getDiffExpression.pl:
HOMER requires R/Bioconductor to be installed
with packages for DESeq2 for getDiffExpression.pl to
work. The command "R" must be available in your executable
PATH. For tips on how to install them, check out the
installation page.
R statistical computing environment
Bioconductor (R packages for bioinformatics)
DESeq2
DESeq - packages for
differential expression calculations (DESeq2 is
recommended)
EdgeR
Step1: Preparing input files for
getDiffExpression.pl
The getDiffExpression.pl program is
essentially a wrapper for R/Bioconductor/DESeq2 to make
running those programs easy using data generated by other
HOMER programs. In general, we want to provide a data
matrix containing raw read counts across
several experiments that we want to compare (tab-delimited
text format). These types of matrices are readily
created using annotatePeaks.pl or analyzeRepeats.pl
and using the "-raw" (or "-noadj") options
to report the raw counts.
For example, lets say we have RNA-seq tag directories for
WNT and Mock treated conditions in replicate. To
quantify their expression (raw counts), we might run the
following:
analyzeRepeats.pl rna hg38 -count exons
-condenseGenes -raw -d
RNAseq-Mock-rep1/ RNAseq-Mock-rep2 RNAseq-WNT-rep1/
RNAseq-WNT-rep2/ > countTable.rna.txt
For another example, lets consider quantifying the
differences in H3K4me3 signal at promoters for the same
conditions:
annotatePeaks.pl tss hg38 -raw -d
H3K4me3-Mock-rep1/ H3K4me3-Mock-rep2/
H3K4me3-WNT-rep1/ H3K4me3-WNT-rep3/ > countTable.peaks.txt
You can also use analyzeRNA.pl to generate these
matrices, but generally I would recommend using analyzeRepeats.pl.
You can also create your own simple data matrix
that contains a single column of IDs (i.e. gene/region
IDs) and the top row for experiment names.
IMPORTANT: Make sure you remember the order that
your experiments and replicates where entered in for
generating these commands. Because experiment names
can be cryptic, you will need to specify which experiments
are which when running getDiffExpression.pl to
assign replicates and conditions.
Some of you might be wondering if you can use FPKM, TPM,
normalized read counts, or some other type of processed
data. The simple answer is "yes you can" (getDiffExpression.pl
will round everything to the nearest integer and it
shouldn't 'crash'), but the better answer is "no you
shouldn't". For a vast majority of NGS assays, the raw
read counts have well-behaved variance structure as a
function of intensity, which is what DESeq2 and EdgeR will
be modeling when calculating the significance differences
in the data. FPKM, TPM, etc. introduces
normalization factors (i.e. for the length of the gene)
that will obscure the intensity vs. variance relationship
and undermine the assumptions used by the programs.
There may be cases where you do want to provide processed
values instead of raw counts, so the program will do its
best to choke them down!
Step2: Running getDiffExpression.pl to find
differentially regulated features
Once you have a raw count file containing a
tab-delimited matrix of raw count files, you're ready to
run getDiffExpression.pl. The getDiffExpression.pl
program is executed with the following arguments:
getDiffExpression.pl <raw count file>
<group code1> <group code2> [group
code3...] [options] > diffOutput.txt
#example for the RNAseq file prepared from above
getDiffExpression.pl countTable.rna.txt Mock Mock WNT
WNT > diffOutput.txt
Key arguments for getDiffExpression.pl:
- [required] The raw count file is the
first argument to the program - this is the output
file from analyzeRepeats.pl, analyzeRNA.pl
or annotatePeaks.pl. It can also be a simple
tab-delimited text file with a header and a single
column of IDs followed by the matrix of counts.
- [required] Provide sample group annotation
for each experiment with an argument on the command
line (in the same order found in the file, i.e. the
same order given to the
annotatePeaks.pl/analyzeRepeats.pl command when
preparing the raw count file).
i.e. if you provide three tag directories for
macrophage RNA-Seq and two tag directories for Bcell
RNA-seq, enter " mac mac mac bcell bcell ". This
tells the program which columns go with which
experiment, and which are replicates.
- Differential Expression Calculation program - To use
any of them they must already be installed on your
local copy of R:
- "-edgeR"
- "-DESeq"
- "-DESeq2" (default)
- [Usually not needed] Use "-repeats", "-rna",
"-peaks", or "-basic" depending on if
you used analyzeRepeats.pl, analyzeRNA.pl
or annotatePeaks.pl to generate the initial
raw count table. The program will generally auto
detect which program was used.
Other Parameters and Considerations
No replicates
If you do not have replicates, the R
packages DESeq2 or DESeq will automatically attempt to
model the variance and provide differential
expression, although they will be VERY conservative
and you'll be luck to find anything that is
differentially expressed. edgeR, on the other
hand, requires that you set the common dispersion when
performing the calculation to estimate what you think
the general variance in the sample is. You can
set this dispersion explicitly "-dispersion
<#>" (default is 0.05, only works with
edgeR). The other potential problem is that
DESeq2 might have a problem performing the rlog
variance stabilization described below. Because
of that you may also want to use the options
"-simpleNorm" to normalize the output read counts so
that they are directly comparable.
In general, if you don't have replicates, I would
recommend running the program with the paramters "-edgeR
-simpleNorm -dispersion 0.05".
Multiple Comparisons
By default, if 3 or more groups of samples are
provided, the script will compare the first
experiment against each of the others. If you
instead want all pairwise comparisons to be
performed (all vs. all), specify "-AvsA". For
example, lets say you had 6 experiments across 3
different conditions. If you specify "cond1
cond1 cond2 cond2 cond3 cond3", the output would
provide differential calls for "cond1 vs. cond2" and
"cond1 vs. cond3". Adding the "-AvsA"
option would include the "cond2 vs. cond3" as well.
Unbalanced Normalization
Normally RNA-seq analysis is carried out under the
assumption that the distribution of gene expression
values should be relatively similar between
experiments. There are situations where this
assumption breaks down, such as with certain types
of ChIP-Seq analysis (i.e. IP vs. input
control). If you would prefer that the
differential enrichment calculation normalize to the
total mapped reads in the tag directory instead of
normalizing based on the matrix of gene expression
values, specify "-norm2total".
Unfortunately, this will only work with DESeq/DESeq2
(edgeR doesn't really support this feature). The "-norm2total"
option is very useful for ChIP-Seq when you do
not expect the signal from each experiment to
be comparable, like when comparing target vs. IgG
experiemnts. However, if you were comparing a
type of experiment where you do expect similar
signal (for example, H3K27ac levels might be assumed
to be similar across conditions), it is not
recommended that you use the "-norm2total"
option.
Reporting Lists of Differentially Expressed Genes
By default, HOMER returns a full list of genes with
log2 Fold change, p-values, and FDR calculations for
each comparison. HOMER can create files with
only differentially expressed genes/features if you
specify the "-export <prefix>", where
<prefix> is the start of each filename that
will contain the differentially expressed
genes. The "-log2fold <#>" and "-fdr
<#>" parameters control the threshold
for differential detection. If "-export
<prefix>" is not included in the
command, the program will still output the numbers
of differentially expressed genes to stderr to give
you an idea how the comparison turned out.
Accounting for Paired Samples/Experiment Design
If your samples are paired or have other
relationships, you may want to try to account for
batch effects. EdgeR and DESeq2 allow you to
apply a generalized model to try to remove effects
caused by analyzing data on a different day, from the
same patient, or slightly different batch. For
example, lets say you did an experiment with a control
and a drug treatment, then did a second experiment two
weeks later with a different library preparation
protocol, etc. You can tell the program that the
samples are linked by specifying a batch code - the
code is like the experiment annotation in that it
applies to each experiment in the same order that they
were listed when preparing the raw count matrix:
analyzeRepeats.pl rna hg19 -noadj -d
Week1-Ctrl/ Week1-Drug/ Week2-Ctrl/ Week2-Drug/
> output.txt
getDiffExpression.pl output.txt -repeats ctrl drug
ctrl drug -batch 1 1 2 2 > diffOutput.txt
This type of analysis is analogous to a paired
t-test. The design formula will also be applied to
the variance stabilization transforms discussed in
the next paragraph.
Variance Stabilization/Normalized Read counts in
output file
By default, getDiffExpression.pl will now
also perform a variance stabilization transform on
your raw count data so that viewing the data is
easier after the command is finished. The idea
behind variance stabilization is to "transform" the
data such that variance in the read counts is
relatively constant as a function of
intensity. This limits the "flare" of low
expression variance that you normally see in
log-transformed scatter plots when comparing one
experiment versus another. By default the program
will use DESeq2's rlog transform to create nicely
normalized log2 read counts ("-rlog"). The
program also supports DESeq2's VST transform (option
"-vst"), which is faster and may be
recommended if you are processing a lot of samples.
In both cases the tranforms will be conducted with
the design matrix specifying which samples are
replicates (or paired in the batch definition). If
you prefer read counts that are simply normalized to
the total count in the raw count matrix, specify "-simpleNorm",
or "-raw" to keep the data the same as the
input data.
NOTE: The rlog and vst transforms may have problems
if there are no replicates or too few samples (i.e.
only 2), so in these cases it is recommended to use
"-simpleNorm" or "-raw".
Output Format
This command will basically add 3 columns to
the end of your original file per pairwise
comparison. Most important are the log2 fold
change and the p-value and FDR values. The raw
read counts in the output file will be log2
transformed and variance stabilized (using
DESeq2's rlog) unless specified otherwise (i.e. "-vst",
"-raw", "-simpleNorm"). This is generally
useful for other types of downstream analysis.
Command line options for getDiffExpression.pl
This program
uses DESeq2/edgeR to find differential expression between
sets of genes
(R must be
installed in the executable path, and the DESeq2/edgeR
package must be installed)
Step 1: Run analyzeRepeats.pl, but use -raw (or
analyzeRNA.pl or annotatePeaks.pl)
Step 2: Run this program using that file (use
-repeats/-rna/-peaks to match program)
The output is
sent to stdout - appends columns to original file
containing data
Group/Batch
names below correspond to order of experiments given as
tag directories
Usage:
getDiffExpression.pl <data file> <group
code...> [options]
i.e. getDiffExpression.pl inputFile.txt groupName1
groupName1 groupName2 > output.txt
(for 2 replicates in first group and 1 in second group)
To normalize
batch effects, add "-batch <batch code 1> [batch
code2] ..." to the command
For now batch normalization is only available with edgeR
and DESeq2
i.e. getDiffExpression.pl inputFile.txt gName1 gName1
gName2 gName2 -batch 1 2 1 2 > output.txt
(for 2 replicates in each group, where the 1st in each and
2nd in are paired)
Update: Expression values are now variance stabalized with
DESeq2/rlog by default
Input File
format options (will try to autodetect):
-rna (for analyzeRNA.pl formatted input, default)
-repeats (for analyzeRepeats.pl formatted input file)
-peaks (for annotatePeaks.pl formatted input file)
-basic (for simple file with one column of gene
identifiers and then the count data)
Additional
Options:
-dispersion <#> (edgeR common dispersion to use if
no replicates, default: 0.05)
-norm2total (normalize using tag directory totals,
default: normalize to gene totals(i.e.table)
-AvsA (compare each group vs. each group, default: compare
1st group vs. others)
-showR (Show R status updates, command output)
Differential
Expression program selection:
-DESeq2 (use DESeq2, now the default)
-DESeq (use DESeq instead of DESeq2 - doesn't support
batch mode)
-edgeR (use edgeR, - doesn't support -norm2total for
normalization to total mapped reads)
DE List
Reporting:
-export <prefix> (output differential expression
gene lists with filename prefix)
-fdr <#> (FDR threshold for diff. expression
reporting, default: 0.05)
-log2fold <#> (Log2 Fold threshold for diff.
expression reporting, default: 1.0)
Expression
value normalization/Variance Stabilization:
-rlog (Use DESeq2's rlog transform to normalize output
matrix, default)
-vst (Use DESeq2's vst transform, quicker for large sample
sets)
-simpleNorm (Normalize to experiment totals, i.e. basic
normalization)
-raw (do not normalize read counts in output table, keep
raw values)
Current
versions that work:
R: v3.3.1
Bioconductor: v3.2
DESeq v1.24.0
DESeq2 v1.12.4
edgeR v3.14.0
limma v3.29.0
|