|
Automating Analysis
Often experiments are composed of dozens (or even hundreds)
of samples. It is nearly impossible to type in all the
commands needed for each samples without driving yourself
crazy and needing ice packs for your figures. Below
are routines to help with automating various analysis tasks.
Note about our approach to automation: It can be
very dangerous to fully automate analysis since it may cause
you to ignore important QC metrics and limit the amount of
time you may spend asking yourself if the results make
sense. As a result, many of the commands below
emphasize the parallelization of samples for single steps,
which allow you to examine your results before moving
forward to make sure everything is OK. The exception
is the analyzeChIP-Seq.pl command, which performs a vertical
automation of analysis steps for a single sample (you could
combine it with batchParallel.pl if you want to parallelize
everything).
The sections below are separated into the following:
- General Automation (general commands to speed up any
task, batchParallel.pl)
- Mapping FASTQ files (map-star.pl, map-bowtie2.pl)
- Creating Tag Directories in parallel (batchMakeTagDirectory.pl)
- Finding Motifs in multiple files (batchFindMotifs.pl,
batchFindMotifsGenome.pl)
- Vertical Analysis of ChIP-Seq experimetns (analyzeChIP-Seq.pl)
- Several common commands are shown at the bottom (Cheat
sheet).
General Automation (batchParallel.pl)
HOMER contains a very simple automation script called
batchParallel.pl that can be useful for automating and
parallelizing any task. In fact, the first
argument is a command - it can be used to automate
pretty much any command. This is how it works:
batchParallel.pl <command> <stdoutSuffix
| none> [options] -f <file1> <file2>
How it works: It takes the command and
runs it on each of the files listed after the "-f
<file> ..." argument. It runs the command
like this:
command file1 options ... > file1.stdoutSuffix
command file2 options ... > file2.stdoutSuffix
etc.
For example, lets say you want to run findPeaks for
several experiments at the same time. You might
try something like this:
batchParallel.pl findPeaks peaks
-style factor -i InputTagDir/ -f ChIPExp1/
ChIPExp2/ ChIPExp3/
This would cause the following commands to be executed
in parallel:
findPeaks ChIPExp1/ -style factor -i InputTagDir/
> ChIPExp1.peaks
findPeaks ChIPExp2/ -style factor -i
InputTagDir/ > ChIPExp2.peaks
findPeaks ChIPExp3/ -style factor -i
InputTagDir/ > ChIPExp3.peaks
Note the 'stdoutSuffix' underlined above. If the
results do not go to stdout, specify 'none'
instead of a stdoutSuffix. For example, with
findPeaks, we may specify '-o auto' to have the peak
files placed automatically in the tag directories.
batchParallel.pl findPeaks none
-style factor -i InputTagDir/ -o auto -f
ChIPExp1/ ChIPExp2/ ChIPExp3/
this will run the commands:
findPeaks ChIPExp1/ -style
factor -i InputTagDir/ -o auto
findPeaks ChIPExp2/ -style
factor -i InputTagDir/ -o auto
findPeaks ChIPExp3/ -style
factor -i InputTagDir/ -o auto
The command is actually very general. Lets say
you want to gzip several FASTQ files at the same time:
batchParallel.pl gzip none -f *fastq
By default, the batchParallel.pl command limits the
number of CPUs used at 12 (i.e. if you have a 12-core
computer). If there are more than 12 files, it
will wait for them to finish such that you never have
more than 12 active processes at a single time. If
this is too many or too few, consider changing the
maximum using "-cpu <#>" (i.e. "-cpu 4").
Some commands don't like to be run in parallel (may
write to a file that has the same name) - in those cases
be sure to specify "-cpu 1".
Mapping FASTQ Files (map-star.pl, map-bowtie2.pl)
So you've just complete a round of sequencing, and you
just got back a dozen or even hundreds FASTQ files that
you want to map to the genome. The following two
scripts help automate the process of mapping reads to a
reference genome using either STAR or
bowtie2/tophat. If you like other alignment
programs more, you can probably modify either script to
accommodate a different algorithm. For more
information on mapping check out this page.
NOTE: These scripts work with gzipped files! This
means you do NOT need to unzip your fastq files before
processing them!
Batch Mapping with STAR
STAR
is extremely fast and accurate, making it the go-to
mapping algorithm at the moment for RNA (and probably
DNA too). The following script, map-star.pl,
will automate several steps for alignment.
Requirements:
- STAR must be installed on your system and
available in your executable path as "STAR".
Otherwise you may also specify the exact path to
STAR using "-path <path-to-star>".
- The genome index directory must be available
(download from the STAR site or create yourself
using a reference genome FASTA file - see the STAR
documentation to check it out).
- samtools must be available in your executable
PATH.
To run the map-star.pl command, use the
following syntax:
map-star.pl [optoins] -p <cpu#> -x
<genome index dir> <FASTQ file1> [FASTQ
file2] ...
For paired-end runs, separate mate-pairs with a
comma (no spaces):
<file1.r1>,<file1.r2>
<file2.r1>,<file2.r2> ...
For Example:
map-star.pl -p 12 -x
/data/STAR-indexes/Genome_hg19 *fastq.gz
By default, the script will map the reads to the
genome using STAR, then create a sorted bam and bai
file using samtools that can be used by downstream
analysis programs. The most important options
are:
-p <#cpus> (Maximum number of cores to
use when mapping and sorting bam files)
-x <path-to-star-index-directory>
(Location where the STAR genome index is located)
The command is only useful for one genome at a
time. If you have human and mouse data, you'll
have to run it twice for each genome. You'll
probably notice that samtools takes longer than
STAR...
Batch Mapping with bowtie2 and/or tophat2
Bowtie/bowtie2
and tophat2 are
extremely popular mapping programs for NGS
analysis. The map-bowtie2.pl script helps
automate analysis with them.
Requirements:
- bowtie2 must be installed on your system and
available in your executable PATH as
"bowtie2". If you wish to use tophat, tophat2
must be available too.
- bowtie2 formatted genome index files must be
available on your system.
- samtools must be installed and available in your
executable PATH.
To run map-bowtie2.pl, use the following syntax:
map-bowtie2.pl [options] -p <#> -x
<genome index prefix> <FASTQ file1>
[FASTQ file2]...
For paired-end runs, separate mate-pairs with a
comma (no spaces):
<file1.r1>,<file1.r2>
<file2.r1>,<file2.r2> ...
For Example:
map-bowtie2.pl -p 12 -x
/data/bowtie2-indexes/hg19 *fastq.gz
To run tophat2 for spliced alignments (i.e. RNA), add
the option "-tophat2" to the command
line. You can also specify the library type "--library-type
fr-firststrand" for example, and supply a GTF
file with "-G file.gtf". Also, to help
speed up tophat alignments, if you have a lot of
memory, you may wish to map several files in parallel
instead of using several CPUs on a single instance of
bowtie/tophat. This is because there are several
steps in tophat where the processing is serial and
does not take full advantage of multiprocessing.
In this case you may want to run tophat with the
options "-cpu 8 -p 1" (which will run 8 concurrent
jobs with 1 processor each, instead of the standard
"-cpu 1 -p 8" which would run one file using 8
processors. There are also several other options
you might want to check out.
Creating Tag Directories (batchMakeTagDirectory.pl)
To create Tag Directories in parallel, use
batchMakeTagDirectory.pl. Because the relationship
between mapping files and experiment names can be
complicated, this programs requires that you create a
simple tab-dlimited text file (called a key file)
containing information that assigns alignment files to tag
directories. The general structure is:
# Key File Format:
TagDirectoryName <tab> path-to-alignment-file
TagDirectoryName2 <tab> path-to-alignment-file2
...
If there are multiple alignment files for a single tag
directory, each one needs its own line - however, make
sure the tag directory name is the same for each one
(i.e.):
Macrophage-PU.1
lane1.pu1.sam
Macrophage-PU.1 lane2.pu1.sam
Both sam files in the above example will be used to create
the Macrophage-PU.1 tag directory because the tag
directory has the same name in each line.
One you have key file described above, you're ready to use
batchMakeTagDirectory.pl. The first argument is the
key file, and after that you can specify the number of tag
directories to create concurrently (-cpu <#>) and
common options for the makeTagDirectory command:
batchMakeTagDirectory.pl <keyFile>
[-cpu <#>] [options]
this will spawn the following for each unique tag
directory name in the key file:
makeTagDirectory tagDir
alignmentFiles [options]
Common options would be things like "-checkGC", "-genome
hg19", and/or "-format sam". An example
of the command might be:
batchMakeTagDirectory.pl key.txt -cpu 8
-checkGC -genome hg19 -format sam
Finding Motifs in multiple files (batchFindMotifs.pl,
batchFindMotifsGenome.pl)
To commands are available to automate findMotifs.pl
and findMotifsGenome.pl over large numbers of
files using the same parameters. There usage is
below:
batchFindMotifs.pl [promoter set]
[options...] -d <gene list file1> [gene list
file2] ...
batchFindMotifsGenome.pl [genome] [options...] -d
<TagDirectory1> [TagDirectory 2] ...
batchFindMotifsGenome.pl [genome] [options...] -f
<peak/BED file1> [peak/BED file2] ...
Each of these commands will produce a new directory named
"Motifs-filename" with the motif results. If
you provide tag directories to batchFindMotifsGenome.pl
(-d <...>), the program will look for
peaks.txt or regions.txt files and output the results to a
directory named "Motifs-BatchOutput/" inside the tag
directory.
Vertical analysis: Analyzing a
ChIP-Seq experiment with one command (analyzeChIP-Seq.pl)
Even I don't like typing the same commands over and over
again.
The following command performs the standard set of
analysis commands so
that you can do better things while your data is
processed.
analyzeChIP-Seq.pl
<Tag
Directory>
<genome>
[general
options] [-A | B | C | D
sub-program options]
i.e. a common use: analyzeChIP-Seq.pl
Factor-ChIP-Seq/
hg18r -i Input-ChIP-Seq -focus -A
factor_alignment_file.bed factor_alignment_file2.bed
This command performs 4 separate tasks labeled as A,B,C
& D:
A. Runs makeTagDirectory to
parse alignment
files, set up the tag directory, and performs basic QC
such as tag auto
correlation and checks for sequence bias.
B. Runs make makeUCSCfile
and findPeaks
to generate UCSC Genome
Browser files and peak files for the experiment.
C. Runs findMotifsGenome.pl
to
determine enriched motifs in your ChIP-Seq peaks.
D. Runs annotatePeaks.pl
to
generate an annotated peak file and performs GO analysis
on genes found
near the peaks.
As output, this program will create standard files in the
"Tag
Directory" including an "index.html" file that links you
to each of the
output files.
There are a couple of general
options that can be used with analyzeChIP-Seq.pl:
-i <input tag
directory> :
"Tag directory" to use as a control for peak finding.
-size <#> : Force this peak size for analysis
(default is "auto"
for peak finding, 200 bp for motif analysis and 50 bp
for focused peak
analysis)
-focus : This will find enriched motifs in 85% focused
peaks using only
+/- 25 bp of sequence, useful for identifying the
primary motif bound
by the factor.
-enhancer : This will set the peak size to "-size 1000"
and only
perform motif analysis on peaks > 3kb from the TSS.
To specifically tailor the options used by the
sub-programs, first
enter the "general options" you want above, then enter "-A" followed by the
options you want
passed to the sub-programs. For example, the most
used
sub-program option I use is the following:
"-A s_1_eland_result.txt"
- this
passes the alignment file to the makeTagDirectory
program so that it will be used to make the Tag
Directory.
If you've already made a tag directory, no need to use the
"-A blah
blah" option. In similar fashion, to tell the motif
finding to
check motifs of length 10, 11, and 12, add "-C -len 10,11,12" to
the end of the
command.
Batch analysis cheat sheet:
# map all fastq files
map-star.pl -x /data/starIndexes/Genome_hg19 -p 12 *fastq
# create tag directories (need to make the key.txt file)
batchMakeTagDirectories.pl key.txt -cpu 12 -genome hg19
-checkGC
# make UCSC files for all directories
batchParallel.pl makeUCSCfile none -o auto -f *TagDir/
# find peaks for all directories
batchParallel.pl findPeaks none -o auto -i InputTagDir/
-style factor -f *TagDir/
# find motifs for all peak files
batchFindMotifsGenome.pl hg19r -size 200 -p 20 -d *TagDir/
|