logo

HOMER

Software for motif discovery and next-gen sequecing analysis



Automating Analysis

Often experiments are composed of dozens (or even hundreds) of samples.  It is nearly impossible to type in all the commands needed for each samples without driving yourself crazy and needing ice packs for your figures.  Below are routines to help with automating various analysis tasks.

Note about our approach to automation: It can be very dangerous to fully automate analysis since it may cause you to ignore important QC metrics and limit the amount of time you may spend asking yourself if the results make sense.  As a result, many of the commands below emphasize the parallelization of samples for single steps, which allow you to examine your results before moving forward to make sure everything is OK.  The exception is the analyzeChIP-Seq.pl command, which performs a vertical automation of analysis steps for a single sample (you could combine it with batchParallel.pl if you want to parallelize everything).

The sections below are separated into the following:
  • General Automation (general commands to speed up any task, batchParallel.pl)
  • Mapping FASTQ files (map-star.pl, map-bowtie2.pl)
  • Creating Tag Directories in parallel (batchMakeTagDirectory.pl)
  • Finding Motifs in multiple files (batchFindMotifs.pl, batchFindMotifsGenome.pl)
  • Vertical Analysis of ChIP-Seq experimetns (analyzeChIP-Seq.pl)
  • Several common commands are shown at the bottom (Cheat sheet).

General Automation (batchParallel.pl)

HOMER contains a very simple automation script called batchParallel.pl that can be useful for automating and parallelizing any task.  In fact, the first argument is a command - it can be used to automate pretty much any command.  This is how it works:

batchParallel.pl <command> <stdoutSuffix | none> [options] -f <file1> <file2>

How it works:  It takes the command and runs it on each of the files listed after the "-f <file> ..." argument. It runs the command like this:

command file1 options ... > file1.stdoutSuffix
command file2 options ... > file2.stdoutSuffix
etc.

For example, lets say you want to run findPeaks for several experiments at the same time.  You might try something like this:

batchParallel.pl findPeaks peaks -style factor -i InputTagDir/ -f ChIPExp1/  ChIPExp2/  ChIPExp3/

This would cause the following commands to be executed in parallel:

findPeaks ChIPExp1/ -style factor -i InputTagDir/ > ChIPExp1.peaks
findPeaks ChIPExp2/ -style factor -i InputTagDir/ > ChIPExp2.peaks
findPeaks ChIPExp3/ -style factor -i InputTagDir/ > ChIPExp3.peaks

Note the 'stdoutSuffix' underlined above.  If the results do not go to stdout, specify 'none' instead of a stdoutSuffix. For example, with findPeaks, we may specify '-o auto' to have the peak files placed automatically in the tag directories.

batchParallel.pl findPeaks none -style factor -i InputTagDir/ -o auto -f ChIPExp1/  ChIPExp2/  ChIPExp3/

this will run the commands:
    findPeaks ChIPExp1/ -style factor -i InputTagDir/ -o auto
    findPeaks ChIPExp2/ -style factor -i InputTagDir/ -o auto
    findPeaks ChIPExp3/ -style factor -i InputTagDir/ -o auto

The command is actually very general.  Lets say you want to gzip several FASTQ files at the same time:

batchParallel.pl gzip none -f *fastq

By default, the batchParallel.pl command limits the number of CPUs used at 12 (i.e. if you have a 12-core computer).  If there are more than 12 files, it will wait for them to finish such that you never have more than 12 active processes at a single time.  If this is too many or too few, consider changing the maximum using "-cpu <#>" (i.e. "-cpu 4").  Some commands don't like to be run in parallel (may write to a file that has the same name) - in those cases be sure to specify "-cpu 1".

Mapping FASTQ Files (map-star.pl, map-bowtie2.pl)

So you've just complete a round of sequencing, and you just got back a dozen or even hundreds FASTQ files that you want to map to the genome.  The following two scripts help automate the process of mapping reads to a reference genome using either STAR or bowtie2/tophat.  If you like other alignment programs more, you can probably modify either script to accommodate a different algorithm.  For more information on mapping check out this page.

NOTE: These scripts work with gzipped files!  This means you do NOT need to unzip your fastq files before processing them!

Batch Mapping with STAR

STAR is extremely fast and accurate, making it the go-to mapping algorithm at the moment for RNA (and probably DNA too).  The following script, map-star.pl, will automate several steps for alignment.

Requirements:

  • STAR must be installed on your system and available in your executable path as "STAR".  Otherwise you may also specify the exact path to STAR using "-path <path-to-star>".
  • The genome index directory must be available (download from the STAR site or create yourself using a reference genome FASTA file - see the STAR documentation to check it out).
  • samtools must be available in your executable PATH.

To run the map-star.pl command, use the following syntax:

map-star.pl [optoins] -p <cpu#> -x <genome index dir> <FASTQ file1> [FASTQ file2] ...

For paired-end runs, separate mate-pairs with a comma (no spaces):
    <file1.r1>,<file1.r2> <file2.r1>,<file2.r2> ...

For Example:
map-star.pl -p 12 -x /data/STAR-indexes/Genome_hg19 *fastq.gz

By default, the script will map the reads to the genome using STAR, then create a sorted bam and bai file using samtools that can be used by downstream analysis programs.  The most important options are:

-p <#cpus> (Maximum number of cores to use when mapping and sorting bam files)
-x <path-to-star-index-directory> (Location where the STAR genome index is located)

The command is only useful for one genome at a time.  If you have human and mouse data, you'll have to run it twice for each genome.  You'll probably notice that samtools takes longer than STAR...

Batch Mapping with bowtie2 and/or tophat2

Bowtie/bowtie2 and tophat2 are extremely popular mapping programs for NGS analysis.  The map-bowtie2.pl script helps automate analysis with them.

Requirements:

  • bowtie2 must be installed on your system and available in your executable PATH as "bowtie2".  If you wish to use tophat, tophat2 must be available too.
  • bowtie2 formatted genome index files must be available on your system.
  • samtools must be installed and available in your executable PATH.

To run map-bowtie2.pl, use the following syntax:

map-bowtie2.pl [options] -p <#> -x <genome index prefix> <FASTQ file1> [FASTQ file2]...

For paired-end runs, separate mate-pairs with a comma (no spaces):
    <file1.r1>,<file1.r2> <file2.r1>,<file2.r2> ...

For Example:
map-bowtie2.pl -p 12 -x /data/bowtie2-indexes/hg19 *fastq.gz

To run tophat2 for spliced alignments (i.e. RNA), add the option "-tophat2" to the command line.  You can also specify the library type "--library-type fr-firststrand" for example, and supply a GTF file with "-G file.gtf".  Also, to help speed up tophat alignments, if you have a lot of memory, you may wish to map several files in parallel instead of using several CPUs on a single instance of bowtie/tophat.  This is because there are several steps in tophat where the processing is serial and does not take full advantage of multiprocessing.  In this case you may want to run tophat with the options "-cpu 8 -p 1" (which will run 8 concurrent jobs with 1 processor each, instead of the standard "-cpu 1 -p 8" which would run one file using 8 processors.  There are also several other options you might want to check out.

Creating Tag Directories (batchMakeTagDirectory.pl)

To create Tag Directories in parallel, use batchMakeTagDirectory.pl.  Because the relationship between mapping files and experiment names can be complicated, this programs requires that you create a simple tab-dlimited text file (called a key file) containing information that assigns alignment files to tag directories.  The general structure is:
# Key File Format:
TagDirectoryName
<tab> path-to-alignment-file
TagDirectoryName2 <tab> path-to-alignment-file2
...
If there are multiple alignment files for a single tag directory, each one needs its own line - however, make sure the tag directory name is the same for each one (i.e.):
Macrophage-PU.1    lane1.pu1.sam
Macrophage-PU.1    lane2.pu1.sam
Both sam files in the above example will be used to create the Macrophage-PU.1 tag directory because the tag directory has the same name in each line.

One you have key file described above, you're ready to use batchMakeTagDirectory.pl.  The first argument is the key file, and after that you can specify the number of tag directories to create concurrently (-cpu <#>) and common options for the makeTagDirectory command:
batchMakeTagDirectory.pl <keyFile> [-cpu <#>] [options]

this will spawn the following for each unique tag directory name in the key file:
    makeTagDirectory tagDir alignmentFiles [options]
Common options would be things like "-checkGC", "-genome hg19", and/or "-format sam".  An example of the command might be:
batchMakeTagDirectory.pl key.txt -cpu 8 -checkGC -genome hg19 -format sam


Finding Motifs in multiple files (batchFindMotifs.pl, batchFindMotifsGenome.pl)

To commands are available to automate findMotifs.pl and findMotifsGenome.pl over large numbers of files using the same parameters.  There usage is below:
batchFindMotifs.pl [promoter set] [options...] -d <gene list file1> [gene list file2] ...

batchFindMotifsGenome.pl [genome] [options...] -d <TagDirectory1> [TagDirectory 2] ...
batchFindMotifsGenome.pl [genome] [options...] -f <peak/BED file1> [peak/BED file2] ...
Each of these commands will produce a new directory named "Motifs-filename" with the motif results.  If you provide tag directories to batchFindMotifsGenome.pl (-d <...>), the program will look for peaks.txt or regions.txt files and output the results to a directory named "Motifs-BatchOutput/" inside the tag directory.

Vertical analysis: Analyzing a ChIP-Seq experiment with one command (analyzeChIP-Seq.pl)

Even I don't like typing the same commands over and over again.  The following command performs the standard set of analysis commands so that you can do better things while your data is processed.

analyzeChIP-Seq.pl <Tag Directory> <genome> [general options] [-A | B | C | D sub-program options]

i.e. a common use: analyzeChIP-Seq.pl Factor-ChIP-Seq/ hg18r -i Input-ChIP-Seq -focus -A factor_alignment_file.bed factor_alignment_file2.bed

This command performs 4 separate tasks labeled as A,B,C & D:

A. Runs makeTagDirectory to parse alignment files, set up the tag directory, and performs basic QC such as tag auto correlation and checks for sequence bias.
B. Runs make makeUCSCfile and findPeaks to generate UCSC Genome Browser files and peak files for the experiment.
C. Runs findMotifsGenome.pl to determine enriched motifs in your ChIP-Seq peaks.
D. Runs annotatePeaks.pl to generate an annotated peak file and performs GO analysis on genes found near the peaks.

As output, this program will create standard files in the "Tag Directory" including an "index.html" file that links you to each of the output files. 

There are a couple of general options that can be used with analyzeChIP-Seq.pl:

-i <input tag directory> : "Tag directory" to use as a control for peak finding.
-size <#> : Force this peak size for analysis (default is "auto" for peak finding, 200 bp for motif analysis and 50 bp for focused peak analysis)
-focus : This will find enriched motifs in 85% focused peaks using only +/- 25 bp of sequence, useful for identifying the primary motif bound by the factor.
-enhancer : This will set the peak size to "-size 1000" and only perform motif analysis on peaks > 3kb from the TSS.

To specifically tailor the options used by the sub-programs, first enter the "general options" you want above, then enter "-A" followed by the options you want passed to the sub-programs.  For example, the most used sub-program option I use is the following:

"-A s_1_eland_result.txt" - this passes the alignment file to the makeTagDirectory program so that it will be used to make the Tag Directory.

If you've already made a tag directory, no need to use the "-A blah blah" option.  In similar fashion, to tell the motif finding to check motifs of length 10, 11, and 12, add "-C -len 10,11,12" to the end of the command.

Batch analysis cheat sheet:

# map all fastq files
map-star.pl -x /data/starIndexes/Genome_hg19 -p 12 *fastq

# create tag directories (need to make the key.txt file)
batchMakeTagDirectories.pl key.txt -cpu 12 -genome hg19 -checkGC

# make UCSC files for all directories
batchParallel.pl makeUCSCfile none -o auto -f *TagDir/

# find peaks for all directories
batchParallel.pl findPeaks none -o auto -i InputTagDir/ -style factor -f *TagDir/

# find motifs for all peak files
batchFindMotifsGenome.pl hg19r -size 200 -p 20 -d *TagDir/



Back to ChIP-Seq Analysis


Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@salk.edu