Software for motif discovery and next-gen sequencing analysis

Visualizing Experiments with a Genome Browser

The UCSC Genome Browser is quite possibly one of the best computational tools ever developed.  Not only does it contain an incredible amount of data in a single application, it allows users to upload custom information such as data from their ChIP-Seq experiments so that they can be easily visualized and compared to other information.  There are also other genome browsers that are available, and each has a different strength:

UCSC Genome Browser
Truly a unique resource, logs of data preloaded and annotations.
WashU Epigenome Browser
Capable of visualizing long-range interactions (great for data sets like Hi-C), also has a lot of preloaded data.
The Integrated Genomics Viewer (IGV), great for looking at reads locally instead of needing to load them to a server/cloud based solution.  Great for directly looking at sorted bam/bai files to examine mutations in reads.
Many others...

Most of the tools that are part of HOMER cater to the strengths of the UCSC Genome Browser - however, the bedGraph and other files generated by HOMER can be normally be used in the other genome browsers as well.

Making Genome Browser Files

The basic strategy HOMER uses is to create a bedGraph formatted file that can then be uploaded as a custom track to the genome browser.  This is accomplished using the makeUCSCfile program.  To make a ucsc visualization file, type the following:

makeUCSCfile <tag directory> -o auto

i.e. makeUCSCfile PU.1-ChIP-Seq/ -o auto
(output file will be in the PU.1-ChIP-Seq/ folder named PU.1-ChIP-Seq.ucsc.bedGraph.gz)

The "-o auto" with make the program automatically generate an output file name (i.e. TagDirectory.ucsc.bedGraph.gz) and place it in the tag directory which helps with the organization of all these files.  The output file can be named differently by specifying "-o outputfilename" or by simply omitting "-o", which will send the output of the program to stdout (i.e. add " > outputfile" to capture it in the file outputfile).  It is recommended that you zip the file using gzip and directly upload the zipped file when loading custom tracks at UCSC.

To visualize the experiment in the UCSC Genome Browser, go to Genome Browser page and select the appropriate genome (i.e. the genome that the sequencing tags were mapped to).  Then click on the "add custom tracks" button (this will read "manage custom tracks" once at least one custom track is loaded).  Enter the file created earlier in the "Paste URLs or data" section and click "Submit".

Problems Loading UCSC Files

The most common problem encountered while loading UCSC files is to see "position exceeds chromosome length" or something to that effect.  This is usually caused by one of two problems:

1. You are trying to load the file to the wrong genome assembly.  Make sure the assembly is correct!

2. Did you align the genome to a UCSC version?  chr1 != Chr1 != 1

3. Some of your tags are mapping outside the reference chromosome - this can be caused by mapping to non-standard assemblies or by some alignment programs.  To remove all reads outside of the UCSC chromosome lengths, you can run the program removeOutOfBoundsReads.pl.

removeOutOfBoundsReads.pl <tag directory> <genome>
i.e. removeOutOfBoundsReads.pl PU.1-ChIP-Seq/ mm9

After running the program, you can rerun makeUCSCfile.

What does makeUCSCfile do?

The program works by approximating the ChIP-fragment density at each position in the genome.  This is done by starting with each tag and extending it by the estimated fragment length (determined by autocorrelation, or it can be manually specified using "-fragLength <#>").  The ChIP-fragment density is then defined as the total number of overlapping fragments at each position in the genome.  Below is a diagram that depicts how this works:
As great as the UCSC Genome Browser is, the large size of recent ChIP-Seq experiments results in custom track files that are very large.  In addition to taking a long time to upload, the genome browser has trouble loading excessively large files.  To help cope with this, the makeUCSCfile program can help by specifying a target file size when zipped (i.e. "-fsize 50e6" for 50MB).  In order to meet the specified target file size, makeUCSCfile merges adjacent regions of tag density levels by their weighted average to reduce the total number lines in the final bedGraph file.  If you have trouble loading getting your file to load, try reducing the size of the file using the "-fsize <#>" option (i.e. "-fsize 2e7").  To force the creation of larger files, use a very large file size (i.e. "-fsize 1e50") - this will create a file that does not merge any regions and displays a "native" view of the data.

Tags can be visualized separately for each strand using the "-strand separate" option.

Changing the Resolution

By default, makeUCSCfile uses the "-fsize <#>" option to determine how man reads to essentially "skip" when making the output file.  You can also manually set the resolution.

In an effort to reduce the size of large UCSC files, one attractive option is to reduce the overall resolution of the file.  By default, makeUCSCfile will make full resolution (i.e.  1 bp) files, but this can be changed by specifying the "-res <#>" option.  For example, "-res 10" will cause changes in ChIP-fragment density to be reported only every 10 bp.

Normalization of UCSC files: 2 types of normalization

There are two important parameters to consider during normalization of data.  First, the total read depth of the experiment is important, which is obvious.  The 2nd factor to consider is the length of the reads (this is new to v4.4).  The problem is that if an experiment has longer fragment lengths, it will generate additional coverage than an experiment with shorter fragment lengths.  In order to make sure there total area under the curve is the same for each experiment, experiments are normalized to a fixed number of reads as well as a 100 bp fragment length.  If reads are longer than 100 bp, they are 'down-normalized' a fractional amount such that they produce the same relative coverage of a 100 bp fragment length.  Experiments with shorter fragment lengths are 'up-normalized' a proportional amount (maximum of 4x or 25 bp).  This allows experiments with different fragment lengths to be comparable along the genome browser.  The behavior of each normalization is controlled with the following parameters:
-norm <#> : Normalize the total number of reads to this number, default 1e7.  This means that tags from an experiment with only 5 million mapped tags will count for 2 tags apiece.

-normLength <#> : Set the standand length for normalization, default 100.
"-normLength 0" will disable the length normalization altogether, useful when visualizing single nucleotide data.
-noadj or -raw : who needs normalization?  Just give me the raw coverage numbers...

Normalizing files to Input

The paragraph above specifies how to normalize read densities based on the total number of reads.  For some applications, particularly if studying organisms with small genomes, it is better to visualize the read density as a ratio relative to Input or IgG.  Normally I would NOT recommend visualizing reads this way if the Input/IgG read coverage is sparse as this will cause trouble when calculating ratios.  To normalize the experiment to a second tag directory, use the "-i <input tag directory>" option:
makeUCSCfile ExpTagDirectory/ -i InputTagDirectory  -o auto
Additional parameters to control the normalized output:
-pseudo <#> : To avoid fluctuations in the ratio due to low coverage input, a pseudo count is added to the numerator and denominator when calculating the ratio (default: 5)
-log : report as a log ratio (default is a simple ratio)
-inputtbp <#> : set the maximum tags per bp considered in the input experiment

Separating data from different strands / RNA-Seq

You can specify that HOMER separate the data based on the strand by using the "-strand <...>" option.  This is useful when looking at strand-specific RNA-Seq/GRO-Seq experiments.  The following options are available:

-strand both : default behavior, for ChIP-Seq/MNase-Seq etc.
-strand separate : separate data by strand, for RNA-Seq/GRO-Seq
-strand + : only show the positive strand (i.e. Watson strand) data
-strand - : only show the negative strand (i.e. crick strand) data

RNA-Seq and Splicing:

HOMER does not fully support the visualization of spliced RNA-Seq reads.  However, if you specify the "-fragLength given" option, HOMER will only visualize the reads from the 5' end of the read until the first splice site (or the end of the read).  This will help make the read densities look nice a crisp over exons, but will not visualize parts of the read that are 3' from the first splice found in the read.

Modifying Read Coverage

You can manually set the fragment lengths that are visualized and shift their positions, both of which can be useful:

-fragLength <# | given> : sets the fragment length, default: uses fragmentLengthEstimate in the tagInfo.txt file of the tag directory.  If you want to visualize how the signal changes over large regions, it can be useful to set the fragment length to a very large value (i.e. 10000).  If you want to visualize the exact length of the reads, use "-fragLength given".

-adjust <#> : adjust the position of the read by this amount from the 5' end.  For example, -adjust -10 would start the coverage 10 bp upstream.  This useful when the 5' end of the read represents a localized signal, i.e. DNase nicking site, as opposed to a ChIP-Seq fragment, which implies the factor binds downstream from the 5' end.

-tbp <#> : limit the number of reads considered per position, default: no limit. i.e. "-tbp 1" only counts one read per position.

-inputFragLength <#>, -inputAdjust <#>, -inputtbp <#> work the same for input directories if calculating a ratio.

Special Visualization Styles

To help streamline the visualization of different data types, you can use the "-style <styles>" option (i.e. "-style rnaseq").  This will adjust parameters for each type:

chipseq : standard, default
rnaseq : strand specific, will only extended fragments their given amount to help visualize exon edges.
tss : strand specific, and only shows the 5' nucleotide of the read (single base precision)
dnase : for nicking style DNase data only (see here), centers read fragment over the 5' end of the read.
methylated : reports cytosine methylation percentage at single bp resolution.
unmethylated : reports the percentage of unmethylated cytosinse at single bp resolution.
damid : reports large coverage fragments (2kb) centered on 5' end of the reads

Creating bigWig files with HOMER

Some data sets of very large, but you still want to see all of the details from your sequencing in the UCSC Genome Browser.  HOMER can produce bigWig files by running the conversion program for you (bedGraphToBigWig).  The only catch is that you must have access to a webserver where you can post the resulting bigWig file - this is because instead of uploading the whole file to UCSC, the browser actually looks for the data file on YOUR webserver and grabs only the parts it needs.  Slick, eh.  Chuck uses this all the time for big experiments.

Before even trying to make bigWigs, you must download the bedGraphToBigWig program from UCSC and place it somewhere in your executable path (i.e. the /path-to-homer/bin/ folder).  This called directly by HOMER to create the BigWig files. 

Using the makeBigWig.pl Script

To make bigWig files easier to generate, HOMER includes a program creatively named "makeBigWig.pl" that automates all of the steps below.

makeBigWig.pl <tag directory> <genome> [special options] [makeUCSC file options] -webDir /path-to-web-fold/ -url http://webserverURL/bigwigFold/
i.e. makeBigWig.pl PU.1-ChIP-Seq/ mm9  -webDir /var/www/bigWigs/ -url http://ChuckNorrisU.edu/bigWigs/

If you are visualizing strand specific data (i.e. RNA-Seq), specify "-strand".  The -url and -webDir are the directories are the web URL directory and file system directory where the bigWigs will be stored, respectively.  Recent changes to UCSC require that the chromosome sizes be specified exactly.  If having trouble, the current version of HOMER has the option "-chromSizes <filename>" so that you can specify the sizes explicitly.

Other makeBigWig.pl options:
-normal (default, similar to "-style chipseq" for makeUCSCfile).
-strand (for RNAseq, will create two bigWigs separately for each strand).
-dnase (will use "-style dnase")
-cage (combines -strand with -style cage)
-cpg (creates both methylated and unmethylated bigWigs)

-update (will overwrite default bigWig for that tag directory name.  Otherwise, if the same file name exists, a random number will be added to the end)
-chromSizes <chrom.size file> (specify the chromosome sizes, default: automatic)
-url <URL> (URL directory -no filename- to tell UCSC where to look)
-webdir <directory> (name of directory to place resulting bigWig file)

Making bigWigs from scratch

This is a quick description of what HOMER is doing.  To make a bigWig, add the "-bigWig <chrom.sizes file> -fsize 1e20" parameters to your makeUCSCfile command.  When making a bigWig, you usually want to see all of the tag information, so make sure the "-fsize" options is large.  You also need to specify an output file using "-o <bigwigfilename>" and also capture the stdout stream using "> trackfileoutput.txt".  You can also use "-o auto".  The "trackfileoutput.txt" will contain the header information that is uploaded as a custom track to UCSC.  Recently, changes to UCSC require that HOMER know the exact size of the chromosomes when making the file - these should be placed in a file (<chrom.sizes> file).  makeBigWig.pl and makeMultiWigHub.pl will generate these files automatically by analyzing the sequences in the genome directory.

After running the makeUCSCfile program with the bigWig options, you need to do the following:
  1. Copy the *.bigWig file to your webserver location and make sure it is viewable over the internet.
  2. Need to edit the "trackfileoutput.txt" file and enter the URL of your bigWig file (...  bigDataUrl=http://server/path/bigWigFilename ...)
  3. Upload the "trackfileoutput.txt" file to UCSC as a custom track to view your data.
For example:

makeUCSCfile <tag directory> -o auto -bigWig <chrom.sizes file> -fsize 1e20 > trackInfo.txt

makeUCSCfile PU.1-ChIP-Seq/ -o auto -bigWig chrom.sizes -fsize 1e20 > PU.1-bigWig.trackInfo.txt
cp PU.1-ChIP-Seq/PU.1-ChIP-Seq.ucsc.bigWig /Web/Server/Root/Path/
** edit PU.1-bigWig.trackInfo.txt to have the right URL **

NOTE: As of now, a bigWig file can only be composed of a single track - if you want to separate the data by strands, do the following:

makeUCSCfile PU.1-ChIP-Seq/ -o PU.1.positiveStrand.bigWig -bigWig chrom.sizes -fsize 1e20 -strand + > PU.1-bigWig.trackInfo.positiveStrand.txt
makeUCSCfile PU.1-ChIP-Seq/ -o PU.1.negativeStrand.bigWig -bigWig chrom.sizes -fsize 1e20 -strand - > PU.1-bigWig.trackInfo.negativeStrand.txt
cp PU.1.positiveStrand.bigWig PU.1.negativeStrand.bigWig /Web/Server/Root/Path/
cat PU.1-bigWig.trackInfo.positiveStrand.txt PU.1-bigWig.trackInfo.negativeStrand.txt > PU.1-bigWig.trackInfo.both.txt
** edit PU.1-bigWig.trackInfo.both.txt to have the right URLs for both the negative and positive strands **

Creating Multi-Experiment Overlay Tracks

UCSC has recently added the option to create overlay tracks, where several bigWig files can be viewed in the same space with the help of transparent colors.  The first example of this was the Encode Regulation Track, which showed H3K4me1/3 data from several cell types at the same time.  This is very useful for large-scale data sets will many different experiments.  In these cases it is just about impossible to get them on the screen together.

To make a "multi-wig hub", as we will refer to them, you need to make sure you have the bedGraphToBigWig program from UCSC, and a working webserver to host your files.  If you can handle bigWigs in the section above, you can make multi-wig hubs.

The HOMER program to handle multi-wig hubs is called makeMultiWigHub.pl.  It works essentially the same way as the makeBigWig.pl script, however, the syntax is a little different.  The basic usage is:

makeMultiWigHub.pl <hub name> <genome> [options] -d <tag directory1> <tag directory2> ...
i.e. makeMultiWigHub.pl ES-Factors mm9 -d mES-Oct4/ mES-Sox2/ mES-Nanog/ mES-Klf4/ mES-Esrrb/ mES-cMyc/ mES-Stat3/

NOTE: make sure you use the UCSC genome (e.g. mm9) and not the masked, bastardized HOMER version (mm9r).

The above example will produce a hub called "ES-Factors", composed of configuration files and bigWig files, and place it on your server in the directory specified by "-webDir <directory>".  It will also provide you with a URL to the hub (dependent on the value of -url <base url>").  To load the Hub, click on "Track Hubs" on the UCSC browser (next to custom tracks button), and paste the URL in to the dialog box.  The example above will look something like this:


To figure out which factors correspond to which colors, click on the Blue Heading for the Hub in the settings area below the UCSC picture.  Something like this should pop up:

UCSC Hub settings HOMER

Unfortunately, as of now editing hub information can only be done by directly modifying the hub files on the server.  For example, to edit to colors, you must edit the "/webserver/directory/hubName/genome/trackDB.txt" file.

Because Hubs are so cool, HOMER will also do +/- strand RNA data right.  Unfortunately, for now you can't mix stranded and non-stranded data in the same hub with the makeMultiWigHub.pl program.  To visualize stranded information, add "-strand".  Below is an example:

HOMER UCSC Hub RNA example

Other makeMultiWigHub.pl options are essentially identical to makeBigWig.pl.


Examples of UCSC bedGraph files

The following shows what the same data set looks like changing options for file size (-fsize) and resolution (-res).  Usually it's best to use one or the other.
  1. -fsize 5e7 -res 1
  2. -fsize 1e7 -res 1
  3. -fsize 5e7 -res 10
  4. -fsize 1e7 -res 10

Command line options for makeUCSCfile

    Usage: makeUCSCfile <tag directory> [options]

    Creates a bedgraph file for visualization using the UCSC Genome Browser

    General Options:
        -fsize <#> (Size of file, when gzipped, default: 1e10, i.e. no reduction)
        -strand <both|separate|+|-> (control if reads are separated by strand, default: both)
        -fragLength <# | auto | given> (Approximate fragment length, default: auto)
        -adjust <#> (Adjust edge of tag 3' by # bp, negative for 5', default: none[good for dnase])
        -tbp <#> (Maximum tags per bp to count, default: no limit)
        -mintbp <#> (Minimum tags per bp to count, default: no limit)
        -res <#> (Resolution, in bp, of file, default: 1)
            -avg (report average coverage if resolution is larger than 1bp, default: max is reported)
        -lastTag (To keep ucsc happy, last mapped tag is NOT extended by default
            Using this option will allow extending of data past the last tag position)
        -norm <#> (Total number of tags to normalize experiment to, default: 1e7)
        -normLength <#> (Expected length of fragment to normalize to [0=off], default: 100)
        -noadj (Do not normalize tag counts)
        -neg (plot negative values, i.e. for - strand transcription)
        -CpG (Show unmethylated CpG ratios)
        -color <(0-255),(0-255),(0-255)> (no spaces, rgb color for UCSC track, default: random)
        -i <input tag directory> (normalize bedGraph to input data)
            -pseudo <#> (Number of pseudo counts used to smooth out low coverage areas, default: 5)
            -log (report log2 ratio instead of linear ratio)
            -inputtbp <#>, -inputFragLength <#>, -inputAdjust <#> can also be set
        -bigWig <chrom.size file> (creates a full resolution bigWig file and track line file)
            This requires bedGraphToBigWig to be available in your executable path
            Also, because how how bigWig files work, use "-strand -" and "-strand +"
            in separate runs to make strand specific files: "-strand separate" will not work
            Consider using makeBigWig.pl and makeMultiWigHub.pl if interested in bigWigs
        -o <filename|auto> (send output to this file - will be gzipped, default: prints to stdout)
            auto: this will place an appropriately named file in the tag directory
        -name <...> (Name of UCSC track, default: auto generated)
        -style <option> (See options below:)
            chipseq (standard, default)
            rnaseq (strand specific, if unstranded add '-strand both' to end of command)
            tss (strand specific, single bp fragment length)
            dnase (fragments centered on tag position instead of downstream)
            methylated (single bp resolution of cytosine methylation)
            unmethylated (single bp resolution of unmethylated cytosines)
            damid (2kb fragments centered on 5' end of reads)
        -circos <chrN:XXX-YYY|genome> (output only a specific region for circos[no header])

Command line options for makeBigWig.pl

        Script for automating the process of creating bigWigs

        Usage: makeBigWig.pl <tag directory> <genome> [special options] [options]

        Special Options for bigWigs [choose one, don't combine]:
                -normal (ChIP-Seq style, default)
                -strand (Strand specific, for RNA-Seq and GRO-Seq)
                -dnase (Special options for Crawford-lab style DNase-Seq)
                -cage (Special options for CAGE/TSS-Seq)
                -cpg (Special options for mCpG/CpG)

        Other options:
                Whatever options you want to pass to makeUCSCfile
                !!Warning!!: do not try to specify "-strand separate" - use the special option above.

        File options:
                -fsize <#> (Use to limit the size of the bigwig files)
                -url <URL> (URL directory -no filename- to tell UCSC where to look)
                -webdir <directory> (name of directory to place resulting bigWig file)
                -update (overwrite bigwigs in the webDir directory, otherwise random numbers are
                        added to make the file unique.

        Current url target (-url):         http://homer.salk.edu/bigWig/
        Current web directory (-webDir):   /data/www/bigWig/

        You're going to want to modify the $wwwDir and $httpDir variables at the top of
        the makeBigWig.pl program file to accomidate your system so you don't have to
        specify -url and -webdir all the time.

Command line options for makeMultiWigHub.pl

        Script for automating the process of creating multiWig tracks

        Usage: makeMultiWigHub.pl <hubname> <genome> [options] -d <tag directory1> [tag directory2]...

        Special Options for bigWigs [choose one, don't combine]:
                -normal (ChIP-Seq style, default)
                -strand (Strand specific, for RNA-Seq and GRO-Seq)
                -dnase (Special options for Crawford-lab style DNase-Seq)
                -cage (Special options for CAGE/TSS-Seq)
                -cpg (Special options for mCpG/CpG)

        Other options:
                Whatever options you want to pass to makeUCSCfile
                !!Warning!!: do not try to specify "-strand separate" - use the special option above.
                Also, for the genome, do NOT use repeat version (mm9r) - use mm9 instead

        File options:
                -force (overwrite existing hub)
                -fsize <#> (limit the file size of the bigwig files to this value)
                -url <URL> (URL directory -no filename- to tell UCSC where to look)
                -webdir <directory> (name of directory to place resulting hub directory)

        Current url target (-url):         http://biowhat.ucsd.edu/hubs/
        Current web directory (-webDir):   /data/www/hubs/

        You're going to want to modify the $wwwDir and $httpDir variables at the top of
        the makeMultiWigHub.pl program file to accomidate your system so you don't have to
        specify -url and -webdir all the time.

Next: Finding Peaks (ChIP-enriched regions) in the genome

Can't figure something out? Questions, comments, concerns, or other feedback: