HOMER

Software for motif discovery and next-gen sequencing analysis

homerTools - General sequence manipulation

homerTools is a utility program Chuck uses for basic sequence manipulation of FASTQ files, extracting sequences from genome FASTA files, and calculating nucleotide frequencies. It is used by many of the other HOMER programs to do basic tasks, but can also be useful to run on its own. To run homerTools type the following:

homerTools [command] [command specific options]

i.e. homerTools trim -3 AAAAAAAA s_1_sequence.txt

The following commands are available in homerTools:

barcodes - for separating and removing 5' barcodes from FASTQ/FASTA files
trim - for trimming by adapter sequence, specific lengths, etc. from FASTQ/FASTA files
freq - for calculating nucleotide frequencies in FASTQ/FASTA/txt sequence files
extract - for extracting specific regions of seqeuence from genomic FASTA files

There are also some other specialized functions that are under passive development. If you're bored or desperate you can play with them:

truseq - tool to quantify barcode frequencies from FASTQ files (i.e. undetermined reads from demultiplexing)
decontaminate - tools to try and scrub unwanted reads from a tag directory in the case of sample mixing. Just redo the experiment...
cluster - hierarchical clustering utilities
special - binary file manipulation for mapability

Separating 5' Barcodes:

To separate and remove short 5' barcodes from sequencing data (where the first "x" base pairs of the read are the barcode):

homerTools barcodes <# length of barcode> [options] <sequence file1> [sequence file2] ...

i.e. homerTools barcodes 3 s_1_sequence.txt (removes first 3bp as the barcode and sorts the reads by barcode)

The 3rd argument must be the length of the 5' barcode, which will be the first base pairs in the sequence. By default, this command creates files named "filename.barcode", such as s_1_sequence.txt.AAA, s_1_sequence.txt.AAC, s_1_sequence.txt.AAG etc. The parameter "-min <#>" specifies the minimum barcode frequence to keep (default is 0.02 [2%]). The frequency of each barcode is recorded in the output file "filename.freq.txt". If important barcodes were deleted, rerun the command with a smaller value for "-min <#>".

WARNING: This tool consumes minimal memory by creating several files to output the reads. If your 5' barcodes are longer than 5 or 6 nucleotides the program will likely crash once the operating system reaches the maximum number of files it's willing to create in a given directory.

Trimming Sequence Files

With all the fancy types of sequencing being done, it is getting common to find adapters as part of the sequences that are analyzed. The trim command allows users to trim sequences from the 3' and 5' ends by either a specific number of nucleotides or remove a specific adapter sequence. The basic command is executed like this:

homerTools trim [options] <sequence file1> [sequence file2]

The output will be placed in files "filename.trimmed" and the distribution of sequence lengths after trimming will be in "filename.lengths" for each of the input files. The following options control how homerTools trims the sequences:

-len <#> (trim sequences to this length)
-min <#> (remove sequence that are shorter than this after timming)
-max <#> (Maximum read length, default: 100000)

-len <#> (Keep first # bp of sequence - i.e. make them the same length)
-3 <#> (trim this many bp off the 3' end of the sequence)
-5 <#> (trim this many bp off the 5' end of the sequence)
-3 <ACGT> (trim adapter sequence (i.e. "-3 GGAGGATTT") from the 3' end of the sequence)
-5 <ACGT> (trim adapter sequence (i.e. "-5 GGAGGATTT") from the 5' end of the sequence)
-mis <#> (Maximum allowed mismatches in adapter sequence, default: 0)
-minMatchLength <#> (minimum adapter sequence at edge to match, default: half adapter length)
-matchStart <#> (don't start searching for adapter until this position, default: 0)

-q <#> (Trim sequences once quality dips below threshold, default: none [range:0-40])
-qstart <#> (don't check quality until sequences are at least this long, default: 10)
-qwindow <#> (size of moving average to check for quality dropoff, default: 5)

For adapter sequence trimming, it will search for the first full match to the sequence and delete the rest of the sequence. For example if you specify "-3 AA", it will search for the first instance of "AA" and delete everything after it. It will also delete partial matches if they are at the end of the sequence (or beginning for 5'). As another example, our lab uses an amplification strategy for RNA that results in the ligation of a polyA tail to the RNA sequence. If the reads are long enough, the read will be just As.

i.e. GAGATTATCTACGTACCGAAAAAAAAAAAAAAAAAA

Trimming with "-3 AAAAAAAAA" will cleave the complete polyA stretch.

In this example: GAGATTATCTACGTACCGTACTGCATGACGGGAAAA, only the final 4 As would be trimmed.

Common trimming tasks

TruSeq adapter trimming:

homerTools trim -3 GATCGGAAGAGCACACGTCT -mis 1 -minMatchLength 4 -min 15 file.fastq

Small RNA adapter trimming:

homerTools trim -3 TCGTATGCCGTCTTCTGCTTGT -mis 1 -minMatchLength 4 -min 15 file.fastq

Hi-C trimming

homerTools trim -3 AAGCTT -matchStart 20 -min 20 file.fastq

Extracting Genomic Sequences From FASTA Files

The extract command can be used to extract large numbers of specific genomic sequence. The first input file you need is a HOMER style peak file or a BED file with genomic locations. Next, you must have the genomic DNA sequences in one of two formats: (1) a directory of chr1.fa, chr2.fa FASTA files (can be masked file like *.fa.masked), or (2) a single file FASTA file with all of the chromosomes concatonated in one file. The sequences are sent to stdout as a tab-delimited file, or as a FASTA formatted file if "-fa" is added to the end of the command. Save the output to a file by adding " > outputfile.txt" to the end of the command. The program is run like this:

homerTools extract <peak/BED file> <FASTA directory or file location> [-fa]

i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ > outputSequences.txt
Or, to get FASTA files back, i.e. homerTools extract peaks.bed /home/chucknorris/homer/data/genomes/mm9/ -fa > outputSequence.fa

Calculating Nucleotide Frequencies

The freq command will calculate nucleotide frequencies from FASTQ, FASTA, or tab-delimited text sequence files. The program tries to auto detect the format, but it may help to specify the format directly ("-format fastq", "-format fasta", "-format tsv"). The program outputs a position-dependent nucleotide/dinucleotide frequency file as a function of the distance from the start of the sequencing reads. The output is sent to stdout, unless you specify "-o <outputfile.txt>". If you specify "-gc <outpufile2.txt>", the program will also create a file that specifies the cumulative frequency of CpG, total G+C, total A+G, and total A+C in each individual sequence.

homerTools freq -format fastq s_1_sequence.txt > s_1.frequency.txt
homerTools freq -format fastq s_1_sequence.txt -gc GCdistribution.txt -o positionFrequency.txt

homerTools Command Line options:

    Usage: homerTools <command> [--help | options]

    Collection of tools for sequence manipulation

    Commands: [type "homerTools <command>" to see individual command options]
        barcodes - separate FASTQ file by barcodes
        truseq - process truseq barcodes from unidentified indexes (illumina)
        trim - trim adapter sequences or fixed sizes from FASTQ files(also splits)
        freq - calculate position-dependent nucleotide/dinucleotide frequencies
        extract - extract specific sequences from FASTA file(s)
        decontaminate - remove bad tags from a contaminated tag directory
        cluster - hierarchical clustering of a NxN distance matrix
        special - specialized routines (i.e. only really useful for chuck)

    Options for command: barcode
        3rd argument must be the number bp in the barcode

        -min <#> (Minimum frequency of barcodes to keep: default=0.020
        -freq <filename> (output file for barcode frequencies, default=file.freq.txt)
        -qual <#> (Minimum quality score for barcode nucleotides, default=not used)
        -qualBase <character> (Minimum quality character in FASTQ file, default=B)

    Options for command: trim
        -3 <#|[ACGT]> (trim # bp or adapter sequence from 3' end of sequences)
        -5 <#|[ACGT]> (trim # bp or adapter sequence from 5' end of sequences)
            -mis <#> (Maximum allowed mismatches in adapter sequence, default: 0)
            -minMatchLength <#> (minimum adapter sequence at edge to match, default: half adapter length)
            -matchStart <#> (don't start searching for adapter until this position, default: 0)
        -q <#> (Trim sequences once quality dips below threshold, default: none [range:0-40])
            -qstart <#> (don't check quality until sequences are at least this long, default: 10)
            -qwindow <#> (size of moving average to check for quality dropoff, default: 5)
        -len <#> (Keep first # bp of sequence - i.e. make them the same length)
        -stats <filename> (Output trimming statistics to filename, default: sent to stdout)
        -min <#> (Minimum size of trimmed sequence to keep, default: 1)
        -max <#> (Maximum read length, default: 100000)
        -suffix <filename suffix> (output is sent to InuptFileName.suffix, default: trimmed)
        -lenSuffix <filename suffix> (length distribution is sent to InuptFileName.suffix, default: lengths)

    Options for command: extract
        -fa (output sequences in FASTA format - default is tab-delimited format)
        -mask (mask out lower case sequence from genome)

      Usage: homerTools extract <peak file/BED file> <Directory of FASTA files> [options]

      The <Directory of FASTA files> can be a single FASTA file instead.
      If using a Directory, files should be named with chromosomes,
          i.e. chr1.fa or chr1.fa.masked or genome.fa/genome.fa.masked
    If having trouble, place all FASTA entries in single file instead of a directory
          FASTA format: >chrname ... (anything after whitespace will be ignored)
        This program will output sequences to stdout in tab-delimited format

        Alternate Usage: homerTools extract stats <Directory of FASTA files>
            Displays stats about the genome files (such as length)

Can't figure something out? Questions, comments, concerns, or other feedback:
cbenner@salk.edu