homerTools - General sequence manipulation
homerTools is a
utility program Chuck uses for basic sequence manipulation
of FASTQ files, extracting sequences from genome FASTA
files, and calculating nucleotide frequencies. It is
used by many of the other HOMER programs to do basic tasks,
but can also be useful to run on its own. To run homerTools type the
following:
homerTools [command] [command specific options]
i.e. homerTools trim -3
AAAAAAAA s_1_sequence.txt
The following commands are available in homerTools:
barcodes - for separating and removing 5'
barcodes from FASTQ/FASTA files
trim - for
trimming by adapter sequence, specific lengths, etc. from
FASTQ/FASTA files
freq - for
calculating nucleotide frequencies in FASTQ/FASTA/txt
sequence files
extract - for
extracting specific regions of seqeuence from genomic
FASTA files
There are also some other specialized functions that are
under passive development. If you're bored or
desperate you can play with them:
truseq - tool to quantify barcode
frequencies from FASTQ files (i.e. undetermined reads
from demultiplexing)
decontaminate - tools to try and scrub unwanted
reads from a tag directory in the case of sample
mixing. Just redo the experiment...
cluster - hierarchical clustering utilities
special - binary file manipulation for mapability
Separating 5' Barcodes:
To separate and remove short
5' barcodes from sequencing data (where the first "x" base
pairs of the read are the barcode):
homerTools barcodes <# length of barcode>
[options] <sequence file1> [sequence file2] ...
i.e. homerTools barcodes 3
s_1_sequence.txt (removes first 3bp as the
barcode and sorts the reads by barcode)
The 3rd argument must be the length of the 5' barcode,
which will be the first base pairs in the sequence.
By default, this command creates files named
"filename.barcode", such as s_1_sequence.txt.AAA,
s_1_sequence.txt.AAC, s_1_sequence.txt.AAG etc. The
parameter " -min <#>"
specifies the minimum barcode frequence to keep (default
is 0.02 [2%]). The frequency of each barcode is
recorded in the output file "filename.freq.txt". If
important barcodes were deleted, rerun the command with a
smaller value for " -min
<#>".
WARNING: This tool consumes minimal memory by creating
several files to output the reads. If your 5'
barcodes are longer than 5 or 6 nucleotides the program
will likely crash once the operating system reaches the
maximum number of files it's willing to create in a given
directory.
Trimming Sequence Files
With all the fancy types of
sequencing being done, it is getting common to find
adapters as part of the sequences that are analyzed.
The trim command allows users to trim sequences from the
3' and 5' ends by either a specific number of nucleotides
or remove a specific adapter sequence. The basic
command is executed like this:
homerTools
trim
[options] <sequence file1> [sequence file2]
The output will be placed in files "filename.trimmed" and
the distribution of sequence lengths after trimming will
be in "filename.lengths" for each of the input
files. The following options control how homerTools
trims the sequences:
-len <#> (trim sequences to this
length)
-min <#>
(remove sequence that are shorter than this after
timming)
-max <#> (Maximum read length, default:
100000)
-len <#> (Keep first # bp of sequence -
i.e. make them the same length)
-3 <#>
(trim this many bp off the 3' end of the sequence)
-5 <#>
(trim this many bp off the 5' end of the sequence)
-3 <ACGT>
(trim adapter sequence (i.e. "-3 GGAGGATTT") from the 3'
end of the sequence)
-5 <ACGT>
(trim adapter sequence (i.e. "-5 GGAGGATTT") from the 5'
end of the sequence)
-mis <#> (Maximum allowed mismatches in
adapter sequence, default: 0)
-minMatchLength <#> (minimum adapter
sequence at edge to match, default: half adapter length)
-matchStart <#> (don't start searching for
adapter until this position, default: 0)
-q <#> (Trim sequences once quality dips
below threshold, default: none [range:0-40])
-qstart <#> (don't check quality until
sequences are at least this long, default: 10)
-qwindow <#> (size of moving average to
check for quality dropoff, default: 5)
For adapter sequence trimming, it will search for the
first full match to the sequence and delete the rest of
the sequence. For example if you specify "-3 AA", it
will search for the first instance of "AA" and delete
everything after it. It will also delete partial
matches if they are at the end of the sequence (or
beginning for 5'). As another example, our lab uses
an amplification strategy for RNA that results in the
ligation of a polyA tail to the RNA sequence. If the
reads are long enough, the read will be just As.
i.e.
GAGATTATCTACGTACCGAAAAAAAAAAAAAAAAAA
Trimming with "-3
AAAAAAAAA" will cleave the complete polyA
stretch.
In this example: GAGATTATCTACGTACCGTACTGCATGACGGGAAAA,
only the final 4 As would be trimmed.
Common trimming tasks
TruSeq adapter trimming:
homerTools trim -3 GATCGGAAGAGCACACGTCT -mis 1
-minMatchLength 4 -min 15 file.fastq
Small RNA adapter trimming:
homerTools trim -3 TCGTATGCCGTCTTCTGCTTGT -mis 1
-minMatchLength 4 -min 15 file.fastq
Hi-C trimming
homerTools trim -3 AAGCTT -matchStart 20
-min 20 file.fastq
Extracting Genomic Sequences From FASTA Files
The extract command can be
used to extract large numbers of specific genomic
sequence. The first input file you need is a HOMER
style peak file or a BED file with genomic
locations. Next, you must have the genomic DNA
sequences in one of two formats: (1) a directory of
chr1.fa, chr2.fa FASTA files (can be masked file like
*.fa.masked), or (2) a single file FASTA file with all of
the chromosomes concatonated in one file. The
sequences are sent to stdout
as a tab-delimited file, or as a FASTA formatted file if " -fa" is added to the
end of the command. Save the output to a file by
adding " >
outputfile.txt" to the end of the command. The
program is run like this:
homerTools extract <peak/BED file>
<FASTA directory or file location> [-fa]
i.e. homerTools
extract peaks.bed
/home/chucknorris/homer/data/genomes/mm9/ >
outputSequences.txt
Or, to get FASTA files back, i.e. homerTools extract
peaks.bed /home/chucknorris/homer/data/genomes/mm9/
-fa > outputSequence.fa
Calculating Nucleotide Frequencies
The freq command will
calculate nucleotide frequencies from FASTQ, FASTA, or
tab-delimited text sequence files. The program tries
to auto detect the format, but it may help to specify the
format directly (" -format
fastq", " -format
fasta", " -format
tsv"). The program outputs a
position-dependent nucleotide/dinucleotide frequency file
as a function of the distance from the start of the
sequencing reads. The output is sent to stdout, unless you
specify " -o
<outputfile.txt>". If you specify " -gc <outpufile2.txt>",
the program will also create a file that specifies the
cumulative frequency of CpG, total G+C, total A+G,
and total A+C in each individual sequence.
homerTools freq -format fastq
s_1_sequence.txt > s_1.frequency.txt
homerTools freq -format
fastq s_1_sequence.txt -gc GCdistribution.txt -o
positionFrequency.txt
homerTools Command Line options:
Usage: homerTools <command> [--help
| options]
Collection of tools for sequence
manipulation
Commands: [type "homerTools
<command>" to see individual command options]
barcodes - separate
FASTQ file by barcodes
truseq - process
truseq barcodes from unidentified indexes (illumina)
trim - trim adapter
sequences or fixed sizes from FASTQ files(also splits)
freq - calculate
position-dependent nucleotide/dinucleotide frequencies
extract - extract
specific sequences from FASTA file(s)
decontaminate - remove
bad tags from a contaminated tag directory
cluster - hierarchical
clustering of a NxN distance matrix
special - specialized
routines (i.e. only really useful for chuck)
Options for command: barcode
3rd argument must
be the number bp in the barcode
-min <#>
(Minimum frequency of barcodes to keep: default=0.020
-freq <filename>
(output file for barcode frequencies, default=file.freq.txt)
-qual <#>
(Minimum quality score for barcode nucleotides, default=not
used)
-qualBase
<character> (Minimum quality character in FASTQ file,
default=B)
Options for command: trim
-3 <#|[ACGT]>
(trim # bp or adapter sequence from 3' end of sequences)
-5 <#|[ACGT]>
(trim # bp or adapter sequence from 5' end of sequences)
-mis <#> (Maximum allowed mismatches in adapter
sequence, default: 0)
-minMatchLength <#> (minimum adapter sequence at edge
to match, default: half adapter length)
-matchStart <#> (don't start searching for adapter
until this position, default: 0)
-q <#> (Trim
sequences once quality dips below threshold, default: none
[range:0-40])
-qstart <#> (don't check quality until sequences are
at least this long, default: 10)
-qwindow <#> (size of moving average to check for
quality dropoff, default: 5)
-len <#> (Keep
first # bp of sequence - i.e. make them the same length)
-stats
<filename> (Output trimming statistics to filename,
default: sent to stdout)
-min <#>
(Minimum size of trimmed sequence to keep, default: 1)
-max <#>
(Maximum read length, default: 100000)
-suffix <filename
suffix> (output is sent to InuptFileName.suffix, default:
trimmed)
-lenSuffix
<filename suffix> (length distribution is sent to
InuptFileName.suffix, default: lengths)
Options for command: extract
-fa (output sequences
in FASTA format - default is tab-delimited format)
-mask (mask out lower
case sequence from genome)
Usage: homerTools extract
<peak file/BED file> <Directory of FASTA files>
[options]
The <Directory of FASTA
files> can be a single FASTA file instead.
If using a Directory, files
should be named with chromosomes,
i.e.
chr1.fa or chr1.fa.masked or genome.fa/genome.fa.masked
If having
trouble, place all FASTA entries in single file instead of a
directory
FASTA
format: >chrname ... (anything after whitespace will be
ignored)
This program will
output sequences to stdout in tab-delimited format
Alternate Usage:
homerTools extract stats <Directory of FASTA files>
Displays stats
about the genome files (such as length)
|