The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to generate your own indexed, searchable database of reference sequences to be used by the k-mer classifier.
Option | Description |
---|---|
K-mer/G-mer length considerations:
G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.
K-mer length refers to the size of the minimizer to be saved in a window size specified by the G-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.
As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.
Pre-built Explify Reference Database k-mer/g-mer length settings for reference:
Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): G-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.
Subset of reference genomes from Refseq with a focus on viral detection: G-mer length of 35 and k-mer length of 31
Collection of 16S sequences for bacterial identification / profiling: G-mer length of 31 and k-mer length of 31
Uniref90 protein sequences: G-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)
Three types of databases can be built with this tool:
Binner: each k-mer is assigned to a category/bin.
Must use --kmer-class-db-builder-num-categories
.
Do not use --kmer-class-db-builder-tax-tree-file
, --kmer-class-db-builder-save-weights
, or --kmer-class-db-builder-kmer-cutoff
.
Classifier: each k-mer is assigned to one taxid.
Must define a taxonomic tree with --kmer-class-db-builder-tax-tree-file
.
Do not use --kmer-class-db-builder-num-categories
, --kmer-class-db-builder-save-weights
, or --kmer-class-db-builder-kmer-cutoff
.
Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.
Must use --kmer-class-db-builder-save-weights
and define a taxonomic tree with --kmer-class-db-builder-tax-tree-file
.
Can use --kmer-class-db-builder-kmer-cutoff
.
Do not use --kmer-class-db-builder-num-categories
.
Required Inputs
--enable-kmer-class-db-builder
Enables the Kmer Classifier Database Builder. (Default=false).
--kmer-class-db-builder-input-file
Headerless, tab-delimited file where each line is (1) the path to a reference fasta file and (2) the associated taxid. When using --kmer-class-db-builder-taxids-as-seq-name, the second column is required but ignored.
--output-file-prefix
Prefix for all output files.
--output-directory
Directory for all output files.
--kmer-class-db-builder-kmer-length
Kmer length (Range: [4, 31]).
--kmer-class-db-builder-gmer-length
Gmer length (must be >= kmer length. Range: [4, 64]).
Optional Inputs
--kmer-class-db-builder-tax-tree-file
.tri file with nodes in the taxonomic tree for a classifier database (not required if building a binner database). Headerless, tab-delimited file where each line has (1) the child node taxid and (2) the parent node taxid. Root of tree must be 1 and have a parent of 0.
--kmer-class-db-builder-protein
Set to indicate input sequences are protein sequences. (Default=false).
--kmer-class-db-builder-taxids-to-keep
File with taxids to keep. If set, any kmers with taxids not in this file will be excluded from database.
--kmer-class-db-builder-num-categories
Set to build a binner database with this number of categories. Max is 25 categories, assumes categories are from 2^0..2^n sequentially. The categories take the place of taxids in the input file.
--kmer-class-db-builder-save-weights
Set to build classification database that saves all kmers / taxids / weights.
--kmer-class-db-builder-kmer-cutoff
Cutoff that excludes k-mers that are found in more than cutoff number of taxids when building a database using --kmer-class-db-builder-save-weights. Helps speed up classification. (Default=1000).
--kmer-class-db-builder-mask-bits
Number of bits to mask in kmer before building / searching. (Default=7).
--kmer-class-db-builder-num-cpus
Option to set the number of CPUs available for processing.
--kmer-class-db-builder-num-kmers-per-bucket
Set to output number of kmers in each minimizer bucket. (Default=false).
--kmer-class-db-builder-include-lowercase
Set to include kmers with lowercase bases (usually repeatmasked). (Default=false).
--kmer-class-db-builder-taxids-as-seq-name
Set to indicate that the reference fastas listed in the input file have taxids as sequence name. In this case, the second column of the input file is ignored. (Default=false).