1 of 1

Kmer Classifier Database Builder

Description

The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to generate your own indexed, searchable database of reference sequences to be used by the k-mer classifier.

Command Line Settings

Option

Description

Example Command Line

dragen \
  --enable-kmer-class-db-builder=true \
  --kmer-class-db-builder-input-file <builder_input.txt> \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-class-db-builder-kmer-length 31 \
  --kmer-class-db-builder-gmer-length 64 \
  --kmer-class-db-builder-num-categories 3

Usage

K-mer/G-mer length considerations:

G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.
K-mer length refers to the size of the minimizer to be saved in a window size specified by the G-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.
As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.
Pre-built Explify Reference Database k-mer/g-mer length settings for reference:
- Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): G-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.
- Subset of reference genomes from Refseq with a focus on viral detection: G-mer length of 35 and k-mer length of 31
- Collection of 16S sequences for bacterial identification / profiling: G-mer length of 31 and k-mer length of 31
- Uniref90 protein sequences: G-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)

Three types of databases can be built with this tool:

Binner: each k-mer is assigned to a category/bin.
- Must use --kmer-class-db-builder-num-categories.
- Do not use --kmer-class-db-builder-tax-tree-file, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.
Classifier: each k-mer is assigned to one taxid.
- Must define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.
- Do not use --kmer-class-db-builder-num-categories, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.
Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.
- Must use --kmer-class-db-builder-save-weights and define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.
- Can use --kmer-class-db-builder-kmer-cutoff.
- Do not use --kmer-class-db-builder-num-categories.

Kmer Classifier Database Builder

Description

Command Line Settings

Option

Description

Example Command Line

dragen \
  --enable-kmer-class-db-builder=true \
  --kmer-class-db-builder-input-file <builder_input.txt> \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-class-db-builder-kmer-length 31 \
  --kmer-class-db-builder-gmer-length 64 \
  --kmer-class-db-builder-num-categories 3

Usage

K-mer/G-mer length considerations:

G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.
K-mer length refers to the size of the minimizer to be saved in a window size specified by the G-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.
As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.
Pre-built Explify Reference Database k-mer/g-mer length settings for reference:
- Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): G-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.
- Subset of reference genomes from Refseq with a focus on viral detection: G-mer length of 35 and k-mer length of 31
- Collection of 16S sequences for bacterial identification / profiling: G-mer length of 31 and k-mer length of 31
- Uniref90 protein sequences: G-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)

Three types of databases can be built with this tool:

Binner: each k-mer is assigned to a category/bin.
- Must use --kmer-class-db-builder-num-categories.
- Do not use --kmer-class-db-builder-tax-tree-file, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.
Classifier: each k-mer is assigned to one taxid.
- Must define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.
- Do not use --kmer-class-db-builder-num-categories, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.
Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.
- Must use --kmer-class-db-builder-save-weights and define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.
- Can use --kmer-class-db-builder-kmer-cutoff.
- Do not use --kmer-class-db-builder-num-categories.