Kmer Classifier
Description
The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to run query sequences against a pre-existing reference sequence database (As of DRAGEN 4.3+, users can build their own custom reference sequence database).
Command Line Settings
Option | Description |
---|---|
Required Inputs | |
| Enables the Kmer Classifier. (Default=false). |
| Prefix for all output files. |
| Directory for all output files. |
| Input sequence file (zipped or unzipped) to the Kmer Classifier. |
| Database of sequences to classify against. |
Optional Inputs | |
| Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 2. |
| Load the database onto RAM. Do not use if database is on ramdisk. (Default=false). |
| Set to true to run with multiple inputs. The input read file is now a .tsv file that has three columns: Sample ID, Read1 file, (optional) Read 2 file. (Default=false). |
| The minimum number of consecutive kmers to classify assignment at taxid. (Default=1). |
| Option to enable read sequence column in the output file. (Default=false). |
| Option to enable a taxid string column in the output file. (Default=false). |
| Path to JSON file that maps database IDs to external taxids, names, and ranks. |
| Option to not create individual read output. (Default=false). |
| Option to not write taxid count output file. (Default=false). |
| Option to indicate protein query sequences. To use this option, the reference sequence database MUST be of protein sequences. (Default=false). |
| Option to set the number of CPUs available for processing. |
Example Command Line
Input Details
Input Reads
Applies to: --kmer-classifier-input-read-file
, --kmer-classifier-multiple-inputs
If the analysis is for a single FASTA/FASTQ read file, then that filename is input to --kmer-classifier-input-read-file
and --kmer-classifier-multiple-inputs=false
. However, many read files can be submitted to the Kmer Classifier at one time, minimizing the load time for a large reference sequence database. In this case, the input file must be a .tsv
(tab-separated) file with two columns (optionally 3 columns). The first column is a unique ID, the second column is the path to the read file, and the optional third column is the path to the second read file in the case of paired-end reads. The ID is used to distinguish the output files. There is no header line. This .tsv
file is the input file to --kmer-classifier-input-read-file
and --kmer-classifier-multiple-inputs=true
.
Reference Sequences
Applies to: --kmer-classifier-db-file
, --kmer-classifier-db-to-taxid-json
, --kmer-classifier-load-db-ram
A file of reference sequences (the "database") can be quite large. If the database file is stored on a normal file system, it is recommended that you set --kmer-classifier-load-db-ram=true
. This will tell the Kmer Classifier to load the database file into memory for faster analysis. It is also allowable to store the database file on a RAM disk, which reduces load time over many Kmer Classifier runs. In this case, it is recommended to set --kmer-classifier-load-db-ram=false
.
DB TaxID JSON Mapping File
Applies to: --kmer-classifier-db-to-taxid-json
This input file is downloaded alongside the reference sequence database. It associates a taxid internal to the classifier database to an external source, like the NCBI taxonomy. This JSON file is a dictionary where the keys are internal taxids, and is mapped to an external taxid, name, and rank. Example:
The internal taxids are used in the output files. This JSON file can be used to map the results to taxids from the NCBI taxonomy.
Downloading Reference Sequence Databases and Mapping Files
Genome Database
The genome database includes NCBI RefSeq genomes for human, bacteria, archaea, viruses, and fungi. The December 3 2023 NCBI taxonomy was used to build the database, and the sequences were collected in December 2023.
To download the reference index file and the taxid mapping JSON:
Genome and NT Database
This database includes the contents of the Genome database and all of the NCBI nucleotide (nt) database. The sequences from the NCBI nucleotide database were collected in July 2023, and the December 3 2023 NCBI taxonomy was used to build the database. Two versions of this database are available for download: One that requires a machine with >= 550GB RAM, and a compressed version that trades approximately 5-10% accuracy for a smaller RAM footprint and requires a machine with >= 225GB RAM.
To download the reference index file and the taxid mapping JSON:
To download the compressed reference index file and the taxid mapping JSON:
UniRef90 Database
This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.
To download the reference index file and the taxid mapping JSON:
16S database
This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.
To download the reference index file and the taxid mapping JSON:
Ouput Details
There are two output files, one organized around the reads, and the other organized around the taxids.
Read-level Output
Applies to: --kmer-classifier-output-taxid-seq
, --kmer-classifier-output-read-seq
The main output file is a .tsv
file with the extension .read_classifications.tsv
. It has no header line, has tab-separated columns, and can vary in the number of columns depending on command line options. It details the results for each read.
Column | Description | Data Type |
---|---|---|
1 | Read index | integer |
2 | Read name | string |
3 | Taxid the read classified to | integer |
4 | Maximum number of contiguous kmers that classified to this taxid | integer |
5 | Score assigned to the classification | integer |
6 | Number of kmers that classified to this taxid | integer |
7 | Read duplication count | integer |
8 | Name associated with taxid, if given with | string |
9 | Taxonomic rank associated with taxid, if given with | string |
10 | Taxid that each kmer classified to (is output when the | list of integers separated by commas |
11 | Read sequence (is output when the the | string |
TaxID-level Output
The second output file is a .tsv
file with the extension .classifier.taxid_kmer_counts.tsv
. It has a header line and has tab-separated columns. It summarizes the results for each taxid.
Header | Description | Data Type |
---|---|---|
db_taxid | Identifier for this taxid used internally in the database | integer |
duplicity | Ratio of total number of kmers from reads assigned to this taxid compared to the number of distinct kmers from reads assigned to this taxid | float |
distinct_coverage | Percent of kmers in the database assigned to this taxid that are covered by kmers in the reads assigned to this taxid | integer |
read_count | Number of reads that classified to this taxid | integer |
total_kmer_count | Number of kmers that classified to this taxid | integer |
distinct_kmer_count | Number of distinct kmers that classified to this taxid | integer |
cumulative_read_count | Cumulative number of reads assigned to this taxid and its taxonomic descendants | integer |
taxid | Taxid | integer |
name | Name associated with the taxid, if given with | string |
rank | Taxonomic rank of the taxid, if given with | string |
taxid_distinct_kmer_count | Number of distinct kmers assigned to this taxid from the reference sequences | string |
probability_present | Not in use | float |
Last updated