There are several pre-built k-mer databases that are available for download and use with the k-mer classifier. The databases can be divided into two categories: (1) taxonomy-based: each k-mer is assigned to a taxonomic identifier (taxid), i.e. node in a taxonomic tree; (2) category-based: each k-mer is assigned to one or more categories, e.g. "bacterial". The type of database used is auto-detected by the k-mer classifier and does not need to be specified in the DRAGEN command.
For each database, there are two files to download. The index file contains the k-mer mapping and is pointed to with the --kmer-classifier-db-file option. The name map JSON file maps internal identifiers to taxid/category and name, and is pointed to with the --kmer-classifier-db-to-taxid-json option.
Genome Database
The genome database includes NCBI RefSeq genomes for human, bacteria, archaea, viruses, and fungi. The December 3 2023 NCBI taxonomy was used to build the database, and the sequences were collected in December 2023.
To download the reference index file and the taxid mapping JSON:
This database includes the contents of the Genome database and all of the NCBI nucleotide (nt) database. The sequences from the NCBI nucleotide database were collected in July 2023, and the December 3 2023 NCBI taxonomy was used to build the database. Two versions of this database are available for download: One that requires a machine with >= 550GB RAM, and a compressed version that trades approximately 5-10% accuracy for a smaller RAM footprint and requires a machine with >= 225GB RAM.
To download the reference index file and the taxid mapping JSON:
This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.
To download the reference index file and the taxid mapping JSON:
File
Size
md5sum
dragen-kmer-classifier.u90_all.v6dh.t6db
81G
78bba8b3635241ac9adc35f101df7f46
dragen-kmer-classifier.u90_all.name_map.json
27M
8ebe7b070aa85212f8f37a2f8b901cff
16S database
This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.
To download the reference index file and the taxid mapping JSON: