Microbial Binner Database

The human and microbial binning database can be used to classify (or "bin") each read to a category. The categories include human, viral, bacterial, fungal, and parasite. It is possible for a read to classify equally well to more than one category, or for a read to remain unclassified.

This database is useful for analyzing the composition of a sample or splitting an input FASTQ file into category-specific FASTQ files.

Download the database

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.name_map.json

File

Size

md5sum

dragen-kmer-classifier.human_microbial_binner.v6dh.t6db

54G

1cdf5176bf03d9fcc3b86fd5a12fe99a

dragen-kmer-classifier.human_microbial_binner.name_map.json

0.3K

eda0826ef3d079b73c52609b8cadd048

Run the K-mer Classifier

Point to the downloaded resource files with --kmer-classifier-db-file and --kmer-classifier-db-to-taxid-json in the DRAGEN command. For example:

dragen \
  --enable-kmer-classifier true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db \
  --kmer-classifier-db-to-taxid-json /path/to/database/dragen-kmer-classifier.human_microbial_binner.name_map.json \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus 20 \
  --kmer-classifier-split-fastq true \
  --kmer-classifier-remove-dups false

If you want to generate category-specific FASTQ files, ensure that --kmer-classifier-split-fastq is set to true. If you only wish to analyze the sample composition, and do not wish to generate category-specific FASTQ files, setting it to false may reduce the run time.

When generating category-specific FASTQs, it is important that --kmer-classifier-remove-dups is set to false (its default value) so that all reads from the input file are represented in the output files.

There is no specific argument to enable the sample composition analysis. The category-based database will be automatically detected and a category summary output will be generated.

Output Files

Category Summary Output

The high-level results of the composition analysis can be found in an output file with the extension .categories.tsv. It contains the number and percent of reads that classified to each category in the database. If a read classifies equally well to multiple categories, it is counted towards "Ambiguous". If it does not classify to any category, it is counted towards "Unclassified".

Here is an example of the output file:

FASTQs

If --kmer-classifier-split-fastq is set to true, category-specific FASTQ files will be generated. They are named according to the category and are only created if at least one read is classified to that category. The possible set of FASTQ files are:

*.ambiguous.fastq.gz
*.bacterial.fastq.gz
*.fungal.fastq.gz
*.human.fastq.gz
*.parasite.fastq.gz
*.unclassified.fastq.gz
*.viral.fastq.gz

If a read classifies equally well to multiple categories, it will be written to the "ambiguous" FASTQ. If it does not classify to any cateogry, it will be written to the "unclassified" FASTQ.

PreviousK-mer Classifier Database Builder NextDRAGEN Microbial Enrichment Plus

Last updated 3 hours ago

Was this helpful?

hashtagDownload the database

hashtagRun the K-mer Classifier

hashtagOutput Files

hashtagCategory Summary Output

hashtagFASTQs

Download the database

Run the K-mer Classifier

Output Files

Category Summary Output

FASTQs