# Microbial Binner Database

The human and microbial binning database can be used to classify (or "bin") each read to a category. The categories include human, viral, bacterial, fungal, and parasite. It is possible for a read to classify equally well to more than one category, or for a read to remain unclassified.

This database is useful for analyzing the composition of a sample or splitting an input FASTQ file into category-specific FASTQ files.

## Download the database

To download the reference index file and the taxid mapping JSON:

```
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.name_map.json
```

| File                                                           | Size | md5sum                           |
| -------------------------------------------------------------- | ---- | -------------------------------- |
| dragen-kmer-classifier.human\_microbial\_binner.v6dh.t6db      | 54G  | 1cdf5176bf03d9fcc3b86fd5a12fe99a |
| dragen-kmer-classifier.human\_microbial\_binner.name\_map.json | 0.3K | eda0826ef3d079b73c52609b8cadd048 |

## Run the K-mer Classifier

Point to the downloaded resource files with `--kmer-classifier-db-file` and `--kmer-classifier-db-to-taxid-json` in the DRAGEN command. For example:

```
dragen \
  --enable-kmer-classifier true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db \
  --kmer-classifier-db-to-taxid-json /path/to/database/dragen-kmer-classifier.human_microbial_binner.name_map.json \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus 20 \
  --kmer-classifier-split-fastq true \
  --kmer-classifier-remove-dups false
```

If you want to generate category-specific FASTQ files, ensure that `--kmer-classifier-split-fastq` is set to `true`. If you only wish to analyze the sample composition, and do not wish to generate category-specific FASTQ files, setting it to `false` may reduce the run time.

When generating category-specific FASTQs, it is important that `--kmer-classifier-remove-dups` is set to false (its default value) so that all reads from the input file are represented in the output files.

There is no specific argument to enable the sample composition analysis. The category-based database will be automatically detected and a category summary output will be generated.

## Output Files

### Category Summary Output

The high-level results of the composition analysis can be found in an output file with the extension `.categories.tsv`. It contains the number and percent of reads that classified to each category in the database. If a read classifies equally well to multiple categories, it is counted towards "Ambiguous". If it does not classify to any category, it is counted towards "Unclassified".

Here is an example of the output file:

|   category   |  reads | percent |
| :----------: | :----: | :-----: |
|   Ambiguous  |  1817  |   0.18  |
|   Bacterial  |  13036 |   1.3   |
|    Fungal    |  1454  |   0.15  |
|     Human    | 983385 |  98.34  |
|   Parasite   |   297  |   0.03  |
| Unclassified |    3   |    0    |
|     Viral    |    8   |    0    |

### FASTQs

If `--kmer-classifier-split-fastq` is set to `true`, category-specific FASTQ files will be generated. They are named according to the category and are only created if at least one read is classified to that category. The possible set of FASTQ files are:

* \*.ambiguous.fastq.gz
* \*.bacterial.fastq.gz
* \*.fungal.fastq.gz
* \*.human.fastq.gz
* \*.parasite.fastq.gz
* \*.unclassified.fastq.gz
* \*.viral.fastq.gz

If a read classifies equally well to multiple categories, it will be written to the "ambiguous" FASTQ. If it does not classify to any cateogry, it will be written to the "unclassified" FASTQ.
