The human and microbial binning database can be used to classify (or "bin") each read to a category. The categories include human, viral, bacterial, fungal, and parasite. It is possible for a read to classify equally well to more than one category, or for a read to remain unclassified.
This database is useful for analyzing the composition of a sample or splitting an input FASTQ file into category-specific FASTQ files.
Download the database
To download the reference index file and the taxid mapping JSON:
If you want to generate category-specific FASTQ files, ensure that --kmer-classifier-split-fastq is set to true. If you only wish to analyze the sample composition, and do not wish to generate category-specific FASTQ files, setting it to false may reduce the run time.
When generating category-specific FASTQs, it is important that --kmer-classifier-remove-dups is set to false (its default value) so that all reads from the input file are represented in the output files.
There is no specific argument to enable the sample composition analysis. The category-based database will be automatically detected and a category summary output will be generated.
Output Files
Category Summary Output
The high-level results of the composition analysis can be found in an output file with the extension .categories.tsv. It contains the number and percent of reads that classified to each category in the database. If a read classifies equally well to multiple categories, it is counted towards "Ambiguous". If it does not classify to any category, it is counted towards "Unclassified".
Here is an example of the output file:
category
reads
percent
Ambiguous
1817
0.18
Bacterial
13036
1.3
Fungal
1454
0.15
Human
983385
98.34
Parasite
297
0.03
Unclassified
3
0
Viral
8
0
FASTQs
If --kmer-classifier-split-fastq is set to true, category-specific FASTQ files will be generated. They are named according to the category and are only created if at least one read is classified to that category. The possible set of FASTQ files are:
*.ambiguous.fastq.gz
*.bacterial.fastq.gz
*.fungal.fastq.gz
*.human.fastq.gz
*.parasite.fastq.gz
*.unclassified.fastq.gz
*.viral.fastq.gz
If a read classifies equally well to multiple categories, it will be written to the "ambiguous" FASTQ. If it does not classify to any cateogry, it will be written to the "unclassified" FASTQ.