1 of 3

Explify Analysis Pipeline

Description

The Explify Analysis Pipeline offers a dedicated informatics solution with flexible analysis options for the following Illumina Infectious Disease and Microbiology target-capture enrichment panel kits: the Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit (RPIP), Illumina Urinary Pathogen ID/AMR Enrichment Panel Kit (UPIP), and Illumina Viral Surveillance Panel V2 Kit (VSP V2). The application delivers easy-to-use, powerful secondary analysis of Illumina sequencing data, with workflows for sample QC, viral WGS (whole-genome sequencing), pathogen detection and quantification, and antimicrobial resistance (AMR) marker profiling. It also supports custom reference sequence analysis.

RPIP: Target-capture enrichment of >280 RNA and DNA respiratory pathogens, including SARS-CoV-2, Influenza viruses, Respiratory syncytial virus, Mycobacterium and Legionella species, and >4000 AMR markers.
UPIP: Target-capture enrichment of >170 genitourinary pathogens, including fastidious, slow-growing, and anaerobic uropathogens, sexually transmitted microorganisms, and >4000 bacterial AMR markers.
VSP V2: Target-capture enrichment for whole-genome sequencing (WGS) of 200 RNA and DNA viruses prioritized as high-risk to public health, zoonotic surveillance, and biotech, and >200 viral AMR markers.
Custom: Analyze FASTQ/FASTA read files with a custom reference sequence database.

Note that samples enriched using the Illumina Respiratory Virus Oligo Panel/Respiratory Virus Enrichment Kit (RVOP/RVEK) and Viral Surveillance Panel Kit (VSP) can also be analyzed using the Explify Analysis Pipeline and VSPv2 database.

Command Line Settings

Option

Description

Required Inputs

--enable-explify

Enables the Explify Analysis Pipeline. (Default=false)

--output-file-prefix

Prefix for all output files.

--output-directory

Directory for all output files.

--explify-sample-list

Input sample list .tsv file with sample IDs, FASTQs, etc.

--explify-test-panel-name

"RPIP", "UPIP", "VSPv2", "Custom".

--explify-test-panel-version

Set to test panel version (e.g. "1.0.0").

--explify-ref-db-dir

Path to root directory for Explify Database files.

Optional Inputs

--intermediate-results-dir

Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 3.

--explify-load-db-ram

Option to load database into RAM if not on ramdisk. (Default=false).

--explify-no-read-qc

Option to turn off read QC on FASTQs before analysis. (Default=false).

--explify-internal-control

Option to set internal control from an accepted list. (Default="Enterobacteria phage T7")

--explify-internal-control-concentration

Option to set internal control concentration. (Default=12100000)

--explify-ncpus

Option to set the number of CPUs available for processing.

--explify-sensitivity-threshold

Option to set sensitivity threshold. Range: 0 < Integer < 1000. Only valid for VSPv2. (Default=5).

--explify-custom-ref-fasta

Reference FASTA file. Required for Custom reference DBs.

--explify-custom-ref-bed

Reference BED file. Optional for Custom reference DBs.

Example Command Line

dragen \
  --enable-explify=true \
  --output-file-prefix <PREFIX> \
  --explify-sample-list /path/to/sample/list/tsv \
  --explify-test-panel-name <"RPIP"/"UPIP"/"VSPv2"/"Custom"> \
  --explify-test-panel-version <VERSION> \
  --explify-ref-db-dir /path/to/root/db/dir \
  --explify-load-db-ram=true \
  --output-directory <OUTPUT_DIR> \
  --intermediate-results-dir <OUTPUT_DIR> \
  --explify-ncpus=1

Input Details

Sample Input List

Applies to: --explify-sample-list

The sample input list is a column-formatted file with tab separations between the columns (i.e., a .tsv file).

SampleID     BatchID     RunID     ControlFlag     FastQs
MySample     MyBatch     MyRun     POS             /path/to/fastq1.gz     /path/to/fastq2.gz

Notes:

The SampleID values must be unique.
BatchID and RunID are to help users track and manage sample analyses. Often the BatchID is used to track libraries that were prepared together, and the RunID is used to track sequencing runs. They can also be left blank.
The ControlFlag value can be POS, NEG, BLANK, or left empty.
- POS is used to indicate a positive control sample.
- NEG is used to indicate a negative control sample.
- BLANK is used to indicate a blank control sample (e.g. buffer only).
If there are multiple FASTQ files, they are tab delimited.
Please be very careful when editing tsv files. Some editors replace tabs with spaces without alerting the user.

Internal Control

Applies to: --explify-internal-control, --explify-internal-control-concentration

The user may specify one of the internal controls listed below. If NONE is specified, the internal control concentration is ignored. These are case-sensitive and must be input exactly as they appear:

Allobacillus halotolerans
Armored RNA Quant Internal Process Control
Enterobacteria phage T7 (This is the default)
Escherichia virus MS2
Escherichia virus Qbeta
Escherichia virus T4
Imtechella halotolerans
Phocid alphaherpesvirus 1
Phocine morbillivirus
Truepera radiovictrix
NONE

The internal control concentration is an integer representing the number of copies/mL of sample for the internal control.

Reference Databases

Applies to: --explify-ref-db-dir, --explify-test-panel-name, --explify-test-panel-version, --explify-load-db-ram,--explify-custom-ref-fasta, --explify-custom-ref-bed

An Explify Reference Database is required to run the Explify Analysis Pipeline in DRAGEN. The databases are stored remotely and must be downloaded prior to running an analysis. The database download script provided to facilitate the download is described below.

Directory Setup

Prior to downloading the databases, create a directory that will be dedicated to storing them. It is recommended that the directory be on a disk with at least 150 GB of free space. The path to this directory will be used for the -d parameter when the download script is run in subsequent steps: "explify-databases/" is used in the examples below.

Obtaining the Download Script

Download and management of Explify reference databases is handled by a shell script. The script can be downloaded with the following command:

wget -O explify-dbs.sh https://illumina-explify-databases.s3.us-east-1.amazonaws.com/explify-dbs.sh
chmod +x explify-dbs.sh

Seeing What Databases are Available for Download

The search subcommand can be used to list what databases can be downloaded:

$ ./explify-dbs.sh search -d explify-databases/
4 database(s) found meeting those criteria:
- Custom-1.0.0
- RPIP-6.5.1
- UPIP-8.6.0
- VSPv2-2.7.0

The -d argument is the base directory used for storage of the databases
Optionally, when a test panel name is specified with the -p argument, the results will be limited to that panel
Optionally, setting the -n argument will filter the search to databases that have not already been downloaded

Downloading a Database

The download subcommand is used to download the database files for a test panel:

./explify-dbs.sh download -d explify-databases/ -p UPIP -v 8.6.0 -n 20

The -d argument is the base directory used for storage of the databases
The -p argument is the test panel name
The -v argument is the test panel version
The -n argument is the number of CPUs that can be used to download the files (defaults to 1)

Additional notes:

In this example, after the UPIP-8.6.0 are downloaded, additional required files will be downloaded to a subdirectory named "common"
After the files are downloaded, their checksums will be automatically checked
Due to the size of some of the files, this command will take some time. It is best to run it via screen or nohup

Listing Downloaded Databases

The list subcommand is used to view the databases that have already been downloaded:

$ ./explify-dbs.sh list -d explify-databases/

The -d argument is the base directory used for storage of the databases
Optionally, when a test panel name is specified with the -p argument, the results will be limited to that panel

Checking Database Integrity

The download subcommand will automatically check the file checksums after download. The check subcommand can also be used on its own to check the files:

$ ./explify-dbs.sh check -d explify-databases/ -p UPIP -v 8.6.0 -n 20

The -d argument is the base directory used for storage of the databases
The -p argument is the test panel name
The -v argument is the test panel version
The -n argument is the number of CPUs that can be used to download the files (defaults to 1)

Using the Databases with the Explify Analysis Pipeline

Assume the Explify database distributable, when unpacked, has a root directory name of /explify-databases. The database files will be organized in this root directory first by test panel type, then by test panel version:

explify-databases/
    Custom/
        1.0.0/
    RPIP/
        6.5.1/
    UPIP/
        8.6.0/
    VSPv2/
        2.7.0/

To run an analysis with RPIP 6.5.1, for example, the following inputs would be needed:

--explify-ref-db-dir /explify-databases
--explify-test-panel-name RPIP
--explify-test-panel-version 6.5.1

The Explify Analysis Pipeline will use these inputs to navigate to the specified database location, namely /explify-databases/RPIP/6.5.1.

If the databases are stored on a normal file system, it is recommended that you set --explify-load-db-ram=true. This will tell the Explify Analysis Pipeline to load the databases into memory for faster analysis. It is also allowable to store the databases on a RAM disk, which reduces load time over many Explify Analysis Pipeline runs. In this case, it is recommended to set --explify-load-db-ram=false.

Using the Custom Database Option

To use a Custom database, references are supplied through a FASTA file via --explify-custom-ref-fasta and an optional BED file via --explify-custom-ref-bed. Note that you must have downloaded the Custom database and set --explify-test-panel-name to "Custom", and set --explify-test-panel-version to the version you have downloaded. The supplied Custom Explify Reference Database is used by the Explify Analysis Pipeline to filter out host reads.

In the FASTA file, sequence names must be unique and should not contain any spaces. If there is any space in the FASTA header, the part before the first space is assumed to be the sequence name. It is recommended to use only the following in sequence names: alphabets, numbers, underscore (_), hyphen (-), parentheses ((,)), and period (.). Otherwise, the sequence names may appear different in the output.

The BED file must be tab-delimited with at least 4 columns:

chrom: the sequence name as it appears in the FASTA
chromStart: start position (always set to 0)
chromEnd: end position (sequence length)
genomeName: name of the genome, target, or microorganism the sequence belongs to (e.g. Monkeypox virus clade II)
segmentName (optional): the name of the segment or gene (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Sequence names must match between the FASTA file and BED file, and the same set of sequences must appear in both files. If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.

The BED file controls how sequences are labeled in the output JSON. If the custom reference FASTA file includes sequences from multiple segments, it is recommended to provide a BED file so that the segments are included under the results of that microorganism.

Output Details

The output of the Explify Analysis Pipeline is a single ap.json file written to the specified output directory containing general metadata, version information, sample QC, microorganism, and AMR marker results, as well as detailed test information.

ap.json format

Top-Level Node

The top-level section of the output JSON contains general metadata and version information.

Field

Description

.accession

Identifier used for the sample

.deploymentEnvironment

Environment in which the results were produced

.batchId

Identifier used for the batch of samples processed together

.analysisId

Identifier used for the analysis

.runId

Identifier used for the sequencing run

.controlFlag

Indicates whether the sample is a control. It is based on the ControlFlag field in the sample .tsv and can be set to “POS”, “NEG”, “BLANK”, or “-”

.dragenVersion

DRAGEN release version

.analysisPipelineVersion

Analysis Pipeline release version

.testType

Type of test panel ("RPIP", "UPIP", "VSPv2", "Custom")

.testVersion

Test panel release version

.testName

Full name of test panel

.testUse

Test use. "For Research Use Only. Not for use in diagnostic procedures"

.reportTime

Date and time the report was generated

.warnings

List of warnings encountered during the analysis

.errors

List of errors encountered during the analysis

.qcReport.sampleQc Node

This section contains information about sample quality control (QC). The fields are relative to .qcReport.sampleQc

Field

Description

.totalRawBases

Number of base pairs in sample before read QC processing

.totalRawReads

Number of reads in sample before read QC processing

.uniqueReads

Number of distinct reads in sample before read QC processing

.uniqueReadsProportion

Proportion of distinct reads in sample before read QC processing

.preQualityMeanReadLength

Average read length before read QC processing

.postQualityMeanReadLength

Average read length after read QC processing

.postQualityReads

Number of reads in sample after read QC processing, inclusive of any duplicate reads

.postQualityReadsProportion

Proportion of post-quality reads in sample relative to total raw reads

.removedInDehostingReads

Number of host reads in sample removed during dehosting (host = human)

.removedInDehostingReadsProportion

Proportion of host reads in sample removed relative to total raw reads (host = human)

.entropy

Shannon entropy of the counts of 5-mers in the reads after read QC processing, which is a measure of randomness

.gContent

Proportion of guanine (G) base calls in reads after read QC processing

.libraryQScore

Quality score of the library after read QC processing

.qcReport.enrichmentFactor Node

This section contains information about the enrichment factor calculation. Detection of an appropriate Internal Control is required. The fields are relative to .qcReport.enrichmentFactor

Field

Description

.value

Enrichment factor value reflecting how well targeted regions were enriched

.category

Enrichment factor category: "poor", "fair", "good", or "not calculated"

.qcReport.sampleComposition Node

This section contains information about the composition of the sample. The fields are relative to .qcReport.sampleComposition

Field

Description

.readClassification

Proportion of post-quality reads classified to the following categories:

.readClassification.targetedMicrobial

Targeted microbial

.readClassification.targetedInternalControl

Targeted Internal Control

.readClassification.untargeted

Untargeted

.readClassification.ambiguous

More than one category

.readClassification.unclassified

No category

.readClassification.lowComplexity

Low complexity

.targetedMicrobial

Proportion of post-quality targeted microbial reads classified to the following sub-categories:

.targetedMicrobial.viral

Viral targeted

.targetedMicrobial.bacterial

Bacterial targeted

.targetedMicrobial.fungal

Fungal targeted

.targetedMicrobial.parasitic

Parasitic targeted

.targetedMicrobial.bacterialAmr

Bacterial AMR targeted

.untargeted

Proportion of post-quality untargeted reads classified to the following sub-categories:

.untargeted.viral

Viral untargeted

.untargeted.bacterial

Bacterial untargeted

.untargeted.fungal

Fungal untargeted

.untargeted.parasitic

Parasitic untargeted

.untargeted.bacterialAmr

Bacterial AMR untargeted

.untargeted.internalControl

Internal Control untargeted

.untargeted.human

Human untargeted

.viral

Proportion of post-quality viral reads classified to the following categories:

.viral.targeted

Viral targeted

.viral.untargeted

Viral untargeted

.viral.untargetedSubcategories

Proportion of post-quality viral untargeted reads classified to the following sub-categories:

.viral.untargetedSubcategories.panel

Viral panel members

.viral.untargetedSubcategories.phage

Viral phage

.viral.untargetedSubcategories.other

Viral other (not a panel member or phage)

.bacterial

Proportion of post-quality bacterial reads classified to the following categories:

.bacterial.targeted

Bacterial targeted

.bacterial.untargeted

Bacterial untargeted

.bacterial.untargetedSubcategories

Proportion of post-quality bacterial untargeted reads classified to the following sub-categories:

.bacterial.untargetedSubcategories.panel

Bacterial panel members

.bacterial.untargetedSubcategories.ribosomalDna

Bacterial ribosomal DNA (16S)

.bacterial.untargetedSubcategories.plasmid

Bacterial plasmids

.bacterial.untargetedSubcategories.other

Bacterial other (not a panel member, ribosomal DNA, or plasmid)

.fungal

Proportion of post-quality fungal reads classified to the following categories:

.fungal.targeted

Fungal targeted

.fungal.untargeted

Fungal untargeted

.fungal.untargetedSubcategories

Proportion of post-quality fungal untargeted reads classified to the following sub-categories:

.fungal.untargetedSubcategories.panel

Fungal panel members

.fungal.untargetedSubcategories.ribosomalDna

Fungal ribosomal DNA (18S)

.fungal.untargetedSubcategories.other

Fungal other (not a panel member or ribosomal DNA)

.parasitic

Proportion of post-quality parasitic reads classified to the following categories:

.parasitic.targeted

Parasitic targeted

.parasitic.untargeted

Parasitic untargeted

.parasitic.untargetedSubcategories

Proportion of post-quality parasitic untargeted reads classified to the following sub-categories:

.parasitic.untargetedSubcategories.panel

Parasitic panel members

.parasitic.untargetedSubcategories.ribosomalDna

Parasitic ribosomal DNA (18S)

.parasitic.untargetedSubcategories.other

Parasitic other (not a panel member or ribosomal DNA)

.human

Proportion of post-quality human reads classified to the following categories:

.human.untargeted

Human untargeted

.human.untargetedSubcategories

Proportion of post-quality human untargeted reads classified to the following sub-categories:

.human.untargetedSubcategories.ribosomalDna

Human ribosomal DNA

.human.untargetedSubcategories.codingSequence

Human coding sequence

.human.untargetedSubcategories.other

Human other (not ribosomal DNA or coding sequence)

.internalControl

Proportion of post-quality Internal Control reads classified to the following categories:

.internalControl.targeted

Internal Control targeted

.internalControl.untargeted

Internal Control untargeted

.microbialAndInternalControl

Proportion of post-quality Microbial and Internal Control reads classified to the following categories:

.microbialAndInternalControl.targeted

Microbial and Internal Control targeted

.microbialAndInternalControl.untargeted

Microbial and Internal Control untargeted

.bacterialAmr

Proportion of post-quality bacterial AMR reads classified to the following categories:

.bacterialAmr.targeted

Bacterial AMR targeted

.bacterialAmr.untargeted

Bacterial AMR untargeted

.qcReport.internalControls Node

This section contains information about internal control detection. The value of the .qcReport.internalControls field is an array of objects containing name and RPKM information for each Internal Control. See the code block below for an example:

[
    {
        "name": "Allobacillus halotolerans",
        "rpkm": 0
    },
    {
        "name": "Armored RNA Quant Internal Process Control",
        "rpkm": 0
    },
    {
        "name": "Enterobacteria phage T7",
        "rpkm": 180323
    },
    {
        "name": "Escherichia virus MS2",
        "rpkm": 0
    },
    {
        "name": "Escherichia virus Qbeta",
        "rpkm": 0
    },
    {
        "name": "Escherichia virus T4",
        "rpkm": 0
    },
    {
        "name": "Imtechella halotolerans",
        "rpkm": 0
    },
    {
        "name": "Phocid alphaherpesvirus 1",
        "rpkm": 0
    },
    {
        "name": "Phocine morbillivirus",
        "rpkm": 0
    },
    {
        "name": "Truepera radiovictrix",
        "rpkm": 0
    }
]

.userOptions Node

This section gives information about analysis options specified by the user. The fields are relative to .userOptions

Field

Description

.quantitativeInternalControlName

Quantitative Internal Control used for microorganism absolute quantification (recommendation: Enterobacteria phage T7)

.quantitativeInternalControlConcentration

Quantitative Internal Control concentration (recommendation: 1.21 x 10^7 copies/mL of sample)

.readQcEnabled

Boolean indicating if read QC (trimming and filtering based on quality and read length) is enabled

.readClassificationSensitivity

(VSP V2 only) Sensitivity threshold for classifying reads. Determines whether alignment should proceed for a microorganism and/or reference sequence. Value is an integer with a valid range of 1 to 1000, inclusive

.customPanelFastaFile

(Custom Panel only) Name of the custom reference FASTA file

.customPanelBedFile

(Custom Panel only) Name of the custom reference BED file

.targetReport.microorganisms[] Node

The value of the .targetReport.microorganisms[] field is an array of objects containing information about detected microorganisms. The following table describes one .targetReport.microorganisms[] object. The fields are relative to .targetReport.microorganisms[]

Field

Description

.class

Microorganism class ("viral", "bacterial", "fungal", "parasite")

.name

Name of microorganism

.coverage

Proportion of targeted microorganism reference sequence bases that appear in sample sequencing reads

.ani

Average nucleotide identity of consensus sequence to targeted microorganism reference sequences

.medianDepth

Median depth of sample sequencing reads aligned to targeted microorganism reference sequences, indicating the median number of times each targeted microorganism reference sequence base appears in sample sequencing reads

.condensedDepthVector

Read depth across the targeted microorganism reference sequences, condensed to 256 bins

.rpkm

Normalized representation of the number of sample sequencing reads aligned to targeted microorganism reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)

.alignedReadCount

Number of sample sequencing reads that aligned to targeted microorganism reference sequences

.kmerReadCount

(UPIP only) Number of sample sequencing reads classified to targeted microorganism reference sequences

.absoluteQuantityRatio

Numerical absolute quantification value. Quantitative internal control required for calculation

.absoluteQuantityRatioFormatted

Formatted absolute quantification value with units. Quantitative internal control required for calculation

.phenotypicGroup

(RPIP, UPIP only) Grouping indicating general association with normal flora, colonization, or contamination from the environment or other sources, as well as general association with disease

.associatedAmrMarkers

(Bacteria only) Information about the bacterial AMR markers associated with the microorganism

.associatedAmrMarkers.applicable

Boolean indicating whether one or more bacterial AMR markers are associated with the microorganism

.associatedAmrMarkers.detected

List of detected bacterial AMR markers associated with the microorganism

.associatedAmrMarkers.predicted

List of predicted bacterial AMR markers associated with the microorganism

.consensusGenomeSequences

(RPIP, VSP V2 viruses only) Information about the majority consensus genome (or segment) sequence

.consensusGenomeSequences.sequence

Consensus genome (or segment) sequence bases

.consensusGenomeSequences.referenceAccession

Accession of the reference genome (or segment) sequence

.consensusGenomeSequences.referenceDescription

Description of the reference genome (or segment) sequence

.consensusGenomeSequences.referenceLength

Length of the reference genome (or segment) sequence

.consensusGenomeSequences.maximumAlignmentLength

Longest contiguous alignment between consensus sequence and reference genome (or segment) sequence

.consensusGenomeSequences.maximumGapLength

Longest contiguous alignment gap (insertion or deletion) between consensus sequence and reference genome (or segment) sequence

.consensusGenomeSequences.maximumUnalignedLength

Longest section of the reference genome (or segment) sequence not aligned to by consensus sequence

.consensusGenomeSequences.coverage

Proportion of reference genome (or segment) sequence bases that appear in sample sequencing reads

.consensusGenomeSequences.ani

Average nucleotide identity of consensus sequence to reference genome (or segment) sequence

.consensusGenomeSequences.alignedReadCount

Number of sample sequencing reads that aligned to reference genome (or segment) sequence

.consensusGenomeSequences.medianDepth

Median depth of sample sequencing reads aligned to reference genome (or segment) sequence, indicating the median number of times each reference genome (or segment) sequence base appears in sample sequencing reads

.consensusGenomeSequences.targetAnnotation

List of targeted region annotations for the reference genome (or segment) sequence. Each annotation is a JSON object with the following fields: start (int), end (int), strand (string: "+", "-"), target_name (string), type (string)

.consensusGenomeSequences.condensedDepthVector

Read depth across the reference genome (or segment) sequence, condensed to 256 bins

.consensusTargetSequences

(RPIP viruses only) Information about the majority targeted region consensus sequences

.consensusTargetSequences.sequence

Consensus targeted region sequence bases

.consensusTargetSequences.name

Name of the targeted region

.consensusTargetSequences.referenceAccession

Accession of the targeted region reference sequence

.consensusTargetSequences.depthVector

Read depth across the targeted region reference sequence, not condensed

.predictionInformation

Information about microorganism prediction results

.predictionInformation.predictedPresent

Boolean indicating whether the microorganism passed its reporting logic algorithm

.predictionInformation.notes

List of notes about the prediction result

.predictionInformation.subpanels

List of pre-defined subpanels that the microorganism belongs to

.predictionInformation.relatedMicroorganisms

Array of objects with information about genetically related microorganisms. See below for details

.variants

(all VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only) Information about viral variants. See below for details

.targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] Node

The value of the .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] field is an array of objects containing information about genetically related microorganisms. The following table describes one .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] object. The fields are relative to .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[]

Field

Description

.name

Name of related microorganism

.onPanel

Boolean indicating whether the related microorganism is a panel member

.kmerReadCount

(UPIP only) Number of sample sequencing reads classified to related microorganism reference sequences

.coverage

Proportion of related microorganism reference sequence bases that appear in sample sequencing reads

.ani

Average nucleotide identity of consensus sequence to related microorganism reference sequences

.alignedReadCount

Number of sample sequencing reads that aligned to related microorganism reference sequences

.targetReport.microorganisms[].variants[] Node

The value of the .targetReport.microorganisms[].variants[] field is an array of objects containing information about viral variants for all VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only. The following table describes one .targetReport.microorganisms[].variants[] object. The fields are relative to .targetReport.microorganisms[].variants[]

Field

Description

.referenceAccession

Accession of reference genome (or segment) sequence used for variant calling

.segment

(Segmented viruses only) Segment number of reference segment sequence

.ntChange

Nucleotide change associated with variant

.referencePosition

Variant position in viral reference genome (or segment) sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

.targetReport.amrMarkers[] Node

The value of the .targetReport.amrMarkers[] field is an array of objects containing information about detected bacterial AMR markers. The following table describes one .targetReport.amrMarkers[] object. The fields are relative to .targetReport.amrMarkers[]

Field

Description

.class

Microorganism class ("bacterial")

.cardModelType

Bacterial AMR marker model type in the Comprehensive Antibiotic Resistance Database (CARD) ("homolog", "protein variant", "rRNA variant")

.cardGeneFamily

Bacterial AMR marker gene family in the Comprehensive Antibiotic Resistance Database (CARD)

.name

Bacterial AMR marker name

.cardName

Bacterial AMR marker name in the Comprehensive Antibiotic Resistance Database (CARD)

.ncbiName

Bacterial AMR marker name in the National Center for Biotechnology Information (NCBI) Reference Gene Catalog

.referenceAccession

Accession of the bacterial AMR marker reference sequence

.coverage

Proportion of bacterial AMR marker reference sequence residues that appear in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.pid

Percent identity of consensus sequence aligned to bacterial AMR marker reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.medianDepth

Median depth of sample sequencing reads aligned to bacterial AMR marker reference sequence, indicating the median number of times each bacterial AMR marker sequence residue appears in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.rpkm

Normalized representation of the number of sample sequencing reads aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.alignedReadCount

Number of sample sequencing reads that aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.nucleotideConsensusSequence

Nucleotide consensus sequence bases

.proteinConsensusSequence

Protein consensus sequence bases

.nucleotideDepthVector

Read depth across the bacterial AMR marker nucleotide reference sequence, not condensed

.proteinDepthVector

Read depth across the bacterial AMR marker protein reference sequence, not condensed

.associatedMicroorganisms

Information about the microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.all

List of all microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.detected

List of detected microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.predicted

List of predicted microorganisms associated with the bacterial AMR marker

.predictionInformation

Information about bacterial AMR marker prediction results

.predictionInformation.predictedPresent

Boolean indicating whether the bacterial AMR marker passed its reporting logic algorithm

.predictionInformation.confidence

Confidence level of bacterial AMR marker prediction ("high", "medium", "low")

.predictionInformation.notes

List of notes about the prediction result

.targetReport.amrMarkers[].variants[] Node

The value of the .targetReport.amrMarkers[].variants[] field is an array of objects containing information about variants for bacterial AMR markers with "protein variant" or "rRNA variant" model types. The following table describes one .targetReport.amrMarkers[].variants[] object. The fields are relative to .targetReport.amrMarkers[].variants[]

Field

Description

.category

Variant category ("Bacterial Variant; Known AMR")

.referenceSourceMicroorganism

Microorganism that reference sequence is associated with in NCBI

.comments

List of additional information regarding the variant

.product

Protein product of gene

.ntChange

Nucleotide change associated with variant

.referencePosition

Variant position in reference sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

.annotation

Type of change (e.g. "Nonsynonymous Variant")

.aaChange

Amino acid change associated with variant

.epistaticGroups

List of epistatic groups variant is associated with

.targetReport.customReferences[] Node

This section contains information about custom reference detection results and is only present for custom database analyses. When only a custom reference FASTA file is provided (no BED file), each .targetReport.customReferences[] object contains information for a single reference sequence. When both a FASTA and BED file are provided, each .targetReport.customReferences[] object contains information for a single genome/microorganism, which can be a collection of one or more reference sequences. The fields are relative to .targetReport.customReferences[]

Field

Description

.name

Provided name of custom reference sequence, accession, genome, or microorganism

.coverage

Proportion of custom reference sequence bases that appear in sample sequencing reads

.ani

Average nucleolotide identity of consensus sequence to custom reference sequence or, if specified, collection of one or more custom reference sequences

.medianDepth

Median depth of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences, indicating the med\ian number of times each custom reference sequence base appears in sample sequencing reads

.condensedDepthVector

Read depth across custom reference sequence or, if specified, collection of one or more custom reference sequences, condensed to 256 bins

.rpkm

Normalized number of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)

.alignedReadCount

Number of sample sequencing reads that aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences

.consensusSequences

Array of objects with information about each consensus sequence. See below for details

.variants

Array of objects with information about variants detected in custom reference sequence or, if specified, collection of one or more custom reference sequences. See below for details

.targetReport.customReferences[].consensusSequences[] Node

The value of the .targetReport.customReferences[].consensusSequences[] field is an array of objects containing majority consensus sequence information for a single custom reference sequence. When only a FASTA file is provided (no BED file), there will be only one object in the array. When both a FASTA and BED file are provided, there may be more than one object in the array. The fields are relative to .targetReport.customReferences[].consensusSequences[]

Field

Description

.sequence

Majority consensus sequence bases

.referenceAccession

Accession of custom reference sequence

.referenceDescription

Description of custom reference sequence

.referenceLength

Length of custom reference sequence

.coverage

Proportion of custom reference sequence bases that appear in sample sequencing reads

.ani

Average nucleolotide identity of consensus sequence to custom reference sequence

.medianDepth

Median depth of sample sequencing reads aligned to custom reference sequence, indicating the median number of times each custom reference sequence base appears in sample sequencing reads

.depthVector

Read depth across custom reference sequence, not condensed

.alignedReadCount

Number of sample sequencing reads that aligned to custom reference sequence

.maximumAlignmentLength

Longest contiguous alignment between consensus sequence and custom reference sequence

.maximumGapLength

Longest contiguous alignment gap (insertion or deletion) between consensus sequence and custom reference sequence

.maximumUnalignedLength

Longest section of custom reference sequence not aligned to by consensus sequence

.targetReport.customReferences[].variants[] Node

The value of the .targetReport.customReferences[].variants[] field is an array of objects containing information about a single detected variant. The fields are relative to .targetReport.customReferences[].variants[]

Field

Description

.ntChange

Nucleotide change associated with variant

.referenceAccession

Accession of custom reference sequence used for variant calling

.referencePosition

Variant position in custom reference sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

Kmer Classifier

Description

The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to run query sequences against a pre-existing reference sequence database (As of DRAGEN 4.3+, users can build their own custom reference sequence database).

Command Line Settings

Option

Description

Required Inputs

--enable-kmer-classifier

Enables the Kmer Classifier. (Default=false).

--output-file-prefix

Prefix for all output files.

--output-directory

Directory for all output files.

--kmer-classifier-input-read-file

Input sequence file (zipped or unzipped) to the Kmer Classifier.

--kmer-classifier-db-file

Database of sequences to classify against.

Optional Inputs

--intermediate-results-dir

Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 2.

--kmer-classifier-load-db-ram

Load the database onto RAM. Do not use if database is on ramdisk. (Default=false).

--kmer-classifier-multiple-inputs

Set to true to run with multiple inputs. The input read file is now a .tsv file that has three columns: Sample ID, Read1 file, (optional) Read 2 file. (Default=false).

--kmer-classifier-min-window

The minimum number of consecutive kmers to classify assignment at taxid. (Default=1).

--kmer-classifier-output-read-seq

Option to enable read sequence column in the output file. (Default=false).

--kmer-classifier-output-taxid-seq

Option to enable a taxid string column in the output file. (Default=false).

--kmer-classifier-db-to-taxid-json

Path to JSON file that maps database IDs to external taxids, names, and ranks.

--kmer-classifier-no-read-output

Option to not create individual read output. (Default=false).

--kmer-classifier-no-taxid-counts

Option to not write taxid count output file. (Default=false).

--kmer-classifier-protein-input

Option to indicate protein query sequences. To use this option, the reference sequence database MUST be of protein sequences. (Default=false).

--kmer-classifier-ncpus

Option to set the number of CPUs available for processing.

Example Command Line

dragen \
  --enable-kmer-classifier=true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus=2 \
  --kmer-classifier-output-read-seq=false \
  --kmer-classifier-output-taxid-seq=false

Input Details

Input Reads

Applies to: --kmer-classifier-input-read-file, --kmer-classifier-multiple-inputs

If the analysis is for a single FASTA/FASTQ read file, then that filename is input to --kmer-classifier-input-read-file and --kmer-classifier-multiple-inputs=false. However, many read files can be submitted to the Kmer Classifier at one time, minimizing the load time for a large reference sequence database. In this case, the input file must be a .tsv (tab-separated) file with two columns (optionally 3 columns). The first column is a unique ID, the second column is the path to the read file, and the optional third column is the path to the second read file in the case of paired-end reads. The ID is used to distinguish the output files. There is no header line. This .tsv file is the input file to --kmer-classifier-input-read-file and --kmer-classifier-multiple-inputs=true.

Reference Sequences

Applies to: --kmer-classifier-db-file, --kmer-classifier-db-to-taxid-json, --kmer-classifier-load-db-ram

A file of reference sequences (the "database") can be quite large. If the database file is stored on a normal file system, it is recommended that you set --kmer-classifier-load-db-ram=true. This will tell the Kmer Classifier to load the database file into memory for faster analysis. It is also allowable to store the database file on a RAM disk, which reduces load time over many Kmer Classifier runs. In this case, it is recommended to set --kmer-classifier-load-db-ram=false.

DB TaxID JSON Mapping File

Applies to: --kmer-classifier-db-to-taxid-json

This input file is downloaded alongside the reference sequence database. It associates a taxid internal to the classifier database to an external source, like the NCBI taxonomy. This JSON file is a dictionary where the keys are internal taxids, and is mapped to an external taxid, name, and rank. Example:

 {
   "2": {"taxid": 2, "name": "bacteria", "rank": "kingdom"},
   "3": {"taxid": 2697049, "name": "SARS-CoV-2", "rank": "subspecies"},
   "4": {"taxid": 5052, "name": "Aspergillus", "rank": "genus"}
 }

The internal taxids are used in the output files. This JSON file can be used to map the results to taxids from the NCBI taxonomy.

Downloading Reference Sequence Databases and Mapping Files

Genome Database

The genome database includes NCBI RefSeq genomes for human, bacteria, archaea, viruses, and fungi. The December 3 2023 NCBI taxonomy was used to build the database, and the sequences were collected in December 2023.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.name_map.json

Genome and NT Database

This database includes the contents of the Genome database and all of the NCBI nucleotide (nt) database. The sequences from the NCBI nucleotide database were collected in July 2023, and the December 3 2023 NCBI taxonomy was used to build the database. Two versions of this database are available for download: One that requires a machine with >= 550GB RAM, and a compressed version that trades approximately 5-10% accuracy for a smaller RAM footprint and requires a machine with >= 225GB RAM.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json

To download the compressed reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.compressed.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json

UniRef90 Database

This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.name_map.json

16S database

This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.name_map.json

Ouput Details

There are two output files, one organized around the reads, and the other organized around the taxids.

Read-level Output

Applies to: --kmer-classifier-output-taxid-seq, --kmer-classifier-output-read-seq The main output file is a .tsv file with the extension .read_classifications.tsv. It has no header line, has tab-separated columns, and can vary in the number of columns depending on command line options. It details the results for each read.

Column

Description

Data Type

Read index

integer

Read name

string

Taxid the read classified to

integer

Maximum number of contiguous kmers that classified to this taxid

integer

Score assigned to the classification

integer

Number of kmers that classified to this taxid

integer

Read duplication count

integer

Name associated with taxid, if given with --kmer-classifier-db-to-taxid-json

string

Taxonomic rank associated with taxid, if given with --kmer-classifier-db-to-taxid-json

string

Taxid that each kmer classified to (is output when the --kmer-classifier-output-taxid-seq flag is set)

list of integers separated by commas

Read sequence (is output when the the --kmer-classifier-output-read-seq flag is set)

string

TaxID-level Output

The second output file is a .tsv file with the extension .classifier.taxid_kmer_counts.tsv. It has a header line and has tab-separated columns. It summarizes the results for each taxid.

Header

Description

Data Type

db_taxid

Identifier for this taxid used internally in the database

integer

duplicity

Ratio of total number of kmers from reads assigned to this taxid compared to the number of distinct kmers from reads assigned to this taxid

float

distinct_coverage

Percent of kmers in the database assigned to this taxid that are covered by kmers in the reads assigned to this taxid

integer

read_count

Number of reads that classified to this taxid

integer

total_kmer_count

Number of kmers that classified to this taxid

integer

distinct_kmer_count

Number of distinct kmers that classified to this taxid

integer

cumulative_read_count

Cumulative number of reads assigned to this taxid and its taxonomic descendants

integer

taxid

Taxid

integer

name

Name associated with the taxid, if given with --kmer-classifier-db-to-taxid-json

string

rank

Taxonomic rank of the taxid, if given with --kmer-classifier-db-to-taxid-json

string

taxid_distinct_kmer_count

Number of distinct kmers assigned to this taxid from the reference sequences

string

probability_present

Not in use

float

Kmer Classifier Database Builder

Description

The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to generate your own indexed, searchable database of reference sequences to be used by the k-mer classifier.

Command Line Settings

Option

Description

Example Command Line

Usage

K-mer/G-mer length considerations:

G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.
K-mer length refers to the size of the minimizer to be saved in a window size specified by the G-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.
As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.
Pre-built Explify Reference Database k-mer/g-mer length settings for reference:
- Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): G-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.
- Subset of reference genomes from Refseq with a focus on viral detection: G-mer length of 35 and k-mer length of 31
- Collection of 16S sequences for bacterial identification / profiling: G-mer length of 31 and k-mer length of 31
- Uniref90 protein sequences: G-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)

Three types of databases can be built with this tool:

Binner: each k-mer is assigned to a category/bin.
- Must use --kmer-class-db-builder-num-categories.
- Do not use --kmer-class-db-builder-tax-tree-file, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.
Classifier: each k-mer is assigned to one taxid.
- Must define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.
- Do not use --kmer-class-db-builder-num-categories, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.
Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.
- Must use --kmer-class-db-builder-save-weights and define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.
- Can use --kmer-class-db-builder-kmer-cutoff.
- Do not use --kmer-class-db-builder-num-categories.

Explify Analysis Pipeline

Description

RPIP: Target-capture enrichment of >280 RNA and DNA respiratory pathogens, including SARS-CoV-2, Influenza viruses, Respiratory syncytial virus, Mycobacterium and Legionella species, and >4000 AMR markers.
UPIP: Target-capture enrichment of >170 genitourinary pathogens, including fastidious, slow-growing, and anaerobic uropathogens, sexually transmitted microorganisms, and >4000 bacterial AMR markers.
VSP V2: Target-capture enrichment for whole-genome sequencing (WGS) of 200 RNA and DNA viruses prioritized as high-risk to public health, zoonotic surveillance, and biotech, and >200 viral AMR markers.
Custom: Analyze FASTQ/FASTA read files with a custom reference sequence database.

Command Line Settings

Option

Description

Required Inputs

--enable-explify

Enables the Explify Analysis Pipeline. (Default=false)

--output-file-prefix

Prefix for all output files.

--output-directory

Directory for all output files.

--explify-sample-list

Input sample list .tsv file with sample IDs, FASTQs, etc.

--explify-test-panel-name

"RPIP", "UPIP", "VSPv2", "Custom".

--explify-test-panel-version

Set to test panel version (e.g. "1.0.0").

--explify-ref-db-dir

Path to root directory for Explify Database files.

Optional Inputs

--intermediate-results-dir

Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 3.

--explify-load-db-ram

Option to load database into RAM if not on ramdisk. (Default=false).

--explify-no-read-qc

Option to turn off read QC on FASTQs before analysis. (Default=false).

--explify-internal-control

Option to set internal control from an accepted list. (Default="Enterobacteria phage T7")

--explify-internal-control-concentration

Option to set internal control concentration. (Default=12100000)

--explify-ncpus

Option to set the number of CPUs available for processing.

--explify-sensitivity-threshold

Option to set sensitivity threshold. Range: 0 < Integer < 1000. Only valid for VSPv2. (Default=5).

--explify-custom-ref-fasta

Reference FASTA file. Required for Custom reference DBs.

--explify-custom-ref-bed

Reference BED file. Optional for Custom reference DBs.

Example Command Line

dragen \
  --enable-explify=true \
  --output-file-prefix <PREFIX> \
  --explify-sample-list /path/to/sample/list/tsv \
  --explify-test-panel-name <"RPIP"/"UPIP"/"VSPv2"/"Custom"> \
  --explify-test-panel-version <VERSION> \
  --explify-ref-db-dir /path/to/root/db/dir \
  --explify-load-db-ram=true \
  --output-directory <OUTPUT_DIR> \
  --intermediate-results-dir <OUTPUT_DIR> \
  --explify-ncpus=1

Input Details

Sample Input List

Applies to: --explify-sample-list

The sample input list is a column-formatted file with tab separations between the columns (i.e., a .tsv file).

SampleID     BatchID     RunID     ControlFlag     FastQs
MySample     MyBatch     MyRun     POS             /path/to/fastq1.gz     /path/to/fastq2.gz

Notes:

The SampleID values must be unique.
BatchID and RunID are to help users track and manage sample analyses. Often the BatchID is used to track libraries that were prepared together, and the RunID is used to track sequencing runs. They can also be left blank.
The ControlFlag value can be POS, NEG, BLANK, or left empty.
- POS is used to indicate a positive control sample.
- NEG is used to indicate a negative control sample.
- BLANK is used to indicate a blank control sample (e.g. buffer only).
If there are multiple FASTQ files, they are tab delimited.
Please be very careful when editing tsv files. Some editors replace tabs with spaces without alerting the user.

Internal Control

Applies to: --explify-internal-control, --explify-internal-control-concentration

Allobacillus halotolerans
Armored RNA Quant Internal Process Control
Enterobacteria phage T7 (This is the default)
Escherichia virus MS2
Escherichia virus Qbeta
Escherichia virus T4
Imtechella halotolerans
Phocid alphaherpesvirus 1
Phocine morbillivirus
Truepera radiovictrix
NONE

The internal control concentration is an integer representing the number of copies/mL of sample for the internal control.

Reference Databases

Applies to: --explify-ref-db-dir, --explify-test-panel-name, --explify-test-panel-version, --explify-load-db-ram,--explify-custom-ref-fasta, --explify-custom-ref-bed

Directory Setup

Obtaining the Download Script

Download and management of Explify reference databases is handled by a shell script. The script can be downloaded with the following command:

wget -O explify-dbs.sh https://illumina-explify-databases.s3.us-east-1.amazonaws.com/explify-dbs.sh
chmod +x explify-dbs.sh

Seeing What Databases are Available for Download

The search subcommand can be used to list what databases can be downloaded:

$ ./explify-dbs.sh search -d explify-databases/
4 database(s) found meeting those criteria:
- Custom-1.0.0
- RPIP-6.5.1
- UPIP-8.6.0
- VSPv2-2.7.0

The -d argument is the base directory used for storage of the databases
Optionally, when a test panel name is specified with the -p argument, the results will be limited to that panel
Optionally, setting the -n argument will filter the search to databases that have not already been downloaded

Downloading a Database

The download subcommand is used to download the database files for a test panel:

./explify-dbs.sh download -d explify-databases/ -p UPIP -v 8.6.0 -n 20

The -d argument is the base directory used for storage of the databases
The -p argument is the test panel name
The -v argument is the test panel version
The -n argument is the number of CPUs that can be used to download the files (defaults to 1)

Additional notes:

In this example, after the UPIP-8.6.0 are downloaded, additional required files will be downloaded to a subdirectory named "common"
After the files are downloaded, their checksums will be automatically checked
Due to the size of some of the files, this command will take some time. It is best to run it via screen or nohup

Listing Downloaded Databases

The list subcommand is used to view the databases that have already been downloaded:

$ ./explify-dbs.sh list -d explify-databases/

The -d argument is the base directory used for storage of the databases
Optionally, when a test panel name is specified with the -p argument, the results will be limited to that panel

Checking Database Integrity

The download subcommand will automatically check the file checksums after download. The check subcommand can also be used on its own to check the files:

$ ./explify-dbs.sh check -d explify-databases/ -p UPIP -v 8.6.0 -n 20

The -d argument is the base directory used for storage of the databases
The -p argument is the test panel name
The -v argument is the test panel version
The -n argument is the number of CPUs that can be used to download the files (defaults to 1)

Using the Databases with the Explify Analysis Pipeline

explify-databases/
    Custom/
        1.0.0/
    RPIP/
        6.5.1/
    UPIP/
        8.6.0/
    VSPv2/
        2.7.0/

To run an analysis with RPIP 6.5.1, for example, the following inputs would be needed:

--explify-ref-db-dir /explify-databases
--explify-test-panel-name RPIP
--explify-test-panel-version 6.5.1

The Explify Analysis Pipeline will use these inputs to navigate to the specified database location, namely /explify-databases/RPIP/6.5.1.

Using the Custom Database Option

The BED file must be tab-delimited with at least 4 columns:

chrom: the sequence name as it appears in the FASTA
chromStart: start position (always set to 0)
chromEnd: end position (sequence length)
genomeName: name of the genome, target, or microorganism the sequence belongs to (e.g. Monkeypox virus clade II)
segmentName (optional): the name of the segment or gene (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Output Details

ap.json format

Top-Level Node

The top-level section of the output JSON contains general metadata and version information.

Field

Description

.accession

Identifier used for the sample

.deploymentEnvironment

Environment in which the results were produced

.batchId

Identifier used for the batch of samples processed together

.analysisId

Identifier used for the analysis

.runId

Identifier used for the sequencing run

.controlFlag

Indicates whether the sample is a control. It is based on the ControlFlag field in the sample .tsv and can be set to “POS”, “NEG”, “BLANK”, or “-”

.dragenVersion

DRAGEN release version

.analysisPipelineVersion

Analysis Pipeline release version

.testType

Type of test panel ("RPIP", "UPIP", "VSPv2", "Custom")

.testVersion

Test panel release version

.testName

Full name of test panel

.testUse

Test use. "For Research Use Only. Not for use in diagnostic procedures"

.reportTime

Date and time the report was generated

.warnings

List of warnings encountered during the analysis

.errors

List of errors encountered during the analysis

.qcReport.sampleQc Node

This section contains information about sample quality control (QC). The fields are relative to .qcReport.sampleQc

Field

Description

.totalRawBases

Number of base pairs in sample before read QC processing

.totalRawReads

Number of reads in sample before read QC processing

.uniqueReads

Number of distinct reads in sample before read QC processing

.uniqueReadsProportion

Proportion of distinct reads in sample before read QC processing

.preQualityMeanReadLength

Average read length before read QC processing

.postQualityMeanReadLength

Average read length after read QC processing

.postQualityReads

Number of reads in sample after read QC processing, inclusive of any duplicate reads

.postQualityReadsProportion

Proportion of post-quality reads in sample relative to total raw reads

.removedInDehostingReads

Number of host reads in sample removed during dehosting (host = human)

.removedInDehostingReadsProportion

Proportion of host reads in sample removed relative to total raw reads (host = human)

.entropy

Shannon entropy of the counts of 5-mers in the reads after read QC processing, which is a measure of randomness

.gContent

Proportion of guanine (G) base calls in reads after read QC processing

.libraryQScore

Quality score of the library after read QC processing

.qcReport.enrichmentFactor Node

This section contains information about the enrichment factor calculation. Detection of an appropriate Internal Control is required. The fields are relative to .qcReport.enrichmentFactor

Field

Description

.value

Enrichment factor value reflecting how well targeted regions were enriched

.category

Enrichment factor category: "poor", "fair", "good", or "not calculated"

.qcReport.sampleComposition Node

This section contains information about the composition of the sample. The fields are relative to .qcReport.sampleComposition

Field

Description

.readClassification

Proportion of post-quality reads classified to the following categories:

.readClassification.targetedMicrobial

Targeted microbial

.readClassification.targetedInternalControl

Targeted Internal Control

.readClassification.untargeted

Untargeted

.readClassification.ambiguous

More than one category

.readClassification.unclassified

No category

.readClassification.lowComplexity

Low complexity

.targetedMicrobial

Proportion of post-quality targeted microbial reads classified to the following sub-categories:

.targetedMicrobial.viral

Viral targeted

.targetedMicrobial.bacterial

Bacterial targeted

.targetedMicrobial.fungal

Fungal targeted

.targetedMicrobial.parasitic

Parasitic targeted

.targetedMicrobial.bacterialAmr

Bacterial AMR targeted

.untargeted

Proportion of post-quality untargeted reads classified to the following sub-categories:

.untargeted.viral

Viral untargeted

.untargeted.bacterial

Bacterial untargeted

.untargeted.fungal

Fungal untargeted

.untargeted.parasitic

Parasitic untargeted

.untargeted.bacterialAmr

Bacterial AMR untargeted

.untargeted.internalControl

Internal Control untargeted

.untargeted.human

Human untargeted

.viral

Proportion of post-quality viral reads classified to the following categories:

.viral.targeted

Viral targeted

.viral.untargeted

Viral untargeted

.viral.untargetedSubcategories

Proportion of post-quality viral untargeted reads classified to the following sub-categories:

.viral.untargetedSubcategories.panel

Viral panel members

.viral.untargetedSubcategories.phage

Viral phage

.viral.untargetedSubcategories.other

Viral other (not a panel member or phage)

.bacterial

Proportion of post-quality bacterial reads classified to the following categories:

.bacterial.targeted

Bacterial targeted

.bacterial.untargeted

Bacterial untargeted

.bacterial.untargetedSubcategories

Proportion of post-quality bacterial untargeted reads classified to the following sub-categories:

.bacterial.untargetedSubcategories.panel

Bacterial panel members

.bacterial.untargetedSubcategories.ribosomalDna

Bacterial ribosomal DNA (16S)

.bacterial.untargetedSubcategories.plasmid

Bacterial plasmids

.bacterial.untargetedSubcategories.other

Bacterial other (not a panel member, ribosomal DNA, or plasmid)

.fungal

Proportion of post-quality fungal reads classified to the following categories:

.fungal.targeted

Fungal targeted

.fungal.untargeted

Fungal untargeted

.fungal.untargetedSubcategories

Proportion of post-quality fungal untargeted reads classified to the following sub-categories:

.fungal.untargetedSubcategories.panel

Fungal panel members

.fungal.untargetedSubcategories.ribosomalDna

Fungal ribosomal DNA (18S)

.fungal.untargetedSubcategories.other

Fungal other (not a panel member or ribosomal DNA)

.parasitic

Proportion of post-quality parasitic reads classified to the following categories:

.parasitic.targeted

Parasitic targeted

.parasitic.untargeted

Parasitic untargeted

.parasitic.untargetedSubcategories

Proportion of post-quality parasitic untargeted reads classified to the following sub-categories:

.parasitic.untargetedSubcategories.panel

Parasitic panel members

.parasitic.untargetedSubcategories.ribosomalDna

Parasitic ribosomal DNA (18S)

.parasitic.untargetedSubcategories.other

Parasitic other (not a panel member or ribosomal DNA)

.human

Proportion of post-quality human reads classified to the following categories:

.human.untargeted

Human untargeted

.human.untargetedSubcategories

Proportion of post-quality human untargeted reads classified to the following sub-categories:

.human.untargetedSubcategories.ribosomalDna

Human ribosomal DNA

.human.untargetedSubcategories.codingSequence

Human coding sequence

.human.untargetedSubcategories.other

Human other (not ribosomal DNA or coding sequence)

.internalControl

Proportion of post-quality Internal Control reads classified to the following categories:

.internalControl.targeted

Internal Control targeted

.internalControl.untargeted

Internal Control untargeted

.microbialAndInternalControl

Proportion of post-quality Microbial and Internal Control reads classified to the following categories:

.microbialAndInternalControl.targeted

Microbial and Internal Control targeted

.microbialAndInternalControl.untargeted

Microbial and Internal Control untargeted

.bacterialAmr

Proportion of post-quality bacterial AMR reads classified to the following categories:

.bacterialAmr.targeted

Bacterial AMR targeted

.bacterialAmr.untargeted

Bacterial AMR untargeted

.qcReport.internalControls Node

[
    {
        "name": "Allobacillus halotolerans",
        "rpkm": 0
    },
    {
        "name": "Armored RNA Quant Internal Process Control",
        "rpkm": 0
    },
    {
        "name": "Enterobacteria phage T7",
        "rpkm": 180323
    },
    {
        "name": "Escherichia virus MS2",
        "rpkm": 0
    },
    {
        "name": "Escherichia virus Qbeta",
        "rpkm": 0
    },
    {
        "name": "Escherichia virus T4",
        "rpkm": 0
    },
    {
        "name": "Imtechella halotolerans",
        "rpkm": 0
    },
    {
        "name": "Phocid alphaherpesvirus 1",
        "rpkm": 0
    },
    {
        "name": "Phocine morbillivirus",
        "rpkm": 0
    },
    {
        "name": "Truepera radiovictrix",
        "rpkm": 0
    }
]

.userOptions Node

This section gives information about analysis options specified by the user. The fields are relative to .userOptions

Field

Description

.quantitativeInternalControlName

Quantitative Internal Control used for microorganism absolute quantification (recommendation: Enterobacteria phage T7)

.quantitativeInternalControlConcentration

Quantitative Internal Control concentration (recommendation: 1.21 x 10^7 copies/mL of sample)

.readQcEnabled

Boolean indicating if read QC (trimming and filtering based on quality and read length) is enabled

.readClassificationSensitivity

.customPanelFastaFile

(Custom Panel only) Name of the custom reference FASTA file

.customPanelBedFile

(Custom Panel only) Name of the custom reference BED file

.targetReport.microorganisms[] Node

Field

Description

.class

Microorganism class ("viral", "bacterial", "fungal", "parasite")

.name

Name of microorganism

.coverage

Proportion of targeted microorganism reference sequence bases that appear in sample sequencing reads

.ani

Average nucleotide identity of consensus sequence to targeted microorganism reference sequences

.medianDepth

.condensedDepthVector

Read depth across the targeted microorganism reference sequences, condensed to 256 bins

.rpkm

.alignedReadCount

Number of sample sequencing reads that aligned to targeted microorganism reference sequences

.kmerReadCount

(UPIP only) Number of sample sequencing reads classified to targeted microorganism reference sequences

.absoluteQuantityRatio

Numerical absolute quantification value. Quantitative internal control required for calculation

.absoluteQuantityRatioFormatted

Formatted absolute quantification value with units. Quantitative internal control required for calculation

.phenotypicGroup

(RPIP, UPIP only) Grouping indicating general association with normal flora, colonization, or contamination from the environment or other sources, as well as general association with disease

.associatedAmrMarkers

(Bacteria only) Information about the bacterial AMR markers associated with the microorganism

.associatedAmrMarkers.applicable

Boolean indicating whether one or more bacterial AMR markers are associated with the microorganism

.associatedAmrMarkers.detected

List of detected bacterial AMR markers associated with the microorganism

.associatedAmrMarkers.predicted

List of predicted bacterial AMR markers associated with the microorganism

.consensusGenomeSequences

(RPIP, VSP V2 viruses only) Information about the majority consensus genome (or segment) sequence

.consensusGenomeSequences.sequence

Consensus genome (or segment) sequence bases

.consensusGenomeSequences.referenceAccession

Accession of the reference genome (or segment) sequence

.consensusGenomeSequences.referenceDescription

Description of the reference genome (or segment) sequence

.consensusGenomeSequences.referenceLength

Length of the reference genome (or segment) sequence

.consensusGenomeSequences.maximumAlignmentLength

Longest contiguous alignment between consensus sequence and reference genome (or segment) sequence

.consensusGenomeSequences.maximumGapLength

Longest contiguous alignment gap (insertion or deletion) between consensus sequence and reference genome (or segment) sequence

.consensusGenomeSequences.maximumUnalignedLength

Longest section of the reference genome (or segment) sequence not aligned to by consensus sequence

.consensusGenomeSequences.coverage

Proportion of reference genome (or segment) sequence bases that appear in sample sequencing reads

.consensusGenomeSequences.ani

Average nucleotide identity of consensus sequence to reference genome (or segment) sequence

.consensusGenomeSequences.alignedReadCount

Number of sample sequencing reads that aligned to reference genome (or segment) sequence

.consensusGenomeSequences.medianDepth

.consensusGenomeSequences.targetAnnotation

.consensusGenomeSequences.condensedDepthVector

Read depth across the reference genome (or segment) sequence, condensed to 256 bins

.consensusTargetSequences

(RPIP viruses only) Information about the majority targeted region consensus sequences

.consensusTargetSequences.sequence

Consensus targeted region sequence bases

.consensusTargetSequences.name

Name of the targeted region

.consensusTargetSequences.referenceAccession

Accession of the targeted region reference sequence

.consensusTargetSequences.depthVector

Read depth across the targeted region reference sequence, not condensed

.predictionInformation

Information about microorganism prediction results

.predictionInformation.predictedPresent

Boolean indicating whether the microorganism passed its reporting logic algorithm

.predictionInformation.notes

List of notes about the prediction result

.predictionInformation.subpanels

List of pre-defined subpanels that the microorganism belongs to

.predictionInformation.relatedMicroorganisms

Array of objects with information about genetically related microorganisms. See below for details

.variants

(all VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only) Information about viral variants. See below for details

.targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] Node

Field

Description

.name

Name of related microorganism

.onPanel

Boolean indicating whether the related microorganism is a panel member

.kmerReadCount

(UPIP only) Number of sample sequencing reads classified to related microorganism reference sequences

.coverage

Proportion of related microorganism reference sequence bases that appear in sample sequencing reads

.ani

Average nucleotide identity of consensus sequence to related microorganism reference sequences

.alignedReadCount

Number of sample sequencing reads that aligned to related microorganism reference sequences

.targetReport.microorganisms[].variants[] Node

Field

Description

.referenceAccession

Accession of reference genome (or segment) sequence used for variant calling

.segment

(Segmented viruses only) Segment number of reference segment sequence

.ntChange

Nucleotide change associated with variant

.referencePosition

Variant position in viral reference genome (or segment) sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

.targetReport.amrMarkers[] Node

Field

Description

.class

Microorganism class ("bacterial")

.cardModelType

Bacterial AMR marker model type in the Comprehensive Antibiotic Resistance Database (CARD) ("homolog", "protein variant", "rRNA variant")

.cardGeneFamily

Bacterial AMR marker gene family in the Comprehensive Antibiotic Resistance Database (CARD)

.name

Bacterial AMR marker name

.cardName

Bacterial AMR marker name in the Comprehensive Antibiotic Resistance Database (CARD)

.ncbiName

Bacterial AMR marker name in the National Center for Biotechnology Information (NCBI) Reference Gene Catalog

.referenceAccession

Accession of the bacterial AMR marker reference sequence

.coverage

.pid

.medianDepth

.rpkm

.alignedReadCount

.nucleotideConsensusSequence

Nucleotide consensus sequence bases

.proteinConsensusSequence

Protein consensus sequence bases

.nucleotideDepthVector

Read depth across the bacterial AMR marker nucleotide reference sequence, not condensed

.proteinDepthVector

Read depth across the bacterial AMR marker protein reference sequence, not condensed

.associatedMicroorganisms

Information about the microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.all

List of all microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.detected

List of detected microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.predicted

List of predicted microorganisms associated with the bacterial AMR marker

.predictionInformation

Information about bacterial AMR marker prediction results

.predictionInformation.predictedPresent

Boolean indicating whether the bacterial AMR marker passed its reporting logic algorithm

.predictionInformation.confidence

Confidence level of bacterial AMR marker prediction ("high", "medium", "low")

.predictionInformation.notes

List of notes about the prediction result

.targetReport.amrMarkers[].variants[] Node

Field

Description

.category

Variant category ("Bacterial Variant; Known AMR")

.referenceSourceMicroorganism

Microorganism that reference sequence is associated with in NCBI

.comments

List of additional information regarding the variant

.product

Protein product of gene

.ntChange

Nucleotide change associated with variant

.referencePosition

Variant position in reference sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

.annotation

Type of change (e.g. "Nonsynonymous Variant")

.aaChange

Amino acid change associated with variant

.epistaticGroups

List of epistatic groups variant is associated with

.targetReport.customReferences[] Node

Field

Description

.name

Provided name of custom reference sequence, accession, genome, or microorganism

.coverage

Proportion of custom reference sequence bases that appear in sample sequencing reads

.ani

Average nucleolotide identity of consensus sequence to custom reference sequence or, if specified, collection of one or more custom reference sequences

.medianDepth

.condensedDepthVector

Read depth across custom reference sequence or, if specified, collection of one or more custom reference sequences, condensed to 256 bins

.rpkm

.alignedReadCount

Number of sample sequencing reads that aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences

.consensusSequences

Array of objects with information about each consensus sequence. See below for details

.variants

Array of objects with information about variants detected in custom reference sequence or, if specified, collection of one or more custom reference sequences. See below for details

.targetReport.customReferences[].consensusSequences[] Node

Field

Description

.sequence

Majority consensus sequence bases

.referenceAccession

Accession of custom reference sequence

.referenceDescription

Description of custom reference sequence

.referenceLength

Length of custom reference sequence

.coverage

Proportion of custom reference sequence bases that appear in sample sequencing reads

.ani

Average nucleolotide identity of consensus sequence to custom reference sequence

.medianDepth

Median depth of sample sequencing reads aligned to custom reference sequence, indicating the median number of times each custom reference sequence base appears in sample sequencing reads

.depthVector

Read depth across custom reference sequence, not condensed

.alignedReadCount

Number of sample sequencing reads that aligned to custom reference sequence

.maximumAlignmentLength

Longest contiguous alignment between consensus sequence and custom reference sequence

.maximumGapLength

Longest contiguous alignment gap (insertion or deletion) between consensus sequence and custom reference sequence

.maximumUnalignedLength

Longest section of custom reference sequence not aligned to by consensus sequence

.targetReport.customReferences[].variants[] Node

Field

Description

.ntChange

Nucleotide change associated with variant

.referenceAccession

Accession of custom reference sequence used for variant calling

.referencePosition

Variant position in custom reference sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

Kmer Classifier

Description

Command Line Settings

Option

Description

Required Inputs

--enable-kmer-classifier

Enables the Kmer Classifier. (Default=false).

--output-file-prefix

Prefix for all output files.

--output-directory

Directory for all output files.

--kmer-classifier-input-read-file

Input sequence file (zipped or unzipped) to the Kmer Classifier.

--kmer-classifier-db-file

Database of sequences to classify against.

Optional Inputs

--intermediate-results-dir

Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 2.

--kmer-classifier-load-db-ram

Load the database onto RAM. Do not use if database is on ramdisk. (Default=false).

--kmer-classifier-multiple-inputs

Set to true to run with multiple inputs. The input read file is now a .tsv file that has three columns: Sample ID, Read1 file, (optional) Read 2 file. (Default=false).

--kmer-classifier-min-window

The minimum number of consecutive kmers to classify assignment at taxid. (Default=1).

--kmer-classifier-output-read-seq

Option to enable read sequence column in the output file. (Default=false).

--kmer-classifier-output-taxid-seq

Option to enable a taxid string column in the output file. (Default=false).

--kmer-classifier-db-to-taxid-json

Path to JSON file that maps database IDs to external taxids, names, and ranks.

--kmer-classifier-no-read-output

Option to not create individual read output. (Default=false).

--kmer-classifier-no-taxid-counts

Option to not write taxid count output file. (Default=false).

--kmer-classifier-protein-input

Option to indicate protein query sequences. To use this option, the reference sequence database MUST be of protein sequences. (Default=false).

--kmer-classifier-ncpus

Option to set the number of CPUs available for processing.

Example Command Line

dragen \
  --enable-kmer-classifier=true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus=2 \
  --kmer-classifier-output-read-seq=false \
  --kmer-classifier-output-taxid-seq=false

Input Details

Input Reads

Applies to: --kmer-classifier-input-read-file, --kmer-classifier-multiple-inputs

Reference Sequences

Applies to: --kmer-classifier-db-file, --kmer-classifier-db-to-taxid-json, --kmer-classifier-load-db-ram

DB TaxID JSON Mapping File

Applies to: --kmer-classifier-db-to-taxid-json

 {
   "2": {"taxid": 2, "name": "bacteria", "rank": "kingdom"},
   "3": {"taxid": 2697049, "name": "SARS-CoV-2", "rank": "subspecies"},
   "4": {"taxid": 5052, "name": "Aspergillus", "rank": "genus"}
 }

The internal taxids are used in the output files. This JSON file can be used to map the results to taxids from the NCBI taxonomy.

Downloading Reference Sequence Databases and Mapping Files

Genome Database

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.name_map.json

Genome and NT Database

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json

To download the compressed reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.compressed.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json

UniRef90 Database

This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.name_map.json

16S database

This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.name_map.json

Ouput Details

There are two output files, one organized around the reads, and the other organized around the taxids.

Read-level Output

Column

Description

Data Type

Read index

integer

Read name

string

Taxid the read classified to

integer

Maximum number of contiguous kmers that classified to this taxid

integer

Score assigned to the classification

integer

Number of kmers that classified to this taxid

integer

Read duplication count

integer

Name associated with taxid, if given with --kmer-classifier-db-to-taxid-json

string

Taxonomic rank associated with taxid, if given with --kmer-classifier-db-to-taxid-json

string

Taxid that each kmer classified to (is output when the --kmer-classifier-output-taxid-seq flag is set)

list of integers separated by commas

Read sequence (is output when the the --kmer-classifier-output-read-seq flag is set)

string

TaxID-level Output

The second output file is a .tsv file with the extension .classifier.taxid_kmer_counts.tsv. It has a header line and has tab-separated columns. It summarizes the results for each taxid.

Header

Description

Data Type

db_taxid

Identifier for this taxid used internally in the database

integer

duplicity

Ratio of total number of kmers from reads assigned to this taxid compared to the number of distinct kmers from reads assigned to this taxid

float

distinct_coverage

Percent of kmers in the database assigned to this taxid that are covered by kmers in the reads assigned to this taxid

integer

read_count

Number of reads that classified to this taxid

integer

total_kmer_count

Number of kmers that classified to this taxid

integer

distinct_kmer_count

Number of distinct kmers that classified to this taxid

integer

cumulative_read_count

Cumulative number of reads assigned to this taxid and its taxonomic descendants

integer

taxid

Taxid

integer

name

Name associated with the taxid, if given with --kmer-classifier-db-to-taxid-json

string

rank

Taxonomic rank of the taxid, if given with --kmer-classifier-db-to-taxid-json

string

taxid_distinct_kmer_count

Number of distinct kmers assigned to this taxid from the reference sequences

string

probability_present

Not in use

float