Only this pageAll pages
Powered by GitBook
1 of 49

Illumina Infectious Disease Software

Loading...

DRAGEN Microbial Amplicon

Loading...

Loading...

Page

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

DRAGEN Targeted Microbial

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

DRAGEN Microbial Enrichment Plus

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Illumina Infectious Disease and Microbiology Software

This page provides an overview of the software available on Illumina's cloud platforms

Infectious Disease and Microbiology software include powerful bioinformatics tools to analyze NGS data ranging from single microbial genomes to complex microbial communities of thousands of viruses, bacteria, parasites, and fungi. This comprehensive secondary analysis suite of tools supports target specific workflows such as amplicon and hybrid capture enrichment sequencing, to generalized microbiology methods like small WGS, shotgun sequencing, or 16S Amplicon. All tools are available on BaseSpace, with some available on On-board select Illumina Sequencers.

Click the links below to learn more about our currently-available infectious disease software products:

How to start

  1. Launch the DRAGEN Microbial Amplicon BaseSpace application.

  2. Choose the analysis name and destination project to save results to.

  3. Choose either Biosample or Project as input method. Selecting Project will result in all biosamples in the selected project being analyzed.

  4. Choose the appropriate Amplicon Primer Set that matches the primer design used to prepare your samples or choose Custom to provide your own. See this page to learn more about the custom option.

  5. If needed, uncheck the appropriate boxes to disable Pangolin and Nextclade analyses. These tools perform phylogenetic analysis and lineage assignment for SARS-CoV-2 (Pangolin) and other viruses (Nextclade). Depending on the chosen Amplicon Primer Set, these tools may not be applicable.

  6. If needed, expand the Advanced Workflow Settings box to change default settings. Click on the "i" circle next to each setting for more information.

  7. If needed, expand the Additional DRAGEN Command Line Arguments to provide additional arguments to default DRAGEN commands.

  8. Click “Launch Application"

Reference BED file format

A BED-like tab-separated value (TSV) file with no header row and with 4 or 5 columns:

  1. accession: each sequence accession as it appears in Custom Reference FASTA heaer

  2. start: start position (always set to 0)

  3. end: end position (sequence length)

  4. genome: full name of the virus the sequence belongs to (e.g. Influenza A H1N1)

  5. (optional) segment: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Guidelines

  • This file affects how sequences are labeled in the output.

  • Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.

  • If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.

  • If the Custom Reference FASTA includes sequences from multiple segments, it is strongly recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.

Example

DRAGEN Microbial Amplicon App Documentation

DRAGEN Microbial Amplicon is a software application designed to analyze sequencing data from amplicon library preps (both DNA and RNA) on microbiological samples, with an emphasis on viruses. Illumina sequencing reads are processed to generate consensus sequences that represent a best estimate of the population of viral sequences in each sample. Where appropriate, these consensus sequences are further analyzed by the phylogenetic analysis tools Nextclade and/or Pangolin to provide an identification of the clade or lineage of each sequence.

Input

Data can be provided in one of the following ways:

  • Samples / biosamples with FASTQ datasets (see details in library preparation documents)

  • A project containing one or more samples / biosamples with FASTQ datasets

    • All samples / biosamples in the selected project will be analyzed

Supported amplicon primer schemes

  • Chikungunya

    • Illumina

  • Dengue

    • Serotype 1 - Illumina

  • Mpox

    • Clade I - Illumina

  • RSV

  • SARS-CoV-2 - ARTIC

Custom genome and primer sets

Users can upload custom files to provide user-defined reference genome set and primer definitions. Multiplexed amplicon panels targeting multiple organisms in the same reaction are supported.

Pipeline steps

  1. Align reads to the default reference genome or selected reference genomes using DRAGEN v4.3.6

  2. Trim primer sequences in aligned reads based on coordinates

  3. Filter out samples with insufficient amplicon coverage

  4. Call sequence variants from the alignments using DRAGEN Somatic v4.3.6 and apply them to the corresponding reference genomes to create consensus sequences

  5. If applicable, run Nextclade/Pangolin on the consensus sequences

Output

  • Consensus sequences representing a best estimate of targeted sequences

  • Tables and plots reporting read counts, coverage, and Nextclade/Pangolin results

Currently supported platforms

  • BaseSpace Sequence Hub

Important Notes

  • The sequences are labeled according to the best match in the reference database, which is not exhaustive and the labels should not be taken as definitive for strain-typing. If strain typing is needed, the built-in Nextclade and/or Pangolin tools can be used for supported organisms. Alternatively, a BLAST or similar search of nucleotide databases may provide a more detailed match.

  • Because of sequence homology, it is possible that organisms with very few reads will result in the generation of a sequence not present (false positive). Although the de novo assembly step of this software largely mitigates such instances, sequences with very low horizontal coverage (< 5%) should be treated with caution.

Overview

All serotypes -

Influenza / - Universal

Pan-clade -

Clade II -

Zika -

Trim and filter reads using

Remove off-target reads using DRAGEN v4.3.6 kmer classifier (for custom reference, remove human reads using a modified version of the v2.2.1)

For organisms with one default reference genome, skip this step. For organisms with multiple candidates, trim primer sequences in reads using , perform assembly using , cluster contigs using , map contigs to candidate reference genomes using , then select reference genomes based on the mapping

NC_012532.1 0   10794   Zika    Full
NC_007373.1 0   2341    Influenza A virus (H3N2)    Segment 1 (PB2)
NC_007372.1 0   2341    Influenza A virus (H3N2)    Segment 2 (PB1)
NC_007371.1 0   2233    Influenza A virus (H3N2)    Segment 3 (PA+PA-X)
NC_007366.1 0   1762    Influenza A virus (H3N2)    Segment 4 (HA)
NC_007369.1 0   1566    Influenza A virus (H3N2)    Segment 5 (NP)
NC_007368.1 0   1467    Influenza A virus (H3N2)    Segment 6 (NB+NA)
NC_007367.1 0   1027    Influenza A virus (H3N2)    Segment 7 (M1+M2)
NC_007370.1 0   890     Influenza A virus (H3N2)    Segment 8 (NS1+NEP)
Grubaugh Lab
DengueSeq from Grubaugh Lab
A
B
ARTIC
Grubaugh Lab
CDC
WCCRRI
v5.4.2
v5.3.2, v4.1, v4, v3
Grubaugh Lab
Trimmomatic
SRA Human Read Scrubber tool
Trimmomatic
MEGAHIT
CD-HIT-EST
minimap2
Custom reference
Pipeline Logic
Output files
🧬
💠
📂

Output files

Note: Some files may not be generated depending on user inputs and pipeline outcome

Analysis level

Analysis_Results/<analysisId>.report.html displays tables and plots that summarize results from all samples combined.

Sample level

An output directory named after each sample contains <sampleName>.html, which displays tables and plots specific to the sample. The HTML files are identical to the ones displayed in BaseSpace Reports.

Each sample directory also contains the following subdirectories and output files:

amplicon/

<sampleName>.amplicon_coverage.log
Log from computing coverage metrics for each amplicon in a sample

<sampleName>.amplicon_coverage.csv

CSV reporting coverage metrics for each amplicon in a sample

<sampleName>.amplicon_detection.json

JSON reporting amplicon detection results for a sample

consensus/

<sampleName>.hard_masked_consensus.fa
FASTA containing all hard-masked consensus sequences generated for a sample

<sampleName>.soft_masked_consensus.fa

FASTA containing all soft-masked consensus sequences generated for a sample

<sampleName>.sample_consensus.fasta

<sampleName>.hard_masked_consensus.fa but with informative headers

<sampleName>_<genomeName>.genome_consensus.fasta

<sampleName>.sample_consensus.fasta but specific to consensus sequences generated using reference sequences that belong a particular genome

<sampleName>_<accessionName>.accession_consensus.fasta

<sampleName>.sample_consensus.fasta but specific to consensus sequences generated using a particular reference sequence

<sampleName>.consensus.json

JSON containing information on all consensus sequences generated for a sample

contig/

<sampleName>.sample_contig.fasta
FASTA containing all contig sequences generated for a sample

<sampleName>_<genomeName>.genome_contig.fasta

<sampleName>.sample_contig.fasta but specific to contigs mapping to reference sequences that belong a particular genome

<sampleName>_<accessionName>.accession_contig.fasta

<sampleName>.sample_contig.fasta but specific to contigs mapping to a particular reference sequence

<sampleName>.contig.json

JSON containing information on all contig sequences generated for a sample

map_align/

<sampleName>-replay.json

JSON reporting parameters and versions used when running DRAGEN to perform short read alignment

<sampleName>-unmapped_ S1_L001_R1_001.fastq.gz

FASTQ containing R1 reads that did not map to any selected reference sequences

<sampleName>-unmapped_ S1_L001_R2_001.fastq.gz

FASTQ containing R2 reads that did not map to any selected reference sequences

<sampleName>-unmapped-singleton_S1_L001_R1_001.fastq.gz

FASTQ containing singleton reads that did not map to any selected reference sequences

<sampleName>.bam

BAM containing all short read alignments

<sampleName>.bam.bai

BAI for <sampleName>.bam

<sampleName>.mapping_metrics.csv

CSV generated by DRAGEN to report mapping metrics

<sampleName>.trim.log

Log from performing post-facto primer trimming after short read alignment

<sampleName>.trimmer_metrics.csv

CSV generated by DRAGEN to report trimmer metrics

dragen_run_<runId>.log

Log from running DRAGEN to perform short read alignment

metrics/

<sampleName>.coverage.tsv
TSV reporting base-pair resolution coverage values across all reference sequences used in short read alignment

<sampleName>.report.json

JSON containing summary metrics generated for a sample

nextclade/

<sampleName>_<datasetName>.aligned.fasta
FASTA generated by Nextclade from aligning consensus sequences to a reference sequence

<sampleName>_<datasetName>.auspice.json

Auspice JSON generated by Nextclade containing output phylogenetic tree

<sampleName>_<datasetName>.csv

CSV generated by Nextclade to report results from mutation calling, clade assignment, quality control, etc.

<sampleName>_<datasetName>.json

<sampleName>_<datasetName>.csv in JSON format

<sampleName>_<datasetName>.ndjson

<sampleName>_<datasetName>.csv in NDJSON format

<sampleName>_<datasetName>.tsv

<sampleName>_<datasetName>.csv in TSV format

pangolin/

variant_calling/

<sampleName>-replay.json
JSON reporting parameters and versions used when running DRAGEN to perform variant calling

<sampleName>.consensus_filtered.bcftools_stats.txt

TXT generated by BCFtools stats command to report statistics on called variants that passed the consensus filter

<sampleName>.consensus_filtered.summary.csv

CSV generated by BCFtools query command to summarize called variants that passed the consensus filter

<sampleName>.consensus_filtered.vcf.gz

VCF containing called variants that passed the consensus filter

<sampleName>.consensus_filtered.vcf.gz.tbi

TBI for <sampleName>.consensus_filtered.vcf.gz

<sampleName>.hard-filtered.bcftools_stats.txt

TXT generated by BCFtools stats command to report statistics on called variants

<sampleName>.hard-filtered.summary.csv

CSV generated by BCFtools query command to summarize called variants

<sampleName>.hard-filtered.vcf.gz

VCF containing called variants

<sampleName>.hard-filtered.vcf.gz.tbi

TBI for <sampleName>.hard-filtered.vcf.gz

<sampleName>.vc_metrics.csv

CSV generated by DRAGEN to report variant calling metrics

dragen_run_<runId>.log

Log from running DRAGEN to perform variant calling

PCR Primer definition file formats

If primer coordinates are known (recommended)

Provide a BED file with at least 4 columns: accession, start, end, primerName. Additional columns can be included: pool, strand, sequence, but their order must be maintained.

For example, accession, start, end, primerName, pool for 5-column BED format:

seqX    0           15        primer1_LEFT   1
seqX    1745        1760      primer1_RIGHT  1
seqY    0           15        primer2_LEFT   2
seqY    1015        1030      primer2_RIGHT  2

And accession, start, end, primerName, pool, strand, sequence for 7-column BED format:

seqX    0           15        primer1_LEFT   1     +       GGGCAAACCTAAAGG
seqX    1745        1760      primer1_RIGHT  1     -       GTTATGTAAAGGTGC
seqY    0           15        primer2_LEFT   2     +       GGGCGAAACTAAAGG
seqY    1015        1030      primer2_RIGHT  2     -       GTTATGTAAAGGTGC

If primer coordinates are unknown

Option 1. One line per amplicon with 3 columns: ampliconName, forwardSequence,reverseSequence.

amplicon1      GGGCAAACCTAAAGG  GTTATGTAAAGGTGC
amplicon2      GGGCGAAACTAAAGG  GTTATGTAAAGGTGC

Option 2. One line per primer with 3 columns: primerName, sequence, pool.

primer1_LEFT      GGGCAAACCTAAAGG  1
primer1_LEFT_alt  GGGCGAAACTAAAGG  1
primer1_RIGHT     GTTATGTAAAGGTGC  1

Formatting rules

  • General

    • All text is case sensitive.

    • Any line starting with '#' is ignored. This can be used to add a header line with column names.

    • Every line must have the same number of columns and format (except those starting with '#').

    • Any number of spaces can separate the columns. A value within a single column should not have any space.

  • BED format

    • Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the start field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the end field (3rd column) minus 1.

    • accession field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.

    • Multiple sequence identifiers (accession) are permitted within one file.

  • Primer name

    • primerName must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.

    • In addition to _LEFT and _RIGHT, we permit _L and _R as direction tags in primerName. Any text after the direction tag should be separated by an underscore.

    • Text in primerName before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName.

    • Each amplicon must have at least one left and right primer (including alternative primers) associated with it.

    • Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt after the direction tag in primerName, followed by optional text to distinguish between different alternative primers, such as a number.

    • Examples of valid primer names:

      • MY_SEQUENCE_434_A_LEFT

      • virus1_L

      • amplicon_4934m_RIGHT_alt

      • amplicon_4934m_RIGHT_alt1

      • amplicon_4934m_R_altprimerB

    • Examples of invalid primer names:

      • LEFT_MY_SEQUENCE_434_A

      • virus1_l

      • amplicon_4934m_RIGHT_L

Sample Report

The Sample Report contains at most four tabs: Sample QC, Virus Metrics, Nextclade Report, and Pangolin Report.

Sample QC

This tab contains tables and plots summarizing the sample.

Sample Summary Metrics

This table reports summary metrics for the sample, such as Status and Detected Amplicons. See here for their definitions.

Pre-processing Metrics

This plot displays counts of reads that fall into different categories. See here for their definitions.

Sequence Alignment

This plot displays the number of reads that mapped to each reference sequence. If there is a single reference sequence (e.g. SARS-CoV-2), one bar is shown.

Sequence Alignment Metrics

This table provides the number of reads that mapped to each reference sequence along with the genome and segment names of the reference sequence. The "Download CSV" button enables downloading the contents of the table as a text comma-separated value (CSV) file.

Virus Metrics

Metrics by Virus

This table summarizes results for each viral genome generated in the sample with each row corresponding to a single viral genome. For segmented viruses like Influenza, a row will summarize information across multiple sequences generated for a single viral genome.

At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.

The table itself contains rows for every viral genome with at least one sequence generated in the sample with the following columns:

  1. Virus: Name of the viral genome

    • For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName column

  2. % Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). This is computed across all sequences belonging to the viral genome. See here for more information.

  3. Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).

Metrics by Sequence

This table summarizes the results for each sequence generated in the sample. For segmented viruses like Influenza, there are typically multiple rows with the same virus name. Otherwise, this table contains similar information as the Metrics By Virus table.

At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.

  1. Virus: Name of the virus genome

  2. Segment: Name of the segment to which the reference sequence corresponds. For non-segmented viruses, this is typically set to "Full".

  3. Accession: Unique identifier of the reference sequence (text before first space in FASTA header if custom reference FASTA was provided)

  4. % Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). See here for more information on this metric.

  5. Callable Bases: Number of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)

  6. Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).

  7. Consensus Length: Length of the final consensus, without leading and trailing masked bases if sequence trimming is enabled. Sequence trimming can be disabled in the Input Form under Advanced Workflow Settings.

Consensus Coverage

Displays a trace of read coverage over each reference genome. On the top right is a drop-down menu that allows users to switch between genomes. The blue line represents the read coverage, with the coverage depth in log 10 of number of reads on the y-axis and the genomic position in the reference genome on the x-axis.

For segmented viruses like Influenza, coverage values for each segment is displayed in a horizontally stacked fashion. Grey blocks at the top show their boundaries.

Nextclade Report (optional)

This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences in the sample. See here for more details.

Pangolin Report (optional)

This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences in this sample. See here for more details.

Understanding the BaseSpace Reports

Once the analysis completes, the "REPORTS" tab on BaseSpace enables users to view the Summary of the entire analysis, which summarizes results from all input samples, as well as individual Sample Report for each sample.

Pipeline Logic

Steps

Step
Module/Script
Run

QC

trimmomatic

Always

Primer trimming (on FASTQ)

trimmomatic

If assembly is to run

Remove off-target reads

DRAGEN

If checked in Input Form

Assembly

MEGAHIT

If reference FASTA and BED files imply more than one genome as reference

Contig clustering

CD-HIT

If assembly ran

Reference selection

custom script

If assembly ran, otherwise input reference database is used as is

Map/Align

DRAGEN

If at least one reference sequence is generated

Post-facto primer trimming (on BAM)

custom script

If Map/Align ran and primer set exists

Sample filtering based on amplicon coverage

custom script

If Map/Align ran and primer set exists

Variant calling

DRAGEN

If Map/Align ran and sample passed filter above

Consensus sequence generation

custom script

If Map/Align ran and sample passed filter above

Outcomes

Status
Level
Outcome

Completed successfully

Pipeline

Exit with all applicable output files

Custom files are not formatted correctly

Pipeline

Exit early with error

No remaining reads after preprocessing

Sample

Exit early with a report of read counts

No contig generated

Sample

Exit early with a report of read counts

No reference found after assembly

Sample

Exit early with a report of read counts and contig FASTA

None of the primers provided in custom primer definition file align to selected reference sequences

Sample

Skip post-factor primer trimming and sample filtering based on amplicon coverage for this sample

Insufficient amplicon coverage

Sample

Exit early before variant calling and consensus sequence generation

📄Summary
📄Sample Report

Frequently Asked Questions (FAQ)

Q: Majority of my reads are removed in preprocessing as off-target reads. Is the amplicon panel working?

A: For many sample types, especially clinical, wastewater, or environmental samples, viral RNA or DNA makes up a tiny proportion of the total nucleic acids, with the remainder dominated by host or bacteria/archaea. Therefore, even with a dramatic increase of abundance over what you would obtain without targeted sequencing, the percentage of targeted reads can still be low.

Q: How is the default minimum read coverage depth of 10x applied? Is that averaged across the entire sequence?

A: The 10x threshold is applied per-nucleotide. Any positions below 10x coverage will be hard-masked with "N".

Q: Header lines in the consensus sequence FASTA files say “Panel reference sequences are not necessarily comprehensive and should not be used for strain typing.” What does this mean?

A: This message is to warn users that the sequence accession in the consensus genome does not necessarily reflect the true phylogeny of the organisms in the sample and should not be taken as such.

Because the app uses a limited set of reference sequences, the accession in the consensus sequence FASTA file headers (and coverage plots, etc) merely reflects the best match from that limited set. There may be sequences in RefSeq or elsewhere that are a closer match.

Q: Why does the "Detected Amplicons" column show 7 in the denominator for an Influenza genome when there should be 8 amplicons in total?

A: The denominator in the "Detected Amplicons" columns is based on the reference sequences selected based on de novo assembled contigs. Depending on the quality of the sample and/or reads, the assembler may not have enough data to generate a contig for some segments. Shorter segments are more likely to be missed. If only 7 segments are selected as reference for short read alignment, then we expect 7 amplicons in total. If you believe that the sample should contain all 8 segments, you can download the contig FASTA file from our report page and submit it to NCBI BLAST to see if all 8 segments are present in the contig sequences.

One known issue is that chimeric reads can be generated during library preparation, which can lead to chimeric contigs, where the contig sequence contains sequences from more than one segment. This can result in missing an entire segment in the reference selection stage. A workaround may be to filter out chimeric reads from your FASTQ files before running the app.

Alternatively, you can force the app to use all 8 segments of a particular Influenza genome by providing a custom reference FASTA file with all 8 segment sequences and a custom reference BED file with the genome column to set to the same value (e.g. Influenza A). This way, the app skips assembly and uses all 8 segments as the reference sequences for short read alignment.

Q: What is the difference between the "Detected Amplicons" and "% callable bases" columns?

A: The "Detected Amplicons" column shows the number of detected amplicons over the total number of expected amplicons. The percentage of detected amplicons is used to infer if the sample is of sufficient quality for variant calling. The "% callable bases" column shows the percentage of the selected reference genome whose bases are at or above the minimum read coverage depth for consensus sequence generation, which is computed independent of amplicons.

Both metrics are useful to assess the quality of the sample, but the percentage of detected amplicons is used by the app after short read alignment to filter out low-titer samples and the percentage of callable bases is not.

Q: I don't see the virus I'm interested in listed in the "Genomes generated" column. Does that mean the virus is not present?

A: Not necessarily. Your virus of interest may be present in the sample, but the app may not have generated a consensus sequence for it for various reasons.

One reason could be that there are too few reads coming from that virus. Tools like DRAGEN Metagenomics can be used to characterize what is in the sample more broadly.

Another reason could be that the virus in your sample is too divergent from the reference sequences used in the app. In such cases, we recommend downloading the contig FASTA file (if available) from our report page and submitting it to NCBI BLAST. If you do see a genome that matches your virus of interest, you can provide that to the app as a custom reference genome.

Q: I cannot find any contig FASTA files in the output, why?

A: De novo assembly is performed only if there are multiple candidate reference genomes, which is typically when there are multiple serotypes, strains, subtypes, or clades. This currently applies to the following Amplicon Primer Set options:

  • Dengue Virus All Serotypes, 400-bp DengueSeq primers

  • Influenza A, Universal primers

  • Influenza B, Universal primers

  • Influenza A and B, Universal primers

  • Mpox All Clades, 2500-bp ARTIC-INRB v1 primers

  • Respiratory Syncytial Virus (RSV), CDC primers

  • Respiratory Syncytial Virus (RSV), WCCRRI primers

If a custom reference FASTA file is provided, assembly is performed if there are multiple sequences in the file. If a custom reference BED fils is also provided, assembly is performed if based on the BED file there are multiple genome-segment pairs (or multiple non-segmented full genomes). Otherwise, all sequences in the custom reference FASTA file are used as reference for short read alignment.

Q: I see both consensus sequence FASTA files and contig FASTA files. Which one is better?

A: In most cases, the consensus sequence FASTA file. Contig sequences are useful if the reference sequences used for consensus sequence generation were not the best match. They should be used with caution however because there is no filtering of base calls based on coverage or quality as done in consensus sequence generation.

Q: I am using the custom analysis workflow and my analysis aborted or shows an error, why?

A: It is most likely that the custom database was not formatted correctly. Below are requirements for the Custom Reference FASTA For Consensus Generation:

  • Do not use Spaces in the file name, instead use an underscore "_"

  • Do not exceed 25 characters in the file name

  • File extension must be .fasta or .fa

  • Do not exceed the file size limitation: 16GB for a single file or 25GB for multiple files

  • Do not have duplicate entries

  • If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (accession) must match the names that appear in the FASTA (text after > and before the first whitespace character).

Please see this page on general guidelines to upload data to BaseSpace for more details. If you continue having issues, reach out to techsupport@illumina.com.

Special considerations for amplicon detection

Amplicon sequencing, especially of RNA viruses, requires additional bioinformatics processing to ensure maximum quality of the resulting data.

In RT-PCR, a reverse transcriptase enzyme first generates cDNA molecules using the RNA molecules in the sample as templates, before amplifying the cDNA sequences using a DNA polymerase enzyme during PCR. These amplified cDNA sequences are then further processed to generate the sequencing libraries. Both of these enzymes can potentially introduce an incorrect base into a sequence, generating a position where the resulting sequence does not match the sequence in the sample -- that is, an error.

Reverse transcriptases exhibit error rates that are multiple orders of magnitude higher than those of DNA polymerases.

When large numbers of nucleic acid molecules are present in a reaction, these individual misincorporation errors are largely uncorrelated and appear at very low frequencies, so that they are typically ignored by variant callers.

However, when there is a small number of incoming nucleic acid molecules, such as for a low-titer sample, an error that occurs during the RT step or early in the PCR reaction can, as a result of sampling noise, be amplified to high frequencies in the resulting sequencing libraries. The variant caller may treat this error as a sequence variant, since it is a true sequence variant in the context of the library provided to the instrument. As a result, these artifactual sequence variants often have high allele frequency and quality scores, which makes them very difficult to detect, and appear in the final consensus sequence. While less common, it is also possible for a true sequence variant to have its allele frequency depressed by this same process (if the error results in a reversion to the reference sequence).

Since it is difficult to identify enzyme-introduced false variants after the fact, we instead take a preemptive approach of determining if there is sufficient sample material present before variant calling and consensus sequence generation in order to ensure data quality.

Specifically, the app calculates the number of amplicons with at least 1x coverage for at least 90% of the non-overlapping portion of the amplicon sequence. The 1x coverage threshold used here is fixed and independent of the minimum read coverage depth for consensus sequence generation which defaults to 10x. The number of amplicons that meet this threshold is then divided by the total number of amplicons expected in the experiment, which is the number of amplicons whose location falls in reference sequences selected for short read alignment. If the resulting percentage is at least 80%, the sample is considered to have sufficient material for accurate variant calling. If it is below this threshold, the sample is not processed further to avoid spurious variant calls. The user can override the 80% threshold in the "Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation" control in the "Advanced Workflow Settings" section.

The threshold above was determined through data analysis using an experimentally-determined threshold corresponding to minimum concentration needed to produce reliable variant calls. We assumed that higher nucleic acid concentrations leads to a higher probability of amplifying each amplicon.

How to set up and run an analysis

  1. Launch the DRAGEN Targeted Microbial BaseSpace application.

  2. After choosing a name and destination project for the Analysis, choose either “Biosample” or “Project” as input type. Selecting “Project” will result in all biosamples in the selected project being analyzed.

  3. Next, choose between Enrichment and Amplicon for Experiment Type. Libraries prepared with IMAP should be run as “Amplicon” experiments. Choose the appropriate Amplicon Primer Set that matches the primer design used to prepare your samples or choose “Custom” to provide your own genome references and primer designs. Note that all provided files must first be uploaded to a BaseSpace project before they can be selected in the software.

  4. To use a custom reference and primer design, click the “Custom Reference” block to expand it.

  5. At a minimum, the user must provide a custom genome reference containing one or more target sequences (to be used for alignment, variant calling and consensus generation) in the form of a FASTA file.

    1. Optionally, the user may provide a BED file that assigns human-readable names and segment numbers (if applicable) to each sequence in the provided FASTA file. Note that the accessions in the genome definition file must match the first part (before whitespace) of the FASTA headers. See the pages for “Genome Definition File Format Specification” in the “Supporting Information” section of this user guide for information on the required format of this file.

    2. Optionally, the user may provide a file containing the locations or sequences of the primers used to prepare this sample. These primer definitions are important to guide the trimming of primer sequence from reads that overlap the binding sites, as well as to define the boundaries of the amplicons whose coverage is used to determine if the sample has sufficient viral material to reliably call variants and generate consensus sequence.

    3. Optionally, the user may choose one or more NextClade datasets to use for phylogenetic analysis of the consensus sequences generated from the samples.

  6. Check the appropriate boxes to enable or disable Pangolin and/or NextClade analysis if desired. These tools perform phylogenetic analysis and lineage assignment for SARS-CoV-2 (Pangolin) and other viruses (NextClade). Depending on the chosen Amplicon primer set, not all of these options may be applicable.

  7. Click “Launch Application” to begin the Analysis.

Custom genomes and primer sets

In addition to the built-in options, DRAGEN Targeted Microbial supports the use of custom reference genomes and primer definitions. These files must be uploaded to a BaseSpace Project before they can be used. See https://help.basespace.illumina.com/manage-data/import-data for more information about importing files into BaseSpace. These files can be used for both Enrichment and Amplicon libraries, when choosing the 'Custom' option for 'Enrichment Panel' or 'Amplicon Primer Set', respectively. Expand the 'Custom Reference' settings block to access the options for custom files. The following controls are applicable to the specified experiment type:

Custom Enrichment Panel

  • Custom Reference FASTA for Consensus Generation (required)

  • Custom Reference BED (optional)

Custom Amplicon Primer Set

  • Custom Reference FASTA for Consensus Generation (required)

  • Custom Reference BED (optional)

  • Custom PCR Primer Definitions (optional)

Custom genome references

The user may provide one or more reference genomes as the target for read alignment (and as the basis for generating consensus sequences). At a minimum, the user must provide a FASTA file containing the sequences of the reference genomes. The software will generate the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app. Use the 'Custom Reference FASTA for Consensus Generation' control to select the previously-uploaded FASTA file containing the reference sequences.

Optionally, a genome definition BED file may also be provided, which tells the software more information about each sequence, such as a human-readable common name to be used in the reports. For multi-segment genomes such as Influenza, the genome definition file provides the segment name of each sequence and indicates that all the segments of a single genome belong together. Use the 'Custom Reference BED' control to select the previously-uploaded BED file containing the genome definition. See the following page for a description of the format of the genome definition file:

Custom primer sets

For amplicon experiments, the user may optionally provide a file that defines the primer sequences or locations. The primers defined in this file are used for two purposes:

  1. The primer binding locations are used to trim reads, which eliminates sequence data that may be contributed by the primer sequences themselves (which we do not want) from sequence data contributed by the sample (which we do want). This is important to avoid reference bias that can depress the observed allele frequency of sequence variants in primer binding sites.

  2. The primers are matched to define the boundaries of the expected amplicons resulting from the PCR reaction. The read coverage within the unique (non-overlapping) regions of these amplicons is used to determine whether or not each amplicon is reliably observed. The fraction of observed amplicons is a function of the concentration of the sample, and is used to determine whether or not sufficient material exists within the sample to reliably and accurately call variants and generate a consensus sequence. See this page for a more in-depth discussion:

Use the 'Custom PCR Primer Definitions' control to select the previously-uploaded primer definition file. The allowed formats for this file are described here:

Required custom input based on reference type for amplicon experiments

Reference
Example
Required input
Note

Single non-segmented genome

Zika

Primer set

Single segmented genome

All 8 segments from one Influenza A genome

Primer set

Multiple non-segmented genomes

Multiple genomes of Zika

Reference BED, Primer set

Reference BED must be provided to make it clear that the reference sequences are not segments in the same genome. Otherwise, the pipeline will assume this is a single segmented genome (above). If multiple genomes remain after reference selection, the genome with the best per-amplicon coverage will be considered for sample filtering.

Multiple segmented genomes

A collection of Influenza A and B genomes

Reference BED, Primer set

Reference BED must be provided to specify which sequences belong to the same genome. Otherwise, the pipeline will assume this is a single segmented genome. If multiple genomes remain after reference selection, the genome with the best per-amplicon coverage will be considered for sample filtering.

Custom reference

In addition to the built-in options, DRAGEN Microbial Amplicon supports the use of custom reference genomes and primer definitions. These files must be uploaded to a BaseSpace Project before they can be used. See https://help.basespace.illumina.com/manage-data/import-data for more information about importing files into BaseSpace.

In the app input form, select the 'Custom' option for 'Amplicon Primer Set'. Then expand the 'Custom Reference' settings to provide the following:

  • Custom Reference FASTA for Consensus Generation (required)

  • Custom Reference BED (optional)

  • Custom PCR Primer Definitions (optional)

Custom reference FASTA

If the 'Custom' option is selected for 'Amplicon Primer Set', the user must provide a custom FASTA file containing one or more reference sequences as the target for read alignment (and as the basis for generating consensus sequences). The software generates the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app. Note that not all provided reference sequences in the FASTA file may be used for read alignment and consensus sequence generation.

Custom reference BED

Optionally, a reference BED file may be provided to add information about each reference sequence in the FASTA file, such as human-readable names to be used in the reports. For multi-segment genomes such as Influenza, this file assigns the segment name to each sequence, which allows the software to group individual segment sequences by genome. See the following page on the format of this file:

Custom PRC primer definitions

Optionally, a TSV file may be provided to define the primer sequences or binding locations, which are used for two purposes:

  1. Primer sequences are trimmed from reads, which eliminates sequences that may come from the primer sequences themselves (which we do not want) from sequences contributed by the biological sample (which we do want). This reduces reference bias that can incorrectly lower the observed allele frequency of true sequence variants in primer binding sites.

  2. Primer locations are used to define the amplicons expected from PCR reactions. The read coverage within the unique (non-overlapping) amplicon regions is used to determine whether each amplicon is reliably detected. The percentage of detected amplicons is used to determine whether sufficient material exists to accurately call variants and generate consensus sequences from the sample.

See the following pages for further information:

Nextclade datasets

Optionally, one or more Nextclade datasets can be selected to use for phylogenetic analysis of the consensus sequences generated from the samples. Every selected dataset will be applied to every consensus sequence generated in every sample.

Summary

The Summary contains at most three tabs: Summary Report, Nextclade Report, and Pangolin Report.

Summary Report

Metrics By Sample

This table provides a top-line summary of each of the analyzed samples.

At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.

Next is the table itself, which contains one row per sample and the following columns:

  1. Sample: Name of the BaseSpace sample analyzed

  2. Status: Status of the sample analysis

  3. Input Reads: Total number of reads in input FASTQs

  4. Mapped Reads: Number of reads that map to reference sequences during short read alignment

  5. Detected Amplicons: Proportion of amplicons detected out of the total expected for the sample, which is used to to determine if the sample is sufficient quality for variant calling. See this page for more details.

  6. Num Genomes: Number of genomes chosen during the reference selection stage

  7. Virus: Name of the genome to which the reference sequence belongs

  8. % Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)

    • Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. They are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default).

    • When generating consensus sequences, genomic positions below the threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the true base cannot be accurately determined).

    • This percentage is calculated over the lengths of the reference genome(s), not the final consensus sequence(s) which may be trimmed.

Pre-processing Metrics

This stacked bar plot contains counts of reads that fall into the following categories:

  • Removed in Downsampling: Reads that were removed during downsampling because the user specified a downsampling target in the Input Form under Advanced Workflow Settings

  • Removed in QC: Reads removed as poor quality reads based on quality thresholds during pre-processing

  • Removed as Duplicate: Reads that were labeled as duplicate during short read alignment. Removal of them can be disabled in the Input Form under Advanced Workflow Settings

  • Removed in Trimming: Reads that were removed in the initial sequence-based primer trimming step and were excluded from further processing

  • Removed in De-hosting: Reads that were filtered out as human reads based on kmer-based classification during pre-processing.

    • This improves the quality of downstream analysis and helps ensure that human sequences are not included in the output BAM files.

    • This is applied only if 'Amplicon Primer Set' was set to 'Custom' in the Input Form.

    • This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".

  • Removed as Off-target: Reads that were filtered out as off-target reads based on kmer-based classification during pre-processing

    • Similar de-hosting, this improves the quality of downstream analysis.

    • Off-target is defined as not coming from the target organism, which is determined based on the 'Amplicon Primer Set' selection in the Input Form. For example, if "Influenza A and B, Universal Primers" option is selected, a kmer database generated from a large collection of publicly available Influenza sequences is used to separate reads likely coming from Influenza from the rest.

    • This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".

  • Unmapped: Reads that were not aligned to any reference genomes

  • Mapped. Reads that were mapped to at least one reference genome

Nextclade Report (optional)

This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences across all samples. Nextclade is run if the "Enable NextClade" box is checked on the Input Form and one of the following is true:

  • 'Amplicon Primer Set' is set to a non-custom set with a reference with Nextclade dataset available and a valid consensus sequence was generated.

  • 'Amplicon Primer Set' is set to 'Custom' and one or more Nextclade datasets are selected under 'Custom Reference'. In this case, each of the selected Nextclade datasets is applied to each consensus sequence generated for every sample. This may result in multiple Nextclade results for each consensus sequence.

Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

All content shown in the tab is derived from the output of the Nextclade software. Please see the Nextclade documentation for more details.

Pangolin Report (optional)

This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences across all samples. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:

  • 'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.4.2 primers) and a valid consensus sequence was generated

  • 'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.

Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

All content shown in the tab is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details.

Genome definition file formats

A BED-like tab-separated value (TSV) file with no header row, consisting of the following columns:

  1. chrom: each sequence name as it appears in Custom Reference FASTA

  2. chromStart: start position (always set to 0)

  3. chromEnd: end position (sequence length)

  4. genomeName: full name of the virus the sequence belongs to (e.g. Monkeypox virus clade II)

  5. (optional) segmentName: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Guidelines

  • This file affects how sequences are labeled in the output.

  • Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.

  • If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.

  • If the Custom Reference FASTA includes sequences from multiple segments, it is recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.

Example

Understanding the BaseSpace Reports

Brief description of Summary and Result reports and an explanation of their contents

The app produces a summary report as well as result reports for each of the samples analyzed. See the links below for a description of each.

DRAGEN Targeted Microbial App Documentation

▶️ DRAGEN Targeted Microbial App Documentation

Summary

DRAGEN Targeted Microbial is a software application designed to analyze sequencing data from enrichment and amplicon library preps (both DNA and RNA) on microbiological samples, with an emphasis on viruses. Illumina sequencing reads are processed to remove human-origin sequence, then assembled into consensus sequences that represent a best estimate of the population of viral sequences in each sample. Where appropriate, these consensus sequences are further analyzed by the phylogenetic analysis tools NextClade and/or Pangolin to provide an identification of the clade or lineage of each sequence.

Inputs

  • Samples / biosamples with FASTQ datasets (see details in library preparation documents)

  • A project containing one or more samples / biosamples with FASTQ datasets

    • all samples / biosamples in the selected project will be analyzed

Supported hybrid-capture enrichment panels

Supported amplicon primer schemes

Custom genomes and panels

Supports uploading FASTA files to use as reference genomes for both enrichment and amplicon panels, as well as custom primer definitions for amplicon panels. Multiplexed amplicon panels targeting multiple organisms in the same reaction are supported.

Analysis Pipeline

  1. The best matching reference for each contig is selected for short read mapping.

  2. The scrubbed reads from step 2 are aligned to the selected reference genomes using DRAGEN v4.2.4

  3. Sequence variants are called from the alignments using DRAGEN Somatic Small Variant Caller v4.2.4 and applied to the corresponding reference sequences to create consensus sequences.

  4. If applicable, Pangolin and/or Nextclade are run on the consensus sequences.

Outputs

The software generates consensus sequences representing a best estimate of the population of targeted sequences in the sample. NextClade and Pangolin analysis are run on select organisms. See this page for details:

Important Notes

  • The sequences are labeled according to the best match in the panel references. These references are not exhaustive and the labels should not be taken as definitive for strain-typing. If strain typing is needed, the built-in NextClade and/or Pangolin tools can be used for supported organisms. Alternatively, a BLAST or similar search of nucleotide databases may provide a more detailed match.

  • Because of sequence homology, it is possible that organisms with very few reads will result in the generation of a sequence not present (false positive). Although the de novo assembly step of this software largely mitigates such instances, sequences with very low horizontal coverage (< 5%) should be treated with caution and are highlighted as "low confidence" in the reports.

Currently supported platforms

  • BaseSpace Sequence Hub (native BaseSpace app)

Overview

Chikungunya (; )

Dengue Serotype 1 (; )

Influenza / (Zhou et al)

MPXV ()

Respiratory Syncytial Virus (RSV) (; )

SARS-CoV-2 ()

Zika ()

Reads are trimmed and filtered using with the following parameters: LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36.

Human reads are removed with a modified version of the .

is used to perform de novo assembly on the scrubbed reads.

is used to cluster similar contigs to reduce redundancy.

The resulting contigs are mapped to a set of reference genomes using .

📄Genome definition file formats
⭐Special considerations for amplicon sequencing with IMAP protocols
📄Primer definition file formats
📄Genome definition file formats
⭐Special considerations for amplicon detection
📄PCR Primer definition file formats
NC_012532.1 0   10794   Zika    Full
NC_007373.1 0   2341    Influenza A virus (H3N2)    Segment 1 (PB2)
NC_007372.1 0   2341    Influenza A virus (H3N2)    Segment 2 (PB1)
NC_007371.1 0   2233    Influenza A virus (H3N2)    Segment 3 (PA+PA-X)
NC_007366.1 0   1762    Influenza A virus (H3N2)    Segment 4 (HA)
NC_007369.1 0   1566    Influenza A virus (H3N2)    Segment 5 (NP)
NC_007368.1 0   1467    Influenza A virus (H3N2)    Segment 6 (NB+NA)
NC_007367.1 0   1027    Influenza A virus (H3N2)    Segment 7 (M1+M2)
NC_007370.1 0   890     Influenza A virus (H3N2)    Segment 8 (NS1+NEP)
Viral Surveillance Panel (VSP)
Pan-Coronavirus Panel (Pan-Cov)
Respiratory Virus Oligo Panel (RVOP)
Grubaugh lab
Illumina
Grubaugh Lab
Illumina
A
B
Grubaugh Lab
WCCRRI
CDC
ARTIC v3, v4, v4.1, v5.3.2
Grubaugh lab
Trimmomatic
SRA Human Read Scrubber tool
MEGAHIT
CD-HIT-EST
minimap2

Summary Report

Describes the report that can be viewed from the Summary link on the Reports tab of a completed analysis.

Metrics by Sample Table

At the top of the report, after the app version display, is the Metrics by Sample table which provides a top-line summary of each of the analyzed samples.

The first element is a button that will trigger downloading of a FASTA-formatted file containing all consensus sequences generated across all samples.

The "Download CSV" button allows for downloading the contents of the table as a text comma-separated value (CSV) file. Note that for fields with multiple entries, these entries will be combined as a semicolon-separated list in the corresponding fields in the CSV file.

Next is the table itself, which contains one row per sample. The various genomes generated for each sample are nested as sub-rows within this row.

The table contains one row per sample and the following columns:

  1. Sample: The name of the BaseSpace sample analyzed. The sample name is a clickable link that will take you directly to the Result Report for that sample.

  2. Detected Amplicons (only if 'Experiment Type' is set to 'Amplicon'): The number of amplicons detected over the total number of amplicons expected for that sequence. The percentage of amplicons detected is used to to determine if the sample is sufficient quality for variant calling. See Special considerations for amplicon sequencing with IMAP protocols for more details.

  3. Num genomes: The number of genomes chosen during the reference selection stage of the pipeline

  4. Genomes generated: The names of each genome chosen during the reference selection stage. If the percentage of callable bases (callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation, 10x by default) for a genome is below the minimum percentage of consensus sequence generated to label as confident (5% by default), the cell is highlighted in yellow to indicate that there is only marginal evidence that the indicated genome is present in the sample and should be treated with caution. For amplicon experiments, if the sample is considered to have insufficient titer for VC because the percentage of detected amplicons is below the minimum percentage required for reliable variant calling (80% by default), cells are highlighted in orange. For genomes for which a consensus sequence was generated, clicking on the name of that genome initiates a download of a FASTA file containing the consensus sequences of that genome only.

  5. % callable bases: The percentage of the selected reference genome whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference genome(s), not the reported consensus sequence(s).

  6. Status: The overall outcome of the analysis for this virus

    1. Full analysis (consensus, VC) means that the sample analysis completed normally, that a sufficient number of amplicons were detected to ensure reliable variant calling (amplicon experiments only), and that the percentage of callable bases was above the minimum percentage of consensus sequence generated to label as confident (5% by default)

    2. Low confidence means that there is at lease one callable base but the overall percentage of callable bases was below the minimum percentage of consensus sequence generated to label as confident (5% by default)

    3. No callable bases indicates that zero positions in the indicated reference genome were callable and no consensus sequence is therefore provided.

    4. Insufficient titer for VC will only be present for an amplicon experiment and indicates that the number of detected amplicons was below the minimum percentage (default 80%) required for reliable variant calling. See Special considerations for amplicon sequencing with IMAP protocols for more details.

  7. Consensus FASTA: This column contains links to download a FASTA-formatted text file containing all of the consensus genomes generated for a sample. If no consensus genomes were generated for a sample, this column contains "N/A."

  8. Input read count: The number of reads (or read pairs / clusters for paired-end samples) in the sample.

  9. Mapped read count: The number of reads that could be mapped to any reference genome.

  10. Unmapped reads: Displays buttons that initiate downloads of gzipped FASTQ files containing reads that could not be mapped to any reference genomes.

  11. Raw Contigs: Displays a button that initiates a download of a FASTA file containing all contigs generated during the de novo assembly step of the pipeline. If a contig could be mapped to a reference genome the contig name contains information about the reference genome they aligned to.

Pangolin Report (optional)

This table contains the results of the Pangolin analysis performed on the generated consensus sequences across all samples. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:

  • 'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using SARS-CoV-2 as reference

  • 'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated

  • Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.

The Pangolin report contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

The table in the Pangolin report is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details: https://cov-lineages.org/resources/pangolin/output.html. Sequences with a bad Pangolin QC status are highlighted in yellow.

NextClade Report (optional)

This table contains the results of the NextClade analysis performed on the generated consensus sequences across all samples. NextClade is run if the "Enable NextClade" box is checked on the input form and one of the following is true:

  • 'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using a reference with NextClade dataset available.

  • 'Amplicon Primer Set' is set to a non-custom set with a reference with NextClade dataset available (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated.

  • Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom' and one or more NextClade datasets are selected under 'Custom Reference'. In this case, each of the selected NextClade datasets is applied to each consensus sequence generated for the sample. This may result in multiple NextClade results for each consensus sequence, some of which may not be meaningful (e.g. "flu_h1n1pdm_ha" dataset applied to a NA segment of an Influenza genome).

The NextClade Report contains a button labeled "Group table by" with a drop-down menu allowing the user to group the results by various fields, including "None". The default is "Dataset" which means that all of the results for each NextClade dataset will be grouped together. For example, if a user is only interested in phylogenetic analysis performed on the HA segment of Influenza A H1N1, these results for each sample can be viewed together in the "flu_h1n1pdm_ha" collapsible group.

The NextClade report also contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

The table in the NextClade report is derived from the output of the NextClade software. Please see the NextClade documentation for more details: https://docs.nextstrain.org/projects/nextclade/en/stable/user/output-files/04-results-tsv.html. Sequences with a bad NextClade QC status are highlighted in yellow.

App Settings

Describes the controls on the Input Form and their function

Item name
Description
Choices
Default
Required

Save Results To

Project to run the analysis in

Required

Input Type

This app can accept samples or a project as input.

  • Samples: Select up to 60 individual samples, from any project(s)

  • Project: Select a single project containing up to 1536 samples. The app will analyze every FASTQ sample in that project (FASTQ datasets with QcStatus=QcFailed will be excluded)

Biosamples, Project

Biosamples

Required

Input Biosample

Select one or more samples to analyze. Either Input Samples or an Input Project can be selected - not both.

Required if Input Type is set to 'Samples'

Input Project

Select a Project containing up to 1536 samples to be analyzed. The analysis will process all samples from that project (FASTQ datasets with QcStatus=QcFailed will be excluded). There is currently no way to filter specific samples from a project. If the project contains more than 1536 Biosamples, the app will appear to launch, but then will immediately exit.

Required if Input Type is set to 'Project'

Experiment Type

This app can analyze samples generated from enrichment or amplicon sequencing experiments. Either can be selected - not both.

Enrichment, Amplicon

Enrichment

Required

Enrichment Panel

Select the enrichment panel used to generate the data. This determines the set of reference genomes the app uses. Different selection will produce different results. Choose 'Custom' to provide your own reference genomes below.

  • Viral Surveillance Panel (VSP)

  • Pan-Coronavirus Panel (Pan-Cov)

  • Respiratory Virus Oligo Panel (RVOP)

  • Custom

Required if Experiment Type is set to 'Enrichment'

Amplicon Primer Set

Select the virus genome to align to and primer set used to generate the data. Primer locations determine primer trimming locations and amplicon definitions. If processing SARS-CoV-2 data from a non-amplicon protocol, choose 'SARS-CoV-2, no primers'. Different selection will produce different results. Choose 'Custom' to provide your own reference genomes and primer set below

  • SARS-CoV-2, ARTIC v5.3.2 primers

  • SARS-CoV-2, ARTIC v4.1 primers

  • SARS-CoV-2, ARTIC v4 primers

  • SARS-CoV-2, ARTIC v3 primers

  • SARS-CoV-2, no primers

  • Influenza A, Universal primers

  • Influenza B, Universal primers

  • Influenza A and B, Universal primers

  • Chikungunya Virus, Grubaugh Lab primers

  • Chikungunya Virus, Illumina primers

  • Dengue Virus Serotype 1 (DENV1), 400-bp DengueSeq primers

  • Dengue Virus Serotype 1 (DENV1), Illumina primers

  • Monkeypox Virus (MPXV) Clade II, Grubaugh Lab primers

  • Respiratory Syncytial Virus (RSV), CDC primers

  • Respiratory Syncytial Virus (RSV), WCCRRI primers

  • Zika Virus, Grubaugh Lab primers

  • Custom

Required if Experiment Type is set to 'Amplicon'

Custom Reference: Custom Reference FASTA For Consensus Generation

Provide a custom reference FASTA to use for consensus generation. Either Enrichment Panel or Amplicon Primer Set must be set to Custom to enable this field.

  • Sequence names must be unique and must not contain any space. If there is any space in the FASTA header, the part before the first space is assumed to be the sequence name.

  • It is recommended to use the following in sequence names: alphabets, numbers, underscore (_), hyphen (-), parentheses ((,)), and period (.). Otherwise, the sequence names may appear different in the output.

  • It is recommended to keep sequence names short (e.g. NC_045512.2). If needed, full names can be provided in the genomeName column of Reference BED below.

  • FASTA file name must not include any space, must not exceed 25 characters, and must use extension .fasta or .fa

Required if either Enrichment Panel or Amplicon Primer Set is set to 'Custom'

Custom Reference: Custom Reference BED

Optional if Enrichment Panel or Amplicon Primer Set is set to 'Custom'. Otherwise not applicable

Custom Reference: Custom PCR Primer Definitions

Optional if Amplicon Primer Set is set to 'Custom'. Otherwise not applicable

Custom Reference: NextClade Datasets

Select one or more available NextClade Datasets from the drop-down menu below. Hold ctrl/command key to select multiple or deselect.

Optional if either Enrichment Panel or Amplicon Primer Set is set to 'Custom'. Otherwise not applicable

Pangolin

Run Pangolin on applicable consensus genomes

True, False

True

Optional if any Enrichment Panel is selected, any SARS-CoV-2 Amplicon Primer Set is selected, or 'Custom' is selected for Enrichment Panel or Amplicon Primer Set. Otherwise not applicable

NextClade

Run NextClade on applicable consensus genomes. If providing Custom Reference, select NextClade Datasets above to enable. Otherwise not applicable NextClade

True, False

True

Optional if any Enrichment Panel is selected, if a genome with NextClade dataset available is selected for Amplicon Primer Set, or if 'Custom' is selected for Enrichment Panel or Amplicon Primer Set. Otherwise not applicable

Advanced Workflow Settings: Dehost

If checked: input FASTQs will be scrubbed of all human reads, before the Map/Align stage, so that the output BAM includes only viral reads.

True, False

True

Required

Advanced Workflow Settings: Trim Consensus Sequences

Remove any leading and trailing masked nucleotides from the resulting consensus sequences. Does not affect internal masked regions.

True, False

True

Required

Advanced Workflow Settings: Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation

80.0%

Required if Experiment Type is set to 'Amplicon'

Advanced Workflow Settings: Minimum read coverage depth for consensus sequence generation

Genomic positions with read coverage below this threshold will be considered indeterminate and hard-masked in the final consensus sequence

10

Required

Advanced Workflow Settings: Minimum percentage of consensus sequence generated to label as confident

Consensus sequences with percentage of callable bases below this threshold will be considered 'low confidence'. Callability is defined based on minimum coverage depth for consensus sequence generation (above)

5.0%

Required

Additional DRAGEN Command Line Arguments: Additional DRAGEN Map/Align Command Line Arguments

USE AT YOUR OWN RISK. This field allows the user to add any DRAGEN command line argument, which can cause DRAGEN to:

  • Crash/fail/hang

  • Run for a very long time

  • Generate unexpected or invalid results

The app appends this input text to the DRAGEN command line after removing invalid characters (valid characters are alphanumeric plus ._-"'). Note that there is no validation of the contents. If you use this field and the appsession aborts, the output*.log appsession log file may help to understand the cause of the failure.

Optional

Additional DRAGEN Command Line Arguments: Additional DRAGEN Variant Calling (Somatic) Command Line Arguments

USE AT YOUR OWN RISK. This field allows the user to add any DRAGEN command line argument, which can cause DRAGEN to:

  • Crash/fail/hang

  • Run for a very long time

  • Generate unexpected or invalid results

The app appends this input text to the DRAGEN command line after removing invalid characters (valid characters are alphanumeric plus ._-"'). Note that there is no validation of the contents. If you use this field and the appsession aborts, the output*.log appsession log file may help to understand the cause of the failure.

Optional

Organisms to Report (VSP)

Only the checked organisms will be reported (consensus sequences and metrics). This will not affect the underlying bioinformatics pipeline, only which outputs are provided.

All VSP organisms

Optional if Enrichment Panel is set to 'VSP'. Otherwise, not applicable

Organisms to Report (RVOP)

Only the checked organisms will be reported (consensus sequences and metrics). This will not affect the underlying bioinformatics pipeline, only which outputs are provided.

All RVOP organisms

Optional if Enrichment Panel is set to 'RVOP'. Otherwise, not applicable

Primer definition file formats

If primer coordinates are known (recommended)

Provide a BED file with at least 4 columns: chrom, chromStart, chromEnd, primerName. Additional columns can be included: pool, strand, sequence, but their order must be maintained.

For example, chrom, chromStart, chromEnd, primerName, pool for 5-column BED format:

And chrom, chromStart, chromEnd, primerName, pool, strand, sequence for 7-column BED format:

If primer coordinates are unknown

Option 1. One line per amplicon with 3 columns: ampliconName, forwardSequence,reverseSequence.

Option 2. One line per primer with 3 columns: primerName, sequence, pool.

Formatting rules

  • General

    • All text is case sensitive.

    • Any line starting with '#' is ignored. This can be used to add a header line with column names.

    • Every line must have the same number of columns and format (except those starting with '#').

    • Any number of spaces can separate the columns. A value within a single column should not have any space.

  • BED format

    • Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the chromStart field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the chromEnd field (3rd column) minus 1.

    • chrom field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.

    • Multiple sequence identifiers (chrom) are permitted within one file.

  • Primer name

    • primerName must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.

    • In addition to _LEFT and _RIGHT, we permit _L and _R as direction tags in primerName. Any text after the direction tag should be separated by an underscore.

    • Text in primerName before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName.

    • Each amplicon must have at least one left and right primer (including alternative primers) associated with it.

    • Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt after the direction tag in primerName, followed by optional text to distinguish between different alternative primers, such as a number.

    • Examples of valid primer names:

      • MY_SEQUENCE_434_A_LEFT

      • virus1_L

      • amplicon_4934m_RIGHT_alt

      • amplicon_4934m_RIGHT_alt1

      • amplicon_4934m_R_altprimerB

    • Examples of invalid primer names:

      • LEFT_MY_SEQUENCE_434_A

      • virus1_l

      • amplicon_4934m_RIGHT_L

Picture1
Summary Report
Result Reports
Output files

Provide a custom reference BED to describe each sequence in Custom Reference FASTA. See

Provide a file defining primers used in amplicon sequencing. See

At low input concentrations, errors produced by the reverse transcriptase enzyme can propagate to high frequencies, leading to false positive sequence variants. Therefore, we attempt to infer the sample concentration from the amplicon coverage using this metric. If you wish to adjust this, we advise conducting internal studies to examine variant call reproducibility between replicates to determine a threshold that will produce acceptable quality levels for your application. Only applicable to amplicon sequencing where primers are defined. See

📄
📄
📂
#chrom  chromStart  chromEnd  primerName     pool
seqX    0           15        primer1_LEFT   1
seqX    1745        1760      primer1_RIGHT  1
seqY    0           15        primer2_LEFT   2
seqY    1015        1030      primer2_RIGHT  2
#chrom  chromStart  chromEnd  primerName     pool  strand  sequence
seqX    0           15        primer1_LEFT   1     +       GGGCAAACCTAAAGG
seqX    1745        1760      primer1_RIGHT  1     -       GTTATGTAAAGGTGC
seqY    0           15        primer2_LEFT   2     +       GGGCGAAACTAAAGG
seqY    1015        1030      primer2_RIGHT  2     -       GTTATGTAAAGGTGC
#ampliconName  forwardSequence  reverseSequence
amplicon1      GGGCAAACCTAAAGG  GTTATGTAAAGGTGC
amplicon2      GGGCGAAACTAAAGG  GTTATGTAAAGGTGC
#primerName       sequence         pool
primer1_LEFT      GGGCAAACCTAAAGG  1
primer1_LEFT_alt  GGGCGAAACTAAAGG  1
primer1_RIGHT     GTTATGTAAAGGTGC  1
Genome definition BED file format
Primer definition file formats
Special considerations for amplicon sequencing with IMAP protocols

Special considerations for amplicon sequencing with IMAP protocols

Amplicon sequencing, especially of RNA viruses, requires additional bioinformatics processing to ensure maximum quality of the resulting data.

In RT-PCR, a reverse transcriptase enzyme first generates cDNA molecules using the RNA molecules in the sample as templates, before amplifying the cDNA sequences using a DNA polymerase enzyme during PCR. These amplified cDNA sequences are then further processed to generate the sequencing libraries. Both of these enzymes can potentially introduce an incorrect base into a sequence, generating a position where the resulting sequence does not match the sequence in the sample -- that is, an error.

Reverse transcriptases exhibit error rates that are multiple orders of magnitude higher than those of DNA polymerases.

When large numbers of nucleic acid molecules are present in a reaction, these individual misincorporation errors are largely uncorrelated and appear at very low frequencies, so that they are typically ignored by variant callers.

However, when the number of incoming nucleic acid molecules is small, such as for a low-titer virus sample, an error that occurs during the RT step or early in the PCR reaction can, as a result of sampling noise, be amplified to high frequencies in the resulting sequencing libraries. When the variant caller encounters such a position, it will be treated as a sequence variant, since it is a true sequence variant in the context of the library provided to the instrument. As a result, these artifactual sequence variants often have high allele frequency and very good quality scores, which makes them very difficult to detect and remove. This can result in a false positive variant call that, at a sufficiently high allele frequency, will also be incorporated into the consensus sequence. It is also possible for a true sequence variant to have its allele frequency depressed by this same process (if the error results in a reversion to the reference sequence), but this is much less common.

Since it is difficult to identify enzyme-introduced false variants after the fact, we instead take a pre-emptive approach to ensuring data quality. As noted above, sampling noise as a function of molecular abundance is the mechanism responsible for boosting of the frequency of individual enzymatic errors into artifactual variants, and therefore the magnitude of this effect is largely a function of the concentration of the nucleic acids in the reaction. Therefore, the software first attempts to determine whether there is sufficient sample material present before proceeding with variant calling and consensus sequence generation.

To determine this, the software takes advantage of the fact that the probability of each amplicon being amplified is a function of the nucleic acid concentration, with higher concentrations leading to a higher probability of amplification. By counting the observed proportion of amplicons with detectable sequence coverage, we can estimate this probability and compare it to an experimentally-determined threshold that corresponds to the minimum concentration needed to produce reliable variant calls.

To compute this, we calculate the number of amplicons with at least 1x coverage for at least 90% of the non-overlapping portion of the amplicon sequence. The 1x coverage threshold used here is fixed and independent of the minimum read coverage depth for consensus sequence generation which defaults to 10x. The number of amplicons that meet this threshold is then divided by the total number of amplicons in the experiment, which is the number of amplicons whose location falls in reference sequences selected for short read alignment. If the resulting fraction is at least 80%, the sample is considered to have sufficient material for accurate variant calling and the variant calling and consensus sequence generation steps are performed. If it is below this threshold, the sample is not processed further to avoid spurious variant calls. The user can override the 80% threshold in the "Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation" control in the "Advanced Workflow Settings" section. See App Settings.

Output files

Note: Some files may not be generated depending on user inputs and pipeline outcome

For each sample, the pipeline generates a directory named after the sample. This directory contains the following subdirectories:

  • consensus/

    • {sample_name}_sample_consensus.fasta : Contains all hard-masked consensus sequences for the sample. Regions with coverage below minimum coverage depth for consensus sequence generation (10x by default) are considered not callable and are therefore “hard-masked” with letter N. Variant calling is not applied to these regions. If the user selected specific VSP or RVOP organisms to be reported, this file excludes consensus sequences that are generated but do not belong to the selected organisms.

    • {sample_name}.consensus_hard_masked_sequence.fa : Identical to {sample_name}_sample_consensus.fasta, except for sequence headers. Moreover, even if the user selected specific VSP or RVOP organisms to be reported, this file contains all consensus sequences, including those that do not belong to the selected organisms

    • {sample_name}.consensus_soft_masked_sequence.fa : Identical to {sample_name}.consensus_hard_masked_sequence.fa, except low-coverage regions are “soft-masked” with lower-case letters that match the reference. Variant calling is not performed in these regions.

    • {sample_name}{virus_name}_virus_consensus.fasta : Contains hard-masked consensus sequences generated for a particular virus. If the virus is not segmented (i.e. one reference sequence for the virus), this file contains a single sequence and is identical to {sample_name}{virus_name}{segment_name}{sequence_accession} consensus.fasta.

    • {sample_name}_{virus_name}_{segment_name}_{sequence_accession}_consensus.fasta : Contains a hard-masked consensus sequence generated with a particular reference sequence.

    • {sample_name}.consensus_from_vcf.log : Log file generated during consensus sequence generation.

  • contig/

    • {sample_name}_sample_contig.fasta : Contains all de novo assembled contigs generated for the sample.

    • {sample_name}_unmapped_contig.fasta : Contains de novo assembled contigs that could not be mapped to any reference sequence. Because de novo assembly is reference-free, these contigs may correspond to sequences that are too diverged from those in the reference database or sequences not included in the database.

    • {sample_name}_{virus_name}_virus_contig.fasta : Contains de novo assembled contigs that mapped to a particular virus.

    • {sample_name}_{virus_name}_{segment_name}_{sequence_accession}_contig.fasta : Contains de novo assembled contigs that mapped to a particular reference sequence, which resulted in choosing the reference sequence for short-read alignment and generating {sample_name}_{virus_name}_{segment_name}_{sequence_accession}_consensus.fasta.

    • {sample_name}_reference_selection.log : Log file generated during reference selection.

  • map_align/

    • {sample_name}_unfiltered_tumor.bam or {sample_name}_unfiltered_tumor_primertrim_with_unmapped_reads.bam : Short reads mapped to all selected reference sequences. If a primer set is available and properly mapped to the reference sequences, {sample_name}_unfiltered_tumor_primertrim_with_unmapped_reads.bam is provided as output. Its reads have primer sequences trimmed based on primer binding site coordinates.

    • {sample_name}-unmapped_S1_L001_R1_001.fastq.gz, {sample_name}-unmapped_S1_L001_R2_001.fastq.gz : Short reads that do not map to any selected reference sequences. These may be used to find organisms not reported by the pipeline.

  • variant_calling/

    • {sample_name}.consensus_filtered_variants.vcf.gz : Contains variant calls that passed consensus filter and were therefore applied to consensus sequences. They are a subset of variants listed in {sample_name}.hard-filtered.vcf.gz.

    • {sample_name}.consensus_filtered_variants_vcf_stats.txt : Summarizes all variant calls in {sample_name}.consensus_filtered_variants.vcf.gz. Outputted by bcftools stats.

    • {sample_name}.consensus_filtered_variants_summary.csv : Describes each variant call in {sample_name}.consensus_filtered_variants.vcf.gz.

    • {sample_name}.hard-filtered.vcf.gz : Contains all variant calls.

    • {sample_name}.consensus_input_vcf_stats.txt : Summarizes all variant calls in {sample_name}.hard-filtered.vcf.gz. Outputted by bcftools stats.

    • {sample_name}.consensus_all_variants_summary.csv : Describes each variant call in {sample_name}.hard-filtered.vcf.gz.

  • metrics/

    • {sample_name}_num_reads.tsv : Reports number of input reads, reads filtered out at each pre-processing step, reads mapped to each selected reference sequence, etc.

    • {sample_name}_metadata.json : Reports parameters, read counts, amplicon counts, analysis results, and other metadata.

    • {sample_name}.consensus_metrics.csv : Reports consensus metrics (e.g. total length of pre-trimmed sequence, fraction of masked bases, number of callable bases) for each generated consensus sequence

    • {sample_name}.consensus_coverage_from_filtered_bam.tsv : Reports base-pair-resolution read coverage for all reference sequences based on short-read map/align step. Its three columns correspond to: chrom/accession, base position, coverage.

    • {sample_name}.consensus_callable_regions_from_filtered_bam.bed : Reports callable regions in all reference sequences based on base-pair-resolution read coverage and minimum coverage depth for consensus sequence generation (10x by default). Bases outside of these regions are masked in consensus FASTA.

  • amplicon/

    • {sample_name}_processed_non_overlapping_amplicon.bed : Lists all non-overlapping amplicon regions (i.e. covered with exactly one amplicon). If a custom primer set is provided, this file also lists selected reference sequences lacking amplicons. While amplicons are defined based on primer binding sites, for viruses like Influenza, reference sequences often lack primer binding sites, which are located at sequence ends. This results in defining fewer or sometimes no amplicons for an entire viral genome. To avoid this, each reference sequence without any amplicons defined is considered an amplicon and is listed in this file. All regions in this file are used for amplicon detection to infer sample concentration and determine if it is sufficient to apply variant calling and consensus sequence generation.

    • {sample_name}_calculate_amplicon_coverage.csv : Reports coverage metrics (e.g. median coverage, fraction of bases with at least 1x coverage) for each non-overlapping amplicon region listed in {sample_name}_ processed_non_overlapping_amplicon.bed.

    • {sample_name}_generate_all_primer_bed.log : Log file generated while defining amplicons for selected reference sequences and writing relevant BED files.

  • tertiary/

    • nextclade_{sample_name}_{sequence_accession}_{dataset_name}.csv : CSV output file generated by NextClade given a consensus sequence (generated with the specified sequence as reference) and a NextClade dataset.

    • nextclade_{sample_name}_{sequence_accession}_{dataset_name}.tsv : Same as nextclade_{sample_name} _{sequence_accession}_{dataset_name}.csv except in TSV format.

    • nextclade_{sample_name}_{sequence_accession}_{dataset_name}.json : Same as nextclade_{sample_name} _{sequence_accession}_{dataset_name}.csv except in JSON format.

    • nextclade_{sample_name}_{sequence_accession}_{dataset_name}_log.txt : Log file generated by NextClade given a consensus sequence (generated with the specified sequence as reference) and a NextClade dataset.

    • pangolin_{sample_name}_{sequence_accession}_lineage_report.csv : CSV output file generated by Pangolin given a consensus sequence (generated with the specified sequence as reference) as input.

    • pangolin_{sample_name}_{sequence_accession}_log.txt : Log file generated by Pangolin given a consensus sequence (generated with the specified sequence as reference) as input.

  • reference/

    • reference.bed : Describes all reference sequences. If a custom reference was provided, sequence names may appear different in this BED file.

    • reference.json : Same as reference.bed but with more detail. If any of the sequences were renamed during the pipeline, this file provides the mapping between the original and renamed versions.

Frequently Asked Questions (FAQ)

Q: After processing my sample with an enrichment panel, the majority of my reads are removed in preprocessing and/or I only have a small amount of viral reads. Is the enrichment panel working?

A: The enrichment protocols can create a several thousand fold increase in the abundance of the targeted viral species. However, it is important to keep in mind that in many sample types, especially clinical, wastewater, or environmental samples, viral RNA or DNA makes up a tiny proportion of the total nucleic acids present, with the remainder dominated by host (human) or bacteria/archaea. So even with a dramatic enrichment of abundance over what you would obtain without enrichment, the percentage of viral reads can still be low. E.g. you may have a sample with only 2% viral reads, but without enrichment you might have only obtained 0.1% viral reads. If it is low abundance after enrichment, it is likely extremely low abundance prior to enrichment.

Q: The contig default min coverage is 10x, but is that across the entire contig, median, average.. or?

A: The 10x threshold is applied per-nucleotide. Any positions below 10x coverage will be hard-masked with "N".

Q: In the demo data, I downloaded the consensus sequence FASTA file and each sequence line would say “Panel reference sequences are not necessarily comprehensive and should not be used for strain typing.” Does that mean even though VSP provide full-genome resolution of all 66 viruses, the app can only strain-type the strains listed because of the reference sequences the app uses?

A: Correct, we only align to a limited number of reference sequences for each virus type, so the sequence accession in the consensus genomes (and coverage plots, etc) merely reflects the best match chosen from that subset. There could be sequences in RefSeq that are a closer match. Furthermore, strain typing is not necessarily as simple as choosing the closest matching genome; there are further complexities that can go into it, and we have not systematically developed or tested any strain typing capability to date. The noted message is to warn users that the sequence accession in the consensus genome does not necessarily reflect the true phylogeny of the organisms in the sample and should not be taken as such.

Q: Why am I seeing some segments from one Influenza A subtype and some from another subtype? Does my sample contain both subtypes?

A: For each de novo assembled contig, we aim to find the best matching reference sequence rather than an entire reference genome. If the best match for one contig is a reference sequence from one subtype and the best match for another contig is a reference sequence from another subtype, then we will report them as such. This is not necessarily indicative of a mixed infection, reassortment, or error. It is usually reflective of how similar certain segments can be across different subtypes.

Influenza A viruses are classified into different subtypes based on the hemagglutinin (HA) and neuraminidase (NA) proteins, which are encoded by segments 4 and 6, respectively. Therefore, we recommend focusing on those segments to infer the subtype. If there is a sequence generated from segment 8 of an H3N2 genome but all the rest of the consensus sequences are generated from reference sequences from an H1N1 genome (indicating H1 and N1 subtypes), then the sample likely contains H1N1, not H3N2. One possible explanation is that segment 8 from H1N1 and segment 8 from H3N2 were both good matches for a particular contig but the one from H3N2 was a slightly better match and therefore chosen as final reference. Similarly, if there is a sequence generated from segment 4 of an H1N1 genome (indicating H1 subtype) and a sequence generated from segment 6 of an H5N1 genome (indicating N1 subtype), then the sample likely contains H1N1, not H5N1.

Q: Why does the "Detected Amplicons" column show 7 in the denominator for an Influenza genome when there should be 8 amplicons in total?

A: The denominator in the "Detected Amplicons" columns is based on the reference sequences selected based on de novo assembled contigs. Depending on the quality of the sample and/or reads, the assembler may not have enough data to generate a contig for some segments. Shorter segments are more likely to be missed. If only 7 segments are selected as final reference for short read alignment, then we expect 7 amplicons in total. If you believe that the sample should contain all 8 segments, you can download the contig FASTA file from our report page and submit it to NCBI BLAST to see if all 8 segments are present in the contig sequences.

One known issue is that chimeric reads can be generated during library preparation, which can lead to chimeric contigs, where the contig sequence contains sequences from more than one segment. This can result in missing an entire segment in the reference selection stage. A workaround may be to filter out chimeric reads from your FASTQ files before running the app.

Alternatively, you can force the app to use all 8 segments of a particular Influenza genome by providing a custom reference FASTA file with all 8 segment sequences and a custom reference BED file with the genomeName column to set to the same value (e.g. Influenza A). This way, the app will not perform assembly and use all 8 segments as the reference sequences for short read alignemnt.

Q: What is the difference between the "Detected Amplicons" and "% callable bases" columns?

A: The "Detected Amplicons" column shows the number of amplicons detected over the total number of amplicons expected for that genome. The percentage of amplicons detected is used to infer if the sample is of sufficient quality for variant calling. The "% callable bases" column shows the percentage of the selected reference genome whose bases are above the minimum read coverage depth for consensus sequence generation, which is computed independent of amplicon coordinates. Both metrics are useful to assess the quality of the sample, but the percentage of detected amplicons is used by the app after short read alignment to filter out low-titer samples and the percentage of callable bases is not.

Q: I don't see the virus I'm interested in listed in the "Genomes generated" column. Does that mean the virus is not present in my sample?

A: Not necessarily. Your virus of interest may be present in the sample, but the app may not have generated a consensus sequence for it for various reasons. One reason could be that there are too few reads coming from that virus. Tools like DRAGEN Metagenomics can be used to characterize what is in the sample more broadly. Another reason could be that the virus in your sample is too divergent from the reference sequences used in the app. In such cases, we recommend downloading the contig FASTA file from our report page and submitting it to NCBI BLAST. If you do see a sequence that matches your virus of interest, you can provide that sequence to the app as a custom reference genome.

Q: I am using the custom analysis workflow and my analysis aborted or shows an error, why?

A: While there may be quite a few causes for the analysis to fail, some of the most common cases are that the custom database was not formatted correctly. Below are requirements for the Custom Reference FASTA For Consensus Generation:

  • Do not use Spaces in the file name, instead use an underscore "_"

  • Do not exceed 25 characters in the file name

  • File extension must be .fasta or .fa

  • Do not exceed the file size limitation: 16GB for a single file or 25GB for multiple files

  • Do not have duplicate entries

  • If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (chrom) must match the names that appear in the FASTA (text after > and before the first whitespace character) Please see this link on general guidelines to upload data to BaseSpace for more details: https://help.basespace.illumina.com/manage-data/import-data If you continue having issues, reach out to techsupport@illumina.com

Pipeline Logic

Descriptions of the logic underlying the DRAGEN Targeted Microbial analysis pipeline

Flowchart

Pipeline Steps

Step
Module/Script
Run

Reference reformatting/validation

custom script

If custom reference is provided

Read QC

trimmomatic

Always

Primer trimming (on FASTQ)

trimmomatic

If primer set exists

Read dehosting

human_read_scrubber

If checked in Input Form

Assembly

MEGAHIT

If reference FASTA and BED files imply more than one genome as reference

Contig clustering

CD-HIT

If assembly ran

Reference selection

custom script

If assembly ran, otherwise input reference database is used as is

Primer alignment / reformatting

custom script

If primer set is provided. Primers are aligned to selected reference sequences if coordinates are not provided. Otherwise, primers mapping to selected reference sequences (based on the provided primer coordinates) are selected as final set of primers

Map/Align

DRAGEN

If at least one reference sequence is generated

Post-facto primer trimming (on BAM)

custom script

If Map/Align ran and primer set exists

Sample filtering based on amplicon coverage

custom script

If Map/Align ran and primer set exists

Variant calling

DRAGEN

If Map/Align ran and sample passed filter above

Consensus sequence generation

custom script

If Map/Align ran and sample passed filter above

Possible outcomes

Status
Level
Outcome
Output files (excluding log files)

Pipeline completed

Pipeline

Pipeline exits

All

Custom files are not formatted correctly

Pipeline

Pipeline exits

None

None of the primers provided in custom primer definition file align to selected reference sequences

Sample

Skip post-factor primer trimming and sample filtering based on amplicon coverage for this sample

All except for amplicon-related output files

No reference found after assembly

Sample

Do not proceed to short read map alignment for this sample

Contig FASTA

Insufficient amplicon coverage

Sample

Do not proceed to variant calling for this sample

Contig FASTA (if assembly was run)

Result Reports

Describes the reports that can be viewed from the individual sample links on the left side of the reports tab or by clicking on sample names in the Metrics by sample table.

Version information

At the top of the report is version information for the App and any third-party components.

FASTA downloads

Two buttons provide the ability to download relevant FASTA-formatted text files for this sample. The "Consensus" button initiates a download of a FASTA file containing all consensus sequences generated for this sample. The "Contig" button initiates a download of a FASTA file containing all assembled contigs for this sample.

Metrics by virus

The metrics by virus table contains information about each viral genome generated. Each row summarizes all sequences assigned to that virus. In the case of multi-segment viruses like Influenza, a row will summarize information across all segment sequences generated for a single viral genome. It contains buttons to download the contents of the table as a CSV, JSON or PDF file. The table itself contains rows for every virus with at least one generated genome in the sample. It contains the following columns:

  1. Virus: The name of the virus. For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName column that corresponds to the selected reference (matched by the value in the chrom column of the genome definition file and the part of the FASTA header before the first whitespace character).

  2. % callable bases: The percentage of the selected reference genome whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference genome(s), not the reported consensus sequence(s).

  3. Status: The overall outcome of the analysis for this virus

    1. Full analysis (consensus, VC) means that the sample analysis completed normally, that a sufficient number of amplicons were detected to ensure reliable variant calling (amplicon experiments only), and that the percentage of callable bases was above the minimum percentage of consensus sequence generated to label as confident (5% by default)

    2. Low confidence means that there is at lease one callable base but the overall percentage of callable bases was below the minimum percentage of consensus sequence generated to label as confident (5% by default)

    3. No callable bases indicates that zero positions in the indicated reference genome were callable and no consensus sequence is therefore provided.

  4. Median coverage: The median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).

  5. Consensus FASTA: A download link to a FASTA-formatted text file containing all the consensus sequences generated for this virus.

Metrics by sequence

This table summarizes the results for each sequence generated for the sample. For multi-segment viruses such as Influenza, there will may be multiple sequences detected for a given virus. For single-segment viruses there will typically be only one sequence per virus. It contains buttons to download the contents of the table as a CSV, JSON or PDF file. The table itself contains rows for every sequence. It contains the following columns:

  1. Virus: The name of the virus. For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName column that corresponds to the selected reference (matched by the value in the accession column of the genome definition file and the part of the FASTA header before the first whitespace character).

  2. Segment: The name of the genome segment to which the sequence belongs. For viruses with a single segment, the name of the segment will typically be "Full".

  3. Accession: The accession number or other short unique identifier for the sequence. If using a custom genome definition BED, this value is taken from the first column (chrom) in the definition file. If using a custom reference without a genome definition file, the value is taken from the part of the FASTA header before the first whitespace character.

  4. % callable bases: The percentage of the selected reference sequence whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference sequence, not the reported consensus sequence.

  5. Median coverage: The median coverage value (in number of reads overlapping each position) over the entire reference sequence (not just the generated consensus sequence).

  6. Consensus sequence length: The length of this consensus sequence. The reported length is the length of the hard-masked sequence after trimming any leading and trailing masked regions (if trimming is active).

  7. # callable bases: The number of positions in the reference sequence above the minimum read coverage depth for consensus sequence generation (default 10x). In other words, the number of positions not masked. This may not be equal to the number of unmasked positions in the final consensus sequence since insertions and deletions are applied after masking.

  8. Consensus FASTA: A download link to a FASTA-formatted text file containing this consensus sequence.

Pre-processing metrics

This stacked bar plot contains information about the outcome of the pre-processing steps (read QC, trimming, de-hosting) as well as the alignment step. It contains counts of reads that fall into the following categories:

  • Removed in QC: Reads that failed to meet the minimum quality thresholds and were excluded from further processing.

  • Removed in trimming: Reads that were removed in the initial sequence-based primer trimming step and were excluded from further processing.

  • Removed in de-hosting: Reads that were removed in the de-hosting step and excluded from further processing. De-hosting is the process of removing reads that may originate from the host organism. Currently only human hosts are supported. De-hosting reads improves the quality of downstream analysis and helps ensure that human sequences are not included in the output BAM files.

  • Unmapped: Reads that were not aligned to any reference genomes.

  • Mapped. Reads that were mapped to at least one reference genome.

Alignment metrics

A column plot displaying the numbers and percentages of all reads that aligned to each reference sequence with at least one mapped read. The columns are labeled by both virus and segment name (if available) on the x-axis, and the y-axis is the read count for each sequence.

Coverage

Displays a trace of the read coverage over each reference sequence. The drop-down menu in the upper left allows the user to switch between viruses. If multiple segment sequences are generated for a single virus, their corresponding coverage plots will be displayed in a vertically stacked fashion. The black trace represents the read coverage, with the coverage depth in number of reads on the left y-axis and the position in the reference sequence on the x-axis.

The minimum read coverage depth for consensus sequence generation (default 10x) is plotted as a dashed orange line across the plot, to easily visualize locations where coverage drops below the threshold (which will be masked in the consensus sequence) and where the coverage is above the threshold (which will be reported in the consensus sequence).

The median coverage is plotted as a dashed teal line across the plot.

By default, sequence variants representing differences between the consensus sequence and the reference sequence are also plotted, with allele frequency on the right y-axis. The colors and symbols represent different sequence variant types. See the figure legend for details.

The "Show log-scale" toggle switch allows the user to switch between logarithmic and linear scales for the coverage (left) y-axis.

The "Show Median" toggle switch allows the user to turn the median coverage line on and off.

The "Show Sequence Variants" toggle switch allows the user to turn the plotting of sequence variants on and off.

Pangolin Report (optional)

This table contains the results of the Pangolin analysis performed on the generated consensus sequences. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:

  • 'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using SARS-CoV-2 as reference

  • 'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated

  • Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.

The Pangolin report contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

NextClade Report (optional)

This table contains the results of the NextClade analysis performed on the generated consensus sequences. NextClade is run if the "Enable NextClade" box is checked on the input form and one of the following is true:

  • 'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using a reference with NextClade dataset available.

  • 'Amplicon Primer Set' is set to a non-custom set with a reference with NextClade dataset available (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated.

  • Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom' and one or more NextClade datasets are selected under 'Custom Reference'. In this case, each of the selected NextClade datasets is applied to each consensus sequence generated for the sample. This may result in multiple NextClade results for each consensus sequence, some of which may not be meaningful (e.g. "flu_h1n1pdm_ha" dataset applied to a NA segment of an Influenza genome).

The NextClade Report contains a button labeled "Group table by" with a drop-down menu allowing the user to group the results by various fields, including "None". The default is "Dataset" which means that all of the results for each NextClade dataset will be grouped together. For example, if a user is only interested in phylogenetic analysis performed on the HA segment of Influenza A H1N1, these results for each sample can be viewed together in the "flu_h1n1pdm_ha" collapsible group.

The NextClade report also contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

Detected Amplicons (only if 'Experiment Type' is set to 'Amplicon'): The number of amplicons detected over the total number of amplicons expected for that sequence. The percentage of amplicons detected is used to to determine if the sample is sufficient quality for variant calling. See for more details.

Insufficient titer for VC will only be present for an amplicon experiment and indicates that the number of detected amplicons was below the minimum percentage (default 80%) required for reliable variant calling. See for more details.

The table in the Pangolin report is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details: . Sequences with a bad Pangolin QC status are highlighted in yellow.

The table in the NextClade report is derived from the output of the NextClade software. Please see the NextClade documentation for more details: . Sequences with a bad NextClade QC status are highlighted in yellow.

Special considerations for amplicon sequencing with IMAP protocols
Special considerations for amplicon sequencing with IMAP protocols
https://cov-lineages.org/resources/pangolin/output.html
https://docs.nextstrain.org/projects/nextclade/en/stable/user/output-files/04-results-tsv.html

Known issues

  1. For version 1.1 and later, consensus FASTA files generated for each sample, virus, and reference sequence incorrectly contain soft-masked sequences instead of hard-masked sequences. To get hard-masked sequences, use {sample_name}.consensus_hard_masked_sequence.fa or convert lowercase nucleotides to "N".

How to set up and run an analysis

  1. Launch the DRAGEN Microbial Enrichment Plus BaseSpace app, which can be found in the "Dragen" and "Infectious Disease + Microbiology" app collections.

  2. Enter a name for the Analysis.

  3. Choose either “Biosample” or “Project” as input type. When a Project is selected, the app will attempt to find all FASTQ files in that Project and run analyses on them. There is no FASTQ file limitation when reading Biosamples from a Project. However, 99 associated FASTQ files is the maximum allowed per analysis when providing Biosample input from a list.

  4. Select a target-capture Enrichment Panel for the appropriate analysis options and default settings to populate. Only one enrichment panel can be selected per analysis. If Custom Panel is selected, the "Custom panel specification" section is enabled to allow entry of a reference FASTA file and (optionally) a reference BED file. See Custom reference FASTA and BED files for further details.

  5. Under "Enrichment Panel Microorganism Reporting List", select from the available list to report All microorganisms (default), specify a Pre-defined subset of microorganisms (RPIP, UPIP only), or specify a User-defined microorganism reporting list and reporting thresholds.

    • If Pre-defined is selected, the Pre-defined specification section is enabled to allow specification of a pre-defined subset of microorganisms for the selected Enrichment Panel. This option is only available for RPIP and UPIP.

    • If User-defined is selected, the User-defined specification section is enabled to allow entry of a microorganism reporting list and reporting thresholds file in TSV or XLSX format. See Microorganism Reporting File format for further details.

  6. Analysis Options:

  • Perform read QC (Quality Control)

    • If checked, reads are pre-processed using quality metrics before analysis.

    • If unchecked, read quality metrics are calculated, but reads are not trimmed or filtered before analysis.

  • Report bacterial AMR markers only

    • If checked, only bacterial AMR markers but no microorganisms are reported

    • This option is disabled if RVOP/RVEK, VSP, VSP V2 or Custom Panel is selected

    • This option is disabled if the "Report bacterial AMR markers only when an associated microorganism is reported" option is enabled

  • Report bacterial AMR markers only when an associated microorganism is reported

    • If checked, detected bacterial AMR markers are reported if the bacterial AMR marker passes a minimum reporting threshold and one or more associated microorganisms are also detected and reported

    • If unchecked, detected bacterial AMR markers are reported if the bacterial AMR marker passes a minimum reporting threshold

    • This option is disabled if RVOP/RVEK, VSP, VSP V2 or Custom Panel is selected

  • Report microorganisms and/or AMR markers that are below threshold

    • If checked, microorganisms and/or AMR markers below reporting thresholds are included in reports

    • If unchecked, only microorganisms and/or AMR markers above reporting thresholds are included in reports

    • This option is disabled if Custom Panel is selected

    • This option is disabled if the "Report bacterial AMR markers only when an associated microorganism is reported" option is enabled

  1. Specify "Read classification sensitivity". This setting is used as a pre-alignment filtering step for RVOP/RVEK, VSP, and VSP V2 only. The default setting of 5 means that if less than 5 reads classify to the set of reference sequences belonging to a given virus, that virus will not be reported. On the other hand, if 5 or more reads classify to the set of reference sequences belonging to a given virus, read alignment will proceed and alignment-based thresholds will be used to determine whether that virus is reported. The read classification sensitivity can be set as low as 1 or as high as 1000. Lowering the read classification sensitivity threshold below 5 may significantly increase computational run time and is not recommended for most use cases.

  2. Pangolin is currently enabled for all enrichment panels besides UPIP. For Custom Panel analyses, Pangolin will run on custom reference sequences with at least 3% coverage that meet these naming conventions:

    • If only a FASTA file is provided, Pangolin will run on sequences that have a header containing either SARS-CoV-2 or NC_045512

    • If both a FASTA and BED file are provided, Pangolin will run on sequences where the first column (chrom) contains NC_045512 or the fourth column (genomeName) contains SARS-CoV-2

  3. Optionally, enable Nextclade to run when one of the following microorganisms is detected (RPIP, RVOP/RVEK, VSP, VSP V2 only):

    • Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

    • Influenza A virus (H1N1)

    • Influenza A virus (H3N2)

    • Influenza A virus (H5N1)

    • Influenza A virus (H5N6)

    • Influenza A virus (H5N8)

    • Influenza B virus (B/Victoria/2/87-like)

    • Influenza B virus (B/Yamagata/16/88-like)

    • Human immunodeficiency virus 1 (HIV-1)

    • Human respiratory syncytial virus A (HRSV-A)

    • Human respiratory syncytial virus B (HRSV-B)

    • Monkeypox virus (MPV)

    • Measles virus (MV)

    • Dengue virus (DENV), Dengue virus type 1 (DENV-1), Dengue virus type 2 (DENV-2), Dengue virus type 3 (DENV-3), Dengue virus type 4 (DENV-4)

  4. Select a quantitative Internal Control (IC) from the available list (RPIP, UPIP, VSP V2 only). If the quantitative IC is set to NONE, which is the default, the IC concentration value is ignored. For VSP V2, the only valid quantitative IC selections are:

    • NONE

    • Enterobacteria phage T7

    • Escherichia virus T4

    • Escherichia Virus MS2

    • Armored RNA Quant Internal Process Control

  5. Enter Internal Control (IC) concentration as an integer in the following scientific notation format: "#.## x 10^#". **An incorrect quantitative IC or incorrect IC concentration will result in inaccurate microorganism absolute quantification results.

  6. Select the Project where the Analysis Output should be saved.

Understanding the BaseSpace HTML reports

Summary results

Sample Composition

The Sample Composition bargraphs show the proportion of reads classified to six broad categories for of all samples in the analysis run: Targeted Microbial, Untargeted, Ambiguous, Unclassified, Low Complexity, and Targeted Internal Control (RPIP, UPIP, VSP V2 only).

Summary Statistics

The Summary Statistics table summarizes sample QC metrics for all samples in the analysis run. Further details on each metric can be found by hovering over each column header.

Per sample results

Individual sample results can be further explored by clicking on "Report" under each sample name in the panel on the left. There are four tabs in the Sample Report: Sample Quality Control, Microorganisms, Antimicrobial Resistance Markers, and User Options.

1. Sample Quality Control

  • Version Information is a table with the application version, test type, and test version that were run. Running the latest version of the application is recommended.

  • Sample Composition is a bargraph showing the proportion of post-quality reads classified to six broad categories for the sample (RPIP, UPIP, VSP V2 only).

  • Read Classification is a dynamic plot that can be configured to show the following (RPIP, UPIP, VSP V2 only):

    • Targeted Microbial Reads - Relative (default): Bargraph of post-quality targeted microbial reads belonging to Viral, Bacterial, Fungal, Parasite and AMR categories, relative to post-quality targeted microbial reads only. Percentages are expected to sum to 100%. Hover over an individual bar to display the values.

    • Targeted Microbial Reads - Absolute: Bargraph of post-quality targeted microbial reads belonging to Viral, Bacterial, Fungal, Parasite and AMR categories for all post-quality reads in the sample overall. Hover over an individual bar to display the values.

    • Untargeted Reads - Relative: Bargraph of post-quality untargeted reads belonging to untargeted categories, relative to post-quality untargeted reads only. Percentages are expected to sum to 100%. Hover over an individual bar to display the values.

    • Untargeted Reads - Absolute: Bargraph of post-quality untargeted reads belonging to untargeted categories for all post-quality reads in the sample overall. Hover over an individual bar to display the values.

      **Note that accurate sample composition and read classification results rely on selecting the correct enrichment panel. If you run an analysis that is not specific to the enrichment panel (e.g., VSP V2 analysis with VSP-enriched samples), reads from high background viruses that are not targeted by VSP probes (e.g., Measles virus) but that are targeted by VSP V2 probes will be reported as targeted viral reads.

  • Internal Controls is a table containing supported Internal Control options along with observed RPKM values (RPIP, UPIP, VSP V2 only).

  • QC Metrics is a table containing sample QC metrics. Dehosting refers to human reads only.

2. Microorganisms

  • Microorganism results are summarized in tables, separated by type (Viruses, Bacteria, Fungi, Parasites). Each table includes whether the microorganism is predicted present in the sample, as well as various alignment metrics. Further details on each metric can be found by hovering over each column header. The best-match Reference Accession(s) are provided for all RPIP, RVOP/RVEK, VSP, and VSP V2 viruses in the Viruses table. To see all best-match Reference Accession(s), click on the three dots (...) in the table and scroll down the page.

  • Reference Coverage is a dynamic plot showing the coverage depth across the viral genome for detected RPIP, RVOP/RVEK, VSP, and VSP V2 viruses. Select a virus from the dropdown list to view the coverage plot. Segments are concatenated for segmented viruses, and the targeted regions of the viral genome are indicated for RPIP viruses.

3. Antimicrobial Resistance Markers

  • Viral AMR (Variants) is a table with viral AMR variant results for Influenza A/B viruses (RPIP, RVOP/RVEK, VSP, and VSP V2 only)

  • Bacterial AMR (Genes) is a table witb bacterial AMR gene results (RPIP, UPIP only)

  • Bacterial AMR (Variants) is a table with bacterial AMR variant results (RPIP, UPIP only)

4. User Options

The User Options table summarizes user options selected during launch of the analysis.

Custom reference FASTA and BED files

Custom reference FASTA file:

A custom reference FASTA file containing one or more reference sequences is required to run the custom reference sequence analysis. In the FASTA file, sequence names must be unique and should not contain any spaces. If there is any space in the FASTA header, the part before the first space is assumed to be the sequence name. It is recommended to use only the following in sequence names: alphabets, numbers, underscore (_), hyphen (-), parentheses ((,)), and period (.). Otherwise, the sequence names may appear different in the output. An example custom reference FASTA file is provided in the link below.

To upload a custom reference FASTA file, go to the "Projects" tab and click on the folded paper icon (representing File) to reveal a dropdown menu. Click on "Upload" and select "Files". Within the upload page, select "Other" format for FASTA files, and upload the file as a Biosample. Within the DRAGEN Microbial Enrichment Plus app, under "Custom panel specification" use the "Custom reference FASTA for consensus generation" control to select the uploaded FASTA file.

Custom reference BED file (optional):

Optionally, a custom reference BED file may also be provided. Sequence names must match between the FASTA file and BED file, and the same set of sequences must appear in both files. If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.

The BED file controls how sequences are grouped and labeled in the output. If the custom reference FASTA file includes sequences from multiple segments of a viral genome, it is recommended to provide a BED file so that the segments are included under the results of that microorganism.

The BED file must be tab-delimited with at least 4 columns:

  1. chrom: the sequence name as it appears in the FASTA

  2. chromStart: start position (always set to 0)

  3. chromEnd: end position (sequence length)

  4. genomeName: name of the genome, target, or microorganism the sequence belongs to (e.g. Monkeypox virus clade II)

  5. segmentName (optional): the name of the segment or gene (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Example custom reference BED file:

NC_012532.1	0	10794	Zika	Full
KJ609203.1	0	2292	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 1 (PB2)
KJ609204.1	0	2304	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 2 (PB1)
KJ609205.1	0	2168	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 3 (PA+PA-X)
KJ609206.1	0	1727	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 4 (HA)
KJ609207.1	0	1530	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 5 (NP)
KJ609208.1	0	1441	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 6 (NA)
KJ609209.1	0	1001	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 7 (M1+M2)
KJ609210.1	0	866	Influenza A virus (A/Perth/16/2009(H3N2))	Segment 8 (NS1+NEP)
MK239128.1	0	2316	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 1 (PB2)
MK239126.1	0	2316	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 2 (PB1)
MK239124.1	0	2208	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 3 (PA+PA-X)
MK239073.1	0	1737	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 4 (HA)
MK239074.1	0	1540	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 5 (NP)
MK239123.1	0	1441	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 6 (NA)
MK239125.1	0	1002	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 7 (M1+M2)
MK239127.1	0	865	Influenza A virus (A/Iowa/38/2017(H3N2v))	Segment 8 (NS1+NEP)

To upload a custom reference BED file, go to the "Projects" tab and click on the folded paper icon (representing File) to reveal a dropdown menu. Click on "Upload" and select "Files". Within the upload page, select "Other" format for BED files, and upload the file as a Biosample. Within the DRAGEN Microbial Enrichment Plus App, under "Custom panel specification" use the "Custom reference BED (optional)" dropdown to select the uploaded BED file.

Pangolin custom analysis behavior:

For Custom Panel analyses, Pangolin is enabled and will run on custom reference sequences with at least 3% coverage that meet these naming conventions:

  • If only a FASTA file is provided, Pangolin will run on sequences that have a header containing either SARS-CoV-2 or NC_045512

  • If both a FASTA and BED file are provided, Pangolin will run on sequences where the first column (chrom) contains NC_045512 or the fourth column (genomeName) contains SARS-CoV-2

Nextclade custom analysis behavior:

For Custom Panel analyses, Nextclade is disabled and will not be run. Do not enable Nextclade.

DRAGEN Microbial Enrichment Plus App Documentation

Summary:

DRAGEN Microbial Enrichment Plus offers a dedicated informatics solution with flexible analysis options for Illumina Infectious Disease and Microbiology target-capture enrichment panel kits. The app delivers easy-to-use, powerful secondary analysis of Illumina sequencing data, with workflows for sample QC, viral WGS (whole-genome sequencing), pathogen detection and quantification, and antimicrobial resistance (AMR) marker profiling. It also supports user-defined microorganism reporting thresholds and custom reference sequence analysis.

Supported hybrid-capture enrichment panels:

Input files:

  • FASTQ files

  • User-defined microorganism reporting file in TSV or XLSX format (optional)

  • Custom reference FASTA file (if applicable)

  • Custom reference BED file (if applicable)

Demo Data:

The DRAGEN Microbial Enrichment Plus Demo Project includes external control, contrived, and environmental samples prepared using the RPIP, UPIP, RVOP, VSP, and VSP V2 target-capture enrichment kits. Example custom reference sequence FASTA and BED files are also included.

Analysis Pipeline:

(all panels except where noted, (*) indicates applicable to custom reference sequence analysis)

  1. Read QC* (optional)

  2. Dehosting* (human read removal)

  3. Sample QC (sample composition and enrichment factor calculations. Internal control required to calculate the enrichment factor) – RPIP, UPIP, VSP V2

  4. Microorganism classification (configurable sensitivity) - RVOP, VSP, VSP V2

  5. Microorganism detection (alignment, consensus generation, variant calling)

  6. Microorganism quantification (quantitative internal control required) – RPIP, UPIP, VSP V2

  7. Microorganism reporting thresholds (proprietary algorithms or user-defined reporting logic)

  8. Bacterial AMR marker analysis (nucleotide and protein alignment, consensus generation, variant calling and annotation) – RPIP, UPIP

  9. Viral AMR marker analysis (variant calling and annotation) – RPIP, RVOP, VSP, VSP V2

  10. Viral clade and lineage prediction (Pangolin, Nextclade) – RPIP, RVOP, VSP, VSP V2

  11. Result filters (user-specified filters applied)

  12. Reporting*

Output files:

  • Analysis-level outputs: XLSX, HTML, ZIP

  • Sample-level outputs: JSON, HTML, FASTA (consensus sequences), VCF (viral variants)

Important Notes:

DRAGEN Microbial Enrichment Plus is a secondary analysis tool for research use only. Further interpretation, statistical analysis, and downstream analysis of results may be necessary.

For Research Use Only. Not for use in diagnostic procedures.

Microorganism Reporting File format

How to edit the template file

  1. First, we recommend saving the provided template file with a new name

  2. Do not add any new columns and do not delete any columns from the template file

  3. Do not change or remove any text from the header row. **The "kmer_read_count" metric is only valid with the UPIP enrichment panel.

  4. Upload the microorganism reporting file to a BaseSpace Project. **It is only necessary to upload the file once.

  5. Select the file by clicking on the "Dataset File(s)" option under the "User-defined specification" section.

Example user-defined microorganism reporting file

See example below for 6 RPIP microorganism reporting names. Prediction logic can be specified on a microorganism-by-microorganism basis using multiple parameters and combinatorial logical expressions.

Output files

Note: Some files may not be generated depending on the selected analysis options and analysis results

Sample-level output files

Analysis-level output files

Test information

RPIP

Pipeline logic

Pipeline steps

Overview

Product Page
Panel Summary
Enrichment panel
Template file

Rows with microorganism names that are not of interest can be deleted. However, the entire tiered reporting group for certain viruses must be included to preserve tiered reporting logic (if desired). Membership in a tiered reporting group means that a hierarchical relationship is pre-built into the database and the most granular tier level passing reporting thresholds is reported. For example, if Influenza B virus (B/Victoria/2/87-like) or Influenza B virus (B/Yamagata/16/88-like) are reported in a sample then the less granular Influenza B virus reporting name will NOT be reported. See the "Has Tiered Reporting" and "Reporting Tier" columns of the "Microorganisms" table in the for RPIP, RVOP/RVEK, VSP, and VSP V2 to select and see which viruses are reported as part of a tiered reporting group.

reporting_name
prediction_logic
coverage
median_depth
ani
aligned_read_count
rpkm
kmer_read_count
Filename
Type
Description
Filename
Type
Description
Enrichment panel
Abbreviation
Definition
Category
Test information
Step
Description
Notes
RPIP
UPIP
RVOP/RVEK
VSP
VSP V2
Custom Panel

Acinetobacter baumannii

default

Cryptococcus neoformans

coverage

0.3

Escherichia coli

aligned_read_count

200

Human adenovirus E

(coverage AND median_depth) OR (aligned_read_count AND ani)

0.1

1

0.95

100

Human bocavirus 1 (HBoV1)

rpkm OR (ani AND coverage) OR median_depth

0.2

5

0.9

5

Klebsiella pneumoniae

default AND coverage

0.5

Samplename.Panelname.report.json

json

Comprehensive report file. See Report JSON format for further details

Samplename.Panelname.report.html

html

Visual report file. See Understanding the BaseSpace HTML reports for further details

Samplename.Panelname.viral_variants.vcf

vcf

Viral variant call file describing variant calls between viral consensus genome (or segment) sequences and best-match reference sequences (all RVOP/RVEK, VSP, and VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only)

Samplename.Panelname.viral_genomes_consensus.fa

fasta

Viral genome (or segment) nucleotide consensus sequence(s) for all viruses reported in the sample (RPIP, RVOP/RVEK, VSP, VSP V2 only)

Samplename.Panelname.viral_targets_consensus.fa

fasta

Viral targeted region nucleotide consensus sequence(s) for all viruses reported in the sample (RPIP only)

Samplename.Panelname.bacterial_amr_nucleotide_consensus.fa

fasta

Bacterial AMR gene nucleotide consensus sequence(s) for all bacterial AMR markers reported in the sample (RPIP, UPIP only)

Samplename.Panelname.bacterial_amr_protein_consensus.fa

fasta

Bacterial AMR gene protein consensus sequence(s) for all bacterial AMR markers reported in the sample (RPIP, UPIP only)

viral_consensus_genomes

Dataset

Directory containing viral genome (or segment) nucleotide consensus sequence(s) per virus reported in the sample (RPIP, RVOP/RVEK, VSP, VSP V2 only)

AnalysisIDnumber.Panelname.results.zip

zip

Compressed file containing all output files for single-click download

AnalysisIDnumber.Panelname.report.xlsx

xlsx

Aggregate Excel report file that summarizes results for all samples across 4 tabs: Samples, Microorganisms, AMR, and Variants. See below for further details

report.html

html

Visual report file. See Understanding the BaseSpace HTML reports for further details

Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit (RPIP)

Illumina Urinary Pathogen ID/AMR Enrichment Panel Kit (UPIP)

Illumina Respiratory Virus Oligo Panel / Respiratory Virus Enrichment Kit (RVOP / RVEK)

Illumina Viral Surveillance Panel (VSP)

Illumina Viral Surveillance Panel V2 (VSP V2)

Illumina Custom Panel

AMR

antimicrobial resistance

CLSI

Clinical and Laboratory Standards Institute

ESBL

extended spectrum beta-lactamase

EUCAST

European Committee on Antimicrobial Susceptibility Testing

mL

milliliter

NAI

neuraminidase inhibitor

NGS

next-generation sequencing

PAI

polymerase acidic endonuclease inhibitor

pangolin

phylogenetic assignment of named global outbreak lineages

RPIP

Respiratory Pathogen ID/AMR Panel

RPKM

targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads

RUO

For Research Use Only. Not for use in diagnostic procedures.

URL

See https://www.illumina.com/ for additional information.

Quantification - when a quantitative Internal Control {ic_name} and concentration {ic_concentration} is specified

RPIP data analysis using DRAGEN Microbial Enrichment Plus detects 41 viruses, 187 bacteria, 53 fungi, and 4,079 AMR markers, unless filtered reporting options are selected, based on target enriched next-generation sequencing (NGS) of microorganism DNA and cDNA sequences. Sequencing data are interpreted by the DRAGEN software platform and microorganisms that pass detection thresholds are reported. Absolute quantification assumes use of {ic_name} as an Internal Control spiked at {ic_concentration} copies/mL of sample. Relative abundance is calculated based on absolute quantities and is expressed as proportion of absolute quantities within each pathogen class (i.e., bacteria, viruses, fungi). If RPKM for the Internal Control is zero, no absolute quantification is provided, and relative abundance is expressed as proportion of microorganism RPKM values within each pathogen class.

Quantification - when a quantitative Internal Control is NOT specified

RPIP data analysis using DRAGEN Microbial Enrichment Plus detects 41 viruses, 187 bacteria, 53 fungi, and 4,079 AMR markers, unless filtered reporting options are selected, based on target enriched next-generation sequencing (NGS) of microorganism DNA and cDNA sequences. Sequencing data are interpreted by the DRAGEN software platform and microorganisms that pass detection thresholds are reported. Relative abundance is expressed as proportion of microorganism RPKM values within each pathogen class (i.e., bacteria, viruses, fungi). Internal Control not specified; no absolute quantification provided.

AMR - when "Report bacterial AMR markers only when an associated microorganism is reported" is selected

This test detects 4,079 antimicrobial resistance (AMR) markers and reports associations for 99 microorganisms, 181 antimicrobials, and 35 drug classes, unless filtered reporting options are selected. Bacterial AMR markers are based on the Comprehensive Antibiotic Research Database (CARD, version 3.2.8) and viral AMR markers are based on World Health Organization (WHO) Influenza virus neuraminidase inhibitor (NAI) and polymerase acidic protein inhibitor (PAI) Reduced Susceptibility Marker Tables (07 March 2023 version). Detection of an AMR marker is reported if the AMR marker passes a minimum detection threshold and if one or more of the microorganisms associated with the AMR marker is also detected, in alignment with guidance provided by the College of American Pathologists (CAP) MIC.21855. However, reported AMR markers may originate from microorganisms that did not meet detection thresholds or microorganisms not targeted by the test. Association between microorganisms and bacterial AMR marker is based on scientific literature and the Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1) from McMaster University. Reported AMR markers have been associated with antimicrobial resistance but may not always indicate phenotypic resistance. Failure to detect AMR markers does not always indicate phenotypic susceptibility. Results should be interpreted in the context of all available information.

AMR - when "Report bacterial AMR markers only when an associated microorganism is reported" is NOT selected

This test detects 4,079 antimicrobial resistance (AMR) markers and reports associations for 99 microorganisms, 181 antimicrobials, and 35 drug classes. Bacterial AMR markers are based on the Comprehensive Antibiotic Research Database (CARD, version 3.2.8) and viral AMR markers are based on World Health Organization (WHO) Influenza virus neuraminidase inhibitor (NAI) and polymerase acidic protein inhibitor (PAI) Reduced Susceptibility Marker Tables (07 March 2023 version). Association between microorganisms and bacterial AMR marker is based on scientific literature and the Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1) from McMaster University. Detection of a bacterial AMR marker is reported if the marker passes a minimum detection threshold, regardless of associated microorganism detection. Reported AMR markers may originate from microorganisms that did not meet detection thresholds or microorganisms not targeted by the test. Reported AMR markers have been associated with antimicrobial resistance but may not always indicate phenotypic resistance. Failure to detect AMR markers does not always indicate phenotypic susceptibility. Results should be interpreted in the context of all available information.

AMR

Linkage between bacterial AMR marker, antimicrobial, and drug class is based on the Comprehensive Antibiotic Research Database (CARD, version 3.2.8) from McMaster University, ResFinder (version 2.2.1), NCBI Reference Gene Catalog (version 2023-09-26.1), EUCAST expert rules on indicator agents (2019-2023), and CLSI Performance Standards for Antimicrobial Susceptibility Testing (M100 34th Edition). Linkage between viral AMR marker, antimicrobial, and drug class is based on the publications provided in the JSON report - see PubMed IDs (pmids) field. Not all antimicrobials and drug classes that are listed may be relevant. Detected AMR markers may also confer resistance to antimicrobials and drug classes that are not listed.

AMR

A representative list of associated microorganisms known to harbor the detected or similar bacterial AMR markers, based on the Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1) from McMaster University, can be found in the Associated Microorganisms field.

AMR

Mutations connected with a '+' form an epistatic group. Epistatic groups are two or more mutations that need to be present concurrently to confer the associated resistance.

AMR

All intrinsic resistance described in CLSI Performance Standards for Antimicrobial Susceptibility Testing, M100 34th Edition, Appendix B for detected microorganism(s) is reported. Additional comments regarding CLSI intrinsic resistance definitions may be reported in footnotes specific to the detected microorganism(s). Some intrinsic resistance is described with reference to drug classes rather than specific antimicrobials. Users may reference CLSI Glossary I (Part 1 and Part 2): Class and Subclass Designations and Generic Names for information on how CLSI categorizes antimicrobials and drug classes.

AMR

Confidence of bacterial AMR marker detection is shown as High, Medium, or Low and is based on the available sequencing data. High confidence indicates that a bacterial AMR marker has 100% protein sequence coverage and 100% protein sequence percent identity (PID). Medium confidence indicates that a bacterial AMR marker has ≥90% protein sequence coverage and ≥90% protein sequence percent identity (PID). Low confidence indicates that a bacterial AMR marker has ≥60% protein sequence coverage and ≥80% protein sequence percent identity (PID).

Phenotypic group

Targeted microorganisms are classified into three Phenotypic Groups based on general association with normal flora, colonization, or contamination from the environment or other sources, as well as based on general association with disease. Phenotypic grouping DOES NOT INDICATE PATHOGENICITY IN A GIVEN CASE and results need to be interpreted in the context of all available information. Phenotypic Group 1: Microorganisms that are frequently considered part of the normal flora, colonizers, or contaminants but may be associated with disease in certain settings. Phenotypic Group 2: Microorganisms that may represent normal flora, colonizers, or contaminants but that are frequently associated with disease. Phenotypic Group 3: Microorganisms that are not generally considered part of the normal flora, colonizers, or contaminants and are generally considered to be associated with disease.

Pango lineage

The most likely Pango (phylogenetic assignment of named global outbreak) lineage is assigned to the majority consensus SARS-CoV-2 genome sequence using pangolin 4.3.1 (Áine O'Toole & Emily Scher et al. 2021 Virus Evolution DOI:10.1093/ve/veab064).

Read classification

This test differentiates sequencing reads classified to microorganism and Internal Control regions that are targeted by capture probes (“Targeted Microbial” and “Targeted Internal Control”) from those that are not targeted (“Untargeted”), are low complexity (“Low Complexity”), cannot be unambiguously assigned to one category (“Ambiguous”), or cannot be classified with confidence (“Unclassified”).

Limitations

Non-detected results do not rule out the presence of viruses, bacteria, fungi, and AMR markers. Contamination with microorganisms is possible during specimen collection, transport, and processing. Closely related microorganisms may be misidentified based on sequence homology to species present in the database. The identification of cDNA or DNA sequences from a microorganism does not confirm that the identified microorganism is causing symptoms, is viable, or is infectious. Recombinant viral strains may not be reported or may be reported as one or more individual viruses. Should one or more individual viruses be reported for a recombinant viral strain, antiviral resistance results may be inaccurate.

Limitations

The best matching allele is reported for each detected bacterial AMR gene family. If two or more alleles within the same bacterial AMR gene family are detected, only the allele with the higher confidence will be reported as the best match unless multiple alleles have a High confidence interpretation (100% coverage and PID). In strains containing insertion-deletion mutations (indels), there is a risk of false positive or false negative results for other resistance mutations within a region of 100 nucleotides around the indel.

Limitations

Information provided by DRAGEN Microbial Enrichment Plus is based on scientific knowledge and has been curated; however, scientific knowledge evolves and information about associated microorganism and associated resistance may not always be complete and/or correct. Results should be interpreted in the context of all available information. Other sources of data may be required for confirmation.

Read QC

Low-quality bases are trimmed from the ends of each read. After trimming, the read is discarded if fewer than 50% of its bases have a quality score greater or equal to q20, the read is shorter than 32 bp, or the read has 5 or more ambiguous bases. It is assumed that appropriate adapter trimming has already been performed.

Optional

X

X

X

X

X

X

Dehosting

Human read removal using the DRAGEN Kmer Classifier

X

X

X

X

X

X

Sample QC

Sample composition and enrichment factor calculations

Internal control required to calculate the enrichment factor

X

X

X

Microorganism classification

Pre-alignment filtering step

Configurable sensitivity

X

X

X

Microorganism detection

Reference alignment, consensus sequence generation, variant calling

X

X

X

X

X

X

Microorganism quantification

Absolute copies/mL calculation

Quantitative internal control and concentration required

X

X

X

Microorganism reporting thresholds

Proprietary algorithms or user-defined reporting logic

X

X

X

X

X

Bacterial AMR marker analysis

Nucleotide and protein alignment, consensus sequence generation, variant calling and annotation

X

X

Viral AMR marker analysis

Variant calling and annotation

X

X

X

X

Viral clade and lineage prediction

Pangolin, Nextclade

X

X

X

X

Result filters

User-specified filters applied

X

X

X

X

X

Reporting - Analysis level

XLSX, HTML, ZIP

X

X

X

X

X

X

Reporting - Sample level

JSON, HTML, FASTA (consensus sequences), VCF (viral variants)

X

X

X

X

X

X

Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit (RPIP)
Illumina Urinary Pathogen ID/AMR Enrichment Panel Kit (UPIP)
Illumina Respiratory Virus Oligo Panel / Respiratory Virus Enrichment Kit (RVOP / RVEK)
Illumina Viral Surveillance Panel (VSP)
Illumina Viral Surveillance Panel V2 (VSP V2)
Panel Summary

Report JSON format

The DRAGEN Microbial Enrichment Plus app outputs a comprehensive sample-level report.json file containing general metadata, version information, sample QC, microorganism, and AMR marker results, as well as detailed test information. The additional convenience file formats generated by the DRAGEN Microbial Enrichment Plus app do not contain novel content.

(*) indicates results generated by the application layer as opposed to the DRAGEN secondary analysis pipeline

Top-Level Node

The top-level section of the report JSON contains general metadata and version information.

Field
Description

.accession

Identifier used for the sample

.deploymentEnvironment

Environment in which the results were produced

.batchId

Identifier used for the batch of samples processed together

.analysisId

Identifier used for the analysis

.runId

Identifier used for the sequencing run

.controlFlag

Indicates whether the sample is a control. It is set to “POS” if the substring “PosCon” is found in the sample name, “NEG” if the substring “NegCon” is found, or “BLANK” if the substring “controlBlk” is found. Otherwise, it is set to “-”

.dragenVersion

DRAGEN release version

.analysisPipelineVersion

Analysis Pipeline release version

.testType

Type of test panel ("RPIP", "UPIP", "RVOP", "VSPv1", "VSPv2", "Custom")

.testVersion

Test panel release version

.testName

Full name of test panel

.testUse

Test use. "For Research Use Only. Not for use in diagnostic procedures"

.reportTime

Date and time the report was generated

.warnings

List of warnings encountered during the analysis

.errors

List of errors encountered during the analysis

.results*

High level result: “One or more potential pathogens predicted” or ”No potential pathogens predicted”

.appVersion*

DRAGEN Microbial Enrichment plus application release version

.qcReport.sampleQc Node

This section contains information about sample quality control (QC). The fields are relative to .qcReport.sampleQc

Field
Description

.totalRawBases

Number of base pairs in sample before read QC processing

.totalRawReads

Number of reads in sample before read QC processing

.uniqueReads

Number of distinct reads in sample before read QC processing

.uniqueReadsProportion

Proportion of distinct reads in sample before read QC processing

.preQualityMeanReadLength

Average read length before read QC processing

.postQualityMeanReadLength

Average read length after read QC processing

.postQualityReads

Number of reads in sample after read QC processing, inclusive of any duplicate reads

.postQualityReadsProportion

Proportion of post-quality reads in sample relative to total raw reads

.removedInDehostingReads

Number of host reads in sample removed during dehosting (host = human)

.removedInDehostingReadsProportion

Proportion of host reads in sample removed relative to total raw reads (host = human)

.entropy

Shannon entropy of the counts of 5-mers in the reads after read QC processing, which is a measure of randomness

.gContent

Proportion of guanine (G) base calls in reads after read QC processing

.libraryQScore

Quality score of the library after read QC processing

.qcReport.enrichmentFactor Node

This section contains information about the enrichment factor calculation and is relevant to RPIP, UPIP, and VSP V2 only. Detection of an appropriate Internal Control is required. The fields are relative to .qcReport.enrichmentFactor

Field
Description

.value

Enrichment factor value reflecting how well targeted regions were enriched

.category

Enrichment factor category: "poor", "fair", "good", or "not calculated"

.qcReport.sampleComposition Node

This section contains information about the composition of the sample and is provided for RPIP, UPIP, and VSP V2 only. The fields are relative to .qcReport.sampleComposition

Field
Description

.readClassification

Proportion of post-quality reads classified to the following categories:

.readClassification.targetedMicrobial

Targeted microbial

.readClassification.targetedInternalControl

Targeted Internal Control

.readClassification.untargeted

Untargeted

.readClassification.ambiguous

More than one category

.readClassification.unclassified

No category

.readClassification.lowComplexity

Low complexity

.targetedMicrobial

Proportion of post-quality targeted microbial reads classified to the following sub-categories:

.targetedMicrobial.viral

Viral targeted

.targetedMicrobial.bacterial

Bacterial targeted

.targetedMicrobial.fungal

Fungal targeted

.targetedMicrobial.parasitic

Parasitic targeted

.targetedMicrobial.bacterialAmr

Bacterial AMR targeted

.untargeted

Proportion of post-quality untargeted reads classified to the following sub-categories:

.untargeted.viral

Viral untargeted

.untargeted.bacterial

Bacterial untargeted

.untargeted.fungal

Fungal untargeted

.untargeted.parasitic

Parasitic untargeted

.untargeted.bacterialAmr

Bacterial AMR untargeted

.untargeted.internalControl

Internal Control untargeted

.untargeted.human

Human untargeted

.viral

Proportion of post-quality viral reads classified to the following categories:

.viral.targeted

Viral targeted

.viral.untargeted

Viral untargeted

.viral.untargetedSubcategories

Proportion of post-quality viral untargeted reads classified to the following sub-categories:

.viral.untargetedSubcategories.panel

Viral panel members

.viral.untargetedSubcategories.phage

Viral phage

.viral.untargetedSubcategories.other

Viral other (not a panel member or phage)

.bacterial

Proportion of post-quality bacterial reads classified to the following categories:

.bacterial.targeted

Bacterial targeted

.bacterial.untargeted

Bacterial untargeted

.bacterial.untargetedSubcategories

Proportion of post-quality bacterial untargeted reads classified to the following sub-categories:

.bacterial.untargetedSubcategories.panel

Bacterial panel members

.bacterial.untargetedSubcategories.ribosomalDna

Bacterial ribosomal DNA (16S)

.bacterial.untargetedSubcategories.plasmid

Bacterial plasmids

.bacterial.untargetedSubcategories.other

Bacterial other (not a panel member, ribosomal DNA, or plasmid)

.fungal

Proportion of post-quality fungal reads classified to the following categories:

.fungal.targeted

Fungal targeted

.fungal.untargeted

Fungal untargeted

.fungal.untargetedSubcategories

Proportion of post-quality fungal untargeted reads classified to the following sub-categories:

.fungal.untargetedSubcategories.panel

Fungal panel members

.fungal.untargetedSubcategories.ribosomalDna

Fungal ribosomal DNA (18S)

.fungal.untargetedSubcategories.other

Fungal other (not a panel member or ribosomal DNA)

.parasitic

Proportion of post-quality parasitic reads classified to the following categories:

.parasitic.targeted

Parasitic targeted

.parasitic.untargeted

Parasitic untargeted

.parasitic.untargetedSubcategories

Proportion of post-quality parasitic untargeted reads classified to the following sub-categories:

.parasitic.untargetedSubcategories.panel

Parasitic panel members

.parasitic.untargetedSubcategories.ribosomalDna

Parasitic ribosomal DNA (18S)

.parasitic.untargetedSubcategories.other

Parasitic other (not a panel member or ribosomal DNA)

.human

Proportion of post-quality human reads classified to the following categories:

.human.untargeted

Human untargeted

.human.untargetedSubcategories

Proportion of post-quality human untargeted reads classified to the following sub-categories:

.human.untargetedSubcategories.ribosomalDna

Human ribosomal DNA

.human.untargetedSubcategories.codingSequence

Human coding sequence

.human.untargetedSubcategories.other

Human other (not ribosomal DNA or coding sequence)

.internalControl

Proportion of post-quality Internal Control reads classified to the following categories:

.internalControl.targeted

Internal Control targeted

.internalControl.untargeted

Internal Control untargeted

.microbialAndInternalControl

Proportion of post-quality Microbial and Internal Control reads classified to the following categories:

.microbialAndInternalControl.targeted

Microbial and Internal Control targeted

.microbialAndInternalControl.untargeted

Microbial and Internal Control untargeted

.bacterialAmr

Proportion of post-quality bacterial AMR reads classified to the following categories:

.bacterialAmr.targeted

Bacterial AMR targeted

.bacterialAmr.untargeted

Bacterial AMR untargeted

.qcReport.internalControls Node

This section contains information about internal control detection and is relevant to RPIP, UPIP, and VSP V2 only. The value of the .qcReport.internalControls field is an array of objects containing name and RPKM information for each Internal Control. See the code block below for an example:

[
    {
        "name": "Allobacillus halotolerans",
        "rpkm": 0
    },
    {
        "name": "Armored RNA Quant Internal Process Control",
        "rpkm": 0
    },
    {
        "name": "Enterobacteria phage T7",
        "rpkm": 180323
    },
    {
        "name": "Escherichia virus MS2",
        "rpkm": 0
    },
    {
        "name": "Escherichia virus Qbeta",
        "rpkm": 0
    },
    {
        "name": "Escherichia virus T4",
        "rpkm": 0
    },
    {
        "name": "Imtechella halotolerans",
        "rpkm": 0
    },
    {
        "name": "Phocid alphaherpesvirus 1",
        "rpkm": 0
    },
    {
        "name": "Phocine morbillivirus",
        "rpkm": 0
    },
    {
        "name": "Truepera radiovictrix",
        "rpkm": 0
    }
]

.userOptions Node

This section gives information about analysis options specified by the user. The fields are relative to .userOptions

Field
Description

.quantitativeInternalControlName

Quantitative Internal Control used for microorganism absolute quantification (recommendation: Enterobacteria phage T7)

.quantitativeInternalControlConcentration

Quantitative Internal Control concentration (recommendation: 1.21 x 10^7 copies/mL of sample)

.readQcEnabled

Boolean indicating if read QC (trimming and filtering based on quality and read length) is enabled

.readClassificationSensitivity

(RVOP/RVEK, VSP, VSP V2 only) Sensitivity threshold for classifying reads. Determines whether alignment should proceed for a microorganism and/or reference sequence. Value is an integer with a valid range of 1 to 1000, inclusive

.customPanelFastaFile

(Custom Panel only) Name of the custom reference FASTA file

.customPanelBedFile

(Custom Panel only) Name of the custom reference BED file

.belowThresholdEnabled*

Boolean indicating if microorganisms and/or AMR markers below detection thresholds are reported

.bacterialAmrMarkersOnly*

(RPIP, UPIP only) Boolean indicating if only bacterial AMR markers are reported

.bacterialAmrMarkerMicroorganismRequired*

(RPIP, UPIP only) Boolean indicating if bacterial AMR markers are reported only when an associated microorganism is reported

.preDefinedMicroorganismReportingList*

(RPIP, UPIP only) Pre-defined microorganism reporting list, if specified

.userDefinedMicroorganismReportingListUsed*

Boolean indicating if a user-defined microorganism reporting file is specified

.userDefinedMicroorganismReportingListFile*

Name of the user-defined microorganism reporting file, if specified

.providedAnalysisName*

User-provided analysis name

.targetReport.microorganisms[] Node

The value of the .targetReport.microorganisms[] field is an array of objects containing information about detected microorganisms. The following table describes one .targetReport.microorganisms[] object. The fields are relative to .targetReport.microorganisms[]

Field
Description

.class

Microorganism class ("viral", "bacterial", "fungal", "parasite")

.name

Name of microorganism

.coverage

Proportion of targeted microorganism reference sequence bases that appear in sample sequencing reads

.ani

Average nucleotide identity of consensus sequence to targeted microorganism reference sequences

.medianDepth

Median depth of sample sequencing reads aligned to targeted microorganism reference sequences, indicating the median number of times each targeted microorganism reference sequence base appears in sample sequencing reads

.condensedDepthVector

Read depth across the targeted microorganism reference sequences, condensed to 256 bins

.rpkm

Normalized representation of the number of sample sequencing reads aligned to targeted microorganism reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)

.alignedReadCount

Number of sample sequencing reads that aligned to targeted microorganism reference sequences

.kmerReadCount

(UPIP only) Number of sample sequencing reads classified to targeted microorganism reference sequences

.absoluteQuantityRatio

Numerical absolute quantification value. Quantitative internal control required for calculation

.absoluteQuantityRatioFormatted

Formatted absolute quantification value with units. Quantitative internal control required for calculation

.phenotypicGroup

(RPIP, UPIP only) Grouping indicating general association with normal flora, colonization, or contamination from the environment or other sources, as well as general association with disease

.associatedAmrMarkers

(Bacteria only) Information about the bacterial AMR markers associated with the microorganism

.associatedAmrMarkers.applicable

Boolean indicating whether one or more bacterial AMR markers are associated with the microorganism

.associatedAmrMarkers.detected

List of detected bacterial AMR markers associated with the microorganism

.associatedAmrMarkers.predicted

List of predicted bacterial AMR markers associated with the microorganism

.consensusGenomeSequences

(RPIP, RVOP/RVEK, VSP, VSP V2 viruses only) Information about the majority consensus genome (or segment) sequence

.consensusGenomeSequences.sequence

Consensus genome (or segment) sequence bases

.consensusGenomeSequences.referenceAccession

Accession of the reference genome (or segment) sequence

.consensusGenomeSequences.referenceDescription

Description of the reference genome (or segment) sequence

.consensusGenomeSequences.referenceLength

Length of the reference genome (or segment) sequence

.consensusGenomeSequences.maximumAlignmentLength

Longest contiguous alignment between consensus sequence and reference genome (or segment) sequence

.consensusGenomeSequences.maximumGapLength

Longest contiguous alignment gap (insertion or deletion) between consensus sequence and reference genome (or segment) sequence

.consensusGenomeSequences.maximumUnalignedLength

Longest section of the reference genome (or segment) sequence not aligned to by consensus sequence

.consensusGenomeSequences.coverage

Proportion of reference genome (or segment) sequence bases that appear in sample sequencing reads

.consensusGenomeSequences.ani

Average nucleotide identity of consensus sequence to reference genome (or segment) sequence

.consensusGenomeSequences.alignedReadCount

Number of sample sequencing reads that aligned to reference genome (or segment) sequence

.consensusGenomeSequences.medianDepth

Median depth of sample sequencing reads aligned to reference genome (or segment) sequence, indicating the median number of times each reference genome (or segment) sequence base appears in sample sequencing reads

.consensusGenomeSequences.targetAnnotation

List of targeted region annotations for the reference genome (or segment) sequence. Each annotation is a JSON object with the following fields: start (int), end (int), strand (string: "+", "-"), target_name (string), type (string)

.consensusGenomeSequences.condensedDepthVector

Read depth across the reference genome (or segment) sequence, condensed to 256 bins

.consensusTargetSequences

(RPIP viruses only) Information about the majority targeted region consensus sequences

.consensusTargetSequences.sequence

Consensus targeted region sequence bases

.consensusTargetSequences.name

Name of the targeted region

.consensusTargetSequences.referenceAccession

Accession of the targeted region reference sequence

.consensusTargetSequences.depthVector

Read depth across the targeted region reference sequence, not condensed

.consensusTargetSequences.scaledDepthVector*

Read depth across the targeted region reference sequence, condensed and scaled such that the longest targeted region for the microorganism has a maximum length of 256 bins

.predictionInformation

Information about microorganism prediction results

.predictionInformation.predictedPresent

Boolean indicating whether the microorganism passed its reporting logic algorithm

.predictionInformation.notes

List of notes about the prediction result

.predictionInformation.subpanels

List of pre-defined subpanels that the microorganism belongs to

.predictionInformation.relatedMicroorganisms

Array of objects with information about genetically related microorganisms. See below for details

.predictionInformation.userDefined*

User-defined reporting prediction logic for microorganism, if specified

.variants

(all RVOP/RVEK, VSP, and VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only) Information about viral variants. See below for details

.comments*

List of additional information regarding the microorganism

.abundance*

Relative abundance of the microorganism within the microorganism class

.pangoLineage*

(SARS-CoV-2 only) Information about SARS-CoV-2 Pango lineage prediction results. See below for details

.nextclade*

(applicable viruses only) Information about viral clade assignment results. See below for details

.potentialAmrDetected*

(Bacteria only) Potential AMR detection flag for microorganism. Can be "Yes", “Not Detected”, or “n/a”

.potentialAmrPredicted*

(Bacteria only) Potential AMR prediction flag for microorganism. Can be "Yes", “Not Predicted”, or “n/a”

.flags*

(Bacteria only) Flag for potential resistance to an important drug class ("Potential ESBL", "Potential Carbapenemase")

.intrinsicResistance*

(Bacteria only) List of antimicrobials to which the reported bacteria is intrinsically resistant, based on CLSI Performance Standards for Antimicrobial Susceptibility Testing, M100 34th Edition, Appendix B

.intrinsicResistanceDrugClasses*

(Bacteria only) List of drug classes to which the reported bacteria is intrinsically resistant, based on CLSI Performance Standards for Antimicrobial Susceptibility Testing, M100 34th Edition, Appendix B

.targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] Node

The value of the .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] field is an array of objects containing information about genetically related microorganisms. The following table describes one .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] object. The fields are relative to .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[]

Field
Description

.name

Name of related microorganism

.onPanel

Boolean indicating whether the related microorganism is a panel member

.kmerReadCount

(UPIP only) Number of sample sequencing reads classified to related microorganism reference sequences

.coverage

Proportion of related microorganism reference sequence bases that appear in sample sequencing reads

.ani

Average nucleotide identity of consensus sequence to related microorganism reference sequences

.alignedReadCount

Number of sample sequencing reads that aligned to related microorganism reference sequences

.targetReport.microorganisms[].variants[] Node

The value of the .targetReport.microorganisms[].variants[] field is an array of objects containing information about viral variants for all RVOP/RVEK, VSP, and VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only. The following table describes one .targetReport.microorganisms[].variants[] object. The fields are relative to .targetReport.microorganisms[].variants[]

Field
Description

.referenceAccession

Accession of reference genome (or segment) sequence used for variant calling

.segment

(Segmented viruses only) Segment number of reference segment sequence

.ntChange

Nucleotide change associated with variant

.referencePosition

Variant position in viral reference genome (or segment) sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

.category*

Variant category ("Viral Variant; Known AMR", "Viral Variant")

.comments*

List of additional information regarding the variant

.gene*

(SARS-CoV-2, Flu A/B/C only) Gene name

.product*

Protein product of gene

.annotation*

Type of change (e.g., "Nonsynonymous Variant")

.aachange*

Amino acid change associated with variant

.epistaticGroups*

List of epistatic groups variant is associated with

.standardNomenclatureEpistaticGroups*

(Flu A/B only) List of epistatic groups variant is associated with using standard nomenclature coordinates

.standardNomenclatureAaChange*

(Flu A/B only) Amino acid change associated with variant using standard nomenclature coordinates

.standardNomenclatureAccession*

(Flu A/B only) NCBI accession of the reference sequence used to establish standard nomenclature coordinates

.drugClasses*

List of drug classes variant is predicted to confer resistance to

.representativeAntimicrobials*

List of representative antimicrobials variant is predicted to confer resistance to

.inhibitionLevel*

(Flu A/B only) Level of inhibition per cited publications (see pmids)

.pmids*

PubMed IDs of publications associated with variant

.targetReport.microorganisms[].pangoLineage[] Node

The value of the .targetReport.microorganisms[].pangoLineage[] field is an array of objects containing information about SARS-CoV-2 Pango lineage prediction results. The following table describes one .targetReport.microorganisms[].pangoLineage[] object. The fields are relative to .targetReport.microorganisms[].pangoLineage[].

Field

.lineage*

From Pangolin: "The most likely lineage assigned to a given sequence based on the inference engine used and the SARS-CoV-2 diversity designated. This assignment may be sensitive to missing data at key sites"

.conflict*

From Pangolin: "In the pangoLEARN model, a given sequence gets assigned to the most likely category based on known diversity. If a sequence can fit into more than one category, the conflict score will be greater than 0 and reflect the number of categories the sequence could fit into. If the conflict score is 0, this means that within the current decision tree there is only one category that the sequence could be assigned to"

.ambiguityScore*

From Pangolin: "This score is a function of the quantity of missing data in a sequence. It represents the proportion of relevant sites in a sequnece which were imputed to the reference values. A score of 1 indicates that no sites were imputed, while a score of 0 indicates that more sites were imputed than were not imputed. This score only includes sites which are used by the decision tree to classify a sequence"

.version*

Version of the PUSHER database

.pangolinVersion*

Version of the Pangolin software

.targetReport.microorganisms[].nextclade[] Node

The value of the .targetReport.microorganisms[].nextclade[] field is an array of objects containing information about viral clade assignment results for applicable viruses. The following table describes one .targetReport.microorganisms[].nextclade[] object. The fields are relative to .targetReport.microorganisms[].nextclade[].

Field

.sequenceName*

Name of the sequence

.referenceAccession*

Reference accession

.overallStatus*

.clade*

Assigned clade

.pangoLineage*

Pango lineage assigned by Nextclade

.cladeWho*

World Health Organization (WHO) nomenclature

.substitutions*

Total number of detected nucleotide substitutions

.totalNonACGTNs*

Total number of detected ambiguous nucleotides (nucleotide characters that are not A, C, G, T, N)

.totalMissing*

Total number of detected missing nucleotides (nucleotide character N)

.coverage*

Proportion of consensus genome (or segment) sequence bases that aligned to reference accession

.totalInsertions*

Total number of inserted nucleotide bases

.totalFrameShifts*

Total number of detected frame shifts

.stopCodons*

Total number of detected stop codons

.version*

Version of the Nextclade software

.targetReport.amrMarkers[] Node

The value of the .targetReport.amrMarkers[] field is an array of objects containing information about detected bacterial AMR markers. The following table describes one .targetReport.amrMarkers[] object. The fields are relative to .targetReport.amrMarkers[]

Field
Description

.class

Microorganism class ("bacterial")

.cardModelType

Bacterial AMR marker model type in the Comprehensive Antibiotic Resistance Database (CARD) ("homolog", "protein variant", "rRNA variant")

.cardGeneFamily

Bacterial AMR marker gene family in the Comprehensive Antibiotic Resistance Database (CARD)

.name

Bacterial AMR marker name

.cardName

Bacterial AMR marker name in the Comprehensive Antibiotic Resistance Database (CARD)

.ncbiName

Bacterial AMR marker name in the National Center for Biotechnology Information (NCBI) Reference Gene Catalog

.referenceAccession

Accession of the bacterial AMR marker reference sequence

.coverage

Proportion of bacterial AMR marker reference sequence residues that appear in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.pid

Percent identity of consensus sequence aligned to bacterial AMR marker reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.medianDepth

Median depth of sample sequencing reads aligned to bacterial AMR marker reference sequence, indicating the median number of times each bacterial AMR marker sequence residue appears in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.rpkm

Normalized representation of the number of sample sequencing reads aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.alignedReadCount

Number of sample sequencing reads that aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)

.nucleotideConsensusSequence

Nucleotide consensus sequence bases

.proteinConsensusSequence

Protein consensus sequence bases

.nucleotideDepthVector

Read depth across the bacterial AMR marker nucleotide reference sequence, not condensed

.proteinDepthVector

Read depth across the bacterial AMR marker protein reference sequence, not condensed

.associatedMicroorganisms

Information about the microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.all

List of all microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.detected

List of detected microorganisms associated with the bacterial AMR marker

.associatedMicroorganisms.predicted

List of predicted microorganisms associated with the bacterial AMR marker

.predictionInformation

Information about bacterial AMR marker prediction results

.predictionInformation.predictedPresent

Boolean indicating whether the bacterial AMR marker passed its reporting logic algorithm

.predictionInformation.confidence

Confidence level of bacterial AMR marker prediction ("high", "medium", "low")

.predictionInformation.notes

List of notes about the prediction result

.flags*

Flag indicating AMR marker is an extended-spectrum beta-lactamase (ESBL) or carbapenemase (Carbapenemase)

.representativeAntimicrobials*

List of representative antimicrobials the AMR marker is predicted to confer resistance to

.drugClasses*

List of drug classes the AMR marker is predicted to confer resistance to

.targetReport.amrMarkers[].variants[] Node

The value of the .targetReport.amrMarkers[].variants[] field is an array of objects containing information about variants for bacterial AMR markers with "protein variant" or "rRNA variant" model types. The following table describes one .targetReport.amrMarkers[].variants[] object. The fields are relative to .targetReport.amrMarkers[].variants[]

Field
Description

.category

Variant category ("Bacterial Variant; Known AMR")

.referenceSourceMicroorganism

Microorganism that reference sequence is associated with in NCBI

.comments

List of additional information regarding the variant

.product

Protein product of gene

.ntChange

Nucleotide change associated with variant

.referencePosition

Variant position in reference sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

.annotation

Type of change (e.g. "Nonsynonymous Variant")

.aaChange

Amino acid change associated with variant

.epistaticGroups

List of epistatic groups variant is associated with

.representativeAntimicrobials*

List of representative antimicrobials variant is predicted to confer resistance to

.drugClasses*

List of drug classes variant is predicted to confer resistance to

.confidenceLevel*

(MTB only) Confidence level is given for Mycobacterium tuberculosis variants if provided by the WHO Catalogue of mutations in Mycobacterium tuberculosis (Final Grading Confidence; for rpoB only), or the Comprehensive Antibiotic Resistance Database (CARD), as part of a confidence model for AMR developed by the Relational Sequencing Tuberculosis Data Platform (ReSeqTB)

.pmids*

PubMed IDs of publications associated with variant

.targetReport.customReferences[] Node

This section contains information about custom reference detection results and is only present for custom database analyses. When only a custom reference FASTA file is provided (no BED file), each .targetReport.customReferences[] object contains information for a single reference sequence. When both a FASTA and BED file are provided, each .targetReport.customReferences[] object contains information for a single genome/microorganism, which can be a collection of one or more reference sequences. The fields are relative to .targetReport.customReferences[]

Field
Description

.name

Provided name of custom reference sequence, accession, genome, or microorganism

.coverage

Proportion of custom reference sequence bases that appear in sample sequencing reads

.ani

Average nucleolotide identity of consensus sequence to custom reference sequence or, if specified, collection of one or more custom reference sequences

.medianDepth

Median depth of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences, indicating the med\ian number of times each custom reference sequence base appears in sample sequencing reads

.condensedDepthVector

Read depth across custom reference sequence or, if specified, collection of one or more custom reference sequences, condensed to 256 bins

.rpkm

Normalized number of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)

.alignedReadCount

Number of sample sequencing reads that aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences

.consensusSequences

Array of objects with information about each consensus sequence. See below for details

.variants

Array of objects with information about variants detected in custom reference sequence or, if specified, collection of one or more custom reference sequences. See below for details

.pangoLineage*

Array of objects with information about SARS-CoV-2 Pango lineage prediction results. See below for details

.targetReport.customReferences[].consensusSequences[] Node

The value of the .targetReport.customReferences[].consensusSequences[] field is an array of objects containing majority consensus sequence information for a single custom reference sequence. When only a FASTA file is provided (no BED file), there will be only one object in the array. When both a FASTA and BED file are provided, there may be more than one object in the array. The fields are relative to .targetReport.customReferences[].consensusSequences[]

Field
Description

.sequence

Majority consensus sequence bases

.referenceAccession

Accession of custom reference sequence

.referenceDescription

Description of custom reference sequence

.referenceLength

Length of custom reference sequence

.coverage

Proportion of custom reference sequence bases that appear in sample sequencing reads

.ani

Average nucleolotide identity of consensus sequence to custom reference sequence

.medianDepth

Median depth of sample sequencing reads aligned to custom reference sequence, indicating the median number of times each custom reference sequence base appears in sample sequencing reads

.depthVector

Read depth across custom reference sequence, not condensed

.alignedReadCount

Number of sample sequencing reads that aligned to custom reference sequence

.maximumAlignmentLength

Longest contiguous alignment between consensus sequence and custom reference sequence

.maximumGapLength

Longest contiguous alignment gap (insertion or deletion) between consensus sequence and custom reference sequence

.maximumUnalignedLength

Longest section of custom reference sequence not aligned to by consensus sequence

.targetReport.customReferences[].variants[] Node

The value of the .targetReport.customReferences[].variants[] field is an array of objects containing information about a single detected variant. The fields are relative to .targetReport.customReferences[].variants[]

Field
Description

.ntChange

Nucleotide change associated with variant

.referenceAccession

Accession of custom reference sequence used for variant calling

.referencePosition

Variant position in custom reference sequence

.referenceAllele

Reference allele at variant position

.variantAllele

Variant allele

.depth

Variant depth, indicating the number of times variant position appears in sample sequencing reads

.alleleFrequency

Frequency of variant allele in sample sequencing reads

.targetReport.customReferences[].pangoLineage[] Node

The value of the .targetReport.customReferences[].pangoLineage[] field is an array of objects containing information about SARS-CoV-2 Pango lineage prediction results. The following table describes one .targetReport.customReferences[].pangoLineage[] object. The fields are relative to .targetReport.customReferences[].pangoLineage[]

Field

.lineage*

The most likely lineage assigned to a given sequence based on the inference engine used and the SARS-CoV-2 diversity designated. This assignment may be is sensitive to missing data at key sites

.conflict*

In the pangoLEARN model, a given sequence gets assigned to the most likely category based on known diversity. If a sequence can fit into more than one category, the conflict score will be greater than 0 and reflect the number of categories the sequence could fit into. If the conflict score is 0, this means that within the current decision tree there is only one category that the sequence could be assigned to

.ambiguityScore*

This score is a function of the quantity of missing data in a sequence. It represents the proportion of relevant sites in a sequnece which were imputed to the reference values. A score of 1 indicates that no sites were imputed, while a score of 0 indicates that more sites were imputed than were not imputed. This score only includes sites which are used by the decision tree to classify a sequence

.version*

Version of the PUSHER database

.pangolinVersion*

Version of the Pangolin software

.additionalInformation[] Node

The value of the .additionalInformation[] field is an array of objects containing additional information about the test and data analysis solution. The fields are relative to .additionalInformation[]

Field
Description

.abbreviations*

Information about abbreviations relevant to test

.abbreviations.abbreviation*

Abbreviation

.abbreviations.definition*

Abbreviation definition

.interpretiveData*

Information about test

.interpretiveData.header*

Test information category

.interpretiveData.paragraph*

Test information text

Custom Panel

UPIP

Scientific evidence

Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit (RPIP)

Illumina Urinary Pathogen ID/AMR Enrichment Panel Kit (UPIP)

Illumina Respiratory Virus Oligo Panel / Respiratory Virus Enrichment Kit (RVOP / RVEK)

Illumina Viral Surveillance Panel (VSP)

Illumina Viral Surveillance Panel V2 (VSP V2)

VSP V2

RVOP/RVEK

VSP

Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit (RPIP)

Illumina Urinary Pathogen ID/AMR Enrichment Panel Kit (UPIP)

Illumina Respiratory Virus Oligo Panel / Respiratory Virus Enrichment Kit (RVOP / RVEK)

Illumina Viral Surveillance Panel (VSP)

Illumina Viral Surveillance Panel V2 (VSP V2)

Description

Description

Overall status

Description

Abbreviation
Definition
Category
Test information
Abbreviation
Definition
Category
Test information

Application note:

Application note:

Technical note:

Genomics Research Hub (GRH) article:

Data sheet:

UPIP ID Week Scientific Poster:

Application note:

Application note:

Application note:

Data sheet:

Data sheet:

Abbreviation
Definition
Category
Test information
Abbreviation
Definition
Category
Test information
Abbreviation
Definition
Category
Test information

NGS

next-generation sequencing

pangolin

phylogenetic assignment of named global outbreak lineages

RUO

For Research Use Only. Not for use in diagnostic procedures.

URL

See https://www.illumina.com/ for additional information.

Pango lineage

The most likely Pango (phylogenetic assignment of named global outbreak) lineage is assigned to the majority consensus SARS-CoV-2 genome sequence using pangolin 4.3.1 (Áine O'Toole & Emily Scher et al. 2021 Virus Evolution DOI:10.1093/ve/veab064).

Limitations

Custom panel data analysis using DRAGEN Microbial Enrichment Plus aligns human-dehosted next-generation sequencing (NGS) reads to reference sequences. Contamination with microorganisms is possible during specimen collection, transport, and processing. Reads from closely related microorganisms may align to reference sequences based on sequence homology. Alignment of reads to a microorganism does not confirm that the microorganism is causing symptoms, is viable, or is infectious. Results should be interpreted in the context of all available information. Other sources of data may be required for confirmation.

AMR

antimicrobial resistance

CLSI

Clinical and Laboratory Standards Institute

ESBL

extended spectrum beta-lactamase

EUCAST

European Committee on Antimicrobial Susceptibility Testing

mL

milliliter

NGS

next-generation sequencing

RPKM

targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads

UPIP

Urinary Pathogen ID/AMR Panel

RUO

For Research Use Only. Not for use in diagnostic procedures.

URL

See https://www.illumina.com/ for additional information.

Quantification - when a quantitative Internal Control {ic_name} and concentration {ic_concentration} is specified

UPIP data analysis using DRAGEN Microbial Enrichment Plus detects 35 viruses, 121 bacteria, 14 fungi, 4 parasites, and 4,371 AMR markers, unless filtered reporting options are selected, based on target enriched next-generation sequencing (NGS) of microorganism DNA sequences. Sequencing data are interpreted by the DRAGEN software platform and microorganisms that pass reporting thresholds are reported. Absolute quantification assumes use of {ic_name} as an Internal Control spiked at {ic_concentration} copies/mL of sample. Relative abundance is calculated based on absolute quantities and is expressed as proportion of absolute quantities within each pathogen class (i.e., bacteria, viruses, fungi, parasites). If RPKM for the Internal Control is zero, no absolute quantification is provided, and relative abundance is expressed as proportion of microorganism RPKM values within each pathogen class.

Quantification - when a quantitative Internal Control is NOT specified

UPIP data analysis using DRAGEN Microbial Enrichment Plus detects 35 viruses, 121 bacteria, 14 fungi, 4 parasites, and 4,371 AMR markers, unless filtered reporting options are selected, based on target enriched next-generation sequencing (NGS) of microorganism DNA sequences. Sequencing data are interpreted by the DRAGEN software platform and microorganisms that pass reporting thresholds are reported. Relative abundance is expressed as proportion of microorganism RPKM values within each pathogen class (i.e., bacteria, viruses, fungi, parasites). Internal Control not specified; no absolute quantification provided.

AMR - when "Report bacterial AMR markers only when an associated microorganism is reported" is selected

This test detects 4,371 antimicrobial resistance (AMR) markers and reports associations for 72 microorganisms, 185 antimicrobials, and 33 drug classes, unless filtered reporting options are selected. AMR markers are based on the Comprehensive Antibiotic Research Database (CARD, version 3.2.8). Detection of an AMR marker is reported if the AMR marker passes a minimum detection threshold and if one or more of the microorganisms associated with the AMR marker is also detected, in alignment with guidance provided by the College of American Pathologists (CAP) MIC.21855. However, reported AMR markers may originate from microorganisms that did not meet detection thresholds or microorganisms not targeted by the test. Association between microorganisms and AMR marker is based on scientific literature and the Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1) from McMaster University. 3,968 out of 4,371 AMR markers are associated with a microorganism targeted by UPIP. Reported AMR markers have been associated with antimicrobial resistance but may not always indicate phenotypic resistance. Failure to detect AMR markers does not always indicate phenotypic susceptibility. Results should be interpreted in the context of all available information.

AMR - when "Report bacterial AMR markers only when an associated microorganism is reported" is NOT selected

This test detects 4,371 antimicrobial resistance (AMR) markers and reports associations for 72 microorganisms, 185 antimicrobials, and 33 drug classes. AMR markers are based on the Comprehensive Antibiotic Research Database (CARD, version 3.2.8). Association between microorganisms and AMR marker is based on scientific literature and the Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1) from McMaster University. Detection of an AMR marker is reported if the AMR marker passes a minimum detection threshold, regardless of associated microorganism detection. Reported AMR markers may originate from microorganisms that did not meet detection thresholds or microorganisms not targeted by the test. Reported AMR markers have been associated with antimicrobial resistance but may not always indicate phenotypic resistance. Failure to detect AMR markers does not always indicate phenotypic susceptibility. Results should be interpreted in the context of all available information.

AMR

Linkage between AMR marker, antimicrobial, and drug class is based on the Comprehensive Antibiotic Research Database (CARD, version 3.2.8) from McMaster University, ResFinder (version 2.2.1), NCBI Reference Gene Catalog (version 2023-09-26.1), EUCAST expert rules on indicator agents (2019-2023), and CLSI Performance Standards for Antimicrobial Susceptibility Testing (M100 34th Edition). Not all antimicrobials and drug classes that are listed may be relevant. Detected AMR markers may also confer resistance to antimicrobials and drug classes that are not listed.

AMR

A representative list of associated microorganisms known to harbor the detected or similar AMR markers, based on the Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1) from McMaster University, can be found in the Associated Microorganisms field.

AMR

Mutations connected with a '+' form an epistatic group. Epistatic groups are two or more mutations that need to be present concurrently to confer the associated resistance.

AMR

All intrinsic resistance described in CLSI Performance Standards for Antimicrobial Susceptibility Testing, M100 34th Edition, Appendix B for detected microorganism(s) is reported. Additional comments regarding CLSI intrinsic resistance definitions may be reported in footnotes specific to the detected microorganism(s). Some intrinsic resistance is described with reference to drug classes rather than specific antimicrobials. Users may reference CLSI Glossary I (Part 1 and Part 2): Class and Subclass Designations and Generic Names for information on how CLSI categorizes antimicrobials and drug classes.

AMR

Confidence of AMR marker detection is shown as High, Medium, or Low and is based on the available sequencing data. High confidence indicates that an AMR marker has 100% protein sequence coverage and 100% protein sequence percent identity (PID). Medium confidence indicates that an AMR marker has ≥90% protein sequence coverage and ≥90% protein sequence percent identity (PID). Low confidence indicates that an AMR marker has ≥60% protein sequence coverage and ≥80% protein sequence percent identity (PID).

Phenotypic group

Targeted microorganisms are classified into three Phenotypic Groups based on general association with urinary tract infections, normal flora, colonization, or contamination from the environment or other sources. Phenotypic grouping DOES NOT INDICATE PATHOGENICITY IN A GIVEN CASE and results need to be interpreted in the context of all available information. Phenotypic Group 1: Microorganisms that are rarely associated with urinary tract infections and may frequently represent normal flora, colonizers, or contaminants. Phenotypic Group 2: Microorganisms that are infrequently associated with urinary tract infections and may frequently represent part of the normal flora, colonizers, or contaminants. Phenotypic Group 3: Microorganisms that are commonly associated with urinary tract infections but may also represent part of the normal flora, colonizers, or contaminants.

Read classification

This test differentiates sequencing reads classified to microorganism and Internal Control regions that are targeted by capture probes (“Targeted Microbial” and “Targeted Internal Control”) from those that are not targeted (“Untargeted”), are low complexity (“Low Complexity”), cannot be unambiguously assigned to one category (“Ambiguous”), or cannot be classified with confidence (“Unclassified”).

Limitations

Non-detected results do not rule out the presence of viruses, bacteria, fungi, parasites, and AMR markers. Contamination with microorganisms is possible during specimen collection, transport, and processing. Closely related microorganisms may be misidentified based on sequence homology to species present in the database. The identification of DNA sequences from a microorganism does not confirm that the identified microorganism is causing symptoms, is viable, or is infectious. Recombinant viral strains may not be reported or may be reported as one or more individual viruses. The Enterobacter cloacae complex may not be reported if targeted species members (Enterobacter cloacae, Enterobacter hormaechei, and Enterobacter cancerogenus) are not present.

Limitations

The best matching allele is reported for each detected AMR gene family. If two or more alleles within the same AMR gene family are detected, only the allele with the higher confidence will be reported as the best match unless multiple alleles have a High confidence interpretation (100% protein sequence coverage and PID). In bacterial strains containing insertion-deletion mutations (indels), there is a risk of false positive or false negative results for other resistance mutations within a region of 100 nucleotides around the indel.

Limitations

Information provided by DRAGEN Microbial Enrichment Plus is based on scientific knowledge and has been curated; however, scientific knowledge evolves and information about associated microorganism and associated resistance may not always be complete and/or correct. Results should be interpreted in the context of all available information. Other sources of data may be required for confirmation.

AMR

antimicrobial resistance

mL

milliliter

NAI

neuraminidase inhibitor

NGS

next-generation sequencing

PAI

polymerase acidic endonuclease inhibitor

pangolin

phylogenetic assignment of named global outbreak lineages

RPKM

targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads

VSP

Viral Surveillance Panel

RUO

For Research Use Only. Not for use in diagnostic procedures.

URL

See https://www.illumina.com/ for additional information.

Quantification - when a quantitative Internal Control {ic_name} and concentration {ic_concentration} is specified

VSP (generation 2) data analysis using DRAGEN Microbial Enrichment Plus detects 200 viruses and 238 AMR markers based on target enriched next-generation sequencing (NGS) of viral DNA and cDNA sequences. Sequencing data are interpreted by the DRAGEN software platform and viruses that pass detection thresholds are reported. Absolute quantification assumes use of {ic_name} as an Internal Control spiked at {ic_concentration} copies/mL of sample. Relative abundance is calculated based on absolute quantities and is expressed as proportion of absolute quantities. If RPKM for the Internal Control is zero, no absolute quantification is provided, and relative abundance is expressed as proportion of RPKM values.

Quantification - when a quantitative Internal Control is NOT specified

VSP (generation 2) data analysis using DRAGEN Microbial Enrichment Plus detects 200 viruses and 238 AMR markers based on target enriched next-generation sequencing (NGS) of viral DNA and cDNA sequences. Sequencing data are interpreted by the DRAGEN software platform and viruses that pass detection thresholds are reported. Relative abundance is expressed as proportion of RPKM values. Internal Control not specified; no absolute quantification provided.

AMR

This test detects 238 antimicrobial resistance (AMR) markers associated with resistance to Influenza virus neuraminidase inhibitor (NAI) and polymerase acidic protein inhibitor (PAI) in Influenza A virus (H1N1pdm09), Influenza A virus (H1N1), Influenza A virus (H5N1), Influenza A virus (H3N2), Influenza A virus (H3N2; swine-like), Influenza A virus (H7N9), and Influenza B virus. AMR markers and drug associations are based on the World Health Organization (WHO) Influenza virus NAI and PAI Reduced Susceptibility Marker Tables (07 March 2023 version). Detection of an AMR marker is reported if the marker passes a minimum detection threshold and if the Influenza virus associated with the marker is also detected. Reported AMR markers have been associated with antimicrobial resistance but may not always indicate phenotypic resistance. Failure to detect AMR variants does not always indicate phenotypic susceptibility. Results should be interpreted in the context of all available information.

AMR

Mutations connected with a '+' form an epistatic group. Epistatic groups are two or more mutations that need to be present concurrently to confer the associated resistance.

Read classification

This test differentiates sequencing reads classified to microorganism and Internal Control regions that are targeted by capture probes (“Targeted Microbial” and “Targeted Internal Control”) from those that are not targeted (“Untargeted”), are low complexity (“Low Complexity”), cannot be unambiguously assigned to one category (“Ambiguous”), or cannot be classified with confidence (“Unclassified”).

Pango lineage

The most likely Pango (phylogenetic assignment of named global outbreak) lineage is assigned to the majority consensus SARS-CoV-2 genome sequence using pangolin 4.3.1 (Áine O'Toole & Emily Scher et al. 2021 Virus Evolution DOI:10.1093/ve/veab064).

Limitations

Non-detected results do not rule out the presence of viruses and AMR markers. Contamination is possible during specimen collection, transport, and processing. Closely related viruses may be misidentified based on sequence homology to viruses present in the database. The identification of cDNA or DNA sequences from a virus does not confirm that the identified virus is causing symptoms, is viable, or is infectious. Recombinant viral strains may not be reported or may be reported as one or more individual viruses. Should one or more individual viruses be reported for a recombinant viral strain, antiviral resistance results may be inaccurate. In viral strains containing insertion-deletion mutations (indels), there is a risk of false positive or false negative results for other resistance mutations within a region of 100 nucleotides around the indel.

Limitations

Information provided by DRAGEN Microbial Enrichment Plus is based on scientific knowledge and has been curated; however, scientific knowledge evolves and reported information may not always be complete and/or correct. Results should be interpreted in the context of all available information. Other sources of data may be required for confirmation.

AMR

antimicrobial resistance

mL

milliliter

NAI

neuraminidase inhibitor

NGS

next-generation sequencing

PAI

polymerase acidic endonuclease inhibitor

pangolin

phylogenetic assignment of named global outbreak lineages

RPKM

targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads

RVEK

Respiratory Virus Enrichment Kit

RVOP

Respiratory Virus Oligo Panel

RUO

For Research Use Only. Not for use in diagnostic procedures.

URL

See https://www.illumina.com/ for additional information.

Quantification

RVOP data analysis using DRAGEN Microbial Enrichment Plus detects 24 viruses and 238 AMR markers based on target enriched next-generation sequencing (NGS) of viral DNA and cDNA sequences. Sequencing data are interpreted by the DRAGEN software platform and viruses that pass detection thresholds are reported. Relative abundance is expressed as proportion of RPKM values.

AMR

This test detects 238 antimicrobial resistance (AMR) markers associated with resistance to Influenza virus neuraminidase inhibitor (NAI) and polymerase acidic protein inhibitor (PAI) in Influenza A virus (H1N1pdm09), Influenza A virus (H1N1), Influenza A virus (H5N1), Influenza A virus (H3N2), Influenza A virus (H3N2; swine-like), Influenza A virus (H7N9), and Influenza B virus. AMR markers and drug associations are based on the World Health Organization (WHO) Influenza virus NAI and PAI Reduced Susceptibility Marker Tables (07 March 2023 version). Detection of an AMR marker is reported if the marker passes a minimum detection threshold and if the Influenza virus associated with the marker is also detected. Reported AMR markers have been associated with antimicrobial resistance but may not always indicate phenotypic resistance. Failure to detect AMR variants does not always indicate phenotypic susceptibility. Results should be interpreted in the context of all available information.

AMR

Mutations connected with a '+' form an epistatic group. Epistatic groups are two or more mutations that need to be present concurrently to confer the associated resistance.

Pango lineage

The most likely Pango (phylogenetic assignment of named global outbreak) lineage is assigned to the majority consensus SARS-CoV-2 genome sequence using pangolin 4.3.1 (Áine O'Toole & Emily Scher et al. 2021 Virus Evolution DOI:10.1093/ve/veab064).

Limitations

Non-detected results do not rule out the presence of viruses and AMR markers. Contamination is possible during specimen collection, transport, and processing. Closely related viruses may be misidentified based on sequence homology to viruses present in the database. The identification of cDNA or DNA sequences from a virus does not confirm that the identified virus is causing symptoms, is viable, or is infectious. Recombinant viral strains may not be reported or may be reported as one or more individual viruses. Should one or more individual viruses be reported for a recombinant viral strain, antiviral resistance results may be inaccurate. In viral strains containing insertion-deletion mutations (indels), there is a risk of false positive or false negative results for other resistance mutations within a region of 100 nucleotides around the indel.

Limitations

Information provided by DRAGEN Microbial Enrichment Plus is based on scientific knowledge and has been curated; however, scientific knowledge evolves and reported information may not always be complete and/or correct. Results should be interpreted in the context of all available information. Other sources of data may be required for confirmation.

AMR

antimicrobial resistance

mL

milliliter

NAI

neuraminidase inhibitor

NGS

next-generation sequencing

PAI

polymerase acidic endonuclease inhibitor

pangolin

phylogenetic assignment of named global outbreak lineages

RPKM

targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads

VSP

Viral Surveillance Panel

RUO

For Research Use Only. Not for use in diagnostic procedures.

URL

See https://www.illumina.com/ for additional information.

Quantification

VSP data analysis using DRAGEN Microbial Enrichment Plus detects 149 viruses and 238 AMR markers based on target enriched next-generation sequencing (NGS) of viral DNA and cDNA sequences. Sequencing data are interpreted by the DRAGEN software platform and viruses that pass detection thresholds are reported. Relative abundance is expressed as proportion of RPKM values.

AMR

This test detects 238 antimicrobial resistance (AMR) markers associated with resistance to Influenza virus neuraminidase inhibitor (NAI) and polymerase acidic protein inhibitor (PAI) in Influenza A virus (H1N1pdm09), Influenza A virus (H1N1), Influenza A virus (H5N1), Influenza A virus (H3N2), Influenza A virus (H3N2; swine-like), Influenza A virus (H7N9), and Influenza B virus. AMR markers and drug associations are based on the World Health Organization (WHO) Influenza virus NAI and PAI Reduced Susceptibility Marker Tables (07 March 2023 version). Detection of an AMR marker is reported if the marker passes a minimum detection threshold and if the Influenza virus associated with the marker is also detected. Reported AMR markers have been associated with antimicrobial resistance but may not always indicate phenotypic resistance. Failure to detect AMR variants does not always indicate phenotypic susceptibility. Results should be interpreted in the context of all available information.

AMR

Mutations connected with a '+' form an epistatic group. Epistatic groups are two or more mutations that need to be present concurrently to confer the associated resistance.

Pango lineage

The most likely Pango (phylogenetic assignment of named global outbreak) lineage is assigned to the majority consensus SARS-CoV-2 genome sequence using pangolin 4.3.1 (Áine O'Toole & Emily Scher et al. 2021 Virus Evolution DOI:10.1093/ve/veab064).

Limitations

Non-detected results do not rule out the presence of viruses and AMR markers. Contamination is possible during specimen collection, transport, and processing. Closely related viruses may be misidentified based on sequence homology to viruses present in the database. The identification of cDNA or DNA sequences from a virus does not confirm that the identified virus is causing symptoms, is viable, or is infectious. Recombinant viral strains may not be reported or may be reported as one or more individual viruses. Should one or more individual viruses be reported for a recombinant viral strain, antiviral resistance results may be inaccurate. In viral strains containing insertion-deletion mutations (indels), there is a risk of false positive or false negative results for other resistance mutations within a region of 100 nucleotides around the indel.

Limitations

Information provided by DRAGEN Microbial Enrichment Plus is based on scientific knowledge and has been curated; however, scientific knowledge evolves and reported information may not always be complete and/or correct. Results should be interpreted in the context of all available information. Other sources of data may be required for confirmation.

[Source]
[Source]
quality control
[Source]
Analytical performance of the Respiratory Pathogen ID/AMR Panel
Rapid detection of respiratory pathogens using the MiniSeq™ System
Evaluating reference materials for use with the Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit: Ensure optimal performance with external controls from commercial vendors
Wastewater AMR surveillance with a broad probe-capture precision metagenomics (PMG) panel
Urinary Pathogen ID/AMR Panel: Highly sensitive, culturefree identification and quantification of common and underrecognized uropathogens
Theoretical Antimicrobial Selection Based on Precision Metagenomics Compared with Standard Urine Culture/Susceptibility: A Reliability and Inter-Rater Agreement Feasibility Analysis
Detection and characterization of respiratory viruses, including SARS-CoV-2, using Illumina RNA Prep with Enrichment
Faster detection of respiratory viruses using the MiniSeq™ Rapid Reagent Kit and Illumina RNA Prep with Enrichment
Surveillance of infectious disease through wastewater sequencing: Detect SARS-CoV-2 variants and other respiratory viruses in the community
Viral Surveillance Panel: Streamlined whole-genome sequencing of high-impact viruses using hybrid–capture enrichment
Viral Surveillance Panel v2: Streamlined whole-genome sequencing for high-risk viral surveillance and research

Frequently Asked Questions (FAQ)

General

Q: Which Illumina Infectious Disease and Microbiology target-capture enrichment panel kits are compatible with the DRAGEN Microbial Enrichment Plus app?

A: RPIP, UPIP, RVOP/RVEK, VSP, VSP V2, and Custom infectious disease and microbiology enrichment panels. To analyze the Pan-Coronavirus (Pan-CoV) panel, a custom coronavirus reference sequence database may be specified. The DME+ app is not intended for use with non-infectious disease enrichment panels (such as human exome).

Q: Can I analyze the Pan-Coronavirus (Pan-CoV) panel here?

A: The only infectious disease and microbiology enrichment panel without a pre-set DME+ database is the Pan-CoV panel. To analyze Pan-CoV enriched data with the DME+ app, select "Custom Panel" under the "Enrichment Panel" drop-down list and specify a custom coronavirus reference sequence database. Alternatively, we recommend using the DRAGEN Targeted Microbial app.

Q: What does it cost to analyze samples using the DRAGEN Microbial Enrichment Plus app?

A: A Basic Basespace Sequence Hub (BSSH) user account is required to access the DME+ app. However, there is no subscription cost for a Basic BSSH account and no compute cost to run the DME+ app. A Basic BSSH account provides 1 TB of free storage. Additional storage may require iCredits.

Q: Where do I upload my custom reference FASTA and/or BED file?

A: Upload these files to a BSSH project before launching the DME+ app. It will then be possible to select these files in the "Select Dataset File(s)" browser in the app. Please see general guidelines for how to upload data to BaseSpace and reach out to techsupport@illumina.com with any unresolved upload issues.

Panel Content & Design

Q: Is my viral subtype of interest captured by the VSP V2 panel?

A: See the "Virus Types Captured" column of the "Microorganisms" table in the VSP V2 Panel Summary.

Q: Was VSP V2 designed using contemporary viral genomes or against traditional reference strains only?

A: The VSP V2 viral genome sourcing approach aimed at being as inclusive and comprehensive as possible for the 200 targeted human viruses. All viral genomes passing quality filters available as of June 2023 were included in the design, including recombinant and vaccine strains.

Q: How much of the genome is targeted by the RPIP, UPIP, RVOP/RVEK, VSP, and VSP V2 panels?

A: The full viral genome is targeted for all RVOP/RVEK, VSP, and VSP V2 viruses. For RPIP viruses, see the "Percent Genome Targeted" column of the "Microorganisms" table in the RPIP Panel Summary. No more than ~1% of bacterial, fungal, and parasitic genomes are targeted by RPIP or UPIP.

Analysis Options & Settings

Q: I am using the "Custom panel specification" option and my custom analysis aborted or shows an error, why?

A: While there are many possible reasons, one of the most common causes is that the custom database was not formatted correctly. Below are requirements for the custom reference FASTA and (optional) BED file:

  • Do not exceed the file size limitation: 10 million bases

  • Do not include duplicate entries

  • Do not use spaces in the file name; instead use an underscore "_"

  • File extension must be .fasta or .fa for custom reference FASTA file and .bed for custom reference BED file

  • If providing a custom reference BED file, the names in the first column of the BED file (chrom) must match the names that appear in the FASTA file (text after > and before the first whitespace character).

See Custom reference FASTA and BED files for further details.

Q: I am using the "User-defined specification" option. I am not seeing the microorganisms I expect to be there AND/OR I am seeing microorganisms that I do not want to see.

A: Ensure that the correct microorganism reporting file was uploaded and used. We recommend saving the updated microorganism reporting file with a new name. Rows with microorganism names that are not of interest can be deleted, but do not add any new columns or delete any columns from the provided template. Similarly, do not change or remove any text from the header row. Also, please note that the "kmer_read_count" metric is only valid with the UPIP panel. See Microorganism Reporting File format for further details.

Q: What read QC (Quality Control) is performed by the DRAGEN Microbial Enrichment Plus app?

A: If enabled, low-quality bases are trimmed from the ends of each read. After trimming, the read is discarded if fewer than 50% of its bases have a quality score greater or equal to q20, the read is shorter than 32 bp, or the read has 5 or more ambiguous bases. It is assumed that appropriate adapter trimming has already been performed.

Q: What does "Read classification sensitivity" mean in the settings for RVOP/RVEK, VSP, and VSP V2?

A: This setting is used as a pre-alignment filtering step for all viral whole-genome sequencing (WGS) panels. The default setting of 5 means that if less than 5 reads classify to the set of reference sequences belonging to a given virus, that virus will not be reported. On the other hand, if 5 or more reads classify to the set of reference sequences belonging to a given virus, read alignment will proceed and alignment-based thresholds will be used to determine whether that virus is reported. The read classification sensitivity can be set as low as 1 or as high as 1000. Lowering the read classification sensitivity threshold below 5 may significantly increase computational run time and is not recommended for most use cases.

Q: When is a Pangolin analysis run?

A: Pangolin is currently enabled for all enrichment panels except UPIP. For Custom Panel analyses, Pangolin is enabled and will run on custom reference sequences with at least 3% coverage that meet these naming conventions:

  • If only a FASTA file is provided, Pangolin will run on sequences that have a header containing either SARS-CoV-2 or NC_045512

  • If both a FASTA and BED file are provided, Pangolin will run on sequences where the first column (chrom) contains NC_045512 or the fourth column (genomeName) contains SARS-CoV-2

Q: When is a Nextclade analysis run?

A: When enabled, a Nextclade analysis using the specified dataset(s) is run for the following microorganisms, as applicable:

Microorganism
Nextclade Dataset
Type of Nextclade Dataset

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

SARS-CoV-2 (relative to Wuhan-Hu-1/2019)

Official

Influenza A virus (H1N1)

Influenza A H1N1pdm HA (relative to A/Wisconsin/588/2019) & Influenza A H1N1pdm NA (relative to A/Wisconsin/588/2019)

Official

Influenza A virus (H3N2)

Influenza A H3N2 HA (relative to A/Darwin/6/2021) & Influenza A H3N2 NA (relative to A/Darwin/6/2021)

Official

Influenza B virus (B/Victoria/2/87-like)

Influenza B Victoria HA (relative to B/Brisbane/60/2008)

Official

Influenza B virus (B/Yamagata/16/88-like)

Influenza B Yamagata HA (relative to B/Wisconsin/01/2010)

Official

Human respiratory syncytial virus A (HRSV-A)

RSV-A

Official

Human respiratory syncytial virus B (HRSV-B)

RSV-B

Official

Monkeypox virus (MPV)

Mpox virus (All Clades)

Official

Measles virus (MV)

Measles virus N450 (WHO-2012)

Official

Dengue virus (DENV), Dengue virus type 1 (DENV-1), Dengue virus type 2 (DENV-2), Dengue virus type 3 (DENV-3), Dengue virus type 4 (DENV-4)

Dengue virus (All Serotypes)

Official

Human immunodeficiency virus 1 (HIV-1)

HIV-1 (relative to HXB2)

Community

Influenza A virus (H5N1)

Influenza A H5Nx HA (relative to A/Goose/Guangdong/1/96)

Community

Influenza A virus (H5N6)

Influenza A H5Nx HA (relative to A/Goose/Guangdong/1/96)

Community

Influenza A virus (H5N8)

Influenza A H5Nx HA (relative to A/Goose/Guangdong/1/96)

Community

Q. What Internal Control (IC) options are supported and what additional information does using an IC provide?

A: The RPIP, UPIP, and VSP V2 enrichment panels contain probes targeting commercially available Internal Controls. See the table below for Internal Control options compatible with RPIP, UPIP, and VSP V2. It is recommended to spike each sample prior to extraction with Enterobacteria phage T7 at 1.21 x 10^7 copies/mL of sample.

Internal Control
RPIP
UPIP
VSP V2
Process control
Enrichment factor calculation
Microorganism absolute quantification*
Notes

Allobacillus halotolerans

X

X

X

X

X

Armored RNA Quant Internal Process Control

X

X

X

X

X

Enterobacteria phage T7

X

X

X

X

X

X

Recommended IC concentration = 1.21 x 10^7 copies/mL of sample

Escherichia virus MS2

X

X

X

X

X

X

Escherichia virus Qbeta

X

X

X

X

X

Escherichia virus T4

X

X

X

X

X

X

Imtechella halotolerans

X

X

X

X

X

Phocid alphaherpesvirus

X

X

X

X

X

Phocine morbillivirus

X

X

X

X

X

Truepera radiovictrix

X

X

X

X

X

*Quantitative Internal Control concentration must be provided

Q. What are the DRAGEN Microbial Enrichment Plus app settings related to consensus sequence generation and variant calling?

A: See the table below. Consensus sequence bases without aligned read support are indicated by "N" bases.

Setting
Value

Read de-duplication

Not performed

Depth threshold for consensus sequence generation

1x

Depth threshold for variant calling

5x

Minimum minor allele frequency

20%

Reporting

Q: I don't see the microbe I'm interested in listed in the reported microorganism summary. Does that mean my microbe of interest is not present?

A: Not necessarily. The microbe of interest may be present in the sample, but the DME+ app may not have reported it because the detection metrics fell below the default reporting thresholds. If it is suspected that this may be the case, select the "Report microorganisms and/or AMR markers that are below threshold" option. A user-defined microorganism reporting file can also be specified on a microorganism-by-microorganism basis using multiple parameters should more sensitive reporting be required for a given use case. See Microorganism Reporting File format for further details.

Q: What is the default reporting threshold for a microorganism to be "predicted present" and make it into reports?

A: Multiple parameters are used to determine whether the sequencing data for a given microorganism is sufficient for a positive call. These may include the horizontal coverage, median read depth, normalized read count, average nucleotide identity, etc of the microorganism and/or other genetically related microorganisms. The default reporting thresholds are different for different microorganisms, as microorganisms with close genetic neighbors generally require more stringent reporting thresholds than genetically distinct microorganisms. As with most tests and prediction algorithms, the default reporting thresholds are intended to balance the trade-off between analytical sensitivity and specificity. Should a given use case require more sensitive or specific reporting, a user-defined microorganism reporting file can be specified on a microorganism-by-microorganism basis using multiple parameters. See Microorganism Reporting File format for further details. Additionally, the "Report microorganisms and/or AMR markers that are below threshold" option can be enabled.

Q. Are low coverage, median depth 0 microorganisms actually in the sample or are they artifacts?

A: Mathematically, any result with a horizontal coverage of <50% will have a median depth of 0 (50% or more of the nucleotide positions have a depth of 0). Low coverage results could represent true low positives (the most likely reason) or non-specific results, contamination, etc. If maximum confidence is required for a given use case, stricter microorganism reporting thresholds can be specified on a microorganism-by-microorganism basis using multiple parameters. See Microorganism Reporting File format for further details.

Q. What is tiered reporting logic, which viruses are reported as part of a tiered reporting group, and why should I care?

A: See the "Has Tiered Reporting" and "Reporting Tier" columns of the "Microorganisms" table in the Panel Summary for RPIP, RVOP/RVEK, VSP, and VSP V2 to select and see which viruses are reported as part of a tiered reporting group. Membership in a tiered reporting group means that a hierarchical relationship is pre-built into the database and the most granular tier level passing reporting thresholds is reported. For example, if Influenza B virus (B/Victoria/2/87-like) or Influenza B virus (B/Yamagata/16/88-like) are reported in a sample then the less granular Influenza B virus reporting name will NOT be reported. Tiered reporting group membership is especially relevant when specifying a user-defined microorganism reporting file as including the entire tiered reporting group is necessary to preserve tiered reporting logic.

Q. How can I evaluate DRAGEN Microbial Enrichment Plus microorganism absolute quantification results?

A: To evaluate microorganism absolute quantification results, it is recommended to perform experiments using the relevant sample type and full sequencing workflow (including extraction) and to compare results obtained from the DME+ app with those from digital droplet PCR (ddPCR) and/or quantitative PCR (qPCR) assays. A per-microorganism absolute quantification correction factor can be applied to DME+ results as needed.

Q. I noticed some antimicrobials listed that do not usually get used in clinical environments - is this expected?

A: Yes. Not all antimicrobials and drug classes that are listed may be relevant. Detected AMR markers may also confer resistance to antimicrobials and drug classes that are not listed. Linkage between bacterial AMR marker, antimicrobial, and drug class is based on the Comprehensive Antibiotic Research Database (CARD, version 3.2.8) from McMaster University, ResFinder (version 2.2.1), NCBI Reference Gene Catalog (version 2023-09-26.1), EUCAST expert rules on indicator agents (2019-2023), and CLSI Performance Standards for Antimicrobial Susceptibility Testing (M100 34th Edition). Linkage between viral AMR marker, antimicrobial, and drug class is based on the publications provided in the JSON report - see the PubMed IDs (pmids) field.

Q. Some of the reported bacterial AMR markers in my sample have an “ESBL” flag, a “Carbapenemase” flag, or both. How are these flags determined?

A: Extended-spectrum beta-lactamase (ESBL) and Carbapenemase flags are assigned based on the antimicrobials and drug classes associated with each bacterial AMR marker. An ESBL flag is reported if a 3rd, 4th, or 5th generation cepholosporin OR a beta-lactam + beta-lactamase inhibitor combination is contained in the list of associated antimicrobials or drug classes. A carbapenemase flag is reported if a carbapenem is contained in the list of associated antimicrobials or drug classes. The logic for each of these flags is decoupled, such that a marker can be reported with both flags if the associated antimicrobial or drug class metadata indicates both ESBL and carbapenemase activity.

Results & Output Files

Q: Most of my reads are untargeted reads. Is enrichment working?

A: For complex samples or samples with the majority of nucleic acid being host/untargeted, while 100-1000X more targeted reads and sensitivity over a shotgun/pre-enriched library is expected, typically targeted reads will still only represent a minority of the overall sequencing reads. Notably, RPIP, UPIP, and VSP V2 support various Internal Control options that can be spiked into samples prior to extraction to enable automated calculation of an enrichment factor sample QC metric.

Q: Is any typing information included for my virus of interest?

A: See the "Has Tiered Reporting" and "Lineage/Clade Prediction" columns of the "Microorganisms" table in the Panel Summary for RPIP, RVOP/RVEK, VSP, and VSP V2. Consensus sequence and best match reference accession are also provided for RPIP, RVOP/RVEK, VSP, and VSP V2 viruses. Subtype information may be possible to infer from the consensus sequence (e.g. by Blast) or from the best match reference accession (if annotated in NCBI). Consensus sequence can also be used as input to downstream viral typing tools.

Q. The % Targeted Microbial Reads is not exactly equal to the sum of microorganism Aligned Read Count values, why?

A: The % Targeted Microbial Reads is calculated using a kmer-based classification approach that is intended to give a quick, high-level overview of sample composition. The Aligned Read Count values for microorganisms are calculated in a separate pipeline step using microorganism-specific reference sequence alignment as opposed to broad, categorical, kmer-based classification. Reads that were unclassified or that were classified as low-complexity or ambiguous may actually align to reference sequences. It is also possible for a read to align to a reference sequence of more than one microorganism, for example in a conserved region.

Q: How can I verify or compare results of the DRAGEN Microbial Enrichment Plus app to previously used apps (such as DRAGEN Targeted Microbial)?

A: FASTQ files previously run through other apps can be re-analyzed using the DME+ app. Results from other apps may not be identical to results from the DME+ app, most notably because of the expanded databases used in DME+.

Q: The Reference Coverage section of the HTML report only shows coverage plots for viral genomes. Why doesn't it show the plots for bacterial genomes and/or for viral targeted regions?

A: Viral genomes are orders of magnitude smaller and thus computationally much "cheaper" to align to than bacterial, fungal, and parasitic genomes. In the case of RVOP/RVEK, VSP, and VSP V2, the full viral genome is targeted for all viruses. For RPIP viruses, see the "Percent Genome Targeted" column of the "Microorganisms" table in the RPIP Panel Summary. While not visualized in the HTML report at this time, the DME+ Report JSON does contain coverage depth vector information for all microorganism targeted regions (viruses, bacteria, fungi, and parasites). See: .targetReport.microorganisms[].condensedDepthVector[], which is the read depth across the targeted microorganism reference sequences, condensed (if needed) into 256 bins.

Release notes

🚩 Release notes

DRAGEN Microbial Enrichment Plus app version 1.1.0

Component versions

  • Test type, version:

    • RPIP 6.5.1

    • UPIP 8.6.0

    • RVOP 2.7.0

    • VSP 2.7.0

    • VSPv2 2.7.0

    • Custom 1.0.0

  • Analysis Pipeline version: 6.3.12

  • DRAGEN version: 4.3.11

Third-party versions

  • Pangolin 4.3.1 (Pangolin database PUSHER version 1.27)

  • Nextclade 3.5.0

  • SnpEff 5.1

  • ResFinder (version 2.2.1)

  • NCBI Reference Gene Catalog (version 2023-09-26.1)

  • EUCAST expert rules on indicator agents (2019-2023)

  • CLSI Performance Standards for Antimicrobial Susceptibility Testing (M100 34th Edition)

  • Comprehensive Antibiotic Research Database (CARD, version 3.2.8)

  • Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1)

  • World Health Organization (WHO) Influenza virus neuraminidase inhibitor (NAI) and polymerase acidic protein inhibitor (PAI) Reduced Susceptibility Marker Tables (07 March 2023 version)

Key updates

  • Various bug fixes (see below)

  • Tiered reporting added for Norovirus (GI, GII, GIV, GVIII, GIX) and Dengue virus (type 1, type 2, type 3, type 4)

  • Tiered reporting suppressed for below subtype resolution of Influenza A virus subtypes H1N1 and H3N2

  • Nextclade datasets added for Measles virus (MV) and Dengue virus (DENV) clade assignment

  • Reference genomes added for Monkeypox virus (MPV) Clade 1b

  • Additional database curation

Known issues

  • When reading Biosamples from a Project, Fastq files for Biosamples sharing the same sample name prefix before the first underscore are merged. For example, Fastq files for Biosamples PREFIX_001 and PREFIX_002 will be merged and reported as a single PREFIX sample. To avoid this error, ensure that sample names are unique before the first underscore, replace underscores with a hyphen, or provide Biosample input from a list

  • Coverage results for SARS-CoV-2 are slightly (<1%) over-estimated, which may result in coverage >100% due to an error accounting for masked polyA-tail bases

  • Viral genome consensus sequence bases without aligned read support are indicated by "X" bases rather than "N" bases for RPIP viruses except SARS-CoV-2 and Influenza viruses

  • Variant annotation information for Influenza A and B viruses, including antiviral resistance prediction results, may not be populated when below threshold reporting is enabled and/or a user-defined microorganism reporting file is specified that does not include all members of the Influenza A and B virus tiered reporting groups. If viral variant annotation is of interest for Influenza A and B viruses, default microorganism reporting options are recommended

Known limitations

  • When providing Biosample input from a list, 99 associated Fastq files is the maximum allowed per analysis. There is no Fastq file limitation when reading Biosamples from a Project

  • In strains containing insertion-deletion mutations (indels), there is a risk of false positive or false negative results for other resistance mutations within a region of 100 nucleotides around the indel

  • In strains containing long insertion-deletion mutations (indels), there is a risk of false negative results for two large vAMR-associated deletion mutations (RPIP, VSPv2) and one large bAMR-associated insertion mutation (RPIP). Long indels may be incorrectly reported as other variant types, such as frameshift mutations

  • Small differences in SARS-CoV-2 and Influenza virus results may be observed between repeat analyses

Bug fixes

  • Nextclade parsing errors for some samples

  • Custom reference sequence analysis not functional in non-US regions

  • User-defined microorganism reporting feature not reporting microorganisms that belong to a tiered reporting group when “prediction_logic” column set to “default”

  • RPKM and absolute quantity metrics inaccurate when read QC disabled

  • SHV beta-lactamase AMR markers incorrectly reported as carbapenemases based on a known curation error in CARD version 3.2.8

  • Reads duplicated for samples with a single FASTQ file

  • Empty FASTQ files abort analysis

  • Pangolin not run on all samples with SARS-CoV-2 detected

  • Viral genome coverage plots not rendered for segmented viruses when all segments not detected

  • Description information missing for some viral genome accessions

DRAGEN Microbial Enrichment Plus app version 1.0.0

Initial release.

Component versions

  • Test type, version:

    • RPIP 6.3.0

    • UPIP 8.4.0

    • RVOP 2.3.0

    • VSP 2.3.0

    • VSPv2 2.3.0

    • Custom 1.0.0

  • Analysis Pipeline version: 6.3.12

  • DRAGEN version: 4.3.6

Third-party versions

  • Pangolin 4.3.1 (Pangolin data 1.27)

  • Nextclade 3.5.0

  • SnpEff 5.1

  • ResFinder (version 2.2.1)

  • NCBI Reference Gene Catalog (version 2023-09-26.1)

  • EUCAST expert rules on indicator agents (2019-2023)

  • CLSI Performance Standards for Antimicrobial Susceptibility Testing (M100 34th Edition)

  • Comprehensive Antibiotic Research Database (CARD, version 3.2.8)

  • Comprehensive Antibiotic Research Database Prevalence Data (CARD Prevalence, version 4.0.1)

  • World Health Organization (WHO) Influenza virus neuraminidase inhibitor (NAI) and polymerase acidic protein inhibitor (PAI) Reduced Susceptibility Marker Tables (07 March 2023 version)

Key updates

  • Updated and expanded microorganism and bacterial AMR marker databases

  • Updated and expanded Influenza virus typing capability and antiviral resistance (AVR) reporting

  • User-defined microorganism reporting list and reporting thresholds

  • Below threshold reporting for microorganisms and/or AMR markers

  • Custom reference sequence analysis

Known issues

  • When reading Biosamples from a Project, Fastq files for Biosamples sharing the same sample name prefix before the first underscore are merged. For example, Fastq files for Biosamples PREFIX_001 and PREFIX_002 will be merged and reported as a single PREFIX sample. To avoid this error, ensure that sample names are unique before the first underscore, replace underscores with a hyphen, or provide Biosample input from a list

  • Reads are duplicated for samples with a single FASTQ file

  • Empty FASTQ files will abort analysis

  • Nextclade may encounter a parsing error for some samples. If an analysis fails, try re-running the analysis with Nextclade disabled

  • Pangolin may not be run on all samples with SARS-COV-2 detected

  • Custom reference sequence analysis is not functional in non-US regions

  • The user-defined microorganism reporting feature does not report microorganisms that belong to a tiered reporting group when the “prediction_logic” column is set to “default”. See the User Guide for further information about microorganism tiered reporting

  • RPKM and absolute quantity metrics are inaccurate when read QC is disabled

  • SHV beta-lactamase AMR markers are incorrectly reported as carbapenemases based on a known curation error in CARD version 3.2.8

  • Coverage results for SARS-CoV-2 are slightly (<1%) over-estimated, which may result in coverage >100% due to an error accounting for masked polyA-tail bases

  • Viral genome consensus sequence bases without aligned read support are indicated by "X" bases rather than "N" bases for RPIP viruses except SARS-CoV-2 and Influenza viruses

  • Variant annotation information for Influenza A and B viruses, including antiviral resistance prediction results, may not be populated when below threshold reporting is enabled and/or a user-defined microorganism reporting file is specified that does not include all members of the Influenza A and B virus tiered reporting groups. If viral variant annotation is of interest for Influenza A and B viruses, default microorganism reporting options are recommended

Known limitations

  • When providing Biosample input from a list, 99 associated Fastq files is the maximum allowed per analysis. There is no Fastq file limitation when reading Biosamples from a Project

  • Small differences in results may be observed between repeat analyses

  • In strains containing insertion-deletion mutations (indels), there is a risk of false positive or false negative results for other resistance mutations within a region of 100 nucleotides around the indel

  • In strains containing long insertion-deletion mutations (indels), there is a risk of false negative results for two large vAMR-associated deletion mutations (RPIP, VSPv2) and one large bAMR-associated insertion mutation (RPIP). Long indels may be incorrectly reported as other variant types, such as frameshift mutations

  • The RPIP, VSPv2, VSPv1, and RVOP Data Analysis solutions can report Influenza A virus subtypes H1N1 and H3N2 to a below-subtype resolution. Multiple results for H1N1 and/or H3N2 may be reported concurrently, particularly in samples that contain a mixture of Influenza A virus subtypes

  • Viral genome coverage plots are not rendered for segmented viruses when all segments are not detected

  • Description information is missing for some viral genome accessions

25KB
custom_reference_snippet.fasta
Example Fasta file formatting
26KB
DMEplus_aggregate_report_descriptions.xlsx
RPIP, UPIP, RVOP/RVEK, VSP, VSP V2: Description of aggregate Excel report file fields
20KB
DMEplus_custom_aggregate_report_descriptions.xlsx
Custom Panel: Description of aggregate Excel report file fields
RPIP_6-5-1_Panel_Summary.xlsx
UPIP_8-6-0_Panel_Summary.xlsx
RVOP_2-7-0_Panel_Summary.xlsx
VSP_2-7-0_Panel_Summary.xlsx
VSPV2_2-7-0_Panel_Summary.xlsx
RPIP_6-5-1_Microorganism_Reporting_Template.xlsx
UPIP_8-6-0_Microorganism_Reporting_Template.xlsx
RVOP_2-7-0_Microorganism_Reporting_Template.xlsx
VSP_2-7-0_Microorganism_Reporting_Template.xlsx
VSPV2_2-7-0_Microorganism_Reporting_Template.xlsx
Example of a Virus Metric tab
Control and data flow diagram for the DRAGEN Microbial Amplicon app. Not all steps shown may be run depending on user inputs and pipeline outcome
Example of a Summary Report tab
Example entry in Metrics by Sample table.
Control and data flow diagram for the DRAGEN Targeted Microbial analysis pipeline
The Metrics by virus table for a single Influenza A sample