Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Provide a BED file with at least 4 columns: chrom
, chromStart
, chromEnd
, primerName
. Additional columns can be included: pool
, strand
, sequence
, but their order must be maintained.
For example, chrom
, chromStart
, chromEnd
, primerName
, pool
for 5-column BED format:
And chrom
, chromStart
, chromEnd
, primerName
, pool
, strand
, sequence
for 7-column BED format:
Option 1. One line per amplicon with 3 columns: ampliconName
, forwardSequence
,reverseSequence
.
Option 2. One line per primer with 3 columns: primerName
, sequence
, pool
.
General
All text is case sensitive.
Any line starting with '#' is ignored. This can be used to add a header line with column names.
Every line must have the same number of columns and format (except those starting with '#').
Any number of spaces can separate the columns. A value within a single column should not have any space.
BED format
Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the chromStart
field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the chromEnd
field (3rd column) minus 1.
chrom
field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.
Multiple sequence identifiers (chrom
) are permitted within one file.
Primer name
primerName
must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.
In addition to _LEFT
and _RIGHT
, we permit _L
and _R
as direction tags in primerName
. Any text after the direction tag should be separated by an underscore.
Text in primerName
before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName
.
Each amplicon must have at least one left and right primer (including alternative primers) associated with it.
Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt
after the direction tag in primerName
, followed by optional text to distinguish between different alternative primers, such as a number.
Examples of valid primer names:
MY_SEQUENCE_434_A_LEFT
virus1_L
amplicon_4934m_RIGHT_alt
amplicon_4934m_RIGHT_alt1
amplicon_4934m_R_altprimerB
Examples of invalid primer names:
LEFT_MY_SEQUENCE_434_A
virus1_l
amplicon_4934m_RIGHT_L
A BED-like tab-separated value (TSV) file with no header row, consisting of the following columns:
chrom
: each sequence name as it appears in Custom Reference FASTA
chromStart
: start position (always set to 0)
chromEnd
: end position (sequence length)
genomeName
: full name of the virus the sequence belongs to (e.g. Monkeypox virus clade II)
(optional) segmentName
: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome
This file affects how sequences are labeled in the output.
Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.
If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.
If the Custom Reference FASTA includes sequences from multiple segments, it is recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.
Launch the DRAGEN Targeted Microbial BaseSpace application.
After choosing a name and destination project for the Analysis, choose either “Biosample” or “Project” as input type. Selecting “Project” will result in all biosamples in the selected project being analyzed.
Next, choose between Enrichment and Amplicon for Experiment Type. Libraries prepared with IMAP should be run as “Amplicon” experiments. Choose the appropriate Amplicon Primer Set that matches the primer design used to prepare your samples or choose “Custom” to provide your own genome references and primer designs. Note that all provided files must first be uploaded to a BaseSpace project before they can be selected in the software.
To use a custom reference and primer design, click the “Custom Reference” block to expand it.
At a minimum, the user must provide a custom genome reference containing one or more target sequences (to be used for alignment, variant calling and consensus generation) in the form of a FASTA file.
Optionally, the user may provide a BED file that assigns human-readable names and segment numbers (if applicable) to each sequence in the provided FASTA file. Note that the accessions in the genome definition file must match the first part (before whitespace) of the FASTA headers. See the pages for “Genome Definition File Format Specification” in the “Supporting Information” section of this user guide for information on the required format of this file.
Optionally, the user may provide a file containing the locations or sequences of the primers used to prepare this sample. These primer definitions are important to guide the trimming of primer sequence from reads that overlap the binding sites, as well as to define the boundaries of the amplicons whose coverage is used to determine if the sample has sufficient viral material to reliably call variants and generate consensus sequence.
Optionally, the user may choose one or more NextClade datasets to use for phylogenetic analysis of the consensus sequences generated from the samples.
Check the appropriate boxes to enable or disable Pangolin and/or NextClade analysis if desired. These tools perform phylogenetic analysis and lineage assignment for SARS-CoV-2 (Pangolin) and other viruses (NextClade). Depending on the chosen Amplicon primer set, not all of these options may be applicable.
Click “Launch Application” to begin the Analysis.
Note: Some files may not be generated depending on user inputs and pipeline outcome
For each sample, the pipeline generates a directory named after the sample. This directory contains the following subdirectories:
consensus/
{sample_name}_sample_consensus.fasta
: Contains all hard-masked consensus sequences for the sample. Regions with coverage below minimum coverage depth for consensus sequence generation (10x by default) are considered not callable and are therefore “hard-masked” with letter N. Variant calling is not applied to these regions. If the user selected specific VSP or RVOP organisms to be reported, this file excludes consensus sequences that are generated but do not belong to the selected organisms.
{sample_name}.consensus_hard_masked_sequence.fa
: Identical to {sample_name}_sample_consensus.fasta
, except for sequence headers. Moreover, even if the user selected specific VSP or RVOP organisms to be reported, this file contains all consensus sequences, including those that do not belong to the selected organisms
{sample_name}.consensus_soft_masked_sequence.fa
: Identical to {sample_name}.consensus_hard_masked_sequence.fa
, except low-coverage regions are “soft-masked” with lower-case letters that match the reference. Variant calling is not performed in these regions.
{sample_name}{virus_name}_virus_consensus.fasta
: Contains hard-masked consensus sequences generated for a particular virus. If the virus is not segmented (i.e. one reference sequence for the virus), this file contains a single sequence and is identical to {sample_name}{virus_name}{segment_name}{sequence_accession} consensus.fasta.
{sample_name}_{virus_name}_{segment_name}_{sequence_accession}_consensus.fasta
: Contains a hard-masked consensus sequence generated with a particular reference sequence.
{sample_name}.consensus_from_vcf.log
: Log file generated during consensus sequence generation.
contig/
{sample_name}_sample_contig.fasta
: Contains all de novo assembled contigs generated for the sample.
{sample_name}_unmapped_contig.fasta
: Contains de novo assembled contigs that could not be mapped to any reference sequence. Because de novo assembly is reference-free, these contigs may correspond to sequences that are too diverged from those in the reference database or sequences not included in the database.
{sample_name}_{virus_name}_virus_contig.fasta
: Contains de novo assembled contigs that mapped to a particular virus.
{sample_name}_{virus_name}_{segment_name}_{sequence_accession}_contig.fasta
: Contains de novo assembled contigs that mapped to a particular reference sequence, which resulted in choosing the reference sequence for short-read alignment and generating {sample_name}_{virus_name}_{segment_name}_{sequence_accession}_consensus.fasta.
{sample_name}_reference_selection.log
: Log file generated during reference selection.
map_align/
{sample_name}_unfiltered_tumor.bam
or {sample_name}_unfiltered_tumor_primertrim_with_unmapped_reads.bam
: Short reads mapped to all selected reference sequences. If a primer set is available and properly mapped to the reference sequences, {sample_name}_unfiltered_tumor_primertrim_with_unmapped_reads.bam
is provided as output. Its reads have primer sequences trimmed based on primer binding site coordinates.
{sample_name}-unmapped_S1_L001_R1_001.fastq.gz, {sample_name}-unmapped_S1_L001_R2_001.fastq.gz
: Short reads that do not map to any selected reference sequences. These may be used to find organisms not reported by the pipeline.
variant_calling/
{sample_name}.consensus_filtered_variants.vcf.gz
: Contains variant calls that passed consensus filter and were therefore applied to consensus sequences. They are a subset of variants listed in {sample_name}.hard-filtered.vcf.gz.
{sample_name}.consensus_filtered_variants_vcf_stats.txt
: Summarizes all variant calls in {sample_name}.consensus_filtered_variants.vcf.gz. Outputted by bcftools stats.
{sample_name}.consensus_filtered_variants_summary.csv
: Describes each variant call in {sample_name}.consensus_filtered_variants.vcf.gz.
{sample_name}.hard-filtered.vcf.gz
: Contains all variant calls.
{sample_name}.consensus_input_vcf_stats.txt
: Summarizes all variant calls in {sample_name}.hard-filtered.vcf.gz. Outputted by bcftools stats.
{sample_name}.consensus_all_variants_summary.csv
: Describes each variant call in {sample_name}.hard-filtered.vcf.gz.
metrics/
{sample_name}_num_reads.tsv
: Reports number of input reads, reads filtered out at each pre-processing step, reads mapped to each selected reference sequence, etc.
{sample_name}_metadata.json
: Reports parameters, read counts, amplicon counts, analysis results, and other metadata.
{sample_name}.consensus_metrics.csv
: Reports consensus metrics (e.g. total length of pre-trimmed sequence, fraction of masked bases, number of callable bases) for each generated consensus sequence
{sample_name}.consensus_coverage_from_filtered_bam.tsv
: Reports base-pair-resolution read coverage for all reference sequences based on short-read map/align step. Its three columns correspond to: chrom/accession, base position, coverage.
{sample_name}.consensus_callable_regions_from_filtered_bam.bed
: Reports callable regions in all reference sequences based on base-pair-resolution read coverage and minimum coverage depth for consensus sequence generation (10x by default). Bases outside of these regions are masked in consensus FASTA.
amplicon/
{sample_name}_processed_non_overlapping_amplicon.bed
: Lists all non-overlapping amplicon regions (i.e. covered with exactly one amplicon). If a custom primer set is provided, this file also lists selected reference sequences lacking amplicons. While amplicons are defined based on primer binding sites, for viruses like Influenza, reference sequences often lack primer binding sites, which are located at sequence ends. This results in defining fewer or sometimes no amplicons for an entire viral genome. To avoid this, each reference sequence without any amplicons defined is considered an amplicon and is listed in this file. All regions in this file are used for amplicon detection to infer sample concentration and determine if it is sufficient to apply variant calling and consensus sequence generation.
{sample_name}_calculate_amplicon_coverage.csv
: Reports coverage metrics (e.g. median coverage, fraction of bases with at least 1x coverage) for each non-overlapping amplicon region listed in {sample_name}_ processed_non_overlapping_amplicon.bed.
{sample_name}_generate_all_primer_bed.log
: Log file generated while defining amplicons for selected reference sequences and writing relevant BED files.
tertiary/
nextclade_{sample_name}_{sequence_accession}_{dataset_name}.csv
: CSV output file generated by NextClade given a consensus sequence (generated with the specified sequence as reference) and a NextClade dataset.
nextclade_{sample_name}_{sequence_accession}_{dataset_name}.tsv
: Same as nextclade_{sample_name} _{sequence_accession}_{dataset_name}.csv
except in TSV format.
nextclade_{sample_name}_{sequence_accession}_{dataset_name}.json
: Same as nextclade_{sample_name} _{sequence_accession}_{dataset_name}.csv
except in JSON format.
nextclade_{sample_name}_{sequence_accession}_{dataset_name}_log.txt
: Log file generated by NextClade given a consensus sequence (generated with the specified sequence as reference) and a NextClade dataset.
pangolin_{sample_name}_{sequence_accession}_lineage_report.csv
: CSV output file generated by Pangolin given a consensus sequence (generated with the specified sequence as reference) as input.
pangolin_{sample_name}_{sequence_accession}_log.txt
: Log file generated by Pangolin given a consensus sequence (generated with the specified sequence as reference) as input.
reference/
reference.bed
: Describes all reference sequences. If a custom reference was provided, sequence names may appear different in this BED file.
reference.json
: Same as reference.bed
but with more detail. If any of the sequences were renamed during the pipeline, this file provides the mapping between the original and renamed versions.
This page provides an overview of the software available on Illumina's cloud platforms
Infectious Disease and Microbiology software include powerful bioinformatics tools to analyze NGS data ranging from single microbial genomes to complex microbial communities of thousands of viruses, bacteria, parasites, and fungi. This comprehensive secondary analysis suite of tools supports target specific workflows such as amplicon and hybrid capture enrichment sequencing, to generalized microbiology methods like small WGS, shotgun sequencing, or 16S Amplicon. All tools are available on BaseSpace, with some available on On-board select Illumina Sequencers.
Click the links below to learn more about our currently-available infectious disease software products:
For version 1.1 and later, consensus FASTA files generated for each sample, virus, and reference sequence incorrectly contain soft-masked sequences instead of hard-masked sequences. To get hard-masked sequences, use {sample_name}.consensus_hard_masked_sequence.fa
or convert lowercase nucleotides to "N".
Describes the controls on the Input Form and their function
Item name | Description | Choices | Default | Required |
---|---|---|---|---|
Brief description of Summary and Result reports and an explanation of their contents
The app produces a summary report as well as result reports for each of the samples analyzed. See the links below for a description of each.
DRAGEN Targeted Microbial is a software application designed to analyze sequencing data from enrichment and amplicon library preps (both DNA and RNA) on microbiological samples, with an emphasis on viruses. Illumina sequencing reads are processed to remove human-origin sequence, then assembled into consensus sequences that represent a best estimate of the population of viral sequences in each sample. Where appropriate, these consensus sequences are further analyzed by the phylogenetic analysis tools NextClade and/or Pangolin to provide an identification of the clade or lineage of each sequence.
Samples / biosamples with FASTQ datasets (see details in library preparation documents)
A project containing one or more samples / biosamples with FASTQ datasets
all samples / biosamples in the selected project will be analyzed
Supported hybrid-capture enrichment panels
Supported amplicon primer schemes
Chikungunya (Grubaugh lab; Illumina)
Dengue Serotype 1 (Grubaugh Lab; Illumina)
MPXV (Grubaugh Lab)
SARS-CoV-2 (ARTIC v3, v4, v4.1, v5.3.2)
Zika (Grubaugh lab)
Custom genomes and panels
Supports uploading FASTA files to use as reference genomes for both enrichment and amplicon panels, as well as custom primer definitions for amplicon panels. Multiplexed amplicon panels targeting multiple organisms in the same reaction are supported.
Reads are trimmed and filtered using Trimmomatic with the following parameters: LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
.
Human reads are removed with a modified version of the SRA Human Read Scrubber tool.
MEGAHIT is used to perform de novo assembly on the scrubbed reads.
CD-HIT-EST is used to cluster similar contigs to reduce redundancy.
The resulting contigs are mapped to a set of reference genomes using minimap2.
The best matching reference for each contig is selected for short read mapping.
The scrubbed reads from step 2 are aligned to the selected reference genomes using DRAGEN v4.2.4
Sequence variants are called from the alignments using DRAGEN Somatic Small Variant Caller v4.2.4 and applied to the corresponding reference sequences to create consensus sequences.
If applicable, Pangolin and/or Nextclade are run on the consensus sequences.
The software generates consensus sequences representing a best estimate of the population of targeted sequences in the sample. NextClade and Pangolin analysis are run on select organisms. See this page for details:
The sequences are labeled according to the best match in the panel references. These references are not exhaustive and the labels should not be taken as definitive for strain-typing. If strain typing is needed, the built-in NextClade and/or Pangolin tools can be used for supported organisms. Alternatively, a BLAST or similar search of nucleotide databases may provide a more detailed match.
Because of sequence homology, it is possible that organisms with very few reads will result in the generation of a sequence not present (false positive). Although the de novo assembly step of this software largely mitigates such instances, sequences with very low horizontal coverage (< 5%) should be treated with caution and are highlighted as "low confidence" in the reports.
BaseSpace Sequence Hub (native BaseSpace app)
Describes the reports that can be viewed from the individual sample links on the left side of the reports tab or by clicking on sample names in the Metrics by sample table.
At the top of the report is version information for the App and any third-party components.
Two buttons provide the ability to download relevant FASTA-formatted text files for this sample. The "Consensus" button initiates a download of a FASTA file containing all consensus sequences generated for this sample. The "Contig" button initiates a download of a FASTA file containing all assembled contigs for this sample.
The metrics by virus table contains information about each viral genome generated. Each row summarizes all sequences assigned to that virus. In the case of multi-segment viruses like Influenza, a row will summarize information across all segment sequences generated for a single viral genome. It contains buttons to download the contents of the table as a CSV, JSON or PDF file. The table itself contains rows for every virus with at least one generated genome in the sample. It contains the following columns:
Virus: The name of the virus. For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName
column that corresponds to the selected reference (matched by the value in the chrom
column of the genome definition file and the part of the FASTA header before the first whitespace character).
% callable bases: The percentage of the selected reference genome whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference genome(s), not the reported consensus sequence(s).
Status: The overall outcome of the analysis for this virus
Full analysis (consensus, VC) means that the sample analysis completed normally, that a sufficient number of amplicons were detected to ensure reliable variant calling (amplicon experiments only), and that the percentage of callable bases was above the minimum percentage of consensus sequence generated to label as confident (5% by default)
Low confidence means that there is at lease one callable base but the overall percentage of callable bases was below the minimum percentage of consensus sequence generated to label as confident (5% by default)
No callable bases indicates that zero positions in the indicated reference genome were callable and no consensus sequence is therefore provided.
Median coverage: The median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).
Consensus FASTA: A download link to a FASTA-formatted text file containing all the consensus sequences generated for this virus.
This table summarizes the results for each sequence generated for the sample. For multi-segment viruses such as Influenza, there will may be multiple sequences detected for a given virus. For single-segment viruses there will typically be only one sequence per virus. It contains buttons to download the contents of the table as a CSV, JSON or PDF file. The table itself contains rows for every sequence. It contains the following columns:
Virus: The name of the virus. For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName column that corresponds to the selected reference (matched by the value in the accession column of the genome definition file and the part of the FASTA header before the first whitespace character).
Segment: The name of the genome segment to which the sequence belongs. For viruses with a single segment, the name of the segment will typically be "Full".
Accession: The accession number or other short unique identifier for the sequence. If using a custom genome definition BED, this value is taken from the first column (chrom
) in the definition file. If using a custom reference without a genome definition file, the value is taken from the part of the FASTA header before the first whitespace character.
% callable bases: The percentage of the selected reference sequence whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference sequence, not the reported consensus sequence.
Median coverage: The median coverage value (in number of reads overlapping each position) over the entire reference sequence (not just the generated consensus sequence).
Consensus sequence length: The length of this consensus sequence. The reported length is the length of the hard-masked sequence after trimming any leading and trailing masked regions (if trimming is active).
# callable bases: The number of positions in the reference sequence above the minimum read coverage depth for consensus sequence generation (default 10x). In other words, the number of positions not masked. This may not be equal to the number of unmasked positions in the final consensus sequence since insertions and deletions are applied after masking.
Consensus FASTA: A download link to a FASTA-formatted text file containing this consensus sequence.
This stacked bar plot contains information about the outcome of the pre-processing steps (read QC, trimming, de-hosting) as well as the alignment step. It contains counts of reads that fall into the following categories:
Removed in QC: Reads that failed to meet the minimum quality thresholds and were excluded from further processing.
Removed in trimming: Reads that were removed in the initial sequence-based primer trimming step and were excluded from further processing.
Removed in de-hosting: Reads that were removed in the de-hosting step and excluded from further processing. De-hosting is the process of removing reads that may originate from the host organism. Currently only human hosts are supported. De-hosting reads improves the quality of downstream analysis and helps ensure that human sequences are not included in the output BAM files.
Unmapped: Reads that were not aligned to any reference genomes.
Mapped. Reads that were mapped to at least one reference genome.
A column plot displaying the numbers and percentages of all reads that aligned to each reference sequence with at least one mapped read. The columns are labeled by both virus and segment name (if available) on the x-axis, and the y-axis is the read count for each sequence.
Displays a trace of the read coverage over each reference sequence. The drop-down menu in the upper left allows the user to switch between viruses. If multiple segment sequences are generated for a single virus, their corresponding coverage plots will be displayed in a vertically stacked fashion. The black trace represents the read coverage, with the coverage depth in number of reads on the left y-axis and the position in the reference sequence on the x-axis.
The minimum read coverage depth for consensus sequence generation (default 10x) is plotted as a dashed orange line across the plot, to easily visualize locations where coverage drops below the threshold (which will be masked in the consensus sequence) and where the coverage is above the threshold (which will be reported in the consensus sequence).
The median coverage is plotted as a dashed teal line across the plot.
By default, sequence variants representing differences between the consensus sequence and the reference sequence are also plotted, with allele frequency on the right y-axis. The colors and symbols represent different sequence variant types. See the figure legend for details.
The "Show log-scale" toggle switch allows the user to switch between logarithmic and linear scales for the coverage (left) y-axis.
The "Show Median" toggle switch allows the user to turn the median coverage line on and off.
The "Show Sequence Variants" toggle switch allows the user to turn the plotting of sequence variants on and off.
This table contains the results of the Pangolin analysis performed on the generated consensus sequences. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using SARS-CoV-2 as reference
'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.
The Pangolin report contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
This table contains the results of the NextClade analysis performed on the generated consensus sequences. NextClade is run if the "Enable NextClade" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using a reference with NextClade dataset available.
'Amplicon Primer Set' is set to a non-custom set with a reference with NextClade dataset available (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated.
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom' and one or more NextClade datasets are selected under 'Custom Reference'. In this case, each of the selected NextClade datasets is applied to each consensus sequence generated for the sample. This may result in multiple NextClade results for each consensus sequence, some of which may not be meaningful (e.g. "flu_h1n1pdm_ha" dataset applied to a NA segment of an Influenza genome).
The NextClade Report contains a button labeled "Group table by" with a drop-down menu allowing the user to group the results by various fields, including "None". The default is "Dataset" which means that all of the results for each NextClade dataset will be grouped together. For example, if a user is only interested in phylogenetic analysis performed on the HA segment of Influenza A H1N1, these results for each sample can be viewed together in the "flu_h1n1pdm_ha" collapsible group.
The NextClade report also contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
Describes the report that can be viewed from the Summary link on the Reports tab of a completed analysis.
At the top of the report, after the app version display, is the Metrics by Sample table which provides a top-line summary of each of the analyzed samples.
The first element is a button that will trigger downloading of a FASTA-formatted file containing all consensus sequences generated across all samples.
The "Download CSV" button allows for downloading the contents of the table as a text comma-separated value (CSV) file. Note that for fields with multiple entries, these entries will be combined as a semicolon-separated list in the corresponding fields in the CSV file.
Next is the table itself, which contains one row per sample. The various genomes generated for each sample are nested as sub-rows within this row.
The table contains one row per sample and the following columns:
Sample: The name of the BaseSpace sample analyzed. The sample name is a clickable link that will take you directly to the Result Report for that sample.
Num genomes: The number of genomes chosen during the reference selection stage of the pipeline
Genomes generated: The names of each genome chosen during the reference selection stage. If the percentage of callable bases (callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation, 10x by default) for a genome is below the minimum percentage of consensus sequence generated to label as confident (5% by default), the cell is highlighted in yellow to indicate that there is only marginal evidence that the indicated genome is present in the sample and should be treated with caution. For amplicon experiments, if the sample is considered to have insufficient titer for VC because the percentage of detected amplicons is below the minimum percentage required for reliable variant calling (80% by default), cells are highlighted in orange. For genomes for which a consensus sequence was generated, clicking on the name of that genome initiates a download of a FASTA file containing the consensus sequences of that genome only.
% callable bases: The percentage of the selected reference genome whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference genome(s), not the reported consensus sequence(s).
Status: The overall outcome of the analysis for this virus
Full analysis (consensus, VC) means that the sample analysis completed normally, that a sufficient number of amplicons were detected to ensure reliable variant calling (amplicon experiments only), and that the percentage of callable bases was above the minimum percentage of consensus sequence generated to label as confident (5% by default)
Low confidence means that there is at lease one callable base but the overall percentage of callable bases was below the minimum percentage of consensus sequence generated to label as confident (5% by default)
No callable bases indicates that zero positions in the indicated reference genome were callable and no consensus sequence is therefore provided.
Consensus FASTA: This column contains links to download a FASTA-formatted text file containing all of the consensus genomes generated for a sample. If no consensus genomes were generated for a sample, this column contains "N/A."
Input read count: The number of reads (or read pairs / clusters for paired-end samples) in the sample.
Mapped read count: The number of reads that could be mapped to any reference genome.
Unmapped reads: Displays buttons that initiate downloads of gzipped FASTQ files containing reads that could not be mapped to any reference genomes.
Raw Contigs: Displays a button that initiates a download of a FASTA file containing all contigs generated during the de novo assembly step of the pipeline. If a contig could be mapped to a reference genome the contig name contains information about the reference genome they aligned to.
This table contains the results of the Pangolin analysis performed on the generated consensus sequences across all samples. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using SARS-CoV-2 as reference
'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.
The Pangolin report contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
This table contains the results of the NextClade analysis performed on the generated consensus sequences across all samples. NextClade is run if the "Enable NextClade" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using a reference with NextClade dataset available.
'Amplicon Primer Set' is set to a non-custom set with a reference with NextClade dataset available (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated.
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom' and one or more NextClade datasets are selected under 'Custom Reference'. In this case, each of the selected NextClade datasets is applied to each consensus sequence generated for the sample. This may result in multiple NextClade results for each consensus sequence, some of which may not be meaningful (e.g. "flu_h1n1pdm_ha" dataset applied to a NA segment of an Influenza genome).
The NextClade Report contains a button labeled "Group table by" with a drop-down menu allowing the user to group the results by various fields, including "None". The default is "Dataset" which means that all of the results for each NextClade dataset will be grouped together. For example, if a user is only interested in phylogenetic analysis performed on the HA segment of Influenza A H1N1, these results for each sample can be viewed together in the "flu_h1n1pdm_ha" collapsible group.
The NextClade report also contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
In addition to the built-in options, DRAGEN Targeted Microbial supports the use of custom reference genomes and primer definitions. These files must be uploaded to a BaseSpace Project before they can be used. See for more information about importing files into BaseSpace. These files can be used for both Enrichment and Amplicon libraries, when choosing the 'Custom' option for 'Enrichment Panel' or 'Amplicon Primer Set', respectively. Expand the 'Custom Reference' settings block to access the options for custom files. The following controls are applicable to the specified experiment type:
Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)
Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)
Custom PCR Primer Definitions (optional)
The user may provide one or more reference genomes as the target for read alignment (and as the basis for generating consensus sequences). At a minimum, the user must provide a FASTA file containing the sequences of the reference genomes. The software will generate the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app. Use the 'Custom Reference FASTA for Consensus Generation' control to select the previously-uploaded FASTA file containing the reference sequences.
Optionally, a genome definition BED file may also be provided, which tells the software more information about each sequence, such as a human-readable common name to be used in the reports. For multi-segment genomes such as Influenza, the genome definition file provides the segment name of each sequence and indicates that all the segments of a single genome belong together. Use the 'Custom Reference BED' control to select the previously-uploaded BED file containing the genome definition. See the following page for a description of the format of the genome definition file:
For amplicon experiments, the user may optionally provide a file that defines the primer sequences or locations. The primers defined in this file are used for two purposes:
The primer binding locations are used to trim reads, which eliminates sequence data that may be contributed by the primer sequences themselves (which we do not want) from sequence data contributed by the sample (which we do want). This is important to avoid reference bias that can depress the observed allele frequency of sequence variants in primer binding sites.
The primers are matched to define the boundaries of the expected amplicons resulting from the PCR reaction. The read coverage within the unique (non-overlapping) regions of these amplicons is used to determine whether or not each amplicon is reliably observed. The fraction of observed amplicons is a function of the concentration of the sample, and is used to determine whether or not sufficient material exists within the sample to reliably and accurately call variants and generate a consensus sequence. See this page for a more in-depth discussion:
Use the 'Custom PCR Primer Definitions' control to select the previously-uploaded primer definition file. The allowed formats for this file are described here:
A: The enrichment protocols can create a several thousand fold increase in the abundance of the targeted viral species. However, it is important to keep in mind that in many sample types, especially clinical, wastewater, or environmental samples, viral RNA or DNA makes up a tiny proportion of the total nucleic acids present, with the remainder dominated by host (human) or bacteria/archaea. So even with a dramatic enrichment of abundance over what you would obtain without enrichment, the percentage of viral reads can still be low. E.g. you may have a sample with only 2% viral reads, but without enrichment you might have only obtained 0.1% viral reads. If it is low abundance after enrichment, it is likely extremely low abundance prior to enrichment.
A: The 10x threshold is applied per-nucleotide. Any positions below 10x coverage will be hard-masked with "N".
A: Correct, we only align to a limited number of reference sequences for each virus type, so the sequence accession in the consensus genomes (and coverage plots, etc) merely reflects the best match chosen from that subset. There could be sequences in RefSeq that are a closer match. Furthermore, strain typing is not necessarily as simple as choosing the closest matching genome; there are further complexities that can go into it, and we have not systematically developed or tested any strain typing capability to date. The noted message is to warn users that the sequence accession in the consensus genome does not necessarily reflect the true phylogeny of the organisms in the sample and should not be taken as such.
A: For each de novo assembled contig, we aim to find the best matching reference sequence rather than an entire reference genome. If the best match for one contig is a reference sequence from one subtype and the best match for another contig is a reference sequence from another subtype, then we will report them as such. This is not necessarily indicative of a mixed infection, reassortment, or error. It is usually reflective of how similar certain segments can be across different subtypes.
Influenza A viruses are classified into different subtypes based on the hemagglutinin (HA) and neuraminidase (NA) proteins, which are encoded by segments 4 and 6, respectively. Therefore, we recommend focusing on those segments to infer the subtype. If there is a sequence generated from segment 8 of an H3N2 genome but all the rest of the consensus sequences are generated from reference sequences from an H1N1 genome (indicating H1 and N1 subtypes), then the sample likely contains H1N1, not H3N2. One possible explanation is that segment 8 from H1N1 and segment 8 from H3N2 were both good matches for a particular contig but the one from H3N2 was a slightly better match and therefore chosen as final reference. Similarly, if there is a sequence generated from segment 4 of an H1N1 genome (indicating H1 subtype) and a sequence generated from segment 6 of an H5N1 genome (indicating N1 subtype), then the sample likely contains H1N1, not H5N1.
A: The denominator in the "Detected Amplicons" columns is based on the reference sequences selected based on de novo assembled contigs. Depending on the quality of the sample and/or reads, the assembler may not have enough data to generate a contig for some segments. Shorter segments are more likely to be missed. If only 7 segments are selected as final reference for short read alignment, then we expect 7 amplicons in total. If you believe that the sample should contain all 8 segments, you can download the contig FASTA file from our report page and submit it to to see if all 8 segments are present in the contig sequences.
One known issue is that chimeric reads can be generated during library preparation, which can lead to chimeric contigs, where the contig sequence contains sequences from more than one segment. This can result in missing an entire segment in the reference selection stage. A workaround may be to filter out chimeric reads from your FASTQ files before running the app.
Alternatively, you can force the app to use all 8 segments of a particular Influenza genome by providing a custom reference FASTA file with all 8 segment sequences and a custom reference BED file with the genomeName
column to set to the same value (e.g. Influenza A). This way, the app will not perform assembly and use all 8 segments as the reference sequences for short read alignemnt.
A: The "Detected Amplicons" column shows the number of amplicons detected over the total number of amplicons expected for that genome. The percentage of amplicons detected is used to infer if the sample is of sufficient quality for variant calling. The "% callable bases" column shows the percentage of the selected reference genome whose bases are above the minimum read coverage depth for consensus sequence generation, which is computed independent of amplicon coordinates. Both metrics are useful to assess the quality of the sample, but the percentage of detected amplicons is used by the app after short read alignment to filter out low-titer samples and the percentage of callable bases is not.
A: While there may be quite a few causes for the analysis to fail, some of the most common cases are that the custom database was not formatted correctly. Below are requirements for the Custom Reference FASTA For Consensus Generation:
Do not use Spaces in the file name, instead use an underscore "_"
Do not exceed 25 characters in the file name
File extension must be .fasta or .fa
Do not exceed the file size limitation: 16GB for a single file or 25GB for multiple files
Do not have duplicate entries
If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (chrom
) must match the names that appear in the FASTA (text after >
and before the first whitespace character) Please see this link on general guidelines to upload data to BaseSpace for more details: https://help.basespace.illumina.com/manage-data/import-data If you continue having issues, reach out to techsupport@illumina.com
Detected Amplicons (only if 'Experiment Type' is set to 'Amplicon'): The number of amplicons detected over the total number of amplicons expected for that sequence. The percentage of amplicons detected is used to to determine if the sample is sufficient quality for variant calling. See for more details.
Insufficient titer for VC will only be present for an amplicon experiment and indicates that the number of detected amplicons was below the minimum percentage (default 80%) required for reliable variant calling. See for more details.
The table in the Pangolin report is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details: . Sequences with a bad Pangolin QC status are highlighted in yellow.
The table in the NextClade report is derived from the output of the NextClade software. Please see the NextClade documentation for more details: . Sequences with a bad NextClade QC status are highlighted in yellow.
Detected Amplicons (only if 'Experiment Type' is set to 'Amplicon'): The number of amplicons detected over the total number of amplicons expected for that sequence. The percentage of amplicons detected is used to to determine if the sample is sufficient quality for variant calling. See for more details.
Insufficient titer for VC will only be present for an amplicon experiment and indicates that the number of detected amplicons was below the minimum percentage (default 80%) required for reliable variant calling. See for more details.
The table in the Pangolin report is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details: . Sequences with a bad Pangolin QC status are highlighted in yellow.
The table in the NextClade report is derived from the output of the NextClade software. Please see the NextClade documentation for more details: . Sequences with a bad NextClade QC status are highlighted in yellow.
Reference | Example | Required input | Note |
---|
A: Not necessarily. Your virus of interest may be present in the sample, but the app may not have generated a consensus sequence for it for various reasons. One reason could be that there are too few reads coming from that virus. Tools like DRAGEN Metagenomics can be used to characterize what is in the sample more broadly. Another reason could be that the virus in your sample is too divergent from the reference sequences used in the app. In such cases, we recommend downloading the contig FASTA file from our report page and submitting it to . If you do see a sequence that matches your virus of interest, you can provide that sequence to the app as a custom reference genome.
Save Results To
Project to run the analysis in
Required
Input Type
This app can accept samples or a project as input.
Samples: Select up to 60 individual samples, from any project(s)
Project: Select a single project containing up to 1536 samples. The app will analyze every FASTQ sample in that project (FASTQ datasets with QcStatus=QcFailed will be excluded)
Biosamples, Project
Biosamples
Required
Input Biosample
Select one or more samples to analyze. Either Input Samples or an Input Project can be selected - not both.
Required if Input Type is set to 'Samples'
Input Project
Select a Project containing up to 1536 samples to be analyzed. The analysis will process all samples from that project (FASTQ datasets with QcStatus=QcFailed will be excluded). There is currently no way to filter specific samples from a project. If the project contains more than 1536 Biosamples, the app will appear to launch, but then will immediately exit.
Required if Input Type is set to 'Project'
Experiment Type
This app can analyze samples generated from enrichment or amplicon sequencing experiments. Either can be selected - not both.
Enrichment, Amplicon
Enrichment
Required
Enrichment Panel
Select the enrichment panel used to generate the data. This determines the set of reference genomes the app uses. Different selection will produce different results. Choose 'Custom' to provide your own reference genomes below.
Viral Surveillance Panel (VSP)
Pan-Coronavirus Panel (Pan-Cov)
Respiratory Virus Oligo Panel (RVOP)
Custom
Required if Experiment Type is set to 'Enrichment'
Amplicon Primer Set
Select the virus genome to align to and primer set used to generate the data. Primer locations determine primer trimming locations and amplicon definitions. If processing SARS-CoV-2 data from a non-amplicon protocol, choose 'SARS-CoV-2, no primers'. Different selection will produce different results. Choose 'Custom' to provide your own reference genomes and primer set below
SARS-CoV-2, ARTIC v5.3.2 primers
SARS-CoV-2, ARTIC v4.1 primers
SARS-CoV-2, ARTIC v4 primers
SARS-CoV-2, ARTIC v3 primers
SARS-CoV-2, no primers
Influenza A, Universal primers
Influenza B, Universal primers
Influenza A and B, Universal primers
Chikungunya Virus, Grubaugh Lab primers
Chikungunya Virus, Illumina primers
Dengue Virus Serotype 1 (DENV1), 400-bp DengueSeq primers
Dengue Virus Serotype 1 (DENV1), Illumina primers
Monkeypox Virus (MPXV) Clade II, Grubaugh Lab primers
Respiratory Syncytial Virus (RSV), CDC primers
Respiratory Syncytial Virus (RSV), WCCRRI primers
Zika Virus, Grubaugh Lab primers
Custom
Required if Experiment Type is set to 'Amplicon'
Custom Reference: Custom Reference FASTA For Consensus Generation
Provide a custom reference FASTA to use for consensus generation. Either Enrichment Panel or Amplicon Primer Set must be set to Custom to enable this field.
Sequence names must be unique and must not contain any space. If there is any space in the FASTA header, the part before the first space is assumed to be the sequence name.
It is recommended to use the following in sequence names: alphabets, numbers, underscore (_
), hyphen (-
), parentheses ((
,)
), and period (.
). Otherwise, the sequence names may appear different in the output.
It is recommended to keep sequence names short (e.g. NC_045512.2). If needed, full names can be provided in the genomeName column of Reference BED below.
FASTA file name must not include any space, must not exceed 25 characters, and must use extension .fasta or .fa
Required if either Enrichment Panel or Amplicon Primer Set is set to 'Custom'
Custom Reference: Custom Reference BED
Provide a custom reference BED to describe each sequence in Custom Reference FASTA. See Genome definition BED file format
Optional if Enrichment Panel or Amplicon Primer Set is set to 'Custom'. Otherwise not applicable
Custom Reference: Custom PCR Primer Definitions
Provide a file defining primers used in amplicon sequencing. See Primer definition file formats
Optional if Amplicon Primer Set is set to 'Custom'. Otherwise not applicable
Custom Reference: NextClade Datasets
Select one or more available NextClade Datasets from the drop-down menu below. Hold ctrl/command key to select multiple or deselect.
Optional if either Enrichment Panel or Amplicon Primer Set is set to 'Custom'. Otherwise not applicable
Pangolin
Run Pangolin on applicable consensus genomes
True, False
True
Optional if any Enrichment Panel is selected, any SARS-CoV-2 Amplicon Primer Set is selected, or 'Custom' is selected for Enrichment Panel or Amplicon Primer Set. Otherwise not applicable
NextClade
Run NextClade on applicable consensus genomes. If providing Custom Reference, select NextClade Datasets above to enable. Otherwise not applicable NextClade
True, False
True
Optional if any Enrichment Panel is selected, if a genome with NextClade dataset available is selected for Amplicon Primer Set, or if 'Custom' is selected for Enrichment Panel or Amplicon Primer Set. Otherwise not applicable
Advanced Workflow Settings: Dehost
If checked: input FASTQs will be scrubbed of all human reads, before the Map/Align stage, so that the output BAM includes only viral reads.
True, False
True
Required
Advanced Workflow Settings: Trim Consensus Sequences
Remove any leading and trailing masked nucleotides from the resulting consensus sequences. Does not affect internal masked regions.
True, False
True
Required
Advanced Workflow Settings: Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation
At low input concentrations, errors produced by the reverse transcriptase enzyme can propagate to high frequencies, leading to false positive sequence variants. Therefore, we attempt to infer the sample concentration from the amplicon coverage using this metric. If you wish to adjust this, we advise conducting internal studies to examine variant call reproducibility between replicates to determine a threshold that will produce acceptable quality levels for your application. Only applicable to amplicon sequencing where primers are defined. See Special considerations for amplicon sequencing with IMAP protocols
80.0%
Required if Experiment Type is set to 'Amplicon'
Advanced Workflow Settings: Minimum read coverage depth for consensus sequence generation
Genomic positions with read coverage below this threshold will be considered indeterminate and hard-masked in the final consensus sequence
10
Required
Advanced Workflow Settings: Minimum percentage of consensus sequence generated to label as confident
Consensus sequences with percentage of callable bases below this threshold will be considered 'low confidence'. Callability is defined based on minimum coverage depth for consensus sequence generation (above)
5.0%
Required
Additional DRAGEN Command Line Arguments: Additional DRAGEN Map/Align Command Line Arguments
USE AT YOUR OWN RISK. This field allows the user to add any DRAGEN command line argument, which can cause DRAGEN to:
Crash/fail/hang
Run for a very long time
Generate unexpected or invalid results
The app appends this input text to the DRAGEN command line after removing invalid characters (valid characters are alphanumeric plus ._-"'
). Note that there is no validation of the contents. If you use this field and the appsession aborts, the output*.log appsession log file may help to understand the cause of the failure.
Optional
Additional DRAGEN Command Line Arguments: Additional DRAGEN Variant Calling (Somatic) Command Line Arguments
USE AT YOUR OWN RISK. This field allows the user to add any DRAGEN command line argument, which can cause DRAGEN to:
Crash/fail/hang
Run for a very long time
Generate unexpected or invalid results
The app appends this input text to the DRAGEN command line after removing invalid characters (valid characters are alphanumeric plus ._-"'
). Note that there is no validation of the contents. If you use this field and the appsession aborts, the output*.log appsession log file may help to understand the cause of the failure.
Optional
Organisms to Report (VSP)
Only the checked organisms will be reported (consensus sequences and metrics). This will not affect the underlying bioinformatics pipeline, only which outputs are provided.
All VSP organisms
Optional if Enrichment Panel is set to 'VSP'. Otherwise, not applicable
Organisms to Report (RVOP)
Only the checked organisms will be reported (consensus sequences and metrics). This will not affect the underlying bioinformatics pipeline, only which outputs are provided.
All RVOP organisms
Optional if Enrichment Panel is set to 'RVOP'. Otherwise, not applicable
Single non-segmented genome | Zika | Primer set |
Single segmented genome | All 8 segments from one Influenza A genome | Primer set |
Multiple non-segmented genomes | Multiple genomes of Zika | Reference BED, Primer set | Reference BED must be provided to make it clear that the reference sequences are not segments in the same genome. Otherwise, the pipeline will assume this is a single segmented genome (above). If multiple genomes remain after reference selection, the genome with the best per-amplicon coverage will be considered for sample filtering. |
Multiple segmented genomes | A collection of Influenza A and B genomes | Reference BED, Primer set | Reference BED must be provided to specify which sequences belong to the same genome. Otherwise, the pipeline will assume this is a single segmented genome. If multiple genomes remain after reference selection, the genome with the best per-amplicon coverage will be considered for sample filtering. |
Launch the DRAGEN Microbial Enrichment Plus BaseSpace application. The application is found in the "DRAGEN" dropdown and the "Infectious Disease and Microbiology" dropdown in BaseSpace.
The DRAGEN Microbial Enrichment Plus only supports Biosample inputs.
After choosing a name for the Analysis, choose either “Biosample” or “Project” as input type. Selecting “Project” will result in all biosamples in the selected project being analyzed.
Select the enrichment panel library from the dropdown. Only one panel can be selected per analysis. There is no need to open the "Custom panel specification" tab if selecting one of the pre-set panels. Note that the user must first select the Enrichment Panel for the appropriate downstream analysis options to populate.
Under "Enrichment Panel Microorganism Reporting List" the default option is "All microorganisms". Optional: the microorganism reporting list can be subset from the full set of organisms targeted by the selected panel if desired. If the user is analyzing with the RPIP or UPIP panel, they can select "Pre-defined specification" or "User-defined specification". For VSP, VSP V2 or RVOP/RVEK, there is no "Pre-defined specification" option and only "User-defined specification" is available. If selecting "User-defined specification", the following steps must be taken. We recommend pre-loading your "User-defined specification" file before starting the analysis.
Download the template with all the microorganisms as a starting point.
Remove any rows of organisms that are not wanted in the analysis/report
DO NOT remove any columns or alter their names
DO NOT delete the header row or alter the names of the header row
Optional but recommended: Rename the file to indicate it is altered. Note that in BaseSpace the file name is text before a period.
Upload the file to a BaseSpace Project
Once the data file is uploaded, select the "Dataset File(s)" under the "User-defined specification" tab **Once uploaded to a project, the user can reuse the same "User-defined specification" sheet across different analyses.
Analysis Options:
Perform read QC (Quality Control)
If checked, reads are pre-processed using quality metrics before analysis.
If unchecked, read quality metrics are calculated, but reads are not trimmed or filtered before analysis.
Report bacterial AMR markers only
If checked, only bacterial AMR markers but no microorganisms are reported
This option is disabled if RVOP, VSP, VSP V2 or Custom Panel is selected
This option is disabled if the 'Report bacterial AMR markers only when an associated microorganism is reported' option is enabled
Report bacterial AMR markers only when an associated microorganism is reported
If checked, detected bacterial AMR markers are reported if the bacterial AMR marker passes a minimum reporting threshold and one or more associated microorganisms are also detected and reported
If unchecked, detected bacterial AMR markers are reported if the bacterial AMR marker passes a minimum reporting threshold
This option is disabled if RVOP, VSP, VSP V2 or Custom Panel is selected
Report microorganisms and/or AMR markers that are below threshold
If checked, microorganisms and/or AMR markers below reporting thresholds are included in reports
If unchecked, only microorganisms and/or AMR markers above reporting thresholds are included in reports
This option is disabled if Custom Panel is selected
This option is disabled if the 'Report bacterial AMR markers only when an associated microorganism is reported' option is enabled
The 'Read classification sensitivity' default value is 5. This is a pre-filtering step only valid for VSP, VSP v2 or RVOP/RVEK. Decreasing the default may considerably slow the run time.
Pangolin is by default Enabled, expect if analyzing the UPIP panel. Note that Pangolin only runs if SARS-CoV-2 is detected.
Nextclade is by default Disabled. This can be Enabled. Note that Nextclade only runs if the viruses listed below are detected:
Human immunodeficiency virus 1 (HIV-1)
Human respiratory syncytial virus A (HRSV-A)
Human respiratory syncytial virus B (HRSV-B)
Influenza A virus (H1N1pdm09)
Influenza A virus (H3N2)
Influenza A virus (H5N1)
Influenza A virus (H5N6)
Influenza A virus (H5N8)
Influenza B virus (B/Victoria/2/87-like)
Influenza B virus (B/Yamagata/16/88-like)
Monkeypox virus (MPV)
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
Select an Internal Control (IC). The default is none. Note that RVOP/RVEK and VSP do not have internal controls in their panel. VSP V2 can be analyzed with the following ICs - Enterobacteria Phage T7, Escherichia virus T4, Escherichia Virus MS2, or Armored RNA Quant Internal Process Control. The IC must be added to the workflow before library preparation - otherwise select "None" for IC. The Internal Control concentration needs to be normalized and in the correct format.
Select the Project where the Analysis Output should be saved.
Users may define their own reference database through the Custom Panel analysis option. This option can be used to analyze samples enriched with standard Illumina Infectious Disease and Microbiology panels (RPIP, UPIP, VSP, VSP V2, RVOP/RVEK) if the user wishes to use a specific set of references rather than the built-in databases for each of these panels. This option also enables users to analyze samples enriched with other targeted enrichment kits through the DRAGEN Microbial Enrichment Plus app.
Launch the DRAGEN Microbial Enrichment Plus BaseSpace application. The application is found in the "DRAGEN" dropdown and the "Infectious Disease and Microbiology" dropdown in BaseSpace.
The DRAGEN Microbial Enrichment Plus only supports Biosample inputs.
After choosing a name for the Analysis, choose either “Biosample” or “Project” as input type. Selecting “Project” will result in all biosamples in the selected project being analyzed.
Select the "Custom Panel" option in the Enrichment Panel dropdown.
A custom reference FASTA must be uploaded to BaseSpace before launching the analysis. See https://help.basespace.illumina.com/manage-data/import-data for more information about importing files into BaseSpace. Information on formatting the FASTA file is provided here.
A BED file is optional. If provided, the BED file must also be uploaded to BaseSpace before launching the analysis. Information on formatting the bed file is provided here..
The only Analysis Options for a custom panel is the ability to turn on or off read QC.
Perform read QC (Quality Control)
If checked, reads are pre-processed using quality metrics before analysis.
If unchecked, read quality metrics are calculated, but reads are not trimmed or filtered before analysis.
Pangolin will run on custom reference sequences with at least 3% coverage that meet these naming conventions:
If only a FASTA file is provided, Pangolin will run on sequences that have a header containing either SARS-CoV-2 or NC_045512
If both a FASTA and BED file are provided, Pangolin will run on sequences where the first column (chrom) contains NC_045512 or the fourth column (genomeName) contains SARS-CoV-2
Nextclade DOES NOT work for custom panel analysis - if this is "Enabled" it will not run.
For Custom Analysis, the only option for "Internal Control" is "NONE".
Select the Project where the Analysis Output should be saved.
Amplicon sequencing, especially of RNA viruses, requires additional bioinformatics processing to ensure maximum quality of the resulting data.
In RT-PCR, a reverse transcriptase enzyme first generates cDNA molecules using the RNA molecules in the sample as templates, before amplifying the cDNA sequences using a DNA polymerase enzyme during PCR. These amplified cDNA sequences are then further processed to generate the sequencing libraries. Both of these enzymes can potentially introduce an incorrect base into a sequence, generating a position where the resulting sequence does not match the sequence in the sample -- that is, an error.
Reverse transcriptases exhibit error rates that are multiple orders of magnitude higher than those of DNA polymerases.
When large numbers of nucleic acid molecules are present in a reaction, these individual misincorporation errors are largely uncorrelated and appear at very low frequencies, so that they are typically ignored by variant callers.
However, when the number of incoming nucleic acid molecules is small, such as for a low-titer virus sample, an error that occurs during the RT step or early in the PCR reaction can, as a result of sampling noise, be amplified to high frequencies in the resulting sequencing libraries. When the variant caller encounters such a position, it will be treated as a sequence variant, since it is a true sequence variant in the context of the library provided to the instrument. As a result, these artifactual sequence variants often have high allele frequency and very good quality scores, which makes them very difficult to detect and remove. This can result in a false positive variant call that, at a sufficiently high allele frequency, will also be incorporated into the consensus sequence. It is also possible for a true sequence variant to have its allele frequency depressed by this same process (if the error results in a reversion to the reference sequence), but this is much less common.
Since it is difficult to identify enzyme-introduced false variants after the fact, we instead take a pre-emptive approach to ensuring data quality. As noted above, sampling noise as a function of molecular abundance is the mechanism responsible for boosting of the frequency of individual enzymatic errors into artifactual variants, and therefore the magnitude of this effect is largely a function of the concentration of the nucleic acids in the reaction. Therefore, the software first attempts to determine whether there is sufficient sample material present before proceeding with variant calling and consensus sequence generation.
To determine this, the software takes advantage of the fact that the probability of each amplicon being amplified is a function of the nucleic acid concentration, with higher concentrations leading to a higher probability of amplification. By counting the observed proportion of amplicons with detectable sequence coverage, we can estimate this probability and compare it to an experimentally-determined threshold that corresponds to the minimum concentration needed to produce reliable variant calls.
To compute this, we calculate the number of amplicons with at least 1x coverage for at least 90% of the non-overlapping portion of the amplicon sequence. The 1x coverage threshold used here is fixed and independent of the minimum read coverage depth for consensus sequence generation which defaults to 10x. The number of amplicons that meet this threshold is then divided by the total number of amplicons in the experiment, which is the number of amplicons whose location falls in reference sequences selected for short read alignment. If the resulting fraction is at least 80%, the sample is considered to have sufficient material for accurate variant calling and the variant calling and consensus sequence generation steps are performed. If it is below this threshold, the sample is not processed further to avoid spurious variant calls. The user can override the 80% threshold in the "Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation" control in the "Advanced Workflow Settings" section. See App Settings.
A: The following enrichment panels can be analyzed using the app: VSP, VSP V2, RVOP/RVEK, RPIP, and UPIP enrichment panels. In addition, custom infectious disease and microbiology panels can also be analyzed. The Pan-Coronavirus Panel cannot be analyzed in DME+ using default settings; however, a custom coronavirus reference database can be used to analyze this panel. This application was not intended for non-infectious disease enrichment panels (such as exome) and was not intended for other types of workflows (such as shotgun metagenomics).
A: There is no compute cost to run the application. A Basic BSSH account is needed, and additional storage may require iCredits.
A: Fastq files previously run through other apps can be re-analyzed using DME+. Results from other pipelines may not be identical to DME+, most notably because of the unique databases used in DME+. To perform a direct comparison from DME+ to a different analyses pipeline, we recommend that the consensus FASTA files generated be run through NCBI Blast to assess sequence similarity.
A: Not necessarily. The microbe of interest may be present in the sample, but the app may not have reported it because the alignment metrics fell below the default prediction thresholds. If you suspect this may be the case, select the "Report microorganisms and/or AMR markers that are below threshold" option. If you know a priori which organisms are expected, the "Custom panel specification" can also be used.
A: While there may be quite a few causes for the analysis to fail, one of the most common causes is that the custom database was not formatted correctly. Below are requirements for the "Custom reference FASTA for consensus generation":
Do not use Spaces in the file name; instead use an underscore "_"
Do not exceed 25 characters in the file name
File extension must be .fasta or .fa
Do not exceed the file size limitation: 16GB for a single file
Do not have duplicate entries
If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (chrom) must match the names that appear in the FASTA (text after > and before the first whitespace character).
A: Ensure that the correct file with the intended microorganisms was uploaded and used. We recommend saving the updated microorganism file with a new name. Do not add any new columns or delete any columns form the excel template. Similarly, do not change or remove any items in the Header row, but rows with microorganism names that you are not interested in can be deleted.
A: If analyzing with VSP, VSP V2 or RVOP panels, the default read classification sensitivity is set to 5. This setting is used as a pre-alignment filtering step. If less than 5 reads classify to the set of reference sequences belonging to a given organism, that organism will not be reported. If 5 or more reads classify to the set of reference sequences belonging to a given organism, analysis will proceed to a read alignment workflow, and alignment-based thresholds will be used to determine whether that organism is reported. Therefore, even if there are >5 reads mapping to a viral group, the final analyses will not report that virus if the aligmment-based thresholds are not met. The read classification sensitivity can be set as low as 1 or as high as 1000. Lowering the read classification sensitivity threshold below 5 may significantly increase computational run time and is not recommended for most use cases.
A: The only infectious disease and microbiology panel that is not pre-set in this application is the Pan-CoV enrichment panel. To use the Pan-CoV enrichment panel with the DME+ app, select the "Custom panel" under the "Enrichment Panel" drop-down list and use a custom reference database. Otherwise, we recomend using the DRAGEN Targeted Microbial Enrichment BaseSpace app.
A custom reference FASTA file is required to run the custom panel analysis. Sequence names in the custom reference FASTA file must be unique and should not contain any spaces. If there is any space in the FASTA header, the part before the first space is assumed to be the sequence name. It is recommended to use only the following in sequence names: alphabets, numbers, underscore (_), hyphen (-), parentheses ((,)), and period (.). Otherwise, the sequence names may appear different in the output. An example custom reference fasta file is provided in the link below.
The user may provide one or more reference genomes as the target for read alignment (and as the basis for generating consensus sequences). At a minimum, the user must provide a FASTA file containing the sequences of the reference genomes. To upload the reference FASTA file, go to the "Projects" tab and click on the folded paper icon (representing File), which will reveal a dropdown menu. Click on "Upload" and select "Files". Within the upload page, select "Other" format for FASTA files, and upload the file as a BioSample. Within the DRAGEN Microbial Enrichment Plus App, under 'Custom panel specification' use the 'Custom reference FASTA for consensus generation' control to select the uploaded FASTA file containing the reference sequences. The software will generate the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app.
Optionally, a genome definition BED file may also be provided. The BED file tells the software more information about each sequence in the fasta file, such as a human-readable common name to be used in the reports. For multi-segment genomes such as Influenza, the genome definition BED file provides the segment name of each sequence and indicates that all the segments of a single genome belong together. To upload the BED file, go to the "Projects" tab and click on the folded paper icon (representing File), which will reveal a dropdown menu. Click on "Upload" and select "Files". Within the upload page, select "Other" format for BED files, and upload the file as a BioSample. Within the DRAGEN Microbial Enrichment Plus App, under 'Custom panel specification' use the 'Custom reference BED' dropdown to select the uploaded BED file containing the genome definition. See the following page for a description of the format of the genome definition BED file:
The file must be tab-delimited with at least 4 columns:
chrom: the sequence name as it appears in the FASTA
chromStart: start position (always set to 0)
chromEnd: end position (sequence length)
genomeName: name of the genome, target, or microorganism the sequence belongs to (e.g. Monkeypox virus clade II)
segmentName (optional): the name of the segment or gene (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome
Sequence names must match between the FASTA file and BED file (as included in the "chrom" column), and the same set of sequences must appear in both files. If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.
The BED file controls how sequences are labeled in the output JSON. If the custom reference FASTA file includes sequences from multiple segments, it is recommended to provide a BED file so that the segments are included under the results of that microorganism. Otherwise, each segment will be treated independently and not all of them may be used as reference.
DRAGEN Microbial Enrichment Plus offers a dedicated informatics solution with flexible analysis options for Illumina Infectious Disease and Microbiology target-capture enrichment panel kits. The app delivers easy-to-use, powerful secondary analysis of Illumina sequencing data, with workflows for sample QC, viral WGS (whole-genome sequencing), pathogen detection and quantification, and antimicrobial resistance (AMR) marker profiling. It also supports user-defined microorganism reporting thresholds and custom reference sequence analysis.
Product Page | Panel Summary |
---|---|
FASTQ files
User-defined microorganism reporting file in TSV or XLSX format (optional)
Custom reference FASTA file (if applicable)
Custom reference BED file (if applicable)
The [DRAGEN Microbial Enrichment Plus Demo Project] includes external control, contrived, and environmental samples prepared using the RPIP, UPIP, RVOP, VSP, and VSP V2 target-capture enrichment kits. Example custom reference sequence FASTA and BED files are also included.
(all panels except where noted, (*) indicates applicable to custom reference sequence analysis)
Read QC* (optional)
Dehosting* (human read removal)
Sample QC (sample composition and enrichment factor calculations. Internal control required to calculate the enrichment factor) – RPIP, UPIP, VSP V2
Microorganism classification (configurable sensitivity) - RVOP, VSP, VSP V2
Microorganism detection (alignment, consensus generation, variant calling)
Microorganism quantification (quantitative internal control required) – RPIP, UPIP, VSP V2
Microorganism reporting thresholds (proprietary algorithms or user-defined reporting logic)
Bacterial AMR marker analysis (nucleotide and protein alignment, consensus generation, variant calling and annotation) – RPIP, UPIP
Viral AMR marker analysis (variant calling and annotation) – RPIP, RVOP, VSP, VSP V2
Viral clade and lineage prediction (Pangolin, Nextclade) – RPIP, RVOP, VSP, VSP V2
Result filters (user-specified filters applied)
Reporting*
Analysis-level outputs: XLSX, HTML, ZIP
Sample-level outputs: JSON, HTML, FASTA (consensus sequences), VCF (viral variants)
DRAGEN Microbial Enrichment Plus is a secondary analysis tool for research use only. Further interpretation, statistical analysis, and downstream analysis of results may be necessary.
Panel | Template File |
---|
Do not delete or rename columns, and do not add additional columns.
To create a user-defined subset list of organisms on panel to target, do not change the prediction logic but delete rows of the unwanted microorganisms. For example, here is a reporting table of three bacteria with default report tresholds.
Reporting Name | Prediction Logic | Coverage | Median Depth | ANI | Aligned Read Count | RPKM | Kmer Read Count |
---|
Note that if sub-selecting to keep Influenza or Enterovirus (including Coxsackievirus, Poliovirus, and Echovirus) this analysis may require additional support.
A: Upload these files to a basespace project before launching the DME+ app; you will then be able to select these files in the "Select Dataset File(s)" browser in the app. Please see . If you continue having issues, reach out to techsupport@illumina.com
Step | Module/Script | Run |
---|---|---|
Status | Level | Outcome | Output files (excluding log files) |
---|---|---|---|
Reference reformatting/validation
custom script
If custom reference is provided
Read QC
trimmomatic
Always
Primer trimming (on FASTQ)
trimmomatic
If primer set exists
Read dehosting
human_read_scrubber
If checked in Input Form
Assembly
MEGAHIT
If reference FASTA and BED files imply more than one genome as reference
Contig clustering
CD-HIT
If assembly ran
Reference selection
custom script
If assembly ran, otherwise input reference database is used as is
Primer alignment / reformatting
custom script
If primer set is provided. Primers are aligned to selected reference sequences if coordinates are not provided. Otherwise, primers mapping to selected reference sequences (based on the provided primer coordinates) are selected as final set of primers
Map/Align
DRAGEN
If at least one reference sequence is generated
Post-facto primer trimming (on BAM)
custom script
If Map/Align ran and primer set exists
Sample filtering based on amplicon coverage
custom script
If Map/Align ran and primer set exists
Variant calling
DRAGEN
If Map/Align ran and sample passed filter above
Consensus sequence generation
custom script
If Map/Align ran and sample passed filter above
Pipeline completed
Pipeline
Pipeline exits
All
Custom files are not formatted correctly
Pipeline
Pipeline exits
None
None of the primers provided in custom primer definition file align to selected reference sequences
Sample
Skip post-factor primer trimming and sample filtering based on amplicon coverage for this sample
All except for amplicon-related output files
No reference found after assembly
Sample
Do not proceed to short read map alignment for this sample
Contig FASTA
Insufficient amplicon coverage
Sample
Do not proceed to variant calling for this sample
Contig FASTA (if assembly was run)
viral_consensus_genomes
Dataset
Directory containing single sample consensus genome(s) that were generated. Each fasta file within the directory contains a single viral genome. There is a Dataset for each sample with results
Samplename.Panelname.report.html
html
Report visualization for a single sample: includes Sample Quality Control, Microorganisms present, Antimicrobial Resistance Markers (only for Influenza and bacteria/fungi) and User Options, separated by tabs
Samplename.Panelname.report.json
json
See JSON report section for information on file contents
Samplename.Panelname.viral_genomes_consensus.fa
fasta
Full-genome consensus sequence(s) for all viruses reported in the sample (nucleotide sequences)
Samplename.Panelname.viral_targets_consensus.fa
fasta
Consensus sequence(s) for each targeted gene for all viruses reported in the sample (nucleotide sequences)
Samplename.Panelname.bacterial_amr_nucleotide_consensus.fa
fasta
Consensus sequence(s) for all bacterial AMR markers reported in the sample (nucleotide sequences)
Samplename.Panelname.bacterial_amr_protein_consensus.fa
fasta
Consensus sequence(s) for all bacterial AMR markers reported in the sample (amino acid sequences)
Samplename.Panelname.viral_variants.vcf
vcf
Variant call file with viral consensus genome and closest matching reference genome
Analysis Results
Dataset
Directory with three files including: Excel results for all samples, zip file with all files available for download, and html report with summary of all samples
AnalysisIDnumber.Panelname.report.xlsx
Excel compatible file
Sample, Microorganism, AMR, and Variant results from all samples in the analysis. Further details on the fields in the Sample Report are included below the table.
AnalysisIDnumber.Panelname.results.zip
Zip
Compressed Zip file containing all of the result files available for download
report.html
html
Report visualization including Summary of all Sample Composition (reads off target/on target, Unclassified, Internal Controls, etc) and Summary Statistics of all samples within an analysis run
Klebsiella pneumoniae | default |
Cryptococcus neoformans | default |
Acinetobacter baumannii | default |
The DRAGEN Microbial Enrichment Plus app outputs a comprehensive sample-level report.json
file containing general metadata, version information, sample QC, microorganism, and AMR marker results, as well as detailed test information. Additional convenience file formats are generated by the DRAGEN Microbial Enrichment Plus app but do not contain novel content.
Top-Level Node
The top-level section of the report JSON contains general metadata and version information.
Field | Description |
---|---|
.qcReport Node
This section contains information about sample quality control (QC). The fields are relative to .qcReport
.qcReport.sampleComposition Node
This section contains information about the composition of the sample. The fields are relative to .qcReport.sampleComposition
.qcReport.internalControls Node
The value of the .qcReport.internalControls
field is an array of objects containing name and RPKM information for each Internal Control. See the code block below for an example:
.userOptions Node
This section gives information about analysis options specified by the user. The fields are relative to .userOptions
.targetReport.microorganisms[] Node
The value of the .targetReport.microorganisms[]
field is an array of objects containing information about detected microorganisms. The following table describes one .targetReport.microorganisms[]
object. The fields are relative to .targetReport.microorganisms[]
.targetReport.microorganisms[].relatedMicroorganisms[] Node
The value of the .targetReport.microorganisms[].relatedMicroorganisms[]
field is an array of objects containing information about genetically related microorganisms. The following table describes one .targetReport.microorganisms[].relatedMicroorganisms[]
object. The fields are relative to .targetReport.microorganisms[].relatedMicroorganisms[]
.targetReport.microorganisms[].variants[] Node
The value of the .targetReport.microorganisms[].variants[]
field is an array of objects containing information about variants for all VSP V2 viruses and select RPIP WGS viruses (SARS-CoV-2 & FluA/B/C). The following table describes one .targetReport.microorganisms[].variants[]
object. The fields are relative to .targetReport.microorganisms[].variants[]
.targetReport.amrMarkers[] Node
The value of the .targetReport.amrMarkers[]
field is an array of objects containing information about detected bacterial AMR markers. The following table describes one .targetReport.amrMarkers[]
object. The fields are relative to .targetReport.amrMarkers[]
.targetReport.amrMarkers[].variants[] Node
The value of the .targetReport.amrMarkers[].variants[]
field is an array of objects containing information about variants for bacterial AMR markers with "protein variant" or "rRNA variant" model types. The following table describes one .targetReport.amrMarkers[].variants[]
object. The fields are relative to .targetReport.amrMarkers[].variants[]
.targetReport.customReferences[] Node
This section contains information about custom reference detection results and is only present for custom database analyses. When only a custom reference FASTA file is provided (no BED file), each .targetReport.customReferences[]
object contains information for a single reference sequence. When both a FASTA and BED file are provided, each .targetReport.customReferences[]
object contains information for a single genome/microorganism, which can be a collection of one or more reference sequences. The fields are relative to .targetReport.customReferences[]
.targetReport.customReferences[].consensusSequences[] Node
The value of the .targetReport.customReferences[].consensusSequences[]
field is an array of objects containing majority consensus sequence information for a single custom reference sequence. When only a FASTA file is provided (no BED file), there will be only one object in the array. When both a FASTA and BED file are provided, there may be more than one object in the array. The fields are relative to .targetReport.customReferences[].consensusSequences[]
.targetReport.customReferences[].variants[] Node
The value of the .targetReport.customReferences[].variants[]
field is an array of objects containing information about a single detected variant. The fields are relative to .targetReport.customReferences[].variants[]
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
Field | Description |
---|---|
.accession
Identifier used for the sample
.deploymentEnvironment
Environment in which the results were produced
.batchId
Identifier used for the batch of samples processed together
.analysisId
Identifier used for the analysis
.runId
Identifier used for the sequencing run
.controlFlag
Indicates whether the sample is a control. It is based on the ControlFlag field in the sample .tsv
and can be set to “POS”, “NEG”, “BLANK”, or “-”
.dragenVersion
DRAGEN release version
.analysisPipelineVersion
Analysis Pipeline release version
.testType
Type of test enrichment panel (e.g. RPIP, VSP V2, Custom)
.testVersion
Test panel release version
.testName
Name of the test panel, e.g. "Explify® Respiratory Pathogen ID/AMR Panel (RPIP) - Data Analysis Solution"
.testUse
Test use. "For Research Use Only. Not for use in diagnostic procedures"
.reportTime
Time the report was generated
.warnings
List of warnings encountered during the analysis
.errors
List of errors encountered during the analysis
.sampleQc
Sample QC information
.sampleQc.totalRawBases
Number of base pairs in sample before read QC processing
.sampleQc.totalRawReads
Number of reads in sample before read QC processing
.sampleQc.uniqueReads
Number of distinct reads in sample before read QC processing
.sampleQc.uniqueReadsProportion
Proportion of distinct reads in sample before read QC processing
.sampleQc.preQualityMeanReadLength
Average read length before read QC processing
.sampleQc.postQualityMeanReadLength
Average read length after read QC processing
.sampleQc.postQualityReads
Number of reads in sample after read QC processing
.sampleQc.postQualityReadsProportion
Proportion of post-quality reads in sample relative to total raw reads
.sampleQc.removedInDehostingReads
Number of host reads in sample removed during dehosting
.sampleQc.removedInDehostingReadsProportion
Proportion of host reads in sample removed relative to total raw reads
.sampleQc.entropy
Kmer entropy of reads after read QC processing
.sampleQc.gContent
Proportion of guanine (G) base calls in reads after read QC processing
.sampleQc.libraryQScore
Quality score of the library after read QC processing
.sampleQc.enrichmentFactor
Enrichment factor information (calculation requires detection of an appropriate Internal Control)
.sampleQc.enrichmentFactor.value
Enrichment factor value reflecting how well targeted regions were enriched
.sampleQc.enrichmentFactor.category
Enrichment factor category: "poor", "fair", "good", or "not calculated"
.readClassification
Proportion of reads classified to the following categories:
.readClassification.targetedMicrobial
Targeted microbial
.readClassification.targetedInternalControl
Targeted Internal Control
.readClassification.untargeted
Untargeted
.readClassification.ambiguous
More than one category
.readClassification.unclassified
No category
.readClassification.lowComplexity
Low complexity
.targetedMicrobial
Proportion of targeted microbial reads classified to the following sub-categories:
.targetedMicrobial.viral
Viral targeted
.targetedMicrobial.bacterial
Bacterial targeted
.targetedMicrobial.fungal
Fungal targeted
.targetedMicrobial.parasitic
Parasitic targeted
.targetedMicrobial.bacterialAmr
Bacterial AMR targeted
.untargeted
Proportion of untargeted reads classified to the following sub-categories:
.untargeted.viral
Viral untargeted
.untargeted.bacterial
Bacterial untargeted
.untargeted.fungal
Fungal untargeted
.untargeted.parasitic
Parasitic untargeted
.untargeted.bacterialAmr
Bacterial AMR untargeted
.untargeted.internalControl
Internal Control untargeted
.untargeted.human
Human untargeted
.viral
Proportion of viral reads classified to the following categories:
.viral.targeted
Viral targeted
.viral.untargeted
Viral untargeted
.viral.untargetedSubcategories
Proportion of viral untargeted reads classified to the following sub-categories:
.viral.untargetedSubcategories.panel
Viral panel members
.viral.untargetedSubcategories.phage
Viral phage
.viral.untargetedSubcategories.other
Viral other (not a panel member or phage)
.bacterial
Proportion of bacterial reads classified to the following categories:
.bacterial.targeted
Bacterial targeted
.bacterial.untargeted
Bacterial untargeted
.bacterial.untargetedSubcategories
Proportion of bacterial untargeted reads classified to the following sub-categories:
.bacterial.untargetedSubcategories.panel
Bacterial panel members
.bacterial.untargetedSubcategories.ribosomalDna
Bacterial ribosomal DNA (16S)
.bacterial.untargetedSubcategories.plasmid
Bacterial plasmids
.bacterial.untargetedSubcategories.other
Bacterial other (not a panel member, ribosomal DNA, or plasmid)
.fungal
Proportion of fungal reads classified to the following categories:
.fungal.targeted
Fungal targeted
.fungal.untargeted
Fungal untargeted
.fungal.untargetedSubcategories
Proportion of fungal untargeted reads classified to the following sub-categories:
.fungal.untargetedSubcategories.panel
Fungal panel members
.fungal.untargetedSubcategories.ribosomalDna
Fungal ribosomal DNA (18S)
.fungal.untargetedSubcategories.other
Fungal other (not a panel member or ribosomal DNA)
.parasitic
Proportion of parasitic reads classified to the following categories:
.parasitic.targeted
Parasitic targeted
.parasitic.untargeted
Parasitic untargeted
.parasitic.untargetedSubcategories
Proportion of parasitic untargeted reads classified to the following sub-categories:
.parasitic.untargetedSubcategories.panel
Parasitic panel members
.parasitic.untargetedSubcategories.ribosomalDna
Parasitic ribosomal DNA (18S)
.parasitic.untargetedSubcategories.other
Parasitic other (not a panel member or ribosomal DNA)
.human
Proportion of human reads classified to the following categories:
.human.untargeted
Human untargeted
.human.untargetedSubcategories
Proportion of human untargeted reads classified to the following sub-categories:
.human.untargetedSubcategories.ribosomalDna
Human ribosomal DNA
.human.untargetedSubcategories.codingSequence
Human coding sequence
.human.untargetedSubcategories.other
Human other (not ribosomal DNA or coding sequence)
.internalControl
Proporition of Internal Control reads classified to the following categories:
.internalControl.targeted
Internal Control targeted
.internalControl.untargeted
Internal Control untargeted
.microbialAndInternalControl
Proportion of Microbial and Internal Control reads classified to the following categories:
.microbialAndInternalControl.targeted
Microbial and Internal Control targeted
.microbialAndInternalControl.untargeted
Microbial and Internal Control untargeted
.bacterialAmr
Proportion of bacterial AMR reads classified to the following categories:
.bacterialAmr.targeted
Bacterial AMR targeted
.bacterialAmr.untargeted
Bacterial AMR untargeted
.quantitativeInternalControlName
The quantitative Internal Control used for microorganism absolute quantification (recommendation: Enterobacteria phage T7)
.quantitativeInternalControlConcentration
The quantitative Internal Control concentration (recommendation: 1.21 x 10^7 copies/mL of sample)
.readQcEnabled
Boolean field that indicates whether read QC (trimming and filtering based on quality and read length) was enabled
.readClassificationSensitivity
(VSPv2 only) Sensitivity threshold for classifying reads. Determines whether alignment should proceed for a microorganism and/or reference sequence
.class
Microorganism class ("viral", "bacterial", "fungal", "parasite")
.name
Name of microorganism
.coverage
Proportion of targeted microorganism reference sequence bases that appear in sample sequencing reads
.ani
Average nucleotide identity of consensus sequence to targeted microorganism reference sequences
.medianDepth
Median depth of sample sequencing reads aligned to targeted microorganism reference sequences, indicating the median number of times each targeted microorganism reference sequence base appears in sample sequencing reads
.condensedDepthVector
Read depth across the targeted microorganism reference sequences, condensed to 256 bins
.rpkm
Normalized representation of the number of sample sequencing reads aligned to targeted microorganism reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)
.alignedReadCount
Number of sample sequencing reads that aligned to targeted microorganism reference sequences
.kmerReadCount
(UPIP only) Number of sample sequencing reads classified to targeted microorganism reference sequences
.absoluteQuantityRatio
Numerical absolute quantification value
.absoluteQuantityRatioFormatted
Formatted absolute quantification value with units
.phenotypicGroup
Grouping indicating general association with normal flora, colonization, or contamination from the environment or other sources, as well as general association with disease
.associatedAmrMarkers
(Bacteria only) Information about the bacterial AMR markers associated with the microorganism
.associatedAmrMarkers.applicable
Boolean indicating whether one or more bacterial AMR markers are associated with the microorganism
.associatedAmrMarkers.detected
List of detected bacterial AMR markers associated with the microorganism
.associatedAmrMarkers.predicted
List of predicted bacterial AMR markers associated with the microorganism
.consensusGenomeSequences
(RPIP/VSP V2 viruses only) Information about the majority consensus genome (or segment) sequence
.consensusGenomeSequences.sequence
Consensus genome (or segment) sequence bases
.consensusGenomeSequences.referenceAccession
Accession of the reference genome (or segment) sequence
.consensusGenomeSequences.referenceDescription
Description of the reference genome (or segment) sequence
.consensusGenomeSequences.referenceLength
Length of the reference genome (or segment) sequence
.consensusGenomeSequences.maximumAlignmentLength
Longest contiguous alignment between consensus sequence and reference genome (or segment) sequence
.consensusGenomeSequences.maximumGapLength
Longest contiguous alignment gap (insertion or deletion) between consensus sequence and reference genome (or segment) sequence
.consensusGenomeSequences.maximumUnalignedLength
Longest section of the reference genome (or segment) sequence not aligned to by consensus sequence
.consensusGenomeSequences.coverage
Proportion of reference genome (or segment) sequence bases that appear in sample sequencing reads
.consensusGenomeSequences.ani
Average nucleotide identity of consensus sequence to reference genome (or segment) sequence
.consensusGenomeSequences.alignedReadCount
Number of sample sequencing reads that aligned to reference genome (or segment) sequence.
.consensusGenomeSequences.medianDepth
Median depth of sample sequencing reads aligned to reference genome (or segment) sequence, indicating the median number of times each reference genome (or segment) sequence base appears in sample sequencing reads
.consensusGenomeSequences.targetAnnotation
List of targeted region annotations for the reference genome (or segment) sequence. Each annotation is a JSON object with the following fields: start (int), end (int), strand (string: "+", "-"), target_name (string), type (string)
.consensusGenomeSequences.condensedDepthVector
Read depth across the reference genome (or segment) sequence, condensed to 256 bins
.consensusTargetSequences
(RPIP viruses only) Information about the majority targeted region consensus sequences
.consensusTargetSequences.sequence
Consensus targeted region sequence bases
.consensusTargetSequences.name
Name of the targeted region
.consensusTargetSequences.referenceAccession
Accession of the targeted region reference sequence
.consensusTargetSequences.depthVector
Read depth across the targeted region reference sequence, not condensed
.predictionInformation
Information about microorganism prediction results
.predictionInformation.predictedPresent
Boolean indicating whether the microorganism passed its proprietary reporting logic algorithm
.predictionInformation.notes
List of notes about the prediction result
.predictionInformation.subpanels
List of pre-defined subpanels that the microorganism belongs to
.predictionInformation.relatedMicroorganisms
Array of objects with information about genetically related microorganisms. See below for details
.name
Name of related microorganism
.onPanel
Boolean indicating whether the related microorganism is a panel member
.kmerReadCount
(UPIP only) Number of sample sequencing reads classified to related microorganism reference sequences
.coverage
Proportion of related microorganism reference sequence bases that appear in sample sequencing reads
.ani
Average nucleotide identity of consensus sequence to related microorganism reference sequences
.alignedReadCount
Number of sample sequencing reads that aligned to related microorganism reference sequences
.referenceAccession
Accession of reference genome (or segment) sequence used for variant calling
.segment
(Segmented viruses only) Segment number of reference segment sequence
.ntChange
Nucleotide change associated with variant
.referencePosition
Variant position in reference genome (or segment) sequence
.referenceAllele
Reference allele at variant position
.variantAllele
Variant allele
.depth
Variant depth, indicating the number of times variant allele appears in sample sequencing reads
.alleleFrequency
Frequency of variant allele in sample sequencing reads
.class
Microorganism class ("bacterial")
.cardModelType
Bacterial AMR marker model type in the Comprehensive Antibiotic Resistance Database (CARD) ("homolog", "protein variant", "rRNA variant")
.cardGeneFamily
Bacterial AMR marker family in the Comprehensive Antibiotic Resistance Database (CARD)
.name
Bacterial AMR marker name
.cardName
Bacterial AMR marker name in the Comprehensive Antibiotic Resistance Database (CARD)
.ncbiName
Bacterial AMR marker name in the National Center for Biotechnology Information (NCBI)
.referenceAccession
Accession of the bacterial AMR marker reference sequence
.coverage
Proportion of bacterial AMR marker reference sequence residues that appear in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.pid
Percent identity of consensus sequence aligned to bacterial AMR marker reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.medianDepth
Median depth of sample sequencing reads aligned to bacterial AMR marker reference sequence, indicating the median number of times each bacterial AMR marker sequence residue appears in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.rpkm
Normalized representation of the number of sample sequencing reads aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.alignedReadCount
Number of sample sequencing reads that aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.nucleotideConsensusSequence
Nucleotide consensus sequence bases
.proteinConsensusSequence
Protein consensus sequence bases
.nucleotideDepthVector
Read depth across the bacterial AMR marker nucleotide reference sequence, not condensed
.proteinDepthVector
Read depth across the bacterial AMR marker protein reference sequence, not condensed
.associatedMicroorganisms
Information about the microorganisms associated with the bacterial AMR marker
.associatedMicroorganisms.all
List of all microorganisms associated with the bacterial AMR marker
.associatedMicroorganisms.detected
List of detected microorganisms associated with the bacterial AMR marker
.associatedMicroorganisms.predicted
List of predicted microorganisms associated with the bacterial AMR marker
.predictionInformation
Information about bacterial AMR marker prediction results
.predictionInformation.predictedPresent
Boolean indicating whether the bacterial AMR marker passed its proprietary reporting logic algorithm
.predictionInformation.confidence
Confidence level of bacterial AMR marker prediction ("high", "medium", "low")
.predictionInformation.notes
List of notes about the prediction result
.category
Variant category ("Bacterial Variant; Known AMR")
.referenceSourceMicroorganism
Microorganism that reference sequence is associated with in NCBI
.comments
Comments about variant
.product
Protein product of gene
.ntChange
Nucleotide change associated with variant
.referencePosition
Variant position in reference sequence
.referenceAllele
Reference allele at variant position
.variantAllele
Variant allele
.depth
Variant depth, indicating the number of times variant allele appears in sample sequencing reads
.alleleFrequency
Frequency of variant allele in sample sequencing reads
.annotation
Type of change (e.g. "Nonsynonymous Variant")
.aaChange
Amino acid change associated with variant
.epistaticGroups
List of epistatic groups variant is associated with
.name
Name of custom reference sequence, accession or genome/microorgannism
.coverage
Proportion of custom reference sequence bases that appear in sample sequencing reads
.ani
Average nucleolotide identity of consensus sequence to custom reference sequence or, if specified, collection of one or more custom reference sequences
.medianDepth
Median depth of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences, indicating the med\ian number of times each custom reference sequence base appears in sample sequencing reads
.condensedDepthVector
Read depth across custom reference sequence or, if specified, collection of one or more custom reference sequences, condensed to 256 bins
.rpkm
Normalized number of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)
.alignedReadCount
Number of sample sequencing reads that aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences
.consensusSequences
Array of objects with information about each consensus sequence
.variants
Array of objects with information about variants detected in custom reference sequence or, if specified, collection of one or more custom reference sequences
.sequence
Majority consensus sequence bases
.referenceAccession
Accession of custom reference sequence
.referenceDescription
Description of custom reference sequence
.referenceLength
Length of custom reference sequence
.coverage
Proportion of custom reference sequence bases that appear in sample sequencing reads
.ani
Average nucleolotide identity of consensus sequence to custom reference sequence
.medianDepth
Median depth of sample sequencing reads aligned to custom reference sequence, indicating the median number of times each custom reference sequence base appears in sample sequencing reads
.depthVector
Read depth across custom reference sequence, not condensed
.alignedReadCount
Number of sample sequencing reads that aligned to custom reference sequence
.maximumAlignmentLength
Longest contiguous alignment between consensus sequence and custom reference sequence
.maximumGapLength
Longest contiguous alignment gap (insertion or deletion) between consensus sequence and custom reference sequence
.maximumUnalignedLength
Longest section of custom reference sequence not aligned to by consensus sequence
.ntChange
Nucleotide change associated with the variant
.referenceAccession
Accession of custom reference sequence used for variant calling
.referencePosition
Variant position in custom reference sequence
.referenceAllele
Reference allele at variant position
.variantAllele
Variant allele
.depth
Variant depth, indicating the number of times variant allele appears in sample sequencing reads
.alleleFrequency
Frequency of variant allele in sample sequencing reads
The Sample composition bar graphs show the broad compositional makeup of all samples in the analysis run. If an internal control is used, this will show in the bargraph. It is typical to see that Untargeted reads comprise a large portion of the sample composition, particularly from samples where viruses make up a low portion of the overall genetic material, such as wastewater samples or viruses grown in host cells.
The table shows an overview of QC metrics for all the samples in the analysis run. One can click on the "Download CSV" link to directly access an excel compatible version. Further details on each metric can be found by hovering over each column header.
Individual sample composition can be further explored by clicking on "Report" under each sample name in the panel on the left. There are four tabs in the Sample Report: Sample Quality Control, Microorganisms, Antimicrobial Resistance Markers and User Options.
Version information provides a table with details as to the application run version, the analysis pipeline run and test version. We recommend running the latest version of the application.
Sample composition summarizes the broad compositional makeup of the individual sample.
Read Classification is a dynamic figure where one can choose select the following options:
Targeted Microbial Reads - Relative (default): This bargraph shows the compositional makeup of the targeted types relative to targeted microbial reads. Specifically, the percent of the targeted microbial reads belonging to Viral, Bacterial, Fungal, Parasite and AMR are broken down. Hovering over an individual bar shows the read value.
Targeted Microbial Reads - Absolute: This bargraph shows the compositional makeup of the targeted types within all reads (i.e. types within the absolute number of all reads). Hovering over an individual bar shows the read value, and for a given type will be the same value as shown in Targeted Microbial Reads - Relative.
Untargeted Reads - Relative: This bargraph shows the compositional makeup of the untargeted types relative to untargeted microbial reads. It is possible for reads not in the panel to be sequenced when there is a high concentration of background DNA/RNA.
Untargeted Reads - Absolute: This bargraph shows the compositional makeup of the untargeted types within all reads (i.e. types within the absolute number of all reads). It is possible for reads not in the panel to be sequenced when there is a high concentration of background DNA/RNA.
Note that accurate sample composition results rely on selecting the correct enrichment panel. If you run a viral panel analysis that is not specific to that panel (eg VSP-enriched samples through the VSPv2 pipeline), you may see unenriched viruses as on target, even if they are not in the panel design. For example - if your sample has a high level of measles and used the VSP Panel in library prep but the VSP V2 analysis, measles reads will be calculated as "on target" despite measles not being in the VSP enrichment panel.
Also note that sample composition metrics are not available for VSP or RVOP.
Internal Controls provide a table with RPKM values if ICs were used in the wetlab workflow. If one selects an internal control but did not include it in their wetlab workflow/sequencing than this information may show as 0 and may lead to inaccurate RPKM scores.
QC Metrics includes a table with defined metrics and their corresponding values. Dehosted Reads are human reads only. If another host is used, these would be removed as "off target" reads.
Microbes detected will be shown in the tables, separated by microbe type (virus, bacteria, etc). Each table includes whether the microbe is present, and various metrics. Hovering over each metric title will provide more details on the metrics. The best-match Reference Accession(s) for each virus is shown in the Viruses table. Non-segmented viruses will have a single accession number and segmented viruses will have one accession per segment. To see all of the accessions, click on the three dots (...) in the table and scroll down to see a list of all accession numbers.
Certain microbes show their "Phenotypic Group". Each number represents a a predefined group.
Phenotypic Group 1: Microorganisms that are frequently considered part of the normal flora, colonizers, or contaminants but may be associated with disease in certain settings.
Phenotypic Group 2: Microorganisms that may represent normal flora, colonizers, or contaminants but that are frequently associated with disease.
Phenotypic Group 3: Microorganisms that are not generally considered part of the normal flora, colonizers, or contaminants and are generally considered to be associated with disease.
Reference Coverage plots show the coverage depth across the consensus genomes of the viruses. There is a dropdown where the user can select which virus to examine. For segmented viruses (such as influenza), each segment is concatenated into one coverage plot.
Viral Antimicrobial Resistance (AMR) results will be reported for Influenza viruses in the RPIP, RVOP/RVEK, VSP and VSP V2 analysis pipelines, and bacterial AMR marker results will be reported in the RPIP and UPIP analysis pipelines. Custom workflows and non-influenza viruses do not include AMR results. Note that this does not mean there are no AMR markers in the sample; rather, further analysis using a virus specific tool is recomended if other AMR markers are of interest.
This table summarizes which settings the user selected at launch of the analysis.