1 of 13

DRAGEN Microbial Amplicon App Documentation

Overview

DRAGEN Microbial Amplicon is a software application designed to analyze sequencing data from amplicon library preps (both DNA and RNA) on microbiological samples, with an emphasis on viruses. Illumina sequencing reads are processed to generate consensus sequences that represent a best estimate of the population of viral sequences in each sample. Where appropriate, these consensus sequences are further analyzed by the phylogenetic analysis tools Nextclade and/or Pangolin to provide an identification of the clade or lineage of each sequence.

Input

Data can be provided in one of the following ways:

Samples / biosamples with FASTQ datasets (see details in library preparation documents)
A project containing one or more samples / biosamples with FASTQ datasets
- All samples / biosamples in the selected project will be analyzed

Supported amplicon primer schemes

Chikungunya
- Grubaugh Lab
- Illumina
Dengue
- Serotype 1 - Illumina
- All serotypes - DengueSeq from Grubaugh Lab
Influenza A/B - Universal
Mpox
- Pan-clade - ARTIC
- Clade I - Illumina
- Clade II - Grubaugh Lab
RSV
- CDC
- WCCRRI
SARS-CoV-2 - ARTIC
- v5.4.2
- v5.3.2, v4.1, v4, v3
Zika - Grubaugh Lab

Custom genome and primer sets

Users can upload custom files to provide user-defined reference genome set and primer definitions. Multiplexed amplicon panels targeting multiple organisms in the same reaction are supported.

Pipeline steps

Trim and filter reads using Trimmomatic
Remove off-target reads using DRAGEN v4.3.6 kmer classifier (for custom reference, remove human reads using a modified version of the SRA Human Read Scrubber tool v2.2.1)
For organisms with one default reference genome, skip this step. For organisms with multiple candidates, trim primer sequences in reads using Trimmomatic, perform assembly using MEGAHIT, cluster contigs using CD-HIT-EST, map contigs to candidate reference genomes using minimap2, then select reference genomes based on the mapping
Align reads to the default reference genome or selected reference genomes using DRAGEN v4.3.6
Trim primer sequences in aligned reads based on coordinates
Filter out samples with insufficient amplicon coverage
Call sequence variants from the alignments using DRAGEN Somatic v4.3.6 and apply them to the corresponding reference genomes to create consensus sequences
If applicable, run Nextclade/Pangolin on the consensus sequences

Output

Consensus sequences representing a best estimate of targeted sequences
Tables and plots reporting read counts, coverage, and Nextclade/Pangolin results

Currently supported platforms

BaseSpace Sequence Hub

Important Notes

The sequences are labeled according to the best match in the reference database, which is not exhaustive and the labels should not be taken as definitive for strain-typing. If strain typing is needed, the built-in Nextclade and/or Pangolin tools can be used for supported organisms. Alternatively, a BLAST or similar search of nucleotide databases may provide a more detailed match.
Because of sequence homology, it is possible that organisms with very few reads will result in the generation of a sequence not present (false positive). Although the de novo assembly step of this software largely mitigates such instances, sequences with very low horizontal coverage (< 5%) should be treated with caution.

How to start

Launch the DRAGEN Microbial Amplicon BaseSpace application.
Choose the analysis name and destination project to save results to.
Choose either Biosample or Project as input method. Selecting Project will result in all biosamples in the selected project being analyzed.
Choose the appropriate Amplicon Primer Set that matches the primer design used to prepare your samples or choose Custom to provide your own. See this page to learn more about the custom option.
If needed, uncheck the appropriate boxes to disable Pangolin and Nextclade analyses. These tools perform phylogenetic analysis and lineage assignment for SARS-CoV-2 (Pangolin) and other viruses (Nextclade). Depending on the chosen Amplicon Primer Set, these tools may not be applicable.
If needed, expand the Advanced Workflow Settings box to change default settings. Click on the "i" circle next to each setting for more information.
If needed, expand the Additional DRAGEN Command Line Arguments to provide additional arguments to default DRAGEN commands.
Click “Launch Application"

Custom reference

In addition to the built-in options, DRAGEN Microbial Amplicon supports the use of custom reference genomes and primer definitions. These files must be uploaded to a BaseSpace Project before they can be used. See for more information about importing files into BaseSpace.

In the app input form, select the 'Custom' option for 'Amplicon Primer Set'. Then expand the 'Custom Reference' settings to provide the following:

Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)
Custom PCR Primer Definitions (optional)

Custom reference FASTA

If the 'Custom' option is selected for 'Amplicon Primer Set', the user must provide a custom FASTA file containing one or more reference sequences as the target for read alignment (and as the basis for generating consensus sequences). The software generates the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app. Note that not all provided reference sequences in the FASTA file may be used for read alignment and consensus sequence generation.

Custom reference BED

Optionally, a reference BED file may be provided to add information about each reference sequence in the FASTA file, such as human-readable names to be used in the reports. For multi-segment genomes such as Influenza, this file assigns the segment name to each sequence, which allows the software to group individual segment sequences by genome. See the following page on the format of this file:

Custom PRC primer definitions

Optionally, a TSV file may be provided to define the primer sequences or binding locations, which are used for two purposes:

Primer sequences are trimmed from reads, which eliminates sequences that may come from the primer sequences themselves (which we do not want) from sequences contributed by the biological sample (which we do want). This reduces reference bias that can incorrectly lower the observed allele frequency of true sequence variants in primer binding sites.
Primer locations are used to define the amplicons expected from PCR reactions. The read coverage within the unique (non-overlapping) amplicon regions is used to determine whether each amplicon is reliably detected. The percentage of detected amplicons is used to determine whether sufficient material exists to accurately call variants and generate consensus sequences from the sample.

See the following pages for further information:

Nextclade datasets

Optionally, one or more Nextclade datasets can be selected to use for phylogenetic analysis of the consensus sequences generated from the samples. Every selected dataset will be applied to every consensus sequence generated in every sample.

Reference BED file format

A BED-like tab-separated value (TSV) file with no header row and with 4 or 5 columns:

accession: each sequence accession as it appears in Custom Reference FASTA heaer
start: start position (always set to 0)
end: end position (sequence length)
genome: full name of the virus the sequence belongs to (e.g. Influenza A H1N1)
(optional) segment: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Guidelines

This file affects how sequences are labeled in the output.
Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.
If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.
If the Custom Reference FASTA includes sequences from multiple segments, it is strongly recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.

Example

NC_012532.1 0   10794   Zika    Full
NC_007373.1 0   2341    Influenza A virus (H3N2)    Segment 1 (PB2)
NC_007372.1 0   2341    Influenza A virus (H3N2)    Segment 2 (PB1)
NC_007371.1 0   2233    Influenza A virus (H3N2)    Segment 3 (PA+PA-X)
NC_007366.1 0   1762    Influenza A virus (H3N2)    Segment 4 (HA)
NC_007369.1 0   1566    Influenza A virus (H3N2)    Segment 5 (NP)
NC_007368.1 0   1467    Influenza A virus (H3N2)    Segment 6 (NB+NA)
NC_007367.1 0   1027    Influenza A virus (H3N2)    Segment 7 (M1+M2)
NC_007370.1 0   890     Influenza A virus (H3N2)    Segment 8 (NS1+NEP)

PCR Primer definition file formats

If primer coordinates are known (recommended)

Provide a BED file with at least 4 columns: accession, start, end, primerName. Additional columns can be included: pool, strand, sequence, but their order must be maintained.

For example, accession, start, end, primerName, pool for 5-column BED format:

seqX    0           15        primer1_LEFT   1
seqX    1745        1760      primer1_RIGHT  1
seqY    0           15        primer2_LEFT   2
seqY    1015        1030      primer2_RIGHT  2

And accession, start, end, primerName, pool, strand, sequence for 7-column BED format:

seqX    0           15        primer1_LEFT   1     +       GGGCAAACCTAAAGG
seqX    1745        1760      primer1_RIGHT  1     -       GTTATGTAAAGGTGC
seqY    0           15        primer2_LEFT   2     +       GGGCGAAACTAAAGG
seqY    1015        1030      primer2_RIGHT  2     -       GTTATGTAAAGGTGC

If primer coordinates are unknown

Option 1. One line per amplicon with 3 columns: ampliconName, forwardSequence,reverseSequence.

amplicon1      GGGCAAACCTAAAGG  GTTATGTAAAGGTGC
amplicon2      GGGCGAAACTAAAGG  GTTATGTAAAGGTGC

Option 2. One line per primer with 3 columns: primerName, sequence, pool.

primer1_LEFT      GGGCAAACCTAAAGG  1
primer1_LEFT_alt  GGGCGAAACTAAAGG  1
primer1_RIGHT     GTTATGTAAAGGTGC  1

Formatting rules

General
- All text is case sensitive.
- Any line starting with '#' is ignored. This can be used to add a header line with column names.
- Every line must have the same number of columns and format (except those starting with '#').
- Any number of spaces can separate the columns. A value within a single column should not have any space.
BED format
- Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the start field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the end field (3rd column) minus 1.
- accession field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.
- Multiple sequence identifiers (accession) are permitted within one file.
Primer name
- primerName must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.
- In addition to _LEFT and _RIGHT, we permit _L and _R as direction tags in primerName. Any text after the direction tag should be separated by an underscore.
- Text in primerName before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName.
- Each amplicon must have at least one left and right primer (including alternative primers) associated with it.
- Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt after the direction tag in primerName, followed by optional text to distinguish between different alternative primers, such as a number.
- Examples of valid primer names:
  - MY_SEQUENCE_434_A_LEFT
  - virus1_L
  - amplicon_4934m_RIGHT_alt
  - amplicon_4934m_RIGHT_alt1
  - amplicon_4934m_R_altprimerB
- Examples of invalid primer names:
  - LEFT_MY_SEQUENCE_434_A
  - virus1_l
  - amplicon_4934m_RIGHT_L

Output files

Note: Some files may not be generated depending on user inputs and pipeline outcome

Analysis level

Analysis_Results/<analysisId>.report.html displays tables and plots that summarize results from all samples combined.

Sample level

An output directory named after each sample contains <sampleName>.html, which displays tables and plots specific to the sample. The HTML files are identical to the ones displayed in BaseSpace Reports.

Each sample directory also contains the following subdirectories and output files:

amplicon/

<sampleName>.amplicon_coverage.log

Log from computing coverage metrics for each amplicon in a sample

consensus/

<sampleName>.hard_masked_consensus.fa

FASTA containing all hard-masked consensus sequences generated for a sample

contig/

map_align/

metrics/

nextclade/

pangolin/

variant_calling/

Understanding the BaseSpace Reports

Once the analysis completes, the "REPORTS" tab on BaseSpace enables users to view the Summary of the entire analysis, which summarizes results from all input samples, as well as individual Sample Report for each sample.

Summary

The Summary contains at most three tabs: Summary Report, Nextclade Report, and Pangolin Report.

Summary Report

Metrics By Sample

This table provides a top-line summary of each of the analyzed samples.

At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.

Next is the table itself, which contains one row per sample and the following columns:

Sample: Name of the BaseSpace sample analyzed
Status: Status of the sample analysis
Input Reads: Total number of reads in input FASTQs
Mapped Reads: Number of reads that map to reference sequences during short read alignment
Detected Amplicons: Proportion of amplicons detected out of the total expected for the sample, which is used to to determine if the sample is sufficient quality for variant calling. See this page for more details.
Num Genomes: Number of genomes chosen during the reference selection stage
Virus: Name of the genome to which the reference sequence belongs
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)
- Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. They are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default).
- When generating consensus sequences, genomic positions below the threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the true base cannot be accurately determined).
- This percentage is calculated over the lengths of the reference genome(s), not the final consensus sequence(s) which may be trimmed.

Pre-processing Metrics

This stacked bar plot contains counts of reads that fall into the following categories:

Removed in Downsampling: Reads that were removed during downsampling because the user specified a downsampling target in the Input Form under Advanced Workflow Settings
Removed in QC: Reads removed as poor quality reads based on quality thresholds during pre-processing
Removed as Duplicate: Reads that were labeled as duplicate during short read alignment. Removal of them can be disabled in the Input Form under Advanced Workflow Settings
Removed in Trimming: Reads that were removed in the initial sequence-based primer trimming step and were excluded from further processing
Removed in De-hosting: Reads that were filtered out as human reads based on kmer-based classification during pre-processing.
- This improves the quality of downstream analysis and helps ensure that human sequences are not included in the output BAM files.
- This is applied only if 'Amplicon Primer Set' was set to 'Custom' in the Input Form.
- This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".
Removed as Off-target: Reads that were filtered out as off-target reads based on kmer-based classification during pre-processing
- Similar de-hosting, this improves the quality of downstream analysis.
- Off-target is defined as not coming from the target organism, which is determined based on the 'Amplicon Primer Set' selection in the Input Form. For example, if "Influenza A and B, Universal Primers" option is selected, a kmer database generated from a large collection of publicly available Influenza sequences is used to separate reads likely coming from Influenza from the rest.
- This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".
Unmapped: Reads that were not aligned to any reference genomes
Mapped. Reads that were mapped to at least one reference genome

Nextclade Report (optional)

This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences across all samples. Nextclade is run if the "Enable NextClade" box is checked on the Input Form and one of the following is true:

'Amplicon Primer Set' is set to a non-custom set with a reference with Nextclade dataset available and a valid consensus sequence was generated.
'Amplicon Primer Set' is set to 'Custom' and one or more Nextclade datasets are selected under 'Custom Reference'. In this case, each of the selected Nextclade datasets is applied to each consensus sequence generated for every sample. This may result in multiple Nextclade results for each consensus sequence.

Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

All content shown in the tab is derived from the output of the Nextclade software. Please see the Nextclade documentation for more details.

Pangolin Report (optional)

This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences across all samples. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:

'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.4.2 primers) and a valid consensus sequence was generated
'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.

Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.

All content shown in the tab is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details.

Sample Report

The Sample Report contains at most four tabs: Sample QC, Virus Metrics, Nextclade Report, and Pangolin Report.

Sample QC

This tab contains tables and plots summarizing the sample.

Sample Summary Metrics

This table reports summary metrics for the sample, such as Status and Detected Amplicons. See here for their definitions.

Pre-processing Metrics

This plot displays counts of reads that fall into different categories. See here for their definitions.

Sequence Alignment

This plot displays the number of reads that mapped to each reference sequence. If there is a single reference sequence (e.g. SARS-CoV-2), one bar is shown.

Sequence Alignment Metrics

This table provides the number of reads that mapped to each reference sequence along with the genome and segment names of the reference sequence. The "Download CSV" button enables downloading the contents of the table as a text comma-separated value (CSV) file.

Virus Metrics

Metrics by Virus

This table summarizes results for each viral genome generated in the sample with each row corresponding to a single viral genome. For segmented viruses like Influenza, a row will summarize information across multiple sequences generated for a single viral genome.

At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.

The table itself contains rows for every viral genome with at least one sequence generated in the sample with the following columns:

Virus: Name of the viral genome
- For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName column
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). This is computed across all sequences belonging to the viral genome. See here for more information.
Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).

Metrics by Sequence

This table summarizes the results for each sequence generated in the sample. For segmented viruses like Influenza, there are typically multiple rows with the same virus name. Otherwise, this table contains similar information as the Metrics By Virus table.

At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.

Virus: Name of the virus genome
Segment: Name of the segment to which the reference sequence corresponds. For non-segmented viruses, this is typically set to "Full".
Accession: Unique identifier of the reference sequence (text before first space in FASTA header if custom reference FASTA was provided)
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). See here for more information on this metric.
Callable Bases: Number of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)
Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).
Consensus Length: Length of the final consensus, without leading and trailing masked bases if sequence trimming is enabled. Sequence trimming can be disabled in the Input Form under Advanced Workflow Settings.

Consensus Coverage

Displays a trace of read coverage over each reference genome. On the top right is a drop-down menu that allows users to switch between genomes. The blue line represents the read coverage, with the coverage depth in log 10 of number of reads on the y-axis and the genomic position in the reference genome on the x-axis.

For segmented viruses like Influenza, coverage values for each segment is displayed in a horizontally stacked fashion. Grey blocks at the top show their boundaries.

Nextclade Report (optional)

This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences in the sample. See here for more details.

Pangolin Report (optional)

This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences in this sample. See here for more details.

Pipeline Logic

Steps

Step

Module/Script

Run

Outcomes

Special considerations for amplicon detection

Amplicon sequencing, especially of RNA viruses, requires additional bioinformatics processing to ensure maximum quality of the resulting data.

In RT-PCR, a reverse transcriptase enzyme first generates cDNA molecules using the RNA molecules in the sample as templates, before amplifying the cDNA sequences using a DNA polymerase enzyme during PCR. These amplified cDNA sequences are then further processed to generate the sequencing libraries. Both of these enzymes can potentially introduce an incorrect base into a sequence, generating a position where the resulting sequence does not match the sequence in the sample -- that is, an error.

Reverse transcriptases exhibit that are multiple orders of magnitude higher than .

When large numbers of nucleic acid molecules are present in a reaction, these individual misincorporation errors are largely uncorrelated and appear at very low frequencies, so that they are typically ignored by variant callers.

However, when there is a small number of incoming nucleic acid molecules, such as for a low-titer sample, an error that occurs during the RT step or early in the PCR reaction can, as a result of sampling noise, be amplified to high frequencies in the resulting sequencing libraries. The variant caller may treat this error as a sequence variant, since it is a true sequence variant in the context of the library provided to the instrument. As a result, these artifactual sequence variants often have high allele frequency and quality scores, which makes them very difficult to detect, and appear in the final consensus sequence. While less common, it is also possible for a true sequence variant to have its allele frequency depressed by this same process (if the error results in a reversion to the reference sequence).

Since it is difficult to identify enzyme-introduced false variants after the fact, we instead take a preemptive approach of determining if there is sufficient sample material present before variant calling and consensus sequence generation in order to ensure data quality.

Specifically, the app calculates the number of amplicons with at least 1x coverage for at least 90% of the non-overlapping portion of the amplicon sequence. The 1x coverage threshold used here is fixed and independent of the minimum read coverage depth for consensus sequence generation which defaults to 10x. The number of amplicons that meet this threshold is then divided by the total number of amplicons expected in the experiment, which is the number of amplicons whose location falls in reference sequences selected for short read alignment. If the resulting percentage is at least 80%, the sample is considered to have sufficient material for accurate variant calling. If it is below this threshold, the sample is not processed further to avoid spurious variant calls. The user can override the 80% threshold in the "Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation" control in the "Advanced Workflow Settings" section.

The threshold above was determined through data analysis using an experimentally-determined threshold corresponding to minimum concentration needed to produce reliable variant calls. We assumed that higher nucleic acid concentrations leads to a higher probability of amplifying each amplicon.

Frequently Asked Questions (FAQ)

Q: Majority of my reads are removed in preprocessing as off-target reads. Is the amplicon panel working?

A: For many sample types, especially clinical, wastewater, or environmental samples, viral RNA or DNA makes up a tiny proportion of the total nucleic acids, with the remainder dominated by host or bacteria/archaea. Therefore, even with a dramatic increase of abundance over what you would obtain without targeted sequencing, the percentage of targeted reads can still be low.

Q: How is the default minimum read coverage depth of 10x applied? Is that averaged across the entire sequence?

A: The 10x threshold is applied per-nucleotide. Any positions below 10x coverage will be hard-masked with "N".

Q: Header lines in the consensus sequence FASTA files say “Panel reference sequences are not necessarily comprehensive and should not be used for strain typing.” What does this mean?

A: This message is to warn users that the sequence accession in the consensus genome does not necessarily reflect the true phylogeny of the organisms in the sample and should not be taken as such.

Because the app uses a limited set of reference sequences, the accession in the consensus sequence FASTA file headers (and coverage plots, etc) merely reflects the best match from that limited set. There may be sequences in RefSeq or elsewhere that are a closer match.

Q: Why does the "Detected Amplicons" column show 7 in the denominator for an Influenza genome when there should be 8 amplicons in total?

A: The denominator in the "Detected Amplicons" columns is based on the reference sequences selected based on de novo assembled contigs. Depending on the quality of the sample and/or reads, the assembler may not have enough data to generate a contig for some segments. Shorter segments are more likely to be missed. If only 7 segments are selected as reference for short read alignment, then we expect 7 amplicons in total. If you believe that the sample should contain all 8 segments, you can download the contig FASTA file from our report page and submit it to to see if all 8 segments are present in the contig sequences.

One known issue is that chimeric reads can be generated during library preparation, which can lead to chimeric contigs, where the contig sequence contains sequences from more than one segment. This can result in missing an entire segment in the reference selection stage. A workaround may be to filter out chimeric reads from your FASTQ files before running the app.

Alternatively, you can force the app to use all 8 segments of a particular Influenza genome by providing a custom reference FASTA file with all 8 segment sequences and a custom reference BED file with the genome column to set to the same value (e.g. Influenza A). This way, the app skips assembly and uses all 8 segments as the reference sequences for short read alignment.

Q: What is the difference between the "Detected Amplicons" and "% callable bases" columns?

A: The "Detected Amplicons" column shows the number of detected amplicons over the total number of expected amplicons. The percentage of detected amplicons is used to infer if the sample is of sufficient quality for variant calling. The "% callable bases" column shows the percentage of the selected reference genome whose bases are at or above the minimum read coverage depth for consensus sequence generation, which is computed independent of amplicons.

Both metrics are useful to assess the quality of the sample, but the percentage of detected amplicons is used by the app after short read alignment to filter out low-titer samples and the percentage of callable bases is not.

Q: I don't see the virus I'm interested in listed in the "Genomes generated" column. Does that mean the virus is not present?

A: Not necessarily. Your virus of interest may be present in the sample, but the app may not have generated a consensus sequence for it for various reasons.

One reason could be that there are too few reads coming from that virus. Tools like DRAGEN Metagenomics can be used to characterize what is in the sample more broadly.

Q: I cannot find any contig FASTA files in the output, why?

A: De novo assembly is performed only if there are multiple candidate reference genomes, which is typically when there are multiple serotypes, strains, subtypes, or clades. This currently applies to the following Amplicon Primer Set options:

Dengue Virus All Serotypes, 400-bp DengueSeq primers
Influenza A, Universal primers
Influenza B, Universal primers
Influenza A and B, Universal primers
Mpox All Clades, 2500-bp ARTIC-INRB v1 primers
Respiratory Syncytial Virus (RSV), CDC primers
Respiratory Syncytial Virus (RSV), WCCRRI primers

If a custom reference FASTA file is provided, assembly is performed if there are multiple sequences in the file. If a custom reference BED fils is also provided, assembly is performed if based on the BED file there are multiple genome-segment pairs (or multiple non-segmented full genomes). Otherwise, all sequences in the custom reference FASTA file are used as reference for short read alignment.

Q: I see both consensus sequence FASTA files and contig FASTA files. Which one is better?

A: In most cases, the consensus sequence FASTA file. Contig sequences are useful if the reference sequences used for consensus sequence generation were not the best match. They should be used with caution however because there is no filtering of base calls based on coverage or quality as done in consensus sequence generation.

Q: I am using the custom analysis workflow and my analysis aborted or shows an error, why?

A: It is most likely that the custom database was not formatted correctly. Below are requirements for the Custom Reference FASTA For Consensus Generation:

Do not use Spaces in the file name, instead use an underscore "_"
Do not exceed 25 characters in the file name
File extension must be .fasta or .fa
Do not exceed the file size limitation: 16GB for a single file or 25GB for multiple files
Do not have duplicate entries
If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (accession) must match the names that appear in the FASTA (text after > and before the first whitespace character).

PCR Primer definition file formats

If primer coordinates are known (recommended)

Provide a BED file with at least 4 columns: accession, start, end, primerName. Additional columns can be included: pool, strand, sequence, but their order must be maintained.

For example, accession, start, end, primerName, pool for 5-column BED format:

seqX    0           15        primer1_LEFT   1
seqX    1745        1760      primer1_RIGHT  1
seqY    0           15        primer2_LEFT   2
seqY    1015        1030      primer2_RIGHT  2

And accession, start, end, primerName, pool, strand, sequence for 7-column BED format:

seqX    0           15        primer1_LEFT   1     +       GGGCAAACCTAAAGG
seqX    1745        1760      primer1_RIGHT  1     -       GTTATGTAAAGGTGC
seqY    0           15        primer2_LEFT   2     +       GGGCGAAACTAAAGG
seqY    1015        1030      primer2_RIGHT  2     -       GTTATGTAAAGGTGC

If primer coordinates are unknown

Option 1. One line per amplicon with 3 columns: ampliconName, forwardSequence,reverseSequence.

amplicon1      GGGCAAACCTAAAGG  GTTATGTAAAGGTGC
amplicon2      GGGCGAAACTAAAGG  GTTATGTAAAGGTGC

Option 2. One line per primer with 3 columns: primerName, sequence, pool.

primer1_LEFT      GGGCAAACCTAAAGG  1
primer1_LEFT_alt  GGGCGAAACTAAAGG  1
primer1_RIGHT     GTTATGTAAAGGTGC  1

Formatting rules

General
- All text is case sensitive.
- Any line starting with '#' is ignored. This can be used to add a header line with column names.
- Every line must have the same number of columns and format (except those starting with '#').
- Any number of spaces can separate the columns. A value within a single column should not have any space.
BED format
- Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the start field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the end field (3rd column) minus 1.
- accession field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.
- Multiple sequence identifiers (accession) are permitted within one file.
Primer name
- primerName must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.
- In addition to _LEFT and _RIGHT, we permit _L and _R as direction tags in primerName. Any text after the direction tag should be separated by an underscore.
- Text in primerName before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName.
- Each amplicon must have at least one left and right primer (including alternative primers) associated with it.
- Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt after the direction tag in primerName, followed by optional text to distinguish between different alternative primers, such as a number.
- Examples of valid primer names:
  - MY_SEQUENCE_434_A_LEFT
  - virus1_L
  - amplicon_4934m_RIGHT_alt
  - amplicon_4934m_RIGHT_alt1
  - amplicon_4934m_R_altprimerB
- Examples of invalid primer names:
  - LEFT_MY_SEQUENCE_434_A
  - virus1_l
  - amplicon_4934m_RIGHT_L

DRAGEN Microbial Amplicon App Documentation

Overview

Input

Pipeline steps

Output

Currently supported platforms

Important Notes

How to start

Page

Custom reference

Custom reference FASTA

Custom reference BED

Custom PRC primer definitions

Nextclade datasets

Reference BED file format

Guidelines

Example

PCR Primer definition file formats

If primer coordinates are known (recommended)

If primer coordinates are unknown

Formatting rules

Output files

Analysis level

Sample level

amplicon/

consensus/

contig/

map_align/

metrics/

nextclade/

pangolin/

variant_calling/

Understanding the BaseSpace Reports

Summary

Summary Report

Metrics By Sample

Pre-processing Metrics

Nextclade Report (optional)

Pangolin Report (optional)

Sample Report

Sample QC

Sample Summary Metrics

Pre-processing Metrics

Sequence Alignment

Sequence Alignment Metrics

Virus Metrics

Metrics by Virus

Metrics by Sequence

Consensus Coverage

Nextclade Report (optional)

Pangolin Report (optional)

Pipeline Logic

Steps

Outcomes

Special considerations for amplicon detection

Frequently Asked Questions (FAQ)

Q: Majority of my reads are removed in preprocessing as off-target reads. Is the amplicon panel working?

Q: How is the default minimum read coverage depth of 10x applied? Is that averaged across the entire sequence?

Q: Header lines in the consensus sequence FASTA files say “Panel reference sequences are not necessarily comprehensive and should not be used for strain typing.” What does this mean?

Q: Why does the "Detected Amplicons" column show 7 in the denominator for an Influenza genome when there should be 8 amplicons in total?

Q: What is the difference between the "Detected Amplicons" and "% callable bases" columns?

Q: I don't see the virus I'm interested in listed in the "Genomes generated" column. Does that mean the virus is not present?

Q: I cannot find any contig FASTA files in the output, why?

Q: I see both consensus sequence FASTA files and contig FASTA files. Which one is better?

Q: I am using the custom analysis workflow and my analysis aborted or shows an error, why?

DRAGEN Microbial Amplicon App Documentation

Overview

Input

Pipeline steps

Output

Currently supported platforms

Important Notes

Reference BED file format

Guidelines

Example

How to start

PCR Primer definition file formats

If primer coordinates are known (recommended)

If primer coordinates are unknown

Formatting rules