Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Launch the DRAGEN Microbial Amplicon BaseSpace application.
Choose the analysis name and destination project to save results to.
Choose either Biosample or Project as input method. Selecting Project will result in all biosamples in the selected project being analyzed.
Choose the appropriate Amplicon Primer Set that matches the primer design used to prepare your samples or choose Custom to provide your own. See this page to learn more about the custom option.
If needed, uncheck the appropriate boxes to disable Pangolin and Nextclade analyses. These tools perform phylogenetic analysis and lineage assignment for SARS-CoV-2 (Pangolin) and other viruses (Nextclade). Depending on the chosen Amplicon Primer Set, these tools may not be applicable.
If needed, expand the Advanced Workflow Settings box to change default settings. Click on the "i" circle next to each setting for more information.
If needed, expand the Additional DRAGEN Command Line Arguments to provide additional arguments to default DRAGEN commands.
Click “Launch Application"
A BED-like tab-separated value (TSV) file with no header row and with 4 or 5 columns:
accession
: each sequence accession as it appears in Custom Reference FASTA heaer
start
: start position (always set to 0)
end
: end position (sequence length)
genome
: full name of the virus the sequence belongs to (e.g. Influenza A H1N1)
(optional) segment
: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome
This file affects how sequences are labeled in the output.
Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.
If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.
If the Custom Reference FASTA includes sequences from multiple segments, it is strongly recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.
Provide a BED file with at least 4 columns: accession
, start
, end
, primerName
. Additional columns can be included: pool
, strand
, sequence
, but their order must be maintained.
For example, accession
, start
, end
, primerName
, pool
for 5-column BED format:
And accession
, start
, end
, primerName
, pool
, strand
, sequence
for 7-column BED format:
Option 1. One line per amplicon with 3 columns: ampliconName
, forwardSequence
,reverseSequence
.
Option 2. One line per primer with 3 columns: primerName
, sequence
, pool
.
General
All text is case sensitive.
Any line starting with '#' is ignored. This can be used to add a header line with column names.
Every line must have the same number of columns and format (except those starting with '#').
Any number of spaces can separate the columns. A value within a single column should not have any space.
BED format
Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the start
field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the end
field (3rd column) minus 1.
accession
field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.
Multiple sequence identifiers (accession
) are permitted within one file.
Primer name
primerName
must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.
In addition to _LEFT
and _RIGHT
, we permit _L
and _R
as direction tags in primerName
. Any text after the direction tag should be separated by an underscore.
Text in primerName
before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName
.
Each amplicon must have at least one left and right primer (including alternative primers) associated with it.
Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt
after the direction tag in primerName
, followed by optional text to distinguish between different alternative primers, such as a number.
Examples of valid primer names:
MY_SEQUENCE_434_A_LEFT
virus1_L
amplicon_4934m_RIGHT_alt
amplicon_4934m_RIGHT_alt1
amplicon_4934m_R_altprimerB
Examples of invalid primer names:
LEFT_MY_SEQUENCE_434_A
virus1_l
amplicon_4934m_RIGHT_L
In addition to the built-in options, DRAGEN Microbial Amplicon supports the use of custom reference genomes and primer definitions. These files must be uploaded to a BaseSpace Project before they can be used. See https://help.basespace.illumina.com/manage-data/import-data for more information about importing files into BaseSpace.
In the app input form, select the 'Custom' option for 'Amplicon Primer Set'. Then expand the 'Custom Reference' settings to provide the following:
Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)
Custom PCR Primer Definitions (optional)
If the 'Custom' option is selected for 'Amplicon Primer Set', the user must provide a custom FASTA file containing one or more reference sequences as the target for read alignment (and as the basis for generating consensus sequences). The software generates the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app. Note that not all provided reference sequences in the FASTA file may be used for read alignment and consensus sequence generation.
Optionally, a reference BED file may be provided to add information about each reference sequence in the FASTA file, such as human-readable names to be used in the reports. For multi-segment genomes such as Influenza, this file assigns the segment name to each sequence, which allows the software to group individual segment sequences by genome. See the following page on the format of this file:
Optionally, a TSV file may be provided to define the primer sequences or binding locations, which are used for two purposes:
Primer sequences are trimmed from reads, which eliminates sequences that may come from the primer sequences themselves (which we do not want) from sequences contributed by the biological sample (which we do want). This reduces reference bias that can incorrectly lower the observed allele frequency of true sequence variants in primer binding sites.
Primer locations are used to define the amplicons expected from PCR reactions. The read coverage within the unique (non-overlapping) amplicon regions is used to determine whether each amplicon is reliably detected. The percentage of detected amplicons is used to determine whether sufficient material exists to accurately call variants and generate consensus sequences from the sample.
See the following pages for further information:
Optionally, one or more Nextclade datasets can be selected to use for phylogenetic analysis of the consensus sequences generated from the samples. Every selected dataset will be applied to every consensus sequence generated in every sample.
DRAGEN Microbial Amplicon is a software application designed to analyze sequencing data from amplicon library preps (both DNA and RNA) on microbiological samples, with an emphasis on viruses. Illumina sequencing reads are processed to generate consensus sequences that represent a best estimate of the population of viral sequences in each sample. Where appropriate, these consensus sequences are further analyzed by the phylogenetic analysis tools Nextclade and/or Pangolin to provide an identification of the clade or lineage of each sequence.
Data can be provided in one of the following ways:
Samples / biosamples with FASTQ datasets (see details in library preparation documents)
A project containing one or more samples / biosamples with FASTQ datasets
All samples / biosamples in the selected project will be analyzed
Supported amplicon primer schemes
Chikungunya
Illumina
Dengue
Serotype 1 - Illumina
All serotypes - DengueSeq from Grubaugh Lab
Mpox
Pan-clade - ARTIC
Clade I - Illumina
Clade II - Grubaugh Lab
SARS-CoV-2 - ARTIC
Zika - Grubaugh Lab
Custom genome and primer sets
Users can upload custom files to provide user-defined reference genome set and primer definitions. Multiplexed amplicon panels targeting multiple organisms in the same reaction are supported.
Trim and filter reads using Trimmomatic
Remove off-target reads using DRAGEN v4.3.6 kmer classifier (for custom reference, remove human reads using a modified version of the SRA Human Read Scrubber tool v2.2.1)
For organisms with one default reference genome, skip this step. For organisms with multiple candidates, trim primer sequences in reads using Trimmomatic, perform assembly using MEGAHIT, cluster contigs using CD-HIT-EST, map contigs to candidate reference genomes using minimap2, then select reference genomes based on the mapping
Align reads to the default reference genome or selected reference genomes using DRAGEN v4.3.6
Trim primer sequences in aligned reads based on coordinates
Filter out samples with insufficient amplicon coverage
Call sequence variants from the alignments using DRAGEN Somatic v4.3.6 and apply them to the corresponding reference genomes to create consensus sequences
If applicable, run Nextclade/Pangolin on the consensus sequences
Consensus sequences representing a best estimate of targeted sequences
Tables and plots reporting read counts, coverage, and Nextclade/Pangolin results
BaseSpace Sequence Hub
The sequences are labeled according to the best match in the reference database, which is not exhaustive and the labels should not be taken as definitive for strain-typing. If strain typing is needed, the built-in Nextclade and/or Pangolin tools can be used for supported organisms. Alternatively, a BLAST or similar search of nucleotide databases may provide a more detailed match.
Because of sequence homology, it is possible that organisms with very few reads will result in the generation of a sequence not present (false positive). Although the de novo assembly step of this software largely mitigates such instances, sequences with very low horizontal coverage (< 5%) should be treated with caution.
Amplicon sequencing, especially of RNA viruses, requires additional bioinformatics processing to ensure maximum quality of the resulting data.
In RT-PCR, a reverse transcriptase enzyme first generates cDNA molecules using the RNA molecules in the sample as templates, before amplifying the cDNA sequences using a DNA polymerase enzyme during PCR. These amplified cDNA sequences are then further processed to generate the sequencing libraries. Both of these enzymes can potentially introduce an incorrect base into a sequence, generating a position where the resulting sequence does not match the sequence in the sample -- that is, an error.
Reverse transcriptases exhibit that are multiple orders of magnitude higher than .
When large numbers of nucleic acid molecules are present in a reaction, these individual misincorporation errors are largely uncorrelated and appear at very low frequencies, so that they are typically ignored by variant callers.
However, when there is a small number of incoming nucleic acid molecules, such as for a low-titer sample, an error that occurs during the RT step or early in the PCR reaction can, as a result of sampling noise, be amplified to high frequencies in the resulting sequencing libraries. The variant caller may treat this error as a sequence variant, since it is a true sequence variant in the context of the library provided to the instrument. As a result, these artifactual sequence variants often have high allele frequency and quality scores, which makes them very difficult to detect, and appear in the final consensus sequence. While less common, it is also possible for a true sequence variant to have its allele frequency depressed by this same process (if the error results in a reversion to the reference sequence).
Since it is difficult to identify enzyme-introduced false variants after the fact, we instead take a preemptive approach of determining if there is sufficient sample material present before variant calling and consensus sequence generation in order to ensure data quality.
Specifically, the app calculates the number of amplicons with at least 1x coverage for at least 90% of the non-overlapping portion of the amplicon sequence. The 1x coverage threshold used here is fixed and independent of the minimum read coverage depth for consensus sequence generation which defaults to 10x. The number of amplicons that meet this threshold is then divided by the total number of amplicons expected in the experiment, which is the number of amplicons whose location falls in reference sequences selected for short read alignment. If the resulting percentage is at least 80%, the sample is considered to have sufficient material for accurate variant calling. If it is below this threshold, the sample is not processed further to avoid spurious variant calls. The user can override the 80% threshold in the "Minimum percentage of amplicons with at least 90% coverage ≥ 1x to enable variant calling and consensus sequence generation" control in the "Advanced Workflow Settings" section.
The threshold above was determined through data analysis using an experimentally-determined threshold corresponding to minimum concentration needed to produce reliable variant calls. We assumed that higher nucleic acid concentrations leads to a higher probability of amplifying each amplicon.
Once the analysis completes, the "REPORTS" tab on BaseSpace enables users to view the Summary of the entire analysis, which summarizes results from all input samples, as well as individual Sample Report for each sample.
A: For many sample types, especially clinical, wastewater, or environmental samples, viral RNA or DNA makes up a tiny proportion of the total nucleic acids, with the remainder dominated by host or bacteria/archaea. Therefore, even with a dramatic increase of abundance over what you would obtain without targeted sequencing, the percentage of targeted reads can still be low.
A: The 10x threshold is applied per-nucleotide. Any positions below 10x coverage will be hard-masked with "N".
A: This message is to warn users that the sequence accession in the consensus genome does not necessarily reflect the true phylogeny of the organisms in the sample and should not be taken as such.
Because the app uses a limited set of reference sequences, the accession in the consensus sequence FASTA file headers (and coverage plots, etc) merely reflects the best match from that limited set. There may be sequences in RefSeq or elsewhere that are a closer match.
A: The denominator in the "Detected Amplicons" columns is based on the reference sequences selected based on de novo assembled contigs. Depending on the quality of the sample and/or reads, the assembler may not have enough data to generate a contig for some segments. Shorter segments are more likely to be missed. If only 7 segments are selected as reference for short read alignment, then we expect 7 amplicons in total. If you believe that the sample should contain all 8 segments, you can download the contig FASTA file from our report page and submit it to to see if all 8 segments are present in the contig sequences.
One known issue is that chimeric reads can be generated during library preparation, which can lead to chimeric contigs, where the contig sequence contains sequences from more than one segment. This can result in missing an entire segment in the reference selection stage. A workaround may be to filter out chimeric reads from your FASTQ files before running the app.
Alternatively, you can force the app to use all 8 segments of a particular Influenza genome by providing a custom reference FASTA file with all 8 segment sequences and a custom reference BED file with the genome
column to set to the same value (e.g. Influenza A). This way, the app skips assembly and uses all 8 segments as the reference sequences for short read alignment.
A: The "Detected Amplicons" column shows the number of detected amplicons over the total number of expected amplicons. The percentage of detected amplicons is used to infer if the sample is of sufficient quality for variant calling. The "% callable bases" column shows the percentage of the selected reference genome whose bases are at or above the minimum read coverage depth for consensus sequence generation, which is computed independent of amplicons.
Both metrics are useful to assess the quality of the sample, but the percentage of detected amplicons is used by the app after short read alignment to filter out low-titer samples and the percentage of callable bases is not.
A: Not necessarily. Your virus of interest may be present in the sample, but the app may not have generated a consensus sequence for it for various reasons.
One reason could be that there are too few reads coming from that virus. Tools like DRAGEN Metagenomics can be used to characterize what is in the sample more broadly.
A: De novo assembly is performed only if there are multiple candidate reference genomes, which is typically when there are multiple serotypes, strains, subtypes, or clades. This currently applies to the following Amplicon Primer Set options:
Dengue Virus All Serotypes, 400-bp DengueSeq primers
Influenza A, Universal primers
Influenza B, Universal primers
Influenza A and B, Universal primers
Mpox All Clades, 2500-bp ARTIC-INRB v1 primers
Respiratory Syncytial Virus (RSV), CDC primers
Respiratory Syncytial Virus (RSV), WCCRRI primers
If a custom reference FASTA file is provided, assembly is performed if there are multiple sequences in the file. If a custom reference BED fils is also provided, assembly is performed if based on the BED file there are multiple genome-segment pairs (or multiple non-segmented full genomes). Otherwise, all sequences in the custom reference FASTA file are used as reference for short read alignment.
A: In most cases, the consensus sequence FASTA file. Contig sequences are useful if the reference sequences used for consensus sequence generation were not the best match. They should be used with caution however because there is no filtering of base calls based on coverage or quality as done in consensus sequence generation.
A: It is most likely that the custom database was not formatted correctly. Below are requirements for the Custom Reference FASTA For Consensus Generation:
Do not use Spaces in the file name, instead use an underscore "_"
Do not exceed 25 characters in the file name
File extension must be .fasta or .fa
Do not exceed the file size limitation: 16GB for a single file or 25GB for multiple files
Do not have duplicate entries
If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (accession
) must match the names that appear in the FASTA (text after >
and before the first whitespace character).
Another reason could be that the virus in your sample is too divergent from the reference sequences used in the app. In such cases, we recommend downloading the contig FASTA file (if available) from our report page and submitting it to . If you do see a genome that matches your virus of interest, you can provide that to the app as a custom reference genome.
Please see this on general guidelines to upload data to BaseSpace for more details. If you continue having issues, reach out to techsupport@illumina.com.
Status | Level | Outcome |
---|
Completed successfully | Pipeline | Exit with all applicable output files |
Custom files are not formatted correctly | Pipeline | Exit early with error |
No remaining reads after preprocessing | Sample | Exit early with a report of read counts |
No contig generated | Sample | Exit early with a report of read counts |
No reference found after assembly | Sample | Exit early with a report of read counts and contig FASTA |
None of the primers provided in custom primer definition file align to selected reference sequences | Sample | Skip post-factor primer trimming and sample filtering based on amplicon coverage for this sample |
Insufficient amplicon coverage | Sample | Exit early before variant calling and consensus sequence generation |
QC | trimmomatic | Always |
Primer trimming (on FASTQ) | trimmomatic | If assembly is to run |
Remove off-target reads | DRAGEN | If checked in Input Form |
Assembly | MEGAHIT | If reference FASTA and BED files imply more than one genome as reference |
Contig clustering | CD-HIT | If assembly ran |
Reference selection | custom script | If assembly ran, otherwise input reference database is used as is |
Map/Align | DRAGEN | If at least one reference sequence is generated |
Post-facto primer trimming (on BAM) | custom script | If Map/Align ran and primer set exists |
Sample filtering based on amplicon coverage | custom script | If Map/Align ran and primer set exists |
Variant calling | DRAGEN | If Map/Align ran and sample passed filter above |
Consensus sequence generation | custom script | If Map/Align ran and sample passed filter above |
The Sample Report contains at most four tabs: Sample QC, Virus Metrics, Nextclade Report, and Pangolin Report.
This tab contains tables and plots summarizing the sample.
This table reports summary metrics for the sample, such as Status and Detected Amplicons. See here for their definitions.
This plot displays counts of reads that fall into different categories. See here for their definitions.
This plot displays the number of reads that mapped to each reference sequence. If there is a single reference sequence (e.g. SARS-CoV-2), one bar is shown.
This table provides the number of reads that mapped to each reference sequence along with the genome and segment names of the reference sequence. The "Download CSV" button enables downloading the contents of the table as a text comma-separated value (CSV) file.
This table summarizes results for each viral genome generated in the sample with each row corresponding to a single viral genome. For segmented viruses like Influenza, a row will summarize information across multiple sequences generated for a single viral genome.
At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.
The table itself contains rows for every viral genome with at least one sequence generated in the sample with the following columns:
Virus: Name of the viral genome
For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName
column
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). This is computed across all sequences belonging to the viral genome. See here for more information.
Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).
This table summarizes the results for each sequence generated in the sample. For segmented viruses like Influenza, there are typically multiple rows with the same virus name. Otherwise, this table contains similar information as the Metrics By Virus table.
At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.
Virus: Name of the virus genome
Segment: Name of the segment to which the reference sequence corresponds. For non-segmented viruses, this is typically set to "Full".
Accession: Unique identifier of the reference sequence (text before first space in FASTA header if custom reference FASTA was provided)
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). See here for more information on this metric.
Callable Bases: Number of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)
Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).
Consensus Length: Length of the final consensus, without leading and trailing masked bases if sequence trimming is enabled. Sequence trimming can be disabled in the Input Form under Advanced Workflow Settings.
Displays a trace of read coverage over each reference genome. On the top right is a drop-down menu that allows users to switch between genomes. The blue line represents the read coverage, with the coverage depth in log 10 of number of reads on the y-axis and the genomic position in the reference genome on the x-axis.
For segmented viruses like Influenza, coverage values for each segment is displayed in a horizontally stacked fashion. Grey blocks at the top show their boundaries.
This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences in the sample. See here for more details.
This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences in this sample. See here for more details.
The Summary contains at most three tabs: Summary Report, Nextclade Report, and Pangolin Report.
This table provides a top-line summary of each of the analyzed samples.
At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.
Next is the table itself, which contains one row per sample and the following columns:
Sample: Name of the BaseSpace sample analyzed
Status: Status of the sample analysis
Input Reads: Total number of reads in input FASTQs
Mapped Reads: Number of reads that map to reference sequences during short read alignment
Detected Amplicons: Proportion of amplicons detected out of the total expected for the sample, which is used to to determine if the sample is sufficient quality for variant calling. See this page for more details.
Num Genomes: Number of genomes chosen during the reference selection stage
Virus: Name of the genome to which the reference sequence belongs
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)
Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. They are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default).
When generating consensus sequences, genomic positions below the threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the true base cannot be accurately determined).
This percentage is calculated over the lengths of the reference genome(s), not the final consensus sequence(s) which may be trimmed.
This stacked bar plot contains counts of reads that fall into the following categories:
Removed in Downsampling: Reads that were removed during downsampling because the user specified a downsampling target in the Input Form under Advanced Workflow Settings
Removed in QC: Reads removed as poor quality reads based on quality thresholds during pre-processing
Removed as Duplicate: Reads that were labeled as duplicate during short read alignment. Removal of them can be disabled in the Input Form under Advanced Workflow Settings
Removed in Trimming: Reads that were removed in the initial sequence-based primer trimming step and were excluded from further processing
Removed in De-hosting: Reads that were filtered out as human reads based on kmer-based classification during pre-processing.
This improves the quality of downstream analysis and helps ensure that human sequences are not included in the output BAM files.
This is applied only if 'Amplicon Primer Set' was set to 'Custom' in the Input Form.
This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".
Removed as Off-target: Reads that were filtered out as off-target reads based on kmer-based classification during pre-processing
Similar de-hosting, this improves the quality of downstream analysis.
Off-target is defined as not coming from the target organism, which is determined based on the 'Amplicon Primer Set' selection in the Input Form. For example, if "Influenza A and B, Universal Primers" option is selected, a kmer database generated from a large collection of publicly available Influenza sequences is used to separate reads likely coming from Influenza from the rest.
This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".
Unmapped: Reads that were not aligned to any reference genomes
Mapped. Reads that were mapped to at least one reference genome
This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences across all samples. Nextclade is run if the "Enable NextClade" box is checked on the Input Form and one of the following is true:
'Amplicon Primer Set' is set to a non-custom set with a reference with Nextclade dataset available and a valid consensus sequence was generated.
'Amplicon Primer Set' is set to 'Custom' and one or more Nextclade datasets are selected under 'Custom Reference'. In this case, each of the selected Nextclade datasets is applied to each consensus sequence generated for every sample. This may result in multiple Nextclade results for each consensus sequence.
Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
All content shown in the tab is derived from the output of the Nextclade software. Please see the Nextclade documentation for more details.
This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences across all samples. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:
'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.4.2 primers) and a valid consensus sequence was generated
'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.
Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
All content shown in the tab is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details.
Note: Some files may not be generated depending on user inputs and pipeline outcome
Analysis_Results/<analysisId>.report.html
displays tables and plots that summarize results from all samples combined.
An output directory named after each sample contains <sampleName>.html
, which displays tables and plots specific to the sample. The HTML files are identical to the ones displayed in BaseSpace Reports.
Each sample directory also contains the following subdirectories and output files:
<sampleName>.amplicon_coverage.log | Log from computing coverage metrics for each amplicon in a sample |
---|---|
<sampleName>.hard_masked_consensus.fa | FASTA containing all hard-masked consensus sequences generated for a sample |
---|---|
<sampleName>.sample_contig.fasta | FASTA containing all contig sequences generated for a sample |
---|---|
<sampleName>.coverage.tsv | TSV reporting base-pair resolution coverage values across all reference sequences used in short read alignment |
---|---|
<sampleName>_<datasetName>.aligned.fasta | FASTA generated by Nextclade from aligning consensus sequences to a reference sequence |
---|---|
<sampleName>-replay.json | JSON reporting parameters and versions used when running DRAGEN to perform variant calling |
---|---|
<sampleName>.amplicon_coverage.csv
CSV reporting coverage metrics for each amplicon in a sample
<sampleName>.amplicon_detection.json
JSON reporting amplicon detection results for a sample
<sampleName>.soft_masked_consensus.fa
FASTA containing all soft-masked consensus sequences generated for a sample
<sampleName>.sample_consensus.fasta
<sampleName>.hard_masked_consensus.fa but with informative headers
<sampleName>_<genomeName>.genome_consensus.fasta
<sampleName>.sample_consensus.fasta but specific to consensus sequences generated using reference sequences that belong a particular genome
<sampleName>_<accessionName>.accession_consensus.fasta
<sampleName>.sample_consensus.fasta but specific to consensus sequences generated using a particular reference sequence
<sampleName>.consensus.json
JSON containing information on all consensus sequences generated for a sample
<sampleName>_<genomeName>.genome_contig.fasta
<sampleName>.sample_contig.fasta but specific to contigs mapping to reference sequences that belong a particular genome
<sampleName>_<accessionName>.accession_contig.fasta
<sampleName>.sample_contig.fasta but specific to contigs mapping to a particular reference sequence
<sampleName>.contig.json
JSON containing information on all contig sequences generated for a sample
<sampleName>-replay.json
JSON reporting parameters and versions used when running DRAGEN to perform short read alignment
<sampleName>-unmapped_ S1_L001_R1_001.fastq.gz
FASTQ containing R1 reads that did not map to any selected reference sequences
<sampleName>-unmapped_ S1_L001_R2_001.fastq.gz
FASTQ containing R2 reads that did not map to any selected reference sequences
<sampleName>-unmapped-singleton_S1_L001_R1_001.fastq.gz
FASTQ containing singleton reads that did not map to any selected reference sequences
<sampleName>.bam
BAM containing all short read alignments
<sampleName>.bam.bai
BAI for <sampleName>.bam
<sampleName>.mapping_metrics.csv
CSV generated by DRAGEN to report mapping metrics
<sampleName>.trim.log
Log from performing post-facto primer trimming after short read alignment
<sampleName>.trimmer_metrics.csv
CSV generated by DRAGEN to report trimmer metrics
dragen_run_<runId>.log
Log from running DRAGEN to perform short read alignment
<sampleName>.report.json
JSON containing summary metrics generated for a sample
<sampleName>_<datasetName>.auspice.json
Auspice JSON generated by Nextclade containing output phylogenetic tree
<sampleName>_<datasetName>.csv
CSV generated by Nextclade to report results from mutation calling, clade assignment, quality control, etc.
<sampleName>_<datasetName>.json
<sampleName>_<datasetName>.csv in JSON format
<sampleName>_<datasetName>.ndjson
<sampleName>_<datasetName>.csv in NDJSON format
<sampleName>_<datasetName>.tsv
<sampleName>_<datasetName>.csv in TSV format
<sampleName>.consensus_filtered.bcftools_stats.txt
TXT generated by BCFtools stats command to report statistics on called variants that passed the consensus filter
<sampleName>.consensus_filtered.summary.csv
CSV generated by BCFtools query command to summarize called variants that passed the consensus filter
<sampleName>.consensus_filtered.vcf.gz
VCF containing called variants that passed the consensus filter
<sampleName>.consensus_filtered.vcf.gz.tbi
TBI for <sampleName>.consensus_filtered.vcf.gz
<sampleName>.hard-filtered.bcftools_stats.txt
TXT generated by BCFtools stats command to report statistics on called variants
<sampleName>.hard-filtered.summary.csv
CSV generated by BCFtools query command to summarize called variants
<sampleName>.hard-filtered.vcf.gz
VCF containing called variants
<sampleName>.hard-filtered.vcf.gz.tbi
TBI for <sampleName>.hard-filtered.vcf.gz
<sampleName>.vc_metrics.csv
CSV generated by DRAGEN to report variant calling metrics
dragen_run_<runId>.log
Log from running DRAGEN to perform variant calling