Describes the report that can be viewed from the Summary link on the Reports tab of a completed analysis.
At the top of the report, after the app version display, is the Metrics by Sample table which provides a top-line summary of each of the analyzed samples.
The first element is a button that will trigger downloading of a FASTA-formatted file containing all consensus sequences generated across all samples.
The "Download CSV" button allows for downloading the contents of the table as a text comma-separated value (CSV) file. Note that for fields with multiple entries, these entries will be combined as a semicolon-separated list in the corresponding fields in the CSV file.
Next is the table itself, which contains one row per sample. The various genomes generated for each sample are nested as sub-rows within this row.
The table contains one row per sample and the following columns:
Sample: The name of the BaseSpace sample analyzed. The sample name is a clickable link that will take you directly to the Result Report for that sample.
Detected Amplicons (only if 'Experiment Type' is set to 'Amplicon'): The number of amplicons detected over the total number of amplicons expected for that sequence. The percentage of amplicons detected is used to to determine if the sample is sufficient quality for variant calling. See Special considerations for amplicon sequencing with IMAP protocols for more details.
Num genomes: The number of genomes chosen during the reference selection stage of the pipeline
Genomes generated: The names of each genome chosen during the reference selection stage. If the percentage of callable bases (callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation, 10x by default) for a genome is below the minimum percentage of consensus sequence generated to label as confident (5% by default), the cell is highlighted in yellow to indicate that there is only marginal evidence that the indicated genome is present in the sample and should be treated with caution. For amplicon experiments, if the sample is considered to have insufficient titer for VC because the percentage of detected amplicons is below the minimum percentage required for reliable variant calling (80% by default), cells are highlighted in orange. For genomes for which a consensus sequence was generated, clicking on the name of that genome initiates a download of a FASTA file containing the consensus sequences of that genome only.
% callable bases: The percentage of the selected reference genome whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference genome(s), not the reported consensus sequence(s).
Status: The overall outcome of the analysis for this virus
Full analysis (consensus, VC) means that the sample analysis completed normally, that a sufficient number of amplicons were detected to ensure reliable variant calling (amplicon experiments only), and that the percentage of callable bases was above the minimum percentage of consensus sequence generated to label as confident (5% by default)
Low confidence means that there is at lease one callable base but the overall percentage of callable bases was below the minimum percentage of consensus sequence generated to label as confident (5% by default)
No callable bases indicates that zero positions in the indicated reference genome were callable and no consensus sequence is therefore provided.
Insufficient titer for VC will only be present for an amplicon experiment and indicates that the number of detected amplicons was below the minimum percentage (default 80%) required for reliable variant calling. See Special considerations for amplicon sequencing with IMAP protocols for more details.
Consensus FASTA: This column contains links to download a FASTA-formatted text file containing all of the consensus genomes generated for a sample. If no consensus genomes were generated for a sample, this column contains "N/A."
Input read count: The number of reads (or read pairs / clusters for paired-end samples) in the sample.
Mapped read count: The number of reads that could be mapped to any reference genome.
Unmapped reads: Displays buttons that initiate downloads of gzipped FASTQ files containing reads that could not be mapped to any reference genomes.
Raw Contigs: Displays a button that initiates a download of a FASTA file containing all contigs generated during the de novo assembly step of the pipeline. If a contig could be mapped to a reference genome the contig name contains information about the reference genome they aligned to.
This table contains the results of the Pangolin analysis performed on the generated consensus sequences across all samples. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using SARS-CoV-2 as reference
'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.
The Pangolin report contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
The table in the Pangolin report is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details: https://cov-lineages.org/resources/pangolin/output.html. Sequences with a bad Pangolin QC status are highlighted in yellow.
This table contains the results of the NextClade analysis performed on the generated consensus sequences across all samples. NextClade is run if the "Enable NextClade" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using a reference with NextClade dataset available.
'Amplicon Primer Set' is set to a non-custom set with a reference with NextClade dataset available (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated.
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom' and one or more NextClade datasets are selected under 'Custom Reference'. In this case, each of the selected NextClade datasets is applied to each consensus sequence generated for the sample. This may result in multiple NextClade results for each consensus sequence, some of which may not be meaningful (e.g. "flu_h1n1pdm_ha" dataset applied to a NA segment of an Influenza genome).
The NextClade Report contains a button labeled "Group table by" with a drop-down menu allowing the user to group the results by various fields, including "None". The default is "Dataset" which means that all of the results for each NextClade dataset will be grouped together. For example, if a user is only interested in phylogenetic analysis performed on the HA segment of Influenza A H1N1, these results for each sample can be viewed together in the "flu_h1n1pdm_ha" collapsible group.
The NextClade report also contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
The table in the NextClade report is derived from the output of the NextClade software. Please see the NextClade documentation for more details: https://docs.nextstrain.org/projects/nextclade/en/stable/user/output-files/04-results-tsv.html. Sequences with a bad NextClade QC status are highlighted in yellow.
Describes the reports that can be viewed from the individual sample links on the left side of the reports tab or by clicking on sample names in the Metrics by sample table.
At the top of the report is version information for the App and any third-party components.
Two buttons provide the ability to download relevant FASTA-formatted text files for this sample. The "Consensus" button initiates a download of a FASTA file containing all consensus sequences generated for this sample. The "Contig" button initiates a download of a FASTA file containing all assembled contigs for this sample.
The metrics by virus table contains information about each viral genome generated. Each row summarizes all sequences assigned to that virus. In the case of multi-segment viruses like Influenza, a row will summarize information across all segment sequences generated for a single viral genome. It contains buttons to download the contents of the table as a CSV, JSON or PDF file. The table itself contains rows for every virus with at least one generated genome in the sample. It contains the following columns:
Virus: The name of the virus. For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName
column that corresponds to the selected reference (matched by the value in the chrom
column of the genome definition file and the part of the FASTA header before the first whitespace character).
Detected Amplicons (only if 'Experiment Type' is set to 'Amplicon'): The number of amplicons detected over the total number of amplicons expected for that sequence. The percentage of amplicons detected is used to to determine if the sample is sufficient quality for variant calling. See Special considerations for amplicon sequencing with IMAP protocols for more details.
% callable bases: The percentage of the selected reference genome whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference genome(s), not the reported consensus sequence(s).
Status: The overall outcome of the analysis for this virus
Full analysis (consensus, VC) means that the sample analysis completed normally, that a sufficient number of amplicons were detected to ensure reliable variant calling (amplicon experiments only), and that the percentage of callable bases was above the minimum percentage of consensus sequence generated to label as confident (5% by default)
Low confidence means that there is at lease one callable base but the overall percentage of callable bases was below the minimum percentage of consensus sequence generated to label as confident (5% by default)
No callable bases indicates that zero positions in the indicated reference genome were callable and no consensus sequence is therefore provided.
Insufficient titer for VC will only be present for an amplicon experiment and indicates that the number of detected amplicons was below the minimum percentage (default 80%) required for reliable variant calling. See Special considerations for amplicon sequencing with IMAP protocols for more details.
Median coverage: The median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).
Consensus FASTA: A download link to a FASTA-formatted text file containing all the consensus sequences generated for this virus.
This table summarizes the results for each sequence generated for the sample. For multi-segment viruses such as Influenza, there will may be multiple sequences detected for a given virus. For single-segment viruses there will typically be only one sequence per virus. It contains buttons to download the contents of the table as a CSV, JSON or PDF file. The table itself contains rows for every sequence. It contains the following columns:
Virus: The name of the virus. For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName column that corresponds to the selected reference (matched by the value in the accession column of the genome definition file and the part of the FASTA header before the first whitespace character).
Segment: The name of the genome segment to which the sequence belongs. For viruses with a single segment, the name of the segment will typically be "Full".
Accession: The accession number or other short unique identifier for the sequence. If using a custom genome definition BED, this value is taken from the first column (chrom
) in the definition file. If using a custom reference without a genome definition file, the value is taken from the part of the FASTA header before the first whitespace character.
% callable bases: The percentage of the selected reference sequence whose bases are considered "callable." Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. Callable bases are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default). Note that genomic positions below the confidence threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the actual base cannot be accurately determined). Note that this percentage is calculated over the lengths of the reference sequence, not the reported consensus sequence.
Median coverage: The median coverage value (in number of reads overlapping each position) over the entire reference sequence (not just the generated consensus sequence).
Consensus sequence length: The length of this consensus sequence. The reported length is the length of the hard-masked sequence after trimming any leading and trailing masked regions (if trimming is active).
# callable bases: The number of positions in the reference sequence above the minimum read coverage depth for consensus sequence generation (default 10x). In other words, the number of positions not masked. This may not be equal to the number of unmasked positions in the final consensus sequence since insertions and deletions are applied after masking.
Consensus FASTA: A download link to a FASTA-formatted text file containing this consensus sequence.
This stacked bar plot contains information about the outcome of the pre-processing steps (read QC, trimming, de-hosting) as well as the alignment step. It contains counts of reads that fall into the following categories:
Removed in QC: Reads that failed to meet the minimum quality thresholds and were excluded from further processing.
Removed in trimming: Reads that were removed in the initial sequence-based primer trimming step and were excluded from further processing.
Removed in de-hosting: Reads that were removed in the de-hosting step and excluded from further processing. De-hosting is the process of removing reads that may originate from the host organism. Currently only human hosts are supported. De-hosting reads improves the quality of downstream analysis and helps ensure that human sequences are not included in the output BAM files.
Unmapped: Reads that were not aligned to any reference genomes.
Mapped. Reads that were mapped to at least one reference genome.
A column plot displaying the numbers and percentages of all reads that aligned to each reference sequence with at least one mapped read. The columns are labeled by both virus and segment name (if available) on the x-axis, and the y-axis is the read count for each sequence.
Displays a trace of the read coverage over each reference sequence. The drop-down menu in the upper left allows the user to switch between viruses. If multiple segment sequences are generated for a single virus, their corresponding coverage plots will be displayed in a vertically stacked fashion. The black trace represents the read coverage, with the coverage depth in number of reads on the left y-axis and the position in the reference sequence on the x-axis.
The minimum read coverage depth for consensus sequence generation (default 10x) is plotted as a dashed orange line across the plot, to easily visualize locations where coverage drops below the threshold (which will be masked in the consensus sequence) and where the coverage is above the threshold (which will be reported in the consensus sequence).
The median coverage is plotted as a dashed teal line across the plot.
By default, sequence variants representing differences between the consensus sequence and the reference sequence are also plotted, with allele frequency on the right y-axis. The colors and symbols represent different sequence variant types. See the figure legend for details.
The "Show log-scale" toggle switch allows the user to switch between logarithmic and linear scales for the coverage (left) y-axis.
The "Show Median" toggle switch allows the user to turn the median coverage line on and off.
The "Show Sequence Variants" toggle switch allows the user to turn the plotting of sequence variants on and off.
This table contains the results of the Pangolin analysis performed on the generated consensus sequences. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using SARS-CoV-2 as reference
'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.
The Pangolin report contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
The table in the Pangolin report is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details: https://cov-lineages.org/resources/pangolin/output.html. Sequences with a bad Pangolin QC status are highlighted in yellow.
This table contains the results of the NextClade analysis performed on the generated consensus sequences. NextClade is run if the "Enable NextClade" box is checked on the input form and one of the following is true:
'Enrichment Panel' is set to a non-custom panel (e.g. VSP) and a consensus sequence was generated using a reference with NextClade dataset available.
'Amplicon Primer Set' is set to a non-custom set with a reference with NextClade dataset available (e.g. SARS-CoV-2, ARTIC v5.3.2 primers) and a valid consensus sequence was generated.
Either 'Enrichment Panel' or 'Amplicon Primer Set' is set to 'Custom' and one or more NextClade datasets are selected under 'Custom Reference'. In this case, each of the selected NextClade datasets is applied to each consensus sequence generated for the sample. This may result in multiple NextClade results for each consensus sequence, some of which may not be meaningful (e.g. "flu_h1n1pdm_ha" dataset applied to a NA segment of an Influenza genome).
The NextClade Report contains a button labeled "Group table by" with a drop-down menu allowing the user to group the results by various fields, including "None". The default is "Dataset" which means that all of the results for each NextClade dataset will be grouped together. For example, if a user is only interested in phylogenetic analysis performed on the HA segment of Influenza A H1N1, these results for each sample can be viewed together in the "flu_h1n1pdm_ha" collapsible group.
The NextClade report also contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
The table in the NextClade report is derived from the output of the NextClade software. Please see the NextClade documentation for more details: https://docs.nextstrain.org/projects/nextclade/en/stable/user/output-files/04-results-tsv.html. Sequences with a bad NextClade QC status are highlighted in yellow.
Brief description of Summary and Result reports and an explanation of their contents
The app produces a summary report as well as result reports for each of the samples analyzed. See the links below for a description of each.