Once the analysis completes, the "REPORTS" tab on BaseSpace enables users to view the Summary of the entire analysis, which summarizes results from all input samples, as well as individual Sample Report for each sample.
The Sample Report contains at most four tabs: Sample QC, Virus Metrics, Nextclade Report, and Pangolin Report.
This tab contains tables and plots summarizing the sample.
This table reports summary metrics for the sample, such as Status and Detected Amplicons. See here for their definitions.
This plot displays counts of reads that fall into different categories. See here for their definitions.
This plot displays the number of reads that mapped to each reference sequence. If there is a single reference sequence (e.g. SARS-CoV-2), one bar is shown.
This table provides the number of reads that mapped to each reference sequence along with the genome and segment names of the reference sequence. The "Download CSV" button enables downloading the contents of the table as a text comma-separated value (CSV) file.
This table summarizes results for each viral genome generated in the sample with each row corresponding to a single viral genome. For segmented viruses like Influenza, a row will summarize information across multiple sequences generated for a single viral genome.
At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.
The table itself contains rows for every viral genome with at least one sequence generated in the sample with the following columns:
Virus: Name of the viral genome
For custom references, this will be the part of the FASTA header before the first whitespace character for the corresponding reference sequence if no custom genome definition file is provided. If a custom genome definition file is provided, this will be the value of the genomeName
column
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). This is computed across all sequences belonging to the viral genome. See here for more information.
Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).
This table summarizes the results for each sequence generated in the sample. For segmented viruses like Influenza, there are typically multiple rows with the same virus name. Otherwise, this table contains similar information as the Metrics By Virus table.
At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.
Virus: Name of the virus genome
Segment: Name of the segment to which the reference sequence corresponds. For non-segmented viruses, this is typically set to "Full".
Accession: Unique identifier of the reference sequence (text before first space in FASTA header if custom reference FASTA was provided)
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default). See here for more information on this metric.
Callable Bases: Number of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)
Median Coverage: Median coverage value (in number of reads overlapping each position) over the entire reference genome (not just the generated consensus sequence).
Consensus Length: Length of the final consensus, without leading and trailing masked bases if sequence trimming is enabled. Sequence trimming can be disabled in the Input Form under Advanced Workflow Settings.
Displays a trace of read coverage over each reference genome. On the top right is a drop-down menu that allows users to switch between genomes. The blue line represents the read coverage, with the coverage depth in log 10 of number of reads on the y-axis and the genomic position in the reference genome on the x-axis.
For segmented viruses like Influenza, coverage values for each segment is displayed in a horizontally stacked fashion. Grey blocks at the top show their boundaries.
This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences in the sample. See here for more details.
This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences in this sample. See here for more details.
The Summary contains at most three tabs: Summary Report, Nextclade Report, and Pangolin Report.
This table provides a top-line summary of each of the analyzed samples.
At the top is the "Download CSV" button, which enables downloading the contents of the table as a text comma-separated value (CSV) file.
Next is the table itself, which contains one row per sample and the following columns:
Sample: Name of the BaseSpace sample analyzed
Status: Status of the sample analysis
Input Reads: Total number of reads in input FASTQs
Mapped Reads: Number of reads that map to reference sequences during short read alignment
Detected Amplicons: Proportion of amplicons detected out of the total expected for the sample, which is used to to determine if the sample is sufficient quality for variant calling. See this page for more details.
Num Genomes: Number of genomes chosen during the reference selection stage
Virus: Name of the genome to which the reference sequence belongs
% Callable: Percentage of bases in the reference sequence with coverage above the minimum read coverage depth for consensus sequence generation (10x by default)
Callable bases are those for which reliable variant calling can be performed and therefore for which the software can output a base call. They are defined as genomic positions with read coverage above the minimum read coverage depth for consensus sequence generation (10x by default).
When generating consensus sequences, genomic positions below the threshold are hard-masked with "N" characters to avoid reference bias (inclusion of a reference base when the true base cannot be accurately determined).
This percentage is calculated over the lengths of the reference genome(s), not the final consensus sequence(s) which may be trimmed.
This stacked bar plot contains counts of reads that fall into the following categories:
Removed in Downsampling: Reads that were removed during downsampling because the user specified a downsampling target in the Input Form under Advanced Workflow Settings
Removed in QC: Reads removed as poor quality reads based on quality thresholds during pre-processing
Removed as Duplicate: Reads that were labeled as duplicate during short read alignment. Removal of them can be disabled in the Input Form under Advanced Workflow Settings
Removed in Trimming: Reads that were removed in the initial sequence-based primer trimming step and were excluded from further processing
Removed in De-hosting: Reads that were filtered out as human reads based on kmer-based classification during pre-processing.
This improves the quality of downstream analysis and helps ensure that human sequences are not included in the output BAM files.
This is applied only if 'Amplicon Primer Set' was set to 'Custom' in the Input Form.
This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".
Removed as Off-target: Reads that were filtered out as off-target reads based on kmer-based classification during pre-processing
Similar de-hosting, this improves the quality of downstream analysis.
Off-target is defined as not coming from the target organism, which is determined based on the 'Amplicon Primer Set' selection in the Input Form. For example, if "Influenza A and B, Universal Primers" option is selected, a kmer database generated from a large collection of publicly available Influenza sequences is used to separate reads likely coming from Influenza from the rest.
This can be disabled in the Input Form under Advanced Workflow Settings by unchecking "Remove off-target reads".
Unmapped: Reads that were not aligned to any reference genomes
Mapped. Reads that were mapped to at least one reference genome
This tab contains tables reporting the results of the Nextclade analysis performed on the generated consensus sequences across all samples. Nextclade is run if the "Enable NextClade" box is checked on the Input Form and one of the following is true:
'Amplicon Primer Set' is set to a non-custom set with a reference with Nextclade dataset available and a valid consensus sequence was generated.
'Amplicon Primer Set' is set to 'Custom' and one or more Nextclade datasets are selected under 'Custom Reference'. In this case, each of the selected Nextclade datasets is applied to each consensus sequence generated for every sample. This may result in multiple Nextclade results for each consensus sequence.
Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
All content shown in the tab is derived from the output of the Nextclade software. Please see the Nextclade documentation for more details.
This tab contains tables reporting the results of the Pangolin analysis performed on the generated consensus sequences across all samples. Pangolin is run if the "Enable Pangolin" box is checked on the input form and one of the following is true:
'Amplicon Primer Set' is set to a non-custom set with SARS-CoV-2 as reference (e.g. SARS-CoV-2, ARTIC v5.4.2 primers) and a valid consensus sequence was generated
'Amplicon Primer Set' is set to 'Custom'. In this case, Pangolin is applied to every consensus sequence generated for the sample since the software assumes all of them to be potentially SARS-CoV-2 sequences.
Each table contains a "Download CSV" button which allows the user to download the contents of the report as a text CSV file.
All content shown in the tab is derived from the output of the Pangolin software. Please see the Pangolin documentation for more details.