❓Frequently Asked Questions (FAQ)

Q: After processing my sample with an enrichment panel, the majority of my reads are removed in preprocessing and/or I only have a small amount of viral reads. Is the enrichment panel working?

A: The enrichment protocols can create a several thousand fold increase in the abundance of the targeted viral species. However, it is important to keep in mind that in many sample types, especially clinical, wastewater, or environmental samples, viral RNA or DNA makes up a tiny proportion of the total nucleic acids present, with the remainder dominated by host (human) or bacteria/archaea. So even with a dramatic enrichment of abundance over what you would obtain without enrichment, the percentage of viral reads can still be low. E.g. you may have a sample with only 2% viral reads, but without enrichment you might have only obtained 0.1% viral reads. If it is low abundance after enrichment, it is likely extremely low abundance prior to enrichment.

Q: The contig default min coverage is 10x, but is that across the entire contig, median, average.. or?

A: The 10x threshold is applied per-nucleotide. Any positions below 10x coverage will be hard-masked with "N".

Q: In the demo data, I downloaded the consensus sequence FASTA file and each sequence line would say β€œPanel reference sequences are not necessarily comprehensive and should not be used for strain typing.” Does that mean even though VSP provide full-genome resolution of all 66 viruses, the app can only strain-type the strains listed because of the reference sequences the app uses?

A: Correct, we only align to a limited number of reference sequences for each virus type, so the sequence accession in the consensus genomes (and coverage plots, etc) merely reflects the best match chosen from that subset. There could be sequences in RefSeq that are a closer match. Furthermore, strain typing is not necessarily as simple as choosing the closest matching genome; there are further complexities that can go into it, and we have not systematically developed or tested any strain typing capability to date. The noted message is to warn users that the sequence accession in the consensus genome does not necessarily reflect the true phylogeny of the organisms in the sample and should not be taken as such.

Q: Why am I seeing some segments from one Influenza A subtype and some from another subtype? Does my sample contain both subtypes?

A: For each de novo assembled contig, we aim to find the best matching reference sequence rather than an entire reference genome. If the best match for one contig is a reference sequence from one subtype and the best match for another contig is a reference sequence from another subtype, then we will report them as such. This is not necessarily indicative of a mixed infection, reassortment, or error. It is usually reflective of how similar certain segments can be across different subtypes.

Influenza A viruses are classified into different subtypes based on the hemagglutinin (HA) and neuraminidase (NA) proteins, which are encoded by segments 4 and 6, respectively. Therefore, we recommend focusing on those segments to infer the subtype. If there is a sequence generated from segment 8 of an H3N2 genome but all the rest of the consensus sequences are generated from reference sequences from an H1N1 genome (indicating H1 and N1 subtypes), then the sample likely contains H1N1, not H3N2. One possible explanation is that segment 8 from H1N1 and segment 8 from H3N2 were both good matches for a particular contig but the one from H3N2 was a slightly better match and therefore chosen as final reference. Similarly, if there is a sequence generated from segment 4 of an H1N1 genome (indicating H1 subtype) and a sequence generated from segment 6 of an H5N1 genome (indicating N1 subtype), then the sample likely contains H1N1, not H5N1.

Q: Why does the "Detected Amplicons" column show 7 in the denominator for an Influenza genome when there should be 8 amplicons in total?

A: The denominator in the "Detected Amplicons" columns is based on the reference sequences selected based on de novo assembled contigs. Depending on the quality of the sample and/or reads, the assembler may not have enough data to generate a contig for some segments. Shorter segments are more likely to be missed. If only 7 segments are selected as final reference for short read alignment, then we expect 7 amplicons in total. If you believe that the sample should contain all 8 segments, you can download the contig FASTA file from our report page and submit it to NCBI BLAST to see if all 8 segments are present in the contig sequences.

One known issue is that chimeric reads can be generated during library preparation, which can lead to chimeric contigs, where the contig sequence contains sequences from more than one segment. This can result in missing an entire segment in the reference selection stage. A workaround may be to filter out chimeric reads from your FASTQ files before running the app.

Alternatively, you can force the app to use all 8 segments of a particular Influenza genome by providing a custom reference FASTA file with all 8 segment sequences and a custom reference BED file with the genomeName column to set to the same value (e.g. Influenza A). This way, the app will not perform assembly and use all 8 segments as the reference sequences for short read alignemnt.

Q: What is the difference between the "Detected Amplicons" and "% callable bases" columns?

A: The "Detected Amplicons" column shows the number of amplicons detected over the total number of amplicons expected for that genome. The percentage of amplicons detected is used to infer if the sample is of sufficient quality for variant calling. The "% callable bases" column shows the percentage of the selected reference genome whose bases are above the minimum read coverage depth for consensus sequence generation, which is computed independent of amplicon coordinates. Both metrics are useful to assess the quality of the sample, but the percentage of detected amplicons is used by the app after short read alignment to filter out low-titer samples and the percentage of callable bases is not.

Q: I don't see the virus I'm interested in listed in the "Genomes generated" column. Does that mean the virus is not present in my sample?

A: Not necessarily. Your virus of interest may be present in the sample, but the app may not have generated a consensus sequence for it for various reasons. One reason could be that there are too few reads coming from that virus. Tools like DRAGEN Metagenomics can be used to characterize what is in the sample more broadly. Another reason could be that the virus in your sample is too divergent from the reference sequences used in the app. In such cases, we recommend downloading the contig FASTA file from our report page and submitting it to NCBI BLAST. If you do see a sequence that matches your virus of interest, you can provide that sequence to the app as a custom reference genome.

Q: I am using the custom analysis workflow and my analysis aborted or shows an error, why?

A: While there may be quite a few causes for the analysis to fail, some of the most common cases are that the custom database was not formatted correctly. Below are requirements for the Custom Reference FASTA For Consensus Generation:

  • Do not use Spaces in the file name, instead use an underscore "_"

  • Do not exceed 25 characters in the file name

  • File extension must be .fasta or .fa

  • Do not exceed the file size limitation: 16GB for a single file or 25GB for multiple files

  • Do not have duplicate entries

  • If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (chrom) must match the names that appear in the FASTA (text after > and before the first whitespace character) Please see this link on general guidelines to upload data to BaseSpace for more details: https://help.basespace.illumina.com/manage-data/import-data If you continue having issues, reach out to techsupport@illumina.com

Last updated