1 of 1

Frequently Asked Questions (FAQ)

Q: Majority of my reads are removed in preprocessing as off-target reads. Is the amplicon panel working?

A: For many sample types, especially clinical, wastewater, or environmental samples, viral RNA or DNA makes up a tiny proportion of the total nucleic acids, with the remainder dominated by host or bacteria/archaea. Therefore, even with a dramatic increase of abundance over what you would obtain without targeted sequencing, the percentage of targeted reads can still be low.

Q: How is the default minimum read coverage depth of 10x applied? Is that averaged across the entire sequence?

A: The 10x threshold is applied per-nucleotide. Any positions below 10x coverage will be hard-masked with "N".

Q: Header lines in the consensus sequence FASTA files say “Panel reference sequences are not necessarily comprehensive and should not be used for strain typing.” What does this mean?

A: This message is to warn users that the sequence accession in the consensus genome does not necessarily reflect the true phylogeny of the organisms in the sample and should not be taken as such.

Because the app uses a limited set of reference sequences, the accession in the consensus sequence FASTA file headers (and coverage plots, etc) merely reflects the best match from that limited set. There may be sequences in RefSeq or elsewhere that are a closer match.

Q: Why does the "Detected Amplicons" column show 7 in the denominator for an Influenza genome when there should be 8 amplicons in total?

A: The denominator in the "Detected Amplicons" columns is based on the reference sequences selected based on de novo assembled contigs. Depending on the quality of the sample and/or reads, the assembler may not have enough data to generate a contig for some segments. Shorter segments are more likely to be missed. If only 7 segments are selected as reference for short read alignment, then we expect 7 amplicons in total. If you believe that the sample should contain all 8 segments, you can download the contig FASTA file from our report page and submit it to NCBI BLAST to see if all 8 segments are present in the contig sequences.

One known issue is that chimeric reads can be generated during library preparation, which can lead to chimeric contigs, where the contig sequence contains sequences from more than one segment. This can result in missing an entire segment in the reference selection stage. A workaround may be to filter out chimeric reads from your FASTQ files before running the app.

Alternatively, you can force the app to use all 8 segments of a particular Influenza genome by providing a custom reference FASTA file with all 8 segment sequences and a custom reference BED file with the genome column to set to the same value (e.g. Influenza A). This way, the app skips assembly and uses all 8 segments as the reference sequences for short read alignment.

Q: What is the difference between the "Detected Amplicons" and "% callable bases" columns?

A: The "Detected Amplicons" column shows the number of detected amplicons over the total number of expected amplicons. The percentage of detected amplicons is used to infer if the sample is of sufficient quality for variant calling. The "% callable bases" column shows the percentage of the selected reference genome whose bases are at or above the minimum read coverage depth for consensus sequence generation, which is computed independent of amplicons.

Both metrics are useful to assess the quality of the sample, but the percentage of detected amplicons is used by the app after short read alignment to filter out low-titer samples and the percentage of callable bases is not.

Q: I don't see the virus I'm interested in listed in the "Genomes generated" column. Does that mean the virus is not present?

A: Not necessarily. Your virus of interest may be present in the sample, but the app may not have generated a consensus sequence for it for various reasons.

One reason could be that there are too few reads coming from that virus. Tools like DRAGEN Metagenomics can be used to characterize what is in the sample more broadly.

Another reason could be that the virus in your sample is too divergent from the reference sequences used in the app. In such cases, we recommend downloading the contig FASTA file (if available) from our report page and submitting it to NCBI BLAST. If you do see a genome that matches your virus of interest, you can provide that to the app as a custom reference genome.

Q: I cannot find any contig FASTA files in the output, why?

A: De novo assembly is performed only if there are multiple candidate reference genomes, which is typically when there are multiple serotypes, strains, subtypes, or clades. This currently applies to the following Amplicon Primer Set options:

Dengue Virus All Serotypes, 400-bp DengueSeq primers
Influenza A, Universal primers
Influenza B, Universal primers
Influenza A and B, Universal primers
Mpox All Clades, 2500-bp ARTIC-INRB v1 primers
Respiratory Syncytial Virus (RSV), CDC primers
Respiratory Syncytial Virus (RSV), WCCRRI primers

If a custom reference FASTA file is provided, assembly is performed if there are multiple sequences in the file. If a custom reference BED fils is also provided, assembly is performed if based on the BED file there are multiple genome-segment pairs (or multiple non-segmented full genomes). Otherwise, all sequences in the custom reference FASTA file are used as reference for short read alignment.

Q: I see both consensus sequence FASTA files and contig FASTA files. Which one is better?

A: In most cases, the consensus sequence FASTA file. Contig sequences are useful if the reference sequences used for consensus sequence generation were not the best match. They should be used with caution however because there is no filtering of base calls based on coverage or quality as done in consensus sequence generation.

Q: I am using the custom analysis workflow and my analysis aborted or shows an error, why?

A: It is most likely that the custom database was not formatted correctly. Below are requirements for the Custom Reference FASTA For Consensus Generation:

Do not use Spaces in the file name, instead use an underscore "_"
Do not exceed 25 characters in the file name
File extension must be .fasta or .fa
Do not exceed the file size limitation: 16GB for a single file or 25GB for multiple files
Do not have duplicate entries
If providing a Custom Reference BED and/or Custom PCR Primer Definitions in BED format, the names in the first column of the BED file (accession) must match the names that appear in the FASTA (text after > and before the first whitespace character).

Please see this page on general guidelines to upload data to BaseSpace for more details. If you continue having issues, reach out to techsupport@illumina.com.