All pages
Powered by GitBook
1 of 9

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Analysis Methods

The software processes sequencing data to perform quality control, detect variants, determine tumor mutational burden (TMB), microsatellite instability (MSI) status, and genomic instability score (GIS), and report results. The following sections describe the analysis methods used in DRAGEN TruSight Oncology 500 Analysis Software.

DRAGEN TruSight Oncology 500 Analysis Software uses the following workflows to analyze sequencing data.

  • FASTQ Generation

  • DNA Analysis

    • DNA Alignment and Realignment

    • Read Collapsing

    • Indel Realignment and Read Stitching

    • Small Variant Calling

    • Small Variant Filtering

    • Copy Number Variant (CNV) Calling

    • Phased Variant Calling

    • Variant Merging

    • Annotation

    • Tumor Mutational Burden (TMB) Scoring

    • Microsatellite Instability (MSI) Status

    • Contamination Detection

  • RNA Analysis

    • Downsampling

    • Read Trimming

    • Alignment

  • Quality Control

    • Run QC

    • DNA Sample QC

    • RNA Sample QC

Duplicate Marking
  • Fusion Calling

  • RNA Fusion Filtering

  • Splice Variant Calling

  • Annotation

  • Fusion Merging

  • DNA Analysis Methods

    DNA Alignment and Error Correction

    DNA alignment and error correction involves aligning sequencing reads derived from DNA libraries to a reference genome and correcting errors in the sequencing reads prior to variant calling.

    DRAGEN unique molecular identifier (UMI) error correction comprises three main steps:

    1. DRAGEN UMI uses its HW accelerated mapper (based on a hash table implementation) to align DNA sequences in FASTQ files to the hg19 reference genome. These alignments are not written to a BAM.

    2. The raw alignments are processed to remove errors, including errors introduced during FFPE preservation, PCR amplification, and sequencing. Reads from the same original DNA molecule are tagged with the same UMI during library preparation. The UMI allows DRAGEN to compare related reads, remove outlier signals, and collapse multiple reads into a single high-quality sequence. Read collapsing adds the following BAM tags:

      • RX/XU—UMI.

      • XV—Number of reads in the family.

      • XW—Number of reads in the duplex-family or 0 if not a duplex family.

    3. DRAGEN performs a final alignment step on the UMI-collapsed reads. These final alignments are then written to a BAM file and a corresponding BAM index file is created.

    DRAGEN continues to use these final alignments as input for gene amplification (copy number) calling, small variant calling (SNV, indel, MNV, delin), microsatellite instability (MSI) status determination, and DNA library quality control.

    Small Variant Calling and Filtering

    DRAGEN supports calling SNVs, indels, MNVs, and delins in tumor-only samples by using mapped and aligned DNA reads from a tumor sample as input. Variants are detected via both column wise pileup analysis and local de novo assembly of haplotypes. The de novo haplotypes allow the detection of much larger insertions and deletions than possible through column wise pileup analysis only. DRAGEN insertions and deletions are validated with lengths of at least 0–25 bp and more than 25 bp can be supported. In addition, DRAGEN also uses the de novo assembly to detect SNVs, insertions, and deletions that are co-phased and part of the same haplotypes. Any such co-phased variants that are within a window of 15 bp can then be reassembled into complex variants (MNVs and delins). The tumor-only pipeline produces a VCF file containing both germline and somatic variants that can be further analyzed to identify tumor mutations. Variant calling extends ± 10 bp into introns; details of the regions covered can be found in the assay manifest file. The pipeline makes no ploidy assumptions, enabling detection of low-frequency alleles.

    DRAGEN small variant calling includes the following steps:

    1. Detects regions with sufficient read coverage (callable regions).

    2. Detects regions where the reads deviate from the reference and there is a possibility of a germline or somatic call (active regions).

    3. Assembles de novo graph haplotypes are assembled from reads (haplotype assembly).

    4. Extracts possible somatic or germline calls (events) from column wise pileup analysis.

    Copy Number Variant Calling

    The DRAGEN copy number variant caller performs amplification, reference, and deletion calling for CNV targets within the assay. It counts the coverage of each target interval on the panel, uses a preprocessed panel of normal samples to normalize target counts, corrects for GC coverage bias, and calculates scores of a CNV event from observed coverage and makes copy number calls.

    Exon-Level Copy Number Variant Calling

    The BRCA large rearrangement step generates segmentation of the BRCA1 and BRCA2 genes for exon-level CNV detection from the BAM file. Using the same method as CNV calling, the large rearrangement component counts coverage of each target interval of the panel, performs normalization, and calculates the fold change values for each probe across the BRCA genes. Normalization includes GC bias correction, sequencing depth, and probe efficiency using a collection of normal FFPE and genomic DNA samples. Initial segmentation is performed for each gene with circular binary segmentation. The merging of segments is then determined by amplitude, noise, and variance at adjacent segments using thresholds established with in silico data. A large rearrangement is reported for genes with more than one segment. Coordinates of the exon-level CNV and the log2 mean fold change for each of the BRCA gene segments are found in the *_DragenExonCNV.json file.

    Annotation

    The Illumina Annotation Engine performs annotation of small variants, CNVs, and exon-level CNVs. The inputs are gVCF files and the outputs are annotated JSON files.

    The Illumina Annotation Engine processes each variant entry and annotates with available information from databases such as dbSNP, gnomAD genome and exome, 1000 genomes, ClinVar, COSMIC, RefSeq, and Ensembl. The header includes version information and general details. Each annotated variant is included as a nested dictionary structure in separate lines following the header.

    The following table shows version information for each annotation database:

    Database
    Version

    Tumor Mutational Burden

    DRAGEN is used to compute tumor mutational burden (TMB) in coding regions where there is sufficient coverage.

    The following variants are excluded from the TMB calculation:

    • Non-PASS variants.

    • Mitochondrial variants.

    • MNVs.

    • Variants that do not meet a minimum depth threshold (50).

    Variants with a population allele count ≥ 10 that are observed in either the 1000 Genomes or gnomAD databases are marked as germline. MNVs, which do not count towards TMB, may be marked as germline when all their component small variants are marked as germline. The proxy filter scans the variants surrounding a specific variant and identifies those variants with similar variant allele frequencies (VAF). If the majority of surrounding variants of similar VAF are germline, then the variant is also marked as germline.

    The formula for TMB calculation is:

    Outputs are captured in a _TMB_Trace.tsv file that contains information on variants used in the TMB calculation and a .tmb.json file that contains the TMB score calculation and configuration details.

    Microsatellite Instability Status

    DRAGEN can determine the MSI status of a sample. It uses a normal reference file, which was created from a set of normal samples. Normal reference files were generated by tabulating read counts for each microsatellite site. The normal file contains the read count distribution for each microsatellite site.

    MSI calling is assessed on a predefined list of 130 A and T repeats. The first step in calculating the MSI score is determining how many sites are assessable. A site is considered assessable if it has at least 60 spanning reads. A spanning read is defined as one that extends 5 bp before and after the repeat.

    Once assessable sites are identified, the distribution of repeat lengths is compared to the panel of normals. A site is classified as unstable if:

    • Jensen-Shannon distance ≥ 0.1, and

    • P-value ≤ 0.01.

    After all sites are evaluated, DRAGEN reports:

    • The total number of sites assessed

    • The count of unstable sites

    • The percentage of unstable sites across the sample

    Finally, the MSI score is calculated as:

    Genomic Instability Score

    Requires HRD add-on assay

    Genomic instability score (GIS) is a whole genome signature for homologous recombination deficiency. The GIS is composed of the sum of three components: loss of heterozygosity, telomeric allele imbalance, and large-scale state transition. These components are estimated using the GIS algorithm contracted from Myriad Genetics, which uses an input of the b-allele frequency and coverage across a genome-wide single nucleotide panel. A panel of normal samples is used for both bias reduction and normalization prior to GIS estimation. Final GIS results can be found in the *.gis.json file.

    Contamination Detection

    The contamination analysis step detects foreign human DNA contamination using the SNP error file and pileup file that are generated during the small variant calling and the TMB trace file. The software determines whether a sample has foreign DNA using the contamination score. In contaminated samples, the variant allele frequencies in SNPs shift from the expected values of 0%, 50%, or 100%. The algorithm collects all positions that overlap with common SNPs that have variant allele frequencies of < 25% or > 75%. Then, the algorithm computes the likelihood that the positions are an error or a real mutation. The contamination score is the sum of all the log likelihood scores across the predefined SNP positions with minor allele frequency < 25% in the sample and are not likely due to CNV events.

    The larger the contamination score, the more likely there is foreign DNA contamination. A sample is considered to be contaminated if the contamination score is above predefined quality threshold. The contamination score was found to be high in samples with highly rearranged genomes or HRD samples. 1% of HRD samples found to be above the threshold with no evidence for actual contamination.

    Tumor fraction

    Tumor fraction is calculated as described in the User Guide, section “HRD Metrics Report” and leverages the Myriad Genetics algorithm. Tumor fraction is output in the Logs_Intermediates/Gis/SAMPLE/SAMPLE.gis.json and Combined Variant Output file.

    Ploidy

    Ploidy is calculated as described in the User Guide, section “HRD Metrics Report” and leverages the Myriad Genetics algorithm. Ploidy is output in the in the Logs_Intermediates/Gis/SAMPLE/SAMPLE.gis.json and Combined Variant Output file.

    Absolute Copy Number (Beta)

    This is a beta feature. Beta feature results are included in the Combined Variant Output file and other files. However, disclaimers that the results are generated by beta features are only provided in the Combined Variant Output file. Requires HRD add-on assay.

    Absolute copy numbers are calculated by leveraging the Myriad Genetics algorithm. The algorithm segments the entire genome using the HRD panel and provides an A and B allele estimate for each segment. After the TSO 500 pipeline determines CNV calls (using the TSO 500 panel), the segment covering the gene is identified, and the A and B allele numbers of the segment overlapping the gene are reported. If the gene is within 300 kbases from the segment boundary, the estimate is unreliable and “-1” is output. Absolute copy numbers are output in the Logs_Intermediates/Gis/SAMPLE/SAMPLE.abcn_annotated.vcf, Logs_Intermediates/Gis/SAMPLE/SAMPLE.abcn_genes.tsv and Combined Variant Output file.

    Absolute copy number estimation has an upper limit of 6. While biological samples may exhibit copy numbers exceeding this value, the estimation algorithm does not report values above 6.

    Gene-Level Loss of Heterozygosity (Beta)

    This is a beta feature. Beta feature results are included in the Combined Variant Output file and other files. However, disclaimers that the results are generated by beta features are only provided in the Combined Variant Output file. Requires HRD add-on assay.

    Gene-level loss of heterozygosity is calculated based on the minor copy number reported in the abcn_annotated.vc f. If the minor copy number is 0 then the gene is assumed to have a loss of heterozygosity. Gene-level loss of heterozygosity is output in the Logs_Intermediates/Gis/SAMPLE/SAMPLE.abcn_genes.tsv and Combined Variant Output file.

    Block List

    The block list represents high noise regions in the panel where false positive variant calls are likely produced. As a result, all positions in the gVCF are marked as Filter=excluded_regions to indicate variant call results are not reliable in such regions.

    The block list includes the following genes:

    • HLA A

    • HLA B

    • HLA C

    KMT2B

  • KMT2C

  • KMT2D

  • chrY

  • Any position with VAF 1% occurrence in six or more of the 60 baseline samples.

  • Calibrates read base qualities to account for FFPE noise.

  • Computes read likelihoods for each read/haplotype pair.

  • Performs variant calling by summing the genotype probabilities across all reads/haplotype pairs.

  • Performs additional filtering to improve variant calling accuracy, including using a systematic noise file. The systematic noise file indicates the statistical probability of noise at specific positions in the genome. This noise file is constructed using clean (normal) samples. Regions where noise is common (eg, difficult to map regions) have higher noise values. The small variant caller penalizes those regions to reduce the probability of making false positive calls.

  • Variants that do not meet the minimum variant allele threshold (0.05).
  • Variants that fall outside the eligible regions.

  • Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. Germline variants are not counted towards TMB. Variants are determined as germline based on a database and a proxy filter.

  • gnomeAD

    2.1

    COSMIC

    v84

    ClinVar

    2019-02-04

    dbSNP

    v151

    1000 Genomes Project

    Phase 3 v5a

    RefSeq

    NCBI Homo sapiens Annotation Release 105.20201022

    TMB=Filtered VariantsEligible Region Size(Mbp)TMB = {Filtered\ Variants \over Eligible\ Region\ Size (Mbp)}TMB=Eligible Region Size(Mbp)Filtered Variants​
    NonsynonymousTMB=Filtered Nonsynonymous VariantsEligible Region Size(Mbp)Nonsynonymous TMB = {Filtered\ Nonsynonymous\ Variants \over Eligible\ Region\ Size (Mbp)}NonsynonymousTMB=Eligible Region Size(Mbp)Filtered Nonsynonymous Variants​

    RNA Expanded Metrics

    RNA expanded metrics are provided for information only. They can be informative for troubleshooting but are provided without explicit specification limits and are not directly used for sample quality control. For additional guidance, contact Illumina Technical Support.

    Metric
    Description
    Units

    Count

    TOTAL_PF_READS

    Total number of reads passing filter.

    Count

    GENE_MEDIAN_COVERAGE

    The median coverage depth of all genes in the panel.

    Count

    GENE_ABOVE_MEDIAN_CUTOFF

    Number of genes above the median coverage cutoff.

    Count

    PER_GENE_MEDIAN_COVERAGE

    Median deduped coverage across each gene (available in Logs_Intermediates only)

    Count

    PCT_CHIMERIC_READS

    Percentage of reads that are aligned as two segments which map to nonconsecutive regions in the genome.

    %

    PCT_ON_TARGET_READS

    Percentage of reads that cross any part of the target region versus total reads. A read that partially maps to a target region is counted as on target.

    %

    SCALED_MEDIAN_GENE_COVERAGE

    Median of median base coverage of genes scaled by length. An indication of median coverage depth of genes in the panel.

    DNA Expanded Metrics

    DNA expanded metrics are provided for information only. They can be informative for troubleshooting but are provided without explicit specification limits and are not directly used for sample quality control. For additional guidance, contact Illumina Technical Support.

    Metric
    Description
    Troubleshooting

    RNA Analysis Methods

    Refer to for more information.

    Downsampling

    Each sample is downsampled to 30 million RNA reads. This number represents the total number of single reads (eg, R1 + R2, from all lanes). When using the recommended sequencing configurations or plexity, the samples can have fewer reads than the downsampling limit. In these cases, the FASTQ files are left as-is.

    Read Trimming

    Reads are trimmed to 76 base pairs for further processing.

    RNA Alignment and Fusion Detection

    RNA alignment and fusion detection uses trimmed reads in FASTQ format as input. The outputs include a BAM file that contains duplicate-marked read alignments, an SJ.out.tab file that contains unannotated splice junctions, and a CSV file that contains fusion candidates.

    DRAGEN aligns RNA reads in a transcript-aware mode using the human hg19 genome containing unplaced contigs (ie, chrUn_gl regions) and uses GENCODEv19 transcript annotations to identify splice sites. DRAGEN identifies and marks duplicate read alignments using start and end coordinates of alignments, which are adjusted for soft clipped reads.

    Fusion and splice variant calling only use deduped fragments to score variants. DRAGEN identifies fusion candidates using chimeric split read alignments (pairs of primary and supplementary alignments) against multiple genes. DRAGEN scores and filters reads based on the various features of each candidate such as the number of supporting reads, mapping quality of supporting reads, and sequence homology between parent genes.

    The DRAGEN RNA Fusion caller identifies gene fusions by searching for chimeric reads spanning two distinct parent genes. Based on the chimeric reads, DRAGEN first creates a list of fusion candidates, then scores the candidates to report the list of high confidence fusion calls from the candidate pool.

    DRAGEN RNA Fusion caller performs the following steps:

    1. Generates fusion candidate generation based on split read alignment.

    2. Recruits additional evidence from fusion supporting discordant read pairs and soft-clipped reads.

    3. Computes fusion candidate features such as gene coverage, read mapping quality, alternate allele frequency, gene homology, alignment anchor length, and breakpoint distance from exon boundary.

    4. Scores and ranks the fusion candidates using a logistic regression model.

    5. Selects a final list of fusion calls based on score and other filters including number of supporting reads, unique read alignment count, read through transcripts, and fusions matching the enriched regions.

    Splice Variant Calling

    RNA splice variant calling is performed for RNA sample libraries. Candidate splice variants (junctions) from RNA Alignment are compared against a database of known transcripts and a splice variant baseline of non-tumor junctions generated from a set of normal FFPE samples from different tissue types. Any splice variants that match the database or baseline are filtered out unless they are in a set of junctions with known oncological function. If there is sufficient read support, the candidate splice variant is kept. This process also identifies candidate RNA fusions.

    RNA Fusion Merging

    Fusions identified during RNA fusion calling are merged with fusions from proximal genes identified during RNA splice variant calling. These are then annotated with gene symbols or names with respect to a static database of transcripts (GENCODE Release 19). The result of this process is a set of fusion calls that are eligible for reporting

    RNA Splice Variant Annotation

    The Illumina Annotation Engine annotates detected RNA splice variant calls with transcript-level changes (eg, affected exons in the transcript of a gene) with respect to RefSeq. This RefSeq database is the same RefSeq database used by the small variant annotation process.

    RNA Output

    Lower median target coverage may be due to poor sample input/quality, library preparation issues or low sequencing output.

    PCT_CHIMERIC_READS (%)

    Chimeric reads occur when one sequencing read aligns to two distinct portions of the genome with little or no overlap. Metric is proportion of total number of non-supplementary, non-secondary, and passing QC reads after alignment to the whole genome sequence.

    While this can be indicative of large-scale structural rearrangement of the genome, values that are elevated above the usual baseline may indicate enrichment probe contamination during library preparation. A suggested metric USL is 8% (those that are higher might see decrease performance in small variant and tmb scores).

    PCT_EXON_100X (%)

    Percentage of exon bases with 100X fragment coverage. Calculated against all regions in manifest containing _exon in name.

    Can be used in combination with other PCT_EXON metrics to understand under or over coverage of exons.

    PCT_READ_ENRICHMENT (%)

    Percentage of reads that have overlapping sequence with the target regions defined in the sample manifest.

    Indicative of general enrichment performance. Reduced proportions of enriched reads may indicate issues with the enrichment proportion of the library preparation.

    PCT_USABLE_UMI_READS (%)

    Percentage of reads that have valid UMI sequences associated with them.

    As UMI reads are sequenced at the start of each read, loss of valid UMI sequence may be cause by sequencing issues impacting the quality of base calling in this portion of the sequencing read.

    MEAN_TARGET_COVERAGE (count)

    Mean depth across all the unique loci defined in the manifest file.

    Lower mean target coverage may be due to poor sample input/quality, library preparation issues or low sequencing output. Large differences between the median and mean target coverage values may indicated a skewed distribution of target coverage.

    PCT_ALIGNED_READS (%)

    Proportion of aligned reads that are non-supplementary, non-secondary and pass QC versus aligned reads that are non-supplementary, non-secondary, mapped and pass QC.

    PCT_CONTAMINATION_EST (%)

    This metric should only be evaluated if the CONTAMINATION_SCORE metric exceed the USL. This metric estimates the amount of contamination in a sample. The contamination level is computed by taking 2.0* the average of the adjusted allele frequencies of all variants that were selected. The adjusted alllele frequency is either the actual allele frequency of the variant if it is less than 0.5, or 1 -allele frequency if it is greater than or equal to 0.5.

    If the sample does not fail the CONTAMINATION_SCORE this metric has no intended meaning as it will be driven by statistical noise (e.g. the few variants that naturally fall outside an expected interval around 0.5 due to random chance)

    High contamination estimates may be due to any of the following:

    Inter-sample contamination caused by mixing of samples during extraction or library preparation.

    Intra-sample contamination, due to mixing of clonally different cell populations during extraction. Large scale genomic rearrangements that cause unexpected VAFs for large numbers of variants.

    PCT_TARGET_0.4X_MEAN (%)

    Parentage of target (all locations in manifest) reads that have a coverage depth of greater the 0.4x the mean target coverage depth (see definition above).

    Provides an indication of uniformity of coverage of the target regions in the manifest file. When trended over time reductions in this metric may indicate an issue with the enrichment process resulting in coverage bias.

    PCT_TARGET_50X (%)

    Percentage of target bases with 50X fragment coverage. Calculated against all regions in manifest file.

    Can be used in combination with other PCT_TARGET metrics to understand under or over coverage of targets.

    PCT_TARGET_100X (%)

    Percentage of target bases with 100X fragment coverage. Calculated against all regions in manifest file.

    Can be used in combination with other PCT_TARGET metrics to understand under or over coverage of targets.

    PCT_TARGET_250X (%)

    Percentage of target bases with 250X fragment coverage. Calculated against all regions in manifest file.

    Can be used in combination with other PCT_TARGET metrics to understand under or over coverage of targets.

    ALLELE DOSAGE_RATIO (with HRD add-on)

    Proprietary Myriad Genetics estimate of b-allele dosage based on b-allele noise/signal ratio. B-Allele noise is correlated with coverage; lower coverage samples will have higher noise. B-allele signal is also correlated with tumor fraction; a higher tumor fraction produces a higher signal for b-allele sites. Samples with lower tumor fraction and higher amount of noise (or lower coverage) will have higher Allele Dosage Ratio. The upper limit of the score is 50, therefore any sample with 50 Allele Dosage Ratio can be assumed to have tumor fraction close to zero and typically has a GIS = 0.

    MEDIAN TARGET HRD (with HRD add-on)

    Median target fragment coverage across all target positions in the genome. Coverage is the total number of non-duplicate pair alignments that overlap.

    TOTAL_PF_READS (count)

    Total number of non-supplementary, non-secondary, and passing QC reads after alignment to the whole genome sequence.

    Primarily driven by data output of sequencer, quality of library and balancing of library in library pool. If TOTAL_PF_READS is in line with other samples, but coverage metrics are more may suggest non-specific enrichment.

    Low values for all samples indicate a poor quality run with possible low cluster numbers or low numbers of Q30 and PF%.

    A low value for an individual sample indicates poor pooling of this library into the final pool.

    MEAN_FAMILY_SIZE (count)

    A UMI Family is a group of reads that all have the same UMI barcode. The family size is the number of reads in family. MEAN_FAMILY_SIZE is the mean of the entire population of reads assembled into UMI families.

    The mean UMI family size decreases with increased unique read numbers, and more input DNA leads to more unique reads. Conversely over sequencing of a fixed population of unique DNA molecules leads to increased family size.

    As a guide, for a good run with optimal cluster density, passing specs, even sample pooling, and good quality DNA we usually observe values <10.

    UMI family size = 1 is not ideal as it is harder to correct for errors.

    UMI family size of 2 to 5 enables efficient error correction without wasting sequencing capacity on high percentages of duplicate reads.

    MEDIAN_TARGET_COVERAGE (count)

    Median depth across all the unique loci occurring in all regions of the manifest file.

    Contamination

    The contamination score evaluates presence of sample-to-sample contamination. The algorithm uses common germline SNPs in the homozygous state expected to have variant allele frequencies (VAF) at 0% and 100%. In contaminated samples, the VAFs shift away from the expected values allowing the detection of sample-to-sample contamination.

    The contamination score can detect sample-to-sample contamination greater than or equal to 2% (more than 2% of DNA input is coming from the contaminant)

    Contamination Score Calculation

    The contamination score is calculated using the SNP error file and Pileup file that are generated during the small variant calling, as well as the TMB trace file. The algorithm includes the following steps:

    • All positions that overlap with a pre-defined set of common SNPs that have variant allele frequencies of < 25% or > 75% are collected (only SNP are considered, indels are excluded)

    • Variants in CNV events are removed using a clustering method

    • The likelihood that the positions are an error or a real mutation is calculated by:

      • Estimating the error rate per sample

    CONTAMINATION_SCORE = sum(log10(P(vi is False Positive)))

    Contamination Score Interpretation

    • The contamination score is output in the metrics output file, MetricsOutput.tsv

    • If a contamination score is equal or below 1457 (the upper specification limit provided in the "USL Guideline" field in the metrics output file, see ), the sample has less than 2% sample-to-sample contamination.

    • If a contamination score is above 1457, the sample has more than 2% sample-to-sample contamination. In this case, an estimation of the contamination can be obtained from the PCT_CONTAMINATION_EST metric, see more details on the . As noted, PCT_CONTAMINATION_EST is not valid unless the contamination score exceeds 1457.

    Samples with highly rearranged genomes (HRD samples) can have variants with VAFs that shift away from the expected frequencies due to genomic rearrangement, which can lead to false-positive contamination scores

    • Visual examination can help determine if a shift of VAFs is due to true contamination

    How to build a VAF plot for visual examination

    1. To build a VAF plot, use the {Sample_ID}.tmb.trace.csv file. Filter to only germline variants (for example, by using tags "Germline_DB" and "Germline_Proxi" in the column "Status") and use values in the VAF column.

    2. Select Scatter from the Charts menu

    3. Review plot as described above analyzing whether variants are scattered or clustered around 50% and 100% VAF

    Counting mutation support

  • Counting total depth

  • The contamination score is calculated as the sum of all the log likelihood scores across the pre-defined SNP positions whose minor allele frequency is <25% in the sample and not likely due to CNV events:

  • Metrics Output page
    DNA Expanded Metrics page
    Visual investigation of VAFs across the genome can help determine if a shift of VAFs is due to true contamination

    FASTQ Generation

    Sequencing data stored in BCL format are demultiplexed through a process that uses the index sequences unique to each sample to assign clusters to the library from which they originated. Each cluster contains two indexes (i7 and i5 sequences, one at each end of the library fragment). The combination of those index sequences are used to demultiplex the pooled libraries.

    After demultiplexing, this process generates FASTQ files, which contain the sequencing reads for each individual sample library and the associated quality scores for each base call, excluding reads from any clusters that did not pass filter.

    Quality Control

    The software calculates several quality control metrics for runs and samples.

    These metrics and guidelines apply to DRAGEN TSO 500 v2.1 and above.

    Run QC

    The Run Metrics section of the metrics output report provides sequencing run quality metrics along with suggested values to determine if they are within an acceptable range. The overall percentage of reads passing filter is compared to a minimum threshold. For Read 1 and Read 2, the average percentage of bases ≥ Q30, which gives a prediction of the probability of an incorrect base call (Q‑score), are also compared to a minimum threshold. The following tables show run metric and quality threshold information for different systems.

    The values in the Run Metrics section are listed as NA in the following situations:

    • If the analysis was started from FASTQ files.

    • If the analysis was started from BCL files and the InterOp files are missing or corrupt.

    NextSeq 500/550 or NextSeq 550Dx (RUO)

    Metric
    Description
    Recommended Guideline Quality Threshold
    Variant Class

    NovaSeq 6000 or NovaSeq 6000Dx (RUO)

    Metric
    Description
    Recommended Guideline Quality Threshold
    Variant Class

    NextSeq 1000/2000

    Metric
    Description
    Recommended Guideline Quality Threshold
    Variant Class

    NovaSeq X

    Metric
    Description
    Recommended Guideline Quality Threshold
    Variant Class

    DNA Sample QC

    DRAGEN TruSight Oncology 500 uses QC metrics to assess the validity of analysis for DNA libraries that pass contamination quality control. If the library fails one or more quality metrics, then the corresponding variant type or biomarker is not reported, and the associated QC category in the report header displays FAIL. Additionally, a companion diagnostic result may not be available if it relies on QC passing for one or more of the following QC categories.

    DNA library QC results are available in the MetricsOutput.tsv file.

    Metric
    Description
    Recommended Guideline Quality Threshold
    Variant Class

    RNA Sample QC

    The input for RNA Library QC is RNA alignment. Metrics and guideline thresholds can be found in the MetricsOutput.tsv file.

    Metric
    Description
    Recommended Guideline Quality Threshold
    Variant Class

    *To avoid failing RNA samples unnecessarily, Illumina does not recommend a universal threshold to determine RNA sample quality. RNA expression varies significantly across tissue types and a small panel size (55 genes), which makes normalization challenging. Tissue-specific thresholds could be considered for normalization.

    All

    All

    Small variant TMB

    MEDIAN_INSERT_SIZE

    The median fragment length in the sample.

    ≥ 70

    Small variant TMB

    USABLE_MSI_SITES

    The number of MSI sites usable for MSI calling.

    ≥ 40

    MSI

    MEDIAN_BIN_COUNT_CNV_TARGET

    The median raw bin count per CNV target.

    ≥ 1.0

    CNV

    Fusion Splice

    GENE_MEDIAN_COVERAGE*

    The median deduped coverage across all genes in the RNA panel (55 genes).

    N/A

    Fusion Splice

    PCT_PF_READS (%)

    Total percentage of reads passing filter.

    ≥80.0

    All

    PCT_Q30_R1 (%)

    Percentage of Read 1 reads with quality score ≥ 30.

    ≥80.0

    All

    PCT_Q30_R2 (%)

    Percentage of Read 2 reads with quality score ≥ 30.

    PCT_PF_READS (%)

    Total percentage of reads passing filter.

    ≥55.0

    All

    PCT_Q30_R1 (%)

    Percentage of Read 1 reads with quality score ≥ 30.

    ≥80.0

    All

    PCT_Q30_R2 (%)

    Percentage of Read 2 reads with quality score ≥ 30.

    PCT_Q30_R1 (%)

    Percentage of Read 1 reads with quality score ≥ 30.

    ≥85.0

    All

    PCT_Q30_R2 (%)

    Percentage of Read 2 reads with quality score ≥ 30.

    ≥85.0

    All

    PCT_Q30_R1 (%)

    Percentage of Read 1 reads with quality score ≥ 30.

    ≥85.0

    All

    PCT_Q30_R2 (%)

    Percentage of Read 2 reads with quality score ≥ 30.

    ≥85.0

    All

    CONTAMINATION_SCORE

    The contamination score is based on VAF distribution of SNPs.

    ≤ 1457

    All

    MEDIAN_EXON_COVERAGE

    Median exon fragment coverage across all exon bases.

    ≥ 150

    Small variant TMB

    PCT_EXON_50X

    Percent exon bases with 50x fragment coverage.

    MEDIAN_CV_GENE_500X

    The median CV for all genes with median coverage > 500x. Genes with median coverage > 500x are likely to be highly expressed. Higher CV median > 500x indicates an issue with library preparation (poor sample input and/or probes pulldown issue).

    <= 0.93

    Fusion Splice

    MEDIAN_INSERT_SIZE

    The median fragment length in the sample.

    ≥ 80

    Fusion Splice

    TOTAL_ON_TARGET_READS

    The total number of reads that map to the target regions.

    ≥80.0

    ≥80.0

    ≥ 90.0

    ≥ 9000000