The contamination score evaluates presence of sample-to-sample contamination. The algorithm uses common germline SNPs in the homozygous state expected to have variant allele frequencies (VAF) at 0% and 100%. In contaminated samples, the VAFs shift away from the expected values allowing the detection of sample-to-sample contamination.
The contamination score can detect sample-to-sample contamination greater than or equal to 0.4% (more than 0.4% of DNA input is coming from the contaminant)
The contamination score is calculated using the SNP error file and Pileup file that are generated during the small variant calling, as well as the TMB trace file. The algorithm includes the following steps:
All positions that overlap with a pre-defined set of common SNPs that have variant allele frequencies of < 25% or > 75% are collected (only SNP are considered, indels are excluded)
Variants in CNV events are removed using a clustering method
The likelihood that the positions are an error or a real mutation is calculated by:
CONTAMINATION_SCORE = sum(log10(P(vi is False Positive)))
The contamination score is output in the metrics output file, MetricsOutput.tsv
If a contamination score is equal or below 1227 (the upper specification limit provided in the "USL Guideline" field in the metrics output file, see ), the sample has less than 0.4% sample-to-sample contamination.
If a contamination score is above 1227, the sample has more than 0.4% sample-to-sample contamination. In this case, an estimation of the contamination can be obtained from the PCT_CONTAMINATION_EST metric, see more details on the . As noted, PCT_CONTAMINATION_EST is not valid unless the contamination score exceeds 1227.
Samples with highly rearranged genomes (HRD samples) can have variants with VAFs that shift away from the expected frequencies due to genomic rearrangement, which can lead to false-positive contamination scores
Visual examination can help determine if a shift of VAFs is due to true contamination
To build a VAF plot, use the {Sample_ID}.tmb.trace.csv file. Filter to only germline variants (for example, by using tags "Germline_DB" and "Germline_Proxi" in the column "Status") and use values in the VAF column.
Select Scatter from the Charts menu
Review plot as described above analyzing whether variants are scattered or clustered around 50% and 100% VAF
Estimating the error rate per sample
Counting mutation support
Counting total depth
The contamination score is calculated as the sum of all the log likelihood scores across the pre-defined SNP positions whose minor allele frequency is <25% in the sample and not likely due to CNV events:



The Run Metrics section of the metrics output report provides sequencing run quality metrics along with suggested values to determine if they are within an acceptable range. The overall percentage of reads passing filter is compared to a minimum threshold. For Read 1 and Read 2, the average percentage of bases ≥ Q30, which gives a prediction of the probability of an incorrect base call (Q‑score), are also compared to a minimum threshold. The following tables show run metric and quality threshold information for different systems.
The values in the Run Metrics section are listed as NA in the following situations:
If the analysis was started from FASTQ files.
If the analysis was started from BCL files and the InterOp files are missing or corrupt.
There is no PCT_PF_READS value in NovaSeqX Plus runs, so the PCT_PF_READS value will always be NA
All
PCT_PF_READS (%)
Total percentage of reads passing filter.
≥55.0
All
PCT_Q30_R1 (%)
Percentage of Read 1 reads with quality score ≥ 30.
≥80.0
All
PCT_Q30_R2 (%)
Percentage of Read 2 reads with quality score ≥ 30.
PCT_Q30_R1 (%)
Percentage of Read 1 reads with quality score ≥ 30.
≥85.0
All
PCT_Q30_R2 (%)
Percentage of Read 2 reads with quality score ≥ 30.
≥85.0
All
≥80.0
DRAGEN TruSight Oncology 500 uses QC metrics to assess the validity of analysis for DNA libraries that pass contamination quality control. If the library fails one or more quality metrics, then the corresponding variant type or biomarker is not reported, and the associated QC category in the report header displays FAIL.
DNA library QC results are available in the MetricsOutput.tsv file. Refer to Metrics Output for details.
CONTAMINATION_SCORE
The contamination score is based on VAF distribution of SNPs.
≤ 1227
All
MEDIAN_EXON_COVERAGE
Median exon fragment coverage across all exon bases.
≥ 1300
Small variant, TMB, Fusion, MSI
PCT_EXON_1000X
Percent exon bases with 1000x fragment coverage.
≥ 80.0
Small variant, TMB
GENE_SCALED_MAD
The median of absolute deviations normalized by gene fold change.
≤ 0.059*
CNV
MEDIAN_BIN_COUNT_CNV_TARGET
The median raw bin count per CNV target.
≥ 6.0
CNV