Contamination Detection
The DRAGEN cross-sample contamination module estimates the fraction of sequencing reads originating from another human sample using a probabilistic mixture model.
DRAGEN provides two contamination detection modes. The appropriate mode depends on sample type, coverage, and expected contamination level.
Quick Decision Guide
General germline or somatic (default)
>= 20X coverage; FFPE/CNV/LOH allowed
--qc-detect-contamination=true
Runs GATK-based model; automatically falls back to legacy VerifyBamID-like model if GATK fails (e.g. high contamination)
RNA-seq
Variable expression and coverage
--qc-detect-contamination=true
Runs GATK-based model in experimental mode; results are best-effort and qualitative
Low coverage germline
Low coverage (~10×), no FFPE/CNV/LOH
--qc-cross-cont-vcf
Runs legacy VerifyBamID-like model directly; robust at low coverage
Fallback Mechanism
When --qc-detect-contamination=true is specified, DRAGEN:
First attempts contamination estimation using the GATK-based model
Automatically falls back to the legacy VerifyBamID-like model if the GATK-based model fails to converge, most commonly at high contamination levels
No additional settings are required to enable fallback behavior.
GATK-Based Contamination Detection (Default)
Use for: Germline, tumor-only, and tumor-normal workflows. This is the recommended default.
Enable
Population Marker Resources
(hg19, hg38, hs37d5)
Markers can also be supplied explicitly:
Behavior
Accounts for FFPE damage, copy number variation (CNV), and loss of heterozygosity (LOH)
Empirically adjusts base qualities to reduce FFPE deamination and oxidation noise
Optimized for low-to-moderate contamination levels
RNA-seq Support (Experimental)
--qc-detect-contamination=true can be run on RNA-seq data.
Limitations
Less stable than DNA due to expression and coverage variability
Results are qualitative indicators only
Feature is experimental
Legacy Contamination Detection (VerifyBamID-like)
Use for: Clean germline samples, especially at low coverage (~10×), or when fallback occurs.
Enable
Population Marker Resources
(hg19, hg38, hs37d5)
Behavior
Models the sample as a mixture of individuals
Performs well on clean germline data
Robust at low coverage
Can remain informative at high contamination
Not robust to FFPE, CNVs, or extended ROH
Output and Interpretation
The contamination estimate is reported as a fraction:
This corresponds to 1.1% contamination.
Interpretation Guidance
Contamination should be well below the minimum allele frequency of interest
Example: at 1% contamination, variants below ~5% AF may be unreliable
The metric saturates near ~30% contamination
Coverage and Validity Requirements
Contamination estimation requires ≥100 valid pileups.
A pileup is valid if:
Coverage ≥ 10×
≥ 95% of reads are valid
Soft-clipped reads are excluded. Excessive soft clipping is often caused by untrimmed adapters. If contamination is reported as NA, inspect marker loci in IGV and correct adapter issues upstream.
Legacy Model–Specific Settings
--qc-contam-min-cov
Minimum coverage per pileup (default: 10).
--qc-contam-min-valid-read-ratio
Minimum fraction of valid reads (default: 0.95). Can be lowered to ~0.75, but adapter trimming issues should be fixed instead.
Key Takeaways
Use GATK-based contamination detection for most workflows
Use the legacy model for low-coverage clean germline samples
High contamination triggers automatic fallback when using
--qc-detect-contamination=trueRNA-seq support is experimental
Last updated
Was this helpful?
