# Contamination Detection

The DRAGEN cross-sample contamination module estimates the fraction of sequencing reads originating from another human sample using a probabilistic mixture model.

DRAGEN provides **two contamination detection modes**. The appropriate mode depends on sample type, coverage, and expected contamination level.

***

## Quick Decision Guide

| What are you running?                 | Sample characteristics                | Setting to use                   | What DRAGEN does                                                                                                         |
| ------------------------------------- | ------------------------------------- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| General germline or somatic (default) | >= 20X coverage; FFPE/CNV/LOH allowed | `--qc-detect-contamination=true` | Runs GATK-based model; automatically falls back to legacy VerifyBamID-like model if GATK fails (e.g. high contamination) |
| RNA-seq                               | Variable expression and coverage      | `--qc-detect-contamination=true` | Runs GATK-based model in experimental mode; results are best-effort and qualitative                                      |
| Low coverage germline                 | Low coverage (\~10×), no FFPE/CNV/LOH | `--qc-cross-cont-vcf`            | Runs legacy VerifyBamID-like model directly; robust at low coverage                                                      |

***

## Fallback Mechanism

When `--qc-detect-contamination=true` is specified, DRAGEN:

1. First attempts contamination estimation using the **GATK-based model**
2. Automatically falls back to the **legacy VerifyBamID-like model** if the GATK-based model fails to converge, most commonly at high contamination levels

No additional settings are required to enable fallback behavior.

***

## GATK-Based Contamination Detection (Default)

**Use for:**\
Germline, tumor-only, and tumor-normal workflows. This is the **recommended default**.

**Enable**

```
--qc-detect-contamination=true
```

**Population Marker Resources**

```
/opt/dragen/<VERSION>/resources/qc/somatic_sample_cross_contamination_resource_*.vcf.gz
```

(hg19, hg38, hs37d5)

Markers can also be supplied explicitly:

```
--qc-somatic-contam-vcf <population_markers.vcf>
```

**Behavior**

* Accounts for FFPE damage, copy number variation (CNV), and loss of heterozygosity (LOH)
* Empirically adjusts base qualities to reduce FFPE deamination and oxidation noise
* Optimized for low-to-moderate contamination levels

***

### RNA-seq Support (Experimental)

`--qc-detect-contamination=true` can be run on RNA-seq data.

**Limitations**

* Less stable than DNA due to expression and coverage variability
* Results are qualitative indicators only
* Feature is experimental

***

## Legacy Contamination Detection (VerifyBamID-like)

**Use for:**\
Clean germline samples, especially at **low coverage (\~10×)**, or when fallback occurs.

**Enable**

```
--qc-cross-cont-vcf <population_markers.vcf>
```

**Population Marker Resources**

```
/opt/dragen/<VERSION>/resources/qc/sample_cross_contamination_resource_*.vcf.gz
```

(hg19, hg38, hs37d5)

**Behavior**

* Models the sample as a mixture of individuals
* Performs well on clean germline data
* Robust at low coverage
* Can remain informative at high contamination
* Not robust to FFPE, CNVs, or extended ROH

***

## Output and Interpretation

The contamination estimate is reported as a fraction:

```
MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.011
```

This corresponds to **1.1% contamination**.

**Interpretation Guidance**

* Contamination should be well below the minimum allele frequency of interest
* Example: at 1% contamination, variants below \~5% AF may be unreliable
* The metric saturates near \~30% contamination

***

## Coverage and Validity Requirements

Contamination estimation requires **≥100 valid pileups**.

A pileup is valid if:

* Coverage ≥ **10×**
* ≥ **95% of reads are valid**

Soft-clipped reads are excluded. Excessive soft clipping is often caused by untrimmed adapters. If contamination is reported as **NA**, inspect marker loci in IGV and correct adapter issues upstream.

***

## Legacy Model–Specific Settings

| Setting                            | Description                                                                                                                     |
| ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `--qc-contam-min-cov`              | Minimum coverage per pileup (default: 10).                                                                                      |
| `--qc-contam-min-valid-read-ratio` | Minimum fraction of valid reads (default: 0.95). Can be lowered to \~0.75, but adapter trimming issues should be fixed instead. |

***

## Key Takeaways

* Use **GATK-based contamination detection** for most workflows
* Use the **legacy model** for low-coverage clean germline samples
* High contamination triggers **automatic fallback** when using `--qc-detect-contamination=true`
* RNA-seq support is **experimental**
