# TruPath Outputs

The proximity mode enabled DRAGEN Germline pipeline for use with the Illumina TruPath Genome prep produces a layered set of outputs that begin with **proximity‑aware templates and alignments,** expand into **long‑range phasing**, and culminate in **phased small variant calls, haplotype‑resolved SVs, small variants in paralogous regions (MRJD), STR expansion calls, and colocation‑validated break ends,** all backed by extensive QC and reporting. Each algorithm consumes the same underlying proximity signal but exposes results through **standard genomics artifacts (BAM, VCF, CSV, JSON, Cooler)**, making the workflow powerful yet interoperable.

For more information on the DRAGEN algorithms, features, and outputs supporting Illumina TruPath Genome prep, please navigate to the [DRAGEN User Guide linked here.](https://help.dragen.illumina.com/dragen-v4.5-trupath)

<figure><img src="https://2122546113-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiFAstZpxWHpI6k3vTvF%2Fuploads%2Fgit-blob-a39dc915961efa570d2d8991425b42a0022bb5da%2Fclear_Dragen%20secondary%20analysis%20workflow-v2%20(1).png?alt=media" alt=""><figcaption></figcaption></figure>

#### 1. Proximity Linking Model & Template Reconstruction

* A **non‑linear proximity linking model** is fit per read group using flow cell (X,Y) distance and genomic distance to reconstruct long DNA templates. This is the foundational signal reused by each downstream algorithm.
* Key Output Files:
  * **Proximity‑aware BAM/CRAM**
    * Reads tagged with **`BX:Z`** (template ID)
  * **Template metrics CSVs** (WGS + QC regions):
    * `<prefix>.<qc>_template_subpairs.csv`
    * `<prefix>.<qc>_template_gdist.csv`
    * `<prefix>.<qc>_template_xdist.csv`
    * `<prefix>.<qc>_template_ydist.csv`
    * `<prefix>.<qc>_template_thresholds.csv`
  * **Link metrics CSVs**:
    * `<prefix>.<qc>_proximity_gdist.csv`
    * `<prefix>.<qc>_proximity_xdist.csv`
    * `<prefix>.<qc>_proximity_ydist.csv`

#### 2. Proximity-Aware Mapping & Alignment

* Uses proximity link probabilities as an additional alignment score to resolve ambiguous mappings that standard short-read alignment cannot resolve.
* Key Output Files:
  * **Mapped BAM/CRAM**
    * Improved placement in repeats/paralogs
    * Carries proximity and template tags

#### 3. Read Phasing & Personalized Haplotypes

* Combines haplotype databases with long‑range proximity links to create long, confident phase blocks, enabling haplotype‑aware variant calling and downstream analyses.
* Key Output Files:
  * **Phased BAM/CRAM**
    * Tags: `HP` (haplotype), `PS` (phase block), `pp` (phasing confidence)
  * **Personalized haplotypes TSV**
    * `<prefix>.personal_haplotypes.tsv.gz`
  * **Phase block GTF**
    * `<prefix>.phase_blocks.gtf.gz`
  * **Imputed variants VCF**
    * `<prefix>.personal.vcf.gz`
  * **Phasing metrics CSV**
    * `<prefix>.phasing_summary_stats.csv`

#### 4. Proximity-Aware Structural Variant (SV) Calling

* Uses haplotype‑segregated assemblies and phasing‑aware machine learning (ML) models to improve SV detection and genotyping in single‑sample germline WGS. Proximity information enters indirectly through phasing.
* Key Output Files
  * **SV VCF**
    * Includes TruPath‑specific fields:
      * `PHASEDASM`
      * `ML_UPDATED`
      * `MLQS`
      * `ColocationSum` (when colocation filter applied)

#### 5. Multi-Region Joint Detection (MRJD)

* Performs *de novo*, haplotype‑resolved small‑variant calling in paralogous regions, estimating copy number and assigning variants to specific gene copies or haplotypes using proximity information
* Key output files
  * **Primary MRJD VCF**
    * `<prefix>.mrjd.hard-filtered.vcf.gz`
  * **MRJD JSON summary**
    * `<prefix>.mrjd.json`
    * Copy number, region/haplotype assignments, run status
  * **MRJD phased BAM**
    * `<prefix>.mrjd.phased.bam`
    * Tags: `HP` (copy), `PC` (confidence), `PS`, `BX`
  * **Supporting files directory**
    * `mrjd_supporting_files/`
      * Multi‑column VCFs per paralog set
      * Reference region alignments SAM

#### 6. STR Calling with IRR Recovery

* Recovers otherwise unmapped in‑repeat reads using proximity, enabling more accurate sizing of large STR expansions and improving genotyping in heterozygous cases.
* Key Output Files
  * **STR VCF** (standard DRAGEN‑STR format)
  * **BAM/CRAM with IRR tags**
    * `tr` tag encoding recovered repeat motif

#### 7. Colocation Analysis & Filtering

* Summarizes long‑range genomic interactions from proximity‑linked reads and uses that signal to validate or filter SV break ends lacking molecular support
* Key Output Files:
  * **Cooler file**
    * Sparse colocation matrix (Hi‑C‑like)
  * **SV VCF annotations**
    * `NORMALIZED_COLOC_SUM`
    * `ColocationSum` filter (when applied)

#### 8. Proximity‑Filtered Coverage & Reporting

* Provides QC and interpretability for proximity data quality, template structure, and phasing performance, integrated into standard DRAGEN Report
* Key Output Files:
  * **Proximity coverage CSVs** (per link‑quality threshold):
    * `<prefix>_proximity_linkqual<q>_coverage_metrics.csv`
    * `<prefix>_proximity_linkqual<q>_hist.csv`
    * `<prefix>_proximity_linkqual<q>_fine_hist.csv`
    * `<prefix>_proximity_linkqual<q>_overall_mean_cov.csv`
    * `<prefix>_proximity_linkqual<q>_contig_mean_cov.csv`

#### 9. DRAGEN Reports (TruPath Germline WGS)

* Dedicated *Proximity* tab with QC metrics and visualizations. See the DRAGEN Reports section of the DRAGEN User guide for additional information:
* Key Metrics:
  * `Fit RMSE` - An estimate of how different the estimated probabilities can be between the parametric and non-parametric models, on the phred scale
  * `Q25 Proximity Rate` - Percentage of read-pairs with at least one neighbor above Q25, on the phred scale
  * `Q25 Proximity Coverage` - Average alignment coverage over genome with link-quality ≥Q25, on the phred scale
  * `P75 Template Size` - The size of linked template molecules at the 75th percentile
  * `NG50` - The size of the smallest phasing block required to phase 50% of the genome
* Key Plots:

  * Template Genomic Span, displaying the distribution of template genomic lengths from `<prefix>.wgs_template_gdist.csv`

  <figure><img src="https://2122546113-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FCiFAstZpxWHpI6k3vTvF%2Fuploads%2Fgit-blob-d9af0b14a9a940c11b4a161b348c5a09e8d9cf92%2Fimage%20(46).png?alt=media" alt=""><figcaption></figcaption></figure>
