1 of 6

Biomarkers

Tumor Mutational Burden

DRAGEN supports Tumor Mutational Burden (TMB) in Tumor-Only or Tumor-Normal Mode.

It is important to note that in T/O mode germline variants must be identified and filtered using database information and optionally also allele frequency information. These germline filtering techniques are generally not as accurate as tumor normal subtraction. When using databases only to subtract germline variants, the TMB may be slightly higher than the more accurate T/N estimate. When using database and allele frequency information to remove germline variants, the TMB may be slightly underestimated for high purity tumor samples.

TMB Processing Flow

DRAGEN TMB comprise the following steps:

1. Variant calling

Please refer to "Somatic mode" for detailed variant calling options.

2. Eligible region detection

TMB is computed over protein coding regions with sufficient coverage. If DRAGEN detects a reference hg19/38, GRCh37/38 or hs37d5 it will automatically select the appropriate coding region based on the bed files available in "<INSTALL_PATH>/resources/tmb/". By default the coverage threshold for eligible regions is 50.

The protein coding region bed file and the coverage settings can explicitly be specified using the qc-coverage options listed below in [QC coverage settings to override the default eligible region]. If DRAGEN does not automatically detect the reference it is required to specify these settings.

3. Variant filters

The following variants are excluded from the TMB calculation:

Non-PASS variants
Mitochondrial variants
MNVs
Variants that do not meet the minimum depth (DP) threshold. Use the --vc-callability-tumor-thresh command line option to specify the threshold value.
Variants that do not meet the minimum variant allele threshold. Use the --tmb-vaf-threshold command line option to specify the threshold value.
Variants that fall outside the eligible regions.
Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. You can specify the cosmic driver threshold using the tmb-cosmic-count-threshold command line option. The tumor driver mutations filter relies on Nirvana annotations and will additionally require settings for --enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data.

4. Support for germline variants

By default, germline variants are not counted towards TMB. Variants are determined as germline based on a database or a proxi filter. The database germline filter can be disabled with tmb-skip-db-filter. Disabling the database germline filter will effectively also disable the germline proxi filter.

Database filter

Variants with a population allele count ≥ 50 that are observed in either the 1000 Genome or gnomAD database will be marked as germline. Use germline-tagging-db-threshold to change the population allele counts. The database germline filter relies on Nirvana annotations and requires settings for --enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data.

Proxi filter

Proxi filter can be enabled with tmb-enable-proxi-filter. The proxi filter will flag any variants with VAF > 0.9 as germline. The proxi filter scans the variants surround a specific variant and identifies those variants with similar VAFs. The proxi window size that determines the number of surrounding variants can be specified with tmb-proxi-window-size. If 95% (default value for tmb-proxi-fraction-threshold) and no less than 5 (tmb-proxi-count-threshold) of the surrounding variants of similar VAF are germline, then mark the current variant also as germline.
Proxi filter can also be done via a probabilistic approach, which can be enabled with tmb-enable-prob-proxi. It estimates the expected germline allele frequency using the surrounding germline variants and then tests whether the allele frequency of the target variant is similar to the expected germline allele frequency or not. P value threshold can be set by tmb-prob-proxi-p-value (the default value 1e-15 is set for ultra-deep sequenced samples, e.g. cfDNA)
Note that proxi filters can be too aggressive for 100% pure cell lines. Probabilistic proxi filter can be problematic for mixing or contaminated samples, as these samples do not have clear germline variant allele frequency distributions.

CH filter

When processing ctDNA samples it may be beneficial to also remove CH (clonal hematopoiesis) variants. Circulating tumor DNA generally has shorter fragment size. CH variants can be identified based on the insert size of the reads supporting the call. To capture the insert-size distribution for each variant call, it is required to specify vc-log-insert-size during variant calling (step1). Once specified, potential CH variants based on insert size distribution will be labeled in the output. Additional, CH variants can be also labeled via a bed file supplied to tmb-ch-bed. Variants other than germlines overlapping the region will be labeled as CH.

5. Support Nonsynonymous variants.

Nonsynonymous consequences are detected based on the Nirvana annotations. Nirvana variants that are annotated with the following consequences are labaled as nonsynonymous:

feature_elongation, feature_truncation, frameshift_variant, incomplete_terminal_codon_variant, inframe_deletion, inframe_insertion, missense_variant, protein_altering_variant, splice_acceptor_variant, splice_donor_variant, start_lost, stop_gained, stop_lost, transcript_truncation

TMB outputs a tmb.trace.csv file with detailed information on each variant used the TMB score. The trace file contains a column "Nonsynonymous" that indicates the appropriate status for each variant.

The subset of filtered variants that are nonsynonymous are used as numerator in the "Filtered Nonsyn Variant Count" metric.

6. TMB calculation

TMB = Filtered Variants / Eligible Region (Mbp)
Nonsynonymous TMB = Filtered Nonsynonymous Variants / Eligible Region (Mbp)

7. Maximum somatic allele frequency (MSAF)

The maximum somatic allele frequency (MSAF) outputs the estimated maximum somatic allele frequency of the sample. This is done via finding the confident somatic variants with highest allele frequency. MSAF is a rough approximate to the tumor fraction of cfDNA in peripheral blood samples. The MSAF mode can be enabled with tmb-enable-msaf.

Command-Line Options

[Required]

--enable-tmb true Enables TMB. If set, the small variant caller, Illumina Annotation Engine, and the related callability report are enabled.
--enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data enables Nirvana, the Illumina Annotation Engine. For more information on selecting the correct assembly and downloading reference files, see Illumina Annotation Engine.

[Recommended]

[QC coverage settings to override the default eligible region]

The protein coding region and the coverage settings can explicitly be specified using the qc-coverage options listed below. All four settings must be specified to override the defaults. If DRAGEN does not automatically detect the reference it is required to specify these settings.

--qc-coverage-region-1 Specify the coding regions bed file to use.
--qc-coverage-tag-1=tmb Required to associate these coverage settings with TMB. If this setting is not specified then DRAGEN will revert to default coding regions.
--vc-callability-tumor-thresh Specify the somatic_callable bed minimum threshold, this will limit the regions over which TMB will be computed (default is 50).
--qc-coverage-reports-1=callability. The callability report is required whenever it is desired to override the default TMB coverage settings.

[Optional settings]

--tmb-vaf-threshold Specify the minimum VAF threshold for a variant. Variants that do not meet the threshold are filtered out (default=0.05)
--tmb-cosmic-count-threshold The minimum number of observations in cosmic for variant to be considered a driver mutation. Driver mutations are not counted in TMB. This setting has very little impact on WES/WGS, but can help avoid bias in small panels (default=50)
--tmb-skip-db-filter Skip database germline filtering. The database germline filter is required for tumor-only samples, but can be skipped for tumor-normal (default=false)
--germline-tagging-db-threshold Specify the minimum allele count (total number of observations) for an allele in gnomAD or 1000 Genome to be considered a germline variant. Variant calls that have the same positions and allele are ignored from the TMB calculation (default=50)
--tmb-germline-max-cosmic-count Restrict the db-filter. Variants with cosmic allele count higher than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1000]).
--tmb-germline-min-vaf Restrict the db-filter. Variants with a variant allele frequency lower than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1])
--tmb-enable-proxi-filter Enable proxi filter functionality in germline filtering. This is an optional feature that may be appropriate for T/O runs. In T/O mode the DB germline filter may not able to detect all germline variants, especially for ethnicity groups that are not well represented in germline databases. The proxi filter uses allele frequency information to help remove germline variants missed by the DB, and can help to obtain more accurate (lower) TMB values on samples with low tumor purity. In samples with high tumor purity this filter may be too aggressive and mark some somatic variants as germline resulting in too low TMB scores. (default=false)
--tmb-proxi-count-threshold Proxi filter surrounding variant count threshold in germline filtering (default=5)
--tmb-proxi-fraction-threshold Proxi filter surrounding variant db filter fraction threshold in germline filtering (default=0.95)
--tmb-proxi-window-size, Number of surrounding variants before and after the target variant for proxi filter (default=500)
--tmb-ch-bed Variants in the region will be labeled as clonal hematopoiesis (CH) variants.
--tmb-ch-insert-p-value Minimum P value to classify a variant as CH using insert size (default=0.1)
--tmb-ch-insert-min-len Minimum fragment size to test for CH using insert size (default=100)
--tmb-ch-insert-max-len Maximum fragment size to test for CH using insert size (default=200)
--tmb-ch-insert-min-num Minimum number of fragment size record to test for CH using insert size (default=50)
--tmb-enable-msaf Enable MSAF output (default=false)
--tmb-msaf-p-value Maximum P value (from insert size) to call confident somatic variant (default=1e-5)
--tmb-msaf-rank-num If no confident somatic variant found, it will use the specified ranked variant (default=4)

Output

The TMB values are output to <output prefix>.tmb.metrics.csv. The file format uses the following CSV column convention, similar to other metric CSV files.

The TMB module also outputs a tmb.trace.csv file that provides detailed information on each variant that was included in the TMB calculation.

When enabling MSAF, the information is output to <output prefix>.tmb.msaf.csv.

Microsatellite Instability

Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.

DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the collect-evidence mode.

Command-Line Options

The following is an example command for tumor-normal mode. Default resource files are available for WES and WGS. Please note that the WES and WGS tumor-normal modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.

The following is an example command for the tumor-only mode. Please note that the WES and WGS tumor-only modes are not as extensively tested as the tumor-normal modes. The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only mode.

Assay Specific Settings

TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.

Default Microsatellite sites files

The following is an example of a microsatellite file:

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.

Custom Microsatellite files

Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.

Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].

A subsequent post-processing step is recommended:

only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)

Please note an error would occur if long (>100bp) microsatellite sites are present in the file.

Normal references of miscrosatellite repeat distribution

Normal reference files can be generated by running collect-evidence mode on a panel of normal samples.

Please note:

The collect-evidence mode MUST be run in DRAGEN germline mode.
The --msi-microsatellites-file and --msi-coverage-threshold settings used in collect-evidence mode must be consistent with the settings used during tumor-only MSI calling.
At least 20 normal samples are required.

MSI Output

The output containing MSI score (PecentageUnstableSites) are stored in <output prefix>.microsat_output.json.

The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".

In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.

MSI Debugging Output

There are two other output files (*_diffs.txt and *.dist) that are useful for debugging.

Here is an example of *_diffs.txt file

The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column

The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.

The *.dist file stores the read counts for each repeat length of the microsatellite site

The coverage of the site can be obtained by summing up all counts in the last column

MSI Verbose Output

1. <output prefix>.microsat_output.json (described above)
1. <output prefix>.microsat_tumor.dist. This file contains the repeat length array for every microsatellite.

Column length_dis is the repeat length array.

1. <output prefix>.microsat_diffs.txt. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.

Column Assessed indicates if a site passes the coverage filter (msi-coverage-threshold). Column PassFilter is an internal metric and currently is not used for filtering microsatellites.

MSI Algorithm

The MSI algorithm performs the following steps:

Tabulates tumor and normal counts from the read alignments for each microsatellite site.
Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (tumor-normal mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only mode).
Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (tumor-normal mode). In tumor-only mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).
Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.

Homologous Recombination Deficiency

DRAGEN Homologous Recombination Deficiency (HRD) Scoring takes in allele-specific copy number calls in either VCF format or directly streamed from somatic copy number callers. DRAGEN HRD then calculates scores for Loss of Heterozygosity (LOH), Telomeric Allelic Imbalance (TAI), and Large-Scale State Transition (LST). The three scores are output to the .hrdscore.csv file. You can only use DRAGEN HRD when inputting results from WGS somatic CNV calling or ASCN WES somatic CNV calling.

Command-Line Options

Use the following command-line options to run HRD scoring. You can run HRD scoring with somatic CNV calling or after using somatic CNV calling results.

To run HRD scoring together with somatic CNV calling, use the following options. For more CNV parameters, please refer to CNV calling.

--enable-hrd=true Set to true to enable HRD scoring to quantify genomic instability.

--enable-cnv=true Set to true to enable CNV calling to run together with HRD scoring.

To run HRD scoring after somatic CNV calling, use the following options:

--enable-hrd=true Set to true to enable HRD scoring to quantify genomic instability.

--hrd-input-ascn Specify the allele-specific copy number file (*cnv.vcf.gz). The CNV VCF file should include REF calls for proper HRD segmentation. See the option --cnv-enable-ref-calls in the CNV section.

--hrd-input-tn Specify the tumor normalized bin count file (*.tn.tsv.gz).

If reference is failed to AutoDetected, then centromere and blacklist files should be specified with following options:

--hrd-input-centromere Centromere locations per chromosome in tsv format

--hrd-input-blacklist Blacklist bed file

HRD Output

The following metrics are included in the .hrdscore.csv output file. The following is an example output file.

Example command line

The following example command runs HRD end to end workflow with CNV. This is an example of Somatic WGS T/N. See the Somatic CNV section for other use cases. HRD is supported for any CNV workflows that support ASCN, and just needs to add --enable-hrd=true on top of the CNV command lines.

The following example command runs HRD standalone.

BRCA Large Genomic Rearrangment

Large genomic rearrangements affecting one or more exons account for approximately 5~10% of all disease-causing mutations in BRCA1 and BRCA2 genes in patients with hereditary breast and ovarian cancer syndrome. DRAGEN LR can detect within gene large genomic rearrangements in tumor-only mode for targeted panels such as TruSight Oncology 500. The performance has been verified for BRCA1/2 with TruSight Oncology 500 Assay.

Command-Line Options

Use the following command-line options to run large rearrangement detection. The same cmd line options can be tested on other tumor-only pipelines.

--tso500-solid-brca-lr=true Set to true enable large rearrangement parameters. This is not limited to TruSight Oncology 500 Assay.

--cnv-normals-list Specify the panel of normal samples to measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see the example command line. The panel of normal samples should be well matched to the case sample under analysis.

--cnv-target-bed Specify the targeted regions of the panel.

--cnv-within-gene-lr-bed Specify the gene regions in BED format to do large rearrangment calling. Example file:

Example to generate panel of normal

Run the following command on each normal sample to generate .target.counts.gc-corrected.gz file.

Put the path to the generated .target.counts.gc-corrected.gz files into a txt file. One file per line. This will be the file given to --cnv-normals-list.

Example command lines

LR Output

The output file .cnv.LR.json contains the breakpoints detected for each specified gene region. The following is an example output file.

Note that coordinate follows BED format [start,stop) suggesting:

start: segment starting coordinate. (0-base inclusive: first base on the chromosome is numbered 0. start coordinate is included in the interval)
stop: segment stop coordinate. (0-base exclusive: first base on the chromosome is numbered 0. stop coordinate is not included in the interval)

DRAGEN Fragmentomics

Fragmentomics is the study of fragmentation patterns of cell-free DNA or circulating turmor DNA (ctDNA). DNA molecules are released into the plasma from various tissues and cell types. Fragmentation features, such as fragment sizes and end motifs, of the cell-free DNA contains the characteristics of their tissue of origin. Studies have shown that fragmentation features are distinct between cancer and noncancer cells derived ctDNA. The use of genome-wide fragment profile of cell-free DNA has proven to be a powerful tool to infer cancer status and their tissue of origin. The DRAGEN fragmentomics component computes three fragmentomics metrics as following.[1]

Fragment profile
End motif frequency
Window protection score (WPS)

The fragmentomics component works by taking aligned reads from the mapper, calculating per read metrics, and finally tabulating into per-bin or target region metrics. DRAGEN first gets the chromosome sizes from the reference genome. Only autosomes and X, Y chromosomes are considered for fragment profile calculation. The genome is binned with the bin size specified by the user. Each aligned read is processed sequentially. Only reads satisfied with the following criteria are considered: 1) mapped, 2) mate-mapped, 3) not PCR duplicates, 4) primary alignment, 5) mapping quality no less than minimum mapq specified by the user. Reads that have template length within the short fragment size ranges are counted as short fragment. Reads that have template length within the long fragment size range are counted as long fragment. The fragment profile is calculated as the ratio of short-to-long fragment counts for each genomic bin. Genome-wide short fragment counts, long fragment counts, and their ratio are normalized against the GC bias of each genomic bin using the GC correction module from DRAGEN CNV component.

End motif frequency calculation is enabled when --fragmentomics-end-motif-len is set to positive integers. Unmapped, duplicated, or secondary alignments are excluded for end motif frequency calculation. The first x basepair sequences (x is specified by --fragmentomics-end-motif-len) at the 5' end of the reads is tabulated into a frequency dictionary with keys being the sequences and values being the frequencies. If the first x basepair contains any 'N's, this read will be ignored. After all reads are processed, the frequency table is sorted by the sequences in alphabetic order.

Window protection score (WPS) calculation is enabled when a target region is provided with --fragmentomics-wps-target-file. This file must be a BED format text file with three columns. Each row in the file represents a 120-bp region for which WPS will be calculated. An interval tree will be constructed for the target regions. Then each aligned read is processed sequentially, and unmapped, duplicated, or secondary alignments are excluded. Any read with 5' end falling in a target region increments the read count for the region by one. Forward and reverse reads are counted separately. If a read fully spans the region, the fully-span read count for the region increments by one. After all reads are processed, WPS is calculated for each target region. Two ways of WPS calculation are supported, 1) number of fully spanning rads subtracted by the number of reads with 5' ending in the region. 2) percentage of reads ending in the region of all reads mapped to this region.

Supported assays and DRAGEN modes

DRAGEN Fragmentomics currently supports Tumor-only and Normal-only sequencing data from TSO500/WES/WGS ctDNA assays. The results for Tumor-Normal pair data are undefined because ctDNA data are derived from mixture of tumor and normal DNA. Therefore, users should avoid running Fragmentomics in Tumor-Normal mode.

Command-Line Options

Required options:

Enable the Fragmentomics component:

--enable-fragmentomics

Optional options:

    --fragmentomics-bin-size                      Uint. Default 100000
    --fragmentomics-num-threads                   Uint. Default 12
    --fragmentomics-min-mapq                      Uint. Default 30
    --fragmentomics-short-fragment-min-size       Uint. Default 100
    --fragmentomics-short-fragment-max-size       Uint. Default 150
    --fragmentomics-long-fragment-min-size        Uint. Default 151
    --fragmentomics-long-fragment-max-size        Uint. Default 220
    --fragmentomics-num-gc-bins                   Uint. Default 25
    --fragmentomics-gc-enable-smoothing           Bool. Default true
    --fragmentomics-end-motif-len.                Uint. Default 4
    --fragmentomics-wps-target-file               String 
    --fragmentomics-exclude-bed                   String

Target regions for window protection score

The target regions file is used only in window protection score calculation. The target regions file is in BED format with three columns.

chr1    2488101    2488120
chr1    2488120    2488174

Exclude regions for fragment profile

Users can provide a blocklist of regions to remove reads from fragment profile calculation. For example, low mappability regions. This file is in BED format with three columns.

chr1    1       1000
chr2    1000    2000

Example command line options for FASTQ input of WGS ctDNA

dragen \
	--ref-dir=$REF \
        --fastq-file1 $fastq1 \
	--fastq-file2 $fastq2 \
	--RGID "test" \
	--RGSM "test" \
	--enable-map-align true \
	--enable-sort false \
	--generate-ploidy-vcf false \
	--enable-cnv false \
	--enable-fragmentomics true \
	--fragmentomics-exclude-bed hg19_exclude.bed \
	--fragmentomics-bin-size 100000 \
	--fragmentomics-num-threads 12 \
	--fragmentomics-min-mapq 30 \
	--fragmentomics-short-fragment-min-size 100 \
	--fragmentomics-short-fragment-max-size 150 \
	--fragmentomics-long-fragment-min-size 151 \
	--fragmentomics-long-fragment-max-size 220 \
	--fragmentomics-num-gc-bins 25 \
	--fragmentomics-gc-enable-smoothing true \
	--fragmentomics-end-motif-len 4 \
	--fragmentomics-wps-target-file hg19.window.bed \
	--output-directory $outdir \
	--output-file-prefix $outprefix

Fragmentomics Output

The system should output the fragment profile file, and optionally the end motif frequency file or WPS file if either or both are enabled.

The fragment profile file is in the following format:

Chr     Start     End    ShortFragCounts    LongFragCounts    ShortToLongRatio    GCBias    ShortFragCountsCorrected    LongFragCountsCorrected    ShortToLongRatioCorrected
chr1    0         100000    4    7    0.571429    0.424522    3.921048    6.925498    0.572042
chr1    100000    200000    1    2    0.500000    0.436106    0.980262    1.978714    0.500537
chr1    200000    300000    0    2    0.000000    0.391445    0.000000    2.008326    0.000000

The end motif frequency file is in the following format:

Motif   Frequency
AAAA    111559
AAAC    39204
AAAG    56773
AAAT    68437

The WPS file is in the following format:

Chr     Start      End    ForwardCount    ReverseCount    FullySpanCount    WPS    TotalCount    RatioForward    RatioReverse
chr1    2488101    2488120    3    21    314    290    355    0.0084507    0.0591549
chr1    2488101    2488140    3    32    285    250    366    0.00819672    0.0874317
chr1    2488101    2488160    4    41    255    210    376    0.0106383    0.109043

Reference

Y. M. DENNIS LO, DIANA S. C. HAN, PEIYONG JIANG, ROSSA W. K. CHIU. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021. DOI: 10.1126/science.aaw3616

DRAGEN Fragmentomics

Fragment profile
End motif frequency
Window protection score (WPS)

Supported assays and DRAGEN modes

Command-Line Options

Required options:

Enable the Fragmentomics component:

--enable-fragmentomics

Optional options:

    --fragmentomics-bin-size                      Uint. Default 100000
    --fragmentomics-num-threads                   Uint. Default 12
    --fragmentomics-min-mapq                      Uint. Default 30
    --fragmentomics-short-fragment-min-size       Uint. Default 100
    --fragmentomics-short-fragment-max-size       Uint. Default 150
    --fragmentomics-long-fragment-min-size        Uint. Default 151
    --fragmentomics-long-fragment-max-size        Uint. Default 220
    --fragmentomics-num-gc-bins                   Uint. Default 25
    --fragmentomics-gc-enable-smoothing           Bool. Default true
    --fragmentomics-end-motif-len.                Uint. Default 4
    --fragmentomics-wps-target-file               String 
    --fragmentomics-exclude-bed                   String

Target regions for window protection score

The target regions file is used only in window protection score calculation. The target regions file is in BED format with three columns.

chr1    2488101    2488120
chr1    2488120    2488174

Exclude regions for fragment profile

Users can provide a blocklist of regions to remove reads from fragment profile calculation. For example, low mappability regions. This file is in BED format with three columns.

chr1    1       1000
chr2    1000    2000

Example command line options for FASTQ input of WGS ctDNA

dragen \
	--ref-dir=$REF \
        --fastq-file1 $fastq1 \
	--fastq-file2 $fastq2 \
	--RGID "test" \
	--RGSM "test" \
	--enable-map-align true \
	--enable-sort false \
	--generate-ploidy-vcf false \
	--enable-cnv false \
	--enable-fragmentomics true \
	--fragmentomics-exclude-bed hg19_exclude.bed \
	--fragmentomics-bin-size 100000 \
	--fragmentomics-num-threads 12 \
	--fragmentomics-min-mapq 30 \
	--fragmentomics-short-fragment-min-size 100 \
	--fragmentomics-short-fragment-max-size 150 \
	--fragmentomics-long-fragment-min-size 151 \
	--fragmentomics-long-fragment-max-size 220 \
	--fragmentomics-num-gc-bins 25 \
	--fragmentomics-gc-enable-smoothing true \
	--fragmentomics-end-motif-len 4 \
	--fragmentomics-wps-target-file hg19.window.bed \
	--output-directory $outdir \
	--output-file-prefix $outprefix

Fragmentomics Output

The system should output the fragment profile file, and optionally the end motif frequency file or WPS file if either or both are enabled.

The fragment profile file is in the following format:

Chr     Start     End    ShortFragCounts    LongFragCounts    ShortToLongRatio    GCBias    ShortFragCountsCorrected    LongFragCountsCorrected    ShortToLongRatioCorrected
chr1    0         100000    4    7    0.571429    0.424522    3.921048    6.925498    0.572042
chr1    100000    200000    1    2    0.500000    0.436106    0.980262    1.978714    0.500537
chr1    200000    300000    0    2    0.000000    0.391445    0.000000    2.008326    0.000000

The end motif frequency file is in the following format:

Motif   Frequency
AAAA    111559
AAAC    39204
AAAG    56773
AAAT    68437

The WPS file is in the following format:

Chr     Start      End    ForwardCount    ReverseCount    FullySpanCount    WPS    TotalCount    RatioForward    RatioReverse
chr1    2488101    2488120    3    21    314    290    355    0.0084507    0.0591549
chr1    2488101    2488140    3    32    285    250    366    0.00819672    0.0874317
chr1    2488101    2488160    4    41    255    210    376    0.0106383    0.109043

Reference

Y. M. DENNIS LO, DIANA S. C. HAN, PEIYONG JIANG, ROSSA W. K. CHIU. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021. DOI: 10.1126/science.aaw3616

Tumor Mutational Burden

DRAGEN supports Tumor Mutational Burden (TMB) in Tumor-Only or Tumor-Normal Mode.

TMB Processing Flow

DRAGEN TMB comprise the following steps:

1. Variant calling

Please refer to "Somatic mode" for detailed variant calling options.

2. Eligible region detection

3. Variant filters

The following variants are excluded from the TMB calculation:

Non-PASS variants
Mitochondrial variants
MNVs
Variants that do not meet the minimum depth (DP) threshold. Use the --vc-callability-tumor-thresh command line option to specify the threshold value.
Variants that do not meet the minimum variant allele threshold. Use the --tmb-vaf-threshold command line option to specify the threshold value.
Variants that fall outside the eligible regions.
Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. You can specify the cosmic driver threshold using the tmb-cosmic-count-threshold command line option. The tumor driver mutations filter relies on Nirvana annotations and will additionally require settings for --enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data.

4. Support for germline variants

Database filter

Variants with a population allele count ≥ 50 that are observed in either the 1000 Genome or gnomAD database will be marked as germline. Use germline-tagging-db-threshold to change the population allele counts. The database germline filter relies on Nirvana annotations and requires settings for --enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data.

Proxi filter

Proxi filter can be enabled with tmb-enable-proxi-filter. The proxi filter will flag any variants with VAF > 0.9 as germline. The proxi filter scans the variants surround a specific variant and identifies those variants with similar VAFs. The proxi window size that determines the number of surrounding variants can be specified with tmb-proxi-window-size. If 95% (default value for tmb-proxi-fraction-threshold) and no less than 5 (tmb-proxi-count-threshold) of the surrounding variants of similar VAF are germline, then mark the current variant also as germline.
Proxi filter can also be done via a probabilistic approach, which can be enabled with tmb-enable-prob-proxi. It estimates the expected germline allele frequency using the surrounding germline variants and then tests whether the allele frequency of the target variant is similar to the expected germline allele frequency or not. P value threshold can be set by tmb-prob-proxi-p-value (the default value 1e-15 is set for ultra-deep sequenced samples, e.g. cfDNA)
Note that proxi filters can be too aggressive for 100% pure cell lines. Probabilistic proxi filter can be problematic for mixing or contaminated samples, as these samples do not have clear germline variant allele frequency distributions.

CH filter

When processing ctDNA samples it may be beneficial to also remove CH (clonal hematopoiesis) variants. Circulating tumor DNA generally has shorter fragment size. CH variants can be identified based on the insert size of the reads supporting the call. To capture the insert-size distribution for each variant call, it is required to specify vc-log-insert-size during variant calling (step1). Once specified, potential CH variants based on insert size distribution will be labeled in the output. Additional, CH variants can be also labeled via a bed file supplied to tmb-ch-bed. Variants other than germlines overlapping the region will be labeled as CH.

5. Support Nonsynonymous variants.

Nonsynonymous consequences are detected based on the Nirvana annotations. Nirvana variants that are annotated with the following consequences are labaled as nonsynonymous:

feature_elongation, feature_truncation, frameshift_variant, incomplete_terminal_codon_variant, inframe_deletion, inframe_insertion, missense_variant, protein_altering_variant, splice_acceptor_variant, splice_donor_variant, start_lost, stop_gained, stop_lost, transcript_truncation

TMB outputs a tmb.trace.csv file with detailed information on each variant used the TMB score. The trace file contains a column "Nonsynonymous" that indicates the appropriate status for each variant.

The subset of filtered variants that are nonsynonymous are used as numerator in the "Filtered Nonsyn Variant Count" metric.

6. TMB calculation

TMB = Filtered Variants / Eligible Region (Mbp)
Nonsynonymous TMB = Filtered Nonsynonymous Variants / Eligible Region (Mbp)

7. Maximum somatic allele frequency (MSAF)

Command-Line Options

[Required]

--enable-tmb true Enables TMB. If set, the small variant caller, Illumina Annotation Engine, and the related callability report are enabled.
--enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data enables Nirvana, the Illumina Annotation Engine. For more information on selecting the correct assembly and downloading reference files, see Illumina Annotation Engine.

[Recommended]

Setting

Description

Tumor-Normal Panel/WES/WGS

Tumor-Only Panel/WES/WGS

High coverage bTMB

[QC coverage settings to override the default eligible region]

--qc-coverage-region-1 Specify the coding regions bed file to use.
--qc-coverage-tag-1=tmb Required to associate these coverage settings with TMB. If this setting is not specified then DRAGEN will revert to default coding regions.
--vc-callability-tumor-thresh Specify the somatic_callable bed minimum threshold, this will limit the regions over which TMB will be computed (default is 50).
--qc-coverage-reports-1=callability. The callability report is required whenever it is desired to override the default TMB coverage settings.

[Optional settings]

--tmb-vaf-threshold Specify the minimum VAF threshold for a variant. Variants that do not meet the threshold are filtered out (default=0.05)
--tmb-cosmic-count-threshold The minimum number of observations in cosmic for variant to be considered a driver mutation. Driver mutations are not counted in TMB. This setting has very little impact on WES/WGS, but can help avoid bias in small panels (default=50)
--tmb-skip-db-filter Skip database germline filtering. The database germline filter is required for tumor-only samples, but can be skipped for tumor-normal (default=false)
--germline-tagging-db-threshold Specify the minimum allele count (total number of observations) for an allele in gnomAD or 1000 Genome to be considered a germline variant. Variant calls that have the same positions and allele are ignored from the TMB calculation (default=50)
--tmb-germline-max-cosmic-count Restrict the db-filter. Variants with cosmic allele count higher than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1000]).
--tmb-germline-min-vaf Restrict the db-filter. Variants with a variant allele frequency lower than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1])
--tmb-enable-proxi-filter Enable proxi filter functionality in germline filtering. This is an optional feature that may be appropriate for T/O runs. In T/O mode the DB germline filter may not able to detect all germline variants, especially for ethnicity groups that are not well represented in germline databases. The proxi filter uses allele frequency information to help remove germline variants missed by the DB, and can help to obtain more accurate (lower) TMB values on samples with low tumor purity. In samples with high tumor purity this filter may be too aggressive and mark some somatic variants as germline resulting in too low TMB scores. (default=false)
--tmb-proxi-count-threshold Proxi filter surrounding variant count threshold in germline filtering (default=5)
--tmb-proxi-fraction-threshold Proxi filter surrounding variant db filter fraction threshold in germline filtering (default=0.95)
--tmb-proxi-window-size, Number of surrounding variants before and after the target variant for proxi filter (default=500)
--tmb-ch-bed Variants in the region will be labeled as clonal hematopoiesis (CH) variants.
--tmb-ch-insert-p-value Minimum P value to classify a variant as CH using insert size (default=0.1)
--tmb-ch-insert-min-len Minimum fragment size to test for CH using insert size (default=100)
--tmb-ch-insert-max-len Maximum fragment size to test for CH using insert size (default=200)
--tmb-ch-insert-min-num Minimum number of fragment size record to test for CH using insert size (default=50)
--tmb-enable-msaf Enable MSAF output (default=false)
--tmb-msaf-p-value Maximum P value (from insert size) to call confident somatic variant (default=1e-5)
--tmb-msaf-rank-num If no confident somatic variant found, it will use the specified ranked variant (default=4)

Output

The TMB values are output to <output prefix>.tmb.metrics.csv. The file format uses the following CSV column convention, similar to other metric CSV files.

Metric

Definition

The TMB module also outputs a tmb.trace.csv file that provides detailed information on each variant that was included in the TMB calculation.

When enabling MSAF, the information is output to <output prefix>.tmb.msaf.csv.

Microsatellite Instability

Command-Line Options

dragen \
--msi-command tumor-only \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \  # See section: Default Microsatellite sites files
--msi-ref-normal-dir ${normal_reference_directory} \ # See section: Normal references of miscrosatellite repeat distribution 
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align=true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output=true \
--enable-sort true \
--enable-duplicate-marking=true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2}

Option

Description

Assay Specific Settings

Sample Type

Assay

Microsatelitte file

Specific Settings

PercentageUnstableSites Threshold

Default Microsatellite sites files

The following is an example of a microsatellite file:

#chromosome     location        repeat_unit_length      repeat_unit_binary      repeat_times    left_flank_binary       right_flank_binary      repeat_unit_bases       left_flank_bases    right_flank_bases
chr1	985443	1	2	15	676	992	G	GGGCA	TTGAA
chr1	7980985	1	0	10	231	1020	A	ATGCT	TTTTA
chr1	8022800	1	3	19	13	41	T	AAATC	AAGGC
chr1	8029500	1	2	10	39	0	G	AAGCT	AAAAA
chr1	9146447	1	3	15	887	248	T	TCTCT	ATTGA
chr1	9767837	1	3	12	704	195	T	GTAAA	ATAAT

Default WES and WGS Microsatellite site files can be downloaded here:

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.

Custom Microsatellite files

Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.

Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].

msisensor-pro scan -d /path/to/reference.fa -o ${microsatellite_file}

A subsequent post-processing step is recommended:

only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)

Please note an error would occur if long (>100bp) microsatellite sites are present in the file.

Normal references of miscrosatellite repeat distribution

Normal reference files can be generated by running collect-evidence mode on a panel of normal samples.

dragen -f \
--msi-command collect-evidence \
--ref-dir ${reference_directory} \
--msi-microsatellites-file ${microsatellite_file} \
--msi-coverage-threshold 60 \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
-1 ${normal_fq1} \
-2 ${normal_fq2}

Please note:

The collect-evidence mode MUST be run in DRAGEN germline mode.
The --msi-microsatellites-file and --msi-coverage-threshold settings used in collect-evidence mode must be consistent with the settings used during tumor-only MSI calling.
At least 20 normal samples are required.

MSI Output

The output containing MSI score (PecentageUnstableSites) are stored in <output prefix>.microsat_output.json.

"TotalMicrosatelliteSitesAssessed": "20020",
"TotalMicrosatelliteSitesUnstable": "4374",
"PecentageUnstableSites": "21.850000000000001",
"ResultIsValid": "true",
"ResultMessage": "",
"SumDistance": "1214.174"

MSI Debugging Output

There are two other output files (*_diffs.txt and *.dist) that are useful for debugging.

Here is an example of *_diffs.txt file

#Chromosome	Start	RepeatUnit	Assessed	Distance	PValue	PassFilter
chr1	69106	T	true	0.04105300052	0.4786448589	true
chr1	69116	TC	false	0	0	false

The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column

1. read count for [reference repeat length] > 0
2. read count for [reference repeat length - 1] >= 0
3. the ratio of read count for [reference repeat length - 1] and [reference repeat length] >= 0

The *.dist file stores the read counts for each repeat length of the microsatellite site

#chromosome	location	repeat_unit_bases	reference_allele	covered	length_distribution
chr1	69106	T	5	true	0,0,0,0,103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The coverage of the site can be obtained by summing up all counts in the last column

MSI Verbose Output

1. <output prefix>.microsat_output.json (described above)
1. <output prefix>.microsat_tumor.dist. This file contains the repeat length array for every microsatellite.

#chromosome     location        repeat_unit_bases       reference_allele        covered length_dis
chr1    16200729        T       10      true    0,0,0,0,0,0,0,0,0,118,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
chr1    40361307        A       10      true    0,0,0,0,0,0,2,0,2,95,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Column length_dis is the repeat length array.

1. <output prefix>.microsat_diffs.txt. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.

#Chromosome     Start   RepeatUnit      Assessed        Distance        PValue  PassFilter
chr1    16200729        T       true    0.03348841224   0.002411104562  true
chr1    40361307        A       true    0.0406985608    0.0006306633961 true
chr1    156842471       T       false   0       0       true
chr1    239881908       T       true    0.003136536956  0.5983661726    true

Column Assessed indicates if a site passes the coverage filter (msi-coverage-threshold). Column PassFilter is an internal metric and currently is not used for filtering microsatellites.

MSI Algorithm

The MSI algorithm performs the following steps:

Tabulates tumor and normal counts from the read alignments for each microsatellite site.
Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (tumor-normal mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only mode).
Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (tumor-normal mode). In tumor-only mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).
Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.