Fragmentomics is the study of fragmentation patterns of cell-free DNA or circulating turmor DNA (ctDNA). DNA molecules are released into the plasma from various tissues and cell types. Fragmentation features, such as fragment sizes and end motifs, of the cell-free DNA contains the characteristics of their tissue of origin. Studies have shown that fragmentation features are distinct between cancer and noncancer cells derived ctDNA. The use of genome-wide fragment profile of cell-free DNA has proven to be a powerful tool to infer cancer status and their tissue of origin. The DRAGEN fragmentomics component computes three fragmentomics metrics as following.[1]
Fragment profile
End motif frequency
Window protection score (WPS)
The fragmentomics component works by taking aligned reads from the mapper, calculating per read metrics, and finally tabulating into per-bin or target region metrics. DRAGEN first gets the chromosome sizes from the reference genome. Only autosomes and X, Y chromosomes are considered for fragment profile calculation. The genome is binned with the bin size specified by the user. Each aligned read is processed sequentially. Only reads satisfied with the following criteria are considered: 1) mapped, 2) mate-mapped, 3) not PCR duplicates, 4) primary alignment, 5) mapping quality no less than minimum mapq specified by the user. Reads that have template length within the short fragment size ranges are counted as short fragment. Reads that have template length within the long fragment size range are counted as long fragment. The fragment profile is calculated as the ratio of short-to-long fragment counts for each genomic bin. Genome-wide short fragment counts, long fragment counts, and their ratio are normalized against the GC bias of each genomic bin using the GC correction module from DRAGEN CNV component.
End motif frequency calculation is enabled when --fragmentomics-end-motif-len
is set to positive integers. Unmapped, duplicated, or secondary alignments are excluded for end motif frequency calculation. The first x basepair sequences (x is specified by --fragmentomics-end-motif-len
) at the 5' end of the reads is tabulated into a frequency dictionary with keys being the sequences and values being the frequencies. If the first x basepair contains any 'N's, this read will be ignored. After all reads are processed, the frequency table is sorted by the sequences in alphabetic order.
Window protection score (WPS) calculation is enabled when a target region is provided with --fragmentomics-wps-target-file
. This file must be a BED format text file with three columns. Each row in the file represents a 120-bp region for which WPS will be calculated. An interval tree will be constructed for the target regions. Then each aligned read is processed sequentially, and unmapped, duplicated, or secondary alignments are excluded. Any read with 5' end falling in a target region increments the read count for the region by one. Forward and reverse reads are counted separately. If a read fully spans the region, the fully-span read count for the region increments by one. After all reads are processed, WPS is calculated for each target region. Two ways of WPS calculation are supported, 1) number of fully spanning rads subtracted by the number of reads with 5' ending in the region. 2) percentage of reads ending in the region of all reads mapped to this region.
DRAGEN Fragmentomics currently supports Tumor-only
and Normal-only
sequencing data from TSO500/WES/WGS ctDNA assays. The results for Tumor-Normal
pair data are undefined because ctDNA data are derived from mixture of tumor and normal DNA. Therefore, users should avoid running Fragmentomics in Tumor-Normal
mode.
Enable the Fragmentomics component:
The target regions file is used only in window protection score calculation. The target regions file is in BED format with three columns.
Users can provide a blocklist of regions to remove reads from fragment profile calculation. For example, low mappability regions. This file is in BED format with three columns.
The system should output the fragment profile file, and optionally the end motif frequency file or WPS file if either or both are enabled.
The fragment profile file is in the following format:
The end motif frequency file is in the following format:
The WPS file is in the following format:
Y. M. DENNIS LO, DIANA S. C. HAN, PEIYONG JIANG, ROSSA W. K. CHIU. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021. DOI: 10.1126/science.aaw3616
DRAGEN supports Tumor Mutational Burden (TMB) in Tumor-Only or Tumor-Normal Mode.
It is important to note that in T/O mode germline variants must be identified and filtered using database information and optionally also allele frequency information. These germline filtering techniques are generally not as accurate as tumor normal subtraction. When using databases only to subtract germline variants, the TMB may be slightly higher than the more accurate T/N estimate. When using database and allele frequency information to remove germline variants, the TMB may be slightly underestimated for high purity tumor samples.
DRAGEN TMB comprise the following steps:
Please refer to "Somatic mode" for detailed variant calling options.
TMB is computed over protein coding regions with sufficient coverage. If DRAGEN detects a reference hg19/38, GRCh37/38 or hs37d5 it will automatically select the appropriate coding region based on the bed files available in "<INSTALL_PATH>/resources/tmb/". By default the coverage threshold for eligible regions is 50.
The protein coding region bed file and the coverage settings can explicitly be specified using the qc-coverage
options listed below in [QC coverage settings to override the default eligible region]. If DRAGEN does not automatically detect the reference it is required to specify these settings.
The following variants are excluded from the TMB calculation:
Non-PASS variants
Mitochondrial variants
MNVs
Variants that do not meet the minimum depth (DP) threshold. Use the --vc-callability-tumor-thresh
command line option to specify the threshold value.
Variants that do not meet the minimum variant allele threshold. Use the --tmb-vaf-threshold
command line option to specify the threshold value.
Variants that fall outside the eligible regions.
Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. You can specify the cosmic driver threshold using the tmb-cosmic-count-threshold
command line option. The tumor driver mutations filter relies on Nirvana annotations and will additionally require settings for --enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
.
By default, germline variants are not counted towards TMB. Variants are determined as germline based on a database or a proxi filter. The database germline filter can be disabled with tmb-skip-db-filter
. Disabling the database germline filter will effectively also disable the germline proxi filter.
Database filter
Variants with a population allele count ≥ 50 that are observed in either the 1000 Genome or gnomAD database will be marked as germline. Use germline-tagging-db-threshold
to change the population allele counts. The database germline filter relies on Nirvana annotations and requires settings for --enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
.
Proxi filter
Proxi filter can be enabled with tmb-enable-proxi-filter
. The proxi filter will flag any variants with VAF > 0.9 as germline. The proxi filter scans the variants surround a specific variant and identifies those variants with similar VAFs. The proxi window size that determines the number of surrounding variants can be specified with tmb-proxi-window-size
. If 95% (default value for tmb-proxi-fraction-threshold
) and no less than 5 (tmb-proxi-count-threshold
) of the surrounding variants of similar VAF are germline, then mark the current variant also as germline.
Proxi filter can also be done via a probabilistic approach, which can be enabled with tmb-enable-prob-proxi
. It estimates the expected germline allele frequency using the surrounding germline variants and then tests whether the allele frequency of the target variant is similar to the expected germline allele frequency or not. P value threshold can be set by tmb-prob-proxi-p-value
(the default value 1e-15 is set for ultra-deep sequenced samples, e.g. cfDNA)
Note that proxi filters can be too aggressive for 100% pure cell lines. Probabilistic proxi filter can be problematic for mixing or contaminated samples, as these samples do not have clear germline variant allele frequency distributions.
CH filter
When processing ctDNA samples it may be beneficial to also remove CH (clonal hematopoiesis) variants. Circulating tumor DNA generally has shorter fragment size. CH variants can be identified based on the insert size of the reads supporting the call. To capture the insert-size distribution for each variant call, it is required to specify vc-log-insert-size
during variant calling (step1). Once specified, potential CH variants based on insert size distribution will be labeled in the output. Additional, CH variants can be also labeled via a bed file supplied to tmb-ch-bed
. Variants other than germlines overlapping the region will be labeled as CH.
Nonsynonymous consequences are detected based on the Nirvana annotations. Nirvana variants that are annotated with the following consequences are labaled as nonsynonymous:
feature_elongation, feature_truncation, frameshift_variant, incomplete_terminal_codon_variant, inframe_deletion, inframe_insertion, missense_variant, protein_altering_variant, splice_acceptor_variant, splice_donor_variant, start_lost, stop_gained, stop_lost, transcript_truncation
TMB outputs a tmb.trace.csv file with detailed information on each variant used the TMB score. The trace file contains a column "Nonsynonymous" that indicates the appropriate status for each variant.
The subset of filtered variants that are nonsynonymous are used as numerator in the "Filtered Nonsyn Variant Count" metric.
TMB = Filtered Variants / Eligible Region (Mbp)
Nonsynonymous TMB = Filtered Nonsynonymous Variants / Eligible Region (Mbp)
The maximum somatic allele frequency (MSAF) outputs the estimated maximum somatic allele frequency of the sample. This is done via finding the confident somatic variants with highest allele frequency. MSAF is a rough approximate to the tumor fraction of cfDNA in peripheral blood samples. The MSAF mode can be enabled with tmb-enable-msaf
.
[Required]
--enable-tmb true
Enables TMB. If set, the small variant caller, Illumina Annotation Engine, and the related callability report are enabled.
--enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
enables Nirvana, the Illumina Annotation Engine. For more information on selecting the correct assembly and downloading reference files, see Illumina Annotation Engine.
[Recommended]
[QC coverage settings to override the default eligible region]
The protein coding region and the coverage settings can explicitly be specified using the qc-coverage
options listed below. All four settings must be specified to override the defaults. If DRAGEN does not automatically detect the reference it is required to specify these settings.
--qc-coverage-region-1
Specify the coding regions bed file to use.
--qc-coverage-tag-1=tmb
Required to associate these coverage settings with TMB. If this setting is not specified then DRAGEN will revert to default coding regions.
--vc-callability-tumor-thresh
Specify the somatic_callable bed minimum threshold, this will limit the regions over which TMB will be computed (default is 50).
--qc-coverage-reports-1=callability
. The callability report is required whenever it is desired to override the default TMB coverage settings.
[Optional settings]
--tmb-vaf-threshold
Specify the minimum VAF threshold for a variant. Variants that do not meet the threshold are filtered out (default=0.05)
--tmb-cosmic-count-threshold
The minimum number of observations in cosmic for variant to be considered a driver mutation. Driver mutations are not counted in TMB. This setting has very little impact on WES/WGS, but can help avoid bias in small panels (default=50)
--tmb-skip-db-filter
Skip database germline filtering. The database germline filter is required for tumor-only samples, but can be skipped for tumor-normal (default=false)
--germline-tagging-db-threshold
Specify the minimum allele count (total number of observations) for an allele in gnomAD or 1000 Genome to be considered a germline variant. Variant calls that have the same positions and allele are ignored from the TMB calculation (default=50)
--tmb-germline-max-cosmic-count
Restrict the db-filter. Variants with cosmic allele count higher than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1000]).
--tmb-germline-min-vaf
Restrict the db-filter. Variants with a variant allele frequency lower than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1])
--tmb-enable-proxi-filter
Enable proxi filter functionality in germline filtering. This is an optional feature that may be appropriate for T/O runs. In T/O mode the DB germline filter may not able to detect all germline variants, especially for ethnicity groups that are not well represented in germline databases. The proxi filter uses allele frequency information to help remove germline variants missed by the DB, and can help to obtain more accurate (lower) TMB values on samples with low tumor purity. In samples with high tumor purity this filter may be too aggressive and mark some somatic variants as germline resulting in too low TMB scores. (default=false)
--tmb-proxi-count-threshold
Proxi filter surrounding variant count threshold in germline filtering (default=5)
--tmb-proxi-fraction-threshold
Proxi filter surrounding variant db filter fraction threshold in germline filtering (default=0.95)
--tmb-proxi-window-size
, Number of surrounding variants before and after the target variant for proxi filter (default=500)
--tmb-ch-bed
Variants in the region will be labeled as clonal hematopoiesis (CH) variants.
--tmb-ch-insert-p-value
Minimum P value to classify a variant as CH using insert size (default=0.1)
--tmb-ch-insert-min-len
Minimum fragment size to test for CH using insert size (default=100)
--tmb-ch-insert-max-len
Maximum fragment size to test for CH using insert size (default=200)
--tmb-ch-insert-min-num
Minimum number of fragment size record to test for CH using insert size (default=50)
--tmb-enable-msaf
Enable MSAF output (default=false)
--tmb-msaf-p-value
Maximum P value (from insert size) to call confident somatic variant (default=1e-5)
--tmb-msaf-rank-num
If no confident somatic variant found, it will use the specified ranked variant (default=4)
The TMB values are output to <output prefix>.tmb.metrics.csv
. The file format uses the following CSV column convention, similar to other metric CSV files.
The TMB module also outputs a tmb.trace.csv file that provides detailed information on each variant that was included in the TMB calculation.
When enabling MSAF, the information is output to <output prefix>.tmb.msaf.csv
.
Large genomic rearrangements affecting one or more exons account for approximately 5~10% of all disease-causing mutations in BRCA1 and BRCA2 genes in patients with hereditary breast and ovarian cancer syndrome. DRAGEN LR can detect within gene large genomic rearrangements in tumor-only mode for targeted panels such as TruSight Oncology 500. The performance has been verified for BRCA1/2 with TruSight Oncology 500 Assay.
Use the following command-line options to run large rearrangement detection. The same cmd line options can be tested on other tumor-only pipelines.
--tso500-solid-brca-lr=true
Set to true
enable large rearrangement parameters. This is not limited to TruSight Oncology 500 Assay.
--cnv-normals-list
Specify the panel of normal samples to measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see the example command line. The panel of normal samples should be well matched to the case sample under analysis.
--cnv-target-bed
Specify the targeted regions of the panel.
--cnv-within-gene-lr-bed
Specify the gene regions in BED format to do large rearrangment calling. Example file:
Run the following command on each normal sample to generate .target.counts.gc-corrected.gz
file.
Put the path to the generated .target.counts.gc-corrected.gz
files into a txt file. One file per line. This will be the file given to --cnv-normals-list
.
The output file .cnv.LR.json
contains the breakpoints detected for each specified gene region. The following is an example output file.
Note that coordinate follows BED format [start,stop) suggesting:
start: segment starting coordinate. (0-base inclusive: first base on the chromosome is numbered 0. start coordinate is included in the interval)
stop: segment stop coordinate. (0-base exclusive: first base on the chromosome is numbered 0. stop coordinate is not included in the interval)
DRAGEN Homologous Recombination Deficiency (HRD) Scoring takes in allele-specific copy number calls in either VCF format or directly streamed from somatic copy number callers. DRAGEN HRD then calculates scores for Loss of Heterozygosity (LOH), Telomeric Allelic Imbalance (TAI), and Large-Scale State Transition (LST). The three scores are output to the .hrdscore.csv
file. You can only use DRAGEN HRD when inputting results from WGS somatic CNV calling or ASCN WES somatic CNV calling.
Use the following command-line options to run HRD scoring. You can run HRD scoring with somatic CNV calling or after using somatic CNV calling results.
To run HRD scoring together with somatic CNV calling, use the following options. For more CNV parameters, please refer to CNV calling.
--enable-hrd=true
Set to true to enable HRD scoring to quantify genomic instability.
--enable-cnv=true
Set to true to enable CNV calling to run together with HRD scoring.
To run HRD scoring after somatic CNV calling, use the following options:
--enable-hrd=true
Set to true to enable HRD scoring to quantify genomic instability.
--hrd-input-ascn
Specify the allele-specific copy number file (*cnv.vcf.gz
). The CNV VCF file should include REF
calls for proper HRD segmentation. See the option --cnv-enable-ref-calls
in the CNV section.
--hrd-input-tn
Specify the tumor normalized bin count file (*.tn.tsv.gz
).
If reference is failed to AutoDetected
, then centromere and blacklist files should be specified with following options:
--hrd-input-centromere
Centromere locations per chromosome in tsv format
--hrd-input-blacklist
Blacklist bed file
The following metrics are included in the .hrdscore.csv
output file. The following is an example output file.
The following example command runs HRD end to end workflow with CNV. This is an example of Somatic WGS T/N. See the Somatic CNV section for other use cases. HRD is supported for any CNV workflows that support ASCN, and just needs to add --enable-hrd=true
on top of the CNV command lines.
The following example command runs HRD standalone.
Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.
DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the collect-evidence
mode.
The following is an example command for tumor-normal
mode. Default resource files are available for WES and WGS. Please note that the WES and WGS tumor-normal
modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.
The following is an example command for the tumor-only
mode. Please note that the WES and WGS tumor-only
modes are not as extensively tested as the tumor-normal
modes. The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only
mode.
TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.
The following is an example of a microsatellite file:
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.
Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.
Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].
A subsequent post-processing step is recommended:
only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)
Please note an error would occur if long (>100bp) microsatellite sites are present in the file.
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples.
Please note:
The collect-evidence
mode MUST be run in DRAGEN germline mode.
The --msi-microsatellites-file
and --msi-coverage-threshold
settings used in collect-evidence
mode must be consistent with the settings used during tumor-only MSI calling.
At least 20 normal samples are required.
The output containing MSI score (PecentageUnstableSites
) are stored in <output prefix>.microsat_output.json
.
The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".
In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.
There are two other output files (*_diffs.txt
and *.dist
) that are useful for debugging.
Here is an example of *_diffs.txt
file
The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column
The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.
The *.dist
file stores the read counts for each repeat length of the microsatellite site
The coverage of the site can be obtained by summing up all counts in the last column
<output prefix>.microsat_output.json
(described above)
<output prefix>.microsat_tumor.dist
. This file contains the repeat length array for every microsatellite.
Column length_dis
is the repeat length array.
<output prefix>.microsat_diffs.txt
. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.
Column Assessed
indicates if a site passes the coverage filter (msi-coverage-threshold
). Column PassFilter
is an internal metric and currently is not used for filtering microsatellites.
The MSI algorithm performs the following steps:
Tabulates tumor and normal counts from the read alignments for each microsatellite site.
Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (tumor-normal
mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only
mode).
Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (tumor-normal
mode). In tumor-only
mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).
Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.
Setting | Description | Tumor-Normal Panel/WES/WGS | Tumor-Only Panel/WES/WGS | High coverage bTMB |
---|---|---|---|---|
Metric | Definition |
---|---|
Sample | LOH_Score | TAI_Score | LST_Score | HRD_Score |
---|
Option | Description |
---|
Sample Type | Assay | Microsatelitte file | Specific Settings | PercentageUnstableSites Threshold |
---|
Default WES and WGS Microsatellite site files can be downloaded here:
--vc-callability-tumor-thresh
The minimum coverage for usable coding regions
50 (default)
50 (default)
1000 (not default)
--tmb-vaf-threshold
Variant mininum allele frequency for usable variants
0.05 (default)
0.05 (default)
0.002 (not default)
--tmb-cosmic-count-threshold
Number of observations in cosmic for variant to be considered a driver mutation.
50 (default)
50 (default)
50 (default)
--tmb-skip-db-filter
Do not use Nirvana database to filter germline variants
TRUE (default:T/N)
FALSE (default:T/O)
FALSE (not default)
--tmb-enable-proxi-filter
Use allele frequency information to filter germline variants
OPTIONAL (default is FALSE)
FALSE (not default)
TRUE (not default)
Eligible Region (Mbp)
The specified custom regions in (Mbp) that meet the minimum coverage threshold.
Filtered Variant Count
Remaining variants after variant and germline filters.
Filtered Nonsyn Variant Count
Subset of filtered variants that are nonsynonymous.
TMB
Filtered variants normalized by the eligible regions (Mbp).
Nonsyn TMB
Filtered nonsynonymous variants normalized by the eligible regions (Mbp).
Sample | 16 | 17 | 28 | 61 |
| Mode of execution: tumor-only, tumor-normal, or collect-evidence. |
| Specify the file containing the microsatellites. You can generate this file by scanning the genome for microsatellites using an MSI-sensor. DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples. |
| Full name of directory containing files with normal reference repeat length distribution. Used only in |
| Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not included in analysis. DRAGEN recommends using 60 as the value for solid samples. For TSO500 liquid, a value of 500 is recommended. |
| Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended. |
Solid | TSO500 | Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites. | msi-distance-threshold=0.1 | 20 |
Heme | TSO500 | N/A | N/A | N/A |
Liquid (cfDNA) | TSO500 | Part of TSO500 resource bundle. Repeats 6,7. 2344 sites. | msi-distance-threshold=0.02 | TBD |
Solid, Heme | WES | Available for download. Repeats 10 - 50. Approx. 3.5K sites. | msi-distance-threshold=0.1 | TBD |
Liquid (cfDNA) | WES | Available for download. Repeats 10 - 50. Approx. 3.5K sites. | msi-distance-threshold=0.02 | TBD |
Solid, Heme | WGS | Available for download. Repeats 10 - 50.Approx. 1 mil sites. | msi-distance-threshold=0.1 | TBD |
Liquid (cfDNA) | WGS | Available for download. Repeats 10 - 50. Approx. 1 mil sites. | msi-distance-threshold=0.02 | TBD |