Microsatellite Instability
Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.
DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the collect-evidence
mode.
Command-Line Options
The following is an example command for tumor-normal
mode. Default resource files are available for WES and WGS. Please note that the WES and WGS tumor-normal
modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.
The following is an example command for the tumor-only
mode. Please note that the WES and WGS tumor-only
modes are not as extensively tested as the tumor-normal
modes. The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only
mode.
Assay Specific Settings
TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.
Default Microsatellite sites files
The following is an example of a microsatellite file:
Default WES and WGS Microsatellite site files can be downloaded here: DRAGEN Software Support Site page
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.
Custom Microsatellite files
Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.
Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].
A subsequent post-processing step is recommended:
only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)
Please note an error would occur if long (>100bp) microsatellite sites are present in the file.
Normal references of miscrosatellite repeat distribution
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples.
Please note:
The
collect-evidence
mode MUST be run in DRAGEN germline mode.The
--msi-microsatellites-file
and--msi-coverage-threshold
settings used incollect-evidence
mode must be consistent with the settings used during tumor-only MSI calling.At least 20 normal samples are required.
MSI Output
The output containing MSI score (PecentageUnstableSites
) are stored in <output prefix>.microsat_output.json
.
The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".
In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.
MSI Debugging Output
There are two other output files (*_diffs.txt
and *.dist
) that are useful for debugging.
Here is an example of *_diffs.txt
file
The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column
The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.
The *.dist
file stores the read counts for each repeat length of the microsatellite site
The coverage of the site can be obtained by summing up all counts in the last column
MSI Verbose Output
<output prefix>.microsat_output.json
(described above)
<output prefix>.microsat_tumor.dist
. This file contains the repeat length array for every microsatellite.
Column length_dis
is the repeat length array.
<output prefix>.microsat_diffs.txt
. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.
Column Assessed
indicates if a site passes the coverage filter (msi-coverage-threshold
). Column PassFilter
is an internal metric and currently is not used for filtering microsatellites.
MSI Algorithm
The MSI algorithm performs the following steps:
Tabulates tumor and normal counts from the read alignments for each microsatellite site.
Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (
tumor-normal
mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only
mode).Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (
tumor-normal
mode). Intumor-only
mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.
Last updated