# Microsatellite Instability

Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.

DRAGEN MSI supports running in tumor-normal and tumor-only modes. The tumor-only mode will require a panel of normals. The panel of normals can be generated using the `collect-evidence` mode.

The default microsatellite site lists and the panel of normals are available for WES and WGS ([DRAGEN Software Support Site page](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html)). Custom panels other than WES and WGS may require more extensive validation and possibly require [generating a new sites file](#custom-microsatellite-files).

## MSI Algorithm

The MSI algorithm performs the following steps:

1. Tabulate the number of read alignments for each microsatellite site in tumor and normal samples.
   * A read is counted toward a repeat length only if the sequence contains the repeat sequence, 5 bases each on the left and right flanks as specified in the microsatellite site list.
   * When `msi-read-stitching` is turned on, a pair of reads are counted as one read if they are overlapping with each other.
2. Calculate Jensen-Shannon distance of tumor and normal distributions
   * In `tumor-normal` mode, the JS distance is calculated bewteen the tumor sample and the normal sample.
   * In `tumor-only` mode, we first calculate intra-normal JS distances between all pairs of normal samples. Then, we normalize the mean JS distance between the tumor sample and all normal samples by the mean intra-normal distance.
3. Compute P-values for each site using
   * chi-square testing between tumor and normal distributions in `tumor-normal` mode, and
   * student-t testing between mean tumor and normal distributions in `tumor-only` mode.
4. Determine if the site is **assessed** if the followign criteria are satisfied:
   * the total number of supporting reads is greater than `SpanningCoverageThreshold` in both tumor and normal samples
   * the number of reads supporting the reference repeat length is larger than `MinReferencePeakHeight`.
5. Determine if a site is **unstable** based on both the Jensen-Shannon distance and P-values. A site is unstable if JS distance is larger than `DistanceThreshold` (default=0.1), and P-value is smaller than `PValueThreshold` (default=0.01).
6. Determine if the site **passes filters** based on specific peak heights if the following criteria are satisfied:

   * the number of reads supporting (reference repeat length - 1) is greater than or equal to `MinLeftPeakHeight`
   * the ratio of number of reads supporting reference repeat length and (reference repeat length - 1) is between `MinLeftPeakRatio` and `MaxLeftPeakRatio`.

   If a filter is not passed, the site is counted toward total assessed site, but is not counted toward unstable sites even though distance and P-value pass the thresholds.
7. Summarize stats and produce a report in the [JSON output file](#msi-score-report) given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites. The parameter values mentioned above are also reported.

## Command-Line Options

DRAGEN MSI is turned on by `--enable-msi=true` and supports MSI calling in `tumor-only` and `tumor-normal` modes. The key differences between the two modes are:

1. `tumor-only` mode requires providing a panel of normals with `msi-ref-normal-input`
2. `tumor-normal` mode requires a paired normal sample

The MSI calling modes of DRAGEN MSI are automatically inferred by command line inputs.

DRAGEN MSI also supports generating baselines for an input sample via `--msi-generate-baseline=true`. See the section [Baseline microsatellite repeat distribution](#baseline-microsatellite-repeat-distribution) for an example command.

### Example command for `tumor-only` mode

It is recommended to use `tumor-only` mode rather than `tumor-normal` if a panel of normals is available, especially for low-coverage and/or low-quality samples.

The TSO500 panels do not have normal controls, and are only tested and validated in `tumor-only` mode.

```
dragen \
--enable-msi=true \
--msi-microsatellites-file ${microsatellite_file} \
--msi-ref-normal-input ${panel_of_normals} \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align=true \
--RGID=${read_group_ID} \ 
--RGSM=${read_group_sample} \
--ref-dir ${reference_directory} \
--enable-map-align-output=true \
--enable-sort true \
--enable-duplicate-marking=true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2}
```

> Note: DRAGEN MSI runs in `tumor-only` mode even if a normal sample is provided along with the tumor sample, as long as `msi-ref-normal-input` is provided.

### Example command for `tumor-normal` mode

The paired normal sample is specified by `--fastq-file1` and `--fastq-file2`.

```
dragen \
--enable-msi=true \
--msi-microsatellites-file ${microsatellite_file} \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output true \
--enable-sort true \
--enable-duplicate-marking true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2} \
--fastq-file1 ${fq1} \
--fastq-file2 ${fq2}
```

| Option                              | Description                                                                                                                                                                                                                                                                                                                                                                                                              | Defaults                                                        |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------- |
| `msi-microsatellites-file`          | Specify the file containing the [microsatellite sites](#microsatellite-sites-files). DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.                                                                                                                                                                                                                          | NA                                                              |
| `msi-ref-normal-input`              | Specify a set of [normal reference repeat length distribution](#normal-references-of-microsatellite-repeat-distribution). These files can be generated by running `collect-evidence` on each normal sample. A site is only evaluated if enough samples have coverage for that site. The set of normal references can either be in the format of a combined.dist file, or as a directory containing a set of .dist files. | NA                                                              |
| `msi-coverage-threshold` (optional) | Specify the minimum spanning read coverage for a microsatellite.See [MSI algorithm](#msi-algorithm) for the details on how the number of spanning reads are counted.                                                                                                                                                                                                                                                     | <p>Tumor-only: 30<br>Tumor-normal: 60<br>TSO500-liquid: 500</p> |
| `msi-distance-threshold` (optional) | Threshold for distance distributions to be considered different.                                                                                                                                                                                                                                                                                                                                                         | <p>Solid samples: 0.1<br>Liquid samples: 0.02</p>               |
| `msi-read-stitching` (optional)     | Whether to count overlapping reads as one fragment. When read-stitching is turned on, the number of evidence covering each site will be lowered as each pair of reads now counts as one piece of evidence.                                                                                                                                                                                                               | True                                                            |

> **Notes on coverage threshold:** Setting the coverage threshold lower than the defaults is not recommended as the quality of calls at the low-coverage sites may be low. In tumor-normal mode, if the sample coverage is lower than MSI's default coverage threshold, please provide a panel of normals and adjust the coverage threshold accordingly for the best quality of calls.

## Assay-Specific Settings

TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.

| Sample Type    | Assay  | Microsatellite file                                          | Recommended Settings        | PercentageUnstableSites Threshold                 |
| -------------- | ------ | ------------------------------------------------------------ | --------------------------- | ------------------------------------------------- |
| Solid          | TSO500 | Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.  | msi-distance-threshold=0.1  | 20                                                |
| Heme           | TSO500 | N/A                                                          | N/A                         | N/A                                               |
| Liquid (cfDNA) | TSO500 | Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.     | msi-distance-threshold=0.02 | MSI status determined by sum JSD (see note below) |
| Solid, Heme    | WES    | Available for download. Repeats 10 - 50. Approx. 1.1K sites. | msi-distance-threshold=0.1  | 20                                                |
| Liquid (cfDNA) | WES    | Available for download. Repeats 10 - 50. Approx. 1.1K sites. | msi-distance-threshold=0.02 | 20                                                |
| Solid, Heme    | WGS    | Available for download. Repeats 10 - 50.Approx. 250K sites.  | msi-distance-threshold=0.1  | 20                                                |
| Liquid (cfDNA) | WGS    | Available for download. Repeats 10 - 50. Approx. 250K sites. | msi-distance-threshold=0.02 | 20                                                |

> **Note: For TSO500 liquid (cfDNA), the MSI status is determined by the sum of Jensen-Shannon distance of all unstable sites. Therefore PercentageUnstableSites is not applicable.**

## Baseline microsatellite repeat distribution

The MSI baselines, or MSI panel of normals, can be provided in two formats: as separate files in one directory, or as a single file containing distributions from multiple samples. Both formats can be provided with `msi-ref-normal-input`. The combined `.dist` file must contain an additional column that specifies the name of the sample for each distribution. The default normal baseline files are available for WES and WGS at [DRAGEN Software Support Site page](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html).

It is recommended to match the sample types of the panel of normals and the tumor sample for optimal performance. For example, a panel of normals constructed from FFPE samples should be used with FFPE tumor samples.

> Note: Currently, we provide baseline files for FFPE only. In the case where FF normal samples are not available to generate a baseline, FFPE baselines can be used. However, FFPE baselines should be used with caution with FF tumor samples as mismatched sample types may lead to inaccurate results.

### Custom microsatellite baseline files

MSI baselines can be generated by running `generate-baseline` mode on normal samples. The command below shows an example of generating a baseline for one normal sample.

```
dragen -f \
--enable-msi=true \
--msi-generate-baseline=true \
--ref-dir ${reference_directory} \
--msi-microsatellites-file ${microsatellite_file} \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
-1 ${normal_fq1} \
-2 ${normal_fq2}
```

The above command needs to be run for each normal sample in order to produce a set of baselines files. The output is in the same format as the `.dist` file described in [MSI output](#distribution-of-repeat-lengths).

Please note:

* The `generate-baseline` mode **MUST** be run in DRAGEN germline mode, as indicated by fastq options `-1` and `-2`.
* The `--msi-microsatellites-file` used in `generate-baseline` mode must be consistent with MSI calling.
* At least 20 normal samples are required to be used as baselines for MSI calling.

## Microsatellite sites files

The following is an example of a microsatellite file:

```
#chromosome     location        repeat_unit_length      repeat_unit_binary      repeat_times    left_flank_binary       right_flank_binary      repeat_unit_bases       left_flank_bases    right_flank_bases
chr1	985443	1	2	15	676	992	G	GGGCA	TTGAA
chr1	7980985	1	0	10	231	1020	A	ATGCT	TTTTA
chr1	8022800	1	3	19	13	41	T	AAATC	AAGGC
chr1	8029500	1	2	10	39	0	G	AAGCT	AAAAA
chr1	9146447	1	3	15	887	248	T	TCTCT	ATTGA
chr1	9767837	1	3	12	704	195	T	GTAAA	ATAAT
```

Default WES and WGS Microsatellite site files can be downloaded here: [DRAGEN Software Support Site page](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html)

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.

### Microsatellite site list columns

| Column name          | Description                                                                           |
| -------------------- | ------------------------------------------------------------------------------------- |
| chromosome           | Chromosome of the site                                                                |
| location             | Start location of the site                                                            |
| repeat\_unit\_length | Size of the repeat unit                                                               |
| repeat\_unit\_binary | Binary encoding of the repeat unit base converted to decimal (A: 0, C: 1, G: 2, T: 3) |
| repeat\_times        | Number of repeats units in reference                                                  |
| left\_flank\_binary  | Left flank bases in terms of binary encoding converted to decimal                     |
| right\_flank\_binary | Right flank bases in terms of binary encoding converted to decimal                    |
| repeat\_unit\_bases  | Repeat unit base in A/T/C/G                                                           |
| left\_flank\_bases   | Five bases on the left flank of the microsatellite site                               |
| right\_flank\_bases  | Five bases on the right flank of the microsatellite site                              |

### Custom Microsatellite files

Custom Microsatellite site files may be required if a small panel is targeted and/or the default site files do not have sufficient overlapping sites.

Custom Microsatellite site files can be generated by using MSIsensor-Pro <https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices>.

```
msisensor-pro scan -d /path/to/reference.fa -o ${microsatellite_file}
```

A subsequent post-processing step is required for the site list to be used by DRAGEN:

* only keep microsatellites sites with a repeat unit of length 1
* keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
* remove any sites containing Ns in the left or right anchors
* downsample the remaining sites to contain no more than 1 million sites (to avoid excessive run time)
* rearrange the columns to match the format of a DRAGEN microsatellite site list (see [Microsatellite site list columns](#microsatellite-site-list-columns))

> **An error would occur if long (>100bp) microsatellite sites are present in the file.**

> **The Microsatellite site file output by MSI-sensor Pro is in a different format as the DRAGEN site file. A post-processing step is required to convert the format.**

### Germline variant filtering

We recommend filtering out microsatellite sites that overlap with known population variants. A locus affected by small variants will result in artificially inflated differences between samples. In the example below, the site in normal sample overlaps with a heterozygous variant (possibly a one-base ins/del). In the paired tumor sample, the heterozygosity is lost (LOH). The difference observed between the two distributions are not due to microsatellite instability, but LOH.

![msi-snv](https://25033470-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG9szlFZupV6Q2DasL98y%2Fuploads%2Fgit-blob-0bc502f3d88923a4e2c8c717bfd677dbf4f1d34c%2Fmsi-snv.png?alt=media)

We recommend using [gnomAD](https://gnomad.broadinstitute.org/) as the reference database to filter all sites that overlap with small variants with population allele frequencies > 1%.

## MSI Output

DRAGEN outputs the following files during the MSI workflow:

| File name                       | Description                                                                                                                                                                                                                      | Output in TO mode | Output in TN mode | Output in baseline generation |
| ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | ----------------- | ----------------------------- |
| \<prefix>.microsat\_output.json | [MSI score report](#msi-score-report) reports MSI status and parameters used in JSON format.                                                                                                                                     | Y                 | Y                 | N                             |
| \<prefix>.microsat\_diffs.txt   | [Difference between tumor and normal samples](#difference-between-tumor-and-normal-samples) reports the statistical distance between tumor and normal samples for each site, and stats used to determine the status of the site. | Y                 | Y                 | N                             |
| \<prefix>.microsat\_normal.dist | [Distribution of repeat lengths](#distribution-of-repeat-lengths) reports the repeat length distribution of each site in a normal sample.                                                                                        | N                 | Y                 | Y                             |
| \<prefix>.microsat\_tumor.dist  | [Distribution of repeat lengths](#distribution-of-repeat-lengths) reports the repeat length distribution of each site in tumor sample.                                                                                           | Y                 | Y                 | N                             |
| \<prefix>.microsat\_log.txt     | Logs the runtime and MSI results                                                                                                                                                                                                 | Y                 | Y                 | Y                             |

### MSI score report

The JSON file `<prefix>.microsat_output.json` contains the parameters to reproduce the experiments, and the MSI results (including the MSI score `PrecentageUnstableSites`).

```
{   
    "Settings":{
        "Command": "tumor-normal",
        ...,
    },
    "TotalMicrosatelliteSitesAssessed": "20020",
    "TotalMicrosatelliteSitesUnstable": "4374",
    "PecentageUnstableSites": "21.850000000000001",
    "ResultIsValid": "true",
    "ResultMessage": "",
    "SumDistance": "1214.174" 
}
```

The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of tumor vs normal distributions. The "SumDistance" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".

In TSO500-solid, microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.

### Distribution of repeat lengths

DRAGEN MSI computes the number of repeat units (repeat lengths) supported by each read fragment.

![distribution](https://25033470-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG9szlFZupV6Q2DasL98y%2Fuploads%2Fgit-blob-510ccda40b578fbe441372afe4661acbbee4c56c%2Fmsi-distribution.png?alt=media) The above figure shows a mock example of read pileup (left) at a pre-specified homopolymer site with 10 repeat units of T in reference with two abnormal alignments at bottom, and the distribution of repeat lengths (right) corresponding to the pileup.

The distribution is recorded in `<prefix>.microsat_normal.dist` and `<prefix>.microsat_tumor.dist` for normal and tumor samples, respectively.

Example `.dist` file:

```
#chromosome     location        repeat_unit_bases       reference_allele        covered length_distribution
chr1    985443  G       15      false   0,0,0,0,0,0,0,0,0,0,0,0,...,0
chr1    7980985 A       10      true    0,0,0,0,0,0,2,0,8,393,14,1,...,0
chr1    8022800 T       19      true    0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,3,4,35,42,13,2,2,0,0...0,0
```

Summing up the numbers in the last column give the total number of reads covering the site.

Columns in `.dist` files:

| Column name          | Description                                                                              |
| -------------------- | ---------------------------------------------------------------------------------------- |
| chromosome           | chromosome of the site                                                                   |
| location             | start position of the site                                                               |
| repeat\_unit\_bases  | the base(s) of the repeat unit in reference in A/T/C/G string                            |
| reference\_allele    | the number of repeats in reference                                                       |
| covered              | whether the site is covered by sufficient reads (determined by `msi-coverage-threshold`) |
| length\_distribution | A vector of size 100 that records read support for each repeat length from 1 to 100.     |

### Difference between tumor and normal samples

**Example `<prefix>.microsat_diffs.txt` file**

```
#Chromosome	Start	RepeatUnit	Assessed	Distance	PValue	PassFilter
chr1	69106	T	true	0.04105300052	0.4786448589	true
chr1	69116	TC	false	0	0	false
```

**Columnns in `<prefix>.microsat_diffs.txt`**

The details of how column values are computed can be found in [MSI algorithm](#msi-algorithm).

| Column name | Description                                                                                            |
| ----------- | ------------------------------------------------------------------------------------------------------ |
| Chromosome  | chromosome of the site                                                                                 |
| Start       | start position of the site                                                                             |
| RepeatUnit  | the base(s) of the repeat unit in reference in A/T/C/G string                                          |
| Assessed    | whether the base is assesed based on read coverage and number of reads supporting the reference length |
| Distance    | the Jensen-Shannon distance between tumor and normal distritbutions                                    |
| PValue      | statistical significance of the difference observed between distributions                              |
| PassFilter  | whether the site passes filters based on on specific peak heights                                      |
