# Microsatellite Instability

Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.

DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the `collect-evidence` mode.

## Command-Line Options

The following is an example command for `tumor-normal` mode. Default resource files are available for WES and WGS. Please note that the WES and WGS `tumor-normal` modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.

```
dragen \
--msi-command tumor-normal \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \ # See section: Default Microsatellite sites files
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output true \
--enable-sort true \
--enable-duplicate-marking true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2} \
--fastq-file1 ${fq1} \
--fastq-file2 ${fq2}
```

The following is an example command for the `tumor-only` mode. Please note that the WES and WGS `tumor-only` modes are not as extensively tested as the `tumor-normal` modes. The TSO500 panels do not have normal controls, and are only tested and validated in `tumor-only` mode.

```
dragen \
--msi-command tumor-only \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \  # See section: Default Microsatellite sites files
--msi-ref-normal-dir ${normal_reference_directory} \ # See section: Normal references of miscrosatellite repeat distribution 
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align=true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output=true \
--enable-sort true \
--enable-duplicate-marking=true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2}
```

| Option                                                 | Description                                                                                                                                                                                                                                                          |
| ------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `msi-command tumor-only/tumor-normal/collect-evidence` | Mode of execution: tumor-only, tumor-normal, or collect-evidence.                                                                                                                                                                                                    |
| `msi-microsatellites-file`                             | Specify the file containing the microsatellites. You can generate this file by scanning the genome for microsatellites using an MSI-sensor. DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.               |
| `msi-ref-normal-dir`                                   | Full name of directory containing files with normal reference repeat length distribution. Used only in `tumor-only` mode. These files can be generated by running `collect-evidence` on each normal sample. At least 20 normal samples are required.                 |
| `msi-coverage-threshold`                               | Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not included in analysis. DRAGEN recommends using 60 as the value for solid samples. For TSO500 liquid, a value of 500 is recommended. |
| `msi-distance-threshold`                               | Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.                                                                                                                                 |

## Assay Specific Settings

TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.

| Sample Type    | Assay  | Microsatelitte file                                           | Specific Settings           | PercentageUnstableSites Threshold |
| -------------- | ------ | ------------------------------------------------------------- | --------------------------- | --------------------------------- |
| Solid          | TSO500 | Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.   | msi-distance-threshold=0.1  | 20                                |
| Heme           | TSO500 | N/A                                                           | N/A                         | N/A                               |
| Liquid (cfDNA) | TSO500 | Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.      | msi-distance-threshold=0.02 | TBD                               |
| Solid, Heme    | WES    | Available for download. Repeats 10 - 50. Approx. 3.5K sites.  | msi-distance-threshold=0.1  | TBD                               |
| Liquid (cfDNA) | WES    | Available for download. Repeats 10 - 50. Approx. 3.5K sites.  | msi-distance-threshold=0.02 | TBD                               |
| Solid, Heme    | WGS    | Available for download. Repeats 10 - 50.Approx. 1 mil sites.  | msi-distance-threshold=0.1  | TBD                               |
| Liquid (cfDNA) | WGS    | Available for download. Repeats 10 - 50. Approx. 1 mil sites. | msi-distance-threshold=0.02 | TBD                               |

## Default Microsatellite sites files

The following is an example of a microsatellite file:

```
#chromosome     location        repeat_unit_length      repeat_unit_binary      repeat_times    left_flank_binary       right_flank_binary      repeat_unit_bases       left_flank_bases    right_flank_bases
chr1	985443	1	2	15	676	992	G	GGGCA	TTGAA
chr1	7980985	1	0	10	231	1020	A	ATGCT	TTTTA
chr1	8022800	1	3	19	13	41	T	AAATC	AAGGC
chr1	8029500	1	2	10	39	0	G	AAGCT	AAAAA
chr1	9146447	1	3	15	887	248	T	TCTCT	ATTGA
chr1	9767837	1	3	12	704	195	T	GTAAA	ATAAT
```

Default WES and WGS Microsatellite site files can be downloaded here: [DRAGEN Software Support Site page](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html)

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.

## Custom Microsatellite files

Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.

Custom Microsatellite site files can be generated by using msi-sensor \[<https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices>].

```
msisensor-pro scan -d /path/to/reference.fa -o ${microsatellite_file}
```

A subsequent post-processing step is recommended:

* only keep microsatellites sites with a repeat unit of length 1
* keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
* remove any sites containing Ns in the left or right anchors
* downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)

**Please note an error would occur if long (>100bp) microsatellite sites are present in the file.**

## Normal references of miscrosatellite repeat distribution

Normal reference files can be generated by running `collect-evidence` mode on a panel of normal samples.

```
dragen -f \
--msi-command collect-evidence \
--ref-dir ${reference_directory} \
--msi-microsatellites-file ${microsatellite_file} \
--msi-coverage-threshold 60 \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
-1 ${normal_fq1} \
-2 ${normal_fq2}
```

Please note:

* The `collect-evidence` mode **MUST** be run in DRAGEN germline mode.
* The `--msi-microsatellites-file` and `--msi-coverage-threshold` settings used in `collect-evidence` mode must be consistent with the settings used during tumor-only MSI calling.
* At least 20 normal samples are required.

## MSI Output

The output containing MSI score (`PecentageUnstableSites`) are stored in `<output prefix>.microsat_output.json`.

```
"TotalMicrosatelliteSitesAssessed": "20020",
"TotalMicrosatelliteSitesUnstable": "4374",
"PecentageUnstableSites": "21.850000000000001",
"ResultIsValid": "true",
"ResultMessage": "",
"SumDistance": "1214.174" 
```

The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".

In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.

## MSI Debugging Output

There are two other output files (`*_diffs.txt` and `*.dist`) that are useful for debugging.

Here is an example of `*_diffs.txt` file

```
#Chromosome	Start	RepeatUnit	Assessed	Distance	PValue	PassFilter
chr1	69106	T	true	0.04105300052	0.4786448589	true
chr1	69116	TC	false	0	0	false
```

The fourth column (Assessed) is the coverage filter. Any site with **coverage >= 60** is true for this column

The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.

```
1. read count for [reference repeat length] > 0
2. read count for [reference repeat length - 1] >= 0
3. the ratio of read count for [reference repeat length - 1] and [reference repeat length] >= 0
```

The `*.dist` file stores the read counts for each repeat length of the microsatellite site

```
#chromosome	location	repeat_unit_bases	reference_allele	covered	length_distribution
chr1	69106	T	5	true	0,0,0,0,103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
```

The coverage of the site can be obtained by summing up all counts in the last column

## MSI Verbose Output

* 1. `<output prefix>.microsat_output.json` (described above)
* 2. `<output prefix>.microsat_tumor.dist`. This file contains the repeat length array for every microsatellite.

```
#chromosome     location        repeat_unit_bases       reference_allele        covered length_dis
chr1    16200729        T       10      true    0,0,0,0,0,0,0,0,0,118,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
chr1    40361307        A       10      true    0,0,0,0,0,0,2,0,2,95,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0
```

Column `length_dis` is the repeat length array.

* 3. `<output prefix>.microsat_diffs.txt`. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.

```
#Chromosome     Start   RepeatUnit      Assessed        Distance        PValue  PassFilter
chr1    16200729        T       true    0.03348841224   0.002411104562  true
chr1    40361307        A       true    0.0406985608    0.0006306633961 true
chr1    156842471       T       false   0       0       true
chr1    239881908       T       true    0.003136536956  0.5983661726    true
```

Column `Assessed` indicates if a site passes the coverage filter (`msi-coverage-threshold`). Column `PassFilter` is an internal metric and currently is not used for filtering microsatellites.

## MSI Algorithm

The MSI algorithm performs the following steps:

1. Tabulates tumor and normal counts from the read alignments for each microsatellite site.
2. Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (`tumor-normal` mode), or Jensen-Shannon distance of two normal baseline samples (`tumor-only` mode).
3. Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (`tumor-normal` mode). In `tumor-only` mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).
4. Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.
