# Hurdle model

*Hurdle model* is a statistical test for differential analysis that utilizes a two-part model, a discrete (logistic) part for modeling zero vs. non-zero counts and a continuous (log-normal) part for modeling the distribution of non-zero counts. In RNA-Seq data, this can be thought of as the discrete part modeling whether or not the gene is expressed and the continuous part modeling how much it is expressed if it is expressed. *Hurdle model* is well suited to data sets where features have very many zero values, such as single cell RNA-Seq data.

On default settings, *Hurdle model* is equivalent to MAST, a published differential analysis tool designed for single cell RNA-Seq data that uses a hurdle model \[1].

## Running Hurdle model

We recommend normalizing you data prior to running *Hurdle model*, but it can be invoked on any counts data node.

* Click the counts data node
* Click the **Differential analysis** section in the toolbox
* Click **Hurdle model**
* Select the factors and interactions to include in the statistical test

Numeric and categorical attributes can be added as factors. To add attributes as factors, check the attribute check boxes and click **Add factors**. To add interactions between attributes, select at least two attributes by clicking check boxes and click **Add interaction**.

<div align="left"><figure><img src="https://580316046-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWMxqQAMFOJtu98OBk9KN%2Fuploads%2Fgit-blob-5b380f44c8347f400bfb726c56b5f4279c3e1129%2Fimage%20(93).png?alt=media" alt=""><figcaption></figcaption></figure></div>

* Click **Next**
* Define comparisons between factor or interaction levels

Adding comparisons in *Hurdle model* uses the same interface as[ ANOVA/LIMMA-trend/LIMMA-voom](https://help.connected.illumina.com/icm/analyses/analysis-functionality/task-menu/statistics/differential-analysis/anova-limma-trend-limma-voom). Start by choosing a factor or interaction from the *Factor* drop-down list. The levels of the factor or interaction will appear in the left-hand panel. Select levels in the panel on the left and click the **>** arrow buttons to add them to the top or bottom panels on the right. The control level(s) should be added to the bottom box and the experimental level(s) should be added to the top box. Click **Add comparison** to add the comparison to the *Comparisons* table. Only comparisons in the *Comparisons* table will be included in the statistical test.

<div align="left"><figure><img src="https://580316046-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWMxqQAMFOJtu98OBk9KN%2Fuploads%2Fgit-blob-c7d212a1ec4d77a40fc222f42b849d95e34b0dfc%2Fimage%20(94).png?alt=media" alt=""><figcaption></figcaption></figure></div>

* Click **Finish** to run the statistical test

*Hurdle model* produces a *Feature list* task node. The results table and options are the same as[ANOVA/LIMMA-trend/LIMMA-voom](https://help.connected.illumina.com/icm/analyses/analysis-functionality/task-menu/statistics/differential-analysis/anova-limma-trend-limma-voom). The percentage of cells where the feature is detected (value is above the background threshold) in different groups (Pct(group1), Pct(group2)) are calculated and included in the Hurdle model report.

<figure><img src="https://580316046-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWMxqQAMFOJtu98OBk9KN%2Fuploads%2Fgit-blob-43a27c93da568df2e0c83828d6434646eb60291a%2Fimage%20(95).png?alt=media" alt=""><figcaption></figcaption></figure>

## Hurdle model advanced options

#### Multiple test correction

Multiple test correction can be performed on the p-values of each comparison, with **FDR step-up** being the default. If you check the *Storey q-value*, an extra column with q-values will be added to the report.

#### Use only reliable estimation results

There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of *Use only reliable estimation results* is set **Yes**.

#### Data has been transformed with log base

Shows the current scale of the input data for this task

#### Background expression level

Set the threshold for a feature to be considered expressed for the two-part hurdle model. If the feature value is greater than the specified value, it is considered expressed. If the upstream data node contains log-transformed values, be sure to specify the value on the same log scale. Default value is **0**.

#### Shrinkage of error term variance

Applies shrinkage to the error variance in the continuous (log-normal) part of the hurdle model. The error term variance will be shrunk towards a common value and a shrinkage plot will be produced on the task report page if enable. Default is **Enabled**.

#### Shrinkage of regression coefficients

Applies shrinkage to the regression coefficients in the discrete (logistic) part of the hurdle model. The initial versions of MAST contained a bug that was fixed in its R source in March 2020. However, for the sake of reproducibility the fix was released only on a topic branch in MAST Github \[2] and the default version of MAST remained as is. To install the fixed version of MAST in R, run the following R script.

```
\# Uninstall the default version of MAST, if it's installed.  
remove.packages("MAST")
```

```
\# Install devtools, if it's not installed yet.  
library("devtools")
```

```
install\_github("[https://github.com/RGLab/MAST/tree/fix/bayesglm](https://github.com/RGLab/MAST/tree/fix/bayesglm)")
```

```
library(MAST)
```

In Connected Multiomics, the user can switch between the fixed and default version by selecting **Fixed version** or **Default version,** respectively. To disable the shrinkage altogether, choose **Disabled**.

## References

\[1] Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., ... & Linsley, P. S. (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome biology, 16(1), 278.

\[2] MAST topic branch that contains the regression coefficient shrinkage fix:

<https://github.com/RGLab/MAST/tree/fix/bayesglm>
