# Illumina Connected Multiomics

## Getting Started

Refer to the following links to the ICM user guide to get started with ICM:

* [Registration and Login](https://help.connected.illumina.com/icm/introduction/icm)
* [Data Inputs](https://help.connected.illumina.com/icm/introduction/data-inputs)
* [Creating a Study from a ICA Project](https://help.connected.illumina.com/icm/studies/create-study)
* [Viewing Results and Navigating in ICM](https://help.connected.illumina.com/icm/studies/enter-study)

## Default Sample Settings

Each Illumina DRAGEN Single Cell sample contains a feature-barcode matrix file. By default, if the feature ID is not unique, the feature will be summarized by 'Mean' as the Deduplication method and the feature name is used as the primary feature identifier. The count value format is 'Raw counts' (this is the same whether the 'filtered' sparse matrix files are used or the un-filtered files). All features and cells with a total read count of at least 400 are reported. This information is visually depicted below.

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2FKM7suV0cpo7kcl1ABtW1%2Fimage.png?alt=media&#x26;token=946ad50a-e191-4426-9201-f42ee6e49281" alt=""><figcaption></figcaption></figure>

If any changes to the default Illumina DRAGEN Single Cell sample settings are desired, add data to your study using the scRNA feature-barcode-matrix option and then create a Custom: Third Party Assays analysis. After initiating the analysis, file format options can be adjusted through the required user input on the analysis. Once finished, the import will proceed with the options selected.

## Default Single Cell Analysis

Below are explanations of the steps that are run in the default single-cell analysis that is automatically launched on import of single-cell data in ICM. Also included below are the instructions to launch each step manually if input parameters need to be adjusted from the default settings.

### Normalize Counts

Because different cells will have a different number of total counts, it is important to normalize the data prior to downstream analysis. For droplet-based single cell isolation and library preparation methods that use a 3' counting strategy, where only the 3' end of each transcript is captured and sequenced, we recommend the following normalization:

1. CPM (counts per million)
2. Add 1
3. Log2

This accounts for differences in total IMI counts per cell and log transforms the data, which makes the data easier to visualize. While the above normalization is already run in a default analysis, additional normalization can be done by following the steps below.

* Click the counts node you wish to normalize&#x20;
* Click **Normalization and scaling** in the context-sensitive task menu on the right
* Click **Normalization**
* Click **Use recommended** to add the recommended normalization scheme

This adds *CPM (counts per million), Add 1,* and *Log2* to the *Normalization order* panel. Normalization steps are performed in descending order.

* Click **Finish** to apply the normalization

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2FiiGJm7U0uY3pvrzMLfTx%2FNormalization.png?alt=media&#x26;token=cbe65264-befa-4cc1-8725-9e2030b9f95b" alt=""><figcaption></figcaption></figure>

### PCA

Principal components (PC) analysis (PCA) is an exploratory technique that is used to describe the structure of high dimensional data by reducing its dimensionality. Because PCA is used to reduce the dimensionality of the data prior to clustering as part of a standard single cell analysis workflow, it is useful to examine the results of PCA for your data set prior to clustering.

* Select a data node containing the normalized and filtered count matrix
* Click **Exploratory analysis** in the task menu
* Click **PCA** from the drop-down list
* Select the number of features to include
* Select the number of PCs to calculate

You can choose *Features contribute* **equally** to standardize the genes prior to PCA or allow more variable genes to have a larger effect on the PCA by choosing **by variance**. By default, we take variance into account and focus on the most variable genes.

If you have multiple samples, you can choose to run PCA for each sample individually or for all samples together by selecting or not selecting the *Split by sample* option.

* Click **Finish** to run

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2FOsxAVIXw3Yfwaq58I9dt%2FPCA.png?alt=media&#x26;token=05667588-e816-49b6-ae73-231a72d4f299" alt=""><figcaption></figcaption></figure>

A new *PCA* task node will be produced on the task graph for the analysis. When complete, double-click the **PCA** task node to open the 3D PCA scatter plot in data viewer.

Beside PCA coordinates of the cells, PCA task report also includes, the Scree plot, the component loadings table, and the PC projections table.

The Scree plot lists PCs on the x-axis and the amount of variance explained by each PC on the y-axis, measured in Eigenvalue. The higher the Eigenvalue, the more variance is explained by the PC. Typically, after an initial set of highly informative PCs, the amount of variance explained by analyzing additional PCs is minimal. By identifying the point where the Scree plot levels off, you can choose an optimal number of PCs to use in downstream analysis steps like graph-based clustering, UMAP and t-SNE.

### Graph-based Clustering

Graph-based clustering identifies groups of similar cells using PC values as the input. By including only the most informative PCs, noise in the data set is excluded, improving the results of clustering.

* Click the *PCA* data node
* Click **Exploratory analysis** in the task menu
* Click **Graph-based clustering**

Clustering can be performed on each sample individually or on all samples together.

* Select the **Clustering algorithm** to use. The default Single-Cell analysis uses the Louvain algorithm.
* Check **Compute biomarkers** to compute features that are highly expressed when comparing each cluster
* Select the number of **PCs to use**
* Click **Configure** to access the *Advanced options*

The *Number of principal components* can be set based on the your examination of the Scree plot and component loadings table. The default value is likely exhaustive for most data sets; altering this value may introduce noise that influences the number of clusters that are distinguished.

* Click **Finish** to run the task

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2Fzsi0QH63H3YWSHRpQksj%2FGraph%20based%20clustering.png?alt=media&#x26;token=4ee18aa0-8ecf-4ed3-a9f5-79f38f648ca3" alt=""><figcaption></figcaption></figure>

A new *Graph-based clusters* data and *Biomarkers* data node will be generated along with the task nodes

* Double-click the **Graph-based clusters** node to see the cluster results and statistics. The *Graph-based clustering result* lists the *Total number of clusters* and what proportion of cells fall into each cluster.&#x20;
* Double-click the **Biomarkers** node to see the computed biomarkers if you have selected this option. The *Biomarkers* node includes the top features for each graph-based cluster. It displays the top-10 genes that distinguish each cluster from the others. **Download** at the top left of the table can be used to view and save more features. These are calculated using an ANOVA test comparing the cells in each group to all the other cells, filtering to genes that are 1.5 fold upregulated, and sorting by ascending p-value. This ensures that the top-10 genes of each cluster are highly and disproportionately expressed in that cluster.

### UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique. UMAP aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. UMAP is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.

* Click the **Graph-based clusters** or **PCA** node
* Click **Exploratory analysis** in the task menu
* Click **UMAP**
* Select the number of **PCs to use**
* Click **Configure** to access the *Advanced options*
* Click **Finish** to run

If you have multiple samples, you can choose to run UMAP for each sample individually or for all samples together using the *Split cells by sample* option.&#x20;

Like Graph-based clustering, UMAP takes PC values as its input and further reduces the data down to two or three dimensions. For consistency, you should use the same number of PCs as the input for UMAP that you used for Graph-based clustering.

A new *UMAP* task node will be produced. When complete, double-click the **UMAP** node to open the UMAP task report. Use the panel on the left to modify the plot or add more plots to this Data viewer session.

The UMAP scatter plot is interactive and can be viewed in 2D or 3D. The UMAP plot is 3D by default. You can rotate the 3D plot by left-clicking and dragging your mouse or using **Control** under *Configure*. You can zoom in and out using your mouse wheel. You can pan by right-clicking and dragging your mouse. You can use **Style** to modify *color*, *shape*, *size*, and *labeling* (e.g. add a fog effect to improve depth perception on the plot). Add a 2D plot clicking **New plot,** selecting **2D Scatter plot** and selecting UMAP as the source of the data.

## Other Single-Cell Analysis Tasks

### QA/QC

The Single-cell QA/QC task in ICM enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:

* Click a **Single cell counts** data node
* Click the **QA/QC** section of the task menu
* Click **Single cell QA/QC**

By default, all samples are used to perform QA/QC. You can choose to split the sample and perform QA/QC separately for each sample.

You will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog and ideally this closely matches the references used in the DRAGEN secondary analysis. Note, it is still possible to run the task without specifying an annotation file. If you choose not to specify an annotation file, the detection of mitochondrial counts will not be possible.

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2F9RF0VUglTeLG2S3o7OCi%2FQA-QC.png?alt=media&#x26;token=77c31fd0-b74a-4625-867a-8e27ac6fd590" alt=""><figcaption></figcaption></figure>

### Filter Features

A common task in single-cell RNA-Seq analysis is to filter the data to include only informative genes (features). Because there is no gold standard for what makes a gene informative or not and ideal gene filtering criteria depends on your experimental design and research question, ICM has a wide variety of flexible filtering options. The Filter features step can also be performed before normalization or after normalization.

* Select a data node containing the count matrix
* Click **Filtering** in the task menu
* Click **Filter features**
* Select the **Filter type** and **Filter criteria** desired

There are four categories of filter available - noise reduction, statistics-based, feature metadata, and feature list.

The noise reduction filter allows you to exclude genes considered background noise based on a variety of criteria. The statistics-based filter is useful for focusing on a certain number or percentile of genes based on a variety of metrics, such as variance. The metadata, saved list, and manual list filters allow you to filter your data set to include or exclude particular genes.

For example, you can use a noise reduction filter to exclude genes that are not expressed by any cell in the data set, but were included in the matrix file. To do so:&#x20;

* Click the **Noise reduction filter** check box
* Set the *Noise reduction filter to Exclude features where* **value <= 0 in at least 99.9% of cells** using the drop-down menus and text boxes
* Click **Finish** to apply the filter

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2FeDuIVQacKhRPfp7S8hZl%2FFilter%20Features%20.png?alt=media&#x26;token=62f86b12-5ee2-496c-8abc-47f337c2847b" alt=""><figcaption></figcaption></figure>

The default single cell pipeline uses the statistics-based filter to filter for the top 10% of features with the highest variance.

### t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensional reduction technique that prioritizes local relationships to build a low-dimensional representation of the high-dimensional data that places objects that are similar in high-dimensional space close together in the low-dimensional representation. This makes t-SNE well suited for analyzing high-dimensional data when the goal is to identify groups of similar objects, such as cell types in single cell RNA-Seq data.

* Click the **Graph-based clusters** or **PCA** node
* Click **Exploratory analysis** in the task menu
* Click **t-SNE**
* Select the number of **PCs to use**
* Click **Configure** to access the *Advanced options*
* Click **Finish** to run

The t-SNE scatter plot visualization has the same functionality and style elements as the UMAP plot described above.

### Differential Analysis

A common goal in single cell analysis is to identify genes that distinguish a cell type. To do this, you can use the differential analysis tools in ICM.

* Click the **Normalized counts** results node
* Click **Statistics** in the toolbox
* Click **Differential Analysis**
* Select **ANOVA** as the Method to use for differential analysis and click **Next** (note that other single cell suggested models include the **Hurdle model** or **Wilcoxon** but you are not limited)&#x20;
* Select and add the categorical and/or numeric factors for analysis
* Click **Next**

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2FC8S58cWxhlE62ZpwcsC3%2Fanova-1.png?alt=media&#x26;token=ad1cdedc-5f1d-422f-87d3-3a675f4fc670" alt=""><figcaption></figcaption></figure>

The differential analysis tool can be used to compare one group of cells to another group of cells to identify genes or features that distinguish cells. Common examples include determining distinguishing genes between one cell type and all others, two cell types, or the same cell type between two experimental conditions.

The comparison builder can be used to create any of these tests. The top panel is the numerator for fold-change calculations so usually the experimental or test groups are selected in the top panel. The bottom panel is the denominator for fold-change calculations so the control group is often selected in the bottom panel.

* Add attributes/classifications to the **numerator**
* Add attributes/classifications to the **denominator**
* Select **Combine** for a single comparison or **Pairwise** for a factorial set of comparisons&#x20;
* Select **Add comparison**
* Optionally select the checkbox to **Apply lowest average coverage filter** to exclude a feature if the geometric average of its values over all samples is less than the specified value. This can be useful if no noise reduction filter has already been applied in the pipeline.&#x20;
* Click **Configure** to access the *Advanced options* which includes other Multiple test correction options.&#x20;
* Click **Finish** to run

<figure><img src="https://2107948471-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FqVEYIKB8JFfdScsTocFN%2Fuploads%2FmnuREsAka5LOwiY0JnNQ%2Fanova-2.png?alt=media&#x26;token=3e646d61-da73-485a-8e57-527d4886f356" alt=""><figcaption></figcaption></figure>

When completed, double click the newly generated data node to open the **ANOVA** task report. The ANOVA task report lists genes on rows and the results of the statistical test (p-value, fold change, etc.) on columns. Genes are listed in ascending order by the p-value of the first comparison so the most significant gene is listed first.&#x20;

#### Filter for Significant Genes

Using the filter control panel on the left, we can filter to just the genes that are significantly different for the comparison using the p-value and/or multiple test correction value (FDR step-up by default). The number of genes at the top of the filter control panel updates to indicate how many genes are left after the filters are applied.

Click **Generate filtered node** to generate a filtered version of the table for downstream analysis. This new data node containing the filtered genes will run in the Analyses pipeline to generate a filtered **Feature feature list** data node which will be available in the task graph by closing the ANOVA report and navigating to the Analyses pipeline; this filtered list can now be used for downstream tasks.&#x20;

### Gene set enrichment

While a long list of significantly different genes is important information about a cell type, it can be difficult to identify what the biological consequences of these changes might be just by looking at the genes one at a time. Using enrichment analysis, you can identify gene sets and pathways that are over-represented in a list of significant genes, providing clues to the biological meaning of your results.

* Click the **Feature list** data node produced by the Differential analysis filter
* Click **Biological interpretation** in the task menu
* Click **Gene set enrichment**
* Select the **Database** to use. ICM distributes the gene sets from the Gene Ontology Consortium, but Gene set enrichment can work with any custom or public gene set database.
* Choose the latest assembly available from the Gene set drop-down
* Click **Finish**

When completed, double-click the *Gene set enrichment* task node to open the task report.

The **Gene set enrichment** task report lists gene sets on rows with an enrichment score and p-value for each. It also lists how many genes in the gene set were in the input gene list and how many were not. Clicking the Gene set ID links to the geneontology.org or KEGG page for the gene set.

### Hierarchical clustering / heatmap

Since we have filtered to a list of significantly different genes, we can visualize these genes by generating a heatmap or bubble map.

* Click the **Filtered feature list** data node produced by the Differential analysis filter
* Click **Exploratory analysis** in the toolbox
* Click **Hierarchical clustering** **/ heatmap**

This task is used to generate the heatmap or bubble map; choose **Heatmap** as the plot type. You can choose to **Cluster** features (genes) and cells (samples) under *Feature order* and *Cell order* in the *Ordering* section which will perform hierarchical clustering producing a dendrogram which is useful for determining relationships. For single cell data sets, you may choose to forgo clustering the cells in favor of ordering them by the attribute of interest (e.g. drag and drop to order the attribute in a way that makes sense). Both ordering methods help to make the heatmap more comprehensible. [Please click here for more information on this task. ](https://help.multiomics.illumina.com/partek/partek-flow/user-manual/task-menu/exploratory-analysis/hierarchical-clustering#invoking-hierarchical-clustering)

* Select **Feature order**
* Select **Cell order**
* Optionally add any additional **Filtering**
* Click **Configure** to access the *Advanced options*
* Click **Finish** to run

### Cell Typing with ScType

ScType allows automated cell-type identification based on scRNA-seq data along with a comprehensive cell marker database as background information.

* Click the data node containing the non-normalized count matrix
* Click on **Classification** > **Single cell type** in the toolbox
* Select the marker database from the drop-down menu, the original full ScType database is provided by default
* Select categorical attributes to **Categorize by** (e.g. graph-based clusters)
* Optionally **Filter tissue types**
* Select the **SC Type algorithm** to use
* Click **Configure** to access and change any *Advanced options*
* Click **Finish** to run

A new *scType classification* task node will be produced. When complete, double-click the **Single cell type** node to open the results of the cell-type identification. For each cell, the tissue, sctype result, and typescore are reported.
