UMAP

What is UMAP?

Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique [1]. UMAP aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. UMAP is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.

Running UMAP

UMAP task can run on any counts data node, however, it is very computationally intensive, we recommend running PCA first and run this task on PC output data node using the top few number of PCs.

Click the counts data node or PCA data node (recommended)
Click the Exploratory analysis section of the toolbox
Click UMAP
Click Finish to run

Initialize output values

Sets the initialization mode. Options are Spectral and Random.

Spectral - good initial points are chosen using spectral embedding (more accurate)

Random - random initial points are chosen (faster)

PCs to use

Choose how many PCs to use, default is the top 20 PCs. Low number of PCs will reduce the run time.

Split cells by sample

Chose whether to run UMAP on all samples together or on each sample individually.

Checking the box will run UMAP on each sample individually.

UMAP produces a UMAP task node. Opening the task report launches a scatter plot showing the UMAP results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.

UMAP vs. t-SNE

Both t-SNE and UMAP are dimensional reduction techniques that are useful for identifying groups of similar samples in large high-dimensional data sets. A comparison of the techniques for visualizing single cell RNA-Seq data by the authors of UMAP suggests that UMAP runs faster, is more reproducible, gives a more meaningful organization of clusters, and preserves more information about the global structure of the data than t-SNE [2].

We find UMAP to be more informative than t-SNE for many data sets. For example, the similarities and differences between clusters are clearly visible with UMAP, but more difficult to judge with t-SNE.

Advanced UMAP parameters

Local neighborhood size

UMAP preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Local neighborhood size is the number of nearest neighbors to consider.

You can adjust this value to prioritize global or local relationships. Smaller values will give a more local view, while larger values will give a more global view. Default is 30.

Minimal distance

The effective minimum distance between embedded points. Smaller values will create a more clustered embedding, while larger values will create a more evenly dispersed embedding.

You can decrease this value to make clusters more tightly packed or increase it to make them looser. Default is 0.3.

Distance metric

The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Cosine.

Number of iterations

UMAP uses an iterative algorithm to optimize the low-dimensional representation. The value 0 corresponds to the default, which chooses the number of iterations based on the size of the input data. More iterations will result in a more accurate embedding, but will take longer to run. Default is 0.

Random generator seed

Several parts of UMAP utilize a random number generator to provide an initial value. Default is 42. To reproduce the results, use the same random seed at all runs.

References

[1] McInnes L and Healy J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv, 2018, e-prints 1802.03426,

[2] Becht E, McInnes L, Healy J, Dutertre A-C, Kwok I, Guan Ng L, Ginhoux F, and Newell E, Dimensionality reduction for visualizing single-cell data using UMAP, Nature Biotechnology, 2019, 37, 38-44.

Previoust-SNE NextHierarchical clustering / heatmap

Last updated 1 month ago

Was this helpful?