t-SNE

What is t-SNE?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensional reduction technique [1]. t-SNE aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. t-SNE is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.

Running t-SNE

t-SNE task can run on any counts data node, however, it is very computationally intensive, we recommend running PCA first and run this task on PC output data node using the top few number of PCs.

Click the counts data node or PCA data node (recommended)
Click the Exploratory analysis section of the toolbox
Click t-SNE
Click Finish to run

PCs to use

Choose how many PCs to use, default is the top 20 PCs. Low number of PCs will reduce the run time.

Split cells by sample

Chose whether to run t-SNE on all samples together or on each sample individually.

Checking the box will run t-SNE on each sample individually.

Advanced options configure

Perplexity

t-SNE preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Perplexity can be thought of as the number of nearest neighbors being considered. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a higher perplexity. Default is 30. The range of possible values is 3 to 100.

Number of iterations

t-SNE uses an iterative algorithm to optimize the low-dimensional representation. More iterations will result in a more accurate embedding to an extent, but will take longer to run. Default is 1000.

Random generator seed

Several parts of t-SNE utilize a random number generator to provide an initial value. Default is 1. To reproduce the results, use the same random seed at all runs.

Initialize output values at random

If selected, t-SNE initializes from random initial positions for each point. If disabled, the initial values for each point are assigned using the largest principal components extracted from the raw data. Default is enabled.

Number of iterations

To minimize the descrepancy between high and low dimensional distance among the points, t-SNE uses an iterative algorithm. A larger number of iteration will result in a more accurate low dimensional representation, with longer running time. Default is 1000.

Distance metric

The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.

Learning rate

Changing the learning rate may improve the visualization. For instance, if the optimization procedure is stuck, one can try increase the rate. However, too high or too low will ultimately deteriorate the result.

t-SNE produces a t-SNE task node. Opening the task report launches a scatter plot showing the t-SNE results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.

References

[1] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.

PreviousPCA NextUMAP

Last updated 1 month ago

Was this helpful?