Graph-based clustering

What is Graph-based clustering?

Graph-based clustering is a method for identifying groups of similar cells or samples. It makes no prior assumptions about the clusters in the data. This means the number, size, density, and shape of clusters does not need to be known or assumed prior to clustering. Consequently, graph-based clustering is useful for identifying clustering in complex data sets such as scRNA-seq.

Running Graph-based clustering

Graph-based clustering can run on any counts data node, however, it is very computationally intensive, we recommend running PCA first and run this task on PC output data node using the top few number of PCs

Click the counts data node or PCA output data node (recommended)
Click the Exploratory analysis section of the toolbox
Click Graph-based clustering
Configure the parameters
Click Finish to run

Choose which version of the Louvain clustering algorithm to use. Options are Louvain [1], Louvain with refinement [2], SLM [3] and Leiden [4], the default is Leiden, choose number of PCs to use, default is the top 20 PCs. Compute biomarkers task will be performed after this task run if the checkbox is selected. The attribute used in compute biomarkers is the graph-based cluster annotation.

Cluster results

Graph-based clustering produces a Clustering result data node. The task report lists the cluster results and cluster statistics. If clustering was run with Split cells by sample enabled on a single cell counts data node, the cluster results table displays the number of clusters found for each sample and clicking the sample name opens the sample-level report.

The Clustering result data node includes cluster assignment as a new attribute, Graph-based, for each observation. If the Clustering result data node is visualized by Scatter plot, PCA, t-SNE, or UMAP, the plot will be colored by the Graph-based attribute.

Advanced Graph-based clustering parameters

Resolution

To increase the number of clusters, increase the resolution . To decrease the number of clusters, decrease the resolution. Default is 0.5.

A larger number may be more appropriate for larger numbers of cells.

Prune parameter

Removes links between pairs of points if their similarity is below the threshold. Larger values lead to a shorter run time, but can result in many singleton clusters. Default is 0.0.

Number of nearest neighbors

Clustering preserves the local structure of the data by focusing on the distances between each point and its k nearest neighbors. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a larger number of nearest neighbors. Increasing the number of nearest neighbors will increase the size of clusters and vice versa. Default is 30. The range of possible values is 3 to 100.

This parameter can be used to speed up clustering at the expense of accuracy. Larger scale implies greater accuracy and helps avoid singletons, but takes more time to run. To maximize accuracy, the total count of observations being clustered should be below the product of nearest neighbors and scale. Default is 100,000. The range of possible values is 1 to 100,000.

Modularity function

The modularity function measures the overall quality of clustering. Graph-based clustering amounts to finding a local maximum of the modularity function. Possibilities are Standard [5] and Alternative [6]. Default is Standard.

Number of random starts

The clustering result depends on the order observations are considered. Each random start corresponds to a different order and result. A larger number of random starts can deliver a better result because the result with the highest quality (modularity) out of all of the random starts is chosen. Increasing the number of random starts will increase the run time. The range of possible values is 3 to 1,000. The default is 100.

Random seed

The random seed is used in the random starts portion of the algorithm. Using a different seed might give a better result. Use the same random seed to reproduce results. Default is 0.

Number of iterations per random start

To maximize modularity, clustering proceeds iteratively by moving individual points, clusters, or subsets of points within clusters. A larger number of iterations can give better results, but will take longer to run. Default is 10.

Minimal cluster size

Clusters smaller than the minimal cluster size value will be merged with a nearby cluster unless they are completely isolated. To avoid isolation, set the prune parameter to zero (default) and the scale parameter to the maximum (default). Default is 1.

Sequential random starts

Enable this option to use the slower sequential ordering of random starts. Default is disabled.

Nearest Neighbor Type

Different methods for determining nearest neighbors. The K nearest neighbors (K-NN) algorithm is the standard. The NN-Descent algorithm is used by UMAP and is an alternative. Default is K-NN.

Distance metric

If NN-Descent is chosen for Nearest Neighbor Type, the metric to use when determining distance between data points in high dimensional space can be set. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.

References

[1] Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.

[2] Rotta, R., & Noack, A. (2011). Multilevel local search algorithms for modularity clustering. Journal of Experimental Algorithmics (JEA), 16, 2-3.

[3] Waltman, L., & Van Eck, N. J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. The European Physical Journal B, 86(11), 471.

[4]Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z

[5] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.

[6] Traag, V. A., Van Dooren, P., & Nesterov, Y. (2011). Narrow scope for resolution-limit-free community detection. Physical Review E, 84(1), 016114.

PreviousExploratory analysis NextK-means clustering

Last updated 1 month ago

Was this helpful?