Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
K-means clustering is a method for identifying groups of similar observations, i.e. cells or samples. K-means clustering aims to group observations into a pre-determined number of clusters (k) so that each observation belongs to the cluster with the nearest mean. An important aspect of K-means clustering is that it expects clusters to be of similar size (equal variance) and shape (distribution of variance is spherical). The Compare Clusters task can also be used to help determine the optimal number of K-means clusters.
We recommend normalizing your data prior to running K-means clustering, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click K-means clustering
Configure the parameters
Click Finish to run (Figure 1)
K-means clustering produces a K-means Clusters result data node; double-click to open the task report which lists the cluster statistics (Figure 2). If Compute biomarkers was enabled, top markers will be available by double-clicking the Biomarkers result data node. If clustering was run with Split by sample enabled on a single cell counts data node, the cluster results table displays the number of clusters found for each sample and clicking the sample name opens the sample-level report.
The total number of clusters is listed along with the number and percentage of cells in each cluster.
The K-means Clustering result data node includes the input values and adds cluster assignment as a new attribute, K-means, for each observation.
Choose which distance metric to use for cluster distance calculations. Options include Euclidean, Absolute Value, Euclidean Squared, Kendall Correlation, Max Value, Min Value, Pearson Correlation, Rank Correlation, Average Euclidean, Shape, Cosine, Canberra, Bray Curtis, Tanimoto, Pearson Correlation Absolute, Rank Correlation Absolute, and Kendall Correlation Absolute. The default is Euclidean.
Choose between specifying a set number of clusters or a range to test for the best fit number of clusters. The best fit is determined by the number of clusters with the lowest Davies–Bouldin index. The default is set to 10 for a fixed number of clusters. The initial values for the range option are 3 to 20 clusters.
Choose whether to run the ANOVA test comparing each cluster to all other observations to identify features that have higher values in that cluster. Default is Enabled.
This option is present in single cell data. If enabled, K-means clustering will be run separately for each sample. If disabled, K-means clustering will be run on all cells from the input data. Default is set by the Split single cell by sample option in the user preference page.
If enabled, the initial cluster centroids will be selected randomly from among the data points. If disabled, the initial cluster centroids will be selected to optimize distance between clusters. Default is Disabled.
This sets the random seed used if Random cluster initialization is enabled. Use the same random seed to reproduce results.
If enabled, all cluster centroids will be recomputed at the end of each iteration. If disabled, each cluster centroid will be recomputed as the members of the cluster change. Default is Enabled.
The maximum number of iterations to perform before setting on a set of clusters. Default is 1000.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Principal components (PC) analysis (PCA) is an exploratory technique that is used to describe the structure of high dimensional data by reducing its dimensionality. It is a linear transformation that converts n original variables (typically: genes or transcripts) into n new variables, which are called PCs, they have three important properties:
PCs are ordered by the amount of variance explained
PCs are uncorrelated
PCs explain all variation in the data
PCA is a principal axis rotation of the original variables that preserves the variation in the data. Therefore, the total variance of the original variables is equal to the total variance of the PCs.
If read quantification (i.e. mapping to a transcript model) was performed by Partek E/M algorithm, PCA can be invoked on a quantification output data node (Gene counts or Transcript counts) or, after normalization, on a Normalized counts data node. Select a node on the canvas and then PCA in the Exploratory analysis section of the context sensitive menu.
There are two options for features contribute (Figure 1):
equally: all the features are standardized to mean of 0 and standard deviation of 1 . This option will give all the features equal weight in the analysis, this is the default option for e.g bulk RNA-seq data.
by variance: the analysis will give more emphasis to the features with higher variances. This is the default option for e.g. single cell RNA-seq data
If the input data node is in linear scale, you can perform log transformation on PCA calculation.
The PCA task creates a new task node, and to open it and see the result, do one of the following: select the PCA task node, proceed to the context sensitive menu and go to the Task result; or double-click on the PCA task node. The report containing eigenvalues, PC projections, component loadings, and mapping error information for the first three PCs.
When the PCA node is opened in Data viewer, by default, it is the 3D scatterplot, Scree plot with Eigenvalues, and Component loadings table (Figure 2). Each dot on the 3D scatter plot represents an observation, while the first three PCs are shown on the X-, Y-, and Z-axis respectively, with the information content of an individual PC is in the parenthesis.
As an exploratory tool, the the PCA scatterplot is applied to view any groupings in the data set and generate hypotheses based on the outcome, or to spot possible outliers.
To rotate the 3D scatter plot left click & drag. To zoom in or out, use the mouse wheel. Click and drag the legend can move the legend to different location on the viewer.
Detailed configuration on PCA plot can be found by clicking Help>How-to videos>Data viewer section.
In the Data viewer, when a PCA data node is selected from Get Data under Setup (left panel), the node can be dragged and dropped to the screen (Figure 3), then you will have the option to select a scree plot and tables.
When mouse over on a point on the line, it will display detailed information of the PC. The scree plot shows how much variation each PC represents, so it is often used to determine the number of principal components to keep for downstream analysis (e.g. tSNE, UMAP, graph-base clustering). The "elbow" point of the graph where the eigenvalues seem to level off should be considered as a cutoff point for downstream analysis.
In the table, each row is a feature, the column represent PCs, the value is the correlation coefficient. Under Content, there is a PCA projections option, change to this option to display the projection table (Figure 6). In this table, each row is an observation, each column is a PC, the values are the PC scores.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
When choose Scree plot icon, it will plot a 2D viewer, X-axis represents PCs, Y-axis represents eigenvalues (Figure 4)
PCA data node can also be draw as tables, when choose Table icon , it will display the component loadings matrix in the viewer (Figure 5). The Content can be modified using the Content configuration option; the table can be paged through here or from the lower right corner.
Partek Flow offers a wide variety of tools to help you explore your data. Which tools are available depends on the type of data node selected.
Compare clusters is a tool to identify the optimal number of clusters for K-means Clustering using the Davies-Bouldin index. The Davies-Bouldin index is a measure of cluster quality where a lower value indicates better clustering, i.e., the separation between points within the clusters is low (tight clusters) and separation between clusters is high (distinct clusters).
We recommend normalizing your data prior to running Compare clusters, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click Compare clusters
Configure the parameters
Click Finish to run (Figure 1)
The parameters for Compare clusters are the same as for K-means clustering.
The Compare clusters task report is an interactive line chart with the number of clusters on the x-axis and the Davies-Bouldin index on the y-axis (Figure 2).
The Compare clusters task report can be used to run K-means clustering.
Click a point on the plot to select it or type the number of clusters in the text box Partition data into clusters
Selecting a point sets it as the number of clusters to partition the data into. The number of clusters with the lowest Davies-Bouldin index value is chosen by default.
Click Generate clusters to run K-means clustering with the selected number of clusters
A K-means clustering task node and a Clustering result data node are produced. Please see our documentation on K-means Clustering for more details.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Graph-based clustering is a method for identifying groups of similar cells or samples. It makes no prior assumptions about the clusters in the data. This means the number, size, density, and shape of clusters does not need to be known or assumed prior to clustering. Consequently, graph-based clustering is useful for identifying clustering in complex data sets such as scRNA-seq.
We recommend normalizing your data prior to running Graph-based clustering, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click Graph-based clustering
Configure the parameters
Click Finish to run
Graph-based clustering produces a Clustering result data node. The task report lists the cluster results and cluster statistics (Figure 1). If clustering was run with Split cells by sample enabled on a single cell counts data node, the cluster results table displays the number of clusters found for each sample and clicking the sample name opens the sample-level report.
The Maximum modularity is a measure of the quality of the clustering result. Modularity measures how much cells within a cluster are similar to each other and less similar to cells in other clusters. Higher modularity indicates a better result. Optimal modularity is 1, but may not be attainable for the input data.
The total number of clusters is listed along with the number and percentage of cells in each cluster.
The Clustering result data node includes the input values for each gene and adds cluster assignment as a new attribute, Graph-based, for each observation. If the Clustering result data node is visualized by Scatter plot, PCA, t-SNE, or UMAP, the plot will be colored by the Graph-based attribute (Figure 2).
Choose which version of the Louvain clustering algorithm to use. Options are Louvain [1], Louvain with refinement [2], SLM [3] and Leiden [4]. The most recent version is Smart Local Moving (SLM). The default is Louvain.
Compute biomarkers will compute features that are highly expressed when comparing each cluster.
Chose whether to run Graph-based clustering on all samples together or on each sample individually.
Checking the box will run Graph-based clustering on each sample individually.
This option appears when there are multiple feature types in the input data node (e.g., CITE-Seq data).
Select Any to run on all features or pick a feature type.
To increase the number of clusters, increase the resolution . To decrease the number of clusters, decrease the resolution. Default is 0.5.
A larger number may be more appropriate for larger numbers of cells.
Removes links between pairs of points if their similarity is below the threshold. Larger values lead to a shorter run time, but can result in many singleton clusters. Default is 0.0.
Clustering preserves the local structure of the data by focusing on the distances between each point and its k nearest neighbors. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a larger number of nearest neighbors. Increasing the number of nearest neighbors will increase the size of clusters and vice versa. Default is 30. The range of possible values is 3 to 100.
This parameter can be used to speed up clustering at the expense of accuracy. Larger scale implies greater accuracy and helps avoid singletons, but takes more time to run. To maximize accuracy, the total count of observations being clustered should be below the product of nearest neighbors and scale. Default is 100,000. The range of possible values is 1 to 100,000.
The modularity function measures the overall quality of clustering. Graph-based clustering amounts to finding a local maximum of the modularity function. Possibilities are Standard [5] and Alternative [6]. Default is Standard.
The clustering result depends on the order observations are considered. Each random start corresponds to a different order and result. A larger number of random starts can deliver a better result because the result with the highest quality (modularity) out of all of the random starts is chosen. Increasing the number of random starts will increase the run time. The range of possible values is 3 to 1,000. The default is 100.
The random seed is used in the random starts portion of the algorithm. Using a different seed might give a better result. Use the same random seed to reproduce results. Default is 0.
To maximize modularity, clustering proceeds iteratively by moving individual points, clusters, or subsets of points within clusters. A larger number of iterations can give better results, but will take longer to run. Default is 10.
Clusters smaller than the minimal cluster size value will be merged with a nearby cluster unless they are completely isolated. To avoid isolation, set the prune parameter to zero (default) and the scale parameter to the maximum (default). Default is 1.
Enable this option to use the slower sequential ordering of random starts. Default is disabled.
Different methods for determining nearest neighbors. The K nearest neighbors (K-NN) algorithm is the standard. The NN-Descent algorithm is used by UMAP and is an alternative. Default is K-NN.
If NN-Descent is chosen for Nearest Neighbor Type, the metric to use when determining distance between data points in high dimensional space can be set. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.
Graph-based clustering uses principal components as its input. The number of principal components to use is set here.
We recommend using the PCA task to determine the optimal number of principal components for your data. Default is 100.
Options are equally or by variance. Feature values can be standardized prior to PCA so that the contribution of each feature does not depend on its variance. To standardize, choose equally. To take variance into account and focus on the most variable features, choose by variance. Default is by variance.
You can choose to log transform the data prior to running PCA as part of Graph-based clustering. Default is disabled.
If you are normalizing the data, choose a log base. Default is 2 when Log transform data is enabled.
If you are normalizing the data, choose an offset. Default is 1 when Log transform data is enabled.
[1] Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.
[2] Rotta, R., & Noack, A. (2011). Multilevel local search algorithms for modularity clustering. Journal of Experimental Algorithmics (JEA), 16, 2-3.
[3] Waltman, L., & Van Eck, N. J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. The European Physical Journal B, 86(11), 471.
[4]Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z
[5] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
[6] Traag, V. A., Van Dooren, P., & Nesterov, Y. (2011). Narrow scope for resolution-limit-free community detection. Physical Review E, 84(1), 016114.
AUCell is a tool to identify cells that are actively expressing genes within a gene list [1]. For each input gene list, AUCell calculates a value for each cell by ranking all genes by their expression level in the cell and identifying what proportion of the genes from the gene list fall within the top 5% (default cutoff) of genes. This method allows the AUCell value to represent the proportion of genes from the gene list that are expressed in the cell and their relative expression compared to other genes within the cell. Because this is a rank-based method and is calculated for each cell individually, AUCell can be run on raw or normalized data. As an AUCell value is the proportion of genes from the list that are within the top percentile of expressed genes, AUCell values can range from 0 to 1, but may have a more restricted range.
AUCell values can be used directly as input for downstream analysis, such as clustering. Another common use is to set an AUCell value cutoff for expressing vs. not and used this to classify cells. AUCell values will separate cells most effectively when the genes in the list are highly and specifically expressed in a population of cells. If the genes are specifically expressed, but not highly expressed, the AUCell value will not be as useful.
AUCell can be run on any single cell counts data node.
Click the single cell counts data node
Click the Exploratory analysis section in the toolbox
Click AUCell
Choose gene lists by clicking and dragging them to the panel on the right or clicking the green plus that appears after mousing over a gene list (Figure 1)\
Click Finish to run
AUCell produces an AUCell result data node. The AUCell result data node includes the input counts data and adds the AUCell scores to the original data as a new data type, AUCell Values. AUCell values for each input feature list are included as features in the AUCell result data node. These features created by AUCell are named after the feature list (e.g., B cells, Cytotoxic cells).
Because the AUCell values are added as features, they can be used as input for clustering, differential analysis, and visualization tasks.
To produce a data node containing only the AUCell values, use Split matrix to split the AUCell result data node into separate data nodes for each of its data types. This can be helpful if you intend on performing downstream analysis on the AUCell values. To perform differential analysis, it is advisable to normalize the values by adding a small offset (e.g. 1E-9) and Logit transformation to the base Log2 using the Normalization task. This will make the values continuous and suitable for differential analysis with methods such as ANOVA/LIMMA-trend/LIMMA-voom, Non-parametric ANOVA or Welch's ANOVA. For differential analysis, please check the Low-value filter is set to None and the values are correctly recognized as Log2 transformed in the Advanced settings.
If an AUCell result data node or other downstream data node containing AUCell Values is used as the input for AUCell, the additional AUCell values will be added as additional features of the AUCell values data type in the new AUCell result data node.
For each gene set, AUCell computes the intersection between the gene list and the input data set. If the intersection size is below the specified threshold, the gene set is ignored and no AUCell score is calculated for it. Default is 5.
To calculate the AUCell value, genes are ranked and the fraction of genes from the gene list that are above the percentile cutoff is the AUCell value. This parameter sets the percentile cutoff. Default is 5.
[1] Aibar, S., González-Blas, C. B., Moerman, T., Imrichova, H., Hulselmans, G., Rambow, F., ... & Atak, Z. K. (2017). SCENIC: single-cell regulatory network inference and clustering. Nature methods, 14(11), 1083.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
CellPhoneDB addresses the challenges of studying cell-cell communication in scRNA-seq and spatial data. It allows researchers to move beyond just measuring gene expression and delve into the complicated cellular communication world. By analyzing the scRNA-seq or spatial data through the lens of CellPhoneDB, researchers can identify potential signaling pathways and communication networks between different cell types within the sample. Partek Flow wrapped the statistical analysis pipeline (method 2) from CellPhoneDB v5 [1][2] for this purpose.
Invoke the CellPhoneDB task in Partek Flow from a normalized counts data node using the Exploratory analysis section (Figure 1). We recommend running CellPhoneDB on the log normalized data directly.
To run CellPhoneDB task,
Click a Normalized counts data node
Click the Exploratory analysis section in the toolbox
Click CellPhoneDB
The GUI is simple and easy to understand. For each option, the grey colored description explains more details (Figure 2). If the dataset working on is single cell RNA-Seq, it doesn't need the Micro environment file. However, if it is a spatial data, most likely you would like to provide the Micro environment file because of its spatial contents. By default, the value of 0.10 will be used as threshold to select which cells are used for the analysis in the cluster. However, the number could be adjusted manually or typed in directly. Simply click the Finish button if you want to run the task as default.
Double click the CellPhoneDB result data node will open the task report in Data Viewer. It is a heatmap that summarizes how many significant interactions identified in the cell type pairs (Figure 3).
To explore more, the task of Explore CellPhoneDB results allows users to filter CellPhoneDB results by specifying the cell type pairs and genes of interest. After clicking the CellPhoneDB data node (Figure 4a), one will find there's only task triggered under Exploratory analysis menu (Figure 4b). Its GUI is also simple and easy to understand (Figure 4c). Genes of interest are data dependent and usually come from the published results of similar studies or the differential gene analysis between different conditions (eg, cancer patient vs healthy controls). Once set up, click the Finish button to submit the job.
Double click the Output matrix data node will open the task report in Data Viewer. It is another variant of heatmap that displays how genes of your interest interact in the defined cell type pairs (Figure 5). The exampled plot also indicates the data are from two environments. For instructions on setting up the Micro environment file for your spatial study, refer to Figure 2. CellPhoneDB analysis classifies signaling pathways for genes of interest. These classifications are then used to annotate the heatmap within the task report.
It is important to note that the interactions are not symmetric. The authors state that, "Partner A expression is considered for the first cluster/cell type (clusterA), and partner B expression is considered on the second cluster/cell type (clusterB). Thus, IL12-IL12 receptor for clusterA-clusterB (i.e. the receptor is in clusterB) is not the same as IL-12-IL-12 receptor for clusterB-clusterA (i.e. the receptor is in clusterA), and will have different values." [3][4]
The interactions come from the CellphoneDB database. It is manually curated repository using reviewed molecular interactions with demonstrated evidence for a role in cellular communication. [5]
Troule, etc (2023). CellPhoneDB v5: Inferring cell-cell communication from single cell multiomics data.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensional reduction technique [1]. t-SNE aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. t-SNE is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.
We recommend normalizing your data prior to running t-SNE, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click t-SNE
Click Finish to run
t-SNE produces a t-SNE task node. Opening the task report launches a scatter plot showing the t-SNE results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.
Chose whether to run t-SNE on all samples together or on each sample individually.
Checking the box will run t-SNE on each sample individually.
This option appears when there are multiple feature types in the input data node (e.g., CITE-Seq data).
Select Any to run on all features or pick a feature type.
t-SNE preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Perplexity can be thought of as the number of nearest neighbors being considered. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a higher perplexity (Figure 2). Default is 30. The range of possible values is 3 to 100.
t-SNE uses an iterative algorithm to optimize the low-dimensional representation. More iterations will result in a more accurate embedding to an extent, but will take longer to run. Default is 1000.
Several parts of t-SNE utilize a random number generator to provide an initial value. Default is 1. To reproduce the results, use the same random seed at all runs.
If selected, t-SNE initializes from random initial positions for each point. If disabled, the initial values for each point are assigned using the largest principal components extracted from the raw data. Default is enabled.
The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.
If checked, mapping error information will be available in the task report. Default is disabled.
Output a t-SNE table data node that can be downloaded. The 2D t-SNE coordinates are labeled Feature 1 and Feature 2; the 3D t-SNE coordinates are labeled Feature 3, 4, and 5. Default is disabled.
t-SNE uses principal components as its input. The number of principal components to use is set here.
We recommend using the PCA task to determine the optimal number of principal components for your data. Default is 50.
Options are equally or by variance. Feature values can be standardized prior to PCA so that the contribution of each feature does not depend on its variance. To standardize, choose equally. To take variance into account and focus on the most variable features, choose by variance. Default is by variance.
You can choose to log transform the data prior to running PCA as part of t-SNE. Default is disabled.
If you are normalizing the data, choose a log base. Default is 2 when Log transform data is enabled.
If you are normalizing the data, choose an offset. Default is 1 when Log transform data is enabled.
[1] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
Multi-omics single cell analysis is based on simultaneous detection of different types of biological molecules on the same cells. Common multi-omics techniques include feature barcoding or CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) technologies, which enable parallel assessment of gene and protein expression. Specific bioinformatics tools have been developed to enable scientists to integrate results of multiple assays and learn relative importance of each type (or each biological molecule) in identification of cell types. Partek Flow supports weighted nearest neighbor (WNN) analysis (1), which can help combine output of two molecular assays.
This task can only be performed on data nodes containing PCA scores – which are PCA output and graph based clustering output nodes generated from PCA nodes. To start, select a PCA data node of one of the assays (e.g. gene expression) and go to Exploratory analysis > Find multimodal neighbors in the toolbox. On the task setup page, use the Select data node button to point to the PCA data node of the other assay (e.g. protein expression), by default, there is a node selected (Figure 1).
When you click the Select data node button, Partek Flow will open another dialog, showing your current pipeline (Figure 2). Data nodes that can be used for WNN are in color of the branch, other nodes are disabled (greyed out). To pick a node, left-click on it and then push the Select button.
The selected data node is shown under the Select data node button. If you made a mistake, use the Clear selection link (Figure 1).
If there are graph-based clustering task performed on PCA data node, the output of graph-based clustering node also has PCA score from the input data, so the output graph-based clustering data nodes also can be candidate of WNN task.
To customize the Advanced options, select the Configure link (Figure 1). At present you can only change the number of nearest neighbors for each modality (-k.nn option of the Seurat package); the default value is 20 (Figure 3). An illustration on how to use that option to assess the robustness of WNN analysis can be found in Hao et al. (1). The nearest neighbor search method is K-NN and distance metric is Euclidean.
To launch the Find multimodal neighbors task, click the Finish button on the task setup page (Figure 1). For each cell, the WNN algorithm calculates its closest neighbors based on a weighted combination of RNA and protein similarities. The output of the Find multimodal neighbors task is a WNN data node.
For downstream analysis, you can launch a UMAP or graph-based clustering tasks on a WNN node. For example, Figure 4 shows a snippet of analysis of a feature barcoding data set; gene expression and protein expression data were processed separately, and then Find multimodal neighbors was invoked on two respective PCA data nodes. UMAP and graph-based clustering tasks were performed on WNN node.
For an excellent illustration on advantages of WNN algorithm for identification of cell types, please see this blog post.
Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573-3587.e29. doi:10.1016/j.cell.2021.04.048
To analyze scATAC-seq data, Partek Flow introduced a new technique - LSI (latent semantic indexing )[1]. LSI combines steps of frequency-inverse document frequency (TF-IDF) normalization followed by singular value decomposition (SVD). This returns a reduced dimension representation of a matrix. Although SVD and Principal components analysis (PCA) are two different techniques, the SVD has a close connection to PCA. Because PCA is simply an application of the SVD. For users who are more familiar with scRNA-seq, you can think of SVD as analogous to the output of PCA. And similarly, the statistical interpretation of singular values is in the form of variance in the data explained by the various components. The singular values produced by the SVD are in order from largest to smallest and when squared are proportional the amount of variance explained by a given singular vector.
SVD task in Flow can be invoked in Exploratory analysis section by clicking any single cell counts data node (Figure 1). We recommend running SVD on the normalized data, particularly the TF-IDF normalized counts for scATAC-seq analysis.
To run SVD task,
Click a single cell counts data node
Click the Exploratory analysis section in the toolbox
Click SVD
The GUI is simple and easy to understand. The SVD dialog is only asking to select the number of singular values to compute (Figure 2). By default 100 singular values will be computed if users don't want to compute all of them. However, the number could be adjusted manually or typed in directly. Simply click the Finish button if you want to run the task as default.
The task report for SVD is similar to PCA. Its output will be used for downstream analysis and visualization, including Harmony (Figure 3).
Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). https://doi.org/10.1038/nature25981
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique [1]. UMAP aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. UMAP is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.
We recommend normalizing your data prior to running UMAP, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click UMAP
Click Finish to run
UMAP produces a UMAP task node. Opening the task report launches a scatter plot showing the UMAP results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.
Both t-SNE and UMAP are dimensional reduction techniques that are useful for identifying groups of similar samples in large high-dimensional data sets. A comparison of the techniques for visualizing single cell RNA-Seq data by the authors of UMAP suggests that UMAP runs faster, is more reproducible, gives a more meaningful organization of clusters, and preserves more information about the global structure of the data than t-SNE [2].
In our hands, we find UMAP to be more informative than t-SNE for many data sets. For example, the similarities and differences between clusters are clearly visible with UMAP, but more difficult to judge with t-SNE (Figure 1).
Sets the initialization mode. Options are Spectral and Random.
Spectral - good initial points are chosen using spectral embedding (more accurate)
Random - random initial points are chosen (faster)
Chose whether to run UMAP on all samples together or on each sample individually.
Checking the box will run UMAP on each sample individually.
This option appears when there are multiple feature types in the input data node (e.g., CITE-Seq data).
Select Any to run on all features or pick a feature type.
UMAP preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Local neighborhood size is the number of nearest neighbors to consider.
You can adjust this value to prioritize global or local relationships. Smaller values will give a more local view, while larger values will give a more global view (Figure 2). Default is 30.
The effective minimum distance between embedded points. Smaller values will create a more clustered embedding, while larger values will create a more evenly dispersed embedding.
You can decrease this value to make clusters more tightly packed or increase it to make them looser (Figure 3). Default is 0.3.
The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Cosine.
UMAP uses an iterative algorithm to optimize the low-dimensional representation. The value 0 corresponds to the default, which chooses the number of iterations based on the size of the input data. More iterations will result in a more accurate embedding, but will take longer to run. Default is 0.
Several parts of UMAP utilize a random number generator to provide an initial value. Default is 42. To reproduce the results, use the same random seed at all runs.
Output a UMAP table data node that can be downloaded. The 2D UMAP coordinates are labeled Feature 1 and Feature 2; the 3D UMAP coordinates are labeled Feature 3, 4, and 5. Default is disabled.
UMAP uses principal components as its input. The number of principal components to use is set here. Default is 10.
We recommend using the PCA task to determine the optimal number of principal components for your data.
Options are equally or by variance. Feature values can be standardized prior to PCA so that the contribution of each feature does not depend on its variance. To standardize, choose equally. To take variance into account and focus on the most variable features, choose by variance. Default is by variance.
You can choose to log transform the data prior to running PCA as part of UMAP. Default is disabled.
If you are normalizing the data, choose a log base. Default is 2 when Log transform data is enabled.
If you are normalizing the data, choose an offset. Default is 1 when Log transform data is enabled.
[1] McInnes L and Healy J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv, 2018, e-prints 1802.03426,
[2] Becht E, McInnes L, Healy J, Dutertre A-C, Kwok I, Guan Ng L, Ginhoux F, and Newell E, Dimensionality reduction for visualizing single-cell data using UMAP, Nature Biotechnology, 2019, 37, 38-44.
Hierarchical clustering is a statistical method used to assign similar objects into groups called clusters. It is typically performed on results of statistical analyses, such as a list of significant genes/transcripts, but can also be invoked on the full data set, as a part of exploratory analysis.
Hierarchical clustering is an unsupervised technique, meaning that the number of clusters is not specified upfront. In the beginning, each row and/or column is considered a cluster. The two most similar clusters are combined and continue to combine until all objects are in the same cluster. Hierarchical clustering produces a tree (called a dendrogram) that shows the hierarchy of the clusters.
This tutorial will illustrate how to:
To invoke hierarchical clustering, select a data node containing count data (e.g. Gene counts, Normalized counts, Single cell counts), or a Feature list data node (to cluster significant genes/transcripts) and then click on the Hierarchical clustering / heat map option in the context sensitive menu (Figure 1).
The hierarchical clustering setup dialog (Figure 2) enables you to control the clustering algorithm. Starting from the top, you can choose to plot a Heatmap or a Bubble map (clustering can be performed on both plot types). Next, perform Ordering by selecting Cluster for either feature order (genes/transcripts/proteins) or cell/sample/group order or both. Note the context-sensitive image that helps you decide to either perform hierarchical clustering (dendrogram) or assign order (arrow) for the columns and rows to help you orient yourself and make decisions (In Figure 2 below, Cluster is selected for both options so a dendrogram is shown in the image).
When choose Assign order, the Default order of cells/samples/groups (rows) is based upon the labels as displayed in the Data tab and features (columns) are dependent on the input data of the data node.
Feature order can be assigned by selecting a managed list (e.g. generate saved feature lists from report nodes or add lists under list management in the settings) in the drop-down which will limit the features to only those in the list and the features will be ordered as they are listed. If a feature is not available, based on the input of the data node, it will not be shown in the plot (in other words, if the features from the list are not there they will not be plotted). Note that If no features are available from the data node, the task will not be able to perform and an error message will be shown.
Cell/Sample/Group order can also be assigned by choosing an attribute from the drop down list. Click and drag to rearrange categorical attributes; numeric attributes can be sorted in ascending or descending order (note the arrows in the image which are different from the dendrogram for Cluster) (Figure 3).
Another way to invoke a heatmap without performing clustering is via the data viewer. When you select the Heatmap icon in the available plots list, data nodes that contain two-dimensional matrices can be used to draw this type of plot. A bubble map can also be similarly plotted (use the arrow from the heatmap icon to select a Bubble map for descriptive statistics that have been generated in the data analysis pipeline.
If you do not want to cluster all the samples, but select a subset based on a specific sample or cell attribute (i.e. group membership), check Filter cells under Filtering and set a filtering rule using the drop down lists (Figure 4). Notice the drop-down lists allow more than one factor (when available) to be selected at a time. When configuring the filtering rule, use AND to ensure all conditions pass for inclusion and use OR for any conditions to pass.
Hierarchical clustering uses distance metrics to sort based on similarity and is set to Average Linkage by default. This can be adjusted by clicking Configure under Advanced options (Figure 5). You can choose how the data is scaled (sometimes referred to as normalized). There are three Feature scaling options, Standardize (default for a heatmap) will make each column mean as zero and standard deviation as 1 in all features. This is the default scaling for a heatmap and it makes all of the features (e.g., genes or proteins) have equal weight; standardized values are also known as Z-scores. The scaling mode Shift will make each column mean as zero. Choose None to not scale and perform clustering on the values in the input data node (this is the default for a bubble map). If a bubble map is scaled, scaling will be performed on the group summary method (color).
Cluster distance metric for cells/samples and features is used to determine how the distance between two clusters will be calculated:
Single Linkage: the distance between two clusters is determined by the distance of the closest objects in the two clusters
Complete Linkage: the distance between two clusters is equal to the distance between the two furthest members of those clusters
Average Linkage: the average distance between all the pairs of objects in the two different clusters is used as the measure of distance between the two clusters
Centroid method: the distance between two clusters is equal to the distance between the centroids of those clusters
Ward's method: the distance between two clusters is designed to minimize the size of an error measure based on the sum of squares
Point distance metric is used to determine the distance between two rows or columns. For more detailed information about the equations, we refer you to the distance metrics chapter below.
The output of a Hierarchical clustering task can be a heatmap (Figure 6) or a bubble map with or without dendrograms depending on whether you performed clustering on cells/samples/groups or features. By default, samples are on rows (sample labels are displayed as seen in the Data tab) and features (depending on the input data) are on columns. Colors are based on standardized expression values (default selection; performed on the fly). Dendrograms show clustering of rows (samples) and columns (variables).
Depending on the resolution of your screen and the number of samples and variables (features) that need to be displayed, some binning may be involved. If there are more samples/genes than pixels, values of neighboring rows/columns will be averaged together. Use the mouse wheel to zoom in and out. When you zoom in to certain level on the heatmap, you will see each cell represent one sample/gene. When you mouse over the row dendrogram or label area and zoom, it will only zoom in/out on the rows. The binning on the columns will remain the same. Similarly, when you mouse over the column dendrogram or label area and zoom, it will only zoom in/out on the columns. The binning on the rows will remain the same. To move the map around when zoomed in, press down the left mouse button and drag the map. The plot can be saved as a full-size image or as a current view; when Save image is clicked, a prompt will ask how you would like to save the image.
The Hierarchical clustering task can also be used to plot a bubble map. Let's go through the steps to make a bubble map (Figure 7):
Choose to plot a Bubble map (note the selection of a bubble map in the image which is different from the heatmap). This will open the Bubble map settings.
Configure the Bubble map settings. First, Group cells by an available categorical attribute (e.g. cell type). Next, summarize the group’s first dimension by color (Group summary method) then choose an additional dimension to plot size (Additional statistic) by using the drop down lists. If these settings are not adjusted, the default dimensions will generate two descriptive statistic measurements that plot the group mean by color and size by the percent of cells. Hierarchical clustering can be performed on the first assigned dimension (by color) which is the Group summary method. The second dimension (size) which is an Additional statistic is not required but it is selected by default (this can be unchecked with the checkbox).
Ordering the plot columns (Feature order) and rows (Group order) behaves the same as a heatmap. In this example, Ordering for both features and groups by Cluster uses hierarchical clustering to perform distance metrics (default settings will be used but these metrics can be changed under Configure in the Advanced options section). Alternatively, Assign order to features using a managed (saved) feature list or the default order which is dependent on the input data. Assign order to groups can be used to rearrange the attribute by drag and drop, ascending or descending order, or default order which is how the labels as displayed in the Data tab.
Filtering can be applied to the groups by checking Filter cells then specifying the logical operations to filter by (this is the same as a heatmap).
Advanced options let the user perform Feature scaling (e.g. Standardize by a z-score) but in a bubble map the default is set to None. It also allows the user to change the Group clustering and Feature clustering options by altering the Cluster distance metrics and Point distance metrics (similar to a heatmap).
There are plot Configuration/Action options for the Hierarchical clustering / heatmap task which apply to both the heatmap and bubble map in the Data viewer (below): Axes, Heatmap, Dendrograms, Annotations, and Description. Click on the icon to open these configuration options.
This section controls the Content or data source used to draw the values in the heatmap or bubble map and also the ability to transpose the axes. The plot is a color representation of the values in the selected matrix. Most of the data nodes contain only one matrix, so it will just say Matrix for the chosen data node. However, if a data node contains multiple matrices (e.g. descriptive statistics were performed on cluster groups for every gene like mean, standard deviation, percent of cells, etc) each statistic will be in a separate matrix in the output data node. In this case, you can choose which statistic/matrix to display using the drop-down list (this would be the case in a bubble map).
To change the orientation (switch the columns and rows) of the plot, click on the Transpose toggle switch.
Row labels and Column labels can be turned on or off by clicking the relevant toggle switches.
The label size can be changed by specifying the number of pixels using Max size and Font. If an Ensembl annotation model has been associated with the data, you can choose to display the gene name or the Ensembl ID using the Content option.\
This section is used to configure the color, range, size, and shape of the components in the heatmap.
In the color palette horizontal bar, the left side color represents the lowest value and the right side color represents the highest value in the matrix data. Note that when you zoom in/out the lowest and highest values captured by the color palette may change. By default, there are 3 color stops: minimum, middle, and maximum color value of the default range calculated on the matrix. Left-click on the middle color stop and drag left or right to change the middle value this color stop represents. If you left-click on the middle color stop once, you can change the color and value this color stop represents. Click on the (X) to remove this color stop.\
Click on the color square or the adjacent triangle to choose a color to represent the value. This will display a color picker dialog which allows selection of a color, either by clicking or by typing a HEX color code, then clicking OK.
The min and max color stops cannot be dragged or removed. If you left-click on them, you can choose a different color. When you click on the Palette bar, you can add a new color stop between min and max. Adding a color stop can be useful when there is an outlier value in the data. You can use a different color to represent different value ranges.
Right-clicking a color stop will reveal a list of options. Space colors evenly will rearrange the position of the stops so there is an equal distance between all stops. Center will move a stop to the middle of the two adjacent stops. Delete will remove the stop.
In addition to color, you can also use the Size drop-down list to size by a set of values from another matrix stored in the same data node. Most of the data nodes contain only one matrix, so the only options available in the Size drop down will be None or Matrix. In cases where you have multiple matrices, you might want to use the color of the component in the heatmap to represent one type of statistic (like mean of the groups) and the size of the component to represent the information from a different statistic (like std. dev).
The shape of the heatmap cell (component) can be configured either as a rectangle or circle by selecting the radio button under Shape.
If cluster analysis is performed on samples and/or features, the result will be displayed as dendrograms. By default, the dendrograms are all colored in black.
The color of the dendrograms can be configured.
Click on the color square or its triangle to choose a different color for the dendrogram.
When the By cluster in the Row/Column color drop-down list, the number of clusters needs to be specified. The top N clusters will be in N different colors.
This section allows you to add sample or cell level annotations to the viewer. First, make sure to choose the correct data node which contains the annotation information you would like to use by clicking the circle (). All project level annotations will be available on all data nodes in the pipeline.
Choose an attribute from the Row annotation drop-down list. Multiple attributes can be chosen from the drop-down list and can be reordered by clicking and dragging the groups below the drop-down list. Each attribute is represented as an annotation bar next to the heatmap. Different colors represent the different groups in the attribute.
The width of the annotation bar can be changed using the Block size slider when the Show labels toggle switch is on.
The annotation label font size can be changed by specifying the size in pixels.
The Fill blocks toggle switch adds or removes color from the annotation labels.\
Description is used to modify the Title and toggle on or off the Legend.
The heatmap has several different mouse modes which modify the way the plot responds to the mouse buttons. The mode buttons are in the upper right corner of the heatmap. Clicking one of these buttons puts the heatmap into that mode.
In point mode (), you can left-click and drag to move around the heatmap (if you are not fully zoomed out). Left-clicking once on the heatmap or on a dendrogram branch will select the associated rows/columns.
In selection mode (), you can click and drag to select a range of rows, columns, or components.
In flip mode (), you can click on a line in the dendrogram (which represents a cluster branch) and the location of the two legs of the branch will be swapped. If no clustering is performed (no dendrogram is generated), in this mode, you can click on the label of an item (observation or feature), drag and drop to manually switch orders of the row or column on the heatmap.
Click on reset view () to reset to the default
Save Image icon () enables you to download the heat map to your local computer. If the heat map contains up to 2.5M cells (features * observations), you can choose between saving the current appearance of the heat map window (Current view) and saving the entire heat map (All data). Depending on the number of features / observations, Partek Flow may not be able to fit all the labels on the screen, due to the limit imposed by the screen resolution. All Data option provides an image file of sufficient size so that all the labels are readable (in turn, that image may not fit the compute screen and the image file may be quite large). If the heat map exceeds 2.5M cells, the Current view option will not be shown, and you will see only a dialog like the one below.
After selecting either Current view (if applicable) or All data button, the next dialog (below) will allow you to specify the image format, size, and resolution.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.