PCA

Principal components analysis (PCA) is an exploratory technique that is used to describe the structure of high dimensional data by reducing its dimensionality. It is a linear transformation that converts n original variables (e.g. genes/transcripts/protein) into n new variables, which are called PCs, they have three important properties:

PCs are ordered by the amount of variance explained
PCs are uncorrelated
All PCs explain all variation in the data

PCA is a principal axis rotation of the original variables that preserves the variation in the data. Therefore, the total variance of the original variables is equal to the total variance of the PCs.

PCA task can be performed on observation X features matrix data node, e.g raw counts, normalized counts data node etc.. Select a node on and click PCA in the Exploratory analysis section of the context sensitive menu.

Features to include in calculation

You don't have to use all the features to compute PCs for the observations, especially when the input matrix is very large, e.g. scRNA-seq data, this option allows you to choose subset of features based on a selected statistics, the default is using the top 2000 features with the highest variance to compute PCs.

Number of PCs to calculate

When the matrix is large like single cell data, depends on what you would like to do downstream with PCA output, you don't have to compute all the PCs. Choose less number of PCs will reduce the running time of this task. By default, it output the top 100 PCs

Features contribute

equally: all the features are standardized to mean of 0 and standard deviation of 1 . This option will give all the features equal weight in the analysis, this is the default option for e.g bulk RNA-seq data.

by variance: the analysis will give more emphasis to the features with higher variances. This is the default option for e.g. single cell RNA-seq data.

The PCA task creates a new task node, and to open it and see the result, do one of the following: select the PCA task node, proceed to the context sensitive menu and go to the Task result; or double-click on the PCA task node.

When the PCA node is opened in Data viewer, by default, it contains a scatterplot, Scree plot with Eigenvalues, and Component loadings table. Each dot on the scatter plot represents an observation, while the first three PCs are shown on the X-, Y-, and Z-axis respectively, with the information content of an individual PC is in the parenthesis.

As an exploratory tool, the the PCA scatterplot is applied to view clustering patterns in the data and generate hypotheses based on the outcome, or to spot possible outliers.

To rotate the 3D scatter plot left click & drag. To zoom in or out, use the mouse wheel. Click and drag the legend can move the legend to different location on the viewer.

Detailed configuration on PCA plot can be found at Data viewer section.

In the Data viewer, when a PCA data node is selected from Get Data under Setup (left panel), the node can be dragged and dropped to the screen, then you will have the option to select a scree plot and tables.

When choose Scree plot icon , it will plot a 2D viewer, X-axis represents PCs, Y-axis represents eigenvalues. When mouse over on a point on the line, it will display detailed information of the PC. The scree plot shows how much variation each PC represents, so it is often used to determine the number of principal components to keep for downstream analysis (e.g. tSNE, UMAP, graph-base clustering). The "elbow" point of the graph where the eigenvalues seem to level off should be considered as a cutoff point for downstream analysis.

PCA data node can also be draw as tables, when choose Table icon , it will display the component loadings matrix in the viewer. The Content can be modified using the Content configuration option; the table can be paged through here or from the lower right corner.

In the table, each row is a feature, the column represent PCs, the value is the correlation coefficient. Under Content, there is a PCA projections option, change to this option to display the projection table. In this table, each row is an observation, each column is a PC, the values are the PC scores.

PreviousCompare clusters Nextt-SNE

Last updated 1 month ago

Was this helpful?