Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
To ensure that different data sets are comparable, several normalization and scaling options are available in Partek Flow. These include newly-developed algorithms specifically tailored for genomic analysis.
Raw read counts are generated after quantification for each feature on all samples. These read counts need to be normalized prior to differential expression detection to ensure that samples are comparable.
This chapter covers the implementation of each normalization method. The Normalize counts option is available on the context-sensitive menu (Figure 1) upon selection of any quantified output data node or an imported count matrix:
Gene counts
Transcript counts
MicroRNA counts
Cufflinks quantification
Quantification
The format of the output is the same as the input data format, the node is called Normalized counts. This data node can be selected and normalized further using the same task.
Select whether you want your data normalized on a per sample or per feature basis (Figure 2). Some transformations are performed on each value independently of others e.g. log transformation, and you will get an identical result regardless of your choice.
The following normalization methods will generate different results depending on whether the transformation was performed on samples or on features:
Divided by mean, median, Q1, Q3, std dev, sum
Subtract mean, median, Q1, Q3, std dev, sum
Quantile normalization
Note that each task can only perform normalization on samples or features. If you wish to perform both transformations, run two normalization tasks successively. To normalize the data, click on a method from the left panel, then drag and drop the method to the right panel. Add all normalization methods you wish to perform. Multiple methods can be added to the right panel and they will be processed in the order they are listed. You can change the order of methods by dragging each method up or down. To remove a method from the Normalization order panel, click the minus button to the right of the method. Click Finish, when you are done choosing the normalization methods you have chosen.
For some data nodes, recommended methods are available:
Data nodes resulting from Quantify to annotation model (Partek E/M) or Quantify to reference (Partek E/M) are raw read counts, the recommendation is Total Count, Add 0.0001
Cufflinks quantification data node output FPKM normalized read counts, the recommendation is Add 0.0001
If available, the Recommended button will appear. Clicking the button will populate the right panel (Figure 3).
Below is the notation that will be used to explain each method:
Absolute value TXsf = | Xsf |
Add TXsf = Xsf + C a constant value C needs to be specified
Antilog TXsf = bxsf A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Arcsinh TXsf =arcsinh (Xsf) The hyperbolic arcsine (arcsinh) transformation is often used on flow cytometry data
CLR (centered log ratio) TXsf =ln((Xsf +1)/geom (Xsf +1) +1) geom is geometric mean of either observation or feature. This method can be applied on protein expression data.
CPM (counts per million) TXsf = (106 x Xsf)/TMRs where Xsf here is the raw read of sample S on feature F, and TMRs is the total mapped reads of sample S. If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample.
Divided by When mean, median, Q1, Q3, std dev, or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option Example: If transform on Samples is selected, Divide by mean is calculated as: TXsf = Xsf/Ms where Ms is the mean of the sample. Example: If transform on Features is selected, Divide by mean is calculated as: TXsf = Xsf/Mf where Mf is the mean of the feature.
Log TXsf = logbXsf A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Logit TXsf=logb(Xsf/(1-Xsf)) A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Lower bound A constant value C needs to be specified, if Xsf is smaller than C, then TXsf= C; otherwise, TXsf = Xsf
**Median ratio (DESeq2 only), Median ratio (edgeR) **These approaches are slightly different implementations of the method proposed by Anders and Huber (2010). The idea is as follows: for each feature, its expression is divided by the feature geometric mean expression across the samples. Then, for a given sample, one takes median of these ratios across the features and obtains a sample specific size factor. The normalized expression is equal to the raw expression divided by the size factor. Median ratio (DESeq2 only) is present in R, DESeq2 package, under the name of "ratio". This method should be selected if DESeq2 differential analysis will be used for downstream analysis, since it is not per million scale, not recommended to be used in any other differential analysis methods except for DESeq2. Median ratio (edgeR) is present in R, edgeR package under the name of “RLE”. It is very similar to Median ratio (DESeq2 only) method, but it uses per million scale.
Multiply by TXsf = Xsf x C A constant value C needs to be specified
Poscounts (Deseq2 only) Deseq2 size factor estimate option. Comparing with Median ratio, poscount method can be used when all genes contain a sample with a zero. It calculates a modified geometric mean by taking the nth root of the product of the non-zero counts. It is not per million scale. Here is the details.
Quantile normalization, a rank based normalization method. For instance, if transformation is performed on samples, it first ranks all the features in each sample. Say vector Vs is the sorted feature values of sample S in ascending order, it calculates a vector that is the average of the sorted vectors across all samples --- Vm, then the values in Vs is replaced by the value in Vm in the same rank. Detailed information can be found in [1].
Rank This transformation replaces each value with its rank in the list of sorted values. The smallest value is replaced by 1 and the largest value is replaced by the total number of non-missing values, N. If there are no tied values, the results in a perfectly uniform distribution. In the case of ties, all tied values receive the mean rank.
Rlog Regularied log transformation is the method implemented in DESeq2 package under the name of rlog. It applies a transformation to remove the dependence of the variance on mean. It should not be applied on zero inflated data such as single cell RNA-seq raw count data. The output of this task should not be used for differential expression analysis, but rather for data exploration, like clustering etc.
Round Round the value to the nearest integer.
RPKM (Reads per kilobase of transcript per million mapped reads [2]) TXsf = (109 * Xsf)/(TMRs*Lf) Where Xsf is the raw read of sample S on feature F, TMRs is the total mapped reads of sample S, Lf is the length of the feature F,
If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample. If the feature is a transcript, transcript length Lf is the sum of the lengths of all the exons. If the feature is a gene, gene length is the distance between the start position of the most downstream exon and the stop position of the most upstream exon. See Bullard et al. for additional comparisons with other normalization packages [3]
For paired reads, the normalization option will show up as FPKM (Fragments per kilobase per million mapped reads) rather than RPKM. However, the calculations are the same.
Subtract When mean, median, Q1, Q3, std dev or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option Example: If transform on Samples is selected, Subtract mean is calculated as: TXsf = Xsf - Ms where Ms is the mean of the sample Example: If transform on Features is selected, Subtract mean is calculated as: TXsf = Xsf - Mf where Mf is the mean of the feature
TMM (Trimmed mean of M-values) The scaling factors is produced according to the algorithm described in Robinson et al [4]. The paper by Dillies et al. [5] contains evidence that TMM has an edge over other normalization methods. The reference sample is randomly selected. When perform the trimming, for M values (fold change), the upper 30% and lower 30% are removed; for A values (absolute expression), the upper 5% and lower 5% are removed.
TPM (Transcripts per million as described in Wagner et al [6]) The following steps are performed:
Normalize the reads by the feature length. Here length is measured in kilobases but the final TPM values do not depend on the length unit. RPKsf = Xsf / Lf;
Obtain a scaling factor for sample s as Ks = 10-6 ∑Ff=1 RPKsf
Divide raw reads by the length and the scaling factor to get TPM TXsf = Xsf / Lf / Ks
Upper quartile
The method is exactly the same as the LIMMA package [7]. The following is the simple summarization of the calculation:
Remove all the features that have 0 reads in all samples.
Calculate the effective library size per sample: effective library size = (raw library size (in millions))*((upper quartile for a particular sample)/ (geometric mean of upper quartiles in all the samples))
Get the normalized counts by dividing the raw counts per feature by the effective library size (for the respective sample)
The Normalization report includes the Normalization methods used, a Feature distribution table, Box-whisker plots of the Expression signal before and after normalization, and Sample histogram charts before and after normalization. Note that all visualizations are disabled for results with more than 30 samples.
A summary of the normalization methods performed. They are listed by the order they were performed.
A table that presents descriptive statistics on each sample, the last row is the grand statistics across all samples (Figure 4).
These box-whisker plots show the expression signal distribution for each sample before and after normalization. When you mouse over on each bar in the plot, a balloon would show detailed percentile information (Figure 5).
A histogram is displayed for data before and after it is normalized. Each line is a sample, where the X axis is the range of the data in the node and the Y-axis is the frequency of the value within the range. When you mouse over a circle which represent a center of an interval, detailed information will appear in a balloon (Figure 6). It includes:
The sample name.
The range of the interval, “[ “represent inclusive, “)” represent exclusive.
The frequency value within the interval
Bolstad BM, Irizarry RA, Astrand M, Speed, TP. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics. 2003; 19(2): 185-193.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7): 621–628.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11: 94.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11: R25.
Dillies MA, Rau A, Aubert J et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013; 14(6): 671-83.
Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data. Theory Biosci. 2012; 131(4): 281-5.
Ritchie ME, Phipson B, Wu D et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43(15):e97.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Single cell RNA-seq gene expression counts are zero inflated due to inefficient mRNA capture. This normalization task is based on MAGIC[1]–MArkov Affinity-based Graph Imputation of Cells), to recover gene expression lost due to drop-out. The limitation on using this method is up to 50K cells in the input data node.
To invoke this task, click on a normalized data node which has less than 50K cells, it will first compute PCA to use the number of PCs specified to impute (Figure 1).
Click Finish to run the task, it will output low expression imputed matrix in the output report node.
References
Dijk D et al. MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data
\
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Library size normalization is the simplest strategy for performing scaling normalization. But composition biases will be present when any unbalanced differential expression exists between samples. The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts[1]. To overcome this, Partek Flow wrapped the calculateSumFactors() function from R package scran. It pools counts from many cells to increase the size of the counts for accurate size factor estimation. Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile[1].
Scran deconvolution in Flow can be invoked in Normalization and scaling section by clicking any single cell counts data node (Figure 1).
To run Scran deconvolution,
Click a single cell counts data node
Click the Normalization and scaling section in the toolbox
Click Scran deconvolution
The GUI is simple and easy to understand. The first Scran deconvolution dialog is asking to select the cluster name from a drop-down list that includes all the attributes for this dataset. The selected cluster is an optional factor specifying which cells belong to which cluster, for deconvolution within clusters (Figure 2). Simply click the Finish button if you want to run the task as default.
The output of Scran deconvolution is a new data node that has been normalized by the pool-based size factors of each cell and log2 transformed_._ We can then use this new normalized matrix for downstream analysis and visualization (Figure 3).
Other parameters in this task that you can adjust include:
Pool size: A numeric vector of pool sizes, i.e., number of cells per pool.
Max cluster size: An integer scalar specifying the maximum number of cells in each cluster.
Enforce positive estimates: A logical scalar indicating whether linear inverse models should be used to enforce positive estimates.
Scaling factor: A numeric scalar containing scaling factors to adjust the counts prior to computing size factors.
Lun, A. T., K. Bach, and J. C. Marioni. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016.
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
SC transform task performs the variance stabilizing normalization proposed in [1]. The task's interface follows that of SCTransform() function in R [2]. SCTransform v2 [3] provides the ability to perform downstream differential expression analyses besides the improvements on running speed and memory consumption. v2 is the default method in Flow.
We recommend performing the normalization on a single cell raw count data node. Select SCTransform task in Normalization and scaling section on the pop-up menu to invoke the dialog (Figure 1).
By default, it will generate report on all the input features. Unchecking the Report all features, user can limit the results to a certain number of features with highest variance.
In Advanced options, users can the click Configure to change the default settings (Figure 2).
Scale results: Whether to scale residuals to have unit variance; default is FALSE
Center results_:_ When set to Yes, center all the transformed features to have zero mean expression. Default is TRUE.
VST v2: Default is TRUE. When set to 'v2', it sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; If default is unchecked, it uses the original sctransform model (v1), it will only generate SC scaled data node.
There are two data nodes generated from this task (if VST v2 option is checked as default):
SC scaled data: it is a matrix of normalized values (residuals) that by default has the same size as the input data set. This data node is used to perform downstream exploratory analysis e.g. PCA, Seurat3 integration etc (Figure 3), this data node is not recommend to use for differential analysis.
SC corrected data: is equivalent to the ‘corrected counts’ in data slot generated after PrepSCTFindMarkers task in the SCT assay in Seurat object. It is used for downstream differential expression(DE) analyses (Figure 3).
Note: When perform DE analysis with Hurdle, the 'shrinkage of error term variance' option might need to turn off depending on the dataset. Similarly, the 'Lognormal with shrinkage/voom' option needs to turn off when run DE with GSA.
References
Christoph Hafemeister, Rahul Satija. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. https://doi.org/10.1101/576827
SCTransform() documentation https://www.rdocumentation.org/packages/Seurat/versions/3.1.4/topics/SCTransform
Saket Choudhary, Rahul Satija. Comparison and evaluation of statistical error models for scRNA-seq. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02584-9
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
If your experimental design includes a sample or a group of samples serving as a baseline control, you can normalize the experimental samples by subtracting or dividing by the baseline sample(s) using the Normalize to baseline task in Partek Flow. For example, in PCR experiments, the delta Ct values of control samples are subtracted from the delta Ct values of experimental samples to obtain delta-delta Ct values for the experimental samples.
The Normalize to baseline option is available in the Normalization and Scaling section of the context-sensitive menu (Figure 1) upon selection of any count matrix data node.
There are three options to choose the baseline samples:
use all samples
use a group
use matched pairs
To normalize data to all the samples, choose to calculate the baseline using the mean or median of all samples for each feature, and choose to subtract baseline or ratio to baseline for the normalization method (Figure 2), and click Finish.
Use a group to create baseline
When there is a subset of samples that serve as the baseline in the experiment, select use group for Choose baseline samples. The specific group should be specified using sample attributes (Figure 3).
Choose use group, select the attribute containing the baseline group information, e.g. Treatment in this example, with the samples with the group Control for the Treatment attribute used as the baseline. The control samples can be filtered out after normalization by selecting the Remove baseline samples after normalization check box.
When using matched pairs, one sample from each pair serves as the control. An attribute specifying the pairs must be selected in addition to an attribute designating which sample in each pair is the baseline sample (Figure 4).
After normalization, all values for the control sample will be either 0 or 1 depending on the normalization method chosen, so we recommend removing baseline samples when using matched pairs.
The output of Normalize to baseline is a Normalized counts data node.
This task is to replace missing data in the data with estimated values based on selected method.
First select the computation is based on samples/cells or features, and click Finish to replace missing values. Some functions will generate the same results no matter which transform option is selected, e.g. constant value. Others will generate different results:
Constant values: specify a value to replace the missing data
Maximum: use maximum value of samples/cells or features to replace missing data depends transform option
Mean: use mean value of samples/cells or features to replace missing data depends transform option
Median: use median value of samples/cells or features to replace missing data depends transform option
Minimum: use minimum value of samples/cells or features to replace missing data depends transform option
K-nearest neighbor (mean): specify number of neighbors (N), Euclidean metric is used to compute neighbors, use mean of (N) neighbors to replace missing data
K-nearest neighbor (median): specify number of neighbors (N), Euclidean metric is used to compute neighbors, use median of (N) neighbors to replace missing data
This normalization is performed on observations (samples) using internal control features (genes). The internal control features, usually housekeeping genes, should not vary among samples[1]. The implementation details is as follows:
1. Compute geometric mean of all the control genes (features) e.g. (g1 to gm) in each sample S (f means feature, S means sample, 1-m are control features), represented by GS1 to GSn (n number of samples).
2. Compute geometric mean of across all samples (GS1 to GSn), represented by GS
3. Compute the scaling factor for each sample, S1=GS1/GS, S2=GS2/GS ... Sn=GSn/GS
4. Normalize all the gene expression by divided by its sample scaling factor
Note: The input data node must contain all positive values to compute geometric mean.
Select Normalize to housekeeping genes task in Normalization and scaling section in the pop-up menu when you select a count matrix data node, the dialog will list all the features included in the data node on the left panel
Select control genes on the left panel and move them to the right panel. You can also use search box to find the feature and click the plus button to add it to the right panel.
Click Finish
Frank Speleman. Accurate normalization of real-time quantitative RT_PCR data by geometric averaging of multiple internal control genes. Genome Biology. 2002.
Symbol | Meaning |
---|---|
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
S
Sample (or cell for single cell data node)
F
Feature
Xsf
Value of sample S from feature F (if normalization is performed on a quantification data node, this would be the raw read counts)
TXsf
transformed value of Xsf
C
Constant value
b
Base of log
Latent semantic indexing (LSI) was first introduced for the analysis of scATAC-seq data by Cusanovich et al. 2018[1]. LSI combines steps of frequency-inverse document frequency (TF-IDF) normalization followed by singular value decomposition (SVD). Partek Flow wrapped Signac's TF-IDF normalization for single cell ATAC-seq dataset. It is a two-step normalization procedure that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks[2].
TF-IDF normalization in Flow can be invoked in Normalization and scaling section by clicking any single cell counts data node (Figure 1).
To run TF-IDF normalization,
Click a single cell counts data node
Click the Normalization and scaling section in the toolbox
Click TF-IDF normalization
The output of TF-IDF normalization is a new data node that has been normalized by log(TF x IDF). We can then use this new normalized matrix for downstream analysis and visualization (Figure 2).
Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). https://doi.org/10.1038/nature25981
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.