Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
ANOVA method is applying a specified log normal model to all the features.
To setup ANOVA model or the alternative Welch's ANOVA (which is used on normally distributed data that violates the assumption of homogeneity of variance), select factors from sample attribute. The factors can be categorical or numeric attribute. Click on a check button to select and click Add factors button to add it to the model (Figure 1).
LIMMA-trend and LIMMA-voom setup dialogs are identical to ANOVA's setup.
Note: LIMMA-voom method can only be invoked on the following normalization output data node, those methods can produce library sizes:
TMM, CPM, Upper Quartile, Median ratio, Postcounts
When more than one factor is selected, click Add interaction button to add interaction term of the selected factors.
Once a factor is added to the model, you can specify whether the factor is a random effect (check Random check box) or not.
Most factors in an analysis of variance are fixed factors, i.e. the levels of that factor represent all the levels of interest. Examples of fixed factors include gender, treatment, genotype, etc. However, in experiments that are more complex, a factor can be a random effect, meaning the levels of the factor only represent a random sample of all of the levels of interest. Examples of random effects include subject and batch. Consider the example where one factor is type (with levels normal and diseased), and another factor is subject (the subjects selected for the experiment). In this example, “Treatment” is a fixed factor since the levels treated and control represent all conditions of interest. “Subject”, on the other hand, is a random effect since the subjects are only a random sample of all the levels of that factor. When model has both fixed and random effect, it is called a mixed model.
When more than one factor is added to the model, click on the Cross tabulation link at the bottom to view the relationship between the factors in a different browser tab (Figure 2).
Once the model is set, click on Next button to setup comparisons (contrasts) (Figure 3).
Start by choosing a factor or interaction from the Factor drop-down list. The subgroups of the factor or interaction will be displayed in the left panel; click to select one or more level(s) or subgroup name(s) and move them to one of the boxes on the right. The ratio/fold change calculation on the comparison will use the group in the top box as numerator, and the group in the bottom box as the denominator. When multiple levels (groups) are in either numerator or denominator box(es), in Combine mode, click on Add comparison button to combine all numerator levels and combine all denominator levels in a single comparison in the Comparison table below; in Pairwise, click on Add comparison button will split all numerator levels and denominator levels into a factorial set of comparisons – in other words, it will add every numerator level paired with every denominator level comparisons to the Comparison table . Multiple comparisons from different factors can be added from the specified model.
Click on the Configure to customize Advanced options (Figure 4)
Low-expression feature and Multiple test correction sections are the same as the matching GSA advanced option, see above GSA advanced options.
Report option
Use only reliable estimation results: There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
Report p-value for effects: If set to No, only the p-value of comparison will be displayed on the report, the p-value of the factors and interaction terms are not shown in the report table. When you choose Yes in addition to the comparison’s p-value, type III p-values are displayed for all the non-random terms in the model.
Shrinkage to error term variance: by default, None is select, which is lognormal model. Limma-trend and Limma-voom options are lognormal with shrinkage. (Limma-trend is the same as the GSA default option–lognormal with shrinkage). Shrinkage options are recommended for small sample size design, no random effects can be included when performing shrinkage. If there are numeric factors in the model, the partial correlations cannot be reported on the numeric factors when shrinkage is performed. Limma-trend works well if the ratio of the largest library size to the smallest is not more than 3 fold, it is simple and robust for any type of data. Limma-voom is recommended for sequencing data when library sizes vary substantially, but it can only be invoked on data node normalized using TMM, CPM, or Upper quartile methods while Limma-trend can be applied to normalized data using any method.
Report partial correlations: If the model has a numeric factor(s), when choosing Yes, partial correlation coefficient(s) of the numeric factor(s) will be displayed in the result table. When choosing No, partial correlation coefficients are not shown.
Data has been log transformed with base: showing the current scale of the input data on this task.
Since there is only one model for all features, so there is no pie charts design models and response distribution information. The Gene list table format is the same as the GSA report.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Powerful Partek Flow statistical analysis tools help identify differential expression patterns in the dataset. These can take into account a wide variety of data types and experimental designs.
Partek Flow offers the DESeq2 method for differential expression detection. There are two options DESeq2(R) and DESeq 2.
DESeq2(R) is a wrapper of Bioconductor package ‘DESeq2’. The implementation details for DESeq2(R) can be found at the external DESeq2 documentation page, which includes changes made by the algorithm authors since the publication of the original manuscript (Love, Huber, and Anders 2014).
The DESeq2(R) task can be invoked from data nodes generated by quantification tasks that contains raw read count values for each feature in each sample (Gene counts, Transcript counts, microRNA counts, etc.). DESeq2(R) cannot be run on a normalized counts data node because DESeq2(R) internally corrects for library size and implements a low expression filter.
If the value of the raw count includes a decimal fraction, the value will be rounded to an integer before DESeq2(R) is performed. The DESeq2(R) task itself includes feature filter, normalization and differential analysis. Please note that when you run DESeq2(R) task, you have to install the R package before hand (the installation has to be performed outside of Partek Flow).
DESeq2 task, on the other hand, is a Partek Flow native implementation of DESeq2 differential expression detection algorithm, it is much faster than DESeq2(R). Like GSA and ANOVA, before you run this task, we recommend that you first remove (filter out) features expressed at a low level and then perform normalization using Median ratio (DESeq2 only).
Note: DESeq2 differential analysis can only be performed on the following normalization output data node, those methods can produce library sizes:
TMM, CPM, Upper Quartile, Median ratio, Postcounts
Installation of R is not required to run DESeq2 task.
Categorical and numeric attributes, as well as interaction terms can be added to the DESeq2 model. The DESeq2 configuration dialog for adding attributes and interactions to the model is very similar to the ANOVA configuration dialog. However, DESeq2(R) has two important limitations not shared by GSA or ANOVA/LIMMA-trend/LIMMA-voom.
First, interaction terms cannot be added to contrasts in DESeq2(R). In order to perform contrasts of an interaction term in DESeq2(R), a new attribute that combines the factors of interest must be added and the contrast performed on the new combined attribute. This limitation of DESeq2(R) is detailed in the official DESeq2 documentation page. To perform contrasts of interaction terms without creating new combined attributes, please use either DESeq2, GSA, or ANOVA/LIMMA-trend/LIMMA-voom method.
Second, DESeq2(R) only allows two subgroups to be compared in each contrast. To analyze multiple subgroups, please use either DESeq2, GSA, or ANOVA method.
In DESeq2 advanced options configure dialog, there is reference selection option:
A reference level is specified for each categorical factor in the model and the result may be dependent on the choice. In R, the reference level is typically chosen by default whenever a categorical factor is present in the model. This Flow option was created to allow the user to specify exactly the same reference level as in the R script, if need be e.g. compare the results between Deseq2 vs DESeq2(R).
The report produced by DESeq2 is similar to the ANOVA report; each row is a feature and columns include p-value, FDR p-value, and fold change in linear scale for each contrast. However DESeq2(R) report doesn't have LSMeans information on the compared groups.
The R implementation of DESeq2 and DESeq2(R) fail to generate results in some special cases of input data that are handled differently in Partek's implementation of DESeq2.
First, if two or more categorical factors are present in the model, there can be a situation when some combinations of factor levels have no observations. In Flow, one can see if that is the case by clicking "Cross tabulation" link after the factors have been selected:
In this example, if the user tries to fit the model including Factor_A, Factor_G, and Factor_A*Factor_G terms, R implementation of DESeq2 fails. At the same time, none of these three terms is completely redundant, even though not all contrasts are estimable. It is possible to obtain an answer in R only by combining Factor_A and Factor_G into a single new factor with no empty levels. Partek's implementation of DESeq2 eliminates the need for that extra step and produces an answer right away. However, the results (most likely, p-values) may be somewhat different from what one would obtain by using a combined factor in R. To match R results perfectly, one has to create a combined factor in Flow also.
Second, occasionally all of the gene-wise dispersion estimates are too close to zero. In that case, R implementation fails with an error message "all gene-wise dispersion estimates are within 2 orders of magnitude from the minimum value ... one can instead use the gene-wise estimates as final estimates". As suggested in the error message, Flow implementation uses gene-wise dispersion estimates instead of failing the task.
In all of the special cases, an informative warning is output in Flow log.
In R, shrinkage of log2 fold changes is a separate step performed by lfcShrink() function. In Flow, that functionality is absent in DESeq(R) but present in DESeq2 task. The latter implements the shrinkage method corresponding to “ashr” option in lfcShrink(). The default shrinkage option in lfcShrink is “apeglm”, but the default method is unable produce results for some comparisons whereas “ashr” has no restrictions. The fold change shrinkage results are produced in “Shrunken Log2(Ratio)” and “s-value” columns in DESeq2 task project report.
In addition to the issues addressed in Differential Analysis Troubleshooting, DESeq2 may generate missing values in the multiplicity adjustment columns (such as FDR) if "independent filtering" is enabled in Advanced Options:
"Independent filtering" tries removing some features with low expression in order to increase the statistical power. For such removed features, the p-value is reported but FDR and similar multiplicity adjustment measures are set to "?". In order to avoid the missing values in the report, set the option to "No".
Since there is limitation in DESeq2(R), we recommend to use DESeq2 task all the time.
Love MI, Huber W, and Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 2014;15(12): 550.\
Hurdle model is a statistical test for differential analysis that utilizes a two-part model, a discrete (logistic) part for modeling zero vs. non-zero counts and a continuous (log-normal) part for modeling the distribution of non-zero counts. In RNA-Seq data, this can be thought of as the discrete part modeling whether or not the gene is expressed and the continuous part modeling how much it is expressed if it is expressed. Hurdle model is well suited to data sets where features have very many zero values, such as single cell RNA-Seq data.
On default settings, Hurdle model is equivalent to MAST, a published differential analysis tool designed for single cell RNA-Seq data that uses a hurdle model [1].
We recommend normalizing you data prior to running Hurdle model, but it can be invoked on any counts data node.
Click the counts data node
Click the Differential analysis section in the toolbox
Click Hurdle model
Select the factors and interactions to include in the statistical test (Figure 1)
Numeric and categorical attributes can be added as factors. To add attributes as factors, check the attribute check boxes and click Add interactions. To add interactions between attributes, click the attribute check boxes and click Add interaction.
Click Next
Define comparisons between factor or interaction levels (Figure 2)
Adding comparisons in Hurdle model uses the same interface as ANOVA/LIMMA-trend/LIMMA-voom. Start by choosing a factor or interaction from the Factor drop-down list. The levels of the factor or interaction will appear in the left-hand panel. Select levels in the panel on the left and click the > arrow buttons to add them to the top or bottom panels on the right. The control level(s) should be added to the bottom box and the experimental level(s) should be added to the top box. Click Add comparison to add the comparison to the Comparisons table. Only comparisons in the Comparisons table will be included in the statistical test.
Click Finish to run the statistical test
Hurdle model produces a Feature list task node. The results table and options are the same as the GSA task report except the last two columns (Figure 3). The percentage of cells where the feature is detected (value is above the background threshold) in different groups (Pct(group1), Pct(group2)) are calculated and included in the Hurdle model report.
Low-value filter allows you to specify criteria to exclude features that do not meet requirements for the calculation. If there is filter feature task performed in the upstream analysis, the default of this filter is set to None, otherwise, the default is Lowest average coverage is set to 1.
Lowest average coverage: the computation will exclude a feature if its geometric mean across all samples is below than the specified value
Lowest maximum coverage: the computation will exclude a feature if its maximum across all samples is below the specified value
Minimum coverage: the computation will exclude a feature if its sum across all samples is below than the specified value
None: include all features in the computation
Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default. If you check the Storey q-value, an extra column with q-values will be added to the report.
There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
Shows the current scale of the input data for this task
Set the threshold for a feature to be considered expressed for the two-part hurdle model. If the feature value is greater than the specified value, it is considered expressed. If the upstream data node contains log-transformed values, be sure to specify the value on the same log scale. Default value is 0.
Applies shrinkage to the error variance in the continuous (log-normal) part of the hurdle model. The error term variance will be shrunk towards a common value and a shrinkage plot will be produced on the task report page if enable. Default is Enabled.
Applies shrinkage to the regression coefficients in the discrete (logistic) part of the hurdle model. The initial versions of MAST contained a bug that was fixed in its R source in March 2020. However, for the sake of reproducibility the fix was released only on a topic branch in MAST Github [2] and the default version of MAST remained as is. To install the fixed version of MAST in R, run the following R script.
In Flow, the user can switch between the fixed and default version by selecting Fixed version or Default version, respectively. To disable the shrinkage altogether, choose Disabled.
[1] Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., ... & Linsley, P. S. (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome biology, 16(1), 278.
[2] MAST topic branch that contains the regression coefficient shrinkage fix:
https://github.com/RGLab/MAST/tree/fix/bayesglm
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
GSA stands for gene specific analysis, the goal of which is to identify the statistical model that is the best for a specific gene among all the selected models, and then use that best model to calculate p-value and fold change.
The first step of GSA is to choose which attributes to include in the test (Figure 1). All sample attributes including numeric and categorical attributes are displayed in the dialog, so use the check button to select between them. An experiment with two attributes Cell type (with groups A and B)and Time (time points 0, 5, 10) is used as an example in this section.
Click Next to display the levels of each attribute to be selected for sub-group comparisons (contrasts).
To compare A vs. B, select A for Cell type on the top, B for Cell type on the bottom and click Add comparison. The specified comparison is added to the table below (Figure 2).
To compare Time point 5 vs. 0, select 5 for Time on the top, 0 for Time on the bottom, and click Add comparison (Figure 3).
To compare cell types at a certain time point, e.g. time point 5, select A and 5 on the top, and B and 5 on the bottom. Thereafter click Add comparison (Figure 4).
Multiple comparisons can be computed in one GSA run; Figure 5 shows the above three comparisons are added in the computation.
In terms of design pool, i.e. choices of model designs to select from, two 2 factors in this example data will lead to seven possibilities in the design pool:
Cell type
Time
Cell type, Time
Cell type, Cell type * Time
Time, Cell type * Time
Cell type * Time
Cell type, Time, Cell type * Time
In GSA, if a 2nd order interaction term is present in the design, then all first order terms must be present, which means, if Cell type * Time interaction is present, the two factors must be included in the model. In the other words, the following designs are not considered:
Cell type, Cell type * Time
Time, Cell type * Time
Cell type * Time
If a comparison is added, some models that don't have the comparison factors will also be eliminated. E.g. if a comparison on Cell type A vs. B is added, only designs that have Cell type factor included will be in the computation. These are:
Cell type
Cell type, Time
Cell type, Time, Cell type * Time
The more comparisons on different terms are added, the fewer models will be included in the computation. If the following comparisons are added in one GSA run:
A vs B (Cell type)
5 vs 0 (Time)
only the following two models will be computed:
Cell type, Time
Cell type, Time, Cell type * Time
If comparisons on all the three terms are added in one GSA run:
A vs B (Cell type)
5 vs 0 (Time)
A*5 vs B*5 (Cell type * Time)
then only one model will be computed:
Cell type, Time, Cell type * Time
If GSA is invoked from a quantification output data node directly, you will have the option to use the default normalization methods before performing differential expression detection (Figure 6).
If invoked from a Partek E/M method output, the data node contains raw read counts and the default normalization is:
Normalize to total count (RPM)
Add 0.0001 (offset)
If invoked from a Cufflinks method output, the data node contains FPKM and the default normalization is:
Add 0.0001 (offset)
If advanced normalization needs to be applied, perform the Normalize counts task on a quantification data node before doing differential expression detection (GSA or ANOVA).
Click on Configure to customize Advanced options (Figure 7).
Low -expression feature section allows you to specify criteria to exclude features that do not meet requirements for the calculation. If there is filter feature task performed in the upstream analysis, the default of this filter is set to "None", otherwise, the default is Lowest average coverage is set to 1.
Lowest average coverage: the computation will exclude a feature if its geometric mean across all samples is below than the specified value
Lowest maximum coverage: the computation will exclude a feature if its maximum across all samples is below the specified value
Minimum coverage: the computation will exclude a feature if its sum across all samples is below than the specified value
None: include all features in the computation
Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default (1). Other options like Storey q-value (2), and Bonferroni are provided, select one method at a time; None means no multiple test correct will be performed.
FDR step-up:
Suppose there are n p-values (n is the number of features). The p-values are sorted by ascending order and m represents the rank of a p-value. The calculation compares p-value*(n/m) with the specified alpha level, and the cut-off p-value is the one that generates the last product that is less than the alpha level. The goal of step up method is to find:
Define the step-up value as:
Then an equivalent definition for K* is :
So when
the step up value is
In order to find K* , start with Sn* and then go up the list until you find the first step up value that is less or equal to alpha.
Storey q-value:
q-value is the minimum "positive false discovery rate" (pFDR) that can occur when rejecting a statistic.
For an observed statistic T=t and nested set of rejection area {C},
Bonferroni:
In the results, the best model's Akaike weight is also generated. The model's weight is interpreted as the probability that the model would be picked as the best if the study were reproduced. The range of Akaike weight is between 0 to 1, where 1 means the best model is very superior to the other candidates from the model pool; if the best model's Akaike weight is close to 0.5 on the other hand, it means the best model is likely to be replaced by other candidates if the study were reproduced. One still uses the best shot model, however, the accuracy of the best shot is fairly low.
The default value for Enable multimodel approach is Yes. It means that the estimation will utilize all models in the pool by assigning weights to them based on AIC or AICc. If No is selected instead, the estimation is based on only one best model which has the smallest AIC or AICc. The output p-value will be different depending on the selected option for multimodel, but the fold change is the same. Multimodel approach is recommended when the best model's Akaike weight is not close to 1, meaning that the best model is not compelling.
There are situations when a model estimation procedure does not outright fail, but still encounters some difficulties. In this case, it can even generate p-value and fold change for the comparisons, but those values are not reliable, and can be misleading. It is recommended to use only reliable estimation results, so the default option for Use only reliable estimation results is set Yes.
Partek Flow provides five response distribution types for each design model in the pool, namely:
Normal
Lognormal (the same as ANOVA task)
Lognormal with shrinkage (the same as limma-trend method 4)
Negative binomial
Poisson
We recommend to use lognormal with shrinkage distribution (the default), and an experienced user may want to click on Custom to configure the model type and p-value type (Figure 8).
If multiple distribution types are selected, then the number of total models that is evaluated for each feature is the product of the number of design models and the number of distribution types. In the above example, suppose we have only compared A vs B in Cell type as in Figure 2, then the design model pool will have the following three models:
Cell type
Cell type, Time
Cell type, Time, Cell type * Time
If we select Lognormal with shrinkage and Negative binomial, i.e. two distribution types, the best model fit for each feature will be selected from 3 * 2 = 6 models using AIC or AICc.
The design pool can also be restricted by Min error degrees of freedom. When "Model types configuration" is set to Default , this is automated as follows: it is desirable to keep the error degrees of freedom at or above six. Therefore, we automatically set to the largest k, 0 <= k <=6 for which admissible models exist. Admissible model is one that can be estimated given the specified contrasts. In the above example, when we compare A vs B in Cell type, there are three possible design models. The error degree of freedom of model Cell type is largest and the error degree of freedom of model Cell type, Time, Cell type * Time is the smallest:
k(Cell type) > k(Cell type, Time) > k (Cell type, Time, Cell type*Time)
If the sample size is big, k >=6 in all three models, all the models will be evaluated and the best model will be selected for each feature. However, if the sample size is too small, none of the models will have k >=6, then only the model with maximal k will be used in the calculation. If the maximal k happens to be zero, we are forced to use Poisson response distribution only.
There are two types of p-value, F and Wald., Poisson, negative binomial and normal models can generate p-value using either Wald or F statistics. Lognormal models always employ the F statistics; the more replicates in the study, the less the difference between the two options. When there are no replicates, only Poisson can be used to generate p-value using Wald.
Note: Partek Flow keeps tracking the log status of the data, and no matter whether GSA is performed on logged data or not, the LSMeans, ratio and fold change calculation are always in linear scale. Ratio is the ratio of the two LSMeans from the two groups in the comparison (left is the numerator, right is the denominator); Fold change is converted from ratio: when ratio is greater than 1, fold change is same as ratio; when ratio is less than one, fold change is -1/ratio. In other words - fold change value is always >=1 or <=-1, there is no fold change value between -1 and 1. When the LSmean of numerator group is greater than that of denominator group, fold change is greater than 1; when LSmean of numerator group is less than denominator group, fold change is less than 1; when the group groups are the same, fold change is 1. Logratio is ratio is log2 transformed, which is equivalent to logfoldchange is some other software.
If there are multiple design models and multiple distribution types included in the calculation, the fraction of genes using each model and type will be displayed as pie charts in the task result (Figure 9).
Feature list with p-value and fold change generated from the best model selected is displayed in a table with other statistical information (Figure 10). By default, the gene list table is sorted by the first p-value column.
The following information is included in the table by default:
Feature ID information: if transcript level analysis was performed, and the annotation file has both transcript and gene level information, both gene ID and transcript ID are displayed. Otherwise, the table shows only the available information.
Total counts: total number of read counts across all the observations from the input data.
Each contrast outputs p-value, FDR step up p-value, ratio and fold change in linear scale, LSmean of each group comparison in linear scale
When you click on the Optional columns link on the top-left corner of the table, extra information will be displayed in the table when select:
Maximum count: maximum number of reads counts across all the observations from the input data.
Geometric mean: geometric mean value of the input counts across all observations.
Arithmetic mean: arithmetic mean value of input counts across all observations.
By clicking on Optional columns, you can retrieve more statistics result information, e.g. Average coverage which is the geometric mean of normalized reads in linear scale across all the samples; fold change lower/upper limits generated from 95% confidence interval; feature annotation information if there are any more annotation fields in the annotation model you specified for quantification, like genomic location, strand information etc.
Select the check box of the field and specify the cutoff by typing directly or using the slider. Press Enter to apply. After the filter has been applied, the total number of included features will be updated on the top of the panel (Result).
Note that for the LSMean, there are two columns corresponding to the two factors in the contrast. The cutoffs can be specified for the left and right columns separately. For example, in Figure 6, the LSMean (left) corresponds to A while the LSMean(right) is for the B.
The filtered result can be saved into a filtered data node by selecting the Generate list button at the lower-left corner of the table. Selecting the Download button at the lower-right corner of the table downloads the table as a text file to the local computer.
If lognormal with shrinkage method was selected for GSA, a shrinkage plot is generated in the report (Figure 13). X-axis shows the log2 value of average coverage. The plot helps to determine the threshold of low expression features. If there is an increase before a monotone decrease trend on the left side of the plot, you need to set a higher threshold on the low expression filter. Detailed information on how to set the threshold can be found in the GSA white paper.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Alternative splicing results in a single gene coding for multiple protein isoforms, so this task can only be invoked from transcript level data. The algorithm is is based on ANOVA to detect genes with multiple transcripts showing expression changes differently in different biology groups, e.g. a gene has two transcripts: A and B, transcript A is showing up-regulation in the treated group comparing to the control group, while B is showing down regulation in treated group.
The alt-splicing dialog is very similar to ANOVA dialog, since the analysis is based on the ANOVA model specified. To setup an ANOVA model, first chose factors from the available sample attributes. The factors can be categorical or numeric attribute(s). Click on a check box to select and click Add factors button to add a factor to the model (Figure 1).
Only one alt-splicing factor needs to be selected from the ANOVA factors. The ANOVA model performed is based on the factors specified in the dialog, while the transcript ID and transcript ID interaction with alt-splicing factor effects are added into the model automatically.
Transcript ID effect: not all transcripts of a gene are expressed at the same level, so transcript ID is added to the model to account for transcript-to transcript differences.
Interaction of transcript ID with alt-splicing factor: that effect is used to estimate whether different transcripts have different expression among the levels of the same factor.
Suppose there is an experiment designed to detect transcripts showing differential expression in two tissue groups: liver vs muscle. The alt-splicing ANOVA dialog allows you to specify the ANOVA model that in this analysis is the Tissue. The alt-splicing factor is chosen from the ANOVA factor(s), so the alt-splicing factor is also Tissue (Figure 1).
The alt-splicing range will limit analysis to genes possessing the number of transcripts in the specified range. Lowering the maximum number of transcripts will increase the speed of analysis.
Click Next to setup the comparisons (Figure 2). The levels (i.e. subgroups) of the alt-splicing factor will be displayed in the left panel; click to select a level name and move it to one of the panels on the right. The fold change calculation on the comparison will use the group in the top panel as the numerator, and the group in the bottom panel as the denominator. Click on Add comparison button to add a comparison to the comparisons table.
Click on the Configure to customize Advanced options (Figure 3).
Low-expression feature and Multiple test correction sections are the same as the matching GSA advanced option, so see GSA advanced options discussion.
Report option
User only reliable estimation results: There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default is set to Yes.
Data has been log transformed with base: showing the current scale of the input data on this task.
For this analysis, only genes with more than one transcript will be included in the calculation. The report format is the same as ANOVA report, each row represent a transcript, and besides statistics information on the specified comparisons, there is also alt-splicing information at the right end of the table. That information is represented by the p-value of interaction of transcript ID with alt-splicing factor. Note that the transcripts of the same gene should have the same p-value. Small p-value indicates significant alt-splicing event, hence the table is sorted based on that p-value by default (Figure 4).
In the example above (Figure 4), the alt-splicing p-value of gene SLC25A3 is very small which indicates that this gene shows preferential transcript expression across tissues. There are 3 splicing variants of the gene: NM_213611, NM_005888 and NM_002635. Fold change clarifies that NM_005888 has higher expression in the muscle relative to the liver (negative fold change, liver as the reference category), while NM_002635 has higher expression in the liver.
To visualize the difference, click on the Browse to location icon (Figure 5). The 3rd exon is differentially expressed between NM_005888 and NM_002635. Muscle primarily expresses NM_005888 while liver primarily uses NM_002635.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Kruskal-Wallis and Dunn's tests (Non-parametric ANOVA) task is used to identify deferentially expressed genes among two or more groups. Note that such rank-based tests are generally advised for use with larger sample sizes.
To invoke the Kruskal-Wallis test, select any count-based data nodes, these include:
Gene counts
Transcript counts
Normalized counts
Select Statistics > Differential analysis in the context-sensitive menu, then select Kruskal-Wallis (Figure 1).
Select a specific factor for analysis and click the Next button (Figure 2). Note that this task can only take into account one factor at a time.
For more complicated experimental designs, go back to the original count data that will be used as input and perform Rank normalization at the Features level (Figure 3). The resulting Normalized counts data node can then be analyzed using the Detect differential expression (ANOVA) task, which can take into account multiple factors as well as interactions.
Define the desired comparisons between groups and click the Finish button (Figure 4). Note that comparisons can only be added between single group (i.e. one group per box).
The results of the analysis will appear similar to other differential expression analysis results. However, the column to indicate mean expression levels for each group will display the median instead (Figure 5).
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This task can be invoked from count matrix data node or clustering task report (Statistics > Compute biomarkers). It performs Student's t-tests on the selected attribute, comparing one subgroup at a time vs all the others combined. By default, the up-regulated genes are reported as biomarkers.
In the set-up dialog, select the attribute from the drop down list. The available attributes are categorical attributes which can be seen on the Data tab (i.e. project-level attributes) as well as and data node-specific annotation, e.g. graph-based clustering result (Figure 1). If the task is run on graph-based clustering output data node, the calculation is using upstream data node which contains feature counts – typically the input data node of PCA.
By default, the result outputs the features that are up-regulated by at least 1.5 fold change (in linear scale) for each subgroup comparing to the others. The result is displayed in a table with each column is a subgroup name, each row is a feature. Features are ranked by the ascending p-values within each sub-category. An example is shown in Figure 2. If a subgroup has fewer biomarkers than the others, the "extra" fields for that subgroup will be left blank.
Figure 3. Biomarkers table (example). Top 10 biomarkers for each cluster are shown. Download link provides the full results table
Furthermore, the Download link (upper-left corner of the table report) downloads a .txt file to the local computer (default file name: Biomarkers.txt), which contains the full report: all the genes with fold change > 1.5, with corresponding fold change and p-values.
This option is only available when Cufflinks quantification node is selected. Detailed implementation information can be found in the Cuffdiff manual [5].
When the task is selected, the dialog will display all the categorical attributes more than one subgroups (Figure 1).
When an attribute is selected, pairwise comparisons of all the levels will be performed independently.
Click on Configure button in the Advanced options to configure normalization method and library types (Figure 2).
There are three library normalization methods:
Class-fpkm: library size factor is set to 1, no scaling applied to FPKM values
Geometric: FPKM are scaled via the median of the geometric means of the fragment counts across all libraries [6]. This is the default option (and is identical to the one used by DESeq)
Quartile: FPKMs are scaled via the ratio of the 75 quartile fragment counts to the average 75 quartile value across all libraries
The library types have three options:
Fr-unstranded: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. E.g. standard Illlumina
Fr-firststrand: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. The right-most end of the fragment is the first sequenced or only sequenced for single-end reads. It is assumed that only the strand generated during first strand synthesis is sequenced. E.g. dUPT, NSR, NNSR
Fr-secondstrand: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. The left-most end of the fragment is the first sequenced or only sequenced for single-end reads. It is assumed that only the strand generated during second strand synthesis is sequenced. E.g. Directional Illumina, standard SOLiD.
The report of the cuffdiff task is a table of a feature list p-values, q-value and log2 fold-change information for all the comparisons (Figure 3).
In the p-value column, besides an actual p-value, which means the test was performed successfully, there is also the following flags which indicate the test was not successful:
NOTEST: not enough alignments for testing
LOWDATA: too complex or shallowly sequences
HIGHDATA: too many fragments in locus
FAIL: when an ill-conditioned covariance matrix or other numerical exception prevents testing
The table can be downloaded as a text file when clicking the Download button on the lower-right corner of the table.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010
If a Partek Flow task fails (no report is produced), please follow the directions in .
If the task report is produced, but the results are missing for some features represented by "?" (Figure 1), it may be because something went wrong with the estimation procedure. To better understand this, use the information available in the Extra details report (Figure 1). This type of information is present for many tasks, including Differential Analysis and Survival Analysis.
Click the Extra details report for the feature of interest (Figure 1). This will display the Extra details report (Figure 2). When the estimation procedure fails, a red triangle will be present next to the information criteria value. Hover over the triangle to see a detailed error message.
In many cases, estimation failure is due to low expression, filter out low expression features or choose a reasonable normalization method will resolve this issue.
Sometimes the estimation results are not missing but the reported values look inadequate. If this is the case, the Extra details report may show that the estimation procedure generated a warning, and the triangle is yellow. To remove suspicious results in the report, set Use only reliable estimation results to Yes in the Advanced Options (Figure 3). The warnings will then be treated the same way as estimation failures.
To see the results for as many features as possible, regardless of how reliable they are, set Use only reliable estimation results to No and the result will be reported unless there is an estimation failure. For example, DESeq2 uses Cook’s distances to flag features with outlying expression values and if “Use reliable results” is set to Yes (Figure 3) the p-values for such features are not reported which may lead to some missing values in the report (set Use only reliable estimation results to No to avoid this).
Suppose there are n p-values (n is the number of features), the expected number of Type I errors would be given by , thus the significance level of each individual test should be adjusted to . Alternatively the p-values should be adjusted as pB=p*n, pB is Bonferroni corrected p-value. If pB is greater than 1, it is set to 1
Click on View extra details report () icon under View section to get more statistical information about the feature. In a case that the task doesn't fail, but certain statistical information is not generated, e.g. p-value and/or fold change of a certain comparison are not generated for some or all feature, click on this icon to get more information by mousing over the read exclamation icon
On the right of each contrast header, there is volcano plot icon (). Select it to display the volcano plot on the chosen contrast (Figure 11).
Feature list filter panel is on the left of the table (Figure 12). Click on the black triangle ( ) to collapse and expand the panel.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.