Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
With gene ontology (GO) ANOVA, Partek Genomics Suite includes the ability to use rigorous statistical analysis to find differentially expressed functional groupings of genes. Leveraging the Gene Ontology database, Partek Genomics Suite can organize genes into functional groups. Not only can GO ANOVA detect up and down regulated functional groups, but also functional groups, which are disrupted in a few genes as a result of treatment. Moreover, the common diction of the GO effort enables this analysis to be compared across all types of gene expression data, including those from other species. Traditional tests, such as GO enrichment, require defining filtered lists of differentially expressed genes followed by an analysis of functional groups related to those genes. On the other hand, GO ANOVA is performed directly after data import and normalization. This minimizes the risk that a highly stringent filter will cause important functional groups to be overlooked.
Other tests, such as gene set enrichment analysis (GSEA), tolerate minimal or no pre-filtering. However, these tests are very limited in their ability to integrate complicated experimental designs. GSEA, for example, can only handle two groups at a time. GO ANOVA, on the other hand, can leverage the wealth of sample information collected and use powerful multi-factor ANOVA statistics to analyze very complex interactions and regulatory events. The analysis output includes detailed statistical results specifying the effect and importance of phenotypic information on differential expression and subsequent disruption of Gene Ontology functional categories. Furthermore, GSEA calculates enrichment scores using a running-sum statistic on a ranked gene list. GO ANOVA takes into account more information by utilizing each sample’s expression values to calculate the enrichment score.
Note that the same principles apply to Pathway ANOVA, the only difference being the mapping file; GO ANOVA organizes genes into GO categories, while Pathway ANOVA looks at pathways.
This user guide deals with the following topics:
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The setup dialog for GO ANOVA can be found in the Biological Interpretation section of the expression workflows (Gene Expression, MicroRNA Expression, Exon, RNA-Seq, miRNA-Seq). It is recommended that GO ANOVA is run on the sheet with expression levels, after import and normalization, though GO ANOVA can be run on any spreadsheet with samples on rows and genes on columns. If a child spreadsheet is selected, such as the result of a prior ANOVA analysis, then the test will be automatically run on the parent spreadsheet.
Upon selecting GO ANOVA (Biological Interpretation > Gene Set Analysis), Partek Genomics Suite will first offer the opportunity to configure the parameters of the test and exclude functional groups with too few or too many genes (Figure 1). To save time when running GO ANOVA, the size of GO categories analyzed can be limited using the Restrict analysis to function groups with fewer than __ genes. Large GO categories may be less interesting and also take the most time to analyze. We recommend to restrict the analysis to the groups with fewer than 150 genes, as it can make the analysis much quicker (and the results easier to interpret). In the current example, the maximum category was set to only 20 genes, for demonstration purposes only.
Figure 1. Configure the parameters of the test: gene ontology categories with too few or too many genes can be excluded
The next dialog (Figure 2) specifies the method of mapping genes to gene sets. Default mapping file is built from annotation files from geneontology.org. Custom mapping file points to the mapping files available on the local computer and present in the Microarray libraries directory. Create a new mapping file from the chip's annotation file option will try to build the annotation file from the annotation file created by the microarray vendor. Create a new mapping file from a spreadsheet enables you to create a custom mapping file from an open spreadsheet, which has gene symbols on one column, and gene groups on the other column. Finally, files in gene matrix transposed (GMT) or gene annotation (GA) formats can also be used.
Figure 2. Setting the method of mapping genes to gene sets
To setup the GO ANOVA dialogue you must consider all factors that would normally be included in an ANOVA model analyzing gene expression among the samples (Figure 3). Briefly this should include:
Experimental factors
Factors explaining sample dependence
Factors explaining noise
For more details on ANOVA, see Chapter 11 of the User’s Manual.
Figure 3. GO ANOVA setup dialog. Including a factor in the ANOVA model (ANOVA Factors) will identify gene ontology (GO) categories whose expression is different across the genes within the category, by the factor of interest. Including a factor as a Disruption Factor will identify GO categories where the expression of the genes within the category are affected but not uniformly across the genes withing the category. Genes (probesets) can be excluded based on expression levels, to reduce the noise.
Factors inherent to the experiment include variables that would be considered as the experimental variables during experiment design. Generally this will include all variables necessary to answer the questions of the researcher. Examples may include factors such as tissue type, disease state, treatment, or dosage.
Sometimes factors do not act independently of each other. For example, different dosages of a drug may affect patients differently over time, or a drug may not affect tissues equally as in many toxicity studies. If the effect of one variable on the other is either suspected of occurring, or of particular interest, an interaction between the two factors should be included. To do this, select the two factors simultaneously by CTRL-clicking the factors and then select Add Interaction.
Factors to control for sample dependence include variables that account for relation between samples. If tissues are collected in pairs from the same patient, patient ID would be included. Similarly if tissues are collected from two distinct populations, this variable should probably be included as well.
Noise variables may be caused by technical processes used during sample collection and processing. Scan data and dye color are often among these variables.
Factors included in the GO ANOVA fall into two separate categories: the normal ANOVA factors (middle box) and those interacting with the gene (right-side box).
Fundamentally, you can run the GO ANOVA with the same parameters used to run a standard ANOVA analysis on gene expression data. (In other words, the middle box of the GO ANOVA is populated exactly as the normal ANOVA and the Interact with Gene box is left empty.) If such an analysis is run, the results would be similar to a standard statistical analysis, except resulting data will report on differential expression of functional categories instead of individual genes. Expression of a functional group is derived from the mean of all genes included within the group. Running GO ANOVA with the same parameters as the differential expression analysis is the most common method of running GO ANOVA. This keeps the analysis much more accessible and the results are easier to interpret.
There is no need to interact a factor with the gene if such an interaction is not of interest. The right most box in the GO ANOVA setup is optional and may be left empty if this is the case.
More advanced analysis can include factors, which are interacted with the genes in the GO ANOVA model. After factors are added to the ANOVA factor(s) box, some can be added additionally to the Disruption Factor(s) box. At the mathematical level, this will include the Factor*Gene term in the model, called a Factor-Gene interaction. At the biological level, this will test whether patterns of gene expression within the functional group are being modified as a result of the factor. This altering of gene expression patterns is referred to in this document as the disruption of the functional group.
For example, if comparing different tissue types, adding tissue to the middle ANOVA factor(s) box, will identify entire GO functional groups that are up or down regulated between tissue types. If comparing nerves and muscles, this might include such categories as myosin binding or actin production, which will be wholly up regulated in muscles as the function is much less important to nerve function.
By interacting tissue with the gene in the model (adding tissue to the right most box), the interaction p-value may provide a method of discovering categories where total expression might not changed significantly but the pattern of gene expression with the category is altered or disrupted. Within a functional group, the interaction p-value represents how similar the patterns of gene expression are between the different tissues. One example of a functional group identified by a tissue*gene interaction might include a category such as ion transfer. Ion transfer is equally important to both nerve and muscle function, but the distribution of ion channels and many of the responsible genes may be quite different between the two.
Sometimes factors may be included in the Interact with Gene box even if they are not of specific interest in a similar way that factors to control for noise are added to the ANOVA factors middle box. If any factors are included in Disruption Factor(s) box, to get the most accurate p-values, the more advanced model must fit the data as well as possible. All factors that may alter gene expression patterns should be included. It is important to keep in mind that the GO ANOVA is not only looking for significance in the factors included, but is attempting to generally fit the data. As appropriate factors are added to the model, not only are more aspects of the data analyzed; the model becomes a better fit to the true data and the results will become more accurate.
To understand how including a Gene*Factor interaction may improve the fit of the model, consider the complex GO ANOVA design in the case of a dose-time analysis of a drug. While it may seem clear that the ANOVA factors in the middle box - dose, time, and the dose*time interaction should be specified (to consider the effect of dose, time, and the change in the effect of dose over time) what to put in the rightmost Gene*Factor box is not as clear. Adding dose alone (which is actually Dose*Gene) will check if different drug doses affect the pattern of gene expression. Similarly adding time into the right box (which is actually Time*Gene) will identify gene ontology categories that are affected in different times but differentially across the genes. While this may be the true limit of questions of interest, including the interactions of the gene and both dose and time may be prudent. In general, if it is likely, or expected, that a factor will affect gene distribution within functional categories, then the factor should be included in the Disruption Factor(s) box if the gene distribution is being analyzed at all.
To review, including a factor in the middle box will identify GO categories whose expression is consistently affected across the genes within the category by the factor of interest. Including a factor in the right box (factor*gene) will identify gene ontology categories where the expression of the genes within the category are affected but not uniformly across the genes within the category.
GO ANOVA is not restricted to analysis of factors with only two levels. The ANOVA p-values are measures of likelihood that all groups are equivalent. While this is useful in general, sometimes tests comparing only two sets of data are more desirable. Using contrasts to define pair wise comparisons in an ANOVA model is superior to using a test that is limited to a two group comparison.
To specify individual pair wise comparisons, press the Contrast button. Contrasts are performed on groups already defined in the ANOVA model. If two tissue types should be compared to each other, select the tissue term from the Select Factor/Interaction dropdown in the upper left. Select either one or a set of categories and add them to group 1 and group 2. All samples falling into group 1 will be compared to all samples falling into group 2. Output will include not only a p-value, but also a fold change. This fold change will represent the average fold change of the GO category between the two groups. Fold change is calculated as Group 1 divided by Group 2. For data in log space, the data is antilogged as well; fold change output is always for data on a linear scale.
Check Exclude probe sets and differential expression p-value(s) > to filter out probe sets (=genes) which are not express in any of the genes. The Exclude probe sets option will remove any gene that meets the specified limit. Using the default options, this will remove low expression genes. Note that the default value of 3 is a suggestion for Affymetrix expression arrays and may not be applicable for other data sets. We suggest to perform exploratory analysis and inspect the distribution of the expression values first (e.g. View > Histogram > Row or View > Box and Whiskers > Row). The sub-checkbox, differential expression p-values, provides an override to the low expression limit. Here, a gene will be included in the analysis despite a low expression value if the gene displays a p-value below the specified limit, suggesting that the gene is differentially expressed
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
When looking for simple differential expression, sorting by ascending on the factor p-values is ideal. This will find groups that are the most significantly apart across all the contained genes. In the interest of finding groups that are less likely to be called by chance, it may be wise to filter to groups with a minimum of 4 or 5 genes (Figure 1). Simple filters can be done using the interactive filter () available from the button on the toolbar at the top of the screen.
If there is more than one factor in the model, more complex criteria combining the factors can be specified using Tools>List Manager menu Advanced tab. For example, to find categories that are significant and changed by at least two fold, make two criteria: one for a low p-value and the other for a minimum of two fold change, and take the intersection of the two criteria.
Figure 1. Top ten functional groups sorted by the Tissue p-value after filtering to a minimum five gene in the GO category. Note that most of the groups can be directly related to the heart muscle
If the disruption (factor*gene interaction) is tested, the filters can become more complicated. The most pressing need for complex filters is that when analyzing larger functional groups it is not expected that the entire functional group will behave the same. Looking back at Figure 1, notice how the low values in column 7 are present because not every gene is equally differentially expressed even in the most differentially expressed of groups. That is, when there is significant differential expression, it is likely that there will also be disruption as at least a single gene is likely participating in a role beyond that of the functional group and will not follow the pattern of the rest of the group. This situation is expected and leads to a new type of filter.
Filtering for low p-values on the factor and then filtering for low p-values on the factor interacted with gene will find groups that are differentially expressed, but contain at least a few genes that are either disrupted due to treatment, or simply are involved in additional functional groups beyond the scope of the one being analyzed. This list often contains some of the more informative big picture functional groups.
Figure 2. Top ten functional categories sorted by Disruption(Tissue) p-value after filtering to a minimum of five genes in the GO category. By prioritizing by the disruption column this type of a list is more "big picture"
If looking for disruption for groups which are not so much differentially expressed, but instead which express different genes for different treatments, filter for low disruption p-values but for high factor p-values. As shown by Figure 2, large or diverse groups that are differentially expressed will often exhibit significant disruption. In fact, a group that is differentially expressed but includes even a single gene that is not changed will have very significant disruption. These situations are certainly notable, but are distracting if looking for functional groups that instead are uniquely patterned based on treatment. By filtering out those groups with low p-values for the factor and then looking at the remaining groups with low p-values for disruption, groups observed have usually very distinct patterns of expression (Figure 3).
Figure 3. Top ten functional categories sorted by Disruption(Tissue) p-value after filtering to a minimum of five genes in the GO category and minimum Tissue p-value of 0.3. This list is especially interesting, as using enrichment alone to detect such categories would require a lot of labour.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The method used to detect changes in functional groups is ANOVA. For detailed information about ANOVA, see Chapter 11 of the Partek User Manual. There is one result per functional group based on the expression of all the genes contained in the group. Besides all the factors specified in the ANOVA model, the following extra terms will be added to the model by Partek Genomics Suite automatically:
Gene ID - Since not all genes in a functional group express at the same level, gene ID is added to the model to account for gene-to-gene differences
Factor * Gene ID (optional) - Interaction of gene ID with the factor can be added to detect changes within the expression of a GO category with respect to different levels of the factor, referred to in this document as the disruption of the categories expression pattern or simply disruption
Suppose there is an experiment to find genes differentially expressed in two tissues: Two different tissues are taken from each patient and a paired sample t-test, or 2-way ANOVA can be used to analyze the data. The GO ANOVA dialog allows you to specify the ANOVA model, which includes the two factors: tissue and participant ID. The analysis is performed at the gene level, but the result is displayed at the level of the functional group by averaging of the member genes’ results. The equation of the model that can be specified is:
y = µ + T + P + ε
y: expression of a functional group
µ: average expression of the functional group
T: tissue-to-tissue effect
P: participant-to-participant effect (a random effect)
ε: error term
When the tissue is interacted with the gene ID then the ANOVA model becomes more complicated as demonstrated in the model below. The functional group result is not explicitly derived by averaging the member genes as the new model includes terms for both gene and group level results:
y = µ + T + P + G + T *G + ε
y: expression of a functional group
µ: average expression of the functional group
T: tissue-to-tissue effect
P: patient-to-patient effect (this can be specified as a random effect)
G: gene-to-gene effect (differential expression of genes within the function group independent of tissue type)
T*G: Tissue-Gene interaction (differential patterning of gene expression in different tissue types)
ε: error term
In the case that there is more than one data column mapping to the same gene symbol, Partek Genomics Suite will assume that the markers target different isoforms and will not treat the two markers as replicated of the same gene. Instead, each column is treated as a gene unto itself.
If there are only two samples in the spreadsheet then, Partek Genomics Suite cannot calculate a type by gene ID interaction. In this case, the result spreadsheet will contain a column labeled Disruption score. First, for each gene in the functional group Partek Genomics Suite will calculate the difference between the two samples. A z-test is used to compare the difference between each gene and the rest of the genes in the functional group. The disruption score is the minimum p-value from the z-tests comparing each gene to the rest in the functional group. A low disruption score therefore indicates that at least one gene behaves differently from the rest. This implies a change in the pattern of gene expression within the functional group and potential disruption of the normal operation of the group. The category as a whole may or may not exhibit differential expression in addition to the disruption.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Preparing a data set for analysis requires importing the data, normalizing the data as appropriate for standard gene expression analysis, and inserting columns containing the experimental variables. Checkout our tutorial page for more details about preparing data. It is not necessary to perform a differential analysis of gene expression before GO ANOVA.
For the sake of example, the following walkthrough will consider an experiment that has been imported which includes two different tissues, brain tissue and heart tissue, extracted from a small set of patients.
The GO ANOVA function is available in the Gene Expression, microRNA Expression, RNA-Seq, and miRNA-Seq workflows.
Select the Gene Expression workflow (or any of the other ones) from the Workflows drop-down on the upper right of the spread sheet
Go to the Biological Interpretation section of the workflow
Select Gene Set Analysis (Figure 1) and then Gene Set ANOVA
Figure 1. GO ANOVA dialog can be invoked via Gene Set Analysis option of the workflow
For this example analysis, the model was kept easy to interpret by including Subject and Tissue as the only ANOVA factors. Additionally, Tissue was added to the Disruption Factor(s). Including Subject controled for person to person variation, and including Tissue allowed the analysis of differential expression and of functional category disruption between tissue types. For the sake of simplicity and minimizing run time, the term Subject was not added to the Disruption Factor(s) box. Including it would have helped correct for subject specific gene expression patterns, though the results were largely unaffected in this case.
Performing GO ANOVA analysis on very large GO categories can take quite a bit of time. More importantly, very large categories may have too large a scope to be useful. To speed the operation and analyze only smaller GO categories, specify 20 genes as the maximum size for an analyzed GO category.
For the sample dataset, the GO ANOVA dialog setup should appear as in Figure 2 below.
Figure 2. GO ANOVA configured for the user guide data set. Two factors added to the model
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
GO ANOVA output is very similar to standard ANOVA output except each row in the resulting sheet contains statistical results from a single GO functional group rather than a single gene. Columns can be broken down into four sections:
Annotations contain detail about the category being considered
ANOVA results contain the significance of the effect of the factors in the model
Contrast results contain significance and fold change of the difference between groups compared via contrast
F-ratios display the significance of the factors in the ANOVA model
Annotations will take up the first four columns of the results sheet (Figure 1). The first column (# of genes) is the number of genes in the GO category. Specifically, this is not necessarily the number of unique genes in the category; depending on the technology, it can be the number of probes or probe sets on the microarray whose targets fall into the GO category. Genes targeted more than once will be counted more than once. The second column (GO ID) is the unique numeric identifier of the GO category; it is sometimes useful for searching with when the GO category has a very long name. The third column is the type of the GO category, while the fourth column (GO Description) is the name of the GO category.
Figure 1. GO ANOVA annotation columns (example)
When right click on any row header to choose Create Gene List , a new spreadsheet will be generated, it contains a list of genes (probes/probesets) within the selected GO category.
ANOVA results will include a column for each factor in the setup (Figure 2). A column with the name of the factor or interaction followed by p-value will contain how significant the effect of the variable is on the data. A lower p-value corresponds with a more significant effect. For example, a p-value of 0.1 for tissue means that given the difference between the tissue and the inherent variability of the measurements of the genes in the functional group, there is a 10% likelihood that the tissues are equivalent. A p-value of 0 occurs when the value is too small to be displayed. This can be caused by a very low estimate of inherent variability due to either a very small number of replicates or severely unbalanced data.
Figure 2. Viewing the GO ANOVA result
In the example experiment, a low p-value for tissue would imply the functional group is differentially expressed across tissues.
A low p-value for an interaction implies that the effect of one factor on the other is significant. In the example dataset, no interactions between two main variables were included as factors. To illustrate what the interaction p-value would mean, consider the case that a drug compound and a control injection were dosed over several time points and an interaction between injection compound and time point was included in the GO ANOVA. A low p-value for the drug-time point interaction corresponds to the effect of drug on the functional group being altered with time.
A column will also be present for each factor placed in the Disruption Factor(s) box. This column will have the header Disruption(Factor name). A low p-value in this column corresponds to the different states presenting with different gene patterns within the functional group. For functional groups containing only a single gene, no value will be present as the pattern cannot change. In the example experiment, a low p-value for the Disruption(Tissue) represents function categories which have different genes operating in the heart and in the brain.
Contrast results include four columns for each of the comparisons declared during GO ANOVA setup. The first column contains the p-value representing the significance of the difference between the two categories. The second column contains the ratio between the two groups where increases are represented as greater than one and decreases are represented as values between zero and one. The third column is the fold change of the functional group between the two categories where increases are greater than one and decreases are less than negative one. The fourth column contains a plain text description of the direction of the fold change. Fold changes and ratios represent the average change in the functional category. In the example, a contrast was run comparing expression in the cerebral tissue to the heart tissue (Figure 3). As these were the only tissues, the p-values are identical to those in column 5. While the p-value column shows which groups are differentially expressed between the tissues, the fold change columns allow us to see by how much they are differentially expressed. Using the sign of the fold change, or the description column, you can see which categories are increased in brain and which are increased in heart.
Figure 3. Viewing the GO ANOVA contrast columns
F-Ratios
F-ratios (Figure 4) are used in the computation of p-values. The values in the columns can safely be ignored by most users; there are exceptional cases when the F-ratios may be informative. To see the general significance of the factors included in the model, a Sources of Variation plot can be computed from these values from the View menu (or the Workflow). The higher the average F-ratio, the more important the factor is to the model on average.
Figure 4. Viewing the GO ANOVA F-ratios
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
There are two main visualizations for use with GO ANOVA outputs:
Dot plots used to visualize differential expression of functional groups
Profile plots used for visualizing disruption of gene expression patterns within the group
Dot plots represent each sample with a single dot. The position of each dot is calculated as the average expression of all genes included in the functional group. Invoke this plot by right clicking on the row header of a functional group of interest and choosing Dot Plot (Orig. Data). The color, shape, and size of the dots can be set to represent sample information in the plot properties dialogue, invoked by pressing on the red ball in the upper left.
Figure 1 shows a dot plot for a GO category "cell growth involved in cardiac muscle cell development", which is expressed in the heart at a level of almost four times that of the brain, evidenced by the difference of just under two units on the y-axis (in the current example the values on the y-axis are shown in log2 space). Note that the replicates are grouped neatly, making this category highly significant. That is not a surprise, given that the genes belonging to that category are likely very specific for the heart.
Figure 1. Dot plot of a significantly differentially expressed GO category. Each dot is a sample, box-and-whiskers summarize groups
Profile plots or profiles represent each category of one of the GO ANOVA factors as a few overlapping lines. Horizontal coordinates refer to individual genes or probes in the original data. Vertical coordinates represents expression of the individual gene. Invoke this plot by right clicking on the row header of a function group of interest and choosing Profile (Orig. Data). This plot is useful as the pattern of gene expression in the group is displayed as a line. If the pattern is conserved across treatments, the lines will lie parallel, but if the gene reacts differently, the lines will follow a different pattern, maybe even cross each other.
Profile plot on Figure 2 visualizes a GO category without differential expression, but with significant disruption. Note that the gene TNNI3 is up-regulated in the heart, while STX1A is down-regulated in the heart.
Figure 2. Profile plot of a GO category with significant disruption but not differential expression. Each data point is a gene (error bars are standard error of the mean)
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.