Partek Flow offers biological interpretation tools that can provide additional insight into lists of genes, such as significantly different genes between experimental groups.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
GSEA is a bioinformatics tool that determines whether a set of genes (e.g. a gene ontology (GO) group or a pathway) shows statistically significant, concordant differences between two experimental groups (1,2). Briefly, the goal of GSEA is to determine whether the genes belonging to a gene set are randomly distributed throughout the ranked (by expression) list of all the genes that should be taken into consideration (e.g. gene model), or are primarily found at the top or at the bottom of the list.
To run GSEA, your project has to contain at least one categorical factor with exactly two levels (e.g. Treated and Control). If you are running GSEA on RNA-seq data, note that some common normalisation transformations, such as fragments/reads per kilobase of transcript per million mapped reads (FPKM/RPKM) or transcripts per million (TPM) are not considered suitable for GSEA (for more information, please see GSEA documentation). Instead, you should use an approach such as DESeq2 normalisation, trimmed means of M (TMM), or geometric mean.
To launch GSEA, select the data node with normalised data and then go to Biological interpretation > GSEA (Figure 1).
Use the first dialog (Figure 2) to specify gene sets. You can run GSEA on pathways (currently based on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways) or on other gene set databases. When using the KEGG option, the KEGG database (i.e. the species) is automatically set, based on the upstream nodes. The Gene set size option allows you to restrict your analysis on gene sets of certain size (i.e. number of genes).
If you select Gene set database, two additional options will appear. Genome build will be detected automatically, based on the upstream nodes. The gene sets that are available for that build are listed in the drop down list (Figure 3). Custom databases will be labeled by their name as specified in the Library file management, while GO database will be labeled by the release date (as seen in Figure 3).
Once your choices are made, push Next to proceed.
In the second part of the set up (Figure 4) pick the experimental factor for GSEA (three are available in this example: Condition, Stim, Numeric). The dialog will list only the factors with two categories; if your project contains additional factors, which have a single category or more than two categories, a warning message will be displayed at the top.
If the warning message is displayed, click on the details link to learn more about unavailable factors (an example is shown in Figure 5).
Select the experimental factor that you want to run GSEA on and push Next.
The third dialog is Define comparisons (Figure 6). The box on the left side displays the categories of the selected factor (shown as Factor). Use the arrow buttons (>) to move one of the factors to the Denominator box (that factor should be interpreted as the reference category) and the other factor to the Numerator box. Confirm your selection by pushing the Add comparison button and the comparison will be added to the Comparisons table (Figure 6).
Low value filter is turned on by default and will remove all the genes with the lowest average coverage of 1.0 or below; if a filter feature task was performed before this task, the default low-value filter is set to None (for details please see the GSA chapter) .
Push Finish to launch GSEA with the default settings.
Alternatively, click on the Configure icon to access the advanced options (Figure 7). Number of data permutations (needed to calculate the normalised enrichment scores) can be controlled using the Permutations option. Permutation is to randomly permute the group assignment across a given gene. For each permutation, a random order is computed, that order is used to compute the score for each gene. Finally, if you start your project by importing a count matrix (i.e. as opposed to generating the count matrix using Partek Flow), you need to specify whether the expression values were log transformed before the import (use the Data has been log transformed with base drop down).
When the task completes, double click on the GSEA task node (Figure 8) to view the report.
The report consists of two parts: the GSEA result table on the right and the filter panel on the left (Figure 9).
Gene set ID. The Gene set IDs are based on the gene set file that was selected during set up. Each ID is a link to the geneontology.org page of the selected set.
Gene set size. Number of genes in the set (as specified in the gene set file).
Enrichment score. The enrichment score is the primary result of GSEA; it reflects the degree to which the current gene set is overrepresented at the top or the bottom of the ranked list of all the genes in the gene model (for details, see the References). The higher the enrichment score the more overrepresented (enriched) the gene set is.
Normalised score. Normalisation of the enrichment score takes into account the size of the gene set. We recommend to use normalised values for filtering.
P-value. P-value estimates the statistical significance of the enrichment score.
FDR. False discovery rate (FDR) is used to control for multiple testing. We recommend to use FDR values for filtering.
Enrichment score. The algorithm walks down the ranked list of all the genes in the model, increasing the running sum (y axis) each time when a gene in the current gene set is encountered. Conversely, the running-sum is decreased each time a gene not in the current gene set is encountered. The magnitude of the increment depends on the correlation of the gene with the experimental factor. The enrichment score is then the maximum deviation from zero encountered in the random walk (the summit of the curve).
Gene set hits. Each column shows the location of a gene from the current gene set, within the ranked list of all the genes in the model.
Rank metric. The plot shows the value of the ranking metric (y axis) as you move down the ranked list of all the genes in the model (x axis). The ranking metric measures a gene’s correlation with a phenotype. A positive value of the metric indicates correlation with the first category (Numerator) and a negative value indicates correlation with the second category (Denominator).
Leading edge genes: it is a subset of genes that contribute most to the ES. For a positive ES, the leading edge subset is the set of members that appear in the ranked list prior to the peak score. For a negative ES, it is the set of genes that appear subsequent to the peak score.
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545-15550. doi:10.1073/pnas.0506580102
Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34(3):267-273. doi:10.1038/ng1180
The comparison (i.e. Denominator vs. Numerator) is given at the top of the GSEA table. To download the table to your local computer as a text file, use the Download link in the bottom right. Each row of the table corresponds to one gene set and the gene sets are ranked by the P-value, ascending (lowest values at the top). The icon () in the column headers are used for sorting. The columns of the table are as follows.
View. The icons in the View column open the enrichment plot () or the extra details report () (explanations below).
The filter panel is used to narrow the list of gene sets. The Results shows the number of gene sets currently in the table. Filtering can be performed on: Gene set ID (search for the numeric ID), Gene set description (search for a key word), Gene set size (number of genes in the set), Enrichment score, Normalised enrichment score, P-value, FDR. Click on the black triangle to open the controls for each filter (Figure 10). To remove all the filters, click on the Clear filter link. If you commonly use a filter, you can save the filter settings by clicking the Save filter button. The saved filter will be shown under Saved filters. The cogwheel icon () is a link to Settings > Filter management page.
Click on the View enrichment report icon () to open a new Data viewer session with the per-gene set report. The current gene set is in the title, at the top of the canvas (Enrichment profile). To quickly switch to another gene set, use the Axis > Content drop-down list. The individual plots are as follows (Figure 11; from top to bottom).
Click on the View extra details plot () to open a gene set-specific report page (Figure 12).
Enrichment analysis is a technique commonly used to add biological context to a list of genes, such as list of significant genes filtered from differential analysis report. The procedure is based on assigning genes to groups and then finding overrepresented groups in filtered gene lists using a Fisher's exact test.
Gene set enrichment task can be invoked on a differential analysis output (or filtered differential analysis output) data node or filtered count matrix data node. Since the data node including all the features will serve as background, to get a meaningful result, always use a data node containing subset of features to invoke this task. Only gene names will be used in the computation.
Click a Feature list data node
Click the Biological interpretation section of the toolbox
Click Gene set enrichment
There are two options for Database. KEGG database requires a special license (Figure 1).
Gene set database is user defined database, see more details in the Adding a Gene Set chapter. The gene sets available for the current Assembly are listed under the Gene set database drop-down list (Figure 2). The assembly is automatically selected, if possible. If the assembly cannot be detected, you can specify it using the drop-down.
Partek distributes Gene Ontology (GO) for human and mouse genomes, a bioinformatics initiative to unify the representation of gene and gene product attributes across various species [1, 2].
Select feature identifier (optional) can be used to specify the feature format (e.g. Gene name, Gene ID, Feature ID).
Specify the background gene list (optional) can be used for a feature list. Select the list using the drop-down. Click here for more information on List management.
The background gene list is used as the list of possible genes. By default, this is the genes included in the selected gene set database. If your assay limits the genes that could be detected, you may want to specify a background list.
Click Finish to run
The result is stored under an Enrichment task node. To open it, double click on the node or select the respective Task report from the context sensitive menu.
As previously mentioned, if you are using the GO gene sets distributed by Partek, the GO identifiers in the first column are hyperlinks to the Gene Ontology web-site entries (an example shown in Figure 6).
When KEGG database is used, on the enrichment task report, when click on a pathway ID in the Gene set column, a KEGG pathway gene network picture is displayed (Figure 7).
Each rectangle on the map represents a gene product in the pathway. Gene products are mostly proteins coded by a gene or group of genes, but they could be RNA too. Related pathways are shown as large rounded rectangles. Chemical compounds, DNA or other molecules are shown as circles.
The pathway map is colored by the first fold-change column in the input Feature list data node. The control panel on the left can be used to configure the colors of the pathway map. In all options, rectangles colored white do not have gene information. Options for coloring include:
Fixed color: all genes are colored black.
Genes in list: all genes in the list are colored, default color is yellow, but this can be configured. Genes not in the list are black.
Statistics in the gene list: .e.g FDR, p-value, Fold change etc. Colors can be customized by clicking on the color square to change.
Mousing over a rectangle shows the genes indicated by the rectangle in the tooltip (Figure 8). Genes are listed on rows with all aliases in the KEGG database included on the row. Genes that are in the list are shown in bold. The gene being used to color the rectangle is shown in red.
On KEGG pathway maps that include chemical compounds, the chemical structure is shown in the tooltip on mouse-over (Figure 9).
Clicking a rectangle opens the page for that gene or group of genes on the KEGG website in a new tab in your web browser.
If the gene set enrichment table has fewer than 100 results (rows), the GO categories can be visualized in the Data Viewer. Otherwise, a notification is displayed in the top left corner (Figure 10).
If needed, filter down the number results, for instance by using a cut-off based on the enrichment score. Type in the cut-off value in the text box beneath the Enrichment score and hit enter (an example is shown in Figure 11). Once the number or results falls below 100, a link to the Data Viewer will be displayed (Figure 8). Click on the View plots in Data Viewer link to open a new Data Viewer session.
Two plots are loaded into Data Viewer (Figure 12). Both plots show enrichment score on the horizontal axis and gene ontology categories (i.e. the ones present in the gene enrichment table) on the vertical axis. The plots show enrichments scores (Enrichment score column of the gene ontology table) and - in addition - the plot on the left uses color range to depict enrichment P-value (green = low, red = high P-value).
The same functionality is available for pathway enrichment results.
Ashburner M, Ball CA, Blake JA et al. Gene Ontology: tool for the unification of biology. Nat Genetics. 2000; 25:25-29.
The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015; 43:D1049-1056.Recommended citations from the Geneontology.org website
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Figure 3 shows an example Gene set enrichment task report using GO database. The table contains one gene set per row (Gene set column; the column entries are hyperlinks when using the distributed GO gene sets), with the category name in the Description column. The categories are ranked by the Enrichment score, which is the negative natural logarithm of the enrichment p-value (P-value column) derived from Fisher's exact test on the underlying contingency table. The higher the enrichment score, the more overrepresented the GO category is within the input list of significant genes. The columns can be searched by typing in the search term in the respective box (and hitting Enter), or sorted by selecting the double arrow icon ( ).
The contingency table (Figure 4) can be displayed by selecting the View gene breakdown chart icon on the right (). The term "list" refers to the list of significant genes, while the term "set" refers to the respective GO category. The first row of the contingency table is also seen in the report, namely the Genes in list and Genes not in list columns.
The View extra details () button provides additional information on the GO category (Figure 5). In addition to the details already given in the report, a full list of Genes in list and Genes not in list can be inspected and downloaded (Download data) to the local computer as a text file. Use the arrow to expand these sections.
Click the Save image icon to download a PNG file showing the configured KEGG pathway map to your local computer.