GSEA

GSEA is a bioinformatics tool that determines whether a set of genes (e.g. a gene ontology (GO) group or a pathway) shows statistically significant, concordant differences between two experimental groups (1,2). Briefly, the goal of GSEA is to determine whether the genes belonging to a gene set are randomly distributed throughout the ranked (by expression) list of all the genes that should be taken into consideration (e.g. gene model), or are primarily found at the top or at the bottom of the list.

Prerequisites

To run GSEA, your project has to contain at least one categorical factor with at least two levels (e.g. Treated and Control). If you are running GSEA on RNA-seq data, note that some common normalization transformations, such as fragments/reads per kilobase of transcript per million mapped reads (FPKM/RPKM) or transcripts per million (TPM) are not considered suitable for GSEA. Instead, you should use an approach such as DESeq2 normalisation, trimmed means of M (TMM), or geometric mean.

Running GSEA

To launch GSEA, select the data node with normalised data and then go to Biological interpretation > GSEA

Use the first dialog to specify gene sets. You can run GSEA on pathways (currently based on Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways) or on other gene set databases. When using the KEGG option, the KEGG database (i.e. the species) is automatically set, based on the upstream nodes. The Gene set size option allows you to restrict your analysis on gene sets of certain size (i.e. number of genes).

Once your choices are made, push Next to proceed.

In the second part of the set up, pick the experimental factor for GSEA.

GSEA task only compute one factor at a time. If you select more than one factors, the computation will be performed on each one individually. Click Next to setup comparisons:

The box on the left side displays the categories of the selected factor (shown as Factor). Use the arrow buttons (>) to move one of the factors to the Denominator box (that factor should be interpreted as the reference category) and the other factor to the Numerator box. Confirm your selection by pushing the Add comparison button and the comparison will be added to the Comparisons table.

Low value filter is turned on by default and will remove all the genes with the lowest average coverage of 1.0 or below; if a filter feature task was performed before this task, the default low-value filter is set to None.

Push Finish to launch GSEA with the default settings. Each comparison will be performed individually and generate its own section in the report.

Click on the Configure icon to access the advanced options.

Number of data permutations (needed to calculate the normalised enrichment scores) can be controlled using the Permutations option. Permutation is to randomly permute the group assignment across a given gene. For each permutation, a random order is computed, that order is used to compute the score for each gene. Finally, make sure the input data is in log scale or not.

GSEA Results

When the task completes, double click on the GSEA task node to view the report.

Like ANOVA report, the report consists of two parts: the GSEA result table on the right and the filter panel on the left

The comparison (i.e. Denominator vs. Numerator) is given at the top of the GSEA table. Each row of the table corresponds to one gene set (pathway) and the gene sets are ranked by the first comparison's normalized enrichment score in descending order.

View. The icons in the View column open the enrichment plot () or the extra details report () (explanations below).
Gene set ID. The Gene set IDs are based on the gene set file that was selected during set up. Each ID is a link to the details of he selected set.
Gene set size. Number of genes in the set (as specified in the gene set file), click on the number to download the list of genes.
Enrichment score. The enrichment score is the primary result of GSEA; it reflects the degree to which the current gene set is overrepresented at the top or the bottom of the ranked list of all the genes in the gene model (for details, see the References). The higher the enrichment score the more overrepresented (enriched) the gene set is.
Normalised score. Normalisation of the enrichment score takes into account the size of the gene set. We recommend to use normalised values for filtering.
P-value. P-value estimates the statistical significance of the enrichment score.
FDR. False discovery rate (FDR) is used to control for multiple testing. We recommend to use FDR values for filtering.

Click on the View enrichment report icon () to open a new Data viewer session with the per gene set report. The selected gene set is in the title, at the top of the canvas (Enrichment profile). To quickly switch to another gene set, use the Axis > Content drop-down list. The individual plots are as follows:

Enrichment score. The algorithm walks down the ranked list of all the genes in the model, increasing the running sum (y axis) each time when a gene in the current gene set is encountered. Conversely, the running-sum is decreased each time a gene not in the current gene set is encountered. The magnitude of the increment depends on the correlation of the gene with the experimental factor. The enrichment score is then the maximum deviation from zero encountered in the random walk (the summit of the curve).
Gene set hits. Each vertical line shows the location of a gene from the current gene set, within the ranked list of all the genes in the model.
Rank metric. The plot shows the value of the ranking metric (y axis) as you move down the ranked list of all the genes in the model (x axis). The ranking metric measures a gene’s correlation with the attribute specified in the comparison.

Click on the View extra details plot () to open a gene set-specific report page

Leading edge genes: it is a subset of genes that contribute most to the ES. For a positive ES, the leading edge subset is the set of members that appear in the ranked list prior to the peak score. For a negative ES, it is the set of genes that appear subsequent to the peak score.

The filter panel is used to narrow the list of gene sets. The Results shows the number of gene sets currently in the table. Filtering can be performed on: Gene set ID (search for the numeric ID), Gene set description (search for a key word), Gene set size (number of genes in the set), Enrichment score, Normalised enrichment score, P-value, FDR. Click on the black triangle to open the controls for each filter. To remove all the filters, click on the Clear filter link.

Click Generate filtered node button to perform the filter task based on the specified criteria.

References

Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545-15550. doi:10.1073/pnas.0506580102
Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34(3):267-273. doi:10.1038/ng1180

PreviousGene Set Enrichment NextClassification

Last updated 1 day ago

Was this helpful?