The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function where time-to-event incidence varies over time in a population. The Kaplan-Meier estimator is displayed as a Kaplan-Meier curve, a series of declining horizontal steps. The Kaplan-Meier curve should approach the true survival curve for the population with a sufficiently large sample size. Kaplan-Meier survival analysis can handle censored data, i.e., data where the event is not observed for some subjects.
To perform Kaplan-Meier survival analysis, at least two pieces of information (one column each) must be provided for each sample: time-to-event (a numeric factor) and event status (categorical factor with two levels). Event status indicates whether the event occurred or the subject was censored (did not experience the event). Time-to-event indicates the time elapsed between the enrollment of a subject in the study and the occurrence of the event.
Common examples of Kaplan-Meier analysis include the fraction of patients who remain disease-free after cancer remission. In this case, the event would be disease recurrence and patients would be listed as censored if they do not experience recurrence during the study or if they drop out of the study before experiencing recurrence.
Partek Genomics Suite does not impose any limitation on the labels used for the event and censored categories; in this tutorial, the events are coded as either "death" or "censored". If a subject is still alive at the end of the study, time-to-event indicates the period between enrollment and the end of the study. If a subject dropped out of the study, time-to-event indicates the period between enrollment and the last recorded time point.
To begin, you should have the Survival Tutorial data set open in Partek Genomics Suite as shown.
Select Stat from the main toolbar
Select Survival Analysis then Kaplan-Meier from the Stat menu (Figure 1)
Figure 1. Invoking Kaplan-Meier
The Kaplan-Meier dialog will open. Please note that in this tutorial data set, column 1. Survival (years) indicates the survival time of each patient in years and column 2. Event indicates the event status for each patient, death or censored.
Set Time Variable to 1. Survival (years) using the drop-down menu
Set Event Variable to 2. Event using the drop-down menu
Only numeric data are displayed in the Time Variable drop-down list and only categorical data with two categories are displayed in Event Variable.
Set Event Status to death using the drop-down menu (Figure 2)
Event Status should be set to the primary event outcome.
Figure 2. Configuring the Kaplan-Meier dialog
Select 3. p53 status from the Candidate(s) panel
Select Add Factor > to add 3. p53 status to the Strata (Categorical) panel
This will test the difference in survival rates between the p53 mutants (mutant) and samples with wild-type p53 (wt).
Select OK to run the test (Figure 3)
Figure 3. Configuring the Kaplan-Meier dialog to test the difference in survival rates between patients with different p53 status
The Kapan-Meier Plot will open in a new tab (Figure 4).
Figure 4. Kaplan-Meier plot comparing the survival curves between two groups.
The horizontal axis indicates time-to-event; the vertical axis shows the cumulative percentage of survival. Censoring is shown as a triangle; event occurrence is shown as a step-down in the plot. Partek Genomics Suite performs two statistical tests to compare the survival curves: a log-rank test and the Wilcoxon-Gehan test. Low p-values indicate that the groups have significantly different survival times.
Select the Analysis tab to switch to the Kaplan-Meier results spreadsheet (Figure 5)
Figure 5. Kaplan-Meier spreadsheet. Each row represents occurrence of at least one significant event.
The spreadsheet is organized into two sections: the analysis of the p53 mutant group and the analysis of the p53 wild type group. Each row represents a time point at which at least one event occurred; the columns provide the following information:
1. Identifies the group membership (according to the strata)
2. Survival time corresponds to the entries in column 1. of the original (Survival_Tutorial) spreadsheet. At each given time, at least one event, either death or censored, was recorded.
3. Probability of Survival: cumulative probability of survival at a given time point (also known as KM survival estimate). Cumulative probability is the probability of surviving all of the intervals before this time point. As time increases, the cumulative survival probabilities decreases as events occur.
4. Number of group members at risk (i.e., have not experienced the event). The count in each row is calculated by subtracting the number of deaths and censored events in the row above from the number at risk in the row above.
5. Count of deaths at this time point in the group
6. Count of censored events at the given time in the group
7. Total number of deaths in all groups at the given time
8. Total number of participants at risk in all groups. The count in each row is calculated by subtracting the number of deaths and censored events at the previous time point in both groups from the total number at risk at the previous time point
9. Natural logarithm of column 3.; also noted as ln(KM)
10. Natural logarithm of the negative value of column 9., i.e., ln(-ln(KM)). A plot of ln(-ln(KM) vs. ln(t) is often used to test the proportional hazards assumption. To visualize the risk, select this column and select View > Log Log S Plot (Figure 6).
Figure 6. Log Log S plot of KM data. As the lines are mostly parallel and do not cross, the log-rank test assumptions are valid. The Wilcoxon-Gehan test has more power if the lines had crossed or were not parallel but performs less well when there is extensive censored data
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Please note that the Kaplan-Meier results spreadsheet is a temporary file. If you would like to be able to view the spreadsheet again after closing Partek Genomics Suite, be sure to save it by selecting the Save Active Spreadsheet icon ().
This tutorial will illustrate:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on Our support page to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
Survival analysis is a branch of statistics that deals with modeling of time-to-event. In the context of “survival,” the most common event studied is death; however, any other important biological event could be analyzed in a similar fashion (e.g., spreading of the primary tumor or occurrence/relapse of disease). The significant event should be well-defined and occur at a specific time. As the primary outcome event is typically unfavorable (e.g., death, metastasis, relapse, etc.), the event is called a “hazard.” Survival analysis tries to answer questions such as: What is the proportion of a population who will survive past a certain time (i.e., what is the 5-year survival rate)? What is the rate at which the event occurs? Do particular characteristics have an impact on survival rates (e.g., are certain genes associated with survival)? Is the 5-year survival rate improved in patients treated by a new drug?
An important feature of survival analysis is the presence of “censored” data. Censored data refers to subjects that have not experienced the event being studied. For example, medical studies often focus on survival of patients after treatment so the survival times are recorded during the study period. At the end of the study period, some patients are dead, some patients are alive, and the status of some patients is unknown because they dropped out of the study. Censored data refers to the latter two groups. The patients who survived until the end of the study or those who dropped out of the study have not experienced the study event "death" and are listed as "censored".
The tutorial data set (236 samples) is a subset of fresh-frozen breast tumor specimens from a population-based cohort of 315 women with breast cancer. The clinicopathological characteristics accompanying each tumor include p53 status (mutant or wild-type), estrogen receptor (ER) status, progesterone receptor (PgR) status, lymph node status, tumor size, and patient age. Gene expression was assessed on Affymetrix® U133A and U133B arrays (Miller LD et al., GSE3494). Please note that Affymetrix data have been chosen for the illustration purposes only, and that the same functionality can be used to analyze any data set. The raw data files (.CEL) have already been imported into PGS; samples with no survival time data, as well as sample attributes irrelevant for the survival analysis, were removed, and the final spreadsheet was saved in Partek Genomics Suite (Survival_Tutorial.fmt and Survival_Tutorial.txt). To go through the tutorial, download the tutorial data set, unzip the downloaded folder and save it in an easily accessible location on your computer.
After saving the unzipped file, you can open it in Partek Genomics Suite.
Select File from the main toolbar
Select Open...
Browse to the folder containing the tutorial data set and select the file Survival_Tutorial.fmt
The data spreadsheet will open (Figure 1). Each row represents a tumor sample from a breast cancer patient. Sample attributes are listed in columns 1-8, while columns 9+ are intensity values for the probe sets listed in the column headers.
Figure 1. Viewing the sample data (one sample per row) for the survival analysis tutorial
Miller LD, Smeds J, George J, Vega VB et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. PNAS, 2005; 102(38): 13550-5.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Cox regression (Cox proportional-hazards model) tests the effects of several factors (predictors) on survival time. Predictors that lower the probability of survival at a given time are called risk factors; predictors that increase the probability of survival at a given time are called protective factors. The Cox proportional-hazards model are similar to a multiple logistic regression that considers time-to-event rather than simply whether an event occurred or not.
In this tutorial, we will use Cox Regression to test the effects of tumor gene expression on survival time while accounting for tumor size.
To begin, you should have the Survival Tutorial data set open in Partek Genomics Suite as shown.
Select Stat from the main toolbar
Select Survival Analysis then Cox Regression from the Stat menu (Figure 1)
Figure 1. Invoking Cox Regression
The Cox Regression dialog will open. Please note that in this tutorial data set, column 1. Survival (years) indicates the survival time of each patient in years and column 2. Event indicates the event status for each patient, death or censored.
Set Time Variable to 1. Survival (years) using the drop-down menu
Set Event Variable to 2. Event using the drop-down menu
Only numeric data are displayed in the Time Variable drop-down list and only categorical data with two categories are displayed in Event Variable.
Set Event Status to death using the drop-down menu (Figure 2)
Event Status should be set to the primary event outcome. All Response Variables will be automatically selected for Predictor. This means that Cox Regression will test every probe set for association with the survival (time-to-event).
Figure 2. Configuring the Cox Regression dialog
Co-predictors are numeric or categorical factors that will be included in the regression model. To evaluate the association between tumor size and gene expression, we can add tumor size to the co-predictors list.
Select 7. tumor size (mm) from the Candidate(s) panel
Select Add Factor > to add it to the Co-predictor(s) panel
Advanced options such as the inclusion of interactions between predictors and co-predictors can be accessed by selecting Model... (Figure 3) and the Results... button invokes a dialog (Figure 4) with additional output options for the results spreadsheet. We do not need to adjust any of the advanced model or output options for this tutorial.
Figure 3. Configuring advanced options for Cox Regression
Figure 4. Configuring output options for Cox Regression
Select OK to run Cox Regression (Figure 5)
Figure 5. Configuring Cox Regression to assess the effect of gene expression and tumor size on survival
The spreadsheet generated by Cox Regression (Figure 6) includes a row for each probe set; the columns provide the following information:
1. Column # - Column number of probe set in probe intensities spreadsheet
2. Probest ID - ID of probe set in probe intensities spreadsheet
3. HRatio(gene) - Hazard ratio for the probe set
4. LowCI(gene) - lower 95% confidence boundary of the hazard ratio for the probe set
5. UpCI(gene) - upper 95% confidence boundary of the hazard ratio for the probe set
6. p-value(gene) - P-value of the corresponding Chi-squared test. A low value indicates that the predictor (probe set) poses a large hazard or is associated with shortened surivival time
7. to 10. - Effects of the co-predictor on survival time; for each co-predictor, a similar set of columns is added
11. modelfit(0) - P-value of the test assessing the overall model fit, i.e., the relationship between survival time, the predictor, and co-predictors in the model. A modelfit value of > 0.05 indicates a low association between the predictor and/or co-predictors with survival time.
Please note that the Cox Regression results spreadsheet is a temporary file. If you would like to be able to view the spreadsheet again after closing Partek Genomics Suite, be sure to save it by selecting the Save Active Spreadsheet icon ().
Figure 6. Cox Regression results spreadsheet
The hazard ratio is an effect size measure used to assess the direction and magnitude of the effect of a predictor variable on the relative likelihood of the event occurring at any given point in time, controlling for other predictors in the model.
For continuous predictors, such as gene expression values and tumor size, the hazard ratio is the predicted change in the hazard for a unit increase in the predictor. A hazard ratio greater than 1 indicates that the predictor is associated with shorter time-to-event, hazard ratio less than 1 indicates that the predictor is associated with greater time-to-event, and a hazard ratio of 1 indicates that the predictor has no effect on time-to-event. For categorical predictors, the hazard ratio is relative to the reference category.
For any probe set, we can view a detailed HTML report.
Right-click the row header for row 1
Select HTML Report from the pop-up menu (Figure 7)
Figure 7. Invoking an HTML report for a probe set
The HTML report (Figure 8) will open in your default web browser.
Figure 8. Cox Regression HTML report
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.