Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Download the data from the Partek site to your local disk. The zip file contains both data and annotation files.
Unzip the files to C:\Partek Training Data\Down_Syndrome-GE or to a directory of your choosing. Be sure to create a directory or folder to hold the contents of the zip file
Copy or move the annotation files (HG-U133A.cdf, HG-U133A.na36.annot, HG-U133A.na36.annot.idx) to C:\Microarray Libraries.
Copying the annotation files to the default library location is done because newer annotation files that are released after the publication of this tutorial may cause the results to be different than what is shown in the published tutorial. If, however, you prefer to download the latest version, you may omit copying the HG-U133A files to C:\Microarray Libraries.
Start Partek® Genomics Suite® and select Gene Expression from the Workflows panel on the right side of the tool bar in the main window (Figure 1)
Figure 1. Selecting the gene expression workflow
Select Import Samples under the Import section of the workflow
Select Import from Affymetrix CEL Files and then select OK
Select the Browse button and select the C:\Partek Training Data\Down_Syndrome-GE folder. By default, all the files with a .CEL extension are selected (Figure 2)
Figure 2. Selecting the folder and CEL files for the experiment
Select the Add File(s) > button to move all the .CEL files to the right panel. Twenty-five CEL files will be processed
Select the Next > button to open the Import Affymetrix CEL Files dialog (Figure 3)
Figure 3. Configuring import files window
Select Customize… to open the Advanced Import Options dialog (Figure 4)
Figure 4. Configuring the Advanced Import Options dialog
Select Library Files… to open the Specify File Locations dialog (Figure 5). This dialog is used to specify the location of the library folder and the annotation files
Figure 5. Specifying Microarray Library files or change the default library directory
Partek Genomics Suite will automatically assign the annotation files according to the chip type stored in the .CEL files. If the annotation files are not available in the library directory, Partek Genomics Suite will automatically download and store them in the Default Library File Folder.
The default library location can be modified by selecting Change... in the Default Library File Folder panel. By default, the library directory is at C:\Microarray Libraries. This directory is used to store all the external libraries and annotation files needed for analysis and visualization. The library directory can also be modified from Tools > File Manager in the main Partek Genomics Suite menu
Select OK (Figure 5) to close the Specify File Locations dialog
Select the Outputs tab from the Advanced Import Options dialog (Figure 6)
Figure 6. Specifying Advanced Import Options to create chip images of and extract the scan date from the CEL files
In the Extract Time Stamp and Date from CEL File panel, make sure the Date button is selected to extract the chip scan date. This information can help you detect if there are batch effects caused by the process time
In the Quality Assess of Gene Expression panel, leave the QC report button unselected. A user guide for the microarray data quality assessment and quality control features is available in the User’s Manual
Select OK to exit the Advanced Import Options dialog
Select Import. The progress bar on the lower left of the Import Affymetrix CEL files dialog will update as .CEL files are imported. Once all files have been imported, the Import Affymetrix CEL Files dialog will close
After importing the .CEL files has finished, the result file will open in Partek Genomics Suite as a spreadsheet named 1 (Down_Syndrome-GE). The spreadsheet should contain 25 rows representing the micoarray chips (samples) and over 22,000 columns representing the probe sets (genes) (Figure 7).
Figure 7. Viewing the main or top-level spreadsheet
For additional information on importing data into Partek Genomics Suite, see Chapter 4 Importing and Exporting Data in the Partek User’s Manual. The User’s Manual is available from the Partek Genomic Suite software main menu Help > User’s Manual. The FAQ (Help > On-line Tutorials > FAQ) may also be helpful. As this tutorial only addresses some topics, you may need to consult the User’s Manual for additional information about other useful features.
It is recommended that you are familiar with Chapter 6 The Pattern Visualization System of the User’s Manual before going through the next section of the tutorial.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This tutorial will illustrate:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on Our support page to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
Down syndrome is caused by an extra copy of all or part of chromosome 21; it is the most common non-lethal trisomy in humans. At the time of the study used in this tutorial, conflicting reports had thrown into doubt whether individuals with Down syndrome have dysregulation of gene expression throughout the genome or primarily in genes from chromosome 21. To address this question, Affymetrix GeneChip™ Human U133A arrays were used to assay 25 samples taken from 10 human subjects, with or without Down syndrome, and 4 different tissues. The data revealed a significant upregulation of chromosome 21 genes at the gene expression level in individuals with Down syndrome; this dysregulation was largely specific to chromosome 21 and not a genome-wide phenomenon.
The raw data is available as experiment number GSE1397 in the Gene Expression Omnibus.
Data and associated files for this tutorial can be downloaded using this link - Gene Expression Analysis tutorial data (right-click the link and choose "Save Link As" to download the tutorial data).
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Partek Genomics Suite tutorials provide step-by-step instructions using a supplied data set to teach you how to use the software’s tools. Upon completion of each tutorial, you will be able to apply your knowledge in your own studies.
Partek Genomics Suite enables you to visualize each probe and compare the methylation between the groups at a single CpG site level.
Right click row 5_. SBNO2_ in the LCLs_vs_B_Cells_CpG_Islands spreadsheet
Select Browse to Location from the pop-up menu
Figure 1. Browsing to location from spreadsheet with differentially expressed genes
The Chromosome View tab will open, zoomed in to the selected CpG locus in SBNO2 (Figure 2).
Figure 2. Viewing location in Genome Viewer
The Chromosome View visualization is composed of a series of tracks corresponding to annotation files and data files.
RefSeq Transcripts 2017-05-02 (hg19) (+): transcripts coded by the positive strand
RefSeq Transcripts 2017-05-02 (hg19) (-): transcripts coded by the negative strand
Regions: by default, difference in methylation (M-value) between the groups
Heatmap (1/mvalue): M values for all the samples
Barchart (Methylation): methylation level in M value of the selected sample (to select a sample, click on a heat map)
Heatmap (Methylation Tutorial): Beta values for all the samples
Barchart (Methylation): methylation level in Beta value of the selected sample (to select a sample, click on a heat map)
Cytoband: cytobands of the current chromosome
Genomic Label: coordinates on the current chromosome
To modify a track, select it in the Tracks panel to bring up its configuration options panel below the Tracks panel. Let's modify a few tracks to improve our visualization of the data.
Select the Regions track, opens to Profile tab
Select Color tab
Set Color bars by to Difference (LCLs vs. B cells) (Description)
Select Apply to change
This will color regions by up or down methylated.
Select the Heatmap (1/mvalue)
Select Remove Track
Select Bar Chart (Methylation) located directly below the Regions track
Select Remove Track
We can now more clearly see the Difference in M values for the region in the Regions track, the heatmap of beta values in the Heatmap track, and the beta value for the loci of the selected sample in the Bar Chart track.
Select a sample on the heatmap to view its beta value in the Bar Chart track (Figure 3)
Figure 3. Modify the tracks of the Genome Viewer to facilitate visual analysis
The available tracks can be supplemented with a special annotation file that can be built using a UCSC annotation file as the basis. Building and viewing the UCSC annotation file is available as an optional section of the tutorial, Optional: Add UCSC CpG island annotations.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Twenty-five CEL files (samples) have been imported into Partek Genomics Suite. Sample information must be added to define the grouping and the goals of the experiment.
Select Add Sample Attributes in the Import section of the Gene Expression workflow panel
Choose the option Add Attributes from an Existing Column
Select OK to open the Sample Information Creation dialog
In this tutorial, the file name (e.g., Down Syndrome-Astrocyte-748-Male-1-U133A.CEL) contains the information about a sample and is separated by hyphens (-). Choosing to split the file name by delimiters will separate the categories into different columns
In the Sample Information panel, specify the column labels (Labels 1-4) as Type, Tissue, Subject, and Gender, set each as categorical, and set the other columns as skip (Figure 1). Select OK
Figure 1. Configuring the Sample Information Creation dialog
A dialog window asking if you would like to save the spreadsheet with the new sample attribute will appear. Select Yes
Make column 5. (Subject) random by right-clicking on the column header and selecting Properties from the pop-up menu (Figure 2).
Figure 2. Changing column properties
Select the Random Effect check box from the Properties dialog (Figure 3) then select OK.
Figure 3. Setting column to Random Effect
The column 5. (Subject) will now be colored red, indicating that it is a random effect.
Note: More details on Random vs. Fixed Effects can be found later in this tutorial under the section Identifying differentially expressed genes using ANOVA.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Now that you have obtained statistical results from the microarray experiment, you can create new spreadsheets containing just those genes that pass certain criteria. This will streamline data management by focusing on just those genes with the most significant differential expression or substantial fold change. The List Manager can be used to specify numerous conditions for selecting genes of interest. In this tutorial, we are going to create a gene list of gene with a fold change between -1.3 to 1.3 that has an unadjusted p-value of < 0.0005.
Invoke the List Manager dialog by selecting Create Gene List in the Analysis section of the Gene Expression workflow
Ensure that the 1/ANOVA-3way (ANOVAResults) spreadsheet is selected as this is the spreadsheet we will be using to create our new gene list as shown (Figure 1)
Select the ANOVA Streamlined tab.
Set Contrast: find genes that change between two categories panel, to Down Syndrome vs. Normal and select Have Any Change from the Setting drop-down menu
This will find genes with different expression levels in the different types of samples.
In the Configuration for “Down Syndrome vs Normal” panel, check that Include size of the change is selected and enter 1.3 into Change > and -1.3 in OR Change <
Select Include significance of the change, choose unadjusted p-value from the dropdown menu, and < 0.001 for the cutoff
The number of genes that pass your cutoff criteria will be shown next to the # Pass field. In this example, 30 genes pass the criteria.
Set Save the list as A
Select Create to generate the new list A
Select Close to view the new gene list spreadsheet
Figure 1. Creating a gene list from ANOVA results
The spreadsheet Down_Syndrome_vs_Normal (A) will be created as a child spreadsheet under the Down_Syndrome-GE spreadsheet.
This gene list spreadsheet can now be used for further analysis such as hierarchical clustering, gene ontology, integration of copy number data, or be exported into other data analysis tools such as pathway analysis.
Next, we will generate a list of genes that passed a p-value threshold of 0.05 and fold-changes greater than 1.3 using a volcano plot.
Select the 1/ANOVA-3way (ANOVAResults) spreadsheet in the Analysis tab. This is the spreadsheet our gene list will be drawn from
Select View > Volcano Plot from the Partek Genomics Suite main menu (Figure 2)
Figure 2. Generating a Volcano Plot from ANOVA results
Set X Axis (Fold-Change) to 12. Fold-Change(Down Syndrome vs. Normal), and the Y axis (p-value) to be 10. p-value(Down Syndrome vs. Normal)
Select OK to generate a Volcano Plot tab for genes in the ANOVA spreadsheet (Figure 3)
Figure 3. Volcano plot generated from ANOVA spreadsheet
In the plot, each dot represents a gene. The X-axis represents the fold change of the contrast (Down syndrome vs. Normal), and the Y-axis represents the range of p-values. The genes with increased expression in Down syndrome samples are on the right side of the N/C (no change) line; genes with reduced expression in Down syndrome samples are on the left. The genes become more statistically significant with increasing Y-axis position. The genes that have larger and more significant changes between the Down syndrome and normal groups are on the upper right and upper left corner.
In order to select the genes by fold-change and p-value, we will draw a horizontal line to represent the p-value 0.05 and two vertical lines indicating the –1.3 and 1.3-fold changes (cutoff lines).
Choose the Axes tab
Check Select all points in a section to allow Partek Genomics Suite to automatically select all the points in any given section
Select the Set Cutoff Lines button and configure the Set Cutoff Lines dialog as shown (Figure 4)
Figure 4. Setting cutoff lines for -1.3 to 1.3 fold changes and a p-value of 0.05
Select OK to draw the cutoff lines
Select OK in the Plot Rendering Properties dialog to close the dialog and view the plot
The plot will be divided into six sections. By clicking on the upper-right section, all genes in that section will be selected.
Right-click on the selected region in the plot and choose Create List to create a list including the genes from the section selected (Figure 5). Note that these p-values are uncorrected
Figure 5. Creating a gene list from a volcano plot
Note: If no column is selected in the parent (ANOVA) spreadsheet, all of the columns will be included in the gene list; if some columns are selected, only the selected columns will be included in the list.
Specify a name for the gene list (example: volcano plot list) and write a brief description about the list.
The description is shown when you right-click on the spreadsheet > Info > Comments. Here, I have named the list "volcano plot list" and described it as "Genes with >1.3 fold change and p-value <0.05" (Figure 6). The list can be saved as a text file (File > Save As Text File) for use in reports or by downstream analysis software.
Figure 6. Saving a list created from a volcano plot
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
During data importation, the GeneChip annotation file was linked to the imported data. This linked annotation information can be added as new columns to the ANOVA or gene list spreadsheets. For example, we can add additional annotation to the gene list we created from the ANOVA results as follows:
In the Down_Syndrome_vs_Normal (A) spreadsheet, right click on the second column header 2. ProbesetID and select Insert Annotation from the pop-up menu (Figure 3)
Figure 1. Inserting an annotation
Select Chromosomal Location under the Column Configuration panel (Figure 4). Leave everything else as default
Select OK
Figure 2. Adding Chromosomal Location annotation
Interestingly, of the 23 genes of the Down_Syndrome_vs_Normal (A) spreadsheet, 20 genes are located on chromosome 21. This suggests that the gene expression changes associated with Down syndrome observed in this study are primarily located on chromosome 21, not distributed throughout the genome, an important finding of this study.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The original experiment is listed on the Gene Expression Omnibus as GSE848; however, this tutorial only uses a subset of the original experiment and should be downloaded from the Partek website tutorial page, Gene Expression Analysis with Batch Effects.
Download the zipped project folder, Breast_Cancer-GE.zip
Unzip the project folder to C:/Partek Training Data/ or a directory of your choosing
This location should be easily accessible. The unzipped Breast_Cancer-GE project folder and a zipped annotation file will be added to the selected directory.
Unzip the included annotation file, HG_U95Av2.na32.annot.rar
Move the annotation file, HG_U95Av2.na32.annot, to the microarray libraries folder
By default, the microarray libraries folder will be located at C:/Microarray Libraries, but the location may vary depending on your operating system and configuration.
Open Partek Genomics Suite
Select () from the main command bar
Navigate to the tutorial folder, Breast_Cancer-GE
Select Breast_Cancer.txt
Select Open (Figure 1)
Figure 1. Opening a data file. The red Partek Genomics Suite icon is shown next to the data file (FMT file format)
The spreadsheet will open as 1 (Breast_Cancer.txt) (Figure 2).
Figure 2. Breast_Cancer.txt data file
The summary at the bottom the spreadsheet shows there are 18 rows and 12,631 columns in the spreadsheet. The first column contains the Filename listing the GEO GSM number. This is also is an identifier for the microarray. Treatment, Time, and Batch are in columns 2, 3, and 4, respectively. Column 6 marks the beginning of the probesets. The data is log2 transformed.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
At this point in analysis, you should explore the data preliminarily. Do the genes you expected to be differentially regulated appear to have larger or smaller intensity values? Do similar samples resemble each other?
The latter question can be explored using Principal Components Analysis (PCA), an excellent method for reducing and visualizing high-dimensional data.
Select PCA Scatter Plot from the QA/AC section of the Gene Expression workflow
A Scatter Plot tab containing your PCA plot will open (Figure 1).
Figure 1. PCA Scatter Plot tab
In the scatter plot, each point represents a chip (sample) and corresponds to a row on the top-level spreadsheet. The color of the dot represents the Type of the sample; red represents a normal sample and blue represents a Down syndrome sample. Points that are close together in the plot have similar intensity values across the probe sets on the whole chip, while points that are far apart in the plot are dissimilar
Left-clicking on any point in the scatter plot selects that point. A dash with an identifying row number will appear on the selected PCA plot point. The spreadsheet in the Analysis tab will also jump to the corresponding row.
As you can see from rotating the plot, there is no clear separation between Down syndrome and normal samples in this data since the red and blue samples are not separated in space. However, there are other factors that may separate the data.
Color the points by column 4. Tissue and Size the points by column 3. Type
Select OK
Figure 2. Configuring the PCA scatter plot: Color by Tissue, size by Type
Notice now that the data are clustered by different tissues (Figure 3).
Figure 3. PCA scatter plot configured with color by Tissue, size by Type
Another way to see the cluster pattern is to put an ellipse around the Tissue groups.
Open the Plot Rendering Properties dialog and select the Ellipsoids tab
Select Add Ellipse/Ellipsoid
Select Ellipse in the Add Ellipse/Ellipsoid... dialog
Double click on Tissue in the Categorical Variable(s) panel to move it to the Grouping Variable(s) panel (Figure 4)
Select OK to close the Add Ellipse/Ellipsoid... dialog and select OK again to exit the Plot Rendering Properties dialog
Figure 4. Adding Ellipses to PCA Scatter Plot
By rotating this PCA plot, you can see that the data is separated by tissues, and within some of the tissues, the Down syndrome samples and normal samples are separated. For example, in the Astrocyte and Heart tissues, the Down syndrome samples (small dots) are on the left, and the normal samples (large dots) are on the right (Figure 5).
Figure 5. PCA scatter plot with ellipses, rotated to show separation by Type
PCA is an example of exploratory data analysis and is useful for identifying outliers and major effects in the data. From the scatter plot, you can see that the tissue is the biggest source of variation. There are many genes that express differently between the tissues, but not as many genes that express differently between type (Down syndrome and normal) across the whole chip.
The next step is to draw a histogram to examine the samples.
Select Sample Histogram in the QA/QC section of the Gene Expression workflow to generate the Histogram tab (Figure 6)
Figure 6. Histogram tab
The histogram plots one line for each of the samples with the intensity of the probes graphed on the X-axis and the frequency of the probe intensity on the Y-axis. This allows you to view the distribution of the intensities to identify any outliers. In this dataset, all the samples follow the same distribution pattern indicating that there are no obvious outliers in the data. As demonstrated with the PCA plot, if you click on any of the lines in the histogram, the corresponding row will be highlighted in the spreadsheet 1 (Down_Syndrome-GE). You can also change the way the histogram displays the data by clicking on the Plot Properties button. Feel free to explore these options on your own.
The decision to discard any samples would be based on information from the PCA plot, sample histogram plot, and QC metrics. To discard a sample and renormalize the data (without the effects of the outlier), start over with importing samples and omit the outlier sample(s) during the .CEL file import.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
We are now ready to measure gene expression in our dataset. To do this, we will use the mRNA quantification task in the Analyze Known Genes section of the RNA-Seq workflow. mRNA quantification creates spreadsheets showing expression at exon, transcript, and gene levels and reports raw and normalized reads for each sample.
Please note that the normalization method used by Partek Genomics Suite is Reads Per Kilobase per Million mapped reads (RPKM) (Mortazavi et al. 2008). In brief, this normalization method counts total reads in a sample, divides by one million to create a per million scaling factor for each sample; then divides the read counts for the feature (exon, transcript, or gene) by the per million scaling factor to normalize for sequencing depth and give a reads per million value; and finally divides reads per million values by the length of the feature (exon, transcript, or gene) in kilobases to normalize for feature size.
Select 1 (RNA-Seq) from the spreadsheet tree
Select mRNA quantification in the Analyze Known Genes section of the RNA-seq workflow
The RNA-Seq Quantification dialog will appear (Figure 1).
Select RefSeq Transcripts 2017-05-02 from the mRNA section of the Specify a database of genomic features to quantify panel of the dialog
Your choices in the Configure the test panel of the dialog depend on the design and aims of your experiment. A detailed description of each option can be viewed by selecting the () icon next to it.
For Strand-specificity: select No
Your choice here depends on the method used for sample preparation. A directional mRNA-seq sample preparation protocol only synthesizes the first strand of cDNA whereas other methods reverse transcribe the mRNA into double-stranded cDNA. If double-stranded cDNA has been synthesized, the sequencer reads sequences from both the forward and reverse strands but does not discriminate between them, eliminating strand information. When strand information is preserved, it is possible for paired-end sequences to come from a combination of the forward and reverse strands. If in doubt, select Auto-detect from the drop-down list. The data for this tutorial did not preserve strand information so we selected No.
For In the gene-level result report intronic reads as compatible with the gene?, select No
Selecting Yes would include intronic reads in the gene-level results, which might be useful for discovering unannotated transcripts for known genes, and also includes introns in the RPKM calculation for the gene-level results.
For Require strict paired-end compatibility select No
Selecting Yes would require that two alignments form the same read must map to the same transcript to be considered compatible. However, the data set used in this tutorial consists of single-end reads so this option is unnecessary.
For report results with no reads from any sample? select No
Selecting Yes would include all the genes/transcripts/exons in the transcriptome, even if there are no reads for that feature from any sample.
Make sure Report unexplained regions with more than ___ reads is selected and specify 5 as the number of reads
This option will create a spreadsheet that includes all regions with a specified number of reads that map to the genome, but not to any feature included in the selected database of genomic features.
Select Report exon-level results
If selected, spreadsheets will be created describing expression at the exon level.
Your RNA-Seq Quantification dialog should now be configured as shown (Figure 1). Descriptions of the spreadsheets that can be created by mRNA Quantification can be viewed by selecting Describe results to bring up the Quantification Result Help dialog.
Figure 1. Configuring the RNA-Seq Quantification dialog
Select OK to perform the RNA-Seq quantification
Reads will now be assigned to individual transcripts of a gene based on the Expectation/Maximization (E/M) algorithm (Xing, et al. 2006). In Partek Genomics Suite software, the E/M algorithm is modified to accept paired-end reads, junction aligned reads, and multiple aligned reads if these are present in your data. For a detailed description of the E/M algorithm, refer to the RNA-Seq white paper (Help > On-line Tutorials > White Papers). Several spreadsheets containing the analyzed results will be generated. Progress bars in the lower left-hand corner RNA-Seq Quantification window and the main window will update as the data is analyzed.
If you have not disabled it, the the Quantification Result Help dialog will appear. Select Close
The Analysis tab now shows the spreadsheets created by mRNA Quantification in the spreadsheet tree as a child spreadsheet of 1 (RNA-seq) (Figure 2).
Figure 2. Viewing the results of mRNA Quantification
The __reads and _rpkm_** spreadsheets**
Data on features - genes, transcripts, and exons - are presented before and after normalization as _reads and _rpkm spreadsheets. In this tutorial, we have created exon_reads, exon_rpkm, gene_reads, gene_rpkm, transcript_reads, and transcript_rpkm spreadsheets.In these spreadsheets, samples are listed one per row and the normalized counts of the reads mapped to features are in columns (Figure 2).
The _transcripts_** spreadsheet**
The transcripts spreadsheet lists a transcript in each row.
It is possible to derive basic information from the RNA-Seq_result.transcripts spreadsheet about differential and alternative splicing between your samples even if you don’t have replicates using a simple chi-squared or log-likelihood tests because each sample is represented only once and we can assume a null hypothesis that the transcripts are evenly distributed across all samples. However, the power of Partek Genomics Suite software resides in the implementation of a mixed-model ANOVA that can handle unbalanced and incomplete datasets, nested designs, numerical and categorical variables, any number of factors, and flexible linear contrasts when you do have biological replicates.
The _unexplained_regions_ spreadsheet
The contents of this spreadsheet are explained in more detail in a later section of the tutorial - Analyzing the unexplained regions spreadsheet.
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature, 2008; 5: 621-8.
Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C: An expectation-maximization algorithm for probalisitic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res 2006, 34: 3150-3160.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Analysis of variance (ANOVA) is a very powerful technique for identifying differentially expressed genes in a multi-factor experiment such as this one. In this data set, ANOVA will be used to generate a list of genes that are significantly different between Down syndrome and normal samples with an absolute difference bigger than 1.3 fold.
The ANOVA model should include Type because it is the primary factor of interest. From the exploratory analysis using the PCA plot, we observed that tissue is a large source of variation; therefore, Tissue should be included in the model. In the experiment, multiple samples were taken from the same subject, so Subject must be included in the model. If Subject were excluded from the model, the ANOVA assumption that samples within groups are independent will be violated. Additionally, the PCA scatter plot showed that the Downs syndrome and normal samples separated within tissue type, so the Type*Tissue interaction should be included in the model.
To invoke the ANOVA dialog, select Detect Differentially Expressed Genes in the Analysis section of the Gene Expression workflow
In the Experimental Factor(s) panel, select Type, Tissue and Subject by pressing and left clicking each factor
Use the Add Factor > button to move the selections to the ANOVA Factor(s) panel
Select both Type and Tissue by holding on your keyboard and left clicking each factor
Select the Add Interaction > button to add a Type * Tissue interaction to the ANOVA Factor(s) panel (Figure 1)
Do NOT select OK or Apply. We will be adding contrasts to this ANOVA model in an upcoming section of the tutorial.
Figure 1. Configuring ANOVA factors and interactions
Most factors in ANOVA are fixed effects, whose levels in a data set represent all the levels of interest. In this study, Type and Tissue are fixed effects. If the levels of a factor in a data set only represent a random sample of all the levels of interest (for example, Subject), the factor is a random effect. The ten subjects in this study represent only a random sample of the global population about which inferences are being made. Random effects are colored red on the spreadsheet and in the ANOVA dialog. When the ANOVA model includes both random and fixed factors, it is a mixed-model ANOVA.
Another way to determine if a factor is random or fixed is to imagine repeating the experiment. Would the same levels of each factor be used again?
Type – Yes, the same types would be used again - a fixed effect
Tissue – Yes, the same tissues would be used again - a fixed effect
Subject - No, the samples would be taken from other subjects- a random effect
You can specify which factors are random and which are fixed when you import your data or after importing by right-clicking on the column corresponding to a categorical variable, selecting Properties, and checking Random Effect. By doing that, the ANOVA will automatically know which factors to treat as random and which factors to treat as fixed.
The subject factor in the ANOVA model is listed as “5. Subject (3. Type)”, which means that Subject is nested in Type. Partek Genomics Suite can automatically detect this sort of hierarchical design and will adjust the ANOVA calculation accordingly.
By default, an ANOVA only outputs a p-value for each factor/interaction. To get the fold change and ratio between Down syndrome and normal samples, a contrast must be set up.
Select Contrasts… to invoke the Configure dialog
Choose 3**.**Type from the Select Factor/Interaction drop-down list. The levels in this factor are listed on the Candidate Level(s) panel on the left side of the dialog
Left click to select Down Syndrome from the Candidate Level(s) panel and move it to the Group 1 panel (renamed Down Syndrome) by selecting Add Contrast Level > in the top half of the dialog.
Label 1 will be changed to the subgroup name automatically, but you can also manually specify the label name.
Select Normal from the Candidate Level(s) panel and move it to the Group 2 panel (renamed Normal)
The Add Contrast button can now be selected (Figure 2)
Figure 2. Adding a contrast of Down Syndrome and Normal samples
Because the data is log2 transformed, Partek Genomics Suite will automatically detect this and will automatically select Yes for Data is already log transformed? in the top right-hand corner of the dialog. Partek Genomics Suite will use the geometric mean of the samples in each group to calculate the fold change and mean ratio for the contrast between the Down syndrome and normal samples.
Select Add Contrast to add the Down Syndrome vs. Normal contrast
Select OK to apply the configuration
If successfully added, the Contrasts… button will now read Contrasts Included (Figure 3)
Figure 3. ANOVA configuration with contrasts included
By default, Specify Output File is checked and gives a name to the output file. If you are trying to determine which factors should be included in the model and you do not wish to save the output file, simply uncheck this box
Select OK in the ANOVA dialog to compute the 3-way mixed-model ANOVA
Several progress messages will display in the lower left-hand side of the ANOVA dialog while the results are being calculated.
The result will be displayed in a child spreadsheet, ANOVA-3way (ANOVAResults). In this spreadsheet, each row represents a probe set and the columns represent the computation results for that probe set (Figure 4). Although not synonymous, probe set and gene will be treated as synonyms in this tutorial for convenience. By default, the genes are sorted in ascending order by the p-value of the first categorical factor. In this tutorial,Type is the first categorical factor, which means the most highly significant differently expressed gene between Down syndrome and normal samples is at the top of the spreadsheet in row 1.
Figure 4. ANOVA spreadsheet
For additional information about ANOVA in Partek Genomics Suite, see Chapter 11 Inferential Statistics in the User’s Manual (Help > User’s Manual).
Deciding which factors to include in the ANOVA may be an iterative process while you decide which factors and interactions are relevant as not all factors have to be included in the model. For example, in this tutorial, Gender and Scan date were not included. The Sources of Variation plot is a way to quantify the relative contribution of each factor in the model towards explaining the variability of the data.
Select View Sources of Variation from the Analysis section of the Gene Expression workflow with the ANOVA result spreadsheet active
A Sources of Variation tab will appear (Figure 5) with a bar chart showing the signal to noise ratio for each factor accross the whole genome. Sources of variation can also be viewed as a pie chart showing sum or squares by selecting the Pie Chart (Sum of Squares) tab in the upper left-hand side of the Sources of Variation tab.
Figure 5. Sources of Variation tab showing a bar chart
This plot presents the mean signal-to-noise ratio of all the genes on the microarray. All the non-random factors in the ANOVA model are listed on the X-axis (including error). The Y-axis represents the mean of the ratios of mean square of all the genes to the mean square error of all the genes. Mean square is ANOVA’s measure of variance. Compare the bar for each signal to the bar for error; if a factor's bar is higher than error's bar, that factor contributed significant variation to the data across all the variables. Notice that this plot is very consistent with the results in the PCA scatter plot. In this data, on average, Tissue is the largest source of variation.
To view the source of variation for each individual gene, right click on a row header in the ANOVA-3way (ANOVAResults) spreadsheet and select Sources of Variation from the pop-up menu. This generates a Sources of Variation tab for the individual gene. View a few Sources of Variation plots from rows at the top of the ANOVA table and a few from the bottom of the table.
Another useful graph is an ANOVA Interaction Plot.
Right-click on a row header in the ANOVA spreadsheet (Figure 6)
Select ANOVA Interaction Plot to generate an Interaction Plot tab for that individual gene
Figure 6. Calling an ANOVA Interaction Plot for a gene
Generate these plots for rows 3 (DSCR3) and 8 (CSTB). If the lines in the interaction plot are not parallel, then there is a chance that there is an interaction between Tissue and Type. Error bars show standard error of the least squared mean. DSCR3 is a good example of this (Figure 7). We can look at the p-values in column 9, p-value(Type * Tissue) to check if this apparent interaction is statistically significant.
Figure 7. Interaction Plot for DSCR3
We can view the expression levels of a gene for each sample using a dot plot.
Right click on the gene row header and select Dot Plot (Orig. Data) from the pop-up menu. This generates a Dot Plot tab for the selected gene (Figure 8)
Figure 8. Dot Plot showing DSCR3 expression levels for each sample
In the plot, each dot is a sample of the original data. The Y-axis represents the log2 normalized intensity of the gene and the X-axis represents the different types of samples. The median expression of each group is different from each other in this example. The median of the Down syndrome samples is ~6.3, but the median of the normal samples is ~6.0. The line inside the Box & Whiskers represents the median of the samples in a group. Placing the mouse cursor over a Box & Whiskers plot will show its median and range.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The New Track button allows new tracks to be added to the viewer, while the Remove Track button removes the selected track from the viewer. Tracks can be reordered by selecting a track in the Tracks panel and dragging it up or down to move it in the list. In the Chromosome View, select () for selection mode and () for navigation mode. In navigation mode, left-click and draw a box on any track to zoom in. All tracks are synced and will zoom together. Zooming can also be controlled using the interface in the lower right-hand corner of the tab (). View can be reset to the whole chromosome level using reset zoom (). Searching for a gene or transcript in the position box will also zoom directly to its location.
To save changes to the spreadsheet, select the Save Active Spreadsheet icon (). Spreadsheets with unsaved changes have an asterisk next to their name in the spreadsheet tree.
You can practice creating new gene list criteria of your own to become familiar with the List Manager tool. For more information, you can always click on the () buttons.
Select Rendering Properties ()
To save changes to the spreadsheet, select the Save Active Spreadsheet icon ().
While pressing the mouse wheel down, drag the mouse to rotate the plot or select the Rotate Mode icon () on the left side of the Scatter Plot tab. With Rotate Mode selected, press the left mouse button and drag to rotate the plot. Rotating the plot allows you to examine the grouping pattern or outliers of the data on the first three principal components (PCs).
Scrolling the mouse wheel up or down while the cursor is on the PCA plot will zoom in and out or select the Zoom Mode icon () on the left side of the Scatter Plot tab.
Selecting the Reset icon () option on the left side of the Scatter Plot tab will return the PCA plot to its original orientation and zoom.
In the Scatter Plot tab, select the Rendering Properties icon () and configure the plot as shown (Figure 2)
The _reads and _rpkm spreadsheets can be used for data analysis. Sample grouping can be visualized using PCA. Select View > Scatter Plot from the toolbar or press on the quick action bar to create a PCA plot from the selected spreadsheet. See Exploring gene expression data for an example of using PCA plots for data analysis or consult Chapter 7 of the Partek User's Manual for a detailed introduction to PCA. With replicates in a sample group, you would also be able to use the _rpkm spreadsheet to perform differential expression analysis using ANOVA.
This tutorial will will illustrate:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on Our support page to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
The data for this tutorial is taken from an experiment that examined the effects of four treatment conditions at two time points on estrogen receptor-positive breast cancer cell lines in vitro. Each treatment/time combination has two replicates and there are two control samples for a total of eighteen samples. Gene expression analysis was performed using the Affymetrix GeneChip_®_ Human U95A array. Values are transformed to log base 2 scale by f(x) = log2(x+1).
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The gene list in spreadsheet Down_Syndrome_vs_Normal (A) can be used for hierarchical clustering to visualize patterns in the data.
Under the Visualization section in the Gene Expression workflow, select Cluster Based on Significant Genes
The Cluster Significant Genes dialog asks you to specify the type of clustering you want to perform.
Choose Hierarchical Clustering and select OK
Choose the Down_Syndrome_vs_Normal (A) spreadsheet under the Spreadsheet with differentially expressed genes
Choose the Standardize – shift genes to mean of zero and scale to standard deviation of one under the Expression normalization panel (Figure 1)
This option will adjust all the gene intensities such that the mean is zero and the standard deviation is 1.
Figure 1. Configuring Hierarchical Clustering
Select OK to generate a Hierarchical Clustering tab (Figure 2)
Figure 2. Hierarchical Clustering of Down_Syndrome_vs_Normal (A)
The graph (Figure 2) illustrates the standardized gene expression level of each gene in each sample. Each gene is represented in one column, and each sample is represented in one row. Genes with no difference in expression have a value of zero and are colored black. Genes with increased expression in Down syndrome samples have positive values and are colored red. Genes with reduced expression in Down syndrome samples have negative values and are colored green. Down syndrome samples are colored red and normal samples are colored orange. On the left-hand side of the graph, we can see that the Down syndrome samples cluster together.
For more information on the methods used for clustering, you can refer to Chapter 8: Hierarchical & Partitioning Clustering in Help > User’s Manual. For a tutorial on configuring the clustering plot, please refer to Hierarchical Clustering Analysis
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
While many types of data sets are automatically linked with appropriate annotation files upon import, if this does not occur, a spreadsheet can be manually linked with an annotation file.
Right-click Breast_Cancer.txt in the spreadsheet tree
Select Properties (Figure 1)
Figure 1. Selecting file properties for a spreadsheet
Configure the Configure Genomic Properties as shown (Figure 2) with the following steps:
Select Gene Expression from the Choose the type of genomic data drop-down menu
Select Feature in column label
Select Browse...
Select HG_U95Av2.na36.annot.csv from the microarray libraries folder
Select Set Column
Select Gene Symbol from the Choose column containing gene symbol/microRNA name dialog
Select Homo sapiens and hg19 from the Species and Genome Build drop-down menus
Figure 2. Configure the genomic properties dialog as shown
There is now an * after the spreadsheet name in the spreadsheet tree. This indicates an unsaved change has been made to the spreadsheet.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Principal Components Analysis (PCA) is an excellent method to visualize similarities and differences between the samples in a data set. PCA can be invoked through a workflow, by selecting () from the main command bar, or by selecting Scatter Plot from the View section of the main toolbar. We will use a workflow.
Select Gene Expression from the Workflows drop-down menu
Select PCA Scatter Plot from the QA/QC section of the Gene Expression workflow
The PCA scatter plot will open as a new tab (Figure 1).
Figure 1. Viewing the PCA scatter plot. Each point is a sample. Samples are colored by treatment.
In this PCA scatter plot, each point represents a sample in the spreadsheet. Points that are close together in the plot are more similar, while points that are far apart in the plot are more dissimilar.
To better view the data, we can rotate the plot.
Click and drag to rotate the plot
Rotating the plot allows us to look for outliers in the data on each of the three principal components (PC1-3). The percentage of the total variation explained by each PC is listed by its axis label. The chart label shows the sum percentage of the total variation explained by the displayed PCs.
We can change the plot properties to better visualize the effects of different variables.
Set Shape to 4. Batch
Set Size to 3. Time
Set Connect to 5. Treatment Combination
Select OK (Figure 2)
Figure 2. Configuring plot properties to color by treatment, shape by batch, size by time, and connect by treatment combination
The PCA scatter plot now shows information about treament, batch, and time for each sample (Figure 3).
Figure 3. PCA scatter plot showing treatment, batch, and time information for each sample. A batch effect is clearly visible.
PCA is particularly useful for identifying outliers and batch effects in data sets. We can see a batch effect in this dataset as samples separate by batch. To make this more clear, we can add an ellipses by Batch.
Select Ellipsoids from the tab
Select Add Ellipse/Ellipsoid
Select Ellipse
Select Batch from the Categorical Vairable(s) panel and move it to the Group Variable(s) panel
Select OK
Select OK to close the dialog
The ellipses help illustrate that the data is spearated by batches (Figure 4).
Figure 4. Ellipses around batch groups show that samples separate by batch
Ways to address the batch effect in the data set will be detailed later in this tutorial.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The List Manager can be used to generate lists of genes by applying criteria such as fold change and false discovery rate (FDR) adjusted p-value thresholds.
Select the Analysis tab
Select ANOVAResults in the spreadsheet tree
Select Create Gene List from the Analysis section of the Gene Expression workflow (Figure 1)
Figure 1. Selecting Create Gene List from the Gene Expression workflow
Select E2 vs. Control from the Contrast panel of the ANOVA Streamlined tab in the List Manager dialog
Deselect the Include size of the change option
Set p-value with FDR < to 0.1 (Figure 2)
Figure 2. Configuring the List Manager using the ANOVA Streamlined filtering options
There should be ~545 probe(sets)/genes that meet this threshold.
Select Create
A new spreadsheet, E2 vs. Control, will be added as a child spreadsheet of Breast_Cancer.txt.
Repeat the steps listed above to create lists for E2+ICI vs. Control (~24 genes), E2+Ral vs. Control (~22 genes), and E2+TOT vs. Control (~177 genes) with the same threashold
Now we can use the Venn Diagram to create a list of genes that are differentially regulated in all treatment groups.
Select the Venn Diagram tab in the List Manager dialog
The Venn Diagram shows overlap between selected gene lists.
Select the four created lists (E-H) in the spreadsheet list in the List Manager dialog by selecting each while holding the Ctrl key on your keyboard
The Venn Diagram will display the number of overlapping and distinct genes from the four lists (Figure 3).
Figure 3. Viewing the Venn Diagram with intersections of four lists of significant genes
The intersection of the four ellipses shows that 14 differentially regulated genes are in common between the four threatment schemes.
Select the region intersecting all four ellipses
Right-click the intersected region
Select Create List From Highlighted Regions
Select Close to exit the List Manager dialog
The new list will appear in the spreadsheet tree with a temporary file name (ptpm).
Select the temporary list in the spreadsheet tree
Save the list as fourtreatments
Gene lists can be visualized and their ability to distinguish samples evaluated using a hierarchical clustering heat map. Because of the batch effect in this data set, we will perform hierarchical clustering using batch-corrected intensity values. To do this, we need to open the fourtreatments list of differentially expressed genes as a child spreadsheet of the batch-remove spreadsheet
Select fourtreatments from the spreadsheet tree
Select () to close the spreadsheet
Select 1-removeresult (batch-remove) from the spreadsheet tree
Select File from the main tool bar
Select Open as child...
Select fourtreatments using the file browser
The fourtreatments spreadsheet will open as a child spreadsheet of batch-remove (Figure 1).
Figure 1. The fourtreatments spreadsheet is open as a child spreadsheet of bath-remove. Visualizations performed using fourtreatments will pull intensity values from batch-remove.
Visualizations performed using the fourtreatments spreadsheet will now use intensity values from the batch-remove spreadsheet.
To invoke hierarchical clustering, follow the steps below.
Select Cluster Based on Significant Genes from the Visualization section of the Gene Expression workflow
Select Hierarchical Clustering
Select OK
Select 1-removeresult/1 (fourtreatments) from the drop-down menu
Select Standardize for Expression normalization (Figure 2)
Figure 2. Configuring the Cluster the significant genes dialog
Select OK
The hiearchical clustering heat map will open in a new tab (Figure 3).
Figure 3. Hierarchical clustering of genes with significantly different expression across the treatment groups
For detailed information about the methods used for clustering, refer to the Partek Manual Chapter 8: Hierarchical & Partitioning Clustering.
By including Batch in the ANOVA model, the variability due to the batch effect is accounted for when calculating p-values for the non-random factors. In this sense, the batch effect has already been removed. However, visualizing biological effects can be very difficult if batch effects are present in the original intensity data used to generate visualizations. We can modify the original intensity data to remove the batch effect using the Remove Batch Effect tool.
The Remove Batch Effect tool functions much like ANOVA in reverse, calculating the variation attributed to the factor being removed then adjusting the original intensity values to remove the variation. Once the variation caused by the batch effect has been removed, tools like PCA or clustering can be used to visualize what the data would look like if the batch effect was not present.
Select the1 (Breast_Cancer.txt) spreadsheet
Select Stat from the main tool bar
Select Remove Batch Effect... (Figure 1)
Figure 1. Invoking the Remove Batch Effect tool
The Remove Batch Effects dialog will open. The tool functions by performing an ANOVA then modifying the original intensities values to remove the effects of the specified factor(s).
Select Treatment, Time, and Batch
Select Add Factor > to add them to the ANOVA Factor(s) panel
Select Batch in the ANOVA Factor(s) panel
Select Add Factor > to add Batch to the Remove Effect(s) of These Factor(s) panel
By default, the results will be displayed in a new spreadsheet. Options to overwrite the current spreadsheet and specify the output file appear in the bottom of the dialog (Figure 2).
Figure 2. Configuring the Remove Batch Effects tool to remove Batch and create a new spreadsheet
Select OK
The new spreadsheet, 1-removeresult (batch-remove) will open in the Analysis tab (Figure 3).
Figure 3. Viewing the new spreadsheet with batch effects removed
We can visualize the effects of removing the batch effects using PCA.
Select 1 (Breast_Cancer.txt) from the spreadsheet tree
Set Drawing Mode to Mixed
Select the Ellipsoids tab
Select Add Centroid
Add Batch to the Grouping Variable(s) panel
Set the colors of the two centroids as shown (Figure 4) to pink and yellow
Figure 4. Adding a centroid for Batch
Select OK to close the Add Centroid...
Select OK to close the Configure Plot Properties dialog
The two centroids are distinct, showing the batch effect (Figure 5).
Figure 5. Viewing a batch effect using PCA. The batches are shown as the pink (A) and yellow (B) centroids. The clear separation of the centroids indicates a batch effect
Repeat the above steps for 1-removeresult (batch-remove)
For 1-removeresult (batch-remove), the centroids of the two batches overlap, showing that the batch effect has been removed (Figure 6).
Figure 6. Overlapping centroids for batches A and B show that the batch effect has been removed.
Visualization of ANOVA results for single probe(sets)/genes also benefits from batch removal. To illustrate this, we first need to repeat our ANOVA using the new batch-remove intesitiy values spreadsheet.
Select the Analysis tab
Select 1-removeresult (batch-remove) in the spreadsheet tree
Select Stat from the main toolbar
Select ANOVA...
Add Treatment, Time, and Batch factors to the ANOVA Factor(s) panel
Add Treatment * Time interaction to the ANOVA Factor(s) panel
Select Contrasts...
Select Treatment from the Select Factor/Interaction drop-down menu
Select Yes for Data is already log transformed?
Set up contrasts of treatment vs. control for E2, E2+ICI, E2+Ral, and E2+TOT (Figure 7)
Figure 7. Configuring ANOVA to comparing treatment groups to control
Select OK to add contrasts
Change output file name to ANOVAResults_batch-remove
Select OK to perform the ANOVA
The ANOVAResults_batch-remove spreadsheet will open in the Analysis tab.
Select the ANOVAResults spreadsheet
Right-click on the row header for row 2, TFF1
Select Dot Plot (Orig. Data) (Figure 8)
Figure 8. Invoking a dot plot from the ANOVAResults spreadsheet
A dot plot for trefoil factor 1 (TFF1) will open (Figure 9). The dot plot shows gene intensity values (y-axis) for each sample. Samples are grouped by Treatment.
Figure 9. Viewing the dot plot for trefoil factor 1 (TIFF1) across different treatment groups
To visualize the batch effect we will make a few changes to the plot.
Select H/V to switch the horizontal and vertical axis
Set Color to Batch
Set Size to Time
Set Connect to Treatment Combination (Figure 10)
Figure 10. Configuring the dot plot (part 1 of 2)
Select the Labels tab
Select Column for In Point Labels
Select Time from the Column drop-down list (Figure 11)
Figure 11. Configuring the dot plot (part 2 of 2)
Select OK
The dot plot now clearly shows the batch effect (Figure 12). Samples within treatment groups are separated clearly between the two batches shown in blue and red.
Figure 12. Viewing a dot plot showing a batch effect. Each dot is a sample. The y-axis is treatment combinations; the x-axis is the expression value of the TFF1 gene. Dots are colored by batch, sized by time, connected by treatment combination, and labeled by time.
To view the effects of batch removal, we can view this dot plot for the ANOVAResults_batch-remove spreadsheet.
Select the Analysis tab
Select ANOVA-3way (ANOVAResults_batch-remove) from the spreadsheet tree
Repeat the steps shown above to create the dot plot for trefoil factor 1
The dot plot invoked from the ANOVAResults_batch-remove) spreadsheet shows that the batch effect has been removed as all the samples no longer clearly separate by color within treatment groups (Figure 13).
Figure 13. Viewing the dot plot that shows batch effect removal. The plot configuration matches Figure 12.
Illumina’s MethylationEPIC array interrogates the methylation status of over 850,000 cytosines in the human genome. Because the MethylationEPIC array is closely related to the Infinium HumanMethylation450 BeadChip, the steps presented in this document can be applied to either platform.
This tutorial illustrates how to:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
The data set accompanying this document consists of sixteen human samples processed by Illumina MethylationEPIC arrays. The data set is taken from a study of DNA methylation in human B cells and B cells infected with Epstein-Barr virus (EBV).
Infecting B cells with EBV in vitro transforms them, making them capable of indefinite growth in vitro. These immortalized cell lines are referred to as lymphoblastoid cell lines (LCLs). LCLs behave similarly to activated B cells, making them useful for expanding T cells in vitro. Because EBV is a carcinogen and immortalized cell growth is a hallmark of cancer, examining the effects of EBV transformation on B cell DNA methylation might shed light on the roles of DNA methylation in tumor development.
Analysis of variance (ANOVA) is a very powerful technique for identifying differentially expressed genes in a multi-factor experiment. In this data set, ANOVA will be used to generate a list of genes that are significantly differentially regulated by each treatment.
When setting up the ANOVA, the primary factors of interest, Treatment and Time, should be included. We will also include the interaction between Treatment and Time, Treatment * Time, because we are interested in whether different treatments behave differently over time. From our exploratory analysis using PCA, we also know that Batch is a major source of variation and needs to be included. Including Batch as a random factor will allow us to account for the batch effect.
Select Detect differentially expressed genes from the Analysis section of the Gene Expression workflow
Select Treatment, Time, and Batch in the Experimental Factor(s) panel
Select Add Factor > to move the selections to the ANOVA Factor(s) panel
Select both Treatment and Time in the Experimental Factor(s) panel by holding Ctrl on the keyboard while selecting each
Select Add Interaction > to add the Treatment * Time interaction to the ANOVA Factor(s) panel (Figure 1)
Do not select OK or Apply. We still need to add linear contrasts to the ANOVA model
Figure 1. Adding factors and interactions to the ANOVA
ANOVA will output a p-value and F ratio for each factor or interaction; to get the fold-change and ratio between the different levels of a factor or interaction, linear contrasts, or comparisons, must be added.
Select Contrasts... in the ANOVA dialog (Figure 1)
Select Yes for Data is already log transformed?
Select Treatment * Time from the Select Factor/Interaction drop-down menu
We will add contrasts comparing each of the three treatment groups to the control group.
Select E2 * 8 and E2 * 48 from the Candidate Level(s) panel
Select Add Contrast Level > to move them to the top panel (Group 1) on the right-hand side
The Group 1 panel will be renamed after the contents of the panel. We can specify a name for the group.
Set Label of the top panel to E2
Select Control * 0 from the Candidate Level(s) panel
Select Add Contrast Level > to move it to the bottom panel (Group 2) on the right-hand side
Set Label of the bottom panel to Control
The lower panel (Group 2) is considered the reference level. Because the data is log2 transformed, the geometric mean will be used to calculate the fold change and mean ratio to place both on a linear scale instead of a log scale.
Select **Add Contrast (**Figure 2)
Figure 2. Adding a contrast between E2 vs. Control at all time points.
To examine the time points of each treatment condition separately, we can select Add Combinations instead of Add Contrast. This adds every possible contrast for the levels in the Group 1 and Group 2 panels.
Select E2 * 8 and E2 * 48 from the Candidate Level(s) panel
Select Add Contrast Level > to move them to the top panel (Group 1) on the right-hand side
Select Control * 0 from the Candidate Level(s) panel
Select Add Contrast Level > to move it to the bottom panel (Group 2) on the right-hand side
Select Add Combinations to add contrasts for E2 * 8 vs. Control * 0 and E2 * 48 vs. Control * 0 (Figure 3)
Figure 3. Add Combinations creates contrasts for every combination of levels from the two group panels.
For this tutorial, we will not be considering the time points of each treatment condition individually. We can remove the E2 * 8 vs. Control * 0 and E2 * 48 vs. Control * 0 contrasts.
Select E2 * 8 vs. Control * 0 and E2 * 48 vs. Control * 0 from the contrasts list
Select Delete
We will now add contrasts for the other treatment conditions.
Add contrasts for E2+ICI vs. Control, E2+Ral vs. Control, and E2+TOT vs. Control following the steps outlined for E2 vs. Control
There should now be four contrasts added to the contrasts panel (Figure 4).
Figure 4. Fully configured contrasts for the tutorial
Select OK to add the contrasts to the ANVOA model
The Contrasts... button should now read Contrasts Included in the ANOVA dialog.
Select OK to perform the ANOVA
The result of the 3-way mixed model ANOVA is displayed in a new spreadsheet, ANOVA-3way (ANOVAResults) that is a child of the Breast_Cancer.txt spreadsheet. In ANOVAResults, each row represents a probe(set)/gene with the columns containing the results of the ANOVA (Figure 5).
Figure 5. Viewing the ANOVA Results spreadsheet. Probe(sets)/genes are on rows and the ANOVA results are on columns.
By default, the rows are sorted in acending order by the p-value of the first factor, which places the most significantly differentially expressed gene between different treatments at the top of the spreadsheet.
Each factor in the ANOVA adds p-value, F value, and SS value columns. F value is a ratio of signal to noise; high values indicate that the probe(set)/gene explains variation in the data set due to the factor. SS value is the sum of squares.
Each contrast in the ANOVA adds p-value, ratio, and fold-change columns. The p-value is calculated using log space. The ratio and fold change are calculated using linear space.
Sources of variation captured in the ANOVA can be viewed for the entire data set or for individual probe(sets)/genes.
Select View Sources of Variation from the Analysis section of the Gene Expression workflow
The Sources of Variation plot will open in a new tab (Figure 6).
Figure 6. Viewing the sources of variation plot. Non-random factors are included when ANOVA is run using the default REML modle.
This plot presents the signal to noise ratio accross all probe(sets)/genes for each of the non-random factors and interactions in the ANOVA model. The y-axis represents the average mean square or F ratio, the ANOVA measure of variance, for all the probesets. Each bar is a factor and random error is also included. If the factor has a greater mean F ratio than Error, the factor contrinbuted significant variation to the data set.
Note that Batch is not included as a factor. This is beacuse Batch is a random factor and accounted for by the ANOVA model.
The sources of variation for each probe(set)/gene can be viewed individually.
Right-click on a row header in the ANOVAResults spreadsheet
Select Sources of Variation from the pop-up menu
Gene Ontology (GO) enrichment analysis compares a gene list to lists of genes associated with biological processes, cellular compartments, and molecular functions to provide biological insights. Once a list of genes has been created, it is possible to see which GO terms the genes are associated with and whether any GO terms are significantly enriched in the gene list.
Select the E2 vs. Control spreadsheet from the spreadsheet tree
Select Gene Set Analysis from the Biological Interpretation section of the Gene Expression workflow
Select Next > to continue with GO Enrichment
Select Next > to continue with 1/E2_vs_Control (E2 vs. Control)
Select Next > to continue with default parameter settings
Select Next > to continue with the default mapping file
A new spreadsheet 1 (GO-Enrichment.txt) will open as a child spreadsheet of E2 vs. Control (Figure 1).
Figure 1. GO Enrichment results spreadsheet
GO terms are shown in rows and are sorted by ascending enrichment p-value.
To visualize the results, we can launch the Gene Ontology Browser.
Select View from the main tool bar
Select Gene Ontology Browser
The Gene Ontology Browser will open in a new tab (Figure 2).
Figure 2. Viewing GO enrichment results in the Gene Ontology Browser
The bar chart shows the GO terms with the highest enrichment scores for the gene list.
To follow this tutorial, download the 32 .idat files (note that two .idat files are generated for each array) and unzip them on your local computer using 7-zip, WinRAR, or a similar program. The .idat files can be downloaded in a zipped folder using this link - .
Store the 32 .idat files at C:\Partek Training Data\Methylation or to a directory of your choosing. We recommend creating a dedicated folder for the tutorial
Go to the Workflows drop down list, select Methylation (Figure 1)
Figure 1. Selecting the methylation workflow
Select Microarray Loci Methylation from the Methylation sub-workflows panel (Figure 2)
Figure 2. Selecting the Illumina BeadArray Methylation workflow
That will open Illumina BeadArray Methylation workflow (Figure 3)
Figure 3. Illumina BeadArray Methylation workflow
Select Import Illumina Methylation Data to bring up the Load Methylation Data dialog
Select Import human methylation 450/850 .idat files (Figure 4)
Figure 4. Selecting human methylation 450/850 .idat file type for import
Select OK
Select Browse... to navigate to the folder where you stored the .idat files
All .idat files in the folder will be selected by default (Figure 5).
Figure 5. Selecting .idat files to import
Select Add File(s) > to move the files to the idat Files to Process pane of the Import Illumina iDAT Data dialog (Figure 6)
Figure 6. Confirming selection of .idat files for import
Select Next >
The following dialog (Figure 7) deals with the manifest file, i.e. probe annotation file. If a manifest file is not present locally, it will be downloaded in the Microarray libraries directory automatically. The download will take place in the background, with no particular message on the screen and it may take a few minutes, depending on the internet connection. In the future, you may want to reanalyze a data set using the same version of the manifest file used during the initial analysis, rather than downloading an up-to-date version. To facilitate this, the Manual specify option in the Manifest File section allows you to specify a specific version. For this tutorial, we will leave this on the default settings.
Figure 7. Selecting manifest file and output file
By default the output file destination is set to the file containing your .idat files and the name matches the file folder name. The name and location of the output file can be changed using the Output File panel.
Select Customize to view advanced options for data normalization
In the Algorithm tab of the Advanced Import Options dialog (Figure 8), there are two filtering options and five normalization options available. The filters allow you to exclude probes from the X and Y chromosomes or based on detection p-value. In this tutorial, we have male and female samples, so we will apply the X and Y chromosome Filter. We will also filter probes based on detection p-value to exclude low-quality probes.
Select Exclude X and Y Chromosomes
Analysis of differentially methylated loci in humans and mice often excludes probes on the X and Y chromosomes because of the difficulties caused by the inactivation of one X chromosome in female samples.
Select Exclude probes using detection p-value and leave the default settings of 0.05 and 1 out of 16 samples.
Figure 8. Advanced Import Options offers choice of normalization method and additional data outputs
Select OK to close the Advanced Import Options dialog
Select Import on the Import Illumina iDAT data dialog
The imported and normalized data will appear as a spreadsheet 1 (Methylation Tutorial) (Figure 9)
Figure 9. Viewing the imported methylation data in a spreadsheet
Principal component analysis (PCA) can be performed to visualize clusters in the methylation data, but also serves as a quality control procedure; outliers within a group could suggest poor data quality, batch effects, mislabeled samples, or uninformative groupings.
Select PCA Scatter Plot from the QA/QC section of the Illumina BeadArray Methylation workflow to bring up a Scatter Plot tab
Select 2. Cell Type for Color by
Select 3. Gender for Size by
Select () to enable Rotate Mode
Left click and drag to rotate the plot and view different angles (Figure 1)
Each dot of the plot is a single sample and represents the average methylation status across all CpG loci. Two of the LCLs samples do not cluster with the others, but we will not exclude them for this tutorial.
Figure 1. Principal components analysis (PCA) showing methylation profiles of the study samples. Each sample is represented by a dot, the axes are first three PCs, the number in parenthesis indicate the fraction of variance explained by each PC. The number at the top is the variance explained by the first three PCs. The samples are colored by levels of 2. Cell Type
Next, distribution of beta values across the samples can also be inspected by a box-and-whiskers plot.
Select Sample Box and Whiskers Chart from the QA/QC section of the Illumina BeadArray Methylation workflow to bring up a Box and Whiskers tab
Each box-and-whisker is a sample and the y-axis shows beta-value ranges. Samples in this data set seem reasonably uniform (Figure 2).
Figure 2. Box and whiskers plot showing distribution of M-values (y-axis) across the study samples (x-axis). Samples are colored by a categorical attribute (Cell Type). The middle line is the median, box represents the upper and the lower quartile, while the whiskers correspond to the 90th and 10th percentile of the data
An alternative way to take a look at the distribution of beta-values is a histogram.
Select Sample Histogram from the QA/QC section of the Illumina BeadArray Methylation workflow to bring up a Histogram tab
Again, no sample in the tutorial data set stands out (Figure 3).
Figure 3. Sample histogram. Each sample is a line, beta values are on the horizontal axis and their frequencies on the vertical axis. Two peaks correspond to two probe types (I and II) present on the MethylationEPIC array. Sample colors correspond to a categorical attribute (Cell Type)
To detect differential methylation between CpG loci in different experimental groups, we can perform an ANOVA test. For this tutorial, we will perform a simple two-way ANOVA to compare the methylation states of the two experimental groups.
Select Detect Differential Methylation from the Analysis section of the Illumina BeadArray Methylation workflow
A new child spreadsheet, mvalue, is created when Detect Differential Methylation is selected. M-values are an alternative metric for measuring methylation. β-values can be easily converted to M-values using the following equation: M-value = log2( β / (1 - β)).
An M-value close to 0 for a CpG site indicates a similar intensity between the methylated and unmethylated probes, which means the CpG site is about half-methylated. Positive M-values mean that more molecules are methylated than unmethylated, while negative M-values mean that more molecules are unmethylated than methylated. As discussed by , the β-value has a more intuitive biological interpretation, but the M-value is more statistically valid for the differential analysis of methylation levels.
Because we are performing differential methylation analysis, Partek Genomics Suite automatically creates an M-values spreadsheet to use for statistical analysis.
Select 2. Cell Type and 3. Gender from the Experimental Factor(s) panel
Select Add Factor > to move 2. Cell Type and 3. Gender to the ANOVA Factor(s) panel (Figure 1)
Figure 1. ANOVA setup dialog. Experimental factors listed on the left can be added to the ANOVA model.
Select Contrasts...
Leave Data is already log transformed? set to No
Leave Report comparisons as set to Difference
For methylation data, fold-change comparisons are not appropriate. Instead, comparisons should be reported as the difference between groups.
Select 2. Cell Type from the Select Factor/Interaction drop-down menu
Select LCLs
Select Add Contrast Level > for the upper group
Select B cells
Select Add Contrast Level > for the lower group
Select Add Contrast (Figure 2)
Figure 2. Configuring ANOVA contrasts
Select OK to close the Configuration dialog
The Contrasts... button of the ANOVA dialog now reads Contrasts Included
Select OK to close the ANOVA dialog and run the ANOVA
If this is the first time you have analyzed a MethylationEPIC array using the Partek Genomics Suite software, the manifest file may need to be configured. If it needs configuration, the Configure Annotation dialog will appear (Figure 3).
Select Chromosome is in one column and the physical location is in another column for Choose the column configuration
Select Ilmn ID for Marker ID
Select CHR for Chromosome i
Select MAPINFO for Physical Position
Select Close
This enables Partek Genomics Suite to parse out probe annotations from the manifest file.
Figure 3. Processing the annotation file. User needs to point to the columns of the annotation file that contain the probe identifier as well as the chromosome and coordinates of the probe.
The results will appear as ANOVA-2way (ANOVAResults), a child spreadsheet of mvalue. Each row of the spreadsheet represents a single CpG locus (identified by Column ID).
Figure 4. ANOVA spreadsheet. Each row is a result of an ANOVA at a given CpG locus (identified by the Column ID column). The remaining columns contain annotation and statistical output
For each contrast, a p-value, Difference, Difference (Description), Beta Difference, and Beta Difference (Description) are generated. The Difference column reports the difference in M-values between the two groups while the Beta Difference column reports the difference in beta values between the two groups.
Each row of the spreadsheet (Figure 1) corresponds to a single sample. The first column is the names of the .idat files and the remaining columns are the array probes. The table values are β-values, which correspond to the percentage methylation at each site. A β-value is calculated as the ratio of methylated probe intensity over the overall intensity at each site (the overall intensity is the sum of methylated and unmethylated probe intensities).
Figure 1. Spreadsheet after .idat file import: samples on rows (Sample IDs are based on file names), probes on columns, cell values are functionally normalized beta values (default settings)
Before we can perform any analysis, the study samples need to be organized into their experimental groups.
Select Add Sample Attributes from the Import section of the Illumina BeadArray Methylation workflow
Select Add a Categorical Attribute from the Add Sample Attributes dialog (Figure 2)
Figure 2. Adding sample attributes. Adding Attributes from an Existing Column can be used to split file names into sections, based on delimiters (e.g. _, -, space etc.). Adding a Numeric or Categorical Attribute enables the user to manually specify sample attributes
Select OK
The Create categorical attribute dialog allows us to create groups for a categorical attribute. By default, two groups are created, but additional groups can be added.
Set Attribute name: to Cell Type
Rename the groups B cells and LCLs
Drag and drop the samples from the Unassigned list to their groups as listed in the table below
There should now be two groups with eight samples in each group (Figure 3).
Figure 3. Adding Cell Type attribute as a categorical group
Select OK
Select Yes from the Add another categorical attribute dialog
Set Attribute name: to Gender
Rename the groups Male and Female
Drag and drop the samples from the Unassigned list to their groups as listed in the table below
There should now be two groups with four samples in Male and twelve samples in Female (Figure 4).
Figure 4. Adding Gender attribute as a categorical group
Select OK
Select No from the Add another categorical attribute dialog
Select Yes to save the spreadsheet
Two new columns have been added to spreadsheet 1 (Methylation) with the cell type and gender of each sample (Figure 5).
Figure 5. Annotated beta values spreadsheet
The list, LCLs vs B cells, includes differentially methylated loci for locations across the genome; however, in many cases we may want to focus on loci located in particular regions of the genome. To filter our list to include only regions of interest, we can use the annotations provided by Illumina and the interactive filter in Partek Genomics Suite.
Select LCLs_Vs_B_cells from the spreadsheet tree
Right-click on the Gene Symbol column
Select Insert Annotation (Figure 1)
Figure 1. Adding an annotation column to the ANOVA results
Select the Add as categorical option
Select Relation_to_UCSC_CpG_Island (Figure 2)
CpG islands are regions of the genome with an atypically high frequency of CpG sites. CpG islands and their surrounding regions (termed shelf and shore) include many gene promoters and altered methylation in these regions can have a disproportionate effect on gene expression. For example, hyper-methylation of promoter CpG islands is a common mechanism for down-regulating gene expression in cancer.
Figure 2. Adding chromosome location to ANOVA results
Select OK to add Relation_to_UCSC_CpG_Island as a column in next to 3. Gene Symbol
Now, we can filter probes by their relation to CpG islands.
Select 4. Relation_to_UCSC_CpG_Island for Column
For categorical columns, the interactive filter displays each category of the selected column as a colored bar. For 4. Relation_to_UCSC_CpG_Island, each bar represents one of the categories of the UCSC annotation . To filter out a category, left-click on its bar. Right clicking on a bar will include only the selected category. A pop up balloon will show the category label as you mouse over each bar.
Right-click the Island column to filter out other columns (Figure 3)
Figure 3. Using Interactive Filter tool to filter out probes by annotation. When pointed to a categorical column, the Interactive Filter tool summarises the content of the column by a column chart. Left-click to exclude a category (two columns were excluded, so they are grayed out), right-click to include only
The yellow and black bar on the right-hand side of the spreadsheet panel shows the fraction of excluded cells in black and included cells in yellow. Right-clicking this bar brings up an option to clear the filter.
Now that we have filtered out probes that are not in CpG islands, we will create a spreadsheet containing only these probes.
Right click on the LCLs vs. B cells spreadsheet in the spreadsheet tree panel (Figure 4)
Figure 4. Cloning a filtered spreadsheet creates a new spreadsheet with only the included cells
Select Clone
Rename the new spreadsheet LCLs_vs_B_cells_CpG_Islands using the Clone Spreadsheet dialog
Select mvalues from the Create new spreadsheet as a child spreadsheet: drop-down menu (Figure 5)
Select OK
Figure 5. Renaming and configuring filtered spreadsheet
Specify a name for the spreadsheet, we chose LCLs_vs_B_cells_CpG_Islands, using the Save File dialog
Select Save to save the spreadsheet
You may want to save the project before proceeding to the next section of the tutorial.
Select () to save the changes
Select () to activate Rotate Mode
Select () to open the Configure__Plot Properties dialog
Select () to open the Configure__Plot Properties dialog
Select () from the command bar
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Genes without changes in expression are given a value of zero and are colored black. Up-regulated genes have positive values and are displayed in red. Down-regulated genes have negative values and are displayed in green. Each sample is represented in a row while genes are represented as columns. Dendrograms illustrate clustering of samples and genes. To learn more about configuring the hierarchical clustering heat map, see the user guide.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Select () plot the PCA scatter plot
Select ()
Select ()
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
The data files can be downloaded from Gene Expression Omnibus using accession number or by selecting this link - . To follow this tutorial, download the 32 .idat files (note that two .idat files are generated for each array) and unzip them on your local computer using 7-zip, WinRAR, or a similar program.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
The plot will open in a new tab. For additional plots that can be invoked from the ANOVA results spreadsheet, see the user guide.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
To learn more about GO enrichment and using the Gene Ontology Browser, please consult the tutorial.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
We recommend using the default option for normalization; however, advanced users can select their preferred normalization method. Select the () next to each normalization option for details. If you want to import probe intensity, raw probe intensity, probe signals, raw probe signals, or anti-log probe intensity values, they can be added to the data import using the Outputs tab of the Advanced Import Options dialog. Probe intensities and raw probe intensities can be used for advanced troubleshooting purposes and antilog probe intensities can be used for copy number detection. The Outputs tab of the Advanced Import Options dialog also has an option to create NCBI GEO submission spreadsheets from your imported data. For this tutorial, we do not need to import any of these values or create GEO submission spreadsheets.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Sample ID | Cell Type |
---|
Sample ID | Gender |
---|
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Select () from the quick action bar to save the ANOVA-2way (ANOVA Results) spreadsheet with the added annotation
Select () from the quick action bar to invoke the interactive filter
Select () from the quick action bar to save the filtered spreadsheet
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
GSM2452106_200483200025_R04C01 | B cells |
GSM2452107_200483200021_R01C01 | B cells |
GSM2452108_200483200021_R02C01 | B cells |
GSM2452109_200483200025_R06C01 | B cells |
GSM2452110_200483200025_R07C01 | B cells |
GSM2452111_200483200021_R08C01 | B cells |
GSM2452112_200483200021_R06C01 | B cells |
GSM2452113_200483200021_R04C01 | B cells |
GSM2452114_200483200025_R01C01 | LCLs |
GSM2452115_200483200025_R03C01 | LCLs |
GSM2452116_200483200021_R03C01 | LCLs |
GSM2452117_200483200025_R05C01 | LCLs |
GSM2452118_200483200025_R02C01 | LCLs |
GSM2452119_200483200021_R07C01 | LCLs |
GSM2452120_200483200021_R05C01 | LCLs |
GSM2452121_200483200025_R08C01 | LCLs |
GSM2452106_200483200025_R04C01 | Female |
GSM2452107_200483200021_R01C01 | Female |
GSM2452108_200483200021_R02C01 | Male |
GSM2452109_200483200025_R06C01 | Female |
GSM2452110_200483200025_R07C01 | Female |
GSM2452111_200483200021_R08C01 | Female |
GSM2452112_200483200021_R06C01 | Female |
GSM2452113_200483200021_R04C01 | Male |
GSM2452114_200483200025_R01C01 | Female |
GSM2452115_200483200025_R03C01 | Female |
GSM2452116_200483200021_R03C01 | Male |
GSM2452117_200483200025_R05C01 | Female |
GSM2452118_200483200025_R02C01 | Female |
GSM2452119_200483200021_R07C01 | Female |
GSM2452120_200483200021_R05C01 | Female |
GSM2452121_200483200025_R08C01 | Male |
To analyze differences in methylation between our experimental groups, we need to create a list of deferentially methylated loci.
Select Create Marker List from the Analysis section of the Illumina BeadArray Methylation workflow
Select LCLs vs. B cells (Figure 1)
Figure 1. Creating a list of significantly differentially methylated loci
Leave Include size of the change selected and set to Change > 2 OR Change < -2
Leave Include significance of the change selected and set to p-value with FDR < 0.05
Select Create
Select Close to exit the list manager
The new spreadsheet LCLs vs. B cells (LCLs vs. B cells) will open in the Analysis tab.
It is best practice to occasionally save the project you are working on. Let's take the opportunity to do this now.
Select File from the main command toolbar
Select Save Project...
Specify a name for the project, we chose Methylation Tutorial, using the Save File dialog
Select Save to save the project
Saving the project saves the identity and child-parent relationships of all spreadsheets displayed in the spreadsheet tree. This allows us to open all relevant spreadsheets for our analysis by selecting the project file.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The significant CpG loci detected in the previous step actually form a methylation signature that differentiates between LCLs and B cells. We can build and visualize this methylation signature using clustering and a heat map.
Select the LCLs_vs_Bcells_CpG_Islands spreadsheet in the spreadsheet pane on the left
Select Cluster Based on Significant Genes from the Visualization panel of the Illumina BeadArray Methylation workflow
Select Hierarchical Clustering for Specify Method (Figure 1)
Figure 1. Selecting Heirarchical Clustering for clustering method
Select OK
Verify that LCLs_vs_Bcells_CpG_Islands is selected in the drop-down menu
Verify that Standardize is selected for Expression normalization (Figure 2)
Figure 2. Selecting spreadsheet and normalization method for clustering
Select OK
The heat map will be displayed on the Hierarchical Clustering tab (Figure 3).
Figure 3. Hierarchical clustering with heat map invoked on a list of significant CpG loci
The experimental groups are rows, while the CpG loci from the LCLs vs B cells spreadsheet are columns. Methylation levels are compared between the LCLs and B cells groups. CpG loci with higher methylation are colored red, CpG loci with lower methylation are colored green. LCLs samples are colored orange and B cells samples are colored red in the dendrogram on the the left-hand side of the heat map.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The approach described in previous sections relies on ANOVA to detect differentially methylated CpG sites and takes individual sites as a starting point for interpretation. Since ANOVA compares M values at each site independently, this strategy is robust to type I/type II probe bias.
An alternative could be to first summarize all the probes belonging to a CpG island region (i.e. island, N-shore, N-shelf, S-shore, S-shelf) and then use ANOVA to compare regions across the groups. Since the summarization will include both type I and type II probes, you may want to split the analysis in two branches and analyze type I and type II probes independently. To do this, we need to annotate each probe as type I or type II.
Select the mvalue spreadsheet
Select Transform from the main toolbar
Select Create Transposed Spreadsheet... from the Transform drop-down menu (Figure 1)
Figure 1. Creating a transposed spreadsheet
Select Sample ID for Column: and numeric for Data Type:
Select OK
A new temporary spreadsheet will be created with a row for each probe and columns for each sample.
Right-click on column 1. ID to bring up the pop-up menu
Select Insert Annotation
Select Add as categorical
Select Infinium_Design_Type and UCSC_CpG_Islands_Name from the Column Configuration options (Figure 2)
Figure 2. Adding Infinium design type and CpG island annotations
Select OK to add the Inifinium design type and UCSC CpG island name as categorical columns on the spreadsheet
Now, we can use the interactive filter to create separate spreadsheets for type I and type II probes.
Select () to launch the interactive filter
Select 2. Infinium_Design_Type from the drop-down menu if not selected by default
Left-click the type I column to exclude it
Right-click the temporary spreadsheet in the spreadsheet tree to bring up the pop-up dialog
Select Clone... (Figure 3)
Figure 3. Creating a probe list with only Infinium type II probes
Name the new spreadsheet female_only_typeII_probes
Select OK
Save the created spreadsheet, we chose the file name female_only_typeII_probes
Repeat process to create a spreadsheet for type I probes
The temporary spreadsheet is no longer needed so we can close it.
Close the temporary spreadsheet by selecting it in the file tree and selecting ()
We can use these spreadsheets to generate lists of M values at CpG island regions
Select spreadsheet female_only_typeII_probes
Select Stat from the main toolbar
Select Column Statistics... under Descriptive (Figure 4)
Figure 4. Selecting column statistics
Add Mean to the Selected Measure(s) panel
Select Group By and set it to 3. UCSC_CpG_Islands_Name (Figure 5)
Figure 5. Configuring column statistics
Select OK
The new temporary spreadsheet has one CpG island region per row (Figure 6), samples on columns, and the values in the cells represent the mean of M values of all the CpG probes in the region.
Figure 6. New spreadsheet with average M values for probes at each CpG island; probes not at CpG islands are collected into the first row "- Mean"
Note the first row, with label “– Mean”. It corresponds to all the probes that map outside of UCSC CpG islands. As it is not needed for the downstream analysis, we will remove it.
Right-click on the row header for Mean
Select Delete to remove the row
The final step is to transpose the data back to its original orientation.
Select Transform from the main toolbar
Select Create Transposed Spreadsheet... from the Transform drop-down menu
Select 2. Level for Column: and numeric for Data Type:
Select OK
The layout of the new transposed spreadsheet is as follows: one sample per row with CpG island regions on columns; cell entries correspond to mean methylation status of the region (Figure 7). The column with a blank value for the column header is the average of all probes not associated with CpG island regions. You can delete this column if you like.
Figure 7. Spreadsheet with average M values of probes in each CpG island for each sample
Right-click the transposed spreadsheet, 2_transpose
Select Save as... from the pop-up menu
Name it mvalues_typeII_probes_CpG_islands
Close the source temporary spreadsheet by selecting it in the spreadsheet tree and selecting ()
The mvalues_typeII_probes_CpG_islands spreadsheet can be used as a starting point for ANOVA and other analyses. You can also repeat the steps above to create an equivalent spreadsheet for type I probes.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
To perform gene set and pathway analysis, we need to create a list of genes that overlap with differentially methylated CpG loci.
Select LCLs_vs_B_cells_CpG_Islands in the spreadsheet tree
Select Find Overlapping Genes from the Analysis section of the workflow
The Output Overlapping Features dialog will open (Figure 1). This dialog allows you to choose the annotation database that will define where gene are located. By default the promoter region will be defined as 5000 base pairs upstream and 3000 base pairs downstream from the transcription start site.
Figure 1. Selecting Finding Overlapping Genes form the main toolbar
Select Ensembl Transcripts release 75 from the Report regions from the specified database options
You can select a name for the new list, we have named it gene-list
Select OK
A new spreadsheet will be created as a child spreadsheet (Figure 2)
Figure 2. Annotating the differentially methylated CpG loci with genes
Partek Genomics Suite offers several tools to help interpret this list of genes. First, let's look at Gene Set Analysis.
Select Gene Set Analysis from the Biological Interpretation section of the Illumina BeadArray Methylation workflow
Select GO Enrichment for Select the method of analysis
Select Next >
Select 1/mvalue/lcls_vs_b_cells_cpg_islands/gene-list (gene-list.txt) for the source spreadsheet
Select Next >
Select Invoke gene ontology browser on the result and leave the rest of the options set to defaults for Configure the parameters of the test (Figure 3)
Figure 3. Configuring the parameters of the test
Select Next >
Select Default Mapping File for Select the method of mapping genes to genes sets
Select Next >
A new spreadsheet will be created with categories ranked by enrichment score and the Gene Ontology Browser will launch to graphically display the results of the spreadsheet (Figure 4). The results show which gene sets are over represented in the list of genes overlapped by differentially regulated CpG loci between the experimental and control groups.
Figure 4. GO enrichment browser showing gene groups overrepresented in the list of genes which overlap with differentially methylated CpG loci
To get a better idea whether genes associated with these GO terms have increated or decreased methylation, we can view the Forest Plot.
Select the Forest Plot tab
Go terms are listed by the number of significantly up-regulated genes, with the percent up-regulated and down-regulated shown in red to green bars. Here, we see that most GO terms show increased methylation in their associated genes (Figure 4).
Figure 5. Gene Ontology Forest Plot
Next, we can perform Pathway Analysis to see which pathways are over represented in the gene overlapped by differentially regulated CpG loci.
Select gene-list from the spreadsheet tree
Select Pathway Analysis from the Biological Interpretation section of the Illumina BeadArray Methylation workflow
Select Pathway Enrichment for Select the method of analysis
Select Next >
Select 1/mvalue/lcls_vs_b_cells_cpg_islands/gene-list (gene-list.txt) for the source spreadsheet
Select Next >
Leave the default selections for the Configure parameters of the test panel
Select Next >
Leave the default selections for the Result File and Select the parameters panels
Select Next > to run the analysis
The Pathway-Enrichment spreadsheet will be added to the spreadsheet tree in Partek Genomics Suite and the Partek® Pathway™ software will open to provide visualization of the most significantly enriched pathway as a pathway diagram (Figure 5). The color of the gene boxes reflects p-values of the associated differentially methylated CpG loci (bright orange is insignificant, blue is highly significant). The Color by option can be changed another column from the gene-list.txt spreadsheet, such as Difference.
Figure 6. : Partek Pathway illustrating one of the pathways overrepresented in the list of genes overlapping the differentially methylated CpG sites.
The Pathway-Enrichment spreadsheet can also be viewed in Partek Pathway by switching to the Pathway-Enrichment section of the menu tree on the left-hand side of the window. From the spreadsheet view, you can select a pathway name to visualize that pathway. Alternatively, you can open a pathway visualization in Partek Pathway from the Pathway-Enrichment spreadsheet in Partek Genomics Suite by right-clicking on a row and selecting Show pathway... from the pop-up menu. Please note that if you have closed Partek Pathway and have reopened it, you will need to import a gene list if you want to color the visualization by attributes form the gene list. For more information about using Partek Pathway, checkout our Partek Pathway Tutorial.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Partek Genomics Suite software can view annotation .BED files as tracks in the Genome Viewer. We can add a CpG islands track to the Genome Viewer using the UCSC Genome Browser CpG islands annotation.
Go to UCSC Genome Browser
Select Table Browser under Tools in the main command bar of the webpage (Figure 1)
Figure 1. Navigating to the Table Browser at the UCSC Genome Browser website
Configure the Table Browser page as shown (Figure 2)
Figure 2. Configuring the Table Browser to output CpG Islands BED file
Set assembly to Feb. 2009 (GRCh37/hg19)
Set group to Regulation
Set track to CpG Islands
Set table to cpgIslandExt
Set output format to BED
Set output file to cpg.bed
Select get output
The Output cpgIslandExt as BED page will open.
Select get BED to download a compressed folder containing the BED file
Unzip the file using 7-Zip, WinRAR, or a similar program of your choice to a location you will be able to find
Next, we can import the BED file into Partek Genomics Suite.
Select Genomic Database... under Import under File in the main toolbar in Partek Genomics Suite (Figure 3)\
Figure 3. Importing the CpG Islands map BED file
Select the file cpg.bed
The BED file will open as a new spreadsheet.
Change the spreadsheet name to UCSC CpG Island Annotation and save it
For this region list, you can also calculate the average beta values for the probes in each island per sample and detect differential methylated CpG islands regions. Detailed information on how to get average beta value for each CpG can be found in the Determining the average values for a region list section of Starting with a list of genomic regions.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Gene ontology (GO), enrichment analysis has been incorporated into the gene expression, microRNA expression, exon, copy number, tiling, ChIP-Seq, RNA-Seq, miRNA-Seq and methylation workflows. The Gene Ontology Consortium provides an excellent overview for new and experienced users of GO analysis. In brief, the common nomenclature of genes and gene products has been used to group genes into a functional hierarchy. This enables analyses to be compared across all types of genomic data, even data from different species. A broader understanding of experimental results is possible by grouping genes of interest into biological processes, cellular components and molecular functions of the genes. With the GO enrichment tool in Partek® Genomics Suite® you can take a list of genes (e.g. significantly differentially expressed genes) and see how they group in the functional hierarchy. This is analogous to going from looking at individual trees (genes) to see how the whole forest (gene ontology) is organized.
This tutorial illustrates how to:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
This tutorial will provide a step-by-step guide to performing GO enrichment analysis. The data set used is based on 51 subjects run on the Illumina Human Ref-8 BeadChip platform. Twenty-six of the subjects were categorized as "Young" with an age range of 18 to 28. The other 25 subjects were categorized as "Old" with an age range of 65 to 84. Skeletal muscle, a type of striated muscle tissue, was obtained via biopsy from each subject. The total RNA was extracted from the skeletal cells, prepared and run on the BeadChips producing the data that is used for this tutorial.
The paper this data is based on can be found at .
can be downloaded by going to Help > On-line Tutorials on the main menu toolbar within the Partek Genomics Suite software. Download the zipped file and store it on your local disk drive. There is no need to manually unzip the directory.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Although the 450K and MethylationEPIC arrays were initially designed to analyze DNA methylation, they are essentially a dense SNP array and can be used for copy number analysis (Feber et al. 2014). The probe intensity data is easily parsed from the idat files by using the Additional Probe Data Spreadsheet Selection dialog (Figure 1) when importing the raw data. Examining the raw intensity data can also be useful for QA/QC purposes.
Follow the steps for importing Illumina methylation data detailed in until you reach the Import Illumina iDAT Data dialog with Manifest File and Output File panels (Figure 1).
Figure 1. Customizing output during data import
Select Customize... to open the Advanced Import Options dialog
Choose No normalization in the Normalization section of the Algorithm tab
Select the Outputs tab (Figure 2)
Figure 2. Selecting additional probe data to include during data import
Detection p-values. This is the confidence score that the signal of a probe was significantly higher than the background defined by negative control probes. Selecting this checkbox produces a spreadsheet ending with '_detectionp' in addition to the spreadsheet containing beta values. Each row of the _detectionp spreadsheet will be a different sample and the sample names will end in '_detectionp'. This spreadsheet can be used to filter out probes that do not show signal above background.
Probe Intensity. This is the sum of the methlyated and unmethylated intensities per probe. Selecting this checkbox produces a spreadsheet ending with ‘_probe’ in addition to the spreadsheet containing beta values. Each row of the _probe spreadsheet will be a different sample and the file names will also end in ‘_probe.’ The probe intensity values will be log2 transformed by default (note that the beta values are not log2 transformed).
Probe Signal. This option will become available if Probe Intensity is selected. Selecting this checkbox produces a spreadsheet ending with ‘_probe.’ The methylated and unmethylated intensities are shown on separate rows for each sample, in addition to the summed values. The sample names will end in ‘_meth’ or ‘_unmeth’ for methylated and unmethylated values, respectively. The probe intensity values will be log2 transformed by default.
Raw Probe Intensity. This is the sum of the raw red and green signal intensities per probe. Selecting this checkbox produces a spreadsheet ending with ‘_raw’ in addition to the spreadsheet containing beta values. Each row of the spreadsheet will be a different sample and the file names will also end in ‘_raw.’ The raw probe intensity values will be log2 transformed by default.
Raw Probe Signal. This option will become available if Raw Probe Intensity is selected. Selecting this checkbox produces a spreadsheet ending with ‘_raw.’ The red and green intensities will be shown on separate rows for each sample, in addition to the summed values. The sample names will end in ‘_red’ or ‘_green’ for red and green values, respectively. The raw probe intensity values will be log2 transformed by default.
Antilog Probe Intensity Values. Selecting this checkbox will show the probe intensity data without log2 transformation.
Create NCBI GEO Submission Spreadsheets. Generates matrix processed and matrix signal intensities spreadsheets for GEO submission.
How you proceed depends on your study design. Here is an example series of steps to prepare the tutorial data set for copy number analysis:
Select Probe Intensity and Antilog Probe Intensity Values (Figure 2)
Select OK to close the Advanced Import Options dialog
Select Import to import the data and perform the selected normalization method
Select the (_probe) spreadsheet from the spreadsheet tree
Delete any samples with _detectionp names
Select Transform from the main toolbar
Select Normalize to baseline
Configure the Normalize to Baseline 1 dialog as shown (Figure 3)
Select Use control set form this spreadsheet
Set Control Category to B cells
Select Ratio to baseline from the Normalization Method section
Select After ratio apply log base 2
Select New Spreadsheet from the Configure Output section
Figure 3. Configuring normalize to baseline
Select OK to generate the spreadsheet
This spreadsheet contains copy number values per probe in log2 space (i.e. diploid = 0). Prior to performing copy number analysis, you can normalize for local GC abundance.
Select Transform
Select Adjust Based on Local GC Content...
Click OK to run Local GC Adjustment (Figure 4)
Figure 4. Adjusting for local GC content
The GC adjusted spreadsheet is the starting spreadsheet for copy number analysis. You can now switch over to the Copy number workflow, skip the Create copy number step, and begin with the Detect amplifications and deletions step. Consult the user's guide for the copy number workflow for subsequent steps.
Feber A, Guilhamon P, Lechner M, et al. Using high-density DNA methylation arrays to profile copy number alterations. Genome Biology. 2014;15(2):R30. doi:10.1186/gb-2014-15-2-r30.
Partek Pathway provides a visualization tool for pathway enrichment spreadsheets utilizing the KEGG database. This tutorial will illustrate:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
The pathway enrichment analysis illustrated in this user guide uses the . This data set is also used in our .
Download and save the zipped project folder in an accessible location on your computer. The project folder for the tutorial will be created in the same location the zipped project folder is stored.
Import the project using the zipped project importer in Partek Genomics Suite.
Select File from the main toolbar
Select Import
Select Zipped Project...
Choose the zipped project folder, miRNA_tutorial_data
The project will open with three spreadsheets:
1. Affy_miR_BrainHeart_intensities,
2. Affy_HuGeneST_BrainHeart_GeneIntensities,
3. ANOVAResults gene.
An Illumina-type project file (.bsc format) can be imported in Illumina’s GenomeStudio® (please note: to process 450K chips, you need GenomeStudio 2010 or newer) and exported using the Partek Methylation Plug-in for GenomeStudio. For more information on the plug-in, please see the . The plug-in creates six files: a Partek project file (*.ppj), an annotation file (*.annotation.txt), files containing intensity values (*.fmt and *.txt), and files containing β-values (*.fmt and *.txt) (Figure 1).
Figure 1. Output of Partek Methylation Plug-in for GenomeStudio
To load all the files automatically, open the .ppj file as follows.
Select Methylation from the Workflows drop-down menu
Select Illumina BeadArray Methylation from the Methylation sub-workflows section
Select Import Illumina Methylation Data from the Import section
Select Load a project following Illumina GenomeStudio export from the Load Methylation Data dialog
Pathway enrichment generates a results spreadsheet, Pathway-Enrichment.txt, visible in both Partek Genomics Suite (Figure 1) and in Partek Pathway.
Figure 1. The pathway enrichment spreadsheet is visible in both Partek Genomics Suite (shown here) and Partek Pathway
The spreadsheet includes 13 columns with information for each pathway represented in the source gene list.
1. Pathway Name - the name of the KEGG pathway
2. Database - the source database for the pathway annotation
3. Enrichment score - the negative natural log of the enrichment p-value derived from the contingency table (Fisher's Exact test) or the Chi-squared test
4. Enrichment p-value - the enrichment p-value derived from the contingency table (Fisher's Exact test) or the Chi-squared test
5. % genes in pathway that are present - the percentage of genes from the pathway that are present in the source gene list
6. Tissue score, 7. Replicate score, 8. Brain vs. Heart score - for each factor, interaction, and contrast in the ANVOA results spreadsheet, a separate score is calculated. This is derived form the negative log (base 10) of the average p-value for genes within the pathway for each factor. A high score indicates that the genes that fall into the pathway have a low p-value for the given factor.
9. # genes in list, in pathway - number of genes from the list in the pathway
10. # genes not in list, in pathway - number of genes from the pathway, not in the list
11. # genes in list, not in pathway - number of genes in list, not in the pathway
12. # genes, not in list, not in pathway - number of genes not in the pathway or the list that are included in KEGG database pathways for the species
13. Pathway ID - KEGG pathway ID
In Partek Genomics Suite, we can view several new options that are available for each pathway (row) in the Pathway-Enrichment.txt spreadsheet.
Right-click the row header of any row in the Pathway-Enrichment.txt spreadsheet (Figure 2)
Figure 2. The Pathway-Enrichment.txt spreadsheet in Partek Genomics Suite
The new options include:
Export genes in pathway, which creates a child spreadsheet of Pathway-Enrichment.txt that contains all the genes from the selected pathway(s) (Figure 3). This new spreadsheet includes gene symbols and their pathway.
Figure 3. Spreadsheet with all genes in pathway. Includes gene symbols and pathway.
Export genes in list and in pathway, which creates a child spreadsheet of Pathway-Enrichment.txt that contains the genes from your list that are present in the selected pathway(s) (Figure 4). This new spreadsheet includes gene symbols and their pathway.
Figure 4. Spreadsheet with genes only in list and pathway. Includes gene symbols and pathway.
Create Gene List, which creates a new child spreadsheet of the ANOVA results spreadsheet that contains the genes from your list that are present in the selected pathway(s) (Figure 5). This new spreadsheet includes all information for each gene from the ANOVA results spreadsheet. However, this list does not indicate the pathway of each gene.
Figure 5. Spreadsheet with genes in list and pathway. Includes all information from ANOVA results for each gene.
Show Pathway, which opens the selected pathway map in Partek Pathway.
Information about the different output options can be found by selecting the adjacent () icon.
Create sample attributes and assign samples to the groups as described in
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
RNA-Seq is a high-throughput sequencing technology used to generate information about a sample’s RNA content. Partek Genomics Suite offers convenient visualization and analysis of the high volumes of data generated by RNA-Seq experiments.
This tutorial illustrates:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on Our support page to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
In this tutorial, you will analyze an RNA-Seq experiment using the Partek Genomics Suite software RNA-Seq workflow. The data used in this tutorial was generated from mRNA extracted from four diverse human tissues (skeletal muscle, brain, heart, and liver) from different donors and sequenced on the Illumina® Genome Analyzer™. The single-end mRNA-Seq reads were mapped to the human genome (hg19), allowing up to two mismatches, using Partek Flow alignment and the default alignment options. The output files of Partek Flow are BAM files which can be imported directly into Partek Genomics Suite 7.0 software. BAM or SAM files from other alignment programs like ELAND (CASAVA), Bowtie, BWA, or TopHat are also supported. This same workflow will also work for aligned reads from any sequencing platform in the (aligned) BAM or SAM file formats.
Data and associated files for this tutorial can be downloaded by going to Help > On-line Tutorials from the Partek Genomics Suite main menu or using this link - RNA-Seq Data Analysis tutorial files. Once the zipped data directory has been downloaded to your local drive:
Unzip the downloaded files to C:\Partek Training Data\RNA-seq or to a directory of your choosing. Be sure to create a directory or folder to hold the contents of the zip file
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Chromatin Immunoprecipitation Sequencing (ChIP-Seq) uses high-throughput DNA sequencing to map protein-DNA interactions across the entire genome. Partek Genomics Suite offers convenient visualization and analysis of ChIP-Seq data.
In this tutorial, we will use the Partek Genomics Suite ChIP-Seq workflow to analyze aligned data from a ChIP sample versus a control sample in .bam format.
This tutorial illustrates:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on Our support page to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
The data for this tutorial comes from Johnson et al. 2007, which first described the ChIP-Seq technique.
This study mapped genomic binding sites for neuron-restrictive silencer factor (NRSF) transcription factor across the genome. There are two samples: an NRSF-enriched ChIP sample (chip.bam) and a control sample of input DNA without antibody immunoenrichment (mock.bam). The chip.bam file contains ~1.7 million mapped reads and the mock.bam file contains ~2.3 million mapped reads. These .bam files contain the aligned genomic locations and sequences of mapped reads. This data set contains reads from a single-end (SE) library; the differences in processing paired-end (PE) reads will be discussed when applicable.
Data for this tutorial can be downloaded from the Partek website using this link - ChIP-Seq tutorial data. To follow this tutorial, download the 2 .bam files and unzip them on your local computer using 7-zip, WinRAR, or a similar program. Because of the large size of the .bam files, we recommend saving them to a local drive instead of trying to access them on a network drive. The first time a .bam file is read by Partek Genomics Suite, the file will be sorted to allow for faster access. Therefore, you must have write permissions for the .bam files after download and on the file folder where they are stored.
Johnson, D. S., Mortazavi, A., Myers, R. M., & Wold, B. (2007). Genome-Wide Mapping of in Vivo Protein-DNA Interactions (Vol. 316). New York, NY: Science.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Before performing pathway enrichment, we need to create a gene list from the ANOVA results.
Select Gene Expression from the workflows drop-down menu
Select the ANOVAResults gene spreadsheet
Select Create Gene List from the Analysis section of the Gene Expression workflow
Select Brain vs. Heart from the List Manager dialog (Figure 1) leaving the other options as defaults
Select Create
Figure 1. Configuring the list manager dialog
A new list of 420 genes will be created as a child spreadsheet of 1 (ANOVAResults gene).
Select Close to exit the List Manager dialog
Select the new gene list, Brain vs. Heart
Select Pathway Analysis from the Biological Interpretation section of the Gene Expression workflow
Select Next > to continue with Pathway Enrichment
Pathway Enrichment is the only option available for a gene list. To learn more about the other option, Pathway ANOVA, see the Gene Ontology ANOVA tutorial, which follows the same procedure as Pathway ANOVA.
Select Next > to continue with the Brain vs. Heart spreadsheet
Select Next > to continue with default settings for Fisher's Exact test
Select Next > to continue with Homo sapiens and 4. Gene Symbol as parameters
Partek Pathway will now open. If this is your first time using Partek Pathway on the selected species, Partek Pathway will automatically download the KEGG information needed for the analysis. Once the pathway enrichment calculation has been performed, a new spreadsheet, Pathway-Enrichment.txt, will be added as a child spreadsheet of Brain vs. Heart and Partek Pathway will launch (Figure 2).
Figure 2. Partek Pathway displaying the most significantly enriched pathway from the gene list
The pathway currently displayed has the highest enrichment score. Both Partek Genomics Suite and Partek Pathway offer options for analyzing the results of pathway enrichment analysis. The next two sections of the user guide will show the options for analyzing the results of pathway enrichment in each program.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Now that the data has been imported, we need to make a few changes to the data annotation before analysis.
Notice that the Sample ID names in column 1 are gray (Figure 1). This indicates that Sample ID is a text factor. Text factors cannot be used as a variable in downstream analysis so we need to change Sample ID to a categorical factor.
Figure 1. Viewing the imported data in a spreadsheet
Right-click on the column header to invoke the pop-up menu
Select Properties (Figure 2)
Figure 2. Changing column properties
Configure the Properties of Column 1 in Spreadsheet 1 dialog as shown (Figure 3) with Type set to categorical and Attribute set to factor
Figure 3. Changing column 1 properties
Select OK
The samples names in column 1 are now black, indicating that they have been changed to a categorical variable. Next, we will add attributes for grouping the data.
From the RNA-seq workflow panel, select Add Sample Attributes to bring up the Add Sample Attributes dialog (Figure 4)
Figure 4. Add Sample Attributes dialog
Select Add a Categorical Attribute
Select OK to bring up the Create categorical attribute dialog
Creating a categorical sample attribute allows us to group samples. This is useful for designating samples as replicates, as members of an experimental group, or as sharing a phenotype of interest. In this tutorial, we have four different samples from different tissues and different donors, but to illustrate the available statistical analysis options, we need to divide the samples into two groups: muscle (Heart and Muscle) and not muscle (Brain and Liver).
Set Attribute name: as Tissue
Rename Group 1 to muscle and Group 2 to not muscle
Select and drag the samples from the Unassigned panel to the correct group panel (Figure 5)
Figure 5. Creating a categorical attribute
Select OK
Select No from the Add another attribute? dialog
Select Yes from the Save spreadsheet 1 dialog
The attribute will now appear as a new column in the RNA-seq spreadsheet with the heading Tissue and the groups muscle and not muscle.
The next available step in the Import panel of the RNA-seq workflow is Choose Sample ID Column_._ Verifying the correct column is designated the Sample ID becomes particularly important when data from multiple experiments is being combined.
Select Choose Sample ID Column from the Import panel of the RNA-Seq workflow
Select OK (Figure 6)
Figure 6. Choosing the correct column as Sample ID
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The zipped project file contains several prepared files used in this analysis as well as the annotation information for the BeadChip. The zipped file also contains a Partek project file (.ppj).
After downloading the file, go to File > Import > Zipped project... and browse to GO_Enrichment.zip on your local drive
Partek Genomics Suite will automatically unzip the file, read the .ppj file, open and annotate all spreadsheets (Figure 1). The parent spreadsheet (GSE8479-AVGSignal) contains the original intensity data. The first child spreadsheet (ANOVAResults) contains the results of differential gene expression analysis from a 3-way ANOVA. The second child spreadsheet (Gene_List.txt) is a list of significantly differentially expressed genes. When working with your own data, you will need to detect differentially expressed genes and create a gene list yourself.
Figure 1. Viewing the Gene List spreadsheet
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Partek Pathway is a separate program from Partek Genomics Suite with a distinct user interface (Figure 1).
Figure 1. Partek Pathway
The Project Elements panel (Figure 2) displays the selected pathway, the original gene list, the Pathway Enrichment spreadsheet, and the library references that were used for the pathway analysis. The Project Elements panel is used to navigate between open pathway diagrams and spreadsheets.
Figure 2. Project Elements panel
Select the Brain vs. Heart spreadsheet under Gene Lists
The Brain vs. Heart gene list we created earlier will open (Figure 3). The spreadsheet can be sorted by any column by left-clicking a column header; the first click will sort by ascending values, the second click will switch to descending values.
Figure 3. Viewing a gene list in Partek Pathway
Select the Pathway-Enrichment spreadsheet under Pathway Lists
The Pathway-Enrichment.txt spreadsheet will open (Figure 4). This spreadsheet has the same contents as the Pathway-Enrichment.txt spreadsheet in Partek Genomics Suite. Selecting any of the pathway names will open its pathway diagram. The spreadsheet can be sorted by any column.
Figure 4. Viewing the Pathway Enrichment spreadsheet in Partek Pathway
Select the GABAergic synapse pathway on the Pathway-Enrichment spreadsheet or the Project Elements panel
The GABAergic synapse pathway diagram will open. Genes in the pathway are shown as boxes. The color of the boxes is set by the Configuration panel (Figure 5).
Figure 5. The Configuration panel
Any numerical column from the source gene list can be used to color the gene boxes. While significant p-values indicate a difference between the categories, they give no information about upregulation or downregulation of the pathway. We can overlay fold-change information on the pathway diagram.
Select Brain vs. Heart: Fold-Change(Brain vs. Heart) from the drop-down menu
The pathway diagram now shows fold change for each gene in the pathway included in the gene list (Figure 6). Genes not in the gene list remain black.
Figure 6. GABAergic synapse pathway diagram showing fold-changes for genes in the gene list
The colors and range of can be changed using the Color By panel.
Select the red color square next to Max
Select yellow from the color picker interface
Select OK
Select the green color square next to Min
Select pale blue from the color picker interface
Select OK
We can see that all the colored genes in the GABAergic synapse pathway are yellow (Figure 7), indicating that they are up-regulated.
Figure 7. Changing colors in the Pathway Diagram; up-regulated genes are yellow and down-regulated genes are teal
We can select a gene to learn more about it.
Right-click GABAB (Figure 8)
Figure 8. Learn more about any gene on a pathway diagram by right-clicking
Options available include:
Look up on KEGG - opens the KEGG page for the pathway on GenomeNet (genome.jp) in your web browser
Ensembl - under External Links, opens the page for the selected Ensembl ID on ensemble.org
UniProt - under External Links, opens the page for the selected UniProt ID on uniprot.org
Jump to ___ on "___" - opens the source gene list in Partek Pathway to the row of the selected gene
The pathway database can be searched for genes of interest using the Search panel.
Type NSF in the search box
Pathways containing NSF appear on the right-hand side in the Search Results section in alphabetical order (Figure 9).
Figure 9. Using the search panel to find pathways containing a gene of interest
If multiple species or libraries have been loaded, the Filter Options section on the left-hand side can be used to choose which species and libraries to search.
Double click on Synaptic vesicle cycle in the Search Results section
The selected pathway, Synaptic vesicle cycle, will open in the Pathway Diagram panel (Figure 10).
Figure 10. Opening a pathway diagram from search results
On the right-hand side of the Partek Pathway window, we see the Pathway Detail panel (Figure 11).
Figure 11. The Pathway Detail panel
Select KEGG_Gene to open the list of genes in the pathway
Selecting a gene in the list will highlight it in the pathway diagram (Figure 12).
Figure 12. Selecting a gene in a pathway diagram using the Pathway Detail panel
Another way to select and view a pathway is browsing the Pathway Libraries.
The Pathway Libraries dialog will open (Figure 13).
Figure 13. Downloading and browsing pathway libraries
In the upper section of the dialog, you can view available KEGG libraries and download them by selecting the Download Library link. Selecting a pathway opens it in the lower section of the dialog.
In the lower section of the dialog, you can view a list of all the pathways in the selected pathway library. You can also open any pathway from the selected library in the Pathway Diagram panel.
Select Adherens Junction
Select View Pathway to open the pathway diagram
We can use the Project Elements panel to close an open pathway diagram or list.
Right-click Adherens Junction in the Project Elements panel
Select Delete from the pop-up menu to close the diagram (Figure 14)
Figure 14. Closing a pathway diagram
The Search panel and Pathway Libraries can also be used to open pathway diagrams for pathways without any open gene or pathway lists.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
One of the main functions of GO enrichment is to find the overrepresentation of functional categories in a gene list. With the Gene_List.txt spreadsheet selected:
From the Gene Expression workflow, choose Biological Interpretation followed by Gene Set Analysis
Select the GO Enrichment radio button in the Gene Set Analysis dialog (Figure 1) followed by Next
In the next dialog, make sure the Gene_List.txt spreadsheet is chosen from the drop-down list and click Next
Figure 1. Gene set analysis dialog. Choose GO Enrichment and select Next
You have the choice to use the Fisher's Exact or Chi-Square test. Both tests compare the proportion of a gene list in a functional group to the proportion of genes in the background for that group. Both are acceptable and you can always test both by re-running the analysis. You can also restrict the analysis to functional groups with more than or fewer than a specified number of genes. Restricting the analysis to GO groups with fewer than 150-200 genes will increase the speed of analysis and exclude large groups which may not be too informative. If analysis time is not a concern, you can just use the default settings.
Select the Use Fisher's Exact test radio button (Figure 2)
Make sure the Invoke gene ontology browser on the result check box is selected
Leave all other settings as default and click Next
Figure 2. Configure the parameters of the GO enrichment test
Select the Default mapping file radio button and click Next
Figure 3. For an explanation of the different kinds of mapping file supported, click the help icon next to each one
A new spreadsheet (Figure 4) and the gene ontology browser (Figure 5) will appear.
Figure 4. GO enrichment output spreadsheet. Right-click a row header to perform additional tasks on a chosen GO term
The new spreadsheet (GO-Enrichement.txt) is a child spreadsheet of the gene list. The first column contains the GO functional groups, each of which falls into a broader category (biological process, cellular component or molecular processes), shown in column 2. The GO functional groups are sorted by descending enrichment score, which is shown in the column 3. The enrichment score is the negative natural logarithm of the enrichment p-value, which is shown in column 4. The higher the enrichment score, the more overrepresented a functional group is in the gene list. If a functional group has an enrichment score of over 1, it is overrepresented. As a rule of thumb, an enrichment score of 3 corresponds to significant overrepresentation (p-value=0.05). For your data, you may wish to add a multiple test correction (e.g. FDR) by going to Stat > Multiple Test correction. We will not perform the multiple test correction for this tutorial.
Additional columns help describe the enrichment score for each group, including the percentage of genes in the group that are present in the gene list, the number of genes present in the group that are present in the list, and the total number of genes in the group. Because the original gene list was derived from statistical analysis, extra columns will appear for all p-values in the ANOVA model. For example, the Young/Old score and Gender score columns contain the negative natural logarithm of the geometric mean of p-values for each marker/gene present in the list and in the group. These scores represent the level of differential expression of the genes in the functional group. The larger the score, the more differentially expressed the genes are in the group. A score of 3 or greater corresponds to an average p-value of 0.05 or less. For example, the Y_oung/Old_ score explains how differentially expressed the genes present in the list and in a given group are between the "Young" and "Old" categories.
On the GO-Enrichment.txt spreadsheet, right-click on a row header of a functional group, such as hydrogen ion transmembrane transporter activity, which has an enrichment score of 29.9484, and choose Browse to GO term from the menu
Figure 5. Viewing a functional group on the gene ontology browser
The Gene Ontology Browser opens in a separate tab (Figure 5). A functional group viewed in the browser will show the hierarchical relationship to the other GO terms. The selected functional group will be highlighted on the left. In Figure 5, you can see the hydrogen ion transmembrane transporter activity is found in the tree molecular function > transporter activity > substrate-specific transmembrane transporter activity > cation-transporting ATPase activity > inorganic cation transmembrane transporter activity > monovalent inorganic cation transmembrane activity. On the right, a bar chart shows the sub-groups of the selected group and their respective enrichment scores.
On the GO-Enrichment.txt spreadsheet, right-click on a row header of a functional group and choose Term Details from the menu
A web browser will be opened and you will be re-directed to the AmiGO website, where you will find more details about the chosen GO term (internet connection required).
On the GO-Enrichment.txt spreadsheet, right-click on a row header of a functional group and choose Create Gene List from the menu
Figure 6. New gene list spreadsheet containing all the genes in the original list that belong to the chosen functional group
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
We will be using the RNA-Seq workflow to analyze RNA-Seq data throughout this tutorial. The commands included in the RNA-Seq workflow are also available form the command toolbar, but may be labeled differently.
Select the RNA-Seq workflow by selecting it from the Workflow drop-down menu in the upper right-hand corner of the Partek Genomics Suite window (Figure 1)
Figure 1. Selecting the RNA-seq workflow
The Partek Genomics Suite software can import next generation sequencing data that has been aligned to a reference genome. Two standard types of alignment formats can be imported: .BAM and .SAM. It is also possible to concert ELAND .txt files to .BAM files with the converter found in the Tools menu in the main command bar. The data used in this tutorial was aligned using the Partek® Flow® software and saved as .BAM files.
To import the .BAM files, select Import and Manage Samples from the Import section of the RNA-Seq workflow. The Sequence Import dialog box will open (Figure 2)
Figure 2. Importing .BAM files
Select BAM Files (*.bam) from the Files of type drop-down menu if not set by default
Use the file browser panel on the left-hand side of the Sequence Dialog or select Browse... to navigate to the folder where you stored the tutorial .BAM files
Files with checked boxes next to the file name will be imported. For this tutorial, select brain_fa, heart_fa, liver_fa, and muscle.fa
Select OK to confirm the file selection and open the next dialog (Figure 3)
Figure 3. Viewing the Sequence Import wizard; specify Output file (and directory using Browse), Species, and Genome
Configure the dialog as shown (Figure 3)
Output file provides a name for the top-level spreadsheet. Browse can be used to change the output directory.
Select Homo sapiens from the Species drop-down menu
This will allow us to select a human genome reference assembly alignment.
Select hg19 for Genome/Transcriptome reference used to align the reads
This is the reference genome our tutorial data was aligned to using Partek Flow.
Select OK to open the BAM Sample Manager dialog (Figure 4)
Figure 4. Bam Sample Manager dialog
The Bam Sample Manager dialog allows additional samples to be added or removed after the initial sample import. To remove a sample, select a sample from the list and then select Remove selected samples. This dialog also allows us to modify samples.
Select Manage samples to open the Assign files to samples dialog
Sample ID is by default set to the file name, which may be too long or uninformative, so the Assign files to samples dialog can be used to give informative names to samples.
Change the samples names to Brain, Heart, Liver, and Muscle as shown (Figure 5)
The Assign files to samples dialog also allows multiple .BAM files to be merged into one sample. This is useful if reads from one sample are split into multiple .BAM files.
Figure 5. Changing sample names using the Assign files to samples dialog
Select OK to close the Assign files to samples dialog
Select Close to exit the Bam Sample Manager dialog and view the imported data (Figure 6)
Figure 6. Viewing the imported data in a spreadsheet
Additional files can be added to this spreadsheet using the Bam Sample Manager dialog. The Bam Sample Manager dialog can also be used to add imported samples to a separate spreadsheet by selecting a new option in the dialog, Add new experiment.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
During import, you created a categorical attribute called Tissue and assigned the 4 samples to either the muscle or not muscle groups. This step was to create replicates within a group, albeit this grouping is somewhat artificial and is only used in this tutorial because we want to illustrate ANOVA with a small data set. Replicates are a prerequisite for differential expression analysis using ANOVA.
Select Differential Expression Analysis from the Analyze Known Genes section of the RNA-Seq workflow
The Differential Expression Analysis dialog offers the choice of analyzing at Gene-,Transcript-, or Exon-level.
Select Gene-level
Specify the 1/gene_rpkm (RNA-Seq_results.gene.rpkm) spreadsheet from the Spreadsheet drop-down menu (Figure 1)
Figure 1. Choosing the type of differential expression analysis
Select OK to open the ANOVA dialog
Available factors are listed in the Experimental Factor(s) panel on the left-hand side of the dialog.
Select Tissue, then select Add Factor > to move Tissue to the ANOVA Factor(s) panel on the right-hand side of the dialog (Figure 2)
Figure 2. The ANOVA dialog
If the ANOVA were now performed (without contrasts), a p-value for differential expression would be calculated, but it would only indicate if there are differences within the factor Tissue; it would not inform you which groups are different or give any information on the magnitude of the difference between groups (fold-change or ratio). To get this more specific information, you need to define linear contrasts.
Select Contrasts... to open the Configure dialog
For Select Factor/Interaction, Tissue will be the only factor available as it was the only factor included in the ANOVA model in the previous step; if multiple factors were included, they could be selected in the Select Factor/Interaction: drop-down menu. The levels in this factor are listed on the Candidate Level(s) panel on the left side of the dialog
For this data set, verify that No is selected for Data is already log transformed?
Left click to select muscle from the Candidate Level(s) panel and move it to the Group 1 panel (renamed muscle) by selecting Add Contrast Level > in the top half of the dialog. Label 1 will be changed to the subgroup name automatically, but you can also manually specify the label name
Select not muscle from the Candidate Level(s) panel and move it to the Group 2 panel (renamed not muscle)
The Add Contrast button can now be selected (Figure 3)
Select OK to return to the ANOVA dialog
Figure 3. Defining linear contrasts
Select OK to perform the ANOVA as configured (Figure 4)
Figure 4. Fully configured ANOVA
Once the ANOVA has been performed on each gene in the data set, an ANOVA child spreadsheet ANOVA-1way (ANOVAResults) will appear under the gene_rpkm spreadsheet (Figure 5). The format of the ANOVA spreadsheet is similar for all workflows. Mouse over each column title for a description of the column contents.
Figure 5. Viewing ANOVA results
In this tutorial, the overall p-value for the factor (column 4) is the same as the p-value for the linear contrast (column 5) as there are only two levels within Tissue. If we had more than two groups, the overall p-value and the linear contrast p-values would most likely differ. You can also see the ? symbol in the ratio/fold-change columns (6 and 7) for several genes that also have a low p-value because there are zero reads in one of the groups, thus making it impossible to calculate ratios and fold-changes between groups.
For using ANOVA with more complicated experimental designs, including multiple factors and linear contrasts, please refer to Identifying differentially expressed genes using ANOVA in the Gene Expression Analysis tutorial.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Once imported, it is possible to visualize the mapped reads along with gene annotation information and cytobands.
Select the parent spreadsheet 1 (RNA-seq)
Select Chromosome View in the Visualization section of the RNA-Seq workflow panel
Unless you have previously downloaded an annotation file, you will be prompted to select an annotation source.
Select RefSeq Transcripts - 2017-05-02
Partek Genomics Suite will download the relevant file and save it to your default library location. The Chromosome View tab will open with chromosome 1 displayed (Figure 1)
Figure 1. Visualizing reads on a chromosome level in Chromosome View
In Chromosome View you can choose other chromosomes from the position field drop-down menu (Figure 2) to change which chromosome is displayed. You may also type a search term (e.g. gene symbol or transcript ID) directly into the position field.
Figure 2. Choosing a chromosome to view in Chromosome View
The Tracks panel contains the following tracks:
RefSeq Transcripts (+)
The RefSeq Transcripts (+) track shows all genes encoded on the forward strand of the currently selected chromosome. This experiment uses RefSeq Transcripts, which defines genomic sequences of well-characterized genes, as the reference annotation track. Mouse-over a particular region in this track, and all genes within this region are shown in the information bar. Zoom in on this track to see individual genes, including alternative isoforms.
RefSeq Transcripts (-)
The RefSeq Transcripts (-) track shows all the genes encoded on the reverse strand the currently selected chromosome_._
Legend Base Colors
The Legend Base Colors track shows the color for each nucleotide. Colored nucleotide bases become visible in the Bam Profile tracks at higher levels of magnification. By default, the colors are set to red for adenine (A), blue for cytosine (C), yellow for guanine (G), green for thymine (T), and black for base not called (N). The color of the bases can be configured by selecting the Legend Base Colors track and selecting Configure colors in the track configuration panel beneath the Tracks panel on the left-hand side of the Genome Viewer.
Bam Pofile (Heart, Brain, Muscle, Liver)
The Bam Profile tracks show all the reads that mapped to the currently selected chromosome from the four tissue samples. The y-axis numbers on the left side of the tracks indicate the raw read counts. The aligned reads are shown in the Genome Viewer in each track with a different color for each Bam Profile track.
Genome Sequence, Cytoband, and Genomic Label
The Genome Sequence, Cytoband and Genomic Label tracks are shown at the bottom of the panel. These labels are helpful for navigating the chromosome.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The basic method of creating a gene list from ANOVA results based on fold-change and p-value cut-offs is detailed in Creating gene lists from ANOVA results. Advanced options enable the creation of lists based on more complex criteria. For example, we can use the Create Gene List function to identify transcripts that are both significantly differentially expressed AND alternatively-spliced among the four tissue samples.
Select Create Gene List from the Analyze Known Genes panel of the RNA-Seq workflow to invoke the List Manager dialog
Select the Advanced tab (Figure 1)
Figure 1. Creating a gene list using advanced options
Select Specify New Criteria to invoke the Configure Criteria dialog (Figure 2)
Figure 2. Configuring criteria for transcripts with a p-value < 0.05
In the Configure Criteria dialog box (Figure 2), provide a name for the list (Diff Exp)
Select 1/transcripts (RNA-Seq_results.transcripts) from the_Spreadsheet_ drop-down menu
Select 8. p-value(DiffExp) from the Column drop-down menu
Set Include p-values to significant with FDR with a value of 0.05
A list of 38,285 transcripts that pass this criteria will be generated according to the # pass score on the right-hand side of the dialog. If the settings are changed, this number will automatically update.
Select OK
Repeat the same steps to create a list of transcripts that are likely alternatively spliced, named Alt Splice, using the same p-value cutoff and Column set to 10. p-value (AltSplice) (Figure 3)
Figure 3. Configuring criteria for a list of alternatively spliced genes
Select OK to generate Alt Splice
Select both lists in the right-hand panel under the Criteria panel while holding the Ctrl key on your keyboard
Select Intersection from the left-hand panel of the List Manager dialog (Figure 4)
Figure 4. Creating a gene list at the intersection of two criteria
Enter a name for the criteria (Diff Exp and Alt Splice)
Select OK to close the naming dialog and OK again to close the list creation hint dialog
Select Save List from the Manage criteria section of the List Manager dialog (Figure 5)
Figure 5. Saving a created list criteria
Select Diff Exp and Alt Splice in the List Creator dialog (Figure 6)
Figure 6. Selecting list to save in List Creator dialog
Select OK to save the list
Select Close to exit the List Manager dialog and view the Diff_Exp_and_Alt_Splice spreadsheet (Figure 7)
Figure 7. A list of the differentially expressed and alternatively spliced genes is now available for downstream analysis
This list of differentially expressed and alternatively spliced transcripts will be used later in the tutorial.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Data for this tutorial can be downloaded from the Partek website using this link - ChIP-Seq tutorial data. To follow this tutorial, download the two .bam files and unzip them on your local computer using 7-zip, WinRAR, or a similar program.
Store the two.bam files at C:\Partek Training Data\ChIP-Seq or to a directory of your choosing. We recommend creating a dedicated folder for the tutorial on a local drive.
Select ChIP-Seq from the Workflows drop-down menu (Figure 1)
Figure 1. Selecting the ChIP-Seq workflow
Select Import and Manage Samples from the Import section of the ChIP-Seq workflow
Select Browse... or use the file tree to navigate to the folder where you stored the .bam files
All .bam files in the folder will be selected by default (Figure 2).
Figure 2. Importing .bam files using the Sequence Import dialog
Verify that chip.bam and mock.bam are selected
Select OK
The Sequence Import dialog will launch (Figure 3). This allows us to choose the output file name and destination for the parent spreadsheet, as well as the species, and genome build of the imported samples. By default, the output file destination is the folder the .bam files are located.
Figure 3. Setting the output file name, species, and genome build during .bam file import
Set Output file to ChIP-Seq
Set Species to Homo sapiens using the drop-down menu
Set Genome build to hg18 using the drop-down menu
Select OK
The Bam Samples Manager dialog will open (Figure 4). This dialog can be used to add samples to the project (Add samples), remove samples (Remove samples), to associate multiple files with particular samples (Manage samples), and to map the chromosome names from the input files to the association files (Manage sequence names).
Figure 4. The Bam Sample Manager can be used to add, remove, and manage files and samples
Select Close
The Sort bam files dialog will open. Sorting is necessary for all imported .bam files, but you can choose to hide this hint in the future by selecting Please don't show me this hint again.
Select OK
The imported spreadsheet will open while the .bam files are sorted. Progress in sorting will be displayed on the progress bar in the lower left-hand corner of the Partek Genomics Suite window. Once sorting has completed, there will be samples on rows with the sample names in column 1. Sample ID and the number of reads mapped to the reference genome for each sample in column 2. Number of allignments (Figure 5).
Figure 5. Imported .bam files with one sample in each row
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
With the GO Enrichment feature in Partek Genomics Suite, you can take a list of significantly expressed genes/transcripts and find GO terms that are significantly enriched within the list. For a detailed introduction to GO Enrichment, refer to the GO Enrichment User Guide (Help > On-line Tutorials > User Guides).
Select the Diff_Exp_and_Alt_Splice spreadsheet from the spreadsheet tree
Select Gene Set Analysis in the Biological Interpretation section of the RNA-Seq workflow (Figure 1)
Figure 1. Selecting Gene Set Analysis
Select GO Enrichment in the Gene Set Analysis dialog (Figure 2)
Select Next >
Figure 2. Selecting the method of analysis
Select the spreadsheet 1/Diff_Exp_and_Alt_Splice (Diff Exp and Alt Splice.txt) from the drop-down menu (Figure 3)
Select Next >\
Figure 3. Selecting the spreadsheet that contains the genes you want to test
Select Use Fisher's Exact test
Select Invoke gene ontology browser on the result
Set Restrict analysis to functional groups with more than _ genes to 2 (Figure 4)
Select Next >
Figure 4. GO Enrichment options
Select Default mapping file (Figure 5)
Select Next >
Figure 5. Selecting the mapping file
A GO-Enrichment spreadsheet, as well as a browser (Figure 6), will be generated with the enrichment score shown for each GO term. Browse through the results to find a functional group of interest by examining the enrichment scores. The higher the enrichment score, the more over represented this functional group is in the input gene list. Alternatively, you may use the Interactive filter on the GO-Enrichment spreadsheet to identify functional groups that have low p-values and perhaps a higher percentage of genes in the group that are present.
Figure 6. Viewing the Gene Ontology Browser
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
We can check the quality of the samples using Partek Genomics Suite before analyzing the data.
In ChIP-Seq, genomic DNA is fragmented and target-protein-bound DNA fragments are purified by immunoprecipitation. These purified fragments are between 100 and 500 base pairs depending on the protocol; however, because ChIP-Seq uses short-read sequencing (25 to 35 base pair reads) to maximize sequencing depth, only the ends of each fragment will be sequenced. Consequently, with single-end sequencing, the forward and reverse strands for the each fragment will be from opposite ends of the fragment. At a protein-binding site, there will be two peaks of read enrichment, one from enrichment of forward strand reads and another from enrichment of reverse strand reads. The average distance between these peaks is termed the effective fragment length. Because the forward and reverse strand peaks are generated from a common set of fragments, the peaks should be roughly symmetrical. By phase shifting the data to the mid-point between the two peaks, a common read density plot can be created that shows single peaks at binding sites.
Strand Cross-Correlation allows us to use the symmetrical distribution of forward and reverse strand fragments calculate the effective fragment length (Kharchenko et al., 2008). The Pearson correlation coefficient between the read densities of the forward and reverse strands is calculated after phase shifts of between 0 and 500 base pairs. This is visualized with the phase shift range on the x-axis and the corresponding Pearson correlation coefficients between forward and reverse strand read densities on the y-axis (Figure 1). High-quality ChIP-Seq data will give a strong peak on the Strand Cross-Correlation plot at the effective fragment length. When calling peaks, the forward and reverse (or paired end) reads are each phase-shifted by the effective fragment length to create a combined read density profile.
For paired-end sequencing, Strand Cross-Correlation is calculated from the distribution of distances between the paired reads from the ends of each fragment.
We will perform Strand Cross-Correlation to identify the effective fragment length we can use when calling read enrichment peaks.
Select Strand Cross-Correlation from the QA/QC section of the ChIP-Seq workflow
If you have not run this step before, you will be asked if you would like to create a new QA/QC child spreadsheet.
If prompted, select Yes to create a new child spreadsheet for QA/QC
After running Strand Cross-Correlation, the Strand Separation of Samples viewer will open as a new tab (Figure 1).
Figure 1. Strand Cross-Correlation profile plot showing possible effective fragment lengths on the x-axis and resulting Pearson correlation coefficients on the y-axis.
For the chip sample (blue), we can see the peak at 111 base pairs, corresponding to an effective fragment length of 111 base pairs. This number can be determined by examining the values in the strand_correlation spreadsheet (Figure 2), by moving the cursor over the peak in the graph, or by sorting the data in the spreadsheet. The Strand Separation of Samples graph is also useful as a quality control measure. In lower quality ChIP-Seq data, we would also observe a peak at the read length. The ratio between the Pearson correlation coefficient of the effective fragment length peak and the read length peak, normalized with the minimum correlation coefficient, [cc(fragment length) - min(cc)] / [cc(read length) - min(cc)] should be greater than 0.8 to meet the minimum quality standards recommended by the ENCODE project (Landt et al., 2012).
The mock sample (red) does not have an effective fragment length peak because it does not read density peaks to phase shift. It does have a small peak at the sequencing read length of 26 base pairs.
Figure 2. The strand correlation spreadsheet shows the Pearson correlation coefficients for each relative strand shift value (effective fragment length)
BAM files can contain both aligned and unaligned reads. The spreadsheet created during import shows the number of reads that were aligned to the reference genome. A large number of unaligned reads may be the result of poor quality sequencing data or alignment problems. It may also be useful to know how many reads map to more than one location in the genome if the options used during alignment supported multiple-mapped reads.
Select Alignments per read form the QA/QC section of the ChIP-Seq workflow
A new spreadsheet named Alignment_Counts will be generated (Figure 3).
Figure 3. Unaligned reads have been removed from these BAM files and the alignment options did not permit mapping to more than one location
The titles of columns 2. 0 Single End Alignments Per Read and 3. 1 Single End Alignment Per Read indicate that this is single end data. Column 2 shows the number of unaligned reads, while column 3 shows the number of reads that aligned exactly once. If the BAM files used in this tutorial included reads that mapped to more than one location in the genome, there would be additional columns.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
In this section, we will create a list of peaks significantly enriched in the ChIP sample versus the control sample.
Select Create a list of enriched regions from the Peak Analysis section of the ChIP-Seq workflow
Select Specify New Criteria (Figure 1)
Figure 1. List creator for ChIP-Seq data allows you to create lists using preset or custom criteria
Configure the new criteria as shown (Figure 2).
Name the criteria p-value filtered
Select 1/regions (peaks) from the Spreadsheet drop-down menu
Select 11. p-value(Sample ID vs. mock) from the Column drop-down menu
Select significant with FDR of from the include p-values drop-down menu with a value of 0.05
Figure 2. Creating a criteria that includes regions significantly enriched in ChIP vs. mock
Select OK to add the criteria to the criteria list (Figure 3)
Figure 3. New criteria are added to the criteria list
Select Save
Select p-value filtered from the list of criteria (Figure 4)
Figure 4. Choosing criteria to save as lists
Select OK
The new spreadsheet will open (Figure 5).
Figure 5. Spreadsheet with regions that are significantly enriched in the ChIP sample vs. control
Other List Creator operations like the Venn Diagram, Union (Or), and Intersection (And) of the lists could be used to create different lists of enriched peaks. For example, you could filter on the intersection between Strand Separation FDR of 0.05 and Peaks not in mock or filter by scaled fold change or apply a minimum number of reads per million. The choice of what peaks you want to consider for downstream analysis depends on the goals and details of your experimental design.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
With a list of enriched regions, you can now identify recurring patterns or motifs in these regions. Transcription factors bind sites throughout the genome, but each has a characteristic sequence it binds - a consensus sequence that appears in most of its binding sites. By searching for binding site motifs, you can determine the consensus sequence for a transcription factor and predict potential binding locations throughout the genome that may not have been found in your experiment.
Partek Genomics Suite detects de novo motifs using the Gibbs motif sampler (Neuwald et al., 1995) and can search for known transcription factor binding sites using a database such as JASPAR.
Select Motif Discovery from the Peak Analysis section of the ChIP-Seq workflow
Select Discover de novo motifs
Select OK
The Detect Motifs dialog will open to allow you to configure the search (Figure 1).
Figure 1. Configuring search parameters for de novo motfis
Select 1/p-value_filtered from the Spreadsheet with genomic regions drop-down menu
Set Number of Motifs to 1
Set Discover motifs of length to 6 bp to 16 bp
Set Result file to Motifs
Select OK
If you have not previously downloaded the reference genome on your computer, you may be asked if you would like to download the .2bit reference genome. If prompted, select Automatically download a .2bit file then select OK. If Partek Genomics Suite cannot connect to the internet, this option may not be available. If not, you will need to download the .2bit file from the UCSC Genome Browser and import it by selecting Manually specify a .2bit file and choosing the downloaded .2bit file. The reference genome map is required to determine which genes overlap the enriched peak regions and to display the aligned sequences in the Genome Viewer.
A motif visualization tab, Sequence Logo, will open and two spreadsheets will be generated. One spreadsheet, motifs (Motifs), contains information about the motif. The other, instances (Motifs_instances.txt), lists the genomic locations of the motif.
The Sequence Logo tab (Figure 2) opens after motif detection and displays the most significant motif found in the regions listed in the source spreadsheet_._
Figure 2. Viewing the binding site for NRSF. Use the blue arrows to cycle through views of all motif found (if there are more than one). Select Reverse to view the reverse complement sequence.
In this case, the motif finder discovered a motif in the NRSF-enriched regions that is 16 base pairs in length. The height of each position is the relative entropy (in bits) and indicates the importance of a base at a particular location in the binding site.
The title CT.TCC..GGT.CTG. is the consensus sequence for the sequence logo. Dots represent positions that contain more than one significant base across all reads in the motif. The dots can be replaced with characters representing the possible bases at each location by selecting Show nucleotide codes. A description of the IUPAC nucleotide codes is available at the UCSC Genome Browser.
To view the reverse complement of the motif, select Reverse.
The motif information spreadsheet (Figure 3), Motifs, lists the information about all motifs discovered during de novo Motif Detection. This includes five columns describing each motif.
Figure 3. Viewing the Motifs spreadsheet
1. Counts gives the summed counts for each base call across all occurrences of the motif in the region list as {A, C, G, T}
2. Consensus Sequence gives the consensus sequence of the motif in IUPAC nucleotide codes
3. Motif ID gives a unique ID to each discovered motif using its row in the Motifs spreadsheet
4. Log Likelihood Ratio scores the relative likelihood that the pattern did not occur by chance, with larger numbers indicating that it is less likely to have occurred by chance
5. Background frequency (A,C,G,T) gives the frequency of each of the bases in all the sequences of that motif
You can bring up the Sequence Logo visualization of a listed motif by right-clicking on the row header and selecting Logo View from the pop-up menu.
The _instances (_Motif_instances) spreadsheet (Figure 4) is a child spreadsheet of the Motifs spreadsheet. It details all the locations of the motif(s) detected in the enriched regions. Each row lists a putative binding site for a motif. The columns give detailed information about the putative binding sites.
Figure 4. Viewing the instances spreadsheet
1-4. chromosome, start, stop, strand give the position
5. Motif ID gives the identity of the motif
6. instance gives the sequence of this instance of the motif
7. score gives the log ratio of the probability that this sequence was generated by the motif versus the background distribution. A higher number indicates a better chance that the sequence is an instance of the motif.
Select Motif discovery from the Peak Analysis section of the ChIP-Seq workflow
Select Search for known motifs
Select OK
Search for known motifs will search the JASPAR database for motifs that are over-represented in the list of sequences in the significant regions list. The JASPAR database will download automatically if needed during the Search for known motifs step. Downloading the JASPAR database will create a spreadsheet in your experiment named JASPAR.txt that contains all of the species-specific motifs in the database. To visualize the motifs, right-click on a row in the JASPAR.txt spreadsheet and select Logo View.
Before Search for known motifs runs, we need to configure the search (Figure 5).
Figure 5. Configuring a search for known motifs in the JASPAR database
Select 1/p-value_filtered (p-value filtered.txt) from the Choose Region Spreadsheet drop-down menu
Select Search using motifs specified in: for Choose Motifs to Search
Set Search using motifs specified in: to 2 (JASPAR.txt) using the drop-down menu
Set Search for to All Motifs using the drop-down menu
Set Sequence Quality >= to 0.7
Name the result file MotifSearch
Select OK
Because we are searching for around 1200 motifs, the process will take some time to complete. Progress is displayed in the progress bar in the lower left-hand side of the Search for Motif(s) in Sequences dialog (Figure 6).
Figure 6. Progress in the motif search will display in the progress bar
Two spreadsheets are created, similar to the spreadsheets in the de novo motif discovery, the motif_summary (MotifSearch) spreadsheet (Figure 7) and the motif_instances (MotifSearch.instance) spreadsheet.
Figure 7. Viewing the results of motif search
In the MotifSearch spreadsheet, each motif used in the motif search is shown. The columns detail the results of the search for each motif that was found in the reads.
1. Motif this is the name or ID of the motif
2. Probability of Occurrence gives the probability of detecting a false positive for this motif in a random DNA sequence
3. Expected Number of Outcomes gives the Probability of Occurrence multiple by the summed length of the reads
4. Actual Number of Occurrences gives a count of sequences that match the known motif in the reads
5. p-value is the uncorrected p-value (binomial test)
As you can see, REST, which is another name for NRSF, is near the top of the list as one of the most significantly over-represented motifs (Figure 7). This motif agrees with the motif found in the de novo motif detection step. Interestingly, other motifs appear a significant number of times in the ChIP-Seq peaks and may represent possible co-factors or regulators.
The motif_instances spreadsheet contains all instances of the motifs from the motif_summary spreadsheet in a format identical to the instances spreadsheet from de novo motif detection.
While the motif_instances spreadsheet contains every instance of every motif, it may be useful to create a spreadsheet with just instances of one motif or a select group of motifs. Let's do this for both REST motifs.
Select the motif_instances spreadsheet in the spreadsheet tree
Right-click the 5. Motif Name column
Select Find / Replace / Select... from the pop-up menu (Figure 8)
Figure 8. Finding all REST peaks (step 1)
Set Find What: to REST
Select By Columns for Search:
Select Only in column with 5. Motif Name selected form the drop-down menu
Select Select All (Figure 9)
Figure 9. Selecting all REST instances in motif_instances spreadsheet (step 2)
This finds and selects every instance of REST in column 5. Motif Name.
Select Close
In the motif_instances spreadsheet the selected columns are highlighted.
Right-click on the first highlighted row visible; in this example, we see row 13196
Select Filter Include from the pop-up menu (Figure 10)
Figure 10. Filtering for selected rows
The spreadsheet will now include 2098 rows and a black and yellow bar will appear on the right-hand side of the spreadsheet (Figure 11). The black and yellow bar is a filter indicator showing the fraction of the spreadsheet currently visible as yellow and the filtered fraction as black.
Figure 11. Filtered motif_instances spreadsheet containing 2098 instances of the REST motifs
To create a spreadsheet that contains only the REST instances, we can clone the motif_instances spreadsheet while the filter is applied.
Right-click on motif_instances in the spreadsheet navigator
Select Clone... from the pop-up menu
Set the Name of resulting copy as REST
Select 1/p-value_filtered/motif_summary (MotifSearch) from the Create as a child of spreadsheet drop-down menu
Select OK
This creates a temporary spreadsheet rest from the filtered motif_instances spreadsheet. We can now save the new spreadsheet.
Select rest from the spreadsheet tree
Name the file REST
Select Save
We can now remove the filter from the source motif_instances spreadsheet.
Select motif_instances from the spreadsheet tree
Right-click the filter bar
Select Clear Filter
Neuwald, A. F., Liu, J.S., & Lawrence, C.E. (1995). Gibbs motif sampling: detection of outer membrane repeats (Vol. 4). Protein Science.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
During , a spreadsheet named unexplained_regions was generated. This spreadsheet contains locations where reads map to the genome but are not annotated by the transcript database, in this case, RefSeqGene. The unexplained_regions spreadsheet is potentially very interesting as it may contain novel findings.
Right click column 6. Average Coverage and select Sort Descending from the menu
Select Find Overlapping Genes from the Tools option in the command toolbar (Figure 1)
Figure 1. Selecting Find Overlapping Genes from Tools in the command toolbar
Select Add a new column with the gene nearest to the region in the Find Overlapping Genes dialog (Figure 2)
Select OK
Figure 2. Find Overlapping Genes
Select RefSeq****Transcripts – 2017-05-02 from the Output Overlapping Features dialog (Figure 3)
Please note that it is recommended that you annotate with the same database used when you performed mRNA quantification.
Select OK
Figure 3. Select the database to search for overlapping features
The closest overlapping feature and the distance to it is now included as columns 7. Overlapping Features and _8. Nearest Features i_n the unexplained_regions spreadsheet.
Right-clicking on a row header and selecting Browse to Location will show the reads mapped to the chromosome. For this tutorial, a couple of genes are selected to show regions that are located after a known gene or in the intron of a gene.
Right-click row 39 and select Browse to location from the pop-up menu
Select the Chromosome View tab to view a region within an intron of UNC45B. This may be a novel exon (Figure 4)
Figure 4. A region within an intron of UNC45B that might be an novel exon
Right-click row 12576 and select Browse to location to go to a region that starts 1 bp after CD82.
This peak may represent an extended exon (Figure 5).
Figure 5. A region that starts 1 bp after CD82 that might represent an extended exon
While RefSeq was used to identify overlapping features, the choice of which database to use will depend on the biological context of your experiment. For example, you may wish to utilize promoter or miRNA databases if you are interested in regulation of expression.
Chromosome View in the Partek Genomics Suite software enables visualization of differential expression and alternative splicing results in RNA-Seq data.
Select New Track
Select Add a track from spreadsheet and select 1/transcripts (RNA-Seq_results.transcripts) from the drop-down menu
Select Next > (Figure 1)
Figure 1. Adding a new track to Chromosome View
The new track will be added to Chromosome View (Figure 2).
Figure 2. Viewing isoform proportion track in Chromosome View
At this point, you may find it useful to alter track properties. Each track can be individually configured. For example, isoform information will be easier to visualize if we remove a few tracks.
Select Cytoband (hg19) in the Tracks panel
Select Remove Track to remove it form the viewer
Repeat for Genomic Label, RefSeq Transcripts - 2017-05-02 (hg19) (-), Legend: Base Colors, and Genome Sequence
Next, we are going to view a single gene, SLC25A3, with differentially expressed isoforms.
Type SLC25A3 in the Plot Position bar at the top of the window and hit Enter. The browser will browse to the gene
To further improve our visualization of SLC25A3 isoforms, we can modify the remaining tracks.
Select RefSeq Transcripts - 2014-01-03 (hg19) (+) from the Tracks panel
Change Track height to 60 using the slider
Select Apply to change track height
Repeat steps to set each Bam Profile track to a height of 40 to complete our changes
Move the Isoform proportion track to below the RefSeq Transcripts track by selecting and dragging it up the list (Figure 3)
Figure 3. Changing tracks in Chromosome View to facilitate visual analysis of isoform porportions
The Muscle, Br_ain_, Heart, Liver, and genomic label tracks were described in a previous section. Here, the focus is on the Isoform proportion track, which visualizes differential expression and alternative splicing. The reads that are mapped to a certain sample and the proportion of the transcript expressed in that sample are colored to match the Bam Profile track of that sample. In this screenshot, Brain is yellow, Heart is green, Liver is red, and Muscle is orange
SLC25A3 was reported by Wang, et al., (Nature, 2008) to have “mutually exclusive exons (MXEs)”. The reads mapped to the 3 transcripts of this gene in each of the tissue samples are shown in the Genome Viewer in the isoform proportion track. The relative abundances of the individual transcripts of this gene are shown by the height of the color coded bars on each transcript in the isoform proportion track. Note transcript NM_213611 has low expression while transcripts NM_005888 and NM_002635 have higher expression. Also note that NM_005888 is expressed primarily in the heart and muscle, indicated by the primarily green and orange bars, while NM_002635 is expressed primarily in the brain and liver, indicated by the primarily yellow and red bars.
Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., & Burge, C.B. Alternative isoform regulation in human tissue transcriptomes. Nature, 2008; 456: 470-6.
The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function where time-to-event incidence varies over time in a population. The Kaplan-Meier estimator is displayed as a Kaplan-Meier curve, a series of declining horizontal steps. The Kaplan-Meier curve should approach the true survival curve for the population with a sufficiently large sample size. Kaplan-Meier survival analysis can handle censored data, i.e., data where the event is not observed for some subjects.
To perform Kaplan-Meier survival analysis, at least two pieces of information (one column each) must be provided for each sample: time-to-event (a numeric factor) and event status (categorical factor with two levels). Event status indicates whether the event occurred or the subject was censored (did not experience the event). Time-to-event indicates the time elapsed between the enrollment of a subject in the study and the occurrence of the event.
Common examples of Kaplan-Meier analysis include the fraction of patients who remain disease-free after cancer remission. In this case, the event would be disease recurrence and patients would be listed as censored if they do not experience recurrence during the study or if they drop out of the study before experiencing recurrence.
Partek Genomics Suite does not impose any limitation on the labels used for the event and censored categories; in this tutorial, the events are coded as either "death" or "censored". If a subject is still alive at the end of the study, time-to-event indicates the period between enrollment and the end of the study. If a subject dropped out of the study, time-to-event indicates the period between enrollment and the last recorded time point.
To begin, you should have the Survival Tutorial data set open in Partek Genomics Suite .
Select Stat from the main toolbar
Select Survival Analysis then Kaplan-Meier from the Stat menu (Figure 1)
Figure 1. Invoking Kaplan-Meier
The Kaplan-Meier dialog will open. Please note that in this tutorial data set, column 1. Survival (years) indicates the survival time of each patient in years and column 2. Event indicates the event status for each patient, death or censored.
Set Time Variable to 1. Survival (years) using the drop-down menu
Set Event Variable to 2. Event using the drop-down menu
Only numeric data are displayed in the Time Variable drop-down list and only categorical data with two categories are displayed in Event Variable.
Set Event Status to death using the drop-down menu (Figure 2)
Event Status should be set to the primary event outcome.
Figure 2. Configuring the Kaplan-Meier dialog
Select 3. p53 status from the Candidate(s) panel
Select Add Factor > to add 3. p53 status to the Strata (Categorical) panel
This will test the difference in survival rates between the p53 mutants (mutant) and samples with wild-type p53 (wt).
Select OK to run the test (Figure 3)
Figure 3. Configuring the Kaplan-Meier dialog to test the difference in survival rates between patients with different p53 status
The Kapan-Meier Plot will open in a new tab (Figure 4).
Figure 4. Kaplan-Meier plot comparing the survival curves between two groups.
The horizontal axis indicates time-to-event; the vertical axis shows the cumulative percentage of survival. Censoring is shown as a triangle; event occurrence is shown as a step-down in the plot. Partek Genomics Suite performs two statistical tests to compare the survival curves: a log-rank test and the Wilcoxon-Gehan test. Low p-values indicate that the groups have significantly different survival times.
Select the Analysis tab to switch to the Kaplan-Meier results spreadsheet (Figure 5)
Figure 5. Kaplan-Meier spreadsheet. Each row represents occurrence of at least one significant event.
The spreadsheet is organized into two sections: the analysis of the p53 mutant group and the analysis of the p53 wild type group. Each row represents a time point at which at least one event occurred; the columns provide the following information:
1. Identifies the group membership (according to the strata)
2. Survival time corresponds to the entries in column 1. of the original (Survival_Tutorial) spreadsheet. At each given time, at least one event, either death or censored, was recorded.
3. Probability of Survival: cumulative probability of survival at a given time point (also known as KM survival estimate). Cumulative probability is the probability of surviving all of the intervals before this time point. As time increases, the cumulative survival probabilities decreases as events occur.
4. Number of group members at risk (i.e., have not experienced the event). The count in each row is calculated by subtracting the number of deaths and censored events in the row above from the number at risk in the row above.
5. Count of deaths at this time point in the group
6. Count of censored events at the given time in the group
7. Total number of deaths in all groups at the given time
8. Total number of participants at risk in all groups. The count in each row is calculated by subtracting the number of deaths and censored events at the previous time point in both groups from the total number at risk at the previous time point
9. Natural logarithm of column 3.; also noted as ln(KM)
10. Natural logarithm of the negative value of column 9., i.e., ln(-ln(KM)). A plot of ln(-ln(KM) vs. ln(t) is often used to test the proportional hazards assumption. To visualize the risk, select this column and select View > Log Log S Plot (Figure 6).
Figure 6. Log Log S plot of KM data. As the lines are mostly parallel and do not cross, the log-rank test assumptions are valid. The Wilcoxon-Gehan test has more power if the lines had crossed or were not parallel but performs less well when there is extensive censored data
Binding sites for the DNA-binding protein of interest are indicated by peaks of enriched sequencing read density. How are peaks calculated from reads in Partek Genomics Suite?
Using the effective fragment length calculated by Cross Strand-Correlation, each read is extended in the 3' direction by the effective fragment length and overlapping extended reads are merged into single peaks. For paired-end reads, the distance between paired reads is used as the fragment length and overlapping fragments are merged into peaks. For peak detection, the genome is divided into windows of a user-defined size and the number of fragments whose mid-points fall within each window is counted. A model for expected read density (a zero-truncated negative binomial) is used to determine which peaks are significantly enriched over a user-defined false discovery rate (FDR). See the for more information on the peak-finding algorithm and tips for setting the Fragment extension and window sizes.
Select spreadsheet 1 (ChIP-Seq) from the spreadsheet tree
Select Detect peaks from the Peak Analysis section of the ChIP-Seq workflow
The Peak Detection dialog will open. Configure the dialog as shown (Figure 1).
Figure 1. Configuring the peak detection dialog. The appropriate settings for will depend on your experimental design and data.
Select Maximum average fragment size for Fragment Extensions
Set Maximum average fragment size to 111
Your choice for Maximum average fragment size is based on your experimental design. If you have used an antibody that binds DNA as the control antibody such as an IgG control, you could use different fragment lengths for each sample based on its effective fragment length by selecting the Individual maximum fragment sizes option. Here, we have chosen the effective fragment length of 111 base pairs calculated using Cross Strand-Correlation.
Select Reference sample from Reference sample
Select mock from the Reference sample drop-down menu
Set Set the window size to (base pairs) to 111
The peak detection algorithm divides the genome into windows to find windows with enriched for reads based on an FDR cut-off value. Here, we have chosen to match the window and individual maximum fragment sizes.
Select Overlapping for How should windows be merged?
Set The fraction of false positive peaks allowed to 0.001
The Peak Cut-off FDR determines the cut-off for calling peaks. Setting a lower value demands greater differences between mock and chip samples for a peak to be called; a false discovery rate of 0.001 anticipates 1 false positive per 1000 peaks called.
Select Entire region, spanning all merged windows for Which regions should be reported?
Optimal peak detection settings are dependent on your experimental design and data so fine tuning may be required. Because transcription factor binding sites tend to have localized and sharp clusters of reads, the window size used during the analysis of a transcription factor study can be left relatively small, approximately the same as the average fragment length, and the option to allow for gaps between enriched windows does not need to used. Region in the window with most reads could also be selected to report a more narrow region for each peak call. Conversely, histone modification peaks tend to be subtle and diffuse. To analyze histone modification ChIP-seq data, larger window sizes, combining neighboring windows into larger windows using Within a gap distance of, and reporting entire regions using Entire region, spanning all merged windows might be appropriate.
A convenient way to visualize the relationship between window size and gap size is to select the More info link at the top of the Peak Detection dialog box. A simulated read count histogram will open below the Description of Peak Detection section (Figure 2). The blue bars underneath the histogram will reflect how regions are detected and reported using your current Peak Detection settings. Try changing the How should windows be merged or Which regions should be reported? options to visualize their effects on peak detection.
Figure 2. The visual guide helps show the impact of window size and result reporting settings on peak calling.
Select OK to run the peak detection algorithm with your chosen settings
Peak Detection generates a new child spreadsheet, regions (peaks) (Figure 3).
Figure 3. Peaks spreadsheet lists regions with significant peak enrichment with one row per region.
This spreadsheet is sorted by chromosome number and genomic location. Each row represents one genomic region of peak enrichment. The columns are:
Column 1. Chromosome gives the chromosome location of region
Column 2. Start gives the start of region (inclusive)
Column 3. Stop gives the end of region (exclusive)
Column 4. Sample ID gives the sample containing the enriched region
Column 5. Interval length gives the length of the region, Start - Stop, in base pairs
Column 6. Maximum Extended Reads in Window gives the greatest number of extended reads in any of the windows of a region
Column 7. Reads per Million (RPM) divides the total number of aligned reads in the sample (in millions). This helps you compare peaks across samples, especially when there is a large difference in the number of aligned reads between samples.
Column 8. Mann-Whitney p-value identifies the separation between forward and reverse peaks for single-end reads using the Mann-Whitney U-test. Lower p-values indicate better separation. This p-value can be used if there was no control sample or to eliminate regions called due to PCR bias.
Columns 9-10. Total reads in region gives the total number of non-extended reads for each sample in the given genomic region. One column for each sample.
Column 11. p-value(Sample ID. vs. mock) compares the sample specified in column 4 to the reference sample for this genomic region using a one-tailed binomial test. A low p-value means there are significantly more reads in the sample specified in column 4 than in the mock sample. This column is only included if a reference sample is specified.
Column 12. scaled fold change (Sample ID vs. mock) compares intensity of signal between the sample specified in column 4 to the reference sample in the given genomic region. The fold-change is scaled by a ratio of the number of reads for each sample (IP vs. control) on a per-chromosome basis. Scaled fold changes >1 indicate more enrichment in the IP-sample than in the control sample. This column is only included if a reference sample is specified.
Columns 13 -14. overlap percent gives the fraction of the given genomic region that overlaps a called peak region from the indicated sample. For example, the values of 100% in column 13 and 0% in column 14 indicate regions detected in the chip sample, but not in the mock sample. Similarly, regions with the value of 100% in column 14 were detected in the mock sample.
This tutorial will illustrate:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
Survival analysis is a branch of statistics that deals with modeling of time-to-event. In the context of “survival,” the most common event studied is death; however, any other important biological event could be analyzed in a similar fashion (e.g., spreading of the primary tumor or occurrence/relapse of disease). The significant event should be well-defined and occur at a specific time. As the primary outcome event is typically unfavorable (e.g., death, metastasis, relapse, etc.), the event is called a “hazard.” Survival analysis tries to answer questions such as: What is the proportion of a population who will survive past a certain time (i.e., what is the 5-year survival rate)? What is the rate at which the event occurs? Do particular characteristics have an impact on survival rates (e.g., are certain genes associated with survival)? Is the 5-year survival rate improved in patients treated by a new drug?
An important feature of survival analysis is the presence of “censored” data. Censored data refers to subjects that have not experienced the event being studied. For example, medical studies often focus on survival of patients after treatment so the survival times are recorded during the study period. At the end of the study period, some patients are dead, some patients are alive, and the status of some patients is unknown because they dropped out of the study. Censored data refers to the latter two groups. The patients who survived until the end of the study or those who dropped out of the study have not experienced the study event "death" and are listed as "censored".
The tutorial data set (236 samples) is a subset of fresh-frozen breast tumor specimens from a population-based cohort of 315 women with breast cancer. The clinicopathological characteristics accompanying each tumor include p53 status (mutant or wild-type), estrogen receptor (ER) status, progesterone receptor (PgR) status, lymph node status, tumor size, and patient age. Gene expression was assessed on Affymetrix® U133A and U133B arrays (Miller LD et al., GSE3494). Please note that Affymetrix data have been chosen for the illustration purposes only, and that the same functionality can be used to analyze any data set. The raw data files (.CEL) have already been imported into PGS; samples with no survival time data, as well as sample attributes irrelevant for the survival analysis, were removed, and the final spreadsheet was saved in Partek Genomics Suite (Survival_Tutorial.fmt and Survival_Tutorial.txt). To go through the tutorial, , unzip the downloaded folder and save it in an easily accessible location on your computer.
After saving the unzipped file, you can open it in Partek Genomics Suite.
Select File from the main toolbar
Select Open...
Browse to the folder containing the tutorial data set and select the file Survival_Tutorial.fmt
The data spreadsheet will open (Figure 1). Each row represents a tumor sample from a breast cancer patient. Sample attributes are listed in columns 1-8, while columns 9+ are intensity values for the probe sets listed in the column headers.
Figure 1. Viewing the sample data (one sample per row) for the survival analysis tutorial
Miller LD, Smeds J, George J, Vega VB et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. PNAS, 2005; 102(38): 13550-5.
This tutorial will illustrate:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
Copy number analysis asks whether there are regions of the genome with altered abundance. Of particular interest are any genes within those regions and how might a change in gene abundance alter phenotype. Partek Genomics Suite software allows these questions to be answered by analyzing a variety of commercially available assays for copy number analysis. SNP-genotyping arrays with closely spaced genomic markers (Affymetrix and Illumina) and comparative genomic hybridization (CGH) arrays (Agilent, NimbleGen, or custom spotted arrays) can be imported into Partek Genomics Suite and analyzed.
When performing copy number analysis, it is important to remember an inherent limitation of copy number region analysis - the inability to detect copy-neutral events caused by copy-number-neutral loss of heterozygosity (LOH) or copy-number-neutral allelic imbalance. This limitation can be addressed by supplementing copy number analysis with SNP genotyping data. Partek Genomics Suite supports both LOH and allele-specific copy number (AsCN) analysis with dedicated workflows. Tutorials on and analysis are also available.
Ramakrishna M, Williams LH, Boyle SE, Bearfoot JL et al. Identification of candidate growth promoting genes in ovarian cancer through integrated copy number and expression analysis. PLoS One 2010;5(4).
We can visualize the reads and enriched regions using the Partek Genome Viewer.
Select ChIP-Seq from the spreadsheet tree
Select Chromosome View from the Visualization section of the ChIP-Seq workflow
The Chromosome View tab will open. The default tracks are the transcript tracks, the sequence visualization tracks, and the cytoband track (Figure 1).
Figure 1. Viewing ChIP-Seq data in Chromosome View
We can add additional tracks to view the results of our analysis.
Select New Track from the left-hand side of the Tracks tab
Select Add tracks from a list of spreadsheets
Select Next >
Select the p-value_filtered.txt and Motifs_instances.txt tracks on the Tracks panel
De-select Aligned Reads
Select Create (Figure 2)
Figure 2. Adding spreadsheet tracks to Genome Viewer
This will display the enriched regions found in the samples and the locations of the motif instances from the de novo motif discovery (Figure 3).
Figure 3. Adding p-value and motif binding tracks
The two track display the detected regions at each location on the chromosome for the NRSF-enriched sample, chip, and aligned them to the de novo discovered motif binding sites. If you have not gone through the steps for peak detection and motif discovery, these tracks will not be available.
Select RefSeq Transcripts - 2016-10-18 (hg18) (+) on the Tracks panel
Set Strand to Both using the drop-down menu in the Track section of the track properties panel
Select Apply
The RefSeq Transcripts track now shows transcripts from both strands. We can remove the dedicated (-) strand track.
Select RefSeq Transcripts - 2016-10-18 (hg18) (-) on the Tracks panel
Select Remove Track
Several other tracks will not be used in our analysis; we can remove them as well.
Select Legend: Base Colors, Genome Sequences (hg18), and Cytoband (hg18) on the Tracks panel by left clicking on each while holding the Ctrl key on your keyboard
Select Remove Track
The height of many tracks can be adjusted by changing the Track height setting in the Profile tab.
Select Regions (1/p-value_filtered (p-value filtered.txt)) on the Tracks panel
Set Track height to 25 using the slider
Select Apply
Repeat for Regions (1/p-value_filtered/motifs/instances (Motifs_instances.txt))
This view clearly shows the Bam Profile tracks for chip and mock samples alongside smaller track indicating motif binding sites, significant peaks, and RefSeq transcripts (Figure 4).
Figure 4. Genome Viewer tracks can be modified to facilitate data exploration and analysis
We can also see strand-specific information by color-coding forward and reverse strand read information on the Bam Profile (chip) track.
Select Bam Profile (chip) on the Tracks panel
Select Alignments in the Display Color Options section of the Style tab
Select Strands under Color by
Select Apply
A legend track will be added at the bottom of the viewer showing that forward strand reads are colored green and reverse strand reads are colored red. We can move this track to below Bam Profile (chip).
Left-click Legend: strands in the Tracks panel and drag it up the list to below Bam Profile (chip) (Figure 5)
Figure 5. Move tracks by left-clicking and dragging them in the Tracks panel
Chromosome View opens to a whole-chromosome view of chromosome 1. To analyze the data we can zoom in on the data (Figure 6).
Figure 6. Viewing the zoomed-in view of an enriched region showing two possible binding sites at the location
We can practice this using a gene highlighted in the paper describing this data set, NEUROD1.
Type NEUROD1 in the navigation bar
Select Enter on your keyboard to navigate to NEUROD1 (Figure 7)
Figure 7. Zooming to the NEUROD1 gene
NEUROD1 contains a binding site for the NRSF motif. Notice that the enriched region for the NRSF transcription factor is within the NEUROD1 gene. As discussed in the Johnson et al. paper, NRSF is implicated in the repression of NEUROD1, but it was unknown exactly where the NRSF binding occurred. This data indicates that the binding site is within the NEUROD1 gene itself, as shown by the orange box in the Regions track. We can zoom in further to view the forward and reverse strand read histogram.
Left-click and drag to draw a box around the predicted binding site (Figure 8)
Figure 8. Using Zoom/Navigate Mode. Drawing a box around a region on any track will zoom all tracks to that region
Zoomed-in further, we can see the intersecting peaks from forward and reverse strand reads (Figure 9).
Figure 9. Zoomed-in view of NEUROD1 binding site for NRSF
Cox regression (Cox proportional-hazards model) tests the effects of several factors (predictors) on survival time. Predictors that lower the probability of survival at a given time are called risk factors; predictors that increase the probability of survival at a given time are called protective factors. The Cox proportional-hazards model are similar to a multiple logistic regression that considers time-to-event rather than simply whether an event occurred or not.
In this tutorial, we will use Cox Regression to test the effects of tumor gene expression on survival time while accounting for tumor size.
To begin, you should have the Survival Tutorial data set open in Partek Genomics Suite .
Select Stat from the main toolbar
Select Survival Analysis then Cox Regression from the Stat menu (Figure 1)
Figure 1. Invoking Cox Regression
The Cox Regression dialog will open. Please note that in this tutorial data set, column 1. Survival (years) indicates the survival time of each patient in years and column 2. Event indicates the event status for each patient, death or censored.
Set Time Variable to 1. Survival (years) using the drop-down menu
Set Event Variable to 2. Event using the drop-down menu
Only numeric data are displayed in the Time Variable drop-down list and only categorical data with two categories are displayed in Event Variable.
Set Event Status to death using the drop-down menu (Figure 2)
Event Status should be set to the primary event outcome. All Response Variables will be automatically selected for Predictor. This means that Cox Regression will test every probe set for association with the survival (time-to-event).
Figure 2. Configuring the Cox Regression dialog
Co-predictors are numeric or categorical factors that will be included in the regression model. To evaluate the association between tumor size and gene expression, we can add tumor size to the co-predictors list.
Select 7. tumor size (mm) from the Candidate(s) panel
Select Add Factor > to add it to the Co-predictor(s) panel
Advanced options such as the inclusion of interactions between predictors and co-predictors can be accessed by selecting Model... (Figure 3) and the Results... button invokes a dialog (Figure 4) with additional output options for the results spreadsheet. We do not need to adjust any of the advanced model or output options for this tutorial.
Figure 3. Configuring advanced options for Cox Regression
Figure 4. Configuring output options for Cox Regression
Select OK to run Cox Regression (Figure 5)
Figure 5. Configuring Cox Regression to assess the effect of gene expression and tumor size on survival
The spreadsheet generated by Cox Regression (Figure 6) includes a row for each probe set; the columns provide the following information:
1. Column # - Column number of probe set in probe intensities spreadsheet
2. Probest ID - ID of probe set in probe intensities spreadsheet
3. HRatio(gene) - Hazard ratio for the probe set
4. LowCI(gene) - lower 95% confidence boundary of the hazard ratio for the probe set
5. UpCI(gene) - upper 95% confidence boundary of the hazard ratio for the probe set
6. p-value(gene) - P-value of the corresponding Chi-squared test. A low value indicates that the predictor (probe set) poses a large hazard or is associated with shortened surivival time
7. to 10. - Effects of the co-predictor on survival time; for each co-predictor, a similar set of columns is added
11. modelfit(0) - P-value of the test assessing the overall model fit, i.e., the relationship between survival time, the predictor, and co-predictors in the model. A modelfit value of > 0.05 indicates a low association between the predictor and/or co-predictors with survival time.
Figure 6. Cox Regression results spreadsheet
The hazard ratio is an effect size measure used to assess the direction and magnitude of the effect of a predictor variable on the relative likelihood of the event occurring at any given point in time, controlling for other predictors in the model.
For continuous predictors, such as gene expression values and tumor size, the hazard ratio is the predicted change in the hazard for a unit increase in the predictor. A hazard ratio greater than 1 indicates that the predictor is associated with shorter time-to-event, hazard ratio less than 1 indicates that the predictor is associated with greater time-to-event, and a hazard ratio of 1 indicates that the predictor has no effect on time-to-event. For categorical predictors, the hazard ratio is relative to the reference category.
For any probe set, we can view a detailed HTML report.
Right-click the row header for row 1
Select HTML Report from the pop-up menu (Figure 7)
Figure 7. Invoking an HTML report for a probe set
The HTML report (Figure 8) will open in your default web browser.
Figure 8. Cox Regression HTML report
In this section, you will learn how to find genomic features (genes) that are near the IP-enriched regions of the data. You will also learn how to classify the peak locations by gene section (5’ UTR, 3’ UTR, Promoter, exon, intron).
Select p-value_filtered from the spreadsheet tree
Select Find Nearest Genomic Feature from the Peak Analysis section of the ChIP-Seq workflow
The Output Overlapping Features dialog will open (Figure 1).
Figure 1. Selecting a database for overlapping features
With this dialog, you can specify the reference database.
Select RefSeq Transcripts 81 - 2017-08-02 or your preferred annotation database
The promoter region can also be defined. The default settings are appropriate in this case.
Select OK
The resulting spreadsheet, gene-list, is a child of the p-value_filtered spreadsheet (Figure 2). Each row represents a transcript.
Figure 2. Viewing genes overlapped by regions
Column 1. transcript chromosome gives the chromosome location of transcript
Column 2. transcript start gives the start of transcript (inclusive)
Column 3. transcript stop gives the end of transcript (exclusive)
Column 4. strand gives the strand of the transcript
Column 5. Transcript ID gives the identify of the transcript
Column 6. Gene Symbol gives the identity of the gene
Column 7. Distance to TSS gives the distance of each enriched region to the transcription start site in base pairs with positive indicating downstream and negative indicating upstream
Column 8. Percent overlap with gene gives the percent of the gene that overlaps with the region
Column 9. Percent overlap with region gives the percent of the region that overlaps with the gene
Percent overlap with gene is more likely to close to 1 in cases where one region covers several genes, in histone studies, for example. Percent overlap with region is likely to be close to 1 in cases where a region is relatively small and is found completely within a gene, in transcription factor binding studies, for example. If both columns are close to 1, then the gene and the region have nearly the same start and stop sites. If both columns are close to 0, then the region does not overlap with the gene directly and likely covers only the promoter region.
Another way to interpret the genomic location of peaks is to use Classify regions by gene selection.
Select p-value_filtered from the spreadsheet tree
Select Classify regions by gene selection from the Peak Analysis section of the ChIP-Seq workflow
The Output Overlapping Features dialog will open.
Select RefSeq Transcripts 81 - 2017-08-02 or your preferred annotation database
The promoter region can also be defined. The default settings are appropriate in this case. The results can be further configured to give one result per detected region or one result per genomic feature. The default setting, one result per detected region, is appropriate in this case.
Select OK
A new spreadsheet, gene-classification will be generated (Figure 3).
Figure 3. Classifying regions by gene section
Columns 1-6 have the same contents we saw in gene-list.
Column 7. Gene Section gives the section of the gene that overlaps with the region
Column 8. Distance to TSS gives the distance of each enriched region to the transcription start site in base pairs with positive indicates downstream and negative indicating upstream
Column 9. Distance to nearest gene gives the distance of each enriched region to the nearest gene in base pairs with positive indicating downstream and negative indicating upstream
Column 10. Sample ID gives the sample in which the region is enriched
This tutorial provides information about Partek model selection tool, how to use this function and some common mistakes which we should avoid to do. The dataset used in the tutorial is a simulated human microarray intensity values in log space. The data is not used for diagnostic procedure, only to show how to use the function.
Download the zip file from . The download contains the following files:
Training set data: 28 samples (11 disease samples and 15 normal samples) on 9953 genes
Test set data: 8 samples on 9953 genes
configuration of the model builder (.pcms file)
36 sample data set – total of training and test samples
deployed model (.pbb file)
A classification model has two parts: variables and classifier. The model selection tool in Partek Genomics Suite uses cross-validation to choose the best classification model and gives the accuracy estimate of the best model.
1-level cross-validation is used to select the best model to deploy. There are two ways to report the unbiased accuracy estimate (or correct rate): 2-level cross validation on the same data set, or deploy the model on a independent test set. We will show both in this tutorial.
Open Partek Genomics Suite, choose File > Open... from the main menu to open the Training.fmt
Select Tools > Predict > Model Selection from the Partek main menu
In Cross-Validation tab, choose to Predict on Type, Positive Outcome is Disease, Selection Criterion is Normalized Correct Rate (Figure 1)
Choose 1-Level Cross-Validation option, and use Manually specify partition option as 5. The idea of 1-level cross validation option is to select the best model to deploy on the test data set.
Figure 1. Model selection dialog: 1-level cross validation configuration
Choose Variable Selection tab, to use ANOVA to select variables. The number of genes selected are based on the p-value generated from the 1-way ANOVA model which factor is Type. In each iteration of cross validation, we will use the training set to perform ANOVA, take the top N number of genes with the most significant p-values to build the classifier. The Configure button allow you to specify ANOVA model if you want to include multiple factors (Figure 2).
Since we don't know how many genes should be used to build the model, we will try to use 10, 20, 30, 40, 50 genes – the more options you try, the longer time it takes to run. In the How many groups of variables do you want to try, select Multiple groups with sizes from 10 to 50 step 10
Figure 2. Model selection dialog: Variable selection configuration
Click on Classification tab, select K-Nearest Neighbor, choose 1 and 3 neighbors using default Euclidean distance measure (Figure 3)
Figure 3. Model selection dialog: K-nearest neighbor configuration
Select Discriminant Analysis option, use the default setting which has the Linear with equal prior probabilities option checked
Click on Summary tab, we have configured 15 models to choose from (Figure 4)
Figure 4. Model selection dialog: Summary page
The more models configured, the long time it takes to run, in this example, in order to save time, we only specified 15 models and choose 5-fold cross-validation. You can also click on Load Spec button to load the above configuration from file tutorial.pcms
When you click on Run, a dialog as the one in Figure 5 will be displayed, notifying you that some classifiers, like discriminant analysis, are not recommended on dataset with more variables than samples.
Figure 5. A notification that discriminant analysis model is not recommended on data with more variables than samples
Click Run without those models button to dismiss the dialog, leaving12 models in this model space
Since we are doing 5-fold cross validation, there will be 6 samples held out as test set in each iteration, and the models are built on the remaining 22 samples training set. After it is done, all the 12 models have been tested on the 28 samples, and the correct rate will reported, they are displayed in the summary page in descending order of the normalized biased correct rate, the top one is the best model among the 12 models (Figure 6).
Figure 6. One-level cross-validation result: 20 variables 3 nearest neighbor with Euclidean distance measure is the best model among the 12 models on the 28 sample data
Click on Deploy button to leaunch the model using the whole dataset, but first save the file as 20var-1NN-Euclidean.ppb. It will run ANOVA on the 28 samples to generate the top 20 genes and build a model using 3 K-Nearest neighbor based on Euclidean distance measure.
Since the deployed model was from the whole 28 samples, in order to know the correct rate, we need a test set to run the model on.
To get unbiased correct rate, the test set sample must be independent from the training set. Now we are going to load another dataset, it has 8 samples with logged intensity values on the set of genes as that of the training data set. To use a complete independent test set to get correct rate is called hold-out validation.
Choose File > Open... to browse and open testSet.fmt
Choose Tools > Predict > Run Deployed Model... from the menu
Select 20var-3NN-Euclidean.ppb to open, click on Test button to run, the C_orrect rate_ (= accuracy) is reported on the top of the dialog (Figure 7)
Figure 7. Report on deploying a model on a test data set
Click Add Prediction to New Spreadsheet to generate new spreadsheet with a predicted class name in the first column, the samples (rows) whose predicted and true class name are different are highlighted (Figure 8)
Figure 8. Test deployed model on test set report on spreadsheet
Click on Test Report will generate a report in HTML format
Click Close to dismiss the dialog
Hold-out validation have to split the whole data into two parts: training set and test set. However, genomic data (like microarray or NGS data) typically doesn't contain a large number of samples, os using hold-out method, we have to make the training and test test even smaller. When the sample size is small (here the example data is just illustrate the function), the result is not precise. As a rule of a thumb, you should have at least 100 test samples to properly measure the correct rate with a useful precision. In the other words, the larger the size of training set, the better efficiency of the fitted predicted models are; the bigger size of test set, the better power of validation.
Another method to get unbiased accuracy estimate is to perform a 2-level cross validation on all the available samples (here: utilize the 36 samples set), so thta you don't have to split the data. The following steps show how to use all the 36 samples to select the best model and get the accuracy estimate.
Choose File > Open... to browse to and open the file 36samples.fmt
Choose Tools > Predict > Model Selection... from the menu
Click on Load Spec to select tutorial.pcms
Click Run on 1-level cross validation to select the best model using 36 samples
The best model is 30 variables using 1-Nearest Neighbor with Euclidean distance measure (Figure 9).
Figure 9. One-level cross-validation result: 30 variables 1 nearest neighbor with Euclidean distance measure is the best model among the 12 models on the 36 sample data
Click on the model with best correct rate and deploy the model
Since there is no separate data to test the correct rate of the best model in the 12 model space, we will do a 2-level cross-validation to get the accuracy estimate.
Click on Cross-Validation tab, choose 2-Level Nested Cross-Validation, specify the number of Partition as 5 for both, level everything else the same and click Run (Figure 10)
Figure 10. Two-level cross-validation configuration setup
After it is done, you will get a report like the one in Figure 11. The highligted number is the unbiased accuracy estimate of the best model in the 12 model space.
Figure 11. Two level cross-validation report. The highlighted model had the highest accuracy
Cross validation is used to esimate the accuracy of the predictive model, it is the solution to overfitting problem. One example of ovefitting is testing the model on the same data set when the model is trained. It is like to give sample questions to students to practice before exam, and the exact same questions are given to the students during exam, so the result is biased.
In cross-validation, the data is partition the data into training set and test set, build a model on training set and validate the model on the test set, so test set is completely independing from model traininig. An example of K-fold cross-validation is to randomly divide the whole data into k equal sized subsets, take one subset as test set, use the remanining k-1 subset to training a model, it will repeat K times, so all the samples are used for both training and test, and each sample is tested once. The following figure is showing 5-fold cross-validation:
Figure 12. 5-fold cross-validation
In Partek model selection tool, the cross-validation is performed first. Each iteration of cross-valiation, the variable selection and classification are performed on the training set, and the test set is completely independent to validate the model. One common mistake is to select variable beforehand, e.g. using perform ANOVA on the whole dataset and use ANOVA's result to select top genes, and perform the cross-valiation to get correct rate. In this case, the test sets in cross validation were used in the variable selection, it is not independend from the training set, so the result will be biased.
Another common mistake is to run 1-level cross-validation with multiple models, and report the correct rate of the best model as the estimate of generalization correct rate, This correct rate is optimistically biased. The reason is that in 1-level cross validation, the test set is used to select the best model, the test set is not independent anymore in terms of estimating correct rate on a unseen dataset. So either use 2-level cross-validation option or use another independ set to get the accuracy estimate, the idea here is to partition the data into 3 sets: training set, validation set and test set. Train the models on the training set, validation set is used to select the best model, and test set is used to generate an unbiased accuracy estimate.
With a list of amplified or deleted regions in our cohort in hand, one of the more interesting questions to ask is what genes have recurrent amplifications or deletions in the data set. To address this question, we can use the Find overlapping genes function to either add a column to our region list with the genes present in each region or create a new list of genes that overlap the regions.
Here, we will create a new spreadsheet with genes that overlap the regions in the amplified_or_deleted spreadsheet.
Select the amplified_or_deleted spreadsheet in the spreadsheet tree
Select Find Overlapping Genes from the Copy Number Analysis section of the workflow
Select Create a New Spreadsheet with Genes that Overlap the Regions from the Find Overlapping Genes dialog (Figure 1)
Select OK
Figure 1. Options in Find Overlapping Genes dialog
To determine what regions in the genome correspond to genes, we need to select an annotation database (Figure 2).
Figure 2. Viewing the Output Overlapping Features dialog. Database files not present on the computer display Download required in red
Partek Genomics Suite offers a variety of possibilities including RefSeq, Ensembl, and GENCODE; however, custom annotations can also be used. If the database file has not been downloaded, Download required. Click OK to download the file, will be listed in red beneath the annotation. Selecting OK will automatically download the file and then run the task.
Select Ensembl Transcripts release 75
Select OK
A new spreadsheet, gene-list, is created as a child spreadsheet of amplified_or_deleted (Figure 3).
Figure 3. Viewing the gene-list spreadsheet, a result of overlapping genes with regions of copy number changes. Each row of the table represents one Ensembl transcript
Each row corresponds to a transcript and the columns are as follows:
1. Genomic coordinates of the transcript
4. Coding strand
5. Transcript ID
6. Gene Symbol
7. Minimum distance of the region to the transcription start site with positive values indicating downstream and negative values indicating upstream
8. Percent overlap with gene indicates how much of the transcript sequence overlaps the region
9. Percent overlap with region indicates how much of the region is overlapped by the transcript
10. + Correspond to the columns 1+ in the segment-analysis spreadsheet
The first step in analyzing Affymetrix intensity data is to estimate the number of copies of each marker (allele).
Select Create Copy Number (from Allele Intensities Only)
This launches the Copy Number Creation dialog (Figure 1).
Figure 1. Choosing paired samples or unpaired samples
Choosing Paired samples assumes that each sample has its own reference sample with a common sample ID and generates a copy number spreadsheet. Choosing Unpaired samples uses a common reference, either a single sample or a group of samples, to create both a copy number spreadsheet and an allele ratio spreadsheet.
In this tutorial, we have paired tumor-normal samples and thus can use the Paired samples option.
Select Paired samples
Select OK
The next dialog, Create Copy Number from Pairs, asks you to choose the column shared by each pair and the column that identifies the baseline category (Figure 2).
Figure 2. Creating copy number from pairs
Select 3. Tumor for Column
Select N for Baseline category
Select 4. SubjectID for the Column to match sample pairs
Select OK
This will pair samples based on 4. SubjectID, and set the baseline sample as the sample in the pair with a value of N in the 3. Tumor column. The spreadsheet produced (Figure 3) has a row for each tumor sample. In this tutorial, columns 7+ include copy number estimates for each marker. Column 1-6 are identical to the source spreadsheet.
Figure 3. Viewing the paired copy number spreadsheet
Alternatively, if paired samples are not available or appropriate, the Unpaired samples option can be selected in the Copy Number Creation dialog (Figure 1). Selecting this option opens the Unpaired Copy Number dialog (Figure 4).
Figure 4. Viewing the unpaired copy number dialog
There are several options for creating a reference baseline. First, you can use an existing reference file. These may be distributed by the manufacturer of your array, such as Affymetrix or Illumina, or previously created using Partek Genomics Suite from a set of normal samples. Second, you can use the reference file distributed by Partek. Third, you can choose all the samples from a separately imported spreadsheet. Fourth, you can choose a subset of the samples within the current spreadsheet to pool to create a reference.
In each case, every sample in your spreadsheet will be compared to the referece and two spreadsheets will be generated, a copy number spreadsheet and an allele ratio spreadsheet.
Select () to activate Selection Mode
Selecting () activates Navigation Mode. This enables navigation on large pathway diagrams by left-clicking and dragging to move the view.
Select () to open the search panel
Select () to search
Select () to minimize the Search panel
Select () in the upper left-hand corner of the Partek Pathway window
Partek Genomics Suite supports different types of mapping files (Figure 3). These are library files that define how genes are organized into functional groups. For an explanation of each type of mapping file, click on the help icon () next to each one.
A new spreadsheet (gene-list) that contains the genes in the original list that belong to the chosen functional group will be created (Figure 6). Note that this spreadsheet is a Partek temporary (ptmp) file. To save it, click the Save icon ().
Select () from the command bar
Select () several times to zoom out slightly
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
For additional tips on using the Chromosome View, refer to .
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Please note that the Kaplan-Meier results spreadsheet is a temporary file. If you would like to be able to view the spreadsheet again after closing Partek Genomics Suite, be sure to save it by selecting the Save Active Spreadsheet icon ().
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
The example data set consists of 20 paired samples from an ovarian cancer study in which a fresh-frozen tumor sample and peripheral blood sample were obtained from 10 female patients (Ramakrishna et al. 2010). All 20 samples were analyzed using the Affymetrix Genome Wide Human SNP Array 6.0. To download the data set, select this link - . The data set is also used for the LOH and AsCN tutorials. The spreadsheet used in this tutorial was generated by importing SNP6 CEL files and annotating them with attributes for each sample. The experimental goal is to identify copy number changes present in multiple patient tumors.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
To facilitate analysis, we will consolidate the tracks. To select a track, either left-click it in the Tracks panel or activate Selection Mode by selecting the () icon and left-click the track in the viewer. Selecting a track opens its configuration options in the track properties panel below the Tracks panel.
There are four methods for zooming in and out in Chromosome View. (1) Select () to activate Zoom/Navigate Mode. With this mode selected, left-clicking and drawing a box on a track will zoom to the region in the box. (2) Use the mouse scroll wheel (3) Select the zoom in () and zoom out () on the zoom controls (). Selecting () resets the zoom to whole chromosome. (4) Use the slider on the zoom controls (). (5) Zoom to a location or gene by typing the coordinates or gene name in the Navigation bar () and pressing Enter on your keyboard. (5) Starting on a spreadsheet, right-click a row with region information, and select Browse to Location.
Select ()
You can save the reads shown in the visible genome browser window by selecting (), right-clicking in the peak area on the Bam Profile track, and selecting Dump Displayed Reads to Spreadsheet from the pop-up menu.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Please note that the Cox Regression results spreadsheet is a temporary file. If you would like to be able to view the spreadsheet again after closing Partek Genomics Suite, be sure to save it by selecting the Save Active Spreadsheet icon ().
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Column 10.-23. These columns are detailed in
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
This gene-list spreadsheet is gene-centric and enables genomic integration. For example, GO and Pathway enrichment can be directly invoked on the gene-list spreadsheet to detect functional groups affected by copy number changes. While not detailed in this tutorial, please feel free to explore these options on your own. For rmore information on enrichment analysis, you can consult the tutorial.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
For more information about using unpaired samples in copy number calculations, please consult our white paper.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
This tutorial uses a spreadsheet generated after data import, but we will illustrate the steps used to import the data in this section.
Select Copy Number from the Workflows drop-down menu
Select Import Samples from the Copy Number workflow
The import dialog will open (Figure 1).
Figure 1. Viewing the Import Copy Number Samples dialog
For Affymetrix arrays, Partek Genomics Suite can import CEL files with allele intensity values and calculate copy number estimates from these intensities. For Agilent, Illumina, NimbleGen, or Affymetrix .CHP files, Partek Genomics Suite can import files containing calculated copy numbers or log ratios.
For this tutorial, we will not be importing CEL files.
Select Cancel to close the import dialog
Later sections of this tutorial will address starting with copy number or log ratios and performing GC wave correction on Affymetrix CEL files.
We can now open the tutorial data file.
Download the zipped tutorial data folder Overlapping Copy Number with LOH
Unzip the files to an accessible directory
Select File from the main menu
Select Open...
Select the file IC_Intensities_SNP6.fmt
The spreadsheet will open in the Analysis tab (Figure 2).
Figure 2. Viewing the tutorial data set spreadsheet
This spreadsheet was generated from the import of SNP6 CEL files and shows all 20 samples on rows. Columns 1-6 describe the samples with information such as file names, Subject ID, Gender, etc. The other columns are individual markers from the microarray with the log2 normalized intensities associated with each marker (marker labels are column headers). Opening the IC_Intensities_SNP6.fmt file is equivalent to importing the 20 sample files and adding sample attributes.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
In addition to annotating regions with overlapping genes, other annotations can be to characterize the regions showing copy number variation.
For example, Overlap with known SNPs in the Copy Number workflow gives the option of annotation regions with SNPs from dbSNP or a custom SNP database (Figure 1).
Figure 1. Annotate regions with SNPs from dbSNP
This task adds two column to the region list spreadsheet - the list of SNPs described in each region and the total number of SNPs in the region. If the list of SNPs is very long, you can output a separate list by right-clicking on the row header and select Create list of dbSNP from the pop-up menu.
Another option in the workflow is Test for known abnormalities. Selecting this option compares the regions listed in the region list with a database of genomic abnormalities characteristic of particular diseases or syndromes to find possible matches. Annotation options include a Partek-distributed database of 60 syndromes or a custom database (Figure 2). Please note that the included table of known abnormalities is distributed for research use only.
Figure 2. Test for known abnormalities in your copy number data
If you like to add a custom database, organize the following information by column: the name of the abnormality, chromosome number, start location, and stop location. The input for the task should be a list of aberrations for every sample; do not include unchanged regions in the input or every syndrome will be shown as positive.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Starting with copy number estimates for each marker (either taken directly from the vendor’s input file or calculated previously), the next step is to create a list of regions where adjacent markers share the same copy number.
There are two algorithms available for copy number region detection: Genomic Segmentation and Hidden Markov Model (HMM). Both algorithms look for trends across multiple adjacent markers. The genomic segmentation algorithm identifies breakpoints - changes in copy number between two neighboring regions. The HMM algorithm looks for discrete changes of whole number copy number states (e.g., 0, 1, 2 … with no upper limit) and will find regions with those numbers of copies. Therefore, the HMM model performs better in cases of homogeneous samples such as clinical syndromes with underlying copy number aberrations. Genomic segmentation is preferable for heterogeneous samples such as cancer because tumor biopsies often contain “contaminating” healthy tissue and a tumor can have cells with different genomic aberrations.
The number of copies of each marker created in the previous step will be used to detect the genomic regions with copy number variation, i.e., to identify amplifications and deletions across the genome.
Select the IC_IntensitiesSNP6pairedcopynumber spreadsheet in the Analysis tab
Select Detect Amplifications and Deletions from the Copy Number Analysis section of the workflow (Figure 1)
Figure 1. Invoking Detect Amplifications and Deletions
The Detect Amplifications and Deletions dialog will give you the option to choose Genomic Segmentation or HMM Region Detection (Figure 2).
Figure 2. Select a method for detecting amplifications and deletions
Select Genomic Segmentation
Select OK
The Genomic Copy Number Segmentation dialog gives options for setting segmentation parameters and the configuring the region report (Figure 3).
Figure 3. Configuring the Genomic Copy Number Segmentation dialog
Set Minimum genomic markers to 50
Leave the rest of the parameters set to default values as shown (Figure 3)
Select OK
The Genomic Segmentation task is divided into two steps. In the first step, each region is compared to an adjacent region to determine whether both have the same average copy number and whether a breakpoint can be inserted. This is determined by first using a two-sided t-test to compare the average intensities of adjacent regions and then checking whether the corresponding cut-off p-value is below the specified P-value threshold. The genomic size of a region is defined by the number of genomic markers in the region, Minimum genomic markers, while the magnitude of the significant difference between two regions is controlled by Signal to noise, which can be thought of as the difference in copy numbers between the regions. If the t-test is significant, the copy number of the region differs significantly from its nearest neighbors. However, a second step is needed to detemine whether the difference is due to amplificaiton or deletion. In this second step, two one-sided t-tests are used to compare the mean copy number in the region with the expected diploid copy number. For a detailed explanation of the genomic segmenetation procedure, please consult our Genomic Segmentation white paper. For more detailed information about fine-tuning the parameters of your copy number analysis, please consult our guide, Optimizing Copy Number Segmentation.
The resulting spreadsheet, segmentation, shows one row per genomic region per sample (Figure 4). The columns provide the following information:
1-4: Genomic location of the region
5. Sample ID
6. Description of the copy number change
7. The length of the region (in base pairs)
8. The number of markers in the region
9. Markers density in the region (region length in base pairs divided by the number of markers)
10. Geometric mean of the copy number of all the markers in the region
11. Minimum p-value of the one-sided t-tests of the difference of the copy number in column 10 vs. the diploid range
Figure 4. Viewing the segmentation spreadsheet
If desired, you can use Merge Adjacent Regions under Tools in the main toolbar to combine similar regions.
Individual regions of interest can be visualized using Chromosome View.
Right-click a row header in the segmentation spreadsheet
Select Browse to location from the pop-up menu
Alternatively, you can visualize results at the whole chromosome level.
Select the segementation spreadsheet
Select Chromosome View from the QA/QC section of the workflow
The Genomic Segementation track displays the segmentation results (Figure 5). Each line in the track represents a sample. Amplified, deleted, and unchanged regions are shown in red, blue, and white, respectively. The Profile track now also includes information from the segmentation spreadsheet for the selected sample.
Figure 5. Segmentation results shown as regions of amplification and deletion in each sample
Amplified and deleted regions in each sample have been detected, we can compare the regions across multiple samples to detect copy number changes that are shared by multiple samples.
Select Analyze detected segments from the Copy Number Analysis section of the workflow
The Analyze Segments task (Figure 6) can test for associations between copy number variations and sample categories using the χ2 test. In this tutorial, all pairs share the sample phenotype, so we will not test for associations.
Figure 6. Viewing the Analyze segments dialog
Leave all boxes unchecked
Select OK to run the Analyze Segements task
The task generates a new spreadsheet, summary (segment-analysis) (Figure 7), with one region per row. The columns provide the following information:
1-4. Genomic locations of the regions
5. Total number of samples
6-7. Number of samples with amplifications and the average amplified copy number, respectively
8-9. Number of samples with deletions and the average deleted copy number, respectively
10. Total number of samples with copy number abberations
11-12. Number of samples with no change in copy number and the average copy number in those samples, respectively
13. Number of markers in the region
14. Length of the region (in base pairs)
15+. Two columns per sample - the average copy number in each sample as well as the copy number change status of the sample sample (e.g., amplified, deleted, unchanged, depending on the copy number and the threshold for unchanged defined in the Genomic Segementation dialog)
A "?" indicates that a region with the particular characterisitic does not exist or cannot be computed. For example, if a region is not amplified in any of the samples, the average amplified copy number will be shows as "?". This list may be filtered to contain only regions that meet user-specified criteria as discussed in the next section of the tutorial.
Figure 7. Viewing the results of Analyze Detected Segments
To get an overiew of the common abberations in the group of samples over the entire genome we can use View Detected Regions.
Select View Detected Regions
The View Detected Regions dialog (Figure 7) allows you to select the spreadsheet with genomic regions and choose between histogram and copy number classification plots.
Figure 8. View Detected Regions dialog
Select summary (segment-analysis) from the drop-down menu
Select View Histogram
Select OK
The plot will open in a new tab titled Karyogram View (Figure 8).
Figure 9. Viewing amplification and deletion histograms using Karyogram View
The Karyogram View shows each chromosome with red and blue histograms on either side corresponding to amplification and deletion, repsectively. The histogram height reflects the number of samples that share either amplification of deletion a that particular region. For example, the long arms of chromosomes 3 and 7 are amplified in the majority of samples and most samples share a deletion in the long arm of chromosome 4.
Mousing over the chromosome will give cytoband information, mousing over the histogram will give the number of shared regions at each position and the number of samples sharing the type of variation. Both the menu and display may be used to control which chromosomes are displayed; left-click in the menu to toggle a chromosome on/off and right click in the menu or graph to show only that chromosome.
Alternatively, we can use the Copy Number Classification plot to get a more sample-centric view.
Select View Detected Regions
Select View Copy Number Classification
Select OK
The Copy Number Classificaiton also utilizes Karyogram View to provides an overview of all the samples and the copy number of regions on each chromosome (Figure 9).
Figure 10. Viewing the Copy Number Classification plot
Each sample is drawn as a separate column next to the chromosome. Amplified regions are depicted in red, deleted regions in blue, and regions with no copy number change in white. Sample names are given accross the top of each column. For greater detail, try viewing fewer chromosomes.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Principal component analysis (PCA) is a way to explore the overall similarity between samples, visualize possible groupings within the data set, and detect outliers.
Select PCA Scatter Plot from the QA/QC
Figure 1. Principal component analysis showing total allele intensities of normal (blue) and cancer (red) samples. Each dot represents a single sample.
Each dot on the plot corresponds to a single sample and can be thought of as a summary of all normalized marker intensities for the sample. The first categorical column is used to color the plot; here, tumor samples are shown in red and normal samples are shown in blue.
To better view the data, we can rotate the plot.
Click and drag to rotate the plot
Rotating the plot allows us to look for outliers in the data on each of the three principal components (PC1-3). The percentage of the total variation explained by each PC is listed by its axis label. The chart label shows the sum percentage of the total variation explained by the displayed PCs.
We can see that the peripheral blood samples (normal) cluster together whereas the cancer tissue samples (tumor) are more dispersed and show considerable variability. This corresponds well with the known genomic variability of cancer cells.
To view the similarity of paired normal and tumor samples from the same patient, we can connect dots by Subject ID.
Select 4. SubjectID from the Connect by drop-down menu in the upper right-hand corner of the plot tab
Paired tumor and normal samples are now connected by lines, illustrating the range of differences between normal and tumor copy number in the data set (Figure 2).
Figure 2. Lines connect paired tumor and normal samples
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
In this tutorial, the experimental goal is to identify regions with copy number changes in multiple patients. To do this, we will create a list containing deleted and amplified regions across the genome shared by 8 or more samples.
Select Create Region List from the Copy Number Analysis section of the Copy Number workflow
Select Specify New Criteria
We want to include all the amplified regions across the genome shared by at least 8 samples in our first criteria (Figure 1).
Set Name to Amplified
Set Spreadsheet to 2/segmentation/summary (segment-analysis)
Set Column to 6. Total Amplifications using the drop-down menu
Deselect the box next to Include values less than or equal to
Set Include values greater than or equal to value to 8
The # pass should be 86, indicating that 86 regions meet the criteria.
Select OK
Figure 1. Configuring the Amplified criteria
Select Save to save the list
Select OK to confirm that you would like to save Amplified as a list
Select Close to exit the List Creator dialog
Amplified is now open in the Analysis tab as a child spreadsheet of segmentation. Although this list contains regions amplified in 8 or more samples, some samples may also contain deletions in the same regions. For downstream analysis, we may want to filter out these regions to create a final list with only amplified regions. Here, we will use the interactive filter.
Select the Amplified spreadsheet
Set the Column drop-down list to 8. Total Deletions
Type 0 in the Max box
Select Enter on your keyboard
This will apply a filter excluding any region with deletions (Figure 2).
Figure 2. Interactive filter excluding regions with deletions.
The yellow and black bar on the right-hand side of the spreadsheet indicates the porportion of rows that have been filtered. Next, we can save the filtered list.
Right-click the Amplified spreadsheet in the spreadsheet trees
Select Clone... from the pop-up menu
Set the Name of the new spreadsheet to amplified_only
Set Create new spreadsheet as a child of spreadsheet to 2/segmentation (segmentation.txt)
Select OK
The new spreadsheet is a temporary file. To keep the spreadsheet, we need to save it.
Select amplified_only in the spreadsheet tree
Set the file name as a****mplified_only
Select Save
The amplified_only spreadsheet contains 60 rows and includes regions that were amplified in 8 or more samples and not deleted in any sample.
To create a list of regions only deleted in 8 or more samples, repeat the above steps for deleted regions. You should create a final list, deleted_only, with 92 regions.
Next, we can merge the two lists to create a spreadsheet with both deleted and amplified regions.
Select File from the main taskbar
Select Merge Spreadsheets...
Select the Append Rows tab
Select **2/segmentation/**deleted_only (deleted_only) from the First Spreadsheet drop-down menu
Select 2/segmentation/amplified_only (amplified_only) from the Second Spreadsheet drop-down menu
Name the merged spreadsheet amplified_or_deleted using the Specify Output File (Figure 3)
Select OK
Figure 3. Merging amplified and deleted spreadsheets
Select the new spreadsheet, amplified_or_deleted in the spreadsheet tree
This spreadsheet, amplified_or_deleted, will be used as the basis for the downstream steps in this analysis.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
To normalize for GC content, use the custom import settings during import. Select Customize... and under the Algorithm tab of the Advanced Import dialog, check the Adjust for GC content box (Figure 1).
Figure 1. Adjusting for GC content during CEL file import
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This tutorial outlines how to analyze miRNA expression data in Partek Genomics Suite and outlines how miRNA expression data can be integrated with mRNA expression data from gene expression microarrays.
This tutorial illustrates how to:
Note: the workflow described below is enabled in Partek Genomics Suite version 7.0 software. Please fill out the form on Our support page to request this version or use the Help > Check for Updates command to check whether you have the latest released version. The screenshots shown within this tutorial may vary across platforms and across different versions of Partek Genomics Suite.
The data set for this tutorial includes miRNA from 3 human brain samples and 3 heart samples quantified using the Affymetrix GeneChip miRNA 1.0 array. The same sample set was also processed on GeneChip Human Gene 1.0 ST arrays for mRNA expression.
For this tutorial, the gene expression and miRNA expression studies have been analyzed and stored in Partek Genomics Suite project (ppj) format as miRNAmRNA integration. The project contains two Partek format files: Affy_miR_BrainHeart_intensities.fmt with the miRNA data and Affy_HuGeneST_BrainHeart_GeneIntensities.fmt with the analyzed mRNA data. There is also an ANOVA results spreadsheet open as a child spreadsheet of Affy_HuGeneST_BrainHeart_GeneIntensities.fmt.
Download the miRNA Expression and Integration with Gene Expression data set and save it in an easily accessible location on your computer
We can now open the project in Partek Genomics Suite.
Select File
Select Import
Select Zipped Project...
Select the miRNA_tutorial_data.zip zipped folder
The project files will open in the Analysis tab (Figure 1).
Figure 1. The miRNA tutorial data set
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Typically, you would begin a miRNA expression analysis with the same steps outlined in the Importing Affymetrix CEL files section of the Gene Expression tutorial. Here, the data has already been imported and attributes added.
To being our analysis, we will open the miRNA Expression workflow.
Select the miRNA Expression workflow from the Workflows drop-down menu
The miRNA Expression workflow provides a series of steps for analyzing miRNA expression data and integrating it with gene expression data (Figure 1).
Figure 1. The miRNA Expression workflow
Select the Affy_miR_BrainHeart_intensities spreadsheet
This is the probe intensities spreadsheet for the miRNA expression data (Figure 2). Each row is a sample; columns 7 to 9 give attribute information about each sample including tissue, replicate number, and scan date, while columns 10 on give prove intensities values.
Figure 2. Viewing the miRNA probe intensities spreadsheet
Select PCA Scatter Plot from the QA/QC section of the workflow
A new tab will open showing a PCA scatter plot (Figure 3).
Figure 3. PCA scatter plot. Samples are spheres. Samples with more similar miRNA expression are close together while dissimilar samples are further apart.
In this PCA scatter plot, each point represents a sample in the spreadsheet. Points that are close together in the plot are more similar, while points that are far apart in the plot are more dissimilar.
To better view the data, we can rotate the plot.
Click and drag to rotate the plot
Rotating the plot allows us to look for outliers in the data on each of the three principal components (PC1-3). The percentage of the total variation explained by each PC is listed by its axis label. The chart label shows the sum percentage of the total variation explained by the displayed PCs.
Here, we can see that the brain and heart samples are well separated across PC1, which is expected.
For more information about customizing the plot, please see Exploring the data set with PCA from the Gene Expression with Batch Effect tutorial.
Next, we will identify miRNAs that are differentially expressed between brain and heart tissues.
Select the Analysis tab
Select the Affy_miR_BrainHeart_intensities spreadsheet
Select Detect Differentially Expressed miRNAs from the Analysis section of the workflow
The ANOVA dialog (Figure 4) allows us to configure the comparisons we want to make between samples and groups within the data set.
Figure 4. ANOVA dialog
Select Tissue from the Experimental Factor(s) panel
Select Add Factor > to move Tissue to the ANOVA Factor(s) panel
The Contrasts... button will now be available to select.
Select Contrasts...
The Configure ANOVA dialog (Figure 5) is used to set up contrasts. Contrasts are the comparisons between groups and are where experimental questions can be asked. In this study, we are asking what miRNAs are differentially expressed between heart and brain tissue.
Figure 5. ANOVA configuration dialog
Select Yes for Data is already log transformed?
Select Fold change for Report comparisons as
Select 7. Tissue from the Select Factor/Interaction drop-down menu
Select brain from the left panel
Select Add Contrast Level > to move brain to the upper group - initially Group 1
Select heart from the left panel
Select Add Contrast Level > to move heart to the lower group - initially Group 2
This contrast (Figure 6) will compare expression of miRNAs in brain samples to expression in heart samples with brain as the numerator and heart as the denominator for fold-change calculations.
Figure 6. Configuring a contrast between brain and heart tissue in the ANOVA dialog
Select Add Contrast
Select OK
The Contrasts... button should now read Contrasts Included.
Select OK to run the ANOVA as configured
An ANOVA Results sheet, ANOVAResults, will be created as a child spreadsheet of Affy_miR_BrainHeart_intensities (Figure 7). In this spreadsheet, each row represents a probe set and the columns represent the computation results for that probe set. Although not synonymous, probe set and gene will be treated as synonyms in this tutorial for convenience. By default, the genes are sorted in ascending order by the p-value of the first categorical factor, which, in this case, is Tissue. This means the most significant differentially expressed miRNAs between the brain and heart (up-regulated and donw-regulated) are at the top of the spreadsheet.
Figure 7. Viewing the ANOVA results spreadsheet
You may explore what is known about any listed miRNA using external databases TargetScan, miRBase, microRNA.org, or miR2Disease, by right-clicking a row header, selecting Find miRNA in... and choosing one of the external databases. This will open a web page in your default web browser and requires your computer be connected to the internet.
For more information about AVOVA in Partek Genomics Suite, see Identifying differentially expressed genes using ANOVA.
The ANOVA results spreadsheet includes every miRNA on the array for a total of 7815 miRNAs. However, many of these miRNAs are not significantly differentially expressed between brain and heart and, thus, are not of interest. Next, we will create a filtered list of significantly differentially expressed miRNAs.
Select the ANOVAResults spreadsheet
Select Create List from the Analysis section of the workflow
The List Manager dialog will open (Figure 8).
Select brain vs. heart under Contrast: find genes that change between two categories
By default, the fold-change and significance thresholds are set to > 2, < -2 and p-value with FDR < 0.05. These defaults are appropriate for this tutorial so we will leave them in place.
Select Create to create a new list, brain vs. heart containing only the 1404 miRNAs that pass the criteria
Figure 8. Creating a list of significantly differentially expressed miRNAs
A new spreadsheet, brain vs. heart will be created as a child spreadsheet of Affy_miR_BrainHeart (Figure 9).
Figure 9. Viewing brain vs. heart spreadsheet
To view the miRNAs with the largest difference between tissues, we can sort by fold-change.
Right-click the 6. Fold-Change(brain vs. heart) column header
Select Sort Descending by Absolute Value from the pop-up menu
The top 33 miRNAs we see (Figure 10) are all miR-124 from different species. The miRNA miR-124 is the most abundant miRNA in neuronal cells so this finding is expected. The multiple species versions of miR-124 are present because Affymetrix GeneChip miRNA arrays provide comprehensive coverage of miRNAs from multiple organisms including human, mouse, rat, canine, monkey, and many more on a single chip. The miRNAs from these different species are highly homologous so probes targeting miRNAs from other species will hybridize with human miRNAs. Therefore, we need to filter the list of miRNAs to include only human miRNAs.
Figure 10. miR-124 is highly differentially expressed in brain vs. heart
To do this, we need to add a new annotation column containing species information for each probe.
Right-click on the 2. Probeset ID column header
Select Insert Annotation from the pop-up menu
Select Add as categorical
Check Species Scientific Name (Figure 11)
Select OK to add the annotation column
Figure 11. Inserting species annotation column
The table now includes a column 3. Species Scientific Name with the species name of each miRNA. We can now filter to include only human miRNAs.
Right-click the 3. Species Scientific Name column header
Select Find / Replace / Select... from the pop-up menu
Type Homo sapiens for Find What
Select Only in column for Search
Select 3. Species Scientific Name from the drop-down menu next to the Only in column option
Select Select All (Figure 12)
Figure 12. Configuring the Find / / Replace / Select... dialog
The search should find and select 251 miRNAs.
Select Close
Right-click any of the row headers that are selected
Select Filter Include from the pop-up menu
The spreadsheet will now include only the 251 miRNAs from human (Figure 13). The first row is still miR-124 with a fold change of 4087.94. The black and gold bar on the right-hand side of the spreadsheet indicates the fraction of rows that have been filtered. To retain this filtered list, we can create a new spreadsheet.
Figure 13. Viewing differentially expressed human miRNAs
Right-click the brain_vs_heart spreadsheet in the spreadsheet tree
Select Clone... from the pop-up menu
Cloning a spreadsheet while a filter is applied copies only the included rows/columns.
Name the spreadsheet brain_vs_heart_human
Select Affy_miR_BrainHeart_intensities from the drop-down menu Create new spreadsheet as a child of spreadsheet
Name the new file brain vs. heart human
Select Save
The new spreadsheet includes only the 251 human miRNAs that are significantly differentially expressed between brain and heart tissue (Figure 14).
Figure 14. Viewing the filtered human miRNAs spreadsheet
The next step in our analysis will be integrating miRNA and gene expression data.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Although copy number analysis is a powerful tool for studying genomic aberrations, it lacks the capability to detect changes that are copy-neutral. For example, loss of heterozygosity (LOH) can involve a change in copy number or be copy-neutral. In the former case, LOH could be caused by a hemizygous deletion in which one allele is lost and the other allele remains present (Figure 1, middle panel). This type of LOH can be recognized by copy number analysis or SNP-genotyping. However, in the latter case, an allele is lost initially, but a subsequent amplification of the remaining copy creates a copy-neutral LOH (Figure 1, right panel). This copy-neutral LOH can only be detected when copy number is studied in combination with SNP genotype.
Figure 1. Possible mechanisms of LOH and their impact on copy number. Left panel: heterozygous SNP; numbers indicate the number of copies of each allele (“normal” allele = green, “mutant” = red). Middle panel: hemizygous deletion leading to the loss of normal allele. Right panel: duplication of the ”mutant” allele. The case in the middle panel changes the copy number, while the case in the right panel is copy-number neutral
Copy-neutral events can be detect by combining the copy number workflow with the LOH workflow or the Allele-Specific Copy Number (AsCN) workflow to detect allelic imbalance (AI) (advantages of AsCN over LOH are discussed below). With these approaches, the copy number data are supplemented with SNP genotyping data (currently available with Affymetrix® and Illumina® arrays) to label the genomic regions as amplification without LOH/AI, amplification with LOH/AI, deletion without LOH/AI, deletion with LOH/AI, copy-neutral LOH/AI (Figure 2). The last category, copy-neutral LOH/AI, is the added value of the workflow integration.
An important consideration when choosing between LOH and AsCN analysis is that LOH analysis in the context of cancer has been proven complex and difficult because cancer cells frequently deviate from the diploid state and tumor samples often contain many normal cells. As the proportion of tumor cells in a sample decreases and approaches 50% or less, the capacity to detect the LOH diminishes (Yamamoto et al., Am J Hum Gen 2007). Additionally, in cases where only one of two alleles is amplified, LOH genotyping algorithms fail to call a heterozygote SNP, resulting in a false-positive LOH call.
Figure 2. Integration of copy number workflow with loss of heterozygosity (LOH) or allelic imbalance (AI) under allele-specific copy number (AsCN) workflows enables the identification of copy-neutral events
AsCN analysis, on the other hand, enables reliable detection of allelic imbalance in tumor samples even in the presence of large proportions of normal cells. Unlike LOH, it does not require a large set of normal reference samples. For a heterozygous SNP, a balance is expected between the two alleles (1×A and 1×B, or 1:1 ratio). The AsCN algorithm provides an estimated number of copies of each allele and therefore enables the detection of allelic imbalance even in cases when alleles are amplified or deleted (e.g. 3×A and 1×B). Moreover, LOH can be considered a special case of AI (e.g., 1×A, B allele deleted) (Figure 3). Therefore, AsCN should be the preferred workflow for tumor samples.
Figure 3. Loss of heterozygosity (LOH) as a special case of allelic imbalance. The situation on the left represents a normal heterozygous SNP, with one copy of each allele
Diskin SJ, Li M, Hou C, Yang S, Glessner J, Hakonarson H, Bucan M, Maris JM, Wang K. Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 2008 Nov;36(19):e126.
Ramakrishna M, Williams LH, Boyle SE, Bearfoot JL, Sridhar A, Speed TP, Gorringe KL, Campbell IG. Identification of candidate growth promoting genes in ovarian cancer through integrated copy number and expression analysis. PLoS One. 2010 Apr 8;5(4):e9983.
Yamamoto G, Nannya Y, Kato M, Sanada M, Levine RL, Kawamata N, Hangaishi A, Kurokawa M, Chiba S, Gilliland DG, Koeffler HP, Ogawa S. Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of Affymetrix single-nucleotide-polymorphism genotyping microarrays. Am J Hum Genet. 2007 Jul;81(1):114-26.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
miRNAs regulate gene expression at the post-transcriptional level by base-pairing with the three prime untranslated region (3’ UTR) of the target gene, causing cleavage/degradation of the cognate mRNA or preventing translation initiation. Integration of miRNA expression with gene expression data to study the overall network of gene regulation is vital to understanding miRNA function in a given sample. Partek Genomics Suite provides a platform that can analyze miRNA and gene expression data independently, yet allows data to be integrated for downstream analysis.This integrative analysis can be accomplished at several different levels. If you only have miRNA data, then Partek Genomics Suite can search the predicted gene targets in a miRNA-mRNA database like TargetScan to provide a list of genes that might be regulated by the differentially expressed miRNAs. Alternatively, if you have only gene expression data, Partek Genomics Suite can use the same database to identify the microRNAs that putatively regulate those differentially expressed genes in a statistically significant manner. If you have gene expression data and miRNA data from comparable tissue/species, Partek Genomics Suite can combine the results of these separate experiments into one spreadsheet. Lastly, if the miRNA and mRNA from the same source was analyzed (as in this tutorial), then you may statistically correlate the results of miRNA and gene expression assays.
This application is useful in the case where you have miRNA expression data, but not gene expression data. Using a database like TargetScan, microCosm, or a custom database, you can identify the list of genes that are predicted to be regulated by these differentially expressed miRNAs and then perform Biological Interpretation tasks on the list of genes.
Select Combine miRNAs with their mRNA targets from the miRNA Integration section of the miRNA Expression workflow
Select the Get All Targets tab
Select TargetScan7.1 for Database Name
Select brain vs. heart human for Spreadsheet Name
Set Column with microRNA labels to 2. Probeset ID
Name the Result file PutativeGenes
Select OK (Figure 1)
Figure 1. Identifying all predicted gene targets of differentially expressed miRNAs
This will create a new spreadsheet PutativeGenes that contains a miRNA and a putative gene target in each row. Because each miRNA can regulate multiple genes, the list will be much longer than the input miRNA list. Each row contains a gene so this spreadsheet can be analyzed using GO Enrichment and Pathway Enrichment tasks from the Biological Interpretation section of the workflow.
Another useful way to analyze this list is to determine which genes could be targeted by multiple miRNAs in the list. To do this:
Right-click on the column 13. Gene Symbol header
Select Create List With Occurrence Counts from the pop-up menu (Figure 2)
Figure 2. Creating an occurrence counts list from the list of putative miRNA target genes
The new spreadsheet is a temporary spreadsheet listing each gene in alphabetical order and giving the occurance count of each. Sorting by descending order will list the gene with the most occurances first (Figure 3).
Figure 3. Occurrence list of putative miRNA target genes
This application is useful when you only have gene expression results or a gene list of interest and are interested in identifying which miRNAs might regulated the genes. Using a databse like TargetScan, you can create a list of miRNAs that are statistically predicted to regulated those genes. miRNAs of particular interest could then be explored using a lower-throughput technique like RT-qPCR.
Using the gene list as input, a Fisher's Exact right-tailed p-value is calculated to show the overrepresentation of genes of interest for each miRNA in the database. The smaller the p-value, the more overrepresented the miRNAs are for the dataset. Target associations are taken from a database, TargetScan in this example. If the input list is a filtered list of genes from an ANOVA calculation, the parent spreadsheet is used to identify the background list of genes from the array. Genes in the array but not in the significant gene list will be treated as background in the calculations.
To begin, we need to create a list of significant genes using the ANOVAResults gene spreadsheet.
Select the ANOVAResults gene spreadsheet in the spreadsheet tree
Select Create List from the workflow
Select Brain vs. Heart
Set the Save list as to brain vs. heart genes
Leave other fields at their default values (Figure 4)
Select Create
Figure 4. Creating a list of significantly differentially expressed genes
Select Close to exit the List Manager dialog
We will now use this list to identify overrepresented miRNA target sets.
Select Find overrepresented miRNA target sets from the miRNA Integration section of the workflow
Select TargetScan 7.1 from the Target Databse drop-down menu
Select brain vs. heart genes from the mRNA Spreadsheet drop-down menu
Select 4. Gene Symbol from the Column with gene symbols drop-down menu (Figure 5)
Select OK
Figure 5. Finding enriched miRNA target sets
A new spreadsheet, enrichedAssociations, will be created with miRNAs from the database on rows (Figure 6). Column 1 contains the miRNA name and column 2 shows its p-value. The smaller the p-value, the more significant it is. Column 3 contains the number of genes from the (input) significant gene list that are targeted by this microRNA and Column 7 shows the number of significant genes from the input list that are not targeted by this microRNA. Columns 4 and 5 contain the number of significantly up- and down-regulated genes from the input significant gene list targeted by the miRNA. Column 6 shows the number of background genes (genes on the array but not in the input significant gene list) that are targeted by the miRNA and Column 8 shows the number of background genes on the array that are not targeted by the miRNA. The numbers in columns 3, 6, 7 and 8 will be used to calculate the Fisher’s Exact (right-tailed) p-value, a measure of the overrepresentation of the predicted miRNAs within a gene set.
Figure 6. Output of the Find Overrepresented miRNA Target Sets tool
As the enrichment p-values have not been corrected for running multiple statistical tests, we can the multiple test corrrection feature of Partek Genomics Suite to adjust the p-values.
Select the enrichedAssociations spreadsheet
Select Stat from the main menu toolbar
Select Multiple Test Correction
Select all the multiple test correction options
Transfer Enrichment p-value to the Selected Column(s) panel from the Candidate Column(s) panel (Figure 7)
Figure 7. Configuring the Multiple Test Correction dialog
Columns for each of the test correction methods will be added to the enrichedAssociations spreadsheet and can be used to filter the list of miRNAs.
This option is useful if you have miRNA and gene expression experiments you want to compare. The samples should be comparable, but do not have to originate from the same specimens.
Select Combine miRNAs with their mRNA targets from the miRNA Integration section of the workflow
Select the Get Targets from Spreadsheet tab
Select TargetScan 7.1 from the Target Database drop-down menu
Select brain vs. heart human from the microRNA Spreadsheet drop-down menu
Select 2. Probeset ID for Column with microRNA labels
Select ANOVAResults gene from the mRNA Spreadsheet drop-down menu
Select 4. Gene Symbol for Column with gene symbols (Figure 8)
Select OK
Figure 8. Combining miRNAs with their mRNA targets
In the new spreadsheet, each row represents a specific miRNA associated with one of its target genes; a single miRNA can have multiple targets. For example, hsa-miR-133b_st has 659 rows, one for each target (Figure 9).
Figure 9. Viewing the combined spreadsheet with miRNAs and mRNA targets
Columns 1-12 are taken from the miRNA expression source spreadsheet while columns 13-26 are taken from the gene expression source spreadsheet.
This application is useful when you have miRNA and mRNA expression data form the same samples and want to correlate the findings to determine whether up- or down-regulated miRNAs result in gene expression changes in their cognate genes. Pearson and Spearman correlation coefficients and their p-values are calculated.
Select Correlate miRNA and mRNA data from the miRNA Integration section of the workflow
Select TargetScan7.1 from the Target Database drop-down menu
Select Affy_miR_BrainHeart_intensities for the microRNA spreadsheet using the drop-down menu
Select Affy_HuGeneST_BrainHeart_GeneIntensities as the mRNA spreadsheet using the drop-down menu (Figure 10)
Select OK
Figure 10. Configuring the Correlate miRNA-mRNA dialog
Next, select the SmapleID column from each spreadsheet. These must match.
Select 6. SampleID for Affy_miR_BrainHeart_intensities
Select 6. SampleID for Affy_HuGeneST_BrainHeart_GeneIntensities
Select OK (Figure 11)
Figure 11. Choosing matching Sample ID columns
The new spreadsheet, correlation.txt (Figure 12). Each row contains one miRNA correlated with one of its target gnees. The first column contains the miRNA probeset ID from the miRNA intensities spreadsheet. The second column contains the mRNA probeset ID from the gene expression intensities spreadsheet. The third column lists the gene symbol and the fourth the miRNA name. The fifth and sixth columns are the Pearson correlation coefficient and its p-value for the gene-miRNA pair. The seventh and eigth columns are the Spearman's rank correlation coefficient and its p-value for the gene-miRNA pair. Negative correlation indicates that a high level of the miRNA is correlated with a low expression level in its target gene. Positive correlation indicates that a high level of the miRNA is associated with a high level of its target gene.
Figure 12. Viewing the correlation spreadsheet
We can visualize the correlation between any miRNA and target gene.
Right-click a row header
Select Scatter Plot (Orig. Data) from the pop-up menu
The correlation plot shows miRNA intensitiy on the x-axis and gene expression on the y-axis (Figure 13). Here, we see a negative correlation between expression of xtr-miR-148a_st and its target gene, RAB14, in brain and heart tissues. Drawing the scatter plot will create a temporary file with miRNA and gene expression probe intensities for all samples that is used to draw the plot.
Figure 13. Viewing the scatter plot showing correlated miRNA and target gene expression
Please note that the correlation function is only useful for identifying miRNAs that affect mRNA stability, not translation.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Select to activate Rotate Mode
Select to open the interactive filter
Select
Select to save the spreadsheet
Principal Components Analysis (PCA) is an excellent method to visualize similarities and differences between the samples in a data set. PCA can be invoked through a workflow, by selecting from the main command bar, or by selecting Scatter Plot from the View section of the main toolbar. We will use a workflow.
Select () to activate Rotate Mode
Select
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
We will not be using this temporary spreadsheet moving forward. You can close the spreadsheet by selecting
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This document was developed for Partek Genomics Suite version 6.6 software. Documentation for Partek Genomics Suite version 7.0 software is in development and will replace this document.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.