# Exploring gene expression data

At this point in analysis, you should explore the data preliminarily. Do the genes you expected to be differentially regulated appear to have larger or smaller intensity values? Do similar samples resemble each other?

The latter question can be explored using Principal Components Analysis (PCA), an excellent method for reducing and visualizing high-dimensional data.

* Select **PCA Scatter Plot** from the *QA/AC* section of the *Gene Expression* workflow

A *Scatter Plot* tab containing your PCA plot will open (Figure 1).

![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-2b0e4ae5a3306a8f2e53878ae328ec7641644111%2F2017-08-07%2009_12_07-Partek%20Genomics%20Suite%20-%201%20\(Down_Syndrome-GE\).png?alt=media)

Figure 1. PCA Scatter Plot tab

In the scatter plot, each point represents a chip (sample) and corresponds to a row on the top-level spreadsheet. The color of the dot represents the *Type* of the sample; red represents a normal sample and blue represents a Down syndrome sample. Points that are close together in the plot have similar intensity values across the probe sets on the whole chip, while points that are far apart in the plot are dissimilar

Left-clicking on any point in the scatter plot selects that point. A dash with an identifying row number will appear on the selected PCA plot point. The spreadsheet in the *Analysis* tab will also jump to the corresponding row.

While pressing the mouse wheel down, drag the mouse to rotate the plot or select the **Rotate Mode** icon (![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-8e084927fb0c9214c060d851c9623eb34f3b28d3%2Fimage2017-6-14%2017_13_37.png?alt=media)) on the left side of the *Scatter Plot* tab. With **Rotate Mode** selected, press the left mouse button and drag to rotate the plot. Rotating the plot allows you to examine the grouping pattern or outliers of the data on the first three principal components (PCs).

Scrolling the mouse wheel up or down while the cursor is on the PCA plot will zoom in and out or select the **Zoom Mode** icon (![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-e1ae205079ca0d8853a0c797be71850ec4a8538d%2Fimage2017-6-14%2017_13_54.png?alt=media)) on the left side of the *Scatter Plot* tab.

Selecting the **Reset** icon (![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-c71d9436626450f666d628a5b37ddf1abd7539a5%2Fimage2017-6-14%2017_14_5.png?alt=media)) option on the left side of the *Scatter Plot* tab will return the PCA plot to its original orientation and zoom.

As you can see from rotating the plot, there is no clear separation between Down syndrome and normal samples in this data since the red and blue samples are not separated in space. However, there are other factors that may separate the data.

* In the *Scatter Plot* tab, select the **Rendering Properties** icon (![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-16d5b969b6110132d4500fd8ef843ebd9622ea62%2Fimage2017-6-14%2017_15_50.png?alt=media)) and configure the plot as shown (Figure 2)
* *Color* the points by column **4**. **Tissue** and *Size* the points by column **3. Type**
* Select **OK**

![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-209f93b900e54684d4e689675807c6c992b0ec91%2F2017-08-07%2009_13_09-Plot%20Rendering%20Properties.png?alt=media)

Figure 2. Configuring the PCA scatter plot: Color by Tissue, size by Type

Notice now that the data are clustered by different tissues (Figure 3).

![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-40f5738ced749356cf1ea2f86f7121964bd70778%2F2017-08-07%2009_13_59-Greenshot.png?alt=media)

Figure 3. PCA scatter plot configured with color by Tissue, size by Type

Another way to see the cluster pattern is to put an ellipse around the *Tissue* groups.

* Open the *Plot Rendering Properties* dialog and select the **Ellipsoids** tab
* Select **Add Ellipse/Ellipsoid**
* Select **Ellipse** in the *Add Ellipse/Ellipsoid...* dialog
* Double click on **Tissue** in the *Categorical Variable(s)* panel to move it to the *Grouping Variable(s)* panel (Figure 4)
* Select **OK** to close the *Add Ellipse/Ellipsoid...* dialog and select **OK** again to exit the *Plot Rendering Properties* dialog

![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-ee3a3e50dd0784eda5f9d5adbd58083f46c00337%2F2017-08-07%2009_15_01-Add%20Ellipse_Ellipsoid....png?alt=media)

Figure 4. Adding Ellipses to PCA Scatter Plot

By rotating this PCA plot, you can see that the data is separated by tissues, and within some of the tissues, the Down syndrome samples and normal samples are separated. For example, in the *Astrocyte* and *Heart* tissues, the Down syndrome samples (small dots) are on the left, and the normal samples (large dots) are on the right (Figure 5).

![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-0d707246d6ceac8e44752fede2a2504cdd0aa080%2F2017-08-07%2009_15_45-Partek%20Genomics%20Suite%20-%201%20\(Down_Syndrome-GE\).png?alt=media)

Figure 5. PCA scatter plot with ellipses, rotated to show separation by Type

PCA is an example of exploratory data analysis and is useful for identifying outliers and major effects in the data. From the scatter plot, you can see that the tissue is the biggest source of variation. There are many genes that express differently between the tissues, but not as many genes that express differently between type (Down syndrome and normal) across the whole chip.

The next step is to draw a histogram to examine the samples.

* Select **Sample Histogram** in the *QA/QC* section of the *Gene Expression* workflow to generate the *Histogram* tab (Figure 6)

![](https://1384254481-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FJVEESmJAPppJ3ijFq5aR%2Fuploads%2Fgit-blob-48740f5ebd780caf3326c0c05b11ffdcc3593e6e%2F2017-08-07%2009_16_28-Partek%20Genomics%20Suite%20-%201%20\(Down_Syndrome-GE\).png?alt=media)

Figure 6. Histogram tab

The histogram plots one line for each of the samples with the intensity of the probes graphed on the X-axis and the frequency of the probe intensity on the Y-axis. This allows you to view the distribution of the intensities to identify any outliers. In this dataset, all the samples follow the same distribution pattern indicating that there are no obvious outliers in the data. As demonstrated with the PCA plot, if you click on any of the lines in the histogram, the corresponding row will be highlighted in the spreadsheet *1* (*Down\_Syndrome-GE)*. You can also change the way the histogram displays the data by clicking on the *Plot Properties* button. Feel free to explore these options on your own.

The decision to discard any samples would be based on information from the PCA plot, sample histogram plot, and QC metrics. To discard a sample and renormalize the data (without the effects of the outlier), start over with importing samples and omit the outlier sample(s) during the .CEL file import.

## Additional Assistance

If you need additional assistance, please visit [our support page](http://www.partek.com/support) to submit a help ticket or find phone numbers for regional support.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.connected.illumina.com/partek/partek-genomics-suite/tutorials/gene-expression-analysis/exploring-gene-expression-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
