1 of 15

Blog Posts

This is a collection of blog posts from our website that you might find of interest.

How to select the best single cell quality control thresholds

The answer no one wants to hear

Using trajectory analysis to determine their fate

Transcriptome-wide studies of gene expression certainly provide invaluable insight into biology on a molecular level, particularly when performed at the single-cell level

Which tools to use for single cell analysis

Can nuisance batch effects or undesirable numeric or categorical factors be removed?

Step one in performing single cell analysis

Strategies for integrating single cell RNA-Seq multiomic data

Comparing gene expression levels across samples

We discuss the advantages of simultaneous gene and protein expression measurement.

Let’s discuss how to make multi-omics integration analysis and integration seamless by bringing all your analysis tools and data together.

Using tasting results of 86 different Scotch malts, let’s explore how you can apply statistical tools to non-genomic data.

With the human genome being extensively described and studies of the proteome well under way, is the glycome is the final frontier?

Cells are sometimes mysterious and do not readily reveal their true identity. Here’s how to identify their biological nature.

How well does Partek Flow software analyze non-model organism data, without having to deal with any command-line tools?

How to select the best single cell quality control thresholds

When asked what the best single cell quality control thresholds are, I know the person asking wants a number such as:

Cells with a total count between X and Y are of good quality.
You need at least X number of detected genes for a cell to be informative.
Cells with more than X% of mitochondrial counts are of bad quality.

The real answer is it depends.

Why there is no one best single cell quality control threshold

When I started analyzing single cell RNA-Seq data, I found myself asking the same questions. At first, I was expecting some standards to emerge in the field as they did for interpreting PHRED base call quality scores in NGS sequencing data. Instead of standard threshold values, the field developed a broad set of considerations to account for when assessing single cell quality.

To date, the best set of single cell quality control threshold recommendations I have seen are outlined in . This paper is a personal favorite of mine to which I refer often.

From my own hands-on experience, here are the most important lessons I have learned when deciding which single cell QA/QC thresholds to use.

Reason one: single cell threshold selection is a trade-off between quality and quantity

If you are more stringent with your thresholds, you will retain fewer cells of higher quality. If you are more lenient, you will retain more cells, some of which may be of lower quality.

How stringent or lenient you are will depend on how many cells you need to answer your research question, and at what quality.

Do you need many cells to assess the general heterogeneity of a sample? If so, perhaps you can afford to be more lenient.

Are you looking for a rare cell type that may be present in low frequency? Go lenient.

Do you need highly accurate cell type identifications for a set of cells? If so, perhaps it makes sense to be more stringent.

Reason two: metrics in the biological sample context matters in threshold selection

The biology of the sample or an experimental treatment may affect the single cell quality control metrics in a predictable way.

For example, you might be tempted to set the maximum percentage of mitochondrial counts threshold to 15%, because a higher percentage is typically indicative of a damaged cell. However, if you are working with a more metabolically active tissue, such as the kidney, a maximum threshold of ~30% makes more biological sense .

Perhaps you are working with a dissociated tumor sample and expecting some infiltrating normal cells, which are known to be smaller than the tumor cells. It stands to reason those smaller cells will have less total RNA content, so we would expect their total count to be lower. In this case, it might make sense to lower the minimum total count threshold, so you don’t inadvertently exclude a sub-population of infiltrating normal cells.

These are just a couple of examples of things to consider when selecting a single cell QA/QC threshold. Other things to consider are the expected effects of treatments, gene knockouts, and sample handling on the three quality metrics.

Conclusion

Don’t be afraid of making a mistake. You will not break the data by choosing different thresholds. The original data will always be intact, so you can go back and re-run things.

So, make a choice. It doesn’t have to be a perfect choice, just make a choice, and test it. Look at the downstream clustering, biomarkers, differential gene expression results, and visualizations

Cellular Differentiation Using Trajectory Analysis & Single Cell RNA-Seq Data

Researchers use single cell RNA-Seq trajectory analysis to reveal cell fate. In ancient Greek mythology, three ancient goddesses known as the Moirai determined each mortal's fate at birth by spinning their thread of life. Today instead of a spinning wheel the Moirai might use Monocle, a trajectory analysis tool embedded in Partek Flow, to determine their fate. Monocle uses single cell RNA-Seq data to detect cell transitions between different cell states, in other words, revealing cell fate.

Cell differentiation is at the heart of many biological processes

A hallmark example of a cell fate study is cell differentiation. Cellular differentiation can be studied under physiological conditions or as part of a pathophysiological process. For example, we may want to dissect the cell differentiation of lymphocytes as a response to SARS-CoV-2 infection.

Taking this one step further, cancer cells are very heterogeneous, ranging from undifferentiated cells to cells resembling their normal counterparts. Therefore, their different states and transitions between them can be analyzed with a trajectory approach.

Trajectory analysis of cancer single cell RNA-Seq data

Here’s an example. In 2017, published their work on single cell RNA-Seq analysis of human gliomas. For this short article, we used a subset of that study, i.e., four oligodendroglioma samples (3,000 cells in total), and ran trajectory analysis.

By looking at the left panel in this figure, we can immediately appreciate the complexity of this cell population: eleven different states (sub-populations) were detected. Each subpopulation is characterized by a unique gene expression pattern. An interesting gene in this context is the cyclin I gene (CCNI), a member of a gene family tightly associated with the cell cycle. High expression levels are observed on the right side of the plot, such as state 10, while the levels of CCNI decrease as we traverse down the cell differentiation tree represented by the black lines. Knowing the pattern of gene expression (i.e., genes being turned on / switched off), we can reconstruct the differentiation sequence from early to final states. That information is shown on the right panel where the cells are colored by pseudotime. The beginning of pseudotime is indicated by a very light blue and becomes brighter as time progresses. Note the complementarity of the CCNI expression and pseudotime patterns. As the cancer cells become more differentiated (=similar to the healthy cells), the activity of cell cycle genes goes down. What if a new drug or treatment would affect the cancer cell differentiation? Since cancer differentiation is typically tied with improved survival rates, a drug with the ability to promote differentiation is well worth reviewing.

Other applications of trajectory analysis

We encourage you to explore the use of trajectory analysis tools not only for obvious purposes where cell differentiation is of primary interest, such as developmental but whenever you are looking at cells that can be classified into sub-groups, which in turn, can be organized in a before-after relationship.

Or, to rephrase: a Greek hero was bound by the verdict of the Moirai and they were not able to change their fate and (most often!) escape their doom. On the other hand, software tools are not Greek heroes, so why not deviate from the routine and use them for non-obvious purposes?

Spatial transcriptomics—what’s the big deal and why you should do it

We may all be under the impression that single cell RNA-Seq is a new technology, but it has been around for more than a decade (). Hard to believe, right? And what about tissue transcriptomics (a.k.a. spatially resolved transcriptomics)? If we focus only on RNA sequencing-based analysis, then the first paper was published in 2013 (), which in the “omics” age, cannot be considered a novelty. To be fair, the concept of gene expression analysis within tissue context is considerably older—first introduced in 1982 (). Either way, tissue transcriptomics really rose to prominence within the last two years and continues to gain a lot of attention.

Transcriptome-wide studies of gene expression certainly provide invaluable insight into biology on a molecular level, particularly when performed at the single cell level (i.e., single cell RNA-Seq). However, some key pieces of information such as the tissue relationships, are still missing and are only made available by tissue transcriptomics techniques. Knowing the location of each cell, as well as its surroundings, enable us to fully understand and appreciate molecular events. Two straightforward examples that come to mind are embryonic development and tumor microenvironments. One of the key features of the development is spatially and temporally coordinated gradients of gene expression, which then turn off developmental genes. Cancer tissue, the second example, is composed of different neoplastic populations and has a “contaminating” component of healthy tissue such as stroma or especially exciting, invading leukocytes.

Detecting differential gene expression in single cell RNA-Seq analysis

One of the main goals of both the Single Cell RNA-Seq and bulk RNA-Seq pipelines is to detect differentially expressed genes. There are many available tools from which to choose such as and , as well as classic statistical tests like ANOVA or Welch’s ANOVA. Thus, a frequent question is which to choose?

When discussing data analysis with our customers, I often hear that “this” or “that” approach has been suggested to them by others, because it is “better”. “Better” is quite difficult to define but typically leads to a conclusion like “I found more significant genes with this test”. Having a research scientist’s background, I certainly understand the appeal of more targets, but is this necessarily something for which to strive? Spoiler alert: no, you should find the right genes, not just more genes.

Finding the right genes is where the hurdle models come into play. Hurdle models are a class of statistical models developed for count data, with the idea of handling and . Sounds exactly like single cell RNA-Seq data, right? (If you are interested in the mathematics, you may want to go over this excellent overview of )

We can think of a hurdle model as a two-question process: is a gene detected in the project or not, and if yes, has the expression level hurdle been crossed? Note that if we observe zero counts for a gene, this does not necessarily mean that the gene is not expressed. Expression may not have been detected because sequencing was too shallow or for a number of other technical reasons (e.g., sampling error, reverse transcription inefficiency, etc.). That phenomenon is known as “

Batch remover for single cell data

Experimental data, such as single cell RNA-Seq, is frequently burdened by nuisance batch effects, or undesirable numeric or categorical factors. Due to logistic constraints, data is often processed in different batches, e.g., different operator, different flow cell, different reagent lot and so on. If the processing batches are included in the experimental design or are relatively well balanced within the experimental conditions (for example, technician A processed half of the control and half of the treated samples, while technician B took care of the other half), their effects can be identified and removed from the data.

All the way back in the microarray era, Partek Genomics Suite was well known in the field for its batch remover. Now, our batch remover has been implemented in Partek Flow 8.0. To illustrate the batch remover in action, I downloaded two public data sets from 10x Genomics®: 1,000 peripheral blood mononuclear cells (PBMC) from a healthy human, processed by v2 chemistry, and the same sample processed by v3 chemistry. I analyzed the data in Partek Flow (e.g., by removing dead or apoptotic cells) and identified several cell types.

You would expect the batch effect in this project to be as large as it gets, and you would be right. The left panel of Figure 1 shows the t-SNE before the correction: the cells of each type split into two groups, based on the chemistry (instead of being clustered together). Now the good news! Once the batch effect has been removed, that pattern is no longer discernible (Figure 1, right panel): cells of the same type are grouped tightly together.

The scenario presented above, where two different chemistries were used to generate the data, can be considered as an extreme example of batch effect and it is highly unlikely that you would face it in the real world. I was using it for illustration purposes only. If the Partek Flow batch remover can handle batch effects of this magnitude, it will have no problem helping you with an actual data set.

How to perform single cell RNA sequencing: exploratory analysis

Is anyone out there still performing bulk RNA-Seq? I am sure that it is far from dying out but judging by what scientists talk about and report on, single cell RNA sequencing (scRNA-Seq) has become the new norm.

Although, when you think about it, it can hardly be called a novel approach. The first manuscript on scRNA-Seq was published in 2009 (), but you may be surprised to hear that the basic principles were described in the 1990s (seminal works by and ). That decade (or three) is quite negligible for a geologist, but in terms of molecular biology, scRNA-Seq is getting gray hair. In comparison, bioinformatics analysis of scRNA-Seq data has yet to mature and there is no consensus in the community.

Single Cell analysis is a larger topic than what is possible in a single blog post. To break it down into bite-size reading, we’ll talk about each step in a separate blog post.

Single Cell Multiomics Analysis: Strategies for Integration

One of the most exciting technological innovations in recent years is multiomic analysis in single cells. Technologies like CITE-Seq (transcriptomic and proteomic expression data from the same cells) have taken some research fields, such as immuno-oncology and neuroscience, to the next level.

It’s one thing to obtain both data types, and another to know how best to use them. Here are three strategies for integrating transcriptomic and proteomic data, particularly for identifying cell types.

Use only the proteomic data to cluster cells, use protein & gene expression to infer cell types

Cell surface proteins offer excellent resolving power in identifying the major cell types, so it makes sense to combine information across multiple protein markers to cluster cells together. You can use the most highly expressed proteins (biomarkers) in each cluster to infer which cell type each cluster corresponds to. If any protein markers are missing from the panel, you can use highly expressed genes as biomarkers to fill in the gaps (Figure 1).

One drawback to this approach is the limited number of protein markers in a panel. Each oligo-conjugated antibody targets a different cell surface protein. A panel of 15-20 antibodies may be enough to distinguish some of the major cell types, which may be fine for some studies. But if cell surface protein markers specific to certain cell types are missing from the panel, those cell types may not be easily distinguished as separate clusters and identified.

Use only the transcriptomic data to cluster cells, use gene & protein expression to infer cell types

Single cell RNA-Sequencing has proven to be effective in clustering single cells based on their transcriptome-wide gene expression profiles. High expression of known, canonical marker genes in each cluster can be used to infer the cell type identity of each cluster. While the proteomic data may be limited by the number of known markers, transcriptomic data has no such limitation, with thousands of genes being quantified per cell in an unbiased manner. The additional gene information and biomarkers also create the possibility of discovering novel, cryptic cell type subsets (Figure 2).

One drawback of this approach is that cells that are phenotypically distinct at the cell surface protein level are not always distinct at the transcriptomic level. For example, CD4+ and CD8+ T cells are clearly distinguished as separate clusters at the proteomic level (Figure 1), but they are closer together at the transcriptomic level (Figure 2). Furthermore, the T cells cluster primarily by their state (naive & differentiated) in Figure 2, rather than by their phenotype. Layering in the protein biomarkers onto the transcriptome-derived clusters can help resolve these cell types.

Use a Weighted nearest neighbor (WNN) approach to integrate both data sets, cluster the cells, use gene & protein expression to infer cell types.

Both transcriptomic and proteomic data will vary in their information content and quality. WNN is a computational method that blends the best of both data types (). It learns the information content of each data type and generates an integrated representation of the joint data.

First, you reduce the dimensionality of each data type using principal component analysis (PCA). The PCA results are fed into the WNN algorithm, which generates a WNN graph. For each cell, the graph lists the most similar cells (nearest neighbors) and their distances, where the distances are weighted on cell-specific information content from each data type. Some cells might have more information from the transcriptomic data, and others from the proteomic data. The WNN graph can then be used as input for clustering and visualization (Figure 3).

In some of our own internal testing, we’ve seen WNN work best when the two data sets have complimentary information. For example, let’s say the transcriptomic data correctly separates hypothetical cell types A and B, but cell types C and D form a single cluster. The proteomic data would be complimentary if it separates cell types C and D, but not A and B. WNN would then do a good job of combining the best of both worlds, separating A, B, C, and D.

Summary

There have been some very exciting technological innovations in single cell multiomics in the last few years. It seems that there are multiple ways in which you can combine transcriptomic and proteomic data, not limited to the three strategies outlined above. My recommendation would be to try a few different methods and see what works best for your data and research question.

*The PBMC data used to generate the plots were from a healthy donor and downloaded from . The cells are screened with 10x Genomics’ Feature barcoding assay, with TotalSeq™-B antibodies targeting 17 cell surface proteins. All analyses were performed in Partek Flow v10. The filtered UMI count data were quality-filtered and normalized. PCA was used for dimensionality reduction. WNN was used for integration across the two data types. UMAP was used for clustering and visualization.

Pathway Analysis: ANOVA vs. Enrichment Analysis

One of the main steps in nearly every bulk RNA-Seq or single cell RNA-Seq project is some sort of statistical testing to compare gene expression levels across samples. After which a list of significant genes is generated by applying a filter strategy, the most common being fold change > |2| and false discovery rate (FDR) < 0.05 (the exact values depend on the research question).

There are different ways to interpret the resulting list of significant genes. The most straightforward is to perform enrichment analysis, which identifies gene sets (e.g., pathways or gene ontology categories) that are overrepresented in the list of the significant genes. Then a 2 ✕ 2 contingency table is created for each gene set: the number of genes present in the list of significant genes and in the gene set and the number of genes present in the list of significant genes but not in the gene set, etc. A statistical test (like Fisher’s exact test) can be invoked on the contingency table. Most researchers transform the p-value into an enrichment score (enrichment score = -ln(p-value)) to make it easier to read (e.g., 6.9, instead of 0.001). Let’s have a look at a simple hypothetical example; a list of significant genes consists of 291 genes. Of those genes, 35 (12%) belong to a particular gene set. The set itself consists of 80 genes and almost half (35/80) are present in the list of significant genes (Table 1). Based on the contingency table, Fisher's exact test statistic p-value is 8.28×10-24. Converted to the enrichment score, the value is 53.15. In other words, the number of genes in gene sets that are also in the list of the significant genes exceeds the expectation.

Studying Immunotherapy with Multiomics: Simultaneous Measurement of Gene and Protein

If you follow the field of single cell biology, then you know about the recent trend of simultaneous measurement of both gene expression and protein expression. Here’s an application example based on immunotherapy of mucosa-associated lymphoid tissue (MALT) lymphoma cancer. The most common site is the stomach, but any mucous membrane can be affected.

The cancerous cells of MALT lymphoma are the B lymphocytes, originating from the marginal zone of the MALT, and hence the synonym: extranodal marginal zone B cell lymphoma. In addition to B lymphocytes, the other common cell type found within MALT lymphoma are infiltrating T lymphocytes. A depiction of the cellular composition of MALT lymphoma, generated in Partek Flow, is shown in Figure 1.

Each dot represents a single cell: the entire sample consists of B lymphocytes (red), some of which are the cancer cells, and infiltrating T lymphocytes (green), which are normal cells reacting to the tumor. Cells from a dissociated MALT lymphoma were processed by 10X Genomics’ Feature Barcoding technology and can be downloaded from here. The filtered HDF5 file was loaded and processed in Partek Flow analysis software. The t-SNE is based on a combined analysis of gene and protein data for 8,018 single cells.

Although B lymphocytes are malignant cells in this type of tumor, there is a growing interest in the neighboring normal T lymphocytes. Penetration of T lymphocytes into the tumor is a reaction to cancer and is a possible therapeutic strategy. Although T lymphocytes can, depending on their type, either regulate immune response directed against the tumor or directly kill the malignant B lymphocytes, they rarely do so.

One way of exploiting the activity of T lymphocytes is to target the immune checkpoint molecules expressed on the surface of T lymphocytes, such as cytotoxic T-lymphocyte-associated protein 4 (CTLA-4). Signaling via CTLA-4 regulates the immune response and prevents autoimmune diseases and allergies, but also hinders the immune system from destroying cancer cells. Alas, only a fraction of patients are responsive to therapeutics acting on the CTLA-4 signaling pathway, which motivates the investigation of other inhibitory or stimulatory receptors.

A promising new immunotherapy drug target is programmed cell death protein 1 (PD-1). The PD-1 protein is encoded by the gene PDCD1 and is expressed on the cell surface of T lymphocytes. Ample evidence confirms that PD-1 acts as a negative regulator of the immune response. Another interesting potential target molecule is ITM2A, encoded by the ITM2A gene. ITM2A is also involved in the regulation of immune response and has recently been shown as commonly co-expressed with PD-1 (). The list of possible therapeutic targets is much longer, and research efforts are often aimed not at individual molecules, but on their combinations and networks.

A common tool for detection of drug targets (and identification of cells expressing them) are antibodies, but – to complicate matters – high fidelity antibodies are not available for most potential drug targets. Moreover, standard analysis techniques, such as fluorescence in situ hybridization (FISH) or flow cytometry, allow for simultaneous detection of no more than 20 molecules.

The good news is that these limitations can be resolved by coupling data analysis in Partek Flow software with single cell RNA-Seq using the Feature Barcoding approach. For instance, take a look at Figure 2, which shows an ideal target for immunomodulation: a target population of “helper” T lymphocytes (positive for CD4 protein) expressing both PDCD1 and ITM2A immunoregulatory genes.

Each dot is a single cell. Cells from a dissociated MALT lymphoma were processed by 10X Genomics’ Feature Barcoding technology and can be downloaded from . The filtered HDF5 file was loaded and processed in Partek Flow. The data points are based on the combined analysis of gene and protein data for single cells. T lymphocytes were identified as cells expressing CD3 antigen. A total of 3,069 CD3-positive cells were gated and then charted by expression levels of PDCD1 and ITM2A mRNA, and CD4 protein. To highlight the helper T lymphocytes population, the plot was then colored by expression levels of CD8 antigen, a cytotoxic T lymphocyte marker (red cells in the background).

Analysis presented above can be performed in Partek Flow software with just a few mouse clicks. If you are curious, don’t hesitate to reach out to us for a free trial.

How to Integrate ChIP-Seq and RNA-Seq Data

Next generation sequencing has enabled us to ask questions about biology and disease with unprecedented scope and detail. High-throughput assays are available to study many aspects of genomic regulation including RNA-Seq for gene expression, ATAC-Seq for chromatin accessibility, and ChIP-Seq for protein binding sites.

Bringing together multiple genomic assays to analyze both the epigenome and transcriptome in the same samples promises to uncover the mechanisms underlying biology and disease. But while performing the experiments requires good hands and persistence, the real challenge begins after you receive the data. How do you make sense of it all?

Unfortunately, most analysis pipelines and tools are built for one genomic assay, leaving it to you to piece together disparate output spreadsheets and data files to figure out how the results from different assays mesh to form a coherent picture.

At Partek, we make multi-omics integration analysis and integration seamless by bringing all your analysis tools and data together in Partek Flow.

To illustrate how easy it is to analyze and integrate multi-omics data in Partek Flow, I took a quick look at some data from a recently published study. In the study, the authors used ChIP-Seq and RNA-Seq data to characterize TGF-β signaling through the SMAD2/3 transcription factors.

By analyzing the data in Partek Flow, I was able to quickly go from raw data to integrated results. I identified potential direct targets of SMAD2/3 – genes that were nearby SMAD2/3 binding sites and differentially expressed after TGF-β treatment – by analyzing the RNA-Seq and ChIP-Seq data together.

For the RNA-Seq data, I found genes that were differentially expressed between inhibitor and TGF-β treated conditions. This gave me a list of indirect and direct target genes of TGF-β. These genes are shown in the green circle in Figure 1.

For the ChIP-Seq data, I identified regions that were enriched in a SMAD2/3-pull down sample relative to input control using MACS2, a powerful tool for detecting enriched regions in ChIP-Seq and ATAC-Seq data. I then annotated these regions with nearby genes to give me a list of genes that were likely to have been regulated by SMAD2/3. These genes are shown in the blue circle in Figure 1.

I used the Venn diagram tool in Partek Flow to find the intersection between the TGF-β regulated genes and the SMAD2/3 bound genes – the 202 potential direct targets of SMAD2/3 in the experiment.

Going a step further, I performed pathway enrichment analysis on the list of direct target genes to find pathways that were likely to be quickly impacted by signaling through SMAD2/3. You can see one of the annotated KEGG pathway maps I generated using Partek Flow in Figure 2.

I also visualized several of these direct target genes in Partek Flow. Figure 3 shows Skil, a known target gene of SMAD2/3 signaling, in the Partek Flow genome browser. The top track is the ChIP-Seq data. The ChIP-Seq reads histogram shows that the SMAD2/3 pull-down sample is enriched relative to the input control. The bottom track presents the RNA-Seq data, where the TGF-β treated condition showed higher expression of the gene than the inhibitor-treated condition. The predicted SMAD2/3 binding site for this gene is directly upstream of the transcription start site of Skil in the promoter region, illustrating why it was identified as a direct target of SMAD2/3.

—

Enjoy Responsibly!

By: Ivan Lukic, Ph.D. – Field Application Scientist, Partek Incorporated

Let’s talk about something completely different. At Partek training events, I usually explain that our tools can be easily utilized for any quantitative data, not just genomics. For instance, last month I wrote a blog post on the analysis of gylcomics data. But actually, we go way beyond that, to some completely unrelated applications, for instance, whiskey tasting. I was fortunate enough to stumble upon a whiskey data set (courtesy of the University of Strathclyde), which consists of the results of tasting 86 different Scotch malts. Each malt is distilled at a different site and is described by 12 taste categories (e.g. body, sweetness, smoky, and so on). The taste categories are subjective measurements obtained by referees while tasting a particular whiskey and range from zero (taste category absent) to four (taste category very strong). Let’s explore the data by principal components analysis (Figure 1). The interpretation is analogous to the one in genomics: if two whiskeys are close on the plot, that means that they have similar values across the taste categories (to rephrase, they have a similar taste). My first observation was that there is no striking pattern (i.e., the data does not split into several well-defined clusters). That actually supports my hunch that much of whiskey tasting is merely mumbo-jumbo of connoisseurs and aficionados. You know, …with a slight touch of (insert a name of a berry that cannot be purchased at your local market or found at a specialized store because it is a highly endemic and endangered species for which you need three PhDs in botany and several decades of field experience to be able to find it in the wild). However, there is a group of several whiskeys on the left (including Lagavulin, Ardberg, and Talisker, among others) that does stand out a bit. An obvious question is – what about those? Is there something particular about their taste?

To answer that, I turned to a Principal Components Analysis (PCA) biplot. In general, a biplot highlights the extent to which objects represented by rows of a quantitative matrix (observations) differ in terms of the objects represented by the columns (features). In this context, the PCA biplot shows the biggest patterns evident in the data in terms of how the whiskeys differ in particular taste categories. Whiskeys are shown as dots, while taste categories are presented as vectors (Figure 2; to simplify the chart, I chose 2D PCA plot).

Now we can tell that “body” and “smoky” should be interpreted in a similar fashion. That is, they describe very similar components of a whiskey’s flavor and are quite different from, say “fruity” and “sweetness” (which point to a different direction). That makes sense, right? So, if I am looking for a smoky whiskey of a full-body, I should pick Lagavoulin or Laphroig (I have no idea how to pronounce that), because their points project furthest in the direction of “smoky” and “body” vectors. On the other hand, spring is in the air and to celebrate it with a sip of something with a more floral note, I should go for Amenthosan. Well, in reality, I would have neither, because I am a bourbon guy, but you get my point on biplots, right?

To Boldly Go…

So, what about the Star Trek opening quote? Well, with the human genome being extensively described and studies of the proteome well underway, I believe that the glycome is the final frontier (but then again, who knows… maybe just the next frontier?). Moreover, glycomic research has profound practical consequences due to the key role of protein glycosylation in normal physiology as well as pathophysiology. Unfortunately, there are some issues that need to be resolved, such as the need for “sensitive, robust, and affordable” high-throughput analysis techniques (Trbojević-Akmačić et al., 2016).

On the bright side, a recent manuscript provided a major step in that direction, proposing a standardized method for glycome profiling: Zou and his co-workers used lectin microarrays to obtain glycation signatures of various tissues. That was precisely the type of study I was looking for to test my hypothesis; that glycomic data can be processed using genomics tools in an easy and straightforward way.

Thus, I downloaded the normalized intensity levels of each lectin from the manuscript supplement and analyzed it with Partek Flow analysis software. Figure 1, based on the t-distributed stochastic neighbor embedding (t-SNE) algorithm, highlights the distinct glycome profiles of five mouse tissues harvested in the original study.

This is in line with the results by Zou and his colleagues, but there is a difference that I would like to point out. In the original manuscript, one of the techniques used for differential glycomic profiling was principal components analysis (PCA). Although PCA visualization (Figure 2) leads to the same conclusion, that each tissue has a characteristic glycome fingerprint, the t-SNE (Figure 1) provides a cleaner tissue separation and emphasizes the point in a more convincing fashion.

Partek is best known for our expertise in genomics, however, our tools empower you with the flexibility to examine any type of -omics data. Join us in the journey where no man has gone before.

Get to Know Your Cell

Just like people, cells are sometimes mysterious and do not readily reveal their true identity. If you want an example, just think of metastatic cancer cells, which sometimes lose all the hallmark features of their mother tissue, making detection of the primary tumor site difficult. Or another example, single cell RNA-Seq. Getting a nice t-distributed stochastic neighbor embedding (t-SNE) chart based on sequencing data is pretty much straightforward but figuring out their biological nature can be quite challenging.

One approach to classifying the cells into groups is to use marker genes. An extension of that concept is to use gene groups, such as pathways. Although these strategies are very useful, they are not applicable to every research situation. For instance, you may want to come up with a completely new set of marker genes, or you may want to work hypothesis-free.

To handle those situations, use Partek Flow to combine clustering results with t-SNE visualization. If you perform clustering with compute biomarkers and then invoke the t-SNE plot on the result, genes specific for every cluster can be used for classification (Figure 1). Moreover, a clustering algorithm does not have to be involved, you can simply classify cells into groups by selecting them directly on the chart and Partek Flow will produce biomarkers of your custom groups. The biomarkers table is based on a comparison of one cluster at a time versus all the other clusters combined. The gene list is then filtered by using p-value (0.05) and fold-change criteria (|1.5|), ranked by the p-value, and the top 10 genes are then listed in the table by default (the full list can be viewed in the Data Viewer or downloaded).

What if you do not want to compare a particular cluster with all the remaining cells, but would rather pick two clusters and contrast them directly? That is also easily performed in Partek Flow.

Aliens Among Us: How I Analyzed Non-Model Organism Data in Partek Flow

Recently a post about how the microscopic water bear (tardigrade) can survive in extreme drought, temperature, pressure, and radiation as well as in the vacuum of space popped up on my Twitter feed. Being the fan I am of anything space-related I was intrigued. I downloaded the publicly available data to explore what is it about the water bear transcriptomic profile that allows them to survive the extreme environmental difference between Earth and space. This presented the perfect opportunity for me to see how well software would let me analyze non-model organism data, and at the same time, not have to deal with any command-line tools.

Water bears survive by shriveling up and entering a dormant state, called the ‘tun’ state, so I compared samples from tun and active adults. I started by aligning the RNA-Seq data to the water bear reference genome (in fact, I could use the reference genome of any model or non-model organism in Partek Flow). To improve the quality of the analysis, I used the visualizations in the post-alignment QAQC and quantification reports to help identify a low-quality sample and remove it from my downstream analysis (Figure 1 and Figure 2).

I found 1385 genes that were differentially expressed between tun and active adults (Figure 3). Interestingly, I found most genes (1194 genes) were up-regulated in the tun group, compared to the active adults, indicating the transformation into the tun state requires the activation of many genes, rather than inactivation. This was rather surprising. When in the tun state, the water bears seemingly regress into dormancy. Intuitively, one would expect more inactivation. Then again, the transformation process requires major physiological, cellular, and metabolic changes, which would require the activation of pathways that would normally not be needed when they are active adults. Biological interpretation of the gene list revealed functions involved in oxidative damage response, redox reactions, and membrane integrity, all of which have been implicated in extreme stress response.

Blog Posts

How to select the best single cell quality control thresholds

hashtagWhy there is no one best single cell quality control threshold

hashtagReason one: single cell threshold selection is a trade-off between quality and quantity

hashtagReason two: metrics in the biological sample context matters in threshold selection

hashtagConclusion

Cellular Differentiation Using Trajectory Analysis & Single Cell RNA-Seq Data

hashtagCell differentiation is at the heart of many biological processes

hashtagTrajectory analysis of cancer single cell RNA-Seq data

hashtagOther applications of trajectory analysis

Spatial transcriptomics—what’s the big deal and why you should do it

Detecting differential gene expression in single cell RNA-Seq analysis

Batch remover for single cell data

How to perform single cell RNA sequencing: exploratory analysis

Single Cell Multiomics Analysis: Strategies for Integration

hashtagUse only the proteomic data to cluster cells, use protein & gene expression to infer cell types

hashtagUse only the transcriptomic data to cluster cells, use gene & protein expression to infer cell types

hashtagUse a Weighted nearest neighbor (WNN) approach to integrate both data sets, cluster the cells, use gene & protein expression to infer cell types.

hashtagSummary

Pathway Analysis: ANOVA vs. Enrichment Analysis

Studying Immunotherapy with Multiomics: Simultaneous Measurement of Gene and Protein

How to Integrate ChIP-Seq and RNA-Seq Data

Enjoy Responsibly!

To Boldly Go…

Get to Know Your Cell

Aliens Among Us: How I Analyzed Non-Model Organism Data in Partek Flow

Blog Posts

How to select the best single cell quality control thresholds

hashtagWhy there is no one best single cell quality control threshold

hashtagReason one: single cell threshold selection is a trade-off between quality and quantity

hashtagReason two: metrics in the biological sample context matters in threshold selection

hashtagConclusion

Spatial transcriptomics—what’s the big deal and why you should do it

Cellular Differentiation Using Trajectory Analysis & Single Cell RNA-Seq Data

hashtagCell differentiation is at the heart of many biological processes

hashtagTrajectory analysis of cancer single cell RNA-Seq data

hashtagOther applications of trajectory analysis

Get to Know Your Cell

Detecting differential gene expression in single cell RNA-Seq analysis

Aliens Among Us: How I Analyzed Non-Model Organism Data in Partek Flow

Enjoy Responsibly!

How to perform single cell RNA sequencing: exploratory analysis

Pathway Analysis: ANOVA vs. Enrichment Analysis

hashtagCell-Level Exploratory Analysis

hashtagGene Level Exploratory Analysis

hashtagFiltering the Genes

How to Integrate ChIP-Seq and RNA-Seq Data

To Boldly Go…

Studying Immunotherapy with Multiomics: Simultaneous Measurement of Gene and Protein

Single Cell Multiomics Analysis: Strategies for Integration

hashtagUse only the proteomic data to cluster cells, use protein & gene expression to infer cell types

hashtagUse only the transcriptomic data to cluster cells, use gene & protein expression to infer cell types

hashtagUse a Weighted nearest neighbor (WNN) approach to integrate both data sets, cluster the cells, use gene & protein expression to infer cell types.

hashtagSummary

Batch remover for single cell data

Why there is no one best single cell quality control threshold

Reason one: single cell threshold selection is a trade-off between quality and quantity

Reason two: metrics in the biological sample context matters in threshold selection

Conclusion

Cell differentiation is at the heart of many biological processes

Trajectory analysis of cancer single cell RNA-Seq data

Other applications of trajectory analysis

Use only the proteomic data to cluster cells, use protein & gene expression to infer cell types

Use only the transcriptomic data to cluster cells, use gene & protein expression to infer cell types

Use a Weighted nearest neighbor (WNN) approach to integrate both data sets, cluster the cells, use gene & protein expression to infer cell types.

Summary

Why there is no one best single cell quality control threshold

Reason one: single cell threshold selection is a trade-off between quality and quantity

Reason two: metrics in the biological sample context matters in threshold selection

Conclusion

Cell differentiation is at the heart of many biological processes

Trajectory analysis of cancer single cell RNA-Seq data

Other applications of trajectory analysis

Cell-Level Exploratory Analysis

Gene Level Exploratory Analysis

Filtering the Genes

Use only the proteomic data to cluster cells, use protein & gene expression to infer cell types

Use only the transcriptomic data to cluster cells, use gene & protein expression to infer cell types

Use a Weighted nearest neighbor (WNN) approach to integrate both data sets, cluster the cells, use gene & protein expression to infer cell types.

Summary