arrow-left

All pages
gitbookPowered by GitBook
1 of 15

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Blog Posts

This is a collection of blog posts from our website that you might find of interest.

hashtag
How to select the best single cell quality control thresholds

The answer no one wants to hear

hashtag

Using trajectory analysis to determine their fate

hashtag

Transcriptome-wide studies of gene expression certainly provide invaluable insight into biology on a molecular level, particularly when performed at the single-cell level

hashtag

Which tools to use for single cell analysis

hashtag

Can nuisance batch effects or undesirable numeric or categorical factors be removed?

hashtag

Step one in performing single cell analysis

hashtag

Strategies for integrating single cell RNA-Seq multiomic data

hashtag

Comparing gene expression levels across samples

hashtag

We discuss the advantages of simultaneous gene and protein expression measurement.

hashtag

Let’s discuss how to make multi-omics integration analysis and integration seamless by bringing all your analysis tools and data together.

hashtag

Using tasting results of 86 different Scotch malts, let’s explore how you can apply statistical tools to non-genomic data.

hashtag

With the human genome being extensively described and studies of the proteome well under way, is the glycome is the final frontier?

hashtag

Cells are sometimes mysterious and do not readily reveal their true identity. Here’s how to identify their biological nature.

hashtag

How well does Partek Flow software analyze non-model organism data, without having to deal with any command-line tools?

Using trajectory analysis to study cellular differentiation in single cell RNA-Seq experiments
Tissue transcriptomics—what’s the big deal and why you should do it
Less is more: detecting differential gene expression in single cell RNA-Seq analysis
Batch remover for single cell data
How to perform single cell RNA sequencing: exploratory analysis
Single Cell Multiomics Analysis: Strategies for Integration
Pathway Analysis: ANOVA vs. Enrichment Analysis
Studying Immunotherapy with Multiomics: Simultaneous Measurement of Gene and Protein Expression
How to Integrate ChIP-Seq and RNA-Seq Data
Enjoy Responsibly!
To Boldly Go…
Get to Know Your Cell
Aliens Among Us: How I Analysed Non-Model Organism Data in Partek Flow

How to select the best single cell quality control thresholds

When asked what the best single cell quality control thresholds are, I know the person asking wants a number such as:

  • Cells with a total count between X and Y are of good quality.

  • You need at least X number of detected genes for a cell to be informative.

  • Cells with more than X% of mitochondrial counts are of bad quality.

The real answer is it depends.

hashtag
Why there is no one best single cell quality control threshold

When I started analyzing single cell RNA-Seq data, I found myself asking the same questions. At first, I was expecting some standards to emerge in the field as they did for interpreting PHRED base call quality scores in NGS sequencing data. Instead of standard threshold values, the field developed a broad set of considerations to account for when assessing single cell quality.

To date, the best set of single cell quality control threshold recommendations I have seen are outlined in . This paper is a personal favorite of mine to which I refer often.

From my own hands-on experience, here are the most important lessons I have learned when deciding which single cell QA/QC thresholds to use.

hashtag
Reason one: single cell threshold selection is a trade-off between quality and quantity

If you are more stringent with your thresholds, you will retain fewer cells of higher quality. If you are more lenient, you will retain more cells, some of which may be of lower quality.


How stringent or lenient you are will depend on how many cells you need to answer your research question, and at what quality.

Do you need many cells to assess the general heterogeneity of a sample? If so, perhaps you can afford to be more lenient.

Are you looking for a rare cell type that may be present in low frequency? Go lenient.

Do you need highly accurate cell type identifications for a set of cells? If so, perhaps it makes sense to be more stringent.

hashtag
Reason two: metrics in the biological sample context matters in threshold selection

The biology of the sample or an experimental treatment may affect the single cell quality control metrics in a predictable way.

For example, you might be tempted to set the maximum percentage of mitochondrial counts threshold to 15%, because a higher percentage is typically indicative of a damaged cell. However, if you are working with a more metabolically active tissue, such as the kidney, a maximum threshold of ~30% makes more biological sense .

Perhaps you are working with a dissociated tumor sample and expecting some infiltrating normal cells, which are known to be smaller than the tumor cells. It stands to reason those smaller cells will have less total RNA content, so we would expect their total count to be lower. In this case, it might make sense to lower the minimum total count threshold, so you don’t inadvertently exclude a sub-population of infiltrating normal cells.


These are just a couple of examples of things to consider when selecting a single cell QA/QC threshold. Other things to consider are the expected effects of treatments, gene knockouts, and sample handling on the three quality metrics.

hashtag
Conclusion

Don’t be afraid of making a mistake. You will not break the data by choosing different thresholds. The original data will always be intact, so you can go back and re-run things.

So, make a choice. It doesn’t have to be a perfect choice, just make a choice, and test it. Look at the downstream clustering, biomarkers, differential gene expression results, and visualizations

Spatial transcriptomics—what’s the big deal and why you should do it

We may all be under the impression that single cell RNA-Seq is a new technology, but it has been around for more than a decade (). Hard to believe, right? And what about tissue transcriptomics (a.k.a. spatially resolved transcriptomics)? If we focus only on RNA sequencing-based analysis, then the first paper was published in 2013 (), which in the “omics” age, cannot be considered a novelty. To be fair, the concept of gene expression analysis within tissue context is considerably older—first introduced in 1982 (). Either way, tissue transcriptomics really rose to prominence within the last two years and continues to gain a lot of attention.

Transcriptome-wide studies of gene expression certainly provide invaluable insight into biology on a molecular level, particularly when performed at the single cell level (i.e., single cell RNA-Seq). However, some key pieces of information such as the tissue relationships, are still missing and are only made available by tissue transcriptomics techniques. Knowing the location of each cell, as well as its surroundings, enable us to fully understand and appreciate molecular events. Two straightforward examples that come to mind are embryonic development and tumor microenvironments. One of the key features of the development is spatially and temporally coordinated gradients of gene expression, which then turn off developmental genes. Cancer tissue, the second example, is composed of different neoplastic populations and has a “contaminating” component of healthy tissue such as stroma or especially exciting, invading leukocytes.

Luecken & Theis (2019)arrow-up-right
(Liao et al. 2020)arrow-up-right
Figure 1. Narrow, stringent thresholds on the left. Wider, more lenient thresholds on the right. The lenient settings will retain more single cells, but some of them might be less informative, more likely to be doublets, and more damaged
Figure 2. Frequency distribution of total counts from a dissociated glioma sample. The data is from the MGH56 sample taken from (Venteicher et al. 2017)arrow-up-right, processed and visualized in Partek Flow v9. Setting the minimum total count threshold where the broken red line is will inadvertently exclude the infiltrating microglial sub-population
Moreover, some tissues are not easily singularized (e.g., cartilage, bone, nervous tissue) or if they are, single cell suspension still lacks topographical information (e.g., bone marrow, lymph nodes). For instance, if you study interactions between bone and bone marrow the positional information is critical.

And finally, spatial considerations do not need to be limited to the tissue level but can also be transferred to the level of an individual cell, as intra-cellular functions and signals require well-defined positional arguments (Asp et al. in BioAssays 2020). This is exactly where tissue transcriptomics comes into play, as it merges gene expression data and spatial data. At the same time, we know what genes are expressed, their expression levels, and where within the tissue those genes are located.

To learn how to perform spatial transcriptomics analysis in Partek Flow, take a look at our live training recording: Single Cell Analysis - Tissue Transcriptomics. The training is based on 10x Genomics’ Visium data, but you are not limited to that assay: other related technologies, such as Nanostring’s GeoMx Digital Spatial Profiling are also supported.

Tang F et al. Nat Methods 2009arrow-up-right
Ke et al. Nat Methods 2013arrow-up-right
Singer & Ward. Proc Natl Acad Sci USA 1982arrow-up-right
Analyzing spatial transcriptomic data in Partek Flow

Cellular Differentiation Using Trajectory Analysis & Single Cell RNA-Seq Data

Researchers use single cell RNA-Seq trajectory analysis to reveal cell fate. In ancient Greek mythology, three ancient goddesses known as the Moirai determined each mortal's fate at birth by spinning their thread of life. Today instead of a spinning wheel the Moirai might use Monocle, a trajectory analysis tool embedded in Partek Flow, to determine their fate. Monocle uses single cell RNA-Seq data to detect cell transitions between different cell states, in other words, revealing cell fate.

hashtag
Cell differentiation is at the heart of many biological processes

A hallmark example of a cell fate study is cell differentiation. Cellular differentiation can be studied under physiological conditions or as part of a pathophysiological process. For example, we may want to dissect the cell differentiation of lymphocytes as a response to SARS-CoV-2 infection.

Taking this one step further, cancer cells are very heterogeneous, ranging from undifferentiated cells to cells resembling their normal counterparts. Therefore, their different states and transitions between them can be analyzed with a trajectory approach.

hashtag
Trajectory analysis of cancer single cell RNA-Seq data

Here’s an example. In 2017, published their work on single cell RNA-Seq analysis of human gliomas. For this short article, we used a subset of that study, i.e., four oligodendroglioma samples (3,000 cells in total), and ran trajectory analysis.


By looking at the left panel in this figure, we can immediately appreciate the complexity of this cell population: eleven different states (sub-populations) were detected. Each subpopulation is characterized by a unique gene expression pattern. An interesting gene in this context is the cyclin I gene (CCNI), a member of a gene family tightly associated with the cell cycle. High expression levels are observed on the right side of the plot, such as state 10, while the levels of CCNI decrease as we traverse down the cell differentiation tree represented by the black lines. Knowing the pattern of gene expression (i.e., genes being turned on / switched off), we can reconstruct the differentiation sequence from early to final states. That information is shown on the right panel where the cells are colored by pseudotime. The beginning of pseudotime is indicated by a very light blue and becomes brighter as time progresses. Note the complementarity of the CCNI expression and pseudotime patterns. As the cancer cells become more differentiated (=similar to the healthy cells), the activity of cell cycle genes goes down. What if a new drug or treatment would affect the cancer cell differentiation? Since cancer differentiation is typically tied with improved survival rates, a drug with the ability to promote differentiation is well worth reviewing.

hashtag
Other applications of trajectory analysis

We encourage you to explore the use of trajectory analysis tools not only for obvious purposes where cell differentiation is of primary interest, such as developmental but whenever you are looking at cells that can be classified into sub-groups, which in turn, can be organized in a before-after relationship.

Or, to rephrase: a Greek hero was bound by the verdict of the Moirai and they were not able to change their fate and (most often!) escape their doom. On the other hand, software tools are not Greek heroes, so why not deviate from the routine and use them for non-obvious purposes?

Venteicher et al.arrow-up-right
Trajectory analysis of single cell RNA-Seq data collected from 3,000 oligodendroglioma cancer cells (each dot is a single cell). The cells are colored by differentiation state (left), the expression level of cyclin I (center), and pseudotime (right)

Get to Know Your Cell

Just like people, cells are sometimes mysterious and do not readily reveal their true identity. If you want an example, just think of metastatic cancer cells, which sometimes lose all the hallmark features of their mother tissue, making detection of the primary tumor site difficult. Or another example, single cell RNA-Seq. Getting a nice t-distributed stochastic neighbor embedding (t-SNE) chart based on sequencing data is pretty much straightforward but figuring out their biological nature can be quite challenging.

One approach to classifying the cells into groups is to use marker genes. An extension of that concept is to use gene groups, such as pathways. Although these strategies are very useful, they are not applicable to every research situation. For instance, you may want to come up with a completely new set of marker genes, or you may want to work hypothesis-free.

To handle those situations, use Partek Flow to combine clustering results with t-SNE visualization. If you perform clustering with compute biomarkers and then invoke the t-SNE plot on the result, genes specific for every cluster can be used for classification (Figure 1). Moreover, a clustering algorithm does not have to be involved, you can simply classify cells into groups by selecting them directly on the chart and Partek Flow will produce biomarkers of your custom groups. The biomarkers table is based on a comparison of one cluster at a time versus all the other clusters combined. The gene list is then filtered by using p-value (0.05) and fold-change criteria (|1.5|), ranked by the p-value, and the top 10 genes are then listed in the table by default (the full list can be viewed in the Data Viewer or downloaded).

What if you do not want to compare a particular cluster with all the remaining cells, but would rather pick two clusters and contrast them directly? That is also easily performed in Partek Flow.

Detecting differential gene expression in single cell RNA-Seq analysis

One of the main goals of both the Single Cell RNA-Seq and bulk RNA-Seq pipelines is to detect differentially expressed genes. There are many available tools from which to choose such as and , as well as classic statistical tests like ANOVA or Welch’s ANOVA. Thus, a frequent question is which to choose?

When discussing data analysis with our customers, I often hear that “this” or “that” approach has been suggested to them by others, because it is “better”. “Better” is quite difficult to define but typically leads to a conclusion like “I found more significant genes with this test”. Having a research scientist’s background, I certainly understand the appeal of more targets, but is this necessarily something for which to strive? Spoiler alert: no, you should find the right genes, not just more genes.

Finding the right genes is where the hurdle models come into play. Hurdle models are a class of statistical models developed for count data, with the idea of handling and . Sounds exactly like single cell RNA-Seq data, right? (If you are interested in the mathematics, you may want to go over this excellent overview of )

We can think of a hurdle model as a two-question process: is a gene detected in the project or not, and if yes, has the expression level hurdle been crossed? Note that if we observe zero counts for a gene, this does not necessarily mean that the gene is not expressed. Expression may not have been detected because sequencing was too shallow or for a number of other technical reasons (e.g., sampling error, reverse transcription inefficiency, etc.). That phenomenon is known as “

Aliens Among Us: How I Analyzed Non-Model Organism Data in Partek Flow

Recently a post about how the microscopic water bear (tardigrade) can survive in extreme drought, temperature, pressure, and radiation as well as in the vacuum of space popped up on my Twitter feed. Being the fan I am of anything space-related I was intrigued. I downloaded the publicly available data to explore what is it about the water bear transcriptomic profile that allows them to survive the extreme environmental difference between Earth and space. This presented the perfect opportunity for me to see how well software would let me analyze non-model organism data, and at the same time, not have to deal with any command-line tools.

Water bears survive by shriveling up and entering a dormant state, called the ‘tun’ state, so I compared samples from tun and active adults. I started by aligning the RNA-Seq data to the water bear reference genome (in fact, I could use the reference genome of any model or non-model organism in Partek Flow). To improve the quality of the analysis, I used the visualizations in the post-alignment QAQC and quantification reports to help identify a low-quality sample and remove it from my downstream analysis (Figure 1 and Figure 2).

I found 1385 genes that were differentially expressed between tun and active adults (Figure 3). Interestingly, I found most genes (1194 genes) were up-regulated in the tun group, compared to the active adults, indicating the transformation into the tun state requires the activation of many genes, rather than inactivation. This was rather surprising. When in the tun state, the water bears seemingly regress into dormancy. Intuitively, one would expect more inactivation. Then again, the transformation process requires major physiological, cellular, and metabolic changes, which would require the activation of pathways that would normally not be needed when they are active adults. Biological interpretation of the gene list revealed functions involved in oxidative damage response, redox reactions, and membrane integrity, all of which have been implicated in extreme stress response.

” and hurdle models that that into account.

To illustrate the advantages of hurdle models in the context of Single Cell RNA-Seq, I am going to present an example based on the (MAST) algorithm. The data originates from , who prepared Single Cell RNA-Seq libraries from primary mouse dendritic cells and stimulated them with three pathogenic components. I downloaded the TPM-normalized count matrices from and extracted the data for unstimulated cells (n = 96; control) and cells stimulated with lipopolysaccharide (LPS; these are molecules found in some bacteria) for 6 hours (n = 96) (27,723 genes each). After that, I removed the genes with zero counts in at least 90% of the cells and log2-transformed the expression values (with the offset of 0.001). Finally, I proceeded to detect differentially expressed genes by either MAST or limma-trend (default settings). Significant genes were selected by applying the cut-off of false discovery rate of 0.05 and fold change > |2|.

As we can see, the number of significant genes reported by MAST was lower by roughly one-third (Figure 1).

However, have a look at the results of the pathway enrichment invoked on the significant genes (Figure 2). The enriched pathways are not exactly the same between the two statistical models but point to the same biological direction; dendritic cells are responsible for mounting and directing the immune response and upon activation by LPS they express genes belonging to immune signaling pathways (e.g., TNF signaling pathway) or are involved in combating specific pathogens (like Salmonella). There are some differences between the two results, which are worth mentioning. The MAST-based output includes a “C-type lectin receptor signaling pathway” and dendritic cells use C-type lectin receptors to detect LPS. On the other hand, one of the pathways in the limma output is “hepatitis B”, which does not directly resonate with LPS stimulation (LPS challenge mimics bacterial, not viral infection). Furthermore, the enrichment scores are considerably higher for the MAST group, despite of fewer significant genes. In conclusion, the hurdle model helped us to focus on the right genes and to obtain more biologically relevant results, which is consistent with the conclusion of the MAST paper.

It is important to notice that the analysis in this example was very straightforward. If you stimulate dendritic cells with LPS, you know that you will find a bunch of differentially regulated genes. But what about an experiment where the effect of the experimental factor is not as well defined and predictable? Like development, tissue regeneration, or cancer? In all those cases it is easy to get a long list of significant genes, but the trick is to pick the genes that really matter. Tools like MAST can help you with that challenge and we fully encourage you to try out different analysis strategies. Good luck!

DESeq2arrow-up-right
limmaarrow-up-right
zero-inflationarrow-up-right
overdispersionarrow-up-right
single-cell RNA-Seq bioinformaticsarrow-up-right

The whole analysis, from data download to biological interpretation of the gene list, was completed in a matter of hours with a few easy clicks thanks to the intuitive graphical user interface in Partek Flow.

References:

1. Arakawa, K. et al. Genome Sequencing of a Single Tardigrade Hypsibius Dujardini Individual. Sci. Data 3:160063 doi: 10.1038/sdata.2016.63 (2016)

2. Koutsovoulos, Georgios et al. “No Evidence for Extensive Horizontal Gene Transfer in the Genome of the Tardigrade Hypsibius Dujardini.” Proceedings of the National Academy of Sciences of the United States of America 113.18 (2016): 5053–5058. PMC. Web. 16 Nov. 2016.

Partek Flow
Figure 1: Post QAQC alignment breakdown of a non-model organism
Figure 2: Quantification log expression signal of a non-model organism
Figure 1. Detect biomarkers using Graph-based clustering or K-means clustering for the differentially expressed features between each cluster and the rest of the data; t-SNE colored by Graph-based clusters (left) with table of Graph-based biomarkers (right). Drag and drop the biomarkers from the table to the t-SNE plot to visualize genes
gene dropoutarrow-up-right
Model-based Analysis of Single Cell Transcriptomicsarrow-up-right
Shalek et al.arrow-up-right
herearrow-up-right
Figure 1: Number of significant genes reported by MAST and limma on the same single-cell RNA-Seq data (default options)
Figure 2: Pathway enrichment results (top 10) invoked on significant genes reported by MAST and limma (using the same single-cell RNA-seq data). KEGG was used as the pathway database. Enrichment scores (the higher - the better) are based on the p-value
Figure 3: Dendrogram of 1385 genes that were differentially expressed between tun and active adults

Enjoy Responsibly!

Enjoy Responsibly!

By: Ivan Lukic, Ph.D. – Field Application Scientist, Partek Incorporated

Let’s talk about something completely different. At Partek training events, I usually explain that our tools can be easily utilized for any quantitative data, not just genomics. For instance, last month I wrote a blog post on the analysis of gylcomics data. But actually, we go way beyond that, to some completely unrelated applications, for instance, whiskey tasting. I was fortunate enough to stumble upon a whiskey data set (courtesy of the University of Strathclyde), which consists of the results of tasting 86 different Scotch malts. Each malt is distilled at a different site and is described by 12 taste categories (e.g. body, sweetness, smoky, and so on). The taste categories are subjective measurements obtained by referees while tasting a particular whiskey and range from zero (taste category absent) to four (taste category very strong). Let’s explore the data by principal components analysis (Figure 1). The interpretation is analogous to the one in genomics: if two whiskeys are close on the plot, that means that they have similar values across the taste categories (to rephrase, they have a similar taste). My first observation was that there is no striking pattern (i.e., the data does not split into several well-defined clusters). That actually supports my hunch that much of whiskey tasting is merely mumbo-jumbo of connoisseurs and aficionados. You know, …with a slight touch of (insert a name of a berry that cannot be purchased at your local market or found at a specialized store because it is a highly endemic and endangered species for which you need three PhDs in botany and several decades of field experience to be able to find it in the wild). However, there is a group of several whiskeys on the left (including Lagavulin, Ardberg, and Talisker, among others) that does stand out a bit. An obvious question is – what about those? Is there something particular about their taste?

To answer that, I turned to a Principal Components Analysis (PCA) biplot. In general, a biplot highlights the extent to which objects represented by rows of a quantitative matrix (observations) differ in terms of the objects represented by the columns (features). In this context, the PCA biplot shows the biggest patterns evident in the data in terms of how the whiskeys differ in particular taste categories. Whiskeys are shown as dots, while taste categories are presented as vectors (Figure 2; to simplify the chart, I chose 2D PCA plot).

Now we can tell that “body” and “smoky” should be interpreted in a similar fashion. That is, they describe very similar components of a whiskey’s flavor and are quite different from, say “fruity” and “sweetness” (which point to a different direction). That makes sense, right? So, if I am looking for a smoky whiskey of a full-body, I should pick Lagavoulin or Laphroig (I have no idea how to pronounce that), because their points project furthest in the direction of “smoky” and “body” vectors. On the other hand, spring is in the air and to celebrate it with a sip of something with a more floral note, I should go for Amenthosan. Well, in reality, I would have neither, because I am a bourbon guy, but you get my point on biplots, right?

How to perform single cell RNA sequencing: exploratory analysis

Is anyone out there still performing bulk RNA-Seq? I am sure that it is far from dying out but judging by what scientists talk about and report on, single cell RNA sequencing (scRNA-Seq) has become the new norm.

Although, when you think about it, it can hardly be called a novel approach. The first manuscript on scRNA-Seq was published in 2009 (), but you may be surprised to hear that the basic principles were described in the 1990s (seminal works by and ). That decade (or three) is quite negligible for a geologist, but in terms of molecular biology, scRNA-Seq is getting gray hair. In comparison, bioinformatics analysis of scRNA-Seq data has yet to mature and there is no consensus in the community.

Single Cell analysis is a larger topic than what is possible in a single blog post. To break it down into bite-size reading, we’ll talk about each step in a separate blog post.

Pathway Analysis: ANOVA vs. Enrichment Analysis

One of the main steps in nearly every bulk RNA-Seq or single cell RNA-Seq project is some sort of statistical testing to compare gene expression levels across samples. After which a list of significant genes is generated by applying a filter strategy, the most common being fold change > |2| and false discovery rate (FDR) < 0.05 (the exact values depend on the research question).

There are different ways to interpret the resulting list of significant genes. The most straightforward is to perform enrichment analysis, which identifies gene sets (e.g., pathways or gene ontology categories) that are overrepresented in the list of the significant genes. Then a 2 ✕ 2 contingency table is created for each gene set: the number of genes present in the list of significant genes and in the gene set and the number of genes present in the list of significant genes but not in the gene set, etc. A statistical test (like Fisher’s exact test) can be invoked on the contingency table. Most researchers transform the p-value into an enrichment score (enrichment score = -ln(p-value)) to make it easier to read (e.g., 6.9, instead of 0.001). Let’s have a look at a simple hypothetical example; a list of significant genes consists of 291 genes. Of those genes, 35 (12%) belong to a particular gene set. The set itself consists of 80 genes and almost half (35/80) are present in the list of significant genes (Table 1). Based on the contingency table, Fisher's exact test statistic p-value is 8.28×10-24. Converted to the enrichment score, the value is 53.15. In other words, the number of genes in gene sets that are also in the list of the significant genes exceeds the expectation.

Back to Blog Post List
Figure 1: Principal components analysis of tasting data of 86 Scotch malt whiskeys. Each whiskey (dots) is produced by a different distillery (dots colored by the distillery and labeled) and described by 12 different taste categories
Figure 2: Principal components analysis biplot of tasting data of 86 Scotch malt whiskeys. Each whiskey (dots) is produced by a different distillery (dots colored by the distillery and labeled) and described by 12 different taste categories (lines)
hashtag
Step One: Clear the Path for Clean Single Cell Analysis

Like with any other analysis, the golden rule of garbage in, garbage outarrow-up-right still applies, so once you have the quantified matrix, you should do some exploratory analysis, on both the cell and gene levels.

hashtag
Cell-Level Exploratory Analysis

Let’s start with cells. You may want to look at the fraction of mitochondrial counts per cell (i.e., the ratio of reads mapping to mitochondrial genes the total number of reads; Figure 1). Cells with an increase in the proportion of mitochondrial genes may be apoptotic and, hence, should be excluded from the analysis. The decision on the cut-off should be made considering the experiment and the samples, but values of 5% or 10% are quite common for single cell analysis studies.

Figure 1. The fraction of mitochondrial counts per cell. Each data point is a single cell, while the pink violin summarizes the data distribution. The plot is based on a publicly available data set, provided by 10x Genomics

Next, some cells may show unusually high count numbers. As an example, I plotted the results of a Drop-Seq experiment (Figure 2). The cell in the top right corner has ~1.3 million reads, which is some 1000⨉ more than most of the cells. That event is most likely not a single cell, but a doublet (or a triplet) and, again, should be excluded from any downstream steps.

Tip! Use a violin plot to show the data distribution of your single cell analysis experiment. It makes it easier to set the cutoff.

Figure 2. The number of counts per cell. Each data point is a single cell, while the pink violin summarizes the data distribution. The plot is based on a Drop-Seq data set

hashtag
Gene Level Exploratory Analysis

There are also several gene-level matrices that you should take into consideration. Going over a distribution plot like the one in Figure 3, based on a 10x Genomics data set, will reveal the genes with the highest number of reads.

hashtag
Filtering the Genes

One way to interpret a distribution plot is to see where you are investing your reads. For example, out of the ten top genes in Figure 3, six are ribosomal genes. Unless you are specifically interested in ribosomal biology or translation machinery, you may want to remove them from the downstream steps, since their presence only introduces noise in the single cell analysis.

Tip! A list of ribosomal genes can be downloaded from genenames.org.arrow-up-right

Figure 3. Feature distribution plot. The horizontal axis shows the number of reads per gene, while the genes are listed on the vertical axis. Each data point (blue bar) is a single cell. The grey dots indicate the median

There is an additional filter strategy to consider: what genes to focus on? One approach is to prune the genes that are not detected (for example, have 0 counts across the cells) or that are detected in a few cells only (“background”; for example, have 0 counts in at least 99% of the cells).

We advise you to carefully interpret the non-detectable genes: are those genes not expressed in your cells (=biology) or are you not picking them up due to the experimental setup, e.g., insufficient sequencing depth (=technology)?

Another approach is to filter only the most variable genes, with the rationale that those genes are also the most informative ones. This is quite appealing and can help to identify the main cell groups. On the flip side, key biological information may be lost.

Irrespective of your decision, filter strategy needs to be considered when interpreting the results. To illustrate the impact of filtering, let us compare two t-SNE charts, based on the same 10x Genomics data (Figure 4): all the detected genes were used for the t-SNE on the left (in this case: 6,178), while only the 100 most variable genes were used for the one on the right. As shown by overlaying the output of graph-based clustering, the cells in the left panel form five, while the cells in the right panel form six clusters.

Figure 4. Impact of gene filtering on cell clustering. Each dot is a single cell (713 in total; public data set by 10x Genomics). t-SNE is based on either all the detected genes (left) or the top 100 genes by variance (right). Graph-based clustering output is indicated by color

Having performed cell-based and gene-based filtering you are one (big) step closer to the analytical data set, but there is still work to be done. For example, you may want to eliminate technical nuisance factors by using batch removal or scaling.

Tang F, et al. 2009arrow-up-right
Paul Colemanarrow-up-right
Norman Iscovearrow-up-right
Genes in list
Genes not in list
Total

Genes in set

35

45

80

Genes not in set

256

4,929

5,185

Total

291

4,974

5,265

To illustrate biological interpretation, we used a spatial transcriptomics data set, consisting of two human prostate samples: a cancer sample and a normal sample harvested from another individual. The samples were processed by the 10x Genomics Visium platform and are available herearrow-up-right. Using the graph-based clustering approach, tissue spots were classified as cancer or normal. Differential expression testing was performed and a list of significant genes was created using fold change > |2| with a false discovery rate (FDR) < 0.05. The list was then interpreted by pathway enrichment with KEGG pathwaysarrow-up-right as the pathways database.

Figure 1 shows the top ten enriched pathways. The first pathway really stands out as a “focal adhesion”. Two additional pathways relate to the adhesion of cells to either other cells or components of the extracellular matrix; dysregulation of genes involved in adhesion is a well-known feature of prostate cancer (particularly important for metastases).

Figure 1. Pathway enrichment results invoked on a list of differentially expressed genes between normal human prostate tissue and cancer tissue (top ten pathways are shown)

Although the enrichment results make sense, they fail to reveal all the biological nuances. It is important to note that enrichment focuses on genes that show considerable differences in expression level between the conditions, while the genes with subtle changes simply go unnoticed. But what if within a pathway several genes are slightly up-regulated to the baseline condition? For instance, in a signaling pathway, even slight changes in expression levels may have profound biological consequences.

One way to address the limitation of gene set enrichment is to use a strategy that could be described as “differential set expression”. For any gene set (e.g., a pathway) expression values of all the genes within the set are added up and then those sums are compared between the samples (basically, pathway ANOVA). This is the principle behind Pathway Analysis as implemented in Partek Flow (Figure 2).

Figure 2. Differential Analysis section of the toolbox in Partek Flow, showing the Pathway Analysis tool

Using the same project to illustrate the concept, cancer and normal samples were compared using Pathway Analysis. The resulting pathways are sorted by fold change (descending) and the top ten pathways are shown in Figure 3.

Figure 3. Pathway analysis results obtained by comparing a normal human prostate and cancer tissues (results sorted by descending fold change, top ten pathways are shown)

If you are not familiar with the biology of prostate cancer, these results may be unexpected: half of the pathways with the highest fold change values relate to lipid metabolism. It turns out, dysregulation of lipid metabolism is one of the hallmarks of prostate cancer (e.g., Poulose et al., Nat Gen 2018arrow-up-right).

Examination of one of the pathways (ɑ-linolenic acid metabolism, Figure 4) sheds light on the difference between pathway analysis and pathway enrichment. All the genes in the pathway are indeed expressed at a higher level in the cancer samples (blue), but the difference in terms of fold change (per gene) is modest, at best, and below the usual cut-off value of 2. Hence, it is not surprising that the enrichment score value for the ɑ-linolenic acid metabolism pathway is 1.34 (corresponding to Fisher’s exact test p-value of 0.26).

Figure 4. Expression levels of genes in the ɑ-linolenic acid metabolism pathway in the cancer (blue dots) and normal sample (red dots)

In summary, enrichment is a valid and valuable approach to contextualize gene lists (sometimes it is the only available method, e.g., if a gene list is obtained from a publication), but it has limitations. In the current example, pathway enrichment yielded valuable results, but additional biological insight was obtained when applying pathway ANOVA. We encourage you to try it with your next experiment and share your experience with us.

How to Integrate ChIP-Seq and RNA-Seq Data

Next generation sequencing has enabled us to ask questions about biology and disease with unprecedented scope and detail. High-throughput assays are available to study many aspects of genomic regulation including RNA-Seq for gene expression, ATAC-Seq for chromatin accessibility, and ChIP-Seq for protein binding sites.

Bringing together multiple genomic assays to analyze both the epigenome and transcriptome in the same samples promises to uncover the mechanisms underlying biology and disease. But while performing the experiments requires good hands and persistence, the real challenge begins after you receive the data. How do you make sense of it all?

Unfortunately, most analysis pipelines and tools are built for one genomic assay, leaving it to you to piece together disparate output spreadsheets and data files to figure out how the results from different assays mesh to form a coherent picture.

At Partek, we make multi-omics integration analysis and integration seamless by bringing all your analysis tools and data together in Partek Flow.

To illustrate how easy it is to analyze and integrate multi-omics data in Partek Flow, I took a quick look at some data from a recently published study. In the study, the authors used ChIP-Seq and RNA-Seq data to characterize TGF-β signaling through the SMAD2/3 transcription factors.

By analyzing the data in Partek Flow, I was able to quickly go from raw data to integrated results. I identified potential direct targets of SMAD2/3 – genes that were nearby SMAD2/3 binding sites and differentially expressed after TGF-β treatment – by analyzing the RNA-Seq and ChIP-Seq data together.

For the RNA-Seq data, I found genes that were differentially expressed between inhibitor and TGF-β treated conditions. This gave me a list of indirect and direct target genes of TGF-β. These genes are shown in the green circle in Figure 1.

For the ChIP-Seq data, I identified regions that were enriched in a SMAD2/3-pull down sample relative to input control using MACS2, a powerful tool for detecting enriched regions in ChIP-Seq and ATAC-Seq data. I then annotated these regions with nearby genes to give me a list of genes that were likely to have been regulated by SMAD2/3. These genes are shown in the blue circle in Figure 1.

I used the Venn diagram tool in Partek Flow to find the intersection between the TGF-β regulated genes and the SMAD2/3 bound genes – the 202 potential direct targets of SMAD2/3 in the experiment.

Going a step further, I performed pathway enrichment analysis on the list of direct target genes to find pathways that were likely to be quickly impacted by signaling through SMAD2/3. You can see one of the annotated KEGG pathway maps I generated using Partek Flow in Figure 2.

I also visualized several of these direct target genes in Partek Flow. Figure 3 shows Skil, a known target gene of SMAD2/3 signaling, in the Partek Flow genome browser. The top track is the ChIP-Seq data. The ChIP-Seq reads histogram shows that the SMAD2/3 pull-down sample is enriched relative to the input control. The bottom track presents the RNA-Seq data, where the TGF-β treated condition showed higher expression of the gene than the inhibitor-treated condition. The predicted SMAD2/3 binding site for this gene is directly upstream of the transcription start site of Skil in the promoter region, illustrating why it was identified as a direct target of SMAD2/3.

—

To Boldly Go…

So, what about the Star Trek opening quote? Well, with the human genome being extensively described and studies of the proteome well underway, I believe that the glycome is the final frontier (but then again, who knows… maybe just the next frontier?). Moreover, glycomic research has profound practical consequences due to the key role of protein glycosylation in normalarrow-up-right physiology as well as pathophysiologyarrow-up-right. Unfortunately, there are some issues that need to be resolved, such as the need for “sensitive, robust, and affordable” high-throughput analysis techniques (Trbojević-Akmačić et al.arrow-up-right, 2016).

On the bright side, a recent manuscript provided a major step in that direction, proposing a standardized method for glycome profiling: Zou and his co-workersarrow-up-right used lectin microarrays to obtain glycation signatures of various tissues. That was precisely the type of study I was looking for to test my hypothesis; that glycomic data can be processed using genomics tools in an easy and straightforward way.

Thus, I downloaded the normalized intensity levels of each lectin from the manuscript supplement and analyzed it with Partek Flow analysis software. Figure 1, based on the t-distributed stochastic neighbor embedding (t-SNE) algorithm, highlights the distinct glycome profiles of five mouse tissues harvested in the original study.

Figure 1. t-Embedded stochastic neighbor embedding analysis of normalized lectin microarray data. Each data point corresponds to one tissue fragment and the original features are intensity signals of the 45 lectins present on the array. Data provided as a supplement to Zou et al., Scientific Reports 2017

This is in line with the results by Zou and his colleagues, but there is a difference that I would like to point out. In the original manuscript, one of the techniques used for differential glycomic profiling was principal components analysis (PCA). Although PCA visualization (Figure 2) leads to the same conclusion, that each tissue has a characteristic glycome fingerprint, the t-SNE (Figure 1) provides a cleaner tissue separation and emphasizes the point in a more convincing fashion.

Partek is best known for our expertise in genomics, however, our tools empower you with the flexibility to examine any type of -omics data. Join us in the journey where no man has gone before.

Figure 1. Venn Diagram of integrated ChIP-Seq and RNA-Seq results showing the direct and indirect targets of SMAD2/3 in the experiments
Figure 2. KEGG pathway map for the JAK-STAT signaling pathway. Direct targets of SMAD2/3 are shown in red
Figure 3. Chromosome view of integrated ChIP-Seq and RNA-Seq results showing Skil, a direct target of SMAD2/3
copyright of Kanehisa Laboratories. Under license from Pathway Solutions Inc.
Figure 2. Principal components analysis of normalized lectin microarray data. Each data point corresponds to one tissue fragment and the original features are intensity signals of the 45 lectins present on the array. Data provided as a supplement to Zou et al., Scientific Reports 2017

Studying Immunotherapy with Multiomics: Simultaneous Measurement of Gene and Protein

If you follow the field of single cell biology, then you know about the recent trend of simultaneous measurement of both gene expression and protein expression. Here’s an application example based on immunotherapy of mucosa-associated lymphoid tissue (MALT) lymphoma cancer. The most common site is the stomach, but any mucous membrane can be affected.

The cancerous cells of MALT lymphoma are the B lymphocytes, originating from the marginal zone of the MALT, and hence the synonym: extranodal marginal zone B cell lymphoma. In addition to B lymphocytes, the other common cell type found within MALT lymphoma are infiltrating T lymphocytes. A depiction of the cellular composition of MALT lymphoma, generated in Partek Flow, is shown in Figure 1.

Figure 1. Cellular composition of a MALT lymphoma sample by a t-SNE chart

Each dot represents a single cell: the entire sample consists of B lymphocytes (red), some of which are the cancer cells, and infiltrating T lymphocytes (green), which are normal cells reacting to the tumor. Cells from a dissociated MALT lymphoma were processed by 10X Genomics’ Feature Barcoding technology and can be downloaded from herearrow-up-right. The filtered HDF5 file was loaded and processed in Partek Flow analysis software. The t-SNE is based on a combined analysis of gene and protein data for 8,018 single cells.


Although B lymphocytes are malignant cells in this type of tumor, there is a growing interest in the neighboring normal T lymphocytes. Penetration of T lymphocytes into the tumor is a reaction to cancer and is a possible therapeutic strategy. Although T lymphocytes can, depending on their type, either regulate immune response directed against the tumor or directly kill the malignant B lymphocytes, they rarely do so.

One way of exploiting the activity of T lymphocytes is to target the immune checkpoint molecules expressed on the surface of T lymphocytes, such as cytotoxic T-lymphocyte-associated protein 4 (CTLA-4). Signaling via CTLA-4 regulates the immune response and prevents autoimmune diseases and allergies, but also hinders the immune system from destroying cancer cells. Alas, only a fraction of patients are responsive to therapeutics acting on the CTLA-4 signaling pathway, which motivates the investigation of other inhibitory or stimulatory receptors.

A promising new immunotherapy drug target is programmed cell death protein 1 (PD-1). The PD-1 protein is encoded by the gene PDCD1 and is expressed on the cell surface of T lymphocytes. Ample evidence confirms that PD-1 acts as a negative regulator of the immune response. Another interesting potential target molecule is ITM2A, encoded by the ITM2A gene. ITM2A is also involved in the regulation of immune response and has recently been shown as commonly co-expressed with PD-1 (). The list of possible therapeutic targets is much longer, and research efforts are often aimed not at individual molecules, but on their combinations and networks.

A common tool for detection of drug targets (and identification of cells expressing them) are antibodies, but – to complicate matters – high fidelity antibodies are not available for most potential drug targets. Moreover, standard analysis techniques, such as fluorescence in situ hybridization (FISH) or flow cytometry, allow for simultaneous detection of no more than 20 molecules.

The good news is that these limitations can be resolved by coupling data analysis in Partek Flow software with single cell RNA-Seq using the Feature Barcoding approach. For instance, take a look at Figure 2, which shows an ideal target for immunomodulation: a target population of “helper” T lymphocytes (positive for CD4 protein) expressing both PDCD1 and ITM2A immunoregulatory genes.

Each dot is a single cell. Cells from a dissociated MALT lymphoma were processed by 10X Genomics’ Feature Barcoding technology and can be downloaded from . The filtered HDF5 file was loaded and processed in Partek Flow. The data points are based on the combined analysis of gene and protein data for single cells. T lymphocytes were identified as cells expressing CD3 antigen. A total of 3,069 CD3-positive cells were gated and then charted by expression levels of PDCD1 and ITM2A mRNA, and CD4 protein. To highlight the helper T lymphocytes population, the plot was then colored by expression levels of CD8 antigen, a cytotoxic T lymphocyte marker (red cells in the background).


Analysis presented above can be performed in Partek Flow software with just a few mouse clicks. If you are curious, don’t hesitate to reach out to us for a free trial.

Andor et al. 2018arrow-up-right
herearrow-up-right
Figure 2. Detection of helper T lymphocytes co-expressing two immunomodulatory genes: PDCD1 and ITM2A

Single Cell Multiomics Analysis: Strategies for Integration

One of the most exciting technological innovations in recent years is multiomic analysis in single cells. Technologies like CITE-Seq (transcriptomic and proteomic expression data from the same cells) have taken some research fields, such as immuno-oncology and neuroscience, to the next level.

It’s one thing to obtain both data types, and another to know how best to use them. Here are three strategies for integrating transcriptomic and proteomic data, particularly for identifying cell types.

hashtag
Use only the proteomic data to cluster cells, use protein & gene expression to infer cell types

Cell surface proteins offer excellent resolving power in identifying the major cell types, so it makes sense to combine information across multiple protein markers to cluster cells together. You can use the most highly expressed proteins (biomarkers) in each cluster to infer which cell type each cluster corresponds to. If any protein markers are missing from the panel, you can use highly expressed genes as biomarkers to fill in the gaps (Figure 1).

One drawback to this approach is the limited number of protein markers in a panel. Each oligo-conjugated antibody targets a different cell surface protein. A panel of 15-20 antibodies may be enough to distinguish some of the major cell types, which may be fine for some studies. But if cell surface protein markers specific to certain cell types are missing from the panel, those cell types may not be easily distinguished as separate clusters and identified.

hashtag
Use only the transcriptomic data to cluster cells, use gene & protein expression to infer cell types

Single cell RNA-Sequencing has proven to be effective in clustering single cells based on their transcriptome-wide gene expression profiles. High expression of known, canonical marker genes in each cluster can be used to infer the cell type identity of each cluster. While the proteomic data may be limited by the number of known markers, transcriptomic data has no such limitation, with thousands of genes being quantified per cell in an unbiased manner. The additional gene information and biomarkers also create the possibility of discovering novel, cryptic cell type subsets (Figure 2).

One drawback of this approach is that cells that are phenotypically distinct at the cell surface protein level are not always distinct at the transcriptomic level. For example, CD4+ and CD8+ T cells are clearly distinguished as separate clusters at the proteomic level (Figure 1), but they are closer together at the transcriptomic level (Figure 2). Furthermore, the T cells cluster primarily by their state (naive & differentiated) in Figure 2, rather than by their phenotype. Layering in the protein biomarkers onto the transcriptome-derived clusters can help resolve these cell types.

hashtag
Use a Weighted nearest neighbor (WNN) approach to integrate both data sets, cluster the cells, use gene & protein expression to infer cell types.

Both transcriptomic and proteomic data will vary in their information content and quality. WNN is a computational method that blends the best of both data types (). It learns the information content of each data type and generates an integrated representation of the joint data.

First, you reduce the dimensionality of each data type using principal component analysis (PCA). The PCA results are fed into the WNN algorithm, which generates a WNN graph. For each cell, the graph lists the most similar cells (nearest neighbors) and their distances, where the distances are weighted on cell-specific information content from each data type. Some cells might have more information from the transcriptomic data, and others from the proteomic data. The WNN graph can then be used as input for clustering and visualization (Figure 3).

In some of our own internal testing, we’ve seen WNN work best when the two data sets have complimentary information. For example, let’s say the transcriptomic data correctly separates hypothetical cell types A and B, but cell types C and D form a single cluster. The proteomic data would be complimentary if it separates cell types C and D, but not A and B. WNN would then do a good job of combining the best of both worlds, separating A, B, C, and D.

hashtag
Summary

There have been some very exciting technological innovations in single cell multiomics in the last few years. It seems that there are multiple ways in which you can combine transcriptomic and proteomic data, not limited to the three strategies outlined above. My recommendation would be to try a few different methods and see what works best for your data and research question.

*The PBMC data used to generate the plots were from a healthy donor and downloaded from . The cells are screened with 10x Genomics’ Feature barcoding assay, with TotalSeq™-B antibodies targeting 17 cell surface proteins. All analyses were performed in Partek Flow v10. The filtered UMI count data were quality-filtered and normalized. PCA was used for dimensionality reduction. WNN was used for integration across the two data types. UMAP was used for clustering and visualization.

Hao, Hao, Andersen-NIssen et al., 2021arrow-up-right
10x Genomics’ websitearrow-up-right
Figure 1. UMAP plot of single cells from a PBMC sample*. Cells are clustered based on 17 protein markers. Clusters are labeled with some diagnostic protein biomarkers (black, bold) and gene biomarkers (grey, italic) for each cell type
Figure 2. UMAP plot of single cells from a PBMC sample*. Cells are clustered based on 21,002 gene expression markers. Clusters are labelled with some diagnostic protein biomarkers (black, bold) and gene biomarkers (grey, italic) for each cell type
Figure 3. UMAP plot of single cells from a PBMC sample*, following WNN integration of transcriptomic and proteomic data. Clusters are labeled with some diagnostic protein biomarkers (black, bold) and gene biomarkers (grey, italic) for each cell type

Batch remover for single cell data

Experimental data, such as single cell RNA-Seq, is frequently burdened by nuisance batch effects, or undesirable numeric or categorical factors. Due to logistic constraints, data is often processed in different batches, e.g., different operator, different flow cell, different reagent lot and so on. If the processing batches are included in the experimental design or are relatively well balanced within the experimental conditions (for example, technician A processed half of the control and half of the treated samples, while technician B took care of the other half), their effects can be identified and removed from the data.

All the way back in the microarray era, Partek Genomics Suite was well known in the field for its batch remover. Now, our batch remover has been implemented in Partek Flow 8.0. To illustrate the batch remover in action, I downloaded two public data setsarrow-up-right from 10x Genomics®: 1,000 peripheral blood mononuclear cells (PBMC) from a healthy human, processed by v2 chemistry, and the same sample processed by v3 chemistry. I analyzed the data in Partek Flow (e.g., by removing dead or apoptotic cells) and identified several cell types.

You would expect the batch effect in this project to be as large as it gets, and you would be right. The left panel of Figure 1 shows the t-SNE before the correction: the cells of each type split into two groups, based on the chemistry (instead of being clustered together). Now the good news! Once the batch effect has been removed, that pattern is no longer discernible (Figure 1, right panel): cells of the same type are grouped tightly together.

Figure 1. Effect of batch removal by Partek Flow. The t-SNE charts are based on analysis of 1,000 human peripheral blood mononuclear cells processed by two different 10X Genomics’ chemistries, thus introducing a batch effect. Each dot on the plot is a single cell. The version of chemistry is indicated by dot size. Three cell types were identified in the data set and that information is depicted using color. Some cells were not classified (N/A)

The scenario presented above, where two different chemistries were used to generate the data, can be considered as an extreme example of batch effect and it is highly unlikely that you would face it in the real world. I was using it for illustration purposes only. If the Partek Flow batch remover can handle batch effects of this magnitude, it will have no problem helping you with an actual data set.