1 of 9

QA/QC

Partek Flow contains a number of quality control tools and reports that can be used to evaluate the current status of your analysis and decide downstream steps. Quality control tools are organized under the Quality Assurance / Quality Control (QA/QC) section of the context-sensitive menu and are available for unaligned and aligned reads data nodes.

This section will illustrate:

Pre-alignment QA/QC
ERCC Assessment
Post-alignment QA/QC
Coverage Report
Validate Variants
Feature distribution
Single-cell QA/QC
Cell barcode QA/QC

In addition to the tools listed above, many other functionalities can also be interpreted in sense of quality control. For instance, principal components analysis, hierarchical clustering (on sample level), variant detection report, and quantification report.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Pre-alignment QA/QC

Selecting a node with unaligned reads (either Unaligned reads or Trimmed reads) shows the QA/QC section in the context sensitive menu, with two options (Figure 1). To assess the quality of your raw reads, use Pre-alignment QA/QC.

Pre-alignment QA/QC setup dialog is given in Figure 2. Examine reads allows you to control the number of reads processed by the tool; All reads, or a subset (One of every n reads). The latter option is obviously not as thorough, but is much faster than All reads.

If selected, K-mer length creates a per-sample report with the position of the most frequent k-mers (i.e. sequences of k nucleotides) of the length specified in the dialog. The range of input values is from one to 10.

The last control refers to .fastq files. Partek® Flow® can automatically detect the quality encoding scheme (Auto detect) or you can use one of the options available in the drop-down list. However, the auto-detection is only applicable for Phred+33 and Phred+64 type of quality encoding score. For early version of Solexa quality encoding score, select Solexa+64 from the Quality encoding drop down list. For a paired-end data, the pre-alignment QA/QC will be done on each read in pair separately and the results will be shown separately as well.

Most sequencing applications now use the phred quality score. This score indicates the probability that the base was accurately identified. The table below shows the corresponding base call accuracies for each score:

The task report is organised in two tiers. The initial view shows project-level report with all the samples. An overview table is at the top, while matching plots are below.

Two project-level plots are Average base quality per position and Average base quality score per read (Figure 4). The latter plot presents the proportion of reads (y-axis) with certain average quality score (meaning all the base qualities within a read are averaged; x-axis). Mouse over a data point to get the matching readouts. The Save icon saves the plot in a .svg format to the local machine. Each line on the plot represents a data file and you can select the sample names from the legend to hide/un-hide individual lines.

A sample-level report begins with a header, which is a collection of typical quality metrics (Figure 5).

Below the header you will find four plots: Base composition, Average base quality score per position (same as above, but on the sample level), Distribution of base quality scores (the same as Average base quality score per read, but on the sample level), and Distribution of read lengths.

Distribution of read lengths shows a single column for fixed length data (e.g. Illumina sequencing). However, for quality-trimmed data or non-fixed length data (like Ion Torrent sequencing), expect to see a read’s length distribution (Figure 7).

If K-mer length option was turned on when setting up the task, an additional plot will be added to the sample-level report, i.e. K-mer Content (Figure 8). For each position, K-mer composition is given, but only the top six most frequent K-mers are reported; high frequency of a K-mer at a given site (enrichment) indicates a possible presence of sequencing adapters in the short reads.

The pre-alignment QA/QC report as described above is generally available for the NGS data of fastq format. For other types of data, the report may differ depending on the availability of information. For example, for fasta format, there is no base quality score information and therefore all the figures or graphs related to base or read quality score will be unavailable.

Additional Assistance

ERCC Assessment

The ERCC (External RNA Control Consortium) developed a set of RNA standards for quality control in microarray, qPCR, and sequencing applications. These RNA standards are spiked-in RNA with known concentrations and composition (i.e. sequence length and GC content). They can be used to evaluate sensitivity and accuracy of RNA-seq data.

The ERCC analysis is performed on unaligned data, if the ERCC RNA standards have been added to the samples. There are 92 ERCC spiked-in sequences with different concentrations and different compositions. The idea is that the raw data will be aligned (with Bowtie) to the known ERCC-RNA sequences to get the count of each ERCC sequence. This information is available within Partek Flow and will be used to plot the correlation between the observed counts and the expected concentration. If there is a high correlation between the observed counts versus the expected concentration, you can be confident that the quantified RNA-seq data are reliable. Partek Flow supports Mix1 and Mix 2 ERCC formulations. Both formulations use the same ERCC sequences, but each sequence is present at different expected concentrations. If both Mix 1 and Mix 2 formulations have been used, ExFold comparison can be performed to compare the observed and expected Mix1:Mix2 ratio for each spike-in.

To start ERCC assessment, select an unaligned reads node and choose ERCC in the context sensitive menu. If all samples in the project have used the Mix 1 or Mix 2 formulation, choose the appropriate radio button at the top (Figure 1).

You can change the Bowtie parameters by clicking Configure before the alignment (Figure 1), although the default parameters work fine for most data. Once the task has been set up correctly, select Finish.

ERCC task report starts with a table (Figure 3), which summarizes the result on the project level. The table shows which samples use the Mix 1 or Mix 2 formulation. The total number of alignments to the ERCC controls are also shown, which is further divided into the total number of alignments to the forward strand and the reverse strand. The summary table also gives the percentage of ERCC controls that contain alignment counts (i.e. are present). Generally, the fraction of present controls should be as high as possible, however, there are certain ERCC controls that may not contain alignment counts due to their low concentration; that information is useful for evaluation of the sensitivity of the RNA-seq experiment. The coefficient of determination (R squared) of the present ERCC controls is listed in the next column. As a rule of a thumb, you should expect a good correlation between the observed alignment counts and the actual concentration, or else the RNA-seq quantification results may not be accurate. Finally, the last two columns give estimates of bias with regards to sequence length and GC content, by giving the correlation of the alignment counts with the sequence length and the GC content, respectively.

If ExFold comparison was enabled, an extra table will be produced in the ERCC task report (Figure 4). Each row in the table is a pairwise comparison. This table lists the percentage of ERCC controls present in the Mix 1 and Mix 2 samples and the R squared for the observed vs expected Mix1:Mix2 ratios.

The ERCC spike-ins plot (Figure 5) shows the regression lines between the actual spike-in concentration (x-axis, given in log2 space) and the observed alignment counts (y-axis, given in log2 space), for all the samples in the project. The samples are depicted as lines, and the probes with the highest and lowest concentration are highlighted as dots. The regression line for a particular sample can be turned off by simply clicking on the sample name in the legend beneath the plot.

Optionally, you can invoke a principal components analysis plot (View PCA), which is based on RPKM-normalised counts, using the ERCC sequences as the annotation file (not shown).

For more details, go to the sample-level report (Figure 6) by selecting a sample name on the summary table. First, you will get a comprehensive scatter plot of observed alignment counts (y-axis, in log2 space) vs. the actual spike-in concentration (x-axis, in log2 space). Each dot on the plot represents an ERCC sequence, coloured based on GC content and sized by sequence length (plot controls are on the right).

The table (Figure 7) lists individual controls, with their actual concentration, alignment counts, sequence length, and % GC content. The table can be downloaded to the local computer by selecting the Download link.

For more details on ExFold comparisons, select a comparison name in the ExFold summary table (Figure 8). First, you will get a comprehensive scatter plot of observed Mix1:Mix2 ratios (y-axis, in log2 space) vs. the expected Mix1:Mix2 ratio (x-axis, in log2 space). Each dot on the plot represents an ERCC sequence, coloured based on GC content and sized by sequence length (plot controls are on the right).

The table (Figure 9) lists individual controls, with each samples' alignment counts, together with the observed and expected Mix1:Mix2 ratios. The table can be downloaded to the local computer by selecting the Download link.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Post-alignment QA/QC

Post-alignment QA/QC is available for data nodes containing aligned reads (Aligned reads) and has no special control dialog. Similar to the pre-alignment QA/QC report, the post-alignment contains two tiers, i.e. project-level report and sample-level report.

The project-level report starts with a summary table (Figure 1). Unlike pre-alignment QA/QC report, each row now corresponds to a sample (sample names are hyperlinks to sample-level report). Table allows for a quick comparison across all the samples within the project. Any outlying sample can, therefore, easily be spotted.

Note that the summary table reflects the underlying chemistry. While Figure 1 shows a summary table for single-end sequencing, an example table for paired-end sequencing is given in Figure 2. Common features are discussed first.

The first two columns contain total number of reads (Total reads) and total number of alignments (Total alignments). Theoretically, for single-end chemistry, total number of reads equals total number of alignments. For double-end reads, theoretical result is to have twice as many alignments as reads (the term “read” refers to the fragment being sequenced, and since each fragment is sequenced from two directions, one can expect to get two alignments per fragment). When counting the actual number of alignments (Total alignments), however, reads that align more than once (multimappers) are also taken into account. Next, the Aligned column contains the fraction of all the reads that were aligned to the reference assembly.

The Coverage column shows the fraction (%) of the reference assembly that was sequenced and the average sequencing coverage (×) of the covered regions is in the Avg. coverage depth column. The Avg. quality is mapping quality, as reported by the aligner (not all aligners support this metric). Avg. length is the average read length and average read quality is given in Avg. quality column. Finally, %GC is the fraction of G or C calls.

In addition, the Post-alignment QA/QC report for single-end reads (Figure 1) contains the Unique column. This refers to the fraction of uniquely aligned reads.

On the other hand, the Post-alignment QA/QC report for paired-end reads (Figure 2) contains these columns:

Unique singleton
- fraction of alignments corresponding to the reads where only one of the paired reads can be uniquely aligned
Unique paired
- fraction of alignments corresponding to the reads where both of the paired reads can be uniquely aligned
Non-unique singleton
- fraction of singletons that align to multiple locations
Non-unique paired
- fraction of paired reads that align to multiple locations

Note: for paired-end reads, if one end is aligned, the mate is not aligned, the alignment rate calculating will include the read as the numerator, also since the mate is not aligned, we will also include this read in the unaligned data node (if the generate unaligned reads data node option is selected) for 2nd stage alignment, this will generate discrepancy between total reads and "unaligned reads + total reads * alignment rate", because reads with only one mate aligned are counted twice.

In addition to the summary table, several graphs are plotted to give a comparison across multiple samples in the project. Those graphs are Alignment breakdown, Coverage, Genomic Coverage, Average base quality per position, Average base quality score per read, and Average alignments per read. Two of those (Average base quality plots) have already been described.

The alignment breakdown chart (Figure 3) presents each sample as a column, and has two vertical axes (i.e. Alignment percent and Total reads). The percentage of reads with different alignment outcomes (Unique paired, Unique singleton, Non-unique, Unaligned) is represented by the left-side y-axis and visualized by stacked columns. The total number of reads in each sample is given using the black line and shown on the right-side y-axis.

The Coverage plot (Figure 4) shows the Average read depth (in covered regions) for each sample using columns and can be red off the left-hand y-axis. Similarly, the Genomic coverage plot shows genome coverage in each sample, expressed as a fraction of the genome.

The last graph is Average alignments per read (Figure 5) and shows the average number of alignments for each read, with samples as columns. For single-end data, the expected average alignments per read is one, while for paired-end data, the expected average alignments per read is two.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Coverage Report

Coverage report is also available for data nodes containing aligned reads (Aligned reads, Trimmed reads, or Filtered reads). The purpose of the report is to understand how well the genomic regions of interest are covered by sequencing reads for a particular analysis.

When setting up the task (Figure 1), you first need to specify the Genome build and then a Gene/feature annotation file, which defines the genomic regions you are interested in (e.g. exome or genes within a panel). The Gene/feature annotation can be previously associated with Partek® Flow® via Library File Management or added on the fly.

Complete coverage report will contain percentage of bases within the specified genes / features with coverage greater than or equal to the coverage levels defined under Add minimum coverage levels. To add a level, click on the green plus . Alternatively, to remove it, click on the red cross icon.

As for the Advanced options, if Strand-specificity is turned on, only reads which match the strand of a given region will be considered for that region’s coverage statistics.

Generate target enrichment graphs will generate a graphical overview of coverage across each feature.

When Use multithreading is checked, the computation will utilize multiple CPUs. However, if the input or output data is on some file systems like GPFS file system, which doesn't support well on multi-thread tasks, unchecking this option will prevent task failures.

Coverage report result page contains project-level overview and starts with a summary table, with one sample per row (Figure 2) The first few columns show the percentage of bases in the genomic features which are covered at the specified level (or higher) (default: 1×, 20×, 100×). Average coverage is defined as the sum of base calls of each base in the genomic features divided by the length of the genomic features. Similarly, Average quality is defined as the sum of average quality of those bases that cover the genomic features, divided by the length of covered genomic features. The last two columns show the number of On-tarted reads (overlapping the genomic features) and Off-target reads (not overlapping the features). The Optional columns link enables import of any meta-data present in the data table (Data tab).

Quantification of on- and off-target reads is also displayed in the column chart below the table (Figure 3), showing each sample as a separate column and fraction of on-/off-target reads on the y-axis.

Region coverage summary hyperlink opens a new page, with a table showing average coverage for each region (rows), across the samples (columns) (Figure 4).

The Coverage summary (Figure 6) plot is an overview of coverage across of the targeted genomic features for all the samples in the project. Each line within the plot is a single sample, the horizontal axis is the normalized position within the genomic feature, represented as 1st to 100th percentile of the length of the feature, while the vertical axis show the average coverage (across all the features for a given sample).

If you need more details about a sample, click on the sample name in the Coverage report table (Figure 7). The columns are as follows:

Region name: the genomic feature identifier (as specified in the annotation file)
Chromosome: the chromosome of the genomic feature (or region)
Start: the start position of the genomic feature (1-based)
Stop: the stop position of the genomic feature (2-based, which means the stop position is exclusive)
Strand: the strand of the genomic feature
Total exon length: the length of the genomic feature
Reads: the total number of reads aligning to the genomic feature
% GC: the percentage of GC contents of those reads aligning to the genomic feature
% N: the percentage of ambiguous bases (N) of those reads aligning to the genomic feature
(n)x: the proportion of the genomic feature which is covered by at least n number of alignments. [Note: n is the coverage level that you specified when submitting Coverage report task, defaults are 1×, 20×, 100×]
Average coverage: the average sequencing depth across all bases in the genomic feature
Average quality: the average quality score across covered bases in the genomic feature

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Validate Variants

Validate variants is available for data nodes containing variants (Variants, Filtered variants, or Annotated Variants). The purpose of this task is to understand the performance of the variant calling pipeline by comparing variant calls from a sample within the project to known “gold standard” variant data that already exist for that sample. This “gold standard” data can encompass variants identified with high confidence by other experimental or computational approaches.

Setting up the task (Figure 1) involves identifying the Genome Build used for variant detection and the Sample to validate within the project. Target specific regions allow for specification of the Target regions for this study, relating to the regions sequenced for all samples in the project. Benchmark target regions represent the regions that have been previously interrogated to identify “gold standard” variant calls in the sample of interest. These parameters are important to ensure that only overlapping regions are compared, avoiding the identification of false positives or false negative variants in regions covered by only the project sample or the “gold standard” sample. Both sections utilize a Gene/feature annotation file, which can be previously associated with Partek Flow via Library File Management or added on the fly. The Validated variants file is a single sample vcf file containing the “gold standard” variant calls for the sample of interest and can be previously associated with Partek Flow as a Variant Annotation Database via Library File Management or added on the fly.

The Validate variants results page contains statistics related to the comparison of variants in the project sample compared to the validated variant calls for the sample (Figure 2). The results are split into two sections, one based on metrics calculated from the comparison of SNVs and the other from the comparison of INDELs.

The following SNP-level metrics are contained within the report, comparing the sample in the project to the validated variant data:

No genotypes: the number of missing genotypes from the sample in the Flow project
Same as reference: the number of homozygous reference genotypes from the sample in the Flow project
True positives: the number of variant genotypes from the sample in the Flow project that match the validated variants file
False positives: the number of variant genotypes from the sample in the Flow project that are not found in the validated variants file
True negatives: the number of loci that do not have variant genotypes in the sample in the Flow project and the validated variants file
False negatives: the number of genotypes that do not have variant genotypes in the sample in the Flow project but do have variant genotypes in the validated variants file
Sensitivity: the proportion of variant genotypes in the validated variants file that are correctly identified in the sample in the Flow project (true positive rate)
Specificity: the proportion of non-variant loci in the validated variants file that are non-variant in the sample in the Flow project (true negative rate)
Precision: the number of true positive calls divided by the number of all variant genotypes called in the sample in the Flow project (positive predictive value),
F-measure: a measure of the accuracy of the calling in the Flow pipeline relative to the validated variants. It considers both the precision and the recall of the test to compute the score. The best value at 1 (perfect precision and recall) and worst at 0.
Matthews correlation: a measure of the quality of classification, taking into account true and false positives and negatives. The Matthews correlation is a correlation coefficient between the observed and predicted classifications, ranging from −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates completely wrong prediction.
Transitions: variant allele interchanges of purines or pyrimidines in the sample in the Flow project relative to the reference
Transversions: variant allele interchanges of purines to/from pyrimidines in the sample in the Flow project relative to the reference
Ti/Tv ratio: ratio of transition to transversions in the sample in the Flow project
Heterozygous/Homozygous ratio: the ratio of heterzygous and homozygous genotypes in the sample in the Flow project
Percentage of sites with depth < 5: the percentage of variant genotypes in the sample in the Flow project that have fewer than 5 supporting reads
Depth, 5th percentile: 5% of sequencing depth found across all variant genotypes in the sample in the Flow project
Depth, 50th percentile: 50% of sequencing depth found across all variant genotypes in the sample in the Flow project
Depth, 95th percentile: 95% of sequencing depth found across all variant genotypes in the sample in the Flow project

The INDEL-level metrics columns contained within the report are identical, with the exception of a lack of information with regards to transitions and transversion.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Feature distribution

The Feature distribution plot visualizes the distribution of features in a counts data node.

Running Feature distribution

To run Feature distribution:

Click a counts data node
Click the QA/QC section of the toolbox
Click Feature distribution

A new task node is generated with the Feature distribution report.

Feature distribution plot configuration

The Feature distribution task report plots the distribution of all features (genes or transcripts) in the input data node with one feature per row (Figure 1). Features are ordered by average value in descending order.

The plot can be configured using the panel of the left-hand side of the page.

Filter

Using the filter, you can choose which features are shown in the task report.

The Manual filter lets you type a feature ID (such as a gene symbol) and filter to matching features by clicking + . You can add multiple feature IDs to filter to multiple features (Figure 2).

Plot type

Distributions can be plotted as histograms, with the x-axis being the expression value and the y-axis the frequency, or as a strip plot, where the x-axis is the expression value and the position of each cell/sample is shown as a thin vertical line, or strip, on the plot (Figure 3).

To switch between plot types, use the Plot type radio buttons.

Mousing over a dot in the histogram plot gives the range of feature values that are being binned to generate the dot and the number of cells/samples for that bin in a pop-up (Figure 4).

Mousing over a strip shows the sample ID and feature value in a pop-up. If there are multiple cells/samples with the same value, only one strip will be visible for those cells/samples and the mouse-over will indicate how many cells/samples are represented by that one strip (Figure 5).

Clicking a strip will highlight that cell/sample in all of the plots on the page (Figure 6).

The grey dot in each strip plot shows the median value for that feature. To view the median value, mouse over the dot (Figure 7).

Page

To navigate between pages, use the Previous and Next buttons or type the page number in the text field and click Enter on your keyboard.

The number of features that appear in the plot on each page is set by the Items per page drop-down menu (Figure 8). You can choose to show 10, 25, or 50 features per page.

Scale Y-axis

When Plot type is set to Histogram, you can choose to configure the Y-axis scale using the Scale Y-axis radio buttons. Feature max sets each feature plot y-axis individually. Page max sets the same y-axis range for every feature plot on the page, with the range determined by the feature with the highest frequency value.

Color by

You can add attribute information to the plots using the Color by drop-down menu.

For histogram plots, the histograms will be split and colored by the levels of the selected attribute (Figure 9). You can choose any categorical attribute.

For strip plots, the sample/cell strips will be colored by the levels or values of the selected attribute (Figure 10). You can choose any categorical or numeric attribute.

Single-cell QA/QC

The Single-cell QA/QC task in Partek Flow enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Single cell QA/QC

By default, all samples are used to perform QA/QC. You can choose to split the sample and perform QA/QC separately for each sample.

If your Single cell counts data node has been annotated with a gene/transcript annotation, the task will run without a task configuration dialog. However, if you imported a single cell counts matrix without specifying a gene/transcript annotation file, you will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog (Figure 1). Note, it is still possible to run the task without specifying an annotation file. If you choose not to specify an annotation file, the detection of mitochondrial counts will not be possible.

The Single cell QA/QC task report opens in a new data viewer session. Four dot and violin plots showing the value of every cell in the project are displayed on the canvas: counts per cell, detected features per cell, the percentage of mitochondrial counts per cell, and the percentage of ribosomal counts per cell (Figure 2).

If your cells do not express any mitochondrial genes or an appropriate annotation file was not specified, the plot for the percentage of mitochondrial counts per cell will be non-informative (Figure 3).

Mitochondrial genes are defined as genes located on a mitochondrial chromosome in the gene annotation file. The mitochondrial chromosome is identified in the gene annotation file by having "M" or "MT" in its chromosome name. If the gene annotation file does not follow this naming convention for the mitochondrial chromosome, Partek Flow will not be able to identify any mitochondrial genes. If your single cell RNA-Seq data was processed in another program and the count matrix was imported into Partek Flow, be sure that the annotation field that matches your feature IDs was chosen during import; Partek Flow will be unable to identify any mitochondrial genes if the gene symbols in the imported single cell data and the chosen gene/feature annotation do not match.

Ribosomal genes are defined as genes that code for proteins in the large and small ribosomal subunits. Ribosomal genes are identified by searching their gene symbol against a list of 89 L & S ribosomal genes taken from HGNC. The search is case-insensitive and includes all known gene name aliases from HGNC. Identifying ribosomal genes is performed independent of the gene annotation file specified.

Total counts are calculated as the sum of the counts for all features in each cell from the input data node. The number of detected features is calculated as the number of features in each cell with greater than zero counts. The percentage of mitochondrial counts is calculated as the sum of counts for known mitochondrial genes divided by the sum of counts for all features and multiplied by 100. The percentage of ribosomal counts are calculated as the sum of counts for known ribosomal genes divided by the sum of counts for all features and multiplied by 100.

Each point on the plots is a cell. All cells from all samples are shown on the plots. The overlaid violins illustrate the distribution of cell values for the y-axis metric.

The appearance of a plot can be configured by selecting a plot and adjusting the Configure settings in the panel on the left (Figure 4). Here are some suggestions, but feel free to explore the other options available:

Open Axes and change the Y-axis scale to Logarithmic. This can be helpful to view the range of values better, although it is usually better to keep the Ribosomal counts plot in linear scale.
Open Style and reduce the Color Opacity using the slider. For data sets with very many cells, it may be helpful to decrease the dot opacity to better visualize the plot density.
Within Style switch on Summary Box & Whiskers. Inspecting the median, Q1, Q3, upper 90%, and lower 10% quantiles of the distributions can be helpful in deciding appropriate thresholds.

High-quality cells can be selected using Select & Filter, which is pre-loaded with the selection criteria, one for each quality metric (Figure 5).

Hovering the mouse over one of the selection criteria reveals a histogram showing you the frequency distribution of the respective quality metric. The minimum and maximum thresholds can be adjusted by clicking and dragging the sliders or by typing directly into the text boxes for each selection criteria (Figure 6).

Alternatively, Pin histogram to view all of the distributions at one time to determine thresholds with ease (Figure 7).

Adjusting the selection criteria will select and deselect cells in all three plots simultaneously. Depending on your settings, the deselected points will either be dimmed or gray. The filters are additive. Combining multiple filters will include the intersection of the three filters. The number of cells selected is shown in the figure legend of each plot (Figure 8).

Select the input data node for the filtering task and click Select (Figure 10).

A new data node, Filtered counts, will be generated under the Analyses tab (Figure 11).

Double click the Filtered counts data node to view the task report. The report includes a summary of the count distribution across all features for each sample; a detailed breakdown of the number of cells included in the filter for each sample; and the minimum and maximum values for each quality metric (expressed genes, total counts, etc) across the included cells for each sample (Figure 12).

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Cell barcode QA/QC

The Cell barcode QA/QC task lets you determine whether a given cell barcode is associated with a cell. This is an important QC step in all droplet-based single cell RNA-seq experiments, such as Drop-seq, where all barcodes are sequenced.

To invoke Cell barcode QA/QC:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Cell barcode QA/QC

The task can be performed with or without the EmptyDrops method enabled.

Cell Barcode QA/QC without EmptyDrops

To perform the task without the EmptyDrops method enabled, leave the checkbox unchecked and click Finish (Figure 1).

The Cell barcode QA/QC task report is a plot (Figure 2). Barcodes are ordered on the X-axis by the number reads such that the barcode closest to the Y-axis has the most reads and the barcode furthest from the Y-axis has the fewest reads. The Y-axis value is the number of mapped reads corresponding to each barcode. This type of plot is often referred to as a knee plot.

The knee plot is used to choose a cutoff point between barcodes that correspond to cells and barcodes that do not. Partek Flow automatically calculates an inflection point, shown by the vertical line on the graph. Barcodes designated as cells are shown in blue while barcodes designated as without cells (background) are shown in grey.

The cutoff can be adjusted by dragging the vertical line across the graph or by using the text fields in the Filter panel on the left-hand side of the plot. Using the Filter panel, you can specify the number of cells or the percentage of reads in cells and the cutoff point will be adjusted to match your criteria. The number of cells and the percentage of counts in cells is adjusted as the cutoff point is changed. To return to the automatically calculated cutoff, click Reset sample filter.

The percentage of counts in cells and median counts per cell are useful technical quality metrics that can be consulted when optimizing sample handling, cell isolation techniques, and library preparation.

One knee plot is generated for each sample. In projects with multiple samples, Next and Back buttons will appear at the top to enable navigation between sample knee plots. Manual filters must be set separately for each sample. This is typically used when the user expects a certain number of cells to be processed, like in experiments where droplets were loaded with a predefined number of cells.

To view a summary of the currently selected filter settings for all samples, click Summary table. This opens a table showing key metrics for each sample in the project (Figure 3).

To return to the knee plot view, click Back to filter. To apply the filter and run the Filter barcodes task, click Apply filter. A Filtered counts data node will be generated.

Cell Barcode QA/QC with EmptyDrops

The EmptyDrops method (1) uses a statistical test to identify which barcodes correspond to real cells and empty droplets. An ambient RNA expression profile is estimated from barcodes below a specified total UMI count threshold, using the Good-Turing algorithm. The expression profile of each barcode above the low-count threshold is then tested for deviations from the ambient profile. Real cells are expected to have a low p-value, indicating a significant deviation from the expected background noise level. False discovery rate (FDR) correction is applied to all the p-values and those falling equal to or below the specified FDR level are detected as real cells. This can allow for the detection of additional cells that would otherwise be discarded due to a low total UMI count.

This method requires empty barcodes to be present in the single cell count matrix, in order to estimate the ambient RNA profile. If your data has already been filtered to remove barcodes with low total counts, this method will not be suitable. For example, if you are working with 10X Genomics data, the EmptyDrops method can only be run on the raw counts, not the filtered counts.

In addition, a knee point threshold will be calculated to identify cells with a very high total UMI count. It's possible that some barcodes with a high total UMI count will not pass the EmptyDrops significance test. This could be due to biases in the ambient RNA profile, leading to a non-significant difference between a barcode's expression profile vs the ambient profile. To protect against this issue, it is advisable to use the EmptyDrops results in conjunction with the knee point filter, on the assumption that barcodes with a very high total UMI count will always correspond to real cells. Note, the knee point will be more conservative than the inflection point calculated by Partek Flow when the EmptyDrops method is not enabled.

To perform the task with the EmptyDrops method, check the checkbox, configure the additional options, and click Finish (Figure 4)

Ambient count threshold

Barcodes with a total UMI count equal to or below this threshold will be used to create the ambient RNA expression profile to estimate background noise. The default is set to 100, which is reasonable for most data.

FDR threshold

Barcodes equal to or below this FDR threshold show a significant deviation from the ambient profile and can therefore be considered real cells. Increasing this value will result in more cells, but will also increase the number of potential false positives.

Random generator seed

This is used for performing Monte Carlo simulations to determine p-values. To reproduce results, use the same random seed for all runs.

The task report will appear similar to Figure 2, with additional metrics on the left (Figure 5).

The number of actual cells detected by the EmptyDrops test and the knee point filter are shown above the Venn diagram on the left. In Figure 5, 3,189 barcodes are above the knee point filter (represented by the vertical blue line on the plot) and 2,657 barcodes passed the significance test in EmptyDrops. The overlap between these sets of barcodes is represented by the Venn diagram. In Figure 5, 1,583 barcodes pass the significance test in EmptyDrops and have a high total UMI count above the knee point filter; 1,606 barcodes have a very high total UMI count with no significant difference from the ambient profile in EmptyDrops; 1,074 barcodes fall below the knee point but are still significantly different from the ambient profile.

The number of cells included by the knee point filter can be adjusted either by click on the plot to change the position of the vertical blue line or by typing a different number of cells into the text box on the left.

The total number of cells is shown in the text box on the left. By default, this will be all of the cells detected by the knee point filter plus the extra cells detected by EmptyDrops. In Figure 5, this means the 3,189 cells with a high total UMI count plus the additional 1,074 cells from EmptyDrops (total = 4,263).

Different sections of the Venn diagram can be selected/deselected to include/exclude barcodes. For example, in Figure 5, clicking the '1,606' section of the Venn diagram will deselect those barcodes. Now, the only cells that will pass the filter will be the significant ones from EmptyDrops (Figure 6).

References

Lun, A., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20: 63.

Additional Assistance

Cell barcode QA/QC

To invoke Cell barcode QA/QC:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Cell barcode QA/QC

The task can be performed with or without the EmptyDrops method enabled.

Cell Barcode QA/QC without EmptyDrops

To perform the task without the EmptyDrops method enabled, leave the checkbox unchecked and click Finish (Figure 1).

To view a summary of the currently selected filter settings for all samples, click Summary table. This opens a table showing key metrics for each sample in the project (Figure 3).

To return to the knee plot view, click Back to filter. To apply the filter and run the Filter barcodes task, click Apply filter. A Filtered counts data node will be generated.

Cell Barcode QA/QC with EmptyDrops

To perform the task with the EmptyDrops method, check the checkbox, configure the additional options, and click Finish (Figure 4)

Ambient count threshold

FDR threshold

Random generator seed

This is used for performing Monte Carlo simulations to determine p-values. To reproduce results, use the same random seed for all runs.

The task report will appear similar to Figure 2, with additional metrics on the left (Figure 5).

References

Lun, A., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20: 63.

Additional Assistance

If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.