1 of 100

Task Menu

The Task Menu lists all the tasks that can be performed on a specific node. It can be invoked from either a Data or Task node and appears on the right hand side of the Analyses tab. It is context-sensitive, meaning that it will only present tasks that the user can perform on the selected node. For example, selecting an Aligned reads data node will not present aligners as options.

Clicking a Data node presents a variety of tasks:

Data summary report
QA/QC
Pre-alignment tools
Post-alignment tools
Annotation/Metadata
Pre-analysis tools
Aligners
Quantification
Filtering
Normalization and scaling
Batch removal
Differential Analysis
Survival Analysis with Cox regression and Kaplan-Meier analysis - Partek Flow
Exploratory Analysis
Trajectory Analysis
- Trajectory Analysis (Monocle 2)
- Trajectory Analysis (Monocle 3)
Variant Callers
Variant Analysis
Copy Number Analysis (CNVkit)
Peak Callers (MACS2)
Peak analysis
- Annotate Peaks
- Promoter sum matrix
Motif Detection
Metagenomics
10x Genomics
V(D)J Analysis
Biological Interpretation
- Gene Set Enrichment
- GSEA
Correlation
Export
Classification
Task actions
Feature linkage analysis

Clicking a Task node gives you the option to view the Task results or perform Task actions such as rerunning the task (Figure 1).

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Task actions

Left single clicking on any task (the rectangles) in the analysis pipeline will cause a Task Actions section to appear in the pop-up menu. This allows users to:

Rerun tasks: rerun the selected task, the task dialog will pop-up and users can change parameters of the task. Previous downstream analysis of the selected task will not be rerun.
Rerun with downstream tasks: rerun the selected task, the task dialog will pop-up, users can change the parameters of the current task and the downstream analysis will be rerun with the same configuration as the previous one.
Edit description: the description of the task can be replaced by manually typing in string.
Change color: choose a color to apply only on the selected task by clicking on Apply. Click Apply to downstream to change the selected task and the downstream pipeline color to the newly selected color.
Delete task: this option is only available if the user is the owner of the project or the owner of the task. When a task is deleted, all downstream tasks, including tasks from other users, will be deleted. Users may check the box to choose to delete the task's output files. If delete output files is not checked, the task will be removed from the pipeline, but the output files of the task will remain on the disk.
Restart task: this option is only available on failed tasks and requires an admin role to perform, but does not require that you have a user account. Since you are logged in as an admin, restarting a task will not take up a concurrent seat and the disk space consumed by the output files will count towards the original owner of the task's storage space.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Data summary report

The Data summary report in Partek Flow provides an overview of all tasks performed as part of a pipeline. This is particularly useful for report writing, record keeping and revisiting projects after a long period of time.

This user guide will cover the following topics:

Viewing the Data Summary Report
Saving the Data Summary Report
Quick Video Demo of the Data Summary Report

Viewing the Data Summary Report

Click on an output data node under the Analyses tab of a project and choose Data summary report from the context sensitive menu on the right (Figure 1). The report will include details from all of the tasks upstream of the selected the node. If tasks have been performed downstream of the selected data node, they will not be included in the report.

Saving the Data Summary Report

The Data summary report can be saved in different formats via the web browser. The instructions below are for Google Chrome. If you are using a different browser, consult your browser's help for equivalent instructions.

Save as a PDF

On the Data summary report, expand all sections and show all task details. Right-click anywhere on the page and choose Print... from the menu (Figure 4) or use Ctrl+P (Command+P on Mac). In the print dialog, click Change… (Figure 5) and set the destination to Save as PDF. Select the Background graphics checkbox (optional), click the blue Save button (Figure 5) and choose a file location on your local machine.

The PDF can be attached to an email and/or opened in a PDF viewer of your choice.

Save as HTML

On the Data summary report, right-click anywhere on the page and choose Save as… from the menu (Figure 6) or use Ctrl+S (Command+S on Mac). Choose a file location on your local machine and set the file type to Web Page, Complete.

The HTML file can be opened in a browser of your choice.

Quick Video Demo of the Data Summary Report

The short video clip below (with audio) shows a tutorial of looking at the Data Summary Report

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

QA/QC

Partek Flow contains a number of quality control tools and reports that can be used to evaluate the current status of your analysis and decide downstream steps. Quality control tools are organized under the Quality Assurance / Quality Control (QA/QC) section of the context-sensitive menu and are available for unaligned and aligned reads data nodes.

This section will illustrate:

Pre-alignment QA/QC
ERCC Assessment
Post-alignment QA/QC
Coverage Report
Validate Variants
Feature distribution
Single-cell QA/QC
Cell barcode QA/QC

In addition to the tools listed above, many other functionalities can also be interpreted in sense of quality control. For instance, principal components analysis, hierarchical clustering (on sample level), variant detection report, and quantification report.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Pre-alignment QA/QC

Selecting a node with unaligned reads (either Unaligned reads or Trimmed reads) shows the QA/QC section in the context sensitive menu, with two options (Figure 1). To assess the quality of your raw reads, use Pre-alignment QA/QC.

Pre-alignment QA/QC setup dialog is given in Figure 2. Examine reads allows you to control the number of reads processed by the tool; All reads, or a subset (One of every n reads). The latter option is obviously not as thorough, but is much faster than All reads.

If selected, K-mer length creates a per-sample report with the position of the most frequent k-mers (i.e. sequences of k nucleotides) of the length specified in the dialog. The range of input values is from one to 10.

The last control refers to .fastq files. Partek® Flow® can automatically detect the quality encoding scheme (Auto detect) or you can use one of the options available in the drop-down list. However, the auto-detection is only applicable for Phred+33 and Phred+64 type of quality encoding score. For early version of Solexa quality encoding score, select Solexa+64 from the Quality encoding drop down list. For a paired-end data, the pre-alignment QA/QC will be done on each read in pair separately and the results will be shown separately as well.

Most sequencing applications now use the phred quality score. This score indicates the probability that the base was accurately identified. The table below shows the corresponding base call accuracies for each score:

The task report is organised in two tiers. The initial view shows project-level report with all the samples. An overview table is at the top, while matching plots are below.

Two project-level plots are Average base quality per position and Average base quality score per read (Figure 4). The latter plot presents the proportion of reads (y-axis) with certain average quality score (meaning all the base qualities within a read are averaged; x-axis). Mouse over a data point to get the matching readouts. The Save icon saves the plot in a .svg format to the local machine. Each line on the plot represents a data file and you can select the sample names from the legend to hide/un-hide individual lines.

A sample-level report begins with a header, which is a collection of typical quality metrics (Figure 5).

Below the header you will find four plots: Base composition, Average base quality score per position (same as above, but on the sample level), Distribution of base quality scores (the same as Average base quality score per read, but on the sample level), and Distribution of read lengths.

Distribution of read lengths shows a single column for fixed length data (e.g. Illumina sequencing). However, for quality-trimmed data or non-fixed length data (like Ion Torrent sequencing), expect to see a read’s length distribution (Figure 7).

If K-mer length option was turned on when setting up the task, an additional plot will be added to the sample-level report, i.e. K-mer Content (Figure 8). For each position, K-mer composition is given, but only the top six most frequent K-mers are reported; high frequency of a K-mer at a given site (enrichment) indicates a possible presence of sequencing adapters in the short reads.

The pre-alignment QA/QC report as described above is generally available for the NGS data of fastq format. For other types of data, the report may differ depending on the availability of information. For example, for fasta format, there is no base quality score information and therefore all the figures or graphs related to base or read quality score will be unavailable.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

ERCC Assessment

The ERCC (External RNA Control Consortium) developed a set of RNA standards for quality control in microarray, qPCR, and sequencing applications. These RNA standards are spiked-in RNA with known concentrations and composition (i.e. sequence length and GC content). They can be used to evaluate sensitivity and accuracy of RNA-seq data.

The ERCC analysis is performed on unaligned data, if the ERCC RNA standards have been added to the samples. There are 92 ERCC spiked-in sequences with different concentrations and different compositions. The idea is that the raw data will be aligned (with Bowtie) to the known ERCC-RNA sequences to get the count of each ERCC sequence. This information is available within Partek Flow and will be used to plot the correlation between the observed counts and the expected concentration. If there is a high correlation between the observed counts versus the expected concentration, you can be confident that the quantified RNA-seq data are reliable. Partek Flow supports Mix1 and Mix 2 ERCC formulations. Both formulations use the same ERCC sequences, but each sequence is present at different expected concentrations. If both Mix 1 and Mix 2 formulations have been used, ExFold comparison can be performed to compare the observed and expected Mix1:Mix2 ratio for each spike-in.

To start ERCC assessment, select an unaligned reads node and choose ERCC in the context sensitive menu. If all samples in the project have used the Mix 1 or Mix 2 formulation, choose the appropriate radio button at the top (Figure 1).

You can change the Bowtie parameters by clicking Configure before the alignment (Figure 1), although the default parameters work fine for most data. Once the task has been set up correctly, select Finish.

ERCC task report starts with a table (Figure 3), which summarizes the result on the project level. The table shows which samples use the Mix 1 or Mix 2 formulation. The total number of alignments to the ERCC controls are also shown, which is further divided into the total number of alignments to the forward strand and the reverse strand. The summary table also gives the percentage of ERCC controls that contain alignment counts (i.e. are present). Generally, the fraction of present controls should be as high as possible, however, there are certain ERCC controls that may not contain alignment counts due to their low concentration; that information is useful for evaluation of the sensitivity of the RNA-seq experiment. The coefficient of determination (R squared) of the present ERCC controls is listed in the next column. As a rule of a thumb, you should expect a good correlation between the observed alignment counts and the actual concentration, or else the RNA-seq quantification results may not be accurate. Finally, the last two columns give estimates of bias with regards to sequence length and GC content, by giving the correlation of the alignment counts with the sequence length and the GC content, respectively.

If ExFold comparison was enabled, an extra table will be produced in the ERCC task report (Figure 4). Each row in the table is a pairwise comparison. This table lists the percentage of ERCC controls present in the Mix 1 and Mix 2 samples and the R squared for the observed vs expected Mix1:Mix2 ratios.

The ERCC spike-ins plot (Figure 5) shows the regression lines between the actual spike-in concentration (x-axis, given in log2 space) and the observed alignment counts (y-axis, given in log2 space), for all the samples in the project. The samples are depicted as lines, and the probes with the highest and lowest concentration are highlighted as dots. The regression line for a particular sample can be turned off by simply clicking on the sample name in the legend beneath the plot.

Optionally, you can invoke a principal components analysis plot (View PCA), which is based on RPKM-normalised counts, using the ERCC sequences as the annotation file (not shown).

For more details, go to the sample-level report (Figure 6) by selecting a sample name on the summary table. First, you will get a comprehensive scatter plot of observed alignment counts (y-axis, in log2 space) vs. the actual spike-in concentration (x-axis, in log2 space). Each dot on the plot represents an ERCC sequence, coloured based on GC content and sized by sequence length (plot controls are on the right).

The table (Figure 7) lists individual controls, with their actual concentration, alignment counts, sequence length, and % GC content. The table can be downloaded to the local computer by selecting the Download link.

For more details on ExFold comparisons, select a comparison name in the ExFold summary table (Figure 8). First, you will get a comprehensive scatter plot of observed Mix1:Mix2 ratios (y-axis, in log2 space) vs. the expected Mix1:Mix2 ratio (x-axis, in log2 space). Each dot on the plot represents an ERCC sequence, coloured based on GC content and sized by sequence length (plot controls are on the right).

The table (Figure 9) lists individual controls, with each samples' alignment counts, together with the observed and expected Mix1:Mix2 ratios. The table can be downloaded to the local computer by selecting the Download link.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Post-alignment QA/QC

Post-alignment QA/QC is available for data nodes containing aligned reads (Aligned reads) and has no special control dialog. Similar to the pre-alignment QA/QC report, the post-alignment contains two tiers, i.e. project-level report and sample-level report.

The project-level report starts with a summary table (Figure 1). Unlike pre-alignment QA/QC report, each row now corresponds to a sample (sample names are hyperlinks to sample-level report). Table allows for a quick comparison across all the samples within the project. Any outlying sample can, therefore, easily be spotted.

Note that the summary table reflects the underlying chemistry. While Figure 1 shows a summary table for single-end sequencing, an example table for paired-end sequencing is given in Figure 2. Common features are discussed first.

The first two columns contain total number of reads (Total reads) and total number of alignments (Total alignments). Theoretically, for single-end chemistry, total number of reads equals total number of alignments. For double-end reads, theoretical result is to have twice as many alignments as reads (the term “read” refers to the fragment being sequenced, and since each fragment is sequenced from two directions, one can expect to get two alignments per fragment). When counting the actual number of alignments (Total alignments), however, reads that align more than once (multimappers) are also taken into account. Next, the Aligned column contains the fraction of all the reads that were aligned to the reference assembly.

The Coverage column shows the fraction (%) of the reference assembly that was sequenced and the average sequencing coverage (×) of the covered regions is in the Avg. coverage depth column. The Avg. quality is mapping quality, as reported by the aligner (not all aligners support this metric). Avg. length is the average read length and average read quality is given in Avg. quality column. Finally, %GC is the fraction of G or C calls.

In addition, the Post-alignment QA/QC report for single-end reads (Figure 1) contains the Unique column. This refers to the fraction of uniquely aligned reads.

On the other hand, the Post-alignment QA/QC report for paired-end reads (Figure 2) contains these columns:

Unique singleton
- fraction of alignments corresponding to the reads where only one of the paired reads can be uniquely aligned
Unique paired
- fraction of alignments corresponding to the reads where both of the paired reads can be uniquely aligned
Non-unique singleton
- fraction of singletons that align to multiple locations
Non-unique paired
- fraction of paired reads that align to multiple locations

Note: for paired-end reads, if one end is aligned, the mate is not aligned, the alignment rate calculating will include the read as the numerator, also since the mate is not aligned, we will also include this read in the unaligned data node (if the generate unaligned reads data node option is selected) for 2nd stage alignment, this will generate discrepancy between total reads and "unaligned reads + total reads * alignment rate", because reads with only one mate aligned are counted twice.

In addition to the summary table, several graphs are plotted to give a comparison across multiple samples in the project. Those graphs are Alignment breakdown, Coverage, Genomic Coverage, Average base quality per position, Average base quality score per read, and Average alignments per read. Two of those (Average base quality plots) have already been described.

The alignment breakdown chart (Figure 3) presents each sample as a column, and has two vertical axes (i.e. Alignment percent and Total reads). The percentage of reads with different alignment outcomes (Unique paired, Unique singleton, Non-unique, Unaligned) is represented by the left-side y-axis and visualized by stacked columns. The total number of reads in each sample is given using the black line and shown on the right-side y-axis.

The Coverage plot (Figure 4) shows the Average read depth (in covered regions) for each sample using columns and can be red off the left-hand y-axis. Similarly, the Genomic coverage plot shows genome coverage in each sample, expressed as a fraction of the genome.

The last graph is Average alignments per read (Figure 5) and shows the average number of alignments for each read, with samples as columns. For single-end data, the expected average alignments per read is one, while for paired-end data, the expected average alignments per read is two.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Coverage Report

Coverage report is also available for data nodes containing aligned reads (Aligned reads, Trimmed reads, or Filtered reads). The purpose of the report is to understand how well the genomic regions of interest are covered by sequencing reads for a particular analysis.

When setting up the task (Figure 1), you first need to specify the Genome build and then a Gene/feature annotation file, which defines the genomic regions you are interested in (e.g. exome or genes within a panel). The Gene/feature annotation can be previously associated with Partek® Flow® via Library File Management or added on the fly.

Complete coverage report will contain percentage of bases within the specified genes / features with coverage greater than or equal to the coverage levels defined under Add minimum coverage levels. To add a level, click on the green plus . Alternatively, to remove it, click on the red cross icon.

As for the Advanced options, if Strand-specificity is turned on, only reads which match the strand of a given region will be considered for that region’s coverage statistics.

Generate target enrichment graphs will generate a graphical overview of coverage across each feature.

When Use multithreading is checked, the computation will utilize multiple CPUs. However, if the input or output data is on some file systems like GPFS file system, which doesn't support well on multi-thread tasks, unchecking this option will prevent task failures.

Coverage report result page contains project-level overview and starts with a summary table, with one sample per row (Figure 2) The first few columns show the percentage of bases in the genomic features which are covered at the specified level (or higher) (default: 1×, 20×, 100×). Average coverage is defined as the sum of base calls of each base in the genomic features divided by the length of the genomic features. Similarly, Average quality is defined as the sum of average quality of those bases that cover the genomic features, divided by the length of covered genomic features. The last two columns show the number of On-tarted reads (overlapping the genomic features) and Off-target reads (not overlapping the features). The Optional columns link enables import of any meta-data present in the data table (Data tab).

Quantification of on- and off-target reads is also displayed in the column chart below the table (Figure 3), showing each sample as a separate column and fraction of on-/off-target reads on the y-axis.

Region coverage summary hyperlink opens a new page, with a table showing average coverage for each region (rows), across the samples (columns) (Figure 4).

The Coverage summary (Figure 6) plot is an overview of coverage across of the targeted genomic features for all the samples in the project. Each line within the plot is a single sample, the horizontal axis is the normalized position within the genomic feature, represented as 1st to 100th percentile of the length of the feature, while the vertical axis show the average coverage (across all the features for a given sample).

If you need more details about a sample, click on the sample name in the Coverage report table (Figure 7). The columns are as follows:

Region name: the genomic feature identifier (as specified in the annotation file)
Chromosome: the chromosome of the genomic feature (or region)
Start: the start position of the genomic feature (1-based)
Stop: the stop position of the genomic feature (2-based, which means the stop position is exclusive)
Strand: the strand of the genomic feature
Total exon length: the length of the genomic feature
Reads: the total number of reads aligning to the genomic feature
% GC: the percentage of GC contents of those reads aligning to the genomic feature
% N: the percentage of ambiguous bases (N) of those reads aligning to the genomic feature
(n)x: the proportion of the genomic feature which is covered by at least n number of alignments. [Note: n is the coverage level that you specified when submitting Coverage report task, defaults are 1×, 20×, 100×]
Average coverage: the average sequencing depth across all bases in the genomic feature
Average quality: the average quality score across covered bases in the genomic feature

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Validate Variants

Validate variants is available for data nodes containing variants (Variants, Filtered variants, or Annotated Variants). The purpose of this task is to understand the performance of the variant calling pipeline by comparing variant calls from a sample within the project to known “gold standard” variant data that already exist for that sample. This “gold standard” data can encompass variants identified with high confidence by other experimental or computational approaches.

Setting up the task (Figure 1) involves identifying the Genome Build used for variant detection and the Sample to validate within the project. Target specific regions allow for specification of the Target regions for this study, relating to the regions sequenced for all samples in the project. Benchmark target regions represent the regions that have been previously interrogated to identify “gold standard” variant calls in the sample of interest. These parameters are important to ensure that only overlapping regions are compared, avoiding the identification of false positives or false negative variants in regions covered by only the project sample or the “gold standard” sample. Both sections utilize a Gene/feature annotation file, which can be previously associated with Partek Flow via Library File Management or added on the fly. The Validated variants file is a single sample vcf file containing the “gold standard” variant calls for the sample of interest and can be previously associated with Partek Flow as a Variant Annotation Database via Library File Management or added on the fly.

The Validate variants results page contains statistics related to the comparison of variants in the project sample compared to the validated variant calls for the sample (Figure 2). The results are split into two sections, one based on metrics calculated from the comparison of SNVs and the other from the comparison of INDELs.

The following SNP-level metrics are contained within the report, comparing the sample in the project to the validated variant data:

No genotypes: the number of missing genotypes from the sample in the Flow project
Same as reference: the number of homozygous reference genotypes from the sample in the Flow project
True positives: the number of variant genotypes from the sample in the Flow project that match the validated variants file
False positives: the number of variant genotypes from the sample in the Flow project that are not found in the validated variants file
True negatives: the number of loci that do not have variant genotypes in the sample in the Flow project and the validated variants file
False negatives: the number of genotypes that do not have variant genotypes in the sample in the Flow project but do have variant genotypes in the validated variants file
Sensitivity: the proportion of variant genotypes in the validated variants file that are correctly identified in the sample in the Flow project (true positive rate)
Specificity: the proportion of non-variant loci in the validated variants file that are non-variant in the sample in the Flow project (true negative rate)
Precision: the number of true positive calls divided by the number of all variant genotypes called in the sample in the Flow project (positive predictive value),
F-measure: a measure of the accuracy of the calling in the Flow pipeline relative to the validated variants. It considers both the precision and the recall of the test to compute the score. The best value at 1 (perfect precision and recall) and worst at 0.
Matthews correlation: a measure of the quality of classification, taking into account true and false positives and negatives. The Matthews correlation is a correlation coefficient between the observed and predicted classifications, ranging from −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates completely wrong prediction.
Transitions: variant allele interchanges of purines or pyrimidines in the sample in the Flow project relative to the reference
Transversions: variant allele interchanges of purines to/from pyrimidines in the sample in the Flow project relative to the reference
Ti/Tv ratio: ratio of transition to transversions in the sample in the Flow project
Heterozygous/Homozygous ratio: the ratio of heterzygous and homozygous genotypes in the sample in the Flow project
Percentage of sites with depth < 5: the percentage of variant genotypes in the sample in the Flow project that have fewer than 5 supporting reads
Depth, 5th percentile: 5% of sequencing depth found across all variant genotypes in the sample in the Flow project
Depth, 50th percentile: 50% of sequencing depth found across all variant genotypes in the sample in the Flow project
Depth, 95th percentile: 95% of sequencing depth found across all variant genotypes in the sample in the Flow project

The INDEL-level metrics columns contained within the report are identical, with the exception of a lack of information with regards to transitions and transversion.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Feature distribution

The Feature distribution plot visualizes the distribution of features in a counts data node.

Running Feature distribution

To run Feature distribution:

Click a counts data node
Click the QA/QC section of the toolbox
Click Feature distribution

A new task node is generated with the Feature distribution report.

Feature distribution plot configuration

The Feature distribution task report plots the distribution of all features (genes or transcripts) in the input data node with one feature per row (Figure 1). Features are ordered by average value in descending order.

The plot can be configured using the panel of the left-hand side of the page.

Filter

Using the filter, you can choose which features are shown in the task report.

The Manual filter lets you type a feature ID (such as a gene symbol) and filter to matching features by clicking + . You can add multiple feature IDs to filter to multiple features (Figure 2).

The List filter lets you filter to the features included in a feature list. To learn more about feature lists in Partek Flow, please see List management.

Plot type

Distributions can be plotted as histograms, with the x-axis being the expression value and the y-axis the frequency, or as a strip plot, where the x-axis is the expression value and the position of each cell/sample is shown as a thin vertical line, or strip, on the plot (Figure 3).

To switch between plot types, use the Plot type radio buttons.

Mousing over a dot in the histogram plot gives the range of feature values that are being binned to generate the dot and the number of cells/samples for that bin in a pop-up (Figure 4).

Mousing over a strip shows the sample ID and feature value in a pop-up. If there are multiple cells/samples with the same value, only one strip will be visible for those cells/samples and the mouse-over will indicate how many cells/samples are represented by that one strip (Figure 5).

Clicking a strip will highlight that cell/sample in all of the plots on the page (Figure 6).

The grey dot in each strip plot shows the median value for that feature. To view the median value, mouse over the dot (Figure 7).

Page

To navigate between pages, use the Previous and Next buttons or type the page number in the text field and click Enter on your keyboard.

The number of features that appear in the plot on each page is set by the Items per page drop-down menu (Figure 8). You can choose to show 10, 25, or 50 features per page.

Scale Y-axis

When Plot type is set to Histogram, you can choose to configure the Y-axis scale using the Scale Y-axis radio buttons. Feature max sets each feature plot y-axis individually. Page max sets the same y-axis range for every feature plot on the page, with the range determined by the feature with the highest frequency value.

Color by

You can add attribute information to the plots using the Color by drop-down menu.

For histogram plots, the histograms will be split and colored by the levels of the selected attribute (Figure 9). You can choose any categorical attribute.

For strip plots, the sample/cell strips will be colored by the levels or values of the selected attribute (Figure 10). You can choose any categorical or numeric attribute.

Single-cell QA/QC

The Single-cell QA/QC task in Partek Flow enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Single cell QA/QC

By default, all samples are used to perform QA/QC. You can choose to split the sample and perform QA/QC separately for each sample.

If your Single cell counts data node has been annotated with a gene/transcript annotation, the task will run without a task configuration dialog. However, if you imported a single cell counts matrix without specifying a gene/transcript annotation file, you will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog (Figure 1). Note, it is still possible to run the task without specifying an annotation file. If you choose not to specify an annotation file, the detection of mitochondrial counts will not be possible.

The Single cell QA/QC task report opens in a new data viewer session. Four dot and violin plots showing the value of every cell in the project are displayed on the canvas: counts per cell, detected features per cell, the percentage of mitochondrial counts per cell, and the percentage of ribosomal counts per cell (Figure 2).

If your cells do not express any mitochondrial genes or an appropriate annotation file was not specified, the plot for the percentage of mitochondrial counts per cell will be non-informative (Figure 3).

Mitochondrial genes are defined as genes located on a mitochondrial chromosome in the gene annotation file. The mitochondrial chromosome is identified in the gene annotation file by having "M" or "MT" in its chromosome name. If the gene annotation file does not follow this naming convention for the mitochondrial chromosome, Partek Flow will not be able to identify any mitochondrial genes. If your single cell RNA-Seq data was processed in another program and the count matrix was imported into Partek Flow, be sure that the annotation field that matches your feature IDs was chosen during import; Partek Flow will be unable to identify any mitochondrial genes if the gene symbols in the imported single cell data and the chosen gene/feature annotation do not match.

Ribosomal genes are defined as genes that code for proteins in the large and small ribosomal subunits. Ribosomal genes are identified by searching their gene symbol against a list of 89 L & S ribosomal genes taken from HGNC. The search is case-insensitive and includes all known gene name aliases from HGNC. Identifying ribosomal genes is performed independent of the gene annotation file specified.

Total counts are calculated as the sum of the counts for all features in each cell from the input data node. The number of detected features is calculated as the number of features in each cell with greater than zero counts. The percentage of mitochondrial counts is calculated as the sum of counts for known mitochondrial genes divided by the sum of counts for all features and multiplied by 100. The percentage of ribosomal counts are calculated as the sum of counts for known ribosomal genes divided by the sum of counts for all features and multiplied by 100.

Each point on the plots is a cell. All cells from all samples are shown on the plots. The overlaid violins illustrate the distribution of cell values for the y-axis metric.

The appearance of a plot can be configured by selecting a plot and adjusting the Configure settings in the panel on the left (Figure 4). Here are some suggestions, but feel free to explore the other options available:

Open Axes and change the Y-axis scale to Logarithmic. This can be helpful to view the range of values better, although it is usually better to keep the Ribosomal counts plot in linear scale.
Open Style and reduce the Color Opacity using the slider. For data sets with very many cells, it may be helpful to decrease the dot opacity to better visualize the plot density.
Within Style switch on Summary Box & Whiskers. Inspecting the median, Q1, Q3, upper 90%, and lower 10% quantiles of the distributions can be helpful in deciding appropriate thresholds.

High-quality cells can be selected using Select & Filter, which is pre-loaded with the selection criteria, one for each quality metric (Figure 5).

Hovering the mouse over one of the selection criteria reveals a histogram showing you the frequency distribution of the respective quality metric. The minimum and maximum thresholds can be adjusted by clicking and dragging the sliders or by typing directly into the text boxes for each selection criteria (Figure 6).

Alternatively, Pin histogram to view all of the distributions at one time to determine thresholds with ease (Figure 7).

Adjusting the selection criteria will select and deselect cells in all three plots simultaneously. Depending on your settings, the deselected points will either be dimmed or gray. The filters are additive. Combining multiple filters will include the intersection of the three filters. The number of cells selected is shown in the figure legend of each plot (Figure 8).

Select the input data node for the filtering task and click Select (Figure 10).

A new data node, Filtered counts, will be generated under the Analyses tab (Figure 11).

Double click the Filtered counts data node to view the task report. The report includes a summary of the count distribution across all features for each sample; a detailed breakdown of the number of cells included in the filter for each sample; and the minimum and maximum values for each quality metric (expressed genes, total counts, etc) across the included cells for each sample (Figure 12).

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Cell barcode QA/QC

Cell Barcode QA/QC without EmptyDrops
Cell Barcode QA/QC with EmptyDrops
References

The Cell barcode QA/QC task lets you determine whether a given cell barcode is associated with a cell. This is an important QC step in all droplet-based single cell RNA-seq experiments, such as Drop-seq, where all barcodes are sequenced.

To invoke Cell barcode QA/QC:

Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Cell barcode QA/QC

The task can be performed with or without the EmptyDrops method enabled.

Cell Barcode QA/QC without EmptyDrops

To perform the task without the EmptyDrops method enabled, leave the checkbox unchecked and click Finish (Figure 1).

The Cell barcode QA/QC task report is a plot (Figure 2). Barcodes are ordered on the X-axis by the number reads such that the barcode closest to the Y-axis has the most reads and the barcode furthest from the Y-axis has the fewest reads. The Y-axis value is the number of mapped reads corresponding to each barcode. This type of plot is often referred to as a knee plot.

The knee plot is used to choose a cutoff point between barcodes that correspond to cells and barcodes that do not. Partek Flow automatically calculates an inflection point, shown by the vertical line on the graph. Barcodes designated as cells are shown in blue while barcodes designated as without cells (background) are shown in grey.

The cutoff can be adjusted by dragging the vertical line across the graph or by using the text fields in the Filter panel on the left-hand side of the plot. Using the Filter panel, you can specify the number of cells or the percentage of reads in cells and the cutoff point will be adjusted to match your criteria. The number of cells and the percentage of counts in cells is adjusted as the cutoff point is changed. To return to the automatically calculated cutoff, click Reset sample filter.

The percentage of counts in cells and median counts per cell are useful technical quality metrics that can be consulted when optimizing sample handling, cell isolation techniques, and library preparation.

One knee plot is generated for each sample. In projects with multiple samples, Next and Back buttons will appear at the top to enable navigation between sample knee plots. Manual filters must be set separately for each sample. This is typically used when the user expects a certain number of cells to be processed, like in experiments where droplets were loaded with a predefined number of cells.

To view a summary of the currently selected filter settings for all samples, click Summary table. This opens a table showing key metrics for each sample in the project (Figure 3).

To return to the knee plot view, click Back to filter. To apply the filter and run the Filter barcodes task, click Apply filter. A Filtered counts data node will be generated.

Cell Barcode QA/QC with EmptyDrops

The EmptyDrops method (1) uses a statistical test to identify which barcodes correspond to real cells and empty droplets. An ambient RNA expression profile is estimated from barcodes below a specified total UMI count threshold, using the Good-Turing algorithm. The expression profile of each barcode above the low-count threshold is then tested for deviations from the ambient profile. Real cells are expected to have a low p-value, indicating a significant deviation from the expected background noise level. False discovery rate (FDR) correction is applied to all the p-values and those falling equal to or below the specified FDR level are detected as real cells. This can allow for the detection of additional cells that would otherwise be discarded due to a low total UMI count.

This method requires empty barcodes to be present in the single cell count matrix, in order to estimate the ambient RNA profile. If your data has already been filtered to remove barcodes with low total counts, this method will not be suitable. For example, if you are working with 10X Genomics data, the EmptyDrops method can only be run on the raw counts, not the filtered counts.

In addition, a knee point threshold will be calculated to identify cells with a very high total UMI count. It's possible that some barcodes with a high total UMI count will not pass the EmptyDrops significance test. This could be due to biases in the ambient RNA profile, leading to a non-significant difference between a barcode's expression profile vs the ambient profile. To protect against this issue, it is advisable to use the EmptyDrops results in conjunction with the knee point filter, on the assumption that barcodes with a very high total UMI count will always correspond to real cells. Note, the knee point will be more conservative than the inflection point calculated by Partek Flow when the EmptyDrops method is not enabled.

To perform the task with the EmptyDrops method, check the checkbox, configure the additional options, and click Finish (Figure 4)

Ambient count threshold

Barcodes with a total UMI count equal to or below this threshold will be used to create the ambient RNA expression profile to estimate background noise. The default is set to 100, which is reasonable for most data.

FDR threshold

Barcodes equal to or below this FDR threshold show a significant deviation from the ambient profile and can therefore be considered real cells. Increasing this value will result in more cells, but will also increase the number of potential false positives.

Random generator seed

This is used for performing Monte Carlo simulations to determine p-values. To reproduce results, use the same random seed for all runs.

The task report will appear similar to Figure 2, with additional metrics on the left (Figure 5).

The number of actual cells detected by the EmptyDrops test and the knee point filter are shown above the Venn diagram on the left. In Figure 5, 3,189 barcodes are above the knee point filter (represented by the vertical blue line on the plot) and 2,657 barcodes passed the significance test in EmptyDrops. The overlap between these sets of barcodes is represented by the Venn diagram. In Figure 5, 1,583 barcodes pass the significance test in EmptyDrops and have a high total UMI count above the knee point filter; 1,606 barcodes have a very high total UMI count with no significant difference from the ambient profile in EmptyDrops; 1,074 barcodes fall below the knee point but are still significantly different from the ambient profile.

The number of cells included by the knee point filter can be adjusted either by click on the plot to change the position of the vertical blue line or by typing a different number of cells into the text box on the left.

The total number of cells is shown in the text box on the left. By default, this will be all of the cells detected by the knee point filter plus the extra cells detected by EmptyDrops. In Figure 5, this means the 3,189 cells with a high total UMI count plus the additional 1,074 cells from EmptyDrops (total = 4,263).

Different sections of the Venn diagram can be selected/deselected to include/exclude barcodes. For example, in Figure 5, clicking the '1,606' section of the Venn diagram will deselect those barcodes. Now, the only cells that will pass the filter will be the significant ones from EmptyDrops (Figure 6).

References

Lun, A., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20: 63.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Pre-alignment tools

Partek Flow provides Pre-alignment tools that allow the user to process next-generation sequencing data before proceeding to alignment. These tools are not only useful for controlling the quality of data, but can also be used for subsampling prior to analyzing the full dataset. There are three functions available in Pre-alignment tools:

Trim bases
Trim adapters
Filter reads
Trim tags

User is expected to have preliminary understanding of:

File formats for next generation sequencing data
Phred-quality score

Showing Pre-alignment tools

In order to show the Pre-alignment tools, select an Unaligned reads or Trimmed reads data node. They will appear on the context-sensitive menu on the right of the screen (Figure 1).

Different Pre-alignment tools are available for different formats of unaligned reads. For example: if the reads are in FASTQ format, then all four tools are available. On the other hand, if the unaligned reads are in FASTA or SFF format, then the Filter reads option is not available.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Trim bases

The Trim bases task is used to trim bases from the 5'-end or 3'-end of the reads. The most obvious reason for Trim bases is to trim away poor quality bases from the read prior to alignment because these can potentially affect alignment rate.

The task allows user to trim reads in different ways (Figure 1), including:

Trim bases based on quality score
Trim bases from 3'-end
Trim bases from 5'-end
Trim bases from both ends

Trim bases from 5'-or 3'-end (Figures 2-3) allows a fixed number of bases to be trimmed away from the 5'- or 3'-end of the reads. These two functions are useful for when your read length is constant. This is not recommended if the read length is not constant, since good quality bases from shorter reads are likely trimmed away by these functions.

Trim bases from both ends (Figure 4) allows user to keep only bases from a fixed start and end position of the reads. This is particularly useful if poor quality bases are observed on both ends of the read. So instead of performing trim bases successively from the 5'- and 3'-end, the trim bases will only be performed once by trimming from both ends.

Trim bases based on quality score (Figure 5) is probably the most useful function to trim poor quality bases from the 5'- or 3'-ends of reads. This function allows dynamic trimming of bases depending on quality score. The trimming can be done from either 5'-end, 3'-end or both ends of the reads. The function evaluates each base from the end of the read and trims it away until the last base has a quality score greater than the specified threshold. For an extensive evaluation of read trimming effects on Illumina NGS data analysis, see Del Fabbro et. al. [1].

Advanced options

In some cases, the reads that result from base trimming can have very short read lengths and thus are not recommended for alignment. Thus, Partek® Flow® Flow provides the option to set a Min read length after base trimming. This discards reads that are shorter than the set length.

Also, reads could have a high percentage of N's or ambiguous bases. Thus, the Max N setting is available to discard reads with %Ns higher than the set threshold

The Quality encoding option refers to the Phred quality score encoded within the FASTQ input file. The list of available options are: Phred+33, Phred+64, Solexa+64 and Integers. Selecting Auto-detect will determine whether the quality encoding is Phred+33 or Phred+64. For Solexa data, you will need to select Solexa+64. For most of datasets, auto-detect option works very well with a few exception cases where the base quality score falls into the grey zone (ambiguous zone) of Phred+33 and Phred+64 score. However, if the quality-encoding scheme is known, we recommend to selecting the encoding format directly from the quality encoding list.

Figure 6 shows the options available for all the different selection of Trim bases function. Note the default Min read length is 25bp. For micro RNA sequencing data, this default Min read length needs to be set to a smaller value (we recommend 15) to account for mature microRNAs.

Trim Bases Task Details Page

The Task Details page for Trim bases can be accessed by selecting the task node Trim bases, and subsequently selecting Task Details from the Task results section. In the Task details page, several sections are available:

General task information: contains information such as the task name, owner, status, submitted time, start, end and duration of the task
Output Files: contains the description of each output file. If you roll-over your mouse cursor to the file name, you will get the exact location of the file on the server. If you click on the file name, you will have the option to view up to 999 lines of the raw data. You can also download the file from the server.
Input Files: contains the information of input files. This section lists down all the input files used in the Trim bases task.
Input Parameters: contains the parameters used for running Trim bases function. This section tells what option has been selected for the Trim bases task. It includes all the parameters used for the task, such as minimum read length, maximum percentage of N's base, quality encoding, quality score threshold (if applicable) and how trimming is performed.
Command Lines: shows the commands used for running Trim bases function by the software Partek Flow

Trim Bases Task Report Page

The Trim bases Task Report page can be accessed by selecting either the Trim bases task node or Trimmed reads data node and then selecting the Task Report from the Task results section of the context sensitive menu. There is a link at the bottom of the page to directly go to the Task Details page. The page displays the following components:

Summary table: gives the total number of reads in each sample, the total number of reads trimmed (i.e. with at least one base trimmed from the read), total number of reads removed (due to Min read length and Max N parameters), the average number of bases trimmed per read, the average read quality before trim bases and finally the average read quality after trim bases.
Stacked bar-chart: shows percentage of untrimmed reads, trimmed reads and removed reads are shown in a stacked bar-chart to compare all the samples.
Average base quality score per position of trimmed reads: shows the average base quality score at each position of the trimmed reads for all samples in the project.

Trim Bases Output Files

The Trim bases function produces trimmed unaligned reads which is named as Trimmed Reads data node. The Trimmed Reads node will have the "trimmed" word appended to the filename. The Trimmed Reads data can be downloaded by selecting the Trimmed Reads node and then select Download data from the context sensitive menu. However, if you have access to the Partek Flow server, you can go to the Task Details page and identify the location of the output files from the Output Files section as described on the Trim Bases Task Details section above. The Trimmed Reads data node will have the same format as the raw data.

References

Del Fabbro C, Scalabrin S, Moragante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS ONE. 2013; 8(12): e85024.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.none;">43rates

Trim adapters

The existence of adapter sequences at the 5'-end or 3'-end of the reads has shown to be one of the major problems during alignment, causing the reads to be unaligned. Thus, removing adapter sequence is of utmost importance if the sequenced read length is longer than the molecule of interest, such as microRNA. The fact that mature microRNAs are short in length makes it almost certain that the adapter sequence will be sequenced at the 3'-end of the miRNA.

In order to know whether the data has been adapter-trimmed for microRNA data, we can look at the pre-alignment QA/QC of the raw data, specifically the read length distribution. If the read length distribution peaks at approximately 22-23 bases, this usually means the data has been adapter-trimmed. However, if you have a fixed length distribution, then very likely the data is not adapter-trimmed and you will need to get the adapter sequence from your vendor or service provider and use the Trim adapter function to trim away the adapter sequence.

Partek Flow software wraps Cutadapt [1], a widely used tool for adapter trimming. It can be used to trim adapter sequences in nucleotide-space data as well as color-space data.

In order to use Trim adapters function, you will need to know the adapter sequences. To trigger the Trim adapters function, please select Unaligned Reads node and then select Trim adapters from the Pre-alignment tools section of the task pane. In the Trim adapters page (Figure 1), paste the adapter sequences into the textbox and select the button.

There are three options when it comes to trimming the adapter sequence:

Trimming for adapter ligated to 3'-end: the adapter sequence and anything that follows it will be trimmed away from the 3'-end.
Trimming for adapter ligated to 5'-end or 3'-end: the adapter sequence is identified within the read or overlapping the 3'-end, then the adapter sequence and anything that follows it will be trimmed away. However, if the adapter sequence partially overlaps the 5'-end of the read, the initial portion of the read matching the adapter sequence is trimmed and anything that follows it is kept.
Trimming for adapter ligated to 5'-end: if the adapter sequence appears partially at the 5'-end or within the read, the preceding sequence including the adapter sequence is trimmed. User has the option to use a special character '^' at the beginning of the adapter sequence, meaning the adapter is 'anchored'. An anchored adapter must appear in its entirety at the 5'-end of the read (i.e. it is a prefix of the read).

For Trim adapters, more than one adapter sequences can be specified at once. When multiple adapters are provided, all adapters are evaluated based on how many bases it overlaps the read as well as the error rate. Adapters which have a lower number of overlapped nucleotides or high error rates are removed from consideration.

After that, the best adapter will be chosen based on the number of matching bases to the read. If there is a tie, adapters of the same type will be chosen in the order they are provided and adapters of different types will be chosen by type in the following order: first 3', then 5' or 3', and lastly 5' adapters.

Advanced Options for Trim Adapters

There are cases when the Trim adapters function does not work properly, for example: the existence of N's base in the read, etc. Therefore, there are advanced options which allows user to configure how the matching is done to trim adapter sequence. The advanced options dialog box is shown in Figure 2.

The first section of advanced options is the Adapter options. This is used to configure how the matching between the adapter sequence and the read will be performed. This includes the maximum error rate allowed, the number of matched times, minimum length of overlapped bases, allowing Ns (ambiguous base) in adapter and whether N will be treated as wildcards. User can roll-over mouse cursor to the info button to get more information of each parameter.

The second section of advanced options is the Filtering options. This is used to filter adapter-trimmed reads which are shorter than the minimum read length. This is to avoid having reads too short because short reads gives non-unique alignment and we would like to avoid that.

The third section of advanced options is the Additional modification to reads. The quality cutoff is used to trim bad quality bases from the reads before trimming adapter. Quality encoding tells the quality score encoding for the raw data. The Reads names prefix and suffix is used to add prefix and suffix to the read ID. Lastly, the Negative quality zero if checked will convert all negative quality score base to zero.

References

Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011; 17: 10-12.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Filter reads

Next generation sequencing (NGS) data is notably huge in file size. Dealing with NGS data is not only time consuming but also puts constraints on hard disk space. This is especially true if analysis parameters need to be optimized. The Filter reads task is a very useful tool to get a subset of the raw data upon which optimization can be performed. The optimized parameters can then be saved and applied to the whole dataset.

Filter reads is only available for unaligned reads of FASTQ format. Select the Unaligned Reads data node then select Filter reads from the Pre-alignment tools section on the menu.

There are two options to filter reads: Subsample reads and Filter by read length.

To Subsample reads, specify how many reads you want to keep for every nth reads. For example: if the user specifies to "Keep one read for every 10 reads" (Figure 1), this means that for every 10 reads, the program will keep only 1 read. This is equivalent to keeping 10% of the data.

To Filter by read length, set the read length limits by choosing the minimum and maximum read length(s) to keep.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Trim tags

What is Trim tags?
Running Trim tags
Building a custom prep kit
Managing prep kits

What is Trim tags?

The Trim tags task allows you to process unaligned read data with adaptors, barcodes, and UMIs using a Prep kit file that specifies the configuration of these elements in your NGS reads.

Running Trim tags

Click an Unaligned reads data node
Click the Pre-alignment QA/QC section of the toolbox
Click Trim tags

There are three parameters to configure - Prep kit, Keep untrimmed, and Map feature barcodes.

Selecting Keep untrimmed will generate a separate unaligned reads data node with any reads that do not match the structure specified by the prep kit. This option is off by default, to save on disk space. Selecting Map feature barcodes is only necessary for processing protein data from 10x Genomics' Feature Barcoding assay (v3+ chemistry). For single cell gene expression data, leave this option unchecked.

Partek distributes prep kits for processing several types of data:

10x Chromium Single Cell 3' v2
10x Chromium Single Cell 3' v3
10x Chromium Single Cell 5'
Drop-seq
Lexogen QuantSeq FWD-UMI
Bio-Rad SureCell WTA 3'
Fluidigm C1 mRNA Seq HT IFC
Rubicon Genomics ThruPLEX Tag-seq
1CellBio inDrop

If your data is from one of these sources, you can select the appropriate option in the Prep kit drop-down menu. If the data is from another source, you can build a custom prep kit file to process your data.

Choose a Prep kit from the drop-down menu
Click Finish to run Trim tags (Figure 1)

The output of Trim tags is a Trimmed reads data node. An additional Untrimmed reads data node will be generated if the Keep untrimmed option was selected.

The task report provides a table with the total reads, reads retained, % reads retained, reads removed, and % reads removed for each sample (Figure 2). You can click Download at the bottom of the table to save a text file copy to your computer.

Building a custom prep kit

Select Other / Custom from the Prep kit name drop-down menu
Give the new prep kit a name
Choose Build prep kit

You can select Import prep kit if you have a Prep kit .zip file downloaded from Partek Flow.

Click Create (Figure 3)

The Prep kit builder interface will load (Figure 4).

There are three sections:

Is paired end - select to switch from single end to paired end FASTQ files (Figure 5). If you choose paired end, the First mate will correspond to the _R1 FASTQ file and the Second mate will correspond to the _R2 FASTQ file.

Figure 5. Paired end prep kits have first and second mate segmentation sections

Segmentation - this is where you will describe the structure of your reads

Segments include adaptors, barcodes, UMIs, and the insert (i.e., the target sequence of the assay)

Adaptors

For adaptors, you have the option of choosing a file with your adaptor sequences or entering the adaptor sequences manually.

To use a file, choose File for Sequences and then click Choose File (Figure 6). Use the file browser to choose a FASTA file from your local computer.

You can specify the mismatch allowance using the Mismatches option.

After you have specified the file or manually entered the sequences, click Add to add the adaptor sequence(s).

UMIs

Unique Molecular Identifiers (UMIs) are randomly generated sequences that uniquely identify an original starting molecule after PCR amplification.

Including a UMI in your prep kit will allow you to access a downstream task that uses UMI information for removing PCR duplicates. For more information about the Deduplicate UMIs task, please see our UMI Deduplication in Partek Flow white paper. Note that while the UMI sequence will be trimmed, a record of the UMI sequence for each read is retained for use by this downstream task.

When adding a UMI segment to your prep kit, you can specify the length of your UMIs (Figure 8).

Barcode

Adding a barcode segment to a prep kit allows you to access downstream tasks that use barcode information, including Filter barcodes and Quantify barcodes to annotation model (Partek E/M). While the barcode sequence will be trimmed, a record of the barcode sequence for each read is retained for use by downstream tasks.

Like adaptors, barcodes can be specified using a file or manually specified, but you can also choose to designate any segment of arbitrary length in the sequence as the barcode. This is useful if you do not have a specific set of known barcodes.

To set the barcode to an arbitrary segment of fixed length, choose Arbitrary and specify the barcode length (Figure 9).

Remember to click Add to add the new segment to your prep kit.

Insert

The insert is the sequence retained after trimming in the Trimmed reads data node. For example, in RNA-Seq, this would be the mRNA sequence. Every prep kit must include an insert segment. You can specify the minimum size of the insert section using the Length field (Figure 10). Reads shorter than the minimum length will be discarded.

Remember to click Add to add the new segment to your prep kit.

Ordering segments

Segments are placed from 5' to 3' in the read in the order they are added. You should add the 5' segment first and add additional elements in order of their position in the read. Segments will appear in the Segmentation sections as they are added. You can mouse over a segment to view its details (Figure 11).

Custom prep kit example

For example, the expected read structure (Figure 12) and a completed prep kit for a standard Drop-seq library prep are shown below (Figure 13).

Remove poly-A tail - choose this option to trim poly-A tails from the ends of the read with your insert sequence

Click Next to complete your prep kit

Managing prep kits

You can manage saved prep kits by going to Home > Settings > Library file management and opening the Prep kit files tab (Figure 14).

Prep kits download as a .zip file. This Prep kit .zip file can be imported into Partek Flow by selecting Import from a file when adding a new prep kit. Select the .zip file when importing, do not unzip the file.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Post-alignment tools

Post-alignment tools involve tasks that can be performed on aligned data. These are typically used in preparing aligned data for other downstream analyses, such as DNA-Seq or RNA-Seq analysis.

To invoke Post-alignment tools, click on an Aligned reads data node (Figure 1). There are three functions available in Post-alignment tools:

Filter alignments
Convert alignments to unaligned reads
Combine alignments
Deduplicate UMIs
Downscale alignments

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Filter alignments

Introduction
Removing duplicates
Remove alignments with mismatches

Introduction

The Filter alignments task can be used to filter aligned reads data using specified parameters. To invoke the task, click on an Aligned reads data node and select Filter alignments. By default, this task removes low-quality reads, singletons and unaligned read information stored within the BAM/SAM file (Figure 1).

Removing duplicates

Users also have the option to remove duplicate reads in aligned data. For DNA-Seq analysis, this is typically performed to minimize redundant variant calling information. To remove duplicates, click on the Remove duplicates checkbox (Figure 2).

Select the number of reads you want to keep. Then specify when alignments are treated as duplicates. This can either be reads that map to the same start position or, additionally, have the same sequence. You can also select whether to keep the read with the highest mapping score or a randomly-selected duplicate.

Remove alignments with mismatches

To remove alignments with mismatches, select the Remove alignments with mismatches check box. Using the selector, specify the number the number of mismatched bases that need to be exceeded for the alignment to be excluded (Figure 3). Note that mismatches also include insertions and deletions.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Convert alignments to unaligned reads

Aligned reads can be converted to unaligned reads in Partek Flow. The task is available under Post-alignment tools in the task menu when any Aligned reads data node is selected, which can be a result of an aligner in Partek Flow or data already aligned before import.

Generating unaligned reads from aligned data gives you the flexibility to remap the reads using either a different aligner, a different set of alignment parameters, or a different genome reference. This is particularly useful in analyzing sequences from xenograft models where the same set of reads can be aligned two different species. It may also be useful if the original unaligned FASTQ files are not as easily accessible to the user as the aligned BAM files.

To perform the task, select an Aligned reads data node and click Convert alignments to unaligned reads task in the task menu (Figure 1).

During the conversion, the BAM files are converted to FASTQ files and a new Unaligned reads data node will be generated (Figure 2) .

The filenames of the FASTQ files will be based on the sample names in the Data tab. The files generated are compressed with the extension *.fq.gz. For samples containing BAM files with paired end reads, two FASTQ files will be generated for each, and the files names will be appended with _1 and _2. An example in Figure 3 shows 18 .fq.gz output files that came from 9 BAM files.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Combine alignments

The Combine alignments task can potentially maximize the number of reads that map to a region. When unaligned reads resulting from one aligner are then aligned using a different one, the two resulting alignments can combined together for downstream analysis within Partek Flow. Note that this can only be performed if they were aligned to the same reference genome.

To invoke this task, click an Aligned reads data node and select Combine alignments (Figure 1).

A list of compatible alignments will appear. The color of the text signifies the layer the alignment corresponds to (Figure 2). Select the alignment you would like to combine and click Finish.

The resulting Aligned reads data node can now be used for downstream analysis (Figure 3).

Note that this task combines the files in the data node within Partek Flow but does not merge the BAM files. Downloading an aligned reads data node from a Combine alignments task will result to multiple BAM files per sample.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Deduplicate UMIs

Configuring Deduplicate UMIs
Deduplicate UMIs task report

The Deduplicate UMIs task identifies and removes reads mapped to the same chromosomal location with duplicate unique molecular identifiers (UMIs). The details of our UMI deduplication methods are outlined in the UMI Deduplication in Partek Flow white paper.

To invoke Deduplicate UMIs:

Click an Aligned reads data node
Click Post-alignment tools in the toolbox
Click Deduplicate UMIs

The task configuration dialog content depends on whether you imported FASTQ files or BAM files into Partek Flow.

Configuring Deduplicate UMIs

Imported FASTQ

UMIs and barcodes are detected and recorded by the Trim tags task in Partek Flow. You can choose whether to retain only one alignment per UMI or not (Figure 1). The default will depend on which prep kit was used in the Trim tags task.

If you select Retain only one alignment per UMI, you will be asked to choose an assembly and gene/feature annotation file. The annotation file is used to check whether a read overlaps an exonic region. Only reads that have 50% overlap with an exon will be retained.

If you do not select Retain only one alignment per UMI, UMI deduplication will proceed without filtering to exonic reads. Other differences between the two options are outlined in the UMI Deduplication in Partek Flow white paper.

Imported BAM

Imported BAMs generated by other tools can be imported into Partek Flow and deduplicated by the software. Additional options are available in the task configuration dialog to allow you to specify the location of the UMI and cell barcode information typically stored in the BAM header. Specify the BAM header tags in the text fields. For example, when processing a BAM file produced by CellRanger 3.0.1, the BAM identifier tag for the UMI sequence is UR and the BAM identifier for the barcode sequence is CR (Figure 2).

The option to Retain only one alignment per UMI is also available when starting from a BAM file.

Deduplicate UMIs task report

The Deduplicate UMIs task report includes a knee plot showing the number of deduplicated reads per barcode. This plot is used to filter the barcodes to include only barcodes corresponding to cells. For more information about using the knee plot to filter barcodes, please see the Cell Barcode QA/QC page. One difference between the Deduplication report and the Cell Barcode QA/QC report is that the Deduplication report gives the number of initial alignments and the number of deduplicated alignments for each sample (Figure 3). This indicates how many of your aligned reads were PCR duplicates and how many were unique molecules.

The initial number of cells is set by our automatic filter. You can set the filter manually by clicking on the plot or by typing a cutoff number in the Cells or Reads in cells text boxes. If there are multiple samples, each sample receives a plot and filters are set per sample.

The number of cells, reads in cells, median reads per cell, number of initial alignments, and number of deduplicated alignments are listed for each sample in the summary table (Figure 4).

Clicking Apply filter at either the knee plot or the summary table will run the Filter barcodes task and generate a Filtered reads data node.

To return to the knee plot, click Back to filter.

To reset the filters for all sample to the automatic cutoff, click Reset all filters.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Downscale alignments

The Downscale alignments task can be invoked on data node containing bam/sam files, e.g. Aligned reads data node. The only parameter to specify is the percentage of the alignments to keep in the results, the range of the parameter should be between 0 to 100 (Figure 1). All the samples in the input data node will use the same parameter. The output data node contains bam files with a subset of the alignments.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Annotation/Metadata

This section has tools that are useful in managing and understanding single cell data, especially for downstream analysis. To invoke Annotation/Metadata tools, click on any Single cell counts data node. These include the following tasks:

Annotate cells
Annotation report
Publish cell attributes to project
Attribute report
Annotate Visium image

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Annotate cells

If you have attribute information about your cells, you can use the Annotate cells task in Partek Flow to apply this information to the data. Once applied, these can be used like any other attribute in Partek Flow, and thus can be used for cell selection, classification and differential analysis.

To run Annotate cells:

Click a Single cell counts data node
Click the Annotation/Metadata section in the toolbox
Click Annotate cells

You will be prompted to specify annotation input options:

Single file (all sample): it requires one .txt file for all your cells in all samples, each row in the file represents a barcode, at least one barcode column which will match the barcodes in your data. It also requires an column contains sample ID which will match the sample name in the data tab of your project.
File per sample: it requires the format of all of the annotation files to be the same. Each file has barcodes on rows, it requires one barcode column that will match the barcodes in your data in that sample. All files have to have the same set of column, column headers are case sensitive.

You can pick the file for each sample from the Partek Flow server, you have to specify annotation files for all the samples in the dailog (Figure 1).

To view a preview of the files, click Show Preview (Figure 2).

If you would like to annotate your matrix features with a gene annotation file, you can choose an annotation file at the bottom on the dialog. You can choose any gene/feature annotation available on the Partek Flow server. If a feature annotation is selected, the percentage of mitochondrial reads will be calculated using the selected annotation file.

Click Next to continue

The next dialog page previews the attributes found in the annotations text file (Figure 3).

You can choose which attributes to import using the check-boxes, change the names of attributes using the text fields, and indicate whether an attribute with numbers is categorical or numeric.

Click Finish to import the attributes.

A new data node, Annotated single cell counts, will be generated (Figure 4).

You annotations will be available in downstream analysis tasks.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Annotation report

The Annotation report provides a table summarizing the cell-level attributes of a single cell counts data node.

To run Annotation report:

Click a Single cell counts data node
Click the Annotation/Metadata section of the toolbox
Click Annotation report

The task will run and generate a task report.

The Annotation report includes two tables (Figure 1) - the top table summarizes the categorical attributes, giving the number of levels of each attribute, and the bottom table summarizes the numeric attributes, providing some basic summary statistics about the distribution of each attribute.

To download a text-file version of one of the tables, click Download in lower right-hand corner of the table.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Publish cell attributes to project

This task is only available on single cell matrix data node. It will publish one or more cell level attributes to the project level, so the attribute can be edited and seen from all single cell count data nodes within a project. This function can be used on annotate cell task output data node, graph based cluster data node etc.

Running publish cell attributes to project

Click on a single cell counts data node
Choose Publish cell attributes to project in the Annotation/Metadata section of the toolbox

This invokes the dialog as (Figure 1)

From the drop-down list to select one or more attributes to publish. Only numeric attributes and categorical attributes with less than 1000 levels will be available in the list.

Click Finish at the bottom of the page, all of the attributes will be available to edit on the Data tab > Cell attributes Manage. All data nodes in the project will be able to use those attributes.

Additional Assistance

Attribute report

This task is only available on single cell matrix data node. It will summarize the cell level attributes of a data node and the result displayed in two tables with one table containing the categorical attributes while the other table contains the numerical attribute.

Running attribute report

Click on any single cell count data node and select Attribute report from the Annotation/Metadata task menu

Double click on the result node to view the Attribute report table

Result of the Attribute report task showing categorical and numerical attributes

To download a text-file version of one of the tables, click Download in lower right-hand corner of the table.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Annotate Visium image

Space Ranger output
Annotate Visium image
References

Space Ranger output

The Visium spatial gene expression solution from 10X Genomics allows us to spatially resolve RNA-Seq expression data in individual tissue sections. For the analysis of Visium spatial gene expression data in Partek Flow, you will need the following output files from the Space Ranger _outs_1 subdirectory:

The filtered count matrix -- either the .h5 file (one file) or feature-barcode matrix (three files).
spatial – Outputs of spatial pipeline (Figure 1) (Spatial imaging data)

We recommend import count matrix file using .h5 file format, which allow you to import multiple samples at the same time. You need to rename the file names and put all the files in the sample folder to import in one go.

The spatial subdirectory contains image related files, they are:

tissue_hires_image.png

tissue_lowres_image.png

aligned_fiducials.jpg

detected_tissue_image.jpg

tissue_positions_list.csv

scalefactors_json.json

The folder should be compressed in one .gz or zip file when you upload to Flow server. You can pick the file for each sample from the Partek Flow server, your local computer, or a URL using the file browser .

Annotate Visium image

To run Annotate cells:

Click a Single cell counts data node
Click the Annotation/Metadata section in the toolbox
Click Annotate Visium image

You will be prompted to pick a Spatial image file for each sample you want to annotate (Figure 2).

Click Finish

A new data node, Annotated counts, will be generated (Figure 3).

When the task report of the annotated counts node is opened (or double click on the Annotated counts node), the images will be displayed in data viewer (Figure 4)

It is a 2D plot and XY axes are the tissue spot coordinates. Tissue spots are on top of the slide image. The opacity of the tissue spots can be changed using the slider to show more of the image under Configure>Style>Color.

From the Configure>Background>Image drop-down list in the Data viewer, different formats of the image can be selected (Figure 6).

Note that the "Annotate Visium image" task splits by sample so all of the downstream tasks will also do this (e.g. if the pipeline is built from this node all of the downstream tasks will also be split and viewed per sample). To generate plots with multiple samples on one plot, build the pipeline from the "Single cell counts" node.

References

[1] https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/overview

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Pre-analysis tools

This section constitutes tools that are useful in preparing single cell data for downstream analysis, such as multi-sample comparison or multi-omics analysis. To invoke Pre-analysis tools, click on any Single cell counts data node. These include the following tasks:

Generate group cell counts
Pool cells
Split matrix
Hashtag demultiplexing
Merge matrices
Descriptive statistics
Spot clean

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Generate group cell counts

If a single cell data node contains cell attribute information, e.g., clustering results, classifications, or imported attributes, a counts-type data node containing the number of cells from each attribute group for each sample can be generated and used for downstream analysis.

To invoke Generate group cell counts:

Click a single cell count data node with cell-level attribute information
Click Pre-analysis tools in the toolbox
Click Generate group cell counts
Select the attribute to group the cells from the Group by drop-down menu (Figure 1)
Click Finish

A group cell counts node will be generated. The data node contains a matrix of cell counts in each sample for each group. You can view the counts results in the Group cell counts report (Figure 2).

The Cell counts data node is a counts type data node and downstream analysis tasks, such as normalization, PCA, and ANOVA, can be used to analyze the group cell counts data.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Pool cells

Pool cells combines RNA-Seq data from all cells of a particular cell type classification for each sample. In essence, Pool cells creates virtual bulk RNA-Seq data from single cell RNA-Seq data. Because it is virtual bulk RNA-Seq data, all the same tasks that can be performed on bulk RNA-Seq gene counts data in Partek Flow can be performed on the output of Pool cells.

Pool cells makes it easy to compare gene expression for a cell type of interest between experimental groups.

Running Pool cells

Before running Pool cells, you must classify the cells. To run Pool cells, select the data node with your classified cells and select Pool cells from the QA/QC section of the task menu (Figure 1).

Options for Pool cells are Sum, Maximum, Mean, and Median. Expression values for cells from the same sample with the same cell type classification will be merged using the chosen pooling method (Figure 2). Sum is selected by default. After choosing a pooling method, select Finish to run the Pool cells task.

Output of Pool cells

Pool cells generates a counts data node for each classified cell type in the data set (Figure 3).

Each counts data node is equivalent to simulated bulk RNA-Seq counts data for a cell type. The same tasks that can be performed on bulk RNA-Seq counts data can be performed on Pool cells output data nodes, including normalization, filtering, PCA, and differential expression analysis.

The counts data of a cell type for each sample can be downloaded by clicking the counts data node and selecting Download data from the task menu. The counts data text file lists each sample and its pooled counts values (sum, maximum, mean, or median) for each feature (gene/transcript) in alphabetical order (Figure 4).

Additional Assistance

Split matrix

Split matrix can be invoked on any counts data node with more than one feature type. For example, a CITE-Seq experiment would have Gene Expression counts and Antibody Capture counts in the single cell counts data node. Datasets generated by 10X Genomics' Feature Barcoding experiments also utilize this task to split different feature measurements for downstream analysis.

There are no parameters to configure, to run:

Click the counts data node you want to split
Click the Pre-analysis tools section of the toolbox
Click Split matrix

The Split matrix task will run and generate output data nodes for each of the feature types. For example, if there are Antibody Capture and Gene Expression feature types in the input, Split matrix will generate two data nodes (Figure 1). Every sample is included in both matrices.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Hashtag demultiplexing

Cell Hashing enables sample multiplexing and super-loading in single cell RNA-Seq isolation by labeling each sample with a sample-specific oligo-tagged antibody against a ubiquitously expressed cell surface protein.

The Hashtag demultiplexing task is an implementation of the algorithm used in Stoeckius et al. 20181 for multiplexing cell hashing data. The task adds cell-level attributes Sample of origin and Cells in droplet.

Prerequisites for running Hashtag demultiplexing

To run Hashtag demultiplexing, your data must meet the following criteria:

Data node contains number of features less than number of observations
Data node must be output from normalization task (recommended normalization method for hashtag is CLR)

If you are processing your FASTQ files in Partek Flow, be sure to specify a different Data type for your Cell Hashing FASTQ files on import than the FASTQ files for your gene expression and any other antibody data.

If you are processing your FASTQ files using Cell Ranger, be sure to specify a different feature_type for your Cell Hashing antibodies than any other antibodies in the Feature Reference CSV File.

If you want to specify sample IDs instead of using hashtag feature IDs as the sample IDs, you will need to prepare a tab-delimited text file (.txt) with hashtag feature IDs in the first column and the corresponding sample IDs in the second column (Figure 1). A header row is required.

Running Hashtag demultiplexing

Click the Normalized counts data node for your cell hashing data
Click Hashtag demultiplexing in the Pre-analysis tools section of the toolbox
Click Browse to select your Sample ID file (Optional)
Click Finish to run

The output is a Demultiplexed counts data node (Figure 2).

Two cell-level attributes, Cells in droplet and Sample of origin, are added by this task and are available for use in downstream tasks. You can download the attribute values for each cell by clicking the Demultiplexed counts data node, clicking Download, and choosing to download Attributes only.

We recommend using Annotate cells to transfer the new attributes to other sections of your project after downloading the attributes text file.

It is also possible to use the Merge matrices task to combine your data types and attributes.

References

Stoeckius, M., Zheng, S., Houck-Loomis, B., Hao, S., Yeung, B.Z., Mauck, W.M., Smibert, P. and Satija, R., 2018. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome biology, 19(1), p.224.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Merge matrices

In complex projects, different data matrices (e.g. observations on rows and features on columns) need to be merged in order to achieve the analysis goals. For example, two cell populations were identified on separate branches of the analysis pipeline and to combine them before any joint downstream steps, the expression matrices have to be combined. Alternatively, two assays (gene expression and protein expression) were performed on the same cells so the expression matrices have to be merged for joint analysis.

Merge matrices task is located in the Pre-analysis tools section of the toolbox and it can handle two scenarios: Merge cells/samples and Merge features (Figure 1). To start, select the first data node on the pipeline (e.g. single cell counts) and then select the Merge matrices task.

Merge Cells/Samples

To use the Merge cells option, the data matrices (one or more) that are to be merged with the currently selected one should have the same features (e.g. genes), but distinct cells. Push the Select data nodes button and Partek Flow will display a preview of the pipeline; the data nodes that can be merged are shown in color of the branch, other data nodes are disabled (greyed out). Left click on the data node that you want to merge with the current one and push the Select button, you can select multiple data nodes to merge. The selected node(s) will be shown under the Select data nodes button (Figure 2). If you made a mistake, use the Clear selection icon. Push Finish to proceed.

Merge Features

To use the Merge features option, the data matrices (one or more) that are to be merged with the currently selected one should have the same cells (or samples), but distinct features (e.g. gene and protein expression). Push the Select data nodes button and Partek Flow will display you a preview of the pipeline; the data nodes that can be merged are shown in color of the branch, others are disabled (greyed out). Left lick on the data node that you want to merge with the current one and push the Select button. The selected node will be shown under the Select data nodes button. Repeat the procedure if you would like to merge additional nodes. If you made a mistake, use the Clear selection icon. Push Finish to proceed.

Task Output

The output of the Merge matrices task is a Merged counts data node (Figure 3).

Alternative paths

Additional Assistance

Descriptive statistics

Descriptive statistics task can be invoked on matrix data node e.g. Gene Counts, Normalized Counts data node in bulk RNA seq analysis pipeline or Single Cell counts Data node etc. It calculates measures of central tendency and variability on observations or features of the matrix data.

Running Descriptive statistics

Click on a counts data node
Choose Descriptive Statistics in Statistics section of the toolbox (Figure 1)

This will invoke the dialog configuration dialog; use it to specify which calculation(s) will be performed on cells (or samples for a bulk analysis data node) or features (Figure 2).

The available statistics are listed on the left panel, suppose "x1, x2, ..., xn"represent an array of numbers

Number of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For instance, use this option if you want to know the number of cells in which each feature was detected; possible filter: Number of cells whose value > 0.0
Percent of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.
Number of features: Available when Calculate for is set to Cells. Reports the number of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For example, use this option if you want to know the number of detected genes per each cell; filter: Number of features whose value > 0.0
Percent of features: Available when Calculate for is set to Cells. Reports the fraction of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.
Q1: 25th percentile
Q3: 75th percentile
Range: xmax - x min

Left click to select measurement and drag to move to the right panel one at a time, or when you mouse over on a measurement, click on the green plus button to move to the right panel. When Sample (Cell) is select, the calculation will be performed on all the features in the input matrix for each sample (or cell). When Feature is selected, the calculation will be performed across all the samples (cells) in the input matrix for each feature.

In addition, when Feature is selected, there is an extra Group by option (Figure 3)

From the drop-down list, choose a categorical attribute to calculate the descriptive statistics on all the subgroups for each feature.

The output of the task is a matrix: Cell stats (result of Calculate for Cells) or Feature stats (result of Calculate for Features) (Figure 4). The results can be visualized in the Data Viewer.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Spot clean

The terminology spot swapping describes the artifact for spatial data that mRNA bleed from nearby spots causes substantial contamination of UMI counts[1]. Spot clean in Flow is a task that aims to improve estimates of expression by correcting for spot swapping.

The task can only be invoked from the Space Ranger task output data node since it takes the raw count matrix as input. To run the Spot clean task in Flow:

Click the Single cell counts outputted from Space Ranger (Figure 1)
Click Pre-analysis tools in the toolbox
Click Spot clean
Click Finish to run the task with default settings

Another single cell counts node will be generated. The data node contains a matrix of cell counts with the decontaminated gene expressions (Figure 2). Downstream analysis tasks, such as normalization, PCA, and ANOVA, can be performed on the new single cell counts node.

Parameters in this task that you can adjust include:

Gene cutoff: Filter out genes with average expressions among tissue spots below or equal to this cutoff. Default: 0.1.

Max iteration: Maximum iteration for EM parameter updates. Default: 10. Set a smaller number to save computation time.

References

Ni, Z., Prasad, A., Chen, S. et al. SpotClean adjusts for spot swapping in spatial transcriptomics data. Nat Commun 13, 2971 (2022). https://doi.org/10.1038/s41467-022-30587-y

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Aligners

Next generation sequencing can produce anywhere from hundreds of thousands to tens of millions short nucleotide sequences for a single sample. For any given base within an individual sequence there can also be a quality score associated with the confidence of that base call from the sequencer. The process of alignment is used to map all of these reads to a reference sequence, providing information with regards to the start and stop positions of each read within the reference sequence as well as a quality metric for the mapping. This document will provide information about the available aligners within Partek Flow as well as illustrate how to perform alignment against a reference sequence. The result of alignment will be an Aligned reads data node that contains the BAM files generated from the alignment.

The user should be familiar with:

Alignment tools appear in the context-sensitive menu on the right of the screen (Figure 1) when click on any data node containing FASTQ files. Examples include Unaligned reads, Trimmed reads, and Subsampled reads data nodes.

Partek Flow provides numerous publicly available tools for the alignment process to meet the needs of your specific sequencing experiment. The information below provides a synopsis of each aligner as well as the current version. Please refer to the aligner links and references section for further information on each aligner.

Bowtie1 (Version 1.0.0) - Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome. Backtracking is used to conduct a quality-aware, greedy, randomized, depth-first search of all possible alignments based on the specified alignment parameters. Does not handle gapped alignments. Fast, memory efficient, and accurate for short reads of high quality (<50bp). Popular for short DNA-Seq reads and small RNA-Seq reads. (http://bowtie-bio.sourceforge.net/index.shtml)

Bowtie 22 (Version 2.2.5) - Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome. Alignment involves mapping seed sequences in an ungapped fashion and then performing a gapped extension. Supports a local alignment mode that "soft clips" alignments which do not align end-to-end. Unlike Bowtie, handles gapped alignments, ambiguous bases (N’s), and paired reads that do not align in a paired fashion. Fast, memory efficient, and accurate for longer reads (>50bp) with no upper limit on read length. Popular for DNA-seq reads and small RNA-Seq reads. (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)

BWA3,4 (Version 0.7.15) - Uses a Burrows-Wheeler transform to create an index of the genome. Handles gapped alignments and ambiguous bases (N’s). BWA-backtrack uses a backward search may be optimal for short reads (>70bp). BWA-MEM typically fastest and most accurate for longer reads, although BWA-SW may have better sensitivity when gapped alignments. Popular for DNA-seq variant calling pipelines, but not for RNA-seq as splicing is not taken into account. (http://bio-bwa.sourceforge.net/)

GSNAP5 (Version 2015-12-31(v8)) - A short read aligner (>14bp) using a successive constrained search, capable of handling splicing using either a probabilistic model or database. Built to handle SNPs in alignment. Good sensitivity but slower speed and higher memory usage. Popular for RNA-seq analysis. (http://research-pub.gene.com/gmap/)

HISAT26 (Version 2.1.0) - A fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of genomes. HISAT2 is a successor to TopHat2. (https://github.com/DaehwanKimLab/hisat2)

Isaac 27 (Version 15.07.16) - Gapped aligner that finds candidate mapping positions by matching 32-mers from the data to 32-mers from the reference, extending the candidate mappings to the whole read, and selecting the best mapping. Has utility for mappying DNA-Seq with good speed and accuracy but high memory usage. (https://github.com/Illumina/isaac2)

STAR8 (Version 2.6.1d) - Splice-aware aligner that utilizes novel sequential maximal mappable seed search capable of handling splice junctions. Seeds are subsequently stitched together by local alignment. Capable of handling long reads. Good speed and sensitivity for RNA-seq analysis but with high memory usage. (https://github.com/alexdobin/STAR)

TMAP9 (Version 5.0.0) - Integrates a set of aligners to (including modified BWA) to identify candidate mapping locations and performs alignment using Smith-Waterman algorithm. TMAP is optimized to handle variable length reads and error profiles generated by Ion Torrent data. (https://github.com/iontorrent/TMAP)

TopHat10 (Version 1.4.1 with Bowtie 1.0.0) - Two stage aligner that first utilizes Bowtie to map to a reference and subsequently unaligned reads are are mapped to a database of possible splice junctions. Popular for RNAseq analysis with solid performance, speed, and memory usage. (https://ccb.jhu.edu/software/tophat/index.shtml)

TopHat 211 (Version 2.1.0) - A newer version of TopHat that utlizes Bowtie2 and refined algorithms from Tophat to improve both speed and accuracy. Popular for RNAseq analysis with solid performance, speed, and memory usage. (https://ccb.jhu.edu/software/tophat/index.shtml)

Task Dialog

Selecting an aligner will open the task dialog (Figure 2). All aligners will have an index selection section where the genome build for the species of interest must be entered for Assembly and the Aligner Index must be specified. Aligner indexes provide a means to break apart the reference sequence for fast sequence matching, and can be created for the whole genome or for regions of interest in a Gene/Feature annotation file. Adding Reference Aligner Indexes or Adding Aligner Indexes based on an Annotation Model can be performed via Library File Management or built on the fly.

The Alignment options section is available for all aligners and includes the option to Generate unaligned reads. Selecting this option will create a new fastq file for each sample in the project that contains the reads that do not map during the alignment process.

In addition, some aligners have additional options specific to that tool. BWA allows for selection of the Alignment algorithm, including backtrack, MEM and SW (see BWA documentation). GSNAP has multiple options for Alignment mode (see GSNAP documentation). Both TopHat and TopHat2 have the option to select Fusion search (see Fusion Gene Detection).

The Advanced options section allows for the customization of option sets (see Option Set Management), which allows for the ability to specify parameters specific to each aligner. Default parameters are those specified by the developer of each aligner and parameter details found in the documentation for each aligner.

References

1. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.

2. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359.

3. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754-1760.

4. Li H, Durbin R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics. 2010;26(5):589-595.

5. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinforma Oxf Engl. 2010;26(7):873-881.

6. Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 2015

7. Raczy C, Petrovski R, Saunders CT, et al. Isaac: Ultra-fast whole genome secondary analysis on Illumina sequencing platforms. Bioinformatics. June 2013:btt314.

8. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinforma Oxf Engl. 2013;29(1):15-21.

9. Torrent Suite User Documentation : Technical Note - TMAP Alignment (https://ts-pgm.epigenetic.ru/ion-docs/Technical-Note---TMAP-Alignment_9012907.html).

10. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinforma Oxf Engl. 2009;25(9):1105-1111.

11. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Quantification

In RNA-seq data analysis, after alignment, the most common step is to estimate gene or/and transcript expression abundance, the expression level is represented by read counts. There are three options in this step:

Quantify to annotation model (Partek E/M)

When the reads are aligned to a genome reference, e.g. hg38, the quantification is performed on transcriptome, you need to provide the annotation model file of the transcriptome.

Quantification dialog

If the alignment was generated in Partek Flow, the genome assembly will be displayed as text on the top of the page (Figure 1), you do not have the option to change the reference.

If the bam file is imported, you need to select the assembly with which the reads were aligned to, and which annotation model file you will use to quantify from the drop-down menus (Figure 2).

In the Quantification options section, when the Strict paired-end compatibility check button is selected, paired end reads will be considered compatible with a transcript only if both ends are compatible with the transcript. If it is not selected, reads with only one end have alignment that is compatible with the transcript will also be counted for the transcript .

Minimum read overlap with feature can be specified in percentage of read length or number of bases. By default, a read has to be 100% within a feature. You can allow some overhanging bases outside the exonic region by modifying these parameters.

Filter features option is a filter for minimum reads, by default only the features whose sum of the reads across all samples that are greater than or equal to 10 will be reported. To report all the features in the annotation file, set the value to 0.

Some library preparations reverse transcribe the mRNA into double stranded cDNA, thus losing strand information. In this case, the total transcript count will include all the reads that map to a transcript location. Others will preserve the strand information of the original transcript by only synthesizing the first strand cDNA. Thus, only the reads that have sense compatibility with the transcripts will be included in the calculation. We recommend verifying with the data source how the NGS library was prepared to ensure correct option selection.

In the Advanced options, in Configure dialog, at Strand specificity field, forward means the strand of the read must be the same as the strand of the transcript while reverse means the read must be the complementary strand to the transcript (Figure 3). The options in the drop-down list will be different for paired-end and single-end data. For paired-end reads, the dash separates first- and second-in-pair, determined by the flag information of the read in the BAM file. Briefly, the paired-end Strand specificity options are:

No: Reads will be included in the calculation as long as they map to exonic regions, regardless of the direction
Auto-detect: The first 200,000 reads will be used to examine the strand compatibility with the transcripts. The following percentages are calculated on paired-end reads:
- (1) If (first-in-pair same strand + second-in-pair same strand)/Alignments examined > 75%, Forward-Forward will be specified
- (2) If (first-in-pair same strand + second-in-pair opposite strand)/Alignments examined > 75%, Forward-Reverse will be specified
- (3) If (first-in-pair opposite strand + second-in-pair same strand)/Alignments examined > 75%, Reverse-Forward will be specified
- (4) If neither of the percentages exceed 75%, No option will be used
Forward - Reverse: this option is equivalent to the --fr-secondstrand option in Cufflinks [1]. First-in-pair is the same strand as the transcript, second-in-pair is the opposite strand to the transcript
Reverse - Forward: this option is equivalent to --fr-firststrand option in Cufflinks. First-in-pair is the opposite strand to the transcript, second-in-pair is the same strand as the transcript. The Illumina TruSeq Stranded library prep kit is an example of this configuration
Forward - Forward: Both ends of the read are matching the strand of the transcript. Generally colorspace data generated from SOLiD technology would follow this format

The single-end Strand specificity options are:

No: same as for paired-end reads
Auto-detect: same as for paired-end reads. All single-end reads are treated as first-in-pair reads
Forward: this option is equivalent to the --fr-secondstrand option in Cufflinks. The single-end reads are the same strand as the transcript
Reverse: this option is equivalent to --fr-firststrand option in Cufflinks. The single-end reads are the opposite strand to the transcript. The Illumina TruSeq Stranded library prep kit is an example of this configuration

If the Report unexplained regions check button is selected, an additional report will be generated on the reads that are considered not compatible with any transcripts in the annotation provided. Based on the Min reads for unexplained region cutoff, the adjacent regions that meet the criteria are combined and region start and stop information will be reported.

Quantify to annotation model (Partek E/M) output

Depending on the annotation file, the output could be one or two data nodes. If the annotation file only contains one level of information, e.g. miRNA annotation file, you will only get one output data node. On the other hand, if the annotation file contains gene level and transcript level information, such as those from the Ensembl database, both gene and transcript level data nodes will be generated. If two nodes are generated, the Task report will also contain two tabs, reporting quantification results from each node. Each report has two tables. The first one is a summary table displaying the coverage information for each sample quantified against the specified transcriptome annotation (Figure 4).

The second table contains feature distribution information on each sample and across all the samples, number of features in the annotation model is displayed on the table title (Figure 5).

The bar chart displaying the distribution of raw read counts is helpful in assessing the expression level distribution within each sample. The X-axis is the read count range, Y axis is the number of features within the range, each bar is a sample. Hovering your mouse over the bar displays the following information (Figure 6):

Sample name
Range of read counts, “[ “represent inclusive, “)” represent exclusive, e.g. [0,0] means 0 read counts; (0,10] means the range is greater than 0 count but less than and equal to 10 counts.
Number of features within the read count range
Percentage of the features within the read count range

The coverage breakdown bar chart is a graphical representation of the reads summary table for each sample (Figure 7).

In the box-whisker plot, each box is a sample on X-axis, the box represents 25th and 75th percentile, the whiskers represent 10th and 90th percentile, Y-axis represents the feature counts, when you hover over each box, detailed sample information is displayed (Figure 8).

In sample histogram, each line represents a sample and the range of read counts are divided into 20 bins. Clicking on a sample in the legend will hide the line for that specific sample. Hovering over each circle displays detailed information about the sample and that specific bin (Figure 9). The information includes:

Sample name
Range of read counts, “[ “represent inclusive, “)” represent exclusive
Number of features within the read count range in the sample

The box whisker and sample histogram plots are helpful for understanding the expression level distribution across samples. This may indicate that normalization between samples might be needed prior to downstream analysis. Note that all four visualizations are disabled for results with more than 30 samples.

The output data node contains raw reads of each sample on each feature (gene or transcript or miRNA etc. depends on the annotation used). When click on a output data node, e.g. transcript counts data node, choose Download data on the context sensitive menu on the right, the raw reads of transcripts can be downloaded in three different format (Figure 10):

Partek Genomics Suite project format: it is a zip file, do not manually unzip it, you can choose File>Import>Zipped project in Partek Genomics Suite to import the zip file into PGS.

Features on columns and Features on rows format: it is a .txt file, you can open the text file in any text editor or Microsoft Excel. For Features on columns format, samples will be on rows. For Features on rows format, samples will be at columns.

References

Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006; 34(10):3150-60.

Additional Assistance

Quantify to transcriptome (Cufflinks)

Cufflinks assembles transcripts and estimates transcript abundances on aligned reads. Implementation details are explained in Trapnell et al. [1]

The Cufflinks task has three options that can be configured (Figure 1):

Novel transcript: this option does not require any annotation reference, it will do de novo assembly to reconstruct transcripts and estimate their abundance
Annotation transcript: this option requires an annotation model to quantify the aligned reads to known transcripts based on the annotation file.
Novel transcript with annotation as guide: this option requires an annotation file to quantify the aligned reads to known transcripts as well as assemble aligned reads to novel transcripts. The results include all transcripts in the annotation file plus any novel transcripts that are assembled.

When the Use bias correction check box is selected, it will use the genome sequence information to look for overrepresented sequences and improve the accuracy of transcript abundance estimates.

References

Trapnell C, Williams B, Pertea G, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech. 2010; 28:511-515.

Additional Assistance

Quantify to reference (Partek E/M)

This task does not need an annotation model file, since the annotation is retrieved from the BAM file itself. The sequence names in the BAM files constitute the features with which the reads are quantified against.

This task is generally performed on reads aligned to a transcriptome, e.g when a species does not have a genome reference, and the bam files contain transcriptome information. In this case, the features for this quantification task are the reference sequence names in the input bam files.

There are two parameters in Quantify to reference (Figure 1):

Min coverage: will filter out any features (sequence names) that have fewer reads across all samples than the value specified
Strict paired-end compatibility: this only affects paired end data. When it is checked, only reads that have two ends aligned to the same feature will be counted. Otherwise, reads will still be counted as exonic compatible reads even if the mate is not compatible with the feature

During quantification:

We scan through each of the BAM files and find all the transcripts that meet the minimum coverage threshold.
With those transcripts, we "create" an annotation file that has the transcript name as the sequence name and the Gene ID and the Transcript ID have the same transcript name. The start position is 1 and the end position is the length of the transcript.
Effectively, what the annotation file does is filter out the low coverage transcripts.
Since we don't know where the transcripts are in the genome, chromosome view will display only one transcript at a time (i.e., the transcript names are treated like "chromosomes").

Additional Assistance

Quantify regions

In ChIP-seq or ATAC-seq analysis, a major challenge after detecting enriched regions or peaks is to compare samples and identify differentially enriched regions. In order to compare samples, a common set of regions must be identified and the number of reads mapping to each region quantified. The Quantify regions task addresses this challenge by generating a union set of unique regions and reporting the number of reads from each sample mapping to each region.

To run Quantify regions:

Click a Peaks data node
Click the Quantification section in the toolbox
Click Quantify regions

Quantify regions method

The Quantify regions task takes MACS2 output, a Peaks or Annotated Peaks data node, as its input. In a typical ATAC-Seq or ChIP-Seq analysis, MACS2 is configured to output a set of enriched regions or peaks for each experimental sample or group individually. Quantify regions takes these sets of regions and merges them into a union set of unique regions that it saves as a .bed file. To combine the region sets, overlapping regions between samples/groups are merged. Where overlap ends, a break point is created and a new region defined. All non-overlapping or unique regions from each sample/group are also included.

For example, consider an experiment where MACS2 detected enriched regions for two samples, Sample A and Sample B. In Sample A, a region is detected on chromosome 1 from 100bp to 300bp, chr1:100-300. In Sample B, a region is detected at chr1:160-360. The Quantify regions task will give the following union set of unique regions for these partially overlapping regions:

chr1:100-160 (region detected in Sample A only)

chr1:160-300 (region detected in both Sample A and Sample B)

chr1:300-360 (region detected in Sample B only)

After generating a .bed file with the union set of unique regions, Quantify regions performs quantification using the same algorithm as with the .bed file as the annotation model.

Configuring Quantify regions

The Quantify regions dialog includes configuration options for generating the union set of unique regions and quantifying reads to the regions (Figure 1).

When regions from multiple samples are combined, a small offset in position between enriched regions in different samples can result in many very short unique regions in the union set. The Minimum region size option lets you filter out these very short regions. If a region is smaller than the specified cutoff, the region is excluded. By default, this is set to 50bp, but may need to be adjusted depending on the size of regions you expect to see in your assay.

Quantify regions output

To download the .bed file with the union set of unique regions, click the Quantify regions task node, click Task details, click the regions.bed file in the Output files section, and click Download.

References

Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006; 34(10):3150-60.

Additional Assistance

HTSeq

HTSeq is set of tools for processing high-throughput sequencing data1. In Partek Flow, we have implemented the htseq-count script from HTSeq for quantifying aligned reads to an annotation model.

The input for HTSeq is an Aligned reads data node and a Gene/Feature annotation file.

To run HTSeq:

Click an Aligned reads data node
Click the Quantification section of the toolbox
Click HTSeq

Please note that HTSeq has not been optimized for performance and can take a very long time to run compared with on the same data.

Configurable options

HTSeq includes basic options (Figure 1) and advanced options accessible by clicking Configure (Figure 2).

Basic Options

Annotation file

Strand specificity

Depending on the library preparation method, information about the strand of the original transcript may be faithfully preserved or lost. This setting controls whether HTSeq considers strand during quantification. Consult your library preparation method user manual if you are unsure about if and how the method preserved strand information.

If set to no, a read is considered to be overlapping a feature regardless of whether it maps to the same or opposite strand as the feature.

If set to yes, the read has to be matched to the same strand as the feature if single-end reads and matched to the same strand for the first read and the opposite strand for the second read if paired-end reads.

If set to reverse, the read has to be matched to the opposite strand as the feature if single-end reads and matched to the opposite strand for the first read and the same strand for the second read if paired-end reads.

Include features with no counts

By default, only features (e.g., genes) with one or more aligned read will be included in the output. If this option is selected, all features from the annotation model, including those without any matching aligned reads, will be included.

Advanced Options

Min qual

HTSeq will skip reads with an alignment quality lower than the value specified here. The default is 10.

Overlap mode

This option determines how HTSeq handles reads that partially overlap features. The default is union.

If set to union, any read that is partially overlapped by a feature will be assigned to that feature. Assignment is non-exclusive if multiple feature overlap.

If set to intersection-strict, only reads that are fully overlapped by a feature will be assigned to that feature. Assignment is non-exclusive if multiple feature fully overlap.

If set to intersection-nonempty, reads are assigned to the feature that has the greatest overlap. Assignment is non-exclusive if multiple feature overlap the same amount.

Nonunique mode

This option determines how HTSeq counts reads that are assigned to more than one feature. The default is none.

If set to none, reads that are assigned to more than one feature are not counted for any feature.

If set to all, reads are counted for all features they are assigned to.

Output

HTSeq outputs a Gene counts data node (Figure 3). There is no task report.

References

Simon Anders, Paul Theodor Pyl, Wolfgang Huber. HTSeq — A Python framework to work with high-throughput sequencing data. Bioinformatics (2014), in print, online at doi:10.1093/bioinformatics/btu638

Additional Assistance

Count feature barcodes

What is Count feature barcodes?

Count feature barcodes is a tool for quantifying the number of feature barcodes per cell from CITE-Seq, cell hashing, or other feature barcoding assays to measure protein expression. The input for Count feature barcodes is FASTQ files.

Running Count feature barcodes

Count feature barcodes will run on any unaligned reads data node.

Click the data node containing your unaligned reads containing feature barcodes
Click the Quantification section in the toolbox
Click Count feature barcodes

The task set up page allows you to configure the settings for your assay (Figure 1).

Choose the Prep kit from the drop-down menu

Check Map feature barcodes box (optional)

This is only necessary for processing data from 10X Genomics' Feature Barcoding assay (v3+ chemistry), which utilizes BioLegend TotalSeq-B. For all other assays, leave this box unchecked.

Choose the Barcode location

For BioLegend TotalSeq-A, choose bases 1-15. For BioLegend TotalSeq-B/C, choose bases 11-25. For other locations, select Custom and specify the start and stop positions.

Choose a Sequences text file

This tab-delimited text file should have the feature ID in the first column and the nucleotide sequence in the second column. Do not include column headers. See Figure 2 for an example.

Check Keep bam files box (optional)

This option will retain the alignment BAM files instead of automatically deleting them when the task is complete. An extra Aligned reads output data node will be produced on the task graph. This option is unchecked by default to save on disk space.

Click Finish to run

The output of Count feature barcodes is a Single cell counts data node.

How does Count feature barcodes work?

Count feature barcodes uses a series of tasks available independently in Partek Flow to process the input FASTQ files. The output files generated by these tasks are not retained in the Count feature barcodes output, with the exception of BAM files if Keep bam files is checked.

Quantify barcodes counts the number of UMIs per cell for each feature in the Sequences file. Quantify barcodes uses default settings.

To perform these steps individually instead of using the Count feature barcodes task, you will need to generate a FASTA and GTF file containing the feature barcode IDs and sequences instead of a text file and build an index file for the Bowtie aligner.

Additional Assistance

Salmon

Salmon1 is a method for quantifying transcript abundance from raw read sequence fastq files, it will generate transcript level count matrix as output.

Running Salmon

Salmon will run on any FASTQ file input.

Click the data node containing your FASTQ files
Click the Quantification section in the toolbox
Click Salmon
Choose Assembly and Aligner index on gene annotation model (Figure 1)

The task generates two data nodes: Transcript counts node contains NumReads which is the estimate of number of reads mapping to each transcript; the best estimate is often not integer. Gene counts node contains the sum of all the reads from the corresponding transcripts from each gene.

Note: If you want to perform differential analysis, you need to add offset to deal with 0 values.

References

Additional Assistance

Filtering

Partek Flow has the flexibility to subsample your data for further downstream analyses. Filter data by:

Filter features

A common task in bulk and single-cell RNA-Seq analysis is to filter the data to include only informative genes. Because there is no gold standard for what makes a gene informative or not, and ideal gene filtering criteria depend on your experimental design and research question, Partek Flow has a wide variety of flexible filtering options.

Filter features task can be invoked from any counts or single cell data node. Noise Reduction and Statistics Based filters take each feature and perform the specified calculation across all of the cells. The filter is applied to the values in the selected data node and the output is a filtered version of the input data node.

In the task dialog, select the filter option to activate the filter type and configure the filter, then click Finish to run.

Noise reduction filter

The Noise reduction filter lets you exclude features that meet basic criteria (Figure 1).

Descriptive statistics you can choose are:

Coefficient of variation: std. dev divided by mean of the feature
Geometric mean: nth root of the product of the n numbers, n is the number of features
Maximum: the highest value of a feature
Mean: the average value of a feature
Median: value of the mid point of a feature
Minimum: lowest value of a feature
Range: the difference between the highest and lowest values of a feature
Std dev.: the square root of the variance
Sum: total value of the feature
Variance: the average of the squared differences from the mean
Dispersion: variance divided by mean of the feature

For each of these you can choose to exclude features that are:

<: less than
<=: less than or equal to
== equal to
>: greater than
>=: greater than or equal to

The threshold is set using the text box. The input must be a number; it can be an integer or decimal, positive or negative.

If you select value, you can also choose a percentage of samples or cells that must meet the criteria for the feature to be excluded (Figure 2).

Statistics based filter

The Statistics based filter lets you include a number or percentile of genes based on descriptive statistics (Figure 3).

Select Counts to specify a number of top features to include or select Percentiles to specify the top percentile of features to include.

Descriptive statistics you can choose are:

Coefficient of variance
Geometric mean
Maximum
Mean
Median
Minimum
Range
Standard deviation (std dev)
Sum
Variance
Dispersion

Feature metadata filter

If the data linked to feature (gene) annotation, different fields in the annotation can be used to filter, e.g. genomic location information, gene biotype information etc. (Figure 4)

You can specify logical operation using different annotation field information.

Feature list filter

You can filter features based on a feature lists (Figure 5).

You can choose to include or exclude features in any list that you have added.

Use the Feature identifier option to choose which identifier from your annotation matches the values in the feature list.

Filter features task report

The filter features task report lists the filter criteria, reports distribution statistics for the remaining features, and indicates the number and percentage of features that passed the filter (Figure 6).

If the input was a count matrix data node, sample box plot and sample histograms are provided to show the distribution of features after filtering. These plots are not available if the input was a single cell counts data node.

Additional Assistance

Filter groups (samples or cells)

Filter samples or cells in order to perform downstream analysis on a subset of data.

To filter groups, click a count matrix or single cell counts data node, click the Filtering section of the toolbox, and choose to Filter samples (bulk data) or Filter cells (single cell data).

The dialog lets you build a series of filters based on sample or cell attributes.

Click Finish to apply the filter. If no sample or cell will pass the filter criteria, a warning message will appear and the task will not run.

Configuring a filter

The first drop-down menu allows you to choose to include or exclude based on the specified criteria.

The second drop-down menu allows you to choose any categorical or numeric attribute to use for the filter criteria.

If the attribute is categorical, the third drop-down menu includes in and not in as options. A fourth drop-down menu allows you to search and choose from the levels of the selected attribute (Figure 1).

If the attribute is numeric, the the third drop-down includes:

<: less than
<=: less than or equal to
== equal to
>: greater than
>=: greater than or equal to

The threshold is set using the text box (Figure 2). The input must be a number; it can be an integer or decimal, positive or negative.

Configuring multiple filters

Using the OR and AND options, you can combine multiple filters.

When combining multiple filters all set to Include:

With AND, if all statements must be true for the sample to meet the filter criteria.

With OR, if any statement is true, the sample will meet the filter criteria.

When combining multiple filters all set to Exclude:

With AND, if any statement is true, the sample will meet the filter criteria.

With OR, all statements must be true for the sample to meet the filter criteria.

Filter groups task report

The filter groups task report lists the filter criteria and reports feature distribution statistics for the remaining samples (Figure 3).

If the input was a count matrix data node, the percentage of samples remaining after the filter is listed and charts are provided to show the breakdown of samples by categorical attributes before and after filtering (Figure 4).

If the input was a single cell counts data node, a second table displays the details from each sample based on the filtered criteria (Figure 5).

If the input was a classified groups single cell counts data node, the cell count table includes a breakdown by classification and a bar chart is provided to show the number of cells from each classification remaining after filtering (Figure 6).

Additional Assistance

Downsample Cells

The Downsample Cells task is used to randomly downsample the number of cells in a single cell data set. This task can be used to reduce the size of large single cell datasets to small and manageable sizes for quick analysis. Another use case for this task is for a project with multi samples with each sample having different number of cells. Downsample Cells can be used to randomly select an equal number of cells for all the samples in the project. For the default setting, the sample with the minimum number of cells is used with the number of cells in that sample set as the number of cells to be selected in the other samples. However, this default setting can be changed to a preferred number by the user. If the number selected by the user is greater than the number of cells in one or more samples, those samples will not be downsampled and all the cells in those samples will be returned. If the number selected by the user is greater than the number of cells in all the samples, then none of the samples will be downsampled.

To run a downsample task first click on a single cell count data node. Go to the Filtering section and select Downsample cells task (Figure 1).

Clicking on the Downsample cells task will lead to a dialogue menu with the number of cells to be downsampled set to the minimum number of cells in the project. In the figure below, the minimum number of cells in any of the samples was 2658 and this is used in the default settings. Click Finish to run the task (Figure 2).