Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Partek Flow contains a number of quality control tools and reports that can be used to evaluate the current status of your analysis and decide downstream steps. Quality control tools are organized under the Quality Assurance / Quality Control (QA/QC) section of the context-sensitive menu and are available for unaligned and aligned reads data nodes.
This section will illustrate:
In addition to the tools listed above, many other functionalities can also be interpreted in sense of quality control. For instance, principal components analysis, hierarchical clustering (on sample level), variant detection report, and quantification report.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Left single clicking on any task (the rectangles) in the analysis pipeline will cause a Task Actions section to appear in the pop-up menu. This allows users to:
Rerun tasks: rerun the selected task, the task dialog will pop-up and users can change parameters of the task. Previous downstream analysis of the selected task will not be rerun.
Rerun with downstream tasks: rerun the selected task, the task dialog will pop-up, users can change the parameters of the current task and the downstream analysis will be rerun with the same configuration as the previous one.
Edit description: the description of the task can be replaced by manually typing in string.
Change color: choose a color to apply only on the selected task by clicking on Apply. Click Apply to downstream to change the selected task and the downstream pipeline color to the newly selected color.
Delete task: this option is only available if the user is the owner of the project or the owner of the task. When a task is deleted, all downstream tasks, including tasks from other users, will be deleted. Users may check the box to choose to delete the task's output files. If delete output files is not checked, the task will be removed from the pipeline, but the output files of the task will remain on the disk.
Restart task: this option is only available on failed tasks and requires an admin role to perform, but does not require that you have a user account. Since you are logged in as an admin, restarting a task will not take up a concurrent seat and the disk space consumed by the output files will count towards the original owner of the task's storage space.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This section has tools that are useful in managing and understanding single cell data, especially for downstream analysis. To invoke Annotation/Metadata tools, click on any Single cell counts data node. These include the following tasks:
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This section constitutes tools that are useful in preparing single cell data for downstream analysis, such as multi-sample comparison or multi-omics analysis. To invoke Pre-analysis tools, click on any Single cell counts data node. These include the following tasks:
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Trim tags task allows you to process unaligned read data with adaptors, barcodes, and UMIs using a Prep kit file that specifies the configuration of these elements in your NGS reads.
Click an Unaligned reads data node
Click the Pre-alignment QA/QC section of the toolbox
Click Trim tags
There are three parameters to configure - Prep kit, Keep untrimmed, and Map feature barcodes.
Selecting Keep untrimmed will generate a separate unaligned reads data node with any reads that do not match the structure specified by the prep kit. This option is off by default, to save on disk space. Selecting Map feature barcodes is only necessary for processing protein data from 10x Genomics' Feature Barcoding assay (v3+ chemistry). For single cell gene expression data, leave this option unchecked.
Partek distributes prep kits for processing several types of data:
10x Chromium Single Cell 3' v2
10x Chromium Single Cell 3' v3
10x Chromium Single Cell 5'
Drop-seq
Lexogen QuantSeq FWD-UMI
Bio-Rad SureCell WTA 3'
Fluidigm C1 mRNA Seq HT IFC
Rubicon Genomics ThruPLEX Tag-seq
1CellBio inDrop
If your data is from one of these sources, you can select the appropriate option in the Prep kit drop-down menu. If the data is from another source, you can build a custom prep kit file to process your data.
Choose a Prep kit from the drop-down menu
Click Finish to run Trim tags (Figure 1)
The output of Trim tags is a Trimmed reads data node. An additional Untrimmed reads data node will be generated if the Keep untrimmed option was selected.
The task report provides a table with the total reads, reads retained, % reads retained, reads removed, and % reads removed for each sample (Figure 2). You can click Download at the bottom of the table to save a text file copy to your computer.
Select Other / Custom from the Prep kit name drop-down menu
Give the new prep kit a name
Choose Build prep kit
You can select Import prep kit if you have a Prep kit .zip file downloaded from Partek Flow.
Click Create (Figure 3)
The Prep kit builder interface will load (Figure 4).
There are three sections:
Is paired end - select to switch from single end to paired end FASTQ files (Figure 5). If you choose paired end, the First mate will correspond to the _R1 FASTQ file and the Second mate will correspond to the _R2 FASTQ file.
Figure 5. Paired end prep kits have first and second mate segmentation sections
Segmentation - this is where you will describe the structure of your reads
Segments include adaptors, barcodes, UMIs, and the insert (i.e., the target sequence of the assay)
For adaptors, you have the option of choosing a file with your adaptor sequences or entering the adaptor sequences manually.
To use a file, choose File for Sequences and then click Choose File (Figure 6). Use the file browser to choose a FASTA file from your local computer.
You can specify the mismatch allowance using the Mismatches option.
After you have specified the file or manually entered the sequences, click Add to add the adaptor sequence(s).
Unique Molecular Identifiers (UMIs) are randomly generated sequences that uniquely identify an original starting molecule after PCR amplification.
Including a UMI in your prep kit will allow you to access a downstream task that uses UMI information for removing PCR duplicates. For more information about the Deduplicate UMIs task, please see our UMI Deduplication in Partek Flow white paper. Note that while the UMI sequence will be trimmed, a record of the UMI sequence for each read is retained for use by this downstream task.
When adding a UMI segment to your prep kit, you can specify the length of your UMIs (Figure 8).
Adding a barcode segment to a prep kit allows you to access downstream tasks that use barcode information, including Filter barcodes and Quantify barcodes to annotation model (Partek E/M). While the barcode sequence will be trimmed, a record of the barcode sequence for each read is retained for use by downstream tasks.
Like adaptors, barcodes can be specified using a file or manually specified, but you can also choose to designate any segment of arbitrary length in the sequence as the barcode. This is useful if you do not have a specific set of known barcodes.
To set the barcode to an arbitrary segment of fixed length, choose Arbitrary and specify the barcode length (Figure 9).
Remember to click Add to add the new segment to your prep kit.
The insert is the sequence retained after trimming in the Trimmed reads data node. For example, in RNA-Seq, this would be the mRNA sequence. Every prep kit must include an insert segment. You can specify the minimum size of the insert section using the Length field (Figure 10). Reads shorter than the minimum length will be discarded.
Remember to click Add to add the new segment to your prep kit.
Segments are placed from 5' to 3' in the read in the order they are added. You should add the 5' segment first and add additional elements in order of their position in the read. Segments will appear in the Segmentation sections as they are added. You can mouse over a segment to view its details (Figure 11).
For example, the expected read structure (Figure 12) and a completed prep kit for a standard Drop-seq library prep are shown below (Figure 13).
Remove poly-A tail - choose this option to trim poly-A tails from the ends of the read with your insert sequence
Click Next to complete your prep kit
You can manage saved prep kits by going to Home > Settings > Library file management and opening the Prep kit files tab (Figure 14).
Prep kits download as a .zip file. This Prep kit .zip file can be imported into Partek Flow by selecting Import from a file when adding a new prep kit. Select the .zip file when importing, do not unzip the file.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Task Menu lists all the tasks that can be performed on a specific node. It can be invoked from either a Data or Task node and appears on the right hand side of the Analyses tab. It is context-sensitive, meaning that it will only present tasks that the user can perform on the selected node. For example, selecting an Aligned reads data node will not present aligners as options.
Clicking a Data node presents a variety of tasks:
Clicking a Task node gives you the option to view the Task results or perform Task actions such as rerunning the task (Figure 1).
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Data summary report in Partek Flow provides an overview of all tasks performed as part of a pipeline. This is particularly useful for report writing, record keeping and revisiting projects after a long period of time.
This user guide will cover the following topics:
Click on an output data node under the Analyses tab of a project and choose Data summary report from the context sensitive menu on the right (Figure 1). The report will include details from all of the tasks upstream of the selected the node. If tasks have been performed downstream of the selected data node, they will not be included in the report.
The Data summary report can be saved in different formats via the web browser. The instructions below are for Google Chrome. If you are using a different browser, consult your browser's help for equivalent instructions.
On the Data summary report, expand all sections and show all task details. Right-click anywhere on the page and choose Print... from the menu (Figure 4) or use Ctrl+P (Command+P on Mac). In the print dialog, click Change… (Figure 5) and set the destination to Save as PDF. Select the Background graphics checkbox (optional), click the blue Save button (Figure 5) and choose a file location on your local machine.
The PDF can be attached to an email and/or opened in a PDF viewer of your choice.
On the Data summary report, right-click anywhere on the page and choose Save as… from the menu (Figure 6) or use Ctrl+S (Command+S on Mac). Choose a file location on your local machine and set the file type to Web Page, Complete.
The HTML file can be opened in a browser of your choice.
The short video clip below (with audio) shows a tutorial of looking at the Data Summary Report
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The ERCC (External RNA Control Consortium) developed a set of RNA standards for quality control in microarray, qPCR, and sequencing applications. These RNA standards are spiked-in RNA with known concentrations and composition (i.e. sequence length and GC content). They can be used to evaluate sensitivity and accuracy of RNA-seq data.
The ERCC analysis is performed on unaligned data, if the ERCC RNA standards have been added to the samples. There are 92 ERCC spiked-in sequences with different concentrations and different compositions. The idea is that the raw data will be aligned (with Bowtie) to the known ERCC-RNA sequences to get the count of each ERCC sequence. This information is available within Partek Flow and will be used to plot the correlation between the observed counts and the expected concentration. If there is a high correlation between the observed counts versus the expected concentration, you can be confident that the quantified RNA-seq data are reliable. Partek Flow supports Mix1 and Mix 2 ERCC formulations. Both formulations use the same ERCC sequences, but each sequence is present at different expected concentrations. If both Mix 1 and Mix 2 formulations have been used, ExFold comparison can be performed to compare the observed and expected Mix1:Mix2 ratio for each spike-in.
To start ERCC assessment, select an unaligned reads node and choose ERCC in the context sensitive menu. If all samples in the project have used the Mix 1 or Mix 2 formulation, choose the appropriate radio button at the top (Figure 1).
You can change the Bowtie parameters by clicking Configure before the alignment (Figure 1), although the default parameters work fine for most data. Once the task has been set up correctly, select Finish.
ERCC task report starts with a table (Figure 3), which summarizes the result on the project level. The table shows which samples use the Mix 1 or Mix 2 formulation. The total number of alignments to the ERCC controls are also shown, which is further divided into the total number of alignments to the forward strand and the reverse strand. The summary table also gives the percentage of ERCC controls that contain alignment counts (i.e. are present). Generally, the fraction of present controls should be as high as possible, however, there are certain ERCC controls that may not contain alignment counts due to their low concentration; that information is useful for evaluation of the sensitivity of the RNA-seq experiment. The coefficient of determination (R squared) of the present ERCC controls is listed in the next column. As a rule of a thumb, you should expect a good correlation between the observed alignment counts and the actual concentration, or else the RNA-seq quantification results may not be accurate. Finally, the last two columns give estimates of bias with regards to sequence length and GC content, by giving the correlation of the alignment counts with the sequence length and the GC content, respectively.
If ExFold comparison was enabled, an extra table will be produced in the ERCC task report (Figure 4). Each row in the table is a pairwise comparison. This table lists the percentage of ERCC controls present in the Mix 1 and Mix 2 samples and the R squared for the observed vs expected Mix1:Mix2 ratios.
The ERCC spike-ins plot (Figure 5) shows the regression lines between the actual spike-in concentration (x-axis, given in log2 space) and the observed alignment counts (y-axis, given in log2 space), for all the samples in the project. The samples are depicted as lines, and the probes with the highest and lowest concentration are highlighted as dots. The regression line for a particular sample can be turned off by simply clicking on the sample name in the legend beneath the plot.
Optionally, you can invoke a principal components analysis plot (View PCA), which is based on RPKM-normalised counts, using the ERCC sequences as the annotation file (not shown).
For more details, go to the sample-level report (Figure 6) by selecting a sample name on the summary table. First, you will get a comprehensive scatter plot of observed alignment counts (y-axis, in log2 space) vs. the actual spike-in concentration (x-axis, in log2 space). Each dot on the plot represents an ERCC sequence, coloured based on GC content and sized by sequence length (plot controls are on the right).
The table (Figure 7) lists individual controls, with their actual concentration, alignment counts, sequence length, and % GC content. The table can be downloaded to the local computer by selecting the Download link.
For more details on ExFold comparisons, select a comparison name in the ExFold summary table (Figure 8). First, you will get a comprehensive scatter plot of observed Mix1:Mix2 ratios (y-axis, in log2 space) vs. the expected Mix1:Mix2 ratio (x-axis, in log2 space). Each dot on the plot represents an ERCC sequence, coloured based on GC content and sized by sequence length (plot controls are on the right).
The table (Figure 9) lists individual controls, with each samples' alignment counts, together with the observed and expected Mix1:Mix2 ratios. The table can be downloaded to the local computer by selecting the Download link.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Post-alignment QA/QC is available for data nodes containing aligned reads (Aligned reads) and has no special control dialog. Similar to the pre-alignment QA/QC report, the post-alignment contains two tiers, i.e. project-level report and sample-level report.
The project-level report starts with a summary table (Figure 1). Unlike pre-alignment QA/QC report, each row now corresponds to a sample (sample names are hyperlinks to sample-level report). Table allows for a quick comparison across all the samples within the project. Any outlying sample can, therefore, easily be spotted.
Note that the summary table reflects the underlying chemistry. While Figure 1 shows a summary table for single-end sequencing, an example table for paired-end sequencing is given in Figure 2. Common features are discussed first.
The first two columns contain total number of reads (Total reads) and total number of alignments (Total alignments). Theoretically, for single-end chemistry, total number of reads equals total number of alignments. For double-end reads, theoretical result is to have twice as many alignments as reads (the term “read” refers to the fragment being sequenced, and since each fragment is sequenced from two directions, one can expect to get two alignments per fragment). When counting the actual number of alignments (Total alignments), however, reads that align more than once (multimappers) are also taken into account. Next, the Aligned column contains the fraction of all the reads that were aligned to the reference assembly.
The Coverage column shows the fraction (%) of the reference assembly that was sequenced and the average sequencing coverage (×) of the covered regions is in the Avg. coverage depth column. The Avg. quality is mapping quality, as reported by the aligner (not all aligners support this metric). Avg. length is the average read length and average read quality is given in Avg. quality column. Finally, %GC is the fraction of G or C calls.
In addition, the Post-alignment QA/QC report for single-end reads (Figure 1) contains the Unique column. This refers to the fraction of uniquely aligned reads.
On the other hand, the Post-alignment QA/QC report for paired-end reads (Figure 2) contains these columns:
Unique singleton
fraction of alignments corresponding to the reads where only one of the paired reads can be uniquely aligned
Unique paired
fraction of alignments corresponding to the reads where both of the paired reads can be uniquely aligned
Non-unique singleton
fraction of singletons that align to multiple locations
Non-unique paired
fraction of paired reads that align to multiple locations
Note: for paired-end reads, if one end is aligned, the mate is not aligned, the alignment rate calculating will include the read as the numerator, also since the mate is not aligned, we will also include this read in the unaligned data node (if the generate unaligned reads data node option is selected) for 2nd stage alignment, this will generate discrepancy between total reads and "unaligned reads + total reads * alignment rate", because reads with only one mate aligned are counted twice.
In addition to the summary table, several graphs are plotted to give a comparison across multiple samples in the project. Those graphs are Alignment breakdown, Coverage, Genomic Coverage, Average base quality per position, Average base quality score per read, and Average alignments per read. Two of those (Average base quality plots) have already been described.
The alignment breakdown chart (Figure 3) presents each sample as a column, and has two vertical axes (i.e. Alignment percent and Total reads). The percentage of reads with different alignment outcomes (Unique paired, Unique singleton, Non-unique, Unaligned) is represented by the left-side y-axis and visualized by stacked columns. The total number of reads in each sample is given using the black line and shown on the right-side y-axis.
The Coverage plot (Figure 4) shows the Average read depth (in covered regions) for each sample using columns and can be red off the left-hand y-axis. Similarly, the Genomic coverage plot shows genome coverage in each sample, expressed as a fraction of the genome.
The last graph is Average alignments per read (Figure 5) and shows the average number of alignments for each read, with samples as columns. For single-end data, the expected average alignments per read is one, while for paired-end data, the expected average alignments per read is two.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Cell barcode QA/QC task lets you determine whether a given cell barcode is associated with a cell. This is an important QC step in all droplet-based single cell RNA-seq experiments, such as Drop-seq, where all barcodes are sequenced.
To invoke Cell barcode QA/QC:
Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Cell barcode QA/QC
The task can be performed with or without the EmptyDrops method enabled.
To perform the task without the EmptyDrops method enabled, leave the checkbox unchecked and click Finish (Figure 1).
The Cell barcode QA/QC task report is a plot (Figure 2). Barcodes are ordered on the X-axis by the number reads such that the barcode closest to the Y-axis has the most reads and the barcode furthest from the Y-axis has the fewest reads. The Y-axis value is the number of mapped reads corresponding to each barcode. This type of plot is often referred to as a knee plot.
The knee plot is used to choose a cutoff point between barcodes that correspond to cells and barcodes that do not. Partek Flow automatically calculates an inflection point, shown by the vertical line on the graph. Barcodes designated as cells are shown in blue while barcodes designated as without cells (background) are shown in grey.
The cutoff can be adjusted by dragging the vertical line across the graph or by using the text fields in the Filter panel on the left-hand side of the plot. Using the Filter panel, you can specify the number of cells or the percentage of reads in cells and the cutoff point will be adjusted to match your criteria. The number of cells and the percentage of counts in cells is adjusted as the cutoff point is changed. To return to the automatically calculated cutoff, click Reset sample filter.
The percentage of counts in cells and median counts per cell are useful technical quality metrics that can be consulted when optimizing sample handling, cell isolation techniques, and library preparation.
One knee plot is generated for each sample. In projects with multiple samples, Next and Back buttons will appear at the top to enable navigation between sample knee plots. Manual filters must be set separately for each sample. This is typically used when the user expects a certain number of cells to be processed, like in experiments where droplets were loaded with a predefined number of cells.
To view a summary of the currently selected filter settings for all samples, click Summary table. This opens a table showing key metrics for each sample in the project (Figure 3).
To return to the knee plot view, click Back to filter. To apply the filter and run the Filter barcodes task, click Apply filter. A Filtered counts data node will be generated.
The EmptyDrops method (1) uses a statistical test to identify which barcodes correspond to real cells and empty droplets. An ambient RNA expression profile is estimated from barcodes below a specified total UMI count threshold, using the Good-Turing algorithm. The expression profile of each barcode above the low-count threshold is then tested for deviations from the ambient profile. Real cells are expected to have a low p-value, indicating a significant deviation from the expected background noise level. False discovery rate (FDR) correction is applied to all the p-values and those falling equal to or below the specified FDR level are detected as real cells. This can allow for the detection of additional cells that would otherwise be discarded due to a low total UMI count.
This method requires empty barcodes to be present in the single cell count matrix, in order to estimate the ambient RNA profile. If your data has already been filtered to remove barcodes with low total counts, this method will not be suitable. For example, if you are working with 10X Genomics data, the EmptyDrops method can only be run on the raw counts, not the filtered counts.
In addition, a knee point threshold will be calculated to identify cells with a very high total UMI count. It's possible that some barcodes with a high total UMI count will not pass the EmptyDrops significance test. This could be due to biases in the ambient RNA profile, leading to a non-significant difference between a barcode's expression profile vs the ambient profile. To protect against this issue, it is advisable to use the EmptyDrops results in conjunction with the knee point filter, on the assumption that barcodes with a very high total UMI count will always correspond to real cells. Note, the knee point will be more conservative than the inflection point calculated by Partek Flow when the EmptyDrops method is not enabled.
To perform the task with the EmptyDrops method, check the checkbox, configure the additional options, and click Finish (Figure 4)
Ambient count threshold
Barcodes with a total UMI count equal to or below this threshold will be used to create the ambient RNA expression profile to estimate background noise. The default is set to 100, which is reasonable for most data.
FDR threshold
Barcodes equal to or below this FDR threshold show a significant deviation from the ambient profile and can therefore be considered real cells. Increasing this value will result in more cells, but will also increase the number of potential false positives.
Random generator seed
This is used for performing Monte Carlo simulations to determine p-values. To reproduce results, use the same random seed for all runs.
The task report will appear similar to Figure 2, with additional metrics on the left (Figure 5).
The number of actual cells detected by the EmptyDrops test and the knee point filter are shown above the Venn diagram on the left. In Figure 5, 3,189 barcodes are above the knee point filter (represented by the vertical blue line on the plot) and 2,657 barcodes passed the significance test in EmptyDrops. The overlap between these sets of barcodes is represented by the Venn diagram. In Figure 5, 1,583 barcodes pass the significance test in EmptyDrops and have a high total UMI count above the knee point filter; 1,606 barcodes have a very high total UMI count with no significant difference from the ambient profile in EmptyDrops; 1,074 barcodes fall below the knee point but are still significantly different from the ambient profile.
The number of cells included by the knee point filter can be adjusted either by click on the plot to change the position of the vertical blue line or by typing a different number of cells into the text box on the left.
The total number of cells is shown in the text box on the left. By default, this will be all of the cells detected by the knee point filter plus the extra cells detected by EmptyDrops. In Figure 5, this means the 3,189 cells with a high total UMI count plus the additional 1,074 cells from EmptyDrops (total = 4,263).
Different sections of the Venn diagram can be selected/deselected to include/exclude barcodes. For example, in Figure 5, clicking the '1,606' section of the Venn diagram will deselect those barcodes. Now, the only cells that will pass the filter will be the significant ones from EmptyDrops (Figure 6).
Lun, A., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019; 20: 63.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Coverage report is also available for data nodes containing aligned reads (Aligned reads, Trimmed reads, or Filtered reads). The purpose of the report is to understand how well the genomic regions of interest are covered by sequencing reads for a particular analysis.
When setting up the task (Figure 1), you first need to specify the Genome build and then a Gene/feature annotation file, which defines the genomic regions you are interested in (e.g. exome or genes within a panel). The Gene/feature annotation can be previously associated with Partek® Flow® via Library File Management or added on the fly.
Complete coverage report will contain percentage of bases within the specified genes / features with coverage greater than or equal to the coverage levels defined under Add minimum coverage levels. To add a level, click on the green plus . Alternatively, to remove it, click on the red cross icon.
As for the Advanced options, if Strand-specificity is turned on, only reads which match the strand of a given region will be considered for that region’s coverage statistics.
Generate target enrichment graphs will generate a graphical overview of coverage across each feature.
When Use multithreading is checked, the computation will utilize multiple CPUs. However, if the input or output data is on some file systems like GPFS file system, which doesn't support well on multi-thread tasks, unchecking this option will prevent task failures.
Coverage report result page contains project-level overview and starts with a summary table, with one sample per row (Figure 2) The first few columns show the percentage of bases in the genomic features which are covered at the specified level (or higher) (default: 1×, 20×, 100×). Average coverage is defined as the sum of base calls of each base in the genomic features divided by the length of the genomic features. Similarly, Average quality is defined as the sum of average quality of those bases that cover the genomic features, divided by the length of covered genomic features. The last two columns show the number of On-tarted reads (overlapping the genomic features) and Off-target reads (not overlapping the features). The Optional columns link enables import of any meta-data present in the data table (Data tab).
Quantification of on- and off-target reads is also displayed in the column chart below the table (Figure 3), showing each sample as a separate column and fraction of on-/off-target reads on the y-axis.
Region coverage summary hyperlink opens a new page, with a table showing average coverage for each region (rows), across the samples (columns) (Figure 4).
The Coverage summary (Figure 6) plot is an overview of coverage across of the targeted genomic features for all the samples in the project. Each line within the plot is a single sample, the horizontal axis is the normalized position within the genomic feature, represented as 1st to 100th percentile of the length of the feature, while the vertical axis show the average coverage (across all the features for a given sample).
If you need more details about a sample, click on the sample name in the Coverage report table (Figure 7). The columns are as follows:
Region name: the genomic feature identifier (as specified in the annotation file)
Chromosome: the chromosome of the genomic feature (or region)
Start: the start position of the genomic feature (1-based)
Stop: the stop position of the genomic feature (2-based, which means the stop position is exclusive)
Strand: the strand of the genomic feature
Total exon length: the length of the genomic feature
Reads: the total number of reads aligning to the genomic feature
% GC: the percentage of GC contents of those reads aligning to the genomic feature
% N: the percentage of ambiguous bases (N) of those reads aligning to the genomic feature
(n)x: the proportion of the genomic feature which is covered by at least n number of alignments. [Note: n is the coverage level that you specified when submitting Coverage report task, defaults are 1×, 20×, 100×]
Average coverage: the average sequencing depth across all bases in the genomic feature
Average quality: the average quality score across covered bases in the genomic feature
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Validate variants is available for data nodes containing variants (Variants, Filtered variants, or Annotated Variants). The purpose of this task is to understand the performance of the variant calling pipeline by comparing variant calls from a sample within the project to known “gold standard” variant data that already exist for that sample. This “gold standard” data can encompass variants identified with high confidence by other experimental or computational approaches.
Setting up the task (Figure 1) involves identifying the Genome Build used for variant detection and the Sample to validate within the project. Target specific regions allow for specification of the Target regions for this study, relating to the regions sequenced for all samples in the project. Benchmark target regions represent the regions that have been previously interrogated to identify “gold standard” variant calls in the sample of interest. These parameters are important to ensure that only overlapping regions are compared, avoiding the identification of false positives or false negative variants in regions covered by only the project sample or the “gold standard” sample. Both sections utilize a Gene/feature annotation file, which can be previously associated with Partek Flow via Library File Management or added on the fly. The Validated variants file is a single sample vcf file containing the “gold standard” variant calls for the sample of interest and can be previously associated with Partek Flow as a Variant Annotation Database via Library File Management or added on the fly.
The Validate variants results page contains statistics related to the comparison of variants in the project sample compared to the validated variant calls for the sample (Figure 2). The results are split into two sections, one based on metrics calculated from the comparison of SNVs and the other from the comparison of INDELs.
The following SNP-level metrics are contained within the report, comparing the sample in the project to the validated variant data:
No genotypes: the number of missing genotypes from the sample in the Flow project
Same as reference: the number of homozygous reference genotypes from the sample in the Flow project
True positives: the number of variant genotypes from the sample in the Flow project that match the validated variants file
False positives: the number of variant genotypes from the sample in the Flow project that are not found in the validated variants file
True negatives: the number of loci that do not have variant genotypes in the sample in the Flow project and the validated variants file
False negatives: the number of genotypes that do not have variant genotypes in the sample in the Flow project but do have variant genotypes in the validated variants file
Sensitivity: the proportion of variant genotypes in the validated variants file that are correctly identified in the sample in the Flow project (true positive rate)
Specificity: the proportion of non-variant loci in the validated variants file that are non-variant in the sample in the Flow project (true negative rate)
Precision: the number of true positive calls divided by the number of all variant genotypes called in the sample in the Flow project (positive predictive value),
F-measure: a measure of the accuracy of the calling in the Flow pipeline relative to the validated variants. It considers both the precision and the recall of the test to compute the score. The best value at 1 (perfect precision and recall) and worst at 0.
Matthews correlation: a measure of the quality of classification, taking into account true and false positives and negatives. The Matthews correlation is a correlation coefficient between the observed and predicted classifications, ranging from −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates completely wrong prediction.
Transitions: variant allele interchanges of purines or pyrimidines in the sample in the Flow project relative to the reference
Transversions: variant allele interchanges of purines to/from pyrimidines in the sample in the Flow project relative to the reference
Ti/Tv ratio: ratio of transition to transversions in the sample in the Flow project
Heterozygous/Homozygous ratio: the ratio of heterzygous and homozygous genotypes in the sample in the Flow project
Percentage of sites with depth < 5: the percentage of variant genotypes in the sample in the Flow project that have fewer than 5 supporting reads
Depth, 5th percentile: 5% of sequencing depth found across all variant genotypes in the sample in the Flow project
Depth, 50th percentile: 50% of sequencing depth found across all variant genotypes in the sample in the Flow project
Depth, 95th percentile: 95% of sequencing depth found across all variant genotypes in the sample in the Flow project
The INDEL-level metrics columns contained within the report are identical, with the exception of a lack of information with regards to transitions and transversion.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Single-cell QA/QC task in Partek Flow enables you to visualize several useful metrics that will help you include only high-quality cells. To invoke the Single-cell QA/QC task:
Click a Single cell counts data node
Click the QA/QC section of the task menu
Click Single cell QA/QC
By default, all samples are used to perform QA/QC. You can choose to split the sample and perform QA/QC separately for each sample.
If your Single cell counts data node has been annotated with a gene/transcript annotation, the task will run without a task configuration dialog. However, if you imported a single cell counts matrix without specifying a gene/transcript annotation file, you will be prompted to choose the genome assembly and annotation file by the Single cell QA/QC configuration dialog (Figure 1). Note, it is still possible to run the task without specifying an annotation file. If you choose not to specify an annotation file, the detection of mitochondrial counts will not be possible.
The Single cell QA/QC task report opens in a new data viewer session. Four dot and violin plots showing the value of every cell in the project are displayed on the canvas: counts per cell, detected features per cell, the percentage of mitochondrial counts per cell, and the percentage of ribosomal counts per cell (Figure 2).
If your cells do not express any mitochondrial genes or an appropriate annotation file was not specified, the plot for the percentage of mitochondrial counts per cell will be non-informative (Figure 3).
Mitochondrial genes are defined as genes located on a mitochondrial chromosome in the gene annotation file. The mitochondrial chromosome is identified in the gene annotation file by having "M" or "MT" in its chromosome name. If the gene annotation file does not follow this naming convention for the mitochondrial chromosome, Partek Flow will not be able to identify any mitochondrial genes. If your single cell RNA-Seq data was processed in another program and the count matrix was imported into Partek Flow, be sure that the annotation field that matches your feature IDs was chosen during import; Partek Flow will be unable to identify any mitochondrial genes if the gene symbols in the imported single cell data and the chosen gene/feature annotation do not match.
Ribosomal genes are defined as genes that code for proteins in the large and small ribosomal subunits. Ribosomal genes are identified by searching their gene symbol against a list of 89 L & S ribosomal genes taken from HGNC. The search is case-insensitive and includes all known gene name aliases from HGNC. Identifying ribosomal genes is performed independent of the gene annotation file specified.
Total counts are calculated as the sum of the counts for all features in each cell from the input data node. The number of detected features is calculated as the number of features in each cell with greater than zero counts. The percentage of mitochondrial counts is calculated as the sum of counts for known mitochondrial genes divided by the sum of counts for all features and multiplied by 100. The percentage of ribosomal counts are calculated as the sum of counts for known ribosomal genes divided by the sum of counts for all features and multiplied by 100.
Each point on the plots is a cell. All cells from all samples are shown on the plots. The overlaid violins illustrate the distribution of cell values for the y-axis metric.
The appearance of a plot can be configured by selecting a plot and adjusting the Configure settings in the panel on the left (Figure 4). Here are some suggestions, but feel free to explore the other options available:
Open Axes and change the Y-axis scale to Logarithmic. This can be helpful to view the range of values better, although it is usually better to keep the Ribosomal counts plot in linear scale.
Open Style and reduce the Color Opacity using the slider. For data sets with very many cells, it may be helpful to decrease the dot opacity to better visualize the plot density.
Within Style switch on Summary Box & Whiskers. Inspecting the median, Q1, Q3, upper 90%, and lower 10% quantiles of the distributions can be helpful in deciding appropriate thresholds.
High-quality cells can be selected using Select & Filter, which is pre-loaded with the selection criteria, one for each quality metric (Figure 5).
Hovering the mouse over one of the selection criteria reveals a histogram showing you the frequency distribution of the respective quality metric. The minimum and maximum thresholds can be adjusted by clicking and dragging the sliders or by typing directly into the text boxes for each selection criteria (Figure 6).
Alternatively, Pin histogram to view all of the distributions at one time to determine thresholds with ease (Figure 7).
Adjusting the selection criteria will select and deselect cells in all three plots simultaneously. Depending on your settings, the deselected points will either be dimmed or gray. The filters are additive. Combining multiple filters will include the intersection of the three filters. The number of cells selected is shown in the figure legend of each plot (Figure 8).
Select the input data node for the filtering task and click Select (Figure 10).
A new data node, Filtered counts, will be generated under the Analyses tab (Figure 11).
Double click the Filtered counts data node to view the task report. The report includes a summary of the count distribution across all features for each sample; a detailed breakdown of the number of cells included in the filter for each sample; and the minimum and maximum values for each quality metric (expressed genes, total counts, etc) across the included cells for each sample (Figure 12).
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Selecting a node with unaligned reads (either Unaligned reads or Trimmed reads) shows the QA/QC section in the context sensitive menu, with two options (Figure 1). To assess the quality of your raw reads, use Pre-alignment QA/QC.
Pre-alignment QA/QC setup dialog is given in Figure 2. Examine reads allows you to control the number of reads processed by the tool; All reads, or a subset (One of every n reads). The latter option is obviously not as thorough, but is much faster than All reads.
If selected, K-mer length creates a per-sample report with the position of the most frequent k-mers (i.e. sequences of k nucleotides) of the length specified in the dialog. The range of input values is from one to 10.
The last control refers to .fastq files. Partek® Flow® can automatically detect the quality encoding scheme (Auto detect) or you can use one of the options available in the drop-down list. However, the auto-detection is only applicable for Phred+33 and Phred+64 type of quality encoding score. For early version of Solexa quality encoding score, select Solexa+64 from the Quality encoding drop down list. For a paired-end data, the pre-alignment QA/QC will be done on each read in pair separately and the results will be shown separately as well.
Most sequencing applications now use the phred quality score. This score indicates the probability that the base was accurately identified. The table below shows the corresponding base call accuracies for each score:
The task report is organised in two tiers. The initial view shows project-level report with all the samples. An overview table is at the top, while matching plots are below.
Two project-level plots are Average base quality per position and Average base quality score per read (Figure 4). The latter plot presents the proportion of reads (y-axis) with certain average quality score (meaning all the base qualities within a read are averaged; x-axis). Mouse over a data point to get the matching readouts. The Save icon saves the plot in a .svg format to the local machine. Each line on the plot represents a data file and you can select the sample names from the legend to hide/un-hide individual lines.
A sample-level report begins with a header, which is a collection of typical quality metrics (Figure 5).
Below the header you will find four plots: Base composition, Average base quality score per position (same as above, but on the sample level), Distribution of base quality scores (the same as Average base quality score per read, but on the sample level), and Distribution of read lengths.
Distribution of read lengths shows a single column for fixed length data (e.g. Illumina sequencing). However, for quality-trimmed data or non-fixed length data (like Ion Torrent sequencing), expect to see a read’s length distribution (Figure 7).
If K-mer length option was turned on when setting up the task, an additional plot will be added to the sample-level report, i.e. K-mer Content (Figure 8). For each position, K-mer composition is given, but only the top six most frequent K-mers are reported; high frequency of a K-mer at a given site (enrichment) indicates a possible presence of sequencing adapters in the short reads.
The pre-alignment QA/QC report as described above is generally available for the NGS data of fastq format. For other types of data, the report may differ depending on the availability of information. For example, for fasta format, there is no base quality score information and therefore all the figures or graphs related to base or read quality score will be unavailable.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Next generation sequencing (NGS) data is notably huge in file size. Dealing with NGS data is not only time consuming but also puts constraints on hard disk space. This is especially true if analysis parameters need to be optimized. The Filter reads task is a very useful tool to get a subset of the raw data upon which optimization can be performed. The optimized parameters can then be saved and applied to the whole dataset.
Filter reads is only available for unaligned reads of FASTQ format. Select the Unaligned Reads data node then select Filter reads from the Pre-alignment tools section on the menu.
There are two options to filter reads: Subsample reads and Filter by read length.
To Subsample reads, specify how many reads you want to keep for every nth reads. For example: if the user specifies to "Keep one read for every 10 reads" (Figure 1), this means that for every 10 reads, the program will keep only 1 read. This is equivalent to keeping 10% of the data.
To Filter by read length, set the read length limits by choosing the minimum and maximum read length(s) to keep.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Trim bases task is used to trim bases from the 5'-end or 3'-end of the reads. The most obvious reason for Trim bases is to trim away poor quality bases from the read prior to alignment because these can potentially affect alignment rate.
The task allows user to trim reads in different ways (Figure 1), including:
Trim bases based on quality score
Trim bases from 3'-end
Trim bases from 5'-end
Trim bases from both ends
Trim bases from 5'-or 3'-end (Figures 2-3) allows a fixed number of bases to be trimmed away from the 5'- or 3'-end of the reads. These two functions are useful for when your read length is constant. This is not recommended if the read length is not constant, since good quality bases from shorter reads are likely trimmed away by these functions.
Trim bases from both ends (Figure 4) allows user to keep only bases from a fixed start and end position of the reads. This is particularly useful if poor quality bases are observed on both ends of the read. So instead of performing trim bases successively from the 5'- and 3'-end, the trim bases will only be performed once by trimming from both ends.
Trim bases based on quality score (Figure 5) is probably the most useful function to trim poor quality bases from the 5'- or 3'-ends of reads. This function allows dynamic trimming of bases depending on quality score. The trimming can be done from either 5'-end, 3'-end or both ends of the reads. The function evaluates each base from the end of the read and trims it away until the last base has a quality score greater than the specified threshold. For an extensive evaluation of read trimming effects on Illumina NGS data analysis, see Del Fabbro et. al. [1].
In some cases, the reads that result from base trimming can have very short read lengths and thus are not recommended for alignment. Thus, Partek® Flow® Flow provides the option to set a Min read length after base trimming. This discards reads that are shorter than the set length.
Also, reads could have a high percentage of N's or ambiguous bases. Thus, the Max N setting is available to discard reads with %Ns higher than the set threshold
The Quality encoding option refers to the Phred quality score encoded within the FASTQ input file. The list of available options are: Phred+33, Phred+64, Solexa+64 and Integers. Selecting Auto-detect will determine whether the quality encoding is Phred+33 or Phred+64. For Solexa data, you will need to select Solexa+64. For most of datasets, auto-detect option works very well with a few exception cases where the base quality score falls into the grey zone (ambiguous zone) of Phred+33 and Phred+64 score. However, if the quality-encoding scheme is known, we recommend to selecting the encoding format directly from the quality encoding list.
Figure 6 shows the options available for all the different selection of Trim bases function. Note the default Min read length is 25bp. For micro RNA sequencing data, this default Min read length needs to be set to a smaller value (we recommend 15) to account for mature microRNAs.
The Task Details page for Trim bases can be accessed by selecting the task node Trim bases, and subsequently selecting Task Details from the Task results section. In the Task details page, several sections are available:
General task information: contains information such as the task name, owner, status, submitted time, start, end and duration of the task
Output Files: contains the description of each output file. If you roll-over your mouse cursor to the file name, you will get the exact location of the file on the server. If you click on the file name, you will have the option to view up to 999 lines of the raw data. You can also download the file from the server.
Input Files: contains the information of input files. This section lists down all the input files used in the Trim bases task.
Input Parameters: contains the parameters used for running Trim bases function. This section tells what option has been selected for the Trim bases task. It includes all the parameters used for the task, such as minimum read length, maximum percentage of N's base, quality encoding, quality score threshold (if applicable) and how trimming is performed.
Command Lines: shows the commands used for running Trim bases function by the software Partek Flow
The Trim bases Task Report page can be accessed by selecting either the Trim bases task node or Trimmed reads data node and then selecting the Task Report from the Task results section of the context sensitive menu. There is a link at the bottom of the page to directly go to the Task Details page. The page displays the following components:
Summary table: gives the total number of reads in each sample, the total number of reads trimmed (i.e. with at least one base trimmed from the read), total number of reads removed (due to Min read length and Max N parameters), the average number of bases trimmed per read, the average read quality before trim bases and finally the average read quality after trim bases.
Stacked bar-chart: shows percentage of untrimmed reads, trimmed reads and removed reads are shown in a stacked bar-chart to compare all the samples.
Average base quality score per position of trimmed reads: shows the average base quality score at each position of the trimmed reads for all samples in the project.
The Trim bases function produces trimmed unaligned reads which is named as Trimmed Reads data node. The Trimmed Reads node will have the "trimmed" word appended to the filename. The Trimmed Reads data can be downloaded by selecting the Trimmed Reads node and then select Download data from the context sensitive menu. However, if you have access to the Partek Flow server, you can go to the Task Details page and identify the location of the output files from the Output Files section as described on the Trim Bases Task Details section above. The Trimmed Reads data node will have the same format as the raw data.
Del Fabbro C, Scalabrin S, Moragante M, Giorgi FM. An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS ONE. 2013; 8(12): e85024.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.none;">43rates
The Combine alignments task can potentially maximize the number of reads that map to a region. When unaligned reads resulting from one aligner are then aligned using a different one, the two resulting alignments can combined together for downstream analysis within Partek Flow. Note that this can only be performed if they were aligned to the same reference genome.
To invoke this task, click an Aligned reads data node and select Combine alignments (Figure 1).
A list of compatible alignments will appear. The color of the text signifies the layer the alignment corresponds to (Figure 2). Select the alignment you would like to combine and click Finish.
The resulting Aligned reads data node can now be used for downstream analysis (Figure 3).
Note that this task combines the files in the data node within Partek Flow but does not merge the BAM files. Downloading an aligned reads data node from a Combine alignments task will result to multiple BAM files per sample.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The existence of adapter sequences at the 5'-end or 3'-end of the reads has shown to be one of the major problems during alignment, causing the reads to be unaligned. Thus, removing adapter sequence is of utmost importance if the sequenced read length is longer than the molecule of interest, such as microRNA. The fact that mature microRNAs are short in length makes it almost certain that the adapter sequence will be sequenced at the 3'-end of the miRNA.
In order to know whether the data has been adapter-trimmed for microRNA data, we can look at the pre-alignment QA/QC of the raw data, specifically the read length distribution. If the read length distribution peaks at approximately 22-23 bases, this usually means the data has been adapter-trimmed. However, if you have a fixed length distribution, then very likely the data is not adapter-trimmed and you will need to get the adapter sequence from your vendor or service provider and use the Trim adapter function to trim away the adapter sequence.
Partek Flow software wraps Cutadapt [1], a widely used tool for adapter trimming. It can be used to trim adapter sequences in nucleotide-space data as well as color-space data.
In order to use Trim adapters function, you will need to know the adapter sequences. To trigger the Trim adapters function, please select Unaligned Reads node and then select Trim adapters from the Pre-alignment tools section of the task pane. In the Trim adapters page (Figure 1), paste the adapter sequences into the textbox and select the button.
There are three options when it comes to trimming the adapter sequence:
Trimming for adapter ligated to 3'-end: the adapter sequence and anything that follows it will be trimmed away from the 3'-end.
Trimming for adapter ligated to 5'-end or 3'-end: the adapter sequence is identified within the read or overlapping the 3'-end, then the adapter sequence and anything that follows it will be trimmed away. However, if the adapter sequence partially overlaps the 5'-end of the read, the initial portion of the read matching the adapter sequence is trimmed and anything that follows it is kept.
Trimming for adapter ligated to 5'-end: if the adapter sequence appears partially at the 5'-end or within the read, the preceding sequence including the adapter sequence is trimmed. User has the option to use a special character '^' at the beginning of the adapter sequence, meaning the adapter is 'anchored'. An anchored adapter must appear in its entirety at the 5'-end of the read (i.e. it is a prefix of the read).
For Trim adapters, more than one adapter sequences can be specified at once. When multiple adapters are provided, all adapters are evaluated based on how many bases it overlaps the read as well as the error rate. Adapters which have a lower number of overlapped nucleotides or high error rates are removed from consideration.
After that, the best adapter will be chosen based on the number of matching bases to the read. If there is a tie, adapters of the same type will be chosen in the order they are provided and adapters of different types will be chosen by type in the following order: first 3', then 5' or 3', and lastly 5' adapters.
There are cases when the Trim adapters function does not work properly, for example: the existence of N's base in the read, etc. Therefore, there are advanced options which allows user to configure how the matching is done to trim adapter sequence. The advanced options dialog box is shown in Figure 2.
The first section of advanced options is the Adapter options. This is used to configure how the matching between the adapter sequence and the read will be performed. This includes the maximum error rate allowed, the number of matched times, minimum length of overlapped bases, allowing Ns (ambiguous base) in adapter and whether N will be treated as wildcards. User can roll-over mouse cursor to the info button to get more information of each parameter.
The second section of advanced options is the Filtering options. This is used to filter adapter-trimmed reads which are shorter than the minimum read length. This is to avoid having reads too short because short reads gives non-unique alignment and we would like to avoid that.
The third section of advanced options is the Additional modification to reads. The quality cutoff is used to trim bad quality bases from the reads before trimming adapter. Quality encoding tells the quality score encoding for the raw data. The Reads names prefix and suffix is used to add prefix and suffix to the read ID. Lastly, the Negative quality zero if checked will convert all negative quality score base to zero.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011; 17: 10-12.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Feature distribution plot visualizes the distribution of features in a counts data node.
To run Feature distribution:
Click a counts data node
Click the QA/QC section of the toolbox
Click Feature distribution
A new task node is generated with the Feature distribution report.
The Feature distribution task report plots the distribution of all features (genes or transcripts) in the input data node with one feature per row (Figure 1). Features are ordered by average value in descending order.
The plot can be configured using the panel of the left-hand side of the page.
Using the filter, you can choose which features are shown in the task report.
The Manual filter lets you type a feature ID (such as a gene symbol) and filter to matching features by clicking + . You can add multiple feature IDs to filter to multiple features (Figure 2).
The List filter lets you filter to the features included in a feature list. To learn more about feature lists in Partek Flow, please see List management.
Distributions can be plotted as histograms, with the x-axis being the expression value and the y-axis the frequency, or as a strip plot, where the x-axis is the expression value and the position of each cell/sample is shown as a thin vertical line, or strip, on the plot (Figure 3).
To switch between plot types, use the Plot type radio buttons.
Mousing over a dot in the histogram plot gives the range of feature values that are being binned to generate the dot and the number of cells/samples for that bin in a pop-up (Figure 4).
Mousing over a strip shows the sample ID and feature value in a pop-up. If there are multiple cells/samples with the same value, only one strip will be visible for those cells/samples and the mouse-over will indicate how many cells/samples are represented by that one strip (Figure 5).
Clicking a strip will highlight that cell/sample in all of the plots on the page (Figure 6).
The grey dot in each strip plot shows the median value for that feature. To view the median value, mouse over the dot (Figure 7).
To navigate between pages, use the Previous and Next buttons or type the page number in the text field and click Enter on your keyboard.
The number of features that appear in the plot on each page is set by the Items per page drop-down menu (Figure 8). You can choose to show 10, 25, or 50 features per page.
When Plot type is set to Histogram, you can choose to configure the Y-axis scale using the Scale Y-axis radio buttons. Feature max sets each feature plot y-axis individually. Page max sets the same y-axis range for every feature plot on the page, with the range determined by the feature with the highest frequency value.
You can add attribute information to the plots using the Color by drop-down menu.
For histogram plots, the histograms will be split and colored by the levels of the selected attribute (Figure 9). You can choose any categorical attribute.
For strip plots, the sample/cell strips will be colored by the levels or values of the selected attribute (Figure 10). You can choose any categorical or numeric attribute.
The Filter alignments task can be used to filter aligned reads data using specified parameters. To invoke the task, click on an Aligned reads data node and select Filter alignments. By default, this task removes low-quality reads, singletons and unaligned read information stored within the BAM/SAM file (Figure 1).
Users also have the option to remove duplicate reads in aligned data. For DNA-Seq analysis, this is typically performed to minimize redundant variant calling information. To remove duplicates, click on the Remove duplicates checkbox (Figure 2).
Select the number of reads you want to keep. Then specify when alignments are treated as duplicates. This can either be reads that map to the same start position or, additionally, have the same sequence. You can also select whether to keep the read with the highest mapping score or a randomly-selected duplicate.
To remove alignments with mismatches, select the Remove alignments with mismatches check box. Using the selector, specify the number the number of mismatched bases that need to be exceeded for the alignment to be excluded (Figure 3). Note that mismatches also include insertions and deletions.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Post-alignment tools involve tasks that can be performed on aligned data. These are typically used in preparing aligned data for other downstream analyses, such as DNA-Seq or RNA-Seq analysis.
To invoke Post-alignment tools, click on an Aligned reads data node (Figure 1). There are three functions available in Post-alignment tools:
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Annotation report provides a table summarizing the cell-level attributes of a single cell counts data node.
To run Annotation report:
Click a Single cell counts data node
Click the Annotation/Metadata section of the toolbox
Click Annotation report
The task will run and generate a task report.
The Annotation report includes two tables (Figure 1) - the top table summarizes the categorical attributes, giving the number of levels of each attribute, and the bottom table summarizes the numeric attributes, providing some basic summary statistics about the distribution of each attribute.
To download a text-file version of one of the tables, click Download in lower right-hand corner of the table.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This task is only available on single cell matrix data node. It will summarize the cell level attributes of a data node and the result displayed in two tables with one table containing the categorical attributes while the other table contains the numerical attribute.
To download a text-file version of one of the tables, click Download in lower right-hand corner of the table.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Deduplicate UMIs task identifies and removes reads mapped to the same chromosomal location with duplicate unique molecular identifiers (UMIs). The details of our UMI deduplication methods are outlined in the UMI Deduplication in Partek Flow white paper.
To invoke Deduplicate UMIs:
Click an Aligned reads data node
Click Post-alignment tools in the toolbox
Click Deduplicate UMIs
The task configuration dialog content depends on whether you imported FASTQ files or BAM files into Partek Flow.
UMIs and barcodes are detected and recorded by the Trim tags task in Partek Flow. You can choose whether to retain only one alignment per UMI or not (Figure 1). The default will depend on which prep kit was used in the Trim tags task.
If you select Retain only one alignment per UMI, you will be asked to choose an assembly and gene/feature annotation file. The annotation file is used to check whether a read overlaps an exonic region. Only reads that have 50% overlap with an exon will be retained.
If you do not select Retain only one alignment per UMI, UMI deduplication will proceed without filtering to exonic reads. Other differences between the two options are outlined in the UMI Deduplication in Partek Flow white paper.
Imported BAMs generated by other tools can be imported into Partek Flow and deduplicated by the software. Additional options are available in the task configuration dialog to allow you to specify the location of the UMI and cell barcode information typically stored in the BAM header. Specify the BAM header tags in the text fields. For example, when processing a BAM file produced by CellRanger 3.0.1, the BAM identifier tag for the UMI sequence is UR and the BAM identifier for the barcode sequence is CR (Figure 2).
The option to Retain only one alignment per UMI is also available when starting from a BAM file.
The Deduplicate UMIs task report includes a knee plot showing the number of deduplicated reads per barcode. This plot is used to filter the barcodes to include only barcodes corresponding to cells. For more information about using the knee plot to filter barcodes, please see the Cell Barcode QA/QC page. One difference between the Deduplication report and the Cell Barcode QA/QC report is that the Deduplication report gives the number of initial alignments and the number of deduplicated alignments for each sample (Figure 3). This indicates how many of your aligned reads were PCR duplicates and how many were unique molecules.
The initial number of cells is set by our automatic filter. You can set the filter manually by clicking on the plot or by typing a cutoff number in the Cells or Reads in cells text boxes. If there are multiple samples, each sample receives a plot and filters are set per sample.
The number of cells, reads in cells, median reads per cell, number of initial alignments, and number of deduplicated alignments are listed for each sample in the summary table (Figure 4).
Clicking Apply filter at either the knee plot or the summary table will run the Filter barcodes task and generate a Filtered reads data node.
To return to the knee plot, click Back to filter.
To reset the filters for all sample to the automatic cutoff, click Reset all filters.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Aligned reads can be converted to unaligned reads in Partek Flow. The task is available under Post-alignment tools in the task menu when any Aligned reads data node is selected, which can be a result of an aligner in Partek Flow or data already aligned before import.
Generating unaligned reads from aligned data gives you the flexibility to remap the reads using either a different aligner, a different set of alignment parameters, or a different genome reference. This is particularly useful in analyzing sequences from xenograft models where the same set of reads can be aligned two different species. It may also be useful if the original unaligned FASTQ files are not as easily accessible to the user as the aligned BAM files.
To perform the task, select an Aligned reads data node and click Convert alignments to unaligned reads task in the task menu (Figure 1).
During the conversion, the BAM files are converted to FASTQ files and a new Unaligned reads data node will be generated (Figure 2) .
The filenames of the FASTQ files will be based on the sample names in the Data tab. The files generated are compressed with the extension *.fq.gz. For samples containing BAM files with paired end reads, two FASTQ files will be generated for each, and the files names will be appended with _1 and _2. An example in Figure 3 shows 18 .fq.gz output files that came from 9 BAM files.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Visium spatial gene expression solution from 10X Genomics allows us to spatially resolve RNA-Seq expression data in individual tissue sections. For the analysis of Visium spatial gene expression data in Partek Flow, you will need the following output files from the Space Ranger _outs_1 subdirectory:
The filtered count matrix -- either the .h5 file (one file) or feature-barcode matrix (three files).
spatial – Outputs of spatial pipeline (Figure 1) (Spatial imaging data)
We recommend import count matrix file using .h5 file format, which allow you to import multiple samples at the same time. You need to rename the file names and put all the files in the sample folder to import in one go.
The spatial subdirectory contains image related files, they are:
tissue_hires_image.png
tissue_lowres_image.png
aligned_fiducials.jpg
detected_tissue_image.jpg
tissue_positions_list.csv
scalefactors_json.json
The folder should be compressed in one .gz or zip file when you upload to Flow server. You can pick the file for each sample from the Partek Flow server, your local computer, or a URL using the file browser .
To run Annotate cells:
Click a Single cell counts data node
Click the Annotation/Metadata section in the toolbox
Click Annotate Visium image
You will be prompted to pick a Spatial image file for each sample you want to annotate (Figure 2).
Click Finish
A new data node, Annotated counts, will be generated (Figure 3).
When the task report of the annotated counts node is opened (or double click on the Annotated counts node), the images will be displayed in data viewer (Figure 4)
It is a 2D plot and XY axes are the tissue spot coordinates. Tissue spots are on top of the slide image. The opacity of the tissue spots can be changed using the slider to show more of the image under Configure>Style>Color.
From the Configure>Background>Image drop-down list in the Data viewer, different formats of the image can be selected (Figure 6).
Note that the "Annotate Visium image" task splits by sample so all of the downstream tasks will also do this (e.g. if the pipeline is built from this node all of the downstream tasks will also be split and viewed per sample). To generate plots with multiple samples on one plot, build the pipeline from the "Single cell counts" node.
[1] https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/overview
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Downscale alignments task can be invoked on data node containing bam/sam files, e.g. Aligned reads data node. The only parameter to specify is the percentage of the alignments to keep in the results, the range of the parameter should be between 0 to 100 (Figure 1). All the samples in the input data node will use the same parameter. The output data node contains bam files with a subset of the alignments.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
If a single cell data node contains cell attribute information, e.g., clustering results, classifications, or imported attributes, a counts-type data node containing the number of cells from each attribute group for each sample can be generated and used for downstream analysis.
To invoke Generate group cell counts:
Click a single cell count data node with cell-level attribute information
Click Pre-analysis tools in the toolbox
Click Generate group cell counts
Select the attribute to group the cells from the Group by drop-down menu (Figure 1)
Click Finish
A group cell counts node will be generated. The data node contains a matrix of cell counts in each sample for each group. You can view the counts results in the Group cell counts report (Figure 2).
The Cell counts data node is a counts type data node and downstream analysis tasks, such as normalization, PCA, and ANOVA, can be used to analyze the group cell counts data.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Split matrix can be invoked on any counts data node with more than one feature type. For example, a CITE-Seq experiment would have Gene Expression counts and Antibody Capture counts in the single cell counts data node. Datasets generated by 10X Genomics' Feature Barcoding experiments also utilize this task to split different feature measurements for downstream analysis.
There are no parameters to configure, to run:
Click the counts data node you want to split
Click the Pre-analysis tools section of the toolbox
Click Split matrix
The Split matrix task will run and generate output data nodes for each of the feature types. For example, if there are Antibody Capture and Gene Expression feature types in the input, Split matrix will generate two data nodes (Figure 1). Every sample is included in both matrices.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Partek Flow provides Pre-alignment tools that allow the user to process next-generation sequencing data before proceeding to alignment. These tools are not only useful for controlling the quality of data, but can also be used for subsampling prior to analyzing the full dataset. There are three functions available in Pre-alignment tools:
User is expected to have preliminary understanding of:
File formats for next generation sequencing data
Phred-quality score
In order to show the Pre-alignment tools, select an Unaligned reads or Trimmed reads data node. They will appear on the context-sensitive menu on the right of the screen (Figure 1).
Different Pre-alignment tools are available for different formats of unaligned reads. For example: if the reads are in FASTQ format, then all four tools are available. On the other hand, if the unaligned reads are in FASTA or SFF format, then the Filter reads option is not available.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Cell Hashing enables sample multiplexing and super-loading in single cell RNA-Seq isolation by labeling each sample with a sample-specific oligo-tagged antibody against a ubiquitously expressed cell surface protein.
The Hashtag demultiplexing task is an implementation of the algorithm used in Stoeckius et al. 20181 for multiplexing cell hashing data. The task adds cell-level attributes Sample of origin and Cells in droplet.
To run Hashtag demultiplexing, your data must meet the following criteria:
Data node contains number of features less than number of observations
Data node must be output from normalization task (recommended normalization method for hashtag is CLR)
If you are processing your FASTQ files in Partek Flow, be sure to specify a different Data type for your Cell Hashing FASTQ files on import than the FASTQ files for your gene expression and any other antibody data.
If you are processing your FASTQ files using Cell Ranger, be sure to specify a different feature_type for your Cell Hashing antibodies than any other antibodies in the Feature Reference CSV File.
If you want to specify sample IDs instead of using hashtag feature IDs as the sample IDs, you will need to prepare a tab-delimited text file (.txt) with hashtag feature IDs in the first column and the corresponding sample IDs in the second column (Figure 1). A header row is required.
Click the Normalized counts data node for your cell hashing data
Click Hashtag demultiplexing in the Pre-analysis tools section of the toolbox
Click Browse to select your Sample ID file (Optional)
Click Finish to run
The output is a Demultiplexed counts data node (Figure 2).
Two cell-level attributes, Cells in droplet and Sample of origin, are added by this task and are available for use in downstream tasks. You can download the attribute values for each cell by clicking the Demultiplexed counts data node, clicking Download, and choosing to download Attributes only.
We recommend using Annotate cells to transfer the new attributes to other sections of your project after downloading the attributes text file.
It is also possible to use the Merge matrices task to combine your data types and attributes.
Stoeckius, M., Zheng, S., Houck-Loomis, B., Hao, S., Yeung, B.Z., Mauck, W.M., Smibert, P. and Satija, R., 2018. Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome biology, 19(1), p.224.
\
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Descriptive statistics task can be invoked on matrix data node e.g. Gene Counts, Normalized Counts data node in bulk RNA seq analysis pipeline or Single Cell counts Data node etc. It calculates measures of central tendency and variability on observations or features of the matrix data.
Click on a counts data node
Choose Descriptive Statistics in Statistics section of the toolbox (Figure 1)
This will invoke the dialog configuration dialog; use it to specify which calculation(s) will be performed on cells (or samples for a bulk analysis data node) or features (Figure 2).
The available statistics are listed on the left panel, suppose "x1, x2, ..., xn"represent an array of numbers
Number of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For instance, use this option if you want to know the number of cells in which each feature was detected; possible filter: Number of cells whose value > 0.0
Percent of cells: Available when Calculate for is set to Features. Reports the number of cells with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.
Number of features: Available when Calculate for is set to Cells. Reports the number of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box. The cut off will be applied to the values present in the input data node, i.e. if invoked on non-normalised data node, the values are raw counts. For example, use this option if you want to know the number of detected genes per each cell; filter: Number of features whose value > 0.0
Percent of features: Available when Calculate for is set to Cells. Reports the fraction of features with the value [<, <=, =, !=, > >=] (select one from the drop down list) than the cut off value entered in the text box.
Q1: 25th percentile
Q3: 75th percentile
Range: xmax - x min
Left click to select measurement and drag to move to the right panel one at a time, or when you mouse over on a measurement, click on the green plus button to move to the right panel. When Sample (Cell) is select, the calculation will be performed on all the features in the input matrix for each sample (or cell). When Feature is selected, the calculation will be performed across all the samples (cells) in the input matrix for each feature.
In addition, when Feature is selected, there is an extra Group by option (Figure 3)
From the drop-down list, choose a categorical attribute to calculate the descriptive statistics on all the subgroups for each feature.
The output of the task is a matrix: Cell stats (result of Calculate for Cells) or Feature stats (result of Calculate for Features) (Figure 4). The results can be visualized in the Data Viewer.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
If you have attribute information about your cells, you can use the Annotate cells task in Partek Flow to apply this information to the data. Once applied, these can be used like any other attribute in Partek Flow, and thus can be used for cell selection, classification and differential analysis.
To run Annotate cells:
Click a Single cell counts data node
Click the Annotation/Metadata section in the toolbox
Click Annotate cells
You will be prompted to specify annotation input options:
Single file (all sample): it requires one .txt file for all your cells in all samples, each row in the file represents a barcode, at least one barcode column which will match the barcodes in your data. It also requires an column contains sample ID which will match the sample name in the data tab of your project.
File per sample: it requires the format of all of the annotation files to be the same. Each file has barcodes on rows, it requires one barcode column that will match the barcodes in your data in that sample. All files have to have the same set of column, column headers are case sensitive.
You can pick the file for each sample from the Partek Flow server, you have to specify annotation files for all the samples in the dailog (Figure 1).
To view a preview of the files, click Show Preview (Figure 2).
If you would like to annotate your matrix features with a gene annotation file, you can choose an annotation file at the bottom on the dialog. You can choose any gene/feature annotation available on the Partek Flow server. If a feature annotation is selected, the percentage of mitochondrial reads will be calculated using the selected annotation file.
Click Next to continue
The next dialog page previews the attributes found in the annotations text file (Figure 3).
You can choose which attributes to import using the check-boxes, change the names of attributes using the text fields, and indicate whether an attribute with numbers is categorical or numeric.
Click Finish to import the attributes.
A new data node, Annotated single cell counts, will be generated (Figure 4).
You annotations will be available in downstream analysis tasks.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Next generation sequencing can produce anywhere from hundreds of thousands to tens of millions short nucleotide sequences for a single sample. For any given base within an individual sequence there can also be a quality score associated with the confidence of that base call from the sequencer. The process of alignment is used to map all of these reads to a reference sequence, providing information with regards to the start and stop positions of each read within the reference sequence as well as a quality metric for the mapping. This document will provide information about the available aligners within Partek Flow as well as illustrate how to perform alignment against a reference sequence. The result of alignment will be an Aligned reads data node that contains the BAM files generated from the alignment.
The user should be familiar with:
Alignment tools appear in the context-sensitive menu on the right of the screen (Figure 1) when click on any data node containing FASTQ files. Examples include Unaligned reads, Trimmed reads, and Subsampled reads data nodes.
Partek Flow provides numerous publicly available tools for the alignment process to meet the needs of your specific sequencing experiment. The information below provides a synopsis of each aligner as well as the current version. Please refer to the aligner links and references section for further information on each aligner.
Bowtie1 (Version 1.0.0) - Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome. Backtracking is used to conduct a quality-aware, greedy, randomized, depth-first search of all possible alignments based on the specified alignment parameters. Does not handle gapped alignments. Fast, memory efficient, and accurate for short reads of high quality (<50bp). Popular for short DNA-Seq reads and small RNA-Seq reads. (http://bowtie-bio.sourceforge.net/index.shtml)
Bowtie 22 (Version 2.2.5) - Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome. Alignment involves mapping seed sequences in an ungapped fashion and then performing a gapped extension. Supports a local alignment mode that "soft clips" alignments which do not align end-to-end. Unlike Bowtie, handles gapped alignments, ambiguous bases (N’s), and paired reads that do not align in a paired fashion. Fast, memory efficient, and accurate for longer reads (>50bp) with no upper limit on read length. Popular for DNA-seq reads and small RNA-Seq reads. (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
BWA3,4 (Version 0.7.15) - Uses a Burrows-Wheeler transform to create an index of the genome. Handles gapped alignments and ambiguous bases (N’s). BWA-backtrack uses a backward search may be optimal for short reads (>70bp). BWA-MEM typically fastest and most accurate for longer reads, although BWA-SW may have better sensitivity when gapped alignments. Popular for DNA-seq variant calling pipelines, but not for RNA-seq as splicing is not taken into account. (http://bio-bwa.sourceforge.net/)
GSNAP5 (Version 2015-12-31(v8)) - A short read aligner (>14bp) using a successive constrained search, capable of handling splicing using either a probabilistic model or database. Built to handle SNPs in alignment. Good sensitivity but slower speed and higher memory usage. Popular for RNA-seq analysis. (http://research-pub.gene.com/gmap/)
HISAT26 (Version 2.1.0) - A fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of genomes. HISAT2 is a successor to TopHat2. (https://github.com/DaehwanKimLab/hisat2)
Isaac 27 (Version 15.07.16) - Gapped aligner that finds candidate mapping positions by matching 32-mers from the data to 32-mers from the reference, extending the candidate mappings to the whole read, and selecting the best mapping. Has utility for mappying DNA-Seq with good speed and accuracy but high memory usage. (https://github.com/Illumina/isaac2)
STAR8 (Version 2.6.1d) - Splice-aware aligner that utilizes novel sequential maximal mappable seed search capable of handling splice junctions. Seeds are subsequently stitched together by local alignment. Capable of handling long reads. Good speed and sensitivity for RNA-seq analysis but with high memory usage. (https://github.com/alexdobin/STAR)
TMAP9 (Version 5.0.0) - Integrates a set of aligners to (including modified BWA) to identify candidate mapping locations and performs alignment using Smith-Waterman algorithm. TMAP is optimized to handle variable length reads and error profiles generated by Ion Torrent data. (https://github.com/iontorrent/TMAP)
TopHat10 (Version 1.4.1 with Bowtie 1.0.0) - Two stage aligner that first utilizes Bowtie to map to a reference and subsequently unaligned reads are are mapped to a database of possible splice junctions. Popular for RNAseq analysis with solid performance, speed, and memory usage. (https://ccb.jhu.edu/software/tophat/index.shtml)
TopHat 211 (Version 2.1.0) - A newer version of TopHat that utlizes Bowtie2 and refined algorithms from Tophat to improve both speed and accuracy. Popular for RNAseq analysis with solid performance, speed, and memory usage. (https://ccb.jhu.edu/software/tophat/index.shtml)
Selecting an aligner will open the task dialog (Figure 2). All aligners will have an index selection section where the genome build for the species of interest must be entered for Assembly and the Aligner Index must be specified. Aligner indexes provide a means to break apart the reference sequence for fast sequence matching, and can be created for the whole genome or for regions of interest in a Gene/Feature annotation file. Adding Reference Aligner Indexes or Adding Aligner Indexes based on an Annotation Model can be performed via Library File Management or built on the fly.
The Alignment options section is available for all aligners and includes the option to Generate unaligned reads. Selecting this option will create a new fastq file for each sample in the project that contains the reads that do not map during the alignment process.
In addition, some aligners have additional options specific to that tool. BWA allows for selection of the Alignment algorithm, including backtrack, MEM and SW (see BWA documentation). GSNAP has multiple options for Alignment mode (see GSNAP documentation). Both TopHat and TopHat2 have the option to select Fusion search (see Fusion Gene Detection).
The Advanced options section allows for the customization of option sets (see Option Set Management), which allows for the ability to specify parameters specific to each aligner. Default parameters are those specified by the developer of each aligner and parameter details found in the documentation for each aligner.
1. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25.
2. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-359.
3. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754-1760.
4. Li H, Durbin R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics. 2010;26(5):589-595.
5. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinforma Oxf Engl. 2010;26(7):873-881.
6. Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 2015
7. Raczy C, Petrovski R, Saunders CT, et al. Isaac: Ultra-fast whole genome secondary analysis on Illumina sequencing platforms. Bioinformatics. June 2013:btt314.
8. Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinforma Oxf Engl. 2013;29(1):15-21.
9. Torrent Suite User Documentation : Technical Note - TMAP Alignment (https://ts-pgm.epigenetic.ru/ion-docs/Technical-Note---TMAP-Alignment_9012907.html).
10. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinforma Oxf Engl. 2009;25(9):1105-1111.
11. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The terminology spot swapping describes the artifact for spatial data that mRNA bleed from nearby spots causes substantial contamination of UMI counts[1]. Spot clean in Flow is a task that aims to improve estimates of expression by correcting for spot swapping.
The task can only be invoked from the Space Ranger task output data node since it takes the raw count matrix as input. To run the Spot clean task in Flow:
Click the Single cell counts outputted from Space Ranger (Figure 1)
Click Pre-analysis tools in the toolbox
Click Spot clean
Click Finish to run the task with default settings
Another single cell counts node will be generated. The data node contains a matrix of cell counts with the decontaminated gene expressions (Figure 2). Downstream analysis tasks, such as normalization, PCA, and ANOVA, can be performed on the new single cell counts node.
Parameters in this task that you can adjust include:
Gene cutoff: Filter out genes with average expressions among tissue spots below or equal to this cutoff. Default: 0.1.
Max iteration: Maximum iteration for EM parameter updates. Default: 10. Set a smaller number to save computation time.
Ni, Z., Prasad, A., Chen, S. et al. SpotClean adjusts for spot swapping in spatial transcriptomics data. Nat Commun 13, 2971 (2022). https://doi.org/10.1038/s41467-022-30587-y
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This task does not need an annotation model file, since the annotation is retrieved from the BAM file itself. The sequence names in the BAM files constitute the features with which the reads are quantified against.
This task is generally performed on reads aligned to a transcriptome, e.g when a species does not have a genome reference, and the bam files contain transcriptome information. In this case, the features for this quantification task are the reference sequence names in the input bam files.
There are two parameters in Quantify to reference (Figure 1):
Min coverage: will filter out any features (sequence names) that have fewer reads across all samples than the value specified
Strict paired-end compatibility: this only affects paired end data. When it is checked, only reads that have two ends aligned to the same feature will be counted. Otherwise, reads will still be counted as exonic compatible reads even if the mate is not compatible with the feature
During quantification:
We scan through each of the BAM files and find all the transcripts that meet the minimum coverage threshold.
With those transcripts, we "create" an annotation file that has the transcript name as the sequence name and the Gene ID and the Transcript ID have the same transcript name. The start position is 1 and the end position is the length of the transcript.
Effectively, what the annotation file does is filter out the low coverage transcripts.
Since we don't know where the transcripts are in the genome, chromosome view will display only one transcript at a time (i.e., the transcript names are treated like "chromosomes").
In RNA-seq data analysis, after alignment, the most common step is to estimate gene or/and transcript expression abundance, the expression level is represented by read counts. There are three options in this step:
Pool cells combines RNA-Seq data from all cells of a particular cell type classification for each sample. In essence, Pool cells creates virtual bulk RNA-Seq data from single cell RNA-Seq data. Because it is virtual bulk RNA-Seq data, all the same tasks that can be performed on bulk RNA-Seq gene counts data in Partek Flow can be performed on the output of Pool cells.
Pool cells makes it easy to compare gene expression for a cell type of interest between experimental groups.
Before running Pool cells, you must classify the cells. To run Pool cells, select the data node with your classified cells and select Pool cells from the QA/QC section of the task menu (Figure 1).
Options for Pool cells are Sum, Maximum, Mean, and Median. Expression values for cells from the same sample with the same cell type classification will be merged using the chosen pooling method (Figure 2). Sum is selected by default. After choosing a pooling method, select Finish to run the Pool cells task.
Pool cells generates a counts data node for each classified cell type in the data set (Figure 3).
Each counts data node is equivalent to simulated bulk RNA-Seq counts data for a cell type. The same tasks that can be performed on bulk RNA-Seq counts data can be performed on Pool cells output data nodes, including normalization, filtering, PCA, and differential expression analysis.
The counts data of a cell type for each sample can be downloaded by clicking the counts data node and selecting Download data from the task menu. The counts data text file lists each sample and its pooled counts values (sum, maximum, mean, or median) for each feature (gene/transcript) in alphabetical order (Figure 4).
Cufflinks assembles transcripts and estimates transcript abundances on aligned reads. Implementation details are explained in Trapnell et al. [1]
The Cufflinks task has three options that can be configured (Figure 1):
Novel transcript: this option does not require any annotation reference, it will do de novo assembly to reconstruct transcripts and estimate their abundance
Annotation transcript: this option requires an annotation model to quantify the aligned reads to known transcripts based on the annotation file.
Novel transcript with annotation as guide: this option requires an annotation file to quantify the aligned reads to known transcripts as well as assemble aligned reads to novel transcripts. The results include all transcripts in the annotation file plus any novel transcripts that are assembled.
When the Use bias correction check box is selected, it will use the genome sequence information to look for overrepresented sequences and improve the accuracy of transcript abundance estimates.
Trapnell C, Williams B, Pertea G, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech. 2010; 28:511-515.
In ChIP-seq or ATAC-seq analysis, a major challenge after detecting enriched regions or peaks is to compare samples and identify differentially enriched regions. In order to compare samples, a common set of regions must be identified and the number of reads mapping to each region quantified. The Quantify regions task addresses this challenge by generating a union set of unique regions and reporting the number of reads from each sample mapping to each region.
To run Quantify regions:
Click a Peaks data node
Click the Quantification section in the toolbox
Click Quantify regions
The Quantify regions task takes MACS2 output, a Peaks or Annotated Peaks data node, as its input. In a typical ATAC-Seq or ChIP-Seq analysis, MACS2 is configured to output a set of enriched regions or peaks for each experimental sample or group individually. Quantify regions takes these sets of regions and merges them into a union set of unique regions that it saves as a .bed file. To combine the region sets, overlapping regions between samples/groups are merged. Where overlap ends, a break point is created and a new region defined. All non-overlapping or unique regions from each sample/group are also included.
For example, consider an experiment where MACS2 detected enriched regions for two samples, Sample A and Sample B. In Sample A, a region is detected on chromosome 1 from 100bp to 300bp, chr1:100-300. In Sample B, a region is detected at chr1:160-360. The Quantify regions task will give the following union set of unique regions for these partially overlapping regions:
chr1:100-160 (region detected in Sample A only)
chr1:160-300 (region detected in both Sample A and Sample B)
chr1:300-360 (region detected in Sample B only)
After generating a .bed file with the union set of unique regions, Quantify regions performs quantification using the same algorithm as with the .bed file as the annotation model.
The Quantify regions dialog includes configuration options for generating the union set of unique regions and quantifying reads to the regions (Figure 1).
When regions from multiple samples are combined, a small offset in position between enriched regions in different samples can result in many very short unique regions in the union set. The Minimum region size option lets you filter out these very short regions. If a region is smaller than the specified cutoff, the region is excluded. By default, this is set to 50bp, but may need to be adjusted depending on the size of regions you expect to see in your assay.
To download the .bed file with the union set of unique regions, click the Quantify regions task node, click Task details, click the regions.bed file in the Output files section, and click Download.
Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006; 34(10):3150-60.
Salmon1 is a method for quantifying transcript abundance from raw read sequence fastq files, it will generate transcript level count matrix as output.
Salmon will run on any FASTQ file input.
Click the data node containing your FASTQ files
Click the Quantification section in the toolbox
Click Salmon
Choose Assembly and Aligner index on gene annotation model (Figure 1)
The task generates two data nodes: Transcript counts node contains NumReads which is the estimate of number of reads mapping to each transcript; the best estimate is often not integer. Gene counts node contains the sum of all the reads from the corresponding transcripts from each gene.
Note: If you want to perform differential analysis, you need to add offset to deal with 0 values.
When the reads are aligned to a genome reference, e.g. hg38, the quantification is performed on transcriptome, you need to provide the annotation model file of the transcriptome.
If the alignment was generated in Partek Flow, the genome assembly will be displayed as text on the top of the page (Figure 1), you do not have the option to change the reference.
If the bam file is imported, you need to select the assembly with which the reads were aligned to, and which annotation model file you will use to quantify from the drop-down menus (Figure 2).
In the Quantification options section, when the Strict paired-end compatibility check button is selected, paired end reads will be considered compatible with a transcript only if both ends are compatible with the transcript. If it is not selected, reads with only one end have alignment that is compatible with the transcript will also be counted for the transcript .
Minimum read overlap with feature can be specified in percentage of read length or number of bases. By default, a read has to be 100% within a feature. You can allow some overhanging bases outside the exonic region by modifying these parameters.
Filter features option is a filter for minimum reads, by default only the features whose sum of the reads across all samples that are greater than or equal to 10 will be reported. To report all the features in the annotation file, set the value to 0.
Some library preparations reverse transcribe the mRNA into double stranded cDNA, thus losing strand information. In this case, the total transcript count will include all the reads that map to a transcript location. Others will preserve the strand information of the original transcript by only synthesizing the first strand cDNA. Thus, only the reads that have sense compatibility with the transcripts will be included in the calculation. We recommend verifying with the data source how the NGS library was prepared to ensure correct option selection.
In the Advanced options, in Configure dialog, at Strand specificity field, forward means the strand of the read must be the same as the strand of the transcript while reverse means the read must be the complementary strand to the transcript (Figure 3). The options in the drop-down list will be different for paired-end and single-end data. For paired-end reads, the dash separates first- and second-in-pair, determined by the flag information of the read in the BAM file. Briefly, the paired-end Strand specificity options are:
No: Reads will be included in the calculation as long as they map to exonic regions, regardless of the direction
Auto-detect: The first 200,000 reads will be used to examine the strand compatibility with the transcripts. The following percentages are calculated on paired-end reads:
(1) If (first-in-pair same strand + second-in-pair same strand)/Alignments examined > 75%, Forward-Forward will be specified
(2) If (first-in-pair same strand + second-in-pair opposite strand)/Alignments examined > 75%, Forward-Reverse will be specified
(3) If (first-in-pair opposite strand + second-in-pair same strand)/Alignments examined > 75%, Reverse-Forward will be specified
(4) If neither of the percentages exceed 75%, No option will be used
Forward - Reverse: this option is equivalent to the --fr-secondstrand option in Cufflinks [1]. First-in-pair is the same strand as the transcript, second-in-pair is the opposite strand to the transcript
Reverse - Forward: this option is equivalent to --fr-firststrand option in Cufflinks. First-in-pair is the opposite strand to the transcript, second-in-pair is the same strand as the transcript. The Illumina TruSeq Stranded library prep kit is an example of this configuration
Forward - Forward: Both ends of the read are matching the strand of the transcript. Generally colorspace data generated from SOLiD technology would follow this format
The single-end Strand specificity options are:
No: same as for paired-end reads
Auto-detect: same as for paired-end reads. All single-end reads are treated as first-in-pair reads
Forward: this option is equivalent to the --fr-secondstrand option in Cufflinks. The single-end reads are the same strand as the transcript
Reverse: this option is equivalent to --fr-firststrand option in Cufflinks. The single-end reads are the opposite strand to the transcript. The Illumina TruSeq Stranded library prep kit is an example of this configuration
If the Report unexplained regions check button is selected, an additional report will be generated on the reads that are considered not compatible with any transcripts in the annotation provided. Based on the Min reads for unexplained region cutoff, the adjacent regions that meet the criteria are combined and region start and stop information will be reported.
Depending on the annotation file, the output could be one or two data nodes. If the annotation file only contains one level of information, e.g. miRNA annotation file, you will only get one output data node. On the other hand, if the annotation file contains gene level and transcript level information, such as those from the Ensembl database, both gene and transcript level data nodes will be generated. If two nodes are generated, the Task report will also contain two tabs, reporting quantification results from each node. Each report has two tables. The first one is a summary table displaying the coverage information for each sample quantified against the specified transcriptome annotation (Figure 4).
The second table contains feature distribution information on each sample and across all the samples, number of features in the annotation model is displayed on the table title (Figure 5).
The bar chart displaying the distribution of raw read counts is helpful in assessing the expression level distribution within each sample. The X-axis is the read count range, Y axis is the number of features within the range, each bar is a sample. Hovering your mouse over the bar displays the following information (Figure 6):
Sample name
Range of read counts, “[ “represent inclusive, “)” represent exclusive, e.g. [0,0] means 0 read counts; (0,10] means the range is greater than 0 count but less than and equal to 10 counts.
Number of features within the read count range
Percentage of the features within the read count range
The coverage breakdown bar chart is a graphical representation of the reads summary table for each sample (Figure 7).
In the box-whisker plot, each box is a sample on X-axis, the box represents 25th and 75th percentile, the whiskers represent 10th and 90th percentile, Y-axis represents the feature counts, when you hover over each box, detailed sample information is displayed (Figure 8).
In sample histogram, each line represents a sample and the range of read counts are divided into 20 bins. Clicking on a sample in the legend will hide the line for that specific sample. Hovering over each circle displays detailed information about the sample and that specific bin (Figure 9). The information includes:
Sample name
Range of read counts, “[ “represent inclusive, “)” represent exclusive
Number of features within the read count range in the sample
The box whisker and sample histogram plots are helpful for understanding the expression level distribution across samples. This may indicate that normalization between samples might be needed prior to downstream analysis. Note that all four visualizations are disabled for results with more than 30 samples.
The output data node contains raw reads of each sample on each feature (gene or transcript or miRNA etc. depends on the annotation used). When click on a output data node, e.g. transcript counts data node, choose Download data on the context sensitive menu on the right, the raw reads of transcripts can be downloaded in three different format (Figure 10):
Partek Genomics Suite project format: it is a zip file, do not manually unzip it, you can choose File>Import>Zipped project in Partek Genomics Suite to import the zip file into PGS.
Features on columns and Features on rows format: it is a .txt file, you can open the text file in any text editor or Microsoft Excel. For Features on columns format, samples will be on rows. For Features on rows format, samples will be at columns.
Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006; 34(10):3150-60.
In complex projects, different data matrices (e.g. observations on rows and features on columns) need to be merged in order to achieve the analysis goals. For example, two cell populations were identified on separate branches of the analysis pipeline and to combine them before any joint downstream steps, the expression matrices have to be combined. Alternatively, two assays (gene expression and protein expression) were performed on the same cells so the expression matrices have to be merged for joint analysis.
Merge matrices task is located in the Pre-analysis tools section of the toolbox and it can handle two scenarios: Merge cells/samples and Merge features (Figure 1). To start, select the first data node on the pipeline (e.g. single cell counts) and then select the Merge matrices task.
To use the Merge cells option, the data matrices (one or more) that are to be merged with the currently selected one should have the same features (e.g. genes), but distinct cells. Push the Select data nodes button and Partek Flow will display a preview of the pipeline; the data nodes that can be merged are shown in color of the branch, other data nodes are disabled (greyed out). Left click on the data node that you want to merge with the current one and push the Select button, you can select multiple data nodes to merge. The selected node(s) will be shown under the Select data nodes button (Figure 2). If you made a mistake, use the Clear selection icon. Push Finish to proceed.
To use the Merge features option, the data matrices (one or more) that are to be merged with the currently selected one should have the same cells (or samples), but distinct features (e.g. gene and protein expression). Push the Select data nodes button and Partek Flow will display you a preview of the pipeline; the data nodes that can be merged are shown in color of the branch, others are disabled (greyed out). Left lick on the data node that you want to merge with the current one and push the Select button. The selected node will be shown under the Select data nodes button. Repeat the procedure if you would like to merge additional nodes. If you made a mistake, use the Clear selection icon. Push Finish to proceed.
The output of the Merge matrices task is a Merged counts data node (Figure 3).
Count feature barcodes is a tool for quantifying the number of feature barcodes per cell from CITE-Seq, cell hashing, or other feature barcoding assays to measure protein expression. The input for Count feature barcodes is FASTQ files.
Count feature barcodes will run on any unaligned reads data node.
Click the data node containing your unaligned reads containing feature barcodes
Click the Quantification section in the toolbox
Click Count feature barcodes
The task set up page allows you to configure the settings for your assay (Figure 1).
Choose the Prep kit from the drop-down menu
Check Map feature barcodes box (optional)
This is only necessary for processing data from 10X Genomics' Feature Barcoding assay (v3+ chemistry), which utilizes BioLegend TotalSeq-B. For all other assays, leave this box unchecked.
Choose the Barcode location
For BioLegend TotalSeq-A, choose bases 1-15. For BioLegend TotalSeq-B/C, choose bases 11-25. For other locations, select Custom and specify the start and stop positions.
Choose a Sequences text file
This tab-delimited text file should have the feature ID in the first column and the nucleotide sequence in the second column. Do not include column headers. See Figure 2 for an example.
Check Keep bam files box (optional)
This option will retain the alignment BAM files instead of automatically deleting them when the task is complete. An extra Aligned reads output data node will be produced on the task graph. This option is unchecked by default to save on disk space.
Click Finish to run
The output of Count feature barcodes is a Single cell counts data node.
Count feature barcodes uses a series of tasks available independently in Partek Flow to process the input FASTQ files. The output files generated by these tasks are not retained in the Count feature barcodes output, with the exception of BAM files if Keep bam files is checked.
Quantify barcodes counts the number of UMIs per cell for each feature in the Sequences file. Quantify barcodes uses default settings.
To perform these steps individually instead of using the Count feature barcodes task, you will need to generate a FASTA and GTF file containing the feature barcode IDs and sequences instead of a text file and build an index file for the Bowtie aligner.
Partek Flow has the flexibility to subsample your data for further downstream analyses. Filter data by:
A common task in bulk and single-cell RNA-Seq analysis is to filter the data to include only informative genes. Because there is no gold standard for what makes a gene informative or not, and ideal gene filtering criteria depend on your experimental design and research question, Partek Flow has a wide variety of flexible filtering options.
Filter features task can be invoked from any counts or single cell data node. Noise Reduction and Statistics Based filters take each feature and perform the specified calculation across all of the cells. The filter is applied to the values in the selected data node and the output is a filtered version of the input data node.
In the task dialog, select the filter option to activate the filter type and configure the filter, then click Finish to run.
The Noise reduction filter lets you exclude features that meet basic criteria (Figure 1).
Descriptive statistics you can choose are:
Coefficient of variation: std. dev divided by mean of the feature
Geometric mean: nth root of the product of the n numbers, n is the number of features
Maximum: the highest value of a feature
Mean: the average value of a feature
Median: value of the mid point of a feature
Minimum: lowest value of a feature
Range: the difference between the highest and lowest values of a feature
Std dev.: the square root of the variance
Sum: total value of the feature
Variance: the average of the squared differences from the mean
Dispersion: variance divided by mean of the feature
For each of these you can choose to exclude features that are:
<: less than
<=: less than or equal to
== equal to
>: greater than
>=: greater than or equal to
The threshold is set using the text box. The input must be a number; it can be an integer or decimal, positive or negative.
If you select value, you can also choose a percentage of samples or cells that must meet the criteria for the feature to be excluded (Figure 2).
The Statistics based filter lets you include a number or percentile of genes based on descriptive statistics (Figure 3).
Select Counts to specify a number of top features to include or select Percentiles to specify the top percentile of features to include.
Descriptive statistics you can choose are:
Coefficient of variance
Geometric mean
Maximum
Mean
Median
Minimum
Range
Standard deviation (std dev)
Sum
Variance
Dispersion
If the data linked to feature (gene) annotation, different fields in the annotation can be used to filter, e.g. genomic location information, gene biotype information etc. (Figure 4)
You can specify logical operation using different annotation field information.
You can filter features based on a feature lists (Figure 5).
You can choose to include or exclude features in any list that you have added.
Use the Feature identifier option to choose which identifier from your annotation matches the values in the feature list.
The filter features task report lists the filter criteria, reports distribution statistics for the remaining features, and indicates the number and percentage of features that passed the filter (Figure 6).
If the input was a count matrix data node, sample box plot and sample histograms are provided to show the distribution of features after filtering. These plots are not available if the input was a single cell counts data node.
This task is only available on single cell matrix data node. It will publish one or more cell level attributes to the project level, so the attribute can be edited and seen from all single cell count data nodes within a project. This function can be used on annotate cell task output data node, graph based cluster data node etc.
Click on a single cell counts data node
Choose Publish cell attributes to project in the Annotation/Metadata section of the toolbox
This invokes the dialog as (Figure 1)
From the drop-down list to select one or more attributes to publish. Only numeric attributes and categorical attributes with less than 1000 levels will be available in the list.
Click Finish at the bottom of the page, all of the attributes will be available to edit on the Data tab > Cell attributes Manage. All data nodes in the project will be able to use those attributes.
The Downsample Cells task is used to randomly downsample the number of cells in a single cell data set. This task can be used to reduce the size of large single cell datasets to small and manageable sizes for quick analysis. Another use case for this task is for a project with multi samples with each sample having different number of cells. Downsample Cells can be used to randomly select an equal number of cells for all the samples in the project. For the default setting, the sample with the minimum number of cells is used with the number of cells in that sample set as the number of cells to be selected in the other samples. However, this default setting can be changed to a preferred number by the user. If the number selected by the user is greater than the number of cells in one or more samples, those samples will not be downsampled and all the cells in those samples will be returned. If the number selected by the user is greater than the number of cells in all the samples, then none of the samples will be downsampled.
To run a downsample task first click on a single cell count data node. Go to the Filtering section and select Downsample cells task (Figure 1).
Clicking on the Downsample cells task will lead to a dialogue menu with the number of cells to be downsampled set to the minimum number of cells in the project. In the figure below, the minimum number of cells in any of the samples was 2658 and this is used in the default settings. Click Finish to run the task (Figure 2).
Filter samples or cells in order to perform downstream analysis on a subset of data.
To filter groups, click a count matrix or single cell counts data node, click the Filtering section of the toolbox, and choose to Filter samples (bulk data) or Filter cells (single cell data).
The dialog lets you build a series of filters based on sample or cell attributes.
Click Finish to apply the filter. If no sample or cell will pass the filter criteria, a warning message will appear and the task will not run.
The first drop-down menu allows you to choose to include or exclude based on the specified criteria.
The second drop-down menu allows you to choose any categorical or numeric attribute to use for the filter criteria.
If the attribute is categorical, the third drop-down menu includes in and not in as options. A fourth drop-down menu allows you to search and choose from the levels of the selected attribute (Figure 1).
If the attribute is numeric, the the third drop-down includes:
<: less than
<=: less than or equal to
== equal to
>: greater than
>=: greater than or equal to
The threshold is set using the text box (Figure 2). The input must be a number; it can be an integer or decimal, positive or negative.
Using the OR and AND options, you can combine multiple filters.
When combining multiple filters all set to Include:
With AND, if all statements must be true for the sample to meet the filter criteria.
With OR, if any statement is true, the sample will meet the filter criteria.
When combining multiple filters all set to Exclude:
With AND, if any statement is true, the sample will meet the filter criteria.
With OR, all statements must be true for the sample to meet the filter criteria.
The filter groups task report lists the filter criteria and reports feature distribution statistics for the remaining samples (Figure 3).
If the input was a count matrix data node, the percentage of samples remaining after the filter is listed and charts are provided to show the breakdown of samples by categorical attributes before and after filtering (Figure 4).
If the input was a single cell counts data node, a second table displays the details from each sample based on the filtered criteria (Figure 5).
If the input was a classified groups single cell counts data node, the cell count table includes a breakdown by classification and a bar chart is provided to show the number of cells from each classification remaining after filtering (Figure 6).
Click to add a segment.
To enter the sequences manually, choose Manual for Sequences then type or paste the adaptor sequences into the text field and click to add the adaptor (Figure 7). You must click for the adaptor sequence to be included. You can remove any adaptor you have added by clicking .
You can add new prep kits from this page by clicking .
You can preview a prep kit by clicking , delete a prep kit by clicking , and download a prep kit to your computer by clicking .
Each task will appear as a separate section on the Data summary report (Figure 2). The first section of the report (Sample data) will summarize the input samples information. Click the grey arrows ( / ) to expand and collapse each section. When expanded, the task name, user that performed the task, start date and time, duration and the output file size are displayed (Figure 2). To view or hide a table of task settings, click Show/hide details (Figure 3).
If some samples have been treated with the Mix 1 formulation and others have been treated with the Mix 2 formulation, choose the ExFold comparison radio button (Figure 2). Set up the pairwise comparisons by choosing the Mix 1 and Mix 2 samples that you wish to compare from the drop-down lists, followed by the green plus ( ) icon. The selected pair of samples will be added to the table below.
The browser icon in the right-most column () of the Region average coverage summary table opens the Coverage graph for the respective region (Figure 5). The horizontal axis is the normalized position within the genomic feature, represented as 1st to 100th percentile of the length of the feature. The vertical axis is coverage. Each line on the plot is a single sample, and the samples are listed below the plot.
: the invokes the Coverage graph across the genomic feature, showing the current sample only (Figure 23) (or mouse over to get a preview of the plot)
: the invokes the Chromosome view and browses to the genomic location
To filter the high-quality cells, click the include selected cells icon in Filter in the top right of Select & Filter, and click Apply observation filter... (Figure 9).
Phred Quality Score | Base Call Accuracy |
---|---|
The Pre-alignment QA/QC output table contains one input file per row, with typical metrics on columns (%GC: fraction of GC content; %N: fraction of no-calls) (Figure 3). The file names are hyperlinks, leading to the sample-level reports. To save the table as a txt file to a local computer, push the Download link. Table columns can be sorted using double arrows icon ( ).
Base composition plot specifies relative abundance of each base per position (Figure 6), with N standing for no-calls. By selecting individual bases on the legend, you can remove them from the plot / bring them back on. To zoom in, left-click & drag over a region of interest. To zoom out, use the Reset button ( ) to recreate the original view, or the magnifier glass ( ) to zoom out one level.
To view different samples in the Data viewer, navigate to Axes under Configure Click on the button under Content in the left panel (Figure 5).
Click on Show image to turn on or off the background image.
Coefficient of variation (CV): s represent the standard deviation
Geometric mean:
Max:
Mean:
Median: when n is odd, median is , when n is even, median is
Median absolute deviation: , where
Min:
Standard deviation: where
Sum:
Variance:
The output data node will display a similar Task report as the task.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
The annotation file contains the features the aligned reads will be quantified to. For more information about adding an annotation model, please see .
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Quantification options are the same as in the dialog. The Percent of read length is set to 50% by default to account for small offsets in position between enriched regions in different samples.
Quantify regions generates a counts data node with the number of counts in each region for each sample. This data node can be annotated with gene information using the Annotate regions task and analyzed using tasks that take counts data as input, such as normalization, PCA, and ANOVA. For ChIP-Seq experiments with input control samples, the task can be used prior to downstream analysis.
Similar to the task report, the Quantify regions task report includes feature distribution information including a descriptive stats table, a distribution bar chart, a sample box plot, and sample histogram (Figure 2).
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Salmon index can't be built on reference assembly, you need to provide transcript annotation file beforehand, the index will be built on transcript annotation model. For more information about adding aligner index based on annotation model, see this in library file management.
Rob Patro, Geet Duggal, Michael Love, Rafeal A Irizarry, Carl Kingsford. Salmon: fast and bias-aware quantification of transcript expression using dual-phase inference. Published online 2017 Mar 6. doi:
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If the Require junction reads to match introns check button is selected, only junction reads that overlap with exonic regions and match the skipped bases of an intron in the transcript will be included in the calculation. Otherwise, as long as the reads overlap within the exonic region, they will be counted. Detailed information about read compatibility can be found in the white paper.
In the annotation file, there might be multiple features in the same location, or one read might have multiple alignments, so the read count of a feature might not be an integer. Our white paper on the has more details on Partek’s implementation the E/M algorithm initially described by Xing et al. [1]
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
For a practical example using Merge matrices, please see our tutorial on .
Depending on your goal, you may want to consider a different approach. For example, data matrices based on two different assays (e.g. gene and protein expression) can be combined using . Instead of merging two (or more) cell populations by using Merge cells, you may want to use filtering (Filtering > Filter groups) to filter out the populations that you do not consider relevant / filter in the populations of your interest.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
For more details on adding Prep kits, please see our documentation on . The prep kit should include cell barcode and unique molecular identifiers (UMIs) locations.
identifies the UMI and cell barcode sequences. The Prep kit is specified using the Prep kit setting.
trims the insert read to include only the feature barcode sequence. Trim bases is set to Both ends for Trim based on with the start and stop set by the Barcode location preference and the Min read length set to 1.
is used to align the reads to the sequences specified in the Sequences text file. Bowtie is set to Ignore quality limit for the Alignment mode. Other settings are default.
consolidates duplicate reads based on UMIs. Deduplicate UMIs is set to Retain only one alignment per UMI.
filters the cell barcodes to include cells and not empty droplets. Filter barcodes is set to Automatic.
For more information about library file management, please see .
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you have added feature lists in Partek Flow using the feature, the filter using Saved list option will be available. Otherwise, you can specify a Manual list by typing in the Filter criteria box.
If you choose Saved list, the drop-down list will display all the feature lists added in ; If you choose Manual list, you can manually type in the feature IDs/names in the box, one feature per row.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
After selection, click on the green plus button () to add, change the attribute by typing in the New name box (Figure 2).
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
10
90%
20
99%
30
99.9%
40
99.99%
To ensure that different data sets are comparable, several normalization and scaling options are available in Partek Flow. These include newly-developed algorithms specifically tailored for genomic analysis.
When a project contains multiple libraries, the data might contains variabilities due to technical differences (e.g. sequencing machine, library prep kit etc.) in addition to biological differences (like treatment, genotype etc.). Batch removal is essential to remove the noise and discover biological variations.
Powerful Partek Flow statistical analysis tools help identify differential expression patterns in the dataset. These can take into account a wide variety of data types and experimental designs.
This task is to replace missing data in the data with estimated values based on selected method.
First select the computation is based on samples/cells or features, and click Finish to replace missing values. Some functions will generate the same results no matter which transform option is selected, e.g. constant value. Others will generate different results:
Constant values: specify a value to replace the missing data
Maximum: use maximum value of samples/cells or features to replace missing data depends transform option
Mean: use mean value of samples/cells or features to replace missing data depends transform option
Median: use median value of samples/cells or features to replace missing data depends transform option
Minimum: use minimum value of samples/cells or features to replace missing data depends transform option
K-nearest neighbor (mean): specify number of neighbors (N), Euclidean metric is used to compute neighbors, use mean of (N) neighbors to replace missing data
K-nearest neighbor (median): specify number of neighbors (N), Euclidean metric is used to compute neighbors, use median of (N) neighbors to replace missing data
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Single cell RNA-seq gene expression counts are zero inflated due to inefficient mRNA capture. This normalization task is based on MAGIC[1]–MArkov Affinity-based Graph Imputation of Cells), to recover gene expression lost due to drop-out. The limitation on using this method is up to 50K cells in the input data node.
To invoke this task, click on a normalized data node which has less than 50K cells, it will first compute PCA to use the number of PCs specified to impute (Figure 1).
Click Finish to run the task, it will output low expression imputed matrix in the output report node.
References
Dijk D et al. MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data
\
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
In droplet-based single cell isolation and library prep methods, each droplet is labeled by a unique nucleotide barcode. Because not all droplets will contain cells, it is important to filter out nucleotide barcodes that correspond to empty droplets prior to downstream analysis.
In Partek Flow, you can filter barcodes interactively using a knee plot after UMI deduplication in the UMI deduplication task report or after quantification in the Cell barcode QA/QC task report. Alternatively, you can filter using preset options in the Filter barcodes task.
To invoke Filter barcodes:
Click a Deduplicated reads data node
Click the Filtering section of the toolbox
Click Filter barcodes
You can choose to filter barcodes using three or four options, depending on whether you have already run a Filter barcodes task for your samples in the project (Figure 1).
The automatic filter threshold is set for each sample individually. It picks the cutoff between cells and empty droplets by identifying where the UMI content per barcode drops precipitously when moving in descending order from the barcode with the highest number of UMIs.
Set the number of cells per sample to include. This is set for all samples; if set to 100, the top 100 barcodes by total UMI count for each sample will be retained.
Set the percent of reads in cells per sample to include. The number of barcodes included will be set to match the specified percent of reads in cells for each sample. Barcodes are included starting with the barcode with the highest number of total UMIs and proceeding in descending order of total UMIs per barcode until the specified percent of reads has been met or exceeded.
If you have already run a Filter barcodes task for your samples in the project, the Previous filter option will be available. This option lets you filter to the same cell barcodes that were included by the previous filter. This option is particularly useful for CITE-Seq data, where antibody barcodes and gene expression data must be processed separately, but you will want to analyze the same cell barcodes in downstream steps.
Selecting Previous filter opens a table with information about the previous barcode filters in your project (Figure 2).
Use the radio buttons in the first column to pick which filter you want to use.
After configuring the task, click Finish to run.
Filter barcodes produces a Filtered reads data node (Figure 4). Filter barcodes does not have a task report.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Split by Attribute task is used to split a data node into different nodes based on the groups in a categorical attribute, each data node only includes the samples/cells from one group. It is a more efficient way to filter your data if you plan to perform downstream analysis on each and every group separately in an attribute.
Click on the data node and select split by attribute from the Filtering section in task menu (Figure 1).
Select the attribute to split the data on. In this case, data will be split according to the Age attribute (Figure 2).
Result of the split by attribute task will be two separate data nodes, each contains samples from one age group (Figure 3).
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This normalization is performed on observations (samples) using internal control features (genes). The internal control features, usually housekeeping genes, should not vary among samples[1]. The implementation details is as follows:
1. Compute geometric mean of all the control genes (features) e.g. (g1 to gm) in each sample S (f means feature, S means sample, 1-m are control features), represented by GS1 to GSn (n number of samples).
2. Compute geometric mean of across all samples (GS1 to GSn), represented by GS
3. Compute the scaling factor for each sample, S1=GS1/GS, S2=GS2/GS ... Sn=GSn/GS
4. Normalize all the gene expression by divided by its sample scaling factor
Note: The input data node must contain all positive values to compute geometric mean.
Select Normalize to housekeeping genes task in Normalization and scaling section in the pop-up menu when you select a count matrix data node, the dialog will list all the features included in the data node on the left panel
Select control genes on the left panel and move them to the right panel. You can also use search box to find the feature and click the plus button to add it to the right panel.
Click Finish
Frank Speleman. Accurate normalization of real-time quantitative RT_PCR data by geometric averaging of multiple internal control genes. Genome Biology. 2002.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Latent semantic indexing (LSI) was first introduced for the analysis of scATAC-seq data by Cusanovich et al. 2018[1]. LSI combines steps of frequency-inverse document frequency (TF-IDF) normalization followed by singular value decomposition (SVD). Partek Flow wrapped Signac's TF-IDF normalization for single cell ATAC-seq dataset. It is a two-step normalization procedure that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks[2].
TF-IDF normalization in Flow can be invoked in Normalization and scaling section by clicking any single cell counts data node (Figure 1).
To run TF-IDF normalization,
Click a single cell counts data node
Click the Normalization and scaling section in the toolbox
Click TF-IDF normalization
The output of TF-IDF normalization is a new data node that has been normalized by log(TF x IDF). We can then use this new normalized matrix for downstream analysis and visualization (Figure 2).
Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). https://doi.org/10.1038/nature25981
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
This method is based on general linear model, much like ANOVA in reverse, it calculates the variation attributed to the factor(s) being removed then adjusting the original values to remove the variation.
By including batch in the differential analysis model, the variability due to the batch effect is accounted for when calculating p-values. In this sense, batch effects are best handled as part of the differential analysis model. However, clustering data or visualizing biological effects can be very difficult if batch effects are present in the original data. We transform the original values to remove the batch effect using this tool.
We recommend normalizing your data prior to removing batch effects, but the task will run on any counts data node.
Click the counts data node
Click the Batch removal section of the toolbox
Click General linear model
The batch effect removal dialog is similar to the dialog for ANOVA. To set up the model, you need to choose which attributes should be considered. Here, you should include the batch attribute, any attributes that interact with batch, and the interactions between these attributes.
For example, in the case where you have different cell types and batch may have a different effect on different cell types, you would need to include both batch, cell type, and the interaction between batch and cell type. Here, batch is the attribute Version and cell type is the attribute Classification.
Click Version and Classification
Click Add factors
Click Version and Classification
Click Add interaction (Figure 1)
To remove the batch effect and its interaction with cell type, we can click the Remove checkbox for both Version and Version*Classification.
Click the Remove checkbox for Version and Version*Classification
Click Finish to run (Figure 2)
The output of is a new data node, Batch effect adjusted counts. This data node contains the batch effect corrected values can be used as the input for downstream tasks such as clustering and UMAP (Figure 3).
The advanced options for Remove batch effect are shared by ANOVA/LIMMA-trend/LIMMA-voom.
Library size normalization is the simplest strategy for performing scaling normalization. But composition biases will be present when any unbalanced differential expression exists between samples. The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts[1]. To overcome this, Partek Flow wrapped the calculateSumFactors() function from R package scran. It pools counts from many cells to increase the size of the counts for accurate size factor estimation. Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile[1].
Scran deconvolution in Flow can be invoked in Normalization and scaling section by clicking any single cell counts data node (Figure 1).
To run Scran deconvolution,
Click a single cell counts data node
Click the Normalization and scaling section in the toolbox
Click Scran deconvolution
The GUI is simple and easy to understand. The first Scran deconvolution dialog is asking to select the cluster name from a drop-down list that includes all the attributes for this dataset. The selected cluster is an optional factor specifying which cells belong to which cluster, for deconvolution within clusters (Figure 2). Simply click the Finish button if you want to run the task as default.
The output of Scran deconvolution is a new data node that has been normalized by the pool-based size factors of each cell and log2 transformed_._ We can then use this new normalized matrix for downstream analysis and visualization (Figure 3).
Other parameters in this task that you can adjust include:
Pool size: A numeric vector of pool sizes, i.e., number of cells per pool.
Max cluster size: An integer scalar specifying the maximum number of cells in each cluster.
Enforce positive estimates: A logical scalar indicating whether linear inverse models should be used to enforce positive estimates.
Scaling factor: A numeric scalar containing scaling factors to adjust the counts prior to computing size factors.
Lun, A. T., K. Bach, and J. C. Marioni. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016.
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Raw read counts are generated after quantification for each feature on all samples. These read counts need to be normalized prior to differential expression detection to ensure that samples are comparable.
This chapter covers the implementation of each normalization method. The Normalize counts option is available on the context-sensitive menu (Figure 1) upon selection of any quantified output data node or an imported count matrix:
Gene counts
Transcript counts
MicroRNA counts
Cufflinks quantification
Quantification
The format of the output is the same as the input data format, the node is called Normalized counts. This data node can be selected and normalized further using the same task.
Select whether you want your data normalized on a per sample or per feature basis (Figure 2). Some transformations are performed on each value independently of others e.g. log transformation, and you will get an identical result regardless of your choice.
The following normalization methods will generate different results depending on whether the transformation was performed on samples or on features:
Divided by mean, median, Q1, Q3, std dev, sum
Subtract mean, median, Q1, Q3, std dev, sum
Quantile normalization
Note that each task can only perform normalization on samples or features. If you wish to perform both transformations, run two normalization tasks successively. To normalize the data, click on a method from the left panel, then drag and drop the method to the right panel. Add all normalization methods you wish to perform. Multiple methods can be added to the right panel and they will be processed in the order they are listed. You can change the order of methods by dragging each method up or down. To remove a method from the Normalization order panel, click the minus button to the right of the method. Click Finish, when you are done choosing the normalization methods you have chosen.
For some data nodes, recommended methods are available:
Data nodes resulting from Quantify to annotation model (Partek E/M) or Quantify to reference (Partek E/M) are raw read counts, the recommendation is Total Count, Add 0.0001
Cufflinks quantification data node output FPKM normalized read counts, the recommendation is Add 0.0001
If available, the Recommended button will appear. Clicking the button will populate the right panel (Figure 3).
Below is the notation that will be used to explain each method:
Absolute value TXsf = | Xsf |
Add TXsf = Xsf + C a constant value C needs to be specified
Antilog TXsf = bxsf A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Arcsinh TXsf =arcsinh (Xsf) The hyperbolic arcsine (arcsinh) transformation is often used on flow cytometry data
CLR (centered log ratio) TXsf =ln((Xsf +1)/geom (Xsf +1) +1) geom is geometric mean of either observation or feature. This method can be applied on protein expression data.
CPM (counts per million) TXsf = (106 x Xsf)/TMRs where Xsf here is the raw read of sample S on feature F, and TMRs is the total mapped reads of sample S. If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample.
Divided by When mean, median, Q1, Q3, std dev, or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option Example: If transform on Samples is selected, Divide by mean is calculated as: TXsf = Xsf/Ms where Ms is the mean of the sample. Example: If transform on Features is selected, Divide by mean is calculated as: TXsf = Xsf/Mf where Mf is the mean of the feature.
Log TXsf = logbXsf A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Logit TXsf=logb(Xsf/(1-Xsf)) A log base value b needs to be specified from the drop-down list; any positive number can be specified when Custom value is chosen
Lower bound A constant value C needs to be specified, if Xsf is smaller than C, then TXsf= C; otherwise, TXsf = Xsf
**Median ratio (DESeq2 only), Median ratio (edgeR) **These approaches are slightly different implementations of the method proposed by Anders and Huber (2010). The idea is as follows: for each feature, its expression is divided by the feature geometric mean expression across the samples. Then, for a given sample, one takes median of these ratios across the features and obtains a sample specific size factor. The normalized expression is equal to the raw expression divided by the size factor. Median ratio (DESeq2 only) is present in R, DESeq2 package, under the name of "ratio". This method should be selected if DESeq2 differential analysis will be used for downstream analysis, since it is not per million scale, not recommended to be used in any other differential analysis methods except for DESeq2. Median ratio (edgeR) is present in R, edgeR package under the name of “RLE”. It is very similar to Median ratio (DESeq2 only) method, but it uses per million scale.
Multiply by TXsf = Xsf x C A constant value C needs to be specified
Poscounts (Deseq2 only) Deseq2 size factor estimate option. Comparing with Median ratio, poscount method can be used when all genes contain a sample with a zero. It calculates a modified geometric mean by taking the nth root of the product of the non-zero counts. It is not per million scale. Here is the details.
Quantile normalization, a rank based normalization method. For instance, if transformation is performed on samples, it first ranks all the features in each sample. Say vector Vs is the sorted feature values of sample S in ascending order, it calculates a vector that is the average of the sorted vectors across all samples --- Vm, then the values in Vs is replaced by the value in Vm in the same rank. Detailed information can be found in [1].
Rank This transformation replaces each value with its rank in the list of sorted values. The smallest value is replaced by 1 and the largest value is replaced by the total number of non-missing values, N. If there are no tied values, the results in a perfectly uniform distribution. In the case of ties, all tied values receive the mean rank.
Rlog Regularied log transformation is the method implemented in DESeq2 package under the name of rlog. It applies a transformation to remove the dependence of the variance on mean. It should not be applied on zero inflated data such as single cell RNA-seq raw count data. The output of this task should not be used for differential expression analysis, but rather for data exploration, like clustering etc.
Round Round the value to the nearest integer.
RPKM (Reads per kilobase of transcript per million mapped reads [2]) TXsf = (109 * Xsf)/(TMRs*Lf) Where Xsf is the raw read of sample S on feature F, TMRs is the total mapped reads of sample S, Lf is the length of the feature F,
If quantification is performed on an aligned reads data node, total mapped reads is the aligned reads. If quantification is generated from imported read count text file, the total mapped reads is the sum of all feature reads in the sample. If the feature is a transcript, transcript length Lf is the sum of the lengths of all the exons. If the feature is a gene, gene length is the distance between the start position of the most downstream exon and the stop position of the most upstream exon. See Bullard et al. for additional comparisons with other normalization packages [3]
For paired reads, the normalization option will show up as FPKM (Fragments per kilobase per million mapped reads) rather than RPKM. However, the calculations are the same.
Subtract When mean, median, Q1, Q3, std dev or sum is selected, the corresponding statistics will be calculated based on the transform on sample or features option Example: If transform on Samples is selected, Subtract mean is calculated as: TXsf = Xsf - Ms where Ms is the mean of the sample Example: If transform on Features is selected, Subtract mean is calculated as: TXsf = Xsf - Mf where Mf is the mean of the feature
TMM (Trimmed mean of M-values) The scaling factors is produced according to the algorithm described in Robinson et al [4]. The paper by Dillies et al. [5] contains evidence that TMM has an edge over other normalization methods. The reference sample is randomly selected. When perform the trimming, for M values (fold change), the upper 30% and lower 30% are removed; for A values (absolute expression), the upper 5% and lower 5% are removed.
TPM (Transcripts per million as described in Wagner et al [6]) The following steps are performed:
Normalize the reads by the feature length. Here length is measured in kilobases but the final TPM values do not depend on the length unit. RPKsf = Xsf / Lf;
Obtain a scaling factor for sample s as Ks = 10-6 ∑Ff=1 RPKsf
Divide raw reads by the length and the scaling factor to get TPM TXsf = Xsf / Lf / Ks
Upper quartile
The method is exactly the same as the LIMMA package [7]. The following is the simple summarization of the calculation:
Remove all the features that have 0 reads in all samples.
Calculate the effective library size per sample: effective library size = (raw library size (in millions))*((upper quartile for a particular sample)/ (geometric mean of upper quartiles in all the samples))
Get the normalized counts by dividing the raw counts per feature by the effective library size (for the respective sample)
The Normalization report includes the Normalization methods used, a Feature distribution table, Box-whisker plots of the Expression signal before and after normalization, and Sample histogram charts before and after normalization. Note that all visualizations are disabled for results with more than 30 samples.
A summary of the normalization methods performed. They are listed by the order they were performed.
A table that presents descriptive statistics on each sample, the last row is the grand statistics across all samples (Figure 4).
These box-whisker plots show the expression signal distribution for each sample before and after normalization. When you mouse over on each bar in the plot, a balloon would show detailed percentile information (Figure 5).
A histogram is displayed for data before and after it is normalized. Each line is a sample, where the X axis is the range of the data in the node and the Y-axis is the frequency of the value within the range. When you mouse over a circle which represent a center of an interval, detailed information will appear in a balloon (Figure 6). It includes:
The sample name.
The range of the interval, “[ “represent inclusive, “)” represent exclusive.
The frequency value within the interval
Bolstad BM, Irizarry RA, Astrand M, Speed, TP. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics. 2003; 19(2): 185-193.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008; 5(7): 621–628.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11: 94.
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11: R25.
Dillies MA, Rau A, Aubert J et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2013; 14(6): 671-83.
Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data. Theory Biosci. 2012; 131(4): 281-5.
Ritchie ME, Phipson B, Wu D et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43(15):e97.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
SC transform task performs the variance stabilizing normalization proposed in [1]. The task's interface follows that of SCTransform() function in R [2]. SCTransform v2 [3] provides the ability to perform downstream differential expression analyses besides the improvements on running speed and memory consumption. v2 is the default method in Flow.
We recommend performing the normalization on a single cell raw count data node. Select SCTransform task in Normalization and scaling section on the pop-up menu to invoke the dialog (Figure 1).
By default, it will generate report on all the input features. Unchecking the Report all features, user can limit the results to a certain number of features with highest variance.
In Advanced options, users can the click Configure to change the default settings (Figure 2).
Scale results: Whether to scale residuals to have unit variance; default is FALSE
Center results_:_ When set to Yes, center all the transformed features to have zero mean expression. Default is TRUE.
VST v2: Default is TRUE. When set to 'v2', it sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; If default is unchecked, it uses the original sctransform model (v1), it will only generate SC scaled data node.
There are two data nodes generated from this task (if VST v2 option is checked as default):
SC scaled data: it is a matrix of normalized values (residuals) that by default has the same size as the input data set. This data node is used to perform downstream exploratory analysis e.g. PCA, Seurat3 integration etc (Figure 3), this data node is not recommend to use for differential analysis.
SC corrected data: is equivalent to the ‘corrected counts’ in data slot generated after PrepSCTFindMarkers task in the SCT assay in Seurat object. It is used for downstream differential expression(DE) analyses (Figure 3).
Note: When perform DE analysis with Hurdle, the 'shrinkage of error term variance' option might need to turn off depending on the dataset. Similarly, the 'Lognormal with shrinkage/voom' option needs to turn off when run DE with GSA.
References
Christoph Hafemeister, Rahul Satija. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. https://doi.org/10.1101/576827
SCTransform() documentation https://www.rdocumentation.org/packages/Seurat/versions/3.1.4/topics/SCTransform
Saket Choudhary, Rahul Satija. Comparison and evaluation of statistical error models for scRNA-seq. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02584-9
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
If your experimental design includes a sample or a group of samples serving as a baseline control, you can normalize the experimental samples by subtracting or dividing by the baseline sample(s) using the Normalize to baseline task in Partek Flow. For example, in PCR experiments, the delta Ct values of control samples are subtracted from the delta Ct values of experimental samples to obtain delta-delta Ct values for the experimental samples.
The Normalize to baseline option is available in the Normalization and Scaling section of the context-sensitive menu (Figure 1) upon selection of any count matrix data node.
There are three options to choose the baseline samples:
use all samples
use a group
use matched pairs
To normalize data to all the samples, choose to calculate the baseline using the mean or median of all samples for each feature, and choose to subtract baseline or ratio to baseline for the normalization method (Figure 2), and click Finish.
Use a group to create baseline
When there is a subset of samples that serve as the baseline in the experiment, select use group for Choose baseline samples. The specific group should be specified using sample attributes (Figure 3).
Choose use group, select the attribute containing the baseline group information, e.g. Treatment in this example, with the samples with the group Control for the Treatment attribute used as the baseline. The control samples can be filtered out after normalization by selecting the Remove baseline samples after normalization check box.
When using matched pairs, one sample from each pair serves as the control. An attribute specifying the pairs must be selected in addition to an attribute designating which sample in each pair is the baseline sample (Figure 4).
After normalization, all values for the control sample will be either 0 or 1 depending on the normalization method chosen, so we recommend removing baseline samples when using matched pairs.
The output of Normalize to baseline is a Normalized counts data node.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
ANOVA method is applying a specified log normal model to all the features.
To setup ANOVA model or the alternative Welch's ANOVA (which is used on normally distributed data that violates the assumption of homogeneity of variance), select factors from sample attribute. The factors can be categorical or numeric attribute. Click on a check button to select and click Add factors button to add it to the model (Figure 1).
LIMMA-trend and LIMMA-voom setup dialogs are identical to ANOVA's setup.
Note: LIMMA-voom method can only be invoked on the following normalization output data node, those methods can produce library sizes:
TMM, CPM, Upper Quartile, Median ratio, Postcounts
When more than one factor is selected, click Add interaction button to add interaction term of the selected factors.
Once a factor is added to the model, you can specify whether the factor is a random effect (check Random check box) or not.
Most factors in an analysis of variance are fixed factors, i.e. the levels of that factor represent all the levels of interest. Examples of fixed factors include gender, treatment, genotype, etc. However, in experiments that are more complex, a factor can be a random effect, meaning the levels of the factor only represent a random sample of all of the levels of interest. Examples of random effects include subject and batch. Consider the example where one factor is type (with levels normal and diseased), and another factor is subject (the subjects selected for the experiment). In this example, “Treatment” is a fixed factor since the levels treated and control represent all conditions of interest. “Subject”, on the other hand, is a random effect since the subjects are only a random sample of all the levels of that factor. When model has both fixed and random effect, it is called a mixed model.
When more than one factor is added to the model, click on the Cross tabulation link at the bottom to view the relationship between the factors in a different browser tab (Figure 2).
Once the model is set, click on Next button to setup comparisons (contrasts) (Figure 3).
Start by choosing a factor or interaction from the Factor drop-down list. The subgroups of the factor or interaction will be displayed in the left panel; click to select one or more level(s) or subgroup name(s) and move them to one of the boxes on the right. The ratio/fold change calculation on the comparison will use the group in the top box as numerator, and the group in the bottom box as the denominator. When multiple levels (groups) are in either numerator or denominator box(es), in Combine mode, click on Add comparison button to combine all numerator levels and combine all denominator levels in a single comparison in the Comparison table below; in Pairwise, click on Add comparison button will split all numerator levels and denominator levels into a factorial set of comparisons – in other words, it will add every numerator level paired with every denominator level comparisons to the Comparison table . Multiple comparisons from different factors can be added from the specified model.
Click on the Configure to customize Advanced options (Figure 4)
Low-expression feature and Multiple test correction sections are the same as the matching GSA advanced option, see above GSA advanced options.
Report option
Use only reliable estimation results: There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
Report p-value for effects: If set to No, only the p-value of comparison will be displayed on the report, the p-value of the factors and interaction terms are not shown in the report table. When you choose Yes in addition to the comparison’s p-value, type III p-values are displayed for all the non-random terms in the model.
Shrinkage to error term variance: by default, None is select, which is lognormal model. Limma-trend and Limma-voom options are lognormal with shrinkage. (Limma-trend is the same as the GSA default option–lognormal with shrinkage). Shrinkage options are recommended for small sample size design, no random effects can be included when performing shrinkage. If there are numeric factors in the model, the partial correlations cannot be reported on the numeric factors when shrinkage is performed. Limma-trend works well if the ratio of the largest library size to the smallest is not more than 3 fold, it is simple and robust for any type of data. Limma-voom is recommended for sequencing data when library sizes vary substantially, but it can only be invoked on data node normalized using TMM, CPM, or Upper quartile methods while Limma-trend can be applied to normalized data using any method.
Report partial correlations: If the model has a numeric factor(s), when choosing Yes, partial correlation coefficient(s) of the numeric factor(s) will be displayed in the result table. When choosing No, partial correlation coefficients are not shown.
Data has been log transformed with base: showing the current scale of the input data on this task.
Since there is only one model for all features, so there is no pie charts design models and response distribution information. The Gene list table format is the same as the GSA report.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
GSA stands for gene specific analysis, the goal of which is to identify the statistical model that is the best for a specific gene among all the selected models, and then use that best model to calculate p-value and fold change.
The first step of GSA is to choose which attributes to include in the test (Figure 1). All sample attributes including numeric and categorical attributes are displayed in the dialog, so use the check button to select between them. An experiment with two attributes Cell type (with groups A and B)and Time (time points 0, 5, 10) is used as an example in this section.
Click Next to display the levels of each attribute to be selected for sub-group comparisons (contrasts).
To compare A vs. B, select A for Cell type on the top, B for Cell type on the bottom and click Add comparison. The specified comparison is added to the table below (Figure 2).
To compare Time point 5 vs. 0, select 5 for Time on the top, 0 for Time on the bottom, and click Add comparison (Figure 3).
To compare cell types at a certain time point, e.g. time point 5, select A and 5 on the top, and B and 5 on the bottom. Thereafter click Add comparison (Figure 4).
Multiple comparisons can be computed in one GSA run; Figure 5 shows the above three comparisons are added in the computation.
In terms of design pool, i.e. choices of model designs to select from, two 2 factors in this example data will lead to seven possibilities in the design pool:
Cell type
Time
Cell type, Time
Cell type, Cell type * Time
Time, Cell type * Time
Cell type * Time
Cell type, Time, Cell type * Time
In GSA, if a 2nd order interaction term is present in the design, then all first order terms must be present, which means, if Cell type * Time interaction is present, the two factors must be included in the model. In the other words, the following designs are not considered:
Cell type, Cell type * Time
Time, Cell type * Time
Cell type * Time
If a comparison is added, some models that don't have the comparison factors will also be eliminated. E.g. if a comparison on Cell type A vs. B is added, only designs that have Cell type factor included will be in the computation. These are:
Cell type
Cell type, Time
Cell type, Time, Cell type * Time
The more comparisons on different terms are added, the fewer models will be included in the computation. If the following comparisons are added in one GSA run:
A vs B (Cell type)
5 vs 0 (Time)
only the following two models will be computed:
Cell type, Time
Cell type, Time, Cell type * Time
If comparisons on all the three terms are added in one GSA run:
A vs B (Cell type)
5 vs 0 (Time)
A*5 vs B*5 (Cell type * Time)
then only one model will be computed:
Cell type, Time, Cell type * Time
If GSA is invoked from a quantification output data node directly, you will have the option to use the default normalization methods before performing differential expression detection (Figure 6).
If invoked from a Partek E/M method output, the data node contains raw read counts and the default normalization is:
Normalize to total count (RPM)
Add 0.0001 (offset)
If invoked from a Cufflinks method output, the data node contains FPKM and the default normalization is:
Add 0.0001 (offset)
If advanced normalization needs to be applied, perform the Normalize counts task on a quantification data node before doing differential expression detection (GSA or ANOVA).
Click on Configure to customize Advanced options (Figure 7).
Low -expression feature section allows you to specify criteria to exclude features that do not meet requirements for the calculation. If there is filter feature task performed in the upstream analysis, the default of this filter is set to "None", otherwise, the default is Lowest average coverage is set to 1.
Lowest average coverage: the computation will exclude a feature if its geometric mean across all samples is below than the specified value
Lowest maximum coverage: the computation will exclude a feature if its maximum across all samples is below the specified value
Minimum coverage: the computation will exclude a feature if its sum across all samples is below than the specified value
None: include all features in the computation
Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default (1). Other options like Storey q-value (2), and Bonferroni are provided, select one method at a time; None means no multiple test correct will be performed.
FDR step-up:
Suppose there are n p-values (n is the number of features). The p-values are sorted by ascending order and m represents the rank of a p-value. The calculation compares p-value*(n/m) with the specified alpha level, and the cut-off p-value is the one that generates the last product that is less than the alpha level. The goal of step up method is to find:
Define the step-up value as:
Then an equivalent definition for K* is :
So when
the step up value is
In order to find K* , start with Sn* and then go up the list until you find the first step up value that is less or equal to alpha.
Storey q-value:
q-value is the minimum "positive false discovery rate" (pFDR) that can occur when rejecting a statistic.
For an observed statistic T=t and nested set of rejection area {C},
Bonferroni:
In the results, the best model's Akaike weight is also generated. The model's weight is interpreted as the probability that the model would be picked as the best if the study were reproduced. The range of Akaike weight is between 0 to 1, where 1 means the best model is very superior to the other candidates from the model pool; if the best model's Akaike weight is close to 0.5 on the other hand, it means the best model is likely to be replaced by other candidates if the study were reproduced. One still uses the best shot model, however, the accuracy of the best shot is fairly low.
The default value for Enable multimodel approach is Yes. It means that the estimation will utilize all models in the pool by assigning weights to them based on AIC or AICc. If No is selected instead, the estimation is based on only one best model which has the smallest AIC or AICc. The output p-value will be different depending on the selected option for multimodel, but the fold change is the same. Multimodel approach is recommended when the best model's Akaike weight is not close to 1, meaning that the best model is not compelling.
There are situations when a model estimation procedure does not outright fail, but still encounters some difficulties. In this case, it can even generate p-value and fold change for the comparisons, but those values are not reliable, and can be misleading. It is recommended to use only reliable estimation results, so the default option for Use only reliable estimation results is set Yes.
Partek Flow provides five response distribution types for each design model in the pool, namely:
Normal
Lognormal (the same as ANOVA task)
Lognormal with shrinkage (the same as limma-trend method 4)
Negative binomial
Poisson
We recommend to use lognormal with shrinkage distribution (the default), and an experienced user may want to click on Custom to configure the model type and p-value type (Figure 8).
If multiple distribution types are selected, then the number of total models that is evaluated for each feature is the product of the number of design models and the number of distribution types. In the above example, suppose we have only compared A vs B in Cell type as in Figure 2, then the design model pool will have the following three models:
Cell type
Cell type, Time
Cell type, Time, Cell type * Time
If we select Lognormal with shrinkage and Negative binomial, i.e. two distribution types, the best model fit for each feature will be selected from 3 * 2 = 6 models using AIC or AICc.
The design pool can also be restricted by Min error degrees of freedom. When "Model types configuration" is set to Default , this is automated as follows: it is desirable to keep the error degrees of freedom at or above six. Therefore, we automatically set to the largest k, 0 <= k <=6 for which admissible models exist. Admissible model is one that can be estimated given the specified contrasts. In the above example, when we compare A vs B in Cell type, there are three possible design models. The error degree of freedom of model Cell type is largest and the error degree of freedom of model Cell type, Time, Cell type * Time is the smallest:
k(Cell type) > k(Cell type, Time) > k (Cell type, Time, Cell type*Time)
If the sample size is big, k >=6 in all three models, all the models will be evaluated and the best model will be selected for each feature. However, if the sample size is too small, none of the models will have k >=6, then only the model with maximal k will be used in the calculation. If the maximal k happens to be zero, we are forced to use Poisson response distribution only.
There are two types of p-value, F and Wald., Poisson, negative binomial and normal models can generate p-value using either Wald or F statistics. Lognormal models always employ the F statistics; the more replicates in the study, the less the difference between the two options. When there are no replicates, only Poisson can be used to generate p-value using Wald.
Note: Partek Flow keeps tracking the log status of the data, and no matter whether GSA is performed on logged data or not, the LSMeans, ratio and fold change calculation are always in linear scale. Ratio is the ratio of the two LSMeans from the two groups in the comparison (left is the numerator, right is the denominator); Fold change is converted from ratio: when ratio is greater than 1, fold change is same as ratio; when ratio is less than one, fold change is -1/ratio. In other words - fold change value is always >=1 or <=-1, there is no fold change value between -1 and 1. When the LSmean of numerator group is greater than that of denominator group, fold change is greater than 1; when LSmean of numerator group is less than denominator group, fold change is less than 1; when the group groups are the same, fold change is 1. Logratio is ratio is log2 transformed, which is equivalent to logfoldchange is some other software.
If there are multiple design models and multiple distribution types included in the calculation, the fraction of genes using each model and type will be displayed as pie charts in the task result (Figure 9).
Feature list with p-value and fold change generated from the best model selected is displayed in a table with other statistical information (Figure 10). By default, the gene list table is sorted by the first p-value column.
The following information is included in the table by default:
Feature ID information: if transcript level analysis was performed, and the annotation file has both transcript and gene level information, both gene ID and transcript ID are displayed. Otherwise, the table shows only the available information.
Total counts: total number of read counts across all the observations from the input data.
Each contrast outputs p-value, FDR step up p-value, ratio and fold change in linear scale, LSmean of each group comparison in linear scale
When you click on the Optional columns link on the top-left corner of the table, extra information will be displayed in the table when select:
Maximum count: maximum number of reads counts across all the observations from the input data.
Geometric mean: geometric mean value of the input counts across all observations.
Arithmetic mean: arithmetic mean value of input counts across all observations.
By clicking on Optional columns, you can retrieve more statistics result information, e.g. Average coverage which is the geometric mean of normalized reads in linear scale across all the samples; fold change lower/upper limits generated from 95% confidence interval; feature annotation information if there are any more annotation fields in the annotation model you specified for quantification, like genomic location, strand information etc.
Select the check box of the field and specify the cutoff by typing directly or using the slider. Press Enter to apply. After the filter has been applied, the total number of included features will be updated on the top of the panel (Result).
Note that for the LSMean, there are two columns corresponding to the two factors in the contrast. The cutoffs can be specified for the left and right columns separately. For example, in Figure 6, the LSMean (left) corresponds to A while the LSMean(right) is for the B.
The filtered result can be saved into a filtered data node by selecting the Generate list button at the lower-left corner of the table. Selecting the Download button at the lower-right corner of the table downloads the table as a text file to the local computer.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010
Partek Flow offers the DESeq2 method for differential expression detection. There are two options DESeq2(R) and DESeq 2.
DESeq2(R) is a wrapper of Bioconductor package ‘DESeq2’. The implementation details for DESeq2(R) can be found at the external , which includes changes made by the algorithm authors since the publication of the original manuscript (Love, Huber, and Anders 2014).
The DESeq2(R) task can be invoked from data nodes generated by quantification tasks that contains raw read count values for each feature in each sample (Gene counts, Transcript counts, microRNA counts, etc.). DESeq2(R) cannot be run on a normalized counts data node because DESeq2(R) internally corrects for library size and implements a low expression filter.
If the value of the raw count includes a decimal fraction, the value will be rounded to an integer before DESeq2(R) is performed. The DESeq2(R) task itself includes feature filter, normalization and differential analysis. Please note that when you run DESeq2(R) task, you have to install the R package before hand (the installation has to be performed outside of Partek Flow).
DESeq2 task, on the other hand, is a Partek Flow native implementation of DESeq2 differential expression detection algorithm, it is much faster than DESeq2(R). Like GSA and ANOVA, before you run this task, we recommend that you first remove (filter out) features expressed at a low level and then perform normalization using Median ratio (DESeq2 only).
Note: DESeq2 differential analysis can only be performed on the following normalization output data node, those methods can produce library sizes:
TMM, CPM, Upper Quartile, Median ratio, Postcounts
Installation of R is not required to run DESeq2 task.
Categorical and numeric attributes, as well as interaction terms can be added to the DESeq2 model. The DESeq2 configuration dialog for adding attributes and interactions to the model is very similar to the ANOVA configuration dialog. However, DESeq2(R) has two important limitations not shared by GSA or .
First, interaction terms cannot be added to contrasts in DESeq2(R). In order to perform contrasts of an interaction term in DESeq2(R), a new attribute that combines the factors of interest must be added and the contrast performed on the new combined attribute. This limitation of DESeq2(R) is detailed in the official . To perform contrasts of interaction terms without creating new combined attributes, please use either DESeq2, GSA, or ANOVA/LIMMA-trend/LIMMA-voom method.
Second, DESeq2(R) only allows two subgroups to be compared in each contrast. To analyze multiple subgroups, please use either DESeq2, GSA, or ANOVA method.
In DESeq2 advanced options configure dialog, there is reference selection option:
A reference level is specified for each categorical factor in the model and the result may be dependent on the choice. In R, the reference level is typically chosen by default whenever a categorical factor is present in the model. This Flow option was created to allow the user to specify exactly the same reference level as in the R script, if need be e.g. compare the results between Deseq2 vs DESeq2(R).
The report produced by DESeq2 is similar to the ANOVA report; each row is a feature and columns include p-value, FDR p-value, and fold change in linear scale for each contrast. However DESeq2(R) report doesn't have LSMeans information on the compared groups.
The R implementation of DESeq2 and DESeq2(R) fail to generate results in some special cases of input data that are handled differently in Partek's implementation of DESeq2.
First, if two or more categorical factors are present in the model, there can be a situation when some combinations of factor levels have no observations. In Flow, one can see if that is the case by clicking "Cross tabulation" link after the factors have been selected:
In this example, if the user tries to fit the model including Factor_A, Factor_G, and Factor_A*Factor_G terms, R implementation of DESeq2 fails. At the same time, none of these three terms is completely redundant, even though not all contrasts are estimable. It is possible to obtain an answer in R only by combining Factor_A and Factor_G into a single new factor with no empty levels. Partek's implementation of DESeq2 eliminates the need for that extra step and produces an answer right away. However, the results (most likely, p-values) may be somewhat different from what one would obtain by using a combined factor in R. To match R results perfectly, one has to create a combined factor in Flow also.
Second, occasionally all of the gene-wise dispersion estimates are too close to zero. In that case, R implementation fails with an error message "all gene-wise dispersion estimates are within 2 orders of magnitude from the minimum value ... one can instead use the gene-wise estimates as final estimates". As suggested in the error message, Flow implementation uses gene-wise dispersion estimates instead of failing the task.
In all of the special cases, an informative warning is output in Flow log.
In R, shrinkage of log2 fold changes is a separate step performed by lfcShrink() function. In Flow, that functionality is absent in DESeq(R) but present in DESeq2 task. The latter implements the shrinkage method corresponding to “ashr” option in lfcShrink(). The default shrinkage option in lfcShrink is “apeglm”, but the default method is unable produce results for some comparisons whereas “ashr” has no restrictions. The fold change shrinkage results are produced in “Shrunken Log2(Ratio)” and “s-value” columns in DESeq2 task project report.
"Independent filtering" tries removing some features with low expression in order to increase the statistical power. For such removed features, the p-value is reported but FDR and similar multiplicity adjustment measures are set to "?". In order to avoid the missing values in the report, set the option to "No".
Since there is limitation in DESeq2(R), we recommend to use DESeq2 task all the time.
Love MI, Huber W, and Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 2014;15(12): 550.\
Hurdle model is a statistical test for differential analysis that utilizes a two-part model, a discrete (logistic) part for modeling zero vs. non-zero counts and a continuous (log-normal) part for modeling the distribution of non-zero counts. In RNA-Seq data, this can be thought of as the discrete part modeling whether or not the gene is expressed and the continuous part modeling how much it is expressed if it is expressed. Hurdle model is well suited to data sets where features have very many zero values, such as single cell RNA-Seq data.
On default settings, Hurdle model is equivalent to MAST, a published differential analysis tool designed for single cell RNA-Seq data that uses a hurdle model [1].
We recommend normalizing you data prior to running Hurdle model, but it can be invoked on any counts data node.
Click the counts data node
Click the Differential analysis section in the toolbox
Click Hurdle model
Select the factors and interactions to include in the statistical test (Figure 1)
Numeric and categorical attributes can be added as factors. To add attributes as factors, check the attribute check boxes and click Add interactions. To add interactions between attributes, click the attribute check boxes and click Add interaction.
Click Next
Define comparisons between factor or interaction levels (Figure 2)
Adding comparisons in Hurdle model uses the same interface as ANOVA/LIMMA-trend/LIMMA-voom. Start by choosing a factor or interaction from the Factor drop-down list. The levels of the factor or interaction will appear in the left-hand panel. Select levels in the panel on the left and click the > arrow buttons to add them to the top or bottom panels on the right. The control level(s) should be added to the bottom box and the experimental level(s) should be added to the top box. Click Add comparison to add the comparison to the Comparisons table. Only comparisons in the Comparisons table will be included in the statistical test.
Click Finish to run the statistical test
Hurdle model produces a Feature list task node. The results table and options are the same as the GSA task report except the last two columns (Figure 3). The percentage of cells where the feature is detected (value is above the background threshold) in different groups (Pct(group1), Pct(group2)) are calculated and included in the Hurdle model report.
Low-value filter allows you to specify criteria to exclude features that do not meet requirements for the calculation. If there is filter feature task performed in the upstream analysis, the default of this filter is set to None, otherwise, the default is Lowest average coverage is set to 1.
Lowest average coverage: the computation will exclude a feature if its geometric mean across all samples is below than the specified value
Lowest maximum coverage: the computation will exclude a feature if its maximum across all samples is below the specified value
Minimum coverage: the computation will exclude a feature if its sum across all samples is below than the specified value
None: include all features in the computation
Multiple test correction can be performed on the p-values of each comparison, with FDR step-up being the default. If you check the Storey q-value, an extra column with q-values will be added to the report.
There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default of Use only reliable estimation results is set Yes.
Shows the current scale of the input data for this task
Set the threshold for a feature to be considered expressed for the two-part hurdle model. If the feature value is greater than the specified value, it is considered expressed. If the upstream data node contains log-transformed values, be sure to specify the value on the same log scale. Default value is 0.
Applies shrinkage to the error variance in the continuous (log-normal) part of the hurdle model. The error term variance will be shrunk towards a common value and a shrinkage plot will be produced on the task report page if enable. Default is Enabled.
Applies shrinkage to the regression coefficients in the discrete (logistic) part of the hurdle model. The initial versions of MAST contained a bug that was fixed in its R source in March 2020. However, for the sake of reproducibility the fix was released only on a topic branch in MAST Github [2] and the default version of MAST remained as is. To install the fixed version of MAST in R, run the following R script.
In Flow, the user can switch between the fixed and default version by selecting Fixed version or Default version, respectively. To disable the shrinkage altogether, choose Disabled.
[1] Finak, G., McDavid, A., Yajima, M., Deng, J., Gersuk, V., Shalek, A. K., ... & Linsley, P. S. (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome biology, 16(1), 278.
[2] MAST topic branch that contains the regression coefficient shrinkage fix:
This task can be invoked from count matrix data node or clustering task report (Statistics > Compute biomarkers). It performs Student's t-tests on the selected attribute, comparing one subgroup at a time vs all the others combined. By default, the up-regulated genes are reported as biomarkers.
In the set-up dialog, select the attribute from the drop down list. The available attributes are categorical attributes which can be seen on the Data tab (i.e. project-level attributes) as well as and data node-specific annotation, e.g. graph-based clustering result (Figure 1). If the task is run on graph-based clustering output data node, the calculation is using upstream data node which contains feature counts – typically the input data node of PCA.
By default, the result outputs the features that are up-regulated by at least 1.5 fold change (in linear scale) for each subgroup comparing to the others. The result is displayed in a table with each column is a subgroup name, each row is a feature. Features are ranked by the ascending p-values within each sub-category. An example is shown in Figure 2. If a subgroup has fewer biomarkers than the others, the "extra" fields for that subgroup will be left blank.
Figure 3. Biomarkers table (example). Top 10 biomarkers for each cluster are shown. Download link provides the full results table
Furthermore, the Download link (upper-left corner of the table report) downloads a .txt file to the local computer (default file name: Biomarkers.txt), which contains the full report: all the genes with fold change > 1.5, with corresponding fold change and p-values.
The Kruskal-Wallis and Dunn's tests (Non-parametric ANOVA) task is used to identify deferentially expressed genes among two or more groups. Note that such rank-based tests are generally advised for use with larger sample sizes.
To invoke the Kruskal-Wallis test, select any count-based data nodes, these include:
Gene counts
Transcript counts
Normalized counts
Select Statistics > Differential analysis in the context-sensitive menu, then select Kruskal-Wallis (Figure 1).
Select a specific factor for analysis and click the Next button (Figure 2). Note that this task can only take into account one factor at a time.
For more complicated experimental designs, go back to the original count data that will be used as input and perform Rank normalization at the Features level (Figure 3). The resulting Normalized counts data node can then be analyzed using the Detect differential expression (ANOVA) task, which can take into account multiple factors as well as interactions.
Define the desired comparisons between groups and click the Finish button (Figure 4). Note that comparisons can only be added between single group (i.e. one group per box).
The results of the analysis will appear similar to other differential expression analysis results. However, the column to indicate mean expression levels for each group will display the median instead (Figure 5).
Alternative splicing results in a single gene coding for multiple protein isoforms, so this task can only be invoked from transcript level data. The algorithm is is based on ANOVA to detect genes with multiple transcripts showing expression changes differently in different biology groups, e.g. a gene has two transcripts: A and B, transcript A is showing up-regulation in the treated group comparing to the control group, while B is showing down regulation in treated group.
The alt-splicing dialog is very similar to ANOVA dialog, since the analysis is based on the ANOVA model specified. To setup an ANOVA model, first chose factors from the available sample attributes. The factors can be categorical or numeric attribute(s). Click on a check box to select and click Add factors button to add a factor to the model (Figure 1).
Only one alt-splicing factor needs to be selected from the ANOVA factors. The ANOVA model performed is based on the factors specified in the dialog, while the transcript ID and transcript ID interaction with alt-splicing factor effects are added into the model automatically.
Transcript ID effect: not all transcripts of a gene are expressed at the same level, so transcript ID is added to the model to account for transcript-to transcript differences.
Interaction of transcript ID with alt-splicing factor: that effect is used to estimate whether different transcripts have different expression among the levels of the same factor.
Suppose there is an experiment designed to detect transcripts showing differential expression in two tissue groups: liver vs muscle. The alt-splicing ANOVA dialog allows you to specify the ANOVA model that in this analysis is the Tissue. The alt-splicing factor is chosen from the ANOVA factor(s), so the alt-splicing factor is also Tissue (Figure 1).
The alt-splicing range will limit analysis to genes possessing the number of transcripts in the specified range. Lowering the maximum number of transcripts will increase the speed of analysis.
Click Next to setup the comparisons (Figure 2). The levels (i.e. subgroups) of the alt-splicing factor will be displayed in the left panel; click to select a level name and move it to one of the panels on the right. The fold change calculation on the comparison will use the group in the top panel as the numerator, and the group in the bottom panel as the denominator. Click on Add comparison button to add a comparison to the comparisons table.
Click on the Configure to customize Advanced options (Figure 3).
Report option
User only reliable estimation results: There are situations when a model estimation procedure does not fail outright, but still encounters some difficulties. In this case, it can even generate p-value and fold change on the comparisons, but they are not reliable, i.e. they can be misleading. Therefore, the default is set to Yes.
Data has been log transformed with base: showing the current scale of the input data on this task.
In the example above (Figure 4), the alt-splicing p-value of gene SLC25A3 is very small which indicates that this gene shows preferential transcript expression across tissues. There are 3 splicing variants of the gene: NM_213611, NM_005888 and NM_002635. Fold change clarifies that NM_005888 has higher expression in the muscle relative to the liver (negative fold change, liver as the reference category), while NM_002635 has higher expression in the liver.
To visualize the difference, click on the Browse to location icon (Figure 5). The 3rd exon is differentially expressed between NM_005888 and NM_002635. Muscle primarily expresses NM_005888 while liver primarily uses NM_002635.
Seurat v3[1] introduced new methods for the integration of multiple single-cell datasets, no matter whether they were collected from different individuals, experimental conditions, technologies, etc. Seurat 3 integration method aims to use a subset of the data as reference for the integrate analysis. The method integrates all other data with the reference subset. The subset can be one sample or a subgroup of samples defined by the factor attribute.
Seurat3 integration in Flow can be invoked in Batch removal section if a Normalized counts data node is selected (Figure 1).
To run Seurat3 integration,
Click a Normalized counts data node
Click the Batch removal section in the toolbox
Click Seurat3 Integration
You will be promoted to pick some attribute(s) for analysis. The first Seurat3 integration dialog is a drop-down list that includes the factors for data integration. To set up the model, you need to choose which attribute should be considered. For example, in the case of one dataset that has different cell types from multiple technologies(Tech), different technology may have divergent impacts on different cell types. Hence, the attribute Tech should be considered to be the batch factor_._ The attribute celltype represents different cell types in this dataset (Figure 2).
To integrate data with default settings,
Select Tech from the dropdown list
Click Finish
The output of Seurat3 integration is a new data node - Integrated counts (Figure 1). We can then use this new integrated matrix for downstream analysis and visualization (Figure 3).
Users can click Configure to change the default settings In Advanced options (Figure 4).
Use reference to find anchors: when this box is checked, the first group of the selected attribute is used as reference to find anchors. To use a different group as reference, change the order of subgroups of the attribute in the attribute management page on Data tab. When the box is unchecked, anchors will be identified by comparing all pairs of subgroups, this option is very computationally intensive.
Perform L2 normalization: Perform L2 normalization on the CCA cell embeddings after dimensional reduction.
Pick anchors: How many neighbors (k) to use when picking anchors.
Filter anchors: How many neighbors (k) to use when filtering anchors.
Score anchors: How many neighbors (k) to use when scoring anchors.
Nearest neighbor finding methods: Method for nearest neighbor finding. Options include: rann, annoy.
\
\
\
This option is only available when Cufflinks quantification node is selected. Detailed implementation information can be found in the Cuffdiff manual [5].
When the task is selected, the dialog will display all the categorical attributes more than one subgroups (Figure 1).
When an attribute is selected, pairwise comparisons of all the levels will be performed independently.
Click on Configure button in the Advanced options to configure normalization method and library types (Figure 2).
There are three library normalization methods:
Class-fpkm: library size factor is set to 1, no scaling applied to FPKM values
Geometric: FPKM are scaled via the median of the geometric means of the fragment counts across all libraries [6]. This is the default option (and is identical to the one used by DESeq)
Quartile: FPKMs are scaled via the ratio of the 75 quartile fragment counts to the average 75 quartile value across all libraries
The library types have three options:
Fr-unstranded: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. E.g. standard Illlumina
Fr-firststrand: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. The right-most end of the fragment is the first sequenced or only sequenced for single-end reads. It is assumed that only the strand generated during first strand synthesis is sequenced. E.g. dUPT, NSR, NNSR
Fr-secondstrand: reads from the left-most end of the fragment in transcript coordinates map to the transcript strand, and the right-most end maps to the opposite strand. The left-most end of the fragment is the first sequenced or only sequenced for single-end reads. It is assumed that only the strand generated during second strand synthesis is sequenced. E.g. Directional Illumina, standard SOLiD.
The report of the cuffdiff task is a table of a feature list p-values, q-value and log2 fold-change information for all the comparisons (Figure 3).
In the p-value column, besides an actual p-value, which means the test was performed successfully, there is also the following flags which indicate the test was not successful:
NOTEST: not enough alignments for testing
LOWDATA: too complex or shallowly sequences
HIGHDATA: too many fragments in locus
FAIL: when an ill-conditioned covariance matrix or other numerical exception prevents testing
The table can be downloaded as a text file when clicking the Download button on the lower-right corner of the table.
Benjamini, Y., Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS, B, 57, 289-300.
Storey JD. (2003) The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics, 31: 2013-2035.
Auer, 2011, A two-stage Poisson model for testing RNA-Seq
Burnham, Anderson, 2010, Model selection and multimodel inference
Law C, Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 2014 15:R29.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biology, 2010
It is challenging to analyze scRNA-seq data, particularly when they are assayed with different technologies. Because biological and technical differences are interspersed. Harmony[1] is an algorithm that projects cells into a shared embedding where cells group by cell type rather than dataset-specific conditions. Harmony is able to simultaneously account for multiple experimental and biological factors while integrating different datasets.
Harmony in Flow can be invoked in Batch removal section only if
the data has some categorical attributes (only categorical attributes can be included in the model)
PCA data node is selected (Figure 1).
To run Harmony,
Click a PCA data node
Click the Batch removal section in the toolbox
Click Harmony
You will be prompted to pick some attribute(s) for analysis. The Harmony dialog is similar to the General linear model batch removal. To set up the model, you need to choose which attributes should be considered. For example, in the case of one dataset that has different cell types from multiple batches, the batch may have divergent impacts on different cell types. Here, batch is the attribute Sample name and cell type is the attribute Cell type (Figure 2).
To remove batch effects with default settings,
Click Sample name
Click Add factors
Click Finish
The output of Harmony is a new data node. This data node contains the Harmony corrected values and can be used as the input for downstream tasks such as Graph-based clustering, UMAP and T-SNE (Figure 3).
Users can click Configure to change the default settings In Advanced options (Figure 4).
Diversity clustering penalty (theta): Default theta=2. Higher value of penalty will have stronger correction, which results in better mixing . Zero penalty means no correction. The range of this value is from 0 to positive infinity.
Number of clusters (nclust): Number of clusters in model. Set this to the distinct count of cell types. nclust=1 equivalent to simple linear regression. Use 0 to enable Seurat’s RunHarmony() default setting.
Width of soft kmeans clusters (sigma): The range of this value is from 0 to positive infinity. When set it to 0, an observation will be assigned to 1 cluster (hard clustering). When the value is greater than 0, the observation will be potentially belong to multiple clusters (soft clustering, or fuzzy clustering). Default sigma=0.1. Sigma scales the distance from a cell to cluster centroids. Larger values of sigma result in observations assigned to more clusters. Smaller values of sigma make soft kmeans cluster approach hard clustering.
Ridge regression penalty (lambda): Default lambda=1. Lambda must be strictly positive. Smaller values result in more aggressive correction.
Random seed: Use the same random seed to reproduce the results.
Partek Flow offers a wide variety of tools to help you explore your data. Which tools are available depends on the type of data node selected.
Compare clusters is a tool to identify the optimal number of clusters for K-means Clustering using the Davies-Bouldin index. The Davies-Bouldin index is a measure of cluster quality where a lower value indicates better clustering, i.e., the separation between points within the clusters is low (tight clusters) and separation between clusters is high (distinct clusters).
We recommend normalizing your data prior to running Compare clusters, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click Compare clusters
Configure the parameters
Click Finish to run (Figure 1)
The parameters for Compare clusters are the same as for K-means clustering.
The Compare clusters task report is an interactive line chart with the number of clusters on the x-axis and the Davies-Bouldin index on the y-axis (Figure 2).
The Compare clusters task report can be used to run K-means clustering.
Click a point on the plot to select it or type the number of clusters in the text box Partition data into clusters
Selecting a point sets it as the number of clusters to partition the data into. The number of clusters with the lowest Davies-Bouldin index value is chosen by default.
Click Generate clusters to run K-means clustering with the selected number of clusters
A K-means clustering task node and a Clustering result data node are produced. Please see our documentation on K-means Clustering for more details.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Graph-based clustering is a method for identifying groups of similar cells or samples. It makes no prior assumptions about the clusters in the data. This means the number, size, density, and shape of clusters does not need to be known or assumed prior to clustering. Consequently, graph-based clustering is useful for identifying clustering in complex data sets such as scRNA-seq.
We recommend normalizing your data prior to running Graph-based clustering, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click Graph-based clustering
Configure the parameters
Click Finish to run
Graph-based clustering produces a Clustering result data node. The task report lists the cluster results and cluster statistics (Figure 1). If clustering was run with Split cells by sample enabled on a single cell counts data node, the cluster results table displays the number of clusters found for each sample and clicking the sample name opens the sample-level report.
The Maximum modularity is a measure of the quality of the clustering result. Modularity measures how much cells within a cluster are similar to each other and less similar to cells in other clusters. Higher modularity indicates a better result. Optimal modularity is 1, but may not be attainable for the input data.
The total number of clusters is listed along with the number and percentage of cells in each cluster.
The Clustering result data node includes the input values for each gene and adds cluster assignment as a new attribute, Graph-based, for each observation. If the Clustering result data node is visualized by Scatter plot, PCA, t-SNE, or UMAP, the plot will be colored by the Graph-based attribute (Figure 2).
Choose which version of the Louvain clustering algorithm to use. Options are Louvain [1], Louvain with refinement [2], SLM [3] and Leiden [4]. The most recent version is Smart Local Moving (SLM). The default is Louvain.
Compute biomarkers will compute features that are highly expressed when comparing each cluster.
Chose whether to run Graph-based clustering on all samples together or on each sample individually.
Checking the box will run Graph-based clustering on each sample individually.
This option appears when there are multiple feature types in the input data node (e.g., CITE-Seq data).
Select Any to run on all features or pick a feature type.
To increase the number of clusters, increase the resolution . To decrease the number of clusters, decrease the resolution. Default is 0.5.
A larger number may be more appropriate for larger numbers of cells.
Removes links between pairs of points if their similarity is below the threshold. Larger values lead to a shorter run time, but can result in many singleton clusters. Default is 0.0.
Clustering preserves the local structure of the data by focusing on the distances between each point and its k nearest neighbors. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a larger number of nearest neighbors. Increasing the number of nearest neighbors will increase the size of clusters and vice versa. Default is 30. The range of possible values is 3 to 100.
This parameter can be used to speed up clustering at the expense of accuracy. Larger scale implies greater accuracy and helps avoid singletons, but takes more time to run. To maximize accuracy, the total count of observations being clustered should be below the product of nearest neighbors and scale. Default is 100,000. The range of possible values is 1 to 100,000.
The modularity function measures the overall quality of clustering. Graph-based clustering amounts to finding a local maximum of the modularity function. Possibilities are Standard [5] and Alternative [6]. Default is Standard.
The clustering result depends on the order observations are considered. Each random start corresponds to a different order and result. A larger number of random starts can deliver a better result because the result with the highest quality (modularity) out of all of the random starts is chosen. Increasing the number of random starts will increase the run time. The range of possible values is 3 to 1,000. The default is 100.
The random seed is used in the random starts portion of the algorithm. Using a different seed might give a better result. Use the same random seed to reproduce results. Default is 0.
To maximize modularity, clustering proceeds iteratively by moving individual points, clusters, or subsets of points within clusters. A larger number of iterations can give better results, but will take longer to run. Default is 10.
Clusters smaller than the minimal cluster size value will be merged with a nearby cluster unless they are completely isolated. To avoid isolation, set the prune parameter to zero (default) and the scale parameter to the maximum (default). Default is 1.
Enable this option to use the slower sequential ordering of random starts. Default is disabled.
Different methods for determining nearest neighbors. The K nearest neighbors (K-NN) algorithm is the standard. The NN-Descent algorithm is used by UMAP and is an alternative. Default is K-NN.
If NN-Descent is chosen for Nearest Neighbor Type, the metric to use when determining distance between data points in high dimensional space can be set. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.
Graph-based clustering uses principal components as its input. The number of principal components to use is set here.
We recommend using the PCA task to determine the optimal number of principal components for your data. Default is 100.
Options are equally or by variance. Feature values can be standardized prior to PCA so that the contribution of each feature does not depend on its variance. To standardize, choose equally. To take variance into account and focus on the most variable features, choose by variance. Default is by variance.
You can choose to log transform the data prior to running PCA as part of Graph-based clustering. Default is disabled.
If you are normalizing the data, choose a log base. Default is 2 when Log transform data is enabled.
If you are normalizing the data, choose an offset. Default is 1 when Log transform data is enabled.
[1] Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.
[2] Rotta, R., & Noack, A. (2011). Multilevel local search algorithms for modularity clustering. Journal of Experimental Algorithmics (JEA), 16, 2-3.
[3] Waltman, L., & Van Eck, N. J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. The European Physical Journal B, 86(11), 471.
[4]Traag, V.A., Waltman, L. & van Eck, N.J. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9, 5233 (2019). https://doi.org/10.1038/s41598-019-41695-z
[5] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.
[6] Traag, V. A., Van Dooren, P., & Nesterov, Y. (2011). Narrow scope for resolution-limit-free community detection. Physical Review E, 84(1), 016114.
K-means clustering is a method for identifying groups of similar observations, i.e. cells or samples. K-means clustering aims to group observations into a pre-determined number of clusters (k) so that each observation belongs to the cluster with the nearest mean. An important aspect of K-means clustering is that it expects clusters to be of similar size (equal variance) and shape (distribution of variance is spherical). The Compare Clusters task can also be used to help determine the optimal number of K-means clusters.
We recommend normalizing your data prior to running K-means clustering, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click K-means clustering
Configure the parameters
Click Finish to run (Figure 1)
K-means clustering produces a K-means Clusters result data node; double-click to open the task report which lists the cluster statistics (Figure 2). If Compute biomarkers was enabled, top markers will be available by double-clicking the Biomarkers result data node. If clustering was run with Split by sample enabled on a single cell counts data node, the cluster results table displays the number of clusters found for each sample and clicking the sample name opens the sample-level report.
The total number of clusters is listed along with the number and percentage of cells in each cluster.
The K-means Clustering result data node includes the input values and adds cluster assignment as a new attribute, K-means, for each observation.
Choose which distance metric to use for cluster distance calculations. Options include Euclidean, Absolute Value, Euclidean Squared, Kendall Correlation, Max Value, Min Value, Pearson Correlation, Rank Correlation, Average Euclidean, Shape, Cosine, Canberra, Bray Curtis, Tanimoto, Pearson Correlation Absolute, Rank Correlation Absolute, and Kendall Correlation Absolute. The default is Euclidean.
Choose between specifying a set number of clusters or a range to test for the best fit number of clusters. The best fit is determined by the number of clusters with the lowest Davies–Bouldin index. The default is set to 10 for a fixed number of clusters. The initial values for the range option are 3 to 20 clusters.
Choose whether to run the ANOVA test comparing each cluster to all other observations to identify features that have higher values in that cluster. Default is Enabled.
This option is present in single cell data. If enabled, K-means clustering will be run separately for each sample. If disabled, K-means clustering will be run on all cells from the input data. Default is set by the Split single cell by sample option in the user preference page.
If enabled, the initial cluster centroids will be selected randomly from among the data points. If disabled, the initial cluster centroids will be selected to optimize distance between clusters. Default is Disabled.
This sets the random seed used if Random cluster initialization is enabled. Use the same random seed to reproduce results.
If enabled, all cluster centroids will be recomputed at the end of each iteration. If disabled, each cluster centroid will be recomputed as the members of the cluster change. Default is Enabled.
The maximum number of iterations to perform before setting on a set of clusters. Default is 1000.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
If a Partek Flow task fails (no report is produced), please follow the directions in .
If the task report is produced, but the results are missing for some features represented by "?" (Figure 1), it may be because something went wrong with the estimation procedure. To better understand this, use the information available in the Extra details report (Figure 1). This type of information is present for many tasks, including Differential Analysis and Survival Analysis.
Click the Extra details report for the feature of interest (Figure 1). This will display the Extra details report (Figure 2). When the estimation procedure fails, a red triangle will be present next to the information criteria value. Hover over the triangle to see a detailed error message.
In many cases, estimation failure is due to low expression, filter out low expression features or choose a reasonable normalization method will resolve this issue.
Sometimes the estimation results are not missing but the reported values look inadequate. If this is the case, the Extra details report may show that the estimation procedure generated a warning, and the triangle is yellow. To remove suspicious results in the report, set Use only reliable estimation results to Yes in the Advanced Options (Figure 3). The warnings will then be treated the same way as estimation failures.
To see the results for as many features as possible, regardless of how reliable they are, set Use only reliable estimation results to No and the result will be reported unless there is an estimation failure. For example, DESeq2 uses Cook’s distances to flag features with outlying expression values and if “Use reliable results” is set to Yes (Figure 3) the p-values for such features are not reported which may lead to some missing values in the report (set Use only reliable estimation results to No to avoid this).
Survival analysis is a branch of statistics that deals with modeling of time-to-event. In the context of “survival,” the most common event studied is death; however, any other important biological event could be analyzed in a similar fashion (e.g., spreading of the primary tumor or occurrence/relapse of disease). Survival analysis tries to answer questions such as: What is the proportion of a population who will survive past a certain time (i.e., what is the 5-year survival rate)? What is the rate at which the event occurs? Do particular characteristics have an impact on survival rates (e.g., are certain genes associated with survival)? Is the 5-year survival rate improved in patients treated by a new drug? Cox regression and Kaplan-Meier analysis are two techniques which are commonly used to assess survival analysis.
In survival analysis, the event should be well-defined with two levels and occur at a specific time. Because the primary outcome of the event is typically unfavorable (e.g., death, metastasis, relapse, etc.), the event is called a “hazard.” The hazard ratio is used to assess the likelihood of the event occurring while controlling for other co-predictors (co-variables/co-factors) if added to the model. In other words, the hazard ratio is how rapidly an event is experienced by comparing the hazard between groups. A hazard ratio greater than 1 indicates a shorter time-to-event (increase in the hazard), a hazard ratio less than 1 is associated with a greater time-to-event (reduction in the hazard), and a hazard ratio of 1 indicates no effect on time-to-event. For example, if the hazard ratio is 2 then there is twice a chance of occurrence compared to the other group. In cancer studies, a hazard ratio greater than 1 is considered a bad prognostic factor while a hazard ratio less than 1 is a good prognostic factor.
An important aspect of survival analysis is “censored” data. Censored data refers to subjects that have not experienced the event being studied. For example, medical studies often focus on survival of patients after treatment so the survival times are recorded during the study period. At the end of the study period, some patients are dead, some patients are alive, and the status of some patients is unknown because they dropped out of the study. Censored data refers to the latter two groups. The patients who survived until the end of the study or those who dropped out of the study have not experienced the study event "death" and are listed as "censored".
Cox regression (Cox proportional-hazards model) tests the effects of factors (predictors) on survival time. Predictors that lower the probability of survival at a given time are called risk factors; predictors that increase the probability of survival at a given time are called protective factors. The Cox proportional-hazards model are similar to a multiple logistic regression that considers time-to-event rather than simply whether an event occurred or not. Cox regression should not be used for a small sample size because the events could accidently concentrate into one of the cohorts which will not produce meaningful results.
Open the Cox Regression task in the task menu under Statistics for any counts node.
Next, select the Time, Event, and Event status. Partek Flow will automatically guess factors that might be appropriate for these options. Click Next to proceed with the task.
The predictors (factors or variables) and co-predictors in the model must be defined. Co-predictors are numeric or categorical factors that will be included in the cox regression model. Time-to-event will be performed on features (e.g. genes) by default unless Use feature expression as predictor is unchecked (Figure 2). If unchecked, select a factor and Add factors that is not features to model a different variable. Using the default setting, Use feature expression as predictor, lets the user Add factors to the model that act to explain the relationship for time-to-event (co-predictor) in addition to features. Choose Add interaction to add co-predictors with known dependencies. If factors are added here, they cannot be added as stratification factors. Click Next to proceed with the task.
Next, the user can define comparisons for the co-predictors if they have been added. Configure contrasts by moving factors into the numerator (e.g. experimental factor) or denominator (e.g. control factor / reference), choose Combine or Pairwise, and add the comparison which will be displayed below. Combine all numerator levels and combine all denominator levels in a single comparison or choose Pairwise to split all numerator levels and split all denominator levels into a factorial set of comparisons meaning every numerator will be paired with every denominator. Multiple comparisons from different factors can be added with Add comparison. Low value filter can be used to filter by excluding features; choose a filter or select none. Click Next to proceed with the task (Figure 3).
The user can select categorical factors to perform stratification if needed. Stratification is needed because the proportional odds assumption holds only within each stratum, but not across the strata. When stratification factors are included, the proportional hazard assumption will hold for each combination of levels of stratification factor; a separate submodel is estimated for each level combination and the results are aggregated. Click Finish to complete the task (Figure 4).
The results of Cox regression analysis provide key information to interpret, including:
Hazard ratio (HR): if the HR = 0.5 then half as many patients are experiencing the event compared to the control group, if the HR = 1 the event rates are the same in both groups, and if the HR = 2 then twice as many are experiencing an event compared to the control group.
HR limit: this is the confidence interval of the hazard ratio.
P-value: the lower the p-value, the greater the significance of the observation.
(e.g. If you have selected both a co-predictor and strata factor then a comparison using the co-predictors and Type III p-value for the co-predictor will be generated in the Cox regression report.)
The Kaplan-Meier task is used for comparing the survival curves among two or more groups of samples. The groups are defined by one or more categorical attributes (factors) specified by the user. Like in the case of Cox Regression, it is possible to use feature expression data, if available. In that case, quantitative feature expression is converted into a feature-specific categorical attribute. Each combination of the attribute levels corresponds to a distinct group. If one selects three factors with 2, 3 and 5 levels, respectively, then the total count of compared groups is 2*3*5 = 30. Therefore, selecting too many factors and/or factors with many levels may not work since the total number of samples may be not enough to fill all of the groups.
To perform Kaplan-Meier survival analysis, at least two pieces of information must be provided for each sample: time-to-event (a numeric factor) and event status (categorical factor with two levels). Time-to-event indicates the time elapsed between the enrollment of a subject in the study and the occurrence of the event. Event status indicates whether the event occurred or the subject was censored (did not experience the event). The survival curve is not straight lines connecting each point, instead a staircase pattern is used. The event status will determine the staircase pattern where each drop in the staircase represents the event occurrence.
The Kaplan-Meier task begins similar to the Cox regression task, then differs when selecting categorical attributes to define the compared groups.
For each feature (e.g. gene), the expression values are sorted in ascending order and placed into B bins of (roughly) equal size. As a result, a feature-specific categorical attribute with B levels is constructed which can be used by itself or in combination with other categorical attributes. For instance, for B = 2 (Figure 5) , we take a given feature and compute its median expression. The samples are separated into two bins, depending on whether the expression in the sample is below or above the median. if two percentiles are chosen, the bins are automatically labeled "Low" and "High" but the text box can be used to re-label the bins. The bins are feature-specific since this procedure is repeated for each feature separately.
For each group, the survival curve (aka survival function) is estimated using Kaplan-Meier estimator [1]. For instance, if one selects ER status which has two levels and we choose two feature expression bins, four survival curves are displayed in the Data Viewer (Figure 6). The Grouping configuration option can be used to split and modify the connections.
To see whether the survival curves are statistically different, Kaplan-Meier task runs Log-rank and Wilcoxon (aka Wilcoxon-Gehan) tests. The null hypothesis is that the survival curves do not differ among the groups (the computational details are available in [2]). When feature expression is used, the p-values are also feature specific (Figure 7). Select the step-plot icon under View to visualize the Kaplan-Meier survival curves for each gene.
Like in Cox Regression task, it is possible to choose stratification factor(s), but the purpose and meaning of stratification are not the same as in Cox Regression. Suppose we want to compare the survival among the four groups defined by the two levels of ER status and the two bins of feature expression. We can select the two factors on “Select group factor(s)” page. In that case, the reported p-values will reflect the statistical difference among the four survival curves that are due to both ER status and the feature expression. Imagine that our primary interest is the effect of feature expression on survival. Although ER status can be important and therefore should be included in the model, we want to know whether the effect of feature expression is significant after the contribution of ER status is taken into account. In other words, the goal is to treat ER status as a nuisance factor and the binned feature expression as a factor of interest.
In qualitative terms, it is possible to obtain an answer if we group the survival curves by the level of ER status. This can be achieved in the Data Viewer by choosing Grouping > Split by under Configure. That makes it easy to compare the survival curves that have the same level of ER status and avoid the comparison of curves across different levels of ER status.
If in Figure 8, we see one or more subplots where the survival curves differ a lot, that is evidence that the feature expression affects the survival even after adjusting for the contribution of ER status. To obtain an answer in terms of adjusted Log-rank and Wilcoxon p-values, one should deselect ER status as a “group factor” (Figure 4) and mark it as a stratification factor instead (Figure 9).
The computation of stratification adjusted p-values is elaborated in [2].
Suppose when the feature expression and ER status are selected as “group factors”, Log-rank p-value is 0.001, and when ER status is marked as stratification factor, the p-value becomes 0.70. This means that ER status is very useful for explaining the difference in survival while the feature factor is of no use if ER status is already in the model. In other words, the marginal contribution of the binned expression factor is low.
If more than two attributes are present, it is possible to measure the marginal contribution of any single factor in a similar manner: the attribute of interest should be selected as “group factor” and the other attributes should be marked as stratification factors. There is no limit on the count of factors that can be selected as “group” or stratification, except that all of the selected factors are involved in defining the groups and the groups should contain enough samples (at least, be non-empty) for the results to be reliable.
[2] Klein, Moeschberger (1997), Survival Analysis: Techniques for Censored and Truncated Data. ISBN-13: 978-0387948294
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensional reduction technique [1]. t-SNE aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. t-SNE is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.
We recommend normalizing your data prior to running t-SNE, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click t-SNE
Click Finish to run
t-SNE produces a t-SNE task node. Opening the task report launches a scatter plot showing the t-SNE results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.
Chose whether to run t-SNE on all samples together or on each sample individually.
Checking the box will run t-SNE on each sample individually.
This option appears when there are multiple feature types in the input data node (e.g., CITE-Seq data).
Select Any to run on all features or pick a feature type.
t-SNE preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Perplexity can be thought of as the number of nearest neighbors being considered. The optimal perplexity depends on the size and density of the data. Generally, a larger and/or more dense data set will benefit from a higher perplexity (Figure 2). Default is 30. The range of possible values is 3 to 100.
t-SNE uses an iterative algorithm to optimize the low-dimensional representation. More iterations will result in a more accurate embedding to an extent, but will take longer to run. Default is 1000.
Several parts of t-SNE utilize a random number generator to provide an initial value. Default is 1. To reproduce the results, use the same random seed at all runs.
If selected, t-SNE initializes from random initial positions for each point. If disabled, the initial values for each point are assigned using the largest principal components extracted from the raw data. Default is enabled.
The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Euclidean.
If checked, mapping error information will be available in the task report. Default is disabled.
Output a t-SNE table data node that can be downloaded. The 2D t-SNE coordinates are labeled Feature 1 and Feature 2; the 3D t-SNE coordinates are labeled Feature 3, 4, and 5. Default is disabled.
t-SNE uses principal components as its input. The number of principal components to use is set here.
We recommend using the PCA task to determine the optimal number of principal components for your data. Default is 50.
Options are equally or by variance. Feature values can be standardized prior to PCA so that the contribution of each feature does not depend on its variance. To standardize, choose equally. To take variance into account and focus on the most variable features, choose by variance. Default is by variance.
You can choose to log transform the data prior to running PCA as part of t-SNE. Default is disabled.
If you are normalizing the data, choose a log base. Default is 2 when Log transform data is enabled.
If you are normalizing the data, choose an offset. Default is 1 when Log transform data is enabled.
[1] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
To help you identify which previous filter you want to apply, the color of the task node on the Analyses tab, the number of cell barcodes retained (summed for all samples), and the time/date the previous filter task was submitted are included in the table. To view the number of cells and percentage of reads in cells for each sample in a previous filter, mouse over the button (Figure 3).
Symbol | Meaning |
---|---|
Suppose there are n p-values (n is the number of features), the expected number of Type I errors would be given by , thus the significance level of each individual test should be adjusted to . Alternatively the p-values should be adjusted as pB=p*n, pB is Bonferroni corrected p-value. If pB is greater than 1, it is set to 1
Click on View extra details report () icon under View section to get more statistical information about the feature. In a case that the task doesn't fail, but certain statistical information is not generated, e.g. p-value and/or fold change of a certain comparison are not generated for some or all feature, click on this icon to get more information by mousing over the read exclamation icon
On the right of each contrast header, there is volcano plot icon (). Select it to display the volcano plot on the chosen contrast (Figure 11).
Feature list filter panel is on the left of the table (Figure 12). Click on the black triangle ( ) to collapse and expand the panel.
If lognormal with shrinkage method was selected for GSA, a shrinkage plot is generated in the report (Figure 13). X-axis shows the log2 value of average coverage. The plot helps to determine the threshold of low expression features. If there is an increase before a monotone decrease trend on the left side of the plot, you need to set a higher threshold on the low expression filter. Detailed information on how to set the threshold can be found in the .
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
In addition to the issues addressed in , DESeq2 may generate missing values in the multiplicity adjustment columns (such as FDR) if "independent filtering" is enabled in Advanced Options:
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Low-expression feature and Multiple test correction sections are the same as the matching GSA advanced option, so see discussion.
For this analysis, only genes with more than one transcript will be included in the calculation. The report format is the same as , each row represent a transcript, and besides statistics information on the specified comparisons, there is also alt-splicing information at the right end of the table. That information is represented by the p-value of interaction of transcript ID with alt-splicing factor. Note that the transcripts of the same gene should have the same p-value. Small p-value indicates significant alt-splicing event, hence the table is sorted based on that p-value by default (Figure 4).
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Stuart T, Butler A, Hoffman P, et al. Comprehensive integration of single-cell data. Cell, 2019. DOI: \
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, Baglaenko Y, Brenner M, Loh P-r, Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods; 2019. .
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
If the task fails (no report is produced), please follow the directions in .
If the task report is produced, but the results are missing for some features, it may be possible to fix the issue by following the directions in the section.
[1] Kaplan-Meier (product limit) estimator:
If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.
S
Sample (or cell for single cell data node)
F
Feature
Xsf
Value of sample S from feature F (if normalization is performed on a quantification data node, this would be the raw read counts)
TXsf
transformed value of Xsf
C
Constant value
b
Base of log
Cells undergo changes to transition from one state to another as part of development, disease, and throughout life. Because these changes can be gradual, trajectory analysis attempts to describe progress through a biological process as a position along a path. Because biological processes are often complex, trajectory analysis builds branching trajectories where different paths can be chosen at different points along the trajectory. The progress of a cell along a trajectory from the starting point or root, can be quantified as a numeric value, pseudotime.
Partek Flow offers Monocle 2 and Monocle 3 methods.
Major updates in Monocle 3 (compared to Monocle 2) include:
Monocle 3 learns the principal trajectory graph in the UMAP space;
the principal graph is smoothened and small branches are excluded;
support for principal graphs with loops and convergence points;
support for multiple root nodes.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Variations in nucleotide sequence, in the form of single nucleotide variants (SNVs) and insertion and deletion events (INDELs), can either be neutral in nature or can have functional effects. Partek Flow provides all the tools necessary to interrogate and prioritize variants for further analysis. Variants stored in Variant Call Format (vcf) files can be analyzed to filter, annotate, summarize, visualize, and validate your panel of identified variants. Multiple vcf processing tools are available under the Variant analysis section of the context sensitive menu
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
In Partek Flow, we use tools from Monocle 2 [1] to build trajectories, identify states and branch points, and calculate pseudotime values. The output of Trajectory analysis includes an interactive scatter plot visualization for viewing the trajectory and setting the root state (starting point of the trajectory) and adds a categorical cell level attribute, State. From the Trajectory analysis task report, you can run a second task, Calculate pseudotime, which adds a numeric cell-level attribute, Pseudotime, calculated using the chosen root state. Using the state and pseudotime attributes, you can perform downstream analysis to identify genes that change over pseudotime and characterize branch points.
Note that trajectory analysis will only work on data with <600,000,000 observations (number of cells × number of features). If your data set exceeds this limit, the Trajectory analysis task will not appear in the toolbox. Prior to performing trajectory analysis, you should:
1) Normalize the data
Trajectory analysis requires normalized counts as the input data. We recommend our default "CPM, Add 1, Log 2" normalization for most scRNA-Seq data. For alternative normalization methods, see our Normalization documentation.
2) Filter to cells that belong in the same trajectory
Trajectory analysis will build a single branching trajectory for all input cells. Consequently, only cells that share the biological process being studied should be included. For example, a trajectory describing progression through T cell activation should not include monocytes that do not undergo T cell activation. To learn more about filtering, please see our Filter groups (samples or cells) documentation.
3) Filter to genes that characterize the trajectory
The trajectory should be built using a set of genes that increase or decrease as a function of progression through the biological processes being modeled. One example is using differentially expressed genes between cells collected at the beginning of the process to cells collected at the end of the process. If you have no prior knowledge about the process being studied, you can try identifying genes that are differentially expressed between clusters of cells or genes that are highly variable within the data set. Generally, you should try to filter to 1,000 to 3,000 informative genes prior to performing trajectory analysis. The list manager functionality in Partek Flow is useful for creating a list of genes to use in the filter. To learn more, please see our documentation on Lists.
Dimensionality of the reduced space
While the trajectory is always visualized in a 2D scatter plot, the underlying structure of the trajectory may be more complex and better represented by more than two dimensions.
Scaling
You can choose to scale the genes prior to building the trajectory. Scaling removes any differences in variability between genes, while not scaling allows more variable genes to have a greater weight in building the trajectory.
Click on the task report, a 2D scatterplot will be opened in Data viewer (Figure 1).
The trajectory is shown with a black line showing the trajectory. Branch points are indicated by numbers in black circles. By default, cells are colored by state. You can use the control panel on the left to color, size, and shape by genes and attributes to help identify which state is the root of the trajectory.
To calculate pseudotime, you must choose a root state. The tip of the root state branch will have a value of 0 for pseudotime. Click any cell belonging to that state to select the state. The selected state will be highlighted while unselected cells are dimmed (Figure 2). Choose Calculate pseudotime in the Additional actions on the left control panel.
The Calculate pseudotime task will be performed, it generates a new Pseudotime result data node, which contains Pseudotime annotation for each cell (Figure 3).
Open the Pseudotime result report, a 2D scatterplot will be displayed in data viewer, colored by Pseudotime by default (Figure 4)
[1] Xiaojie Qiu, Qi Mao, Ying Tang, Li Wang, Raghav Chawla, Hannah Pliner, and Cole Trapnell. Reversed graph embedding resolves complex single-cell developmental trajectories. Nature methods, 2017.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Uniform Manifold Approximation and Projection (UMAP) is a dimensional reduction technique [1]. UMAP aims to preserve the essential high-dimensional structure and present it in a low-dimensional representation. UMAP is particularly useful for visually identifying groups of similar samples or cells in large high-dimensional data sets such as single cell RNA-Seq.
We recommend normalizing your data prior to running UMAP, but the task will run on any counts data node.
Click the counts data node
Click the Exploratory analysis section of the toolbox
Click UMAP
Click Finish to run
UMAP produces a UMAP task node. Opening the task report launches a scatter plot showing the UMAP results. Each point on the plot is a cell for single cell data or a sample for bulk data. The plot will open in 2D or 3D depending on the user preference.
Both t-SNE and UMAP are dimensional reduction techniques that are useful for identifying groups of similar samples in large high-dimensional data sets. A comparison of the techniques for visualizing single cell RNA-Seq data by the authors of UMAP suggests that UMAP runs faster, is more reproducible, gives a more meaningful organization of clusters, and preserves more information about the global structure of the data than t-SNE [2].
In our hands, we find UMAP to be more informative than t-SNE for many data sets. For example, the similarities and differences between clusters are clearly visible with UMAP, but more difficult to judge with t-SNE (Figure 1).
Sets the initialization mode. Options are Spectral and Random.
Spectral - good initial points are chosen using spectral embedding (more accurate)
Random - random initial points are chosen (faster)
Chose whether to run UMAP on all samples together or on each sample individually.
Checking the box will run UMAP on each sample individually.
This option appears when there are multiple feature types in the input data node (e.g., CITE-Seq data).
Select Any to run on all features or pick a feature type.
UMAP preserves the local structure of the data by focusing on the distances between each point and its nearest neighbors. Local neighborhood size is the number of nearest neighbors to consider.
You can adjust this value to prioritize global or local relationships. Smaller values will give a more local view, while larger values will give a more global view (Figure 2). Default is 30.
The effective minimum distance between embedded points. Smaller values will create a more clustered embedding, while larger values will create a more evenly dispersed embedding.
You can decrease this value to make clusters more tightly packed or increase it to make them looser (Figure 3). Default is 0.3.
The metric to use when computing distances in high-dimensional space. Options are Euclidean, Manhattan, Chebyshev, Canberra, Bray Curtis, and Cosine. Default is Cosine.
UMAP uses an iterative algorithm to optimize the low-dimensional representation. The value 0 corresponds to the default, which chooses the number of iterations based on the size of the input data. More iterations will result in a more accurate embedding, but will take longer to run. Default is 0.
Several parts of UMAP utilize a random number generator to provide an initial value. Default is 42. To reproduce the results, use the same random seed at all runs.
Output a UMAP table data node that can be downloaded. The 2D UMAP coordinates are labeled Feature 1 and Feature 2; the 3D UMAP coordinates are labeled Feature 3, 4, and 5. Default is disabled.
UMAP uses principal components as its input. The number of principal components to use is set here. Default is 10.
We recommend using the PCA task to determine the optimal number of principal components for your data.
Options are equally or by variance. Feature values can be standardized prior to PCA so that the contribution of each feature does not depend on its variance. To standardize, choose equally. To take variance into account and focus on the most variable features, choose by variance. Default is by variance.
You can choose to log transform the data prior to running PCA as part of UMAP. Default is disabled.
If you are normalizing the data, choose a log base. Default is 2 when Log transform data is enabled.
If you are normalizing the data, choose an offset. Default is 1 when Log transform data is enabled.
[1] McInnes L and Healy J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv, 2018, e-prints 1802.03426,
[2] Becht E, McInnes L, Healy J, Dutertre A-C, Kwok I, Guan Ng L, Ginhoux F, and Newell E, Dimensionality reduction for visualizing single-cell data using UMAP, Nature Biotechnology, 2019, 37, 38-44.
Multi-omics single cell analysis is based on simultaneous detection of different types of biological molecules on the same cells. Common multi-omics techniques include feature barcoding or CITE-seq (cellular indexing of transcriptomes and epitopes by sequencing) technologies, which enable parallel assessment of gene and protein expression. Specific bioinformatics tools have been developed to enable scientists to integrate results of multiple assays and learn relative importance of each type (or each biological molecule) in identification of cell types. Partek Flow supports weighted nearest neighbor (WNN) analysis (1), which can help combine output of two molecular assays.
This task can only be performed on data nodes containing PCA scores – which are PCA output and graph based clustering output nodes generated from PCA nodes. To start, select a PCA data node of one of the assays (e.g. gene expression) and go to Exploratory analysis > Find multimodal neighbors in the toolbox. On the task setup page, use the Select data node button to point to the PCA data node of the other assay (e.g. protein expression), by default, there is a node selected (Figure 1).
When you click the Select data node button, Partek Flow will open another dialog, showing your current pipeline (Figure 2). Data nodes that can be used for WNN are in color of the branch, other nodes are disabled (greyed out). To pick a node, left-click on it and then push the Select button.
The selected data node is shown under the Select data node button. If you made a mistake, use the Clear selection link (Figure 1).
If there are graph-based clustering task performed on PCA data node, the output of graph-based clustering node also has PCA score from the input data, so the output graph-based clustering data nodes also can be candidate of WNN task.
To customize the Advanced options, select the Configure link (Figure 1). At present you can only change the number of nearest neighbors for each modality (-k.nn option of the Seurat package); the default value is 20 (Figure 3). An illustration on how to use that option to assess the robustness of WNN analysis can be found in Hao et al. (1). The nearest neighbor search method is K-NN and distance metric is Euclidean.
To launch the Find multimodal neighbors task, click the Finish button on the task setup page (Figure 1). For each cell, the WNN algorithm calculates its closest neighbors based on a weighted combination of RNA and protein similarities. The output of the Find multimodal neighbors task is a WNN data node.
For downstream analysis, you can launch a UMAP or graph-based clustering tasks on a WNN node. For example, Figure 4 shows a snippet of analysis of a feature barcoding data set; gene expression and protein expression data were processed separately, and then Find multimodal neighbors was invoked on two respective PCA data nodes. UMAP and graph-based clustering tasks were performed on WNN node.
For an excellent illustration on advantages of WNN algorithm for identification of cell types, please see this blog post.
Hao Y, Hao S, Andersen-Nissen E, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573-3587.e29. doi:10.1016/j.cell.2021.04.048
Hierarchical clustering is a statistical method used to assign similar objects into groups called clusters. It is typically performed on results of statistical analyses, such as a list of significant genes/transcripts, but can also be invoked on the full data set, as a part of exploratory analysis.
Hierarchical clustering is an unsupervised technique, meaning that the number of clusters is not specified upfront. In the beginning, each row and/or column is considered a cluster. The two most similar clusters are combined and continue to combine until all objects are in the same cluster. Hierarchical clustering produces a tree (called a dendrogram) that shows the hierarchy of the clusters.
This tutorial will illustrate how to:
To invoke hierarchical clustering, select a data node containing count data (e.g. Gene counts, Normalized counts, Single cell counts), or a Feature list data node (to cluster significant genes/transcripts) and then click on the Hierarchical clustering / heat map option in the context sensitive menu (Figure 1).
The hierarchical clustering setup dialog (Figure 2) enables you to control the clustering algorithm. Starting from the top, you can choose to plot a Heatmap or a Bubble map (clustering can be performed on both plot types). Next, perform Ordering by selecting Cluster for either feature order (genes/transcripts/proteins) or cell/sample/group order or both. Note the context-sensitive image that helps you decide to either perform hierarchical clustering (dendrogram) or assign order (arrow) for the columns and rows to help you orient yourself and make decisions (In Figure 2 below, Cluster is selected for both options so a dendrogram is shown in the image).
When choose Assign order, the Default order of cells/samples/groups (rows) is based upon the labels as displayed in the Data tab and features (columns) are dependent on the input data of the data node.
Feature order can be assigned by selecting a managed list (e.g. generate saved feature lists from report nodes or add lists under list management in the settings) in the drop-down which will limit the features to only those in the list and the features will be ordered as they are listed. If a feature is not available, based on the input of the data node, it will not be shown in the plot (in other words, if the features from the list are not there they will not be plotted). Note that If no features are available from the data node, the task will not be able to perform and an error message will be shown.
Cell/Sample/Group order can also be assigned by choosing an attribute from the drop down list. Click and drag to rearrange categorical attributes; numeric attributes can be sorted in ascending or descending order (note the arrows in the image which are different from the dendrogram for Cluster) (Figure 3).
If you do not want to cluster all the samples, but select a subset based on a specific sample or cell attribute (i.e. group membership), check Filter cells under Filtering and set a filtering rule using the drop down lists (Figure 4). Notice the drop-down lists allow more than one factor (when available) to be selected at a time. When configuring the filtering rule, use AND to ensure all conditions pass for inclusion and use OR for any conditions to pass.
Hierarchical clustering uses distance metrics to sort based on similarity and is set to Average Linkage by default. This can be adjusted by clicking Configure under Advanced options (Figure 5). You can choose how the data is scaled (sometimes referred to as normalized). There are three Feature scaling options, Standardize (default for a heatmap) will make each column mean as zero and standard deviation as 1 in all features. This is the default scaling for a heatmap and it makes all of the features (e.g., genes or proteins) have equal weight; standardized values are also known as Z-scores. The scaling mode Shift will make each column mean as zero. Choose None to not scale and perform clustering on the values in the input data node (this is the default for a bubble map). If a bubble map is scaled, scaling will be performed on the group summary method (color).
Cluster distance metric for cells/samples and features is used to determine how the distance between two clusters will be calculated:
Single Linkage: the distance between two clusters is determined by the distance of the closest objects in the two clusters
Complete Linkage: the distance between two clusters is equal to the distance between the two furthest members of those clusters
Average Linkage: the average distance between all the pairs of objects in the two different clusters is used as the measure of distance between the two clusters
Centroid method: the distance between two clusters is equal to the distance between the centroids of those clusters
Ward's method: the distance between two clusters is designed to minimize the size of an error measure based on the sum of squares
Point distance metric is used to determine the distance between two rows or columns. For more detailed information about the equations, we refer you to the distance metrics chapter below.
The output of a Hierarchical clustering task can be a heatmap (Figure 6) or a bubble map with or without dendrograms depending on whether you performed clustering on cells/samples/groups or features. By default, samples are on rows (sample labels are displayed as seen in the Data tab) and features (depending on the input data) are on columns. Colors are based on standardized expression values (default selection; performed on the fly). Dendrograms show clustering of rows (samples) and columns (variables).
Depending on the resolution of your screen and the number of samples and variables (features) that need to be displayed, some binning may be involved. If there are more samples/genes than pixels, values of neighboring rows/columns will be averaged together. Use the mouse wheel to zoom in and out. When you zoom in to certain level on the heatmap, you will see each cell represent one sample/gene. When you mouse over the row dendrogram or label area and zoom, it will only zoom in/out on the rows. The binning on the columns will remain the same. Similarly, when you mouse over the column dendrogram or label area and zoom, it will only zoom in/out on the columns. The binning on the rows will remain the same. To move the map around when zoomed in, press down the left mouse button and drag the map. The plot can be saved as a full-size image or as a current view; when Save image is clicked, a prompt will ask how you would like to save the image.
The Hierarchical clustering task can also be used to plot a bubble map. Let's go through the steps to make a bubble map (Figure 7):
Choose to plot a Bubble map (note the selection of a bubble map in the image which is different from the heatmap). This will open the Bubble map settings.
Configure the Bubble map settings. First, Group cells by an available categorical attribute (e.g. cell type). Next, summarize the group’s first dimension by color (Group summary method) then choose an additional dimension to plot size (Additional statistic) by using the drop down lists. If these settings are not adjusted, the default dimensions will generate two descriptive statistic measurements that plot the group mean by color and size by the percent of cells. Hierarchical clustering can be performed on the first assigned dimension (by color) which is the Group summary method. The second dimension (size) which is an Additional statistic is not required but it is selected by default (this can be unchecked with the checkbox).
Ordering the plot columns (Feature order) and rows (Group order) behaves the same as a heatmap. In this example, Ordering for both features and groups by Cluster uses hierarchical clustering to perform distance metrics (default settings will be used but these metrics can be changed under Configure in the Advanced options section). Alternatively, Assign order to features using a managed (saved) feature list or the default order which is dependent on the input data. Assign order to groups can be used to rearrange the attribute by drag and drop, ascending or descending order, or default order which is how the labels as displayed in the Data tab.
Filtering can be applied to the groups by checking Filter cells then specifying the logical operations to filter by (this is the same as a heatmap).
Advanced options let the user perform Feature scaling (e.g. Standardize by a z-score) but in a bubble map the default is set to None. It also allows the user to change the Group clustering and Feature clustering options by altering the Cluster distance metrics and Point distance metrics (similar to a heatmap).
There are plot Configuration/Action options for the Hierarchical clustering / heatmap task which apply to both the heatmap and bubble map in the Data viewer (below): Axes, Heatmap, Dendrograms, Annotations, and Description. Click on the icon to open these configuration options.
This section controls the Content or data source used to draw the values in the heatmap or bubble map and also the ability to transpose the axes. The plot is a color representation of the values in the selected matrix. Most of the data nodes contain only one matrix, so it will just say Matrix for the chosen data node. However, if a data node contains multiple matrices (e.g. descriptive statistics were performed on cluster groups for every gene like mean, standard deviation, percent of cells, etc) each statistic will be in a separate matrix in the output data node. In this case, you can choose which statistic/matrix to display using the drop-down list (this would be the case in a bubble map).
To change the orientation (switch the columns and rows) of the plot, click on the Transpose toggle switch.
Row labels and Column labels can be turned on or off by clicking the relevant toggle switches.
The label size can be changed by specifying the number of pixels using Max size and Font. If an Ensembl annotation model has been associated with the data, you can choose to display the gene name or the Ensembl ID using the Content option.\
This section is used to configure the color, range, size, and shape of the components in the heatmap.
In the color palette horizontal bar, the left side color represents the lowest value and the right side color represents the highest value in the matrix data. Note that when you zoom in/out the lowest and highest values captured by the color palette may change. By default, there are 3 color stops: minimum, middle, and maximum color value of the default range calculated on the matrix. Left-click on the middle color stop and drag left or right to change the middle value this color stop represents. If you left-click on the middle color stop once, you can change the color and value this color stop represents. Click on the (X) to remove this color stop.\
Click on the color square or the adjacent triangle to choose a color to represent the value. This will display a color picker dialog which allows selection of a color, either by clicking or by typing a HEX color code, then clicking OK.
The shape of the heatmap cell (component) can be configured either as a rectangle or circle by selecting the radio button under Shape.
If cluster analysis is performed on samples and/or features, the result will be displayed as dendrograms. By default, the dendrograms are all colored in black.
The color of the dendrograms can be configured.
Click on the color square or its triangle to choose a different color for the dendrogram.
Choose an attribute from the Row annotation drop-down list. Multiple attributes can be chosen from the drop-down list and can be reordered by clicking and dragging the groups below the drop-down list. Each attribute is represented as an annotation bar next to the heatmap. Different colors represent the different groups in the attribute.
The width of the annotation bar can be changed using the Block size slider when the Show labels toggle switch is on.
The annotation label font size can be changed by specifying the size in pixels.
The Fill blocks toggle switch adds or removes color from the annotation labels.\
Description is used to modify the Title and toggle on or off the Legend.
The heatmap has several different mouse modes which modify the way the plot responds to the mouse buttons. The mode buttons are in the upper right corner of the heatmap. Clicking one of these buttons puts the heatmap into that mode.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
AUCell is a tool to identify cells that are actively expressing genes within a gene list [1]. For each input gene list, AUCell calculates a value for each cell by ranking all genes by their expression level in the cell and identifying what proportion of the genes from the gene list fall within the top 5% (default cutoff) of genes. This method allows the AUCell value to represent the proportion of genes from the gene list that are expressed in the cell and their relative expression compared to other genes within the cell. Because this is a rank-based method and is calculated for each cell individually, AUCell can be run on raw or normalized data. As an AUCell value is the proportion of genes from the list that are within the top percentile of expressed genes, AUCell values can range from 0 to 1, but may have a more restricted range.
AUCell values can be used directly as input for downstream analysis, such as clustering. Another common use is to set an AUCell value cutoff for expressing vs. not and used this to classify cells. AUCell values will separate cells most effectively when the genes in the list are highly and specifically expressed in a population of cells. If the genes are specifically expressed, but not highly expressed, the AUCell value will not be as useful.
AUCell can be run on any single cell counts data node.
Click the single cell counts data node
Click the Exploratory analysis section in the toolbox
Click AUCell
Choose gene lists by clicking and dragging them to the panel on the right or clicking the green plus that appears after mousing over a gene list (Figure 1)\
Click Finish to run
AUCell produces an AUCell result data node. The AUCell result data node includes the input counts data and adds the AUCell scores to the original data as a new data type, AUCell Values. AUCell values for each input feature list are included as features in the AUCell result data node. These features created by AUCell are named after the feature list (e.g., B cells, Cytotoxic cells).
Because the AUCell values are added as features, they can be used as input for clustering, differential analysis, and visualization tasks.
To produce a data node containing only the AUCell values, use Split matrix to split the AUCell result data node into separate data nodes for each of its data types. This can be helpful if you intend on performing downstream analysis on the AUCell values. To perform differential analysis, it is advisable to normalize the values by adding a small offset (e.g. 1E-9) and Logit transformation to the base Log2 using the Normalization task. This will make the values continuous and suitable for differential analysis with methods such as ANOVA/LIMMA-trend/LIMMA-voom, Non-parametric ANOVA or Welch's ANOVA. For differential analysis, please check the Low-value filter is set to None and the values are correctly recognized as Log2 transformed in the Advanced settings.
If an AUCell result data node or other downstream data node containing AUCell Values is used as the input for AUCell, the additional AUCell values will be added as additional features of the AUCell values data type in the new AUCell result data node.
For each gene set, AUCell computes the intersection between the gene list and the input data set. If the intersection size is below the specified threshold, the gene set is ignored and no AUCell score is calculated for it. Default is 5.
To calculate the AUCell value, genes are ranked and the fraction of genes from the gene list that are above the percentile cutoff is the AUCell value. This parameter sets the percentile cutoff. Default is 5.
[1] Aibar, S., González-Blas, C. B., Moerman, T., Imrichova, H., Hulselmans, G., Rambow, F., ... & Atak, Z. K. (2017). SCENIC: single-cell regulatory network inference and clustering. Nature methods, 14(11), 1083.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
To analyze scATAC-seq data, Partek Flow introduced a new technique - LSI (latent semantic indexing )[1]. LSI combines steps of frequency-inverse document frequency (TF-IDF) normalization followed by singular value decomposition (SVD). This returns a reduced dimension representation of a matrix. Although SVD and Principal components analysis (PCA) are two different techniques, the SVD has a close connection to PCA. Because PCA is simply an application of the SVD. For users who are more familiar with scRNA-seq, you can think of SVD as analogous to the output of PCA. And similarly, the statistical interpretation of singular values is in the form of variance in the data explained by the various components. The singular values produced by the SVD are in order from largest to smallest and when squared are proportional the amount of variance explained by a given singular vector.
SVD task in Flow can be invoked in Exploratory analysis section by clicking any single cell counts data node (Figure 1). We recommend running SVD on the normalized data, particularly the TF-IDF normalized counts for scATAC-seq analysis.
To run SVD task,
Click a single cell counts data node
Click the Exploratory analysis section in the toolbox
Click SVD
The GUI is simple and easy to understand. The SVD dialog is only asking to select the number of singular values to compute (Figure 2). By default 100 singular values will be computed if users don't want to compute all of them. However, the number could be adjusted manually or typed in directly. Simply click the Finish button if you want to run the task as default.
The task report for SVD is similar to PCA. Its output will be used for downstream analysis and visualization, including Harmony (Figure 3).
Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). https://doi.org/10.1038/nature25981
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Variations in nucleotide sequence, in the form of single nucleotide variants (SNVs) and insertion and deletion events (INDELs), can exist within the germline or can be acquired by somatic alterations. Partek Flow provides pipeline creation tools to identify both SNVs and INDELs using aligned reads generated from targeted, whole exome, or whole genome DNA-Seq (or RNA-Seq) data. Detection of these variants can be performed by comparison against either the reference sequence utilized for alignment or among paired samples in a project. Tools for variant detection are performed on either Aligned reads or Filtered reads data nodes (Figure 1), and the Detect variants task node will produce a Variants data node. The Variants data node will contain Variant Call Format (vcf) files for each sample in the project. Three detection tools, each employing unique algorithms to identify variants in aligned sequence data, are available under the Variant callers section of the context sensitive menu:
Figure 1. Showing Variant callers from an aligned reads node
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
A very popular variant detection approach that performs well in many situations, FreeBayes (version 1.0.1) employs a Bayesian statistical framework to determine the most likely combination of genotypes in the sample(s) at each position in a reference genome for any number of individuals from a population. It is haplotype-based, calling variants based on the literal sequences of reads aligned to a particular target and not their precise alignment. This method can identify both single nucleotide variants and insertions/deletion events. Information on the model underlying the variant detection are detailed by Garrison et al.1
Selecting FreeBayes from the context sensitive menu will bring up the Freebayes task dialog (Figure 1), which contains two sections: Select Reference sequence and Advanced options.
Select Reference sequence will specify the reference assembly to utilize for variant detection. If the alignment was generated in Partek Flow, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that alignment was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for alignment in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence to library files from within the task.
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. July 2012. https://arxiv.org/abs/1207.3907.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
CellPhoneDB addresses the challenges of studying cell-cell communication in scRNA-seq and spatial data. It allows researchers to move beyond just measuring gene expression and delve into the complicated cellular communication world. By analyzing the scRNA-seq or spatial data through the lens of CellPhoneDB, researchers can identify potential signaling pathways and communication networks between different cell types within the sample. Partek Flow wrapped the statistical analysis pipeline (method 2) from CellPhoneDB v5 [1][2] for this purpose.
Invoke the CellPhoneDB task in Partek Flow from a normalized counts data node using the Exploratory analysis section (Figure 1). We recommend running CellPhoneDB on the log normalized data directly.
To run CellPhoneDB task,
Click a Normalized counts data node
Click the Exploratory analysis section in the toolbox
Click CellPhoneDB
The GUI is simple and easy to understand. For each option, the grey colored description explains more details (Figure 2). If the dataset working on is single cell RNA-Seq, it doesn't need the Micro environment file. However, if it is a spatial data, most likely you would like to provide the Micro environment file because of its spatial contents. By default, the value of 0.10 will be used as threshold to select which cells are used for the analysis in the cluster. However, the number could be adjusted manually or typed in directly. Simply click the Finish button if you want to run the task as default.
Double click the CellPhoneDB result data node will open the task report in Data Viewer. It is a heatmap that summarizes how many significant interactions identified in the cell type pairs (Figure 3).
To explore more, the task of Explore CellPhoneDB results allows users to filter CellPhoneDB results by specifying the cell type pairs and genes of interest. After clicking the CellPhoneDB data node (Figure 4a), one will find there's only task triggered under Exploratory analysis menu (Figure 4b). Its GUI is also simple and easy to understand (Figure 4c). Genes of interest are data dependent and usually come from the published results of similar studies or the differential gene analysis between different conditions (eg, cancer patient vs healthy controls). Once set up, click the Finish button to submit the job.
Double click the Output matrix data node will open the task report in Data Viewer. It is another variant of heatmap that displays how genes of your interest interact in the defined cell type pairs (Figure 5). The exampled plot also indicates the data are from two environments. For instructions on setting up the Micro environment file for your spatial study, refer to Figure 2. CellPhoneDB analysis classifies signaling pathways for genes of interest. These classifications are then used to annotate the heatmap within the task report.
It is important to note that the interactions are not symmetric. The authors state that, "Partner A expression is considered for the first cluster/cell type (clusterA), and partner B expression is considered on the second cluster/cell type (clusterB). Thus, IL12-IL12 receptor for clusterA-clusterB (i.e. the receptor is in clusterB) is not the same as IL-12-IL-12 receptor for clusterB-clusterA (i.e. the receptor is in clusterA), and will have different values." [3][4]
The interactions come from the CellphoneDB database. It is manually curated repository using reviewed molecular interactions with demonstrated evidence for a role in cellular communication. [5]
Troule, etc (2023). CellPhoneDB v5: Inferring cell-cell communication from single cell multiomics data. https://arxiv.org/pdf/2311.04567.pdf
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
In Partek Flow, we use tools from Monocle 3 (1) to build trajectories, identify states and branch points, and calculate pseudotime values. The output of Trajectory analysis task includes an interactive 2D/3D visualization for viewing the trajectory trees and setting the root states (starting points of the trajectories). From the Trajectory analysis report, you can run a second task, Calculate pseudotime, which adds a numeric cell-level attribute, Pseudotime, calculated using the chosen root states.
Trajectory analysis by Monocle 3 requires data normalization and preprocessing. Regarding the normalization, we suggest to first use the Normalization and scaling section of the toolbox to normalize by counts per million (CPM), add offset of 1, and log2 transform. After that, launch the Trajectory analysis on the Normalized counts node; this input node cannot have zero values.
According to the Monocle 3 authors, you may want to filter in the top 5,000 genes with the highest variance (2,000 genes for datasets with fewer than 5,000 cells, and 300 genes for datasets with fewer than 1,000 cells) (1). Those numbers should be used as a guidance for the first-pass analysis and may need to be optimized, depending on the project at hand and the biological question.
To run Trajectory analysis tool, select the Normalized counts data node (or equivalent) and go to the toolbox: Exploratory analysis > Trajectory analysis
The configuration dialog presents four options (Figure 1).
Dimensionality of reduced space. This option specifies the number of UMAP dimensions that the original data are reduced to, in order to learn the trajectory tree (dimensionality of original data equals the number of genes). Default is two, meaning that the trajectory plot will be draw in two dimensions. To get a 3D trajectory plot, increase this option to 3.
Scaling. Normalized expression values can be further transformed by scaling to unit variance and zero mean (i.e. converting to Z score). The use of this option is recommended (1).
Data is logged. Select this option if the data have already been log-transformed upstream. When selected, Monocle 3 will skip the log2 step on the input data (see below).
Programmatically calculate default root nodes. If not selected, user has to specify the root nodes of the trajectory tree manually (default). Depending on the available meta-data, Monocle 3 may be able to pick the root nodes programmatically (see below for details)\
Under the hood, Monocle 3 will perform log2 transformation of the gene count matrix (if Data is logged was unselected), scale the matrix (if Scaling was selected), and project the gene count matrix into the top 50 principal components. Next, the dimensionality reduction will be implemented by UMAP (using default settings of the reduce dimension command).
Result of running Trajectory analysis in Partek Flow is the Trajectory result data node. Double clicking on the node opens a Data Viewer window with the trajectory plot (Figure 2). Cell trajectory graph shows position of each cell (blue dot) with respect to the UMAP coordinates (axes). Cell trajectories (one or more, depending on the data set) are depicted as black lines. Gray circles are trajectory nodes (i.e. cell communities).
To show / hide cell trajectory tree and trajectory nodes, select Axes in Configure section and on the upper-right corner of the dialog, select the Extra data drop-down options (Figure 2).
To perform pseudotime analysis, you need to point to the cells at the beginning of the biological process you are interested in. For example, cells at the earliest stage of differentiation sequence. There are two ways to perform pseudotime analysis in Partek Flow, depending on the way the root nodes (=cells at the beginning of pseudotime) are defined.
Manual selection of root node. The user has to specify the root nodes (one or more).
Automatic selection of the root node. The root node is picked by the algorithm.
If you want to manually pick the root nodes, leave the option Programmatically calculate default root nodes unselected when setting up the Trajectory analysis.
To start, select the root cell nodes (gray circles in trajectory tree) by left-clicking. If the trajectory result consists of more than one trajectory tree, you can specify more than one root node, e.g. one root node per trajectory tree (ctrl & click). If no root node is specified for a tree, that tree will not be included in the pseudotime calculation. Figure 4 shows an example where seven root nodes were identified.
Once you have identified all the root nodes, click on Additional button in Tools section on the left panel, push the Calculate pseudotime button in the dialog (Figure 5).
As a result, the cells will be annotated by pseudotime, using green to red gradient (start and end, respectively) (Figure 6). If, for a particular tree, no root node has been identified, those cells will be omitted from the pseudotime calculation and will be colored in black (Figure 8).
Pseudotime calculation display the structure of the graph using black lines. The circles with numbers (cell nodes) on the black lines represent special points. There are three types of cell nodes:
Root node (white). Root nodes are start points of the pseudotime and were defined by the user in the previous step (e.g. node 4 in Figure 7).
Branch node (black). Branch nodes indicate where the trajectory tree forks out; i.e. each branch represents a different cell fate or different trajectory (e.g. nodes 3-6, and 8 in Figure 7).
Leaf (light gray). Leaves correspond to different cell fates / different trajectory outcomes (e.g. nodes 5, 9, and 12 in Figure 7). The leaves correspond to cell states of Monocle 2.
The numbers within the circles are provided for reference purposes only. The intermediate nodes from the previous step have been removed.
If suitable meta-data are available, it is possible to automatically select the root node. For example, you may know which cells were harvested from the earliest time points. The cells need to be annotated by that information (Annotate Cells task) before running Trajectory analysis. The annotation will, in turn, be available in the Trajectory analysis setup dialog, upon selecting the Programmatically calculate default root nodes option.
Attribute for root nodes. The drop down list will show the available cell-level attributes. Specify the one which should be used to identify the root nodes.
Attribute value for root nodes. The drop down list will show the content of the attribute selected under Attribute for root nodes. Specify the entry that corresponds to the earliest time point
Once the options have been set, Monocle 3 will first group the cells according to which trajectory node they are nearest to. It then calculates the fraction of the cells from the earliest time point at each trajectory node. Finally, it picks the node with the highest prevalence of the early cells and treats it as the root node.
Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, Zhang F, Mundlos S, Christiansen L, Steemers FJ, Trapnell C, Shendure J. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019 Feb;566(7745):496-502. doi: 10.1038/s41586-019-0969-x. Epub 2019 Feb 20. PMID: 30787437; PMCID: PMC6434952.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Principal components (PC) analysis (PCA) is an exploratory technique that is used to describe the structure of high dimensional data by reducing its dimensionality. It is a linear transformation that converts n original variables (typically: genes or transcripts) into n new variables, which are called PCs, they have three important properties:
PCs are ordered by the amount of variance explained
PCs are uncorrelated
PCs explain all variation in the data
PCA is a principal axis rotation of the original variables that preserves the variation in the data. Therefore, the total variance of the original variables is equal to the total variance of the PCs.
If read quantification (i.e. mapping to a transcript model) was performed by Partek E/M algorithm, PCA can be invoked on a quantification output data node (Gene counts or Transcript counts) or, after normalization, on a Normalized counts data node. Select a node on the canvas and then PCA in the Exploratory analysis section of the context sensitive menu.
There are two options for features contribute (Figure 1):
equally: all the features are standardized to mean of 0 and standard deviation of 1 . This option will give all the features equal weight in the analysis, this is the default option for e.g bulk RNA-seq data.
by variance: the analysis will give more emphasis to the features with higher variances. This is the default option for e.g. single cell RNA-seq data
If the input data node is in linear scale, you can perform log transformation on PCA calculation.
The PCA task creates a new task node, and to open it and see the result, do one of the following: select the PCA task node, proceed to the context sensitive menu and go to the Task result; or double-click on the PCA task node. The report containing eigenvalues, PC projections, component loadings, and mapping error information for the first three PCs.
When the PCA node is opened in Data viewer, by default, it is the 3D scatterplot, Scree plot with Eigenvalues, and Component loadings table (Figure 2). Each dot on the 3D scatter plot represents an observation, while the first three PCs are shown on the X-, Y-, and Z-axis respectively, with the information content of an individual PC is in the parenthesis.
As an exploratory tool, the the PCA scatterplot is applied to view any groupings in the data set and generate hypotheses based on the outcome, or to spot possible outliers.
To rotate the 3D scatter plot left click & drag. To zoom in or out, use the mouse wheel. Click and drag the legend can move the legend to different location on the viewer.
Detailed configuration on PCA plot can be found by clicking Help>How-to videos>Data viewer section.
In the Data viewer, when a PCA data node is selected from Get Data under Setup (left panel), the node can be dragged and dropped to the screen (Figure 3), then you will have the option to select a scree plot and tables.
When mouse over on a point on the line, it will display detailed information of the PC. The scree plot shows how much variation each PC represents, so it is often used to determine the number of principal components to keep for downstream analysis (e.g. tSNE, UMAP, graph-base clustering). The "elbow" point of the graph where the eigenvalues seem to level off should be considered as a cutoff point for downstream analysis.
In the table, each row is a feature, the column represent PCs, the value is the correlation coefficient. Under Content, there is a PCA projections option, change to this option to display the projection table (Figure 6). In this table, each row is an observation, each column is a PC, the values are the PC scores.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
SAMtools1 (version 1.2) utilizes the mpileup command to look at observed bases in the reads covering every genomic position represented in the aligned sequence data and calculate the likelihood of every possible genotype at a locus. Subsequently, bcftools applies the prior probability and uses Bayesian inference to call actual genotypes, outputting variant information in Variant Call Format (vcf). This method can identify both single nucleotide variants and insertions/deletion events. General information about the underlying algorithm utilized by SAMtools is detailed by Li. 2,3
Selecting SAMtools from the context sensitive menu will bring up the SAMtools task dialog, which contains three default sections: Variant detection method, Select Reference sequence, and Advanced options.
In the Variant detection method drop-down list, Against reference will compare base composition for each sample against the reference sequence assembly, independently (Figure 1).
In the event paired samples exist within the project, detection Paired samples can be utilized to identify loci with differing genotypes between the pair once each sample has been compared to the reference sequence assembly. In instances where there is limited information to accurately determine genotypes in one or both of the samples, the same genotype may be called for case and control if it differs from the reference. The Filter variants task can be used to exclude these spurious loci. To perform this analysis, sample attributes must be added in the Data tab of the project (Figure 2). Specifically, an attribute must be added for sample ID (shared between the paired samples) and an attribute must also be added for sample type that differentiates the paired samples.
Examples of the latter can include case and control or tumor and normal. If these attributes are present, a section for Analysis options will be displayed below the Variant detection method (Figure 3). To utilize this feature, select Paired analysis. Match ID must then be specified and should correspond to the attribute that references the sample ID shared between the pair. Selecting Case/control will allow for discriminating genotypes between paired samples in downstream tasks. Attribute should correspond to the attribute that defines type within sample pairs, and Control can be specified for whatever category relates to the reference sample.
Select Reference sequence will specify the reference assembly to utilize for variant detection. If the alignment was generated in Partek® Flow®, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that alignment was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for alignment in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence to library files from within the task.
Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinforma Oxf Engl. 2009;25(16):2078-2079.
Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011;27(21):2987-2993.
Li H. Improving SNP discovery by base alignment quality. Bioinformatics. 2011;27(8):1157-1158.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
LoFreq (version 2.1.3a) is a very sensitive and fast variant caller that can be employed to robustly call low-frequency variants. Uilizing sources of sequencing error in the detection model, LoFreq can identify variants below the sequencing error rate. The significance of each variant is calculated to allow for control of false positives. This method can identify both single nucleotide variants and insertions/deletion events, although the current implementation does not produce discrete genotype calls. Information on the model underlying the variant detection is detailed by Wilm et al.1
Selecting LoFreq from the context sensitive menu will bring up the LoFreq task dialog (Figure 1), which contains two sections: Select Reference sequence and Advanced options.
Select Reference sequence will specify the reference assembly to utilize for variant detection. If the alignment was generated in Partek Flow, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that alignment was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for alignment in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence to library files from within the task.
Wilm A, Aw PPK, Bertrand D, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40(22):11189-11201.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
An important aspect of variant analysis is the ability to prioritize variants for downstream analysis. As variant detection can often identify a large number of variants, it may be difficult to determine which variants may impact phenotypes. As implemented in Partek Flow, the Ensembl Variant Effect Predictor (VEP, version 84)1 provides a means to add detailed annotation to variants in the project such as discrete aspects of transcript models and variant databases not available in the Annotate Variants task. For variants identified in human data, information from popular tools that predict the impact of variants that cause amino acid changes, SIFT2–4 and PROVEAN5 (available for the hg19 genome assembly), will be included. VEP databases can be obtained for multiple species, and content will be dependent on available transcript and variant information for that organism. The Annotate variants (VEP) task can be invoked from any Variants or Annotated variants data node, and the task will supplement any existing annotation in the vcf files. Annotation information will also be visible in the downstream View variants Variant report .
The task dialog for Annotate variants (VEP) contains two sections: Select Variant Effect Predictor database and Advanced options (Figure 1). Select Variant Effect Predictor database will specify the reference assembly to utilize for variant detection. If the variant detection was performed in Partek Flow, the Assembly will be displayed as text in the section. Upon initial task usage, click the Create variant effect predictor database button to import a database. The VEP database for hg19 is available for automated download in Partek Flow, and information regarding obtaining additional databases for other species and genome assemblies can be found in the VEP documentation.
In the report, there variant impact information, it is a subjective classification of the severity of the variant consequence:
Low: a variant that is assumed to be mostly harmless or unlikely to change protein behavior
Moderate: a non-disruptive variant that might change protein effectiveness
Modifier: usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact
High: a variant is assumed to have high disruptive impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The Annotate variants task in Partek Flow provides a means to add information with regards to genomic features, such as transcript models, and existing variant databases to the variants contained in the projects. This information can be useful for filtering, interpreting, and prioritizing variants for downstream investigation. The Annotate variants task can be invoked from any Variants or Annotated variants data node, and the task will be added to and supplement any existing annotation in the underlying vcf files. Annotation information will also be visible in the downstream View variants Variant report.
The task dialog for Annotate variants contains three sections: Assembly, Annotate with genomic features, and Annotate with known variants (Figure 1). If variant detection was performed in Partek Flow, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence from within the task.
Selecting Annotate with genomic feature provides the means to add gene/feature information to the variants (Figure 2). This typically takes the form of overlaying a transcript model (such as Ensembl). Annotation models previously added to library files (see Library File Management) will be available for selection or Add annotation model in the drop-down list can be utilized to import an annotation model to library files within the task. Promoter upstream limit and Promoter downstream limit provides a means to set the number of bases flanking the transcription start site, and this region will considered the promoter of a feature.
Selecting Annotate with known variants will provide the ability to specify a Variant annotation database (Figure 2). Known variant databases in vcf format, such as dbSNP1 and 1000 Genomes2 for human variants, can be used in the task. Additional databases not provided for automated download in Partek® Flow®, such as the Catalogue of Somatic Mutations in Cancer (COSMIC)3, can be obtained and employed by the user. Variant databases previously added to library files (see Library File Management) will be available for selection or Add variant database in the menu can be utilized to import the variant database to library files from within the task.
Sherry ST. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29(1):308-311. doi:10.1093/nar/29.1.308
Auton A, Abecasis GR, Altshuler DM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68-74. doi:10.1038/nature15393.
Forbes SA, Bhamra G, Bamford S, et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). In: Haines JL, Korf BR, Morton CC, Seidman CE, Seidman JG, Smith DR, eds. Current Protocols in Human Genetics. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2008. http://doi.wiley.com/10.1002/0471142905.hg1011s57.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
A fusion gene is a hybrid gene that combines parts of two or more original genes. They can form as a result of chromosomal rearrangements (such as translocation, interstitial deletion, or chromosomal inversion) or abnormal transcription and have been shown to act as drivers of malignant transformation or/and progression in various neoplasms (1). The discovery and characterization of fusion genes have been greatly facilitated by the use of NGS (2) and several computational algorithms have been developed to detect them.
This chapter covers will illustrate how to detect fusion genes by:
The STAR aligner also has the ability to detect fusion genes (referred to as “chimeric alignments”) (5,6). During the first phase of alignment, STAR searches for maximal mappable prefixes (seeds) of sequencing reads. In the second phase, all the seeds that align within user-defined genomic windows are stitched together. If an alignment within one genomic window does not cover the entire read sequence, STAR will try to find two or more windows that cover the entire read. This essentially results in the detection of fusion events, with different parts of reads aligning to distal genomic locations, or different chromosomes, or different strands.
STAR fusion detection is performed in two steps: chimeric alignment of reads with the STAR aligner and fusion detection with STAR-Fusion. Performing fusion detection in two steps is equivalent to running the analysis in "Kickstart" mode, as described by the authors of STAR-Fusion. We recommend using STAR version 2.7.8a (see Task management to check which version you are running).
To save time, you can import the pre-built STAR-Fusion pipeline from our hosted pipeline page. This pipeline includes the two steps outlined below, where the advanced options for the STAR 2.7.8a alignment have been optimized for fusion detection according to the STAR-Fusion author's recommendations. See Importing a Pipeline for more information.
When performing an alignment with STAR, chimeric alignment can be activated by tick-marking the Chimeric alignment option in the Advanced options of the aligner (the Advanced options dialog is reached via the Configure link in the setup dialog). When the Chimeric alignment checkbox is selected, additional options specific to the fusion search algorithm are shown (Figure 1). For a discussion on the details of the options, see STAR documentation.
The output is associated with the Chimeric junctions data node (Figure 2), which is a part of the STAR results in addition to Aligned reads node and, optionally, Unaligned reads node.
STAR-Fusion v1.10 is wrapped into Partek Flow. STAR-Fusion will process the chimeric output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set. To run fusion detection, select the Chimeric junctions data node and choose STAR-Fusion from the Variant analysis menu in the toolbox (Figure 5).
Choose the STAR-Fusion annotation from the drop-down list. We provide automatic downloads of the plug-n-play libraries distributed by Trinity Cancer Transcriptome Analysis Toolkit (CTAT) for Human hg38 (Gencode v22 and v37) and hg19 (Gencode v19) assemblies (Figure 6). If you wish to add your own STAR-Fusion library, you can either import a pre-build CTAT library or gather the appropriate files and build it in Partek Flow. See here for more details on the files you need.
To change any of the advanced options, click the Configure link (Figure 7). To run the task, click Finish.
The resulting Fusion predictions task node (Figure 18) can be downloaded to your local machine by selecting the data node and clicking Download data from the toolbox. There will be one tab-separated (.tsv) file per sample. To view the full table, double-click the new data node to open the task report (Figure 9). Each row of the table is a fusion event and the columns contain information about each detected fusion.
FusionName: the name of the fusion event, given as LeftGene--RightGene. Multiple fusion events can be detected across the same pair of genes, so the FusionName of an event is not necessarily unique;
JunctionReadCount: indicates the number of RNA-Seq fragments containing a read that aligns as a split read at the site of the putative fusion junction;
SpanningFragCount: indicates the number of RNA-Seq fragments that encompass the fusion junction such that one read of the pair aligns to a different gene than the other paired-end read of that fragment;
est_J: estimated junction read counts corrected for multiple mappings;
est_S: estimated spanning fragment counts corrected for multiple mappings;
SpliceType: indicates whether the proposed breakpoint occurs at reference exon junctions as provided by the reference transcript structure annotations (Gencode);
LeftGene: name of the first (left) gene;
LeftBreakpoint: genome coordinates for the breakpoint in left gene;
RightGene: name of the second (right) gene;
RightBreakpoint: genome coordinates for the breakpoint in right gene;
JunctionReads: sequence identifiers for all junction reads;
SpanningFrags: sequence identifiers for all spanning fragments;
LargeAnchorSupport: indicates whether there are split reads that provide 'long' (set to 25bp) alignments on both sides of the putative breakpoint;
FFPM: fusion fragments per million reads
LeftBreakDinuc: dinucleotide base pairs at the left breakpoint
LeftBreakEntropy: the Shannon entropy of the 15 exonic bases flanking the left breakpoint
RightBreakDinuc: dinucleotide base pairs at the right breakpoint
RightBreakEntropy: the Shannon entropy of the 15 exonic bases flanking the right breakpoint
annots: provides a simplified annotation for fusion transcript
TopHat-Fusion is a version of TopHat with the ability to align reads across fusion points and detect fusion genes resulting from breakage and re-joining of two different chromosomes or from rearrangements within a chromosome (3). It is independent of gene annotation and can discover fusion products from known genes, unannotated splice variants of known genes or completely unknown genes.
The reads are first aligned to the genome. The unaligned reads resulting from this initial alignment are split into multiple 25 bp sequences which are, in turn, aligned to the genome by Bowtie. The TopHat-Fusion algorithm identifies the cases where the first and the last 25 bp segments are aligned to either two different chromosomes or two locations on the same chromosome (spacing is defined by the user). The whole read is used to identify a fusion point. After the initial fusion candidates are defined, all the segments from the initially unaligned reads are realigned against the fusion points (as well as intron boundaries and indels). The resulting alignments are combined with the full read alignments.
The most up-to-date TopHat-Fusion version implemented in Partek® Flow® when the manual was written (2.1.0) focuses on fusions due to chromosomal rearrangements, while fusions resulting from read-through transcription or trans-splicing were not supported. For details as well as discussion of TopHat-Fusion options, see TopHat-Fusion home page (4).
TopHat-Fusion is integrated in the TopHat 2 task and is invoked by using the Fusion search check box in the Alignment options dialog (Figure 10).
The output is generated as a new data node Fusion results (Figure 11) stemming as part of the if the TopHat 2 align reads task (in addition to Aligned reads node and, optionally, Unaligned reads node).
Selecting the Fusion results data node opens the task menu, with four options (Figure 12): Data summary report, Fusion report, Fusion attribute report, and Download data.
Clicking the Download data downloads a *.fusion file to the local computer. The file is human-readable and can be opened in a text editor (example in Figure 13). For details refer to TopHat-Fusion documentation.
A list of annotated fusion genes, in a form of Fusion report can be obtained by first selecting the Fusion report task node and then the Task report link from the task menu. Since the task provides an annotated report, an annotation file needs to be specified first (Figure 14).
The resulting Fusion report task node (Figure 15) can be double-clicked to reveal the full table (Figure 7).
Each row of the table in Figure 7 is a potential fusion event, with the columns providing the following information.
Sample ID: sample in which the fusion event was identified
Chromosome 1: chromosome hosting the first (left) segment of the fusion transcript
Stop 1: end of the first (left) segment of the fusion transcript
Chromosome 2: chromosome hosting the second (right) part of the fusion transcript
Start 2: beginning of the second (right) segment of the fusion transcript
Gene1: gene on the left side of the fusion
Gene2: gene on the right side of the fusion
Spanning reads: number of reads which were unaligned during the initial phase of TopHat and where only one mate is used as evidence of the fusion event
Mate Pairs: number of reads which were unaligned during the initial phase of TopHat and where both mates are used as evidence of the fusion event
Spanning mate pairs: number of reads where both mates were aligned during the initial phase of TopHat, but their pairing is discordant (e.g. different chromosomes, different orientation etc.)
Contradicting reads: number of reads which do not support the fusion
Left bases: number of bases on the left side of the fusion
Right bases: number of bases on the right side of the fusion
All the columns can be sorted by using the arrow buttons in column headers, while the type-in boxes can be used for searching. TopHat-Fusion does not report exact start and stop position for each side of the fusion event. It has a single location for the end of the upstream segment (Stop 1) and the beginning of the downstream segment (Start 2). Therefore, columns Start 1 and Stop 2 are added for (internal) consistency with other Partek Flow tools.
The checkboxes Disrupted Genes and Gene/Gene fusions are filter tools. When selected, Disrupted Genes removes all the rows (fusion events) which have no genes assigned to it, i.e. those that merge two intergenic regions. However, if there is a fusion between a gene and an intergenic region, it will be kept in the table. The Gene/Gene fusions filters in only those fusion events which have an annotated gene on both sides of the breakpoint. In other words, only gene to gene fusions are kept in the table.
Another table which can be generated based on a Fusion results node is the Fusion attribute report. When the option is selected, it brings up the dialog shown in Figure 17. First, you need to specify one or more categorical attributes (Select attribute(s) to test), which have at least two categories (see Data tab). Second, you need to specify an annotation file, using the Assembly and Gene/feature annotation drop-down lists.
A new data node, Fusion attribute report, is generated in the Analysis tab (Figure 18) and it provides access to the Task report link in the task menu.
The output, Fusion report table resembles the basic TopHat-Fusion output; each row of the table is a single fusion event while the information on the merged segments is on the columns.
Chromosome 1: chromosome hosting the first (left) segment of the fusion transcript;
Start 1: beginning of the first (left) segment of the fusion transcript;
Stop 1: end of the first (right) segment of the fusion transcript;
Chromosome 2: chromosome hosting the second (right) segment of the fusion transcript;
Start 2: beginning of the second (right) segment of the fusion transcript;
Stop 2: end of the second (left) segment of the fusion transcript;
Gene1: gene on the left side of the fusion;
Gene2: gene on the right side of the fusion;
% in (category name): fraction of samples within the category with the fusion event.
The checkboxes Disrupted Genes and Gene/Gene fusions are filter tools. When selected, Disrupted Genes removes all the rows (fusion events) which have no genes assigned to it, i.e. those that merge two intergenic regions. However, if there is a fusion between a gene and an intergenic region, it will be kept in the table. The Gene/Gene fusions filters in only those fusion events which have an annotated gene on both sides of the breakpoint. In the other words, only gene to gene fusions are kept in the table.
References
Annala MJ, Parker BC, Zhang W, Nykter M. Fusion genes and their discovery using high throughput sequencing. Cancer Lett. 2013;340:192-200.
Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013;21:134-142.
Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biology. 2011;12:R72
TopHat-Fusion. An algorithm for discovery of novel fusion transcripts. http://tophat.cbcb.umd.edu/fusion_index.html Accessed on April 25, 2014
Dobin A, Davies CA, Schlesinger F et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15-21.
Haas B.J, Dobin A, Li B. et al. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20**:**213 (2019)
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
An important aspect of variant analysis is the ability to prioritize specific variants for further investigation. As variant detection can often identify a large number of variants, it may be difficult to determine which variants may impact phenotypes. SnpEff (version 4.1k) provides a means to annotate and predict the effects of variants on genes, allowing for prioritization of variants within the project. In addition, the SnpEff databases utilized for prediction support a large number of genome assemblies. Information regarding the implementation of the predictions is detailed by Cingolani et al.1 The predicted effect of the variant is categorized by impact:
HIGH - frame shifts, addition/deletion of stop codons, etc;
MODERATE – codon change/deletion/insertion, etc;
LOW – synonymous changes, etc;
MODIFIER – changes outside coding regions, etc.
Further details about output metrics can be found in the SnpEff documentation. The Annotate variants (SnpEff) task can be invoked from any Variants or Annotated variants data node, and the task will supplement any existing annotation in the vcf files. Annotation information will also be visible in the View variants Variant report and the Summarize cohort mutations Cohort mutation summary report
The task dialog for Annotate variants (SnpEff) contains two sections: Select SnpEff database and Advanced options (Figure 1). Select SnpEff database will specify the reference assembly to utilize for variant detection. If the variant detection was performed in Partek Flow, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence to library files from within the task. Select SnpEff database will allow selection of databases utilized for prediction, and Partek Flow provides automated download of a limited number of these databases. Databases previously added to library files (see Library File Management) will be available for selection or Add SnpEff variant database in the menu can be utilized to import the reference sequence to library files from within the task. Additional information of SnpEff databases can be found in the SnpEff documentation.
Cingolani P, Platts A, Wang LL, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80-92.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Another way to invoke a heatmap without performing clustering is via the data viewer. When you select the Heatmap icon in the available plots list, data nodes that contain two-dimensional matrices can be used to draw this type of plot. A bubble map can also be similarly plotted (use the arrow from the heatmap icon to select a Bubble map for descriptive statistics that have been generated in the data analysis pipeline.
The min and max color stops cannot be dragged or removed. If you left-click on them, you can choose a different color. When you click on the Palette bar, you can add a new color stop between min and max. Adding a color stop can be useful when there is an outlier value in the data. You can use a different color to represent different value ranges.
Right-clicking a color stop will reveal a list of options. Space colors evenly will rearrange the position of the stops so there is an equal distance between all stops. Center will move a stop to the middle of the two adjacent stops. Delete will remove the stop.
In addition to color, you can also use the Size drop-down list to size by a set of values from another matrix stored in the same data node. Most of the data nodes contain only one matrix, so the only options available in the Size drop down will be None or Matrix. In cases where you have multiple matrices, you might want to use the color of the component in the heatmap to represent one type of statistic (like mean of the groups) and the size of the component to represent the information from a different statistic (like std. dev).
When the By cluster in the Row/Column color drop-down list, the number of clusters needs to be specified. The top N clusters will be in N different colors.
This section allows you to add sample or cell level annotations to the viewer. First, make sure to choose the correct data node which contains the annotation information you would like to use by clicking the circle (). All project level annotations will be available on all data nodes in the pipeline.
In point mode (), you can left-click and drag to move around the heatmap (if you are not fully zoomed out). Left-clicking once on the heatmap or on a dendrogram branch will select the associated rows/columns.
In selection mode (), you can click and drag to select a range of rows, columns, or components.
In flip mode (), you can click on a line in the dendrogram (which represents a cluster branch) and the location of the two legs of the branch will be swapped. If no clustering is performed (no dendrogram is generated), in this mode, you can click on the label of an item (observation or feature), drag and drop to manually switch orders of the row or column on the heatmap.
Click on reset view () to reset to the default
Save Image icon () enables you to download the heat map to your local computer. If the heat map contains up to 2.5M cells (features * observations), you can choose between saving the current appearance of the heat map window (Current view) and saving the entire heat map (All data). Depending on the number of features / observations, Partek Flow may not be able to fit all the labels on the screen, due to the limit imposed by the screen resolution. All Data option provides an image file of sufficient size so that all the labels are readable (in turn, that image may not fit the compute screen and the image file may be quite large). If the heat map exceeds 2.5M cells, the Current view option will not be shown, and you will see only a dialog like the one below.
After selecting either Current view (if applicable) or All data button, the next dialog (below) will allow you to specify the image format, size, and resolution.
Advanced options provides a means to tune parameters in the variant detection for optimal performance. Upon invoking the task dialog, Option set is set to Default, and these parameters are provided by the FreeBayes developers. Clicking Configure button will open a window to tune advanced options. Freebayes has advanced options for Population model, Allele scope, Indel realignment, Input filters, Mappability priors, Genotype likelihoods, Algorithmic features, and Report options. Moving the mouse cursor over the info button will provide details for each parameter. Please refer to the FreeBayes documentation for further information on tuning these parameters.
When choose Scree plot icon, it will plot a 2D viewer, X-axis represents PCs, Y-axis represents eigenvalues (Figure 4)
PCA data node can also be draw as tables, when choose Table icon , it will display the component loadings matrix in the viewer (Figure 5). The Content can be modified using the Content configuration option; the table can be paged through here or from the lower right corner.
Advanced options provides a means to tune parameters in the variant detection for optimal performance. Upon invoking the task dialog, Option set is set to Default, and these parameters are provided by the SAMtools developers. Clicking Configure will open a window to tune advanced options Moving the mouse cursor over the info button will provide details for each parameter. Please refer to the SAMtools documentation for further details on any of these parameters.
Advanced options provides a means to tune parameters in the variant detection for optimal performance. Upon invoking the task dialog, Option set is set to Default, and these parameters are provided by the LoFreq developers. Clicking Configure will open a window to tune advanced options. LoFreq has advanced options for Region control, Base-call quality, Base-alignment (BAQ) and indel-alignment (IDAQ) qualities, Mapping quality, Indels, Source quality, P-values, and Other. Moving the mouse cursor over the info button will provide details for each parameter. Please refer to the LoFreq documentation for further suggestions on tuning these parameters.
Advanced options provides a means to specify aspects of the annotation generated from the VEP annotation task. Upon invoking the task dialog, Option set is set to Default. Clicking Configure will open a window to specify additional components of annotation (Figure 2). VEP has Advanced options for Identifiers, Output options, and Co-located variants. Moving the mouse cursor over the info button will provide details for each parameter.
To obtain a .fusion file that summarizes the chimeric reads across samples, double click on Chimeric junctions data node to open the report, click on View output files link, select the chimeric_result.fusion file and click download icon (Figure 3). The file is human-readable and can be opened in a text editor (example in Figure 4). For details refer to STAR's documentation.
Advanced options provides a means to tune parameters for annotation generated from the SnpEff database. Upon invoking the task dialog, Option set is set to Default, and these parameters are prescribed by the developers of SnpEff. Clicking Configure will open a window to tune advanced options (Figure 2). SnpEff has Advanced options for Results filter options, Annotation options, and Database options. Moving the mouse cursor over the info button will provide details for each parameter.