Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This white paper explains some basic concepts related to alignment and quantification tasks in Partek Flow. The term “alignment” describes the process of finding the position of a sequencing read on the reference genome, while quantification refers to assigning already aligned reads to transcripts based on specified transcripts/gene annotation.
Aligned data node contains reads that have been aligned, counting each read once even if it has multiple alignments. Processing of paired-end reads needs further elaboration: a paired-end read will be counted as one read only if both ends align to the genome, no matter how many alignments each end has (Figure 1). If one end of a paired-end read is not aligned, the read will be considered unaligned and will be discarded for the downstream analysis.
As reads can be aligned to more than one location, the number of alignments may be greater than the number of reads; since some reads may be unaligned, the number of alignments may be less than the number of reads in the bam/sam file (Figure 1).
Partek Flow shows the number of alignments as well as the aligned reads percentage per sample in the post alignment QC report.
This step maps the aligned reads to transcripts using a modified E/M algorithm; detailed information about this algorithm can be found in the white paper RNA-Seq quantification.
Given a list of transcripts, each read can be mapped to one of four classes: exonic, partially overlaps exon, intronic, or between genes (Figure 2). Generally, “exonic” means the read came from a sequence present in the mature mRNA, while “intronic” means the read overlaps a gene, but aligns to the portion of the sequence not present in the mature mRNA. Finally, “between genes” reads came from a region outside of any gene. Detailed definitions of all four types can be found below.
The read assignments are labeled within the context of the chosen transcriptome database. If a read maps to multiple locations, then if any one of the alignments are exonic, the read is labeled “exonic”, and the same goes for partially overlaps exon, intronic and intergenic. In the other words, exonic takes precedence over partially overlaps exon, partially overlaps exon takes precedence over intronic, and intronic takes precedence over intergenic.
Exonic: A read is labeled exonic if any one of its alignments is completely contained within the respective exon as defined by the database (read 1) ,i.e., even if there is a single base shift of the read relative to the exon, the read will not be called exonic but will fall in the category ‘partially overlaps exon’. If the alignments are strand specific, then the strand of the alignment must also agree with the strand of the transcript.
Partially overlaps exon: A read is assigned to partially overlap an exon if any of its alignments overlaps an exon, but at least partly (one base-pair or more) maps out of the exon (read 3, read 4).
Intronic: A read is labeled intronic if any one of its alignments maps completely within an intron (read 2), but none of the alignments are exonic (either fully or in part). If the alignments are strand-specific, then the strand of the alignment must also agree with the strand of the transcript.
Reads between genes: A read is labeled ‘between genes’ if none of its alignments overlap a gene (read 5).
A read will be assigned to a transcript only when it is compatible. A compatible read is a read that fits the transcript model from the chosen annotation database. Compatible reads must be an exonic read (fully mapped to exon); however not all the exonic reads are compatible with a transcript, e.g., in paired end reads, both end reads have to be exonic as well as they both have to map to the same transcript; if they mapped to two different transcript or different chromosome, they are not compatible with a transcript.
Reads that are partially mapped exon reads, intronic reads, intergenic reads are incompatible with any transcript. Incompatible reads do not contribute to gene or transcript level read counts. Several single-end and paired-end scenarios will be discussed in the following sections.
A single-end read that is considered “compatible” would be any read that overlaps the exon 100% as described above in the paragraph on “exonic” reads.
If a single-end read includes an exon to exon junction, the region skipped in the read must be consistent with an intron in that transcript. If it is not, the read will be incompatible.
For example, with an alignment that begins at base 100 and has a cigar score of 5M 10N; 5M, the M means alignment Match and the N means skipped bases (junction). To be compatible, bases 100-104 must be exonic, bases 105-115 must be intronic and bases 116-120 must be exonic.
The handling of paired-end data is a bit more complex. There can be zero or more alignments associated with the first-in-pair read as well as zero or more alignments associated with the second-in-pair read. Each of the alignments associated with a paired-end read is considered compatible/incompatible with overlapping transcripts using the same rules which apply to alignments associated with single-end reads. A paired-end read will be considered compatible if it contains any pair of alignments where an alignment from the first-in-pair read is compatible with a transcript and an alignment from the second-in-pair read is compatible with the same transcript. Consequently, if only one of the ends is compatible with the transcript, the read is counted as incompatible. Similar to that, if a paired-end read has a junction that is not consistent with the intron in the transcript, the read will be incompatible.
Considering genes with multiple transcripts, a read can be both counted as compatible for some transcripts, as well as counted incompatible for other transcripts of the same gene (Figure 3). Please note that this concept holds for single-end reads as well.
Furthermore, all the reads that have at least one alignment contribute to the total number of aligned reads.
The unexplained regions portion of quantification considers any read that is considered “not compatible” with all transcripts. It is basically a 3 step process:
Filter down to reads that are not compatible with the transcript model.
Call regions that have at least an average of 5 reads per position.
Combine adjacent regions and report the result.
Since we are filtering down to reads that are incompatible with the transcript model, therefore during the combine adjacent regions step, only incompatible reads will be combined.
When interpreting the unexplained reads, you should have in mind that these are actually the reads not compatible with the applied transcript model. By changing the model, some reads will become compatible and thus will not be labeled “unexplained” any more. Figure 6 shows such an example. The depicted regions map just downstream of the human LONP2 gene as defined by RefSeq and are hence flagged as unexplained. However, by overlaying the AceView transcripts, it is apparent that mapping to AceView would yield a different result.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The method used to detect changes in functional groups is ANOVA.There is one result per functional group based on the expression of all the genes contained in the group. Besides the factor specified in the ANOVA model, the following extra terms will be added to the model by Partek Flow automatically:
Gene ID - Since not all genes in a functional group express at the same level, gene ID is added to the model to account for gene-to-gene differences
Factor * Gene ID - Interaction of gene ID with the factor is added to detect changes within the expression of a gene set with respect to different levels of the factor, small p-value of this term means genes in the set behave differently on this factor -- the disruption of the categories expression pattern, we call it disruption effect.
Suppose there is an experiment to find genes differentially expressed in two tissues: t-test, or ANOVA can be used to analyze the data. The Gene Set ANOVA dialog allows you to specify the ANOVA model: tissue. However, under the hood, interaction term is added automatically in the model:
y = µ + T + G + T *G + ε
y: expression of a functional group
µ: average expression of the functional group
T: tissue-to-tissue effect
G: gene-to-gene effect (differential expression of genes within the function group independent of tissue type)
T*G: Tissue-Gene interaction (differential patterning of gene expression in different tissue types),
ε: error term
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Most single cell RNA-seq library prep kits compensate for the small quantity of starting material by PCR amplifying the reverse transcribed cDNA. Because some sequences will amplify preferentially, the final proportions of PCR amplified molecules will diverge from the original number of molecules. To correct for these PCR artifacts, reverse transcribed molecules are tagged with unique nucleotide sequences, termed unique molecular identifiers (UMIs), prior to amplification. These UMIs are retained through PCR amplification, allowing PCR products that were amplified from the same original molecule to be identified. Counting UMIs for each gene instead of reads allows the original number of molecules corresponding to each gene to be more faithfully represented.
Identifying reads with matching UMIs and consolidating them into a single aligned read for use in quantification is handled by the Deduplicate UMIs task in Partek Flow.
An important consideration when analyzing UMI data are the errors introduced into the UMIs themselves during PCR amplification of the original molecule. If these errors are not accounted for and each sequenced UMI is considered to be representative of the original UMI, the number of unique molecules can be significantly overestimated. To account for this, the Deduplicate UMIs task uses an implementation of the UMI-tools algorithm described in Smith et al. 2017. Paired-end read support was further improved by incorporating components of the UMI deduplication tool Connor.
The task works by first partitioning reads into groups. Reads are grouped if they align to the same genomic position, have the same strandness, and any barcodes present match within an edit distance of two.
Within each group, sequenced UMIs are analyzed to determine whether they originated from the same UMI. To do this, UMIs are clustered. The UMI that has the most reads is used as the seed for the first cluster. The seed UMI is connected to all UMIs within a single edit distance that have fewer reads than it to form a cluster. Every UMI within the cluster then serves as the seed for a subsequent round of connection, again connecting seed UMIs to all UMIs within a single edit distance that have fewer reads than the seed UMI. Additional rounds of connection are performed until no more UMIs can be incorporated into the cluster. The unclustered UMI with the highest number of reads is chosen as the seed for a second cluster and the same clustering procedure is repeated. This process of clustering continues until all UMIs in the group have been assigned to a cluster.
This directional clustering method has two important benefits. First, it corrects for PCR errors introduced into UMIs by clustering sequenced UMIs with highly similar sequences so that they are counted as a single UMI. Second, it recognizes that PCR errors that arise in later cycles will be present in lower quantities. This is why clustering proceeds directionally, connecting UMIs with more reads to UMIs with fewer reads.
Once the clusters have been identified, a consensus read for the cluster is generated. To begin, any reads that do not match the common CIGAR string for their cluster are discarded. From the remaining reads, the percentage of each base at each position is determined. If a base is present in over 60% of reads, it is used in the consensus read. Otherwise, N is used. The base quality score for each position in the consensus read is the maximum at each position from the contributing reads.
Deduplicate UMIs has an alternative setting to more closely match methods used by 10X Genomics' Cell Ranger pipeline: retain only one alignment per UMI. Selecting this option changes how the task functions and requires that you specify the genome assembly and gene/feature annotation.
The algorithm first checks whether each aligned read is compatible with a transcript in the annotation file. Here, compatible is defined as 50% or more of the aligned read sequence overlapping the transcript; strand is not considered. Aligned reads that are not compatible with a transcript or are mapped to multiple gene annotations are discarded.
The occurrence of each barcode and UMI combination is counted to establish the prevalence of each barcode+UMI.
UMIs within a Levenshtein distance of 1 are grouped.
The UMI within each UMI group with the highest number of reads is reported and other UMIs within the group are filtered out. If two UMIs within the group have the same number of reads, the UMI with the lowest sequence ASCII value is used.
This method is also similar to the default method in the Drop-seq Alignment Cookbook (Macosko et al. 2015), which collapses UMI barcodes with a Hamming distance of 1.
This method may output more UMIs than the default behavior as only UMIs within an edit distance of 1 are summarized, whereas UMIs with a greater distance can be linked in the UMI-tools method. For a comparison of the performance of the two approaches, please see the Adjacency (Cell Ranger) and Directional (UMI-tools) methods described in Smith et al. 2017.
Smith T, Hegar A, Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Research 2017; 27(3): 491-499. https://doi.org/10.1101/gr.209601.116
Connor, University of Michigan BRCF Bioionformatics Core https://github.com/umich-brcf-bioinf/Connor
Cell Ranger Algorithms Overview, 10X Genomics https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview
Drop-seq Alignment Cookbook v1.2 Jan 2016, James Nemesh, Steve McCarroll’s lab, Harvard Medical School http://mccarrolllab.com/wp-content/uploads/2016/03/Drop-seqAlignmentCookbookv1.2Jan2016.pdf
Macosko E, Basu A, Satija R, et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 2015; 161(5):1202-1214. https://doi.org/10.1016/j.cell.2015.05.002
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Partek Flow follows a two-week patch cycle to address any Critical rated vulnerabilities located in any Partek Flow components. Partek Flow does not make use of Log4j components and is not vulnerable to CVE-2021-44228.
Partek Flow software resists external or internal compromise via isolation between users, groups, and data. This isolation is tunable by the Partek Flow administrator as well as users that need to increase the visibility of the data they own. Collections of per-account roles and permissions allow the administrator to define coarse or fine-grain data access rules. Once the administrator configures user privacy directories, users cannot see one another’s data on the same Partek Flow server unless they are collaborators on the same project.
Accounts are protected by a username and password authentication over HTTPS. Authentication credentials are protected using industry-standard approaches like salting, password complexity checks, and administrator-controlled account lockouts. To conform with existing corporate security standards, administrators can use LDAP instead of the Partek Flow internal authentication.
All data uploaded and downloaded through Partek Flow is transited using HTTPS. Keys are provided by the Partek Flow administrator, and thus can be as secure as policy dictates.
All user and administrator actions are immutably logged and timestamped.
All Partek Flow components are kept up to date. Should a security vulnerability be identified it will be immediately rectified. The Partek Flow server and its sub-processes are run with limited privileges, thus damage from a compromised Partek Flow server is minimized. Third-party user-added executable files are restricted from running unless contained within a dedicated folder. Partek Flow is compatible with additional containment strategies such as Docker or virtualization.
Given the minimal deployment requirements of Partek Flow, it is easily integrated into existing infrastructure. The design of this surrounding infrastructure can significantly increase or weaken the security of a Partek Flow deployment. The Partek operations and development team has experience deploying Partek Flow in a variety of environments including cloud, cluster, and desktop, each with varying security requirements. We are available to assist administrators with recommendations on infrastructure design, deployment, and backup.
All parameters and steps taken to generate results using Partek Flow are tracked and easily allow for the reproduction of past results or detailed auditing. Partek Flow detects and reports external modification of user data. Partek performs internal tests to ensure data generated by Partek Flow is scientifically accurate and consistent with previous results.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
An RNA-Seq quantification algorithm determines how aligned reads are assigned (i.e. mapped) to the transcript model. In turn, the quantification result provides a basis for the expression estimation and novel transcript discovery.
Each combination of sequencing instrument and sequencing protocol varies in terms of how precisely it is able to estimate the transcript abundance, and/or what data biases it propagates downstream. A good quantification algorithm is expected to extract the maximal amount of information from the reads, and, importantly, to adjust for the biases of the sequencer and protocol. However, the practical impact of biases depends on the purpose of the study. Likewise, if the study agenda includes some particular items such as novel transcript discovery, this alone can determine the choice of a combination of aligner and quantification package.
The most recognized quantification issues common to RNA-Seq include the following:
Multireads (multimappers)
Sequence (composition) bias
Position bias
GC bias
Multireads (or multimappers) are reads aligning to multiple locations in the genome. Three different approaches to handling those reads have been described.
Disallowing multireads (obsolete) (1, 2).
Full expectation-maximization (EM) algorithm (3), as implemented in Partek® Flow®, with slight modifications.
The “rescue” method implemented in Cufflinks (4 - 7), which is equivalent to the first (presumably the most informative) iteration of EM.
For paired-end reads with multiple alignments, it is possible to improve quantification by extracting information from the fragment (insert) size distribution. This feature is available in Cufflinks.
Sequence bias refers to the observation that the reads that have particular subsequences of nucleotides (typically, at 5’ or 3’ end) may be over- or underrepresented due to some artifacts of sequencing technology (8, 9) (for an illustration see Figure 2 in Roberts et al. (4)). The corresponding bias correction is implemented in Cufflinks.
Position bias (10 - 12) is manifested in unequal distribution of reads along the length of a transcript (see Figure 4 in Jiang et al. (12)). The corresponding correction is implemented in Cufflinks.
GC bias means that the estimated abundance of a transcript (gene) is dependent on its GC content. While Dohm et al. reported that GC-rich regions attract reads (see Figure 2 in (13)), another study argued that the dependence can go either way (see Figure 2 in Zheng et al. (14)). Although there is no available package that contains the GC bias correction Cufflinks team shown that in some cases the GC bias is highly correlated with the sequence bias and then a separate GC correction is unnecessary.
An accepted way to measure the impact of a certain bias is as follows: take a number of transcripts (genes) and perform the quantification with and without bias correction and then observe the change in concordance between the expected (known) and the estimated abundance. Initially, the expected abundances were estimated based on qPCR (4). More recently, the usage of presumably more accurate External RNA Controls Consortium (ERCC) controls took hold (12, 15). In the latter approach, the known abundance is measured as concentration of transcripts in attomoles/μL.
The estimated abundance is derived from fragments per kilobase of transcript per million aligned reads (FPKM)/reads per kilobase of transcript per million aligned reads (RPKM) values delivered by the quantification algorithm. The concordance between known and estimated abundances is measured by the r2, a value ranging from 0% to 100%, where 100% corresponds to perfect estimation of abundance.
The studies of Roberts et al. (4) and Li et al. (10) measured the importance of sequence and position bias correction based on qPCR using a set of genes from Microarray Quality Control project (MAQC). Roberts and colleagues reported that, when both sequence and position bias are accounted for, the r2 goes up by about 5%, from 75.3% to 80.7%. The impact of position bias is reported to be small relative to that of sequence bias.
Li and Dewey (10) showed that RSEM quantification package, that implements only the position bias correction, delivers the r2 of 69%. For comparison, running Cufflinks on the same dataset with the position bias correction obtained the r2 of only 71%. When both sequence and position bias corrections were enabled in Cufflinks, the r2 went up to 79%. We can conclude that, for real datasets, the impact of position bias is negligible, and the impact of sequence bias is about 5-8%.
More qPCR-based insight was given by Li et al. (9). They proposed a so-called MART model for the sequence bias, which was subsequently incorporated in Cufflinks in May 2011. According to their finding, the sequence bias correction has a negligible effect on the estimated abundance. However, this conclusion changes when they conditioned on the gene length and fold change: long genes were not affected by sequencing bias as much as short genes, and the impact was sizable for the genes with the largest fold change.
The results of a ERCC-based study of Jiang (12), who used a dataset consisting of 4.4 billion 76-bases paired-end reads with 96 ERCC transcripts. They found that, while the sequencing bias does exist on short genomic intervals, it disappeared (averaged out) very quickly as the length of transcript (gene) increased. They also found some evidence of GC content and length biases. The length bias was not covered earlier in Challenges and Solutions (at the beginning of this document) because length is a feature of the entire transcript (gene), and therefore it can be adjusted for downstream of quantification, as proposed in (16). A similar approach can be applied to GC bias, and it should work as long as our goal is to compare the abundances of the same transcript (gene) under different conditions.
On the other hand, suppose we want to compare the expression of two isoforms that differ by a short subsequence, such as a 50 bp exon. Then, the GC bias can be practically significant because the difference in expression may be attributed to the exon’s GC content. Likewise, it may be attributed to the nucleotide composition of the exon that causes it to attract an unfair number of reads due to the sequence bias. In that case, the GC and sequence biases need to be adjusted for during quantification.
Data from the Sequencing Quality Control Project (SEQC) was obtained from the Mayo Clinic site, which used the Illumina HiSeq2000. The dataset consists of about one billion of 2×100 paired-end reads, which amounts to about 512 GB of uncompressed fastq files, with about 4 GB for each of the 128 samples.
To control the sequencing quality, 92 ERCC transcripts were used. The expected abundance (concentration) of the transcripts was known and it differed by a factor of one million across transcripts. The transcript reads come from two groups, Mix A and Mix B. The mixes were prepared in a way that, for each transcript, we knew the fold change between Mix A and Mix B (the fold change could take four distinct values). Therefore, we were able test the accuracy of both abundance and fold change estimation.
After Bowtie alignment to the ERCC reference file and filtering on base and alignment quality, about 2.3% of the original one billion reads were retained. For technical reasons, the reads were treated as single-end.
A drawback of ERCC transcripts is that they do not produce multireads, so it was impossible to compare all of the different approaches to multiread handling described above. Likewise, ERCC transcripts are not alternatively spliced, i.e. they do not contain a large number of isoforms that differ by a short exon. Therefore, we were not likely to observe the effect of sequencing bias and GC bias corrections.
Quantification was performed by Partek Flow with the EM algorithm and the plots of estimated vs. expected transcript abundance for Mix A and Mix B are in Figures 1 and 2 (respectively). As we can see, the r2 is about 97% and, hence, there is very little room for improvement in abundance estimation.
Furthermore, a fold change plot is given (Figure 3), comparing the estimated and expected fold change in Mix A vs. Mix B, for the four groups of transcripts with the known fold change values. The fold change values were calculated based on log2 transformed RPKM values for each transcript (control).
To test for possible biases in abundance estimates, we combined the approaches of Li et al. (9) and Jiang et al. (12). That amounted to regressing the estimated abundance not only on the expected abundance, but also on the transcript length, GC content, and the expected fold change, subjecting all the variables to log2 transformation.
We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The combined approach allowed us to consider both practical and statistical significance of the covariates.
The full model for Mix A (Table 1) has the highest adjusted r2, but it was only 0.01% better than the model consisting of the expected abundance and length only. The latter was also pointed to by the stepwise selection, so we nominated it as the best model (Table 2). While the length effect in the best model was statistically significant, the adjusted r2 was only 0.44% higher than that of the benchmark model (Table 3). Therefore, we found little evidence of the practical significance of the length bias.
The full and the best models for the Mix B are shown in Tables 4 and 5 (respectively). Apparently, the regression failed to find evidence of any kind of bias.
Although we did not report the results obtained by Cufflinks quantification algorithm, we believe that no extra analysis is necessary. First, the nature of ERCC approach does not make it a tool sensitive enough to detect the subtle effects that are estimated by Cufflinks quantification algorithm. Second, we showed that, even if those effects were taken into account, there would be very little room for improvement that could be detected by ERCC tools.
Jiang and colleagues came to a similar conclusion (12). Although they did not include the dose response and fold change plots in their manuscript, in personal correspondence they acknowledged that they had actually constructed the plots but failed to see a significant impact of the bias corrections implemented in Cufflinks. The minimal ERCC transcript length is about 250 bp, and, unless one interested in comparing isoforms that differ by a much shorter sequence (50 bp is a good estimate), the bias corrections are of little use.
Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 2010;11(8):R83.
Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50.
Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006;34:3150-3160.
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12:R22.
Trapnell C, Williams BA, Pertea G, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511-515.
Trapnell C, Roberts A, Goff L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562-578.
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011;27:2325-2329.
Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131.
Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50.
Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26:493-500.
Leng N, Dawson JA, Thomson JA, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013;29:1035-1043.
Jiang L, Schlesinger F, Davis CA, et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011;21:1543-1551.
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36:e105.
Zheng W, Chung LM, Zhao H. Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics. 2011;12:290.
ERCC RNA spike-in control mixes. http://tools.thermofisher.com/content/sfs/manuals/cms_086340.pdf Accessed: March 25, 2016
Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009;4:14.
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
When visualizing high throughput data, for example via a 2D or 3D scatter plot, to display the distance of the observations on NGS observations where each observation has thousands of measurements, dimension reduction techniques need to be used. One can apply Principle Component Analysis, t-Distributed Stochastic Neighbor Embedding etc. Those techniques convert data from high dimensional space to lower dimensional space conveying similar information. When information represented in lower dimensional space is not exactly the same as the one from higher dimensional space data, error is generated. In the advanced option settings of dialogs like PCA, t-SNE, there is Generate mapping error statistics option, checking this button will output the mapping error on the plot.
The distance between the ith and jth data points in the original space is denoted by *, and the distance between the projections in lower dimensional space is denoted by . Sammon's mapping error (1) is calculated as the follows:
To avoid emphasizing small distance, quadratic mapping error is also generated according to the formula:
Both error measures range from zero to plus infinity.
[1] Sammon JW, 1969, A nonlinear mapping for data structure analysis
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Gene-specific analysis (GSA) is a statistical modeling approach used to test for differential expression of genes or transcripts in Partek Flow. The terms "gene", "transcript" and "feature" are used interchangeably below. The goal of GSA is to identify a statistical model that is the best for a specific transcript, and then use the best model to test for differential expression.
To test for differential expression, a typical approach is to pick a certain parametric model that can deliver p-values for the factors of interest. Each parametric model is defined by two components:
The choice of response distribution, e.g. Lognormal, Poisson, Negative Binomial.\
The choice of model design, i.e. the factors and factor interactions that should be present in the model. Apart from the factor(s) of interest, it might be necessary to include some nuisance factors in order to increase the statistical power of detecting the effect(s) of interest. It is not known in advance which nuisance factors are important enough to be included in the model.
A fairly typical approach is to select the response distribution in advance, regardless of a particular dataset, and assume that the same response distribution is a good fit for all transcripts. GSA provides a more flexible approach by assuming that the transcript expression can be described by the transcript-specific combination of response distribution and design.
For instance, it might be the case that Poisson distribution is a good fit for Transcript 1, but Negative Binomial is a much better choice for Transcript 2. Likewise, the expression of Transcript 1 can be affected by a factor describing the subject's gender, whereas the expression of Transcript 2 is not gender-sensitive.
In GSA, the data are used to choose the best model for each transcript. Then, the best model produces p-values, fold change estimates, and other measures of importance for the factor(s) of interest.
Currently, GSA is capable of considering the following five response distributions: Normal, Lognormal, Lognormal with shrinkage, Negative Binomial, Poisson (Figure 1). The GSA interface has an option to restrict this pool of distributions in any way, e.g. by specifying just one response distribution. The user also specifies the factors that may enter the model (Figure 2A) and comparisons for categorical factors (Figure 2B).
Based on the set of user-specified factors and comparisons, GSA considers all possible designs with the following restrictions:
It should be possible to estimate the specified comparison(s) using the design. That is, if the user specifies a comparison w.r.t levels of a certain factor, that factor must be included.
The design can include the factor interactions, but only up to 2nd order.
If a 2nd order interaction term is present in the design, then all first order terms must be present.
Restrictions 2 and 3 mean that GSA considers only hierarchical designs of 1st and 2nd order.
Example. Suppose the study involves two categorical factors, A and B. Factor A has two levels, and so does factor B. There are three observations for each combination of the levels of A and B, i.e. we are dealing with 2*2 design with 3 observations per cell, and the total sample size of 12. The user considers factor A to be of interest, and factor B is believed to be a nuisance factor, so a comparison is specified only w.r.t. factor A. As a result, the design space contains the following four elements:
A (main effect of A) A + B (both main effects) A + B + A*B (both main effects and the interaction of A and B)
In addition, the design pool can be restricted by the user through specifying the minimal number of degrees of freedom for the error (Figure 1). The recommended threshold value is six [1]. In this example, the largest model with both main effects and interactions has eight error degrees of freedom, hence all of the three possible designs will be considered. If the user decides to search for the best gene-specific model across all five possible response distributions, it amounts to fitting 3 * 5 = 15 models for each feature.
What model is the best for a given feature is determined through an information criterion specified by the user. At this point, the user can choose between AICc and AIC (Figure 1). AICc and AIC are recommended for small and medium sample sizes, correspondingly [2]. As the sample size goes up, AICc converges to AIC. Therefore, the user should usually do fine by utilizing the default choice, AICc.
The model with the lowest information criterion is considered the best choice. It is possible to quantify the superiority of the best model by computing the so-called Akaike weight (Figure 3).The model's weight is interpreted as the probability that the model would be picked as the best if the study were reproduced. In the example above, we can obtain 15 Akaike weights that sum up to one. For instance, if the best model has Akaike weight of 0.95, then it is very superior to other candidates from the model pool. If, on the other hand, the best weight is 0.52, then the best model is likely to be replaced if the study were reproduced. We can still use this "best shot" model for downstream analysis, keeping in mind that that the accuracy of this "best shot" is fairly low.
GSA is a great tool for automatic model selection: the user does not have to spend time and effort on picking the best model by hand. In addition, GSA allows one to run studies with no replication (one sample per condition) by specifying Poisson distribution in Advanced Options (Figure 1). However, not having enough replication is likely to deliver the results that are far more descriptive of the dataset at hand rather than the population from which the dataset was sampled. For instance, in the example from the previous section the number of fitted models (15) exceeds the number of observations (12), and therefore the results are unlikely to be representative of the population. If the study were to be reproduced with some 12 new samples, the list of significant features could be very different.
The problem is especially pronounced if the user disables the multimodel approach option in Advanced Options (Figure 1, "Enable multimodel approach"). In that case, for each feature only one model with the the highest Akaike weight is used. Under a low replication scenario and a large number of candidate models in the pool, such single best model provides a great description of the sampled data but not of the underlying population. The multimodel approach has some protection against generating non-reproducible results: all other things being equal, the p-value for the comparison of interest goes up as the number of models in the pool increases. Suppose there are two models with Akaike weights 0.5 each. When used separately, the first model generates a p-value of 0.04 and the second model generates p-value of 0.07. When multimodel inference is enabled it is possible to get a multimodel p-value well above 0.07. That protects one from making spurious calls but, if the number of models is very large, the user can end up with inflated p-values and no features to call.
The settings used in GSA's Default mode ("Model type configuration" in Figure 1) are aimed at ensuring higher reproducibility of the study for a fairly typical scenario of low replication. First of all, the response distribution is limited to "Lognormal with shrinkage". The latter is virtually identical to "limma trend" method from the limma package [5]. The idea behind shrinkage is to pool information across all of the genes which results in a more reproducible study under a low replication scenario. As the sample size goes up, the amount of shrinkage is reduced automatically, and eventually the method and its results become equivalent to a feature-specific "Lognormal" method (Figure 1).
Second, the error degrees of freedom threshold in Default mode is adjusted automatically for the purpose of excluding designs that have too many terms compared to the number of observations. Third, the hierarchical design restriction explained in the previous section is also aimed at keeping the number of considered designs reasonable. The user can override the first two restrictions by switching from Default to Custom mode in Advanced Options (Figure 1). The 2nd order and hierarchical design restrictions cannot be disabled in GSA but they are absent in Flow ANOVA. The latter is similar to "Lognormal" option in GSA.
When "Lognormal with shrinkage" is enabled, a separate shrinkage plot is displayed for each design (Figure 4). First, a lognormal linear model is fitted for each gene separately, and the standard deviations of residual errors are obtained (green dots in the plot). Applying shrinkage amounts to two more steps. We look at how the errors change depending on the average gene expression and we estimate the corresponding trend (black curve). Finally, the original error terms are adjusted (shrunk) towards the trend (red dots). The adjusted error terms are plugged back into the lognormal model to obtain the reported results such as p-value.
All other things being equal, the comparison p-value goes up as the magnitude of error term goes up, and vice-versa. As a result, the "shrunken" p-value goes up (down) if the error term is adjusted up (down). Table 1 reports some results for two features highlighted in Figure 4
For a large sample size, the amount of shrinkage is small, (Figure 5), and the "Lognormal" and "Lognormal with shrinkage" p-values become virtually identical.
One important usage of the shrinkage plot is a meaningful setting of low expression threshold in Low expression filter section (Figure 6). For features with low expression, the proportion of zero counts is high. Such features are less likely to be of interest in the study, and, in any case, they cannot be modeled well by a continuous distribution, such as Lognormal. Note that adding a positive offset to get rid of zeros does not help because that does not affect the error term of a lognormal model much. A high proportion of zeros can ultimately result in a drop in the trend in the leftmost part of the shrinkage plot (Figure 6).
A rule of thumb suggested by limma authors is to set the low expression threshold to get rid of the drop and to obtain a monotone decreasing trend in the left-hand part of the plot.
For instance, in Figure 5 it looks like a threshold of 2 can get us what we want. Since the x axis is on the log2 scale, the corresponding value for "Lowest average coverage" is 2^2=4 (Figure 6). After we set the filter that way and rerun GSA (Figure 7), the shrinkage plots takes the required form (Figure 8).
Note that it is possible to achieve a similar effect by increasing a threshold of "Lowest maximal coverage", "Minimum coverage", or any similar filtering option (Figure 7). However, using "Average coverage" is the most straightforward: the shrinkage procedure uses log2(Average coverage) as an independent variable to fit the trend, so the x axis in the shrinkage plot is always log2(Average coverage) regardless of the filtering option chosen in Figure 7.
As suggested above, Lognormal with shrinkage is not intended to work for a dataset that contains a lot of zeros, a set of low expression features being one example. In that case, a count-based (Poisson or Negative Binomial) model is more appropriate. With that in mind, it appears to be a simple solution to enable Poisson, Negative Binomial, and Lognormal with shrinkage in Advanced Options and let the model selection happen automatically. However, there are a couple of technical difficulties with that approach. First, shrinkage can fail altogether because of the presence of low expression features (in that case, the user will get an informative error message in GSA). Second, there is an issue of reproducibility discussed above. However, for a large sample size shrinkage becomes all but disabled and reproducibility is not a concern. In that case, one can indeed cover all levels of expression by using Poisson, Negative Binomial, and Lognormal together.
While shrinkage is likely to be beneficial for a small sample study overall, there can be some features that deviate so much from the rest that for them a feature-specific model can generate results that are more reproducible. The shrinkage plot provides a visually appealing, if informal, way to identify such features. For instance, in Figure 4 for feature ERCC-00046 the feature-specific error term (green) is far away from the trend and, as a result, it is adjusted upwards a lot. When the adjusted error term is plugged into the lognormal model for ERCC-00046, the model fits far worse for the observed ERCC-00046 counts, but the hope is that the results become closer to how ERCC-00046 behaves in the population. However, in this case the adjustment is quite big and one can say that a very large feature-specific effect is present and, as a result, shrinkage will do more harm than good.
A similar problem is present in count-based models as well. For instance, the handling of features with atypical variance parameters is addressed in DESeq2 [6], if in a rather peculiar manner. In DESeq2, if a feature has a variance that is "too high" compared to the trend then shrinkage is disabled altogether, which results in a much higher reported p-value because the variance is not adjusted downwards. However, for a feature with abnormally low variance shrinkage is never disabled which inflates the corresponding p-value because the variance is adjusted upwards. It is not clear why low and high variance features are not treated in the same manner, unless the goal is to exclude them from the final call list by inflating their p-values through this ad hoc procedure. One of DESeq2 authors admitted that, indeed, the goal is to stay on the "conservative side" when dealing with outlying features in a low replication scenario [7].
That line of reasoning suggests that neither DESeq2 nor limma are perfectly equipped for dealing with abnormal features. In fact, "limma trend" has no way to deal with them at all: shrinkage is applied regardless. If such abnormality is coupled with a low level of expression, it could be a good idea to get rid of the outlying features by raising the low expression threshold. For instance, while the trend in Figure 8A is monotone and decreasing in the left hand part of the plot, there are many low expression features with abnormally low error terms. Unless we have a special interest in those features, it makes sense to raise the low expression threshold so as to get rid of them.
There can also be a situation where a small number of low expression features have a very high influence on the trend which affects the p-values for all of the features (Figure 8B). It is reasonable to assume that the overall results should not be sensitive to the presence or absence of a few features, especially if they happen to have low expression. It makes sense to get rid of such influential points by increasing the threshold accordingly.
Speaking of higher expression features, presently GSA has no automatic method to separate "abnormal" and "normal" features, so the user has to do some eyeballing of the shrinkage plot. However, for the purpose of investigating standalone outliers GSA can quantify the benefit of shrinkage in a well grounded way. In order to do that, one can enable both Lognormal and Lognormal with shrinkage in Advanced Options (Figure 9).
Figure 10 contains a pie chart for the dataset whose shrinkage plot is displayed in Figure 4. Because of a small sample size (two groups with four observations each) we see that, overall, shrinkage is beneficial: for an "average" feature, Akaike weight for feature-specific Lognormal is 25%, whereas Lognormal with shrinkage weighs 75%.
At the same time, if we look at ERCC-00046 specifically (Figure 11) we see that Lognormal with shrinkage fits so bad that its Akaike weight is virtually zero, despite having fewer parameters than feature-specific Lognormal.
Using multimodel inference appears to be a better alternative to the ad hoc method in DESeq2 that switches shrinkage on and off all the way. Once again, it is both technically possible and emotionally tempting to automate the handling of abnormal features by enabling both Lognormal models in GSA and applying them to all of the transcripts. Unfortunately, that can make the results less reproducible overall, even though it is likely to yield more accurate conclusions about the drastically outlying features.
[1] Auer, 2011, A two-stage Poisson model for testing RNA-Seq. [2] Burnham, Anderson, 2010, Model selection and multimodel inference. [3] Dillies et al, 2012, A comprehensive evaluation of normalization methods. [4] Storey, 2002, A direct approach to false discovery rates. [5] Law et al, 2014, Voom: precision weights unlock linear model analysis. Note that this paper introduces both "limma trend" and "limma voom", but the present implementation in GSA corresponds to "limma trend". [6] Love et al, 2014, Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. [7] Bioconductor support forum. Accessed last: 4/12/16. https://support.bioconductor.org/p/80745/#80758
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
The purpose of scaling is to remove the variation of response that is described by certain nuisance experimental factors, meaning that scaling can also be called removal of unwanted variation. To distinguish scaling from similarly named procedures such as RUV normalization [1] it is important to keep in mind the following two points. First, the experimental factors participating in scaling are always known (observed) before the model is fitted.
Second, it is important to understand why scaling needs to be performed as a separate step. In the context of single cell analysis, scaled data are meant to be used for subpopulation identification. We assume that the response variance is explained by an unobserved (latent) factor of interest that identifies the subpopulations and by some observed nuisance factor(s). If we perform clustering on unscaled data, it is possible that the data will cluster by the nuisance factor as opposed to the factor of interest. We can compare this to a bulk-RNA experiment where both the factor of interest and the nuisance factors are known and the goal is to find features that are deferentially expressed with respect to the factor of interest. In that case, a separate scaling task is not necessary because we can simply include all of the factors in the model and specify the contrasts only with respect to the factor of interest. Therefore, if for some reason we assume that all of the factors are observed we should skip scaling and apply a bulk-RNA type of differential expression analysis.
Note also that after the k cell types have been identified one can add the corresponding factor with k levels to the data, and the new factor can be treated as observed [3] in downstream analysis.
We assume that for a given single feature (gene) the expected response can be represented as follows:
The scaling step in Seurat [2] allows the user to choose among log-linear, Negative Binomial, and Poisson models. Unfortunately for the user, Seurat provides no guidance as to how to pick the best option and, most likely, the log-linear default is applied all the time in practice. Likewise, if there are a few batch factors, there is no guidance in Seurat as to how to decide what design (a set of batch factors, possibly with interactions) is the best. In particular, including factors that do not exhibit a significant batch effect can lead to overcorrection meaning that the variation of interest is removed instead of the nuisance variation.
[1] Risso et al, 2014, Normalization of RNA-Seq data using factor analysis
[2] Satija et al, 2017 Seurat - Guided Clustering Tutorial, “Scaling the data and removing unwanted sources of variation” section (http://satijalab.org/seurat/pbmc3k_tutorial.html)
[3] Satija et al, 2017 Seurat - Guided Clustering Tutorial, “Finding differentially expressed genes (cluster biomarkers)” section (http://satijalab.org/seurat/pbmc3k_tutorial.html)
If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.
Here, Equation 1 corresponds to a generalized linear model (Poisson, Negative Binomial, etc) and Equation 2 is loglinear (Lognormal) model. Factor is the unobserved factor of interest and are the observed nuisance factors. Because we assume that the library sizes have been already adjusted for via normalization, the intercept term is the same across libraries and no “offset term” (denoted by O in [1], Equation 1) is necessary.
The idea of scaling is to adjust the observed response, Y, so that its expectation is not dependent on . If a random factor is present in the model, it describes not the expected value of Y but the variance of error term and/or correlation among observations. We treat all of the nuisance factors as fixed because our primary goal is to adjust the expectation of response. However, since in some models (Poisson, Negative Binomial) the response variance is dependent on the mean, we also adjust the response variance for the batch effect whenever such models are used.
We fit a generalized linear or loglinear model with as covariates to obtain the estimates of as follows:
The values of will coincide with only if is perfectly independent of which is never the case in practice. The point is that if we scale the observed Y as follows:
via a generalized linear (Equation 4) or loglinear (Equation 5) model, then the expected value of scaled response, becomes independent of . If we regress on we should see far greater p-values than in (Equation 3).
If there is correlation between the nuisance factor and the factor of interest, then by adjusting the response for , we also reduce its dependence on . The analysis can still be insightful, but only if the factor of interest explains a sizeable part of response variation even after the nuisance factor is taken into account. For a single cell subpopulation study, if there are no meaningful clusters in the end, it might be because there are no distinct cell types, or the distinct cell types are present but the type factor is too correlated with the nuisance factor(s), or a combination of these two scenarios is in play.
In theory, the problem of choosing the best model is tricky because, strictly speaking, the choice of best response distribution is dependent on the unknown factor . That being said, it is possible to automate the model choice by utilizing the AICc exactly the way it is already implemented in Flow GSA. The multiple models are constructed by combining the available response distributions with a set of all possible batch designs up to 2nd order (subject to a hierarchical restriction). In the case of multimodel approach, the final scaled response is obtained by weighting from individual models by the corresponding Akaike weights.