LogoLogo
Illumina KnowledgeIllumina SupportSign In
Partek
  • Home
Partek
  • Overview
  • Partek Flow
    • Frequently Asked Questions
      • General
      • Visualization
      • Statistics
      • Biological Interpretation
      • How to cite Partek software
    • Quick Start Guide
    • Installation Guide
      • Minimum System Requirements
      • Single Cell Toolkit System Requirements
      • Single Node Installation
      • Single Node Amazon Web Services Deployment
      • Multi-Node Cluster Installation
      • Creating Restricted User Folders within the Partek Flow server
      • Updating Partek Flow
      • Uninstalling Partek Flow
      • Dependencies
      • Docker and Docker-compose
      • Java KeyStore and Certificates
      • Kubernetes
    • Live Training Event Recordings
      • Bulk RNA-Seq Analysis Training
      • Basic scRNA-Seq Analysis & Visualization Training
      • Advanced scRNA-Seq Data Analysis Training
      • Bulk RNA-Seq and ATAC-Seq Integration Training
      • Spatial Transcriptomics Data Analysis Training
      • scRNA and scATAC Data Integration Training
    • Tutorials
      • Creating and Analyzing a Project
        • Creating a New Project
        • The Metadata Tab
        • The Analyses Tab
        • The Log Tab
        • The Project Settings Tab
        • The Attachments Tab
        • Project Management
        • Importing a GEO / ENA project
      • Bulk RNA-Seq
        • Importing the tutorial data set
        • Adding sample attributes
        • Running pre-alignment QA/QC
        • Trimming bases and filtering reads
        • Aligning to a reference genome
        • Running post-alignment QA/QC
        • Quantifying to an annotation model
        • Filtering features
        • Normalizing counts
        • Exploring the data set with PCA
        • Performing differential expression analysis with DESeq2
        • Viewing DESeq2 results and creating a gene list
        • Viewing a dot plot for a gene
        • Visualizing gene expression in Chromosome view
        • Generating a hierarchical clustering heatmap
        • Performing biological interpretation
        • Saving and running a pipeline
      • Analyzing Single Cell RNA-Seq Data
      • Analyzing CITE-Seq Data
        • Importing Feature Barcoding Data
        • Data Processing
        • Dimensionality Reduction and Clustering
        • Classifying Cells
        • Differentially Expressed Proteins and Genes
      • 10x Genomics Visium Spatial Data Analysis
        • Start with pre-processed Space Ranger output files
        • Start with 10x Genomics Visium fastq files
        • Spatial data analysis steps
        • View tissue images
      • 10x Genomics Xenium Data Analysis
        • Import 10x Genomics Xenium Analyzer output
        • Process Xenium data
        • Perform Exploratory analysis
        • Make comparisons using Compute biomarkers and Biological interpretation
      • Single Cell RNA-Seq Analysis (Multiple Samples)
        • Getting started with the tutorial data set
        • Classify cells from multiple samples using t-SNE
        • Compare expression between cell types with multiple samples
      • Analyzing Single Cell ATAC-Seq data
      • Analyzing Illumina Infinium Methylation array data
      • NanoString CosMx Tutorial
        • Importing CosMx data
        • QA/QC, data processing, and dimension reduction
        • Cell typing
        • Classify subpopulations & differential expression analysis
    • User Manual
      • Interface
      • Importing Data
        • SFTP File Transfer Instructions
        • Import single cell data
        • Importing 10x Genomics Matrix Files
        • Importing and Demultiplexing Illumina BCL Files
        • Partek Flow Uploader for Ion Torrent
        • Importing 10x Genomics .bcl Files
        • Import a GEO / ENA project
      • Task Menu
        • Task actions
        • Data summary report
        • QA/QC
          • Pre-alignment QA/QC
          • ERCC Assessment
          • Post-alignment QA/QC
          • Coverage Report
          • Validate Variants
          • Feature distribution
          • Single-cell QA/QC
          • Cell barcode QA/QC
        • Pre-alignment tools
          • Trim bases
          • Trim adapters
          • Filter reads
          • Trim tags
        • Post-alignment tools
          • Filter alignments
          • Convert alignments to unaligned reads
          • Combine alignments
          • Deduplicate UMIs
          • Downscale alignments
        • Annotation/Metadata
          • Annotate cells
          • Annotation report
          • Publish cell attributes to project
          • Attribute report
          • Annotate Visium image
        • Pre-analysis tools
          • Generate group cell counts
          • Pool cells
          • Split matrix
          • Hashtag demultiplexing
          • Merge matrices
          • Descriptive statistics
          • Spot clean
        • Aligners
        • Quantification
          • Quantify to annotation model (Partek E/M)
          • Quantify to transcriptome (Cufflinks)
          • Quantify to reference (Partek E/M)
          • Quantify regions
          • HTSeq
          • Count feature barcodes
          • Salmon
        • Filtering
          • Filter features
          • Filter groups (samples or cells)
          • Filter barcodes
          • Split by attribute
          • Downsample Cells
        • Normalization and scaling
          • Impute low expression
          • Impute missing values
          • Normalization
          • Normalize to baseline
          • Normalize to housekeeping genes
          • Scran deconvolution
          • SCTransform
          • TF-IDF normalization
        • Batch removal
          • General linear model
          • Harmony
          • Seurat3 integration
        • Differential Analysis
          • GSA
          • ANOVA/LIMMA-trend/LIMMA-voom
          • Kruskal-Wallis
          • Detect alt-splicing (ANOVA)
          • DESeq2(R) vs DESeq2
          • Hurdle model
          • Compute biomarkers
          • Transcript Expression Analysis - Cuffdiff
          • Troubleshooting
        • Survival Analysis with Cox regression and Kaplan-Meier analysis - Partek Flow
        • Exploratory Analysis
          • Graph-based Clustering
          • K-means Clustering
          • Compare Clusters
          • PCA
          • t-SNE
          • UMAP
          • Hierarchical Clustering
          • AUCell
          • Find multimodal neighbors
          • SVD
          • CellPhoneDB
        • Trajectory Analysis
          • Trajectory Analysis (Monocle 2)
          • Trajectory Analysis (Monocle 3)
        • Variant Callers
          • SAMtools
          • FreeBayes
          • LoFreq
        • Variant Analysis
          • Fusion Gene Detection
          • Annotate Variants
          • Annotate Variants (SnpEff)
          • Annotate Variants (VEP)
          • Filter Variants
          • Summarize Cohort Mutations
          • Combine Variants
        • Copy Number Analysis (CNVkit)
        • Peak Callers (MACS2)
        • Peak analysis
          • Annotate Peaks
          • Filter peaks
          • Promoter sum matrix
        • Motif Detection
        • Metagenomics
          • Kraken
          • Alpha & beta diversity
          • Choose taxonomic level
        • 10x Genomics
          • Cell Ranger - Gene Expression
          • Cell Ranger - ATAC
          • Space Ranger
          • STARsolo
        • V(D)J Analysis
        • Biological Interpretation
          • Gene Set Enrichment
          • GSEA
        • Correlation
          • Correlation analysis
          • Sample Correlation
          • Similarity matrix
        • Export
        • Classification
        • Feature linkage analysis
      • Data Viewer
      • Visualizations
        • Chromosome View
          • Launching the Chromosome View
          • Navigating Through the View
          • Selecting Data Tracks for Visualization
          • Visualizing the Results Using Data Tracks
          • Annotating the Results
          • Customizing the View
        • Dot Plot
        • Volcano Plot
        • List Generator (Venn Diagram)
        • Sankey Plot
        • Transcription Start Site (TSS) Plot
        • Sources of variation plot
        • Interaction Plots
        • Correlation Plot
        • Pie Chart
        • Histograms
        • Heatmaps
        • PCA, UMAP and tSNE scatter plots
        • Stacked Violin Plot
      • Pipelines
        • Making a Pipeline
        • Running a Pipeline
        • Downloading and Sharing a Pipeline
        • Previewing a Pipeline
        • Deleting a Pipeline
        • Importing a Pipeline
      • Large File Viewer
      • Settings
        • Personal
          • My Profile
          • My Preferences
          • Forgot Password
        • System
          • System Information
          • System Preferences
          • LDAP Configuration
        • Components
          • Filter Management
          • Library File Management
            • Library File Management Settings
            • Library File Management Page
            • Selecting an Assembly
            • Library Files
            • Update Library Index
            • Creating an Assembly on the Library File Management Page
            • Adding Library Files on the Library File Management Page
            • Adding a Reference Sequence
            • Adding a Cytoband
            • Adding Reference Aligner Indexes
            • Adding a Gene Set
            • Adding a Variant Annotation Database
            • Adding a SnpEff Variant Database
            • Adding a Variant Effect Predictor (VEP) Database
            • Adding an Annotation Model
            • Adding Aligner Indexes Based on an Annotation Model
            • Adding Library Files from Within a Project
            • Microarray Library Files
            • Adding Prep kit
            • Removing Library Files
          • Option Set Management
          • Task Management
          • Pipeline managment
          • Lists
        • Access
          • User Management
          • Group Management
          • Licensing
          • Directory Permissions
          • Access Control Log
          • Failed Logins
          • Orphaned files
        • Usage
          • System Queue
          • System Resources
          • Usage Report
      • Server Management
        • Backing Up the Database
        • System Administrator Guide (Linux)
        • Diagnosing Issues
        • Moving Data
        • Partek Flow Worker Allocator
      • Enterprise Features and Toolkits
        • REST API
          • REST API Command List
      • Microarray Toolkit
        • Importing Custom Microarrays
      • Glossary
    • Webinars
    • Blog Posts
      • How to select the best single cell quality control thresholds
      • Cellular Differentiation Using Trajectory Analysis & Single Cell RNA-Seq Data
      • Spatial transcriptomics—what’s the big deal and why you should do it
      • Detecting differential gene expression in single cell RNA-Seq analysis
      • Batch remover for single cell data
      • How to perform single cell RNA sequencing: exploratory analysis
      • Single Cell Multiomics Analysis: Strategies for Integration
      • Pathway Analysis: ANOVA vs. Enrichment Analysis
      • Studying Immunotherapy with Multiomics: Simultaneous Measurement of Gene and Protein
      • How to Integrate ChIP-Seq and RNA-Seq Data
      • Enjoy Responsibly!
      • To Boldly Go…
      • Get to Know Your Cell
      • Aliens Among Us: How I Analyzed Non-Model Organism Data in Partek Flow
    • White Papers
      • Understanding Reads in RNA-Seq Analysis
      • RNA-Seq Quantification
      • Gene-specific Analysis
      • Gene Set ANOVA
      • Partek Flow Security
      • Single Cell Scaling
      • UMI Deduplication in Partek Flow
      • Mapping error statistics
    • Release Notes
      • Release Notes Archive - Partek Flow 10
  • Partek Genomics Suite
    • Installation Guide
      • Minimum System Requirements
      • Computer Host ID Retrieval
      • Node Locked Installation
        • Windows Installation
        • Macintosh Installation
      • Floating/Locked Floating Installation
        • Linux Installation
          • FlexNet Installation on Linux
        • Installing FlexNet on Windows
        • License Server FAQ's
        • Client Computer Connection to License Server
      • Uninstalling Partek Genomics Suite
      • Updating to Version 7.0
      • License Types
      • Installation FAQs
    • User Manual
      • Lists
        • Importing a text file list
        • Adding annotations to a gene list
        • Tasks available for a gene list
        • Starting with a list of genomic regions
        • Starting with a list of SNPs
        • Importing a BED file
        • Additional options for lists
      • Annotation
      • Hierarchical Clustering Analysis
      • Gene Ontology ANOVA
        • Implementation Details
        • Configuring the GO ANOVA Dialog
        • Performing GO ANOVA
        • GO ANOVA Output
        • GO ANOVA Visualisations
        • Recommended Filters
      • Visualizations
        • Dot Plot
        • Profile Plot
        • XY Plot / Bar Chart
        • Volcano Plot
        • Scatter Plot and MA Plot
        • Sort Rows by Prototype
        • Manhattan Plot
        • Violin Plot
      • Visualizing NGS Data
      • Chromosome View
      • Methylation Workflows
      • Trio/Duo Analysis
      • Association Analysis
      • LOH detection with an allele ratio spreadsheet
      • Import data from Agilent feature extraction software
      • Illumina GenomeStudio Plugin
        • Import gene expression data
        • Import Genotype Data
        • Export CNV data to Illumina GenomeStudio using Partek report plug-in
        • Import data from Illumina GenomeStudio using Partek plug-in
        • Export methylation data to Illumina GenomeStudio using Partek report plug-in
    • Tutorials
      • Gene Expression Analysis
        • Importing Affymetrix CEL files
        • Adding sample information
        • Exploring gene expression data
        • Identifying differentially expressed genes using ANOVA
        • Creating gene lists from ANOVA results
        • Performing hierarchical clustering
        • Adding gene annotations
      • Gene Expression Analysis with Batch Effects
        • Importing the data set
        • Adding an annotation link
        • Exploring the data set with PCA
        • Detect differentially expressed genes with ANOVA
        • Removing batch effects
        • Creating a gene list using the Venn Diagram
        • Hierarchical clustering using a gene list
        • GO enrichment using a gene list
      • Differential Methylation Analysis
        • Import and normalize methylation data
        • Annotate samples
        • Perform data quality analysis and quality control
        • Detect differentially methylated loci
        • Create a marker list
        • Filter loci with the interactive filter
        • Obtain methylation signatures
        • Visualize methylation at each locus
        • Perform gene set and pathway analysis
        • Detect differentially methylated CpG islands
        • Optional: Add UCSC CpG island annotations
        • Optional: Use MethylationEPIC for CNV analysis
        • Optional: Import a Partek Project from Genome Studio
      • Partek Pathway
        • Performing pathway enrichment
        • Analyzing pathway enrichment in Partek Genomics Suite
        • Analyzing pathway enrichment in Partek Pathway
      • Gene Ontology Enrichment
        • Open a zipped project
        • Perform GO enrichment analysis
      • RNA-Seq Analysis
        • Importing aligned reads
        • Adding sample attributes
        • RNA-Seq mRNA quantification
        • Detecting differential expression in RNA-Seq data
        • Creating a gene list with advanced options
        • Visualizing mapped reads with Chromosome View
        • Visualizing differential isoform expression
        • Gene Ontology (GO) Enrichment
        • Analyzing the unexplained regions spreadsheet
      • ChIP-Seq Analysis
        • Importing ChIP-Seq data
        • Quality control for ChIP-Seq samples
        • Detecting peaks and enriched regions in ChIP-Seq data
        • Creating a list of enriched regions
        • Identifying novel and known motifs
        • Finding nearest genomic features
        • Visualizing reads and enriched regions
      • Survival Analysis
        • Kaplan-Meier Survival Analysis
        • Cox Regression Analysis
      • Model Selection Tool
      • Copy Number Analysis
        • Importing Copy Number Data
        • Exploring the data with PCA
        • Creating Copy Number from Allele Intensities
        • Detecting regions with copy number variation
        • Creating a list of regions
        • Finding genes with copy number variation
        • Optional: Additional options for annotating regions
        • Optional: GC wave correction for Affymetrix CEL files
        • Optional: Integrating copy number with LOH and AsCN
      • Loss of Heterozygosity
      • Allele Specific Copy Number
      • Gene Expression - Aging Study
      • miRNA Expression and Integration with Gene Expression
        • Analyze differentially expressed miRNAs
        • Integrate miRNA and Gene Expression data
      • Promoter Tiling Array
      • Human Exon Array
        • Importing Human Exon Array
        • Gene-level Analysis of Exon Array
        • Alt-Splicing Analysis of Exon Array
      • NCBI GEO Importer
    • Webinars
    • White Papers
      • Allele Intensity Import
      • Allele-Specific Copy Number
      • Calculating Genotype Likelihoods
      • ChIP-Seq Peak Detection
      • Detect Regions of Significance
      • Genomic Segmentation
      • Loss of Heterozygosity Analysis
      • Motif Discovery Methods
      • Partek Genomics Suite Security
      • Reads in RNA-Seq
      • RNA-Seq Methods
      • Unpaired Copy Number Estimation
    • Release Notes
    • Version Updates
    • TeamViewer Instructions
  • Getting Help
    • TeamViewer Instructions
Powered by GitBook
On this page
  • Challenges and Solutions
  • Multireads
  • Sequence Bias
  • Position Bias
  • GC Bias
  • Assessing the Practical Impact of Biases
  • Partek EM Algorithm: Validation of Quantification Results
  • Conclusions
  • References
  • Additional Assistance

Was this helpful?

Export as PDF
  1. Partek Flow
  2. White Papers

RNA-Seq Quantification

An RNA-Seq quantification algorithm determines how aligned reads are assigned (i.e. mapped) to the transcript model. In turn, the quantification result provides a basis for the expression estimation and novel transcript discovery.

Each combination of sequencing instrument and sequencing protocol varies in terms of how precisely it is able to estimate the transcript abundance, and/or what data biases it propagates downstream. A good quantification algorithm is expected to extract the maximal amount of information from the reads, and, importantly, to adjust for the biases of the sequencer and protocol. However, the practical impact of biases depends on the purpose of the study. Likewise, if the study agenda includes some particular items such as novel transcript discovery, this alone can determine the choice of a combination of aligner and quantification package.

Challenges and Solutions

The most recognized quantification issues common to RNA-Seq include the following:

  • Multireads (multimappers)

  • Sequence (composition) bias

  • Position bias

  • GC bias

Multireads

Multireads (or multimappers) are reads aligning to multiple locations in the genome. Three different approaches to handling those reads have been described.

  • Disallowing multireads (obsolete) (1, 2).

  • Full expectation-maximization (EM) algorithm (3), as implemented in Partek® Flow®, with slight modifications.

  • The “rescue” method implemented in Cufflinks (4 - 7), which is equivalent to the first (presumably the most informative) iteration of EM.

For paired-end reads with multiple alignments, it is possible to improve quantification by extracting information from the fragment (insert) size distribution. This feature is available in Cufflinks.

Sequence Bias

Sequence bias refers to the observation that the reads that have particular subsequences of nucleotides (typically, at 5’ or 3’ end) may be over- or underrepresented due to some artifacts of sequencing technology (8, 9) (for an illustration see Figure 2 in Roberts et al. (4)). The corresponding bias correction is implemented in Cufflinks.

Position Bias

Position bias (10 - 12) is manifested in unequal distribution of reads along the length of a transcript (see Figure 4 in Jiang et al. (12)). The corresponding correction is implemented in Cufflinks.

GC Bias

GC bias means that the estimated abundance of a transcript (gene) is dependent on its GC content. While Dohm et al. reported that GC-rich regions attract reads (see Figure 2 in (13)), another study argued that the dependence can go either way (see Figure 2 in Zheng et al. (14)). Although there is no available package that contains the GC bias correction Cufflinks team shown that in some cases the GC bias is highly correlated with the sequence bias and then a separate GC correction is unnecessary.

Assessing the Practical Impact of Biases

An accepted way to measure the impact of a certain bias is as follows: take a number of transcripts (genes) and perform the quantification with and without bias correction and then observe the change in concordance between the expected (known) and the estimated abundance. Initially, the expected abundances were estimated based on qPCR (4). More recently, the usage of presumably more accurate External RNA Controls Consortium (ERCC) controls took hold (12, 15). In the latter approach, the known abundance is measured as concentration of transcripts in attomoles/μL.

The estimated abundance is derived from fragments per kilobase of transcript per million aligned reads (FPKM)/reads per kilobase of transcript per million aligned reads (RPKM) values delivered by the quantification algorithm. The concordance between known and estimated abundances is measured by the r2, a value ranging from 0% to 100%, where 100% corresponds to perfect estimation of abundance.

The studies of Roberts et al. (4) and Li et al. (10) measured the importance of sequence and position bias correction based on qPCR using a set of genes from Microarray Quality Control project (MAQC). Roberts and colleagues reported that, when both sequence and position bias are accounted for, the r2 goes up by about 5%, from 75.3% to 80.7%. The impact of position bias is reported to be small relative to that of sequence bias.

Li and Dewey (10) showed that RSEM quantification package, that implements only the position bias correction, delivers the r2 of 69%. For comparison, running Cufflinks on the same dataset with the position bias correction obtained the r2 of only 71%. When both sequence and position bias corrections were enabled in Cufflinks, the r2 went up to 79%. We can conclude that, for real datasets, the impact of position bias is negligible, and the impact of sequence bias is about 5-8%.

More qPCR-based insight was given by Li et al. (9). They proposed a so-called MART model for the sequence bias, which was subsequently incorporated in Cufflinks in May 2011. According to their finding, the sequence bias correction has a negligible effect on the estimated abundance. However, this conclusion changes when they conditioned on the gene length and fold change: long genes were not affected by sequencing bias as much as short genes, and the impact was sizable for the genes with the largest fold change.

The results of a ERCC-based study of Jiang (12), who used a dataset consisting of 4.4 billion 76-bases paired-end reads with 96 ERCC transcripts. They found that, while the sequencing bias does exist on short genomic intervals, it disappeared (averaged out) very quickly as the length of transcript (gene) increased. They also found some evidence of GC content and length biases. The length bias was not covered earlier in Challenges and Solutions (at the beginning of this document) because length is a feature of the entire transcript (gene), and therefore it can be adjusted for downstream of quantification, as proposed in (16). A similar approach can be applied to GC bias, and it should work as long as our goal is to compare the abundances of the same transcript (gene) under different conditions.

On the other hand, suppose we want to compare the expression of two isoforms that differ by a short subsequence, such as a 50 bp exon. Then, the GC bias can be practically significant because the difference in expression may be attributed to the exon’s GC content. Likewise, it may be attributed to the nucleotide composition of the exon that causes it to attract an unfair number of reads due to the sequence bias. In that case, the GC and sequence biases need to be adjusted for during quantification.

Partek EM Algorithm: Validation of Quantification Results

Data from the Sequencing Quality Control Project (SEQC) was obtained from the Mayo Clinic site, which used the Illumina HiSeq2000. The dataset consists of about one billion of 2×100 paired-end reads, which amounts to about 512 GB of uncompressed fastq files, with about 4 GB for each of the 128 samples.

To control the sequencing quality, 92 ERCC transcripts were used. The expected abundance (concentration) of the transcripts was known and it differed by a factor of one million across transcripts. The transcript reads come from two groups, Mix A and Mix B. The mixes were prepared in a way that, for each transcript, we knew the fold change between Mix A and Mix B (the fold change could take four distinct values). Therefore, we were able test the accuracy of both abundance and fold change estimation.

After Bowtie alignment to the ERCC reference file and filtering on base and alignment quality, about 2.3% of the original one billion reads were retained. For technical reasons, the reads were treated as single-end.

A drawback of ERCC transcripts is that they do not produce multireads, so it was impossible to compare all of the different approaches to multiread handling described above. Likewise, ERCC transcripts are not alternatively spliced, i.e. they do not contain a large number of isoforms that differ by a short exon. Therefore, we were not likely to observe the effect of sequencing bias and GC bias corrections.

Quantification was performed by Partek Flow with the EM algorithm and the plots of estimated vs. expected transcript abundance for Mix A and Mix B are in Figures 1 and 2 (respectively). As we can see, the r2 is about 97% and, hence, there is very little room for improvement in abundance estimation.

Furthermore, a fold change plot is given (Figure 3), comparing the estimated and expected fold change in Mix A vs. Mix B, for the four groups of transcripts with the known fold change values. The fold change values were calculated based on log2 transformed RPKM values for each transcript (control).

To test for possible biases in abundance estimates, we combined the approaches of Li et al. (9) and Jiang et al. (12). That amounted to regressing the estimated abundance not only on the expected abundance, but also on the transcript length, GC content, and the expected fold change, subjecting all the variables to log2 transformation.

We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The combined approach allowed us to consider both practical and statistical significance of the covariates.

The full model for Mix A (Table 1) has the highest adjusted r2, but it was only 0.01% better than the model consisting of the expected abundance and length only. The latter was also pointed to by the stepwise selection, so we nominated it as the best model (Table 2). While the length effect in the best model was statistically significant, the adjusted r2 was only 0.44% higher than that of the benchmark model (Table 3). Therefore, we found little evidence of the practical significance of the length bias.

The full and the best models for the Mix B are shown in Tables 4 and 5 (respectively). Apparently, the regression failed to find evidence of any kind of bias.

Conclusions

Although we did not report the results obtained by Cufflinks quantification algorithm, we believe that no extra analysis is necessary. First, the nature of ERCC approach does not make it a tool sensitive enough to detect the subtle effects that are estimated by Cufflinks quantification algorithm. Second, we showed that, even if those effects were taken into account, there would be very little room for improvement that could be detected by ERCC tools.

Jiang and colleagues came to a similar conclusion (12). Although they did not include the dose response and fold change plots in their manuscript, in personal correspondence they acknowledged that they had actually constructed the plots but failed to see a significant impact of the bias corrections implemented in Cufflinks. The minimal ERCC transcript length is about 250 bp, and, unless one interested in comparing isoforms that differ by a much shorter sequence (50 bp is a good estimate), the bias corrections are of little use.

References

  1. Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 2010;11(8):R83.

  2. Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50.

  3. Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006;34:3150-3160.

  4. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;12:R22.

  5. Trapnell C, Williams BA, Pertea G, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511-515.

  6. Trapnell C, Roberts A, Goff L, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562-578.

  7. Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011;27:2325-2329.

  8. Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 2010;38:e131.

  9. Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010;11:R50.

  10. Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26:493-500.

  11. Leng N, Dawson JA, Thomson JA, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013;29:1035-1043.

  12. Jiang L, Schlesinger F, Davis CA, et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011;21:1543-1551.

  13. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36:e105.

  14. Zheng W, Chung LM, Zhao H. Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics. 2011;12:290.

  15. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009;4:14.

Additional Assistance

PreviousUnderstanding Reads in RNA-Seq AnalysisNextGene-specific Analysis

Last updated 7 months ago

Was this helpful?

ERCC RNA spike-in control mixes. Accessed: March 25, 2016

If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

http://tools.thermofisher.com/content/sfs/manuals/cms_086340.pdf
our support page
Figure 1. Comparison of expected number of molecules of ERCC Mix A and the estimated number of transcript molecules (log2 of RPKM values), obtained by Partek’s modification of expectation-maximization algorithm. Each dot is an ERCC control. r^2 = 0.97, regression y = 0.98 * x - 74.41
Figure 2. Comparison of expected number of molecules of ERCC Mix B and the estimated number of transcript molecules (log2 of RPKM values), obtained by Partek’s modification of expectation-maximization algorithm. Each dot is an ERCC control. r^2 = 0.97, regression y = 0.98 * x - 74.06
Figure 3. Comparison of expected fold change (log2) vs. observed fold change (log2) in ERCC Mix A vs. Mix B, for the four groups of transcripts with the known fold change values. Each dot is an ERCC control. r^2 = 0.87, regression y = 0.96 * x + 0.09
Table 1. Regression of the observed transcript abundance on the expected transcript abundance, transcript length, GC content, and the expected fold change, subjecting all the variables to log2 transformation. We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
Table 2. Regression of the observed transcript abundance on the expected transcript abundance and transcript length (the best model). We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
Table 3. Regression of the observed transcript abundance on the expected transcript abundance (the benchmark model). We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
Table 4. Regression of the observed transcript abundance on the expected transcript abundance, transcript length, GC content, and the expected fold change, subjecting all the variables to log2 transformation. We started with the full model containing four covariates and performed model selection based on two criteria: adjusted r^2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix B of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.
Table 5. Regression of the observed transcript abundance on the expected transcript abundance (the best model). We started with the full model containing four covariates and performed model selection based on two criteria: adjusted R2 (computed for all possible models) and stepwise regression (with a cutoff p-value of 0.15). The assessment was performed on the Mix A of ERCC, using Partek’s modified expectation-maximization (EM) algorithm for transcript quantification.