arrow-left

All pages
gitbookPowered by GitBook
1 of 8

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Quantification

In RNA-seq data analysis, after alignment, the most common step is to estimate gene or/and transcript expression abundance, the expression level is represented by read counts. There are three options in this step:

  • Quantify to annotation model (Partek E/M)

  • Quantify to transcriptome (Cufflinks)

Quantify to transcriptome (Cufflinks)

Cufflinks assembles transcripts and estimates transcript abundances on aligned reads. Implementation details are explained in Trapnell et al. [1]

The Cufflinks task has three options that can be configured (Figure 1):

Figure 1. Cufflinks configuration dialog
  • Novel transcript: this option does not require any annotation reference, it will do de novo assembly to reconstruct transcripts and estimate their abundance

  • Annotation transcript: this option requires an annotation model to quantify the aligned reads to known transcripts based on the annotation file.

  • Novel transcript with annotation as guide: this option requires an annotation file to quantify the aligned reads to known transcripts as well as assemble aligned reads to novel transcripts. The results include all transcripts in the annotation file plus any novel transcripts that are assembled.

When the Use bias correction check box is selected, it will use the genome sequence information to look for overrepresented sequences and improve the accuracy of transcript abundance estimates.

hashtag
References

  1. Trapnell C, Williams B, Pertea G, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotech. 2010; 28:511-515.

hashtag
Additional Assistance

If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

Quantify to reference (Partek E/M)
Quantify regions
HTSeq
Count feature barcodes
Salmon
our support pagearrow-up-right

Quantify to annotation model (Partek E/M)

When the reads are aligned to a genome reference, e.g. hg38, the quantification is performed on transcriptome, you need to provide the annotation model file of the transcriptome.

hashtag
Quantification dialog

If the alignment was generated in Partek Flow, the genome assembly will be displayed as text on the top of the page (Figure 1), you do not have the option to change the reference.

Figure 1. Quantify to annotation model (Partek E/M) dialog

If the bam file is imported, you need to select the assembly with which the reads were aligned to, and which annotation model file you will use to quantify from the drop-down menus (Figure 2).

In the Quantification options section, when the Strict paired-end compatibility check button is selected, paired end reads will be considered compatible with a transcript only if both ends are compatible with the transcript. If it is not selected, reads with only one end have alignment that is compatible with the transcript will also be counted for the transcript .

If the Require junction reads to match introns check button is selected, only junction reads that overlap with exonic regions and match the skipped bases of an intron in the transcript will be included in the calculation. Otherwise, as long as the reads overlap within the exonic region, they will be counted. Detailed information about read compatibility can be found in the white paper.

Minimum read overlap with feature can be specified in percentage of read length or number of bases. By default, a read has to be 100% within a feature. You can allow some overhanging bases outside the exonic region by modifying these parameters.

Filter features option is a filter for minimum reads, by default only the features whose sum of the reads across all samples that are greater than or equal to 10 will be reported. To report all the features in the annotation file, set the value to 0.

Some library preparations reverse transcribe the mRNA into double stranded cDNA, thus losing strand information. In this case, the total transcript count will include all the reads that map to a transcript location. Others will preserve the strand information of the original transcript by only synthesizing the first strand cDNA. Thus, only the reads that have sense compatibility with the transcripts will be included in the calculation. We recommend verifying with the data source how the NGS library was prepared to ensure correct option selection.

In the Advanced options, in Configure dialog, at Strand specificity field, forward means the strand of the read must be the same as the strand of the transcript while reverse means the read must be the complementary strand to the transcript (Figure 3). The options in the drop-down list will be different for paired-end and single-end data. For paired-end reads, the dash separates first- and second-in-pair, determined by the flag information of the read in the BAM file. Briefly, the paired-end Strand specificity options are:

  • No: Reads will be included in the calculation as long as they map to exonic regions, regardless of the direction

  • Auto-detect: The first 200,000 reads will be used to examine the strand compatibility with the transcripts. The following percentages are calculated on paired-end reads:

    • (1) If (first-in-pair same strand + second-in-pair same strand)/Alignments examined > 75%, Forward-Forward will be specified

The single-end Strand specificity options are:

  • No: same as for paired-end reads

  • Auto-detect: same as for paired-end reads. All single-end reads are treated as first-in-pair reads

  • Forward: this option is equivalent to the --fr-secondstrand option in Cufflinks. The single-end reads are the same strand as the transcript

If the Report unexplained regions check button is selected, an additional report will be generated on the reads that are considered not compatible with any transcripts in the annotation provided. Based on the Min reads for unexplained region cutoff, the adjacent regions that meet the criteria are combined and region start and stop information will be reported.

In the annotation file, there might be multiple features in the same location, or one read might have multiple alignments, so the read count of a feature might not be an integer. Our white paper on the has more details on Partek’s implementation the E/M algorithm initially described by Xing et al. [1]

hashtag
Quantify to annotation model (Partek E/M) output

Depending on the annotation file, the output could be one or two data nodes. If the annotation file only contains one level of information, e.g. miRNA annotation file, you will only get one output data node. On the other hand, if the annotation file contains gene level and transcript level information, such as those from the Ensembl database, both gene and transcript level data nodes will be generated. If two nodes are generated, the Task report will also contain two tabs, reporting quantification results from each node. Each report has two tables. The first one is a summary table displaying the coverage information for each sample quantified against the specified transcriptome annotation (Figure 4).

The second table contains feature distribution information on each sample and across all the samples, number of features in the annotation model is displayed on the table title (Figure 5).

The bar chart displaying the distribution of raw read counts is helpful in assessing the expression level distribution within each sample. The X-axis is the read count range, Y axis is the number of features within the range, each bar is a sample. Hovering your mouse over the bar displays the following information (Figure 6):

  • Sample name

  • Range of read counts, “[ “represent inclusive, “)” represent exclusive, e.g. [0,0] means 0 read counts; (0,10] means the range is greater than 0 count but less than and equal to 10 counts.

  • Number of features within the read count range

The coverage breakdown bar chart is a graphical representation of the reads summary table for each sample (Figure 7).

In the box-whisker plot, each box is a sample on X-axis, the box represents 25th and 75th percentile, the whiskers represent 10th and 90th percentile, Y-axis represents the feature counts, when you hover over each box, detailed sample information is displayed (Figure 8).

In sample histogram, each line represents a sample and the range of read counts are divided into 20 bins. Clicking on a sample in the legend will hide the line for that specific sample. Hovering over each circle displays detailed information about the sample and that specific bin (Figure 9). The information includes:

  • Sample name

  • Range of read counts, “[ “represent inclusive, “)” represent exclusive

  • Number of features within the read count range in the sample

The box whisker and sample histogram plots are helpful for understanding the expression level distribution across samples. This may indicate that normalization between samples might be needed prior to downstream analysis. Note that all four visualizations are disabled for results with more than 30 samples.

The output data node contains raw reads of each sample on each feature (gene or transcript or miRNA etc. depends on the annotation used). When click on a output data node, e.g. transcript counts data node, choose Download data on the context sensitive menu on the right, the raw reads of transcripts can be downloaded in three different format (Figure 10):

Partek Genomics Suite project format: it is a zip file, do not manually unzip it, you can choose File>Import>Zipped project in Partek Genomics Suite to import the zip file into PGS.

Features on columns and Features on rows format: it is a .txt file, you can open the text file in any text editor or Microsoft Excel. For Features on columns format, samples will be on rows. For Features on rows format, samples will be at columns.

hashtag
References

  1. Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006; 34(10):3150-60.

hashtag
Additional Assistance

If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

Quantify regions

In ChIP-seq or ATAC-seq analysis, a major challenge after detecting enriched regions or peaks is to compare samples and identify differentially enriched regions. In order to compare samples, a common set of regions must be identified and the number of reads mapping to each region quantified. The Quantify regions task addresses this challenge by generating a union set of unique regions and reporting the number of reads from each sample mapping to each region.

To run Quantify regions:

  • Click a Peaks data node

  • Click the Quantification section in the toolbox

  • Click Quantify regions

hashtag
Quantify regions method

The Quantify regions task takes MACS2 output, a Peaks or Annotated Peaks data node, as its input. In a typical ATAC-Seq or ChIP-Seq analysis, MACS2 is configured to output a set of enriched regions or peaks for each experimental sample or group individually. Quantify regions takes these sets of regions and merges them into a union set of unique regions that it saves as a .bed file. To combine the region sets, overlapping regions between samples/groups are merged. Where overlap ends, a break point is created and a new region defined. All non-overlapping or unique regions from each sample/group are also included.

For example, consider an experiment where MACS2 detected enriched regions for two samples, Sample A and Sample B. In Sample A, a region is detected on chromosome 1 from 100bp to 300bp, chr1:100-300. In Sample B, a region is detected at chr1:160-360. The Quantify regions task will give the following union set of unique regions for these partially overlapping regions:

chr1:100-160 (region detected in Sample A only)

chr1:160-300 (region detected in both Sample A and Sample B)

chr1:300-360 (region detected in Sample B only)

After generating a .bed file with the union set of unique regions, Quantify regions performs quantification using the same algorithm as with the .bed file as the annotation model.

hashtag
Configuring Quantify regions

The Quantify regions dialog includes configuration options for generating the union set of unique regions and quantifying reads to the regions (Figure 1).

When regions from multiple samples are combined, a small offset in position between enriched regions in different samples can result in many very short unique regions in the union set. The Minimum region size option lets you filter out these very short regions. If a region is smaller than the specified cutoff, the region is excluded. By default, this is set to 50bp, but may need to be adjusted depending on the size of regions you expect to see in your assay.

Quantification options are the same as in the dialog. The Percent of read length is set to 50% by default to account for small offsets in position between enriched regions in different samples.

hashtag
Quantify regions output

Quantify regions generates a counts data node with the number of counts in each region for each sample. This data node can be annotated with gene information using the Annotate regions task and analyzed using tasks that take counts data as input, such as normalization, PCA, and ANOVA. For ChIP-Seq experiments with input control samples, the task can be used prior to downstream analysis.

Similar to the task report, the Quantify regions task report includes feature distribution information including a descriptive stats table, a distribution bar chart, a sample box plot, and sample histogram (Figure 2).

To download the .bed file with the union set of unique regions, click the Quantify regions task node, click Task details, click the regions.bed file in the Output files section, and click Download.

hashtag
References

  1. Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006; 34(10):3150-60.

hashtag
Additional Assistance

If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

Quantify to reference (Partek E/M)

This task does not need an annotation model file, since the annotation is retrieved from the BAM file itself. The sequence names in the BAM files constitute the features with which the reads are quantified against.

This task is generally performed on reads aligned to a transcriptome, e.g when a species does not have a genome reference, and the bam files contain transcriptome information. In this case, the features for this quantification task are the reference sequence names in the input bam files.

There are two parameters in Quantify to reference (Figure 1):

Figure 1. Quantify to reference dialog
  • Min coverage: will filter out any features (sequence names) that have fewer reads across all samples than the value specified

  • Strict paired-end compatibility: this only affects paired end data. When it is checked, only reads that have two ends aligned to the same feature will be counted. Otherwise, reads will still be counted as exonic compatible reads even if the mate is not compatible with the feature

During quantification:

  1. We scan through each of the BAM files and find all the transcripts that meet the minimum coverage threshold.

  2. With those transcripts, we "create" an annotation file that has the transcript name as the sequence name and the Gene ID and the Transcript ID have the same transcript name. The start position is 1 and the end position is the length of the transcript.

  3. Effectively, what the annotation file does is filter out the low coverage transcripts.

The output data node will display a similar Task report as the task.

hashtag
Additional Assistance

If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

Salmon

  • Running Salmon

  • References

Salmon1 is a method for quantifying transcript abundance from raw read sequence fastq files, it will generate transcript level count matrix as output.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/arrow-up-right

hashtag
Running Salmon

Salmon will run on any FASTQ file input.

  • Click the data node containing your FASTQ files

  • Click the Quantification section in the toolbox

  • Click Salmon

Salmon index can't be built on reference assembly, you need to provide transcript annotation file beforehand, the index will be built on transcript annotation model. For more information about adding aligner index based on annotation model, see this in library file management.

The task generates two data nodes: Transcript counts node contains NumReads which is the estimate of number of reads mapping to each transcript; the best estimate is often not integer. Gene counts node contains the sum of all the reads from the corresponding transcripts from each gene.

Note: If you want to perform differential analysis, you need to add offset to deal with 0 values.

hashtag
References

  1. Rob Patro, Geet Duggal, Michael Love, Rafeal A Irizarry, Carl Kingsford. Salmon: fast and bias-aware quantification of transcript expression using dual-phase inference. Published online 2017 Mar 6. doi:

hashtag
Additional Assistance

If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

Count feature barcodes

Since we don't know where the transcripts are in the genome, chromosome view will display only one transcript at a time (i.e., the transcript names are treated like "chromosomes").
Quantify to annotation model
our support pagearrow-up-right

(2) If (first-in-pair same strand + second-in-pair opposite strand)/Alignments examined > 75%, Forward-Reverse will be specified

  • (3) If (first-in-pair opposite strand + second-in-pair same strand)/Alignments examined > 75%, Reverse-Forward will be specified

  • (4) If neither of the percentages exceed 75%, No option will be used

  • Forward - Reverse: this option is equivalent to the --fr-secondstrand option in Cufflinks [1]. First-in-pair is the same strand as the transcript, second-in-pair is the opposite strand to the transcript

  • Reverse - Forward: this option is equivalent to --fr-firststrand option in Cufflinks. First-in-pair is the opposite strand to the transcript, second-in-pair is the same strand as the transcript. The Illumina TruSeq Stranded library prep kit is an example of this configuration

  • Forward - Forward: Both ends of the read are matching the strand of the transcript. Generally colorspace data generated from SOLiD technology would follow this format

  • Reverse: this option is equivalent to --fr-firststrand option in Cufflinks. The single-end reads are the opposite strand to the transcript. The Illumina TruSeq Stranded library prep kit is an example of this configuration

    Percentage of the features within the read count range
    Understanding Reads
    Partek E/M algorithm
    our support pagearrow-up-right
    Figure 2. Specify the genome assembly with which the bam files are generated from and transcriptome annotation from the drop-down menu
    Figure 3. Illustration of the three types of strand specific assays on paired end reads. _R1 and _R2 means read first-in pair and second-in-pair respectively. Arrows indicate strand directions.
    Figure 4. Summary of raw reads mapping to genes based on the RefSeq annotation file provided. Note that the Gene-level tab is selected.
    Figure 5. Summary of feature distribution statistics
    Figure 6. Bar chart on distribution of raw read counts in each sample
    Figure 7. Coverage breakdown bar chart, it is a graphical presentation of summary table on raw reads mapping to transcription based on the annotation file provided
    Figure 8. Box-whisker plot on read count distribution in each sample, when mouse over a box, detailed information on the box is displayed.
    Figure 9. Sample histogram plot, when mouse over each circle, detailed information is displayed
    Figure 10. Download quantification output data dialog
    Choose Assembly and Aligner index on gene annotation model (Figure 1)
    chapter
    10.1038/nmeth.4197arrow-up-right
    our support pagearrow-up-right
    Figure 1. Configure Salmon task--select index file
    hashtag
    What is Count feature barcodes?

    Count feature barcodes is a tool for quantifying the number of feature barcodes per cell from CITE-Seq, cell hashing, or other feature barcoding assays to measure protein expression. The input for Count feature barcodes is FASTQ files.

    hashtag
    Running Count feature barcodes

    Count feature barcodes will run on any unaligned reads data node.

    • Click the data node containing your unaligned reads containing feature barcodes

    • Click the Quantification section in the toolbox

    • Click Count feature barcodes

    The task set up page allows you to configure the settings for your assay (Figure 1).

    • Choose the Prep kit from the drop-down menu

    For more details on adding Prep kits, please see our documentation on . The prep kit should include cell barcode and unique molecular identifiers (UMIs) locations.

    • Check Map feature barcodes box (optional)

    This is only necessary for processing data from 10X Genomics' Feature Barcoding assay (v3+ chemistry), which utilizes BioLegend TotalSeq-B. For all other assays, leave this box unchecked.

    • Choose the Barcode location

    For BioLegend TotalSeq-A, choose bases 1-15. For BioLegend TotalSeq-B/C, choose bases 11-25. For other locations, select Custom and specify the start and stop positions.

    • Choose a Sequences text file

    This tab-delimited text file should have the feature ID in the first column and the nucleotide sequence in the second column. Do not include column headers. See Figure 2 for an example.

    • Check Keep bam files box (optional)

    This option will retain the alignment BAM files instead of automatically deleting them when the task is complete. An extra Aligned reads output data node will be produced on the task graph. This option is unchecked by default to save on disk space.

    • Click Finish to run

    The output of Count feature barcodes is a Single cell counts data node.

    hashtag
    How does Count feature barcodes work?

    Count feature barcodes uses a series of tasks available independently in Partek Flow to process the input FASTQ files. The output files generated by these tasks are not retained in the Count feature barcodes output, with the exception of BAM files if Keep bam files is checked.

    identifies the UMI and cell barcode sequences. The Prep kit is specified using the Prep kit setting.

    trims the insert read to include only the feature barcode sequence. Trim bases is set to Both ends for Trim based on with the start and stop set by the Barcode location preference and the Min read length set to 1.

    is used to align the reads to the sequences specified in the Sequences text file. Bowtie is set to Ignore quality limit for the Alignment mode. Other settings are default.

    consolidates duplicate reads based on UMIs. Deduplicate UMIs is set to Retain only one alignment per UMI.

    filters the cell barcodes to include cells and not empty droplets. Filter barcodes is set to Automatic.

    Quantify barcodes counts the number of UMIs per cell for each feature in the Sequences file. Quantify barcodes uses default settings.

    To perform these steps individually instead of using the Count feature barcodes task, you will need to generate a FASTA and GTF file containing the feature barcode IDs and sequences instead of a text file and build an index file for the Bowtie aligner.

    For more information about library file management, please see .

    hashtag
    Additional Assistance

    If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

    What is Count feature barcodes?
    Running Count feature barcodes
    How does Count feature barcodes work?
    Quantify to annotation model (Partek E/M)
    Quantify to annotation model (Partek E/M)
    Normalize to baseline
    Quantify to annotation model (Partek E/M)
    our support pagearrow-up-right
    Figure 1. Quantify regions dialog
    Figure 2. Quantify regions task report
    Trim tags
    Trim tags
    Trim bases
    Bowtie
    Deduplicate UMIs
    Filter barcodes
    Library File Management
    our support pagearrow-up-right
    Figure 1. Count feature barcodes task set up page
    Figure 2. Example of how the sequences tab-delimited text (.txt) file should be formatted

    HTSeq

    • Configurable options

    • Output

    • References

    HTSeq is set of tools for processing high-throughput sequencing data1. In Partek Flow, we have implemented the htseq-count script from HTSeq for quantifying aligned reads to an annotation model.

    The input for HTSeq is an Aligned reads data node and a Gene/Feature annotation file.

    To run HTSeq:

    • Click an Aligned reads data node

    • Click the Quantification section of the toolbox

    • Click HTSeq

    Please note that HTSeq has not been optimized for performance and can take a very long time to run compared with on the same data.

    hashtag
    Configurable options

    HTSeq includes basic options (Figure 1) and advanced options accessible by clicking Configure (Figure 2).

    hashtag
    Basic Options

    hashtag
    Annotation file

    The annotation file contains the features the aligned reads will be quantified to. For more information about adding an annotation model, please see .

    hashtag
    Strand specificity

    Depending on the library preparation method, information about the strand of the original transcript may be faithfully preserved or lost. This setting controls whether HTSeq considers strand during quantification. Consult your library preparation method user manual if you are unsure about if and how the method preserved strand information.

    If set to no, a read is considered to be overlapping a feature regardless of whether it maps to the same or opposite strand as the feature.

    If set to yes, the read has to be matched to the same strand as the feature if single-end reads and matched to the same strand for the first read and the opposite strand for the second read if paired-end reads.

    If set to reverse, the read has to be matched to the opposite strand as the feature if single-end reads and matched to the opposite strand for the first read and the same strand for the second read if paired-end reads.

    hashtag
    Include features with no counts

    By default, only features (e.g., genes) with one or more aligned read will be included in the output. If this option is selected, all features from the annotation model, including those without any matching aligned reads, will be included.

    hashtag
    Advanced Options

    hashtag
    Min qual

    HTSeq will skip reads with an alignment quality lower than the value specified here. The default is 10.

    hashtag
    Overlap mode

    This option determines how HTSeq handles reads that partially overlap features. The default is union.

    If set to union, any read that is partially overlapped by a feature will be assigned to that feature. Assignment is non-exclusive if multiple feature overlap.

    If set to intersection-strict, only reads that are fully overlapped by a feature will be assigned to that feature. Assignment is non-exclusive if multiple feature fully overlap.

    If set to intersection-nonempty, reads are assigned to the feature that has the greatest overlap. Assignment is non-exclusive if multiple feature overlap the same amount.

    hashtag
    Nonunique mode

    This option determines how HTSeq counts reads that are assigned to more than one feature. The default is none.

    If set to none, reads that are assigned to more than one feature are not counted for any feature.

    If set to all, reads are counted for all features they are assigned to.

    hashtag
    Output

    HTSeq outputs a Gene counts data node (Figure 3). There is no task report.

    The ribosomal reads % column, present when the data is downloaded, is identified by searching ribosomal genes by their gene symbol against a list of 89 L & S ribosomal genes taken from . -; it calculates 'the sum of counts across these genes / the total count' * 100 to give the % of ribosomal counts.

    hashtag
    References

    1. Simon Anders, Paul Theodor Pyl, Wolfgang Huber. HTSeq — A Python framework to work with high-throughput sequencing data. Bioinformatics (2014), in print, online at doi:10.1093/bioinformatics/btu638

    hashtag
    Additional Assistance

    If you need additional assistance, please visit to submit a help ticket or find phone numbers for regional support.

    Quantify to annotation model (Partek E/M)
    Adding an Annotation Model
    HGNCarrow-up-right
    https://help.partek.illumina.com/partek-flow/user-manual/task-menu/qa-qc/single-cell-qa-qcarrow-up-right
    our support pagearrow-up-right
    Figure 1. Basic HTSeq options
    Figure 2. Advanced options for HTSeq
    Figure 3. HTSeq output