1 of 1

Gene Expression Quantification

The DRAGEN RNA pipeline contains a gene expression quantification module that estimates the expression of each transcript and gene in an RNA-seq data set. The module first internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model and correct for GC-bias in the reported quantification results.

To enable the quantification module, set the --enable-rna-quantification option to true in your current RNA-seq command-line scripts. Additionally, you must provide a gene annotation file (GTF/GFF) that contains the genomic position of all transcripts to quantify. You can specify the GTF/GFF file using the -a or --annotation-file option.

Quantification Options

Option

Description

Quantification Outputs

Transcript quantification results are reported in the <outputPrefix>.quant.sf text file. The file lists results for each transcript. You can use the output file as input for differential gene expression using tools such as tximport and DESeq2.

The following is an example of the file contents:

Name Length EffectiveLength TPM NumReads
ENST00000364415.1 116 12.3238 5.2328 1
ENST00000564138.1 2775 2105.58 1.28293 41.8885

Field

Description

The gene expression quantification module also outputs the files below. For information on the metrics included, see the section Quantification and RNA QC Metrics.

<outputPrefix>.quant.genes.sf—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene.
<outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. See Quantification and RNA QC Metrics.
<outputPrefix>.quant.transcript_fragment_lengths.txt —Full fragment length distribution of reads mapped to transcripts, output in length- probability pairs of length minimum through >999 bases. Summing the products of the two columns will yield the average fragment length.
<outputPrefix>.quant.transcript_coverage.txt—Measures coverage uniformity with a normalized average of 5' to 3' coverage pattern along transcripts in increments of 1%. A summation of the 100 coverage bins should yield 100%.
<outputPrefix>.SJ.saturation.txt—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed.

Quantification and RNA QC metrics

The RNA Quantification module outputs metrics related to the gene expression results and more general RNA QC metrics that rely on the transcript-level analysis. A summary of the metrics is output to the <outputPrefix>.quant_metrics.csv file.

Only unfiltered and properly paired reads (for paired-end sequencing) are counted in the above metrics. The seven fragments types that are listed (Forward transcript, Reverse transcript, Strand mismatched, Ambiguous strand, Intron, Intergenic, Unknown transcript) add up to 100% of the counted fragments, and the percentage of this total is provided next to each fragment metric count.

Gene Expression Quantification

Quantification Options

Option

Description

Quantification Outputs

The following is an example of the file contents:

Name Length EffectiveLength TPM NumReads
ENST00000364415.1 116 12.3238 5.2328 1
ENST00000564138.1 2775 2105.58 1.28293 41.8885

Field

Description

The gene expression quantification module also outputs the files below. For information on the metrics included, see the section Quantification and RNA QC Metrics.

<outputPrefix>.quant.genes.sf—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene.
<outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. See Quantification and RNA QC Metrics.
<outputPrefix>.quant.transcript_fragment_lengths.txt —Full fragment length distribution of reads mapped to transcripts, output in length- probability pairs of length minimum through >999 bases. Summing the products of the two columns will yield the average fragment length.
<outputPrefix>.quant.transcript_coverage.txt—Measures coverage uniformity with a normalized average of 5' to 3' coverage pattern along transcripts in increments of 1%. A summation of the 100 coverage bins should yield 100%.
<outputPrefix>.SJ.saturation.txt—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed.

Quantification and RNA QC metrics

Metric

Description

Specifies the type of RNA-seq library. The following are the available values:

IU—Paired-end unstranded library.
ISR—Paired-end stranded library in which read2 matches the transcript strand (eg, Illumina Stranded Total RNA Prep).
ISF—Paired-end stranded library in which read1 matches the transcript strand.
U—Single-end unstranded library.
SR—Single-end stranded library in which reads are in reverse orientation to the transcript strand (eg, Illumina Stranded Total RNA Prep).
SF—Single-end stranded library in which reads match the transcript strand.
A— DRAGEN examines the first reads pairs in the data set to automatically detect the correct library type. For polya tail trimming, the library type is assumed to be unstranded. Autodetect is the default value.