Gene Expression Quantification
The DRAGEN RNA pipeline contains a gene expression quantification module that estimates the expression of each transcript and gene in an RNA-seq data set. The module first internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then it uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model and correct for GC-bias in the reported quantification results.
To enable the quantification module, set the --enable-rna-quantification option to "true" in the command-line. Additionally, you must provide a gene annotation file (GTF/GFF) that contains the genomic position of all transcripts to quantify. You can specify the GTF/GFF file using the -a or --annotation-file option. The following is an example command line for running an end-to-end RNA-Seq experiment with RNA gene quantification.
dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
-a <ANNOTATION_FILE> \
--output-dir <OUT_DIRECTORY> \
--output-file-prefix <OUTPUT_PREFIX> \
--RGID <READ_GROUP_ID> \
--RGSM <SAMPLE_NAME> \
--enable-rna true \
--enable-rna-quantification trueRealignment to transcripts
After initial alignment to the genome, the quantification pipeline realigns each fragment to the transcriptome. Fragments must be contained within transcript boundaries, and skips (Ns in the CIGAR string) must correspond to annotated splice junctions. Short transcripts annotated as Micro RNA (miRNA) are handled differently: Alignments are only require to be inside the transcript, without regard to skips in the CIGAR string. The transcript boundary tolerance for miRNA transcripts is can be adjusted using --rna-quantification-mirna-realign-margin. This is similar to the behavior of StringTie.
Quantification Options
--enable-rna-quantification
If set to "true", enables RNA quantification. Requires --enable-rna to be set to "true".
false
--rna-library-type
Specifies the type of RNA-seq library. The following are the available values:
IU—Paired-end unstranded library.ISR—Paired-end stranded library in which read2 matches the transcript strand (eg, Illumina Stranded Total RNA Prep).ISF—Paired-end stranded library in which read1 matches the transcript strand.U—Single-end unstranded library.SR—Single-end stranded library in which reads are in reverse orientation to the transcript strand (eg, Illumina Stranded Total RNA Prep).SF—Single-end stranded library in which reads match the transcript strand.A— DRAGEN examines the first reads pairs in the data set to automatically detect the correct library type. For polya tail trimming, the library type is assumed to be unstranded.
A
--rna-quantification-gc-bias
GC bias correction estimates the effect of transcript %GC on sequencing coverage and accounts for the effect when estimating expression. To disable GC bias correction, set to "false".
false
--rna-quantification-fld-max --rna-quantification-fld-mean --rna-quantification-fld-sd
Use these options to specify the insert size distribution of the RNA-seq library for single-end runs. These options are relevant for GC bias correction. The maximum allowed value is 1000. To improve accuracy, modify the values to match your library.
250 +- 25
--rna-quantification-min-mapq
Minimum mapping quality for reads to be included in quantification.
0
--rna-quantification-mirna-realign-margin
The number of read bases allowed to align beyond a microRNA transcript exon boundary
0
--rna-quantification-mirna-min-overlap
The minimum number of overlapping bases required to realign a read to a transcript.
3
--rna-quantification-gene-attributes
Attributes to include in quantification output report for genes, in addition to gene ID. Options are "name", "type" or "name,type".
None
--rna-quantification-transcript-attributes
Attributes to include in quantification output report for transcripts. Options are "gene" (Gene name), "type" or "gene,type".
None
--rna-quantification-include-biotypes
Gene biotypes to include in quantification. Comma-separated list including any of the following: protein_coding, immune, lincRNA, pseudogene, shortRNA, microRNA, rRNA, other, none.
All included
Quantification Outputs
Transcript quantification results are reported in the <outputPrefix>.quant.sf text file. The file lists results for each transcript. You can use the output file as input for differential gene expression using tools such as tximport and DESeq2.
The following is an example of the file contents:
Name
The ID of the transcript.
Length
The length of the (spliced) transcript in base pairs.
EffectiveLength
The length as accessible to RNA-seq, accounting for insert-size and edge effects.
TPM
Transcripts per Million (TPM) represents the expression of the transcript when normalized for transcript length and sequencing depth.
NumReads
The estimated number of reads from the transcript. The values are not normalized.
The gene expression quantification module also outputs the files below. For information on the metrics included, see Quantification and RNA QC Metrics.
<outputPrefix>.quant.genes.sf—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene.<outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. See Quantification and RNA QC Metrics.<outputPrefix>.quant.transcript_fragment_lengths.txt—Full fragment length distribution of reads mapped to transcripts, output in length- probability pairs of length minimum through >999 bases. Summing the products of the two columns will yield the average fragment length.<outputPrefix>.quant.transcript_coverage.txt—Measures coverage uniformity with a normalized average of 5' to 3' coverage pattern along transcripts in increments of 1%. A summation of the 100 coverage bins should yield 100%.<outputPrefix>.SJ.saturation.txt—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed.
Quantification and RNA QC metrics
The RNA Quantification module outputs metrics related to the gene expression results and more general RNA QC metrics that rely on the transcript-level analysis. A summary of the metrics is output to the <outputPrefix>.quant_metrics.csv file.
Library orientation
Library orientation of the RNA-seq reads relative to the original transcripts. The library orientation can be automatically detected, or can be explicitly provided. See Quantification Options for more information.
Total Genes
Total number of genes from the gene annotation (GTF/GFF) input used for analysis.
Coding Genes
Number of coding genes from the gene annotation (GTF/GFF) excluding pseudo-genes and biotypes which are non-coding.
Total Transcripts
Number of transcripts from the gene annotation (GTF/GFF) input used for analysis.
Median transcript CV coverage
Median Coefficient of Variation (CV), which is standard deviation divided by mean coverage, of the 1000 most highly expressed transcripts. This metric measures uniformity of RNA-seq read coverage.
Median 5' coverage bias
Median 5 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 5'-most 100 bases divided by the mean coverage of the whole transcript.
Median 3' coverage bias
Median 3 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 3'-most 100 bases divided by the mean coverage of the whole transcript.
Forward transcript fragments
The number of read pairs that match transcripts on the forward strand. Only reads that align fully within exons are counted.
Reverse transcript fragments
The number of read pairs that match transcripts on the reverse strand. Only reads that align fully within exons are counted.
Strand mismatched fragments
In the case of stranded library orientation, number of read pairs that do not match the expected strand of the transcript. Only reads that align fully within exons are counted.
Ambiguous strand fragments
Read pairs that match transcripts in both forward and reverse orientation. Only reads that align fully within exons are counted.
Unknown transcript fragments
Read pairs that partially align with an exon but overlap non-exonic regions (usually due to alternative splicing).
Intron fragments
Read pairs that overlap with a gene, but do not overlap with any exons.
Intergenic fragments
Read pairs that do not overlap with any gene.
Correctly stranded fragments
Read pairs that match transcripts in the expected orientation for the library type. Based on Picard's CORRECT_STRAND_READS metric.
Number of genes with coverage > 1x,10x,30x,100x
The count of the number of genes where the most highly expressed transcript has average coverage greater than 1x, 10x, 20x, and 100x .
Fold coverage of all exons
The average sequencing coverage across all annotated exons, determined using the most highly expressed transcript for each gene.
Fold coverage of coding exons
The average sequencing coverage across only exons within coding genes, determined using the most highly expressed transcript for each gene.
Fold coverage of introns
The average sequencing coverage across detected introns.
Fold coverage of intergenic regions
The average sequencing coverage across areas detected outside annotated genes.
Only unfiltered and properly paired reads (for paired-end sequencing) are counted in the above metrics. The seven fragment types that are listed (Forward transcript, Reverse transcript, Strand mismatched, Ambiguous strand, Intron, Intergenic, Unknown transcript) add up to 100% of the counted fragments, and the percentage of this total is provided next to each fragment metric count.
Last updated
Was this helpful?
