1 of 6

DRAGEN RNA Pipeline

DRAGEN includes an RNA-seq (splicing-aware) aligner, as well as RNA specific analysis components for gene expression quantification, gene fusion detection, splice variant calling, and small variant calling. All of these analysis components require the aligner to be enabled.

Most of the functionality and options described in Host Software Options and DNA Mapping also apply to RNA applications. Additional RNA-specific aspects are described in this section.

Input Files

Gene Annotation File

In addition to the standard input files (reads from fastq or bam, reference genome, etc.), DRAGEN can also take a gene annotations file as input. A gene annotations file aids in the alignment of reads to known splice junctions and is required for gene expression quantification and gene fusion calling.

To specify a gene annotation file, use the -a (--annotation-file) command line option. The input file must conform to the GTF/GFF specification (http://uswest.ensembl.org/info/website/upload/gff.html). The file must contain features of type exon, and the record must contain attributes of type gene_id and transcript_id. An example of a valid GTF file is shown below.

chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        11869   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        13221   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    ENSEMBL transcript  11872   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        11872   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        13225   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
...

Similarly, a GFF file can be used. Each exon feature must have as a Parent a transcript identifier that is used to group exons. An example of a valid GFF file is shown below.

1   ensembl_havana  processed_transcript    11869   14409       .   +   .   ID=transcript:ENST00000456328;
1   havana          exon                    11869   12227       .   +   .   Parent=transcript:ENST00000456328; ...
1   havana          exon                    12613   12721       .   +   .   Parent=transcript:ENST00000456328; ...
1   havana          exon                    13221   14409       .   +   .   Parent=transcript:ENST00000456328; ...
...

NB. For proper handling of genes in the PAR regions of chromosome X and Y, it is required that the gene_id attribute of all exons of the same gene is distinct between the two chromosomes, in order to distinguish exons within the PAR region of chromosome X from the ones within the PAR region of chromosome Y. That is, it is often the case that the gene_id of all exons of a transcript from geneA is equal to gene_id=geneA in chromosome X, and gene_id=geneA_PAR_Y in chromosome Y. This allows the GTF/GFF parser and downstream components to discriminate data associated with PAR genes in chromosome X from data associated with the same PAR genes in chromosome Y.

The DRAGEN host software parses the file for exons within the transcripts and produces splice junctions. The following output displays the number of splice junctions detected.

==================================================================
Generating annotated splice junctions
==================================================================
Input annotations file: ./gencode.v19.annotation.gtf
Splice junctions database file: output/rna.sjdb.annotations.out.tab

Number of genes: 27459

Number of transcripts: 196520
Number of exons: 1196293
Number of splice junctions: 343856

The splice junctions that are detected from the annotation file are also written to *.sjdb.annotations.out.tab. Splice junctions below a minimum length are excluded, which helps filter annotation artifacts. This minimum annotation splice junction length is controlled by the --rna-ann-sj-min-len option, which has a default value of 6.

GFF3 Support

Note that GFF3 is a different file format from GFF. GFF3 files are not officially supported due to inconsistent contig naming conventions between GENCODE and Ensembl.

For the same reference, GENCODE provides all the attributes necessary for DRAGEN to build a hierarchical structure:

#description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
...
chr1    HAVANA  exon    11869   12227   .       +       .       ID=exon:ENST00000456328.2:1;Parent=ENST00000456328.2;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=lncRNA;transcript_name=DDX11L1-202;exon_number=1;exon_id=ENSE00002234944.1;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1

Ensembl has a different notation:

#!genome-build Genome Reference Consortium GRCh38.p14
#!genome-version GRCh38
...
1       havana  exon    11869   12227   .       +       .       Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive
=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1

Ensembl uses different notation for contigs (for GRCh38) than GENCODE. Ensembl contigs do not have the "chr" prefix. The contig identifiers in the annotation file must match the DRAGEN reference in use, and by most conventions GRCh38/hg38 contigs are prefixed with "chr".

If necessary, DRAGEN may support GFF3 files that are GENCODE-compatible with the following annotations present in the attributes of each exon record:

For gene: "gene_name" or "name" or "gene" or "gene_id"
For transcript: "transcript_id" or "Parent"

Due to the flexibility of the GFF3 file format, issues may arise as it continues to evolve.

Two-Pass Splice-junction Alignment

Instead of using a GTF file for annotated splice junctions, the DRAGEN software is also capable of reading in an SJ.out.tab file (see SJ.out.tab). This file enables DRAGEN to run in a two-pass mode, where the splice junctions discovered in the first pass (output as SJ.out.tab file) are used to guide the mapping and alignment reads during a second run through DRAGEN. This mode of operation is useful to increase sensitivity for spliced alignments in cases when a gene annotations file is not readily available for the target genome. If a well curated GTF is already availble for your target genome, then there is no need to run a second pass with the SJ.out.tab.

Please be aware that depending on the characteristics of the input file (i.e. read depth and distribution) the second pass using the first pass SJ.out.tab may take longer than the first pass.

NOTE: Components downstream of aligner like gene expression quantification, gene fusion detection and RNA variant calling require GTF file as the input annotations file and are NOT compatible with two-pass splice-junction alignment mode.

RNA Alignment

The DRAGEN RNA pipeline uses the DRAGEN RNA-Seq spliced aligner. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.

Alignment Output

The output files generated when running DRAGEN in RNA mode are similar to those generated in DNA mode. RNA mode also produces extra information related to spliced alignments. Details regarding the splice junctions are present both in the SAM alignment record and an additional file, the SJ.out.tab file.

BAM

The output BAM file meets the SAM specification and is compatible with downstream RNA-Seq analysis tools.

RNA-Seq BAM Tags

The following BAM tags are emitted alongside spliced alignments.

Tag

Description

If a gene annotations file is used during the map/align stage, and the splice junction is detected as an annotated junction, then 20 is added to its motif value.

Compatibility with Cufflinks

Cufflinks might require spliced alignments to emit the XS:A strand tag. This tag is present in the SAM record if the alignment contains a splice junction. The possible values for XS:A strand tag are as follows:

'.' (undefined), '+' (forward strand), '-' (reverse strand), or '*' (ambiguous).

By default, if the spliced alignment has an undefined strand or an ambiguous (conflicting) strand, then the alignment output is suppressed. These alignments can be output into the output alignment file by setting the --no-ambig-strand option to 1.

Cufflinks also expects that the MAPQ for a uniquely mapped read is a single value. This value is specified by the --rna‑mapq-unique option. To force all uniquely mapped reads to have a MAPQ equal to this value, set ‑‑rna‑mapq‑unique to a nonzero value.

SJ.out.tab

Along with the alignments emitted in the SAM/BAM file, an additional SJ.out.tab file summarizes the high confidence splice junctions in a tab-delimited file. The columns for this file are as follows:

contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)strand (0: undefined, 1: +, 2:-)
strand (0: undefined, 1: +, 2: -)
intron motif: 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
0: unannotated, 1: annotated, only if an input gene annotations file was used
number of uniquely mapping reads spanning the splice junction
number of multimapping reads spanning the splice junction
maximum spliced alignment overhang

The maximum spliced alignment overhang (column 9) field in the SJ.out.tab file is the anchoring alignment overhang. For example, if a read is spliced as ACGTACGT------------ACGT, then the overhang is 4. For the same splice junction, across all reads that span this junction, the maximum overhang is reported. The maximum overhang is a confidence indicator that the splice junction is correct based on anchoring alignments.

There are two SJ.out.tab files generated by the DRAGEN host software, an unfiltered version and a filtered version. The records in the unfiltered file are a consolidation of all spliced alignment records from the output SAM/BAM. However, the filtered version has a much higher confidence for being correct due to the use of the following filters.

A splice junction entry in the SJ.out.tab file is filtered out if any of these conditions are met:

SJ is a noncanonical motif and is only supported by < 3 unique mappings.
SJ of length > 50000 and is only supported by < 2 unique mappings.
SJ of length > 100000 and is only supported by < 3 unique mappings.
SJ of length > 200000 and is only supported by < 4 unique mappings.
SJ is a noncanonical motif and the maximum spliced alignment overhang is < 30.
SJ is a canonical motif and the maximum spliced alignment overhang is < 12.

The filtered SJ.out.tab is recommended for use with any downstream analysis or post processing tools. Alternatively, you can use the unfiltered SJ.out.tab and apply your own filters (for example, with basic awk commands).

Note that the filter does not apply to the alignments present in the BAM or SAM file.

Chimeric.out.junction File

If there are chimeric alignments present in the sample, then a supplementary Chimeric.out.junction file is also output. This file contains information about split-reads that can be used to perform downstream gene fusion detection. Each line contains one chimerically aligned read. The columns of the file are as follows:

Chromosome of the donor.
First base of the intron of the donor (1-based).
Strand of the donor.
Chromosome of the acceptor.
First base of the intron of the acceptor (1-based).
Strand of the acceptor.
N/A---not used, but is present to be compatible with other tools. It will always be 1.
N/A---not used, but is present to be compatible with other tools. It will always be *.
N/A---not used, but is present to be compatible with other tools. It will always be *.
Read name.
First base of the first segment, on the + strand.
CIGAR of the first segment.
First base of the second segment.
CIGAR of the second segment.

CIGARs in this file follow the standard CIGAR operations as found in the SAM specification, with the addition of a gap length L that is encoded with the operation p. For paired end reads, the sequence of the second mate is always reverse complemented before determining strandedness.

The following is an example entry that shows two chimerically aligned read pairs, in which one of the mates is split, mapping segments of chr19 to chr12. Also shown are the corresponding SAM records associated with these entries.

chr19 580462 + chr12 120876182 + 1 * * R_15448 571532 49M8799N26M8p49M26S 120876183 49H26M
chr19 580462 + chr12 120876182 + 1 * * R_15459 571552 29M8799N46M8p29M46S 120876183 29H46M

R_15448:1   99    chr19   571531      60  49M8799N26M  =       580413
R_15448:2   147   chr19   580413      60  49M26S       =       571531
R_15448:2   2193  chr12   120876182   15  49H26M       chr19   571531

R_15459:1   99    chr19   571551      60  29M8799N46M  =       580433
R_15459:2   147   chr19   580433      4   29M46S       =       571551
R_15459:2   2193  chr12   120876182   15  29H46M       chr19   571551

Mapping Metrics

The RNA Pipeline reports summary and per read group statistics pertaining to read mapping in the mapping_metrics.csv file. The majority of the matrics are as described in the DNA mapping metrics section, but the metrics that are specific to RNA-seq are listed below.

Filtered rRNA reads---Total number of ribosomal RNA reads that are filtered out with the --rrna-filter-enable option.
Mitochondrial reads excluded---Total number of reads detected to be in ChrM if the --rna-mapping-metrics-exclude-chrm option is enabled.
Mapped reads adjusted for filtered mapping---Adjusted count of mapped reads by adding in the filtered rRNA reads.
Mapped reads adjusted for excluded mapping---Adjusted count of mapped reads by adding in the excluded mitocondrial reads.
Mapped reads adjusted for filtered and excluded mapping---Adjusted count of mapped reads by adding in both the filtered rRNA and excluded mitocondrial reads.
Unmapped reads adjusted for filtered mapping---Adjusted count of unmapped reads by subtracting out the filtered rRNA reads.
Unmapped reads adjusted for excluded mapping---Adjusted count of unmapped reads by subtracting out the excluded mitocondrial reads.
Unmapped reads adjusted for filtered and excluded mapping---Adjusted count of unmapped reads by subtracting out both the filtered rRNA and the excluded mitocondrial reads.
Reads with splice junction---Total number of reads that included a spliced alignment that crosses an intron

RNA Alignment Options

The aligner stage of the RNA spliced aligner uses Smith-Waterman Alignment Scoring options and Splicing Scoring Options.

Smith-Waterman Alignment Scoring Options

Refer to [Smith-Waterman Alignment Scoring Settings]{.underline} for more details about the alignment algorithm used within DRAGEN. The following scoring options are specific to the processing of canonical and noncanonical motifs within introns.

--Aligner.intron-motif12-pen The --Aligner.intron-motif12-pen option controls the penalty for canonical motifs 1/2 (GT/AG, CT/AC). The default value calculated by the host software is 1 * (match-score + mismatch-pen).
--Aligner.intron-motif34-pen The --Aligner.intron-motif34-pen option controls the penalty for canonical motifs 3/4 (GC/AG, CT/GC). The default value calculated by the host software is 3 * (match-score + mismatch-pen).
--Aligner.intron-motif56-pen The --Aligner.intron-motif56-pen option controls the penalty for canonical motifs 5/6 (AT/AC, GT/AT). The default value calculated by the host software is 4 * (match-score + mismatch-pen).
--Aligner.intron-motif0-pen The --Aligner.intron-motif0-pen option controls the penalty for noncanonical motifs. The default value calculated by the host software is 6 * (match-score + mismatch-pen).

Splicing Scoring Options

--Mapper.min-intron-bases For RNA-Seq mapping, a reference alignment gap can be interpreted as a deletion or an intron. In the absence of an annotated splice junction, the min-intron-bases option is a threshold gap length separating this distinction. Reference gaps at least this long are interpreted and scored as introns, and shorter reference gaps are interpreted and scored as deletions. However, alignments can be returned with annotated splice junctions shorter than this threshold.
--Mapper.max-intron-bases The max-intron-bases option controls the largest possible intron that is reported, which useful for preventing false splice junctions that would otherwise be reported. Set this option to a value that is suitable to the species you are mapping against.
--Mapper.ann-sj-max-indel For RNA-seq, seed mapping can discover a reference gap in the position of an annotated intron, but with slightly different length. If the length difference does not exceed this option, the mapper investigates the possibility that the intron is present exactly as annotated, but an indel on one side or the other near the splice junction explains the length difference. Indels longer than this option and very near annotated splice junctions are not likely to be detected. Higher values may increase mapping time and false detections.

Duplicate Marking

DRAGEN RNA can detect duplicate reads, which are defined as fragments that have both ends mapping to the same (clipping-adjusted) position during alignment. In RNA-Seq data, the reads can represent PCR duplicates during library prep or as a result from deep coverage of highly expressed regions.

If --enable-duplicate-marking is set to true, duplicate fragments are marked in the BAM file and the total number of duplicate reads is reported as a mapping metric. Marking of duplicates does not affect gene expression quantification and gene fusion calling.

Downsampling

DRAGEN RNA also supports internal downsampling, which is a process by which a random sub-sample of reads is selected from the dataset after trimming and alignment for downstream analysis. In RNA-Seq, this can be useful in two ways - it can speed up analysis of samples with excessively high coverage, and it can allow for more accurate cross-comparisons between different samples.

If --enable-down-sampler is set to true and a value specified for --down-sampler-reads, DRAGEN will use only that many RNA fragments (including both Read 1 and Read 2) for quantification, fusion, variant calling and output to BAM. Please note the the entire input dataset is still used for the generation of trimming, fastqc, and mapping metrics.

Ribosomal RNA filtering

Ribosomal RNA (rRNA) sequences can contribute a large fraction of reads in some RNA-Seq datasets, depending on the sample type and library prep method. You can use the DRAGEN RNA pipeline to filter rRNA reads during alignment, because the reads are not relevant for downstream analysis. By filtering rRNA, you can reduce run time and file size and avoid deep read alignment pile ups at rRNA repeat loci on the genome to make downstream analysis of RNA BAM files easier.

rRNA filtering relies on a decoy contig with the rRNA sequence included in the reference hash table. Any read that maps to the decoy contig, including multimappers, is tagged with rRNA and is not mapped in the output.

NOTE: The rrna filter option only accepts a single contig by default. In the event multiple contigs need to be provided, they can be concatenated using a 1kb N mask between them, and added to the reference FASTA while creating the hash table.

NOTE: rRNA filtering is not supported with chm13-based references and it will be automatically disabled.

Command-Line Options

The following are the required command-line options for rRNA filtering.

--rrna-filter-enable=true--Enables rRNA filtering. Set to true to enable rRNA filtering. The default value is false.
--rrna-filter-contig--Specify the name of the rRNA sequences to use for filtering. If you do not specify a value, the default gl000220 is provided for human genome alignments by using the reference autodetect feature. gl000220 is an unplaced contig included in hg19 and hg38 genomes, which include a full copy of the rRNA repeat. For other genomes, you must include a rRNA decoy contig when creating a hash table.

Output Files

All rRNA filtering reads are left unaligned in the BAM files and tagged ZS:Z:FLT. The number and percentage of filtered rRNA reads is reported as a mapper metric Adjustment of reads matching filter contigs.

PolyA trimming

PolyA tails may be trimmed by including the settings --read-trimmers polya or --soft-read-trimmers polyg,polya (Note: polyg soft trimming is enabled by default). The minimum number of poly-A/poly-T bases required for trimming may be set using --trim-polya-min-trim. The default threshold is 20 poly-A/poly-T bases. Refer to DNA pipeline read trimmers section for usage of read trimmers options.

PolyA trimming by read orientation

The PolyA trimmer determines which end of the reads to trim for poly-A and poly-T sequences based on the library type. For example, for Illumina forward stranded paired reads the trimmer will trim poly-A sequences at 3' end of read 1 and poly-T sequences at 5' end of read 2. If --rna-library-type is not provided or set to autodetect (A), the trimmer assumes the library is unstranded and trims poly-A sequences from 3' end of each read and poly-T sequences from 5' end of each read. The option --rna-library-type is described in the Gene Expression section.

MAPQ Scoring

By default, the MAPQ calculation for RNA-Seq is identical to DNA-Seq. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. Therefore, adjusting the alignment scoring parameters impacts the MAPQ estimate. These adjustments are outlined in Smith-Waterman Alignment Scoring Settings.

The --mapq-strict-sjs option is specific to RNA, and applies where at least one exon segment is aligned confidently, but there is ambiguity regarding possible splice junctions. When this option is set to 0, a higher MAPQ value is returned, expressing confidence that the alignment is at least partially correct. When this option is set to 1, a lower MAPQ value is returned, reflecting the splice junction ambiguity.

Some downstream tools, such as Cufflinks, expect the MAPQ value to be a unique value for all uniquely mapped reads. This value is specified with the --rna-mapq-unique option. Setting this option to a nonzero value overrides all MAPQ estimates based on alignment score. Instead, all uniquely mapped reads have a MAPQ set to the value of --rna-mapq-unique. All multimapped reads have a MAPQ value of int(-10*log10(1 ‑ 1/NH), where the NH value is the number of hits (primary and secondary alignments) for that read.

Gene Fusion Detection

The DRAGEN Gene Fusion module uses the DRAGEN RNA splice-aware aligner to detect gene fusion events. The supplementary (chimeric) alignments are used to find potential breakpoints and read evidence is accumulated for the resulting fusion event candidates. Then, an ML model is applied to score the putative fusion events to filter potential false positives. The ML scoring model is currently available on human samples and does not support non-human reference genomes.

Running DRAGEN Gene Fusion

You can run the DRAGEN Gene Fusion module together with a regular RNA-Seq map/align job. To enable the DRAGEN Gene Fusion module, set --enable-rna-gene-fusion to true in your current RNA-Seq command-line scripts. The DRAGEN Gene Fusion module requires a gene annotations file in GTF or GFF format.

The following is an example command line for running an end-to-end RNA-Seq experiment.

At the end of a run, a summary of detected gene fusion events is output, which is similar to the following example.

Gene Fusion Output and Filters

The <OUTPUT_PREFIX>.fusion_candidates.features.csv file lists the detected gene fusion events. The output CSV file includes the following columns.

#FusionGene: Parent gene names (in 5' to 3' order of transcript) participating in the fusion; hereafter refer red to as Gene 1 and Gene 2. If a fusion breakpoint overlaps multiple genes, the genes are listed by default as separate candidates (rows). To show them as a semi-colon separated gene list on the same row, the option --rna-gf-merge-calls can be set to true as described in the Gene Fusion Options and Filters section.
Score: Fusion call confidence score predicted by the ML model. If the ML model is used, the score can be 0 (low confidence) to 1 (high-confidence call). Currently the ML model only supports human references. In the case an ML model is not available, the number of supporting reads will be reported as the score.
LeftBreakpoint: Gene 1 breakpoint formatted as <Chromosome>:<Position>:<Strand>.
RightBreakpoint: Gene 2 breakpoint formatted as <Chromosome>:<Position>:<Strand>.
Filter: Semicolon separated list of filter flags. The LOW_SCORE filter is used to filter low confidence fusion candidates. If --rna-gf-enable-post-filtering=true, other confidence filters will also be applied. Informative filters, on the other hand, do not fail the fusion. In the absence of the ML model scoring (i.e. a non-human reference is used), a more aggressive post-filtering will take place and all confidence and informative filters will be applied. The following are the available filters.

Note that the specific features and column values are subject to change in future DRAGEN versions as more RNA data is analyzed.

#SplitScore: Combined count of fusion supporting read pairs reported as split reads and soft-clipped reads
#NumSplitReads: Number of fusion supporting read pairs with at least one split read alignment.
#NumSoftClippedReads: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment. Included in SplitScore and includes soft-clipped reads for both Gene1 and Gene2
#NumSoftClippedReadsGene1: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment to Gene 1
#NumSoftClippedReadsGene2: See above (NumSoftClippedReadsGene1) for Gene 2
#NumPairedReads: Number of fusion supporting read pairs such that one of the reads maps to Gene1 and the other maps to Gene2, without any breakpoint overlap
#NumRefSplitReadsGene1: Number of read pairs that map fully within Gene 1 such that at least one of the reads aligns across the breakpoint. These reads support the reference transcript and do not support the fusion.
#NumRefSplitReadsGene2: See above (NumRefSplitReadsGene1) for Gene 2
#NumRefPairedReadsGene1: Number of read pairs such that one of the reads maps on the left side of the Gene1 breakpoint and the other maps on the right side of the Gene1 breakpoint, without overlapping the break. These reads support the reference transcript and do not support the fusion.
#NumRefPairedReadsGene2: See above (NumRefPairedReadsGene1) for Gene 2
#AltToRef-- Ratio of (fusion split + soft clipped reads) / max(NumRefSplitReadsGene1, NumRefSplitReadsGene2); used for the LOW_ALT_TO_REF filter
#UniqueAlignmentsGene1: Unique (start-end) positions of fusion supporting read alignments to Gene 1 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS filter
#UniqueAlignmentsGene2: Unique (start-end) positions of fusion supporting read alignments to Gene 2 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS filter
#MaxMapqGene1: Maximum MAPQ for fusion supporting reads in Gene 1
#AvgMapqGene1: Average MAPQ for fusion supporting reads in Gene 1
#MaxMapqGene2: Maximum MAPQ for fusion supporting reads in Gene 2
#AvgMapqGene2: Average MAPQ for fusion supporting reads in Gene 2
#CoverageBasesGene1: Bases in Gene 1 with read coverage within a certain distance (default 1000 bp) of the breakpoint in the direction of the breakpoint strand which is part of the fusion transcript
#CoverageBasesGene2: See above (CoverageBasesGene1) for Gene 2
#DeltaExonBoundaryGene1: Distance from the Gene 1 breakpoint for the closest fusion supporting alignment (higher distance to boundary lowers score)
#DeltaExonBoundaryGene2: See above (DeltaExonBoundaryGene1) for Gene 2
#IsRestrictedGene1: Indicator variable of whether the Gene 1 is tagged as protein coding in the annotation file
#IsRestrictedGene2: Indicator variable of whether the Gene 2 is tagged as protein coding in the annotation file
#IsEnrichedGene1: If enrichment or amplicon assay, then indicates whether Gene 1 is enriched. If whole transcriptome sequencing, then set to 1
#IsEnrichedGene2: See above (IsEnrichedGene1) for Gene 2
#CisDistance: Distance between breakpoints if they are adjacent to each other and on the same strand. Large value (100M) if not a CIS break; used for the READ_THROUGH filter.
#BreakpointDistance: Distance between breakpoints if they are adjacent. Large value (100M) if not within same chromosome
#GenePairHomologyEval: E-value of pairwise BLAST alignment of the parent genes
#AnchorLength1: Longest alignment of a fusion supporting read to Gene 1
#AnchorLength2: Longest alignment of a fusion supporting read to Gene 2
#FusionLengthGene1: Distance from breakpoint to the end of Gene 1
#FusionLengthGene2: Distance from breakpoint to the end of Gene 2
#NonFusionLengthGene1: Breakpoint distance to the end of transcript not part of the fusion for Gene 1
#NonFusionLengthGene2: Breakpoint distance to the end of transcript not part of the fusion for Gene 2
#Gene1Id: Gene ID reported in the annotation file for Gene 1
#Gene2Id: Gene ID reported in the annotation file for Gene 2
#Gene1Location:
- IntactExon: Breakpoint matches exon boundary,
- BrokenExon: Breakpoint is within an exon but does not match the exon boundary,
- Intron: Breakpoint is within an intron,
- Intergenic: Breakpoint does not overlap any gene
#Gene2Location: See above (Gene1Location) for Gene 2
#Gene1Sense: True if the Gene 1 5' to 3' direction matches the breakpoint order, indicating that the gene is the upstream gene in the fusion transcript
#Gene2Sense: See above (Gene1Sense) for Gene 2

In addition, if --rna-gf-merge-calls is enabled, DRAGEN merges the fusion candidates that overlap the same breakpoint into a single row reporting the feature values for the highest scoring passing candidate (or highest scoring failing candidate if no passing candidate is reported). For each breakpoint, in the column #FusionGene, it reports a semi-colon separated list of names of all overlapping genes with a passing candidate. The following two columns are added to the features.csv output file:

#AdditionalGenes1: If a mix of passing and failing candidates are reported for the same breakpoint of Gene 1, genes with only failing candidates are listed. If no passing candidate exists, then all overlapping genes are reported in the #FusionGene column.
#AdditionalGenes2: See above (AdditionalGenes1) for Gene 2

The <OUTPUT_PREFIX>.fusion_candidates.final output file lists each passing fusion along with the read names that support the fusion, including Split Reads, Soft-clipped reads, and Paired (discordant) Reads and the passing scores. These reads can be extracted from the output BAM file and then used to visualize the fusions (i.e. in IGV). The same information for the non-passing fusions is provided in the <OUTPUT_PREFIX>.filter_info output file.

The <OUTPUT_PREFIX>.fusion_candidates.vcf.gz output file provides the VCF representation for all of the breakpoints for the candidate fusions using structural variant-style BND notation. The VCF header is annotated with ##source=DRAGEN_RNA_GF to indicate the file is generated by the DRAGEN RNA Gene Fusion pipeline. All fusion candidates (passing and failing) are represented in the VCF output with one entry for each side of the fusion breakpoint (Gene 1 and Gene 2).

The <OUTPUT_PREFIX>.fusion_metrics.csv output file provides a simple count of the total number of fusion candidates, those passing the scoring filter, and the number of unique left-right gene combinations that are found.

Gene Fusion Options and Filters

The following thresholds and options may be used to configure the fusion caller:

--rna-gf-enriched-genes For RNA enrichment assays, a list of targeted genes specified as one gene-name per line. Only fusion calls involving at least one gene on the list are reported. The enriched genes list should only contain genes listed in the input annotation file. This option cannot be provided together with --rna-gf-enriched-regions. If RNA amplicon mode is enabled and the amplicon bed file already includes the gene name, then you do not need to set this option; DRAGEN will read the enriched genes names from the amplicon BED file (fifth column). See DRAGEN Amplicon Pipeline for further information.
--rna-gf-enriched-regions Alternative to --rna-gf-enriched-genes, but input is provided as a bed-file with regions coordinates instead of a gene list. All the genes in the provided annotation file that overlap such regions are included. Genes that are extracted in this way are summarized in output in the *.fusion.enriched_genes.txt file. This option cannot be provided together with --rna-gf-enriched-genes.
--rna-repeat-genes Text file that contains the names or IDs (from the annotation file) of targeted repetitive genes for sensitive fusion detection. Exclusive from --rna-repeat-intervals. This option overrides the default BED file. The repeat genes list should only contain genes listed in the input annotation file.
--rna-repeat-intervals BED file that contains a target list of repeat intervals for sensitive fusion detection. Exclusive from --rna-repeat-genes. This option overrides the default files, which contain the genes CIC, DUX4, PSPH, and SEPTIN14 for GRCh38 and hg19 reference genomes.
--enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data Enable Illumina Annotation Engine (IAE) to report fusion annotations in JSON format. --enable-variant-annotation must be set to true. For more information, see Illumina Annotation Engine.
--rna-gf-restrict-genes When parsing the gene annotations file for use in the DRAGEN Gene Fusion module, you can use this option to restrict the entries of interest to only protein-coding regions. Restricting the annotation to only the protein-coding genes reduces false positive rates in currently studied fusion events. To report non-coding gene fusions such as pseudo genes and lincRNAs, turn off this option. The default value is true.
--rna-gf-merge-calls If multiple genes overlap a fusion breakpoint, DRAGEN generates and scores a separate fusion candidate for each gene pair overlapping the breakpoint. The default value is false so that each reported fusion event only has one left and right gene in the fusion, and overlapping genes are output as separate events.
--rna-gf-allow-overlapping-genes Allows for fusion calls between overlapping genes. The default value is "false".
--rna-gf-enable-post-filters Enable post-filtering of RNA gene fusion candidates by confidence flags. The filter flags are listed in the table above. The default value is "false".
--enable-rna-amplicon A separate fusion filtering model is trained for RNA amplicon mode. Duplicate removal for fusion supporting reads is disabled for RNA amplicon mode and both genes are required to be in the list of enriched genes. By default, the DRAGEN fusion caller filters candidates if a breakpoint overlaps both transcripts (e.g. fusions such as FIP1L1--PDGFRA and GOPC--ROS1). In RNA amplicon mode, such candidates are not filtered. See DRAGEN Amplicon Pipeline for further information. The default is "false".
--rna-gf-sv-vcf Structural Variant VCF file output from DRAGEN DNA structural variant caller run in somatic mode. See the next section for more information.

Running RNA fusion detection with a somatic SV evidence

You can run the DRAGEN Gene Fusion module with a VCF file containing somatic Structural Variant (SV) calls. DRAGEN will report SV events matching each fusion candidate in the *.features.csv output file for informational purposes, but will not use this data in the scoring or filtering of the fusion candidates. The SV events must be run in somatic mode (for more information see DRAGEN Structural Variant Calling pipeline). The following is an example command line for running an end-to-end RNA-Seq experiment with a somatic SV VCF file.

When the SV VCF input is provided to the RNA fusion caller, the following additional features will be reported in the features.csv output file:

#SvEvent: A semi-colon separated string representation of SV events matching the fusion candidate.
#SvType: A semi-colon separated list of type of the matching SV events.
#SomaticScore: The highest SomaticScore value of the matching SV events.
#SvDistance: The maximum distance between any SV breakpoint to any fusion breakpoints (if multiple matching SV events, then minimum of all maximum distances over all SV events).
#LeftSvDistance: The distance between the left fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#RightSvDistance: The distance between the right fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#SvPresent: Set to 1 if matching SV event is present, otherwise 0.
#SvAbsent: Set to 1 if no matching SV event is present, otherwise 0.

Gene Expression Quantification

The DRAGEN RNA pipeline contains a gene expression quantification module that estimates the expression of each transcript and gene in an RNA-seq data set. The module first internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model and correct for GC-bias in the reported quantification results.

To enable the quantification module, set the --enable-rna-quantification option to true in your current RNA-seq command-line scripts. Additionally, you must provide a gene annotation file (GTF/GFF) that contains the genomic position of all transcripts to quantify. You can specify the GTF/GFF file using the -a or --annotation-file option.

Quantification Options

Option

Description

Quantification Outputs

Transcript quantification results are reported in the <outputPrefix>.quant.sf text file. The file lists results for each transcript. You can use the output file as input for differential gene expression using tools such as tximport and DESeq2.

The following is an example of the file contents:

Name Length EffectiveLength TPM NumReads
ENST00000364415.1 116 12.3238 5.2328 1
ENST00000564138.1 2775 2105.58 1.28293 41.8885

Field

Description

The gene expression quantification module also outputs the files below. For information on the metrics included, see the section Quantification and RNA QC Metrics.

<outputPrefix>.quant.genes.sf—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene.
<outputPrefix>.quant.metrics.csv—Summary statistics relevant to RNA transcripts and quantification. See Quantification and RNA QC Metrics.
<outputPrefix>.quant.transcript_fragment_lengths.txt —Full fragment length distribution of reads mapped to transcripts, output in length- probability pairs of length minimum through >999 bases. Summing the products of the two columns will yield the average fragment length.
<outputPrefix>.quant.transcript_coverage.txt—Measures coverage uniformity with a normalized average of 5' to 3' coverage pattern along transcripts in increments of 1%. A summation of the 100 coverage bins should yield 100%.
<outputPrefix>.SJ.saturation.txt—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed.

Quantification and RNA QC metrics

The RNA Quantification module outputs metrics related to the gene expression results and more general RNA QC metrics that rely on the transcript-level analysis. A summary of the metrics is output to the <outputPrefix>.quant_metrics.csv file.

Only unfiltered and properly paired reads (for paired-end sequencing) are counted in the above metrics. The seven fragments types that are listed (Forward transcript, Reverse transcript, Strand mismatched, Ambiguous strand, Intron, Intergenic, Unknown transcript) add up to 100% of the counted fragments, and the percentage of this total is provided next to each fragment metric count.

RNA Variant Calling

DRAGEN RNA variant calling uses the DRAGEN Somatic Small Variant Caller to call SNVs and indels. DRAGEN uses somatic variant calling to account for nongermline variant allele frequencies in RNA-seq data caused by differential expression. To perform variant calling, DRAGEN uses a probability model that weighs the evidence of a real variant against evidence for various noise models. If the quality score for a variant exceeds a certain threshold, then the variant is reported in the output VCF with the PASS label. DRAGEN also applies filters, such as weak_evidence and base_quality, that might indicate if the variant does not reach the thresholds required to qualify as a passing call. For more information on DRAGEN DNA somatic variant calling, see Somatic Mode.

DRAGEN RNA also supports forced genotyping (ForceGT). A ForceGT VCF that contains variants of interest can be provided to DRAGEN RNA VC, and the output VCF will contain all variants from the input with annotation. ForceGT might be unable to accurately call complex variants or variants with long deletions (> 50 bp). Complex variants are variants that require more than one substitution, insertion, or deletion event to transform the REF allele into the ALT allele.

DRAGEN RNA does not attempt to accurately genotype variants as heterozygous or homozygous (since it uses the DRAGEN somatic caller and somatic variants do not normally fall into those classes). Instead, a heuristic is applied based on the variant allele frequency: if the AF is at least 85%, then the GT field will be set to 1/1. Otherwise GT will always be reported as 0/1. This behavior and threshold can be adjusted with the following options:

Input Options

You can use a FASTQ, BAM, or CRAM file as input. Optionally, you can provide a GTF annotation file for more accurate split junction mapping.

Use the following command line options for FASTQ input files.

Use the following command line options for a list of FASTQ input files.

Use the following command line options for a BAM input file.

Run RNA Variant Calling

To enable RNA variant calling, set --enable-rna and --enable-variant-caller to true. To enable ForceGT, use --vc-forcegt-vcf <forcegt_vcf_file>.

RNA variant calling outputs a VCF file that includes PASS variants and variants that did not pass, due to filters or weak evidence. For more information on filters and additional command line options, see Somatic Mode.

(NOTE: gVCF mode is not supported with RNA variant calling.)

The following is an example RNA variant calling command line.

Running RNA Variant Calling with other RNA workflows (e.g. RNA quantification and RNA fusion calling)

RNA quantification and/or fusion calling can be performed along with RNA variant calling by adding the appropriate option(s) in addition to --enable-rna=true and --enable-variant-caller=true.

Options:

RNA quantification: --enable-rna-quantification=true
RNA gene fusion calling: --enable-rna-gene-fusion=true

For more information and options related to RNA quantification and fusion calling, see those sections of the user guide.

Splice Variant Caller

The identification of alternatively spliced isoforms (using their constitutive splice variants) and their functional effects is of high importance in the study of genetic variation and diseases, including cancer and neurological disorders. The main types of alternative splicing events resulting in splice variants are:

Exon skipping
Intron retention
Mutually exclusive exons
Alternative 5' splice site
Alternative 3' splice site

Running Splice Variant Caller

When enabled with the --enable-rna-splice-variant=true option added to an RNA Map/Align job, DRAGEN runs a Splice Variant caller by taking advantage of its fast and highly accurate splice-aware read mapper/aligner that aligns to the whole genome to identify novel alternative Splice Junction (SJ) candidates. These candidates can be filtered by additional information provided such as a "normals list" and a "target regions list", or whitelisted with a "knowns" list.

Next during the read sorting phase, evidence for these alternative splices candidates (alts) vs. reference splicing are accumulated. Finally, each of the candidates are scored based on the accumulated read evidence and the results are written to TSV and VCF files for downstream tertiary analysis.

Following is an example command line.

Splice Variant Optional Input Files

In addition to the required inputs listed in the above example (i.e. paired fastq reads, reference hashtable, and annotation), the following 3 optional input resource files can be provided to help provide better precision by reducing FP count.

Normals List

A list of Normal splice variants that will be filtered out of the final output (i.e. operating as a blacklist), as long as they are not in the "knowns" list, using the "--rna-splice-variant-normals" option.

The format of this file should be a tab separated file in the same format as the SJ.out.tab, except only the first 4 columns are used, i.e.

contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)

To create a Normals list file, a collection of DRAGEN RNA mapper output SJ.out.tab files for at least 30 samples can be used along with a simple script to process all the SJs in these files. The pseudo code block below describes the function of this script:

Knowns List

A list of known splice variants that are exempt from being filtered out of the final output (i.e. operating as a whitelist), using the "--rna-splice-variant-knowns" option. The format of the file should be a tab separated file in the same format as the SJ.out.tab with 9 columns present, except only the first 4 columns are evaluated, i.e.

contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)

By default, the caller will not consider any splice variant candidates that are found in the input annotation file since it is looking for denovo variants, unless it is included in the knowns list which directs it not to discard the specified candidate. Note that some newer gene annotation models have added alt transcripts that contain clinically relevant splice variants, which causes the DRAGEN to skip reporting them.

To ensure these are reported, the user may want to pass these in with a knowns file containing these common variants if they are found in the annotation that is used. An example is shown below using hg38 coordinates specifying the MET exon 14 skip, EGFRv3, and ARv7 alt splicing events, respectively.

Target Regions BED

A list of regions that called splice variants must fall within using the "--rna-splice-variant-regions" option. Any splice variant candidates will be excluded if they are not within these regions.

This file should be in BED file format with the following info, except that the regions are 1-based

chromosome id
start position (1-based)
end position (1-based)
region (i.e. gene) name

Splice Variant Output Files

The detected splice variants are output as two separate TSV files for the intragenic and intergenic candidates, and as a VCF for the intragenic candidates.

The following categories are used when accumulating read counts for each alt SJ candidate:

DedupUniqueSupportingReads - Non-duplicate marked reads that are unique and precisely align with the SJ
DupUniqueSupportingReads - Duplicate marked reads that are unique and precisely align with the SJ
DedupUniqueNonsupportingReads - Non-duplicate marked reads that are unique but don't support the splice variant
DupUniqueNonsupportingReads - Duplicate marked reads that are unique but don't support the splice variant

To be counted, a paired read alignment:

must be primary and properly paired
must contain a splice junction (i.e. an alignment gap in the CIGAR containing skip ops)
must have overhangs on either side of the skip that are at least 6 bases
considered to be "unique" only if NH=1 and the MAPQ > 35

Splice Variant TSV Files

These two output files are named:

.splice_variants.tsv which contains the intragenic alt splice junctions that result in transcript variants
.splice_variant_fusions.tsv which contains the intergenic alt splice junctions that result fusions across genes

Each detected splice junction contains the following columns:

gene_start - Gene name(s) at the start of the SJ. Multiple genes are separated by a semicolon
gene_end - Gene name(s) at the end of the SJ. Multiple genes are separated by a semicolon
chromosome - Chromosome containing the SJ
start - SJ's start position (1-based genomic coordinate)
end - SJ's start position (1-based genomic coordinate)
strand - Detected strand for the SJ (1: +, 2: -)
motif - intron motif
annotated - True if annotated, otherwise False
split_unique_reads_ref - DedupUniqueNonsupportingReads count that support reference
split_total_reads_ref - DupUniqueNonsupportingReads + DedupUniqueNonsupportingReads count that support reference
split_unique_reads_alt DedupUniqueSupportingReads count that support variant
split_total_reads_alt - DupUniqueSupportingReads + DedupUniqueSupportingReads count that support variant
max_spliced_alignment_overhang - maximum spliced alignment overhang from all supporting reads
score - The splice junction variant score (ranging from 0.0 to 1.0). Currently, this is just a linear function of the number of split_unique_reads divided by 10, i.e. equals MIN(1.0, split_unique_reads_alt/10)

Note:

In the intragenic output file containing transcript variant splice junctions, the gene_start and gene_end columns must match.
In the intergenic output file containing fusions from splice junctions, the gene_start and gene_end columns must be different.

Splice Variant VCF File

This file contains the detected intra-genic splice junction variants that are not filtered out, and are written into a zipped VCF file titled .splice_variants.vcf.gz, where each splice variant candidate is written as a one-line VCF record containing the fields below:

CHROM - Chromosome of the splice
POS - SJ start position (1-based) i.e. first base of intron
ID - "." (unused)
REF - Base from the reference genome FASTA at the SJ start position
ALT - "<DEL>"
QUAL - The junction score from 0.0 - 1.0
FILTER - Semicolon separated list of filters: LowQ and LowUniqueAlignment
INFO - See the possible "Info fields below"
FORMAT - AD:DP
SAMPLE - Counts for {DedupUniqueSupportingReads}:{DedupUniqueNonsupportingReads}

The following lines of the VCF header describe columns 5 to 10 (last 6 columns)

Note on Filter Thresholds

The passing thresholds for the LowQ and LowUniqueAlignments filters are fixed to the settings below.

Output Merged with Fusion Caller

When the splice variant caller and gene fusion caller are both enabled, the passing and failed intergenic fusion SJ's will also be merged into the relevant fusion output TSV files.

The passing calls get added to the fusion caller's .fusion_candidates.final file. The tab separated fields are described below.

The failing calls get added into the fusion caller's .fusion_candidates.filter_info output file. The output fields are the same as described above for the "final" output file, with the addition of the FILTER_INFO field in the first column. The value in this field will be "RSV_FILTER:" followed by the specific filters that are not passing, as described in the table below.

RNA Alignment

Alignment Output

BAM

The output BAM file meets the SAM specification and is compatible with downstream RNA-Seq analysis tools.

RNA-Seq BAM Tags

The following BAM tags are emitted alongside spliced alignments.

Tag

Description

If a gene annotations file is used during the map/align stage, and the splice junction is detected as an annotated junction, then 20 is added to its motif value.

Compatibility with Cufflinks

'.' (undefined), '+' (forward strand), '-' (reverse strand), or '*' (ambiguous).

SJ.out.tab

Along with the alignments emitted in the SAM/BAM file, an additional SJ.out.tab file summarizes the high confidence splice junctions in a tab-delimited file. The columns for this file are as follows:

contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)strand (0: undefined, 1: +, 2:-)
strand (0: undefined, 1: +, 2: -)
intron motif: 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
0: unannotated, 1: annotated, only if an input gene annotations file was used
number of uniquely mapping reads spanning the splice junction
number of multimapping reads spanning the splice junction
maximum spliced alignment overhang

A splice junction entry in the SJ.out.tab file is filtered out if any of these conditions are met:

SJ is a noncanonical motif and is only supported by < 3 unique mappings.
SJ of length > 50000 and is only supported by < 2 unique mappings.
SJ of length > 100000 and is only supported by < 3 unique mappings.
SJ of length > 200000 and is only supported by < 4 unique mappings.
SJ is a noncanonical motif and the maximum spliced alignment overhang is < 30.
SJ is a canonical motif and the maximum spliced alignment overhang is < 12.

Note that the filter does not apply to the alignments present in the BAM or SAM file.

Chimeric.out.junction File

Chromosome of the donor.
First base of the intron of the donor (1-based).
Strand of the donor.
Chromosome of the acceptor.
First base of the intron of the acceptor (1-based).
Strand of the acceptor.
N/A---not used, but is present to be compatible with other tools. It will always be 1.
N/A---not used, but is present to be compatible with other tools. It will always be *.
N/A---not used, but is present to be compatible with other tools. It will always be *.
Read name.
First base of the first segment, on the + strand.
CIGAR of the first segment.
First base of the second segment.
CIGAR of the second segment.

chr19 580462 + chr12 120876182 + 1 * * R_15448 571532 49M8799N26M8p49M26S 120876183 49H26M
chr19 580462 + chr12 120876182 + 1 * * R_15459 571552 29M8799N46M8p29M46S 120876183 29H46M

R_15448:1   99    chr19   571531      60  49M8799N26M  =       580413
R_15448:2   147   chr19   580413      60  49M26S       =       571531
R_15448:2   2193  chr12   120876182   15  49H26M       chr19   571531

R_15459:1   99    chr19   571551      60  29M8799N46M  =       580433
R_15459:2   147   chr19   580433      4   29M46S       =       571551
R_15459:2   2193  chr12   120876182   15  29H46M       chr19   571551

Mapping Metrics

Filtered rRNA reads---Total number of ribosomal RNA reads that are filtered out with the --rrna-filter-enable option.
Mitochondrial reads excluded---Total number of reads detected to be in ChrM if the --rna-mapping-metrics-exclude-chrm option is enabled.
Mapped reads adjusted for filtered mapping---Adjusted count of mapped reads by adding in the filtered rRNA reads.
Mapped reads adjusted for excluded mapping---Adjusted count of mapped reads by adding in the excluded mitocondrial reads.
Mapped reads adjusted for filtered and excluded mapping---Adjusted count of mapped reads by adding in both the filtered rRNA and excluded mitocondrial reads.
Unmapped reads adjusted for filtered mapping---Adjusted count of unmapped reads by subtracting out the filtered rRNA reads.
Unmapped reads adjusted for excluded mapping---Adjusted count of unmapped reads by subtracting out the excluded mitocondrial reads.
Unmapped reads adjusted for filtered and excluded mapping---Adjusted count of unmapped reads by subtracting out both the filtered rRNA and the excluded mitocondrial reads.
Reads with splice junction---Total number of reads that included a spliced alignment that crosses an intron

RNA Alignment Options

The aligner stage of the RNA spliced aligner uses Smith-Waterman Alignment Scoring options and Splicing Scoring Options.

Smith-Waterman Alignment Scoring Options

--Aligner.intron-motif12-pen The --Aligner.intron-motif12-pen option controls the penalty for canonical motifs 1/2 (GT/AG, CT/AC). The default value calculated by the host software is 1 * (match-score + mismatch-pen).
--Aligner.intron-motif34-pen The --Aligner.intron-motif34-pen option controls the penalty for canonical motifs 3/4 (GC/AG, CT/GC). The default value calculated by the host software is 3 * (match-score + mismatch-pen).
--Aligner.intron-motif56-pen The --Aligner.intron-motif56-pen option controls the penalty for canonical motifs 5/6 (AT/AC, GT/AT). The default value calculated by the host software is 4 * (match-score + mismatch-pen).
--Aligner.intron-motif0-pen The --Aligner.intron-motif0-pen option controls the penalty for noncanonical motifs. The default value calculated by the host software is 6 * (match-score + mismatch-pen).

Splicing Scoring Options

--Mapper.min-intron-bases For RNA-Seq mapping, a reference alignment gap can be interpreted as a deletion or an intron. In the absence of an annotated splice junction, the min-intron-bases option is a threshold gap length separating this distinction. Reference gaps at least this long are interpreted and scored as introns, and shorter reference gaps are interpreted and scored as deletions. However, alignments can be returned with annotated splice junctions shorter than this threshold.
--Mapper.max-intron-bases The max-intron-bases option controls the largest possible intron that is reported, which useful for preventing false splice junctions that would otherwise be reported. Set this option to a value that is suitable to the species you are mapping against.
--Mapper.ann-sj-max-indel For RNA-seq, seed mapping can discover a reference gap in the position of an annotated intron, but with slightly different length. If the length difference does not exceed this option, the mapper investigates the possibility that the intron is present exactly as annotated, but an indel on one side or the other near the splice junction explains the length difference. Indels longer than this option and very near annotated splice junctions are not likely to be detected. Higher values may increase mapping time and false detections.

Duplicate Marking

Downsampling

Ribosomal RNA filtering

NOTE: rRNA filtering is not supported with chm13-based references and it will be automatically disabled.

Command-Line Options

The following are the required command-line options for rRNA filtering.

--rrna-filter-enable=true--Enables rRNA filtering. Set to true to enable rRNA filtering. The default value is false.
--rrna-filter-contig--Specify the name of the rRNA sequences to use for filtering. If you do not specify a value, the default gl000220 is provided for human genome alignments by using the reference autodetect feature. gl000220 is an unplaced contig included in hg19 and hg38 genomes, which include a full copy of the rRNA repeat. For other genomes, you must include a rRNA decoy contig when creating a hash table.

Output Files

PolyA trimming

PolyA trimming by read orientation

MAPQ Scoring

Gene Fusion Detection

Running DRAGEN Gene Fusion

The following is an example command line for running an end-to-end RNA-Seq experiment.

At the end of a run, a summary of detected gene fusion events is output, which is similar to the following example.

Gene Fusion Output and Filters

The <OUTPUT_PREFIX>.fusion_candidates.features.csv file lists the detected gene fusion events. The output CSV file includes the following columns.

#FusionGene: Parent gene names (in 5' to 3' order of transcript) participating in the fusion; hereafter refer red to as Gene 1 and Gene 2. If a fusion breakpoint overlaps multiple genes, the genes are listed by default as separate candidates (rows). To show them as a semi-colon separated gene list on the same row, the option --rna-gf-merge-calls can be set to true as described in the Gene Fusion Options and Filters section.
Score: Fusion call confidence score predicted by the ML model. If the ML model is used, the score can be 0 (low confidence) to 1 (high-confidence call). Currently the ML model only supports human references. In the case an ML model is not available, the number of supporting reads will be reported as the score.
LeftBreakpoint: Gene 1 breakpoint formatted as <Chromosome>:<Position>:<Strand>.
RightBreakpoint: Gene 2 breakpoint formatted as <Chromosome>:<Position>:<Strand>.
Filter: Semicolon separated list of filter flags. The LOW_SCORE filter is used to filter low confidence fusion candidates. If --rna-gf-enable-post-filtering=true, other confidence filters will also be applied. Informative filters, on the other hand, do not fail the fusion. In the absence of the ML model scoring (i.e. a non-human reference is used), a more aggressive post-filtering will take place and all confidence and informative filters will be applied. The following are the available filters.

Filter

Type

Description

Option to set threshold

Note that the specific features and column values are subject to change in future DRAGEN versions as more RNA data is analyzed.

#SplitScore: Combined count of fusion supporting read pairs reported as split reads and soft-clipped reads
#NumSplitReads: Number of fusion supporting read pairs with at least one split read alignment.
#NumSoftClippedReads: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment. Included in SplitScore and includes soft-clipped reads for both Gene1 and Gene2
#NumSoftClippedReadsGene1: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment to Gene 1
#NumSoftClippedReadsGene2: See above (NumSoftClippedReadsGene1) for Gene 2
#NumPairedReads: Number of fusion supporting read pairs such that one of the reads maps to Gene1 and the other maps to Gene2, without any breakpoint overlap
#NumRefSplitReadsGene1: Number of read pairs that map fully within Gene 1 such that at least one of the reads aligns across the breakpoint. These reads support the reference transcript and do not support the fusion.
#NumRefSplitReadsGene2: See above (NumRefSplitReadsGene1) for Gene 2
#NumRefPairedReadsGene1: Number of read pairs such that one of the reads maps on the left side of the Gene1 breakpoint and the other maps on the right side of the Gene1 breakpoint, without overlapping the break. These reads support the reference transcript and do not support the fusion.
#NumRefPairedReadsGene2: See above (NumRefPairedReadsGene1) for Gene 2
#AltToRef-- Ratio of (fusion split + soft clipped reads) / max(NumRefSplitReadsGene1, NumRefSplitReadsGene2); used for the LOW_ALT_TO_REF filter
#UniqueAlignmentsGene1: Unique (start-end) positions of fusion supporting read alignments to Gene 1 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS filter
#UniqueAlignmentsGene2: Unique (start-end) positions of fusion supporting read alignments to Gene 2 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS filter
#MaxMapqGene1: Maximum MAPQ for fusion supporting reads in Gene 1
#AvgMapqGene1: Average MAPQ for fusion supporting reads in Gene 1
#MaxMapqGene2: Maximum MAPQ for fusion supporting reads in Gene 2
#AvgMapqGene2: Average MAPQ for fusion supporting reads in Gene 2
#CoverageBasesGene1: Bases in Gene 1 with read coverage within a certain distance (default 1000 bp) of the breakpoint in the direction of the breakpoint strand which is part of the fusion transcript
#CoverageBasesGene2: See above (CoverageBasesGene1) for Gene 2
#DeltaExonBoundaryGene1: Distance from the Gene 1 breakpoint for the closest fusion supporting alignment (higher distance to boundary lowers score)
#DeltaExonBoundaryGene2: See above (DeltaExonBoundaryGene1) for Gene 2
#IsRestrictedGene1: Indicator variable of whether the Gene 1 is tagged as protein coding in the annotation file
#IsRestrictedGene2: Indicator variable of whether the Gene 2 is tagged as protein coding in the annotation file
#IsEnrichedGene1: If enrichment or amplicon assay, then indicates whether Gene 1 is enriched. If whole transcriptome sequencing, then set to 1
#IsEnrichedGene2: See above (IsEnrichedGene1) for Gene 2
#CisDistance: Distance between breakpoints if they are adjacent to each other and on the same strand. Large value (100M) if not a CIS break; used for the READ_THROUGH filter.
#BreakpointDistance: Distance between breakpoints if they are adjacent. Large value (100M) if not within same chromosome
#GenePairHomologyEval: E-value of pairwise BLAST alignment of the parent genes
#AnchorLength1: Longest alignment of a fusion supporting read to Gene 1
#AnchorLength2: Longest alignment of a fusion supporting read to Gene 2
#FusionLengthGene1: Distance from breakpoint to the end of Gene 1
#FusionLengthGene2: Distance from breakpoint to the end of Gene 2
#NonFusionLengthGene1: Breakpoint distance to the end of transcript not part of the fusion for Gene 1
#NonFusionLengthGene2: Breakpoint distance to the end of transcript not part of the fusion for Gene 2
#Gene1Id: Gene ID reported in the annotation file for Gene 1
#Gene2Id: Gene ID reported in the annotation file for Gene 2
#Gene1Location:
- IntactExon: Breakpoint matches exon boundary,
- BrokenExon: Breakpoint is within an exon but does not match the exon boundary,
- Intron: Breakpoint is within an intron,
- Intergenic: Breakpoint does not overlap any gene
#Gene2Location: See above (Gene1Location) for Gene 2
#Gene1Sense: True if the Gene 1 5' to 3' direction matches the breakpoint order, indicating that the gene is the upstream gene in the fusion transcript
#Gene2Sense: See above (Gene1Sense) for Gene 2

#AdditionalGenes1: If a mix of passing and failing candidates are reported for the same breakpoint of Gene 1, genes with only failing candidates are listed. If no passing candidate exists, then all overlapping genes are reported in the #FusionGene column.
#AdditionalGenes2: See above (AdditionalGenes1) for Gene 2

Gene Fusion Options and Filters

The following thresholds and options may be used to configure the fusion caller:

--rna-gf-blast-pairs A tab separated file listing gene pairs that have a high level of similarity. The first and second column are the gene names, and the third column is the e-score. This list of gene pairs is used as a homology filter to reduce false positives. For runs on human genome assemblies GRCH38 and hg19, DRAGEN automatically applies a default file generated using annotations for primary chromosomes if no other file is specified using the command-line.
--rna-gf-enriched-genes For RNA enrichment assays, a list of targeted genes specified as one gene-name per line. Only fusion calls involving at least one gene on the list are reported. The enriched genes list should only contain genes listed in the input annotation file. This option cannot be provided together with --rna-gf-enriched-regions. If RNA amplicon mode is enabled and the amplicon bed file already includes the gene name, then you do not need to set this option; DRAGEN will read the enriched genes names from the amplicon BED file (fifth column). See DRAGEN Amplicon Pipeline for further information.
--rna-gf-enriched-regions Alternative to --rna-gf-enriched-genes, but input is provided as a bed-file with regions coordinates instead of a gene list. All the genes in the provided annotation file that overlap such regions are included. Genes that are extracted in this way are summarized in output in the *.fusion.enriched_genes.txt file. This option cannot be provided together with --rna-gf-enriched-genes.
--rna-repeat-genes Text file that contains the names or IDs (from the annotation file) of targeted repetitive genes for sensitive fusion detection. Exclusive from --rna-repeat-intervals. This option overrides the default BED file. The repeat genes list should only contain genes listed in the input annotation file.
--rna-repeat-intervals BED file that contains a target list of repeat intervals for sensitive fusion detection. Exclusive from --rna-repeat-genes. This option overrides the default files, which contain the genes CIC, DUX4, PSPH, and SEPTIN14 for GRCh38 and hg19 reference genomes.
--enable-variant-annotation=true, --variant-annotation-assembly, and --variant-annotation-data Enable Illumina Annotation Engine (IAE) to report fusion annotations in JSON format. --enable-variant-annotation must be set to true. For more information, see Illumina Annotation Engine.
--rna-gf-restrict-genes When parsing the gene annotations file for use in the DRAGEN Gene Fusion module, you can use this option to restrict the entries of interest to only protein-coding regions. Restricting the annotation to only the protein-coding genes reduces false positive rates in currently studied fusion events. To report non-coding gene fusions such as pseudo genes and lincRNAs, turn off this option. The default value is true.
--rna-gf-merge-calls If multiple genes overlap a fusion breakpoint, DRAGEN generates and scores a separate fusion candidate for each gene pair overlapping the breakpoint. The default value is false so that each reported fusion event only has one left and right gene in the fusion, and overlapping genes are output as separate events.
--rna-gf-allow-overlapping-genes Allows for fusion calls between overlapping genes. The default value is "false".
--rna-gf-enable-post-filters Enable post-filtering of RNA gene fusion candidates by confidence flags. The filter flags are listed in the table above. The default value is "false".
--enable-rna-amplicon A separate fusion filtering model is trained for RNA amplicon mode. Duplicate removal for fusion supporting reads is disabled for RNA amplicon mode and both genes are required to be in the list of enriched genes. By default, the DRAGEN fusion caller filters candidates if a breakpoint overlaps both transcripts (e.g. fusions such as FIP1L1--PDGFRA and GOPC--ROS1). In RNA amplicon mode, such candidates are not filtered. See DRAGEN Amplicon Pipeline for further information. The default is "false".
--rna-gf-sv-vcf Structural Variant VCF file output from DRAGEN DNA structural variant caller run in somatic mode. See the next section for more information.

Running RNA fusion detection with a somatic SV evidence

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
-a <ANNOTATION_FILE> \
--output-dir <OUT_DIRECTORY> \
--output-file-prefix <OUTPUT_PREFIX> \
--RGID <READ_GROUP_ID> \
--RGSM <SAMPLE_NAME> \
--enable-rna true \
--enable-rna-gene-fusion true
--rna-gf-sv-vcf <SV_VCF_PATH>

When the SV VCF input is provided to the RNA fusion caller, the following additional features will be reported in the features.csv output file:

#SvEvent: A semi-colon separated string representation of SV events matching the fusion candidate.
#SvType: A semi-colon separated list of type of the matching SV events.
#SomaticScore: The highest SomaticScore value of the matching SV events.
#SvDistance: The maximum distance between any SV breakpoint to any fusion breakpoints (if multiple matching SV events, then minimum of all maximum distances over all SV events).
#LeftSvDistance: The distance between the left fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#RightSvDistance: The distance between the right fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#SvPresent: Set to 1 if matching SV event is present, otherwise 0.
#SvAbsent: Set to 1 if no matching SV event is present, otherwise 0.