The DRAGEN RNA pipeline uses the DRAGEN RNA-Seq spliced aligner. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.
The output files generated when running DRAGEN in RNA mode are similar to those generated in DNA mode. RNA mode also produces extra information related to spliced alignments. Details regarding the splice junctions are present both in the SAM alignment record and an additional file, the SJ.out.tab file.
The output BAM file meets the SAM specification and is compatible with downstream RNA-Seq analysis tools.
RNA-Seq BAM Tags
The following BAM tags are emitted alongside spliced alignments.
Tag | Description |
---|---|
If a gene annotations file is used during the map/align stage, and the splice junction is detected as an annotated junction, then 20 is added to its motif value.
Cufflinks might require spliced alignments to emit the XS:A
strand tag. This tag is present in the SAM record if the alignment contains a splice junction. The possible values for XS:A
strand tag are as follows:
'.' (undefined), '+' (forward strand), '-' (reverse strand), or '*' (ambiguous).
By default, if the spliced alignment has an undefined strand or an ambiguous (conflicting) strand, then the alignment output is suppressed. These alignments can be output into the output alignment file by setting the --no-ambig-strand
option to 1.
Cufflinks also expects that the MAPQ for a uniquely mapped read is a single value. This value is specified by the --rna‑mapq-unique
option. To force all uniquely mapped reads to have a MAPQ equal to this value, set ‑‑rna‑mapq‑unique
to a nonzero value.
Along with the alignments emitted in the SAM/BAM file, an additional SJ.out.tab file summarizes the high confidence splice junctions in a tab-delimited file. The columns for this file are as follows:
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)strand (0: undefined, 1: +, 2:-)
strand (0: undefined, 1: +, 2: -)
intron motif: 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
0: unannotated, 1: annotated, only if an input gene annotations file was used
number of uniquely mapping reads spanning the splice junction
number of multimapping reads spanning the splice junction
maximum spliced alignment overhang
The maximum spliced alignment overhang (column 9) field in the SJ.out.tab file is the anchoring alignment overhang. For example, if a read is spliced as ACGTACGT------------ACGT
, then the overhang is 4. For the same splice junction, across all reads that span this junction, the maximum overhang is reported. The maximum overhang is a confidence indicator that the splice junction is correct based on anchoring alignments.
There are two SJ.out.tab files generated by the DRAGEN host software, an unfiltered version and a filtered version. The records in the unfiltered file are a consolidation of all spliced alignment records from the output SAM/BAM. However, the filtered version has a much higher confidence for being correct due to the use of the following filters.
A splice junction entry in the SJ.out.tab file is filtered out if any of these conditions are met:
SJ is a noncanonical motif and is only supported by < 3 unique mappings.
SJ of length > 50000 and is only supported by < 2 unique mappings.
SJ of length > 100000 and is only supported by < 3 unique mappings.
SJ of length > 200000 and is only supported by < 4 unique mappings.
SJ is a noncanonical motif and the maximum spliced alignment overhang is < 30.
SJ is a canonical motif and the maximum spliced alignment overhang is < 12.
The filtered SJ.out.tab is recommended for use with any downstream analysis or post processing tools. Alternatively, you can use the unfiltered SJ.out.tab and apply your own filters (for example, with basic awk commands).
Note that the filter does not apply to the alignments present in the BAM or SAM file.
If there are chimeric alignments present in the sample, then a supplementary Chimeric.out.junction file is also output. This file contains information about split-reads that can be used to perform downstream gene fusion detection. Each line contains one chimerically aligned read. The columns of the file are as follows:
Chromosome of the donor.
First base of the intron of the donor (1-based).
Strand of the donor.
Chromosome of the acceptor.
First base of the intron of the acceptor (1-based).
Strand of the acceptor.
N/A---not used, but is present to be compatible with other tools. It will always be 1
.
N/A---not used, but is present to be compatible with other tools. It will always be *
.
N/A---not used, but is present to be compatible with other tools. It will always be *
.
Read name.
First base of the first segment, on the + strand.
CIGAR of the first segment.
First base of the second segment.
CIGAR of the second segment.
CIGARs in this file follow the standard CIGAR operations as found in the SAM specification, with the addition of a gap length L that is encoded with the operation p. For paired end reads, the sequence of the second mate is always reverse complemented before determining strandedness.
The following is an example entry that shows two chimerically aligned read pairs, in which one of the mates is split, mapping segments of chr19 to chr12. Also shown are the corresponding SAM records associated with these entries.
The RNA Pipeline reports summary and per read group statistics pertaining to read mapping in the mapping_metrics.csv
file. The majority of the matrics are as described in the DNA mapping metrics section, but the metrics that are specific to RNA-seq are listed below.
Filtered rRNA reads---Total number of ribosomal RNA reads that are filtered out with the --rrna-filter-enable
option.
Mitochondrial reads excluded---Total number of reads detected to be in ChrM if the --rna-mapping-metrics-exclude-chrm
option is enabled.
Mapped reads adjusted for filtered mapping---Adjusted count of mapped reads by adding in the filtered rRNA reads.
Mapped reads adjusted for excluded mapping---Adjusted count of mapped reads by adding in the excluded mitocondrial reads.
Mapped reads adjusted for filtered and excluded mapping---Adjusted count of mapped reads by adding in both the filtered rRNA and excluded mitocondrial reads.
Unmapped reads adjusted for filtered mapping---Adjusted count of unmapped reads by subtracting out the filtered rRNA reads.
Unmapped reads adjusted for excluded mapping---Adjusted count of unmapped reads by subtracting out the excluded mitocondrial reads.
Unmapped reads adjusted for filtered and excluded mapping---Adjusted count of unmapped reads by subtracting out both the filtered rRNA and the excluded mitocondrial reads.
Reads with splice junction---Total number of reads that included a spliced alignment that crosses an intron
The aligner stage of the RNA spliced aligner uses Smith-Waterman Alignment Scoring options and Splicing Scoring Options.
Refer to [Smith-Waterman Alignment Scoring Settings]{.underline} for more details about the alignment algorithm used within DRAGEN. The following scoring options are specific to the processing of canonical and noncanonical motifs within introns.
--Aligner.intron-motif12-pen
The --Aligner.intron-motif12-pen
option controls the penalty for canonical motifs 1/2 (GT/AG, CT/AC). The default value calculated by the host software is 1 * (match-score + mismatch-pen)
.
--Aligner.intron-motif34-pen
The --Aligner.intron-motif34-pen
option controls the penalty for canonical motifs 3/4 (GC/AG, CT/GC). The default value calculated by the host software is 3 * (match-score + mismatch-pen)
.
--Aligner.intron-motif56-pen
The --Aligner.intron-motif56-pen
option controls the penalty for canonical motifs 5/6 (AT/AC, GT/AT). The default value calculated by the host software is 4 * (match-score + mismatch-pen)
.
--Aligner.intron-motif0-pen
The --Aligner.intron-motif0-pen
option controls the penalty for noncanonical motifs. The default value calculated by the host software is 6 * (match-score + mismatch-pen)
.
--Mapper.min-intron-bases
For RNA-Seq mapping, a reference alignment gap can be interpreted as a deletion or an intron. In the absence of an annotated splice junction, the min-intron-bases option is a threshold gap length separating this distinction. Reference gaps at least this long are interpreted and scored as introns, and shorter reference gaps are interpreted and scored as deletions. However, alignments can be returned with annotated splice junctions shorter than this threshold.
--Mapper.max-intron-bases
The max-intron-bases option controls the largest possible intron that is reported, which useful for preventing false splice junctions that would otherwise be reported. Set this option to a value that is suitable to the species you are mapping against.
--Mapper.ann-sj-max-indel
For RNA-seq, seed mapping can discover a reference gap in the position of an annotated intron, but with slightly different length. If the length difference does not exceed this option, the mapper investigates the possibility that the intron is present exactly as annotated, but an indel on one side or the other near the splice junction explains the length difference. Indels longer than this option and very near annotated splice junctions are not likely to be detected. Higher values may increase mapping time and false detections.
DRAGEN RNA can detect duplicate reads, which are defined as fragments that have both ends mapping to the same (clipping-adjusted) position during alignment. In RNA-Seq data, the reads can represent PCR duplicates during library prep or as a result from deep coverage of highly expressed regions.
If --enable-duplicate-marking
is set to true, duplicate fragments are marked in the BAM file and the total number of duplicate reads is reported as a mapping metric. Marking of duplicates does not affect gene expression quantification and gene fusion calling.
DRAGEN RNA also supports internal downsampling, which is a process by which a random sub-sample of reads is selected from the dataset after trimming and alignment for downstream analysis. In RNA-Seq, this can be useful in two ways - it can speed up analysis of samples with excessively high coverage, and it can allow for more accurate cross-comparisons between different samples.
If --enable-down-sampler
is set to true and a value specified for --down-sampler-reads
, DRAGEN will use only that many RNA fragments (including both Read 1 and Read 2) for quantification, fusion, variant calling and output to BAM. Please note the the entire input dataset is still used for the generation of trimming, fastqc, and mapping metrics.
Ribosomal RNA (rRNA) sequences can contribute a large fraction of reads in some RNA-Seq datasets, depending on the sample type and library prep method. You can use the DRAGEN RNA pipeline to filter rRNA reads during alignment, because the reads are not relevant for downstream analysis. By filtering rRNA, you can reduce run time and file size and avoid deep read alignment pile ups at rRNA repeat loci on the genome to make downstream analysis of RNA BAM files easier.
rRNA filtering relies on a decoy contig with the rRNA sequence included in the reference hash table. Any read that maps to the decoy contig, including multimappers, is tagged with rRNA and is not mapped in the output.
NOTE: The rrna filter option only accepts a single contig by default. In the event multiple contigs need to be provided, they can be concatenated using a 1kb N mask between them, and added to the reference FASTA while creating the hash table.
NOTE: rRNA filtering is not supported with chm13
-based references and it will be automatically disabled.
The following are the required command-line options for rRNA filtering.
--rrna-filter-enable=true
--Enables rRNA filtering. Set to true
to enable rRNA filtering. The default value is false
.
--rrna-filter-contig
--Specify the name of the rRNA sequences to use for filtering. If you do not specify a value, the default gl000220
is provided for human genome alignments by using the reference autodetect feature. gl000220
is an unplaced contig included in hg19 and hg38 genomes, which include a full copy of the rRNA repeat. For other genomes, you must include a rRNA decoy contig when creating a hash table.
All rRNA filtering reads are left unaligned in the BAM files and tagged ZS:Z:FLT
. The number and percentage of filtered rRNA reads is reported as a mapper metric Adjustment of reads matching filter contigs
.
PolyA tails may be trimmed by including the settings --read-trimmers polya
or --soft-read-trimmers polyg,polya
(Note: polyg soft trimming is enabled by default). The minimum number of poly-A/poly-T bases required for trimming may be set using --trim-polya-min-trim
. The default threshold is 20 poly-A/poly-T bases. Refer to DNA pipeline read trimmers section for usage of read trimmers options.
The PolyA trimmer determines which end of the reads to trim for poly-A and poly-T sequences based on the library type. For example, for Illumina forward stranded paired reads the trimmer will trim poly-A sequences at 3' end of read 1 and poly-T sequences at 5' end of read 2. If --rna-library-type
is not provided or set to autodetect (A
), the trimmer assumes the library is unstranded and trims poly-A sequences from 3' end of each read and poly-T sequences from 5' end of each read. The option --rna-library-type
is described in the Gene Expression section.
By default, the MAPQ calculation for RNA-Seq is identical to DNA-Seq. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. Therefore, adjusting the alignment scoring parameters impacts the MAPQ estimate. These adjustments are outlined in Smith-Waterman Alignment Scoring Settings.
The --mapq-strict-sjs
option is specific to RNA, and applies where at least one exon segment is aligned confidently, but there is ambiguity regarding possible splice junctions. When this option is set to 0, a higher MAPQ value is returned, expressing confidence that the alignment is at least partially correct. When this option is set to 1, a lower MAPQ value is returned, reflecting the splice junction ambiguity.
Some downstream tools, such as Cufflinks, expect the MAPQ value to be a unique value for all uniquely mapped reads. This value is specified with the --rna-mapq-unique
option. Setting this option to a nonzero value overrides all MAPQ estimates based on alignment score. Instead, all uniquely mapped reads have a MAPQ set to the value of --rna-mapq-unique
. All multimapped reads have a MAPQ value of int(-10*log10(1 ‑ 1/NH)
, where the NH value is the number of hits (primary and secondary alignments) for that read.
DRAGEN includes an RNA-seq (splicing-aware) aligner, as well as RNA specific analysis components for gene expression quantification, gene fusion detection, splice variant calling, and small variant calling. All of these analysis components require the aligner to be enabled.
Most of the functionality and options described in Host Software Options and DNA Mapping also apply to RNA applications. Additional RNA-specific aspects are described in this section.
In addition to the standard input files (reads from fastq or bam, reference genome, etc.), DRAGEN can also take a gene annotations file as input. A gene annotations file aids in the alignment of reads to known splice junctions and is required for gene expression quantification and gene fusion calling.
To specify a gene annotation file, use the -a
(--annotation-file
) command line option. The input file must conform to the GTF/GFF specification (http://uswest.ensembl.org/info/website/upload/gff.html). The file must contain features of type exon, and the record must contain attributes of type gene_id
and transcript_id
. An example of a valid GTF file is shown below.
Similarly, a GFF file can be used. Each exon feature must have as a Parent a transcript identifier that is used to group exons. An example of a valid GFF file is shown below.
NB. For proper handling of genes in the PAR regions of chromosome X and Y, it is required that the gene_id
attribute of all exons of the same gene is distinct between the two chromosomes, in order to distinguish exons within the PAR region of chromosome X from the ones within the PAR region of chromosome Y. That is, it is often the case that the gene_id
of all exons of a transcript from geneA
is equal to gene_id=geneA
in chromosome X, and gene_id=geneA_PAR_Y
in chromosome Y. This allows the GTF/GFF parser and downstream components to discriminate data associated with PAR genes in chromosome X from data associated with the same PAR genes in chromosome Y.
The DRAGEN host software parses the file for exons within the transcripts and produces splice junctions. The following output displays the number of splice junctions detected.
The splice junctions that are detected from the annotation file are also written to *.sjdb.annotations.out.tab
. Splice junctions below a minimum length are excluded, which helps filter annotation artifacts. This minimum annotation splice junction length is controlled by the --rna-ann-sj-min-len
option, which has a default value of 6
.
Note that GFF3 is a different file format from GFF. GFF3 files are not officially supported due to inconsistent contig naming conventions between GENCODE and Ensembl.
For the same reference, GENCODE provides all the attributes necessary for DRAGEN to build a hierarchical structure:
Ensembl has a different notation:
Ensembl uses different notation for contigs (for GRCh38) than GENCODE. Ensembl contigs do not have the "chr" prefix. The contig identifiers in the annotation file must match the DRAGEN reference in use, and by most conventions GRCh38/hg38 contigs are prefixed with "chr".
If necessary, DRAGEN may support GFF3 files that are GENCODE-compatible with the following annotations present in the attributes of each exon record:
For gene: "gene_name" or "name" or "gene" or "gene_id"
For transcript: "transcript_id" or "Parent"
Due to the flexibility of the GFF3 file format, issues may arise as it continues to evolve.
Please be aware that depending on the characteristics of the input file (i.e. read depth and distribution) the second pass using the first pass SJ.out.tab
may take longer than the first pass.
NOTE: Components downstream of aligner like gene expression quantification, gene fusion detection and RNA variant calling require GTF file as the input annotations file and are NOT compatible with two-pass splice-junction alignment mode.
The DRAGEN RNA pipeline contains a gene expression quantification module that estimates the expression of each transcript and gene in an RNA-seq data set. The module first internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model and correct for GC-bias in the reported quantification results.
To enable the quantification module, set the --enable-rna-quantification
option to true
in your current RNA-seq command-line scripts. Additionally, you must provide a gene annotation file (GTF/GFF) that contains the genomic position of all transcripts to quantify. You can specify the GTF/GFF file using the -a
or --annotation-file
option.
Option | Description |
---|
Transcript quantification results are reported in the <outputPrefix>.quant.sf
text file. The file lists results for each transcript. You can use the output file as input for differential gene expression using tools such as tximport and DESeq2.
The following is an example of the file contents:
Field | Description |
---|
The gene expression quantification module also outputs the files below. For information on the metrics included, see the section Quantification and RNA QC Metrics
.
<outputPrefix>.quant.genes.sf
—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene.
<outputPrefix>.quant.metrics.csv
—Summary statistics relevant to RNA transcripts and quantification. See Quantification and RNA QC Metrics
.
<outputPrefix>.quant.transcript_fragment_lengths.txt
—Full fragment length distribution of reads mapped to transcripts, output in length- probability pairs of length minimum through >999 bases. Summing the products of the two columns will yield the average fragment length.
<outputPrefix>.quant.transcript_coverage.txt
—Measures coverage uniformity with a normalized average of 5' to 3' coverage pattern along transcripts in increments of 1%. A summation of the 100 coverage bins should yield 100%.
<outputPrefix>.SJ.saturation.txt
—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed.
The RNA Quantification module outputs metrics related to the gene expression results and more general RNA QC metrics that rely on the transcript-level analysis. A summary of the metrics is output to the <outputPrefix>.quant_metrics.csv
file.
Only unfiltered and properly paired reads (for paired-end sequencing) are counted in the above metrics. The seven fragments types that are listed (Forward transcript, Reverse transcript, Strand mismatched, Ambiguous strand, Intron, Intergenic, Unknown transcript) add up to 100% of the counted fragments, and the percentage of this total is provided next to each fragment metric count.
The identification of alternatively spliced isoforms (using their constitutive splice variants) and their functional effects is of high importance in the study of genetic variation and diseases, including cancer and neurological disorders. The main types of alternative splicing events resulting in splice variants are:
Exon skipping
Intron retention
Mutually exclusive exons
Alternative 5' splice site
Alternative 3' splice site
When enabled with the --enable-rna-splice-variant=true
option added to an RNA Map/Align job, DRAGEN runs a Splice Variant caller by taking advantage of its fast and highly accurate splice-aware read mapper/aligner that aligns to the whole genome to identify novel alternative Splice Junction (SJ) candidates. These candidates can be filtered by additional information provided such as a "normals list" and a "target regions list", or whitelisted with a "knowns" list.
Next during the read sorting phase, evidence for these alternative splices candidates (alts) vs. reference splicing are accumulated. Finally, each of the candidates are scored based on the accumulated read evidence and the results are written to TSV and VCF files for downstream tertiary analysis.
Following is an example command line.
In addition to the required inputs listed in the above example (i.e. paired fastq reads, reference hashtable, and annotation), the following 3 optional input resource files can be provided to help provide better precision by reducing FP count.
A list of Normal splice variants that will be filtered out of the final output (i.e. operating as a blacklist), as long as they are not in the "knowns" list, using the "--rna-splice-variant-normals" option.
The format of this file should be a tab separated file in the same format as the SJ.out.tab, except only the first 4 columns are used, i.e.
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)
To create a Normals list file, a collection of DRAGEN RNA mapper output SJ.out.tab files for at least 30 samples can be used along with a simple script to process all the SJs in these files. The pseudo code block below describes the function of this script:
A list of known splice variants that are exempt from being filtered out of the final output (i.e. operating as a whitelist), using the "--rna-splice-variant-knowns" option. The format of the file should be a tab separated file in the same format as the SJ.out.tab with 9 columns present, except only the first 4 columns are evaluated, i.e.
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)
By default, the caller will not consider any splice variant candidates that are found in the input annotation file since it is looking for denovo variants, unless it is included in the knowns list which directs it not to discard the specified candidate. Note that some newer gene annotation models have added alt transcripts that contain clinically relevant splice variants, which causes the DRAGEN to skip reporting them.
To ensure these are reported, the user may want to pass these in with a knowns file containing these common variants if they are found in the annotation that is used. An example is shown below using hg38 coordinates specifying the MET exon 14 skip, EGFRv3, and ARv7 alt splicing events, respectively.
A list of regions that called splice variants must fall within using the "--rna-splice-variant-regions" option. Any splice variant candidates will be excluded if they are not within these regions.
This file should be in BED file format with the following info, except that the regions are 1-based
chromosome id
start position (1-based)
end position (1-based)
region (i.e. gene) name
The detected splice variants are output as two separate TSV files for the intragenic and intergenic candidates, and as a VCF for the intragenic candidates.
The following categories are used when accumulating read counts for each alt SJ candidate:
DedupUniqueSupportingReads - Non-duplicate marked reads that are unique and precisely align with the SJ
DupUniqueSupportingReads - Duplicate marked reads that are unique and precisely align with the SJ
DedupUniqueNonsupportingReads - Non-duplicate marked reads that are unique but don't support the splice variant
DupUniqueNonsupportingReads - Duplicate marked reads that are unique but don't support the splice variant
To be counted, a paired read alignment:
must be primary and properly paired
must contain a splice junction (i.e. an alignment gap in the CIGAR containing skip ops)
must have overhangs on either side of the skip that are at least 6 bases
considered to be "unique" only if NH=1 and the MAPQ > 35
These two output files are named:
.splice_variants.tsv which contains the intragenic alt splice junctions that result in transcript variants
.splice_variant_fusions.tsv which contains the intergenic alt splice junctions that result fusions across genes
Each detected splice junction contains the following columns:
gene_start - Gene name(s) at the start of the SJ. Multiple genes are separated by a semicolon
gene_end - Gene name(s) at the end of the SJ. Multiple genes are separated by a semicolon
chromosome - Chromosome containing the SJ
start - SJ's start position (1-based genomic coordinate)
end - SJ's start position (1-based genomic coordinate)
strand - Detected strand for the SJ (1: +, 2: -)
motif - intron motif
annotated - True if annotated, otherwise False
split_unique_reads_ref - DedupUniqueNonsupportingReads count that support reference
split_total_reads_ref - DupUniqueNonsupportingReads + DedupUniqueNonsupportingReads count that support reference
split_unique_reads_alt DedupUniqueSupportingReads count that support variant
split_total_reads_alt - DupUniqueSupportingReads + DedupUniqueSupportingReads count that support variant
max_spliced_alignment_overhang - maximum spliced alignment overhang from all supporting reads
score - The splice junction variant score (ranging from 0.0 to 1.0). Currently, this is just a linear function of the number of split_unique_reads divided by 10, i.e. equals MIN(1.0, split_unique_reads_alt/10)
Note:
In the intragenic output file containing transcript variant splice junctions, the gene_start and gene_end columns must match.
In the intergenic output file containing fusions from splice junctions, the gene_start and gene_end columns must be different.
This file contains the detected intra-genic splice junction variants that are not filtered out, and are written into a zipped VCF file titled .splice_variants.vcf.gz, where each splice variant candidate is written as a one-line VCF record containing the fields below:
CHROM - Chromosome of the splice
POS - SJ start position (1-based) i.e. first base of intron
ID - "." (unused)
REF - Base from the reference genome FASTA at the SJ start position
ALT - "<DEL>"
QUAL - The junction score from 0.0 - 1.0
FILTER - Semicolon separated list of filters: LowQ and LowUniqueAlignment
INFO - See the possible "Info fields below"
FORMAT - AD:DP
SAMPLE - Counts for {DedupUniqueSupportingReads}:{DedupUniqueNonsupportingReads}
The following lines of the VCF header describe columns 5 to 10 (last 6 columns)
Note on Filter Thresholds
The passing thresholds for the LowQ and LowUniqueAlignments filters are fixed to the settings below.
When the splice variant caller and gene fusion caller are both enabled, the passing and failed intergenic fusion SJ's will also be merged into the relevant fusion output TSV files.
The passing calls get added to the fusion caller's .fusion_candidates.final file. The tab separated fields are described below.
The failing calls get added into the fusion caller's .fusion_candidates.filter_info output file. The output fields are the same as described above for the "final" output file, with the addition of the FILTER_INFO field in the first column. The value in this field will be "RSV_FILTER:" followed by the specific filters that are not passing, as described in the table below.
DRAGEN RNA variant calling uses the DRAGEN Somatic Small Variant Caller to call SNVs and indels. DRAGEN uses somatic variant calling to account for nongermline variant allele frequencies in RNA-seq data caused by differential expression. To perform variant calling, DRAGEN uses a probability model that weighs the evidence of a real variant against evidence for various noise models. If the quality score for a variant exceeds a certain threshold, then the variant is reported in the output VCF with the PASS label. DRAGEN also applies filters, such as weak_evidence and base_quality, that might indicate if the variant does not reach the thresholds required to qualify as a passing call. For more information on DRAGEN DNA somatic variant calling, see Somatic Mode.
DRAGEN RNA also supports forced genotyping (ForceGT). A ForceGT VCF that contains variants of interest can be provided to DRAGEN RNA VC, and the output VCF will contain all variants from the input with annotation. ForceGT might be unable to accurately call complex variants or variants with long deletions (> 50 bp). Complex variants are variants that require more than one substitution, insertion, or deletion event to transform the REF allele into the ALT allele.
DRAGEN RNA does not attempt to accurately genotype variants as heterozygous or homozygous (since it uses the DRAGEN somatic caller and somatic variants do not normally fall into those classes). Instead, a heuristic is applied based on the variant allele frequency: if the AF is at least 85%, then the GT field will be set to 1/1. Otherwise GT will always be reported as 0/1. This behavior and threshold can be adjusted with the following options:
You can use a FASTQ, BAM, or CRAM file as input. Optionally, you can provide a GTF annotation file for more accurate split junction mapping.
Use the following command line options for FASTQ input files.
Use the following command line options for a list of FASTQ input files.
Use the following command line options for a BAM input file.
To enable RNA variant calling, set --enable-rna
and --enable-variant-caller
to true. To enable ForceGT, use --vc-forcegt-vcf <forcegt_vcf_file>
.
RNA variant calling outputs a VCF file that includes PASS variants and variants that did not pass, due to filters or weak evidence. For more information on filters and additional command line options, see Somatic Mode.
(NOTE: gVCF mode is not supported with RNA variant calling.)
The following is an example RNA variant calling command line.
RNA quantification and/or fusion calling can be performed along with RNA variant calling by adding the appropriate option(s) in addition to --enable-rna=true
and --enable-variant-caller=true
.
Options:
RNA quantification: --enable-rna-quantification=true
RNA gene fusion calling: --enable-rna-gene-fusion=true
For more information and options related to RNA quantification and fusion calling, see those sections of the user guide.
Instead of using a GTF file for annotated splice junctions, the DRAGEN software is also capable of reading in an SJ.out.tab
file (see ). This file enables DRAGEN to run in a two-pass mode, where the splice junctions discovered in the first pass (output as SJ.out.tab file) are used to guide the mapping and alignment reads during a second run through DRAGEN. This mode of operation is useful to increase sensitivity for spliced alignments in cases when a gene annotations file is not readily available for the target genome. If a well curated GTF is already availble for your target genome, then there is no need to run a second pass with the SJ.out.tab
.
Metric | Description |
---|
XS:A
The XS tag denotes the strand orientation of an intron. See [Compatibility with Cufflinks]{.underline}.
NH:i
A standard SAM tag indicating the number of reported alignments that contains the query in the current record. This tag may be used for downstream tools such as featureCounts.
HI:i
A standard SAM tag denoting the query hit index, with its value indicating that this alignment is the i-th one stored in the SAM. Its value ranges from 1 ... NH. This tag may be used for downstream tools such as featureCounts.
jM:B
The jM tag lists the intron motifs for all junctions in the alignments. It has the following definitions
jM:B
Definition
0
non-canonical
1
GT/AG
2
CT/AC
3
GC/AG
4
CT/GC
5
AT/AC
6
GT/AT
| Library orientation of the RNA-seq reads relative to the original transcripts. The library orientation can be automatically detected, or can be explicitly provided. See |
| Total number of genes from the gene annotation (GTF/GFF) input used for analysis. |
| Number of coding genes from the gene annotation (GTF/GFF) excluding pseudo-genes and biotypes which are non-coding. |
| Number of transcripts from the gene annotation (GTF/GFF) input used for analysis. |
| Median Coefficient of Variation (CV), which is standard deviation divided by mean coverage, of the 1000 most highly expressed transcripts. This metric measures uniformity of RNA-seq read coverage. |
| Median 5 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 5'-most 100 bases divided by the mean coverage of the whole transcript. |
| Median 3 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 3'-most 100 bases divided by the mean coverage of the whole transcript. |
| The number of read pairs that match transcripts on the forward strand. Only reads that align fully within exons are counted. |
| The number of read pairs that match transcripts on the reverse strand. Only reads that align fully within exons are counted. |
| In the case of stranded library orientation, number of read pairs that do not match the expected strand of the transcript. Only reads that align fully within exons are counted. |
| Read pairs that match transcripts in both forward and reverse orientation. Only reads that align fully within exons are counted. |
| Read pairs that overlap with a gene, but do not overlap with any exons. |
| Read pairs that do not overlap with any gene. |
| Read pairs that partially align with an exon but overlap non-exonic regions (usually due to alternative splicing). |
| The count of the number of genes where the most highly expressed transcript has average coverage greater than 1x, 10x, 20x, and 100x . |
| The average sequencing coverage across all annotated exons, determined using the most highly expressed transcript for each gene. |
| The average sequencing coverage across only exons within coding genes, determined using the most highly expressed transcript for each gene. |
| The average sequencing coverage across detected introns. |
| The average sequencing coverage across areas detected outside annotated genes. |
Filter | Description | Value |
LowQ | Below splice variant score threshold | less than 1.0 |
LowUniqueAlignments | Below unique supporting read count threshold | less than 2 |
Field Names | Description |
FusionGene | Left and Right gene names (separated by '--') |
Score | Value between 0 and 1 |
LeftBreakpoint, RightBreakpoint | The location for left and right sides of the splice with three colon separated fields: chromosome:coordinate:strand(+/-) |
Gene1Location, Gene2Location | Splice Variant caller always outputs "SpliceVar" here instead of Exon/Intron location |
Gene1Sense, Gene2Sense | Always TRUE for by design |
Gene1Id, Gene2Id | Long form ID (i.e. for Gencode it is usually "ENSG.version") |
NumSplitReads | Taken from the dedupUniqueSupportingReads count (i.e. split_unique_reads_alt column value) |
NumSoftClippedReads, NumPairedReads | These values are not used by RSV caller and are set to '0' |
ReadNames | Not provided by this caller and set to 'N/A' |
Filter Names | Description |
LOW_QUAL | Below the "low quality score" threshold |
LOW_UNIQUE_ALIGNS | Number of unique anchors either on left or right side are below the MIN_UNIQUE_ALIGNS=2 threshold |
LOW_EVIDENCE_OR_OVERHANG | Not meeting the SJ.out read count vs. splice length and overhang requirements |
READTHROUGH | Gene partner is the next downstream annotated gene |
--enable-rna-quantification | If set to true, enables RNA quantification. Requires |
--rna-library-type | Specifies the type of RNA-seq library. The following are the available values:
|
--rna-quantification-gc-bias | GC bias correction estimates the effect of transcript %GC on sequencing coverage and accounts for the effect when estimating expression. To disable GC bias correction, set to false. |
--rna-quantification-fld-max --rna-quantification-fld-mean --rna-quantification-fld-sd | Use these options to specify the insert size distribution of the RNA-seq library for single-end runs. These options are relevant for GC bias correction. The defaults are 250 +- 25. The maximum allowed value is 1000. To improve accuracy, modify the values to match your library. |
Name | The ID of the transcript. |
Length | The length of the (spliced) transcript in base pairs. |
EffectiveLength | The length as accessible to RNA-seq, accounting for insert-size and edge effects. |
TPM | Transcripts per Million (TPM) represents the expression of the transcript when normalized for transcript length and sequencing depth. |
NumReads | The estimated number of reads from the transcript. The values are not normalized. |
The DRAGEN Gene Fusion module uses the DRAGEN RNA splice-aware aligner to detect gene fusion events. The supplementary (chimeric) alignments are used to find potential breakpoints and read evidence is accumulated for the resulting fusion event candidates. Then, an ML model is applied to score the putative fusion events to filter potential false positives. The ML scoring model is currently available on human samples and does not support non-human reference genomes.
You can run the DRAGEN Gene Fusion module together with a regular RNA-Seq map/align job. To enable the DRAGEN Gene Fusion module, set --enable-rna-gene-fusion
to true in your current RNA-Seq command-line scripts. The DRAGEN Gene Fusion module requires a gene annotations file in GTF or GFF format.
The following is an example command line for running an end-to-end RNA-Seq experiment.
At the end of a run, a summary of detected gene fusion events is output, which is similar to the following example.
The <OUTPUT_PREFIX>.fusion_candidates.features.csv
file lists the detected gene fusion events. The output CSV file includes the following columns.
#FusionGene
: Parent gene names (in 5' to 3' order of transcript) participating in the fusion; hereafter refer red to as Gene 1 and Gene 2. If a fusion breakpoint overlaps multiple genes, the genes are listed by default as separate candidates (rows). To show them as a semi-colon separated gene list on the same row, the option --rna-gf-merge-calls
can be set to true
as described in the Gene Fusion Options and Filters section.
Score
: Fusion call confidence score predicted by the ML model. If the ML model is used, the score can be 0 (low confidence) to 1 (high-confidence call). Currently the ML model only supports human references. In the case an ML model is not available, the number of supporting reads will be reported as the score.
LeftBreakpoint
: Gene 1 breakpoint formatted as <Chromosome>:<Position>:<Strand>
.
RightBreakpoint
: Gene 2 breakpoint formatted as <Chromosome>:<Position>:<Strand>
.
Filter
: Semicolon separated list of filter flags. The LOW_SCORE
filter is used to filter low confidence fusion candidates. If --rna-gf-enable-post-filtering=true
, other confidence filters will also be applied. Informative filters, on the other hand, do not fail the fusion. In the absence of the ML model scoring (i.e. a non-human reference is used), a more aggressive post-filtering will take place and all confidence and informative filters will be applied.
The following are the available filters.
Note that the specific features and column values are subject to change in future DRAGEN versions as more RNA data is analyzed.
#SplitScore
: Combined count of fusion supporting read pairs reported as split reads and soft-clipped reads
#NumSplitReads
: Number of fusion supporting read pairs with at least one split read alignment.
#NumSoftClippedReads
: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment. Included in SplitScore
and includes soft-clipped reads for both Gene1 and Gene2
#NumSoftClippedReadsGene1
: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment to Gene 1
#NumSoftClippedReadsGene2
: See above (NumSoftClippedReadsGene1
) for Gene 2
#NumPairedReads
: Number of fusion supporting read pairs such that one of the reads maps to Gene1 and the other maps to Gene2, without any breakpoint overlap
#NumRefSplitReadsGene1
: Number of read pairs that map fully within Gene 1 such that at least one of the reads aligns across the breakpoint. These reads support the reference transcript and do not support the fusion.
#NumRefSplitReadsGene2
: See above (NumRefSplitReadsGene1
) for Gene 2
#NumRefPairedReadsGene1
: Number of read pairs such that one of the reads maps on the left side of the Gene1 breakpoint and the other maps on the right side of the Gene1 breakpoint, without overlapping the break. These reads support the reference transcript and do not support the fusion.
#NumRefPairedReadsGene2
: See above (NumRefPairedReadsGene1
) for Gene 2
#AltToRef
-- Ratio of (fusion split + soft clipped reads) / max(NumRefSplitReadsGene1, NumRefSplitReadsGene2); used for the LOW_ALT_TO_REF
filter
#UniqueAlignmentsGene1
: Unique (start-end) positions of fusion supporting read alignments to Gene 1 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS
filter
#UniqueAlignmentsGene2
: Unique (start-end) positions of fusion supporting read alignments to Gene 2 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS
filter
#MaxMapqGene1
: Maximum MAPQ for fusion supporting reads in Gene 1
#AvgMapqGene1
: Average MAPQ for fusion supporting reads in Gene 1
#MaxMapqGene2
: Maximum MAPQ for fusion supporting reads in Gene 2
#AvgMapqGene2
: Average MAPQ for fusion supporting reads in Gene 2
#CoverageBasesGene1
: Bases in Gene 1 with read coverage within a certain distance (default 1000 bp) of the breakpoint in the direction of the breakpoint strand which is part of the fusion transcript
#CoverageBasesGene2
: See above (CoverageBasesGene1
) for Gene 2
#DeltaExonBoundaryGene1
: Distance from the Gene 1 breakpoint for the closest fusion supporting alignment (higher distance to boundary lowers score)
#DeltaExonBoundaryGene2
: See above (DeltaExonBoundaryGene1
) for Gene 2
#IsRestrictedGene1
: Indicator variable of whether the Gene 1 is tagged as protein coding in the annotation file
#IsRestrictedGene2
: Indicator variable of whether the Gene 2 is tagged as protein coding in the annotation file
#IsEnrichedGene1
: If enrichment or amplicon assay, then indicates whether Gene 1 is enriched. If whole transcriptome sequencing, then set to 1
#IsEnrichedGene2
: See above (IsEnrichedGene1
) for Gene 2
#CisDistance
: Distance between breakpoints if they are adjacent to each other and on the same strand. Large value (100M) if not a CIS break; used for the READ_THROUGH
filter.
#BreakpointDistance
: Distance between breakpoints if they are adjacent. Large value (100M) if not within same chromosome
#GenePairHomologyEval
: E-value of pairwise BLAST alignment of the parent genes
#AnchorLength1
: Longest alignment of a fusion supporting read to Gene 1
#AnchorLength2
: Longest alignment of a fusion supporting read to Gene 2
#FusionLengthGene1
: Distance from breakpoint to the end of Gene 1
#FusionLengthGene2
: Distance from breakpoint to the end of Gene 2
#NonFusionLengthGene1
: Breakpoint distance to the end of transcript not part of the fusion for Gene 1
#NonFusionLengthGene2
: Breakpoint distance to the end of transcript not part of the fusion for Gene 2
#Gene1Id
: Gene ID reported in the annotation file for Gene 1
#Gene2Id
: Gene ID reported in the annotation file for Gene 2
#Gene1Location
:
IntactExon: Breakpoint matches exon boundary,
BrokenExon: Breakpoint is within an exon but does not match the exon boundary,
Intron: Breakpoint is within an intron,
Intergenic: Breakpoint does not overlap any gene
#Gene2Location
: See above (Gene1Location
) for Gene 2
#Gene1Sense
: True
if the Gene 1 5' to 3' direction matches the breakpoint order, indicating that the gene is the upstream gene in the fusion transcript
#Gene2Sense
: See above (Gene1Sense
) for Gene 2
In addition, if --rna-gf-merge-calls
is enabled, DRAGEN merges the fusion candidates that overlap the same breakpoint into a single row reporting the feature values for the highest scoring passing candidate (or highest scoring failing candidate if no passing candidate is reported). For each breakpoint, in the column #FusionGene
, it reports a semi-colon separated list of names of all overlapping genes with a passing candidate. The following two columns are added to the features.csv
output file:
#AdditionalGenes1
: If a mix of passing and failing candidates are reported for the same breakpoint of Gene 1, genes with only failing candidates are listed. If no passing candidate exists, then all overlapping genes are reported in the #FusionGene
column.
#AdditionalGenes2
: See above (AdditionalGenes1
) for Gene 2
The <OUTPUT_PREFIX>.fusion_candidates.final
output file lists each passing fusion along with the read names that support the fusion, including Split Reads, Soft-clipped reads, and Paired (discordant) Reads and the passing scores. These reads can be extracted from the output BAM file and then used to visualize the fusions (i.e. in IGV). The same information for the non-passing fusions is provided in the <OUTPUT_PREFIX>.filter_info
output file.
The <OUTPUT_PREFIX>.fusion_candidates.vcf.gz
output file provides the VCF representation for all of the breakpoints for the candidate fusions using structural variant-style BND notation. The VCF header is annotated with ##source=DRAGEN_RNA_GF
to indicate the file is generated by the DRAGEN RNA Gene Fusion pipeline. All fusion candidates (passing and failing) are represented in the VCF output with one entry for each side of the fusion breakpoint (Gene 1 and Gene 2).
The <OUTPUT_PREFIX>.fusion_metrics.csv
output file provides a simple count of the total number of fusion candidates, those passing the scoring filter, and the number of unique left-right gene combinations that are found.
The following thresholds and options may be used to configure the fusion caller:
--rna-gf-blast-pairs
A tab separated file listing gene pairs that have a high level of similarity. The first and second column are the gene names, and the third column is the e-score. This list of gene pairs is used as a homology filter to reduce false positives. For runs on human genome assemblies GRCH38 and hg19, DRAGEN automatically applies a default file generated using Gencode Human Release 32 annotations for primary chromosomes if no other file is specified using the command-line.
--rna-gf-enriched-genes
For RNA enrichment assays, a list of targeted genes specified as one gene-name per line. Only fusion calls involving at least one gene on the list are reported. The enriched genes list should only contain genes listed in the input annotation file. This option cannot be provided together with --rna-gf-enriched-regions
. If RNA amplicon mode is enabled and the amplicon bed file already includes the gene name, then you do not need to set this option; DRAGEN will read the enriched genes names from the amplicon BED file (fifth column). See DRAGEN Amplicon Pipeline for further information.
--rna-gf-enriched-regions
Alternative to --rna-gf-enriched-genes
, but input is provided as a bed-file with regions coordinates instead of a gene list. All the genes in the provided annotation file that overlap such regions are included. Genes that are extracted in this way are summarized in output in the *.fusion.enriched_genes.txt
file. This option cannot be provided together with --rna-gf-enriched-genes
.
--rna-repeat-genes
Text file that contains the names or IDs (from the annotation file) of targeted repetitive genes for sensitive fusion detection. Exclusive from --rna-repeat-intervals
. This option overrides the default BED file. The repeat genes list should only contain genes listed in the input annotation file.
--rna-repeat-intervals
BED file that contains a target list of repeat intervals for sensitive fusion detection. Exclusive from --rna-repeat-genes
. This option overrides the default files, which contain the genes CIC, DUX4, PSPH, and SEPTIN14 for GRCh38 and hg19 reference genomes.
--enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
Enable Illumina Annotation Engine (IAE) to report fusion annotations in JSON format. --enable-variant-annotation
must be set to true. For more information, see Illumina Annotation Engine.
--rna-gf-restrict-genes
When parsing the gene annotations file for use in the DRAGEN Gene Fusion module, you can use this option to restrict the entries of interest to only protein-coding regions. Restricting the annotation to only the protein-coding genes reduces false positive rates in currently studied fusion events. To report non-coding gene fusions such as pseudo genes and lincRNAs, turn off this option. The default value is true
.
--rna-gf-merge-calls
If multiple genes overlap a fusion breakpoint, DRAGEN generates and scores a separate fusion candidate for each gene pair overlapping the breakpoint. The default value is false
so that each reported fusion event only has one left and right gene in the fusion, and overlapping genes are output as separate events.
--rna-gf-allow-overlapping-genes
Allows for fusion calls between overlapping genes. The default value is "false".
--rna-gf-enable-post-filters
Enable post-filtering of RNA gene fusion candidates by confidence flags. The filter flags are listed in the table above. The default value is "false".
--enable-rna-amplicon
A separate fusion filtering model is trained for RNA amplicon mode. Duplicate removal for fusion supporting reads is disabled for RNA amplicon mode and both genes are required to be in the list of enriched genes. By default, the DRAGEN fusion caller filters candidates if a breakpoint overlaps both transcripts (e.g. fusions such as FIP1L1--PDGFRA and GOPC--ROS1). In RNA amplicon mode, such candidates are not filtered. See DRAGEN Amplicon Pipeline for further information. The default is "false".
--rna-gf-sv-vcf
Structural Variant VCF file output from DRAGEN DNA structural variant caller run in somatic mode. See the next section for more information.
You can run the DRAGEN Gene Fusion module with a VCF file containing somatic Structural Variant (SV) calls. DRAGEN will report SV events matching each fusion candidate in the *.features.csv
output file for informational purposes, but will not use this data in the scoring or filtering of the fusion candidates. The SV events must be run in somatic mode (for more information see DRAGEN Structural Variant Calling pipeline). The following is an example command line for running an end-to-end RNA-Seq experiment with a somatic SV VCF file.
When the SV VCF input is provided to the RNA fusion caller, the following additional features will be reported in the features.csv
output file:
#SvEvent
: A semi-colon separated string representation of SV events matching the fusion candidate.
#SvType
: A semi-colon separated list of type of the matching SV events.
#SomaticScore
: The highest SomaticScore value of the matching SV events.
#SvDistance
: The maximum distance between any SV breakpoint to any fusion breakpoints (if multiple matching SV events, then minimum of all maximum distances over all SV events).
#LeftSvDistance
: The distance between the left fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#RightSvDistance
: The distance between the right fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#SvPresent
: Set to 1 if matching SV event is present, otherwise 0.
#SvAbsent
: Set to 1 if no matching SV event is present, otherwise 0.
Filter | Type | Description | Option to set threshold |
---|---|---|---|
LOW_SCORE
Confidence (always applied)
The fusion candidate has low probabilistic score (< 0.5) as determined by the features of the candidate.
--rna-gf-min-score
MIN_SUPPORT
Confidence (optional)
The fusion candidate has at least one fusion supporting read pairs.
--rna-gf-min-split-support
LOW_UNIQUE_ALIGNMENTS
Confidence (optional)
All fusion supporting read alignments near at least one of the breakpoints have the same start and end position.
--rna-gf-min-unique-alignments
LOW_MAPQ
Confidence (optional)
All fusion supporting read alignments at either breakpoint have MAPQ < 20.
--rna-gf-min-breakpoint-mapq
DOUBLE_BROKEN_EXON
Confidence (optional)
If both breakpoints are 50 bp from annotated exon boundaries, then the number of supporting reads do not satisfy a high threshold requirement (≥10 supporting reads).
--rna-gf-exon-snap
--rna-gf-min-support-be
UNENRICHED_GENES
Confidence (optional)
If enrichment list provided, then neither parent genes is enriched. If amplicon mode is enabled, then at least one parents gene is not enriched (See DRAGEN amplicon pipeline for further information).
--rna-gf-enriched-only
MITOCHONDRIAL_GENES
Confidence (optional)
The fusion candidate involves mitochondrial genes. Set --rna-gf-filter-chrm=false
to disable this filter.
--rna-gf-filter-chrm
READ_THROUGH
Confidence (optional)
The breakpoints are cis neighbors (< 200,000 bp) on the reference genome.
--rna-gf-min-cis-distance
ANCHOR_SUPPORT
Information only
Read alignments of fusion supporting reads are not long enough (less than 12 bp) at either breakpoint.
--rna-gf-min-anchor
HOMOLOGOUS
Information only
The candidate is likely to be a false candidate generated because the two genes involved have high gene homology.
--rna-gf-min-blast-pairs-eval
LOW_ALT_TO_REF
Information only
The number of reads supporting the fusion is < 1% of the number of reads supporting the reference transcript at either breakpoint.
--rna-gf-min-alt-to-ref
LOW_GENE_COVERAGE
Information only
Either breakpoint has less than 125 bp with nonzero read coverage.
--rna-gf-min-covered-bases