# DRAGEN RNA Pipeline

DRAGEN includes an RNA-seq (splice-aware) aligner, as well as RNA specific analysis components for gene expression quantification, gene fusion detection, splice variant calling, and small variant calling. All these analysis components require the aligner to be enabled.

![](https://25033470-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG9szlFZupV6Q2DasL98y%2Fuploads%2Fgit-blob-99fecc70f073838a7fdd86cbeeac381039cb67a8%2Fdragen-rna-pipeline.RNA_Pipeline.whitebg.png?alt=media)

Most of the functionality and options described in Host Software Options and DNA Mapping also apply to RNA applications. Additional RNA-specific aspects are described in this section.

## Input Files

To pass in reads, you can use a FASTQ, BAM, or CRAM file as input. Use the following command line options for FASTQ input files.

```
--fastq-file1=<fastq1_file> \
--fastq-file2=<fastq2_file> \
--RGID=<read_group_id> \
--RGSM=<read_group_sample_name> 
```

Use the following command line options for a list of FASTQ input files.

```
--fastq-list=<fq_list_file> \
--fastq-list-sample-id=<sample_id>
```

Use the following command line options for a BAM input file.

```
--bam-input=<bam_file> \
--enable-map-align=false \
--enable-sort=false \
--enable-duplicate-marking=false
```

### Gene Annotation File

In addition to the standard input files (reads from fastq or bam, reference genome, etc.), DRAGEN can take a gene annotations file as input. A gene annotations file aids in the alignment of reads to known splice junctions and is required for gene expression quantification, splice variant calling, and gene fusion calling. For human data, the annotation files used for validation and benchmarking (associated with the hg19, hg38 and hs37d5 assemblies) are available for download at: [DRAGEN Software Support site page](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html)

To specify a gene annotation file, use the `-a` (`--annotation-file`) command line option. The input file must conform to the GTF or GFF3 specifications, including the following requirements:

* Each gene record must include a gene\_id attribute
* Each transcript record must include a transcript\_id attribute
* If the annotation (GTF only) file does not include genes, their identities will be inferred from transcript records, which must include a gene\_id attribute. If it is missing both genes and transcripts, their identities will be inferred from exon records, which must include gene\_id and transcript\_id attributes.
* If the annotation is in GFF3 format, the feature hierarchy is described explicitly. Gene records must have an ID attribute, and transcript and exon records mush have ID and Parent attributes. Those are required in addition to gene\_id and transcript\_id attributes.

An example of a valid GTF file is shown below.

```
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        11869   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        13221   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    ENSEMBL transcript  11872   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        11872   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        13225   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
...
```

An example of a valid GFF3 file is shown below.

```
chr4 HAVANA gene 33222 51437 . - . ID=ENSG00000304187.1;gene_id=ENSG00000304187.1;gene_type=lncRNA;gene_name=ENSG00000304187;level=2
chr4 HAVANA transcript 33222 51437 . - . ID=ENST00000800879.1;Parent=ENSG00000304187.1;gene_id=ENSG00000304187.1;transcript_id=ENST00000800879.1;gene_type=lncRNA;gene_name=ENSG00000304187;transcript_type=lncRNA;transcript_name=ENST00000800879;level=2;tag=basic,Ensembl_canonical,GENCODE_Primary,TAGENE
chr4 HAVANA exon 51318 51437 . - . ID=exon:ENST00000800879.1:1;Parent=ENST00000800879.1;gene_id=ENSG00000304187.1;transcript_id=ENST00000800879.1;gene_type=lncRNA;gene_name=ENSG00000304187;transcript_type=lncRNA;transcript_name=ENST00000800879;exon_number=1;exon_id=ENSE00004190061.1;level=2;tag=basic,Ensembl_canonical,GENCODE_Primary,TAGENE
chr4 HAVANA exon 34085 34186 . - . ID=exon:ENST00000800879.1:2;Parent=ENST00000800879.1;gene_id=ENSG00000304187.1;transcript_id=ENST00000800879.1;gene_type=lncRNA;gene_name=ENSG00000304187;transcript_type=lncRNA;transcript_name=ENST00000800879;exon_number=2;exon_id=ENSE00004190059.1;level=2;tag=basic,Ensembl_canonical,GENCODE_Primary,TAGENE
...
```

For proper handling of genes in the PAR regions of chromosome X and Y, it is required that the `gene_id` attribute of all exons of the same gene be distinct between the two chromosomes, in order to distinguish exons within the PAR region of chromosome X from the ones within the PAR region of chromosome Y. That is, it is often the case that the `gene_id` of all exons of a transcript from `geneA` is equal to `gene_id=geneA` in chromosome X, and `gene_id=geneA_PAR_Y` in chromosome Y. This allows the annotation parser and downstream components to discriminate data associated with PAR genes in chromosome X from data associated with the same PAR genes in chromosome Y.

The DRAGEN host software parses the file for exons within the transcripts and produces splice junctions. After parsing is complete, it prints out the number of splice junctions detected.

```
==================================================================
Generating annotated splice junctions
==================================================================
Input annotations file: ./gencode.v19.annotation.gtf
Splice junctions database file: output/rna.sjdb.annotations.out.tab

Number of genes: 27459

Number of transcripts: 196520
Number of exons: 1196293
Number of splice junctions: 343856
```

The splice junctions that are detected from the annotation file are also written to `*.sjdb.annotations.out.tab`. Splice junctions below a minimum length are excluded, which helps filter annotation artifacts. This minimum annotation splice junction length is controlled by the `--rna-ann-sj-min-len` option, which has a default value of `6`.

#### Annotation parser options

The user can modify the behavior of the annotation parser using the following optional arguments:

| Option                           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| -------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| --annotation-gene-features       | Gene records to process based on the 3rd column. If not specified, only records named "gene" will be process. Use a comma-separated list (case-insensitive) to specify allowed values, e.g., "gene,pseudogene".                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| --annotation-transcript-features | Transcript records to process based on the 3rd column. If not specified, the following records (from the RefSeq annotation) will be processed: transcript, primary\_transcript, pseudogenic\_transcript, unconfirmed\_transcript, processed\_transcript, mrna, mirna, snrna, snorna, ncrna, scrna, rrna, trna, telomerase\_rna, antisense\_rna, vault\_rna, v\_gene\_segment, d\_gene\_segment, j\_gene\_segment, c\_gene\_segment, y\_rna, rnase\_mrp\_rna, rnase\_p\_rna, lnc\_rna. Use a comma-separated list (case-insensitive) to specify allowed values, e.g., "transcript,primary\_transcript,mrna". If the parent gene of a transcript is excluded, the transcript will be excluded as well, even if its type is in the allowed list. Note: in most annotation files, only "transcript" is used as a feature type. |
| --annotation-exon-features       | Exon records to process based on the 3rd column. If not specified, only records named "exon" will be process. Use a comma-separated list (case-insensitive) to specify allowed values, e.g., "CDS,start\_codon,stop\_codon". If the parent gene or parent transcript of an exon are excluded, the exon will be excluded as well, even if its type is in the allowed list.                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| --annotation-bed-file            | Name of a .BED file with allowed ranges for parsing. Only features that overlap with any of the ranges described in the file will be included. Default: none.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| --annotation-min-transcript-len  | Restrict annotated features to transcripts of at least this number of bases in length.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| --annotation-max-intron-len      | Restrict annotated features to exclude transcripts with introns longer than this number of bases.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |

### Two-Pass Splice-junction Alignment

Instead of using a GTF file for annotated splice junctions, the DRAGEN software is also capable of reading in an `SJ.out.tab` file (see [SJ.out.tab](https://help.connected.illumina.com/dragen/product-guides/dragen-v4.5/rna-alignment#sjouttab)). This file enables DRAGEN to run in a two-pass mode, where the splice junctions discovered in the first pass (output as SJ.out.tab file) are used to guide the mapping and alignment reads during a second run through DRAGEN. This mode of operation is useful to increase sensitivity for spliced alignments in cases when a gene annotations file is not readily available for the target genome. If a well curated GTF is already availble for your target genome, then there is no need to run a second pass with the `SJ.out.tab`.

Please be aware that depending on the characteristics of the input file (i.e. read depth and distribution) the second pass using the first pass `SJ.out.tab` may take longer than the first pass.

**NOTE: Components downstream of aligner like gene expression quantification, gene fusion detection, and splice variant caller require GTF file as the input annotations file and are NOT compatible with two-pass splice-junction alignment mode.**
