1 of 100

DRAGEN

Overview

Illumina® DRAGEN™ Secondary Analysis

Illumina DRAGEN (Dynamic Read Analysis for GENomics) secondary analysis was developed to address important challenges associated with analyzing NGS (Next Generation Sequencing) data for a range of applications, including genome, exome, transcriptome, and methylome studies. DRAGEN secondary analysis processes NGS data and enables tertiary analysis to drive insights. The available tools make up a highly accurate, comprehensive, and efficient solution that enables labs of all sizes and disciplines to do more with their genomic data.

Product highlights

Accurate results:

Pangenome reference genome and machine learning drive unprecedented accuracy
99.89% accuracy score with the Precision FDA Truth Challenge V2 benchmark data (2,3)

Comprehensive platform:

Analyze NGS data from whole genomes, exomes, methylomes, and transcriptomes
Available on platform of choice and scalable based on needs

Efficient analysis:

Process a 34x genome in ~ 30 minutes, with all supported callers with DRAGEN server v4 (1)
Reduce FASTQ file sizes up to 5x with DRAGEN ORA Compression

References:

Illumina data on file, 2022.

Deployment Options

DRAGEN analysis is available on multiple platforms.

Platform

Description

DRAGEN on-premises server

DRAGEN on-premises server offers highly accurate secondary analysis in a fraction of time compared with a traditional CPU-based system. - Analyze and store data locally - Supports varying levels of command line interface - Replace up to 30 traditional compute instances - Fully process a 34× whole human genome in ~30 minutes. (1) - One unit supports two NovaSeq 6000 Systems running at full capacity

DRAGEN analysis on Illumina Connected Analytics

Couples the accuracy and speed of the DRAGEN with the ability to customize analysis pipeline to operationalize informatics on a secure platform.

DRAGEN on BaseSpace Sequence Hub (BSSH)

Push button analysis capability in an intuitive, easy-to-use interface with compliance, and storage features of BaseSpace Sequence Hub and Amazon Web Services (AWS).

DRAGEN onboard NovaSeq X Series

- Flexibly runs multiple secondary analysis pipelines in parallel. - Performs up to four simultaneous applications per flow cell in a single run. - Brings up to 5x lossless data compression, and analysis with supported applications - Provides savings on analysis, which over five years can exceed the price of the sequencer

DRAGEN onboard NextSeq 1000 and NextSeq 2000 Systems

- Provides access to select DRAGEN analysis informatics pipelines - Enables users to generate results in as little as two hours - Uses intuitive pipeline algorithms to reduce reliance on external informatics experts

DRAGEN onboard MiSeq i100 Series

Intuitive, ultra-rapid analysis including DRAGEN BCL convert, DRAGEN Library QC, DRAGEN small WGS and DRAGEN Microbial Enrichment Plus. - Rapid results with comprehensive secondary analysis generated in two hours or less (2) - Highly efficient workflow with a single user touchpoint to VCF and/or html report and no intermediate file transfers - Exceptionally easy with an intuitive interface for non-expert users

DRAGEN on AWS, Azure

DRAGEN supports the FPGA enabled instance types of AWS, Azure. Rpm installers and the Kernel driver can be installed on images managed by the user, and DRAGEN can be run by purchasing a license.

DRAGEN on AWS and Azure Marketplace

Pre-configured Amazon Machine Images (AMI) and Azure Virtual Machines with DRAGEN installed can be accessed from the respective marketplace offerings in a Pay-As-You-Use model.

DRAGEN on GCP

DRAGEN is made available on the Google Cloud Platform. Pre-configured instances with DRAGEN installed can be accessed through the GCP application interface. Limited availability. Please reach out to your Illumina representative for access.

(1) HG002 from PrecisionFDA truth challenge V2 run with DRAGEN analysis v4.0 on DRAGEN server v4, all callers

(2) When run according to sample recommendations

Product Guides

DRAGEN v4.4

Getting Started

DRAGEN provides tests you can run to make sure that your DRAGEN system is properly installed and configured. Before running the tests, make sure that the DRAGEN server has adequate power and cooling, and is connected to a network that is fast enough to move your data to and from the machine with adequate performance.

On-premises Installation

Installation procedure:

Download the desired installer from the support website and unzip the package
The archive integrity can be checked using: ./<dragen .run file> --check
Install the appropriate release based on your Linux OS with the command: sudo sh <dragen .run file>

The .run file includes a script that administers un-installation of an existing software, integrity checking of the package and files, installation of the new DRAGEN software version. The DRAGEN software is installed in part by use of the Linux RPM Package Manager (rpm). Several rpm packages comprise the installation of a single DRAGEN software version. The RPM packages also configure the system for dragen, like raised user ulimits, and the .run script starts services needed for functionality, such as the Licensing daemon dragen_licd, and the hugepages daemon, dragend_hp.

NOTE: Root privileges are required for the installation.

Single Version Installation

Up to DRAGEN Software v4.2, only one version of the DRAGEN software can be installed at a time. Executing the .run file will remove any existing installed version and (re)install the new version.

After installation, the application and associated files are available at /opt/edico.

The single version installer will add /opt/edico to the Linux $PATH, so that the user can just call dragen without specifying the full path.

Multi-Version Installation

Starting with DRAGEN Software v4.3 and later, multiple compatible versions of the DRAGEN software can be installed at a time. Executing the .run file will add the new version to the system.

After installation, the application files are available at /opt/dragen/{version}/bin and FPGA files are located at /opt/bitstream/{bitstream version}.

The multi-version installer will NOT add /opt/dragen/{version}/bin to the Linux $PATH, since multiple versions can be present at a given time. User should manage the desired paths to the specific version they want to run. When this guide provides command line examples, it will assume that the Linux $PATH is set to correct dragen version, and we will just refer to dragen <options>

Notes on multi-version installation:

Installers released for DRAGEN v4.2 and earlier are single version packages
Single version packages and multi-version packages can not be mixed
- Installation of a prior single version package will remove all the multi-version packages
- Installation of a multi-version package will remove any installed single version package
After installing a multi-version package, see a list of installed versions at any time by running /usr/bin/dragen_versions
To remove any multi-version package, call yum remove on its Path
Adding PATH="/opt/dragen/{version}/bin:$PATH" to the last line of .bashrc file avoids the need to set the path upon each server login

Example:

$ dragen_versions
The output format of this command may change. Use --json for machine readable output.

Dragen Version           Size (MB)  Install Date         Path
4.3.2                    1378.03    2024-03-10 18:26:17  /opt/dragen/4.3.2
4.4.3                    1381.41    2024-03-18 20:56:39  /opt/dragen/4.4.3
4.3.5                    1379.25    2024-03-11 15:20:24  /opt/dragen/4.3.5

Bitstream Version        Size (MB)  Install Date         Path
07.031.732 (0x18101306)  598.95     2024-03-10 18:26:03  /opt/bitstream/07.031.732
07.031.745 (0x18101306)  598.95     2024-03-18 20:56:18  /opt/bitstream/07.031.745
 
To remove a dragen version, call `yum remove` on its Path.

Location of `dragen` and resource files

DRAGEN Version

on-premises server

cloud instance

4.3 and later

/opt/dragen/{version}

/opt/edico/

4.2 and earlier

/opt/edico/

Throughout this guide we will refer to <INSTALL_PATH> which will be either of the locations above

Licensing

Running the System Check

After turning on the server, you can make sure that your DRAGEN server is functioning properly by running <INSTALL_PATH>/self_test/self_test.sh, which does the following:

Automatically indexes chromosome M from the hg19 reference genome
Loads the reference genome and index
Maps and aligns a set of reads
Saves the aligned reads in a BAM file
Asserts that the alignments exactly match the expected results

Each server ships with the test input FASTQ data for this script, which is located in <INSTALL_PATH>/self_test. The system check takes approximately 25--30 minutes.

The following example shows how to run the script and shows the output from a successful test.

$ /opt/dragen/4.3.4/self_test/self_test.sh
#############################################################
Logging to /var/log/dragen/self_test.1714627157_160164.0.details.log
Using dragen executables in /opt/dragen/4.3.4/bin
Using board(s): 0 
#############################################################
Running tests for board 0 (u200)
Using scratch directory /tmp/self_test.4BO0pfPST9/0
-------------------------------------------------------------
Board 0 test 1, FPGA MEMORY TEST
Loading DIAG bitstream
Running fpga memory test, this will take ~13 minutes
Board 0 test 1, FPGA MEMORY TEST: PASS
-------------------------------------------------------------
Board 0 test 2, BAR REGISTER ACCESS
Board 0 test 2, BAR REGISTER ACCESS: PASS
-------------------------------------------------------------
Board 0 test 3, FPGA TEMP REG ACCESS
FPGA Temperature: 27C  (Max Temp: 36C, Min Temp: 22C)
Board 0 test 3, FPGA TEMP REG ACCESS: PASS
-------------------------------------------------------------
Board 0 test 4, BOARD SERIAL # REG ACCESS
Serial Number: 2130069BM05V
Board 0 test 4, BOARD SERIAL # REG ACCESS: PASS
-------------------------------------------------------------
Board 0 test 5, DRAGEN GENOME LICENSE
Board 0 test 5, DRAGEN GENOME LICENSE: PASS
-------------------------------------------------------------
Board 0 test 6, CPLD DATE TEST
cpld date is n/a
Board 0 test 6, CPLD DATE TEST: PASS
-------------------------------------------------------------
Board 0 test 7, ENCRYPTION KEY EXISTENCE TEST
Board 0 test 7, ENCRYPTION KEY EXISTENCE TEST: PASS
-------------------------------------------------------------
Board 0 test 8, PARTIAL RECONFIGURATION
DNA-MAPPER: ok
RNA-MAPPER: ok
HMM: ok
ZIP: ok
UNZIP: ok
DIAG: ok
Board 0 test 8, PARTIAL RECONFIGURATION: PASS
-------------------------------------------------------------
Board 0 test 9, HASH TABLE GENERATION
Board 0 test 9, HASH TABLE GENERATION: PASS
-------------------------------------------------------------
Board 0 test 10, MAP AND ALIGNER
running mapper aligner: ok
unmapped input records percentages: ok
md5sum check dbam sorted: pass
Board 0 test 10, MAP AND ALIGNER: PASS
-------------------------------------------------------------
Board 0 test 11, VARIANT CALLER E2E
running variant caller: ok
md5sum check dbam sorted: ok
md5sum check VCF: ok
Board 0 test 11, VARIANT CALLER E2E: PASS
#############################################################
SELF TEST COMPLETED
SELF TEST RESULT : PASS
#############################################################
Log file at /var/log/dragen/self_test.1714627157_160164.0.details.log

If the output BAM file does not match expected results, then the last line of the above text is as follows:

SELF TEST RESULT : FAIL

If you experience a FAIL result after running this test script immediately after turning on your DRAGEN server, contact Illumina Technical Support.

Running Your Own Test

When you are satisfied that your DRAGEN system is performing as expected, you are ready to run some of your own data through the machine, as follows:

Load the reference table for the reference genome
Determine location of input and output files
Process input data

Loading the Reference Genome

The reference hash table specified on the command line is automatically loaded onto the board the first time you process data with a pipeline. You can manually load the hash table for your reference genome by using the following command:

dragen -r <reference_hash-table_directory>

Make sure that the reference hash table directory is on the fast file IO drive.

The default location for the hash table for hg19 is as follows.

/staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The command to load reference genome hg19 from the default location is as follows.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

This command loads the binary reference genome into memory on the DRAGEN board, where it is used for processing any number of input data sets. You do not need to reload the reference genome unless you restart the system or need to switch to a different reference genome. It can take up to a minute to load a reference genome.

DRAGEN checks whether the specified reference genome is already resident on the board. If it is, then the upload of the reference genome is automatically skipped. You can force reloading of the same reference genome using the force-load-reference (-l) command line option.

The command to load the reference genome prints the software and hardware versions to standard output. For example:

DRAGEN Host Software Version 01.001.035.01.00.30.6682 and

Bio-IT Processor Version 0x1001036

After the reference genome has been loaded, the following message is printed to standard output:

DRAGEN finished normally

Determine Input and Output File Locations

The DRAGEN Pipeline is very fast, which requires careful planning for the locations of the input and output files. If the input or output files are on a slow file system, then the overall performance of the system is limited by the throughput of that file system. It is recommended that inputs and outputs are streamed directly from/to a mounted external storage system.

The DRAGEN system is preconfigured with at least one fast file system consisting of a set of fast SSD disks grouped with RAID-0 for performance. This file system is mounted at /staging. This name was chosen to emphasize the fact that this area was built to be large and fast, but is not redundant. Failure of any of the file system's constituent disks leads to the loss of all data stored there.

During processing, DRAGEN generates and reads back temporary files. With DRAGEN, it is highly recommended to always direct temporary files to the fast SSD (or /staging) by using the --intermediate-results-dir option. If the --intermediate-results-dir option is not provided, temporary files are written to the --output-directory. DRAGEN recommends streaming inputs and outputs using an mounted external storage system.

Process Your Input Data

To analyze FASTQ data, use the dragen command. For example, the following command can be used to analyze a single-ended FASTQ file:

dragen \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 /staging/test/data/SRA056922.fastq \
--output-directory /staging/test/output \
--output-file-prefix SRA056922_dragen \
--RGID DRAGEN_RGID \
--RGSM DRAGEN_RGSM

DRAGEN Host Software

You use the DRAGEN host software program dragen to build and load reference genomes, and then to analyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking with optional removal, and variant calling.

Invoke the software using the dragen command. The command line options are described in the following sections.

Command-line Options

The following are examples of frequently used command lines:

Build Reference/Hash Table

dragen --build-hash-table true --ht-reference <REF_FASTA> \
--output-directory <REF_DIRECTORY>  [options]

Run Map/Align and Variant Caller (*.fastq to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --RGID <RG0> --RGSM <SM0> --enable-variant-caller true

Run Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] \
-1 <FASTQ1> [-2 <FASTQ2>]  \
--RGID <RG0> --RGSM

Run Variant Caller Only (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align false \
--enable-variant-caller true

Re-map and Run Variant Caller (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align true \
--enable-variant-caller true

Run BCL Converter (BCL to *.fastq)

dragen --bcl-conversion-only true --bcl-input-directory <BCL_DIRECTORY> \
--output-directory <OUT_DIRECTORY>

Run RNA Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --enable-rna true

Reference Genome Options

Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

Use the -l (--force-load-reference) option to force the reference genome to load even if it is already loaded.

dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.

Operating Modes

DRAGEN has two primary modes of operation, as follows:

Mapper/aligner
Variant caller

DRAGEN is capable of performing each mode independently or as an end-to-end solution. DRAGEN also allows you to enable and disable decompression, sorting, duplicate marking, and compression along the DRAGEN pipeline.

Full pipeline mode To execute full pipeline mode, set --enable-variant-caller to true and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.
Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking to true.
Variant caller mode To execute variant caller mode, set the --enable-variant-caller option to true, and set --enable-map-align option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort to false will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.
RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna to true. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..
Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol setting.

Output Options

The following command line options for output are mandatory:

--output-directory <out_dir>—Specifies the output directory for generated files.
--output-file-prefix <out_prefix>-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.
-r [--ref-dir ]—Specifies the reference hash table.

The following examples do not include these mandatory options.

For mapping and aligning, the output is sorted and compressed into BAM format by default before saving to disk. The user can control the output format from the map/align stage with the --output-format <SAM|BAM|CRAM> option. If the output file exists, the software issues a warning and exits. To force overwrite if the output file already exists, use the -f [ --force ] option.

For example, the following commands output to a compressed BAM file, and then forces overwrite:

dragen ... -f

dragen ... -f --output-format bam

To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing to true.

The following example outputs to a SAM file, and then forces overwrite:

dragen ... -f --output-format sam

The following example outputs to a CRAM file, and then forces overwrite:

dragen ... -f --output-format cram

DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.

Alignment tags

DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. The feature is turned off by default because there is a small performance cost to generate these strings. To generate MD tags, set --generate-md-tags to true.

DRAGEN can also annotate additional information about alignments in a ZS:Z tag. The following are valid tag values:

Tag

Tag meaning

ZS:Z:R

Multiple alignments with similar score were found.

ZS:Z:NM

No alignment was found.

ZS:Z:QL

An alignment was found but it was below the quality threshold.

ZS:Z:NRD

Alignment is to an auto-added decoy contig (not present in input FASTA).

ZS:Z:PAI

Alignment is to an insertion encoded in a population based alternate contig (not present in input FASTA).

By default, DRAGEN writes a ZS:Z:PAI tag in the output BAM for alignments that map completely inside insertions encoded in population based alternate contigs. To write ZS:Z alignment status tags for all other types described above, set --generate-zs-tags to true (false by default). These tags are only generated in the primary alignment and when a read has suboptimal alignments qualifying for secondary output (even if none were output because --Aligner.sec-aligns was set to 0).

To generate SA:Z tags, set --generate-sa-tags to true (the default). These tags provide alignment information (position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variant calling.

To generate pair score in a ps:i tag, set --generate-ps-tags to true (false by default for DNA, true for RNA). The pair score is used in DRAGEN for computing MAPQ and can be used to check how well alignment candidate pairs score against each other.

DRAGEN can also output mate alignment tags. To generate the mate cigar (in the MC:Z tag), set --generate-mc-tags to true (this is the default). To generate the mate mapping quality (in the MQ:i) tag, set --generate-mq-tags to true (this is the default). To generate mate sequence (in the R2:Z tag) and mate base qualities (in the Q2:Z tag), set --generate-r2-tags to true (default is false) and set --generate-q2-tags to true (default is false) respectively. Please note that when enabled, R2:Z and Q2:Z tags are emitted only for improperly paired read alignments with fragment length atleast 1000 bp. Also, our methylation pipelines currently do not support the output of mate alignment tags.

DRAGEN also outputs a graph alignment tag ga:Z --generate-ga-tags (true by default for DNA, false for RNA) when applicable. This tag is used to describe the best alt contig alignment which improved the score of a primary-contig alignment at its liftover position. It can also be used to describe read alignments to alt contigs for which there is no liftover and the primary alignment is unmapped. For example, cases when the read maps best to an alt contig describing a novel long-insertion that is not present in the reference. In addition, read alignments that have been marked as unmapped because they map to auto-detected decoy contigs not present in the original user-provided FASTA also have their alignments described in the ga tag.

The ga tag uses the same format as the SA tag used to describe supplementary alignments.

CRAM Output

When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:

CRAM format V3.0 is produced by default, V3.1 can be enabled by using the option --cram-version 3.1
The CRAM is lossless. Lossy compression is never employed and not optional
Quality score compression is lossless. Read names are preserved
Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores
All input BAM tags are preserved
The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.
A CRAM index is produced in .crai format
CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted

The following list of default settings are used for the CRAM output

CRAM option

Value

Description

SEQS_PER_SLICE

2000

Max sequences per slice

BASES_PER_SLICE

SEQS_PER_SLICE*500

Max bases per slice

SLICE_PER_CNT

Max slices per container

embed_ref

Do not embed reference sequence

noref

Do not use non-referenced based encoding

multiseq

-1

Do not use multiple references per slice

unsorted

Do not use unsorted mode

use_bz2

Do not compress using bzip2

use_lzma

Do not compress using lmza

use_rans

Use rANS for quality score compression

binning

NONE

Qual score binning not used

preserve_aux_order

Preserve all aux tags and order (incl RG,NM,MD)

preserve_aux_size

Aux tag sizes not preserved ('i', 's', 'c')

lossy_read_names

Preserve read names

lossy

Do not enable Illumina 8 quality-binning system

ignore_md5

Enable all checking of checksums

decode_md

Do not (re)generate MD and NM tags

cram_version

3.0

Default is CRAM v3.0.

Input Options

DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.

Uncompressed
gzip or bgzip compression
ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.

If your input FASTQ files are gzipped, DRAGEN automatically decompresses the files using hardware-accelerated decompression, and then streams the reads into the mapper. If your files end in *.ora, DRAGEN automatically decompresses the files using ORA decompression, and then streams the reads into the mapper. The same FASTQ command-line options apply for all compression formats.

FASTQ Input Files

FASTQ input files can be single-ended or paired-end, as shown in the following examples.

Single-ended in one FASTQ file (-1 option)

dragen -r <REF_DIR> -1 <fastq> \
--output-directory <OUT_DIR> -output-file-prefix <OUTPUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in two matched FASTQ files(-1 and -2 options)

dragen -r <REF_DIR> -1 <fastq1> -2 <fastq2> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in a single interleaved FASTQ file(--interleaved (-i) option)

dragen -r <REF_DIR> -1 <INTERLEAVED_FASTQ> -i \
--RGID <RGID> --RGSM <RGSM>

Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:

<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz

Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.

For Example:

RDRS182520_S1_L001_R1_001.fastq.gz

RDRS182520_S1_L001_R1_002.fastq.gz

...

RDRS182520_S1_L001_R1_008.fastq.gz

These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 <FileName>_001.fastq). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile to false on the command line.

DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name option to true

If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.

To avoid impacting system performance, input files must be located on a fast file system.

Multiple FASTQ Input Files

To process multiple FASTQ input files as one sample, it is recommended that you use the --fastq-list <csv file name> option to specify the name of a CSV file containing the list of FASTQ files, instead of using the --combine-samples-by-name option.

For example:

dragen -r <ref_dir> --fastq-list <CSV_FILE> \
-fastq-list-sample-id <Sample_ID> -output-directory <OUT_DIR> 
--output-file-prefix <OUT_PREFIX>

Using a CSV file avoids having to concatenate the FASTQ files, for cases where there are multiple FASTQ files for a sample such as top-up scenarios or where FASTQ files are split across lanes. It also allows you to name the FASTQ input files, input from multiple subdirectories, and add BAM tags specified explicitly for each read group. DRAGEN automatically generates a CSV file of the correct format during BCL conversion to FASTQ. The CSV file is named fastq_list.csv and contains an entry for each FASTQ file or paired-end file pair produced during the run.

FASTQ CSV File Format

The first line of the CSV file specifies the title of each column, and is followed by one or more data lines. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or other extraneous characters.

Column titles are case-sensitive. The following column titles are required:

RGID--Read Group
RGSM--Sample ID
RGLB--Library
Lane--Flow cell lane
Read1File--Full path to a valid FASTQ input file
Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.

Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.

When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:

ID (from RGID)
SM (from RGSM)
LB (from RGLB)

You can specify additional tags for each read group by adding a column title. The column title must be only four upper-case characters and begin with RG. For example, to add a PU (platform unit) tag, add a column named RGPU and specify the value for each read group in this column. All column titles must be unique.

A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one unique RGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed in the fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, --fastq-list-sample-id <SampleID> must be used in addition to --fastq-list <filename> to process only a specific sample from the CSV file. Only the entries in the fastq-list file with an RGSM value that match the specified SampleID are processed.

Independent processing and output for multiple individual samples in one run is not supported.
To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true can be used instead of --fastq-list-sample-id.

Note

For a single run, only one BAM and VCF output file are produced because all input read groups are expected to belong to the same sample. To process multiple samples independently from one BCL conversion run, DRAGEN must be run multiple times using different values for the `--fastq-list-sample-id` option.

There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.

The following is an example FASTQ list CSV file with the required columns:

RGID,RGSM,RGLB,Lane,Read1File,Read2File
CACACTGA.1,RDSR181520,UnknownLibrary,1,/staging/RDSR181520_S1_L001_R1_001.fastq,
/staging/RDSR181520_S1_L001_R2_001.fastq
AGAACGGA.1,RDSR181521,UnknownLibrary,1,/staging/RDSR181521_S2_L001_R1_001.fastq,
/staging/RDSR181521_S2_L001_R2_001.fastq
TAAGTGCC.1,RDSR181522,UnknownLibrary,1,/staging/RDSR181522_S3_L001_R1_001.fastq,
/staging/RDSR181522_S3_L001_R2_001.fastq
AGACTGAG.1,RDSR181523,UnknownLibrary,1,/staging/RDSR181523_S4_L001_R1_001.fastq,
/staging/RDSR181523_S4_L001_R2_001.fastq

If you use the --tumor-fastq-list option for somatic input, use the --tumor-fastq-list-sample-id SampleID> option to specify the sample ID for the corresponding FASTQ list, as shown in the following example:

dragen -r <ref_dir> --tumor-fastq-list <csv_file> \
--tumor-fastq-list-sample-id <Sample_ID> \
--output-directory <out_dir> \
--output-file-prefix <out_prefix> --fastq-list <csv_file_2> \
--fastq-list-sample-id <Sample_ID_2>

Tumor-Normal Pairs Input

If using fastq_lists or tumor_fastq_lists comprising of multiple samples (RGSMs) in somatic mode, you can use a loop to iterate through the two lists to create tumor-normal pairs for testing. Create a *.txt file with the RGSM of each normal sample to be tested (one per line), and then create a separate *.txt file with the RGSM of the tumor samples to be tested. Make sure that the tumor sample RGSM is listed in the same order as the corresponding normal samples and to include a blank line after the last sample.

You can use the following example script to perform testing in somatic mode. Each iteration takes one entry from the tumor samples list and one entry from the normal samples list (from top to bottom) to create a tumor-normal pair as input for the DRAGEN run.

#!/bin/bash

HT="/staging/HT/"
tumor_fastq_list="/staging/inputs/tumor_fastq_list.csv"
normal_fastq_list="/staging/inputs/normal_fastq_list.csv"

tumor_samples_list="/staging/inputs/tumor_samples_list.txt"
normal_samples_list="/staging/inputs/normal_samples_list.txt"

while read -u 3 -r tumor_RGSM && read -u 4 -r normal_RGSM; do
output_dir="/staging/results/${tumor_RGSM}_${normal_RGSM}"
mkdir -p ${output_dir}

dragen \
-r ${HT} \
--tumor-fastq-list ${tumor_fastq_list} \
--tumor-fastq-list-sample-id ${tumor_RGSM} \
--fastq-list ${normal_fastq_list} \
--fastq-list-sample-id ${normal_RGSM} \
--output-directory ${output_dir} \
--output-file-prefix ${tumor_RGSM}_${normal_RGSM}
done 3<${tumor_samples_list} 4<${normal_samples_list}


Sample fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_N1.1,normal-1,ILLUMINA,1,/staging/inputs/normal-1_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N1.2,normal-1,ILLUMINA,2,/staging/inputs/normal-1_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.1,normal-2,ILLUMINA,1,/staging/inputs/normal-2_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.2,normal-2,ILLUMINA,2,/staging/inputs/normal-2_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.1,normal-3,ILLUMINA,1,/staging/inputs/normal-3_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.2,normal-3,ILLUMINA,2,/staging/inputs/normal-3_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L002_R2_001.fastq.gz

The following are examples of the FASTQ lists and samples lists used as input for the script.

Sample tumor_fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_T1.1,tumor-1,ILLUMINA,1,/staging/inputs/tumor-1_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T1.2,tumor-1,ILLUMINA,2,/staging/inputs/tumor-1_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.1,tumor-2,ILLUMINA,1,/staging/inputs/tumor-2_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.2,tumor-2,ILLUMINA,2,/staging/inputs/tumor-2_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.1,tumor-3,ILLUMINA,1,/staging/inputs/tumor-3_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.2,tumor-3,ILLUMINA,2,/staging/inputs/tumor-3_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L002_R2_001.fastq.gz

Sample normal_samples_list content

normal-1
normal-2
normal-3

Sample tumor_samples_list content

tumor-1
tumor-2
tumor-3

FASTQ ORA Input Files

You can use the same options as the other FASTQ input file types for ORA files. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference.

See ORA Compression and Decompression for more information on ORA reference files.

The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).

dragen -r <REF_DIR> -1 <fastq.ora1> -2 <fastq.ora2> \
--ora-reference <ORADATA_DIR> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

BAM Input Files

BAM files can be used as input to the mapper/aligner. By default, --enable-map-align is true. When a BAM file input is provided with map/align enabled, DRAGEN ignores any alignment or duplicate marking information contained in the input file, reads are re-mapped and the new alignments are fed downstream to the variant callers. Any existing flags in the input BAM are erased when reads are re-mapped. BAM re-mapping is supported for multiple BAM inputs at a time, such as in paired tumor-normal input to somatic variant calling. Outputting the re-mapped BAM(s) can be enabled by setting --enable-map-align-output=true.

Alternatively, existing alignments in the BAM file can be used as input to the variant callers by setting the --enable-map-align option to false.

If the input file contains paired-end reads, it is important to specify that the input data should be sorted so that pairs can be processed together. Other pipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases the speed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairs are identified. Use the --pair-by-name option to enable or disable this feature (the default is true).

Specify single-ended input in one BAM file with the (-b) and --pair-by-name=false options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name false

Specify paired-end input in one BAM file with the (-b) and \--pair-by-name=true options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

CRAM Input

You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input. Supported CRAM input file formats are v3.0 and v3.1.

By default, the CRAM compressor and decompressor uses the DRAGEN reference specified with the --ref-dir option. CRAM compression is reference based, and the reference used for compression is not part of the CRAM file. Therefore, the CRAM input file must have been created with the same reference than what is provided to DRAGEN for the analysis.

DRAGEN supports the re-alignment of a CRAM input that was created with a different reference in one step. Re-aligning a CRAM file that was created with a different reference requires use of the --cram-reference option. This option will make the CRAM decompressor use the specified reference.

--cram-reference can be either a fasta file, or a DRAGEN hash table folder.
If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file
CRAM output will always be compressed using the --ref-dir reference

Example: CRAM was created with hg19, re-analysis with hg38

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <ref_dir HG19>

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <hg19.fa>

The following options are used for providing a CRAM input to either mapper/aligner or variant caller:

--cram-input--The name and path for the CRAM file
--cram-input--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option to true.

dragen -r <ref_dir> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

BCL Input Files

BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.

DRAGEN can read directly from BCL in the following circumstances:

Only one lane is input as part of a run (specified on the command-line).
The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).

The following example command is for BCL input with only one lane of input:

dragen --bcl-input-dir <BCL_ROOT> --bcl-only-lane <num> -r <ref_dir> \
--output-directory <out_dir> --output-file-prefix <out_prefix>

For additional BCL conversion options, see Input File Types.

Handling of N bases

One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.

When you use the --fastq-n-quality and --fastq-offset options, the base quality scores are overwritten with a fixed base quality. The default values for these options are 2 and 33 to match the Illumina minimum quality of 35 (ASCII character ‘#’).

Read Names for Paired-End Reads

By a common convention, read names can include suffixes, such as /1 or /2), which indicate the end of a pair the read represents. For BAM input using the --pair-by-name option, DRAGEN ignores these suffixes to find matching pair names. By default, DRAGEN uses the forward slash character as the delimiter for these suffixes and ignores the /1 and /2 when comparing names. By default, DRAGEN strips these suffixes from the original read names.

DRAGEN has the following options to control how suffixes are used:

To change the delimiter character, for suffixes, use the --pair-suffix-delimiter option. Valid values for this option include forward-slash (/), dot (.), and colon (:).
To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes to false.
To append a new set of suffixes to all read names, set --append-read-index-to-name to true. The delimiter is determined by the --pair-suffix-delimiter option. By default, the delimiter is a slash, so /1 and /2 are added to the names.

Gene Annotation Input Files

When processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-file option. Providing this file improves the accuracy of the mapping and aligning stage (see [Input Files]{.underline}). The file should conform to the GTF/GFF format specification and should list annotated transcripts that match the reference genome being mapped against. The similar GFF3 format is currently not supported, due to inconsistent contig naming between GENCODE and Ensembl. See the RNA user guide section for more details on potential issues and workarounds.

DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.

Networked Streaming

AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming

DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.

Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.

Input streaming is supported for the following use cases:

Mapping/aligning of FASTQ and BAM.
Germline and somatic small variant calling from BAM (without remapping).

For other file types that are significantly smaller in size, download them locally before running the analysis.

Streaming FASTQ Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 s3://s3-bucket-name/path/to/object_1.fastq.gz \
  -2 s3://s3-bucket-name/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://storage-account-name.blob.core.windows.net/path/to/object_1.fastq.gz \
  -2 https://storage-account-name.blob.core.windows.net/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://bucket-name.amazonaws.com/path/to/object_1.fastq.gz?querystring \
  -2 https://bucket-name.amazonaws.com/path/to/object_2.fastq.gz?querystring \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b s3://s3-bucket-name/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://storage-account-name.blob.core.windows.net/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://bucket-name.amazonaws.com/path/to/object_1.bam?querystring \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

AWS S3, Azure Blob Storage, Output Streaming

DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.

Streaming output to AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory s3://s3-bucket-name/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Streaming output to Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory https://storage-account-name.blob.core.windows.net/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Security and Permissions

To stream input files or write to a cloud providers storage, you must have permission to access the remote files.

AWS S3

S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.

Azure Blob Storage Account

Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.

To use managed identities you must run DRAGEN on an Azure instance. The instance must have Contributor permissions (read/write) on the Storage Account it wants to read and write to. If the instance has a single managed identity, only the AZ_ACCOUNT_NAME=<azure-storage-account-name> environment variable is required. For multiple managed identities, you must also provide the AZR_IDENT_CLIENT_ID=<client-id> environment variable, with the client id of the identity that can access your storage bucket. This can be found on the Azure Portal.

Presigned URL (AWS only)

An AWS presigned URL most likely has a query string attached to it, which provides the authentication credentials or necessary tokens to grant permission to the S3 bucket (e.g., https://bucket-name.amazonaws.com/path/to/folder?querystring). Currently, streaming input to DRAGEN Azure presigned URLs is not supported.

Sample Sex

Use the --sample-sex command line option to control the sex karyotype input used in downstream components, such as variant callers. If a sample sex karyotype input is not specified using the command line, the sex karyotype is automatically determined. The sex karyotype input is converted to a reference sex karyotype for use in variant calling. Other components might support sex karyotype input. Refer to the corresponding section for the component you are using.

The --sample-sex option supports the following values. Values are not case-sensitive.

none: No sex karyotype input. Components use a default reference sex karyotype.
auto: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none. auto is the default value.
female: Sex karyotype input is XX.
male: Sex karyotype input is XY.

The following example command lines use --sample-sex to specify the sex karyotype.

--sample-sex FEMALE

--sample-sex MALE

--sample-sex NONE

If the value is none, female, or male, the Ploidy Estimator could still run and produce output, but variant callers will not use any estimated sex karyotype that is different than the sex karyotype provided via the command-line.

The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex is used.

Reference Sex Karyotype

Sex Karyotype Input

CNV Caller

DRAGEN-STR

Ploidy Caller

Small Variant Caller

SV Caller

XXYY

XXX

XXYY

XXXX

XXYY

XXXXX

XXYY

XXY

XXYY

XXXY

XXYY

XXXXY

XXYY

XYY

XXYY

XXXYY

XXYY

XYYY

XXYY

XXYYY

XXYY

XYYYY

XXYY

None

XX/XY

XXYY

For sex karyotype input of None, CNV/Ploidy Caller independently check the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.

Preservation or Stripping of BQSR Tags

The Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tags BI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file with BI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags can become invalid.

The recommendation is to strip these tags when using BAM files as input. To remove the BI and BD tags, set the --preserve-bqsr-tags option to false. If you preserve the tags, DRAGEN warns you to disable hard clipping.

Read Group Options

DRAGEN assumes that all the reads in a given FASTQ belong to the same read group. DRAGEN creates a single @RG read group descriptor in the header of the output BAM file, with the ability to specify the following standard BAM attributes:

Attribute

Argument

Description

--RGID

Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.

--RGLB

Library.

--RGPL

Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.

--RGPU

Platform unit, eg, flowcell-barcode.lane.

--RGSM

Sample.

--RGCN

Name of the sequencing center that produced the read.

--RGDS

Description.

--RGDT

Date the run was produced.

--RGPI

Predicted mean insert size.

If any of these arguments are present, DRAGEN adds an RG tag to all the output records to indicate that they are members of a read group. The following example shows a command line that includes read group parameters:

dragen --RGID 1 --RGCN Broad --RGLB Solexa-135852 \
--RGPL Illumina --RGPU 1 --RGSM NA12878 \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 SRA056922.fastq --output-directory /staging/tmp/ \
--output-file-prefix rg_example

When using the --fastq-list option to input multiple read groups, BAM tags (and others) are specified for each read group by adding columns to the fastq_list.csv file. Each column heading consists of four capital letters and each begins with 'RG'. For each column, each read group's values for that column are propagated to the output BAM file in an identically named tag.

License Options

To suppress the license status message at the end of the run, use the --lic-no-print option. The following shows an example of the license status message:

LICENSE_MSG| =====================================================
LICENSE_MSG| License report
LICENSE_MSG|   Genome status [ACxxxxxxxxxxx] : used 1263.9 Gbases
since 2018-Feb-15 (1263886160894 bases, unlimited)
LICENSE_MSG|   Genome  bases [ACxxxxxxxxxxx] : 202000000
LICENSE_MSG|   Genome  bases [total]         : 202000000

Autogenerated MD5SUM for BAM and CRAM Output Files

An MD5SUM file is generated automatically for BAM and CRAM output files. The MD5SUM file has the same name as the output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). The MD5SUM file is a single-line text file that contains the md5sum of the output file, which exactly matches the output of the Linux md5sum command.

The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).

Configuration Files

Command line options can be stored in a configuration file. The location of the default configuration file is <INSTALL_PATH>/config/dragen-user-defaults.cfg. You can override this file by using the --config-file (-c) option to specify a different file. The configuration file used for a given run supplies the default settings for that run, any of which can be overridden by command line options.

The recommended approach is to use the dragen-user-defaults.cfg file as a template to create default settings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the new file for the specific use-case. Best practice is to put options that rarely change into the configuration file and to specify options that vary from run to run on the command line.

Licensing

DRAGEN Secondary Analysis

The DRAGEN secondary analysis software utilizes a highly reconfigurable Field Programmable Gate Array (FPGA) card and is available on a preconfigured DRAGEN server that can be seamlessly integrated into bioinformatics workflows. The platform can be loaded with highly optimized algorithms for many different NGS secondary analysis pipelines, including the following:

Whole genome
Exome
RNA-Seq
Methylome
Cancer

All user interaction is accomplished via DRAGEN software that runs on the host server and manages all communication with the FPGA card. This user guide summarizes the technical aspects of the system and provides detailed information for all DRAGEN command line options. If you are working with DRAGEN for the first time, Illumina recommends that you first read the Getting Started section, which provides a short introduction to DRAGEN, including running a test of the server, generating a reference genome, and running example commands.

DNA Pipeline

DRAGEN DNA Pipeline

The DRAGEN DNA Pipeline massively accelerates the secondary analysis of NGS data. For example, the time taken to process an entire human genome at 30x coverage is reduced from approximately 10 hours (using the current industry standard, BWA-MEM+GATK-HC software) to approximately 20 minutes. Time scales linearly with coverage depth.

These pipelines harness the tremendous power of the DRAGEN server and include highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. They also use platform features such as hardware-accelerated compression and optimized BCL conversion, together with the full set of platform tools.

Unlike all other secondary analysis methods, DRAGEN DNA Applications do not reduce accuracy to achieve speed improvements. Accuracy for both SNPs and INDELs is improved over that of BWA-MEM+GATK-HC in side-by-side comparisons.

In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions.

RNA Pipeline

DRAGEN secondary anaylsis includes an RNA-seq (splicing-aware) aligner, as well as RNA-specific analysis components for gene expression quantification and gene fusion detection.

The DRAGEN RNA Pipeline shares many components with the DNA Pipeline. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.

DRAGEN secondary analysis uses hardware accelerated algorithms to map and align RNA-Seq--based reads faster and more accurately than popular software tools. For instance, it can align 100 million paired-end RNA-Seq--based reads in about three minutes. With simulated benchmark RNA-Seq data sets, its splice junction sensitivity and specificity are unsurpassed.

Methylation Pipeline

The DRAGEN Methylation Pipeline provides support for automating the processing of bisulfite sequencing data to generate a BAM with the tags required for methylation analysis and reports detailing the locations with methylated cytosines.

Clinical Research Workflows

DRAGEN v4.4 introduces support for DRAGEN server apps. These apps, comprised of Docker images, Nextflow workflows, a CLI shell script, and packaged resource bundles, can be downloaded and installed on the on-premises server. The packaged resource bundles include all the resource files required to run the application, such as the hash table(s), various noise baseline files, bed files.

Server apps make it easy to run complex workflows such as Tumor Normal somatic analysis by simplifying the management of external resources and applying the correct command line parameters for the selected analysis type. The DRAGEN server can support multiple installed server apps and DRAGEN on-prem for command line use at the same time.

Common Product Features

Run Planning

Sample Sheets

ICA Cloud Applications

ICI Variant Interpretation

Sample Sheets

Overview

When running analysis on a standalone DRAGEN server or ICA, a valid sample sheet can be created by:

When running analysis on a standalone DRAGEN server or on ICA, a minimal sample sheet for starting from FASTQ, BAM or CRAM can be created by:

Modify a sample sheet template following the requirements, see product specific templates for more information.

Note: A minimal sample sheet may be invalid for other purposes. It is always advisable to use a valid sample sheet generated from the BaseSpace Run Planner.

The Run Planning section of this guide is available for specific instructions to plan a run and set up a valid sample sheet for the pipeline when supported.

New Sample Sheet options available in DRAGEN 4.4+ release

Forward orientation for index2

[BCLConvert_Settings]

Required

Description

SoftwareVersion

Required

if SoftwareVersion >=4.4, index2 orientation must be forward; Otherwise, legacy behavior is supported

RunInfoIndex2ReverseComplement

Optional

Allowed values Y/N. if SoftwareVersion >=4.4; paired presence required with Index2ColumnReverseComplement. This value overrides the RunInfo.xml isReverseComplement = Y/N flag for index2 orientation in case of conflict.

Index2ColumnReverseComplement

Optional

Allowed Values Y/N. If softwareVersion >=4.4; paired presence required with RunInfoIndex2ReverseComplement. This value indicates whether the index2 column sequence is reverse complement or not.

Summary of Valid Settings for Index Orientation

As indicated in the following Table, the index2 orientation is always Forward orientation for simplicity. The two new flags introduced are especially useful when custom LPKs are used and when a consistent index2 orientation is desired for all run folders. The IndexOrientation field is present from BaseSpace run planner generated sample sheet, and indicates that the sample sheet index2/i5 sequences are in Forward orientation.

Look up table for index2 orientations in DRAGEN 4.4+

Bcl-convert SoftwareVersion must be >=4.4.
* indicates the situation where the IsReverseComplement flag in the RunInfo.xml is overriden by the RunInfoIndex2ReverseComplement value. NA means that IsReverseComplement flag for the index2 is not present in the RunInfo.xml file.
** indicates that legacy run folders may use the two paired flags to ensure that index2 Forward orientation is consistently applied.

Instrument Type

IndexOrientation

RunInfoIndex2ReverseComplement

Index2ColumnReverseComplement

IsReverseComplement

Index2 Orientation

Condition

NovaSeq 6000

Forward

N**

Forward

When SbsConsumableVersion <3

Forward

Y**

N**

Forward

When SbsConsumableVersion >=3

NovaSeq 6000Dx

Forward

When non-SP flow cell is used

Forward

When SP flow cell is used and control software is <2.4

NovaSeq X

Forward

Summary of Legacy Settings for index2 orientations

For backward compatibility, when the bcl-convert version specified is less than 4.4, the index2 orientation may vary depending on the instrument. In BaseSpaces run planner generated sample sheet, the IndexOrientation may still indicate Forward, but it is ignored in this situation.

Look up table for index2 orientations in earlier DRAGEN versions

Bcl-convert SoftwareVersion must be <4.4.
*indicates the situation where the IsReverseComplement flag in the RunInfo.xml is different depending on the control software version.

Instrument Type

IsReverseComplement

Index2 Orientation

Condition

NovaSeq 6000

Forward

When SbsConsumableVersion <3

Reverse

When SbsConsumableVersion >=3

NovaSeq 6000Dx

Forward

When non-SP flow cell is used

Forward

When SP flow cell is used and control software is >2.4

Reverse

When SP flow cell is used and control software is <2.4

NovaSeq X

Forward

Run Planning

Sample Sheet Creation in BaseSpace

How to Create Sample Sheets in BaseSpace Run Planning tool

The sections below represent each step in the BaseSpace Run Planning tool.

Note that NovaSeq X Series has a different run set up configuration screen than other instrument platforms. The software supports multi analysis, and in order to complete run setup on NovaSeq X Series, enter the appropriate Read 1, Read 2, Index 1 and Index 2 described in the instructions below.

Step 1: Run Settings

Step 2: Configuration

Note: On NovaSeq X Series, this page is called "Configuration 1". The right hand corner of the UI displays the Read 1, Read 2, Index 1 and Index 2 entered on the previous run settings screen.

Step 3: Sample Settings

Users can manually enter sample information, or download a template file to bulk upload sample information. Users can import the completed template or a compatible sample sheet.

Step 4: Run Review

Once all details are captured and pass validation, the user can review the details on the Run Review screen. From here they can choose to edit details in previous screens or export the sample sheet. Once completed, press the Cancel button to finish run planning.

Note: once leaving this screen, the run and sample sheet will not be accessible.

For NovaSeqX Plus users, the run can be saved as a draft or as a planned run (via “Save as Draft” and “Save as Planned” buttons respectively). Either selection will save the run to the Planned Runs screen on BaseSpace. There is no option to export the sample sheet on this screen.

Planned Runs Screen (NovaSeq X Series only)

The Planned Runs screen lists all planned or drafted runs. Users can set drafted runs to planned, export the sample sheet, and edit or delete a run on this screen.

Once the run is saved as Planned, it will appear on the NovaSeq X Series instrument where it can be selected for sequencing.

Guided Examples based on TSO 500

Please review these guided examples of TSO 500 analysis workflows that include a step of setting up a run in BaseSpace Run Planning tool:

Custom Config Support

BSSH Run Planner Setup

On the BSSH Run Planner, custom parameters and custom resource files can also be specified during Run Planning.

Custom resource files must be uploaded to BaseSpace under the same project to be selectable during run planning. Supported customizable options are described in the Custom Configuration Support section of each application.

BSSH Run Planner UI Example

DRAGEN Server App

Analysis on DRAGEN Server

Prerequisites

DRAGEN Phase 3 or 4 server
DRAGEN License
Network storage server

DRAGEN server

DRAGEN phase 4 server is recommended especially for datasets from NovaSeq X instruments. The server has 12 TB of intermediate data storage space for full processing of a NovaSeq X 25B flow cell.

The DRAGEN phase 3 server has 6 TB of intermediate data storage space, which can accommodate for flow cells from the NovaSeq 6000 or 6000 Dx instruments.

DRAGEN license

The Heme pipeline uses the standard DRAGEN license without requiring any special licenses.

NFS and CIFS file servers

The Heme pipeline is designed to stream data from a network file server onto the DRAGEN server, complete the analysis using the /staging area of the high performance SSD and then stream the analysis output back to the network file server.

The network file server may be mounted to the DRAGEN server using the NFS or CIFS protocol (SMB 1.0). SMB 2.0 or higher is recommended with Active Directory support if the SMB protocol is used.

Starting from BCL Files

If starting from BCL (*.bcl) files, the Heme pipeline requires the run folder to contain certain files and folders.

The run folder contains data from the sequencing run, make sure that the folder contains the following files:

Starting from FASTQ Files

The following inputs are required for running the using FASTQ (*.fastq) files.

Full path to an existing FASTQ folder.
The sample sheet is in the FASTQ folder path, or you can set the path to the sample sheet with the --sampleSheet override command line option.

Make sure there is sufficient disk space for the analysis to complete. Refer to the --help command line argument details for disk space requirements.

Use BCL Convert to produce FASTQ files for the Heme pipeline. Using bcl2fastq does not produce the same results and is discouraged.

FASTQ File Organization

Store FASTQ files in individual subfolders that correspond to a specific Sample_ID. Keep file pairs together in the same folder. Alternatively, store the FASTQ files in one flat folder structure where the FASTQ files are stored in one folder.

The Heme pipeline requires separate FASTQ files per sample. Do not merge FASTQ files.

The instrument generates two FASTQ files per flow cell lane, so that there are eight FASTQ files per sample.

Sample1_S1_L001_R1_001.fastq.gz

Sample1 represents the Sample ID.
The S in S1 means sample, and the 1 in S1 is based on the order of samples in the sample sheet, so S1 is the first sample.
L001 represents the flow cell lane number.
The R in R1 means Read, so R1 refers to Read 1.

Advanced Topics

The pipeline may be downloaded and installed on a local DRAGEN server. A download utility may be obtained from the Illumina download site, and the download utility will manage all the dependencies. Once the required installers are downloaded, the software may be installed by running the installers.

Using NFS for data streaming

With the NovaSeq X 25B flow cells, the amount of data is on the order of terabytes, which may take a few hours or more to copy to the /staging folder on the local DRAGEN server. Using NFS storage directly for input and output is recommended in this case.

Variant Interpretation

ICI supports variant interpretation with advance visualization capabilities. It is available in the cloud or on a local DRAGEN server.

Data Management

Copying data to local /staging drive

Copy the run or FASTQ folder to the DRAGEN server into the staging folder with the following recommended organization: /staging/runs/{RunID}. You can copy the run folder onto the DRAGEN server using Linux commands such as rsync. The sample sheet within the run folder is used unless otherwise specified through the command line.
Run folder must be intact.
If the analysis output folder path is different from the default, provide the analysis output folder path.

Analysis output directory

Before running the analysis, confirm that the output directory for the software to write to is empty and does not include results of previous analyses.

Storage Requirements

The DRAGEN server provides an NVMe SSD in the /staging directory to use as the software output directory. Network-attached storage is required for long-term storage.

When running the Heme pipeline, use the default settings or set the -analysisFolder command line option to a directory in /staging to make sure the DRAGEN server processes read and write data on the NVMe SSD.

Before beginning analysis, develop a strategy to copy data from the DRAGEN server to a network‑attached storage. Delete output data on the DRAGEN server as soon as possible.

The following are the run folder output size estimates and the minimum free space requirements for fastq.gz or fastq.ora output format.

When launching the analysis, the software checks that the minimum disk space required is available. If the minimum disk space is not available, the software shows an error message and prevents analysis from starting. If disk space is exhausted during a run, the run shows an error and stops analyzing.

Moving or modifying files during an analysis may cause the analysis to fail or provide incorrect results.

Data streaming from Network Filesystem

Illumina Connected Insights Local

ICA Cloud App

Analysis in the ICA Cloud environment

Prerequisites

Basic ICA Subscription
Basic ICI Subscription (if desired)

Variant Interpretation with Illumina Connected Insights

Automatic Ingestion of Heme Analysis on ICA to ICI

Access to Illumina Connected Analytics
Access to Illumina Connected Insights

DRAGEN Heme WGS Tumor Only Pipeline

Overview

DRAGEN Heme WGS Tumor Only Pipeline, henceforth referred as the Heme Pipeline, is a comprehensive and unbiased whole genome sequencing solution to replace conventional cytogenetic and panel sequencing approaches for detecting all types of mutation using a limited amount of DNA. It can be applied to detect clinically actionable mutations for cancer spanning a wide range of genomic events, e.g., structural variants (SV), Copy Number Alterations (CNA), small variants (SNV/insertion/deletion/delins) and internal tandem duplications (ITD) and DUX4 variants using Heme samples.

The Heme pipeline includes a DNA-only workflow designed to analyze whole genome sequencing data generated on supported instruments. It may be run as a local off-instrument solution installable on a DRAGEN server or accessible through the Illumina Connected Analytics (ICA) cloud environment. The Heme pipeline is for Research Use Only (RUO).

Features

Superb performance based on the DRAGEN BioIT platform Release 4.4.4
Supports starting the analysis from BCL, FASTQ, BAM or CRAM as inputs.
Flexible custom configurable options on top of well established DRAGEN recipes for Heme WGS analysis.
Available on local DRAGEN servers and Illumina Connected Analytics (ICA)
Seamless integration with Illumina Connected Insights (ICI) for tertiary interpretation

Supported Library Prep Kits (LPKs)

Illumina DNA PCR Free Prep Kit
Illumina DNA Prep Kit
Custom LPKs

Supported Sequencing Instruments

NovaSeq 6000 or 6000Dx in RUO mode
NovaSeq X or NovaSeq X plus

Note Unsupported instruments can still be analyzed, but a warning will be generated.

Supported FLow Cells

NovaSeq 6000 or 6000Dx S4
NovaSeq X or NovaSeq X plus 10B, 25B

Quick Start

Quick Start Guide

Table 1. Release Information

Execution Environment

software version

Client program

location

Note

Local Dragen Server

4.4.4.62

run_Heme_WGS_TO_{version}.sh

/usr/local/bin

ICA

a11697ba-1144-4dc6-9e22-f21dff29f747

icav2

ICA Pipelines

ICA

urn:ilmn:ica:pipeline:a11697ba-1144-4dc6-9e22-f21dff29f747#Heme_WGS_TO_v4_4_4_62

supported browser

ICA UI

{version} is used to represent the software version number in Table 1 above. Similarly, <pipeline_run_script> is used to indicate the client program name in this document.

Download, Install and Execute on a Local Server

Run analysis on a local DRAGEN Server

The command line program may be used to launch an analysis by using the <pipeline_run_script> with the appropriate options.

start from bcl

<pipeline_run_script> --help # list all supported parameters
<pipeline_run_script> --inputType bcl \
--inputFolder /staging/input-folder \
--analysisFolder /staging/output-folder

start from one or more input folders when using FASTQ, BAM or CRAM files

Multiple folders may be specified as input folders in comma separated values when using FASTQ, BAM or CRAM files as input.

<pipeline_run_script> --inputType <fastq|bam|cram> \
--inputFolder /staging/input-folder-1,/staging/input-folder-2 \
--analysisFolder /staging/output-folder

Pressing Ctrl+C during a DRAGEN step stops the currently running analysis and might cause an FPGA error. To recover from an FPGA error, shut down and restart the server.

Run analysis on ICA using the icav2 client

Here is an example of starting an analysis using the ICA client by providing the necessary command parameters and specify a particuar storage size for analysis in ICA.

icav2 projectpipelines start nextflow ${PIPELINE_ID} \
--project-id ${ANY_PROJECT_ID} \
--storage-size Large \
-o json \
--input ${ANY_SAMPLE_SHEET} \
--input ${ANY_INPUT_DIR} \
--parameters inputType:'bcl' \
--parameters referenceGenome:'hg38' \
--parameters oraCompressionEnabled:'true' \
--parameters sampleIds:'1267-Prostate-Del-R1,741-Lung-SNV-R1' \
--user-reference ${ANY_USER_REFERENCE}

Run analysis on ICA using UI

Turn Around Time Comparison

ICA

Coming soon.

Local Server

Local Server Only

Coming soon.

Data Streaming from NFS

Coming soon.

Sample Sheet Requirements

The pipeline has fields that are required in addition to general sample sheet requirements. Follow the steps below to create a valid samplesheet.

Standard Sample Sheet Requirements

The following sample sheet requirements describe required and optional fields for the pipeline. Depending on the deployment (standalone DRAGEN server, ICA with auto-launch, ICA with manual launch), certain sections and required values can deviate from the standard requirements. These deviations are noted in the information below.

The analysis fails if the sample sheet requirements are not met.

Use the following steps to create a valid sample sheet.

Download the sample sheet v2 template that matches the instrument & assay run.
In the Sequencing Settings section, enter the following required parameters:

[Sequencing_Settings] Section

Sample Parameter

Required

Details

LibraryPrepKits

Required

Accepted values are: IlluminaDNAPrep or IlluminaDNAPCRFree

In the BCL Convert Settings section, enter the following required parameters:

[BCLConvert_Settings] Section

Sample Parameter

Required

Details

SoftwareVersion

Required

The DRAGEN component software version. The pipeline requires 4.4.4 or higher. To ensure you are using the latest compatible version, refer to the software release notes.

AdapterRead1

Required

If using 10 bp indexes with UDP: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC Analysis fails if the incorrect adapter sequences are used

AdapterRead2

Required

If using 10 bp indexes with UDP: CTGTCTCTTATACACATCTGACGCTGCCGACGA Analysis fails if the incorrect adapter sequences are used

AdapterBehavior

Optional

Enter trim This indicates that the BCL Convert software trims the specified adapter sequences from each read.

MinimumTrimmedReadLength

Optional

Enter 35. Reads with a length trimmed below this point are masked.

MaskShortReads

Optional

Enter 35. Reads with a length trimmed below this point are masked.

In the BCL Convert Data section, enter the following parameters for each sample.

[BCLConvert_Data] Section

Sample Parameter

Required

Details

Sample_ID

Required

Must match a Sample_ID listed in the [Heme_Data] section section.

Index

Required

Index 1 sequence valid for Index_ID assigned to matching Sample_ID in the [Heme_Data] section.

Index2

Required

Index 2 sequence valid for Index_ID assigned to matching Sample_ID in the [Heme_Data] section.

Lane

Only for NovaSeq 6000 XP, NovaSeq 6000Dx, or NovaSeq X workflows

Indicates which lane corresponds to a given sample. Enter a single numeric value per row. Cannot be empty, i.e the analysis fails if the Lane column is present without a value in each row.

In the [Heme_Data] section, enter the following parameters:

[Heme_Data] Section header changes depending on the deployment: Section header changes depending on the deployment:

Standalone DRAGEN Server and ICA with Manual Launch: Heme_Data
ICA with Auto-launch: Cloud_Heme_Data

[Heme_Data] Section

Sample Parameter

Required

Details

Sample_ID

Required

The unique ID to identify a sample. The sample ID is included in the output file names. Sample IDs are not case sensitive. Sample IDs must have the following characteristics: - Unique for the run. - 1–70 characters. - No spaces. - Alphanumeric characters with underscores and dashes. If you use an underscore or dash, enter an alphanumeric character before and after the underscore or dash. eg, Sample1-T5B1_022515. - Cannot be called all, default, none, unknown, undetermined, stats, or reports. - Must match a Sample_ID listed in the [BCLConvert_Data] section. Each sample must have a unique combination of Lane (if applicable), sample ID, and index ID or the analysis will fail.

Sample_Type

Optional

Enter DNA

Case_ID

Optional

A unique ID that links the same biological samples from the same individual. It is used for variant interpretation in downstream software such as the Illumina Connected Insights software

Sample_Description

Optional

Sample description must meet the following requirements: - 1–50 characters. - Alphanumeric characters with underscores, dashes and spaces. If you enter a underscore, dash, or space, enter an alphanumeric character before and after. eg, heme-WGS_213.

To ensure a successful analysis, follow these guidelines:

Avoid any blank lines at the end of the sample sheet; these can cause the analysis to fail.
When running local analysis using the command line save the sample sheet in the sequencing run folder with the default name SampleSheet.csv, or choose a different name and specify the path in the command-line options.

ICA with Auto-launch: Sample Sheet Requirements

To auto-launch analysis from the sequencer run folder, ensure the StartsFromFastq and SampleSheetRequested fields are set to FALSE. To auto-launch analysis from FASTQs after BCL Convert auto-launch, StartsFromFastq and SampleSheet Requested fields must be set to TRUE

[Cloud_Heme_Data] Section

[Cloud_Heme_Settings] Section

Parameters

Required

Details

SoftwareVersion

Not Required

The Heme pipeline software version

StartsFromFastq

Required

Set the value to TRUE or FALSE. To auto-launch from BCL files, set to FALSE. To auto-launch from FASTQ files after auto-launch of BCL Convert, set to TRUE.

SampleSheetRequested

Required

Set the value to TRUE or FALSE. To auto-launch from BCL files, set to FALSE. To auto-launch from FASTQ files after auto-launch of BCL Convert, set to TRUE.

[Cloud_Data] Section

Parameters

Required

Details

Sample_ID

Not Required

The same sample ID used in the Cloud_HemeS_Data section.

ProjectName

Not Required

The BaseSpace project name.

LibraryName

Not Required

Combination of sample ID and index values in the following format: sampleID_Index_Index2

LibraryPrepKitName

Required

The Library Prep Kit used.

IndexAdapterKitName

Not Required

The Index Adapter Kit used.

[Cloud_Settings] Section

Parameter

Required

Details

GeneratedVersion

Not Required

The cloud GSS version used to create the sample sheet. Optional if manually updating a sample sheet.

CloudWorkflow

Not Required

Ica_workflow_1

Cloud_Heme_Pipeline

Required

BCLConvert_Pipeline

Required

The value is a URN in the following format: urn:ilmn:ica:pipeline: <pipeline-ID>#<pipeline-name>

Templates

Description

Sample Sheet templates for the Heme pipeline for standalone DRAGEN server and ICA manual launch analysis can be found in the table below. For auto-launch compatible sample sheets, use BaseSpace Run Planner.

The Heme pipeline is compatible with several instruments and assay workflows (standard, XP), each of which have implications for the sample sheet.

Templates

Sample sheet templates contain all required fields, including index sequences in the proper orientation for all indexes from a given library prep kit. The templates are provided as a starting point for creating a sample sheet manually when launching analysis on a standalone DRAGEN server or on ICA using manual launch.

*Lane numbers cannot exceed what is supported by the flow cell in use.

DRAGEN Server App

Installation Procedure on DRAGEN Server

Downloader

Choose the downloader appropriate for your platform, when executed it will prompt you to provide a path to download the assets to. The required software packages will be downloaded into the dragen_pipelines directory under the path provided at the prompt. If the path provided was used for a previous execution of the downloader, any incomplete downloads will be resumed, existing files will be checksummed, and any files with invalid checksums will be re-downloaded.

The downloaded directory content may be moved to the installation target DRAGEN server using a USB key with at least 128 GB of free space or by copying to Network Storage which is reachable from the target DRAGEN Server.

Downloader System Requirements

Expected downloaded content

- dragen-app-manager-1.0.14-1.x86_64-el8-offline.run
- README
- - install_Heme_WGS_TO_v4.4.4.62.run
  - Heme_WGS_TO_4.4.4.62.iapp
  - README
- - dpf-core_1.0.0.36.ires
  - dpf-templates_4.4.4.52.ires
  - dpf-docker-images_4.4.4.52.ires
  - dragen-4.4.4-12.multi.el8.x86_64.run
  - heme_wgs_to_resources_4.4.4.2.ires
  - hg38-alt_masked.cnv.hla.methyl_cg.methylated_combined.rna-11-r5.0-1.ires
  - hs37d5_chr-cnv.hla.methyl_cg.methylated_combined.rna-11-r5.0-1.ires
  - variant_annotation_data-tmb_annotations-4.4.4-1.ires

Installer

Installation Requirements

DRAGEN and DRAGEN Application Manager

The pipeline requires DRAGEN v4.4.4 or higher. If upon installation of the app this version of DRAGEN (or higher) is not installed, the software shall install this version of DRAGEN.

Minimum System Operating Requirements

Hardware

v3 DRAGEN server or v4 DRAGEN server
mkfifo is enabled on the network-attached storage (NAS).

Software

The software installed by default on the DRAGEN server includes the following items:

DRAGEN server software. Refer to sample sheet settings for the DRAGEN version number.
Oracle Linux 8

Storage

DRAGEN server v3 provides a 6.4 TB NVMe SSD. This SSD is located at the /staging directory and is suitable for storing only one or two runs of the analysis pipeline.
DRAGEN server v4 provides 12.8 TB via a 2 x 6.4 TB NVMe U.2 SSD configuration.
Consider the following when making data storage decisions.
- A NovaSeq 6000 sequencing run that uses an S4 flow cell can produce up to 3 TB of output. ▫ The Heme pipeline can produce an additional 4-6 TB of analysis output. For optimal performance when writing to a non-default directory, specify an analysis folder location on /staging, this ensures that the DRAGEN-related processes read and write data to the DRAGEN Server's high-speed NVMe SSD.
- Network-attached storage is required for long-term storage of sequencing runs and Heme pipeline output.
- Managing data storage is your responsibility.
  - Illumina recommends developing a strategy to copy data from the DRAGEN server to network-attached storage.

Installation Instructions

Installing the Heme pipeline requires root privileges.
- Follow the instructions for DRAGEN license installation provided by Illumina Customer Care or refer to the DRAGEN server documentation.
Copy the directory structure from the downloader directory to the target DRAGEN server (or a path accessible with sudo privileges)
Ensure the installer has the correct privileges by running chmod +x install_Heme_WGS_TO_v{version}.run
Launch the installer with root privileges sudo /path/to/install_Heme_WGS_TO_v{version}.run
- If DRAGEN Application Manager is not already installed, the installer will exit and direct you to the path to the DRAGEN Application Manager installer

Run Self-Test Script

The self-test script, present after app installation, checks the following functions:

All required services are running.
All resources are in place.
The analysis workflow image can be launched.
The Heme pipeline can run successfully on a test dataset.

To run the self-test script, execute:

The following output will show if installation is completed successfully.

If the self-test prints a failure message, contact Illumina Technical Support, and provide the output file found in /staging/check_Heme_WGS_TO_{timestamp}.tgz.

When running an analysis on the DRAGEN server via SSH, Illumina recommends that you use a terminal multiplexer utility, which allows you to resume analysis in the event of a disconnection from the DRAGEN server.

Uninstall Heme pipeline

To uninstall the Heme pipeline, run the following command:

Executing the uninstall script removes the following assets:

All scripts, including:
- run_Heme_WGS_TO_{version}.sh
- check_Heme_WGS_TO_{version}.sh
- uninstall_Heme_WGS_TO_{version}.sh
- The application installed under DRAGEN Application Manager

If the uninstall script is executed with the -r or --removeResources flag, dependencies of the application being uninstalled will be removed if no other applications depend on them.

You are not required to uninstall DRAGEN Application Manager, Docker, or the DRAGEN server software.

To remove Docker, review the install instructions for your operating system in the Docker documentation

Launching Analysis

Overview

Run on DRAGEN Server

Getting Started

Analysis output is written to /staging/DRAGEN_Heme_WGS_Tumor_Only_Pipeline_{version}_Analysis_{datetimestamp} by default. To write to a different output directory, run the bash script with --analysisFolder <FULL_PATH_TO_ANALYSIS_FOLDER>.

The --demultiplexOnly flag runs the pipeline through FASTQ Generation only, and these outputs can be used for splitting a run into smaller batch analyses with --inputType fastq and the --sampleIDs argument.

Command Line Options

Overview

Command line options

For command-line options, refer to Table 1: Shell Script Command-Line Options for details.

Table 1: Shell Script Command-Line Options

CAUTION: Do not run analyses as the root user as it can lead to permissions issues when managing data generated by the software.

Local Specific Output

Local output management

On DRAGEN server, Nextflow logs are contained in the Work folder in a hierarchical folder structure organized by the tasks in the pipeline_trace.txt. These files are prefixed with "." and hidden from normal view.

Advanced Topics

Demultiplex only option

In order to break up the workflow, one may wish to run the software with the demux only option. The pipeline will perform FASTQ generation with the settings provided by default or as specified in the sample sheet. Then the subsequent analysis may start from FASTQ.

CRAM input

DNA Germline Panel

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# SV 
--enable-sv true 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting

Notes and additional options

Hashtable

For DRAGEN germline runs, it is recommended to use the pangenome hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING

BAM Input

--bam-input $PATH

CRAM Input

--cram-input $PATH

Mapping and Aligning

Option

Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option

Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-positional-collapsing true

Alternative to --enable-duplicate-marking=true. Instead of discarding duplicate reads, DRAGEN can optionally perform positional collapsing, merging them into higher-quality consensus reads. This is beneficial for small panels without UMIs and coverage between 300X and 1000X. However, it's slower than standard duplicate marking and less effective on samples with coverage lower than 300X. For very high coverage (1000X+), avoid it due to potential read collisions. For high-sensitivity panels with 1000X+ coverage, consider using UMIs.

SNV

DRAGEN SNV VC employs machine learning based variant recalibration (DRAGEN-ML). It processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors. No additional setup is required. DRAGEN-ML is enabled by default as needed, when running the germline SNV VC on hg19 or hg38.

Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range.

Option

Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

Annotation

HLA

Option

Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option

Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

In some cases, an in-run PON containing germline samples from the same batch (i.e. sample source, DNA extraction, library prep and sequencing run) may provide superior normalization.

DNA Somatic Tumor-Normal Solid Panel UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--tumor-normal-has-umi STRING           #Sample(s) containing UMI ['tumor', 'both']. 
--umi-min-supporting-reads 2            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Optional 
--vc-enable-umi-solid true              #>= 1% VAF 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Optional 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-use-somatic-vc-baf true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# TMB 
--enable-tmb true 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 
# Microsatellite Instability (MSI) 
--msi-command tumor-normal 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH 
--fastq-list-sample-id $STRING

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING

BAM Input

--tumor-bam-input $PATH 
--bam-input $PATH

CRAM Input

--tumor-cram-input $PATH 
--cram-input $PATH

Mapping and Aligning

Option

Description

--enable-map-align true

In the TN pipeline this must be set to false for BAM/CRAM input.

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option

Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

UMI

Option

Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

--tumor-normal-has-umi STRING

Specify if only the tumor, or if both the tumor and normal have UMIs. Options: 'both','tumor'.

SNV

Option

Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option

Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option

Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option

Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-normal-cnv-vcf $CNV_NORMAL_VCF

Annotation

TMB

Option

Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files refer to the MSI Biomarker section in the user guide.

Option

Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option

Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option

Recommended Value for Liquid Tumors (e.g. AML/MLL)

--sv-enable-liquid-tumor-mode true

DRAGEN can account for Tumor-in-Normal (TiN) contamination by running liquid tumor mode.

--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE

Set the Tumor-in-Normal (TiN) contamination tolerance level.

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files

Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-emit-ref-confidence BP_RESOLUTION 
--vc-enable-vcf-output true 
--vc-enable-umi-solid true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING

For panels we create GVCF files. Gather the full paths to the small variant caller hard filtered GVCFs (not VCFs) from step 1 and create an input file ${GVCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${GVCF_LIST}

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files

Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line.

CNV Panel of Normals (PON)

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST

Small Variant Calling

The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).

Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller option to true. The VCF header is annotated with ##source=DRAGEN_SNV to indicate the file is generated by the DRAGEN SNV pipeline.

The Variant Caller Algorithm

The DRAGEN Small Variant Caller performs the following steps:

Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.

Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.

Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.

Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.

Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.

Read filtering and reporting of vcf DP fields

In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:

Unfiltered depth is the number of reads (fragment-based) covering the position, downstream of any read collapsing, deduplication, downsampling and read disqualification, but upstream of informative read determination. Overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads (fragment-based) actually used to make the calling decision, where badly mated reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.

The following figure summarizes the different filtering steps in more detail.

Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
- Duplicate reads.
- Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
- [Somatic] Reads with MAPQ=0.
- [Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1. By default, germline runs with machine learning (ML) enabled consider all reads, even those with MAPQ 0, resulting in increased sensitivity. MAPQ read filtering is controlled by --vc-min-read-qual for germline and tumor/normal (T/N) runs, but it does not affect tumor-only (T/O) runs. In contrast, --vc-min-tumor-read-qual controls filtering for tumor samples in T/N and T/O runs and has no effect on normal-only samples.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
- Unmapped reads.
- Secondary reads.
- Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
- Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out the following reads:
- Badly mated reads. A badly mated read is a read where the pair is mapped to two different reference contigs.
- Non-informative reads. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.

Mosaic Calling

Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC INFO flag. As default, MOSAIC tagged variants with AF smaller than 20% are filtered with the MosaicLowAF filter. To further enhance sensitivity, if the median depth of the sample detected by the ploidy estimator exceeds 100x, a default 10% threshold will be applied. This is likely to have a greater impact on exome data, which typically has higher coverage. Exome users looking to control the number of low allele frequency (AF) mosaic variants can set the option --vc-mosaic-af-filter-threshold to 0.2 to override the dynamic coverage-based thresholding.

Variant Caller Options

The following options control the variant caller stage of the DRAGEN host software.

--enable-variant-caller
Set --enable-variant-caller to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:

If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.

--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection. Set to false to disable DRAGEN mosaic detection.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF filter to mosaic calls. All MOSAIC tagged variants with AF smaller than the AF threshold are filtered with the MosaicLowAF filter. The default mosaic AF filter threshold is set to 0.2 if the median depth of the sample detected by the ploidy caller is <= 100x and 0.1 if the detected depth is > 100x.

Downsampling Options for Small Variant Calling

You can use the following options for downsampling reads in the small variant calling pipeline.

For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.

--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.

The following are the default downsampling values for each small variant calling mode.

The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos setting.

If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos, default value is 10) at any start position.

The next downsampling step is to apply the --vc-max-reads-per-raw-region and --vc-max-reads-per-active-region limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.

This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.

When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.

Small Variant Caller Output

By default, the DRAGEN small variant caller outputs a VCF file (<output-file-prefix>.hard-filtered.vcf.gz) in VCF 4.2 format containing both filtered and PASSing variant records.

Variant Representation

Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned

Additional notes on variant representation in the DRAGEN VCF:

Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).

Multi-allelic Variants and Overlapping Variants

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:

Two indels are considered as multi-allelic if they share the same reference base preceding the indel. For example:

DRAGEN employs joint detection of overlapping variants, a feature designed to detect overlapping SNP and INDEL variants and output them in a single VCF record represented as a multi-allelic genotype. However, if a SNP overlaps an INDEL but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below.

QUAL, QD, and GQ Formulation

In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.

QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.

The QUAL scores generated by DRAGEN differ significantly from those of GATK, as DRAGEN's algorithms for small variant detection provide more realistic scores. This improvement stems from two key factors:

Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.
Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3.0103) compared to DRAGEN without ML (10.41 for SNP and 7.83 for Indel).

Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3.0103 with ML enabled.

gVCF Output

A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF> allele. Reads that do not support the reference or any variants are assigned the <NON_REF> allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ and in somatic mode the score is FORMAT/SQ.

The following options are available to enable and control gVCF output.

--vc-emit-ref-confidence
To enable gVCF output, set to GVCF. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence to BP_RESOLUTION.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf (banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80 for germline and 1 3 10 20 50 80 for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30 and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.

Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.

In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.

The following are example gVCF records that include a hom-ref block call and a variant call.

In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.

Phasing and Phased Variants

DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same phase set.

The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.

During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, they are phased together. If the variants only occur on different haplotypes, they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.

Combine Phased Variants

The DRAGEN small variant caller supports combining multiple nearby phased variant records into a single VCF record. When the functionality is enabled, the caller will output both multi-nucleotide variants (MNVs; multiple phased SNVs combined into a single VCF record) and complex indels (multiple phased insertions, deletions, and/or SNVs combined into a single VCF record) in the VCF.

For example, assuming reference at position chr2 115035 is A, the following two phased SNVs can be combined into an MNV.

The phased SNVs are combined as follows.

The following two phased indels can be combined into a complex indel.

The phased indels are combined as follows.

Individual variant records existing on the same haplotypes are deemed to be in phase and will be merged if they are within a configurable distance threshold of one another. For each consecutive pair of phased variants in a phase set, the variants will be combined if the difference between their POS fields does not exceed the threshold. For deletions, the number of deleted bases is taken into account and subtracted from the POS difference between the deletion and downstream phased variant when calculating the distance between calls. Please note that variant records without a PS tag may be merged into MNVs and complex indels together with calls having PS tags due to algorithmic differences between variant phasing and variant merging.

In the somatic pipeline, combining phased variants is enabled by default, consistent with HGVS guidelines. In the germline pipeline, the functionality can be enabled using the command line options detailed below.

Command line options for merging phased variants

--vc-combine-phased-variants-distance-snvs-only Specifies the maximum distance over which phased SNVs will be combined into an MNV. This option applies only to phased variant groups consisting of only SNVs. The default is 2 for somatic and 0 for germline (disabled). For phased variant groups that include both SNVs and indels, the analogous vc-combine-phased-variants-distance option applies.
--vc-combine-phased-variants-distance Specifies the maximum distance over which phased insertions, deletions, and SNVs will be combined into a complex indel. This distance threshold will be applied to any group of phased variants that includes at least one indel. The default is 2 for somatic and 0 for germline (disabled).

For both options, a value of 0 disables merging. When either option is enabled with a value [1, 15], all phased variants in the group that are within the provided distance value of one another are merged into an MNV (for vc-combine-phased-variants-distance-snvs-only) or complex indel (for vc-combine-phased-variants-distance).

--vc-mnv-emit-component-calls Specifies whether or not to emit the individual component variant records along with the merged variant records. When set to true, all component calls making up an MNV or complex indel will be emitted in the VCF along with the merged variant record. The default is true for somatic and false (disabled) for germline.
--vc-combine-phased-variants-max-vaf-delta Specifies the threshold for filtering MNV component variant calls when the events comprising to the MNV have different allele frequencies. The default value is 0.1, which means that any SNV or INDEL with an AF that is more than 0.1 greater than the MNV AF shall be emitted as a PASSing call, while the remaining components shall be emitted with the 'mnv_component' FILTER flag. Only applicable when vc-combine-phased-variants-distance is greater than 0 and vc-mnv-emit-component-calls is true. (Default=0.1)

DRAGEN can output all component SNVs and/or INDELs that make up a merged MNV or complex indel along with the merged call itself. Merged calls and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. This behavior is default in somatic mode and can be enabled in germline mode by setting --vc-mnv-emit-component-calls=true. When vc-mnv-emit-component-calls is enabled and DRAGEN reports an MNV or complex indel call, the component calls that make up the merged call are filtered with the mnv_component filter flag unless the difference in VAF between the component call and merged call is greater than the value of vc-combine-phased-variants-max-vaf-delta (default: 0.1). This avoids component calls being doubly represented in the VCF output. In the case where VAF difference between a given component call and merged call exceeds the threshold value of vc-combine-phased-variants-max-vaf-delta, that is considered evidence for the component call existing both as a standalone variant and as part of the MNV or complex indel and the component call will be emitted as a PASSing VCF record. For example, in the following MNV group, there are two component SNVs making up the MNV. The MNV call is emitted as a PASSing call while one component SNV with AF equal to that of the MNV is filtered with the mnv_component FILTER flag and the other component SNV with VAF greater than that of the MNV by more than 0.1 is emitted as a PASSing call.

DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.

Examples of diploid haplotypes where phasing is supported:

Examples of diploid haplotypes where phasing is not supported:

Ploidy Support

The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.

If --sample-sex is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex is specified on the command line, contigs are processed as follows.
- For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
- For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.

For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 75% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid to false.

An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8

DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.

Overlapping Mates in the Small Variant Calling

Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.

When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.

Mitochondrial Calling

Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.

DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.

--vc-enable-af-filter-mito Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito Set the threshold to mark emitted vcf call as filtered. The default is 0.02.

QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).

The following filters can be applied to mitochondrial variant calls.

--vc-sq-call-threshold Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter Enables the multiallelic filter. The default value is false.

If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.

If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.

The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.

FORMAT/GT

For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.

The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.

Personalized Germline Small Variant Calling

We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.

The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.

To enable personalized variant calling, including the personalized machine learning model, set --enable-personalization to true (default: false). This outputs two files in the output directory: .personal_haplotypes.tsv.gz and .personal.vcf.gz.

.personal_haplotypes.tsv.gz describes the personalized 2-haplotype reference. Each line contains the following fields: #CHROM START END HAPS. By default each line represents a 4kbp bin of the reference genome (indicated by the CHROM, START and END fields). For each 4kbp bin, the HAPS field denotes the pair of ancestral haplotypes (from the pangenome reference panel) that are inferred for the sample.

.personal.vcf.gz describes the variants imputed for the sample. Each variant is annotated along with genotype (GT), posterior probabilities (PGP, Personalized Genotype Posterior) and the inferred best haplotype pair (HAPS). The variant quality score (QUAL) is computed as -10 * log10(probability that the imputed genotype is incorrect) and is capped at 999.

The .personal.vcf.gz is useful when running in split mode and is beneficial to save along with the BAM/CRAM output. To enable personalized variant calling and machine learning in split-run scenarios, simply provide the personal variant VCF (.personal.vcf.gz) along with the BAM/CRAM input (--enable-personalization true --vc-pg-variants <$OUT_PREFIX.personal.vcf.gz>).

Note that personalization is only available for the germline small variant caller (WGS and WES) when used with a pangenome reference.

Joint Detection of Overlapping Variants

When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:

Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.

Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.

Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection to false.

Modeling of Correlated Errors Across Reads

DRAGEN has two algorithms that model correlated errors across reads in a given pileup.

Foreign Read Detection

Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.

Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.

FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.

Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.

Base Quality Dropoff

The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.

Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.

DRAGEN Host Software

Invoke the software using the dragen command. The command line options are described in the following sections.

Command line options can also be set in a configuration file. For more information on configuration files, see . If an option is set in the configuration file and is also specified on the command-line, the command line option overrides the configuration file.

Command-line Options

The following are examples of frequently used command lines:

Build Reference/Hash Table

dragen --build-hash-table true --ht-reference <REF_FASTA> \
--output-directory <REF_DIRECTORY>  [options]

Run Map/Align and Variant Caller (*.fastq to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --RGID <RG0> --RGSM <SM0> --enable-variant-caller true

Run Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] \
-1 <FASTQ1> [-2 <FASTQ2>]  \
--RGID <RG0> --RGSM

Run Variant Caller Only (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align false \
--enable-variant-caller true

Re-map and Run Variant Caller (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align true \
--enable-variant-caller true

Run BCL Converter (BCL to *.fastq)

dragen --bcl-conversion-only true --bcl-input-directory <BCL_DIRECTORY> \
--output-directory <OUT_DIRECTORY>

Run RNA Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --enable-rna true

For recommended command lines in typical use cases, see .

Reference Genome Options

Before you can use the DRAGEN system for aligning reads, you must load a reference genome and its associated hash tables onto the PCIe card. For information on preprocessing a reference genome's FASTA files into the native DRAGEN binary reference and hash table formats, see . You must also specify the directory containing the preprocessed binary reference and hash tables with the -r [or --ref-dir] option. This argument is always required.

Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

Use the -l (--force-load-reference) option to force the reference genome to load even if it is already loaded.

dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.

Operating Modes

DRAGEN has two primary modes of operation, as follows:

Mapper/aligner
Variant caller

Full pipeline mode To execute full pipeline mode, set --enable-variant-caller to true and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.
Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking to true.
Variant caller mode To execute variant caller mode, set the --enable-variant-caller option to true, and set --enable-map-align option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort to false will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.
RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna to true. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..
Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol setting.

Output Options

The following command line options for output are mandatory:

--output-directory <out_dir>—Specifies the output directory for generated files.
--output-file-prefix <out_prefix>-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.
-r [--ref-dir ]—Specifies the reference hash table.

The following examples do not include these mandatory options.

For example, the following commands output to a compressed BAM file, and then forces overwrite:

dragen ... -f

dragen ... -f --output-format bam

To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing to true.

The following example outputs to a SAM file, and then forces overwrite:

dragen ... -f --output-format sam

The following example outputs to a CRAM file, and then forces overwrite:

dragen ... -f --output-format cram

DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.

Alignment tags

DRAGEN can also annotate additional information about alignments in a ZS:Z tag. The following are valid tag values:

Tag

Tag meaning

ZS:Z:R

Multiple alignments with similar score were found.

ZS:Z:NM

No alignment was found.

ZS:Z:QL

An alignment was found but it was below the quality threshold.

ZS:Z:NRD

Alignment is to an auto-added decoy contig (not present in input FASTA).

ZS:Z:PAI

Alignment is to an insertion encoded in a population based alternate contig (not present in input FASTA).

The ga tag uses the same format as the SA tag used to describe supplementary alignments.

CRAM Output

When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:

CRAM format V3.0 is produced by default, V3.1 can be enabled by using the option --cram-version 3.1
The CRAM is lossless. Lossy compression is never employed and not optional
Quality score compression is lossless. Read names are preserved
Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores
All input BAM tags are preserved
The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.
A CRAM index is produced in .crai format
CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted

The following list of default settings are used for the CRAM output

CRAM option

Value

Description

SEQS_PER_SLICE

2000

Max sequences per slice

BASES_PER_SLICE

SEQS_PER_SLICE*500

Max bases per slice

SLICE_PER_CNT

Max slices per container

embed_ref

Do not embed reference sequence

noref

Do not use non-referenced based encoding

multiseq

-1

Do not use multiple references per slice

unsorted

Do not use unsorted mode

use_bz2

Do not compress using bzip2

use_lzma

Do not compress using lmza

use_rans

Use rANS for quality score compression

binning

NONE

Qual score binning not used

preserve_aux_order

Preserve all aux tags and order (incl RG,NM,MD)

preserve_aux_size

Aux tag sizes not preserved ('i', 's', 'c')

lossy_read_names

Preserve read names

lossy

Do not enable Illumina 8 quality-binning system

ignore_md5

Enable all checking of checksums

decode_md

Do not (re)generate MD and NM tags

cram_version

3.0

Default is CRAM v3.0.

Input Options

DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.

Uncompressed
gzip or bgzip compression
ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.

FASTQ Input Files

FASTQ input files can be single-ended or paired-end, as shown in the following examples.

Single-ended in one FASTQ file (-1 option)

dragen -r <REF_DIR> -1 <fastq> \
--output-directory <OUT_DIR> -output-file-prefix <OUTPUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in two matched FASTQ files(-1 and -2 options)

dragen -r <REF_DIR> -1 <fastq1> -2 <fastq2> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in a single interleaved FASTQ file(--interleaved (-i) option)

dragen -r <REF_DIR> -1 <INTERLEAVED_FASTQ> -i \
--RGID <RGID> --RGSM <RGSM>

Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:

<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz

Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.

For Example:

RDRS182520_S1_L001_R1_001.fastq.gz

RDRS182520_S1_L001_R1_002.fastq.gz

...

RDRS182520_S1_L001_R1_008.fastq.gz

To avoid impacting system performance, input files must be located on a fast file system.

Multiple FASTQ Input Files

For example:

dragen -r <ref_dir> --fastq-list <CSV_FILE> \
-fastq-list-sample-id <Sample_ID> -output-directory <OUT_DIR> 
--output-file-prefix <OUT_PREFIX>

FASTQ CSV File Format

Column titles are case-sensitive. The following column titles are required:

RGID--Read Group
RGSM--Sample ID
RGLB--Library
Lane--Flow cell lane
Read1File--Full path to a valid FASTQ input file
Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.

Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.

When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:

ID (from RGID)
SM (from RGSM)
LB (from RGLB)

Independent processing and output for multiple individual samples in one run is not supported.
To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true can be used instead of --fastq-list-sample-id.

Note

There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.

The following is an example FASTQ list CSV file with the required columns:

RGID,RGSM,RGLB,Lane,Read1File,Read2File
CACACTGA.1,RDSR181520,UnknownLibrary,1,/staging/RDSR181520_S1_L001_R1_001.fastq,
/staging/RDSR181520_S1_L001_R2_001.fastq
AGAACGGA.1,RDSR181521,UnknownLibrary,1,/staging/RDSR181521_S2_L001_R1_001.fastq,
/staging/RDSR181521_S2_L001_R2_001.fastq
TAAGTGCC.1,RDSR181522,UnknownLibrary,1,/staging/RDSR181522_S3_L001_R1_001.fastq,
/staging/RDSR181522_S3_L001_R2_001.fastq
AGACTGAG.1,RDSR181523,UnknownLibrary,1,/staging/RDSR181523_S4_L001_R1_001.fastq,
/staging/RDSR181523_S4_L001_R2_001.fastq

dragen -r <ref_dir> --tumor-fastq-list <csv_file> \
--tumor-fastq-list-sample-id <Sample_ID> \
--output-directory <out_dir> \
--output-file-prefix <out_prefix> --fastq-list <csv_file_2> \
--fastq-list-sample-id <Sample_ID_2>

Tumor-Normal Pairs Input

#!/bin/bash

HT="/staging/HT/"
tumor_fastq_list="/staging/inputs/tumor_fastq_list.csv"
normal_fastq_list="/staging/inputs/normal_fastq_list.csv"

tumor_samples_list="/staging/inputs/tumor_samples_list.txt"
normal_samples_list="/staging/inputs/normal_samples_list.txt"

while read -u 3 -r tumor_RGSM && read -u 4 -r normal_RGSM; do
output_dir="/staging/results/${tumor_RGSM}_${normal_RGSM}"
mkdir -p ${output_dir}

dragen \
-r ${HT} \
--tumor-fastq-list ${tumor_fastq_list} \
--tumor-fastq-list-sample-id ${tumor_RGSM} \
--fastq-list ${normal_fastq_list} \
--fastq-list-sample-id ${normal_RGSM} \
--output-directory ${output_dir} \
--output-file-prefix ${tumor_RGSM}_${normal_RGSM}
done 3<${tumor_samples_list} 4<${normal_samples_list}


Sample fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_N1.1,normal-1,ILLUMINA,1,/staging/inputs/normal-1_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N1.2,normal-1,ILLUMINA,2,/staging/inputs/normal-1_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.1,normal-2,ILLUMINA,1,/staging/inputs/normal-2_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.2,normal-2,ILLUMINA,2,/staging/inputs/normal-2_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.1,normal-3,ILLUMINA,1,/staging/inputs/normal-3_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.2,normal-3,ILLUMINA,2,/staging/inputs/normal-3_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L002_R2_001.fastq.gz

The following are examples of the FASTQ lists and samples lists used as input for the script.

Sample tumor_fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_T1.1,tumor-1,ILLUMINA,1,/staging/inputs/tumor-1_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T1.2,tumor-1,ILLUMINA,2,/staging/inputs/tumor-1_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.1,tumor-2,ILLUMINA,1,/staging/inputs/tumor-2_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.2,tumor-2,ILLUMINA,2,/staging/inputs/tumor-2_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.1,tumor-3,ILLUMINA,1,/staging/inputs/tumor-3_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.2,tumor-3,ILLUMINA,2,/staging/inputs/tumor-3_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L002_R2_001.fastq.gz

Sample normal_samples_list content

normal-1
normal-2
normal-3

Sample tumor_samples_list content

tumor-1
tumor-2
tumor-3

FASTQ ORA Input Files

See ORA Compression and Decompression for more information on ORA reference files.

The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).

dragen -r <REF_DIR> -1 <fastq.ora1> -2 <fastq.ora2> \
--ora-reference <ORADATA_DIR> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

BAM Input Files

Alternatively, existing alignments in the BAM file can be used as input to the variant callers by setting the --enable-map-align option to false.

Specify single-ended input in one BAM file with the (-b) and --pair-by-name=false options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name false

Specify paired-end input in one BAM file with the (-b) and \--pair-by-name=true options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

CRAM Input

--cram-reference can be either a fasta file, or a DRAGEN hash table folder.
If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file
CRAM output will always be compressed using the --ref-dir reference

Example: CRAM was created with hg19, re-analysis with hg38

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <ref_dir HG19>

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <hg19.fa>

The following options are used for providing a CRAM input to either mapper/aligner or variant caller:

--cram-input--The name and path for the CRAM file
--cram-input--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option to true.

dragen -r <ref_dir> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

BCL Input Files

BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.

DRAGEN can read directly from BCL in the following circumstances:

Only one lane is input as part of a run (specified on the command-line).
The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).

The following example command is for BCL input with only one lane of input:

dragen --bcl-input-dir <BCL_ROOT> --bcl-only-lane <num> -r <ref_dir> \
--output-directory <out_dir> --output-file-prefix <out_prefix>

For additional BCL conversion options, see Input File Types.

Handling of N bases

One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.

Read Names for Paired-End Reads

DRAGEN has the following options to control how suffixes are used:

To change the delimiter character, for suffixes, use the --pair-suffix-delimiter option. Valid values for this option include forward-slash (/), dot (.), and colon (:).
To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes to false.
To append a new set of suffixes to all read names, set --append-read-index-to-name to true. The delimiter is determined by the --pair-suffix-delimiter option. By default, the delimiter is a slash, so /1 and /2 are added to the names.

Gene Annotation Input Files

DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.

Networked Streaming

AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming

Input streaming is supported for the following use cases:

Mapping/aligning of FASTQ and BAM.
Germline and somatic small variant calling from BAM (without remapping).

For other file types that are significantly smaller in size, download them locally before running the analysis.

Streaming FASTQ Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 s3://s3-bucket-name/path/to/object_1.fastq.gz \
  -2 s3://s3-bucket-name/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://storage-account-name.blob.core.windows.net/path/to/object_1.fastq.gz \
  -2 https://storage-account-name.blob.core.windows.net/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://bucket-name.amazonaws.com/path/to/object_1.fastq.gz?querystring \
  -2 https://bucket-name.amazonaws.com/path/to/object_2.fastq.gz?querystring \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b s3://s3-bucket-name/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://storage-account-name.blob.core.windows.net/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://bucket-name.amazonaws.com/path/to/object_1.bam?querystring \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

AWS S3, Azure Blob Storage, Output Streaming

DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.

Streaming output to AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory s3://s3-bucket-name/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Streaming output to Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory https://storage-account-name.blob.core.windows.net/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Security and Permissions

To stream input files or write to a cloud providers storage, you must have permission to access the remote files.

AWS S3

S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.

Azure Blob Storage Account

Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.

With storage account access keys, DRAGEN can write to an Azure bucket both on and off Azure instances. For this use case, find the and set the environment variables AZ_ACCOUNT_NAME=<azure-storage-account-name> and AZ_ACCOUNT_KEY=<account-key>.

Presigned URL (AWS only)

Sample Sex

The --sample-sex option supports the following values. Values are not case-sensitive.

none: No sex karyotype input. Components use a default reference sex karyotype.
auto: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none. auto is the default value.
female: Sex karyotype input is XX.
male: Sex karyotype input is XY.

The following example command lines use --sample-sex to specify the sex karyotype.

--sample-sex FEMALE

--sample-sex MALE

--sample-sex NONE

The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex is used.

Reference Sex Karyotype

Sex Karyotype Input

CNV Caller

DRAGEN-STR

Ploidy Caller

Small Variant Caller

SV Caller

XXYY

XXX

XXYY

XXXX

XXYY

XXXXX

XXYY

XXY

XXYY

XXXY

XXYY

XXXXY

XXYY

XYY

XXYY

XXXYY

XXYY

XYYY

XXYY

XXYYY

XXYY

XYYYY

XXYY

None

XX/XY

XXYY

For sex karyotype input of None, CNV/Ploidy Caller independently check the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.

Preservation or Stripping of BQSR Tags

Read Group Options

Attribute

Argument

Description

--RGID

Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.

--RGLB

Library.

--RGPL

Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.

--RGPU

Platform unit, eg, flowcell-barcode.lane.

--RGSM

Sample.

--RGCN

Name of the sequencing center that produced the read.

--RGDS

Description.

--RGDT

Date the run was produced.

--RGPI

Predicted mean insert size.

dragen --RGID 1 --RGCN Broad --RGLB Solexa-135852 \
--RGPL Illumina --RGPU 1 --RGSM NA12878 \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 SRA056922.fastq --output-directory /staging/tmp/ \
--output-file-prefix rg_example

License Options

To suppress the license status message at the end of the run, use the --lic-no-print option. The following shows an example of the license status message:

LICENSE_MSG| =====================================================
LICENSE_MSG| License report
LICENSE_MSG|   Genome status [ACxxxxxxxxxxx] : used 1263.9 Gbases
since 2018-Feb-15 (1263886160894 bases, unlimited)
LICENSE_MSG|   Genome  bases [ACxxxxxxxxxxx] : 202000000
LICENSE_MSG|   Genome  bases [total]         : 202000000

Autogenerated MD5SUM for BAM and CRAM Output Files

The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).

Configuration Files

Licensing

DRAGEN utilizes quota based licensing for a majority of features. More information can be found in the .

Small Variant Calling

The Variant Caller Algorithm

The DRAGEN Small Variant Caller performs the following steps:

Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.

Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.

Read filtering and reporting of vcf DP fields

Unfiltered depth is the number of reads (fragment-based) covering the position, downstream of any read collapsing, deduplication, downsampling and read disqualification, but upstream of informative read determination. Overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads (fragment-based) actually used to make the calling decision, where badly mated reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.

The following figure summarizes the different filtering steps in more detail.

Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
- Duplicate reads.
- Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
- [Somatic] Reads with MAPQ=0.
- [Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1. By default, germline runs with machine learning (ML) enabled consider all reads, even those with MAPQ 0, resulting in increased sensitivity. MAPQ read filtering is controlled by --vc-min-read-qual for germline and tumor/normal (T/N) runs, but it does not affect tumor-only (T/O) runs. In contrast, --vc-min-tumor-read-qual controls filtering for tumor samples in T/N and T/O runs and has no effect on normal-only samples.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
- Unmapped reads.
- Secondary reads.
- Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
- Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out the following reads:
- Badly mated reads. A badly mated read is a read where the pair is mapped to two different reference contigs.
- Non-informative reads. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.

Mosaic Calling

Variant Caller Options

The following options control the variant caller stage of the DRAGEN host software.

--enable-variant-caller
Set --enable-variant-caller to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:

--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection. Set to false to disable DRAGEN mosaic detection.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF filter to mosaic calls. All MOSAIC tagged variants with AF smaller than the AF threshold are filtered with the MosaicLowAF filter. The default mosaic AF filter threshold is set to 0.2 if the median depth of the sample detected by the ploidy caller is <= 100x and 0.1 if the detected depth is > 100x.

Downsampling Options for Small Variant Calling

You can use the following options for downsampling reads in the small variant calling pipeline.

--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.

The following are the default downsampling values for each small variant calling mode.

Small Variant Caller Output

By default, the DRAGEN small variant caller outputs a VCF file (<output-file-prefix>.hard-filtered.vcf.gz) in VCF 4.2 format containing both filtered and PASSing variant records.

Variant Representation

Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned

Additional notes on variant representation in the DRAGEN VCF:

Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).

Multi-allelic Variants and Overlapping Variants

Two indels are considered as multi-allelic if they share the same reference base preceding the indel. For example:

QUAL, QD, and GQ Formulation

QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.

Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.
Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3.0103) compared to DRAGEN without ML (10.41 for SNP and 7.83 for Indel).

Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3.0103 with ML enabled.

gVCF Output

The following options are available to enable and control gVCF output.

--vc-emit-ref-confidence
To enable gVCF output, set to GVCF. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence to BP_RESOLUTION.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf (banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80 for germline and 1 3 10 20 50 80 for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30 and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.

The following are example gVCF records that include a hom-ref block call and a variant call.

Phasing and Phased Variants

The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.

Combine Phased Variants

For example, assuming reference at position chr2 115035 is A, the following two phased SNVs can be combined into an MNV.

The phased SNVs are combined as follows.

The following two phased indels can be combined into a complex indel.

The phased indels are combined as follows.

Command line options for merging phased variants

--vc-combine-phased-variants-distance-snvs-only Specifies the maximum distance over which phased SNVs will be combined into an MNV. This option applies only to phased variant groups consisting of only SNVs. The default is 2 for somatic and 0 for germline (disabled). For phased variant groups that include both SNVs and indels, the analogous vc-combine-phased-variants-distance option applies.
--vc-combine-phased-variants-distance Specifies the maximum distance over which phased insertions, deletions, and SNVs will be combined into a complex indel. This distance threshold will be applied to any group of phased variants that includes at least one indel. The default is 2 for somatic and 0 for germline (disabled).

--vc-mnv-emit-component-calls Specifies whether or not to emit the individual component variant records along with the merged variant records. When set to true, all component calls making up an MNV or complex indel will be emitted in the VCF along with the merged variant record. The default is true for somatic and false (disabled) for germline.
--vc-combine-phased-variants-max-vaf-delta Specifies the threshold for filtering MNV component variant calls when the events comprising to the MNV have different allele frequencies. The default value is 0.1, which means that any SNV or INDEL with an AF that is more than 0.1 greater than the MNV AF shall be emitted as a PASSing call, while the remaining components shall be emitted with the 'mnv_component' FILTER flag. Only applicable when vc-combine-phased-variants-distance is greater than 0 and vc-mnv-emit-component-calls is true. (Default=0.1)

Examples of diploid haplotypes where phasing is supported:

Examples of diploid haplotypes where phasing is not supported:

Ploidy Support

If --sample-sex is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex is specified on the command line, contigs are processed as follows.
- For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
- For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.

DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.

Overlapping Mates in the Small Variant Calling

Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.

When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.

Mitochondrial Calling

--vc-enable-af-filter-mito Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito Set the threshold to mark emitted vcf call as filtered. The default is 0.02.

The following filters can be applied to mitochondrial variant calls.

--vc-sq-call-threshold Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter Enables the multiallelic filter. The default value is false.

If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.

FORMAT/GT

Personalized Germline Small Variant Calling

We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.

Note that personalization is only available for the germline small variant caller (WGS and WES) when used with a pangenome reference.

Joint Detection of Overlapping Variants

When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:

Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.

Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection to false.

Modeling of Correlated Errors Across Reads

DRAGEN has two algorithms that model correlated errors across reads in a given pileup.

Foreign Read Detection

Base Quality Dropoff

DNA Mapping

Seed Density Option

The seed-density option controls how many (normally overlapping) primary seeds from each read the mapper looks up in its hash table for exact matches. The maximum density value of 1.0 generates a seed starting at every position in the read, ie, (L-K+1) K-base seeds from an L-base read.

Seed density must be between 0.0 and 1.0. Internally, an available seed pattern equal or close to the requested density is selected. The sparsest pattern is one seed per 32 positions, or density 0.03125.

Accuracy Considerations--Generally, denser seed lookup patterns improve mapping accuracy. However, for modestly long reads (eg, 50 bp+) and low sequencer error rates, there is little to be gained beyond the default 50% seed lookup density.
Speed Considerations--Denser seed lookup patterns generally slow down mapping, and sparser seed patterns speed it up. However, when the seed mapping stage can run faster than the aligning stage, a sparser seed pattern does not make the mapper much faster.

Relationship to Reference Seed Interval

Functionally, a denser or sparser seed lookup pattern has an impact very similar to a shorter or longer reference seed interval (build hash table option --ht-ref-seed-interval). Populating 100% of reference seed positions and looking up 50% of read seed positions has the same effect as populating 50% of reference seed positions and looking up 100% of read seed positions. Either way, the expected density of seed hits is 50%.

More generally, the expected density of seed hits is the product of the reference seed density (the inverse of the reference seed interval) and the seed lookup density. For example, if 50% of reference seeds are populated and 33.3% (1/3) of read seed positions are looked up, then the expected seed hit density should be 16.7% (1/6).

DRAGEN automatically adjusts its precise seed lookup pattern to ensure it does not systematically miss the seed positions populated from the reference. For example, the mapper does not look up seeds matching only odd positions in the reference when only even positions are populated in the hash table, even if the reference seed interval is 2 and seed-density is 0.5.

Map Orientations Option

The --Mapper.map-orientations option is used in mapping reads for bisulfite methylation analysis. It is set automatically based on the value set for ‑‑methylation-protocol.

The --Mapper.map-orientations option can restrict the orientation of read mapping to only forward in the reference genome, or only reverse-complemented. The valid values for --map-orientations are as follows.

0--Either orientation (default)
1--Only forward mapping
2--Only reverse-complemented mapping

If mapping orientations are restricted and paired end reads are used, the expected pair orientation can only be FR, not FF or RF.

Seed-Editing Options

Although DRAGEN primarily maps reads by finding exact reference matches to short seeds, it can also map seeds differing from the reference by one nucleotide by also looking up single-SNP edited seeds. Seed editing is usually not necessary with longer reads (100 bp+), because longer reads have a high probability of containing at least one exact seed match. This is especially true when paired ends are used, because a seed match from either mate can successfully align the pair. But seed editing can, for example, be useful to increase mapping accuracy for short single-ended reads, with some cost in increased mapping time. The following options control seed editing:

Seed Editing Options

edit-mode and edit-chain-limit

The edit-mode and edit-chain-limit options control when seed editing is used. The following four edit-mode values are available:

Edit mode 0 requires all seeds to match exactly. Mode 3 is the most expensive because every seed that fails to match the reference exactly is edited. Modes 1 and 2 employ heuristics to look up edited seeds only for reads most likely to be salvaged to accurate mapping.

The main heuristic in edit modes 1 and 2 is a seed chain length test. Exact seeds are mapped to the reference in a first pass over a given read, and the matching seeds are grouped into chains of similarly aligning seeds. If the longest seed chain (in the read) exceeds a threshold edit-chain-limit, the read is judged not to require seed editing, because there is already a promising mapping position.

Edit mode 1 triggers seed editing for a given read using the seed chain length test. If no seed chain exceeds edit-chain-limit (including if no exact seeds match), then a second seed mapping pass is attempted using edited seeds. Edit mode 2 further optimizes the heuristic for paired-end reads. If either mate has an exact seed chain longer thanedit-chain-limit, then seed editing is disabled for the pair, because a rescue scan is likely to recover the mate alignment based on seed matches from one read. Edit mode 2 is the same as mode 1 for single-ended reads.

edit-seed-num and edit-read-len

For edit modes 1 and 2, when the heuristic triggers seed editing, these options control how many seed positions are edited in the second pass over the read. Although exact seed mapping can use a densely overlapping seed pattern, such as seeds starting at 50% or 100% of read positions, most of the value of seed editing can be obtained by editing a much sparser pattern of seeds, even a nonoverlapping pattern. Generally, if a user application can afford to spend some additional amount of mapping time on seed editing, a greater increase in mapping accuracy can be obtained for the same time cost by editing seeds in sparse patterns for a large number of reads, than by editing seeds in dense patterns for a small number of reads.

Whenever seed editing is triggered, these two options request edit-seed-num seed editing positions, distributed evenly over the first edit-read-len bases of the read. For example, with 21-base seeds, edit-seed-num=6 and edit-read-len=100, edited seeds can begin at offsets {0, 16, 32, 48, 64, 80} from the 5' end, consecutive seeds overlapping by 5 bases. Because sequencing technologies often yield better base qualities nearer the (5') beginning of each read, this can focus seed editing where it is most likely to succeed. When a particular read is shorter than edit-read-len, fewer seeds are edited.

Seed editing is more expensive when the reference seed interval (build hash table option ‑-ht‑ref-seed-interval) is greater than 1. For edit modes 1 and 2, additional seed editing positions are automatically generated to avoid missing the populated reference seed positions. For edit mode 3, the time cost can increase dramatically because query seeds matching unpopulated reference positions typically miss and trigger editing.

DNA Aligning

Smith-Waterman Alignment Scoring Settings

The first stage of mapping is to generate seeds from the read and look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.

The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. But any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors, and differently tuned alignment scoring values can be more appropriate for some applications.

The following alignment options control Smith-Waterman Alignment:

global The global option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in the read. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm (although not end-to-end in the reference), and alignment scores can be positive or negative. When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are nonnegative. Generally, global=0 is preferred for longer reads, so significant read segments after a break of some kind (large indel, structural variant, chimeric read, and so forth) can be clipped without severely decreasing the alignment score. Setting global=1 might not have the desired effect with longer reads because insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions. Using global=1 is sometimes preferable with short reads, which are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end. Consider using the unclip-score option, or increasing it, instead ofsetting global=1, to make a soft preference for unclipped alignments.
match-score The match-score option specifies the score for a read nucleotide matching a reference nucleotide (A, C, G, or T), or matching a reference 2–3 nucleotide IUPAC-IUB code. Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used when global=1. A higher match score results in longer alignments, and fewer long insertions.
match-n-score The match-n-score option specifies the score for an aligned position where the read position and/or the reference position is an N code. This option is a signed integer, from -16 to 15.
mismatch-pen The mismatch-pen option is the penalty (negative score) for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code, except N. This option is an unsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.
gap-open-pen The gap-open-pen option is the penalty (negative score) for opening a gap (ie, an insertion or deletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs, with clipping or alignment through SNPs used instead.
gap-ext-pen The gap-ext-pen option is the penalty (negative score) for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping, or alignment through SNPs used instead.
unclip-score The unclip-score option is the score bonus for an alignment reaching the beginning or end of the read. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0 to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read more often, where this can be done without too many SNPs or indels. A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignments are forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen can have the undesirable effect of insertions at or near one end of a read being utilized as pseudoclipping, as happens with global=1
no-unclip-score The no-unclip-score option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing, such as comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it stays the same or decreases if no-unclip-score=1. The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment. When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.
aln-min-score The aln-min-score option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0). aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. A read's best alignment score is saved in the AS SAM tag, and the second-best score (if available) is saved in the XS tag. aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments. You can use the min-score-coeff option to adjust aln-min-score as a function of read length.
min-score-coeff The min-score-coeff option makes adjustments to aln-min-score per read base. When using the min-score-coeff and aln-min-score options together, you can define the minimum alignment score for each read as an affine function of read lengths. The minimum score for an N-base read is calculated as follows:(min-score-coeff)\*N+(aln-min-score) The min-score-coeff option is an integer ranging from –64 to 63.999. If the value is 0, then the minimum alignment score is fixed at aln-min-score for all read length. You can use positive values for min-score-coeff to allow shorter reads to match with lower alignment scores, but require longer reads to achieve higher scores.

Paired-End Options

DRAGEN can process paired-end data passed via a pair of FASTQ files or in a single interleaved FASTQ file. The hardware maps the two ends separately, and then determines a set of alignments that seem most likely to form a pair in the expected orientation and having roughly the expected insert size. The alignments for the two ends are evaluated for the quality of their pairing, with larger penalties for insert sizes far from the expected size. The following options control processing of paired-end data:

Reorientation The pe-orientation option specifies the expected paired-end orientation. Only pairs with this orientation can be flagged as proper pairs. Valid values are as follows:
- 0--FR (default)
- 1--RF
- 2--FF
unpaired-pen For paired end reads, best mapping positions are determined jointly for each pair, according to the largest pair score found, considering the various combinations of alignments for each mate. A pair score is the sum of the two alignment scores minus a pairing penalty, which estimates the unlikelihood of insert lengths further from the mean insert than this aligned pair. The unpaired-pen option specifies how much alignment pair scores should be penalized when the two alignments are not in properly paired position or orientation. This option also serves as the maximum pairing penalty for properly paired alignments with extreme insert lengths. The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ. Internally, it is scaled into alignment score space based on Smith-Waterman scoring parameters.
pe-max-penalty

The pe-max-penalty option limits how much the estimated MAPQ for one read can increase because its mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that it would have received mapping single-ended, plus this value. By default, pe-max-penalty = mapq-max = 255, effectively disabling this limit. The key difference between unpaired-pen and pe-max-penalty is that unpaired-pen affects calculated pair scores and thus which alignments are selected and pe-max-penalty affects only reported MAPQ for paired alignments.

Mean Insert Size Detection

When working with paired-end data, DRAGEN must choose among the highest-quality alignments for the two ends to try to choose likely pairs. To make this choice, DRAGEN uses a skew normal insert model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the observation that common library preparation methods have insert-size distributions that are sometimes close to normal, but also sometimes clearly asymmetric, often skewing toward longer insert sizes. The skew normal insert model is used only for the DNA mode.

If you know the statistics of your library prep for an input file (and the file consists of a single read group), you can specify the characteristics of the insert-length distribution: mean, standard deviation, shape (or skewness) and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert, Aligner.pe-stat-stddev-insert, Aligner.pe-stat-shape-insert,Aligner.pe-stat-quartiles-insert, andAligner.pe-stat-mean-read-len options. However, it is typically preferable to allow DRAGEN to detect these characteristics automatically.

Dragen automatically samples the insert-length distribution. When the software starts execution, it runs a sample of up to 2,000,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.

The DRAGEN host software reports the statistics in its stdout log in a report, as follows:

Note that the Mean, Standard deviation and Quartiles reported above are the sample mean, standard deviation and quartiles calculated from the initial sample of up to 2,000,000 pairs, assuming a normal distribution. The sample mean and standard deviation are used to fit the parameters of a skew-normal distribution. A skew-normal distribution is defined by starting with an underlying normal distribution (whose mean we call position or xi and standard deviation we call scale or omega) and folding a varying portion of the probability mass from one side of the mean (e.g., left side) to the other (e.g., right) side. The portion folded varies smoothly, from 0% at the original mean, approaching 100% from the left tail to the right tail. A shape parameter which we call alpha controls how rapidly the folded fraction increases, and at alpha=0 there is no folding and the distribution remains normal.

In the standard output, we also include the command line options needed to reproduce the DRAGEN run with the same insert stat settings. Note that when specifying stats on the command line, the skew-normal xi value should be used for Aligner.pe-stat-mean-insert. The omega value should be used for Aligner.pe-stat-stddev-insert, and the alpha value should be used for Aligner.pe-stat-shape-insert. If Aligner-pe-stat-shape-insert is not specified on the command line, a default value of 0 is assumed.

The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines

These lines are followed by the histogram for the first ~2M read pairs for DNA (~100K read pairs for RNA). The histogram counts are aggregated across all read groups sharing the same sample id (RGSM field).

When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:

The small samples formula calculates standard deviation as follows:

The default model is "standard deviation = 10000". If the first 2M reads are unmapped or if all pairs are improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0. Note that the minimum value for standard deviation is 12, which is independent of the number of samples. Also, in the DNA mode when we have fewer than 1000 high quality alignments we revert to the normal distribution based insert model, because of insufficient number of samples to accurately estimate the parameters of the skew normal distribution.

For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.

DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. This file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, shape, minimum, and maximum. The information matches the standard-out report above. Additionally, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans. Note that the reported mean and standard deviation in this tab-limited log file are the xi and omega parameters of the skew-normal distribution.

Rescue Scans

For paired-end reads, where a seed hit is found for one mate but not the other, rescue scans hunt for missing mate alignments within a rescue radius of the mean insert length. Normally, the DRAGEN host software sets the rescue radius to 2.5 standard deviations of the empirical insert distribution. But in cases where the insert standard deviation is large compared to the read length, the rescue radius is restricted to limit mapping slowdowns. In this case, a warning message is displayed, as follows:

Although the user can ignore this warning, or specify an intermediate rescue radius to maintain mapping speed, it is recommended to use 2.5 sigmas for the rescue radius to maintain mapping sensitivity. To disable rescue scanning, set max-rescues to 0.

Output Options

DRAGEN can track multiple independent alignments for each read. These alignments include the optimal (primary) one, as well as those mapping different subsegments of the read, (chimeric/supplementary), and sub-optimal (secondary) mappings of the read to different areas of the reference.

For DNA alignment by default, DRAGEN can emit one primary alignment for each read, up to three chimeric alignments (Aligner.supp-aligns=3), and no secondary alignments (Aligner.sec-aligns=0). The maximum user-specified value for supp-aligns or sec-aligns is 4095.

You can use the following configuration options to control how many of each type of alignment to include in DRAGEN output.

mapq-max The mapq-max option specifies a ceiling on the estimated MAPQ that can be reported for any alignment, from 0 to 255. If the calculated MAPQ is higher, this value is reported instead. The default is 60.
supp-aligns, sec-aligns The supp-aligns and sec-aligns options restrict the maximum number of supplementary (ie, chimeric and SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100) alignments, respectively, that can be reported for each read. A maximum of 4095 supplementary alignments and 4095 secondary alignments can be reported for any read, in addition to a primary alignment. High settings for these two options impact speed so it is advisable to increase only as needed.
sec-phred-delta The sec-phred-delta option controls which secondary alignments are emitted based on the alignment score relative to the primary reported alignment. Only secondary alignments with likelihood within this Phred value of the primary are reported.
sec-aligns-hard The sec-aligns-hard option suppresses the output of all secondary alignments if there are more secondary alignments than can be emitted. Set sec-aligns-hard to 1 to force the read to be unmapped when not all secondary alignments can be output.
supp-as-sec When the supp-as-sec option is set to 1, then supplementary (chimeric) alignments are reported with SAM FLAG 0x100 instead of 0x800. The default is 0. The supp-as-sec option provides compatibility with tools that do not support FLAG 0x800.
hard-clips The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specify alignments, as follows:
- Bit 0--primary alignments
- Bit 1--supplementary alignments
- Bit 2--secondary alignments

Each bit determines whether local alignments of that type are reported with hard clipping (1) or soft clipping (0). The default is 6, meaning primary alignments use soft clipping and supplementary and secondary alignments use hard clipping.

Mapping with ALT-contigs

The GRCh38 human reference contains many more alternate haplotypes (ALT contigs) than previous versions of the reference. Generally, including ALT contigs in the mapping reference improves mapping and variant calling specificity, because misalignments are eliminated for reads matching an ALT contig but scoring poorly against the primary assembly. However, mapping with GRCh38's ALT contigs without special treatment can substantially degrade variant calling sensitivity in corresponding regions, because many reads align equally well to an ALT contig and to the corresponding position in the primary assembly.

Masked Based ALT-awareness

The recomeneded and default approach for dealing with ALT-contigs in DRAGEN is masking regions of ALT contigs of high similarity to their corresponding primary contig. This approach is more accurate than liftover based ALT-awarness because there are many places where the "correct" or most useful liftover between a long ALT haplotype and the primary assembly is ambiguous. Incorrect liftover can produce dense clusters of mismapped reads and false variant calls. The base masking approach has the benefits of using ALT contigs without the negative consequences.

Masked hash tables are built from a standard hg18 or hg38 FASTA that contains ALT contigs. The hash table builder will automatically mask regions of the ALT contigs with Ns.

Liftover Based ALT-awarness

With liftover based ALT-awareness, the mapper and aligner are aware of the liftover relationship between ALT contig positions and corresponding primary assembly positions. Seed matches within ALT contigs are used to obtain corresponding primary assembly alignments, even if the latter score poorly. Liftover groups are formed, each containing a primary assembly alignment candidate, and zero or more ALT alignment candidates that lift to the same location. Each liftover group is scored according to its best-matching alignments, taking properly paired alignments into account. The winning liftover group provides its primary assembly representative as the primary output alignment, with MAPQ calculated based on the score difference to the second-best liftover group. Emitting primary alignments within the primary assembly maintains normal aligned coverage and facilitates variant calling there. If the --Aligner.en-alt-hap-aln option is set to 1 and --Aligner.supp-aligns is greater than 0, then corresponding alternate haplotype alignments can also be output, flagged as supplementary alignments.

DRAGEN requires ALT-Aware hash tables for any hg19 or GRCh38 reference where ALT contigs are detected. To disable this requirement in DRAGEN, set the --ht-alt-aware-validate option to false.

The following is a comparison of alternative options for dealing with alternate haplotypes.

Mapping without ALT contigs in the reference:
- False-positive variant calls result when reads matching an alternate haplotype misalign somewhere else.
- Poor mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.
Mapping with ALT contigs but no ALT awareness:
- False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
- Low or zero aligned coverage in primary assembly regions covered by alternate haplotypes, due to some reads mapping to ALT contigs.
- Low or zero MAPQ in regions covered by alternate haplotypes, where they are similar or identical to the primary assembly.
- Variant calling sensitivity is dramatically reduced throughout regions covered by alternate haplotypes.
Mapping with ALT contigs and ALT awareness:
- False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
- Normal aligned coverage in regions covered by alternate haplotypes because primary alignments are to the primary assembly.
- Normal MAPQs are assigned because alignment candidates in alternative haplotypes are not considered in competition.
- Good mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.

DRAGEN Multigenome Mapper

The Multigenome Mapper in DRAGEN significantly improves the accuracy of mapping Illumina reads, particularly in challenging regions such as segmental duplications and other difficult to map regions. This advanced method leverages population haplotypes from pangenome references to incorporate additional variant information, constructing alternative haplotype paths that improve reads mapping. By offering these alternate paths, the Multigenome Mapper enables reads containing population-specific variants to align directly to their most likely genomic locations, reducing mapping ambiguity. This improved mapping also results in improved variant calling accuracy.

When given a set of population variants (VCF) or haplotypes, the pangenome reference modification is categorized in the following types:

Alternate contigs represent population haplotypes. Alt-contigs can have a single variant or a combination of nearby phased variants.
Ambiguous codes (IUPAC codes) to represent SNPs. To improve alignment, it edits the reference FASTA with isolated population SNPs.
Haplotype database. An additional haplotype database is built and used to augment the reference FASTA with population variants. A multigenome based mapper algorithm is used to score read alignment according to the variants in this database.

Prepare a Reference Genome

Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. The options used in this preprocessing step offer tradeoffs between performance and mapping quality.

Pre-built DRAGEN reference genomes are available for download in the Illumina customer portal. If you find that performance and mapping quality with these are adequate, there is a good chance that you can simply work with these supplied reference genomes. Depending on your read lengths and other particular aspects of your application, you may be able to improve mapping quality and/or performance by tuning the reference preprocessing options.

Hash Table Background

The DRAGEN mapper extracts many overlapping seeds (subsequences or K-mers) from each read, and looks up those seeds in a hash table residing in memory on its PCIe card, to identify locations in the reference genome where the seeds match. Hash tables are ideal for extremely fast lookups of exact matches. The DRAGEN hash table must be constructed from a chosen reference genome using the --build-hash-table option, which extracts many overlapping seeds from the reference genome, populates them into records in the hash table, and saves the hash table as a binary file.

Automatic Reference Detection

DRAGEN will attempt to detect the provided reference in order to automatically apply recommended resources and settings. There are four human references that DRAGEN can detect: hg38, hg19, hs37d5, and chm13v2. DRAGEN is able to detect references that contain a subset of the primary contigs from one of these references, as long as the names and lengths of the detected contigs are consistent with the names and lengths from the standarad assemblies of these references.

In detail, automatic reference detection operates as follows:

We define a primary contig of a human genome to be an autosome (1-22) or sex chromosome (X,Y). Let F be the input fasta. For each reference genome R in hg38, hg19, hs37d5, and chm13v2, DRAGEN checks if there are any contigs in F that have the same name and length as a primary contig in R, and that there are no contigs in F that have the same name as a contig in R, but with different length. If these conditions hold for exactly one of hg38, hg19, hs37d5, and chm13v2, then that reference is detected and resources may be applied automatically.

The DRAGEN hash table builder will automatically apply decoy contigs and mask bed files to detected reference. Other pipelines may also apply automatic resources. For example variant callers may apply machine learning models and target bed files.

Naming Conventions

In order for DRAGEN to correctly detect the provided reference, it is important to use the standard naming conventions for each of the four human assemblies that DRAGEN detects:

Reference Seed Interval

The size of the DRAGEN hash table is proportionate to the number of seeds populated from the reference genome. The default is to populate a seed starting at every position in the reference genome, ie, roughly 3 billion seeds from a human genome. This default requires at least 32 GB of memory on the DRAGEN PCIe board.

To operate on larger, nonhuman genomes or to reduce hash table congestion, it is possible to populate less than all reference seeds using the --ht-ref-seed-interval option to specify an average reference interval. The default interval for 100% population is --ht-ref-seed-interval 1, and 50% population is specified with --ht-ref-seed-interval 2. The population interval does not need to be an integer. For example, --ht-ref-seed-interval 1.2 indicates 83.3% population, with mostly 1-base and some 2-base intervals to achieve a 1.2 base interval on average.

Hash Table Occupancy

It is characteristic of hash tables that they are allocated a certain size, but always retain some empty records, so they are less than 100% occupied. A healthy amount of empty space is important for quick access to the DRAGEN hash table. Approximately 90% occupancy is a good upper bound. Empty space is important because records are pseudo-randomly placed in the hash table, resulting in an abnormally high number of records in some places. These congested regions can get quite large as the percentage of empty space approaches zero, and queries by the DRAGEN mapper for some seeds can become increasingly slow.

Hash Table / Seed Length

The hash table is populated with reference seeds of a single common length. This primary seed length is controlled with the --ht-seed-len option, which defaults to 21.

The longest primary seed supported is 27 bases when the table is 8 GB to 31.5 GB in size. Generally, longer seeds are better for run time performance, and shorter seeds are better for mapping quality (success rate and accuracy). A longer seed is more likely to be unique in the reference genome, facilitating fast mapping without needing to check many alternative locations. But a longer seed is also more likely to overlap a deviation from the reference (variant or sequencing error), which prevents successful mapping by an exact match of that seed (although another seed from the read may still map), and there are fewer long seed positions available in each read.

Longer seeds are more appropriate for longer reads, because there are more seed positions available to avoid deviations.

Seed Length Recommendations

Hash Table / Seed Extensions

Due to repetitive sequences, some seeds of any given length match many locations in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many reference locations, it extends the seed by some number of bases at both ends, to some greater length that is more unique in the reference.

For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extended seed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these 100 seed positions may divide into 40 groups of 1-3 identical 35-base seeds. Iterative seed extensions are also supported, and are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.

The maximum extended seed length, by default equal to the primary seed length plus 128, can be controlled with the --ht-max-ext-seed-len option. For example, for short reads, it is advisable to set the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.

It is also possible to tune how aggressively seeds are extended using the following options (advanced usage):

--ht-cost-coeff-seed-len

--ht-cost-coeff-seed-freq

--ht-cost-penalty

--ht-cost-penalty-incr

There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved using longer seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved by avoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies that result. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, and by finding more candidate mapping locations at which to score alignments. The default extension settings along with default seed frequency settings, lean aggressively toward mapping accuracy, with relatively short seed extensions and high hit frequencies.

The defaults for the seed frequency options are as follows:

Seed Frequency Limit and Target

One primary or extended seed can match multiple places in the reference genome. All such matches are populated into the hash table, and retrieved when the DRAGEN mapper looks up a corresponding seed extracted from a read. The multiple reference positions are then considered and compared to generate aligned mapper output. However, the DRAGEN software enforces a limit on the number of matches, or frequency, of each seed, which is controlled with the --ht-max-seed-freq option. By default, the frequency limit is 16. In practice, when the software encounters a seed with higher frequency, it extends it to a sufficiently long secondary seed that the frequency of any particular extended seed pattern falls within the limit. However, if a maximum seed extension would still exceed the limit, the seed is rejected, and not populated into the hash table. Instead, a single High Frequency record is populated.

This seed frequency limit does not tend to impact DRAGEN mapping quality notably, for two reasons. First, because seeds are rejected only when extension fails, only extremely high-frequency primary seeds, typically with many thousands of matches are rejected. Such seeds are not very useful for mapping. Second, there are other seed positions to check in a given read. If another seed position is unique enough to return one or more matches, the read can still be properly mapped. However, if all seed positions were rejected as high frequency, often this means that the entire read matches similarly well in many reference positions, so even if the read were mapped it would be an arbitrary choice, with very low or zero MAPQ.

Thus, the default frequency limit of 16 for --ht-max-seed-freq works well. However, it may be decreased or increased, up to a maximum of 256. A higher frequency limit tends to marginally increase the number of reads mapped (especially for short reads), but commonly the additional mapped reads have very low or zero MAPQ. This also tends to slow down DRAGEN mapping, because correspondingly large numbers of possible mappings are occasionally considered.

In addition to a frequency limit, a target seed frequency can be specified with --ht-target-seed-freq option. This target frequency is used when extensions are generated for high frequency primary seeds. Extension lengths are chosen with a preference toward extended seed frequencies near the target. The default of 4 for --ht-target-seed-freq means that the software is biased toward generating shorter seed extensions than necessary to map seeds uniquely.

References with ALT contigs

When building a reference hash table from a fasta with ALT contigs, it may be desired to mask certain regions of high similarity, or to establish a liftover realtionships between primary and alternate contigs. The recommended approach is masking, as described in the Map-Align section. When hg19 or hg38 alt contigs are detected, the hash table builder will require a liftover file or a bed file to mask the alt contigs. If non are provided, a mask bed file from <INSTALL_PATH>/resources/ht_builder/ will be used automaticaly.

Masked References

DRAGEN has adopted a masked approach to handle native reference ALT contigs, where strategic regions are masked to increased accuracy. The hash table builder will build the mapper hash table as if the regions that were specified in the argument for ht-mask-bed were masked with N's. The hash table builder will only allow setting one of ht-mask-bed or ht-alt-liftover. Each line in the bed file is expected to contain a contig name, start position (0-based), and end position (1-based), seperated by a single tab or space. Lines that start with # are ignored by the hash table builder to allow commenting. Any line with a contig name that is not found in the input fasta is skipped and logged to the DRAGEN log file. Likewise, lines that describe empty intervals are skipped. If all lines are skipped this way, the hash table builder will issue an error and abort, unless the mask bed file was automatically applied (see Automatic masking). The hash table builder will always issue an error and abort if an interval described in the BED file is outside of the range of the corresponding contig in the fasta. Lines that are not skipped are written to a file called mask.bed that will be present in the hash table output directory, and whose digest will appear in hash_table.cfg. This file is used when a reference is loaded to the FPGA card to dynamically mask reference.bin.

Automatic masking

When running from a fasta for which hg38 or hg19 is detected (See Automatic Reference Detection), and no argument for ht-mask-bed or ht-alt-liftover was provided, the hash table builder will automatically apply the corresponding bed file for the detected reference from <INSTALL_PATH>/resources/ht_builder/. Note that the hash table builder will identify alt contigs by name. So when running from an input fasta that contains alt contig with standard names but modified base content, it is recommended to suppress automatic masking by setting ht-suppress-mask=true or by passing a custom mask bed file to ht-mask-bed.

Handling Decoy Contigs

The behavior of DRAGEN with respect to the handling of decoy contigs in the reference has changed since version 2.6.

Starting with DRAGEN 3.x, DRAGEN's hash table builder automatically detects the absence of the decoy contigs from the reference and adds it to the FASTA file, prior to building the hash table. The decoys file is found at <INSTALL_PATH>/resources/ht_builder/hs_decoys.fa.gz. If the reference is missing the decoy contigs, then the reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). This results in an artificially lower mapping rate, however, the accuracy of variant calling is improved thanks to removing false positive caused by decoy reads.

Illumina recommends using this feature by default. However, you can to set the --ht-suppress-decoys option to true to suppress adding these decoys to the hash table.

The table below describes the difference in behavior between older DRAGEN versions (2.6 and earlier) and DRAGEN 3.x versions with respect to the handling of decoy contigs in the hash table builder:

Prepare a Pangenome Reference

It is possible to build a custom pangenome reference in order to:

generate a population-specific-pangenome hash table from pangenome msVCF generated from the BSSH app.
generate a human or non-human pangenome hash table from customer-provided msVCF.

Usage

To enable the pangenome hash table builder, example command usage is :

dragen --build-hash-table true (required) --ht-graph-msvcf-file <path to a multi-sampple VCF file (required for pangenome reference) --ht-reference <reference.fasta> (required) --ht-graph-extra-kmer-bed < graph.bed> (optional) --ht-mask-bed <mask.bed> (optional) --ht-graph-exclusion-bed <exclusion bed> (optional) --output-directory <DIR> (required) [options]

Inputs

Set of population variants, in a multi-sample VCF (msVCF)

The custom pangenome hash table builder tool uses a set of population variants provided by the user to generate a pangenome hash table. The variants must be specified in VCF format, in a single multi-sample VCF (msVCF) file containing the variants for a set of individuals. This multi-sample VCF file must have specific formatting described below.

Specific msVCF input formatting

The custom pangenome hash table builder tool only supports msVCF file input respecting the format described below:

msVCF compliant with 4.2 VCF format specification
with variants positionally sorted in the same contig order as the main FASTA reference genome provided in --ht-reference
records shall include diploid or haploid GT calls
supports multi-allelic variants merged in multi-line or separated in multiple lines
with the following FILTER codes, non-PASS records are ignored:
- ##FILTER=<ID=PASS,Description="All filters passed">
with the following FORMAT field :
- ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
for better results, we recommend variants to be left-aligned.
maximum number of recommended samples in the msVCF is 256. Higher number may lead to very high memory usage at hash table creation.

Note: INFO/FORMAT subfields must be defined in the header. Events with undefined subfields are ignored.

To build a high-performance custom genome it is highly recommended to use long read sequencing data. We recommend using external tools such as Whatshap (https://github.com/whatshap/whatshap) to generate phased input. DRAGEN analysis leverages the phasing information to reconstruct population haplotypes.

Reference genome

Note: the reference genome provided as input must be the same as the one used to generate the input phased msVCF. If the msVCF contains variants from regions not present in the fasta file, the pangenome reference builder will stop with an error.

Exclusion bed file (optional)

A custom exclusion bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the exclusion bed file provided must be from the same build as the reference genome used to build the pangenome reference.

Extra kmer bed file (optional)

An Extra-kmer-bed bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the Extra-kmer-bed file provided must be from the same build as the reference genome used to build the graph reference.

Mask bed file (recommended)

A custom mask bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the mask bed file provided must be from the same build as the reference genome used to build the graph reference.

Command line options

Note: The custom graph reference hash table end to end pipeline will return an error if options --ht-alt-liftover or --ht-allow-mask-and-liftover are specified.

Output

The hash table builder generates the following outputs:

Prepare a linear Reference

Usage

Use the --build-hash-table option to transform a reference FASTA into the hash table for DRAGEN mapping. It takes as input a FASTA file (multiple reference sequences being concatenated) and a preexisting output directory. Build command usage is as follows:

Input

The --ht-reference and --output-directory options are required for building a hash table. The --ht‑reference option specifies the path to the reference FASTA file, while --output-directory specifies a preexisting directory where the hash table output files are written. Illumina recommends organizing various hash table builds into different folders. As a best practice, folder names should include any nondefault parameter settings used to generate the contained hash table. The sequence names in the reference FASTA file must be unique.

Command line options

Liftover Based ALT-Aware Hash Tables

While masking is the recommended approach to dealing with ALT contigs, DRAGEN also supports a liftover based method. To enable liftover based ALT-aware mapping in DRAGEN, build the hash table with a liftover file by using the --ht-alt-liftover option. The hash table builder classifies each reference sequence as primary or alternate based on the liftover file, and packs primaries before alternates in reference.bin. SAM liftover files for hg38DH and hg19 are in the <INSTALL_PATH>/resources/ht_builder folder.

Custom Liftover Files

Custom liftover files can be used in place of those provided with DRAGEN. Liftover files must be SAM format, but no SAM header is required. SEQ and QUAL fields can be omitted ('*'). Each alignment record should have an alternate haplotype reference sequence name as QNAME, indicating the RNAME and POS of its liftover alignment in a destination (normally primary assembly) reference sequence.

Reverse-complemented alignments are indicated by bit 0x10 in FLAG. Records flagged unmapped (0x4) or secondary (0x100) are ignored. The CIGAR may include hard or soft clipping, leaving parts of the ALT contig unaligned.

A single reference sequence cannot serve as both an ALT contig (appearing in QNAME) and a liftover destination (appearing in RNAME). Multiple ALT contigs can align to the same primary assembly location. Multiple alignments can also be provided for a single ALT contig (extras optionally be flagged 0x800 supplementary), such as to align one portion forward and another portion reverse-complemented. However, each base of the ALT contig only receives one liftover image, according to the first alignment record with an M CIGAR operation covering that base.

SAM records with QNAME missing from the reference genome are ignored, so that the same liftover file may be used for various reference subsets, but an error occurs if any alignment has its QNAME present but its RNAME absent.

Options for advanced users

Primary Seed Length

The --ht-seed-len option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of this same length from each read, and looks for exact matches (unless seed editing is enabled) in the hash table.

The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16 GB to 64 GB, covering typical sizes for whole human genome, or k=26 for sizes from 4 GB to 16 GB.

The minimum primary seed length depends mainly on the reference genome size and complexity. It needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound may be smaller for shorter genomes, or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16 for the 3.1Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from 4 nucleotides to distinguish 3.1 G reference positions.

Accuracy Considerations

For read mapping to succeed, at least one primary seed must match exactly (or with a single SNP when edited seeds are used). Shorter seeds are more likely to map successfully to the reference, because they are less likely to overlap variants or sequencing errors, and because more of them fit in each read. So for mapping accuracy, shorter seeds are mainly better.

However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions, and lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches may be reported. Run time quality filters such as --Aligner.aln_min_score can control the accuracy issues with very short seeds.

Speed Considerations

Shorter seeds tend to slow down mapping, because they map to more reference locations, resulting in more work such as Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the reference genome's uniqueness threshold, eg, K=16 for whole human genome.

Application Considerations

Read Length---Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions (variants or sequencing errors) can chop the read into only short segments matching the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, just one SNP in the middle can block seeds longer than 18 bp from matching the reference. By contrast, in a 250 bp read, it takes 15 SNPs to exceed a 0.01% chance of blocking even 27 bp seeds.

Paired Ends---The use of paired end reads can make longer seeds yield good mapping accuracy. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have essentially twice the opportunity for an exact matching seed to find their correct alignments.

Variant or Error Rate---When read differences from the reference are more frequent, shorter seeds may be required to fit between the difference positions in a given read and match the reference exactly.

Mapping Percentage Requirement---If the application requires a high percentage of reads to be mapped somewhere (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well anywhere are more likely to map using short seeds to find partial matches to the reference.

Maximum Seed Length

The --ht-max-ext-seed-len option limits the length of extended seeds populated into the hash table. Primary seeds (length specified by --ht-seed-len) that match many reference positions can be extended to achieve more unique matching, which may be required to map seeds within the maximum hit frequency (--ht-max-seed-freq).

Given a primary seed length k, the maximum seed length can be configured between k and k+128. The default is the upper bound, k+128.

When to Limit Seed Extension

The --ht-max-ext-seed-len option is recommended for short reads, eg, less than 50 bp. In such cases, it is helpful to limit seed extension to the read length minus a small margin, such as 1-4 bp. For example, with 36 bp reads, setting --ht-max-ext-seed-len to 35 might be appropriate. This ensures that the hash table builder does not plan a seed extension longer than the read causing seed extension and mapping to fail at run time, for seeds that could have fit within the read with shorter extensions.

While seed extension can be similarly limited for longer reads, eg, setting --ht-max-ext-seed-len to 99 for 100 bp reads, there is little utility in this because seeds are extended conservatively in any event. Even with the default k+128 limit, individual seeds are only extended to the lengths required to fit under the maximum hit frequency (--ht-max-seed-freq), and at most a few bases longer to approach the target hit frequency (‑‑ht‑target-seed-freq), or to avoid taking too many incremental extension steps.

Maximum Hit Frequency

The --ht-max-seed-freq option sets a firm limit on the number of seed hits (reference genome locations) that can be populated for any primary or extended seed. If a given primary seed maps to more reference positions than this limit, it must be extended long enough that the extended seeds subdivide into smaller groups of identical seeds under the limit. If, even at the maximum extended seed length (--ht-max-ext-seed-len), a group of identical reference seeds is larger than this limit, their reference positions are not populated into the hash table. Instead, a single High Frequency record is populated.

The maximum hit frequency can be configured from 1 to 256. However, if this value is too low, hash table construction can fail because too many seed extensions are needed. The practical minimum for a whole human genome reference, other options being default, is 8.

Accuracy Considerations

Generally, a higher maximum hit frequency leads to more successful mapping. There are two reasons for this. First, a higher limit rejects fewer reference positions that cannot map under it. Second, a higher limit allows seed extensions to be shorter, improving the odds of exact seed matching without overlapping variants or sequencing errors.

However, as with very short seeds, allowing high hit counts can sometimes hurt mapping accuracy. Most of the seed hits in a large group are not to the true mapping location, and occasionally one of these noise hits may be reported due to imperfect scoring models. Also, the mapper limits the total number of reference positions it considers, and allowing very high hit counts can potentially crowd out the actual best match from consideration.

Speed Considerations

Higher maximum hit frequencies slow down read mapping, because seed mapping finds more reference locations, resulting in more work, such as Smith-Waterman alignments, to determine the best result.

Pangenome Reference

The DRAGEN Software enables the user to build a custom pangenome hash table from a set of population variants. The population variants are specified in a single multi-sample VCF file.

--ht-graph-msvcf-file: Input file containing list of population variants, in multi-sample VCF format.

This replaces the previous options that were previously used to build a graph Reference that are now deprecated.

List of deprecated options :

--ht-pop-alt-contigs: Population based alternate contigs FASTA.
--ht-pop-alt-liftover: Liftover SAM file of population alternate contigs.
--ht-pop-snps: Population based SNPs VCF

ALT-Contigs

The following options control building hash tables from references with ALT-contigs. See References with ALT contigs for more information.

--ht-mask-bed: Set a custom BED file that defines which regions to mask. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from <INSTALL_PATH>/resources/ht_builder.
--ht-alt-liftover: Set a liftover file to build a liftover based ALT-aware hash table. SAM liftover files for hg38DH and hg19 are provided in <INSTALL_PATH>/resources/ht_builder.
--ht-allow-mask-and-liftover: Allow the use of both --ht-mask-bed and --ht-alt-liftover together.
--ht-suppress-mask: Suppress automatic detection of the default mask bed files when building the hash table.

Decoy Contigs

--ht-decoys The DRAGEN software automatically detects the use of hg19 and hg38 references and adds decoys to the hash table when they are not found in the FASTA file. Use the --ht-decoys option to specify the path to a decoys file. The default is <INSTALL_PATH>/resources/ht_builder/hs_decoys.fa.gz.
--ht-suppress-decoys: Suppress automatic detection of the default decoys file when building the hash table.

Processing Options

--ht-num-threads The --ht-num-threads option determines the maximum number of worker CPU threads that are used to speed up hash table construction. The default for this option is 8, with a maximum of 32 threads allowed. If your server supports execution of more threads, it is recommended that you use the maximum. For example, the DRAGEN servers contain 24 cores that have hyperthreading enabled, so a value of 32 should be used. When using a higher value, adjust --ht-max-table-chunks needs to be adjusted as well. The servers have 128 GB of memory available.
--ht-max-table-chunks The --ht-max-table-chunks option controls the memory footprint during hash table construction by limiting the number of ~1 GB hash table chunks that reside in memory simultaneously. Each additional chunk consumes roughly twice its size (~2 GB) in system memory during construction. The hash table is divided into power-of-two independent chunks, of a fixed chunk size, X, which depends on the hash table size, in the range 0.5 GB < X ≤ 1 GB. For example, a 24 GB hash table contains 32 independent 0.75 GB chunks that can be constructed by parallel threads with enough memory and a 16 GB hash table contains 16 independent 1 GB chunks. The default is --ht-max-table-chunks equal to --ht-num-threads, but with a minimum default --ht-max-table-chunks of 8. It makes sense to have these two options match, because building one hash table chunk requires one chunk space in memory and one thread to work on it. Nevertheless, there are build-speed advantages to raising --ht-max-table-chunks higher than --ht-num-threads, or to raising --ht-num-threads higher than --ht-max-table-chunks.

Size Options

--ht-mem-limit Memory Limit. The --ht-mem-limit option controls the generated hash table size by specifying the DRAGEN card memory available for both the hash table and the encoded reference genome. The ‑‑ht‑mem-limit option defaults to 32 GB when the reference genome approaches WHG size, or to a generous size for smaller references. Normally there is little reason to override these defaults.
--ht-size Hash Table Size. This option specifies the hash table size to generate, rather than calculating an appropriate table size from the reference genome size and the available memory (option --ht-mem-limit). Using default table sizing is recommended and using --ht-mem-limit is the next best choice.

Seed Population Options

--ht-ref-seed-interval Seed Interval. The --ht-ref-seed-interval option defines the step size between positions of seeds in the reference genome populated into the hash table. An interval of 1 (default) means that every seed position is populated, 2 means 50% of positions are populated, etc. Noninteger values are supported, eg, 2.5 yields 40% populated. Seeds from a whole human reference are easily 100% populated with 32 GB memory on DRAGEN boards. If a substantially larger reference genome is used, change this option.
--ht-soft-seed-freq-cap and --ht-max-dec-factor Soft Frequency Cap and Maximum Decimation Factor for Seed Thinning. Seed thinning is an experimental technique to improve mapping performance in high-frequency regions. When primary seeds have higher frequency than the cap indicated by the --ht-soft-seed-freq-cap option, only a fraction of seed positions are populated to stay under the cap. The --ht-max-dec-factor option specifies a maximum factor by which seeds can be thinned. For example, --ht-max-dec-factor 3 retains at least 1/3 of the original seeds. --ht-max-dec-factor 1 disables any thinning. Seeds are decimated in careful patterns to prevent leaving any long gaps unpopulated. The idea is that seed thinning can achieve mapped seed coverage in high frequency reference regions where the maximum hit frequency would otherwise have been exceeded. Seed thinning can also keep seed extensions shorter, which is also good for successful mapping. Based on testing to date, seed thinning has not proven to be superior to other accuracy optimization methods.
--ht-rand-hit-hifreq and --ht-rand-hit-extend Random Sample Hit with HIFREQ Record and EXTEND Record. Whenever a HIFREQ or EXTEND record is populated into the hash table, it stands in place of a large set of reference hits for a certain seed. Optionally, the hash table builder can choose a random representative of that set, and populate that HIT record alongside the HIFREQ or EXTEND record. Random sample hits provide alternative alignments that are very useful in estimating MAPQ accurately for the alignments that are reported. They are never used outside of this context for reporting alignment positions, because that would result in biased coverage of locations that happened to be selected during hash table construction. To include a sample hit, set --ht-rand-hit-hifreq to 1. The --ht-rand-hit-extend option is a minimum pre-extension hit count to include a sample hit, or zero to disable. Modifying these options is not recommended.

Seed Extension Control

DRAGEN seed extension is dynamic, applied as needed for particular K-mers that map to too many reference locations. Seeds are incrementally extended in steps of 2--14 bases (always even) from a primary seed length to a fully extended length. The bases are appended symmetrically in each extension step, determining the next extension increment if any.

There is a potentially complex seed extension tree associated with each high frequency primary seed. Each full tree is generated during hash table construction and a path from the root is traced by iterative extension steps during seed mapping. The hash table builder employs a dynamic programming algorithm to search the space of all possible seed extension trees for an optimal one, using a cost function that balances mapping accuracy and speed. The following options define that cost function:

--ht-target-seed-freq Target Hit Frequency. The --ht-target-seed-freq option defines the ideal number of hits per seed for which seed extension should aim. Higher values lead to fewer and shorter final seed extensions, because shorter seeds tend to match more reference positions.
--ht-cost-coeff-seed-len Cost Coefficient for Seed Length The --ht-cost-coeff-seed-len option assigns the cost component for each base by which a seed is extended. Additional bases are considered a cost because longer seeds risk overlapping variants or sequencing errors and losing their correct mappings. Higher values lead to shorter final seed extensions.
--ht-cost-coeff-seed-freq Cost Coefficient for Hit Frequency. The --ht-cost-coeff-seed-freq option assigns the cost component for the difference between the target hit frequency and the number of hits populated for a single seed. Higher values result primarily in high-frequency seeds being extended further to bring their frequencies down toward the target.
--ht-cost-penalty Cost Penalty for Seed Extension. The --ht-cost-penalty option assigns a flat cost for extending beyond the primary seed length. A higher value results in fewer seeds being extended at all. Current testing shows that zero (0) is appropriate for this parameter.
--ht-cost-penalty-incr Cost Increment for Extension Step. The --ht-cost-penalty-incr option assigns a recurring cost for each incremental seed extension step taken from primary to final extended seed length. More steps are considered a higher cost because extending in many small steps requires more hash table space for intermediate EXTEND records, and takes substantially more run time to execute the extensions. A higher value results in seed extension trees with fewer nodes, reaching from the root primary seed length to leaf extended seed lengths in fewer, larger steps.

Pipeline Specific Hash Tables

RNA-Seq

When building a hash table, DRAGEN configures the options for DNA analysis by default. To run RNA-Seq data, you must build an RNA-Seq hash table by setting --ht-build-rna-hashtable to true. If running RNA-Seq alignment, use the original --output-directory instead of the automatically generated subdirectory.

CNV

If using the CNV pipeline, set --ht-build-cnv-hashtable to true. The command generates an additional Kmer hash map that is used in the CNV algorithm. Illumina recommends to always use the --ht-build-cnv-hashtable option, so you can perform CNV calling with the same hash table used for mapping and aligning.

Methylation

To run the methylation pipeline, you must build a methylation-specific hash table. DRAGEN can build a single-pass or legacy multi-pass methylation hash table. Methylation runs using a single-pass hash table are completed faster than the legacy multipass hash tables. Single-pass hash tables are recommended for building methylation tables and running analyses.

Single-pass

The following is an example of a single-pass hash table build. The example generates a combined hash table in your reference index folder under the methyl_converted subdirectory.

dragen --build-hash-table true \ --output-directory $REFDIR \ --ht-reference $FASTA \ --ht-num-threads 40 \ --ht-methylated-combined=true \ --ht-seed-len 27

Multipass

Multi-pass methylation mapping requires building two special hash tables with reference bases converted from C to T in one table and G to A in the other table. The conversions are performed automatically when using the --ht-methylated command line option. The converted hash tables are generated in two subdirectories under the folder specified using the --output-directory command line option. The subdirectories are named CT_converted and GA_converted, corresponding with the base conversions. When using the hash tables for methylated alignment runs, make sure to refer to the --output-directory folder, not the subdirectories.

The base conversions remove a significant amount of information from the hash tables. You might need to use different hash table parameters than in a conventional hash table build. The following options are recommended for building hash tables for mammalian species.

dragen --build-hash-table=true --output-directory $REFDIR --ht-reference $FASTA --ht-max-seed-freq 16 --ht-seed-len 27 --ht-num-threads 40 --ht-methylated=true

HLA

To run the HLA caller, an HLA-specific anchored reference hash table must be built. Set --ht-build-hla-hashtable to true. The command will create a anchored_hla subdirectory inside the --output-directory. The HLA-specific reference subdirectory can be built at the same time as the primary reference construction.

Somatic Mode

The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.

For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.

Variant Scoring

DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):

##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">

DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.

If tumor SQ > vc-sq-call-threshold (default is 3 for tumor-normal and 0.1 for tumor-only), then FORMAT/GT is hard-coded to 0/1 for the tumor sample and 0/0 for the normal sample (if present), and the tumor-sample FORMAT/AF yields an estimate of the somatic variant allele frequency, which ranges anywhere within [0,1].

If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.
If tumor SQ < vc-sq-call-threshold, the variant is not emitted in the VCF.
If tumor SQ > vc-sq-call-threshold but tumor SQ <vc-sq-filter-threshold, the variant is emitted in the VCF, but FILTER=weak_evidence.
If tumor SQ > vc-sq-call-threshold and tumor SQ >vc-sq-filter-threshold, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).
The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ >vc-sq-call-threshold but tumor SQ < vc-sq-filter-threshold, so the FILTER is marked as weak_evidence.

chr2 593701 . G A . weak_evidence
DP=97;MQ=48.74;SQ=3.86;NLOD=9.83;FractionInformativeReads=1.000
GT:SQ:AF:F1R2:F2R1:DP:SB:MB 0/0:9.83:33,0:0.000:14,0:19,0:33
0/1:3.86:61,3:0.047:29,2:32,1:64:35,26,0,3:39,22,1,2

The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0.

Somatic Mode Options

To run DRAGEN somatic small variant calling, enable the variant caller with --enable-variant-caller=true and pass in tumor, and optionally, matched normal inputs via the command line. FASTQ (both gzipped and Ora-compressed), FASTQ list, BAM and CRAM inputs are all supported input types. For all input types, reads will be aligned by the DRAGEN map/align module and resulting alignments fed into the caller by default. For BAM and CRAM inputs, you can bypass map/align and use existing alignments as variant caller input by setting --enable-map-align=false.

Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:

--tumor-fastq1 and --tumor-fastq2

Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:

dragen -f -r  /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--tumor-fastq1 <TUMOR_FASTQ1> \
--tumor-fastq2 <TUMOR_FASTQ2> \
--RGID-tumor <RG0-tumor> ---RGSM-tumor <SM0-tumor> \
-1 <NORMAL_FASTQ1> \
-2 <NORMAL_FASTQ2> \
--RGID <RG0> --RGSM <SM0> \
--enable-variant-caller true \
--output-directory /staging/examples/ \
--output-file-prefix SRA056922_30x_e10_50M

--tumor-fastq-list

Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:

dragen -f \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--tumor-fastq-list <TUMOR_FASTQ_LIST> \
--fastq-list <NORMAL_FASTQ_LIST> \
--enable-variant-caller true \
--output-directory /staging/examples/ \
--output-file-prefix SRA056922_30x_e10_50M

--tumor-bam-input and --tumor-cram-input Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode. When the mapper is enabled (default), reads from the input BAM/CRAM files are re-mapped and updated alignments are sent to the caller (supported for both tumor-normal and tumor-only BAM/CRAM input). When the mapper is disabled (--enable-map-align=false), the existing BAM/CRAM alignments will be used in the caller.
--vc-sq-call-threshold and --vc-sq-filter-threshold These options control the thresholds for emitting calls in the VCF and applying the weak_evidence filter tag (see above).
--vc-target-vaf This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.
--vc-somatic-hotspots, --vc-use-somatic-hotspots, and --vc-hotspot-log10-prior-boost DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_* based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.
vc-systematic-noise This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE.
--vc-combine-phased-variants-distance This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).
vc-skip-germline-tagging=true This option disables the germline tagging feature in the tumor-only pipeline (not recommended).
--vc-excluded-regions-bed Optional excluded regions BED file specifying where variants will be hard-filtered. Useful, e.g., to exclude ALU regions that tend to be especially noisy in FFPE samples.
--vc-call-hotspots-in-excluded-regions Do not apply excluded regions filter to hotspot variants (Default=false).

Tumor-in-normal contamination and liquid tumor mode

In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.

Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.

Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).

Mixing tumor and normal samples from different sequencing protocols

If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.

Allele frequency and related settings

There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.

The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:

Coverage

Lowest AF

0-199

0.05

200-399

0.025

400-799

0.0125

...

If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter (see Post Somatic Calling Filtering below) to apply a hard filter on VAF.

Sample-specific NTD Error Bias Estimation

DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.

This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true.

To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed. Alternatively, if --vc-target-bed is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.

DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.

Unique Molecular Identifier (UMI) Support

DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true) or when running from UMI-collapsed bams, you can enable UMI-aware variant calling by setting one of the following options to true:

--vc-enable-umi-solid The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.
--vc-enable-umi-liquid The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.
If your UMI-collapsed reads do not meet the recommended post-collapsed coverage depths for the options listed above, we recommend you run with default settings.

If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.

gVCF Output

You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.

By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod option.

Post Somatic Calling Filtering

DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the <output-file-prefix>.hard-filtered.vcf.gz output file.

Options

The following options are available for post somatic calling filtering:

--vc-sq-call-threshold
Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.
--vc-sq-filter-threshold
Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.
--vc-enable-non-primary-allelic-filter
Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.
--vc-enable-af-filter
Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold and vc-af-filter-threshold command-line options. Please use vc-enable-af-filter-mito and corresponding threshold options for mitochondrial allele frequency filtering.
--vc-enable-non-homref-normal-filter
Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.
--vc-enable-vaf-ratio-filter
Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.
--vc-depth-filter-threshold
Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).
vc-homref-depth-filter-threshold
In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.
vc-depth-annotation-threshold
Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).

Filters

Somatic Mode

Filter ID

Description

Tumor-Only & Tumor-Normal

weak_evidence

Variant does not meet likelihood threshold. The likelihood ratio for SQ tumor-normal is < 17.5 or < 3.0 for SQ tumor-only.

Tumor-Only & Tumor-Normal

multiallelic

Site filtered if there are two or more ALT alleles at this location in the tumor. Not applied to somatic hotspot variants.

Tumor-Only & Tumor-Normal

base_quality

Median base quality of ALT reads at this locus is < 20.

Tumor-Only & Tumor-Normal

mapping_quality

Median mapping quality of ALT reads at this locus is < 20 (tumor-normal) or < 30 (tumor-only).

Tumor-Only & Tumor-Normal

fragment_length

Absolute difference between the median fragment length of alt reads and median fragment length of ref reads at a given locus > 10000.

Tumor-Only & Tumor-Normal

read_position

Median of distances between the start and end of read and a given locus < 5 (the variant is too close to edge of all the reads). To output variant read position to the INFO field, use --vc-output-variant-read-position=true.

Tumor-Only & Tumor-Normal

low_af

Allele frequency is below the threshold specified with --vc-af-filter-threshold (default is 5%). Enabled only when using --vc-enable-af-filter=true.

Tumor-Only & Tumor-Normal

systematic_noise

If AQ score is < 10 (default) for tumor-normal or < 60 (default) for tumor-only, the site is filtered.

Tumor-Only & Tumor-Normal

low_frac_info_reads

The fraction of informative reads (denominator excludes filtered_out reads) is below the threshold. The default threshold value is 0.5.

Tumor-Only & Tumor-Normal

filtered_reads

More than 50% of reads have been filtered out.

Tumor-Only & Tumor-Normal

long_indel

Indel length is more than 100bp.

Tumor-Only & Tumor-Normal

low_depth

The site was filtered because the number of reads is too low. The filter is off by default.

Tumor-Only & Tumor-Normal

low_tlen

The site was filtered because the fraction of low TLEN ALT supporting reads is above a threshold. The default threshold is 0.4. Reads with TLEN smaller than -2.25 (default) standard deviations from the mean are considered to be low TLEN. This filter is not applied for reads sampled from tight insert distributions i.e., stddev / mean < 0.1 (default).

Tumor-Only and Tumor-Normal

no_reliable_supporting_read

No reliable supporting read was found in the tumor sample. A reliable supporting read is a read supporting the alt allele with mapping quality ≥ 40, fragment length ≤ 10,000, base call quality ≥ 25, and distance from start/end of read ≥ 5.

Tumor-Only & Tumor-Normal

too_few_supporting_reads

Variant is supported by < 3 reads in the tumor sample. This filter is not applied in UMI-aware pipelines.

Tumor-Normal

noisy_normal

More than three alleles are observed in the normal sample at allele frequency above 9.9%.

Tumor-Normal

alt_allele_in_normal

ALT allele frequency in the normal sample is above 0.2 plus the maximum contamination tolerance. For solid tumor mode, the value is 0. For liquid tumor mode, the default value is 0.15. See vc-enable-vaf-ratio-filter for optional conditions.

Tumor-Normal

non_homref_normal

Normal sample genotype is not a homozygous reference.

Systematic Noise Filtering

The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter removes noise that consistently appears at specific locations in the reference genome. This noise can arise from:

Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.
PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.

To determine whether a variant should be filtered, the systematic noise filter compares the observed variant's allele frequency (AF) to the noise level at the matching locus in the systematic noise file. Variants are filtered if their AF is not statistically sufficiently higher than the recorded noise.

Note that the systematic noise filter specifically aims to remove noise, not germline variants; however, it may inadvertently filter some germline variants. For this reason, it is not ideal to evaluate the systematic noise file on germline admixture datasets.

Newer versions of the systematic noise filter will include allele-specific information along with two columns for noise frequency: one for the "mean" noise and one for the "max" noise. During a VC run, DRAGEN will automatically detect the input sample type as either WGS or WES/panel and will apply the optimal noise values based on sample type and run context. For WGS data, the "max" noise is used by default; for WES/panel data or whenever UMI is enabled, the "mean" noise is used.

WES and WGS prebuilt systematic noise files are available for download (see below).

Custom panels will require custom noise files. It is recommended to use normal samples sequenced on the same instrument type and using the same library prep. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 30-70 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.

The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding --vc-systematic-noise NOISE_FILE_PATH.

Option

Description

--vc-systematic-noise

Specifies a systematic noise BED file. If a somatic variant does not pass the AQ threshold, the variant is marked as 'systematic_noise' in the FILTER column of the output VCF.

--vc-systematic-noise-filter-threshold

Set the AQ threshold. Higher values filter more aggressively. By default the threshold value is 10 for tumor-normal and 60 for tumor-only. The valid range spans 0-100. For tumor-normal runs the threshold may be set higher (e.g. to 60) to improve specificity at the possible cost of some sensitivity.

--vc-systematic-noise-filter-threshold-in-hotspot

Set the AQ threshold to use in hotspot regions, where one may want to filter less aggressively than in the rest of the genome. By default, the threshold value is 10 for tumor-normal and 20 for tumor-only.

--vc-allele-specific-systematic-noise

Apply systematic noise in an allele-specific manner when allele information is available. This setting is ignored for v1.x.x noise files (Default=true))

Prebuilt Systematic Noise BED Files

Somatic Systematic Noise Baseline Collection v2.0.0 noise files include allele specific information to better preserve sensitivity with systematic noise filtering enabled. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns, with the appropriate noise applied automatically based on auto-detected input type and run context.

The latest noise files (v2.0.0) contain more columns than earlier noise files and are therefore incompatible with versions of DRAGEN prior to v4.3. Older noise files are still supported in the current version of DRAGEN; however, the older noise files lack allele specific information and noise filtering will be applied by position only as was the default in v4.2 and earlier versions of DRAGEN.

Custom Systematic Noise Files

The BaseSpace Sequence Hub DRAGEN Baseline Builder App or the DRAGEN Systematic Noise File Builder Pipeline on ICA can be used to build systematic noise files in the cloud.

Option

Description

--build-sys-noise-vcfs-list

Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line.

--build-sys-noise-germline-vaf-threshold

Variant calls with VAF higher than this threshold will be considered germline and will not contribute to the noise estimate. This option is disabled by default by setting the threshold to 1. (Default 1)

--build-sys-noise-use-germline-tag

This option will ensure that variants tagged by vc-enable-germline-tagging=true will not be counted as noise. (Default true)

--build-sys-noise-min-sample-cov

Min coverage at a site for a sample to be used towards noise estimation. At low coverages estimated allele frequencies become less reliable. Accurate AF estimation is imporant for germline variant detection, and also for noise detection when using MAX noise. (Default 5)

--build-sys-noise-min-supporting-samples

Min number of samples with noise at a position in order for a position to be considered systematic-noise (Default 1).

Germline Tagging in the Tumor-Only Pipeline

When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:

--vc-enable-germline-tagging Enable germline tagging. The default is 'false'. In a tumor-only analysis, this option must either be set 'true' (recommended) or germline tagging must be explicitly disabled with --vc-skip-germline-tagging=true (not recommended). Once the vc-enable-germline-tagging option is set to 'true', it will require the user to pass in a variant annotation data directory as follows:
- --variant-annotation-data Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)

Additional options to control how to define germline variants.

--germline-tagging-db-threshold The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).
--germline-tagging-pop-af-threshold The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.

1    11301714        .       A       G       .       PASS    
DP=3626;MQ=249.61;FractionInformativeReads=0.974;AQ=100.00;GermlineStatus=Germline_DB   
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:64.73:1772,1758:0.498:872,901:900,857:3530:846,926,843,915:894,878,874,884

Mutation Annotation Format (MAF) Conversion in Tumor-Only and Tumor-Normal Pipelines

When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).

When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:

Annotation options:

--enable-variant-annotation=true Enable variant annotation

MAF conversion options:

--enable-maf-output=true Enable MAF output
--maf-transcript-source Desired transcript source, RefSeq or Ensembl

Additional standalone options (when running without the variant caller):

--maf-input-vcf Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz
--maf-input-json Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz

Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.

Optional options:

--maf-include-non-pass-variants Enabling this option will output all variants, including non-PASS variants, in the MAF output file.

Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.

Example command lines:

MAF output from BAM input and variant caller:

bin/dragen --output-dir=/path/to/output/dir --output-file-prefix=prefix_name --ref-dir=/path/to/ref/dir --enable-map-align=false --enable-sort=false --enable-variant-caller=true -b /path/to/normal/bam --tumor-bam-input /path/to/tumor/bam --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:

bin/dragen --output-dir=/path/to/output/dir/with/vcf --output-file-prefix=prefix_of_vcf --ref-dir=/path/to/ref-dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from source VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-input-vcf=/path/to/vcf/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir and --output-file-prefix options.

MAF output from source annotated VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-maf-output=true --maf-input-json=/path/to/annotated/json/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir and --output-file-prefix options.

DRAGEN

Overview

Illumina® DRAGEN™ Secondary Analysis

Deployment Options

Product Guides

DRAGEN v4.4

Getting Started

On-premises Installation

Single Version Installation

Multi-Version Installation

Location of dragen and resource files

Licensing

Running the System Check

Running Your Own Test

Loading the Reference Genome

Determine Input and Output File Locations

Process Your Input Data

DRAGEN Host Software

Command-line Options

Reference Genome Options

Operating Modes

Output Options

Alignment tags

CRAM Output

Input Options

FASTQ Input Files

Multiple FASTQ Input Files

FASTQ ORA Input Files

BAM Input Files

CRAM Input

BCL Input Files

Handling of N bases

Read Names for Paired-End Reads

Gene Annotation Input Files

Networked Streaming

AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming

AWS S3, Azure Blob Storage, Output Streaming

Security and Permissions

Sample Sex

Reference Sex Karyotype

Preservation or Stripping of BQSR Tags

Read Group Options

License Options

Autogenerated MD5SUM for BAM and CRAM Output Files

Configuration Files

Licensing

DRAGEN Secondary Analysis

DNA Pipeline

RNA Pipeline

Methylation Pipeline

Clinical Research Workflows

Common Product Features

Run Planning

Sample Sheets

ICA Cloud Applications

ICI Variant Interpretation

Sample Sheets

Overview

New Sample Sheet options available in DRAGEN 4.4+ release

Forward orientation for index2

Summary of Valid Settings for Index Orientation

Look up table for index2 orientations in DRAGEN 4.4+

Summary of Legacy Settings for index2 orientations

Look up table for index2 orientations in earlier DRAGEN versions

Run Planning

Sample Sheet Creation in BaseSpace

How to Create Sample Sheets in BaseSpace Run Planning tool

Step 1: Run Settings

Step 2: Configuration

Step 3: Sample Settings

Step 4: Run Review

Planned Runs Screen (NovaSeq X Series only)

Guided Examples based on TSO 500

Custom Config Support

BSSH Run Planner Setup

BSSH Run Planner UI Example

DRAGEN Server App

Analysis on DRAGEN Server

Prerequisites

DRAGEN server

Location of `dragen` and resource files