Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 315 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

DRAGEN

Overview

Loading...

Loading...

Loading...

Product Guides

DRAGEN v4.4

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Sample Sheets

Overview

When running analysis on a standalone DRAGEN server or ICA, a valid sample sheet can be created by:

When running analysis on a standalone DRAGEN server or on ICA, a minimal sample sheet for starting from FASTQ, BAM or CRAM can be created by:

  • Modify a sample sheet template following the requirements, see product specific templates for more information.

Note: A minimal sample sheet may be invalid for other purposes. It is always advisable to use a valid sample sheet generated from the BaseSpace Run Planner.

The Run Planning section of this guide is available for specific instructions to plan a run and set up a valid sample sheet for the pipeline when supported.

New Sample Sheet options available in DRAGEN 4.4+ release

Forward orientation for index2

[BCLConvert_Settings]
Required
Description

SoftwareVersion

Required

if SoftwareVersion >=4.4, index2 orientation must be forward; Otherwise, legacy behavior is supported

RunInfoIndex2ReverseComplement

Optional

Allowed values Y/N. if SoftwareVersion >=4.4; paired presence required with Index2ColumnReverseComplement. This value overrides the RunInfo.xml isReverseComplement = Y/N flag for index2 orientation in case of conflict.

Index2ColumnReverseComplement

Optional

Allowed Values Y/N. If softwareVersion >=4.4; paired presence required with RunInfoIndex2ReverseComplement. This value indicates whether the index2 column sequence is reverse complement or not.

Summary of Valid Settings for Index Orientation

As indicated in the following Table, the index2 orientation is always Forward orientation for simplicity. The two new flags introduced are especially useful when custom LPKs are used and when a consistent index2 orientation is desired for all run folders. The IndexOrientation field is present from BaseSpace run planner generated sample sheet, and indicates that the sample sheet index2/i5 sequences are in Forward orientation.

Look up table for index2 orientations in DRAGEN 4.4+

  • Bcl-convert SoftwareVersion must be >=4.4.

  • * indicates the situation where the IsReverseComplement flag in the RunInfo.xml is overriden by the RunInfoIndex2ReverseComplement value. NA means that IsReverseComplement flag for the index2 is not present in the RunInfo.xml file.

  • ** indicates that legacy run folders may use the two paired flags to ensure that index2 Forward orientation is consistently applied.

Instrument Type
IndexOrientation
RunInfoIndex2ReverseComplement
Index2ColumnReverseComplement
IsReverseComplement
Index2 Orientation
Condition

NovaSeq 6000

Forward

N**

N**

NA

Forward

When SbsConsumableVersion <3

Forward

Y**

N**

NA

Forward

When SbsConsumableVersion >=3

NovaSeq 6000Dx

Forward

Y

N

Y

Forward

When non-SP flow cell is used

Forward

Y

N

N*

Forward

When SP flow cell is used and control software is <2.4

NovaSeq X

Forward

Y

N

N

Forward

Summary of Legacy Settings for index2 orientations

For backward compatibility, when the bcl-convert version specified is less than 4.4, the index2 orientation may vary depending on the instrument. In BaseSpaces run planner generated sample sheet, the IndexOrientation may still indicate Forward, but it is ignored in this situation.

Look up table for index2 orientations in earlier DRAGEN versions

  • Bcl-convert SoftwareVersion must be <4.4.

  • *indicates the situation where the IsReverseComplement flag in the RunInfo.xml is different depending on the control software version.

Instrument Type
IsReverseComplement
Index2 Orientation
Condition

NovaSeq 6000

NA

Forward

When SbsConsumableVersion <3

NA

Reverse

When SbsConsumableVersion >=3

NovaSeq 6000Dx

Y

Forward

When non-SP flow cell is used

Y*

Forward

When SP flow cell is used and control software is >2.4

N*

Reverse

When SP flow cell is used and control software is <2.4

NovaSeq X

Y

Forward

A sample sheet is required for each analysis with the pipeline. A sample sheet is a comma-separated value (*.csv) file format used by Illumina instruments, platforms, and analysis pipelines to store settings and data for sequencing and analysis. The pipeline is compatible with the sample sheet v2. For general information on the sample sheet v2, refer to .

A full sample sheet includes multiple sections, including a [BCLConvert_Settings] section with a list of samples and their index sequences, along with additional information required to run the pipeline in the [{app}_Data] section. For example, the Library Prep Kit is a required field in the sample sheet for the . Both Illumina library prep kits or custom library prep kits are supported.

On the other hand, the may only required a minimal sample sheet with only [Header] section and a [TN_Data] section when starting the analysis from FASTQ. This partial sample sheet is not valid when starting analysis from a run folder.

BaseSpace Run Planner (preferred), see for details.

Downloading and modifying a sample sheet template following the requirements, see for details.

With v2 sample sheet, and DRAGEN 4.4+, it is now required for users to specify index2 orientation in forward orientation only. For additional information, see .

Illumina Connected Software Sample Sheet
DRAGEN Heme WGS Tumor Only Pipeline
DRAGEN Solid WTS Tumor Normal Pipeline
BaseSpace Run Planner
Requirements
Index Orientation Guide

Deployment Options

DRAGEN analysis is available on multiple platforms.

Platform
Description

DRAGEN on-premises server

DRAGEN on-premises server offers highly accurate secondary analysis in a fraction of time compared with a traditional CPU-based system. - Analyze and store data locally - Supports varying levels of command line interface - Replace up to 30 traditional compute instances - Fully process a 34× whole human genome in ~30 minutes. (1) - One unit supports two NovaSeq 6000 Systems running at full capacity

DRAGEN analysis on Illumina Connected Analytics

Couples the accuracy and speed of the DRAGEN with the ability to customize analysis pipeline to operationalize informatics on a secure platform.

DRAGEN on BaseSpace Sequence Hub (BSSH)

Push button analysis capability in an intuitive, easy-to-use interface with compliance, and storage features of BaseSpace Sequence Hub and Amazon Web Services (AWS).

DRAGEN onboard NovaSeq X Series

- Flexibly runs multiple secondary analysis pipelines in parallel. - Performs up to four simultaneous applications per flow cell in a single run. - Brings up to 5x lossless data compression, and analysis with supported applications - Provides savings on analysis, which over five years can exceed the price of the sequencer

DRAGEN onboard NextSeq 1000 and NextSeq 2000 Systems

- Provides access to select DRAGEN analysis informatics pipelines - Enables users to generate results in as little as two hours - Uses intuitive pipeline algorithms to reduce reliance on external informatics experts

DRAGEN onboard MiSeq i100 Series

Intuitive, ultra-rapid analysis including DRAGEN BCL convert, DRAGEN Library QC, DRAGEN small WGS and DRAGEN Microbial Enrichment Plus. - Rapid results with comprehensive secondary analysis generated in two hours or less (2) - Highly efficient workflow with a single user touchpoint to VCF and/or html report and no intermediate file transfers - Exceptionally easy with an intuitive interface for non-expert users

DRAGEN on AWS, Azure

DRAGEN supports the FPGA enabled instance types of AWS, Azure. Rpm installers and the Kernel driver can be installed on images managed by the user, and DRAGEN can be run by purchasing a license.

DRAGEN on AWS and Azure Marketplace

Pre-configured Amazon Machine Images (AMI) and Azure Virtual Machines with DRAGEN installed can be accessed from the respective marketplace offerings in a Pay-As-You-Use model.

DRAGEN on GCP

DRAGEN is made available on the Google Cloud Platform. Pre-configured instances with DRAGEN installed can be accessed through the GCP application interface. Limited availability. Please reach out to your Illumina representative for access.

(1) HG002 from PrecisionFDA truth challenge V2 run with DRAGEN analysis v4.0 on DRAGEN server v4, all callers

(2) When run according to sample recommendations

DNA Germline Panel

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# SV 
--enable-sv true 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 

Notes and additional options

Hashtable

For DRAGEN germline runs, it is recommended to use the pangenome hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-positional-collapsing true

Alternative to --enable-duplicate-marking=true. Instead of discarding duplicate reads, DRAGEN can optionally perform positional collapsing, merging them into higher-quality consensus reads. This is beneficial for small panels without UMIs and coverage between 300X and 1000X. However, it's slower than standard duplicate marking and less effective on samples with coverage lower than 300X. For very high coverage (1000X+), avoid it due to potential read collisions. For high-sensitivity panels with 1000X+ coverage, consider using UMIs.

SNV

DRAGEN SNV VC employs machine learning based variant recalibration (DRAGEN-ML). It processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors. No additional setup is required. DRAGEN-ML is enabled by default as needed, when running the germline SNV VC on hg19 or hg38.

Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range.

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

Annotation

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

In some cases, an in-run PON containing germline samples from the same batch (i.e. sample source, DNA extraction, library prep and sequencing run) may provide superior normalization.

See:

For more detail on the small variant caller in somatic mode please refer to

For instructions on how to download the Nirvana annotation database, please refer to

For more information, see .

For more information, see .

Analysis of a full batch of germline samples with an automatically generated in-run PON can be performed using or DRAGEN Germline Enrichment on .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Product Files
Somatic Mode
Nirvana
CNV Calling
DRAGEN Enrichment on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA

Clinical Research Workflows

DRAGEN v4.4 introduces support for DRAGEN server apps. These apps, comprised of Docker images, Nextflow workflows, a CLI shell script, and packaged resource bundles, can be downloaded and installed on the on-premises server. The packaged resource bundles include all the resource files required to run the application, such as the hash table(s), various noise baseline files, bed files.

Server apps make it easy to run complex workflows such as Tumor Normal somatic analysis by simplifying the management of external resources and applying the correct command line parameters for the selected analysis type. The DRAGEN server can support multiple installed server apps and DRAGEN on-prem for command line use at the same time.

Advanced Topics

The pipeline may be downloaded and installed on a local DRAGEN server. A download utility may be obtained from the Illumina download site, and the download utility will manage all the dependencies. Once the required installers are downloaded, the software may be installed by running the installers.

Using NFS for data streaming

With the NovaSeq X 25B flow cells, the amount of data is on the order of terabytes, which may take a few hours or more to copy to the /staging folder on the local DRAGEN server. Using NFS storage directly for input and output is recommended in this case.

Variant Interpretation

ICI supports variant interpretation with advance visualization capabilities. It is available in the cloud or on a local DRAGEN server.

Getting Started

DRAGEN provides tests you can run to make sure that your DRAGEN system is properly installed and configured. Before running the tests, make sure that the DRAGEN server has adequate power and cooling, and is connected to a network that is fast enough to move your data to and from the machine with adequate performance.

On-premises Installation

Installation procedure:

  • Download the desired installer from the support website and unzip the package

  • The archive integrity can be checked using: ./<dragen .run file> --check

  • Install the appropriate release based on your Linux OS with the command: sudo sh <dragen .run file>

The .run file includes a script that administers un-installation of an existing software, integrity checking of the package and files, installation of the new DRAGEN software version. The DRAGEN software is installed in part by use of the Linux RPM Package Manager (rpm). Several rpm packages comprise the installation of a single DRAGEN software version. The RPM packages also configure the system for dragen, like raised user ulimits, and the .run script starts services needed for functionality, such as the Licensing daemon dragen_licd, and the hugepages daemon, dragend_hp.

NOTE: Root privileges are required for the installation.

Single Version Installation

Up to DRAGEN Software v4.2, only one version of the DRAGEN software can be installed at a time. Executing the .run file will remove any existing installed version and (re)install the new version.

After installation, the application and associated files are available at /opt/edico.

The single version installer will add /opt/edico to the Linux $PATH, so that the user can just call dragen without specifying the full path.

Multi-Version Installation

Starting with DRAGEN Software v4.3 and later, multiple compatible versions of the DRAGEN software can be installed at a time. Executing the .run file will add the new version to the system.

After installation, the application files are available at /opt/dragen/{version}/bin and FPGA files are located at /opt/bitstream/{bitstream version}.

The multi-version installer will NOT add /opt/dragen/{version}/bin to the Linux $PATH, since multiple versions can be present at a given time. User should manage the desired paths to the specific version they want to run. When this guide provides command line examples, it will assume that the Linux $PATH is set to correct dragen version, and we will just refer to dragen <options>

Notes on multi-version installation:

  • Installers released for DRAGEN v4.2 and earlier are single version packages

  • Single version packages and multi-version packages can not be mixed

    • Installation of a prior single version package will remove all the multi-version packages

    • Installation of a multi-version package will remove any installed single version package

  • After installing a multi-version package, see a list of installed versions at any time by running /usr/bin/dragen_versions

  • To remove any multi-version package, call yum remove on its Path

  • Adding PATH="/opt/dragen/{version}/bin:$PATH" to the last line of .bashrc file avoids the need to set the path upon each server login

Example:

$ dragen_versions
The output format of this command may change. Use --json for machine readable output.

Dragen Version           Size (MB)  Install Date         Path
4.3.2                    1378.03    2024-03-10 18:26:17  /opt/dragen/4.3.2
4.4.3                    1381.41    2024-03-18 20:56:39  /opt/dragen/4.4.3
4.3.5                    1379.25    2024-03-11 15:20:24  /opt/dragen/4.3.5

Bitstream Version        Size (MB)  Install Date         Path
07.031.732 (0x18101306)  598.95     2024-03-10 18:26:03  /opt/bitstream/07.031.732
07.031.745 (0x18101306)  598.95     2024-03-18 20:56:18  /opt/bitstream/07.031.745
 
To remove a dragen version, call `yum remove` on its Path.

Location of dragen and resource files

DRAGEN Version
on-premises server
cloud instance

4.3 and later

/opt/dragen/{version}

/opt/edico/

4.2 and earlier

/opt/edico/

/opt/edico/

Throughout this guide we will refer to <INSTALL_PATH> which will be either of the locations above

Licensing

Running the System Check

After turning on the server, you can make sure that your DRAGEN server is functioning properly by running <INSTALL_PATH>/self_test/self_test.sh, which does the following:

  • Automatically indexes chromosome M from the hg19 reference genome

  • Loads the reference genome and index

  • Maps and aligns a set of reads

  • Saves the aligned reads in a BAM file

  • Asserts that the alignments exactly match the expected results

Each server ships with the test input FASTQ data for this script, which is located in <INSTALL_PATH>/self_test. The system check takes approximately 25--30 minutes.

The following example shows how to run the script and shows the output from a successful test.

$ /opt/dragen/4.3.4/self_test/self_test.sh
#############################################################
Logging to /var/log/dragen/self_test.1714627157_160164.0.details.log
Using dragen executables in /opt/dragen/4.3.4/bin
Using board(s): 0 
#############################################################
Running tests for board 0 (u200)
Using scratch directory /tmp/self_test.4BO0pfPST9/0
-------------------------------------------------------------
Board 0 test 1, FPGA MEMORY TEST
Loading DIAG bitstream
Running fpga memory test, this will take ~13 minutes
Board 0 test 1, FPGA MEMORY TEST: PASS
-------------------------------------------------------------
Board 0 test 2, BAR REGISTER ACCESS
Board 0 test 2, BAR REGISTER ACCESS: PASS
-------------------------------------------------------------
Board 0 test 3, FPGA TEMP REG ACCESS
FPGA Temperature: 27C  (Max Temp: 36C, Min Temp: 22C)
Board 0 test 3, FPGA TEMP REG ACCESS: PASS
-------------------------------------------------------------
Board 0 test 4, BOARD SERIAL # REG ACCESS
Serial Number: 2130069BM05V
Board 0 test 4, BOARD SERIAL # REG ACCESS: PASS
-------------------------------------------------------------
Board 0 test 5, DRAGEN GENOME LICENSE
Board 0 test 5, DRAGEN GENOME LICENSE: PASS
-------------------------------------------------------------
Board 0 test 6, CPLD DATE TEST
cpld date is n/a
Board 0 test 6, CPLD DATE TEST: PASS
-------------------------------------------------------------
Board 0 test 7, ENCRYPTION KEY EXISTENCE TEST
Board 0 test 7, ENCRYPTION KEY EXISTENCE TEST: PASS
-------------------------------------------------------------
Board 0 test 8, PARTIAL RECONFIGURATION
DNA-MAPPER: ok
RNA-MAPPER: ok
HMM: ok
ZIP: ok
UNZIP: ok
DIAG: ok
Board 0 test 8, PARTIAL RECONFIGURATION: PASS
-------------------------------------------------------------
Board 0 test 9, HASH TABLE GENERATION
Board 0 test 9, HASH TABLE GENERATION: PASS
-------------------------------------------------------------
Board 0 test 10, MAP AND ALIGNER
running mapper aligner: ok
unmapped input records percentages: ok
md5sum check dbam sorted: pass
Board 0 test 10, MAP AND ALIGNER: PASS
-------------------------------------------------------------
Board 0 test 11, VARIANT CALLER E2E
running variant caller: ok
md5sum check dbam sorted: ok
md5sum check VCF: ok
Board 0 test 11, VARIANT CALLER E2E: PASS
#############################################################
SELF TEST COMPLETED
SELF TEST RESULT : PASS
#############################################################
Log file at /var/log/dragen/self_test.1714627157_160164.0.details.log

If the output BAM file does not match expected results, then the last line of the above text is as follows:

SELF TEST RESULT : FAIL

If you experience a FAIL result after running this test script immediately after turning on your DRAGEN server, contact Illumina Technical Support.

Running Your Own Test

When you are satisfied that your DRAGEN system is performing as expected, you are ready to run some of your own data through the machine, as follows:

  • Load the reference table for the reference genome

  • Determine location of input and output files

  • Process input data

Loading the Reference Genome

The reference hash table specified on the command line is automatically loaded onto the board the first time you process data with a pipeline. You can manually load the hash table for your reference genome by using the following command:

dragen -r <reference_hash-table_directory>

Make sure that the reference hash table directory is on the fast file IO drive.

The default location for the hash table for hg19 is as follows.

/staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The command to load reference genome hg19 from the default location is as follows.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

This command loads the binary reference genome into memory on the DRAGEN board, where it is used for processing any number of input data sets. You do not need to reload the reference genome unless you restart the system or need to switch to a different reference genome. It can take up to a minute to load a reference genome.

DRAGEN checks whether the specified reference genome is already resident on the board. If it is, then the upload of the reference genome is automatically skipped. You can force reloading of the same reference genome using the force-load-reference (-l) command line option.

The command to load the reference genome prints the software and hardware versions to standard output. For example:

DRAGEN Host Software Version 01.001.035.01.00.30.6682 and

Bio-IT Processor Version 0x1001036

After the reference genome has been loaded, the following message is printed to standard output:

DRAGEN finished normally

Determine Input and Output File Locations

The DRAGEN Pipeline is very fast, which requires careful planning for the locations of the input and output files. If the input or output files are on a slow file system, then the overall performance of the system is limited by the throughput of that file system. It is recommended that inputs and outputs are streamed directly from/to a mounted external storage system.

The DRAGEN system is preconfigured with at least one fast file system consisting of a set of fast SSD disks grouped with RAID-0 for performance. This file system is mounted at /staging. This name was chosen to emphasize the fact that this area was built to be large and fast, but is not redundant. Failure of any of the file system's constituent disks leads to the loss of all data stored there.

During processing, DRAGEN generates and reads back temporary files. With DRAGEN, it is highly recommended to always direct temporary files to the fast SSD (or /staging) by using the --intermediate-results-dir option. If the --intermediate-results-dir option is not provided, temporary files are written to the --output-directory. DRAGEN recommends streaming inputs and outputs using an mounted external storage system.

Process Your Input Data

To analyze FASTQ data, use the dragen command. For example, the following command can be used to analyze a single-ended FASTQ file:

dragen \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 /staging/test/data/SRA056922.fastq \
--output-directory /staging/test/output \
--output-file-prefix SRA056922_dragen \
--RGID DRAGEN_RGID \
--RGSM DRAGEN_RGSM

DRAGEN Secondary Analysis

The DRAGEN secondary analysis software utilizes a highly reconfigurable Field Programmable Gate Array (FPGA) card and is available on a preconfigured DRAGEN server that can be seamlessly integrated into bioinformatics workflows. The platform can be loaded with highly optimized algorithms for many different NGS secondary analysis pipelines, including the following:

  • Whole genome

  • Exome

  • RNA-Seq

  • Methylome

  • Cancer

All user interaction is accomplished via DRAGEN software that runs on the host server and manages all communication with the FPGA card. This user guide summarizes the technical aspects of the system and provides detailed information for all DRAGEN command line options. If you are working with DRAGEN for the first time, Illumina recommends that you first read the Getting Started section, which provides a short introduction to DRAGEN, including running a test of the server, generating a reference genome, and running example commands.

DNA Pipeline

DRAGEN DNA Pipeline

The DRAGEN DNA Pipeline massively accelerates the secondary analysis of NGS data. For example, the time taken to process an entire human genome at 30x coverage is reduced from approximately 10 hours (using the current industry standard, BWA-MEM+GATK-HC software) to approximately 20 minutes. Time scales linearly with coverage depth.

These pipelines harness the tremendous power of the DRAGEN server and include highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. They also use platform features such as hardware-accelerated compression and optimized BCL conversion, together with the full set of platform tools.

Unlike all other secondary analysis methods, DRAGEN DNA Applications do not reduce accuracy to achieve speed improvements. Accuracy for both SNPs and INDELs is improved over that of BWA-MEM+GATK-HC in side-by-side comparisons.

In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions.

RNA Pipeline

DRAGEN secondary anaylsis includes an RNA-seq (splicing-aware) aligner, as well as RNA-specific analysis components for gene expression quantification and gene fusion detection.

The DRAGEN RNA Pipeline shares many components with the DNA Pipeline. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.

DRAGEN secondary analysis uses hardware accelerated algorithms to map and align RNA-Seq--based reads faster and more accurately than popular software tools. For instance, it can align 100 million paired-end RNA-Seq--based reads in about three minutes. With simulated benchmark RNA-Seq data sets, its splice junction sensitivity and specificity are unsurpassed.

Methylation Pipeline

The DRAGEN Methylation Pipeline provides support for automating the processing of bisulfite sequencing data to generate a BAM with the tags required for methylation analysis and reports detailing the locations with methylated cytosines.

DRAGEN Host Software

You use the DRAGEN host software program dragen to build and load reference genomes, and then to analyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking with optional removal, and variant calling.

Invoke the software using the dragen command. The command line options are described in the following sections.

Command-line Options

The following are examples of frequently used command lines:

  • Build Reference/Hash Table

    dragen --build-hash-table true --ht-reference <REF_FASTA> \
    --output-directory <REF_DIRECTORY>  [options]
  • Run Map/Align and Variant Caller (*.fastq to *.vcf)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
    [-2 <FASTQ2>] --RGID <RG0> --RGSM <SM0> --enable-variant-caller true
  • Run Map/Align (*.fastq to *.bam)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] \
    -1 <FASTQ1> [-2 <FASTQ2>]  \
    --RGID <RG0> --RGSM
  • Run Variant Caller Only (*.bam to *.vcf)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
    --enable-map-align false \
    --enable-variant-caller true
  • Re-map and Run Variant Caller (*.bam to *.vcf)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
    --enable-map-align true \
    --enable-variant-caller true
  • Run BCL Converter (BCL to *.fastq)

    dragen --bcl-conversion-only true --bcl-input-directory <BCL_DIRECTORY> \
    --output-directory <OUT_DIRECTORY>
  • Run RNA Map/Align (*.fastq to *.bam)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
    [-2 <FASTQ2>] --enable-rna true

Reference Genome Options

Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

Use the -l (--force-load-reference) option to force the reference genome to load even if it is already loaded.

dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.

Operating Modes

DRAGEN has two primary modes of operation, as follows:

  • Mapper/aligner

  • Variant caller

DRAGEN is capable of performing each mode independently or as an end-to-end solution. DRAGEN also allows you to enable and disable decompression, sorting, duplicate marking, and compression along the DRAGEN pipeline.

  • Full pipeline mode To execute full pipeline mode, set --enable-variant-caller to true and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.

  • Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking to true.

  • Variant caller mode To execute variant caller mode, set the --enable-variant-caller option to true, and set --enable-map-align option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort to false will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.

  • RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna to true. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..

  • Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol setting.

Output Options

The following command line options for output are mandatory:

  • --output-directory <out_dir>—Specifies the output directory for generated files.

  • --output-file-prefix <out_prefix>-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.

  • -r [--ref-dir ]—Specifies the reference hash table.

The following examples do not include these mandatory options.

For mapping and aligning, the output is sorted and compressed into BAM format by default before saving to disk. The user can control the output format from the map/align stage with the --output-format <SAM|BAM|CRAM> option. If the output file exists, the software issues a warning and exits. To force overwrite if the output file already exists, use the -f [ --force ] option.

For example, the following commands output to a compressed BAM file, and then forces overwrite:

dragen ... -f

dragen ... -f --output-format bam

To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing to true.

The following example outputs to a SAM file, and then forces overwrite:

dragen ... -f --output-format sam

The following example outputs to a CRAM file, and then forces overwrite:

dragen ... -f --output-format cram

DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.

Alignment tags

DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. The feature is turned off by default because there is a small performance cost to generate these strings. To generate MD tags, set --generate-md-tags to true.

DRAGEN can also annotate additional information about alignments in a ZS:Z tag. The following are valid tag values:

Tag
Tag meaning

ZS:Z:R

Multiple alignments with similar score were found.

ZS:Z:NM

No alignment was found.

ZS:Z:QL

An alignment was found but it was below the quality threshold.

ZS:Z:NRD

Alignment is to an auto-added decoy contig (not present in input FASTA).

ZS:Z:PAI

Alignment is to an insertion encoded in a population based alternate contig (not present in input FASTA).

By default, DRAGEN writes a ZS:Z:PAI tag in the output BAM for alignments that map completely inside insertions encoded in population based alternate contigs. To write ZS:Z alignment status tags for all other types described above, set --generate-zs-tags to true (false by default). These tags are only generated in the primary alignment and when a read has suboptimal alignments qualifying for secondary output (even if none were output because --Aligner.sec-aligns was set to 0).

To generate SA:Z tags, set --generate-sa-tags to true (the default). These tags provide alignment information (position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variant calling.

To generate pair score in a ps:i tag, set --generate-ps-tags to true (false by default for DNA, true for RNA). The pair score is used in DRAGEN for computing MAPQ and can be used to check how well alignment candidate pairs score against each other.

DRAGEN can also output mate alignment tags. To generate the mate cigar (in the MC:Z tag), set --generate-mc-tags to true (this is the default). To generate the mate mapping quality (in the MQ:i) tag, set --generate-mq-tags to true (this is the default). To generate mate sequence (in the R2:Z tag) and mate base qualities (in the Q2:Z tag), set --generate-r2-tags to true (default is false) and set --generate-q2-tags to true (default is false) respectively. Please note that when enabled, R2:Z and Q2:Z tags are emitted only for improperly paired read alignments with fragment length atleast 1000 bp. Also, our methylation pipelines currently do not support the output of mate alignment tags.

DRAGEN also outputs a graph alignment tag ga:Z --generate-ga-tags (true by default for DNA, false for RNA) when applicable. This tag is used to describe the best alt contig alignment which improved the score of a primary-contig alignment at its liftover position. It can also be used to describe read alignments to alt contigs for which there is no liftover and the primary alignment is unmapped. For example, cases when the read maps best to an alt contig describing a novel long-insertion that is not present in the reference. In addition, read alignments that have been marked as unmapped because they map to auto-detected decoy contigs not present in the original user-provided FASTA also have their alignments described in the ga tag.

The ga tag uses the same format as the SA tag used to describe supplementary alignments.

CRAM Output

When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:

  • CRAM format V3.0 is produced by default, V3.1 can be enabled by using the option --cram-version 3.1

  • The CRAM is lossless. Lossy compression is never employed and not optional

  • Quality score compression is lossless. Read names are preserved

  • Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores

  • All input BAM tags are preserved

  • The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.

  • A CRAM index is produced in .crai format

  • CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted

The following list of default settings are used for the CRAM output

CRAM option
Value
Description

SEQS_PER_SLICE

2000

Max sequences per slice

BASES_PER_SLICE

SEQS_PER_SLICE*500

Max bases per slice

SLICE_PER_CNT

1

Max slices per container

embed_ref

0

Do not embed reference sequence

noref

0

Do not use non-referenced based encoding

multiseq

-1

Do not use multiple references per slice

unsorted

0

Do not use unsorted mode

use_bz2

0

Do not compress using bzip2

use_lzma

0

Do not compress using lmza

use_rans

1

Use rANS for quality score compression

binning

NONE

Qual score binning not used

preserve_aux_order

1

Preserve all aux tags and order (incl RG,NM,MD)

preserve_aux_size

0

Aux tag sizes not preserved ('i', 's', 'c')

lossy_read_names

0

Preserve read names

lossy

0

Do not enable Illumina 8 quality-binning system

ignore_md5

0

Enable all checking of checksums

decode_md

0

Do not (re)generate MD and NM tags

cram_version

3.0

Default is CRAM v3.0.

Input Options

DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.

  • Uncompressed

  • gzip or bgzip compression

  • ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.

If your input FASTQ files are gzipped, DRAGEN automatically decompresses the files using hardware-accelerated decompression, and then streams the reads into the mapper. If your files end in *.ora, DRAGEN automatically decompresses the files using ORA decompression, and then streams the reads into the mapper. The same FASTQ command-line options apply for all compression formats.

FASTQ Input Files

FASTQ input files can be single-ended or paired-end, as shown in the following examples.

  • Single-ended in one FASTQ file (-1 option)

    dragen -r <REF_DIR> -1 <fastq> \
    --output-directory <OUT_DIR> -output-file-prefix <OUTPUT_PREFIX> \
    --RGID <RGID> --RGSM <RGSM>
  • Paired-end in two matched FASTQ files(-1 and -2 options)

    dragen -r <REF_DIR> -1 <fastq1> -2 <fastq2> \
    --output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
    --RGID <RGID> --RGSM <RGSM>
  • Paired-end in a single interleaved FASTQ file(--interleaved (-i) option)

    dragen -r <REF_DIR> -1 <INTERLEAVED_FASTQ> -i \
    --RGID <RGID> --RGSM <RGSM>

Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:

<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz

Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.

For Example:

RDRS182520_S1_L001_R1_001.fastq.gz

RDRS182520_S1_L001_R1_002.fastq.gz

...

RDRS182520_S1_L001_R1_008.fastq.gz

These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 <FileName>_001.fastq). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile to false on the command line.

DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name option to true

If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.

To avoid impacting system performance, input files must be located on a fast file system.

Multiple FASTQ Input Files

To process multiple FASTQ input files as one sample, it is recommended that you use the --fastq-list <csv file name> option to specify the name of a CSV file containing the list of FASTQ files, instead of using the --combine-samples-by-name option.

For example:

dragen -r <ref_dir> --fastq-list <CSV_FILE> \
-fastq-list-sample-id <Sample_ID> -output-directory <OUT_DIR> 
--output-file-prefix <OUT_PREFIX>

Using a CSV file avoids having to concatenate the FASTQ files, for cases where there are multiple FASTQ files for a sample such as top-up scenarios or where FASTQ files are split across lanes. It also allows you to name the FASTQ input files, input from multiple subdirectories, and add BAM tags specified explicitly for each read group. DRAGEN automatically generates a CSV file of the correct format during BCL conversion to FASTQ. The CSV file is named fastq_list.csv and contains an entry for each FASTQ file or paired-end file pair produced during the run.

FASTQ CSV File Format

The first line of the CSV file specifies the title of each column, and is followed by one or more data lines. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or other extraneous characters.

Column titles are case-sensitive. The following column titles are required:

  • RGID--Read Group

  • RGSM--Sample ID

  • RGLB--Library

  • Lane--Flow cell lane

  • Read1File--Full path to a valid FASTQ input file

  • Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.

Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.

When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:

  • ID (from RGID)

  • SM (from RGSM)

  • LB (from RGLB)

You can specify additional tags for each read group by adding a column title. The column title must be only four upper-case characters and begin with RG. For example, to add a PU (platform unit) tag, add a column named RGPU and specify the value for each read group in this column. All column titles must be unique.

A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one unique RGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed in the fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, --fastq-list-sample-id <SampleID> must be used in addition to --fastq-list <filename> to process only a specific sample from the CSV file. Only the entries in the fastq-list file with an RGSM value that match the specified SampleID are processed.

  • Independent processing and output for multiple individual samples in one run is not supported.

  • To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true can be used instead of --fastq-list-sample-id.

Note

For a single run, only one BAM and VCF output file are produced because all input read groups are expected to belong to the same sample. To process multiple samples independently from one BCL conversion run, DRAGEN must be run multiple times using different values for the `--fastq-list-sample-id` option.

There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.

The following is an example FASTQ list CSV file with the required columns:

RGID,RGSM,RGLB,Lane,Read1File,Read2File
CACACTGA.1,RDSR181520,UnknownLibrary,1,/staging/RDSR181520_S1_L001_R1_001.fastq,
/staging/RDSR181520_S1_L001_R2_001.fastq
AGAACGGA.1,RDSR181521,UnknownLibrary,1,/staging/RDSR181521_S2_L001_R1_001.fastq,
/staging/RDSR181521_S2_L001_R2_001.fastq
TAAGTGCC.1,RDSR181522,UnknownLibrary,1,/staging/RDSR181522_S3_L001_R1_001.fastq,
/staging/RDSR181522_S3_L001_R2_001.fastq
AGACTGAG.1,RDSR181523,UnknownLibrary,1,/staging/RDSR181523_S4_L001_R1_001.fastq,
/staging/RDSR181523_S4_L001_R2_001.fastq

If you use the --tumor-fastq-list option for somatic input, use the --tumor-fastq-list-sample-id SampleID> option to specify the sample ID for the corresponding FASTQ list, as shown in the following example:

dragen -r <ref_dir> --tumor-fastq-list <csv_file> \
--tumor-fastq-list-sample-id <Sample_ID> \
--output-directory <out_dir> \
--output-file-prefix <out_prefix> --fastq-list <csv_file_2> \
--fastq-list-sample-id <Sample_ID_2>

Tumor-Normal Pairs Input

If using fastq_lists or tumor_fastq_lists comprising of multiple samples (RGSMs) in somatic mode, you can use a loop to iterate through the two lists to create tumor-normal pairs for testing. Create a *.txt file with the RGSM of each normal sample to be tested (one per line), and then create a separate *.txt file with the RGSM of the tumor samples to be tested. Make sure that the tumor sample RGSM is listed in the same order as the corresponding normal samples and to include a blank line after the last sample.

You can use the following example script to perform testing in somatic mode. Each iteration takes one entry from the tumor samples list and one entry from the normal samples list (from top to bottom) to create a tumor-normal pair as input for the DRAGEN run.

#!/bin/bash

HT="/staging/HT/"
tumor_fastq_list="/staging/inputs/tumor_fastq_list.csv"
normal_fastq_list="/staging/inputs/normal_fastq_list.csv"

tumor_samples_list="/staging/inputs/tumor_samples_list.txt"
normal_samples_list="/staging/inputs/normal_samples_list.txt"

while read -u 3 -r tumor_RGSM && read -u 4 -r normal_RGSM; do
output_dir="/staging/results/${tumor_RGSM}_${normal_RGSM}"
mkdir -p ${output_dir}

dragen \
-r ${HT} \
--tumor-fastq-list ${tumor_fastq_list} \
--tumor-fastq-list-sample-id ${tumor_RGSM} \
--fastq-list ${normal_fastq_list} \
--fastq-list-sample-id ${normal_RGSM} \
--output-directory ${output_dir} \
--output-file-prefix ${tumor_RGSM}_${normal_RGSM}
done 3<${tumor_samples_list} 4<${normal_samples_list}

Sample fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_N1.1,normal-1,ILLUMINA,1,/staging/inputs/normal-1_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N1.2,normal-1,ILLUMINA,2,/staging/inputs/normal-1_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.1,normal-2,ILLUMINA,1,/staging/inputs/normal-2_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.2,normal-2,ILLUMINA,2,/staging/inputs/normal-2_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.1,normal-3,ILLUMINA,1,/staging/inputs/normal-3_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.2,normal-3,ILLUMINA,2,/staging/inputs/normal-3_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L002_R2_001.fastq.gz

The following are examples of the FASTQ lists and samples lists used as input for the script.

Sample tumor_fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_T1.1,tumor-1,ILLUMINA,1,/staging/inputs/tumor-1_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T1.2,tumor-1,ILLUMINA,2,/staging/inputs/tumor-1_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.1,tumor-2,ILLUMINA,1,/staging/inputs/tumor-2_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.2,tumor-2,ILLUMINA,2,/staging/inputs/tumor-2_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.1,tumor-3,ILLUMINA,1,/staging/inputs/tumor-3_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.2,tumor-3,ILLUMINA,2,/staging/inputs/tumor-3_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L002_R2_001.fastq.gz
Sample normal_samples_list content

normal-1
normal-2
normal-3
Sample tumor_samples_list content

tumor-1
tumor-2
tumor-3

FASTQ ORA Input Files

You can use the same options as the other FASTQ input file types for ORA files. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference.

See ORA Compression and Decompression for more information on ORA reference files.

The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).

dragen -r <REF_DIR> -1 <fastq.ora1> -2 <fastq.ora2> \
--ora-reference <ORADATA_DIR> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

BAM Input Files

BAM files can be used as input to the mapper/aligner. By default, --enable-map-align is true. When a BAM file input is provided with map/align enabled, DRAGEN ignores any alignment or duplicate marking information contained in the input file, reads are re-mapped and the new alignments are fed downstream to the variant callers. Any existing flags in the input BAM are erased when reads are re-mapped. BAM re-mapping is supported for multiple BAM inputs at a time, such as in paired tumor-normal input to somatic variant calling. Outputting the re-mapped BAM(s) can be enabled by setting --enable-map-align-output=true.

Alternatively, existing alignments in the BAM file can be used as input to the variant callers by setting the --enable-map-align option to false.

If the input file contains paired-end reads, it is important to specify that the input data should be sorted so that pairs can be processed together. Other pipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases the speed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairs are identified. Use the --pair-by-name option to enable or disable this feature (the default is true).

Specify single-ended input in one BAM file with the (-b) and --pair-by-name=false options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name false

Specify paired-end input in one BAM file with the (-b) and \--pair-by-name=true options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

CRAM Input

You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input. Supported CRAM input file formats are v3.0 and v3.1.

By default, the CRAM compressor and decompressor uses the DRAGEN reference specified with the --ref-dir option. CRAM compression is reference based, and the reference used for compression is not part of the CRAM file. Therefore, the CRAM input file must have been created with the same reference than what is provided to DRAGEN for the analysis.

DRAGEN supports the re-alignment of a CRAM input that was created with a different reference in one step. Re-aligning a CRAM file that was created with a different reference requires use of the --cram-reference option. This option will make the CRAM decompressor use the specified reference.

  • --cram-reference can be either a fasta file, or a DRAGEN hash table folder.

  • If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file

  • CRAM output will always be compressed using the --ref-dir reference

Example: CRAM was created with hg19, re-analysis with hg38

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <ref_dir HG19>
dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <hg19.fa>

The following options are used for providing a CRAM input to either mapper/aligner or variant caller:

  • --cram-input--The name and path for the CRAM file

  • --cram-input--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option to true.

dragen -r <ref_dir> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

BCL Input Files

BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.

DRAGEN can read directly from BCL in the following circumstances:

  • Only one lane is input as part of a run (specified on the command-line).

  • The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).

The following example command is for BCL input with only one lane of input:

dragen --bcl-input-dir <BCL_ROOT> --bcl-only-lane <num> -r <ref_dir> \
--output-directory <out_dir> --output-file-prefix <out_prefix>

For additional BCL conversion options, see Input File Types.

Handling of N bases

One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.

When you use the --fastq-n-quality and --fastq-offset options, the base quality scores are overwritten with a fixed base quality. The default values for these options are 2 and 33 to match the Illumina minimum quality of 35 (ASCII character ‘#’).

Read Names for Paired-End Reads

By a common convention, read names can include suffixes, such as /1 or /2), which indicate the end of a pair the read represents. For BAM input using the --pair-by-name option, DRAGEN ignores these suffixes to find matching pair names. By default, DRAGEN uses the forward slash character as the delimiter for these suffixes and ignores the /1 and /2 when comparing names. By default, DRAGEN strips these suffixes from the original read names.

DRAGEN has the following options to control how suffixes are used:

  • To change the delimiter character, for suffixes, use the --pair-suffix-delimiter option. Valid values for this option include forward-slash (/), dot (.), and colon (:).

  • To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes to false.

  • To append a new set of suffixes to all read names, set --append-read-index-to-name to true. The delimiter is determined by the --pair-suffix-delimiter option. By default, the delimiter is a slash, so /1 and /2 are added to the names.

Gene Annotation Input Files

When processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-file option. Providing this file improves the accuracy of the mapping and aligning stage (see [Input Files]{.underline}). The file should conform to the GTF/GFF format specification and should list annotated transcripts that match the reference genome being mapped against. The similar GFF3 format is currently not supported, due to inconsistent contig naming between GENCODE and Ensembl. See the RNA user guide section for more details on potential issues and workarounds.

DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.

Networked Streaming

AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming

DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.

Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.

Input streaming is supported for the following use cases:

  • Mapping/aligning of FASTQ and BAM.

  • Germline and somatic small variant calling from BAM (without remapping).

For other file types that are significantly smaller in size, download them locally before running the analysis.

Streaming FASTQ Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 s3://s3-bucket-name/path/to/object_1.fastq.gz \
  -2 s3://s3-bucket-name/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://storage-account-name.blob.core.windows.net/path/to/object_1.fastq.gz \
  -2 https://storage-account-name.blob.core.windows.net/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://bucket-name.amazonaws.com/path/to/object_1.fastq.gz?querystring \
  -2 https://bucket-name.amazonaws.com/path/to/object_2.fastq.gz?querystring \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b s3://s3-bucket-name/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://storage-account-name.blob.core.windows.net/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://bucket-name.amazonaws.com/path/to/object_1.bam?querystring \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

AWS S3, Azure Blob Storage, Output Streaming

DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.

Streaming output to AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory s3://s3-bucket-name/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Streaming output to Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory https://storage-account-name.blob.core.windows.net/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Security and Permissions

To stream input files or write to a cloud providers storage, you must have permission to access the remote files.

AWS S3

S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.

Azure Blob Storage Account

Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.

To use managed identities you must run DRAGEN on an Azure instance. The instance must have Contributor permissions (read/write) on the Storage Account it wants to read and write to. If the instance has a single managed identity, only the AZ_ACCOUNT_NAME=<azure-storage-account-name> environment variable is required. For multiple managed identities, you must also provide the AZR_IDENT_CLIENT_ID=<client-id> environment variable, with the client id of the identity that can access your storage bucket. This can be found on the Azure Portal.

Presigned URL (AWS only)

An AWS presigned URL most likely has a query string attached to it, which provides the authentication credentials or necessary tokens to grant permission to the S3 bucket (e.g., https://bucket-name.amazonaws.com/path/to/folder?querystring). Currently, streaming input to DRAGEN Azure presigned URLs is not supported.

Sample Sex

Use the --sample-sex command line option to control the sex karyotype input used in downstream components, such as variant callers. If a sample sex karyotype input is not specified using the command line, the sex karyotype is automatically determined. The sex karyotype input is converted to a reference sex karyotype for use in variant calling. Other components might support sex karyotype input. Refer to the corresponding section for the component you are using.

The --sample-sex option supports the following values. Values are not case-sensitive.

  • none: No sex karyotype input. Components use a default reference sex karyotype.

  • auto: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none. auto is the default value.

  • female: Sex karyotype input is XX.

  • male: Sex karyotype input is XY.

The following example command lines use --sample-sex to specify the sex karyotype.

--sample-sex FEMALE

--sample-sex MALE

--sample-sex NONE

If the value is none, female, or male, the Ploidy Estimator could still run and produce output, but variant callers will not use any estimated sex karyotype that is different than the sex karyotype provided via the command-line.

The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex is used.

Reference Sex Karyotype

Sex Karyotype Input
CNV Caller
DRAGEN-STR
Ploidy Caller
Small Variant Caller
SV Caller

X0

XX

XY

XX

XXYY

XXYY

XX

XX

XX

XX

XX

XXYY

XXX

XX

XX

XX

XXYY

XXYY

XXXX

XX

XX

XX

XXYY

XXYY

XXXXX

XX

XX

XX

XXYY

XXYY

XY

XY

XY

XY

XY

XXYY

XXY

XY

XX

XY

XXYY

XXYY

XXXY

XY

XX

XY

XXYY

XXYY

XXXXY

XY

XX

XY

XXYY

XXYY

XYY

XY

XY

XY

XXYY

XXYY

XXYY

XY

XX

XY

XXYY

XXYY

XXXYY

XY

XX

XY

XXYY

XXYY

XYYY

XY

XY

XY

XXYY

XXYY

XXYYY

XY

XX

XY

XXYY

XXYY

XYYYY

XY

XY

XY

XXYY

XXYY

None

XX/XY

XX

XX/XY

XXYY

XXYY

  • For sex karyotype input of None, CNV/Ploidy Caller independently check the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.

Preservation or Stripping of BQSR Tags

The Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tags BI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file with BI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags can become invalid.

The recommendation is to strip these tags when using BAM files as input. To remove the BI and BD tags, set the --preserve-bqsr-tags option to false. If you preserve the tags, DRAGEN warns you to disable hard clipping.

Read Group Options

DRAGEN assumes that all the reads in a given FASTQ belong to the same read group. DRAGEN creates a single @RG read group descriptor in the header of the output BAM file, with the ability to specify the following standard BAM attributes:

Attribute
Argument
Description

ID

--RGID

Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.

LB

--RGLB

Library.

PL

--RGPL

Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.

PU

--RGPU

Platform unit, eg, flowcell-barcode.lane.

SM

--RGSM

Sample.

CN

--RGCN

Name of the sequencing center that produced the read.

DS

--RGDS

Description.

DT

--RGDT

Date the run was produced.

PI

--RGPI

Predicted mean insert size.

If any of these arguments are present, DRAGEN adds an RG tag to all the output records to indicate that they are members of a read group. The following example shows a command line that includes read group parameters:

dragen --RGID 1 --RGCN Broad --RGLB Solexa-135852 \
--RGPL Illumina --RGPU 1 --RGSM NA12878 \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 SRA056922.fastq --output-directory /staging/tmp/ \
--output-file-prefix rg_example

When using the --fastq-list option to input multiple read groups, BAM tags (and others) are specified for each read group by adding columns to the fastq_list.csv file. Each column heading consists of four capital letters and each begins with 'RG'. For each column, each read group's values for that column are propagated to the output BAM file in an identically named tag.

License Options

To suppress the license status message at the end of the run, use the --lic-no-print option. The following shows an example of the license status message:

LICENSE_MSG| =====================================================
LICENSE_MSG| License report
LICENSE_MSG|   Genome status [ACxxxxxxxxxxx] : used 1263.9 Gbases
since 2018-Feb-15 (1263886160894 bases, unlimited)
LICENSE_MSG|   Genome  bases [ACxxxxxxxxxxx] : 202000000
LICENSE_MSG|   Genome  bases [total]         : 202000000

Autogenerated MD5SUM for BAM and CRAM Output Files

An MD5SUM file is generated automatically for BAM and CRAM output files. The MD5SUM file has the same name as the output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). The MD5SUM file is a single-line text file that contains the md5sum of the output file, which exactly matches the output of the Linux md5sum command.

The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).

Configuration Files

Command line options can be stored in a configuration file. The location of the default configuration file is <INSTALL_PATH>/config/dragen-user-defaults.cfg. You can override this file by using the --config-file (-c) option to specify a different file. The configuration file used for a given run supplies the default settings for that run, any of which can be overridden by command line options.

The recommended approach is to use the dragen-user-defaults.cfg file as a template to create default settings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the new file for the specific use-case. Best practice is to put options that rarely change into the configuration file and to specify options that vary from run to run on the command line.

Licensing

DNA Somatic Tumor-Normal Solid Panel UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--tumor-normal-has-umi STRING           #Sample(s) containing UMI ['tumor', 'both']. 
--umi-min-supporting-reads 2            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Optional 
--vc-enable-umi-solid true              #>= 1% VAF 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Optional 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-use-somatic-vc-baf true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# TMB 
--enable-tmb true 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 
# Microsatellite Instability (MSI) 
--msi-command tumor-normal 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--tumor-bam-input $PATH 
--bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 
--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

In the TN pipeline this must be set to false for BAM/CRAM input.

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

--tumor-normal-has-umi STRING

Specify if only the tumor, or if both the tumor and normal have UMIs. Options: 'both','tumor'.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option
Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-normal-cnv-vcf $CNV_NORMAL_VCF

Annotation

TMB

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files refer to the MSI Biomarker section in the user guide.

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--sv-enable-liquid-tumor-mode true

DRAGEN can account for Tumor-in-Normal (TiN) contamination by running liquid tumor mode.

--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE

Set the Tumor-in-Normal (TiN) contamination tolerance level.

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-emit-ref-confidence BP_RESOLUTION 
--vc-enable-vcf-output true 
--vc-enable-umi-solid true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For panels we create GVCF files. Gather the full paths to the small variant caller hard filtered GVCFs (not VCFs) from step 1 and create an input file ${GVCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${GVCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

Common Product Features

Run Planning

Sample Sheets

ICA Cloud Applications

ICI Variant Interpretation

Illumina® DRAGEN™ Secondary Analysis

Illumina DRAGEN (Dynamic Read Analysis for GENomics) secondary analysis was developed to address important challenges associated with analyzing NGS (Next Generation Sequencing) data for a range of applications, including genome, exome, transcriptome, and methylome studies. DRAGEN secondary analysis processes NGS data and enables tertiary analysis to drive insights. The available tools make up a highly accurate, comprehensive, and efficient solution that enables labs of all sizes and disciplines to do more with their genomic data.

Product highlights

Accurate results:

  • Pangenome reference genome and machine learning drive unprecedented accuracy

  • 99.89% accuracy score with the Precision FDA Truth Challenge V2 benchmark data (2,3)

Comprehensive platform:

  • Analyze NGS data from whole genomes, exomes, methylomes, and transcriptomes

  • Available on platform of choice and scalable based on needs

Efficient analysis:

  • Process a 34x genome in ~ 30 minutes, with all supported callers with DRAGEN server v4 (1)

  • Reduce FASTQ file sizes up to 5x with DRAGEN ORA Compression

References:

  1. Illumina data on file, 2022.

Run Planning

DRAGEN Server App

Analysis on DRAGEN Server

Prerequisites

  • DRAGEN Phase 3 or 4 server

  • DRAGEN License

  • Network storage server

DRAGEN server

DRAGEN phase 4 server is recommended especially for datasets from NovaSeq X instruments. The server has 12 TB of intermediate data storage space for full processing of a NovaSeq X 25B flow cell.

The DRAGEN phase 3 server has 6 TB of intermediate data storage space, which can accommodate for flow cells from the NovaSeq 6000 or 6000 Dx instruments.

DRAGEN license

The Heme pipeline uses the standard DRAGEN license without requiring any special licenses.

NFS and CIFS file servers

The Heme pipeline is designed to stream data from a network file server onto the DRAGEN server, complete the analysis using the /staging area of the high performance SSD and then stream the analysis output back to the network file server.

The network file server may be mounted to the DRAGEN server using the NFS or CIFS protocol (SMB 1.0). SMB 2.0 or higher is recommended with Active Directory support if the SMB protocol is used.

Starting from BCL Files

If starting from BCL (*.bcl) files, the Heme pipeline requires the run folder to contain certain files and folders.

The run folder contains data from the sequencing run, make sure that the folder contains the following files:

Starting from FASTQ Files

The following inputs are required for running the using FASTQ (*.fastq) files.

  • Full path to an existing FASTQ folder.

  • The sample sheet is in the FASTQ folder path, or you can set the path to the sample sheet with the --sampleSheet override command line option.

Make sure there is sufficient disk space for the analysis to complete. Refer to the --help command line argument details for disk space requirements.

Use BCL Convert to produce FASTQ files for the Heme pipeline. Using bcl2fastq does not produce the same results and is discouraged.

FASTQ File Organization

Store FASTQ files in individual subfolders that correspond to a specific Sample_ID. Keep file pairs together in the same folder. Alternatively, store the FASTQ files in one flat folder structure where the FASTQ files are stored in one folder.

The Heme pipeline requires separate FASTQ files per sample. Do not merge FASTQ files.

The instrument generates two FASTQ files per flow cell lane, so that there are eight FASTQ files per sample.

Sample1_S1_L001_R1_001.fastq.gz

  • Sample1 represents the Sample ID.

  • The S in S1 means sample, and the 1 in S1 is based on the order of samples in the sample sheet, so S1 is the first sample.

  • L001 represents the flow cell lane number.

  • The R in R1 means Read, so R1 refers to Read 1.

Custom Config Support

BSSH Run Planner Setup

On the BSSH Run Planner, custom parameters and custom resource files can also be specified during Run Planning.

Custom resource files must be uploaded to BaseSpace under the same project to be selectable during run planning. Supported customizable options are described in the Custom Configuration Support section of each application.

BSSH Run Planner UI Example

Sample Sheet Creation in BaseSpace

How to Create Sample Sheets in BaseSpace Run Planning tool

The sections below represent each step in the BaseSpace Run Planning tool.

Note that NovaSeq X Series has a different run set up configuration screen than other instrument platforms. The software supports multi analysis, and in order to complete run setup on NovaSeq X Series, enter the appropriate Read 1, Read 2, Index 1 and Index 2 described in the instructions below.

Step 1: Run Settings

Step 2: Configuration

Note: On NovaSeq X Series, this page is called "Configuration 1". The right hand corner of the UI displays the Read 1, Read 2, Index 1 and Index 2 entered on the previous run settings screen.

Step 3: Sample Settings

Users can manually enter sample information, or download a template file to bulk upload sample information. Users can import the completed template or a compatible sample sheet.

Step 4: Run Review

Once all details are captured and pass validation, the user can review the details on the Run Review screen. From here they can choose to edit details in previous screens or export the sample sheet. Once completed, press the Cancel button to finish run planning.

Note: once leaving this screen, the run and sample sheet will not be accessible.

For NovaSeqX Plus users, the run can be saved as a draft or as a planned run (via “Save as Draft” and “Save as Planned” buttons respectively). Either selection will save the run to the Planned Runs screen on BaseSpace. There is no option to export the sample sheet on this screen.

Planned Runs Screen (NovaSeq X Series only)

The Planned Runs screen lists all planned or drafted runs. Users can set drafted runs to planned, export the sample sheet, and edit or delete a run on this screen.

Once the run is saved as Planned, it will appear on the NovaSeq X Series instrument where it can be selected for sequencing.

Guided Examples based on TSO 500

Please review these guided examples of TSO 500 analysis workflows that include a step of setting up a run in BaseSpace Run Planning tool:

Data Management

Copying data to local /staging drive

  • Copy the run or FASTQ folder to the DRAGEN server into the staging folder with the following recommended organization: /staging/runs/{RunID}. You can copy the run folder onto the DRAGEN server using Linux commands such as rsync. The sample sheet within the run folder is used unless otherwise specified through the command line.

  • Run folder must be intact.

  • If the analysis output folder path is different from the default, provide the analysis output folder path.

Analysis output directory

Before running the analysis, confirm that the output directory for the software to write to is empty and does not include results of previous analyses.

Storage Requirements

The DRAGEN server provides an NVMe SSD in the /staging directory to use as the software output directory. Network-attached storage is required for long-term storage.

When running the Heme pipeline, use the default settings or set the -analysisFolder command line option to a directory in /staging to make sure the DRAGEN server processes read and write data on the NVMe SSD.

Before beginning analysis, develop a strategy to copy data from the DRAGEN server to a network‑attached storage. Delete output data on the DRAGEN server as soon as possible.

The following are the run folder output size estimates and the minimum free space requirements for fastq.gz or fastq.ora output format.

When launching the analysis, the software checks that the minimum disk space required is available. If the minimum disk space is not available, the software shows an error message and prevents analysis from starting. If disk space is exhausted during a run, the run shows an error and stops analyzing.

Moving or modifying files during an analysis may cause the analysis to fail or provide incorrect results.

Data streaming from Network Filesystem

Please refer to the when installing a new system.

The software can be installed on an on-premises server by executing the .run installer for the desired version. Installers are made available for all releases at the .

DRAGEN requires license(s) for most functionality, please refer to the for guidance on how to install and/or review your current licenses.

Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. For more information, see .

For detailed information on the command line options, see .

For recommended command lines in typical use cases, see .

Command line options can also be set in a configuration file. For more information on configuration files, see . If an option is set in the configuration file and is also specified on the command-line, the command line option overrides the configuration file.

For recommended command lines in typical use cases, see .

Before you can use the DRAGEN system for aligning reads, you must load a reference genome and its associated hash tables onto the PCIe card. For information on preprocessing a reference genome's FASTA files into the native DRAGEN binary reference and hash table formats, see . You must also specify the directory containing the preprocessed binary reference and hash tables with the -r [or --ref-dir] option. This argument is always required.

With storage account access keys, DRAGEN can write to an Azure bucket both on and off Azure instances. For this use case, find the and set the environment variables AZ_ACCOUNT_NAME=<azure-storage-account-name> and AZ_ACCOUNT_KEY=<account-key>.

DRAGEN utilizes quota based licensing for a majority of features. More information can be found in the .

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specify germline CNVs from the matched normal sample. .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Application
Supported in BSSH Run Planner
Note

Yes

Mixed flow cell, auto-launch

No

Only Header and TN_Data sections required in SampleSheet.csv

All the clinical research workflow pipelines support only and requires index2 to be in forward orientation, with bcl-convert SoftwareVersion >= 4.4. The pipelines are stil compatible with legacy sample sheets where the BCLConvert_Settings section has SoftwareVersion < 4.4. Sample sheet v1 is no longer supported.

The clinical research workflow pipelines in the ICA cloud support to be executed after the completion of the pipeline analysis.

The clinical research workflow pipelines support after the analysis is completed in ICA, or using a .

Illumina DRAGEN Secondary Analysis is the first single platform to achieve 99.89% accuracy based on . Details here . Accessed March 22, 2023

PrecisionFDA Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions. . Accessed November 3, 2020.

We recommend using the tool to minimize errors in creating the sample sheet. For instruments such as NovaSeq X, sample sheets created in the BSSH Run Planer tool is automatically downloaded into the run folder on the instrument.

Folder/File
Description

The FASTQ folder structure conforms to the folder structure in .

See for additional details.

The BaseSpace Sequence Hub Run Planning tool is available, and is used to generate a valid sample sheet in v2 format for use on a supported sequencer for both ICA and Standalone DRAGEN Server analysis options. Filling out the form on the user interface will produce a exportable sample sheet with the required fields filled in. Refer to for descriptions of fields that appear in ICA sample sheets.

Parameter Name
Required
Description
Parameter Name
Required
Description
Parameter Name
Required
Description

For more information on run planning, refer to the .

Sequencing System
Run Folder Output (Gb)
Minimum Disk Space .gz (Gb)
Minimum Disk Space .ora (Gb)

Analysis of data stored on network file system may be slow when there are multiple DRAGEN servers reading and writing to the network file system simultaneously. However, it is advisable to use a network filesystem to stream large datasets from NFS when data transfer to local /staging is taking a significant amount of time, especially for NovaSeq X 25B flow cells. Discuss with your system administrator for of the DRAGEN server.

Server Site Prep & Installation Guide
DRAGEN Software Support Site page
Licensing Reference Section
Prepare a Reference Genome
DRAGEN Host Software
DRAGEN Recipes
DRAGEN Recipes
Prepare a Reference Genome
Storage Account Access Key
Licensing Reference Section
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
v2 sample sheet
post-processing scripts
automatic data ingestion or manual upload into ICI for variant interpretation
DRAGEN on-premises server locally
PrecisionFDA v2 Truth Challenge Benchmark Data
DRAGEN sets new standard for data accuracy in PrecisionFDA benchmark data
precision.fda.gov/challenges/10
Configuration Files

Config folder

Configuration files

Data folder

*.bcl files

Images folder

[Optional] Raw sequencing image files.

Interop folder

Interop metric files.

Logs folder

[Optional] Sequencing system log files.

RTALogs folder

Real-Time Analysis (RTA) log files.

RunInfo.xml file

Run information.

RunParameters.xml file

Run parameters.

SampleSheet.csv file

Sample information. If you want to use a sample sheet that is not in the run folder or a sample sheet named something other than SampleSheet.csv, provide the full path.

Run Name

Required

Run Name can contain 255 alphanumeric characters, dashes, underscores, periods, and spaces; and must start with an alphanumeric, a dash or an underscore.

Run Description

Optional

Run Description can contain 255 characters except square brackets, asterisks, and commas.

Instrument Platform

Required

Choose from supported instruments:

  • NovaSeq 6000/6000Dx

  • NovaSeq X Series

Secondary Analysis

Required

  • BaseSpace/Illumina Connected Analytics (to generate sample sheet for cloud analysis)

  • Local

Read 1

Required on Instrument Platform NovaSeq X Series

Fill with value 151 or custom values

Index 1

Required on Instrument Platform NovaSeq X Series

Fill the value depending on the Library Prep Kit used: 10

Index 2

Required on Instrument Platform NovaSeq X Series

Fill the value depending on the Library Prep Kit used: 10

Read 2

Required on Instrument Platform NovaSeq X Series

Fill with value 151 or custom values

Sample Container ID

Optional

Unique Identifier for the container that holds the sample

Application*

Required

the pipeline name

Description

Optional

Optional text field

Library Prep Kit

Required

- Illumina DNA Prep Kit (IDP)

Required

- Illumina DNA PCR Free Prep Kit (IDPFP)

Index Adapter Kit

Optional

- IDT for Illumina DNA/RNA UD Indexes Set A B C D, Tagmentation (both IDP and IDPFP)

Optional

- Illumina DNA/RNA UD Indexes Set A B C D, Tagmentation (IDP)

Read Lengths: Read 1 and Read 2

Required Not applicable on NovaSeq X Series

Auto filled with the standard values, but can be optionally overwritten.

Override Cycles

Required on NovaSeq X Series

Entered based on Run Settings read lengths & index 1 / index 2

Lane Usage

Not applicable on NovaSeq X Series or NextSeq 1000 / 2000

Checkbox allows users to apply the same lane across samples.

Lane

Required if Lane Usage is unchecked Not applicable on NextSeq 1000 / 2000

Specify lanes for each sample. The unmarked checkbox at the top of the dropdown selects all lanes.

Case ID

Optional

The identifier used to pair DNA and RNA samples in a run. The field is mandatory whether a sample is part of a pair, or not.

To note: The Sample ID field in the generated samplesheet will be auto-filled based on the Pair ID values captured. “_dna” and “_rna” (for DNA and RNA samples respectively) will be appended to the Pair ID value to create the Sample ID.

Index ID

Required

Index set ID options are based on selected Index Adapter Kit

Project

Optional

Optional field to describe the associated project

Starts from Fastq

Required

True or False

If auto-launching the pipeline from BCL files, set the value to False. If auto-launching the pipeline from FASTQ after auto-launching BCL Convert, set the value to True.

DNA Barcode Mismatches Index 1**

DNA Barcode Mismatches Index 2**

RNA Barcode Mismatches Index 1**

RNA Barcode Mismatches Index 2**

Required on NovaSeq X

Default value is set to 1.

These fields are required by NovaSeq X and represent BCL Convert settings for index diversity checks when demultiplexing. These values are not used in the pipeline analysis.

NovaSeq 6000/6000Dx (RUO) S4 Flow Cell

~2000

4000

2500

NovaSeq X 10B

~2000

4000

2500

NovaSeq X 25B

~4250

8500

5300

Other Instruments

~2000

4000

2500

Bed File Collection
Heme WGS TO
Solid WGS TN
BSSH Run Planner
Uploading Reference Files to BaseSpace
BaseSpace Sequence Hub support site page
NovaSeq 6000Dx: TSO 500 pipeline Auto-launch Analysis in Cloud
NextSeq 500/550Dx: TSO 500 pipeline and Connected Insights Auto-launch Analysis in Cloud
Network Considerations
FASTQ File Organization.
ICA Auto-launch Sample Sheet Requirements

Illumina Connected Insights Local

Variant Interpretation with Illumina Connected Insights

Automatic Ingestion of Heme Analysis on ICA to ICI

  • Access to Illumina Connected Analytics

  • Access to Illumina Connected Insights

Quick Start

Quick Start Guide

Table 1. Release Information

Execution Environment
software version
Client program
location
Note

Local Dragen Server

4.4.4.62

run_Heme_WGS_TO_{version}.sh

/usr/local/bin

ICA

a11697ba-1144-4dc6-9e22-f21dff29f747

icav2

ICA Pipelines

ICA

urn:ilmn:ica:pipeline:a11697ba-1144-4dc6-9e22-f21dff29f747#Heme_WGS_TO_v4_4_4_62

supported browser

ICA UI

  • {version} is used to represent the software version number in Table 1 above. Similarly, <pipeline_run_script> is used to indicate the client program name in this document.

Download, Install and Execute on a Local Server

Run analysis on a local DRAGEN Server

The command line program may be used to launch an analysis by using the <pipeline_run_script> with the appropriate options.

start from bcl

<pipeline_run_script> --help # list all supported parameters
<pipeline_run_script> --inputType bcl \
--inputFolder /staging/input-folder \
--analysisFolder /staging/output-folder

start from one or more input folders when using FASTQ, BAM or CRAM files

Multiple folders may be specified as input folders in comma separated values when using FASTQ, BAM or CRAM files as input.

<pipeline_run_script> --inputType <fastq|bam|cram> \
--inputFolder /staging/input-folder-1,/staging/input-folder-2 \
--analysisFolder /staging/output-folder

Pressing Ctrl+C during a DRAGEN step stops the currently running analysis and might cause an FPGA error. To recover from an FPGA error, shut down and restart the server.

Run analysis on ICA using the icav2 client

Here is an example of starting an analysis using the ICA client by providing the necessary command parameters and specify a particuar storage size for analysis in ICA.

icav2 projectpipelines start nextflow ${PIPELINE_ID} \
--project-id ${ANY_PROJECT_ID} \
--storage-size Large \
-o json \
--input ${ANY_SAMPLE_SHEET} \
--input ${ANY_INPUT_DIR} \
--parameters inputType:'bcl' \
--parameters referenceGenome:'hg38' \
--parameters oraCompressionEnabled:'true' \
--parameters sampleIds:'1267-Prostate-Del-R1,741-Lung-SNV-R1' \
--user-reference ${ANY_USER_REFERENCE}

Run analysis on ICA using UI

Turn Around Time Comparison

ICA

Coming soon.

Local Server

Local Server Only

Coming soon.

Data Streaming from NFS

Coming soon.

ICA Cloud App

Analysis in the ICA Cloud environment

Prerequisites

  • Basic ICA Subscription

  • Basic ICI Subscription (if desired)

DRAGEN Heme WGS Tumor Only Pipeline

Overview

DRAGEN Heme WGS Tumor Only Pipeline, henceforth referred as the Heme Pipeline, is a comprehensive and unbiased whole genome sequencing solution to replace conventional cytogenetic and panel sequencing approaches for detecting all types of mutation using a limited amount of DNA. It can be applied to detect clinically actionable mutations for cancer spanning a wide range of genomic events, e.g., structural variants (SV), Copy Number Alterations (CNA), small variants (SNV/insertion/deletion/delins) and internal tandem duplications (ITD) and DUX4 variants using Heme samples.

The Heme pipeline includes a DNA-only workflow designed to analyze whole genome sequencing data generated on supported instruments. It may be run as a local off-instrument solution installable on a DRAGEN server or accessible through the Illumina Connected Analytics (ICA) cloud environment. The Heme pipeline is for Research Use Only (RUO).

Features

  • Superb performance based on the DRAGEN BioIT platform Release 4.4.4

  • Supports starting the analysis from BCL, FASTQ, BAM or CRAM as inputs.

  • Flexible custom configurable options on top of well established DRAGEN recipes for Heme WGS analysis.

  • Available on local DRAGEN servers and Illumina Connected Analytics (ICA)

  • Seamless integration with Illumina Connected Insights (ICI) for tertiary interpretation

Supported Library Prep Kits (LPKs)

  • Illumina DNA PCR Free Prep Kit

  • Illumina DNA Prep Kit

  • Custom LPKs

Supported Sequencing Instruments

  • NovaSeq 6000 or 6000Dx in RUO mode

  • NovaSeq X or NovaSeq X plus

Note Unsupported instruments can still be analyzed, but a warning will be generated.

Supported FLow Cells

  • NovaSeq 6000 or 6000Dx S4

  • NovaSeq X or NovaSeq X plus 10B, 25B

Sample Sheet Requirements

The pipeline has fields that are required in addition to general sample sheet requirements. Follow the steps below to create a valid samplesheet.

Standard Sample Sheet Requirements

The following sample sheet requirements describe required and optional fields for the pipeline. Depending on the deployment (standalone DRAGEN server, ICA with auto-launch, ICA with manual launch), certain sections and required values can deviate from the standard requirements. These deviations are noted in the information below.

The analysis fails if the sample sheet requirements are not met.

Use the following steps to create a valid sample sheet.

  1. Download the sample sheet v2 template that matches the instrument & assay run.

  2. In the Sequencing Settings section, enter the following required parameters:

[Sequencing_Settings] Section

Sample Parameter
Required
Details

LibraryPrepKits

Required

Accepted values are: IlluminaDNAPrep or IlluminaDNAPCRFree

  1. In the BCL Convert Settings section, enter the following required parameters:

[BCLConvert_Settings] Section

Sample Parameter
Required
Details

SoftwareVersion

Required

The DRAGEN component software version. The pipeline requires 4.4.4 or higher. To ensure you are using the latest compatible version, refer to the software release notes.

AdapterRead1

Required

If using 10 bp indexes with UDP: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC Analysis fails if the incorrect adapter sequences are used

AdapterRead2

Required

If using 10 bp indexes with UDP: CTGTCTCTTATACACATCTGACGCTGCCGACGA Analysis fails if the incorrect adapter sequences are used

AdapterBehavior

Optional

Enter trim This indicates that the BCL Convert software trims the specified adapter sequences from each read.

MinimumTrimmedReadLength

Optional

Enter 35. Reads with a length trimmed below this point are masked.

MaskShortReads

Optional

Enter 35. Reads with a length trimmed below this point are masked.

  1. In the BCL Convert Data section, enter the following parameters for each sample.

[BCLConvert_Data] Section

Sample Parameter
Required
Details

Sample_ID

Required

Must match a Sample_ID listed in the [Heme_Data] section section.

Index

Required

Index 1 sequence valid for Index_ID assigned to matching Sample_ID in the [Heme_Data] section.

Index2

Required

Index 2 sequence valid for Index_ID assigned to matching Sample_ID in the [Heme_Data] section.

Lane

Only for NovaSeq 6000 XP, NovaSeq 6000Dx, or NovaSeq X workflows

Indicates which lane corresponds to a given sample. Enter a single numeric value per row. Cannot be empty, i.e the analysis fails if the Lane column is present without a value in each row.

  1. In the [Heme_Data] section, enter the following parameters:

[Heme_Data] Section header changes depending on the deployment: Section header changes depending on the deployment:

  • Standalone DRAGEN Server and ICA with Manual Launch: Heme_Data

  • ICA with Auto-launch: Cloud_Heme_Data

[Heme_Data] Section

Sample Parameter
Required
Details

Sample_ID

Required

The unique ID to identify a sample. The sample ID is included in the output file names. Sample IDs are not case sensitive. Sample IDs must have the following characteristics: - Unique for the run. - 1–70 characters. - No spaces. - Alphanumeric characters with underscores and dashes. If you use an underscore or dash, enter an alphanumeric character before and after the underscore or dash. eg, Sample1-T5B1_022515. - Cannot be called all, default, none, unknown, undetermined, stats, or reports. - Must match a Sample_ID listed in the [BCLConvert_Data] section. Each sample must have a unique combination of Lane (if applicable), sample ID, and index ID or the analysis will fail.

Sample_Type

Optional

Enter DNA

Case_ID

Optional

A unique ID that links the same biological samples from the same individual. It is used for variant interpretation in downstream software such as the Illumina Connected Insights software

Sample_Description

Optional

Sample description must meet the following requirements: - 1–50 characters. - Alphanumeric characters with underscores, dashes and spaces. If you enter a underscore, dash, or space, enter an alphanumeric character before and after. eg, heme-WGS_213.

To ensure a successful analysis, follow these guidelines:

  1. Avoid any blank lines at the end of the sample sheet; these can cause the analysis to fail.

  2. When running local analysis using the command line save the sample sheet in the sequencing run folder with the default name SampleSheet.csv, or choose a different name and specify the path in the command-line options.

ICA with Auto-launch: Sample Sheet Requirements

To auto-launch analysis from the sequencer run folder, ensure the StartsFromFastq and SampleSheetRequested fields are set to FALSE. To auto-launch analysis from FASTQs after BCL Convert auto-launch, StartsFromFastq and SampleSheet Requested fields must be set to TRUE

[Cloud_Heme_Data] Section

[Cloud_Heme_Settings] Section

Parameters
Required
Details

SoftwareVersion

Not Required

The Heme pipeline software version

StartsFromFastq

Required

Set the value to TRUE or FALSE. To auto-launch from BCL files, set to FALSE. To auto-launch from FASTQ files after auto-launch of BCL Convert, set to TRUE.

SampleSheetRequested

Required

Set the value to TRUE or FALSE. To auto-launch from BCL files, set to FALSE. To auto-launch from FASTQ files after auto-launch of BCL Convert, set to TRUE.

[Cloud_Data] Section

Parameters
Required
Details

Sample_ID

Not Required

The same sample ID used in the Cloud_HemeS_Data section.

ProjectName

Not Required

The BaseSpace project name.

LibraryName

Not Required

Combination of sample ID and index values in the following format: sampleID_Index_Index2

LibraryPrepKitName

Required

The Library Prep Kit used.

IndexAdapterKitName

Not Required

The Index Adapter Kit used.

[Cloud_Settings] Section

Parameter
Required
Details

GeneratedVersion

Not Required

The cloud GSS version used to create the sample sheet. Optional if manually updating a sample sheet.

CloudWorkflow

Not Required

Ica_workflow_1

Cloud_Heme_Pipeline

Required

BCLConvert_Pipeline

Required

The value is a URN in the following format: urn:ilmn:ica:pipeline: <pipeline-ID>#<pipeline-name>

Small Variant Calling

The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).

Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller option to true. The VCF header is annotated with ##source=DRAGEN_SNV to indicate the file is generated by the DRAGEN SNV pipeline.

The Variant Caller Algorithm

The DRAGEN Small Variant Caller performs the following steps:

Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.

Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.

Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.

Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.

Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.

Read filtering and reporting of vcf DP fields

In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:

  • Unfiltered depth is the number of reads (fragment-based) covering the position, downstream of any read collapsing, deduplication, downsampling and read disqualification, but upstream of informative read determination. Overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.

  • Informative depth is the number of reads (fragment-based) actually used to make the calling decision, where badly mated reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.

The following figure summarizes the different filtering steps in more detail.

  • Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:

    • Duplicate reads.

    • Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.

    • [Somatic] Reads with MAPQ=0.

    • [Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1. By default, germline runs with machine learning (ML) enabled consider all reads, even those with MAPQ 0, resulting in increased sensitivity. MAPQ read filtering is controlled by --vc-min-read-qual for germline and tumor/normal (T/N) runs, but it does not affect tumor-only (T/O) runs. In contrast, --vc-min-tumor-read-qual controls filtering for tumor samples in T/N and T/O runs and has no effect on normal-only samples.

  • Filter 2 trims bases with BQ < 10 and filters out the following reads:

    • Unmapped reads.

    • Secondary reads.

    • Reads with bad cigars.

  • Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:

    • Disqualified reads. Reads are disqualified if their HMM score is below a threshold.

  • Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out the following reads:

    • Badly mated reads. A badly mated read is a read where the pair is mapped to two different reference contigs.

    • Non-informative reads. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.

Mosaic Calling

Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC INFO flag. As default, MOSAIC tagged variants with AF smaller than 20% are filtered with the MosaicLowAF filter. To further enhance sensitivity, if the median depth of the sample detected by the ploidy estimator exceeds 100x, a default 10% threshold will be applied. This is likely to have a greater impact on exome data, which typically has higher coverage. Exome users looking to control the number of low allele frequency (AF) mosaic variants can set the option --vc-mosaic-af-filter-threshold to 0.2 to override the dynamic coverage-based thresholding.

Variant Caller Options

The following options control the variant caller stage of the DRAGEN host software.

  • --enable-variant-caller

    Set --enable-variant-caller to true to enable the variant caller stage for the DRAGEN pipeline.

  • --vc-target-bed

    [Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:

If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.

  • --vc-target-bed-padding

    [Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.

  • --vc-target-coverage

    Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.

  • --vc-remove-all-soft-clips

    Set to true to ignore soft-clipped bases during the haploytype assembly step.

  • --vc-decoy-contigs

    Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.

  • --vc-enable-decoy-contigs

    Set to true to enable variant calls on the decoy contigs. The default value is false.

  • --vc-enable-phasing

    Enable variants to be phased when possible. The default value is true.

  • --vc-combine-phased-variants-distance

    Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].

  • --vc-enable-mosaic-detection

    Set to true to enable DRAGEN mosaic detection. Set to false to disable DRAGEN mosaic detection.

  • --vc-mosaic-af-filter-threshold

    Set the allele frequency threshold for the application of the MosaicLowAF filter to mosaic calls. All MOSAIC tagged variants with AF smaller than the AF threshold are filtered with the MosaicLowAF filter. The default mosaic AF filter threshold is set to 0.2 if the median depth of the sample detected by the ploidy caller is <= 100x and 0.1 if the detected depth is > 100x.

Downsampling Options for Small Variant Calling

You can use the following options for downsampling reads in the small variant calling pipeline.

For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.

  • --vc-target-coverage-mito

  • --vc-max-reads-per-active-region-mito

  • --vc-max-reads-per-raw-region-mito The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.

The following are the default downsampling values for each small variant calling mode.

The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos setting.

If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos, default value is 10) at any start position.

The next downsampling step is to apply the --vc-max-reads-per-raw-region and --vc-max-reads-per-active-region limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.

This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.

When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.

Small Variant Caller Output

By default, the DRAGEN small variant caller outputs a VCF file (<output-file-prefix>.hard-filtered.vcf.gz) in VCF 4.2 format containing both filtered and PASSing variant records.

Variant Representation

  • Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.

  • Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.

  • A variant is normalized if and only if it is parsimonious and left aligned

Additional notes on variant representation in the DRAGEN VCF:

  • Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).

Multi-allelic Variants and Overlapping Variants

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:

Two indels are considered as multi-allelic if they share the same reference base preceding the indel. For example:

DRAGEN employs joint detection of overlapping variants, a feature designed to detect overlapping SNP and INDEL variants and output them in a single VCF record represented as a multi-allelic genotype. However, if a SNP overlaps an INDEL but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below.

QUAL, QD, and GQ Formulation

In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.

  • QUAL is the Phred-scaled probability that the site has no variant and is computed as:

    That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.

  • GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.

  • In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.

  • QD is the QUAL normalized by the read depth, DP.

The QUAL scores generated by DRAGEN differ significantly from those of GATK, as DRAGEN's algorithms for small variant detection provide more realistic scores. This improvement stems from two key factors:

  • Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.

  • Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3.0103) compared to DRAGEN without ML (10.41 for SNP and 7.83 for Indel).

Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3.0103 with ML enabled.

gVCF Output

A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF> allele. Reads that do not support the reference or any variants are assigned the <NON_REF> allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ and in somatic mode the score is FORMAT/SQ.

The following options are available to enable and control gVCF output.

  • --vc-emit-ref-confidence

    To enable gVCF output, set to GVCF. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.

    To produce unbanded output, set --vc-emit-ref-confidence to BP_RESOLUTION.

  • --vc-enable-vcf-output

    To enable VCF file output during a gVCF run, set to true. The default value is false.

  • --vc-gvcf-bands

    If using the default --vc-emit-ref-confidence gvcf (banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80 for germline and 1 3 10 20 50 80 for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50.

  • --vc-compact-gvcf

    This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30 and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.

Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.

In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.

The following are example gVCF records that include a hom-ref block call and a variant call.

In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.

Phasing and Phased Variants

DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same phase set.

The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.

During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, they are phased together. If the variants only occur on different haplotypes, they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.

Combine Phased Variants

The DRAGEN small variant caller supports combining multiple nearby phased variant records into a single VCF record. When the functionality is enabled, the caller will output both multi-nucleotide variants (MNVs; multiple phased SNVs combined into a single VCF record) and complex indels (multiple phased insertions, deletions, and/or SNVs combined into a single VCF record) in the VCF.

For example, assuming reference at position chr2 115035 is A, the following two phased SNVs can be combined into an MNV.

The phased SNVs are combined as follows.

The following two phased indels can be combined into a complex indel.

The phased indels are combined as follows.

Individual variant records existing on the same haplotypes are deemed to be in phase and will be merged if they are within a configurable distance threshold of one another. For each consecutive pair of phased variants in a phase set, the variants will be combined if the difference between their POS fields does not exceed the threshold. For deletions, the number of deleted bases is taken into account and subtracted from the POS difference between the deletion and downstream phased variant when calculating the distance between calls. Please note that variant records without a PS tag may be merged into MNVs and complex indels together with calls having PS tags due to algorithmic differences between variant phasing and variant merging.

In the somatic pipeline, combining phased variants is enabled by default, consistent with HGVS guidelines. In the germline pipeline, the functionality can be enabled using the command line options detailed below.

Command line options for merging phased variants

  • --vc-combine-phased-variants-distance-snvs-only Specifies the maximum distance over which phased SNVs will be combined into an MNV. This option applies only to phased variant groups consisting of only SNVs. The default is 2 for somatic and 0 for germline (disabled). For phased variant groups that include both SNVs and indels, the analogous vc-combine-phased-variants-distance option applies.

  • --vc-combine-phased-variants-distance Specifies the maximum distance over which phased insertions, deletions, and SNVs will be combined into a complex indel. This distance threshold will be applied to any group of phased variants that includes at least one indel. The default is 2 for somatic and 0 for germline (disabled).

For both options, a value of 0 disables merging. When either option is enabled with a value [1, 15], all phased variants in the group that are within the provided distance value of one another are merged into an MNV (for vc-combine-phased-variants-distance-snvs-only) or complex indel (for vc-combine-phased-variants-distance).

  • --vc-mnv-emit-component-calls Specifies whether or not to emit the individual component variant records along with the merged variant records. When set to true, all component calls making up an MNV or complex indel will be emitted in the VCF along with the merged variant record. The default is true for somatic and false (disabled) for germline.

  • --vc-combine-phased-variants-max-vaf-delta Specifies the threshold for filtering MNV component variant calls when the events comprising to the MNV have different allele frequencies. The default value is 0.1, which means that any SNV or INDEL with an AF that is more than 0.1 greater than the MNV AF shall be emitted as a PASSing call, while the remaining components shall be emitted with the 'mnv_component' FILTER flag. Only applicable when vc-combine-phased-variants-distance is greater than 0 and vc-mnv-emit-component-calls is true. (Default=0.1)

DRAGEN can output all component SNVs and/or INDELs that make up a merged MNV or complex indel along with the merged call itself. Merged calls and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. This behavior is default in somatic mode and can be enabled in germline mode by setting --vc-mnv-emit-component-calls=true. When vc-mnv-emit-component-calls is enabled and DRAGEN reports an MNV or complex indel call, the component calls that make up the merged call are filtered with the mnv_component filter flag unless the difference in VAF between the component call and merged call is greater than the value of vc-combine-phased-variants-max-vaf-delta (default: 0.1). This avoids component calls being doubly represented in the VCF output. In the case where VAF difference between a given component call and merged call exceeds the threshold value of vc-combine-phased-variants-max-vaf-delta, that is considered evidence for the component call existing both as a standalone variant and as part of the MNV or complex indel and the component call will be emitted as a PASSing VCF record. For example, in the following MNV group, there are two component SNVs making up the MNV. The MNV call is emitted as a PASSing call while one component SNV with AF equal to that of the MNV is filtered with the mnv_component FILTER flag and the other component SNV with VAF greater than that of the MNV by more than 0.1 is emitted as a PASSing call.

DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.

Examples of diploid haplotypes where phasing is supported:

Examples of diploid haplotypes where phasing is not supported:

Ploidy Support

The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.

  • If --sample-sex is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.

  • If --sample-sex is specified on the command line, contigs are processed as follows.

    • For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.

    • For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.

For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 75% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid to false.

An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8

DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.

Overlapping Mates in the Small Variant Calling

Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.

  • When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.

  • When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.

  • The base qualities of overlapping mates are no longer adjusted.

Mitochondrial Calling

Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.

DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.

  • --vc-enable-af-filter-mito Whether to enable the allele frequency for mitochondrial variant calling. The default is true.

  • --vc-af-call-threshold-mito Set the threshold for emitting calls in the VCF. The default is 0.01.

  • --vc-af-filter-threshold-mito Set the threshold to mark emitted vcf call as filtered. The default is 0.02.

QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).

The following filters can be applied to mitochondrial variant calls.

  • --vc-sq-call-threshold Set the SQ threshold for emitting calls in the VCF. The default is 0.1.

  • --vc-sq-filter-threshold Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0

  • --vc-enable-triallelic-filter Enables the multiallelic filter. The default value is false.

If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.

If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.

The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.

FORMAT/GT

For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.

The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.

Personalized Germline Small Variant Calling

We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.

The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.

To enable personalized variant calling, including the personalized machine learning model, set --enable-personalization to true (default: false). This outputs two files in the output directory: .personal_haplotypes.tsv.gz and .personal.vcf.gz.

.personal_haplotypes.tsv.gz describes the personalized 2-haplotype reference. Each line contains the following fields: #CHROM START END HAPS. By default each line represents a 4kbp bin of the reference genome (indicated by the CHROM, START and END fields). For each 4kbp bin, the HAPS field denotes the pair of ancestral haplotypes (from the pangenome reference panel) that are inferred for the sample.

.personal.vcf.gz describes the variants imputed for the sample. Each variant is annotated along with genotype (GT), posterior probabilities (PGP, Personalized Genotype Posterior) and the inferred best haplotype pair (HAPS). The variant quality score (QUAL) is computed as -10 * log10(probability that the imputed genotype is incorrect) and is capped at 999.

The .personal.vcf.gz is useful when running in split mode and is beneficial to save along with the BAM/CRAM output. To enable personalized variant calling and machine learning in split-run scenarios, simply provide the personal variant VCF (.personal.vcf.gz) along with the BAM/CRAM input (--enable-personalization true --vc-pg-variants <$OUT_PREFIX.personal.vcf.gz>).

Note that personalization is only available for the germline small variant caller (WGS and WES) when used with a pangenome reference.

Joint Detection of Overlapping Variants

When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:

  • Loci have alleles that overlap each other.

  • Loci are in the STR region or less than 10 bases apart of the STR region.

  • Loci are less than 10 bases apart of each other.

Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.

Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection to false.

Modeling of Correlated Errors Across Reads

DRAGEN has two algorithms that model correlated errors across reads in a given pileup.

Foreign Read Detection

Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.

Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.

FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.

Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.

Base Quality Dropoff

The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.

Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.

Launching Analysis

Overview

Run on DRAGEN Server

Getting Started

Analysis output is written to /staging/DRAGEN_Heme_WGS_Tumor_Only_Pipeline_{version}_Analysis_{datetimestamp} by default. To write to a different output directory, run the bash script with --analysisFolder <FULL_PATH_TO_ANALYSIS_FOLDER>.

The --demultiplexOnly flag runs the pipeline through FASTQ Generation only, and these outputs can be used for splitting a run into smaller batch analyses with --inputType fastq and the --sampleIDs argument.

DRAGEN Server App

Installation Procedure on DRAGEN Server

Downloader

Choose the downloader appropriate for your platform, when executed it will prompt you to provide a path to download the assets to. The required software packages will be downloaded into the dragen_pipelines directory under the path provided at the prompt. If the path provided was used for a previous execution of the downloader, any incomplete downloads will be resumed, existing files will be checksummed, and any files with invalid checksums will be re-downloaded.

The downloaded directory content may be moved to the installation target DRAGEN server using a USB key with at least 128 GB of free space or by copying to Network Storage which is reachable from the target DRAGEN Server.

Downloader System Requirements

Expected downloaded content

    • dragen-app-manager-1.0.14-1.x86_64-el8-offline.run

    • README

      • install_Heme_WGS_TO_v4.4.4.62.run

      • Heme_WGS_TO_4.4.4.62.iapp

      • README

      • dpf-core_1.0.0.36.ires

      • dpf-templates_4.4.4.52.ires

      • dpf-docker-images_4.4.4.52.ires

      • dragen-4.4.4-12.multi.el8.x86_64.run

      • heme_wgs_to_resources_4.4.4.2.ires

      • hg38-alt_masked.cnv.hla.methyl_cg.methylated_combined.rna-11-r5.0-1.ires

      • hs37d5_chr-cnv.hla.methyl_cg.methylated_combined.rna-11-r5.0-1.ires

      • variant_annotation_data-tmb_annotations-4.4.4-1.ires

Installer

Installation Requirements

DRAGEN and DRAGEN Application Manager

The pipeline requires DRAGEN v4.4.4 or higher. If upon installation of the app this version of DRAGEN (or higher) is not installed, the software shall install this version of DRAGEN.

Minimum System Operating Requirements

Hardware

  • v3 DRAGEN server or v4 DRAGEN server

  • mkfifo is enabled on the network-attached storage (NAS).

Software

The software installed by default on the DRAGEN server includes the following items:

  • DRAGEN server software. Refer to sample sheet settings for the DRAGEN version number.

  • Oracle Linux 8

Storage

  • DRAGEN server v3 provides a 6.4 TB NVMe SSD. This SSD is located at the /staging directory and is suitable for storing only one or two runs of the analysis pipeline.

  • DRAGEN server v4 provides 12.8 TB via a 2 x 6.4 TB NVMe U.2 SSD configuration.

  • Consider the following when making data storage decisions.

    • A NovaSeq 6000 sequencing run that uses an S4 flow cell can produce up to 3 TB of output. ▫ The Heme pipeline can produce an additional 4-6 TB of analysis output. For optimal performance when writing to a non-default directory, specify an analysis folder location on /staging, this ensures that the DRAGEN-related processes read and write data to the DRAGEN Server's high-speed NVMe SSD.

    • Network-attached storage is required for long-term storage of sequencing runs and Heme pipeline output.

    • Managing data storage is your responsibility.

      • Illumina recommends developing a strategy to copy data from the DRAGEN server to network-attached storage.

Installation Instructions

  • Installing the Heme pipeline requires root privileges.

    • Follow the instructions for DRAGEN license installation provided by Illumina Customer Care or refer to the DRAGEN server documentation.

  • Copy the directory structure from the downloader directory to the target DRAGEN server (or a path accessible with sudo privileges)

  • Ensure the installer has the correct privileges by running chmod +x install_Heme_WGS_TO_v{version}.run

  • Launch the installer with root privileges sudo /path/to/install_Heme_WGS_TO_v{version}.run

    • If DRAGEN Application Manager is not already installed, the installer will exit and direct you to the path to the DRAGEN Application Manager installer

Run Self-Test Script

The self-test script, present after app installation, checks the following functions:

  • All required services are running.

  • All resources are in place.

  • The analysis workflow image can be launched.

  • The Heme pipeline can run successfully on a test dataset.

To run the self-test script, execute:

The following output will show if installation is completed successfully.

If the self-test prints a failure message, contact Illumina Technical Support, and provide the output file found in /staging/check_Heme_WGS_TO_{timestamp}.tgz.

When running an analysis on the DRAGEN server via SSH, Illumina recommends that you use a terminal multiplexer utility, which allows you to resume analysis in the event of a disconnection from the DRAGEN server.

Uninstall Heme pipeline

To uninstall the Heme pipeline, run the following command:

Executing the uninstall script removes the following assets:

  • All scripts, including:

    • run_Heme_WGS_TO_{version}.sh

    • check_Heme_WGS_TO_{version}.sh

    • uninstall_Heme_WGS_TO_{version}.sh

    • The application installed under DRAGEN Application Manager

If the uninstall script is executed with the -r or --removeResources flag, dependencies of the application being uninstalled will be removed if no other applications depend on them.

You are not required to uninstall DRAGEN Application Manager, Docker, or the DRAGEN server software.

To remove Docker, review the install instructions for your operating system in the Docker documentation

Templates

Description

Sample Sheet templates for the Heme pipeline for standalone DRAGEN server and ICA manual launch analysis can be found in the table below. For auto-launch compatible sample sheets, use BaseSpace Run Planner.

The Heme pipeline is compatible with several instruments and assay workflows (standard, XP), each of which have implications for the sample sheet.

Templates

Sample sheet templates contain all required fields, including index sequences in the proper orientation for all indexes from a given library prep kit. The templates are provided as a starting point for creating a sample sheet manually when launching analysis on a standalone DRAGEN server or on ICA using manual launch.

*Lane numbers cannot exceed what is supported by the flow cell in use.

Advanced Topics

Demultiplex only option

In order to break up the workflow, one may wish to run the software with the demux only option. The pipeline will perform FASTQ generation with the settings provided by default or as specified in the sample sheet. Then the subsequent analysis may start from FASTQ.

CRAM input

Command Line Options

Overview

Command line options

For command-line options, refer to Table 1: Shell Script Command-Line Options for details.

Table 1: Shell Script Command-Line Options

CAUTION: Do not run analyses as the root user as it can lead to permissions issues when managing data generated by the software.

Local Specific Output

Local output management

On DRAGEN server, Nextflow logs are contained in the Work folder in a hierarchical folder structure organized by the tasks in the pipeline_trace.txt. These files are prefixed with "." and hidden from normal view.

The Illumina Connected Insights (ICI) Local platform can be used to interpret and visualize analysis results from a clinical research workflow pipeline on a local DRAGEN server. See .

The Illumina Connected Insights (ICI) platform can be used to interpret and visualize analysis results from the Heme pipeline. Analysis results can be provided to ICI via a and via auto-ingestion for Illuminia Connected Analytics (ICA) analyses.

Refer to the ICI support site page for information on

See

See

See

The software may be and installed by following the . It may be executed using a local DRAGEN server or on a local computer which launches the analysis in the ICA cloud environment.

The same analysis example above may be completed using the ICA UI by logging into the appropriate domain of your company and project where the Heme pipeline is .

Find more information in the .

Additional information is available from the .

Refer to the following requirements to create sample sheets for running the analysis on ICA with Auto-launch. For sample sheet requirements common between deployments see . Samples sheets can be created using BaseSpace Run Planning Tool or manually by downloading and editing a sample sheet template

Refer to for this section's requirements.

This value is a universal record number (URN). The valid values are described in the

See for further details on the mosaic small variant caller and the mosaic detection mode and a comparison with DRAGEN 4.2 and DRAGEN 4.3 features.

Option
Description
Mode
Downsampling Option
Default Value

DRAGEN outputs variants in the VCF file following variant normalization conventions described here . The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.

Allele decomposition: By default, phased variants are represented as contiguous individual variant records in the VCF. When phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants; this information indicates which variants have occurred on the same haplotype. DRAGEN offers functionality to merge phased variant records into a single VCF record; please see the section for details.

In some cases, such as complex variants in repetitive regions, some variants cannot be normalized (i.e. converted into a standard representation) or represented uniquely. To counteract this problem, when comparing two VCFs (e.g. a DRAGEN VCF against a truth set VCF), it is recommended to use the RTG vcfeval tool which performs variant comparisons using a haplotype-aware approach. RTG vcfeval has been adopted as the standard VCF comparison tool by GA4GH and PrecisionFDA, as described in .

Metric
QUAL
GQ (non-homref)
GQ (homref)
QD
GT variant 1
GT variant 2
GT MNV
Relevant Pipeline
Supported in DRAGEN
--sample-sex
Ploidy Estimation
Sample Sex in Small VC

The DRAGEN Heme WGS Tumor Only Pipeline is launched with the bash script called run_Heme_WGS_TO_{version}.sh, which is installed in the /usr/local/bin directory. The bash script is executed on the command line and runs the software using DRAGEN Application Manager. For a full list of command-line options, refer to .

To launch an analysis, you must provide the --inputType and --inputFolder arguments. The --inputType argument can be bcl, fastq, bam, or cram. When starting from a sequencing system run folder containing BCL files, --inputType must be bcl and --inputFolder is the absolute path to the full run folder. When starting from FASTQ, BAM, or CRAM files --inputFolder may also be a comma separated list of folders. If more than one input folder is specified, the --sampleSheet argument must also be provided with the absolute path to a valid Sample Sheet (refer to ). If the --sampleSheet argument is not provided, the software checks for a file named SampleSheet.csv in the input folder.

A separate lightweight downloader for Windows, macOS, and Linux operating systems is available at the .

Additional download information is available at the download site.

Downloader Name
System Requirements

dragen_pipelines

Heme_WGS_TO_4.4.4.62

common

The pipeline also requires DRAGEN Application Manager to be installed, and an installer is included. DRAGEN Application Manager configuration is controlled by the config.toml file located in /etc/dragen-app-manager directory. See for additional information.

Delete output data on the DRAGEN server as soon as possible. For additional information on data output and storage, refer to .

Contact Illumina Customer Care to request a link to the Downloader or visit and confirm that the Genome DRAGEN license is enabled for your server.

For interactive run planning or to create a sample sheet for ICA Autolaunch, use to create valid sample sheets for either local or cloud analysis. To set up a run in BaseSpace run planner, refer to .

Users can visit the section to learn additional details on required fields and values as they fill-in their sample information. Use the lookup table below to select and download the sample sheet template that matches your instrument, assay, and workflow configuration:

Instrument
Workflow
File

When CRAM is used as input, the reference genome used to generate the CRAM files is required. This may be provided using the

Argument>
Required
Description

Common output files for cloud and local pipelines are described in the .

Work — (DRAGEN server only) - Contains information and files related to Nextflow execution

.command.log - Contains Nextflow pipeline step execution log.

.command.out - Contains Nextflow pipeline step standard output log.

.command.err - Contains Nextflow pipeline step standard error log.

.exit.code - Contains Nextflow pipeline step execution exit code.

Variant Interpretation on-premises via a local DRAGEN server
manual upload for local analyses
setting up the data upload from ICA or BSSH
set up
ICA Cloud App Launch Guide
ICA support site
Standard Sample Sheet Requirements
[Heme_Data] Section
# header information
chr11 0 246920
chr11 255660 255661

--vc-target-coverage

Specifies the maximum number of reads covering any given position.

--vc-max-reads-per-active-region

Specifies the maximum number of reads covering a given active region.

--vc-max-reads-per-raw-region

Specifies the maximum number of reads covering a given raw region.

--vc-min-reads-per-start-pos

Specifies the minimum number of reads with a start position overlapping any given position.

--high-coverage-support-mode

Applies the high coverage mode down-sample options if set to true. Enabling this option is recommended for targeted panels with coverage over 1000x, but will slow down run time.

Germline

--vc-target-coverage

500

Germline

--vc-max-reads-per-active-region

10000

Germline

--vc-max-reads-per-raw-region

30000

Somatic

--vc-target-coverage

1000

Somatic

--vc-max-reads-per-active-region

10000

Somatic

--vc-max-reads-per-raw-region

30000

High Coverage

--vc-target-coverage

100000

High Coverage

--vc-max-reads-per-active-region

200000

High Coverage

--vc-max-reads-per-raw-region

200000

Mitochondrial

--vc-target-coverage-mito

40000

Mitochondrial

--vc-max-reads-per-active-region-mito

200000

Mitochondrial

--vc-max-reads-per-raw-region-mito

200000

chr1 2656216 . A T,C 107.65 PASS
AC=1,1;AF=0.500,0.500;AN=2;DP=12;FS=0.000;MQ=28.95;QD=8.97;SOR=3.056;FractionInformativeReads=0.750
GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB
1/2:0,5,4:0.556,0.444:9:15:177,144,46,122,0,72:-17.704,-14.420,-4.626,-12.220,0.000,-7.244:1.076e+02,1.096e+02,1.465e+01,8.758e+01,1.520e-01,4.082e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,0,1,8:0,0,4,5
chr1 7392258 . C CT,CTTT 234.76 PASS
AC=1,1;AF=0.500,0.500;AN=2;DP=44;FS=0.000;MQ=199.22;QD=5.34;SOR=2.226;FractionInformativeReads=0.659
GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB
1/2:0,15,14:0.517,0.483:29:50:245,256,55,190,0,55:-24.476,-25.634,-5.492,-18.976,0.000,-5.500:2.348e+02,2.513e+02,5.292e+01,1.848e+02,4.401e-05,5.300e+01:0.00,5.00,8.00,5.00,10.00,8.00:0,0,7,22:0,0,17,12
chr1 1029628 . C CGT 49.88 PASS
AC=1;AF=0.500;AN=2;DP=37;FS=7.791;MQ=105.32;MQRankSum=-1.315;QD=1.35;ReadPosRankSum=1.423;SOR=1.510;FractionInformativeReads=0.892;R2\_5P\_bias=-19.742
GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS
0|1:17,16:0.485:33:48:81,0,50:-8.088,0.000,-5.000:4.988e+01,6.653e-05,5.300e+01:0.00,31.00,34.00:10,7,5,11:11,6,9,7:1029628

chr1 1029629 . A G 50.00 PASS
AC=1;AF=0.500;AN=2;DP=37;FS=1.289;MQ=105.32;MQRankSum=-0.659;QD=1.35;ReadPosRankSum=-0.199;SOR=0.604;FractionInformativeReads=1.000;R2\_5P\_bias=-24.923
GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS
0|1:16,21:0.568:37:48:85,0,49:-8.477,0.000,-4.934:5.000e+01,6.886e-05,5.234e+01:0.00,34.77,37.77:9,7,10,11:10,6,13,8:1029628
QUAL = -10\*log10 (posterior genotype probability of a
homozygous-reference genotype (GT=0/0))

Description

Probability that the site has no variant

Probability that the call is incorrect

Evidence supporting homref call

Qual normalized by depth

Formulation

QUAL = GP(GT=0/0)

GQ =-10*log10(p)

GQ = 10*log10[P(D|homref)/P(D|variant)]

QUAL/DP

Scale

Unsigned Phred

Unsigned Phred

Signed Phred

Unsigned Phred

Numerical example

QUAL=20: 1 % chance that there is no variant at the site. Qual=50: 1 in 1e5 chance that there is no variant at the site.

GQ=3, 50% that the call is incorrect. GQ=20, 1% change that the call is incorrect.

GQ=0: no evidence. GQ>0: evidence favors homref.

1 39224 . C <NON_REF> . PASS END=39260
GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT
0/0:2,0:2:3:1:0,3,37:0,3,37:3,0

1 39261 . T C,<NON_REF> 15.59 PASS
DP=3;MQ=12.73;MQRankSum=0.736;ReadPosRankSum=0.736;FractionInformativeReads=1.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB
0/1:1,2,0:0.667,0.000:3:1,0,0:0,2,0:5:49,0,1,69,7,75:66,0,8:1,0:1.5592e+01,1.5915e+00,5.5412e+00,7.0100e+01,4.3330e+01,8.0068e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,1,0,2:0,1,2,0
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing
ID information, where each unique ID within a given sample (but not
across samples) connects records within a phasing group">
chr1 1947645 . C T,<NON_REF> 48.44 PASS
DP=35;MQ=250.00;MQRankSum=4.983;ReadPosRankSum=3.217;FractionInformativeReads=1.000;R2_5P_bias=0.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB:PS
0|1:20,15,0:0.429:35:9,7,0:11,8,0:47:83,0,50,572,758,622:255,0,255:19,0:4.844e+01,8.387e-05,5.300e+01,4.500e+02,4.500e+02,4.500e+02:0.00,34.77,37.77,34.77,69.54,37.77:11,9,10,5:12,8,8,7:1947645

chr1 1947648 . G A,<NON_REF> 50.00 PASS
DP=36;MQ=250.00;MQRankSum=5.078;ReadPosRankSum=2.563;FractionInformativeReads=1.000;R2_5P_bias=0.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB:PS
1|0:16,20,0:0.556:36:8,9,0:8,11,0:48:85,0,49,734,613,698:255,0,255:16,0:5.000e+01,7.067e-05,5.204e+01,4.500e+02,4.500e+02,4.500e+02:0.00,34.77,37.77,34.77,69.54,37.77:10,6,11,9:8,8,12,8:1947645
chr2 115034 . G C GT:PS 0|1:115034
chr2 115036 . C T GT:PS 0|1:115034
chr2 115034 . GAC CAT GT:PS 0|1:115034
chr2 61569261 . GCACA G     GT:PS 0|1:61569261
chr2 61569266 . C     CGTGG GT:PS 0|1:61569261
chr2 61569263 . ACAC GTGG GT:PS 0|1:61569261
chr1    1771073 .       T       C       .       mnv_component
DP=65;MQ=250.00;FractionInformativeReads=0.892;SoftClipRatio=0.00;MNVTAG=chr1:1771073_TAT->CAC
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:42.70:17,41:0.7069:10,18:7,23:58:10,7,13,28:9,8,22,19

chr1    1771073 .       TAT     CAC     .       PASS
DP=65;MQ=250.00;FractionInformativeReads=0.892;SoftClipRatio=0.00;MNVTAG=chr1:1771073_TAT->CAC;GermlineStatus=Germline_DB
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:42.70:17,41:0.7069:10,18:7,23:58:10,7,13,28:9,8,22,19

chr1    1771075 .       T       C       .       PASS
DP=67;MQ=250.00;FractionInformativeReads=0.881;SoftClipRatio=0.00;STR;RU=AC;RPA=7;MNVTAG=chr1:1771073_TAT->CAC;GermlineStatus=Germline_DB
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  1/1:62.99:0,59:1.0000:0,29:0,30:59:0,0,24,35:0,0,32,27

0|1

0|1

0/1

Germline and Somatic

Yes in 4.0

0/1

1/1

1/2

Germline

No

0/1

1/2

1/2

Germline

No

1/1

1/1

1/1

Germline

Yes in 4.2

-------------------------------------------------------------- H0 ( REF ) 
-----------------x---------------------------y---------------- H1
-----------------x---------------------------y---------------- H1
-----------------x---------------------------y-----------------H1
-----------------x---------------------------y---------------- H1
---------------------------------------------y---------------- H2
-----------------x----------------------------y--------------- H1
----------------------------------------------z--------------- H2

male

Not relevant

Male

female

Not relevant

Female

none

Not relevant

None

auto (default)

XY

Male

auto (default)

XX

Female

auto (default)

Everything else

None

##FORMAT=<ID=SQ,Number=A,Type=Float,Description="Somatic quality">
chrM    513     .       GCA     G       .       PASS    DP=4937;MQ=235.28;FractionInformativeReads=0.883
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  1/1:95.46:33,4327:0.992:7,1081:26,3246:4360:31,2,2371,1956:10,23,2811,1516

chrM    7028    .       C       T       .       PASS    DP=8868;MQ=60.19;FractionInformativeReads=0.993 
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:21.48:8622,181:0.021:4190,82:4432,99:8803:4344,4278,94,87:5032,3590,101,80
chrM 2623 . A G . PASS DP=18772;MQ=111.77 GT:AD:AF:DP:FT:SQ:F1R2:F2R1 0/0:6841,7:0.001:4334:weak_evidence:0:.:. 0/1:6736,2053:0.234:8789:PASS:21.32:3394,1060:3342,993 0/0:6086,9:0.001:5613:weak_evidence:0:.:.

Heme_WGS_TO_{version}_Downloader_unix

x86_64 platform with glibc 2.25+

Heme_WGS_TO_{version}_Downloader_mac

arm64 macOS

Heme_WGS_TO_{version}_Downloader_windows.exe

64-bit Windows 10+

/usr/local/bin/check_Heme_WGS_TO_{version}.sh
Checking system configuration...OK!
Now running a test execution of the pipeline.
This could take up to 15 minutes...
Verifying analysis output.
Successfully validated test analysis results.
SUCCESS!
DRAGEN Heme WGS Tumor Only Pipeline is correctly configured and ready for use.
/usr/local/bin/uninstall_Heme_WGS_TO_{version}.sh
downloaded
installation guide

--inputType

Yes

Possible values include bcl, fastq, bam, cram.

--inputFolder

Yes

Input folder containing {input type} files. Multiple {input type, except bcl} folders can be specified as a comma separated list.

--sampleSheet

No

Full path to the sample sheet file. If the sample sheet is named SampleSheet.csv and is located in the run or fastq folder (depending on how the analysis is initiated), this command is not required.

--analysisFolder

No

Full path to the alternative analysis folder. Default is /staging/DRAGEN_Heme_WGS_Tumor_Only_Pipeline_{version}Analysis{datetimestamp} if not specified. This folder must have enough available free space for the analysis and be on an NVMe SSD partition to achieve high performance.

--sampleIDs

No

The comma-delimited sample IDs that are processed by the run. For example, Sample_1,Sample_2.

--referenceGenome

No

Specify the reference genome to use for alignment. Possible values: hg38 or hs37d5_chr. Default is hg38.

--disableOraCompression

No

Specify to disable Ora compression.

--demultiplexOnly

No

Demultiplex to generate FASTQ files only without further analysis.

--customResourceDir

No

Provide custom resource directory path.

--customConfig

No

Provide custom config file path.

--keepFullWorkDir

No

Copy entire work dir to analysis output folder. Default behavior is to copy only nextflow logs.

--version

No

Displays the version of the software, and then exits.

--help

No

Displays the help text.

Advanced Topics

Overview

The pipeline supports advanced use cases:

  • Selected custom parameters may be configured using a configuration file, and associated custom pipeline resource files.

  • The pipeline supports automatic data streaming from instruments and automatic launch of analysis in ICA, followed by tertiary interpretation in ICI.

  • The pipeline supports mix flow cells where different assays are sequenced in the same flow cell.

📂
📂
📂
📂
📄
📄
📄
📄
Dragen Server
ICA Cloud
ICA Cloud
Mosaic detection
https://genome.sph.umich.edu/wiki/Variant_Normalization
Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes
Command-Line Options
Sample Sheet Requirements
DRAGEN Installer Download Site
DRAGEN Resource Files
DRAGEN Application Manager
Illumina Instrument Control Computer Security and Networking
DRAGEN Installer Download Site
BaseSpace Run Planner
Sample Sheet Creation in BaseSpace Run Planner
Sample Sheet guidelines
Analysis Output
Release Information
custom configuration file

De Novo Small Variant Filtering

The filtering step identifies de novo variants calls of the joint calling workflow in regions with ploidy changes. Since de novo calling can have reduced specificity in regions where at least one of the pedigree members shows non-diploid genotypes, the de novo variant filtering marks relevant variants and thus can improve specificity of the call set.

Based on the structural and copy number variant calls of the pedigree, the FORMAT/DN field in the proband is changed from the original DeNovo value to DeNovoSV or DeNovoCNV if the de novo variant overlaps with a ploidy-changing SV or CNV, respectively. All other variant details remain unchanged, and all variants of the input VCF will also be present in the filtered output VCF. Structural or copy number variants which result in no change of ploidy, such as inversions, are not considered in the filtering. As an example, a de novo SNV calls in the input VCF

chr1 234710899 . T C 44.74 PASS
AC=1;AF=0.167;AN=6;DP=73;FS=4.720;MQ=250.00;MQRankSum=5.310;QD=1.15;ReadPosRankSum=1.366;SOR=0.251
GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GL:GP:PP:DQ:DN
0/1:21,18:0.462:39:48:PASS:14,10:7,8:84,0,50:-8.427,0,-5:4.950e+01,7.041e-05,5.300e+01:15,0,120:3.2280e-01:DeNovo
0/0:13,0:0.000:11:30:PASS:.:.:0,30,450:.:.:10,0,227
0/0:25,0:0.000:22:60:PASS:.:.:0,60,899:.:.:0,33,227

Overlapping with an SV duplication in the proband, mother or father would be represented in the filtered output VCF as follows:

chr1 234710899 . T C 44.74 PASS
AC=1;AF=0.167;AN=6;DP=73;FS=4.720;MQ=250.00;MQRankSum=5.310;QD=1.15;ReadPosRankSum=1.366;SOR=0.251
GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GL:GP:PP:DQ:DN
0/1:21,18:0.462:39:48:PASS:14,10:7,8:84,0,50:-8.427,0,-5:4.950e+01,7.041e-05,5.300e+01:15,0,120:3.2280e-01:DeNovoSV
0/0:13,0:0.000:11:30:PASS:.:.:0,30,450:.:.:10,0,227
0/0:25,0:0.000:22:60:PASS:.:.:0,60,899:.:.:0,33,227

The following is an example command line for running the de novo filtering, based on the files returned by the joint calling workflows:

dragen \
--dn-enable-denovo-filtering true \
--dn-input-joint-vcf <JOINT_SMALL_VARIANT_VCF> \
--dn-output-joint-vcf <OUTPUT_VCF> \
--dn-sv-vcf <JOINT_SV_VCF> \
--dn-cnv-vcf <JOINT_CNV_VCF> \ 
--enable-map-align false

De Novo Small Variant Filtering Options

The following options are used for de novo variant filtering:

  • --dn-input-vcf---Joint small variant VCF from the de novo calling step to be filtered.

  • --dn-output-vcf---File location to which the filtered VCF should be written. If not specified, the input VCF is overwritten.

  • --dn-sv-vcf---Joint structural variant VCF from the SV calling step. If omitted, checks with overlapping structural variants are skipped.

  • --dn-cnv-vcf--- Joint structural variant VCF from the CNV calling step. If omitted, checks with overlapping copy number variants are skipped.

Germline Small Variant Hard Filtering

DRAGEN provides post-VCF variant filtering based on annotations present in the VCF records. Default and non-default variant hard filtering are described below. However, due to the nature of DRAGEN's algorithms, which incorporate the hypothesis of correlated errors from within the core of variant caller, the pipeline has improved capabilities in distinguishing the true variants from noise, and therefore the dependency on post-VCF filtering is substantially reduced. For this reason, the default post-VCF filtering in DRAGEN is very simple.

Default Small Variant Hard Filtering

The default filters in the germline pipeline are as follows:

  • ##FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41 (3.0103 when ML recalibration is enabled)">

  • ##FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83 (3.0103 when ML recalibration is enabled)">

  • ##FILTER=<ID=MosaicHardQUAL,Description="Set if true:QUAL < 3.0103">

  • ##FILTER=<ID=MosaicLowAF,Description="Set if true:AF < 0.2 (0.1 if depth > 100x)">

  • ##FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">

  • ##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">

  • DRAGENSnpHardQUAL and DRAGENIndelHardQUAL: For all contigs other than the mitochondrial contig, the default hard filtering consists of thresholding the QUAL value only. A different default QUAL threshold value is applied to SNP and INDEL

  • MosaicHardQUAL and MosaicLowAF: For all MOSAIC tagged variants, the default hard filtering consists of thresholding the QUAL value and the FORMAT:AF value.

  • LowDepth: This filter is applied to all variants calls with INFO/DP <= 1

  • PloidyConflict: This filter is applied to all variant calls on chrY of a female subject, if female is specified on the DRAGEN command line, of if female is detected by the ploidy estimator.

Non-Default Small Variant Hard Filtering

DRAGEN supports basic filtering of variant calls as described in the VCF standard. You can apply any number of filters with the --vc-hard-filter option, which takes a semicolon-delimited list of expressions, as follows:

<filter ID>:<snp|indel|all>:<list of criteria>,

where the list of criteria is itself a list of expressions, delimited by the || (OR) operator in this format:

<annotation ID> <comparison operator> <value>

The meaning of these expression elements is as follows:

  • filterID---The name of the filter, which is entered in the FILTER column of the VCF file for calls that are filtered by that expression. The space character should be avoided.

  • snp/indel/all/homref/mosaic---The subset of variant calls to which the expression should be applied.

  • annotation ID---The variant call record annotation for which values should be checked for the filter. Supported annotations include FS, MQ, MQRankSum, QD, and ReadPosRankSum.

  • comparison operator---The numeric comparison operator to use for comparing to the specified filter value. Supported operators include <, ≤, =, ≠, ≥, and >. For example, the following expression would mark with the label "SNPFilter" any SNPs with FS < 2.1 or with MQ < 100, and would mark with "indelFilter" any records with FS < 2.2 or with MQ < 110:

--vc-hard-filter="SNPFilter:snp:FS < 2.1 || MQ < 100; indelFilter:indel:FS < 2.2 || MQ < 110"

This example is for illustration purposes only and is NOT recommended for use with DRAGEN V3 output. Illumina recommends using the default hard filters. The only supported operation for combining value comparisons is OR, and there is no support for arithmetic combinations of multiple annotations. More complex expressions may be supported in the future.

The --vc-hard-filter value corresponding to the default hard filters are 'DRAGENSnpHardQUAL:snp: QUAL < 3.0103; DRAGENIndelHardQUAL:indel: QUAL < 3.0103; LowDepth:all: DP <= 1; MosaicHardQUAL:mosaic: QUAL < 3.0103; MosaicLowAF:mosaic: AF < 0.2’ if the sample's depth is less than or equals to 100x, otherwise they are 'DRAGENSnpHardQUAL:snp: QUAL < 3.0103; DRAGENIndelHardQUAL:indel: QUAL < 3.0103; LowDepth:all: DP <= 1; MosaicHardQUAL:mosaic: QUAL < 3.0103; MosaicLowAF:mosaic: AF < 0.1’. We strongly suggest to not specify the --vc-hard-filter unless necessary.

Orientation Bias Filter

The orientation bias filter is designed to reduce noise typically associated with the following:

  • Pre-adapter artifacts introduced during genomic library preparation (eg, a combination of heat, shearing, and metal contaminates can result in the 8-oxoguanine base pairing with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification), or

  • FFPE (formalin-fixed paraffin-embedded) artifact. FFPE artifacts stem from formaldehyde deamination of cytosines, which results in C to T transition mutations. The orientation bias filter can only be used on somatic pipelines. To enable the filter, set the --vc-enable-orientation-bias-filter option to true. The default is false.

The artifact type to be filtered can be specified with the--vc-orientation-bias-filter-artifacts option. The default is C/T,G/T, which correspond to OxoG and FFPE artifacts. Valid values include C/T, or G/T, or C/T,G/T,C/A.

An artifact (or an artifact and its reverse compliment) cannot be listed twice. For example, C/T,G/A is not valid, because C→G and T→A are reverse compliments.

The orientation bias filter adds the following information:

  • ##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">

  • ##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">

  • ##FORMAT=<ID=OBC,Number=1,Type=String,Description="Orientation Bias Filter base context">

  • ##FORMAT=<ID=OBPa,Number=1,Type=String,Description="Orientation Bias prior for artifact">

  • ##FORMAT=<ID=OBParc,Number=1,Type=String,Description="Orientation Bias prior for reverse compliment artifact">

  • ##FORMAT=<ID=OBPsnp,Number=1,Type=String,Description="Orientation Bias prior for real variant">

Please note that the OBF filter runs as a standalone process after DRAGEN is complete. The VC metrics that are computed as part of DRAGEN SNV caller will not be updated and will not reflect the additional variants that are filtered in this stage.

Pedigree Analysis

The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.

Joint Analysis Output Format

There are two available joint analysis output files:

  • Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.

  • Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.

The multisample gVCF output is only available in the pedigree-based analysis.

The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".

1 605262 . G A 13.41 DRAGENHardQUAL
AC=2;AF=1.000;AN=2;DP=2;FS=0.000;MQ=14.00;QD=6.70;SOR=0.693
GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP ./.:.:.:.:.:LowDepth
1/1:0,2:1.000:2:4:PASS:0,0:0,2:50,6,0:1.383e+01,4.943e+00,1.951e+00
./.:.:.:.:.:LowDepth

Hom-ref Blocks FORMAT Fields

In hom-ref blocks, the following FORMAT fields are calculated uniquely.

  • FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.

  • FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.

  • FORMAT/AF--Values are based on FORMAT/AD.

  • FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.

  • FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.

In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.

chr1 10288 . C <NON_REF> . PASS END=10290
GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT
0/0:131,4:135:69:132:0,69,1035:0,125,255:23,1

chr1 10291 . C
T,<NON_REF> 38.45 PASS
DP=100;MQ=24.72;MQRankSum=0.733;ReadPosRankSum=4.112;FractionInformativeReads=0.600;R2_5P_bias=0.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB
0/1:28,32,0:0.533,0.000:60:20,21,0:8,11,0:15:73,0,12,307,157,464:255,0,255:23,10:3.8452e+01,1.3151e-01,1.5275e+01,3.0757e+02,1.9173e+02,4.5000e+02:0.00,34.77,37.77,34.77,69.54,37.77:4,24,7,25:8,20,14,18

SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.

In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.

In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.

Pedigree Mode

Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.

Pedigree Mode Options

The following parameters are available.

  • --enable-joint-genotyping To run the Joint Genotyper, set to true.

  • --output-directory The output directory. --output-directory is required.

  • --output-file-prefix The prefix used to label all output files. --output-file-prefix is required.

  • -r The directory where the hash table resides.

  • --variant Specifies the path to a single gVCF file. You can specify multiple gVCF files using multiple --variant options. The joint genotyper output depends on the order of the input gVCF files passed by the --variant command line parameter. It is recommended to use the same input order when re-analyzing gVCF files to ensure the output is consistent with previous runs.

  • --pedigree-file Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples on versions prior to DRAGEN v3.10. It is not recommended for gVCF variant calls on DRAGEN v3.10 or later.

To invoke pedigree mode, set the --enable-joint-genotyping option to true. Use the --pedigree-file option to specify the path to a pedigree file that describes the relationship between panels.

The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.

Column Header
Description

Family_ID

The pedigree identifier.

Individual_ID

The ID of the individual.

Paternal_ID

The ID of the individual's father. If the founder, the value is 0.

Maternal_ID

The ID of the individual's mother. If the founder, the value is 0.

Sex

The sex of the sample. If male, the value is 1. If female, the value is 2.

Phenotype

The genetic data of the sample. If unknown, the value is 0. If unaffected, the value is 1. If affected, the value is 2.

The following is an example of an input pedigree file.

#Family_ID Individual_ID Paternal_ID Maternal_ID Sex Phenotype
FAM001 NA12877_Father 0 0 1 1
FAM001 NA12878_Mother 0 0 2 1
FAM001 NA12882_Proband NA12877_Father NA12878_Mother 2 2
FAM001 NA12883_Proband NA12877_Father NA12878_Mother 1 0

De Novo Calling

The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.

Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.

  1. Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.

    • gVCF files for the Small Variant Caller.

    • *.tn.tsv files for the Copy Number Caller.

    • BAM files for the Structural Variant Caller.

Small Variant DeNovo Calling

The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.

Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.

At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.

The DQ formula is DQ = -10log10(1 - Pdenovo).

Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.

In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.

The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.

The following are the possible values:

  • Inherited--The called trio genotype is consistent with Mendelian inheritance.

  • LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.

  • DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.

The following is an example VCF line for a trio:

1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.

Pedigree Mode Options

The following command line options are available for de novo small variant calling.

  • --enable-joint-genotyping--Run the joint genotyping caller.

  • --pedigree-file--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.

  • --variant or --variant-list --Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.

  • --qc-snp-denovo-quality-threshold--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.

  • --qc-indel-denovo-quality-threshold--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.

  • --output-directory--The output directory. This is required.

  • --output-file-prefix--The prefix used to label all output files. This is required.

  • -r The directory where the hash table resides.

The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant or --variant-list. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.

ICA Specific Output

Output Folder

  • Nextflow output folders differ across platforms.

  • Nextflow output folders differ across platforms.

ICA Cloud App

Requirements

Analysis on ICA requires an account with a valid subscription and a project with the following configuration.

Manual Launch

How to Set up Project

  1. Download Results: After analysis is complete, navigate to results in the configured output location.

Please see the Illumina Support Shorts for guidance on how to set up and run DRAGEN Heme WGS Tumor Only analysis on ICA.

Analysis Parameters on ICA

To launch an analysis via the ICA user interface, configure a DRAGEN Heme WGS Tumor Only pipeline analysis with the following parameters.

Parameter Name
Description

User Reference

The analysis run name

User Tags

Text labels to help index the analysis.

Notify me when task is completed

Option to receive an email notification when analysis is complete.

Output Folder

The path to the analysis output folder. The default path is the project output folder.

Entitlement Bundle

Automatically populated from the project details.

Samplesheet

Select a sample sheet in CSV format for the analysis.To note: Sample Sheet selection is optional if starting from a run folder, and required when submitting a FASTQ folder.

Input Directory

The run folder or FASTQ folder that contains files to analyze.

Input Type

Select input type of analysis will perform on. Options to select include bcl, fastq, bam and cram

Sample or Pair IDs

Optional subset of Sample IDs or Pair IDs to analyze.

Reference Genome

Select the reference genome. hs37d5_chr is the hg19 reference genome with the Chromosome Y PAR masked. It includes the NC_012920 mitocondria genome. The contigs have the chr prefix added, but without the native alternate loci names.

Enable Ora Compression

Enable Ora Compression (True or False). Only applicable when Input Type is bcl

Enable Post Processing

Enable Post Processing (True or False) to run custom scripts at the end of pipeline

Storage Size

The storage size to allocate for the analysis. The default and recommended value is Large.

Custom Parameters Config File

Optional. Select Custom Parameters Config File that override default config

Custom Resources Directory

Optional. Select Custom Resources Directory to use with Custom Parameters Config File

CAUTION - This parameter ...

Optional. Those configuration with this comment is only applies to auto-launch DRAGEN Heme WGS Tumor Only analysis from FASTQs after BCL. Please don't set it if start analysis from ICA UI

Launching Analysis

UI Options

Manual Launch of Heme pipeline Analysis Software Analysis

To manually launch an analysis, configure a Heme pipeline Analysis Software pipeline analysis run in ICA with the following parameters.

General

Parameter Name
Description
Required
Default

User Reference

The custom name of the analysis for later identification.

Yes

Empty

User Tags

Tags for the analysis to help with categorization and identificaion, enhancing organization and searchability.

No

Empty

Notification

Add a user to be notified when the analysis completes.

No

No user selected

Output Folder

The path to the analysis output folder.

No

Project output folder

Input Files

Parameter Name
Description
Required
Default

Samplesheet

The SampleSheet.csv for the analysis

Yes

SampleSheet.csv in Input Folder

Input Directory

The input folder that contains [bcl, fastq, bam, cram] files to analyze. Multiple input [fastq, bam, cram] folders can be specified.

Yes

No folder selected

Custom Parameters Config File

The custom parameters config file for the analysis.

No

No file selected

Custom Resources Directory

The custom resoruces directory used for the analysis.

No

No folder selected

Settings

Parameter Name
Description
Required
Default

Input Type

The type of files in the Input Folder(s): bcl, fastq, bam, cram.

Yes

bcl

Reference Genome

The reference genome used for the analysis: [hs37d5_chr, hg38].

Yes

hg38

Enable Ora Compression

Compress fastq files using ora compression. [Only applies when Input Type is bcl].

No

true

Enable Post Processing

Use the post-processing scripts at the end of the pipeline analysis.

No

false

Sample IDs

Optional subset of Sample IDs or Pair IDs to analyze. A comma-separated list.

No

Empty

Resources

Parameter Name
Description
Required
Default

Storage Size

The storage size to allocate for the analysis. The minimum required value is Large.

Yes

Large

Other available options for the storage size:

  • "1.2TB" if option selected is Small

  • "2.4TB" if option selected is Medium

  • "7.2TB" if option selected is Large

  • "16TB" if option selected is XLarge

  • "32TB" if option selected is 2XLarge

  • "64TB" if option selected is 3XLarge

Note It is recommended to reserve storage size twice the size of the BCL run folder, or the input fastq.gz or bam files, four times the size of the cram file (cram is 30-70% of the bam), and 8 times the size of the fastq.ora (fastq.ora is about 25% of fastq.gz).

Using the icav2 client

Customer may use the icav2 client to launch analysis from the CLI. The specific parameters supported may be obtained from the Project Pipeline details under the XML configuration tab.

Custom Workflow

Autolaunch requires additional BaseSpace Sequence Hub and sample sheet settings.

BaseSpace Sequence Hub Requirements for ICA Autolaunch

Autolaunch uses the BaseSpace Sequence Hub (BSSH) run planning tool to create and export a v2 format sample sheet to enable streaming of sequencing run data to the project and requires the following additional settings. See Figure 1 below.

  • Access to BaseSpace Sequence Hub.

  • ICA Run Storage is enabled under BaseSpace Sequence Hub settings.

Illumina Cloud Run Planning and Auto-Launch Workflow

Autolaunch requires a v2 format sample sheet with specific parameters that instruct the BSSH project to automatically initiate a Heme pipeline analysis in ICA. Use the run planning option in BaseSpace Sequence Hub to generate the sample sheet. The exported sample sheet is automatically populated with the required fields. Using an invalid sample sheet can result in failed runs and analyses.

Refer to Table 1 below for descriptions of the added fields. Enter the following required run parameters in BaseSpace Sequence Hub Run Planning:

Parameter Name
Setting

Secondary Analysis

BaseSpace Sequence Hub / Illumina Connected Analytics

Application

DRAGEN Heme App for Whole-genome Sequencing

Figure 1. BSSH Run Planning Enabled End to End Workflow

The BaseSpace Sequence Hub setting for run monitoring and storage must be selected on the instrument to use Heme pipeline Analysis Software analysis Autolaunch. For information on preparing your instrument for DRAGEN Heme App for Whole-Genome Sequencing Analysis Software Autolaunch, refer to the documentation for your instrument.

  1. Use Run Planning in BaseSpace Sequence Hub to create and export a sample sheet.

  2. Import the sample sheet to the instrument and start the sequencing run. Data is uploaded to BaseSpace Sequence Hub and then pushed to ICA. You can monitor the run in BaseSpace Sequence Hub.

  3. When sequencing and the upload completes, analysis autolaunches in ICA. You can monitor the status of the analysis in BaseSpace Sequence Hub or ICA

  4. If necessary, requeue the analysis via the run's Summary page in BaseSpace Sequence Hub. Refer to the BaseSpace Sequence Hub support site page for more information on requeuing an analysis.

  5. View the analysis output results in either BaseSpace Sequence Hub or ICA.

Table 1. Additional Sample Sheet Fields for Autolaunch

Autolaunch-compatible sample sheets contain the following fields specific to autolaunch configuration.

Section
Parameter
Details
Required

Cloud_Heme_Data

Sample_ID

The unique ID to identify a sample. Must match a Sample_ID used in the Heme_Data section.

Yes

Sample_Type

Sample type.

No

Sample_Description

Must meet the following requirements:

No

- 1–70 characters.

- Alpha numeric characters with underscores, No dashes and spaces. If you enter an underscore, dash, or space, enter an alphanumeric character before and after.

Cloud_Heme_Settings

SoftwareVersion

The Heme software version

No

StartsFromFastq

Set the value to TRUE or FALSE. If autolaunching from BCL files, this must be set to FALSE.

Yes

Cloud_Data

Sample_ID

The same sample ID used in the Cloud_Heme_Data section.

No

ProjectName

The BaseSpace Sequence Hub project name.

No

LibraryName

Combination of sample ID and index values in the No following format: sampleID_Index_Index2.

No

LibraryPrepKitName

The Library Prep Kit used.

No

IndexAdapterKitName

The Index Adapter Kit used.

No

Cloud_Settings

GeneratedVersion

The cloud GSS version used to create the sample sheet. Optional if manually updating a sample sheet.

No

CloudWorkflow

ica_workflow_1

Yes

Cloud_Heme_Pipeline

Yes

Custom Config Support

Local App Setup

Overview

This document describes how to use the Custom Configuration Support feature for the pipeline software. This feature allows users to customize a specific set of DRAGEN command-line options to override the default values pre-defined in the pipeline.

Customization with customConfig and customResourceDir

Users can customize pipeline behavior and file inputs using:

  • --customConfig : path to a custom configuration file listing customized parameter values.

  • --customResourceDir : path to a directory containing custom resource files.

Both options should be used together if file-based overrides are required.

Important note for using File Parameters

  • For file parameters (parameters that require a file), users must specify relative paths in the customConfig file. The software will join customResourceDir and the relative path to form the full file path.

  • Additionally, the value assigned to a file parameter must be enclosed in single quotes ('').

Examples

Command Line

run_Heme_WGS_TO_{version}.sh \
  --inputType bcl \
  --inputFolder /heme_input_bcl \
  --customConfig /path/heme_custom_param.config \
  --customResourceDir custom_resources_Heme_dir

heme_custom_param.config Content

# custom parameters
vc_output_evidence_bam = false
qc_detect_contamination = true
aligner_clip_pe_overhang = 0

# custom reference files
vc_systematic_noise = '/snv/WGS_hg38_v1.0_systematic_noise.snv.bed.gz'
sv_systematic_noise = '/sv/WGS_FF_Heme_hg38_v1.0_systematic_noise.sv.bedpe.gz'
vc_somatic_hotspots = '/snv/somatic_hotspots_GRCh38.vcf.gz'

custom_resources_Heme_dir Folder Structure

custom_resources_Heme/
├── snv
│   ├── WGS_hg38_v1.0_systematic_noise.snv.bed.gz
│   └── somatic_hotspots_GRCh38.vcf.gz
└── sv
    └── WGS_FF_Heme_hg38_v1.0_systematic_noise.sv.bedpe.gz

customConfig Template (with default value)

#vc_systematic_noise = ''
#enable_map_align = true
#sv_systematic_noise = ''
#vc_output_evidence_bam = false
#qc_detect_contamination = true
#vc_somatic_hotspots = ''
#sv_somatic_ins_tandup_hotspot_regions_bed = ''
#cram_reference = ''
#aligner_clip_pe_overhang = 0

Supported Parameters

Display Name
Parameter Name
Component
Allowed Values
Default Value
Optional

VC Systematic Noise File

vc_systematic_noise

Variant Caller

file

included

Yes

VC Somatic Hotspots File

vc_somatic_hotspots

Variant Caller

file

included

Yes

CRAM Input Reference Genome

cram_reference

Mapper

file

included

Yes

Aligner Clip Paired End Reads Overhang

aligner_clip_pe_overhang

Mapper

0,1,2

0

Yes

Enable Map Align

enable_map_align

Mapper

true / false

true

Yes

SV Somatic Hotspot BED File

sv_somatic_ins_tandup_hotspot_regions_bed

Structural VC

file

included

Yes

SV Systematic Noise File

sv_systematic_noise

Structural VC

file

included

Yes

Output SNV Evidence BAM

vc_output_evidence_bam

Debug

true / false

false

Yes

QC Detect Contamination

qc_detect_contamination

QC

true / false

true

Yes


Analysis Output

Analysis Output

When the analysis run completes, the software generates an analysis output in a folder named /staging/DRAGEN_Heme_WGS_Tumor_Only_Pipeline_{version}_Analysis_{datetimestamp}, unless a specific location is specified on the command line. In ICA, analysis output is listed in the Output section of the analysis, where the folder name is a combination of user reference, pipeline name, and a UUID. Within the analysis folder, each analysis step generates a subfolder within the Logs_Intermediates folder.

Output Folders

This section describes each output folder generated during analysis and where to find metric and analytic files when the pipeline is executed.

File Overview

This section describes the summary output files generated during analysis.

Metrics Output

File name: MetricsOutput.tsv

The metrics output file is a final combined metrics report that provides sample status, key analysis metrics, and metadata in a tab-separated values (TSV) file. Sample metrics within the report indicate guideline-suggested lower specification limits (LSL) and upper specification limits (USL) for each sample in the run. One metrics output file is generated for the entire run. An additional file is generated for each sample.

Run Metrics

Run metrics from the analysis module indicate the quality of the sequencing run. Review the following metrics to assess run data quality:

The values in the Run Metrics section are listed as NA in the following situations:

  • The analysis was started from FASTQ files.

  • The analysis was started from BCL files, and the InterOp files are missing or corrupt.

Sample QC Metrics

Review the following metrics to assess sample data quality:

DRAGEN Solid WGS Tumor Normal Pipeline

Overview

DRAGEN Solid WGS Tumor Normal Pipeline, henceforth referred as the Solid WGS TN Pipeline, is a comprehensive and unbiased whole genome sequencing solution for detection of all types of mutation in matched tumor and normal samples. It can be applied to detect clinically actionable mutations for cancer spanning a wide range of genomic events, e.g., structural variants (SV), copy number alterations (CNA), small variants (SNV/insertion/deletion/delins).

The Solid WGS TN pipeline includes a DNA-only workflow designed to analyze whole genome sequencing data generated on supported instruments. It may be run as a local off-instrument solution installable on a DRAGEN server or accessible through the Illumina Connected Analytics (ICA) cloud environment. The Solid WGS TN pipeline is for Research Use Only (RUO).

Features

  • Superb performance based on the DRAGEN BioIT platform Release 4.4.4

  • Supports starting the analysis from FASTQ (.gz or .ora format), BAM or CRAM as inputs

  • Flexible custom configurable options on top of well established DRAGEN recipes for Solid WGS TN analysis.

  • Available on local DRAGEN servers and Illumina Connected Analytics (ICA)

  • Seamless integration with Illumina Connected Insights (ICI) for tertiary interpretation

Supported Library Prep Kits (LPKs)

No specific requirements on LPKs since the pipeline does not support starting from BCL in the curent release.

Supported Sequencing Instruments

No specific requirements on instruments since the pipeline does not support starting from BCL in the curent release.

Analysis Methods

Reference Genomes

The Heme pipeline supports two reference genomes for the DRAGEN Map/Aligner - hg38 and hs37d5_chr.

The hs37d5_chr genome is the hg19 reference genome with the Chromosome Y PAR masked. It includes the NC_012920 mitocondria genome. The contigs have the chr prefix added, but without the native alternate loci names.

DRAGEN Map/Aligner

DRAGEN continues to use these final alignments as input for various variant calls such as gene amplification (copy number) calling, small variant calling (SNV, indel, MNV, delin), and DNA library quality control.

Small Variant Calling and Filtering

DRAGEN supports calling SNVs, indels, MNVs, and delins in tumor-only samples by using mapped and aligned DNA reads from a tumor sample as input. Variants are detected via both column wise pileup analysis and local de novo assembly of haplotypes. The de novo haplotypes allow the detection of much larger insertions and deletions than possible through column wise pileup analysis only. DRAGEN insertions and deletions are validated with lengths of at least 0–25 bp and more than 25 bp can be supported. In addition, DRAGEN also uses the de novo assembly to detect SNVs, insertions, and deletions that are co-phased and part of the same haplotypes. Any such co-phased variants that are within a window of 15 bp can then be reassembled into complex variants (MNVs and delins). The tumor-only pipeline produces a VCF file containing both germline and somatic variants that can be further analyzed to identify tumor mutations. The pipeline makes no ploidy assumptions, enabling detection of low-frequency alleles.

DRAGEN small variant calling includes the following steps:

  1. Detects regions with sufficient read coverage (callable regions).

  2. Detects regions where the reads deviate from the reference and there is a possibility of a germline or somatic call (active regions).

  3. Assembles de novograph haplotypes are assembled from reads (haplotype assembly).

  4. Extracts possible somatic or germline calls (events) from column wise pileup analysis.

  5. Calibrates read base qualities to account for background noise.

  6. Computes read likelihoods for each read/haplotype pair.

  7. Performs mutation calling by summing the genotype probabilities across all reads/haplotype pairs.

  8. Performs additional filtering to improve variant calling accuracy, including using a systematic noise file. The systematic noise file indicates the statistical probability of noise at specific positions in the genome. This noise file is constructed using clean (normal) samples. Regions where noise is common (eg, difficult to map regions) have higher noise values. The small variant caller penalizes those regions to reduce the probability of making false positive calls.

Copy Number Variant Calling

The DRAGEN copy number variant caller performs amplification, reference, and deletion calling for CNV targets within the assay. It counts the coverage of each target interval on the panel, uses a preprocessed panel of normal samples to normalize target counts, corrects for GC coverage bias, and calculates scores of a CNV event from observed coverage and makes copy number calls.

Structural Variant Calling

Variant Deduplication

Contamination Detection

The contamination analysis step detects foreign human DNA contamination using the SNP error file and pileup file that are generated during the small variant calling and the TMB trace file. The software determines whether a sample has foreign DNA using the contamination score. In contaminated samples, the variant allele frequencies in SNPs shift from the expected values of 0%, 50%, or 100%. The algorithm collects all positions that overlap with common SNPs that have variant allele frequencies of < 25% or > 75%. Then, the algorithm computes the likelihood that the positions are an error or a real mutation. The contamination score is the sum of all the log likelihood scores across the predefined SNP positions with minor allele frequency < 25% in the sample and are not likely due to CNV events.

The larger the contamination score, the more likely there is foreign DNA contamination. A sample is considered to be contaminated if the contamination score is above predefined quality threshold. The contamination score was found to be high in samples with highly rearranged genomes or HRD samples. 1% of HRD samples found to be above the threshold with no evidence for actual contamination.

Annotation

The Illumina Annotation Engine performs annotation of small variants, CNVs, and exon-level CNVs. The inputs are gVCF files and the outputs are annotated JSON files.

Tumor Mutational Burden

Microsatellite Instability Status

Post Processing

A reusable Nextflow component designed for executing various post-processing tasks at the end of pipeline execustion. It can be used to enhance, transform, or modify outputs in Logs_Intermediates and Results folders, making it versatile for addressing specific requirements.

This component is highly configurable, supporting fine-tuned control of computational resources (CPU, memory), containerization, and output management. Users can integrate custom containers and scripts to implement their own logic for post-processing, all configured through parameters. Externalized process scripts allow for seamless execution of containerized processes.

Key Features

  • Customizability: Easily adaptable to different post-processing requirements.

  • Reusability: Can be used in multiple pipelines, reducing development effort.

  • Data transformation: Can be used to transform or modify output data in various ways.

What you need ?

  1. A config file which has Post-Processing parameters and values

  2. A bash script , that implements desired functioanlity

  3. Any other custom resources/files that will be required by the bash script

  4. Docker container having dependencies to run the bash script

Process

  1. Modify config file; Set postProcessing_container to the uploaded conatiner

  2. Upload all the required files(config, script, reources) to a project directory, e.g., custom-resources, in ICA using the icav2 client.

  3. Configure ICA Web-UI on 'Start Analysis' Page:

    1. Enable postprocessing, Set it to 'true'

    2. Add 'Custom Parameters Config File', and set it to the filename uploaded to the custom-resource directory above

    3. Add 'Custom Resources Directory', set it to the custom-resource directory above.

Config File - <file-name>.config

Configurable Parameters in Config file

Allowed values for postProcessing_cpusMemoryConfig in the config file

Post-Processing : Sample Script (bam2cram.sh)

Quick Start

Quick Start Guide

Table 1. Release Information

  • {version} is used to represent the software version number in Table 1 above. Similarly, <pipeline_run_script> is used to indicate the client program name in this document.

Download, Install and Execute on a Local Server

Run analysis on a local DRAGEN Server

The command line program may be used to launch an analysis by using the ${CLI_program} with the appropriate options.

start from one or more input folders when using FASTQ, BAM or CRAM files

Multiple folders may be specified as input folders in comma separated values when using FASTQ, BAM or CRAM files as input.

Pressing Ctrl+C during a Solid_WGS_TN_DRAGEN step stops the currently running analysis and might cause an FPGA error. To recover from an FPGA error, shut down and restart the server.

Run analysis on ICA using the icav2 client

Here is an example of starting an analysis using the ICA client by providing the necessary command parameters and specify a particuar storage size for analysis in ICA.

Run analysis on ICA using UI

DRAGEN Applications

Applications

DRAGEN analysis offers a large selection of application pipelines.

Analysis Uses

DRAGEN analysis can be used in numerous fields in the biological sciences.

NovaSeq 6000Dx (RUO)

Standard or XP

-

NovaSeq 6000

-

NovaSeq X

Standard or XP

-

-

-

-

For the mitochondrial contig, DRAGEN processes it through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. Please refer to for the filtering details.

DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other. You can find more information about the population-based analysis in the section.

Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF option. Since is not supported for joint analysis, set --enable-personalization to false when generating gVCF files.

Run Pedigree Mode for Small Variant Caller. For more information, see .

Run Pedigree Mode for Copy Number Caller. For more information, see .

Run Pedigree Mode for Structural Variant Caller. For more information, see .

Run DeNovo Variant Small Variant Filtering. For more information, see .

This section only describes output files specific to ICA. are described in the common output files.

Heme_Nextflow_logs—(ICA only) - Contains information related to the execution of the pipeline as a whole and for specific nodes (when an analysis is split across multiple nodes). It contains files used to execute parts of the workflow on different nodes as well as records of the Nextflow execution on those nodes.

Create a Project: Project can be specific for the DRAGEN Heme WGS Tumor Only v4.4.4 Pipeline or it can contain multiple Pipelines and/or Tools. For information on creating Projects, refer to the Projects section in . ICA standard storage is used by default as soon as the Project is saved. To connect a different storage source, set it up before creating your Project. For details and options, refer to the Storage section in .

Edit Project and Add Bundle: Edit the Project and add the bundle titled, "Heme WGS TO v4.4.4 (XX)." XX is a 2-letter code designating the region from which you are launching the analysis. Adding the Bundle automatically adds the pipeline and associated resource files and datasets to the Project. For information on Bundles, refer to the Bundles section in . After adding the Bundle to the Project, an example dataset becomes available in the Demo_Data folder for the Project.

 Upload the sequencing data: For information on viewing and uploading data, refer to the Data section in .

Start Analysis: In the Project, navigate to Pipelines, select the Heme WGS TO v4_4_4_x  Pipeline, and then select  "Start New Analysis". Set up the new analysis by configuring the parameters listed in the . When the required files are completed, start analysis.

For information about using pipelines, refer to .

For information about using pipelines, refer to the

Refer to the BaseSpace Sequence Hub support site page for information on .

For more information on run planning, refer to the the

This value is a universal record number (URN). The valid value is defined in

ℹ️ Note: For CRAM Input Reference Genome, a list of commonly-used human reference FASTA files can be downloaded from the Illumina support site:

Results - Contains the final result files from the pipeline.

MetricsOutput.tsv - Contains summary metrics for all samples.

Sample1

Sample1_MetricsOutput.tsv—Contains summary metrics for the specific sample.

Sample1.tumor.baf.bedgraph.gz —Contains the BED graph representation of the B-allele frequency (if available).

Sample1.sv.small_indel_dedup.filtered.vcf.gz — Contains DNA structural variants excluding the indels already present in the hard-filtered.vcf file after applying the DragenSvExtraFilters.

Sample1.hard-filtered.vcf.gz—Contains small variants VCF.

Sample1.cnv.vcf.gz —Contains copy number variants VCF.

Logs_Intermediates - Contains all intermediate files for each step of the pipeline.

SampleSheetValidation

ResourceVerification

RunQc(only when started from BCLs)

FastqGeneration (only when started from BCLs)

FastqValidation

DragenCaller

AdditionalSarjMetrics

SampleAnalysisResults

MetricsOutput

DragenSvExtraFilters

passing_sample_steps.json

work - Contains Nextflow execution details for debugging purpose.

errors - Contains an Errors.tsv file if any pipipeline analysis step failed.

SampleSheet.csv - User input sample sheet as provided.

pipeline_trace.txt - Contains Nextflow pipeline step execution status.

timeline_${timestamp}.html - Contains Nextflow pipeline task timeline information.

report_${timestamp}.html - Contains Nextflow pipeline task execution details.

receipt - Contains pipeline analysis CLI parameters and execution environment information.

payload.json - Contains pipeline analysis setup parameters and execution environment information.

nextflow.log - Contains Nextflow pipeline execution log.

analysis.log - Contains Nextflow pipeline standard output.

Metric
Description
Recommended Threshold
Metric (UOM)
Recommended Threshold
Description

The Heme pipeline is a DNA only analysis software based on the . Even though it includes some of the default settings from the , it uses a distinct recipe with different options. A user has the ability to override specific parameters via a .

An example command is provided that highlights the input and output used in DragenCaller step of the Heme Pipeline, which may be found in the log file. Any parameter options not displayed on the command line would be using the default value for the DRAGEN variant caller module. The detailed parameters and default arguments for the individual modules within the DragenCaller step may be found in the replay.json output. See for detailed explanations of the parameters.

involves aligning sequencing reads derived from DNA libraries to a reference genome prior to variant calling.

The pipeline currently does not support UMI libraries by default. Please use the to generate the collapsed BAM as input, if so desired.

Additional information is available at .

Additional information is available at .

The DRAGEN Structural Variant (SV) Caller is described . The DUX4 rearrangement caller is described .

The Variant Deduplication is described

The Heme pipeline currently does not support annotation of gVCF files. Please use the to perform tertiary analysis.

Not Supported in the current release. Please use the .

Not supported in the current release. Please use the .

Upload and configure

Parameter
Description
Value
Description

A Post-Processing bash script is a , which has access to paths/variables defined in the parent Nextflow Process. In our case following directories and subdirectories can accessed from the bash script like {params.analysisDir}/Results , {params.analysisDir}/Logs_Intermediates. Also, the output files generated should be stored into {params.postProcessing.stepName} directory. Note- For BAM to CRAM Conversion , we must upload genome.fa and .fai files to custom resources direcory.

Execution Environment
software version
Client program
location
Note

The software may be and installed by following the . It may be executed using a local DRAGEN server or on a local computer which launches the analysis in the ICA cloud environment.

The same analysis example above may be completed using the ICA UI by logging into the appropriate domain of your company and project where the pipeline is .

Find more information in the .

Pipeline
Description
Variant Types Detected
Metrics Provided
Analysis
Description
📂
Cloud
Local
Cloud Mixed Flow Cell
Cloud
Local Mixed Flow Cell
Local
iterative gVCF Genotyper
Standard output files
Illumina Connected Analytics help
Illumina Connected Analytics help
Illumina Connected Analytics help
Illumina Connected Analytics help
Illumina Connected Analytics support site page
Illumina Connected Analytics documentation
setting up a BaseSpace Sequence Hub project
run planning section
Illumina DRAGEN Product Files
table below
Panel of Normals
Panel of Normals

PCT_Q30_R1

Percentage of bases with a quality score ≥ 30 from Read 1.

≥ 80.0 (≥85.0 for NovaSeq X Plus)

PCT_Q30_R2

Percentage of bases with a quality score ≥ 30 from Read 2.

≥ 80.0 (≥85.0 for NovaSeq X Plus)

TUMOR_ESTIMATED_SAMPLE_CONTAMINATION (NA)

NA

The estimated fraction of reads in a sample that may be from another human source

TUMOR_MAPPED_READS_PCT (%)

NA

Percent of mapped reads in the tumor sample

TUMOR_INSERT_LENGTH_MEDIAN (count)

NA

Median insert length of tumor sample

TUMOR_Q30_BASES_EXCL_DUPS_AND_CLIPPED_BASES (bp)

NA

Bases with a Phred quality score of 30 or higher excluding uplicated reads and clipped bases

AVERAGE_AUTOSOMAL_COVERAGE_OVER_GENOME (count)

NA

Average coverage or sequencing depth across the autosomes (chromosomes 1-22)

GC_NORMALIZED_COVERAGE_AT_GCS_20_39 (count)

NA

Normalized sequencing coverage in genomic regioins with GC content between 20% and 39%

GC_NORMALIZED_COVERAGE_AT_GCS_60_79 (count)

NA

Normalized sequencing coverage in genomic regioins with GC content between 60% and 79%

/opt/edico/bin/dragen \
--ref-dir /staging/dragen-app-manager/resources/Illumina_hg38-alt_masked.cnv.hla.methyl_cg.methylated_combined.rna-11_r5.0-1 \
--output-directory DragenCaller/Sample-001 \
--output-file-prefix Sample-001 \
--events-log-file DragenCaller/Sample-001/events.csv \
--vc-systematic-noise=/staging/dragen-app-manager/resources/Illumina_heme-wgs-to-resources_4.4.4.2/snv/IDPF_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz  \
--vc-enable-germline-tagging=true \
--variant-annotation-data=/staging/dragen-app-manager/resources/Illumina_variant_annotation_data-tmb_annotations_4.4.4-1/tmb_annotations \
--vc-germline-tag-hotspots=false \
--logging-to-output-dir=true \
--gc-metrics-enable=true \
--enable-metrics-json=true \
--enable-map-align=true  \
--enable-sort=true \
--enable-duplicate-marking=true \
--enable-variant-caller=true \
--heme-sv=true \
--sv-systematic-noise=/staging/dragen-app-manager/resources/Illumina_heme-wgs-to-resources_4.4.4.2/sv/WGS_FF_Heme_hg38_v3.1.0_systematic_noise.sv.bedpe.gz \
--heme-cnv=true \
--cnv-population-b-allele-vcf=/staging/dragen-app-manager/resources/Illumina_heme-wgs-to-resources_4.4.4.2/cnv/hg38_1000G_phase1.snps.high_confidence.vcf.gz \
--enable-variant-deduplication=true \
--vc-output-evidence-bam=false \
--qc-detect-contamination=true \
--enable-dux4-caller=true \
--max-base-quality=63 \
--tumor-fastq-list Sample-001.fastq_list.csv \
--tumor-fastq-list-sample-id Sample-001 \
--force
Note - Post-Processing feature is avaialable only for ICA Environment.

postProcessing_container = '079623148045.dkr.ecr.us-east-1.amazonaws.com/cp-prod/0f7f12a0-a6c8-4289-86c3-3e5310b97275:latest'
postProcessing_cpusMemoryConfig = 'single_threaded_low_mem'
postProcessing_shellScript = 'bam2cram.sh'

postProcessing_container

Docker Container URI , Must be present/uploaded to ICA

postProcessing_cpusMemoryConfig

Compute Option to Use, allowed values given below

postProcessing_shellScript

File name of shell-script

single_threaded_low_mem (default)

CPUs: 2, Mem(GB): 8

single_threaded_medium_mem

CPUs: 4, Mem(GB): 16

single_threaded_high_mem

CPUs: 8, Mem(GB): 32

multi_threaded_low_mem

CPUs: 16, Mem(GB): 64

multi_threaded_medium_mem

CPUs: 32, Mem(GB): 128

multi_threaded_high_mem

CPUs: 64, Mem(GB): 128


#========================================================#
# This is a SAMPLE Script only for illustration purpose  #
# Modify it, according to your specific Use Case         #
#========================================================#

#must create this folder to save output files
mkdir -p "${params.postProcessing.stepName}"

cd "${params.postProcessing.stepName}"

#BAMs are located in 'analysis/results' folder
resultsdir="${params.analysisDir}/Results"
#this file must be uploaded to custom-resources-dir
genomefa="${params.customResourceDir}/genome.fa"

sleep_interval=30 # seconds
max_attempts=3

#set sample ids
sample_ids=("Mariner_1_Feasibility_Biosample_45-smoke" "sample_id_2")

for sample_id in "\${sample_ids[@]}"; do
    counter=0
    while : ; do
        if [ "\$counter" -eq "\$max_attempts" ]; then
            echo "WARNING! \${sample_id}.bam was NOT found!"
            break
        fi
        counter=\$((counter + 1))
        bam_file=\$(find \$resultsdir -type f -name "\${sample_id}.bam")
        if [ -z "\$bam_file" ]; then
            echo "Attempt \$counter : Waiting for \${sample_id}.bam"
            sleep \$sleep_interval
        else
            #process and break
            filename=\$(basename -s .bam \$bam_file)
            samtools view -C -T "\$genomefa" -o "./\$filename.cram" "\$bam_file"
            break
         fi
    done
done

exit 0
<pipeline_run_script> --inputType <fastq|bam|cram> \
--inputFolder /staging/input-folder-1,/staging/input-folder-2 \
--analysisFolder /staging/output-folder
icav2 projectpipelines start nextflow ${PIPELINE_ID} \
--project-id ${ANY_PROJECT_ID} \
--storage-size Large \
-o json \
--input ${ANY_SAMPLE_SHEET} \
--input ${ANY_INPUT_DIR} \
--parameters inputType:'fastq' \
--parameters referenceGenome:'hg38' \
--parameters sampleIds:'Sample1,Sample2' \
--user-reference ${ANY_USER_REFERENCE}

DRAGEN Demultiplexing

Rapid demultiplexing of NGS analysis

N/A

N/A

DRAGEN ORA Compression

DRAGEN ORA compression is optimized for high compression ratios of FASTQ files, as well as rapid compression and decompression, all while preserving data integrity.

N/A

Compression Ratio Run Time

DRAGEN Map + Align

The DRAGEN Map + Align can be run as a standalone or as part of DRAGEN’s suite of pipelines

N/A

Mapping metrics Duration Metrics Coverage Metrics

DRAGEN Germline

The DRAGEN Germline Pipeline provides end-to-end NGS analysis, including advanced error model calibration for increased accuracy, and repeat expansion detection and genotyping through Illumina Expansion Hunter.

SNV/Indel CNV SV Repeat Expansions

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Somatic

The DRAGEN Somatic Pipeline includes tumor-only and tumor–normal modes, designed for detecting somatic variants in tumor samples. Both modes make no ploidy assumptions, enabling detection of low-frequency alleles.

SNV/Indel CNV SV TMB MSI HLA

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Enrichment

The DRAGEN Enrichment Pipeline combines DRAGEN’s germline and somatic callers into a pipeline designed specifically for analyzing enrichment samples. Includes a full suite of enrichment metrics and reporting.

SNV/Indel CNV SV

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN RNA

The DRAGEN RNA Pipeline performs transcriptome analysis starting with splice junction discovery and alignment, followed by rapid alignment and splice junction mapping and quantification. For differential expression, Illumina recommends the DRAGEN Differential Expression app on BaseSpace Sequence Hub.

Gene fusion SNV/Indel

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Single Cell RNA

The DRAGEN Single Cell RNA pipeline performs demultiplexing, cell-barcode and UMI error correction, sequence alignment, and quantification of gene expression.

N/A

Mapping Metrics Duration Metrics Coverage Metrics Callability Report Cell Metrics

DRAGEN Joint Genotyping

The DRAGEN Joint Genotyping/Population Pipeline calls variants jointly across multiple genomes and scales to large cohorts of samples at expedited speeds with uncompromising accuracy.

SNV/Indel CNV SV Repeat Expansions

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Methylation

The DRAGEN Methylation Pipeline performs alignment, methyl calling, and calculates alignment and methylation metrics.

N/A

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Reference Builder

Accepts FASTA files, and builds the proprietary reference used by the DRAGEN apps.

N/A

N/A

DRAGEN TruSight Oncology 500 ctDNA Analysis Software

Secondary analysis support for Illumina’s TruSight Oncology 500 ctDNA. Available on the local DRAGEN Server version 3 and later.

SNV/Indel CNV DNA fusions MSI TMB

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Imputation

The DRAGEN Imputation pipeline is an end to end user friendly tool that enables scalable low pass whole genome sequencing analysis

N/A

Impute ≤100 samples simultaneously 1.7x faster compared to original GLIMPSE code

Genetic Diseases

Reduce time required for genomic analysis, with high accuracy and comprehensiveness

Oncology

Analyze tumor-only and tumor/normal samples with accuracy, comprehensiveness, and efficiency

Cell and Molecular Biology

Advance understanding of cellular mechanisms with rapid analysis pipelines for bulk and single cell samples

Population Genomics

Accurately and efficiently analyze sequenced genomes at scale. Accelerate re-analysis as computational tools improve over time

Infectious Disease

Detect and characterize infectious diseases with a comprehensive solution

Agrigenomics

Efficiently analyze animals and plants of varying genomic complexities with custom reference

UMI Options
Merge Duplex UMIs

Command Line Options

Overview

Command line options

For command-line options, refer to Table 1 (below) for details.

Table 1: Shell Script Command-Line Options

CAUTION: Do not run analyses as the root user as it can lead to permissions issues when managing data generated by the software.

Argument>
Required
Description

--inputType

Yes

Possible values include fastq, bam, cram.

--inputFolder

Yes

Input folder containing {input type} files. Multiple folders can be specified as a comma separated list.

--sampleSheet

No

Full path to the sample sheet file. If the sample sheet is named SampleSheet.csv and is located in the single input folder (depending on how the analysis is initiated), this command is not required.

--analysisFolder

No

Full path to the alternative analysis folder. Default is /staging/DRAGEN_Solid_WGS_Tumor_Normal_Pipeline_{version}_Analysis_{datetimestamp} if not specified. This folder must have enough available free space for the analysis and be on an NVMe SSD partition to achieve high performance.

--sampleOrCaseIDs

No

The comma-delimited sample IDs (or CaseID) that are processed by the run. For example, Sample_1,Sample_2.

--referenceGenome

No

Specify the reference genome to use for alignment. Possible values: hg38 or hs37d5_chr. Default is hg38.

--disableOraCompression

No

Specify to disable Ora compression.

--customResourceDir

No

Provide custom resource directory path.

--customConfig

No

Provide custom config file path.

--keepFullWorkDir

No

Copy entire work dir to analysis output folder. Default behavior is to copy only nextflow logs.

--version

No

Displays the version of the software, and then exits.

--help

No

Displays the help text.

Sample Sheet Requirements

The TN pipeline may contain additional user defined fields such as Sex, Tumor Type or Case ID for use with variant interpretation in ICI.

Standard Sample Sheet Requirements

The following sample sheet requirements describe required and optional fields for TN pipeline. It must contain fhe follwing sections.

The analysis fails if the sample sheet requirements are not met.

[Header] Section

Parameter
Required
Details

FileFormatVersion

2

v2 sample sheet format

[TN_Data] Section

Sample Parameter
Required
Details

Sample_ID

Required

The unique ID to identify a sample. The sample ID is included in the output file names. Sample IDs are not case sensitive. Sample IDs must have the following characteristics: - Unique for the run. - 1–70 characters. - No spaces. - Alphanumeric characters with underscores and dashes. If you use an underscore or dash, enter an alphanumeric character before and after the underscore or dash. eg, Sample1-T5B1_022515. - Cannot be called all, default, none, unknown, undetermined, stats, or reports. - Must match a Sample_ID listed in the [BCLConvert_Data] section. Each sample must have a unique combination of Lane (if applicable), sample ID, and index ID or the analysis will fail.

Case_ID

Required

A unique ID that links the same biological samples from the same individual. It is used for variant interpretation in downstream software such as the Illumina Connected Insights software

Sample_Type

Required

Possible value is DNA.

Sample_Classification

Required

Possible values are Tumor or Normal.

Specimen_Type

Required

Possible values are FFPE (Formalin-Fixed, Paraffin-Embedded), FF (Fresh Frozen) for Tumor sample classification. No restrictions on a sample classification of Normal

Sex

Optional

Possible values are Male, Female or Unknown

Tumor_Type

Optional

Support tumor type code based on the SNOMED ontology

Sample_Description

Optional

Free text description for the sample

To ensure a successful analysis, follow these guidelines:

  1. Avoid any blank lines at the end of the sample sheet; these can cause the analysis to fail.

  2. When running local analysis using the command line save the sample sheet in the sequencing run folder with the default name SampleSheet.csv, or choose a different name and specify the path in the command-line options.

Launching Analysis

UI Options

Manual Launch of pipeline Analysis Software Analysis

To manually launch an analysis, configure a pipeline Analysis Software pipeline analysis run in ICA with the following parameters.

General

Parameter Name
Description
Required
Default

User Reference

The custom name of the analysis for later identification.

Yes

Empty

User Tags

Tags for the analysis to help with categorization and identificaion, enhancing organization and searchability.

No

Empty

Notification

Add a user to be notified when the analysis completes.

No

No user selected

Output Folder

The path to the analysis output folder.

No

Project output folder

Input Files

Parameter Name
Description
Required
Default

Samplesheet

The SampleSheet.csv for the analysis

Yes

SampleSheet.csv in Input Folder

Input Directory

The input folder that contains [bcl, fastq, bam, cram] files to analyze. Multiple input [fastq, bam, cram] folders can be specified.

Yes

No folder selected

Custom Parameters Config File

The custom parameters config file for the analysis.

No

No file selected

Custom Resources Directory

The custom resoruces directory used for the analysis.

No

No folder selected

Settings

Parameter Name
Description
Required
Default

Input Type

The type of files in the Input Folder(s): bcl, fastq, bam, cram.

Yes

bcl

Reference Genome

The reference genome used for the analysis: [hs37d5_chr, hg38].

Yes

hg38

Enable Ora Compression

Compress fastq files using ora compression. [Only applies when Input Type is bcl].

No

true

Enable Post Processing

Use the post-processing scripts at the end of the pipeline analysis.

No

false

Sample IDs

Optional subset of Sample IDs or Pair IDs to analyze. A comma-separated list.

No

Empty

Resources

Parameter Name
Description
Required
Default

Storage Size

The storage size to allocate for the analysis. The minimum required value is Large.

Yes

Large

Other available options for the storage size:

  • "1.2TB" if option selected is Small

  • "2.4TB" if option selected is Medium

  • "7.2TB" if option selected is Large

  • "16TB" if option selected is XLarge

  • "32TB" if option selected is 2XLarge

  • "64TB" if option selected is 3XLarge

Note: It is recommended to reserve storage size twice the size of the BCL run folder, or the input fastq.gz or bam files, four times the size of the cram file (cram is 30-70% of the bam), and 8 times the size of the fastq.ora (fastq.ora is about 25% of fastq.gz).

For information about using pipelines, refer to the Illumina Connected Analytics documentation.

📂
📄
📂
📄
📄
📄
📄
📄
📂
📂
📂
📂
📂
📂
📂
📂
📂
📂
📂
📄
📂
📂
📄
📄
📄
📄
📄
📄
📄
📄
DRAGEN Secondary Analysis Software
DNA Somatic Tumor-Only Heme WGS DRAGEN recipe
custom configuration file
DRAGEN Command Line Options
DNA alignment
DRAGEN DNA Pipeline UMI recipe
DRAGEN DNA Pipeline Small Variant Calling
DRAGEN DNA Pipeline Small Variant Calling
here
here
here
Illumina Connected Insights
DNA Somatic Tumor-Only Heme WGS DRAGEN recipe
DNA Somatic Tumor-Only Heme WGS DRAGEN recipe
Custom Docker
Nextflow Template
set up
ICA Cloud App Launch Guide
Quick Start
TMB Germline Variants
Germline-aware Mode
downloaded
installation guide

Advanced Topics

Advanced Use Cases

  • User may be able to rerun a completed analysis using the "rerun" option in ICA.

  • User may be able to use the icav2 client to complete any analysis performed throught the UI.

Local Dragen Server

4.4.4.53

run_Solid_WGS_TN_{version}.sh

/usr/local/bin

ICA

c18e9e69-0a74-4c43-a419-a62cb7c6abc0

icav2

ICA Pipelines

ICA

urn:ilmn:ica:pipeline:c18e9e69-0a74-4c43-a419-a62cb7c6abc0#Solid_WGS_TN_v4_4_4_53

supported browser

ICA UI

Templates

The pipeline only supports starting from FASTQ, BAM or CRAM in the current release. The sample sheet below only contains the minimally required sections for starting the analysis. It is not a valid sample sheet for other purposes.

[Header],,,,,,,,,,
FileFormatVersion,2,,,,,,,,,
RunName,DRAGEN TN Start From FASTQ Only,,,,,,,,,
InstrumentType,NovaSeq,,,,,,,,,
InstrumentPlatform,NovaSeq,,,,,,,,,

[TN_Data],,,,,,,,,,
Sample_ID,Specimen_Type,Sample_Type,Case_ID,Sample_Description,Sample_Classification
tumorSample,FFPE,DNA,SampleA,Description1,Tumor
normalSample,FFPE,DNA,SampleA,Description2,Normal

Launching Analysis

Overview

Run on DRAGEN Server

Getting Started

Analysis output is written to /staging/DRAGEN_Solid_WGS_Tumor_Normal_Pipeline_{version}_Analysis_{datetimestamp} by default. To write to a different output directory, run the bash script with --analysisFolder <FULL_PATH_TO_ANALYSIS_FOLDER>.

Local Specific Output

Local output management

  • On DRAGEN server, Nextflow logs are contained in the Work folder

DRAGEN Server App

Installation Procedure on DRAGEN Server

Downloader

Choose the downloader appropriate for your platform, when executed it will prompt you to provide a path to download the assets to. The required software packages will be downloaded into the dragen_pipelines directory under the path provided at the prompt. If the path provided was used for a previous execution of the downloader, any incomplete downloads will be resumed, existing files will be checksummed, and any files with invalid checksums will be re-downloaded.

The downloaded directory content may be moved to the installation target DRAGEN server using a USB key with at least 128 GB of free space or by copying to Network Storage which is reachable from the target DRAGEN Server.

Downloader System Requirements

Downloader Name
System Requirements

Solid_WGS_TN_{version}_Downloader_unix

x86_64 platform with glibc 2.25+

Solid_WGS_TN_{version}_Downloader_mac

arm64 macOS

Solid_WGS_TN_{version}_Downloader_windows.exe

64-bit Windows 10+

Expected downloaded content

    • dragen-app-manager-1.0.14-1.x86_64-el8-offline.run

    • README

      • install_Solid_WGS_TN_v4.4.4.53.run

      • Solid_WGS_TN_4.4.4.53.iapp

      • README

      • solid-wgs-tn-resources_4.4.4.2.ires

      • dpf-core_1.0.0.36.ires

      • dpf-templates_4.4.4.52.ires

      • dpf-docker-images_4.4.4.52.ires

      • dragen-4.4.4-12.multi.el8.x86_64.run

      • hg38-alt_masked.cnv.graph.hla.methyl_cg.rna-11-r5.0-1.ires

      • hg38-alt_masked.cnv.hla.methyl_cg.methylated_combined.rna-11-r5.0-1.ires

      • hs37d5_chr-cnv.graph.hla.methyl_cg.rna-11-r5.0-1.ires

      • hs37d5_chr-cnv.hla.methyl_cg.methylated_combined.rna-11-r5.0-1.ires

      • variant_annotation_data-tmb_annotations-4.4.4-1.ires

Installer

Installation Requirements

DRAGEN and DRAGEN Application Manager

The pipeline requires DRAGEN v4.4.4 or higher. If upon installation of the app this version of DRAGEN (or higher) is not installed, the software shall install this version of DRAGEN.

Minimum System Operating Requirements

Hardware

  • v3 DRAGEN server or v4 DRAGEN server

  • mkfifo is enabled on the network-attached storage (NAS).

Software

The software installed by default on the DRAGEN server includes the following items:

  • DRAGEN server software. Refer to sample sheet settings for the DRAGEN version number.

  • Oracle Linux 8

Storage

  • DRAGEN server v3 provides a 6.4 TB NVMe SSD. This SSD is located at the /staging directory and is suitable for storing only one or two runs of the analysis pipeline.

  • DRAGEN server v4 provides 12.8 TB via a 2 x 6.4 TB NVMe U.2 SSD configuration.

  • Consider the following when making data storage decisions.

    • A NovaSeq 6000 sequencing run that uses an S4 flow cell can produce up to 3 TB of output. ▫ The pipeline can produce an additional 4-6 TB of analysis output. For optimal performance when writing to a non-default directory, specify an analysis folder location on /staging, this ensures that the DRAGEN-related processes read and write data to the DRAGEN Server's high-speed NVMe SSD.

    • Network-attached storage is required for long-term storage of sequencing runs and pipeline output.

    • Managing data storage is your responsibility.

      • Illumina recommends developing a strategy to copy data from the DRAGEN server to network-attached storage.

Installation Instructions

  • Installing the pipeline requires root privileges.

    • Follow the instructions for DRAGEN license installation provided by Illumina Customer Care or refer to the DRAGEN server documentation.

  • Copy the directory structure from the downloader directory to the target DRAGEN server (or a path accessible with sudo privileges)

  • Ensure the installer has the correct privileges by running chmod +x install_Solid_WGS_TN_v{version}.run

  • Launch the installer with root privileges sudo /path/to/install_Solid_WGS_TN_v{version}.run

    • If DRAGEN Application Manager is not already installed, the installer will exit and direct you to the path to the DRAGEN Application Manager installer

Run Self-Test Script

The self-test script, present after app installation, checks the following functions:

  • All required services are running.

  • All resources are in place.

  • The analysis workflow image can be launched.

  • The pipeline can run successfully on a test dataset.

To run the self-test script, execute:

/usr/local/bin/check_Solid_WGS_TN_{version}.sh

If the self-test prints a failure message, contact Illumina Technical Support, and provide the output file found in /staging/check_Solid_WGS_TN_{version}_{datetimestamp}.tgz.

When running an analysis on the DRAGEN server via SSH, Illumina recommends that you use a terminal multiplexer utility, which allows you to resume analysis in the event of a disconnection from the DRAGEN server.

Uninstall pipeline

To uninstall the pipeline, run the following command as the root user (or with sudo privileges):

/usr/local/bin/uninstall_Solid_WGS_TN_{version}..sh

Executing the uninstall script removes the following assets:

  • All scripts, including:

    • run_Solid_WGS_TN_{version}.sh

    • check_Solid_WGS_TN_{version}.sh

    • uninstall_Solid_WGS_TN_{version}.sh

    • The application installed under DRAGEN Application Manager

If the uninstall script is executed with the -r or --removeResources flag, dependencies of the application being uninstalled will be removed if no other applications depend on them.

You are not required to uninstall DRAGEN Application Manager, Docker, or the DRAGEN server software.

To remove Docker, review the install instructions for your operating system in the Docker documentation

Custom Config Support

This feature allows users to customize a specific set of DRAGEN command-line options to override the default values pre-defined in the pipeline.

ICA Setup

On the ICA (Illumina Connected Analytics) user interface (UI) to the software, you can specify the Custom Parameters Config File and Custom Resources Directory directly. Supported customizable options are described below.

Examples

solid_custom_param.config Content

custom_resources_Heme_dir Folder Structure on ICA

ICA Input Files UI Example

Advanced Topics

CRAM input

Custom Workflow

Autolaunch requires additional BaseSpace Sequence Hub and sample sheet settings.

BaseSpace Sequence Hub Requirements for ICA Autolaunch

Autolaunch uses the BaseSpace Sequence Hub (BSSH) run planning tool to create and export a v2 format sample sheet to enable streaming of sequencing run data to the project and requires the following additional settings. See Figure 1 below.

  • Access to BaseSpace Sequence Hub.

  • ICA Run Storage is enabled under BaseSpace Sequence Hub settings.

Illumina Cloud Run Planning and Auto-Launch Workflow

Autolaunch requires a v2 format sample sheet with specific parameters that instruct the BSSH project to automatically initiate a pipeline analysis in ICA. Use the run planning option in BaseSpace Sequence Hub to generate the sample sheet. The exported sample sheet is automatically populated with the required fields. Using an invalid sample sheet can result in failed runs and analyses.

Refer to Table 1 below for descriptions of the added fields. Enter the following required run parameters in BaseSpace Sequence Hub Run Planning:

Figure 1. BSSH Run Planning Enabled End to End Workflow

The BaseSpace Sequence Hub setting for run monitoring and storage must be selected on the instrument to use pipeline Analysis Software analysis Autolaunch. For information on preparing your instrument for DRAGEN App for Whole-Genome Sequencing Analysis Software Autolaunch, refer to the documentation for your instrument.

  1. Use Run Planning in BaseSpace Sequence Hub to create and export a sample sheet.

  2. Import the sample sheet to the instrument and start the sequencing run. Data is uploaded to BaseSpace Sequence Hub and then pushed to ICA. You can monitor the run in BaseSpace Sequence Hub.

  3. When sequencing and the upload completes, analysis autolaunches in ICA. You can monitor the status of the analysis in BaseSpace Sequence Hub or ICA

  4. If necessary, requeue the analysis via the run's Summary page in BaseSpace Sequence Hub. Refer to the BaseSpace Sequence Hub support site page for more information on requeuing an analysis.

  5. View the analysis output results in either BaseSpace Sequence Hub or ICA.

Table 1. Additional Sample Sheet Fields for Autolaunch

Autolaunch-compatible sample sheets contain the following fields specific to autolaunch configuration.

ICA Cloud App

Requirements

Analysis on ICA requires an account with a valid subscription and a project with the following configuration.

Manual Launch

How to Launch Analysis

After adding the Bundle to the Project, an example dataset becomes available in the Demo_Data folder for the Project.

  1. Download Results: After analysis is complete, navigate to results in the configured output location.

Please see the Illumina Support Shorts for guidance on how to set up and run DRAGEN Solid WGS Tumor Normal analysis on ICA.

Analysis Parameters on ICA

To launch an analysis via the ICA user interface, configure a DRAGEN Solid WGS Tumor Normal pipeline analysis with the following parameters.

For more information about using ICA and BaseSpace Sequence Hub or running a pipeline Analysis Software analysis on ICA, refer to the relevant support pages on the Illumina support site.

DRAGEN Recipes

Overview

Germline Pipelines

Germline with UMI Pipelines

RNA and scRNA Pipelines

Somatic Pipelines

Somatic with UMI Pipelines

Analysis Methods

DNA Analysis Methods

The software performs germline variant calling on the normal sample, and reports the following variants:

  • SNV (annotated)

  • CNV (annotated)

  • SV (annotated)

  • Targeted callers (cyp2b6, cyp2d6, cyp21a2, gbna, hba, lpa, rh and smn)

  • Expansion hunter

  • VNTR

The software perform somatic variant calling on the tumor sample and reports the following variants:

  • SNV (annotated)

  • MNV

  • CNV (annotated, requires germline SNV and CNV VCF)

  • SV (annotated, with variant deduplication)

  • TMB

  • MSI

  • HRD

  • ASCN

  • LOH

  • DUX4

  • HLA

Reference Genomes

The pipeline supports two reference genomes for the DRAGEN Map/Aligner - hg38 and hs37d5_chr.

The hs37d5_chr genome is the hg19 reference genome with the Chromosome Y PAR masked. It includes the NC_012920 mitocondria genome. The contigs have the chr prefix added, but without the native alternate loci names.

DRAGEN Map/Aligner

DRAGEN continues to use these final alignments as input for various variant calls such as gene amplification (copy number) calling, small variant calling (SNV, indel, MNV, delin), and DNA library quality control.

Small Variant Calling and Filtering

DRAGEN supports calling SNVs, indels, MNVs, and delins in tumor-only samples by using mapped and aligned DNA reads from a tumor sample as input. Variants are detected via both column wise pileup analysis and local de novo assembly of haplotypes. The de novo haplotypes allow the detection of much larger insertions and deletions than possible through column wise pileup analysis only. DRAGEN insertions and deletions are validated with lengths of at least 0–25 bp and more than 25 bp can be supported. In addition, DRAGEN also uses the de novo assembly to detect SNVs, insertions, and deletions that are co-phased and part of the same haplotypes. Any such co-phased variants that are within a window of 15 bp can then be reassembled into complex variants (MNVs and delins). The tumor-only pipeline produces a VCF file containing both germline and somatic variants that can be further analyzed to identify tumor mutations. The pipeline makes no ploidy assumptions, enabling detection of low-frequency alleles.

DRAGEN small variant calling includes the following steps:

  1. Detects regions with sufficient read coverage (callable regions).

  2. Detects regions where the reads deviate from the reference and there is a possibility of a germline or somatic call (active regions).

  3. Assembles de novograph haplotypes are assembled from reads (haplotype assembly).

  4. Extracts possible somatic or germline calls (events) from column wise pileup analysis.

  5. Calibrates read base qualities to account for background noise.

  6. Computes read likelihoods for each read/haplotype pair.

  7. Performs mutation calling by summing the genotype probabilities across all reads/haplotype pairs.

  8. Performs additional filtering to improve variant calling accuracy, including using a systematic noise file. The systematic noise file indicates the statistical probability of noise at specific positions in the genome. This noise file is constructed using clean (normal) samples. Regions where noise is common (eg, difficult to map regions) have higher noise values. The small variant caller penalizes those regions to reduce the probability of making false positive calls.

Somatic mode

Copy Number Variant Calling

The DRAGEN copy number variant caller performs amplification, reference, and deletion calling for CNV targets within the assay. It counts the coverage of each target interval on the panel, uses a preprocessed panel of normal samples to normalize target counts, corrects for GC coverage bias, and calculates scores of a CNV event from observed coverage and makes copy number calls.

Absolute Copy Numbers (ABCN)

Loss of Heterozygosity

Structural Variant Calling

Variant Deduplication

Contamination Detection

The contamination analysis step detects foreign human DNA contamination using the SNP error file and pileup file that are generated during the small variant calling and the TMB trace file. The software determines whether a sample has foreign DNA using the contamination score. In contaminated samples, the variant allele frequencies in SNPs shift from the expected values of 0%, 50%, or 100%. The algorithm collects all positions that overlap with common SNPs that have variant allele frequencies of < 25% or > 75%. Then, the algorithm computes the likelihood that the positions are an error or a real mutation. The contamination score is the sum of all the log likelihood scores across the predefined SNP positions with minor allele frequency < 25% in the sample and are not likely due to CNV events.

The larger the contamination score, the more likely there is foreign DNA contamination. A sample is considered to be contaminated if the contamination score is above predefined quality threshold. The contamination score was found to be high in samples with highly rearranged genomes or HRD samples. 1% of HRD samples found to be above the threshold with no evidence for actual contamination.

Annotation

The Illumina Annotation Engine performs annotation of small variants, and CNVs. The inputs are gVCF files and the outputs are annotated JSON files.

The Illumina Annotation Engine processes each variant entry and annotates with available information from databases such as dbSNP, gnomAD genome and exome, 1000 genomes, ClinVar, COSMIC, RefSeq, and Ensembl. The header includes version information and general details. Each annotated variant is included as a nested dictionary structure in separate lines following the header.

Biomarkers

Tumor Mutational Burden

DRAGEN is used to compute tumor mutational burden (TMB) in coding regions where there is sufficient coverage.

The following variants are excluded from the TMB calculation:

  • Non-PASS variants.

  • Mitochondrial variants.

  • MNVs.

  • Variants that do not meet a minimum depth threshold.

  • Variants that do not meet the minimum variant allele threshold.

  • Variants that fall outside the eligible regions.

  • Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. Germline variants are not counted towards TMB. Variants are determined as germline based on a database and a proxy filter.

Variants with a population allele count ≥ 10 that are observed in either the 1000 Genomes or gnomAD databases are marked as germline. MNVs, which do not count towards TMB, may be marked as germline when all their component small variants are marked as germline. The proxy filter scans the variants surrounding a specific variant and identifies those variants with similar variant allele frequencies (VAF). If the majority of surrounding variants of similar VAF are germline, then the variant is also marked as germline.

The formula for TMB calculation is:

Outputs are captured in a .tmb.trace.tsv file that contains information on variants used in the TMB calculation and a .tmb.metrics.json file that contains the TMB score calculation and configuration details.

Microsatellite Instability Status

DRAGEN can determine the MSI status of a sample. It uses a normal reference file, which was created from a set of normal samples. During sequencing, normal reference files are generated by tabulating read counts for each microsatellite site. The normal file contains the read count distribution for each microsatellite.

MSI calling for a tumor-only sample is performed by first tabulating tumor counts from the read alignments for each microsatellite site. Then, the Jensen-Shannon distance (JSD) is calculated between each pair of tumor and normal baseline samples. DRAGEN determines unstable sites by performing Chi-square testing of tumor JSD and normal JSD distributions. Unstable sites are called if the mean distance difference of the two JSD distributions is ≥to the distance threshold and Chi-square p-value is ≤ to the p-value threshold. Lastly, DRAGEN produces an MSI status given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites, and the sum of the Jensen-Shannon distance of all the unstable sites.

HRD

Homologous Recombination Deficiency (HRD) score is a whole genome signature measurement of genomic instability. The HRD is composed of the sum of three components: loss of heterozygosity (LOH), telomeric allele imbalance (TAI), and large-scale state transition (LST). A panel of normal samples is used for both bias reduction and normalization prior to HRD score estimation. Final HRD results can be found in the *.hrdscore.csv file.

HLA Typing

Targeted Callers

Expansion Hunter

Variable Number Tandem Repeat (VNTR)

Troubleshooting

Help

stopping analysis

Pressing Ctrl+C during a DRAGEN step stops the currently running analysis and might cause an FPGA error. To recover from an FPGA error, shut down and restart the server.

using a non-root user

CAUTION: Do not run analyses as the root user as it can lead to permissions issues when managing data generated by the software.

using ssh

When running the analysis software using SSH, Illumina recommends using additional software to prevent unexpected termination of analysis. Illumina recommends screen and tmux.

using DAM

The pipeline depends on the DRAGEN Application Manager. For issues related to the DRAGEN Application Manager installation, refer to the DRAGEN Application Manager installation guide.

  • Ensure DRAGEN App Manager is running properly.

using docker

Troubleshooting

Help

Support Request

For debugging or support request, please include the files from the top level of the analysis output folder, the work directory and the errors directory content, in addition to the MetricsOutput.tsv from the Results folder.

Stopping Analysis

Pressing Ctrl+C during a DRAGEN step stops the currently running analysis and might cause an FPGA error. To recover from an FPGA error, shut down and restart the server.

Using a Non-root User

CAUTION: Do not run analyses as the root user as it can lead to permissions issues when managing data generated by the software.

Using ssh

When running the analysis software using SSH, Illumina recommends using additional software to prevent unexpected termination of analysis. Illumina recommends screen and tmux.

Using DAM

  • Ensure DRAGEN App Manager is running properly.

Using docker

Limitations

CIFS Support

  • In CIFS (SMB 1.0), the mounted volume may have a permission check issue and cause the Nextflow workflow to exit prematurely when a non-root user account is used for analysis, unless the filesystem permission check is disabled. The workaround is to use newer SMB protocols and configure Windows Active Directory for analysis with non-root users.

Using multiple FASTQ files for increased coverage (top-off)

  • If there are more than 16 FASTQ files, then use cat or other command line utility to concatenate the FASTQ files as a single FASTQ file to get around the file number restriction.

See

See

See

Users can visit the section to learn additional details on required fields and values as they fill-in their sample information, or download a template from .

The DRAGEN Heme WGS Tumor Only Pipeline is launched with the bash script called run_Solid_WGS_TN_{version}.sh, which is installed in the /usr/local/bin directory. The bash script is executed on the command line and runs the software using DRAGEN Application Manager. For a full list of command-line options, refer to .

To launch an analysis, you must provide the --inputType and --inputFolder arguments. The --inputType argument can be fastq, bam, or cram. The --inputFolder may be the absolute path to the input folder or it may be a comma separated list of path. If more than one input folder is specified, the --sampleSheet argument must also be provided with the absolute path to a valid Sample Sheet (refer to ). If the --sampleSheet argument is not provided, the software checks for a file named SampleSheet.csv in the input folder.

Work — (DRAGEN server only) - Contains information and files related to Nextflow execution

A separate lightweight downloader for Windows, macOS, and Linux operating systems is available at the .

Additional download information is available at the download site.

dragen_pipelines

Solid_WGS_TN_4.4.4.53

common

The pipeline also requires DRAGEN Application Manager to be installed, and an installer is included. DRAGEN Application Manager configuration is controlled by the config.toml file located in /etc/dragen-app-manager directory. See for additional information.

Delete output data on the DRAGEN server as soon as possible. For additional information on data output and storage, refer to .

Contact Illumina Customer Care to request a link to the Downloader or visit and confirm that the Genome DRAGEN license is enabled for your server.

ℹ️ Note: Custom resource files and the custom configuration file must be uploaded to the same ICA project where the run is created. You can use the icav2 client or other supported methods. See for details.

When CRAM is used as input, the reference genome used to generate the CRAM files is required. This may be provided using the .

Refer to the BaseSpace Sequence Hub support site page for information on .

Parameter Name
Setting

For more information on run planning, refer to the the

Section
Parameter
Details
Required

Create a Project: Project can be specific for the DRAGEN Solid WGS Tumor Normal v4.4.4 Pipeline or it can contain multiple Pipelines and/or Tools). For information on creating Projects, refer to the Projects section in .

ICA standard storage is used by default as soon as the Project is saved. To connect a different storage source, set it up before creating your Project. For details and options, refer to the Storage section in .

Edit Project and Add Bundle: Edit the Project and add the bundle titled, "Solid WGS TN v4.4.4 (XX)." XX is a 2-letter code designating the region from which you are launching the analysis. Adding the Bundle automatically adds the pipeline and associated resource files and datasets to the Project. For information on Bundles, refer to the Bundles section in .

 Upload the sequencing data: For information on viewing and uploading data, refer to the Data section in .

Start Analysis: In the Project, navigate to Pipelines, select the Solid WGS TN v4_4_4_x  Pipeline, and then select  "Start New Analysis". Set up the new analysis by configuring the parameters listed in the . When the required files are completed, start analysis.

Parameter Name
Description

For information about using pipelines, refer to .

The following sub-pages contain recommended command line options for specific DRAGEN pipelines. For an overview of DRAGEN command line parsing, also see

The software is a DNA only analysis pipeline based on the . Even though it includes some of the default settings from the , it uses a distinct recipe with different options. A user has the ability to override specific parameters via a .

An example command is provided that highlights the input and output used in DragenCaller step of the software, which may be found in the DRAGEN run log file. Any parameter options not displayed on the command line would be using the default value for the DRAGEN variant caller module. The detailed parameters and default arguments for the individual modules within the DragenCaller step may be found in the replay.json output. See for detailed explanations of the parameters.

involves aligning sequencing reads derived from DNA libraries to a reference genome prior to variant calling.

The software currently supports both tumor and normal samples with UMI. Please use the to get details on the options.

Additional information is available at .

The supports both matched tumor-normal pairs and tumor only samples. The germline mode of the small variant caller is used to analyze the normal sample in the matched pair.

Additional information is available at .

Absolute copy numbers are calculated by the CNV ASCN Caller. See .

See more information available at .

The DRAGEN Structural Variant (SV) Caller is described .

The DUX4 rearrangement caller is described .

The Variant Deduplication is described

The database content included with Nirvana database is available at the .

The pipeline currently does not support annotation of gVCF files. Please use the to perform tertiary analysis.

Please see the for details about the TMB biomarker analysis.

Please see the for details about the MSI biomarker analysis.

Please see the for details about the HRD biomarker analysis.

Please see the for details.

Please see

Please see .

Please see

Ensure Docker is running properly. For docker configuration help, please check the and .

The Heme pipeline depends on the DRAGEN Application Manager (DAM). For issues related to the DRAGEN Application Manager installation, refer to the .

Ensure Docker is running properly. For docker configuration help, please check the DRAGEN Application Manager installation guide and .

To increase the coverage of a sample using multiple FASTQ files, the FASTQ files must follow the . The current limit is up to 16 FASTQ files from 8 lanes based on available flow cell types.

📂
📂
📂
📂
Dragen Server
ICA Cloud
ICA Cloud
Sample Sheet guidelines
Sample Sheet Template
Command-Line Options
Sample Sheet Requirements
DRAGEN Installer Download Site
DRAGEN Resource Files
DRAGEN Application Manager
Illumina Instrument Control Computer Security and Networking
DRAGEN Installer Download Site
Combine Phased Variants
Mitochondrial Calling
personalized variant calling
## custom parameters
somatic_vc_output_evidence_bam = false
germline_qc_detect_contamination = true
germline_aligner_clip_pe_overhang = 0

## custom reference files
somatic_sv_systematic_noise = '/snv/WGS_hg38_v1.0_systematic_noise.snv.bed.gz'
somatic_sv_systematic_noise = '/sv/WGS_FF_solid_hg38_v1.0_systematic_noise.sv.bedpe.gz'
somatic_vc_somatic_hotspots = '/snv/somatic_hotspots_GRCh38.vcf.gz'
custom_resources_Solid/
├── snv
│   ├── WGS_Solid_hg38_v1.0_systematic_noise.snv.bed.gz
│   └── somatic_hotspots_GRCh38.vcf.gz
└── sv
    └── WGS_Solid_FF_solid_hg38_v1.0_systematic_noise.sv.bedpe.gz

Secondary Analysis

BaseSpace Sequence Hub / Illumina Connected Analytics

Application

DRAGEN App for Whole-genome Sequencing

User Reference

The analysis run name

User Tags

Text labels to help index the analysis.

Notify me when task is completed

Option to receive an email notification when analysis is complete.

Output Folder

The path to the analysis output folder. The default path is the project output folder.

Entitlement Bundle

Automatically populated from the project details.

Samplesheet

Select a sample sheet in CSV format for the analysis.To note: Sample Sheet selection is optional if starting from a run folder, and required when submitting a FASTQ folder.

Input Directory

The run folder or FASTQ folder that contains files to analyze.

Input Type

Select input type of analysis will perform on. Options to select include bcl, fastq, bam and cram

Sample or Pair IDs

Optional subset of Sample IDs or Pair IDs to analyze.

Reference Genome

Select the reference genome. hs37d5_chr is the hg19 reference genome with the Chromosome Y PAR masked. It includes the NC_012920 mitocondria genome. The contigs have the chr prefix added, but without the native alternate loci names.

Enable Ora Compression

Enable Ora Compression (True or False). Only applicable when Input Type is bcl

Enable Post Processing

Enable Post Processing (True or False) to run custom scripts at the end of pipeline

Storage Size

The storage size to allocate for the analysis. The default and recommended value is Large.

Custom Parameters Config File

Optional. Select Custom Parameters Config File that override default config

Custom Resources Directory

Optional. Select Custom Resources Directory to use with Custom Parameters Config File

CAUTION - This parameter ...

Optional. Those configuration with this comment is only applies to auto-launch DRAGEN Solid WGS Tumor Normal analysis from FASTQs after BCL. Please don't set it if start analysis from ICA UI

/opt/edico/bin/dragen \
--ref-dir /staging/dragen-app-manager/resources/Illumina_hg38-alt_masked.cnv.graph.hla.methyl_cg.rna-11_r5.0-1 \
--output-directory DragenCaller/Sample-001 \
--output-file-prefix Sample-001 \
--events-log-file DragenCaller/Sample-001/events.csv \
--enable-map-align=true \
--enable-map-align-output=true \
--enable-variant-caller=true \
--vc-emit-ref-confidence=GVCF \
--vc-enable-vcf-output=true \
--enable-targeted=true \
--targeted-merge-vc=true \
--enable-star-allele=true \
--enable-cnv=true \
--cnv-enable-self-normalization=true \
--repeat-genotype-enable=true \
--enable-sv=true \
--enable-vntr=true \
--sv-vntr-merge=false \
--enable-hla=true \
--hla-enable-class-2=true \
--vc-output-evidence-bam=false \
--qc-detect-contamination=true \
--qc-coverage-ignore-overlaps=false \
--logging-to-output-dir=true \
--max-base-quality=63 \
--enable-duplicate-marking false \
--tumor-normal-has-umi both \
--umi-source qname \
--umi-library-type nonrandom-duplex \
--umi-min-supporting-reads 1 \
--umi-correction-table /staging/dragen-app-manager/resources/Illumina_solid-wgs-tn-resources_4.4.4.2/umi/umi_correction_table.txt.gz \
--bam-input Sample-001.bam \
--force 
TMB=Filtered VariantsEligible Region Size(Mbp)TMB = {Filtered\ Variants \over Eligible\ Region\ Size (Mbp)}TMB=Eligible Region Size(Mbp)Filtered Variants​
NonsynonymousTMB=Filtered Nonsynonymous VariantsEligible Region Size(Mbp)Nonsynonymous TMB = {Filtered\ Nonsynonymous\ Variants \over Eligible\ Region\ Size (Mbp)}NonsynonymousTMB=Eligible Region Size(Mbp)Filtered Nonsynonymous Variants​
De Novo Small Variant Filtering
Structural Variant De Novo Quality Scoring
Small Variant DeNovo Calling
Multisample CNV Calling
Data Transfer Options with ICA Platform
custom configuration file
setting up a BaseSpace Sequence Hub project
run planning section
Illumina Connected Analytics help
Illumina Connected Analytics help
Illumina Connected Analytics help
Illumina Connected Analytics help
Illumina Connected Analytics support site page
Multicaller Workflows
DNA Germline Panel
DNA Germline WES
DNA Germline WGS
DNA Germline Panel UMI
DNA Germline WES UMI
DNA Germline WGS UMI
Illumina scRNA
Other scRNA prep
RNA Panel
RNA WTS
DNA Somatic Tumor-Normal Solid Panel
DNA Somatic Tumor-Normal Solid WES
DNA Somatic Tumor-Normal Solid WGS
DNA Somatic Tumor-Only Heme WGS
DNA Somatic Tumor-Only Solid Panel
DNA Somatic Tumor-Only Solid WES
DNA Somatic Tumor-Only Solid WGS
DNA Somatic Tumor-Normal Solid Panel UMI
DNA Somatic Tumor-Normal Solid WES UMI
DNA Somatic Tumor-Normal Solid WGS UMI
DNA Somatic Tumor-Only Solid Panel UMI
DNA Somatic Tumor-Only Solid WES UMI
DNA Somatic Tumor-Only Solid WGS UMI
DNA Somatic Tumor-Only ctDNA Panel UMI
DRAGEN Secondary Analysis Software
DNA Somatic Tumor-Normal Solid WGS DRAGEN recipe
custom configuration file
DRAGEN Command Line Options
DNA alignment
DRAGEN DNA Pipeline UMI
DRAGEN DNA Pipeline Small Variant Calling
DRAGEN Somatic Pipeline
DRAGEN DNA Pipeline Small Variant Calling
ASCN Caller
DRAGEN DNA Pipeline - LOH
here
here
here
Nirvana online documentation
Illumina Connected Insights
DRAGEN DNA Pipeline - Biomarkers - TMB
DRAGEN DNA Pipeline - Biomarkers - MSI
DRAGEN DNA Pipeline - Biomarkers - HRD
DRAGEN DNA Pipeline - HLA Typing
DRAGEN DNA Pipeline - Targeted Callers
DRAGEN DNA Pipeline - Expansion Hunter
DRAGEN DNA Pipeline - VNTR
DRAGEN Application Manager installation guide
docker.org documentation
DRAGEN Application Manager Installation Guide
docker.org documentation
Illumina naming convention
table below

Cloud_TN_Data

Sample_ID

The unique ID to identify a sample. Must match a Sample_ID used in the TN_Data section.

Yes

Sample_Type

Sample type.

No

Sample_Description

Must meet the following requirements:

No

- 1–70 characters.

- Alpha numeric characters with underscores, No dashes and spaces. If you enter an underscore, dash, or space, enter an alphanumeric character before and after.

Cloud_TN_Settings

SoftwareVersion

The software version

No

StartsFromFastq

Set the value to TRUE or FALSE. If autolaunching from BCL files, this must be set to FALSE.

Yes

Cloud_Data

Sample_ID

The same sample ID used in the Cloud_TN_Data section.

No

ProjectName

The BaseSpace Sequence Hub project name.

No

LibraryName

Combination of sample ID and index values in the No following format: sampleID_Index_Index2.

No

LibraryPrepKitName

The Library Prep Kit used.

No

IndexAdapterKitName

The Index Adapter Kit used.

No

Cloud_Settings

GeneratedVersion

The cloud GSS version used to create the sample sheet. Optional if manually updating a sample sheet.

No

CloudWorkflow

ica_workflow_1

Yes

Cloud_TN_Pipeline

Yes

DNA Germline Panel UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--umi-min-supporting-reads 2            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# SV 
--enable-sv true 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 

Notes and additional options

Hashtable

For DRAGEN germline runs, it is recommended to use the pangenome hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

SNV

DRAGEN SNV VC employs machine learning based variant recalibration (DRAGEN-ML). It processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors. No additional setup is required. DRAGEN-ML is enabled by default as needed, when running the germline SNV VC on hg19 or hg38.

Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range.

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

Annotation

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

In some cases, an in-run PON containing germline samples from the same batch (i.e. sample source, DNA extraction, library prep and sequencing run) may provide superior normalization.

Analysis Output

When the analysis run completes, the software generates an analysis output in a specified location with the folder name /staging/DRAGEN_Solid_WGS_Tumor_Normal_Pipeline_{version}_Analysis_{datetimestamp}. In ICA, analysis output is listed in the Output section of the analysis, where the folder name is a combination of user reference, pipeline name, and a UUID.

Within the analysis folder, each analysis step generates a subfolder within the Logs_Intermediates folder.

Output Folders

File Overview

This section describes the summary output files generated during analysis.

Metrics Output

The metrics output file is a final combined metrics report that provides sample status, key analysis metrics, and metadata in a tab-separated values (TSV) file. Sample metrics within the report indicate guideline-suggested lower specification limits (LSL) and upper specification limits (USL) for each sample in the run. One metrics output file is generated for the entire run. An additional file is generated for each case.

Normal DNA Input QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

TOTAL_INPUT_READS (Count)

NA

NA

PCT_MAPPED_READS (%)

90.00

NA

PCT_PROPERLY_PAIRED_READS (%)

90.00

NA

PCT_Q30_BASES (%)

80.00

NA

PCT_SOFT_CLIPPED_BASES_R1 (%)

NA

10.0

PCT_SOFT_CLIPPED_BASES_R2 (%)

NA

10.0

PCT_SUPPLEMENTARY_(CHIMERIC)_ALIGNMENTS (%)

NA

15.0

ESTIMATED_READ_LENGTH (bp)

NA

NA

MEAN_INSERT_LENGTH (bp)

NA

NA

MEDIAN_INSERT_LENGTH (bp)

NA

NA

INPUT_BASES_OVER_REFERENCE_GENOME_SIZE (Count)

NA

NA

ESTIMATED_SAMPLE_CONTAMINATION (%)

NA

2.00

Normal DNA Dedup/UMI QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

PCT_DUPLICATE_MARKED_READS (%)

NA

20.00

PCT_READS_WITH_VALID_OR_CORRECTABLE_UMIS (%)

NA

NA

PCT_READS_IN_DISCARDED_FAMILIES (%)

NA

NA

PCT_READS_FILTERED_OUT (%)

NA

NA

PCT_READS_WITH_UNCORRECTABLE_UMIS (%)

NA

NA

TOTAL_NUMBER_OF_FAMILIES (Count)

NA

NA

FAMILIES_DISCARDED (Count)

NA

NA

DUPLEX_FAMILIES (Count)

NA

NA

MEAN_FAMILY_DEPTH (Count)

NA

NA

Normal DNA Coverage QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

AVERAGE_GENOME_COVERAGE (Count)

20.00

NA

PCT_UNIFORMITY_OF_COVERAGE_OVER_20PCT_OF_MEAN (%)

50.00

NA

PCT_GENOME_20X (%)

80.00

NA

Tumor DNA Input QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

TOTAL_INPUT_READS (Count)

NA

NA

PCT_MAPPED_READS (%)

90.00

NA

PCT_PROPERLY_PAIRED_READS (%)

90.00

NA

PCT_Q30_BASES (%)

80.00

NA

PCT_SOFT_CLIPPED_BASES_R1 (%)

NA

10.0

PCT_SOFT_CLIPPED_BASES_R2 (%)

NA

10.0

PCT_SUPPLEMENTARY_(CHIMERIC)_ALIGNMENTS (%)

NA

15.0

ESTIMATED_READ_LENGTH (bp)

NA

NA

MEAN_INSERT_LENGTH (bp)

NA

NA

MEDIAN_INSERT_LENGTH (bp)

NA

NA

INPUT_BASES_OVER_REFERENCE_GENOME_SIZE (Count)

NA

NA

ESTIMATED_SAMPLE_CONTAMINATION (%)

NA

2.00

Tumor DNA Dedup/UMI QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

PCT_DUPLICATE_MARKED_READS (%)

NA

20.00

PCT_READS_WITH_VALID_OR_CORRECTABLE_UMIS (%)

NA

NA

PCT_READS_IN_DISCARDED_FAMILIES (%)

NA

NA

PCT_READS_FILTERED_OUT (%)

NA

NA

PCT_READS_WITH_UNCORRECTABLE_UMIS (%)

NA

NA

TOTAL_NUMBER_OF_FAMILIES (Count)

NA

NA

FAMILIES_DISCARDED (Count)

NA

NA

DUPLEX_FAMILIES (Count)

NA

NA

MEAN_FAMILY_DEPTH (Count)

NA

NA

Tumor DNA Coverage QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

AVERAGE_GENOME_COVERAGE (Count)

20.00

NA

PCT_UNIFORMITY_OF_COVERAGE_OVER_20PCT_OF_MEAN (%)

50.00

NA

PCT_GENOME_20X (%)

80.00

NA

Tumor DNA T/N Sample Match QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

OUTLIER_BAF_FRACTION (NA)

NA

0.90

Tumor DNA Purity QC Metrics

Metric (UOM)

LSL Guideline

USL Guideline

ESTIMATED_PURITY (%)

20.00

NA

DNA Germline WGS

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# SV 
--enable-sv true 
# CNV 
--enable-cnv true 
--cnv-enable-self-normalization true 
# HLA genotyper 
--enable-hla true 
# Targeted caller 
--enable-targeted true                  #Targeted 
# Star allele 
--enable-star-allele true 
# PGX 
--enable-pgx true                       #PGX 
# Short tandem repeats 
--repeat-genotype-enable true 
# Multi-Region Joint Detection (MRJD) 
--enable-mrjd true 
--mrjd-enable-high-sensitivity-mode true 

Notes and additional options

Hashtable

For DRAGEN germline runs, it is recommended to use the pangenome hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

SNV

DRAGEN SNV VC employs machine learning based variant recalibration (DRAGEN-ML). It processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors. No additional setup is required. DRAGEN-ML is enabled by default as needed, when running the germline SNV VC on hg19 or hg38.

Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range.

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

Annotation

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--cnv-enable-cyto-output true

--cnv-enable-mosaic-calling true

# Multi-Region Joint Detection (MRJD)

Option
Description

--enable-mrjd

If set to true, MRJD is enabled for the DRAGEN pipeline.

--mrjd-enable-high-sensitivity-mode

If set to true, MRJD high sensitivity mode is enabled for the DRAGEN pipeline. See the MRJD section in the user guide for information on variant types reported in MRJD default mode and high-sensitivity mode (default=false).

DNA Germline WGS UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--umi-min-supporting-reads 1            #Default=2 
# Small variant caller 
--enable-variant-caller true 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# SV 
--enable-sv true 
# CNV 
--enable-cnv true 
--cnv-enable-self-normalization true 
# HLA genotyper 
--enable-hla true 
# Targeted caller 
--enable-targeted true                  #Targeted 
# Star allele 
--enable-star-allele true 
# PGX 
--enable-pgx true                       #PGX 
# Short tandem repeats 
--repeat-genotype-enable true 
# Multi-Region Joint Detection (MRJD) 
--enable-mrjd true 
--mrjd-enable-high-sensitivity-mode true 

Notes and additional options

Hashtable

For DRAGEN germline runs, it is recommended to use the pangenome hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

SNV

DRAGEN SNV VC employs machine learning based variant recalibration (DRAGEN-ML). It processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors. No additional setup is required. DRAGEN-ML is enabled by default as needed, when running the germline SNV VC on hg19 or hg38.

Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range.

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

Annotation

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--cnv-enable-cyto-output true

--cnv-enable-mosaic-calling true

# Multi-Region Joint Detection (MRJD)

Option
Description

--enable-mrjd

If set to true, MRJD is enabled for the DRAGEN pipeline.

--mrjd-enable-high-sensitivity-mode

If set to true, MRJD high sensitivity mode is enabled for the DRAGEN pipeline. See the MRJD section in the user guide for information on variant types reported in MRJD default mode and high-sensitivity mode (default=false).

DNA Germline WES UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--umi-min-supporting-reads 1            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# SV 
--enable-sv true 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HLA genotyper 
--enable-hla true 
# Targeted caller (only if using the Illumina CS/PGx Custom Enrichment Research Panel) 
--enable-targeted true                  #Targeted 
--targeted-pon $PATH                    #Targeted in-run PON 
--targeted-systematic-noise $PATH       #Targeted systematic noise file 

Notes and additional options

Hashtable

For DRAGEN germline runs, it is recommended to use the pangenome hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

SNV

DRAGEN SNV VC employs machine learning based variant recalibration (DRAGEN-ML). It processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors. No additional setup is required. DRAGEN-ML is enabled by default as needed, when running the germline SNV VC on hg19 or hg38.

Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range.

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

Annotation

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

In some cases, an in-run PON containing germline samples from the same batch (i.e. sample source, DNA extraction, library prep and sequencing run) may provide superior normalization.

Targeted Caller

DNA Germline WES

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# SV 
--enable-sv true 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HLA genotyper 
--enable-hla true 
# Targeted caller (only if using the Illumina CS/PGx Custom Enrichment Research Panel) 
--enable-targeted true                  #Targeted 
--targeted-pon $PATH                    #Targeted in-run PON 
--targeted-systematic-noise $PATH       #Targeted systematic noise file 

Notes and additional options

Hashtable

For DRAGEN germline runs, it is recommended to use the pangenome hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

SNV

DRAGEN SNV VC employs machine learning based variant recalibration (DRAGEN-ML). It processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors. No additional setup is required. DRAGEN-ML is enabled by default as needed, when running the germline SNV VC on hg19 or hg38.

Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range.

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

Annotation

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

In some cases, an in-run PON containing germline samples from the same batch (i.e. sample source, DNA extraction, library prep and sequencing run) may provide superior normalization.

Targeted Caller

Custom Config Support

This feature allows users to customize a specific set of DRAGEN command-line options to override the default values pre-defined in the pipeline.

ICA Setup

On the ICA (Illumina Connected Analytics) user interface (UI) to the software, you can specify the Custom Parameters Config File and Custom Resources Directory directly. Supported customizable options are described below.

Examples

heme_custom_param.config Content

## custom parameters
vc_output_evidence_bam = false
qc_detect_contamination = true
aligner_clip_pe_overhang = 0

## custom reference files
vc_systematic_noise = '/snv/WGS_hg38_v1.0_systematic_noise.snv.bed.gz'
sv_systematic_noise = '/sv/WGS_FF_Heme_hg38_v1.0_systematic_noise.sv.bedpe.gz'
vc_somatic_hotspots = '/snv/somatic_hotspots_GRCh38.vcf.gz'

custom_resources_Heme_dir Folder Structure on ICA

custom_resources_Solid/
├── snv
│   ├── WGS_Solid_hg38_v1.0_systematic_noise.snv.bed.gz
│   └── somatic_hotspots_GRCh38.vcf.gz
└── sv
    └── WGS_Solid_FF_solid_hg38_v1.0_systematic_noise.sv.bedpe.gz

ICA Input Files UI Example

DNA Somatic Tumor-Normal Solid WES UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

FQ Input

BAM Input

CRAM Input

Mapping and Aligning

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

UMI

SNV

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

HLA

CNV

Annotation

TMB

MSI

SV

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

Step 2. Build the BEDPE file using input VCFs from previous step.

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

DNA Somatic Tumor-Normal Solid WGS UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

FQ Input

BAM Input

CRAM Input

Mapping and Aligning

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

UMI

SNV

HLA

CNV

Annotation

TMB

MSI

SV

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

Step 2. Build the BEDPE file using input VCFs from previous step.

DNA Somatic Tumor-Normal Solid WES

The DRAGEN recipe includes the recommended pipeline specific commands.

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

FQ Input

BAM Input

CRAM Input

Mapping and Aligning

Duplicate Marking

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

SNV

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

HLA

CNV

Annotation

TMB

MSI

SV

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

Step 2. Build the BEDPE file using input VCFs from previous step.

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

DNA Somatic Tumor-Normal Solid WGS

The DRAGEN recipe includes the recommended pipeline specific commands.

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

FQ Input

BAM Input

CRAM Input

Mapping and Aligning

Duplicate Marking

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

SNV

HLA

CNV

Annotation

TMB

MSI

SV

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

Step 2. Build the BEDPE file using input VCFs from previous step.

DNA Somatic Tumor-Only Heme WGS

The DRAGEN recipe includes the recommended pipeline specific commands.

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

FQ Input

BAM Input

CRAM Input

Mapping and Aligning

Duplicate Marking

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

SNV

HLA

CNV

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

MSI

SV

DUX4

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

Step 2. Build the BEDPE file using input VCFs from previous step.

This value is a universal record number (URN). The valid value is defined in

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

For more detail on the small variant caller in somatic mode please refer to

For instructions on how to download the Nirvana annotation database, please refer to

For more information, see .

For more information, see .

Analysis of a full batch of germline samples with an automatically generated in-run PON can be performed using or DRAGEN Germline Enrichment on .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Results - Contains the final result files from the pipeline.

MetricsOutput.tsv - Contains summary metrics for all samples.

Case1

Case1_MetricsOutput.tsv - Contains summary metrics for tumor and normal samples for Case1.

TumorSample1

TumorSample1.hard-filtered.vcf.gz - Contains somatic small variant calls.

TumorSample1.cnv.vcf.gz - Contains somatic copy number variant calls.

TumorSample1.sv.vcf.gz - Contains somatic structural variant calls.

TumorSample1_SNV_Tumor_Annotated.json.gz - Contains somatic small variant annotations.

TumorSample1_CNV_Tumor_Annotated.json.gz - Contains somatic copy number variant annotations.

TumorSample1_SV_Tumor_Annotated.json.gz - Contains somatic structural variant annotations.

TumorSample1.tmb.metrics.csv - Contains the TMB result and metrics.

TumorSample1.microsat_output.json - Contains the MSI result and metrics.

TumorSample1.hrdscore.csv - Contains the HRD result and metrics.

TumorSample1.tn.bw - Contains tangent normalized somatic coverage in BigWig format.

TumorSample1.tumor.baf.bedgraph.gz - Contains somatic b-allele frequency in BedGraph format.

TumorSample1.bam - Contains aligned somatic reads in BAM format.

TumorSample1.bam.bai - Contains index of aligned somatic reads in BAI format.

NormalSample1

NormalSample1.hard-filtered.vcf.gz - Contains germline small variant calls.

NormalSample1.cnv.vcf.gz - Contains germline copy number variant calls.

NormalSample1.sv.vcf.gz - Contains germline structural variant calls.

NormalSample1.repeats.vcf.gz - Contains germline short tandem repeat calls.

NormalSample1.vntr.vcf.gz - Contains germline variable number tandem repeat calls.

NormalSample1.targeted.vcf.gz - Contains germline targeted (star allele) calls.

NormalSample1.targeted.json - Contains germline targeted (star allele) data in JSON format.

NormalSample1_SNV_Normal_Annotated.json.gz - Contains germline small variant annotations.

NormalSample1_CNV_Normal_Annotated.json.gz - Contains germline copy number variant annotations.

NormalSample1SV_Normal_Annotated.json.gz - Contains germline structural variant annotations.

NormalSample1.hla.tsv - Contains germline HLA typing calls.

NormalSample1.bam - Contains aligned germline reads in BAM format.

NormalSample1.bam.bai - Contains index of aligned germline reads in BAI format.

Logs_Intermediates - Contains all intermediate files for each step of the pipeline (BAMs moved to the Results folder).

ResourceVerification

SampleSheetValidation

NormalFastqValidation

TumorFastqValidation

DragenCaller

TumorNormalVariantCaller

Tmb

Annotation

SampleAnalysisResults

AdditionalSarjMetrics

MetricsOutput

Work - (DRAGEN server only) Contains information and files related to Nextflow execution.

See:

For more detail on the small variant caller in somatic mode please refer to

For instructions on how to download the Nirvana annotation database, please refer to

Specifies a population SNV catalog for . For more information on specifying b-allele loci, see .

Enable Cytogenetics-compatible output (default true), see . Only available with the .

Enable MOSAIC-calling mode (default true). Only available with the .

For more information, see .

For futher details refer to .

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

For more detail on the small variant caller in somatic mode please refer to

For instructions on how to download the Nirvana annotation database, please refer to

Specifies a population SNV catalog for . For more information on specifying b-allele loci, see .

Enable Cytogenetics-compatible output (default true), see . Only available with the .

Enable MOSAIC-calling mode (default true). Only available with the .

For more information, see .

For futher details refer to .

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

For more detail on the small variant caller in somatic mode please refer to

For instructions on how to download the Nirvana annotation database, please refer to

For more information, see .

For more information, see .

Analysis of a full batch of germline samples with an automatically generated in-run PON can be performed using or DRAGEN Germline Enrichment on .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For PON requirements and generation see .

Analysis of a full batch of germline samples with an automatically generated in-run PON can be performed using or DRAGEN Germline Enrichment on .

A systematic noise file and corresponding pre-built pangenome reference can be downloaded from the .

See:

For more detail on the small variant caller in somatic mode please refer to

For instructions on how to download the Nirvana annotation database, please refer to

For more information, see .

For more information, see .

Analysis of a full batch of germline samples with an automatically generated in-run PON can be performed using or DRAGEN Germline Enrichment on .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For PON requirements and generation see .

Analysis of a full batch of germline samples with an automatically generated in-run PON can be performed using or DRAGEN Germline Enrichment on .

A systematic noise file and corresponding pre-built pangenome reference can be downloaded from the .

ℹ️ Note: Custom resource files and the custom configuration file must be uploaded to the same ICA project where the run is created. You can use the icav2 client or other supported methods. See for details.

See:

Option
Description
Option
Description
Option
Description

For more information see: .

Option
Description
High Sensitivity Option
Description

For more detail on the small variant caller in somatic mode please refer to

Option
Description
Option
Description

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

Option
Description

See the user guide: .

Microsatellite sites file can be downloaded here: .

Option
Description and recommended setting
Option
Description
Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

Prebuilt WES/WGS noise files
Description

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Prebuilt WES/WGS noise files
Description

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Option
Description
Option
Description
Option
Description

For more information see: .

Option
Description

For more detail on the small variant caller in somatic mode please refer to

Option
Description
Option
Description

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

Option
Description

See the user guide: .

Microsatellite sites file can be downloaded here: .

Option
Description and recommended setting
Option
Description
Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

Prebuilt WES/WGS noise files
Description

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Prebuilt WES/WGS noise files
Description

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Option
Description
Option
Description
Option
Description
Option
Description
High Sensitivity Option
Description

For more detail on the small variant caller in somatic mode please refer to

Option
Description
Option
Description

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

Option
Description

See the user guide: .

Microsatellite sites file can be downloaded here: .

Option
Description and recommended setting
Option
Description
Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

Prebuilt WES/WGS noise files
Description

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Prebuilt WES/WGS noise files
Description

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Option
Description
Option
Description
Option
Description
Option
Description

For more detail on the small variant caller in somatic mode please refer to

Option
Description
Option
Description

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

Option
Description

See the user guide: .

Microsatellite sites file can be downloaded here: .

Option
Description and recommended setting
Option
Description
Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

Prebuilt WES/WGS noise files
Description

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Prebuilt WES/WGS noise files
Description

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Option
Description
Option
Description
Option
Description
Option
Description

For more detail on the small variant caller in somatic mode please refer to

Option
Description
Option
Description

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

Option
Description

See the user guide: .

Microsatellite sites file can be downloaded here: .

Option
Description and recommended setting
Option
Description
Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

For more information, see .

Option
Description

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

Prebuilt WES/WGS noise files
Description

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Prebuilt WES/WGS noise files
Description

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

📂
📄
📂
📄
📂
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📂
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📄
📂
📂
📂
📂
📂
📂
📂
📂
📂
📂
📂
📂
📂
Product Files
Somatic Mode
Nirvana
CNV Calling
DRAGEN Enrichment on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
Nirvana
CNV Calling
MRJD
Product Files
Somatic Mode
Nirvana
CNV Calling
MRJD
Product Files
Somatic Mode
Nirvana
CNV Calling
DRAGEN Enrichment on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Enrichment on BaseSpace
ICA
DRAGEN Software Support Site page
Product Files
Somatic Mode
Nirvana
CNV Calling
DRAGEN Enrichment on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Enrichment on BaseSpace
ICA
DRAGEN Software Support Site page
Data Transfer Options with ICA Platform
Quick Start
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--tumor-normal-has-umi STRING           #Sample(s) containing UMI ['tumor', 'both']. 
--umi-min-supporting-reads 1            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Optional 
--vc-enable-umi-solid true              #>= 1% VAF 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Optional 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-use-somatic-vc-baf true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# TMB 
--enable-tmb true 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-normal 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 
--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH 
--fastq-list-sample-id $STRING 
--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 
--tumor-bam-input $PATH 
--bam-input $PATH 
--tumor-cram-input $PATH 
--cram-input $PATH 

--enable-map-align true

In the TN pipeline this must be set to false for BAM/CRAM input.

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

--sv-enable-liquid-tumor-mode true

DRAGEN can account for Tumor-in-Normal (TiN) contamination by running liquid tumor mode.

--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE

Set the Tumor-in-Normal (TiN) contamination tolerance level.

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--tumor-normal-has-umi STRING           #Sample(s) containing UMI ['tumor', 'both']. 
--umi-min-supporting-reads 1            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-systematic-noise $PATH             #Optional 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Optional 
# CNV 
--enable-cnv true 
--cnv-enable-self-normalization true 
--cnv-use-somatic-vc-baf true 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# TMB 
--enable-tmb true 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-normal 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 
--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH 
--fastq-list-sample-id $STRING 
--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 
--tumor-bam-input $PATH 
--bam-input $PATH 
--tumor-cram-input $PATH 
--cram-input $PATH 

--enable-map-align true

In the TN pipeline this must be set to false for BAM/CRAM input.

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

--sv-enable-liquid-tumor-mode true

DRAGEN can account for Tumor-in-Normal (TiN) contamination by running liquid tumor mode.

--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE

Set the Tumor-in-Normal (TiN) contamination tolerance level.

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--Aligner.clip-pe-overhang 2            #option to ignore unwanted UMIs in reads 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Optional 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Optional 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-use-somatic-vc-baf true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# TMB 
--enable-tmb true 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-normal 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 
--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH 
--fastq-list-sample-id $STRING 
--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 
--tumor-bam-input $PATH 
--bam-input $PATH 
--tumor-cram-input $PATH 
--cram-input $PATH 

--enable-map-align true

In the TN pipeline this must be set to false for BAM/CRAM input.

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

--sv-enable-liquid-tumor-mode true

DRAGEN can account for Tumor-in-Normal (TiN) contamination by running liquid tumor mode.

--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE

Set the Tumor-in-Normal (TiN) contamination tolerance level.

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--sv-detect-systematic-noise true 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--Aligner.clip-pe-overhang 2            #option to ignore unwanted UMIs in reads 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-systematic-noise $PATH             #Optional 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Optional 
# CNV 
--enable-cnv true 
--cnv-enable-self-normalization true 
--cnv-use-somatic-vc-baf true 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# TMB 
--enable-tmb true 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-normal 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 
--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH 
--fastq-list-sample-id $STRING 
--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 
--tumor-bam-input $PATH 
--bam-input $PATH 
--tumor-cram-input $PATH 
--cram-input $PATH 

--enable-map-align true

In the TN pipeline this must be set to false for BAM/CRAM input.

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

--sv-enable-liquid-tumor-mode true

DRAGEN can account for Tumor-in-Normal (TiN) contamination by running liquid tumor mode.

--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE

Set the Tumor-in-Normal (TiN) contamination tolerance level.

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--vc-detect-systematic-noise=true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--sv-detect-systematic-noise true 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--Aligner.clip-pe-overhang 2            #option to ignore unwanted UMIs in reads 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-systematic-noise $PATH             #Required 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--heme-sv true 
--sv-systematic-noise $PATH             #Recommended 
# DUX4 
--enable-dux4-caller true 
# CNV 
--enable-cnv true 
--cnv-enable-self-normalization true 
--cnv-population-b-allele-vcf $POP_VCF  #optional, add to enable germline ASCN caller 
--heme-cnv true 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 
--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--tumor-bam-input $PATH 
--tumor-cram-input $PATH 

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

--dux4-skip-santiy-check true

Bypass the requirements checks if the input datasets don't comply with parameters listed in prerequisites

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--vc-detect-systematic-noise=true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--sv-detect-systematic-noise true 
  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 
Cytogenetics Modality
Germline ASCN caller
Germline ASCN caller
Cytogenetics Modality
Germline ASCN caller
Germline ASCN caller
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
DUX4-rearrangement Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
UMI Options
UMI Options
UMI Options
Merge Duplex UMIs
Merge Duplex UMIs
Merge Duplex UMIs
UMI Options
UMI Options
Panel of Normals
Panel of Normals
Panel of Normals
Panel of Normals
Panel of Normals
Germline ASCN CNV caller
Specification of B-Allele Loci
Germline ASCN CNV caller
Specification of B-Allele Loci
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

--tumor-normal-has-umi STRING

Specify if only the tumor, or if both the tumor and normal have UMIs. Options: 'both','tumor'.

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-normal-cnv-vcf $CNV_NORMAL_VCF

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

--tumor-normal-has-umi STRING

Specify if only the tumor, or if both the tumor and normal have UMIs. Options: 'both','tumor'.

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-normal-cnv-vcf $CNV_NORMAL_VCF

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-normal-cnv-vcf $CNV_NORMAL_VCF

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-normal-cnv-vcf $CNV_NORMAL_VCF

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

DNA Somatic Tumor-Normal Solid Panel

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--Aligner.clip-pe-overhang 2            #option to ignore unwanted UMIs in reads 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Optional 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Optional 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-use-somatic-vc-baf true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# Annotation 
--variant-annotation-data PATH 
--enable-variant-annotation true 
# TMB 
--enable-tmb true 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 
# Microsatellite Instability (MSI) 
--msi-command tumor-normal 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 
--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 
--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--tumor-bam-input $PATH 
--bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 
--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

In the TN pipeline this must be set to false for BAM/CRAM input.

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-positional-collapsing true

Alternative to --enable-duplicate-marking=true. Instead of discarding duplicate reads, DRAGEN can optionally perform positional collapsing, merging them into higher-quality consensus reads. This is beneficial for small panels without UMIs and coverage between 300X and 1000X. However, it's slower than standard duplicate marking and less effective on samples with coverage lower than 300X. For very high coverage (1000X+), avoid it due to potential read collisions. For high-sensitivity panels with 1000X+ coverage, consider using UMIs.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option
Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-normal-cnv-vcf $CNV_NORMAL_VCF

Annotation

TMB

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files refer to the MSI Biomarker section in the user guide.

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--sv-enable-liquid-tumor-mode true

DRAGEN can account for Tumor-in-Normal (TiN) contamination by running liquid tumor mode.

--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE

Set the Tumor-in-Normal (TiN) contamination tolerance level.

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-emit-ref-confidence BP_RESOLUTION 
--vc-enable-vcf-output true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For panels we create GVCF files. Gather the full paths to the small variant caller hard filtered GVCFs (not VCFs) from step 1 and create an input file ${GVCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${GVCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

DNA Somatic Tumor-Only Solid Panel UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--umi-min-supporting-reads 2            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Required 
--vc-enable-umi-solid true              #>= 1% VAF 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Recommended 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 

BAM Input

--tumor-bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option
Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files refer to the MSI Biomarker section in the user guide.

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-emit-ref-confidence BP_RESOLUTION 
--vc-enable-vcf-output true 
--vc-enable-umi-solid true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For panels we create GVCF files. Gather the full paths to the small variant caller hard filtered GVCFs (not VCFs) from step 1 and create an input file ${GVCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${GVCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

DNA Somatic Tumor-Only Solid WES UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--umi-min-supporting-reads 1            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Required 
--vc-enable-umi-solid true              #>= 1% VAF 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Recommended 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 

BAM Input

--tumor-bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option
Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

DNA Somatic Tumor-Only Solid WES

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--Aligner.clip-pe-overhang 2            #option to ignore unwanted UMIs in reads 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Required 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Recommended 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 

BAM Input

--tumor-bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option
Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

DNA Somatic Tumor-Only Solid WGS UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--umi-min-supporting-reads 1            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-systematic-noise $PATH             #Required 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Recommended 
# CNV 
--enable-cnv true 
--cnv-enable-self-normalization true 
--cnv-population-b-allele-vcf $POP_VCF  #optional, add to enable germline ASCN caller 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 

BAM Input

--tumor-bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

DNA Somatic Tumor-Only Solid WGS

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--Aligner.clip-pe-overhang 2            #option to ignore unwanted UMIs in reads 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-systematic-noise $PATH             #Required 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Recommended 
# CNV 
--enable-cnv true 
--cnv-enable-self-normalization true 
--cnv-population-b-allele-vcf $POP_VCF  #optional, add to enable germline ASCN caller 
# HRD Scoring 
--enable-hrd true                       #requires SNV 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 

BAM Input

--tumor-bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--vc-detect-systematic-noise=true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For WES and WGS pipelines gather the full paths to the small variant hard filtered VCFs (not GVCFs) from step 1 and create a lines file ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${VCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

DNA Somatic Tumor-Only Solid Panel

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--Aligner.clip-pe-overhang 2            #option to ignore unwanted UMIs in reads 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Required 
--vc-target-vaf 0.03                    #Default is 0.03. 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Recommended 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 

BAM Input

--tumor-bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-positional-collapsing true

Alternative to --enable-duplicate-marking=true. Instead of discarding duplicate reads, DRAGEN can optionally perform positional collapsing, merging them into higher-quality consensus reads. This is beneficial for small panels without UMIs and coverage between 300X and 1000X. However, it's slower than standard duplicate marking and less effective on samples with coverage lower than 300X. For very high coverage (1000X+), avoid it due to potential read collisions. For high-sensitivity panels with 1000X+ coverage, consider using UMIs.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option
Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

Option
Description

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants (default=0.05)

--vc-callability-tumor-thresh INT

Required read coverage to use a site (default=50).

--tmb-enable-proxi-filter BOOL

Use variant vaf information to increase germline filtering. Recommended for TO, but not for TN. May be overly aggressive at tagging variants as germline (default=false).

MSI

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files refer to the MSI Biomarker section in the user guide.

Option
Description and recommended setting

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 60 (default)

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.1 (default)

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-emit-ref-confidence BP_RESOLUTION 
--vc-enable-vcf-output true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For panels we create GVCF files. Gather the full paths to the small variant caller hard filtered GVCFs (not VCFs) from step 1 and create an input file ${GVCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${GVCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

RNA WTS

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-rna true 
--annotation-file $GTF                  #GTF or GFF3 format 
--enable-map-align true                 #required for RNA/scRNA 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
# RNA Quantification 
--enable-rna-quantification true 
--rna-library-type A                    #see 'RNA Quant' 
--rna-quantification-gc-bias true 
# RNA Splice Variants 
--enable-rna-splice-variant true 
# RNA Gene Fusions 
--enable-rna-gene-fusion true 

Notes and additional options

Hashtable

For DRAGEN RNA/scRNA runs, it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

RNA Variant Calling

Option
Description

--vc-target-bed $PATH

Restrict the variants called to a target bed. For WTS, a bed file specifying the gene-coding regions should be provided to avoid calling erroneous variants in non-coding regions due to noisy reads.

RNA Quant

Option
Description

--rna-library-type

Set the library according to the read orientations. Set to 'A' to auto detect the correct read orientation. Alternatively select 'IU', 'ISR', 'ISF', 'U', 'SR', or 'SF'.

RNA Splice

Option
Description

--rna-splice-variant-normals $PATH

Optional setting list of normal splice variants that will be used filter false positive calls. The file should be a tab separated file with the following first four columns: (1) contig name, (2) first base of the splice junction (1-based), (3) last base of the splice junction (1-based), (4) strand (0: undefined, 1: +, 2: -).

--rna-splice-variant-regions $PATH

Target region bed file. Required for panels. The name of the region must be specified in the fourth column.

RNA Fusion

Option
Description

--rna-gf-enriched-regions $PATH

For panels, the list of enriched genes should be set, either as a list of genes or a list of regions in BED format.

Other scRNA prep

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
# Mapper 
--enable-rna true 
--annotation-file $GTF                  #GTF or GFF3 format 
--enable-map-align true                 #required for RNA/scRNA 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# Single Cell 
--enable-single-cell-rna true 
--umi-source qname                      #default='qname' 
--scrna-barcode-position $BARCODE_POS 
--scrna-umi-position $UMI_POS           #see notes 
--scrna-barcode-sequence-list $PATH     #optional 
--single-cell-threshold ratio           #['fixed', 'ratio', inflection'] 
--single-cell-threshold-filterby umi    #['umi', 'read'] 

Notes and additional options

Hashtable

For DRAGEN RNA/scRNA runs, it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

Single-cell RNA options

To change the barcode or binning index positions, use --scrna-barcode-position and --scrna-umi-position. These settings should be provided in the form <startPos>_<endPos> for each barcode. Connect multiple barcode sequence positions with a '+'.

For example, a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI would be set to: --scrna-barcode-position 0_8+21_29+43_51, and --scrna-umi-position 52_59.

The following table list some optional settings:

Option
Description

--enable-single-cell-rna true

Option to enable single-cell rna mode.

--scrna-barcode-position

--scrna-umi-position

--single-cell-threshold

Cell filtering can be set to ['fixed', 'ratio', or 'inflection'].

--scrna-barcode-sequence-list

A known barcode sequence list can be optionally provided.

--umi-source

Optionally override the default barcode/BI source, valid option inclde ['read1', 'read2', 'qname', 'fastq'].

RNA Panel

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN pangenome hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--fastq-list $PATH                      #see 'Input Options' for FQ, BAM or CRAM 
--fastq-list-sample-id $STRING 
# Mapper 
--enable-rna true 
--annotation-file $GTF                  #GTF or GFF3 format 
--enable-map-align true                 #required for RNA/scRNA 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
--enable-duplicate-marking true         #default=true 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
# RNA Quantification 
--enable-rna-quantification true 
--rna-library-type A                    #see 'RNA Quant' 
--rna-quantification-gc-bias true 
# RNA Splice Variants 
--enable-rna-splice-variant true 
--rna-splice-variant-regions $PATH 
# RNA Gene Fusions 
--enable-rna-gene-fusion true 
--rna-gf-enriched-regions $PATH         #see 'RNA Fusion' 

Notes and additional options

Hashtable

For DRAGEN RNA/scRNA runs, it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

Duplicate Marking

Option
Description

--enable-duplicate-marking true

By default, DRAGEN marks duplicate reads and exclude them from variant calling.

--enable-positional-collapsing true

Alternative to --enable-duplicate-marking=true. Instead of discarding duplicate reads, DRAGEN can optionally perform positional collapsing, merging them into higher-quality consensus reads. This is beneficial for small panels without UMIs and coverage between 300X and 1000X. However, it's slower than standard duplicate marking and less effective on samples with coverage lower than 300X. For very high coverage (1000X+), avoid it due to potential read collisions. For high-sensitivity panels with 1000X+ coverage, consider using UMIs.

RNA Variant Calling

Option
Description

--vc-target-bed $PATH

Restrict the variants called to a target bed. For WTS, a bed file specifying the gene-coding regions should be provided to avoid calling erroneous variants in non-coding regions due to noisy reads.

RNA Quant

Option
Description

--rna-library-type

Set the library according to the read orientations. Set to 'A' to auto detect the correct read orientation. Alternatively select 'IU', 'ISR', 'ISF', 'U', 'SR', or 'SF'.

RNA Splice

Option
Description

--rna-splice-variant-normals $PATH

Optional setting list of normal splice variants that will be used filter false positive calls. The file should be a tab separated file with the following first four columns: (1) contig name, (2) first base of the splice junction (1-based), (3) last base of the splice junction (1-based), (4) strand (0: undefined, 1: +, 2: -).

--rna-splice-variant-regions $PATH

Target region bed file. Required for panels. The name of the region must be specified in the fourth column.

RNA Fusion

Option
Description

--rna-gf-enriched-regions $PATH

For panels, the list of enriched genes should be set, either as a list of genes or a list of regions in BED format.

RNA Amplicon

To enable RNA amplicon, set:

  • --enable-rna-amplicon true, and

  • --amplicon-target-bed $PATH.

If RNA amplicon mode is enabled and the amplicon bed file already includes the gene name, then it is not required to set the ENRICH options option, since DRAGEN will read the enriched genes names from the amplicon BED file (fifth column).

Illumina scRNA

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
# Mapper 
--enable-rna true 
--annotation-file $GTF                  #GTF or GFF3 format 
--enable-map-align true                 #required for RNA/scRNA 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# Single Cell PIPseq 
--scrna-enable-pipseq-mode true 
--single-cell-threshold ratio           #['fixed', 'ratio', inflection'] 

Notes and additional options

Hashtable

For DRAGEN RNA/scRNA runs, it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--fastq-list $PATH 
--fastq-list-sample-id $STRING 

FQ Input

--fastq-file1 $PATH 
--fastq-file2 $PATH 
--RGSM $STRING 
--RGID $STRING 

BAM Input

--bam-input $PATH 

CRAM Input

--cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

Single-cell RNA PIPseq options

PIPseq mode batch option to automatically set the barcode/BI source, the barcode and binning index positions and the barcode sequence list options.

By default the barcode/BI is read from read 1 and the transcript is obtained from read 2.

To change the barcode or binning index positions, use --scrna-barcode-position and --scrna-umi-position. These settings should be provided in the form <startPos>_<endPos> for each barcode. Connect multiple barcode sequence positions with a '+'.

For example, a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI would be set to: --scrna-barcode-position 0_8+21_29+43_51, and --scrna-umi-position 52_59.

The following table list some optional settings:

Option
Description

--scrna-enable-pipseq-mode

Option to enable PIPseq mode.

--scrna-barcode-position

--scrna-umi-position

--single-cell-threshold

Cell filtering can be set to ['fixed', 'ratio', or 'inflection'].

--scrna-barcode-sequence-list

A known barcode sequence list can be optionally provided.

--umi-source

Optionally override the default barcode/BI source, valid option inclde ['read1', 'read2', 'qname', 'fastq'].

DRAGEN Reference Support

DRAGEN supports the construction of reference hash tables for both human and non-human reference genomes. The reference autodetect feature of DRAGEN is able to recognize the reference hash tables build on the four Human reference genomes: hg19 (hg19), GRCh37/hs37d5 (hs37d5), GRCh38/hs38d1(hg38), and T2T-CHM13v2.0 (chm13).

DRAGEN supports pangenome reference hash tables which extend the reference genomes with alternative variant paths from a sample cohort used to construct the pangenome reference. A pangenome-based reference improves the mapping accuracy of Illumina reads in the “Difficult-to-Map Regions” of the genome and the downstream variant calling.

The pangenome is the recommended for Germline human analyses. The accuracy achieved with pangenome references are highlighted in the plot below.

In the following tables we summarize the reference support for each DRAGEN component and the recommended reference type for each component.

Germline

Component
Recommended Human Reference Type
Recommended Non-Human Reference Type
Human hg19
Human hs37d5
Human hg38
Human chm13
Non-Human

SNV

Pangenome

Linear

Yes

Yes

Yes

Yes

Yes

CNV

Pangenome

Linear

Yes

Yes

Yes

Yes*

No

SV

Pangenome

Linear

Yes

Yes

Yes

Yes*

Yes

Expansion Hunter

Pangenome

Linear

Yes

Yes

Yes

No

No

Targeted Callers

Pangenome

Linear

Yes

Yes

Yes

No

No

RNA

Linear

Linear

Yes

Yes

Yes

Yes*

Yes

De Novo

Pangenome

Linear

Yes

Yes

Yes

Yes*

Yes

Joint Genotyping

Pangenome

Linear

Yes

Yes

Yes

Yes*

Yes

Biomarkers (HLA)

Pangenome

Linear

Yes

Yes

Yes

Yes*

No

gVCF genotyper

Pangenome

Linear

Yes

Yes

Yes

Yes*

Yes

Somatic

Component
Recommended Human Reference Type
Recommended Non-Human Reference Type
Human hg19
Human hs37d5
Human hg38
Human chm13
Non-Human

SNV

Linear

Linear

Yes

Yes

Yes

Yes*

No

UMI SNV

Linear

Linear

Yes

Yes

Yes

Yes*

No

CNV

Linear

Linear

Yes

Yes

Yes

Yes*

No

SV

Linear

Linear

Yes

Yes

Yes

Yes*

No

Methylation

Component
Recommended Human Reference Type
Recommended Non-Human Reference Type
Human hg19
Human hs37d5
Human hg38
Human chm13
Non-Human

Methylation

Linear

Linear

Yes

Yes

Yes

No

No

Annotation

Component
Recommended Human Reference Type
Recommended Non-Human Reference Type
Human hg19
Human hs37d5
Human hg38
Human chm13
Non-Human

Nirvana

Pangenome

Linear

Yes

Yes

Yes

No

Yes

* DRAGEN supports the component execution, however the component's accuracy has not been established.

By default, DRAGEN will error out if a linear reference is provided when running a component for which a pangemome reference is recommended as listed in the above table. If the user is sure that a linear reference is reference is desired, the error can be suppressed by setting --validate-pangenome-reference=false.

DNA Somatic Tumor-Only ctDNA Panel UMI

The DRAGEN recipe includes the recommended pipeline specific commands.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
# Inputs 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
# Mapper 
--enable-map-align true                 #optional with BAM/CRAM input 
--enable-map-align-output true          #optionally save the output BAM 
--enable-sort true                      #default=true 
# UMI 
--umi-enable true 
--umi-source STRING                     #Default='qname' 
--umi-library-type STRING               #e.g. random-duplex 
--umi-min-supporting-reads 2            #Default=2 
# Small variant caller 
--enable-variant-caller true 
--vc-target-bed $VC_TARGET_BED 
--vc-systematic-noise $PATH             #Required 
--vc-enable-umi-liquid true             #>= 0.1% VAF 
# SV 
--enable-sv true 
--sv-systematic-noise $PATH             #Recommended 
--sv-exome true 
--sv-call-regions-bed $SV_TARGET_BED 
# CNV 
--enable-cnv true 
--cnv-target-bed $PATH 
--cnv-combined-counts $PATH             #CNV PON 
# Annotation 
--variant-annotation-data PATH 
--vc-enable-germline-tagging true 
# TMB 
--enable-tmb true 
--vc-callability-tumor-thresh 1000 
--tmb-vaf-threshold 0.002 
--tmb-enable-proxi-filter true          #Optional for Tumor-Only 
# HLA genotyper 
--enable-hla true 
--hla-enable-class-2 true               #if panel covers class II HLA regions 
--hla-as-filter-min-threshold 29.0      #panel specific setting 
--hla-as-filter-ratio-threshold 0.85    #panel specific setting 
# Microsatellite Instability (MSI) 
--msi-command tumor-only 
--msi-ref-normal-dir $PATH 
--msi-microsatellites-file $PATH 
--msi-coverage-threshold 40 

Notes and additional options

Hashtable

For DRAGEN somatic runs it is recommended to use the linear hashtable.

Input options

DRAGEN input sources include: fastq list, fastq, bam, or cram.

FQ list Input

--tumor-fastq-list $PATH 
--tumor-fastq-list-sample-id $STRING 

FQ Input

--tumor-fastq1 $PATH 
--tumor-fastq2 $PATH 
--RGSM-tumor $STRING 
--RGID-tumor $STRING 

BAM Input

--tumor-bam-input $PATH 

CRAM Input

--tumor-cram-input $PATH 

Mapping and Aligning

Option
Description

--enable-map-align true

Optionally disable map & align (default=true).

--enable-map-align-output true

Optionally save the output BAM (default=false).

--Aligner.clip-pe-overhang 2

Clean up any unwanted UMI indexes. Only use when reads contain UMIs, but UMI collapsing was not run.

Fractional (Raw Reads) Downsampling

DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).

Downsampling may be useful to reduce runtime on very deep samples. For Tumor-Normal analyses it is also recommended to use a normal sample with coverage that is less than the tumor sample. If the matched normal has deeper coverage than the tumor sample, then the fractional samples may be used to reduce coverage on the normal sample.

Option
Description

--enable-fractional-down-sampler

Set to true to enable fractional downsampling. The default value is false.

--down-sampler-normal-subsample

Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).

--down-sampler-tumor-subsample

Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).

--down-sampler-random-seed

Specify the random seed for different runs of the same input data. The default value is 42.

UMI

Option
Description

--umi-source STRING

Specify the input type for the UMI sequence. Options: qname, fastq, bamtag.

--umi-library-type STRING

Set the batch option for different UMIs correction. Options: random-duplex, random-simplex, nonrandom-duplex.

--umi-nonrandom-whitelist $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The whitelist includes a valid UMI sequence per line.

--umi-correction-table $PATH

If UMI is nonrandom, either a whitelist or correction table is required. The correction table defaults to the table used by TruSight Oncology: <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz.

--umi-min-supporting-reads INT

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. The default is 2.

--umi-metrics-interval-file $BED

Target region in BED format.

--umi-emit-multiplicity both

--umi-start-mask-length INT

Number of additional bases to ignore from start of read. The default is 0. To reduce FP optionally set to 1.

--umi-end-mask-length INT

Number of additional bases to ignore from end of read. The default is 0. To reduce FP optionally set to 3.

SNV

Option
Description

--vc-target-bed

Limit variant calling to region of interest.

--vc-combine-phased-variants-distance INT

Maximum distance in base pairs (BP) over which phased variants will be combined. Set to 0 to disable. Valid range is [0; 15] BP (Default=2)

--vc-emit-ref-confidence GVCF

To enable gVCF output.

--vc-enable-vcf-output

To enable VCF file output during a gVCF run, set to true. The default value is false.

--vc-systematic-noise $PATH

Systematic noise file. This filter is recommended for removing systematic noise observed in normal samples.

--vc-somatic-hotspots $PATH

DRAGEN has a default set of hotspot variants (positions and alleles) where it will assign an increased prior probability. Use this option to override with a custom hotspots file.

--vc-enable-liquid-tumor-mode true

Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.

--vc-override-tumor-pcr-params-with-normal false

Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.

--vc-sq-filter-threshold $INT

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-excluded-regions-bed $BED

High-coverage sequencing panels allow for the detection of low-frequency alleles. DRAGEN supports 3 main settings for improved sensitivity on low VAF variant calls.

High Sensitivity Option
Description

--vc-target-vaf FLOAT

The default is 0.03 (3%). Set to e.g. 0.01 to improve SNV sensitivity on 1% VAF variants (assuming sufficient coverage).

--vc-enable-umi-solid true

Optimized for 1% and higher VAFs on UMI (or read position collapsed) samples with approx 300-1000X coverage.

--vc-enable-umi-liquid true

Optimized for 0.1% and higher VAFs on UMI samples with 1000X or higher coverage as expected in liquid biopsies.

HLA

Option
Description

--enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

--hla-as-filter-min-threshold

Internal option to set min alignment score threshold. The default is 59 and works for WES and WGS. Set to 29 for panels.

--hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered. The default is 0.67 and works for WES and WES. Set to 0.85 for panels.

--hla-enable-class-2

Extend genotyping to HLA class 2 genes (default=true).

CNV

Option
Description

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts.

--cnv-segmentation-mode $SEG_MODE

Option to override the default segmentation algorithm. Defaults include slm for germline WGS, aslm for somatic WGS, and hslm for targeted analysis.

--cnv-segmentation-bed $PATH

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed.

--cnv-population-b-allele-vcf $POP_VCF

--heme-cnv true

Configures DRAGEN to use CNV settings for HEME.

Annotation

TMB

The Tumor-Normal pipeline is more effective than the Tumor-Only pipeline at removing or tagging germline variants. The Tumor-Only may subsequently report somewhat elevated TMB values. The TMB proxi filter is an optional setting on top of the regular database germline filter. It will aggressively filter additional germline variants based on allele frequencies.

Option
Description

--tmb-vaf-threshold FLOAT

Variant minimum allele frequency for usable variants. Default=0.05. Set to 0.002 for ctDNA.

--vc-callability-tumor-thresh

Required read coverage to use a site. Default=50. Set to 1000 for ctDNA.

MSI

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files refer to the MSI Biomarker section in the user guide.

Option
Description and recommended setting for Liquid (cfDNA)

--msi-coverage-threshold INT

Minimum coverage for a microsatellite: 500

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite: 0.02

SV

Option
Description

--sv-call-regions-bed

Specifies a BED file containing the set of regions to call. Optionally gzip or bgzip format.

--sv-exclusion-bed

Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.

--enable-variant-deduplication true

Relevant when both SV and SNV callers are enabled in somatic workflows. Can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A. Filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF. DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases.

--sv-systematic-noise $BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). Optional for Tumor-Normal, but strongly recommended for Tumor-Only.

--sv-somatic-ins-tandup-hotspot-regions-bed $BED

Specify a custom BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

--sv-min-candidate-variant-size

Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.

--sv-min-scored-variant-size

After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.

Option
Recommended Value for Liquid Tumors (e.g. AML/MLL)

--heme-sv true

Configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $INT

100000

Resource Files

DRAGEN requires resource files for components such as SNV, SV, and CNV. The following notes provide references for downloading these files or generating them for custom workflows or assays.

SNV Systematic Noise

Systematic noise files are considered essential in Tumor-Only workflows. It is also recommended for Tumor-Normals workflows.

Prebuild

Prebuilt WES/WGS noise files
Description

WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FF

FFPE_WGS_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WGS FFPE (only hg38)

WES_hg38_v2.0.0_systematic_noise.snv.bed.gz

For WES FF and FFPE

Custom

Prebuilt systematic noise files are available for WES or WGS applications. For these applications, it is considered optional to build custom noise files. For high-sensitivity applications, including panels, it is required to build custom noise files. For best accuracy, the normal samples should ideally closely match the sequencer, sample type, library prep, and coverage of the tumor samples of interest. It is typically recommended to use 30–70 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on each of approximately 30-70 normal samples.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--vc-detect-systematic-noise=true 
--vc-target-bed-padding 500 
--vc-emit-ref-confidence BP_RESOLUTION 
--vc-enable-vcf-output true 
--vc-enable-umi-liquid true 
--vc-enable-germline-tagging=true 
--variant-annotation-data $PATH 
--intermediate-results-dir $PATH 
--output-directory $PATH 
--output-file-prefix $STRING 

For panels we create GVCF files. Gather the full paths to the small variant caller hard filtered GVCFs (not VCFs) from step 1 and create an input file ${GVCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--build-sys-noise-vcfs-list ${GVCF_LIST} 

SV Systematic Noise

Systematic noise files are also recommended for Tumor-Normals workflows, but are considered essential for reducing FP calls in Tumor-Only workflows.

Prebuilt

Prebuilt WES/WGS noise files
Description

WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For WGS/WES FF/FFPE

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

For HEME

Custom

It is considered optional to build a custom systematic noise file for WES or WGS applications, but for high sensitivity applications like panels it is strongly recommended. For best accuracy the normal samples should ideally closely match the sequencer, sample type, library prep and coverage of the tumor samples of interest. It is typically recommended to use 30 - 100 normals when building a noise file, but fewer can be used.

Step 1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--umi-enable true 
--umi-source STRING                     #default='qname' 
--umi-library-type STRING               #see 'UMI' 
--sv-detect-systematic-noise true 

Step 2. Build the BEDPE file using input VCFs from previous step.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--sv-build-systematic-noise-vcfs-list $VCF_LIST#one VCF per line. 

CNV Panel of Normals (PON)

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. CNV requires PON files for all targeted analyses (including panels, exomes, germline, tumor-only and tumor-normal workflows). It is recommended to use 30-100 normal samples when building the PON, but fewer may be used. If sample coverage noise is relatively stable, as few as 5 PON samples may yield acceptable results.

If a matched normal is available it is recommended to include it in the PON.

Follow the two steps below to generate CNV PON:

Step 1. Generate target counts of individual normal samples.

Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--tumor-fastq-list $PATH                #see 'Input Options' for FQ, BAM or CRAM 
--tumor-fastq-list-sample-id $STRING 
--enable-cnv true 
--cnv-target-bed $PATH 

Step 2. Combined counts generation.

Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

  
/opt/dragen/$VERSION/bin/dragen         #DRAGEN install path 
--ref-dir $REF_DIR                      #path to DRAGEN linear hashtable 
--output-directory $OUTPUT 
--intermediate-results-dir $PATH        #e.g. SDD /staging 
--output-file-prefix $PREFIX 
--enable-cnv true 
--cnv-generate-combined-counts true 
--cnv-normals-list $CNV_NORMALS_LIST 

$CNV_NORMALS_LIST is a single lines file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

DRAGEN FASTQC

DRAGEN FastQC is a tool for calculating common metrics used for quality control of high-throughput sequencing data. The tool is modeled after the metrics generated by Babraham Institute's FastQC tool.

The metrics are generated automatically on all DRAGEN map-align workflows with no additional run time and output in a CSV format file called \<PREFIX\>.fastqc_metrics.csv. All metrics are calculated and reported separately for each mate-pair.

For users only interested in sample QC or would like to obtain FastQC results only, DRAGEN provides a mode to generate the fastqc_metrics.csv file directly.

By default DRAGEN FastQC and read-trimming are run as preprocessing steps to standard sequence alignment workflows. If DNA alignment is not needed or if QC results are needed more quickly, the mapping and BAM output portions of the workflow can be disabled. The workflow only outputs key metric files and runs ~70% faster. This option is available on the command-line by entering --fastqc-only=true after the DRAGEN command.

If FastQC runs stand-alone, then the license will not be consumed. If FastQC runs with map-align enabled, then the license will be consumed.

Differences from the Babraham Institutes' FastQC

DRAGEN FastQC is a complete reimplementation of the original FastQC tool developed by the Babraham Institute (henceforth BI-FastQC). The reimplementation of FastQC in DRAGEN, however, has been modified to take advantage of the hardware-acceleration provided by the DRAGEN Field-Programmable Gate Array (FPGA) for a significant speed improvement. As such, there are some differences in how the values are calculated and the resulting metrics will not be exactly identical between the two tools. The most significant differences are described below.

  • Binning: BI-FastQC uses a customizable binning strategy with a default of 5bp bins, while DRAGEN uses an algorithmic binning strategy based on the Granularity setting described below. In general, this should mean that DRAGEN provides more precise results at default settings.

  • Outputs: BI-FastQC text output contain the same information as their plots in tabular format, while DRAGEN-FastQC outputs it's raw data. For example, BI-FastQC both plots an outputs the average base quality per-position, while DRAGEN outputs the average base quality by both position and nucleotide. This allows for a more detailed analysis of the data, but requires slightly more work to generate the associated plot.

  • Rounding: DRAGEN consistently rounds it's calculations to the nearest integer, while the original FastQC uses a mixture of rounding and taking the mathematical floor, leading DRAGEN-FastQC to provide incrementally higher results for some metrics.

  • Smoothing: Both DRAGEN-FastQC and BI-FastQC utilize smoothing techniques for their distributions of %GC, to account for the fact that 151bp do not divide evenly into 100 percentile bins. However, to take advantage of the speed offered by the FPGA, DRAGEN utilizes a slightly different algorithm than BI-FastQC which results in slightly different results.

Metric Granularity

It is not possible due to memory constraints to guarantee single-base resolution for all metrics. DRAGEN provides an algorithmic solution for binning via --fastqc-granularity. DRAGEN allocates 256 bins in memory for each size or position-based metric. The granularity value of 4–7 inclusive can be used to determine the bin size. High values use smaller bins for greater resolution. Lower values can be used to create larger bins for larger read-lengths

If a value for --fastqc-granularity is not provided by the user, DRAGEN will attempt to estimate the read length of the input data and set the granularity accordingly.

Adapter and Kmer Sequence Files

To include metrics for adapter or other sequence content, DRAGEN FastQC needs to be provided with the desired sequences in FASTA format. DRAGEN provides two options for this purpose, --fastqc-adapter-file for adapter sequences and --fastqc-kmer-file for any additional kmers of interest so that users can add sequences of interest without changing the expected adapter results.

DRAGEN FastQC can accept up to a combined total of 16 adapters and kmer sequences. Each sequence can be a maximum of 12 bp in length. By default, DRAGEN uses the adapter file located at <INSTALL_PATH>/config/adapter_sequences.fasta. The file contains the following same adapter sequences as Babraham's FastQC v 0.11.10 and later.

  • Illumina Universal Adapter--AGATCGGAAGAG

  • Illumina Small RNA 3' Adapter--TGGAATTCTCGG

  • Illumina Small RNA 5' Adapter--GATCGTCGGACT

  • Nextera Transposase Sequence--CTGTCTCTTATA

FastQC Metrics Output

The FastQC metrics are output to a CSV file format in the run output directory called

  • <PREFIX>.fastqc_metrics.csv

The reported metrics are broken down into eight sections by metric type. Each section is broken down further into separate rows by either the length, position, or other relevant categorical variables. The following are the metric sections.

  • Read Mean Quality---Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.

  • Positional Base Mean Quality---Average Phred-scale quality value of bases with a specific nucleotide and at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.

  • Positional Base Content---Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.

  • Read Lengths---Total number of reads with each observed length. Lengths can be either specific sizes or ranges, depending on settings specified using --fastqc-granularity.

  • Read GC Content---Total number of reads with each GC content percentile between 0 % and 100 %.

  • Read GC Content Quality---Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.

  • Sequence Positions---Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads. Sequences are listed first in the metric description in quotes. Locations are listed second and can be either specific positions or ranges.

  • Positional Quality---Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.

The following are examples rows from each section.

DRAGEN DNA Pipeline

The DRAGEN DNA Pipeline accelerates the secondary analysis of NGS data by harnessing the tremendous power available on the DRAGEN Platform. The pipeline includes highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions and targeted calls.

DNA Mapping

DNA Mapping

Seed Density Option

The seed-density option controls how many (normally overlapping) primary seeds from each read the mapper looks up in its hash table for exact matches. The maximum density value of 1.0 generates a seed starting at every position in the read, ie, (L-K+1) K-base seeds from an L-base read.

Seed density must be between 0.0 and 1.0. Internally, an available seed pattern equal or close to the requested density is selected. The sparsest pattern is one seed per 32 positions, or density 0.03125.

  • Accuracy Considerations--Generally, denser seed lookup patterns improve mapping accuracy. However, for modestly long reads (eg, 50 bp+) and low sequencer error rates, there is little to be gained beyond the default 50% seed lookup density.

  • Speed Considerations--Denser seed lookup patterns generally slow down mapping, and sparser seed patterns speed it up. However, when the seed mapping stage can run faster than the aligning stage, a sparser seed pattern does not make the mapper much faster.

Relationship to Reference Seed Interval

Functionally, a denser or sparser seed lookup pattern has an impact very similar to a shorter or longer reference seed interval (build hash table option --ht-ref-seed-interval). Populating 100% of reference seed positions and looking up 50% of read seed positions has the same effect as populating 50% of reference seed positions and looking up 100% of read seed positions. Either way, the expected density of seed hits is 50%.

More generally, the expected density of seed hits is the product of the reference seed density (the inverse of the reference seed interval) and the seed lookup density. For example, if 50% of reference seeds are populated and 33.3% (1/3) of read seed positions are looked up, then the expected seed hit density should be 16.7% (1/6).

DRAGEN automatically adjusts its precise seed lookup pattern to ensure it does not systematically miss the seed positions populated from the reference. For example, the mapper does not look up seeds matching only odd positions in the reference when only even positions are populated in the hash table, even if the reference seed interval is 2 and seed-density is 0.5.

Map Orientations Option

The --Mapper.map-orientations option is used in mapping reads for bisulfite methylation analysis. It is set automatically based on the value set for ‑‑methylation-protocol.

The --Mapper.map-orientations option can restrict the orientation of read mapping to only forward in the reference genome, or only reverse-complemented. The valid values for --map-orientations are as follows.

  • 0--Either orientation (default)

  • 1--Only forward mapping

  • 2--Only reverse-complemented mapping

If mapping orientations are restricted and paired end reads are used, the expected pair orientation can only be FR, not FF or RF.

Seed-Editing Options

Although DRAGEN primarily maps reads by finding exact reference matches to short seeds, it can also map seeds differing from the reference by one nucleotide by also looking up single-SNP edited seeds. Seed editing is usually not necessary with longer reads (100 bp+), because longer reads have a high probability of containing at least one exact seed match. This is especially true when paired ends are used, because a seed match from either mate can successfully align the pair. But seed editing can, for example, be useful to increase mapping accuracy for short single-ended reads, with some cost in increased mapping time. The following options control seed editing:

Seed Editing Options

edit-mode and edit-chain-limit

The edit-mode and edit-chain-limit options control when seed editing is used. The following four edit-mode values are available:

Edit mode 0 requires all seeds to match exactly. Mode 3 is the most expensive because every seed that fails to match the reference exactly is edited. Modes 1 and 2 employ heuristics to look up edited seeds only for reads most likely to be salvaged to accurate mapping.

The main heuristic in edit modes 1 and 2 is a seed chain length test. Exact seeds are mapped to the reference in a first pass over a given read, and the matching seeds are grouped into chains of similarly aligning seeds. If the longest seed chain (in the read) exceeds a threshold edit-chain-limit, the read is judged not to require seed editing, because there is already a promising mapping position.

Edit mode 1 triggers seed editing for a given read using the seed chain length test. If no seed chain exceeds edit-chain-limit (including if no exact seeds match), then a second seed mapping pass is attempted using edited seeds. Edit mode 2 further optimizes the heuristic for paired-end reads. If either mate has an exact seed chain longer thanedit-chain-limit, then seed editing is disabled for the pair, because a rescue scan is likely to recover the mate alignment based on seed matches from one read. Edit mode 2 is the same as mode 1 for single-ended reads.

edit-seed-num and edit-read-len

For edit modes 1 and 2, when the heuristic triggers seed editing, these options control how many seed positions are edited in the second pass over the read. Although exact seed mapping can use a densely overlapping seed pattern, such as seeds starting at 50% or 100% of read positions, most of the value of seed editing can be obtained by editing a much sparser pattern of seeds, even a nonoverlapping pattern. Generally, if a user application can afford to spend some additional amount of mapping time on seed editing, a greater increase in mapping accuracy can be obtained for the same time cost by editing seeds in sparse patterns for a large number of reads, than by editing seeds in dense patterns for a small number of reads.

Whenever seed editing is triggered, these two options request edit-seed-num seed editing positions, distributed evenly over the first edit-read-len bases of the read. For example, with 21-base seeds, edit-seed-num=6 and edit-read-len=100, edited seeds can begin at offsets {0, 16, 32, 48, 64, 80} from the 5' end, consecutive seeds overlapping by 5 bases. Because sequencing technologies often yield better base qualities nearer the (5') beginning of each read, this can focus seed editing where it is most likely to succeed. When a particular read is shorter than edit-read-len, fewer seeds are edited.

Seed editing is more expensive when the reference seed interval (build hash table option ‑-ht‑ref-seed-interval) is greater than 1. For edit modes 1 and 2, additional seed editing positions are automatically generated to avoid missing the populated reference seed positions. For edit mode 3, the time cost can increase dramatically because query seeds matching unpopulated reference positions typically miss and trigger editing.

DNA Aligning

Smith-Waterman Alignment Scoring Settings

The first stage of mapping is to generate seeds from the read and look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.

The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. But any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors, and differently tuned alignment scoring values can be more appropriate for some applications.

The following alignment options control Smith-Waterman Alignment:

  • global The global option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in the read. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm (although not end-to-end in the reference), and alignment scores can be positive or negative. When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are nonnegative. Generally, global=0 is preferred for longer reads, so significant read segments after a break of some kind (large indel, structural variant, chimeric read, and so forth) can be clipped without severely decreasing the alignment score. Setting global=1 might not have the desired effect with longer reads because insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions. Using global=1 is sometimes preferable with short reads, which are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end. Consider using the unclip-score option, or increasing it, instead ofsetting global=1, to make a soft preference for unclipped alignments.

  • match-score The match-score option specifies the score for a read nucleotide matching a reference nucleotide (A, C, G, or T), or matching a reference 2–3 nucleotide IUPAC-IUB code. Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used when global=1. A higher match score results in longer alignments, and fewer long insertions.

  • match-n-score The match-n-score option specifies the score for an aligned position where the read position and/or the reference position is an N code. This option is a signed integer, from -16 to 15.

  • mismatch-pen The mismatch-pen option is the penalty (negative score) for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code, except N. This option is an unsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.

  • gap-open-pen The gap-open-pen option is the penalty (negative score) for opening a gap (ie, an insertion or deletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs, with clipping or alignment through SNPs used instead.

  • gap-ext-pen The gap-ext-pen option is the penalty (negative score) for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping, or alignment through SNPs used instead.

  • unclip-score The unclip-score option is the score bonus for an alignment reaching the beginning or end of the read. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0 to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read more often, where this can be done without too many SNPs or indels. A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignments are forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen can have the undesirable effect of insertions at or near one end of a read being utilized as pseudoclipping, as happens with global=1

  • no-unclip-score The no-unclip-score option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing, such as comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it stays the same or decreases if no-unclip-score=1. The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment. When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.

  • aln-min-score The aln-min-score option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0). aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. A read's best alignment score is saved in the AS SAM tag, and the second-best score (if available) is saved in the XS tag. aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments. You can use the min-score-coeff option to adjust aln-min-score as a function of read length.

  • min-score-coeff The min-score-coeff option makes adjustments to aln-min-score per read base. When using the min-score-coeff and aln-min-score options together, you can define the minimum alignment score for each read as an affine function of read lengths. The minimum score for an N-base read is calculated as follows:(min-score-coeff)\*N+(aln-min-score) The min-score-coeff option is an integer ranging from –64 to 63.999. If the value is 0, then the minimum alignment score is fixed at aln-min-score for all read length. You can use positive values for min-score-coeff to allow shorter reads to match with lower alignment scores, but require longer reads to achieve higher scores.

Paired-End Options

DRAGEN can process paired-end data passed via a pair of FASTQ files or in a single interleaved FASTQ file. The hardware maps the two ends separately, and then determines a set of alignments that seem most likely to form a pair in the expected orientation and having roughly the expected insert size. The alignments for the two ends are evaluated for the quality of their pairing, with larger penalties for insert sizes far from the expected size. The following options control processing of paired-end data:

  • Reorientation The pe-orientation option specifies the expected paired-end orientation. Only pairs with this orientation can be flagged as proper pairs. Valid values are as follows:

    • 0--FR (default)

    • 1--RF

    • 2--FF

  • unpaired-pen For paired end reads, best mapping positions are determined jointly for each pair, according to the largest pair score found, considering the various combinations of alignments for each mate. A pair score is the sum of the two alignment scores minus a pairing penalty, which estimates the unlikelihood of insert lengths further from the mean insert than this aligned pair. The unpaired-pen option specifies how much alignment pair scores should be penalized when the two alignments are not in properly paired position or orientation. This option also serves as the maximum pairing penalty for properly paired alignments with extreme insert lengths. The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ. Internally, it is scaled into alignment score space based on Smith-Waterman scoring parameters.

  • pe-max-penalty

The pe-max-penalty option limits how much the estimated MAPQ for one read can increase because its mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that it would have received mapping single-ended, plus this value. By default, pe-max-penalty = mapq-max = 255, effectively disabling this limit. The key difference between unpaired-pen and pe-max-penalty is that unpaired-pen affects calculated pair scores and thus which alignments are selected and pe-max-penalty affects only reported MAPQ for paired alignments.

Mean Insert Size Detection

When working with paired-end data, DRAGEN must choose among the highest-quality alignments for the two ends to try to choose likely pairs. To make this choice, DRAGEN uses a skew normal insert model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the observation that common library preparation methods have insert-size distributions that are sometimes close to normal, but also sometimes clearly asymmetric, often skewing toward longer insert sizes. The skew normal insert model is used only for the DNA mode.

If you know the statistics of your library prep for an input file (and the file consists of a single read group), you can specify the characteristics of the insert-length distribution: mean, standard deviation, shape (or skewness) and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert, Aligner.pe-stat-stddev-insert, Aligner.pe-stat-shape-insert,Aligner.pe-stat-quartiles-insert, andAligner.pe-stat-mean-read-len options. However, it is typically preferable to allow DRAGEN to detect these characteristics automatically.

Dragen automatically samples the insert-length distribution. When the software starts execution, it runs a sample of up to 2,000,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.

The DRAGEN host software reports the statistics in its stdout log in a report, as follows:

Note that the Mean, Standard deviation and Quartiles reported above are the sample mean, standard deviation and quartiles calculated from the initial sample of up to 2,000,000 pairs, assuming a normal distribution. The sample mean and standard deviation are used to fit the parameters of a skew-normal distribution. A skew-normal distribution is defined by starting with an underlying normal distribution (whose mean we call position or xi and standard deviation we call scale or omega) and folding a varying portion of the probability mass from one side of the mean (e.g., left side) to the other (e.g., right) side. The portion folded varies smoothly, from 0% at the original mean, approaching 100% from the left tail to the right tail. A shape parameter which we call alpha controls how rapidly the folded fraction increases, and at alpha=0 there is no folding and the distribution remains normal.

In the standard output, we also include the command line options needed to reproduce the DRAGEN run with the same insert stat settings. Note that when specifying stats on the command line, the skew-normal xi value should be used for Aligner.pe-stat-mean-insert. The omega value should be used for Aligner.pe-stat-stddev-insert, and the alpha value should be used for Aligner.pe-stat-shape-insert. If Aligner-pe-stat-shape-insert is not specified on the command line, a default value of 0 is assumed.

The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines

These lines are followed by the histogram for the first ~2M read pairs for DNA (~100K read pairs for RNA). The histogram counts are aggregated across all read groups sharing the same sample id (RGSM field).

When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:

The small samples formula calculates standard deviation as follows:

The default model is "standard deviation = 10000". If the first 2M reads are unmapped or if all pairs are improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0. Note that the minimum value for standard deviation is 12, which is independent of the number of samples. Also, in the DNA mode when we have fewer than 1000 high quality alignments we revert to the normal distribution based insert model, because of insufficient number of samples to accurately estimate the parameters of the skew normal distribution.

For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.

DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. This file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, shape, minimum, and maximum. The information matches the standard-out report above. Additionally, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans. Note that the reported mean and standard deviation in this tab-limited log file are the xi and omega parameters of the skew-normal distribution.

Rescue Scans

For paired-end reads, where a seed hit is found for one mate but not the other, rescue scans hunt for missing mate alignments within a rescue radius of the mean insert length. Normally, the DRAGEN host software sets the rescue radius to 2.5 standard deviations of the empirical insert distribution. But in cases where the insert standard deviation is large compared to the read length, the rescue radius is restricted to limit mapping slowdowns. In this case, a warning message is displayed, as follows:

Although the user can ignore this warning, or specify an intermediate rescue radius to maintain mapping speed, it is recommended to use 2.5 sigmas for the rescue radius to maintain mapping sensitivity. To disable rescue scanning, set max-rescues to 0.

Output Options

DRAGEN can track multiple independent alignments for each read. These alignments include the optimal (primary) one, as well as those mapping different subsegments of the read, (chimeric/supplementary), and sub-optimal (secondary) mappings of the read to different areas of the reference.

For DNA alignment by default, DRAGEN can emit one primary alignment for each read, up to three chimeric alignments (Aligner.supp-aligns=3), and no secondary alignments (Aligner.sec-aligns=0). The maximum user-specified value for supp-aligns or sec-aligns is 4095.

You can use the following configuration options to control how many of each type of alignment to include in DRAGEN output.

  • mapq-max The mapq-max option specifies a ceiling on the estimated MAPQ that can be reported for any alignment, from 0 to 255. If the calculated MAPQ is higher, this value is reported instead. The default is 60.

  • supp-aligns, sec-aligns The supp-aligns and sec-aligns options restrict the maximum number of supplementary (ie, chimeric and SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100) alignments, respectively, that can be reported for each read. A maximum of 4095 supplementary alignments and 4095 secondary alignments can be reported for any read, in addition to a primary alignment. High settings for these two options impact speed so it is advisable to increase only as needed.

  • sec-phred-delta The sec-phred-delta option controls which secondary alignments are emitted based on the alignment score relative to the primary reported alignment. Only secondary alignments with likelihood within this Phred value of the primary are reported.

  • sec-aligns-hard The sec-aligns-hard option suppresses the output of all secondary alignments if there are more secondary alignments than can be emitted. Set sec-aligns-hard to 1 to force the read to be unmapped when not all secondary alignments can be output.

  • supp-as-sec When the supp-as-sec option is set to 1, then supplementary (chimeric) alignments are reported with SAM FLAG 0x100 instead of 0x800. The default is 0. The supp-as-sec option provides compatibility with tools that do not support FLAG 0x800.

  • hard-clips The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specify alignments, as follows:

    • Bit 0--primary alignments

    • Bit 1--supplementary alignments

    • Bit 2--secondary alignments

Each bit determines whether local alignments of that type are reported with hard clipping (1) or soft clipping (0). The default is 6, meaning primary alignments use soft clipping and supplementary and secondary alignments use hard clipping.

Mapping with ALT-contigs

The GRCh38 human reference contains many more alternate haplotypes (ALT contigs) than previous versions of the reference. Generally, including ALT contigs in the mapping reference improves mapping and variant calling specificity, because misalignments are eliminated for reads matching an ALT contig but scoring poorly against the primary assembly. However, mapping with GRCh38's ALT contigs without special treatment can substantially degrade variant calling sensitivity in corresponding regions, because many reads align equally well to an ALT contig and to the corresponding position in the primary assembly.

Masked Based ALT-awareness

The recomeneded and default approach for dealing with ALT-contigs in DRAGEN is masking regions of ALT contigs of high similarity to their corresponding primary contig. This approach is more accurate than liftover based ALT-awarness because there are many places where the "correct" or most useful liftover between a long ALT haplotype and the primary assembly is ambiguous. Incorrect liftover can produce dense clusters of mismapped reads and false variant calls. The base masking approach has the benefits of using ALT contigs without the negative consequences.

Masked hash tables are built from a standard hg18 or hg38 FASTA that contains ALT contigs. The hash table builder will automatically mask regions of the ALT contigs with Ns.

Liftover Based ALT-awarness

With liftover based ALT-awareness, the mapper and aligner are aware of the liftover relationship between ALT contig positions and corresponding primary assembly positions. Seed matches within ALT contigs are used to obtain corresponding primary assembly alignments, even if the latter score poorly. Liftover groups are formed, each containing a primary assembly alignment candidate, and zero or more ALT alignment candidates that lift to the same location. Each liftover group is scored according to its best-matching alignments, taking properly paired alignments into account. The winning liftover group provides its primary assembly representative as the primary output alignment, with MAPQ calculated based on the score difference to the second-best liftover group. Emitting primary alignments within the primary assembly maintains normal aligned coverage and facilitates variant calling there. If the --Aligner.en-alt-hap-aln option is set to 1 and --Aligner.supp-aligns is greater than 0, then corresponding alternate haplotype alignments can also be output, flagged as supplementary alignments.

DRAGEN requires ALT-Aware hash tables for any hg19 or GRCh38 reference where ALT contigs are detected. To disable this requirement in DRAGEN, set the --ht-alt-aware-validate option to false.

The following is a comparison of alternative options for dealing with alternate haplotypes.

  • Mapping without ALT contigs in the reference:

    • False-positive variant calls result when reads matching an alternate haplotype misalign somewhere else.

    • Poor mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.

  • Mapping with ALT contigs but no ALT awareness:

    • False-positive variant calls from misaligned reads matching ALT contigs are eliminated.

    • Low or zero aligned coverage in primary assembly regions covered by alternate haplotypes, due to some reads mapping to ALT contigs.

    • Low or zero MAPQ in regions covered by alternate haplotypes, where they are similar or identical to the primary assembly.

    • Variant calling sensitivity is dramatically reduced throughout regions covered by alternate haplotypes.

  • Mapping with ALT contigs and ALT awareness:

    • False-positive variant calls from misaligned reads matching ALT contigs are eliminated.

    • Normal aligned coverage in regions covered by alternate haplotypes because primary alignments are to the primary assembly.

    • Normal MAPQs are assigned because alignment candidates in alternative haplotypes are not considered in competition.

    • Good mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.

DRAGEN Multigenome Mapper

The Multigenome Mapper in DRAGEN significantly improves the accuracy of mapping Illumina reads, particularly in challenging regions such as segmental duplications and other difficult to map regions. This advanced method leverages population haplotypes from pangenome references to incorporate additional variant information, constructing alternative haplotype paths that improve reads mapping. By offering these alternate paths, the Multigenome Mapper enables reads containing population-specific variants to align directly to their most likely genomic locations, reducing mapping ambiguity. This improved mapping also results in improved variant calling accuracy.

When given a set of population variants (VCF) or haplotypes, the pangenome reference modification is categorized in the following types:

  • Alternate contigs represent population haplotypes. Alt-contigs can have a single variant or a combination of nearby phased variants.

  • Ambiguous codes (IUPAC codes) to represent SNPs. To improve alignment, it edits the reference FASTA with isolated population SNPs.

  • Haplotype database. An additional haplotype database is built and used to augment the reference FASTA with population variants. A multigenome based mapper algorithm is used to score read alignment according to the variants in this database.

Sorting and Duplicate Marking

Sorting

The map/align system produces a BAM file sorted by reference sequence and position by default. Creating this BAM file typically eliminates the requirement to run samtools sort or any equivalent postprocessing command. The ‑‑enable-sort option can be used to enable or disable creation of the BAM file, as follows:

  • To enable, set to true.

  • To disable, set to false.

On the reference hardware system, running with sort enabled increases run time for a 30x full genome by about 6--7 minutes.

Duplicate Marking

Marking or removing duplicate aligned reads is a common best practice in whole-genome sequencing. Not doing so can bias variant calling and lead to incorrect results.

The DRAGEN system can mark or remove duplicate reads, and produces a BAM file with duplicates marked in the FLAG field, or with duplicates entirely removed.

In testing, enabling duplicate marking adds minimal run time over and above the time required to produce the sorted BAM file. The additional time is approximately 1--2 minutes for a 30x whole human genome, which is a huge improvement over the long run times of open source tools.

The Duplicate Marking Algorithm

The DRAGEN duplicate-marking algorithm is modeled on the Picard toolkit's MarkDuplicates feature. All the aligned reads are grouped into subsets in which all the members of each subset are potential duplicates.

For two pairs to be duplicates, they must have the following:

  • Identical alignment coordinates (position adjusted for soft- or hard-clips from the CIGAR) at both ends.

  • Identical orientations (direction of the two ends, with the left-most coordinate being first).

In addition, an unpaired read may be marked as a duplicate if it has identical coordinate and orientation with either end of any other read, whether paired or not.

Unmapped read pairs are never marked as duplicates.

When DRAGEN has identified a group of duplicates, it picks one as the best of the group, and marks the others with the BAM duplicate flag (0x400, or decimal 1024). For this comparison, duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of the scores of both ends, while unpaired reads get the score of the one mapped end. The idea of this score is to try, all other things being equal, to preserve the reads with the highest-quality base calls.

If two reads (or pairs) have exactly matching quality scores, DRAGEN breaks the tie by choosing the pair with the higher alignment score. If there are multiple pairs that also tie on this attribute, then DRAGEN chooses a winner arbitrarily.

The score for an unpaired read R is the average Phred quality score per base, calculated as follows:

Where R is a BAM record, QUAL is its array of Phred quality scores, and dedup-min-qual is a DRAGEN configuration option with default value of 15. For a pair, the score is the sum of the scores for the two ends.

This score is stored as a one-byte number, with values rounded down to the nearest one-quarter. This rounding may lead to different duplicate marks from those chosen by Picard, but because the reads were very close in quality this has negligible impact on variant calling results.

Duplicate Marking Limitations

The limitations to DRAGEN duplicate marking implementation are as follows:

  • When there are two duplicate reads or pairs with very close Phred sequence quality scores, DRAGEN might choose a different winner from that chosen by Picard. These differences have negligible impact on variant calling results.

  • If using a single FASTQ file as input, DRAGEN accepts only a single library ID as a command-line argument (RGLB). For this reason, the FASTQ inputs to the system must be already separated by library ID. Library ID cannot be used as a criterion for distinguishing non-duplicates.

  • DRAGEN does not distinguish between optical and PCR duplicates.

Duplicate Marking Settings

The following options can be used to configure duplicate marking in DRAGEN:

  • --enable-duplicate-marking Set to true to enable duplicate marking. When \--enable-duplicate-marking is enabled, the output is sorted, regardless of the value of the enable-sort option.

  • --remove-duplicates Set to true to suppress the output of duplicate records. If set to false, set the 0x400 flag in the FLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable- duplicate-marking is forced to enabled as well.

  • --dedup-min-qual Specifies the Phred quality score below which a base should be excluded from the quality score calculation used for choosing among duplicate reads.

Prepare a Reference Genome

Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. The options used in this preprocessing step offer tradeoffs between performance and mapping quality.

Pre-built DRAGEN reference genomes are available for download in the Illumina customer portal. If you find that performance and mapping quality with these are adequate, there is a good chance that you can simply work with these supplied reference genomes. Depending on your read lengths and other particular aspects of your application, you may be able to improve mapping quality and/or performance by tuning the reference preprocessing options.

Hash Table Background

The DRAGEN mapper extracts many overlapping seeds (subsequences or K-mers) from each read, and looks up those seeds in a hash table residing in memory on its PCIe card, to identify locations in the reference genome where the seeds match. Hash tables are ideal for extremely fast lookups of exact matches. The DRAGEN hash table must be constructed from a chosen reference genome using the --build-hash-table option, which extracts many overlapping seeds from the reference genome, populates them into records in the hash table, and saves the hash table as a binary file.

Automatic Reference Detection

DRAGEN will attempt to detect the provided reference in order to automatically apply recommended resources and settings. There are four human references that DRAGEN can detect: hg38, hg19, hs37d5, and chm13v2. DRAGEN is able to detect references that contain a subset of the primary contigs from one of these references, as long as the names and lengths of the detected contigs are consistent with the names and lengths from the standarad assemblies of these references.

In detail, automatic reference detection operates as follows:

We define a primary contig of a human genome to be an autosome (1-22) or sex chromosome (X,Y). Let F be the input fasta. For each reference genome R in hg38, hg19, hs37d5, and chm13v2, DRAGEN checks if there are any contigs in F that have the same name and length as a primary contig in R, and that there are no contigs in F that have the same name as a contig in R, but with different length. If these conditions hold for exactly one of hg38, hg19, hs37d5, and chm13v2, then that reference is detected and resources may be applied automatically.

The DRAGEN hash table builder will automatically apply decoy contigs and mask bed files to detected reference. Other pipelines may also apply automatic resources. For example variant callers may apply machine learning models and target bed files.

Naming Conventions

In order for DRAGEN to correctly detect the provided reference, it is important to use the standard naming conventions for each of the four human assemblies that DRAGEN detects:

Reference Seed Interval

The size of the DRAGEN hash table is proportionate to the number of seeds populated from the reference genome. The default is to populate a seed starting at every position in the reference genome, ie, roughly 3 billion seeds from a human genome. This default requires at least 32 GB of memory on the DRAGEN PCIe board.

To operate on larger, nonhuman genomes or to reduce hash table congestion, it is possible to populate less than all reference seeds using the --ht-ref-seed-interval option to specify an average reference interval. The default interval for 100% population is --ht-ref-seed-interval 1, and 50% population is specified with --ht-ref-seed-interval 2. The population interval does not need to be an integer. For example, --ht-ref-seed-interval 1.2 indicates 83.3% population, with mostly 1-base and some 2-base intervals to achieve a 1.2 base interval on average.

Hash Table Occupancy

It is characteristic of hash tables that they are allocated a certain size, but always retain some empty records, so they are less than 100% occupied. A healthy amount of empty space is important for quick access to the DRAGEN hash table. Approximately 90% occupancy is a good upper bound. Empty space is important because records are pseudo-randomly placed in the hash table, resulting in an abnormally high number of records in some places. These congested regions can get quite large as the percentage of empty space approaches zero, and queries by the DRAGEN mapper for some seeds can become increasingly slow.

Hash Table / Seed Length

The hash table is populated with reference seeds of a single common length. This primary seed length is controlled with the --ht-seed-len option, which defaults to 21.

The longest primary seed supported is 27 bases when the table is 8 GB to 31.5 GB in size. Generally, longer seeds are better for run time performance, and shorter seeds are better for mapping quality (success rate and accuracy). A longer seed is more likely to be unique in the reference genome, facilitating fast mapping without needing to check many alternative locations. But a longer seed is also more likely to overlap a deviation from the reference (variant or sequencing error), which prevents successful mapping by an exact match of that seed (although another seed from the read may still map), and there are fewer long seed positions available in each read.

Longer seeds are more appropriate for longer reads, because there are more seed positions available to avoid deviations.

Seed Length Recommendations

Hash Table / Seed Extensions

Due to repetitive sequences, some seeds of any given length match many locations in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many reference locations, it extends the seed by some number of bases at both ends, to some greater length that is more unique in the reference.

For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extended seed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these 100 seed positions may divide into 40 groups of 1-3 identical 35-base seeds. Iterative seed extensions are also supported, and are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.

The maximum extended seed length, by default equal to the primary seed length plus 128, can be controlled with the --ht-max-ext-seed-len option. For example, for short reads, it is advisable to set the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.

It is also possible to tune how aggressively seeds are extended using the following options (advanced usage):

--ht-cost-coeff-seed-len

--ht-cost-coeff-seed-freq

--ht-cost-penalty

--ht-cost-penalty-incr

There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved using longer seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved by avoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies that result. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, and by finding more candidate mapping locations at which to score alignments. The default extension settings along with default seed frequency settings, lean aggressively toward mapping accuracy, with relatively short seed extensions and high hit frequencies.

The defaults for the seed frequency options are as follows:

Seed Frequency Limit and Target

One primary or extended seed can match multiple places in the reference genome. All such matches are populated into the hash table, and retrieved when the DRAGEN mapper looks up a corresponding seed extracted from a read. The multiple reference positions are then considered and compared to generate aligned mapper output. However, the DRAGEN software enforces a limit on the number of matches, or frequency, of each seed, which is controlled with the --ht-max-seed-freq option. By default, the frequency limit is 16. In practice, when the software encounters a seed with higher frequency, it extends it to a sufficiently long secondary seed that the frequency of any particular extended seed pattern falls within the limit. However, if a maximum seed extension would still exceed the limit, the seed is rejected, and not populated into the hash table. Instead, a single High Frequency record is populated.

This seed frequency limit does not tend to impact DRAGEN mapping quality notably, for two reasons. First, because seeds are rejected only when extension fails, only extremely high-frequency primary seeds, typically with many thousands of matches are rejected. Such seeds are not very useful for mapping. Second, there are other seed positions to check in a given read. If another seed position is unique enough to return one or more matches, the read can still be properly mapped. However, if all seed positions were rejected as high frequency, often this means that the entire read matches similarly well in many reference positions, so even if the read were mapped it would be an arbitrary choice, with very low or zero MAPQ.

Thus, the default frequency limit of 16 for --ht-max-seed-freq works well. However, it may be decreased or increased, up to a maximum of 256. A higher frequency limit tends to marginally increase the number of reads mapped (especially for short reads), but commonly the additional mapped reads have very low or zero MAPQ. This also tends to slow down DRAGEN mapping, because correspondingly large numbers of possible mappings are occasionally considered.

In addition to a frequency limit, a target seed frequency can be specified with --ht-target-seed-freq option. This target frequency is used when extensions are generated for high frequency primary seeds. Extension lengths are chosen with a preference toward extended seed frequencies near the target. The default of 4 for --ht-target-seed-freq means that the software is biased toward generating shorter seed extensions than necessary to map seeds uniquely.

References with ALT contigs

When building a reference hash table from a fasta with ALT contigs, it may be desired to mask certain regions of high similarity, or to establish a liftover realtionships between primary and alternate contigs. The recommended approach is masking, as described in the Map-Align section. When hg19 or hg38 alt contigs are detected, the hash table builder will require a liftover file or a bed file to mask the alt contigs. If non are provided, a mask bed file from <INSTALL_PATH>/resources/ht_builder/ will be used automaticaly.

Masked References

DRAGEN has adopted a masked approach to handle native reference ALT contigs, where strategic regions are masked to increased accuracy. The hash table builder will build the mapper hash table as if the regions that were specified in the argument for ht-mask-bed were masked with N's. The hash table builder will only allow setting one of ht-mask-bed or ht-alt-liftover. Each line in the bed file is expected to contain a contig name, start position (0-based), and end position (1-based), seperated by a single tab or space. Lines that start with # are ignored by the hash table builder to allow commenting. Any line with a contig name that is not found in the input fasta is skipped and logged to the DRAGEN log file. Likewise, lines that describe empty intervals are skipped. If all lines are skipped this way, the hash table builder will issue an error and abort, unless the mask bed file was automatically applied (see Automatic masking). The hash table builder will always issue an error and abort if an interval described in the BED file is outside of the range of the corresponding contig in the fasta. Lines that are not skipped are written to a file called mask.bed that will be present in the hash table output directory, and whose digest will appear in hash_table.cfg. This file is used when a reference is loaded to the FPGA card to dynamically mask reference.bin.

Automatic masking

When running from a fasta for which hg38 or hg19 is detected (See Automatic Reference Detection), and no argument for ht-mask-bed or ht-alt-liftover was provided, the hash table builder will automatically apply the corresponding bed file for the detected reference from <INSTALL_PATH>/resources/ht_builder/. Note that the hash table builder will identify alt contigs by name. So when running from an input fasta that contains alt contig with standard names but modified base content, it is recommended to suppress automatic masking by setting ht-suppress-mask=true or by passing a custom mask bed file to ht-mask-bed.

Handling Decoy Contigs

The behavior of DRAGEN with respect to the handling of decoy contigs in the reference has changed since version 2.6.

Starting with DRAGEN 3.x, DRAGEN's hash table builder automatically detects the absence of the decoy contigs from the reference and adds it to the FASTA file, prior to building the hash table. The decoys file is found at <INSTALL_PATH>/resources/ht_builder/hs_decoys.fa.gz. If the reference is missing the decoy contigs, then the reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). This results in an artificially lower mapping rate, however, the accuracy of variant calling is improved thanks to removing false positive caused by decoy reads.

Illumina recommends using this feature by default. However, you can to set the --ht-suppress-decoys option to true to suppress adding these decoys to the hash table.

The table below describes the difference in behavior between older DRAGEN versions (2.6 and earlier) and DRAGEN 3.x versions with respect to the handling of decoy contigs in the hash table builder:

Prepare a Pangenome Reference

It is possible to build a custom pangenome reference in order to:

  • generate a population-specific-pangenome hash table from pangenome msVCF generated from the BSSH app.

  • generate a human or non-human pangenome hash table from customer-provided msVCF.

Usage

To enable the pangenome hash table builder, example command usage is :

dragen --build-hash-table true (required) --ht-graph-msvcf-file <path to a multi-sampple VCF file (required for pangenome reference) --ht-reference <reference.fasta> (required) --ht-graph-extra-kmer-bed < graph.bed> (optional) --ht-mask-bed <mask.bed> (optional) --ht-graph-exclusion-bed <exclusion bed> (optional) --output-directory <DIR> (required) [options]

Inputs

Set of population variants, in a multi-sample VCF (msVCF)

The custom pangenome hash table builder tool uses a set of population variants provided by the user to generate a pangenome hash table. The variants must be specified in VCF format, in a single multi-sample VCF (msVCF) file containing the variants for a set of individuals. This multi-sample VCF file must have specific formatting described below.

Specific msVCF input formatting

The custom pangenome hash table builder tool only supports msVCF file input respecting the format described below:

  • msVCF compliant with 4.2 VCF format specification

  • with variants positionally sorted in the same contig order as the main FASTA reference genome provided in --ht-reference

  • records shall include diploid or haploid GT calls

  • supports multi-allelic variants merged in multi-line or separated in multiple lines

  • with the following FILTER codes, non-PASS records are ignored:

    • ##FILTER=<ID=PASS,Description="All filters passed">

  • with the following FORMAT field :

    • ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

  • for better results, we recommend variants to be left-aligned.

  • maximum number of recommended samples in the msVCF is 256. Higher number may lead to very high memory usage at hash table creation.

Note: INFO/FORMAT subfields must be defined in the header. Events with undefined subfields are ignored.

To build a high-performance custom genome it is highly recommended to use long read sequencing data. We recommend using external tools such as Whatshap (https://github.com/whatshap/whatshap) to generate phased input. DRAGEN analysis leverages the phasing information to reconstruct population haplotypes.

Reference genome

Note: the reference genome provided as input must be the same as the one used to generate the input phased msVCF. If the msVCF contains variants from regions not present in the fasta file, the pangenome reference builder will stop with an error.

Exclusion bed file (optional)

A custom exclusion bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the exclusion bed file provided must be from the same build as the reference genome used to build the pangenome reference.

Extra kmer bed file (optional)

An Extra-kmer-bed bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the Extra-kmer-bed file provided must be from the same build as the reference genome used to build the graph reference.

Mask bed file (recommended)

A custom mask bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the mask bed file provided must be from the same build as the reference genome used to build the graph reference.

Command line options

Note: The custom graph reference hash table end to end pipeline will return an error if options --ht-alt-liftover or --ht-allow-mask-and-liftover are specified.

Output

The hash table builder generates the following outputs:

Prepare a linear Reference

Usage

Use the --build-hash-table option to transform a reference FASTA into the hash table for DRAGEN mapping. It takes as input a FASTA file (multiple reference sequences being concatenated) and a preexisting output directory. Build command usage is as follows:

Input

The --ht-reference and --output-directory options are required for building a hash table. The --ht‑reference option specifies the path to the reference FASTA file, while --output-directory specifies a preexisting directory where the hash table output files are written. Illumina recommends organizing various hash table builds into different folders. As a best practice, folder names should include any nondefault parameter settings used to generate the contained hash table. The sequence names in the reference FASTA file must be unique.

Command line options

Liftover Based ALT-Aware Hash Tables

While masking is the recommended approach to dealing with ALT contigs, DRAGEN also supports a liftover based method. To enable liftover based ALT-aware mapping in DRAGEN, build the hash table with a liftover file by using the --ht-alt-liftover option. The hash table builder classifies each reference sequence as primary or alternate based on the liftover file, and packs primaries before alternates in reference.bin. SAM liftover files for hg38DH and hg19 are in the <INSTALL_PATH>/resources/ht_builder folder.

Custom Liftover Files

Custom liftover files can be used in place of those provided with DRAGEN. Liftover files must be SAM format, but no SAM header is required. SEQ and QUAL fields can be omitted ('*'). Each alignment record should have an alternate haplotype reference sequence name as QNAME, indicating the RNAME and POS of its liftover alignment in a destination (normally primary assembly) reference sequence.

Reverse-complemented alignments are indicated by bit 0x10 in FLAG. Records flagged unmapped (0x4) or secondary (0x100) are ignored. The CIGAR may include hard or soft clipping, leaving parts of the ALT contig unaligned.

A single reference sequence cannot serve as both an ALT contig (appearing in QNAME) and a liftover destination (appearing in RNAME). Multiple ALT contigs can align to the same primary assembly location. Multiple alignments can also be provided for a single ALT contig (extras optionally be flagged 0x800 supplementary), such as to align one portion forward and another portion reverse-complemented. However, each base of the ALT contig only receives one liftover image, according to the first alignment record with an M CIGAR operation covering that base.

SAM records with QNAME missing from the reference genome are ignored, so that the same liftover file may be used for various reference subsets, but an error occurs if any alignment has its QNAME present but its RNAME absent.

Options for advanced users

Primary Seed Length

The --ht-seed-len option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of this same length from each read, and looks for exact matches (unless seed editing is enabled) in the hash table.

The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16 GB to 64 GB, covering typical sizes for whole human genome, or k=26 for sizes from 4 GB to 16 GB.

The minimum primary seed length depends mainly on the reference genome size and complexity. It needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound may be smaller for shorter genomes, or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16 for the 3.1Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from 4 nucleotides to distinguish 3.1 G reference positions.

Accuracy Considerations

For read mapping to succeed, at least one primary seed must match exactly (or with a single SNP when edited seeds are used). Shorter seeds are more likely to map successfully to the reference, because they are less likely to overlap variants or sequencing errors, and because more of them fit in each read. So for mapping accuracy, shorter seeds are mainly better.

However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions, and lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches may be reported. Run time quality filters such as --Aligner.aln_min_score can control the accuracy issues with very short seeds.

Speed Considerations

Shorter seeds tend to slow down mapping, because they map to more reference locations, resulting in more work such as Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the reference genome's uniqueness threshold, eg, K=16 for whole human genome.

Application Considerations

Read Length---Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions (variants or sequencing errors) can chop the read into only short segments matching the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, just one SNP in the middle can block seeds longer than 18 bp from matching the reference. By contrast, in a 250 bp read, it takes 15 SNPs to exceed a 0.01% chance of blocking even 27 bp seeds.

Paired Ends---The use of paired end reads can make longer seeds yield good mapping accuracy. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have essentially twice the opportunity for an exact matching seed to find their correct alignments.

Variant or Error Rate---When read differences from the reference are more frequent, shorter seeds may be required to fit between the difference positions in a given read and match the reference exactly.

Mapping Percentage Requirement---If the application requires a high percentage of reads to be mapped somewhere (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well anywhere are more likely to map using short seeds to find partial matches to the reference.

Maximum Seed Length

The --ht-max-ext-seed-len option limits the length of extended seeds populated into the hash table. Primary seeds (length specified by --ht-seed-len) that match many reference positions can be extended to achieve more unique matching, which may be required to map seeds within the maximum hit frequency (--ht-max-seed-freq).

Given a primary seed length k, the maximum seed length can be configured between k and k+128. The default is the upper bound, k+128.

When to Limit Seed Extension

The --ht-max-ext-seed-len option is recommended for short reads, eg, less than 50 bp. In such cases, it is helpful to limit seed extension to the read length minus a small margin, such as 1-4 bp. For example, with 36 bp reads, setting --ht-max-ext-seed-len to 35 might be appropriate. This ensures that the hash table builder does not plan a seed extension longer than the read causing seed extension and mapping to fail at run time, for seeds that could have fit within the read with shorter extensions.

While seed extension can be similarly limited for longer reads, eg, setting --ht-max-ext-seed-len to 99 for 100 bp reads, there is little utility in this because seeds are extended conservatively in any event. Even with the default k+128 limit, individual seeds are only extended to the lengths required to fit under the maximum hit frequency (--ht-max-seed-freq), and at most a few bases longer to approach the target hit frequency (‑‑ht‑target-seed-freq), or to avoid taking too many incremental extension steps.

Maximum Hit Frequency

The --ht-max-seed-freq option sets a firm limit on the number of seed hits (reference genome locations) that can be populated for any primary or extended seed. If a given primary seed maps to more reference positions than this limit, it must be extended long enough that the extended seeds subdivide into smaller groups of identical seeds under the limit. If, even at the maximum extended seed length (--ht-max-ext-seed-len), a group of identical reference seeds is larger than this limit, their reference positions are not populated into the hash table. Instead, a single High Frequency record is populated.

The maximum hit frequency can be configured from 1 to 256. However, if this value is too low, hash table construction can fail because too many seed extensions are needed. The practical minimum for a whole human genome reference, other options being default, is 8.

Accuracy Considerations

Generally, a higher maximum hit frequency leads to more successful mapping. There are two reasons for this. First, a higher limit rejects fewer reference positions that cannot map under it. Second, a higher limit allows seed extensions to be shorter, improving the odds of exact seed matching without overlapping variants or sequencing errors.

However, as with very short seeds, allowing high hit counts can sometimes hurt mapping accuracy. Most of the seed hits in a large group are not to the true mapping location, and occasionally one of these noise hits may be reported due to imperfect scoring models. Also, the mapper limits the total number of reference positions it considers, and allowing very high hit counts can potentially crowd out the actual best match from consideration.

Speed Considerations

Higher maximum hit frequencies slow down read mapping, because seed mapping finds more reference locations, resulting in more work, such as Smith-Waterman alignments, to determine the best result.

Pangenome Reference

The DRAGEN Software enables the user to build a custom pangenome hash table from a set of population variants. The population variants are specified in a single multi-sample VCF file.

  • --ht-graph-msvcf-file: Input file containing list of population variants, in multi-sample VCF format.

This replaces the previous options that were previously used to build a graph Reference that are now deprecated.

List of deprecated options :

  • --ht-pop-alt-contigs: Population based alternate contigs FASTA.

  • --ht-pop-alt-liftover: Liftover SAM file of population alternate contigs.

  • --ht-pop-snps: Population based SNPs VCF

ALT-Contigs

The following options control building hash tables from references with ALT-contigs. See References with ALT contigs for more information.

  • --ht-mask-bed: Set a custom BED file that defines which regions to mask. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from <INSTALL_PATH>/resources/ht_builder.

  • --ht-alt-liftover: Set a liftover file to build a liftover based ALT-aware hash table. SAM liftover files for hg38DH and hg19 are provided in <INSTALL_PATH>/resources/ht_builder.

  • --ht-allow-mask-and-liftover: Allow the use of both --ht-mask-bed and --ht-alt-liftover together.

  • --ht-suppress-mask: Suppress automatic detection of the default mask bed files when building the hash table.

Decoy Contigs

  • --ht-decoys The DRAGEN software automatically detects the use of hg19 and hg38 references and adds decoys to the hash table when they are not found in the FASTA file. Use the --ht-decoys option to specify the path to a decoys file. The default is <INSTALL_PATH>/resources/ht_builder/hs_decoys.fa.gz.

  • --ht-suppress-decoys: Suppress automatic detection of the default decoys file when building the hash table.

Processing Options

  • --ht-num-threads The --ht-num-threads option determines the maximum number of worker CPU threads that are used to speed up hash table construction. The default for this option is 8, with a maximum of 32 threads allowed. If your server supports execution of more threads, it is recommended that you use the maximum. For example, the DRAGEN servers contain 24 cores that have hyperthreading enabled, so a value of 32 should be used. When using a higher value, adjust --ht-max-table-chunks needs to be adjusted as well. The servers have 128 GB of memory available.

  • --ht-max-table-chunks The --ht-max-table-chunks option controls the memory footprint during hash table construction by limiting the number of ~1 GB hash table chunks that reside in memory simultaneously. Each additional chunk consumes roughly twice its size (~2 GB) in system memory during construction. The hash table is divided into power-of-two independent chunks, of a fixed chunk size, X, which depends on the hash table size, in the range 0.5 GB < X ≤ 1 GB. For example, a 24 GB hash table contains 32 independent 0.75 GB chunks that can be constructed by parallel threads with enough memory and a 16 GB hash table contains 16 independent 1 GB chunks. The default is --ht-max-table-chunks equal to --ht-num-threads, but with a minimum default --ht-max-table-chunks of 8. It makes sense to have these two options match, because building one hash table chunk requires one chunk space in memory and one thread to work on it. Nevertheless, there are build-speed advantages to raising --ht-max-table-chunks higher than --ht-num-threads, or to raising --ht-num-threads higher than --ht-max-table-chunks.

Size Options

  • --ht-mem-limit Memory Limit. The --ht-mem-limit option controls the generated hash table size by specifying the DRAGEN card memory available for both the hash table and the encoded reference genome. The ‑‑ht‑mem-limit option defaults to 32 GB when the reference genome approaches WHG size, or to a generous size for smaller references. Normally there is little reason to override these defaults.

  • --ht-size Hash Table Size. This option specifies the hash table size to generate, rather than calculating an appropriate table size from the reference genome size and the available memory (option --ht-mem-limit). Using default table sizing is recommended and using --ht-mem-limit is the next best choice.

Seed Population Options

  • --ht-ref-seed-interval Seed Interval. The --ht-ref-seed-interval option defines the step size between positions of seeds in the reference genome populated into the hash table. An interval of 1 (default) means that every seed position is populated, 2 means 50% of positions are populated, etc. Noninteger values are supported, eg, 2.5 yields 40% populated. Seeds from a whole human reference are easily 100% populated with 32 GB memory on DRAGEN boards. If a substantially larger reference genome is used, change this option.

  • --ht-soft-seed-freq-cap and --ht-max-dec-factor Soft Frequency Cap and Maximum Decimation Factor for Seed Thinning. Seed thinning is an experimental technique to improve mapping performance in high-frequency regions. When primary seeds have higher frequency than the cap indicated by the --ht-soft-seed-freq-cap option, only a fraction of seed positions are populated to stay under the cap. The --ht-max-dec-factor option specifies a maximum factor by which seeds can be thinned. For example, --ht-max-dec-factor 3 retains at least 1/3 of the original seeds. --ht-max-dec-factor 1 disables any thinning. Seeds are decimated in careful patterns to prevent leaving any long gaps unpopulated. The idea is that seed thinning can achieve mapped seed coverage in high frequency reference regions where the maximum hit frequency would otherwise have been exceeded. Seed thinning can also keep seed extensions shorter, which is also good for successful mapping. Based on testing to date, seed thinning has not proven to be superior to other accuracy optimization methods.

  • --ht-rand-hit-hifreq and --ht-rand-hit-extend Random Sample Hit with HIFREQ Record and EXTEND Record. Whenever a HIFREQ or EXTEND record is populated into the hash table, it stands in place of a large set of reference hits for a certain seed. Optionally, the hash table builder can choose a random representative of that set, and populate that HIT record alongside the HIFREQ or EXTEND record. Random sample hits provide alternative alignments that are very useful in estimating MAPQ accurately for the alignments that are reported. They are never used outside of this context for reporting alignment positions, because that would result in biased coverage of locations that happened to be selected during hash table construction. To include a sample hit, set --ht-rand-hit-hifreq to 1. The --ht-rand-hit-extend option is a minimum pre-extension hit count to include a sample hit, or zero to disable. Modifying these options is not recommended.

Seed Extension Control

DRAGEN seed extension is dynamic, applied as needed for particular K-mers that map to too many reference locations. Seeds are incrementally extended in steps of 2--14 bases (always even) from a primary seed length to a fully extended length. The bases are appended symmetrically in each extension step, determining the next extension increment if any.

There is a potentially complex seed extension tree associated with each high frequency primary seed. Each full tree is generated during hash table construction and a path from the root is traced by iterative extension steps during seed mapping. The hash table builder employs a dynamic programming algorithm to search the space of all possible seed extension trees for an optimal one, using a cost function that balances mapping accuracy and speed. The following options define that cost function:

  • --ht-target-seed-freq Target Hit Frequency. The --ht-target-seed-freq option defines the ideal number of hits per seed for which seed extension should aim. Higher values lead to fewer and shorter final seed extensions, because shorter seeds tend to match more reference positions.

  • --ht-cost-coeff-seed-len Cost Coefficient for Seed Length The --ht-cost-coeff-seed-len option assigns the cost component for each base by which a seed is extended. Additional bases are considered a cost because longer seeds risk overlapping variants or sequencing errors and losing their correct mappings. Higher values lead to shorter final seed extensions.

  • --ht-cost-coeff-seed-freq Cost Coefficient for Hit Frequency. The --ht-cost-coeff-seed-freq option assigns the cost component for the difference between the target hit frequency and the number of hits populated for a single seed. Higher values result primarily in high-frequency seeds being extended further to bring their frequencies down toward the target.

  • --ht-cost-penalty Cost Penalty for Seed Extension. The --ht-cost-penalty option assigns a flat cost for extending beyond the primary seed length. A higher value results in fewer seeds being extended at all. Current testing shows that zero (0) is appropriate for this parameter.

  • --ht-cost-penalty-incr Cost Increment for Extension Step. The --ht-cost-penalty-incr option assigns a recurring cost for each incremental seed extension step taken from primary to final extended seed length. More steps are considered a higher cost because extending in many small steps requires more hash table space for intermediate EXTEND records, and takes substantially more run time to execute the extensions. A higher value results in seed extension trees with fewer nodes, reaching from the root primary seed length to leaf extended seed lengths in fewer, larger steps.

Pipeline Specific Hash Tables

RNA-Seq

When building a hash table, DRAGEN configures the options for DNA analysis by default. To run RNA-Seq data, you must build an RNA-Seq hash table by setting --ht-build-rna-hashtable to true. If running RNA-Seq alignment, use the original --output-directory instead of the automatically generated subdirectory.

CNV

If using the CNV pipeline, set --ht-build-cnv-hashtable to true. The command generates an additional Kmer hash map that is used in the CNV algorithm. Illumina recommends to always use the --ht-build-cnv-hashtable option, so you can perform CNV calling with the same hash table used for mapping and aligning.

Methylation

To run the methylation pipeline, you must build a methylation-specific hash table. DRAGEN can build a single-pass or legacy multi-pass methylation hash table. Methylation runs using a single-pass hash table are completed faster than the legacy multipass hash tables. Single-pass hash tables are recommended for building methylation tables and running analyses.

Single-pass

The following is an example of a single-pass hash table build. The example generates a combined hash table in your reference index folder under the methyl_converted subdirectory.

dragen --build-hash-table true \ --output-directory $REFDIR \ --ht-reference $FASTA \ --ht-num-threads 40 \ --ht-methylated-combined=true \ --ht-seed-len 27

Multipass

Multi-pass methylation mapping requires building two special hash tables with reference bases converted from C to T in one table and G to A in the other table. The conversions are performed automatically when using the --ht-methylated command line option. The converted hash tables are generated in two subdirectories under the folder specified using the --output-directory command line option. The subdirectories are named CT_converted and GA_converted, corresponding with the base conversions. When using the hash tables for methylated alignment runs, make sure to refer to the --output-directory folder, not the subdirectories.

The base conversions remove a significant amount of information from the hash tables. You might need to use different hash table parameters than in a conventional hash table build. The following options are recommended for building hash tables for mammalian species.

dragen --build-hash-table=true --output-directory $REFDIR --ht-reference $FASTA --ht-max-seed-freq 16 --ht-seed-len 27 --ht-num-threads 40 --ht-methylated=true

HLA

To run the HLA caller, an HLA-specific anchored reference hash table must be built. Set --ht-build-hla-hashtable to true. The command will create a anchored_hla subdirectory inside the --output-directory. The HLA-specific reference subdirectory can be built at the same time as the primary reference construction.

Read Trimming

DRAGEN can remove artifacts from reads using hardware accelerated read trimming. Hardware accelerated read trimming is available on U200 and cloud systems, as part of the DRAGEN mapper and adds no additional run time. DRAGEN provides multiple independent trimming filters that target different types of artifacts or use cases. You can enable and configure the artifacts or use cases independently to tailor the read-trimming to your analysis. Read trimming uses two different modes, hard-trimming and soft-trimming.

To enable hard-trimming mode, use --read-trimmers. In hard-trimming mode, potential artifacts are removed from input reads. Reads that are trimmed to fewer than 20 bases are filtered and replaced with a placeholder read that uses 10 N bases. DRAGEN assigns the filtered reads a 0x200 flag set.

DRAGEN contains a novel lossless soft-trimming mode. In soft-trimming mode, reads are mapped as though they had been trimmed, but no bases are removed. To enable the trimmer in soft mode, use --soft-read-trimmers.

Soft-trimming suppresses systematic mismapping of reads that contain trimmable artifacts, without actually losing the trimmed bases in aligned output. Soft-trimming prevents reads with trimmable artifacts, such as Poly-G artifacts, from being mapped to reference G homopolymers, or prevents adapter sequences from being mapped to the matching reference loci. Soft-trimming might map reads to different positions in the reference than they would have been if not using soft-trimming. When using soft-trimmed, DRAGEN does not filter reads and does not map reads with bases that would have been trimmed entirely.

Soft-trimming for Poly-G artifacts is enabled by default on supported systems.

Read Trimming Tools

Fixed-Length Trimming

Fixed-length trimming removes a fixed number of bases from the 5' end of each read. If you are analyzing sequencing data from an amplicon of fixed size and expect the read-length to consistently exceed the length of quality sequence data, you can use the expected number in fixed-length trimming.

Poly-G Trimming

Poly-G artifacts appear on two-channel sequencing systems when the dark base G is called after synthesis has terminated. As a result, DRAGEN calls several erroneous high-confidence G bases on the ends of affected reads. For contaminated samples, many affected reads can be mapped to reference regions with high G content. The affected reads can cause problems for processing downstream.

Quality Trimming

Base quality can degrade over the length of a read toward the 5' end and separate from any artifacts from early termination of synthesis. The lower quality bases can affect mapping and alignment results, and might lead to incorrect variant or methylation calls downstream. The quality trimming tool calculates a rolling average of the base quality inward from the 5' end and removes the minimum number of bases, so the average number of bases is above the threshold specified using --trim-min-quality.

Adapter Trimming

Problems during library preparation, or libraries with smaller inserts can result in the synthesis of high quality reads containing sequence from the adapters used. If not removed before analysis, noninsert bases can reduce mapping efficiency and downstream accuracy. The adapter trimming tool uses the adapter sequences from the input FASTA file, and then removes all hits greater than a specified size. Adapter trimming allows for a 10% mismatch. For 3' adapters, trimming is from the first matching adapter base to the end of the read. For 5' adapters, trimming is from the first (3') matching adapter base to the beginning (5') of the read.

Ambiguous Base Trimming

If quality trimming is not feasible due to reduced yield or other limitations, an alternative option is to remove only explicitly ambiguous bases from the ends of read. If enabled the ambiguous base trimmer applies a simple exact-match search to both ends of all processed reads, regardless of mate-pair status.

Minimum Length Trimming

You can maximize trimmer sensitivity, by using the minimum length trimming tool to remove a fixed number of bases from each read after the trimmer tools above have run. For example, if you would like to remove 5 bp from each read, a 7 bp adapter hit could be missed if five of the bases are removed first. To mitigate this issue, DRAGEN provides an optional minimum trim-length filter.

Maximum Length Trimming

If using libraries of fixed-size inserts, such as small PCR amplicons, it is more convenient to specify a length that all reads should be trimmed to rather than the number of bases to remove. You can use the maximum length trimming tool.

PolyA Tail Trimming

Read Trimming Metrics

The trimmer generates a metrics file titled \<output prefix\>.trimmer_metrics.csv. Metrics are available on an aggregate level over all input data. The metrics units are in reads or bases.

  • Total input reads Total number of reads in the input files.

  • Total input bases Total number of bases in the input reads.

  • Total input bases R1 Total number of bases in R1 reads.

  • Total input bases R2 Total number of bases in R2 reads.

  • Average input read length Total number of input bases divided by the number of input reads.

  • Total trimmed reads Total number of reads trimmed by at least one base, not including soft-trimming.

  • Total trimmed bases Total number of bases trimmed, not including soft-trimming.

  • Average bases trimmed per read The number of trimmed bases divided by the number of input reads.

  • Average bases trimmed per trimmed read The number of trimmed bases divided by the number of trimmed reads.

  • Remaining poly-G K-mers R1 3prime The number of R1 3' read ends that contain likely Poly-G artifacts after trimming.

  • Remaining poly-G K-mers R2 3prime The number of R2 3' read ends that contain likely Poly-G artifacts after trimming.

  • Total filtered reads The number of reads that were filtered out during trimming.

  • Reads filtered for minimum read length R1 The number of R1 reads that were filtered due to being trimmed below the minimum read length.

  • Reads filtered for minimum read length R2 The number of R2 reads that were filtered due to being trimmed below the minimum read length.

  • <Trimmer tool> trimmed reads The number of reads with at least one base trimmed by TRIMMER. DRAGEN reports the metric for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes reads that were trimmed during soft-trimming. Each trimming tool above produces the metric.

  • <Trimmer tool> trimmed bases The number of bases trimmed by TRIMMER. The metric is produced for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes bases from reads that were trimmed during soft trimming. Each trimming tool above produces the metric.

Read Trimming Settings

Read trimmer

Filtering after the trimmer's execution

Fixed-length trimming

Quality trimming

Adapter trimming

Bisulfite trimming

Minimum-length trimming

Maximum-length trimming

PolyA trimming

PolyG trimming

PolyX trimming

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

Specify germline CNVs from the matched normal sample. .

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

Specify germline CNVs from the matched normal sample. .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

Specify germline CNVs from the matched normal sample. .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

Specify germline CNVs from the matched normal sample. .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

See:

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specify germline CNVs from the matched normal sample. .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

See:

See:

See example above or refer to

See example above or refer to

For more details on single-cell RNA options, refer to the .

See:

See:

See example above or refer to

See example above or refer to

For more details on PIPseq pipeline options, refer to the

Pre-built human references are available for download at .

See for how to build a custom reference genome.

See:

Set the consensus sequence type to output. DRAGEN UMI allows collapsing duplex sequences from the two strands of the original molecules. For more information, see .

For more information see: .

Hard filter variants that overlap with this region. ALU regions comprise approximately 11% of the genome, and are often exceptionally noisy regions in FFPE samples. Optionally filter out ALU regions using the DRAGEN excluded regions filter. ALU bed files can be downloaded as part of the Bed File Collection:

For more detail on the small variant caller in somatic mode please refer to

Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see .

For more information, see .

For instructions on how to download the Nirvana annotation database, please refer to

See the user guide: .

Microsatellite sites file can be downloaded here: .

For more information, see .

Prebuilt systematic noise BED files (WES and WGS) can be downloaded here: .

The SNV systematic noise files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Prebuilt SV systematic noise files can be downloaded here: .

Systematic noise BEDPE files can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

For more information, see .

CNV PONs can also be built in the cloud using the or the DRAGEN Systematic Noise File Builder Pipeline on .

Granularity
Single Base Resolution (bp)
Resolution at 150 (bp)
Recommended Read-Lengths (bp)
Section
Mate
Metric
Value
Command-Line Option Name
Configuration File Option Name
Mode
Description
Command-Line Option Name
Configuration File Option Name

The DRAGEN pangenome hashtables are available to download from the .

Assembly
Autosome and Sex Chromosome Names
Option
Default
DRAGEN Behavior
DRAGEN 2.6 and earlier versions
DRAGEN 3.0 and later versions

DRAGEN analysis is capable of mapping on a pangenome hash table. The pangenome hash table introduces alternate graph paths to the linear reference hash table to represent more broadly the allelic diversity of the population over the whole genome or in specific regions defined in a bed file. Gain on accuracy from this methodology has been described in scientific blogs available on the . Mutigenome hash tables for CHM13_v2, hg38, hg19 and hs37d5 assemblies are available on the .

See for information on the multigenome mapping method.

customize the released pangenome hash table with custom bed files or hash table builder options. A set of bed files are available in the resource files on the .

The input files required are a single multi-sample VCF file containing the set of population variants, and optionally bed files restricting graph to some region. The generated files, including hash_table.cmp and associated files in the specified output directory, can then be used as the reference hash table for the DRAGEN mapper. DRAGEN software supports the tool on human reference with files available on the . For non-human, the user provides the required resource files.

A reference genome in FASTA format must be provided. Reference genomes are available to download from the .

This bed file is used to filter out regions of the msVCF file. Variants that fall within intervals defined in the "Graph exclusion bed" file will be ignored and not used in any part of the pangenome reference builder. The result will be the same as if the input msVCF did not contain any variants in the regions defined in the exclusion bed. The file is optional, by default every variants in the msVCF file will be used. Exclusion bed files are available to download from .

This file is used to define regions in the genome where extra seeds will be indexed in the hash table. By default, only seed extracted from the primary reference will be extracted and saved in the reference hash table for mapping. This option will additionally generate seeds from population variants in the defined regions. It is recommended to include the expected difficult regions in this bed file. Extra-kmer-bed files are available to download from for the human hg38, hg19, hs37d5, and chm13 references.

A mask bed file must be provided in order to mask certain regions of high similarity between primary and alternate contigs present in the main genome FASTA. Mask bed files are available to download from the .

Option
Required
Description
File
Description
Option
Required
Description
Hash Table Type
Hash Table Commands

An HLA resource file is packaged with DRAGEN and located at the following path after installation: <INSTALL_PATH>/resources/hla/HLA_resource.v1.fasta.gz. This file is used by default when building the HLA-specific anchored hash table. A custom file can be specified with --ht-hla-reference. See the HLA section for more information

If using RNA libraries, reads overlapping the poly-A tail of the transcripts may contain long poly-A/poly-T sequences at the end of the reads which may result in incorrect alignment. The poly-A trimmer mitigates this by trimming the poly-A tail from the end of the read. See additional description in section.

Option
Description
Option
Description
Option
Description
Option
Description
Option
Description
Option
Description
Option
Description
Option
Description
Option
Description
Option
Description
Option
Description
Bed File Collection
Bed File Collection
Bed File Collection
Bed File Collection
Bed File Collection
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
Product Files
DRAGEN Single-Cell RNA User Guide
Product Files
Product Files
scRNA PIPseq Pipeline User Guide
DRAGEN Software Support Site page
Prepare a Reference Genome
Product Files
Somatic Mode
CNV Calling
Nirvana
Product Files
Structural Variant Calling
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
Product Files
DRAGEN Baseline Builder App on BaseSpace
ICA
DRAGEN Baseline Builder App on BaseSpace
ICA

7

1-255

1

<256

6

1-128

2

>=256 and <507

5

1-64

4

>=507 and <4031

4

1-32

8

>=4031

READ MEAN QUALITY

Read1

Q38 Reads

965377

...

POSITIONAL BASE MEAN QUALITY

Read1

ReadPos 145-152 T Average Quality

34.49

POSITIONAL BASE MEAN QUALITY

Read1

ReadPos 150 T Average Quality

34.44

POSITIONAL BASE MEAN QUALITY

Read1

ReadPos 256+ T Average Quality

36.99

...

POSITIONAL BASE CONTENT

Read1

ReadPos 145-152 A Bases

113362306

POSITIONAL BASE CONTENT

Read1

ReadPos 150 A Bases

14300589

POSITIONAL BASE CONTENT

Read1

ReadPos 256+ A Bases

13249068

...

READ LENGTHS

Read1

150bp Length Reads

77304421

READ LENGTHS

Read1

144-151bp Length Reads

77304421

READ LENGTHS

Read1

>=255bp Length Reads

1000000

...

READ GC CONTENT

Read1

50% GC Reads

140878674373

...

READ GC CONTENT QUALITY

Read1

50% GC Reads Average Quality

36.20

...

SEQUENCE POSITIONS

Read1

'AGATCGGAAGAG' 137bp Starts

20

SEQUENCE POSITIONS

Read1

'AGATCGGAAGAG' 137-144bp Starts

23

...

POSITIONAL QUALITY

Read1

ReadPos 150 50% Quantile QV

37

POSITIONAL QUALITY

Read1

ReadPos 145-152 50% Quantile QV

37

...

--Mappper.seed-density

seed-density

-Mapper.edit-mode

edit-mode

--Mapper.edit-seed-num

edit-seed-num

--Mapper.edit-read-len

edit-read-len

--Mapper.edit-chain-limit

edit-chain-limit

0

No editing (default)

1

Chain length test

2

Paired chain length test

3

Full seed editing

--Aligner.global

global

--Aligner.match-score

match-score

--Aligner.match-n-score

match-n-score

--Aligner.mismatch-pen

mismatch-pen

--Aligner.gap-open-pen

gap-open-pen

--Aligner.gap-ext-pen

gap-ext-pen

--Aligner.unclip-score

unclip-score

--Aligner.no-unclip-score

no-unclip-score

--Aligner.aln-min-score

aln-min-score

--Aligner.min-score-coeff

min-score-coeff

Initial paired-end statistics detected for read group RGID, based on 39042 high quality pairs for FR orientation
        Quartiles (25 50 75) = 398 409 420
        Mean = 410.192
        Standard deviation = 14.1254
        NOTE: DRAGEN's insert estimates include corrections for clipping (so they are not identical to TLEN)

        Skew-normal insert distribution applied:
          Position (xi) = 424.084
          Scale (omega) = 19.8719
          Shape (alpha) = -1.88125

        To rerun with identical insert stats, specify:
          --Aligner.pe-stat-mean-insert=424.084
          --Aligner.pe-stat-stddev-insert=19.8719
          --Aligner.pe-stat-shape-insert=-1.88125
          --Aligner.pe-stat-quartiles-insert="398 409 420"
          --Aligner.pe-stat-mean-read-len=101
 #Sample: sample name
 FragmentLength,Count
WARNING: Less than 28 high quality pairs found - standard deviation is
calculated from the small samples formula
 if samples < 3 then                                                     
      standard deviation = 10000                                          
 else if samples < 28 then                                               
    standard deviation = 25 * (standard deviation + 1) / (samples - 2) 
 end if                                                                   
                                                                          
 if standard deviation < 12 then                                         
      standard deviation = 12                                             
 end if                                                                   
Rescue radius = 220
     Effective rescue sigmas = 0.5
            WARNING: Default rescue sigmas value of 2.5 was overridden by host software!
            The user may wish to set rescue sigmas value explicitly with --Aligner.rescue-sigmas

hg38, hg19, chm13v2

chr1-chr22, chrX, chrY

hs37d5

1-22, X, Y

Value for --ht-seed-len

Read Length

21

100 bp to 150 bp

17 to 19

shorter reads (36 bp)

27

250+ bp

--ht-cost-coeff-seed-len

1

--ht-cost-coeff-seed-freq

0.5

--ht-cost-penalty

0

--ht-cost-penalty-incr

0.7

--ht-max-seed-freq

16

--ht-target-seed-freq

4

Reference does not include the decoy contigs (eg, hg19)

Decoy reads mismap elsewhere in the genome due to the lack of contigs in the reference. Artificially higher mapping rate. False positive calls in noisy regions to which the decoy contigs are mismapped.

DRAGEN automatically detects the absence of the decoy contig from the reference and adds it to the FASTA file. Artificially lower mapping rate because decoy reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). False positive calls are avoided thanks to adding the decoy contigs under the hood. Therefore this helps variant calling.

Reference includes the decoy contigs (eg, hs37d5)

Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place

Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place

--build-hash-table

Yes

Set to true

--ht-graph-msvcf-file

Yes

Path to the multi-sample VCF file containing population variants

--ht-reference

Yes

Path to the reference genome FASTA file.

--ht-graph-extra-kmer-bed

No

Path to the extra kmer bed file

--ht-mask-bed

No (but recommended)

Path to the mask bed file

--ht-graph-exclusion-bed

No

Path to the exclusion bed file

--output-directory

Yes

Specify the directory where all related hash table files will be written

reference.bin

The reference sequences, encoded in 4 bits per base. Four-bit codes are used, so the size in bytes is roughly half the reference genome size. In between reference sequences, N are trimmed and padding is automatically inserted. For example, hg19 has 3,137,161,264 bases in 93 sequences. This is encoded in 1,526,285,312 bytes = 1.46 GB, where 1 GB means 1 GiB or 2^30^ bytes.

hash_table.cmp

Compressed hash table. The hash table is decompressed and used by the DRAGEN mapper to look up primary seeds with length specified by the --ht-seed-len option and extended seeds of various lengths.

hash_table.cfg

A list of parameters and attributes for the generated hash table, in a text format. This file provides key information about the reference genome and hash table.

hash_table.cfg.bin

A binary version of hash_table.cfg used to configure the DRAGEN hardware.

hash_table_stats.txt

A text file listing extensive internal statistics on the constructed hash including the hash table occupancy percentages. This table is for information purposes. It is not used by other tools.

mask.bed

Present only for masked hash tables. A tab delimeted bed file that describes the masked regions. Contains all lines from the input bed file that are not comment lines, lines that describe empty intervals, or lines with contig names that were not found in the input fasta.

dragen --build-hash-table true [options] --ht-reference
<reference.fasta> --output-directory <outdir>

--build-hash-table

Yes

Set to true

--ht-reference

Yes

Path to the reference genome FASTA file.

--ht-mask-bed

No (but recommended)

Path to the mask bed file. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from <INSTALL_PATH>/resources/ht_builder.

--output-directory

Yes

Specify the directory where all related hash table files will be written

single-pass

--ht-methylated-combined=true --ht-seed-len 27

multi-pass

--ht-methylated=true --ht-seed-len 27 --ht-max-seed-freq 16

--trim-min-length

Specify a minimum read length allowed after the trimmer execution. DRAGEN filters any reads with a length less than the value after all read-trimming steps are completed (default: 20).

--trim-min-len-read1

Specify a minimum read length allowed for read1 after the trimmer execution. DRAGEN filters any reads with a length of read1 less than the value after all read-trimming steps are completed (default: 20).

--trim-min-len-read2

Specify a minimum read length allowed for read2 after the trimmer execution. DRAGEN filters any reads with a length of read2 less than the value after all read-trimming steps are completed (default: 20).

--trim-filter-dummy-len

Specify the number of N bases in dummy reads that replace filtered reads (default: 10).

--trim-filter-set-flag

If enabled, dummy reads will have their 0x200 SAM flag set (default: true).

--trim-r1-5prime

Specify a fixed number of bases to trim from the 5' end of Read 1 (default: 0).

--trim-r1-3prime

Specify a fixed number of bases to trim from the 3' end of Read 1 (default: 0).

--trim-r2-5prime

Specify a fixed number of bases to trim from the 5' end of Read 2 (default: 0).

--trim-r2-3prime

Specify a fixed number of bases to trim from the 3' end of Read 2 (default: 0).

--trim-min-quality

Specify the minimum read quality. DRAGEN trims bases from the 3' end of reads with a quality below the value.

--trim-quality-r1-5prime

Specify the quality cutoff below which to trim from the 5' end of read 1.

--trim-quality-r1-3prime

Specify the quality cutoff below which to trim from the 3' end of read 1.

--trim-quality-r2-5prime

Specify the quality cutoff below which to trim from the 5' end of read 2.

--trim-quality-r2-3prime

Specify the quality cutoff below which to trim from the 3' end of read 2.

--trim-adapter-read1

Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 1.

--trim-adapter-read2

Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 2.

--trim-adapter-r1-5prime

Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 1. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.

--trim-adapter-r2-5prime

Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 2. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.

--trim-adapter-stringency

Specify the minimum number of adapter bases required for trimming (default: 4).

--trim-bisulfite-ends

Enable both 5-Prime and 3-Prime bisulfite trimming.

--trim-bisulfite-5prime

If a 3' adapter was trimmed, trim an additional 2bp from the 3' end, unless the 5' end matches 'CAA' or 'CGA'".

--trim-bisulfite-3prime

If the 5' end matches 'CAA' or 'CGA', trim the first two of these 5' bases.

--trim-min-r1-5prime

Specify the minimum number of bases to trim from the 5' end of Read 1 (default: 0).

--trim-min-r1-3prime

Specify the minimum number of bases to trim from the 3' end of Read 1 (default: 0).

--trim-min-r2-5prime

Specify the minimum number of bases to trim from the 5' end of Read 2 (default: 0).

--trim-min-r2-3prime

Specify the minimum number of bases to trim from the 3' end of Read 2 (default: 0).

--trim-max-length

Specify the maximum number of bases that can be trimmed from the sequences of both reads.

--trim-max-len-read1

Specify the maximum number of bases that can be trimmed from the sequences of read1.

--trim-max-len-read2

Specify the maximum number of bases that can be trimmed from the sequences of read2.

--trim-polya-min-trim

The minimum number of poly-As required for polya trimming (default: 20).

--trim-polyg-kmer-len

How many bases to check at each read end for poly-G artifact detection (default: 25).

--trim-polyg-kmer-non-g

The maximum number of non-G bases in the K-mer for poly-G artifact detection (default: 2).

--trim-polyg-g-score-r1-5prime

The score for G bases on the 5' end of read 1 (default: 0).

--trim-polyg-g-score-r1-3prime

The score for G bases on the 3' end of read 1 (default: 15).

--trim-polyg-g-score-r2-5prime

The score for G bases on the 5' end of read 2 (default: 0).

--trim-polyg-g-score-r2-3prime

The score for G bases on the 3' end of read 2 (default: 15).

--trim-polyg-min-trim-r1-5prime

The minimum number of G's to trim from the 5' end of read 1 (default: 6).

--trim-polyg-min-trim-r1-3prime

The minimum number of G's to trim from the 3' end of read 1 (default: 6).

--trim-polyg-min-trim-r2-5prime

The minimum number of G's to trim from the 5' end of read 2 (default: 6).

--trim-polyg-min-trim-r2-3prime

The minimum number of G's to trim from the 3' end of read 2 (default: 6).

--trim-polyg-early-exit-threshold

The signed score threshold for poly-G trimming to exit early (default: -500).

--trim-polyx-bases-r1-5prime

The bases to trim for polyX trimming from the 5' end of read 1 (default: empty string "" ).

--trim-polyx-bases-r1-3prime

The bases to trim for polyX trimming from the 3' end of read 1 (default: empty string "" ).

--trim-polyx-bases-r2-5prime

The bases to trim for polyX trimming from the 5' end of read 2 (default: empty string "" ).

--trim-polyx-bases-r2-3prime

The bases to trim for polyX trimming from the 3' end of read 2 (default: empty string "" ).

--trim-polyx-min-trim-r1-5prime

The minimum number of X's to trim from the 5' end of read 1 (default: 20).

--trim-polyx-min-trim-r1-3prime

The minimum number of X's to trim from the 3' end of read 1 (default: 20).

--trim-polyx-min-trim-r2-5prime

The minimum number of X's to trim from the 5' end of read 2 (default: 20).

--trim-polyx-min-trim-r2-3prime

The minimum number of X's to trim from the 3' end of read 2 (default: 20).

B-Allele Frequency Output

B-Allele frequency (BAF) output is enabled by default in germline and somatic VCF and gVCF runs.

The BAF value is calculated as either AF or (1 - AF), where

  • AF = (alt_count / (ref_count + alt_count))

  • BAF = 1 - AF, only when ref base < alt base, order of priority for bases is A < T < G < C < N.

The B-allele frequency values are often plotted to visually inspect the spread away from a perfectly diploid heterozygous call (BAF=50%). This plot is more easily interpreted if it is symmetric about the BAF=50% line. To ensure the symmetry, a heuristic must be used to determine when BAF = AF or BAF = 1-AF. This definition of B-Allele Frequency is based on the definition that is used for bead arrays, as most users are accustomed to that implementation. Here, the choice of the B allele is based on the color of dye attached to each nucleotide. A and T get one color, G and C get the other color. The bead array implementation has much more complex rule for tie-breaking between A and T or G and C that involves top and bottom strands. This is unnecessary and so the simpler hierarchical approach of using a priority for the nucleotides A<T<G<C<N is used.

For each small variant VCF entry with exactly one SNP alternate allele, the output contains a corresponding entry in the BAF output file.

  • <NON_REF> lines are excluded

    • ForceGT variants (as marked by the "FGT" tag in the INFO field) are not included in the output, unless the variant also contains the "NML" tag in the INFO field.

    • Variants where the ref_count and alt_count are both zero are not included in the output.

BAF Options

  • --vc-enable-baf Enable or disable B-allele frequency output. Enabled by default.

BAF Output

The BF generates are BigWig-compressed files, named<output-file-prefix>.baf.bw and <output-file-prefix>.hard-filtered.baf.bw. The hard-filtered file only contains entries for variants that pass the filters defined in the VCF (ie, PASS entries).

Each entry contains the following information:Chromosome Start End BAF

Where:

  • Chromosome is a string matching a reference contig.

  • Start and end values are zero-based, half open intervals.

  • BAF is a floating point value.

Force Genotyping

DRAGEN supports force genotyping (ForceGT) for small variant calling. To use ForceGT, use the --vc-forcegt-vcf option with a list of small variants to force genotype. The input list of small variants can be a *.vcf or *.vcf.gz file.

The current limitations of ForceGT are as follows:

  • ForceGT is supported for germline small variant calling in the V3 mode. The V1, V2, and V2+ modes are not supported.

  • ForceGT is also supported for somatic small variant calling.

  • ForceGT variants do not propagate through joint genotyping.

ForceGT Input

DRAGEN supports only a single ForceGT VCF input file, which must meet the following requirements:

  • The input has to be a valid VCF file according to version 4.2 of the VCF standard. For instance, it has to have at least eight tab-delimited columns and records need to be sorted by reference contig and position.

  • The header has to list the same contigs as the reference used for variant calling. All variants must refer to one of these contig names.

  • Variants have to be normalized (parsimonious and left-aligned, see below).

  • It must not contain any multinucleotide or complex variants (AT -> C). These are variants that require more than one substitution / insertion / deletion to go from REF allele to ALT allele and are ignored.

  • Any deletions longer than 50bp are filtered out.

  • Any variant will only be called once. Duplicate entries will be ignored.

The following nonnormalized variant will cause undefined behavior in DRAGEN:

Instead of…

parsimonious: chrX 153592402 GC GCG

use…

parsimonious representation: chrX 153592403 C CG

ForceGT Operation and Expected Outcome

Force genotyping requires an input VCF and can be used with DRAGEN software in VCF, GVCF or VCF+GVCF mode. In all cases the output file(s) contains all regular calls and the forceGT variants, as follows:

  • For a ForceGT call that was not called by the variant caller (not common), the call is tagged with FGT in the INFO field.

  • For a germline ForceGT call that was also called by the variant caller and filter field is PASS, the call is tagged with NML;FGT in the INFO field (NML denotes normal). In somatic mode, the call is tagged with FGT;SOM.

  • For a normal call (and PASS) by the variant caller, with no ForceGT call (normal), no extra tags are added (no NML tag, no FGT tag).

This scheme distinguishes among calls that are present due to FGT only, common in both ForceGT input and normal calling, and normal calls.

All the variants in the input ForceGT VCF are genotyped and present in the output file. The following table lists the reported GTs for the variants.

Condition
Reported GT

At a position with no coverage

./. or .

At a position with coverage but no reads supporting ALT allele

0/0 or 0

At a position with coverage and reads supporting ALT allele

dependent on pipeline (germline/somatic)

If DRAGEN calls a variant that is different from the one specified in the input ForceGT VCF, the output contains the following multiple entries at the same position:

  • One entry for the default DRAGEN variant call

  • One entry each for every variant call present in the input ForceGT-VCF at that position

chrX 100 G C [Default DRAGEN variant call]
chrX 100 G A [Variant in ForceGT vcf]

If a target BED file is provided along with the input ForceGT VCF, then the output file only contains ForceGT variants that overlap the BED file positions.

Autogenerated MD5SUM for VCF Files

An MD5SUM file is generated automatically for VCF output files. This file is in the same output directory and has the same name as the VCF output file, but with an .md5sum extension appended. For example, whole_genome_run_123.vcf.md5sum. The MD5SUM files is a single-line text file that contains the md5sum of the VCF output file. This md5sum exactly matches the output of the Linux md5sum command.

Evidence BAM

Overview

The DRAGEN small variant caller is a haplotype-based caller which performs local assembly of all reads in an active region into a de Bruijn graph (DBG). The assembly process uses all the read bases including the soft-clip bases of reads. The soft-clip bases provide evidence for the presence of variants, specifically longer insertions and deletions which are not present in the read cigar and hence cannot be directly viewed in IGV.

The assembly and realignment step (using pair-HMM) performed by variant caller aims to correct mapping errors made by the original aligner and improves the overall variant caller accuracy. Using the evidence BAM, we can view how the variant caller sees the read evidence and how the reads have been realigned making it a very useful debugging tool.

By default, the evidence BAM contains only a subset of regions processed by the small variant caller. Only regions which have candidate indel variants and some percentage of soft-clip reads in the pileup are realgned and output in the evidence BAM. This is done to reduce the run-time overhead needed to generate the evidence BAM.

Please note this feature is only available in DNA germline and somatic modes, and not supported by the RNA variant caller.

Outputs

The output of the VC Evidence BAM feature will match the output format that the customer has selected using --output-format option. The default format is bam.

  • A bam/cram/sam file with the suffix _evidence.bam/cram/sam and the corresponding index file. The evidence BAM can be enabled along with the regular BAM output from the Map-Align step. When multiple BAM are passed as inputs to the variant caller, for e.g., in Tumor-Normal calling, then they will be combined in the evidence BAM output and tagged with appropriate read groups.

  • A bed file with regions that were realigned and output in VC Evidence BAM with suffix ".realigned-regions.bed".

Features

The evidence BAM consists of realigned reads, badly mated reads and reads that are disqualified by the variant caller based on the read likelihood scores.

  • Disqualified and Badly Mated reads

    Reads that are badly-mated (when the read and its mate are mapped to different chromosmes) are tagged with a BM tag (integer) and reads that are disqualified (based on read likelihoods) are tagged with the DQ tag (integer). These reads are filtered out by the genotyper in the variant caller.

  • Graph Haplotypes

    When enabling graph haplotypes output using --vc-evidence-bam-output-haplotypes, all the haplotypes constructed by the de Bruijn graph are output in the evidence BAM as single reads covering the entire active region. The reads and haplotypes are tagged with different read groups which makes it easily distinguishable in IGV. In IGV, we can use “Color Alignments By” or “Group Alignments By” > read group to separate out the reads from the haplotypes. The haplotypes are tagged with read group EvidenceHaplotype and the reads are part of the EvidenceRead_Normal/Tumor read group.

    The haplotypes are named as Haplotype 1, Haplotype 2 and so on and have an additional ‘HC’ tag (integer). The realigned reads also have an HC tag which encodes which haplotype best matches the read based on the likelihood calculation. Only reads which are supported by a single unique haplotype have the HC tag, reads which match more than one haplotype well do not have an HC tag. The use of this tag is primarily intended to enable highlighting of reads in IGV. Go to "Color Alignments By > Tag" and enter "HC" to view which reads are uniquely supported by a certain graph haplotypes.

Command Line Arguments

Name
Description
Default Value

vc-output-evidence-bam

Enable evidence BAM output

False

vc-evidence-bam-output-haplotypes

Output graph haplotypes in evidence BAM

False

vc-evidence-bam-clipped-read-threshold

Percentage of clipped reads in active region to enable evidence BAM output for that region

10%

vc-evidence-bam-force-output

Force evidence BAM output for all active regions

False

Machine Learning for Variant Calling

DRAGEN secondary analysis employs machine learning based variant recalibration (DRAGEN-ML) for germline SNV VC. Variant calling accuracy is improved using powerful yet efficient machine learning techniques that augment the variant caller, by exploiting more of the available read and context information that does not easily integrate into the Bayesian processing used by the haplotype variant caller. A supervised machine learning method was developed using truth from the PrecisionFDA v4.2.1 sets to build a model that processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors, for both SNVs and INDELs.

Setup

No additional setup is required. ML model files for the hg38 and hg19 human references are packaged with the DRAGEN installer. After installation, the files are present at <INSTALL_PATH>/resources/ml_model/<ref> DRAGEN-ML is enabled by default as needed, when running the germline SNV VC. DRAGEN will automatically detect the reference used for analysis, and use the correct model files. It either hg38 or hg19 reference type is not detected, ML recalibration will automatically be disabled and SNV VC falls back to legacy operation.

Inputs

DRAGEN-ML requires a run with BAM or FASTQ input, since the machine learning model extracts information from the read pile-up. DRAGEN-ML runs concurrently with DRAGEN SNV VC. DRAGEN-ML can be applied to WGS or WES samples. Re-calibration of existing VCF files is not supported.

Outputs

DRAGEN-ML recalibrates all quality scores, changing the values of the QUAL and GQ fields in the output VCF/GVCF.

  • DRAGEN-ML also updates PL and GP in the output VCF/GVCF.

  • The genotypes (GT field) of some variants may be changed by ML e.g., 0/1 to 1/1 or vice versa.

  • DRAGEN-ML PHRED scores (e.g. QUAL) are better calibrated than and differ significantly from those with ML disabled and, as a consequence, QUAL scores tend to not exceed 75. By comparison, QUAL scores with ML disabled can exceed 1000. For this reason, the QUAL filtering threshold for both SNP and Indel is set to 3.0103 when DRAGEN-ML is enabled, compared to 10.41 of SNP and 7.83 of Indel for DRAGEN-VC when DRAGEN-ML is disabled.

The following variants types are recalibrated:

  • Biallelic and multiallelic variants

  • Autosomes and sex chromosomes, including haploid positions

  • Force GT calls

  • Non primary contigs

Accuracy Improvements

DRAGEN-ML typically removes 30-50% of SNP FPs, with smaller gains on INDELS. FN counts are reduced by 10% or more. The output QUAL/GQ of DRAGEN-ML is empirically more accurately calibrated than DRAGEN SNV VC without ML. There are significant gains in accuracy statistics across the entire genome with ML enabled. Note that a small number of variant calls may have degraded accuracy with ML enabled compared to VC without ML.

Run time

DRAGEN-ML adds about 10% to the run time compared to runs without ML.

Bed File Collection
Bed File Collection
Bed File Collection
Bed File Collection
Bed File Collection
Bed File Collection
Bed File Collection
scRNA
scRNA
scRNA PIPseq
scRNA PIPseq
Bed File Collection
DRAGEN Software Support Site page
Illumina Genomics Research Hub site
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
Germline-aware Mode
Germline-aware Mode
Germline-aware Mode
Germline-aware Mode
Specification of B-Allele Loci
Germline-aware Mode
Specification of B-Allele Loci
Specification of B-Allele Loci
Specification of B-Allele Loci
Specification of B-Allele Loci
Specification of B-Allele Loci
Specification of B-Allele Loci
Specification of B-Allele Loci
Merge Duplex UMIs
Merge Duplex UMIs
UMI Options
UMI Options
UMI Options
UMI Options
Merge Duplex UMIs
Merge Duplex UMIs
Merge Duplex UMIs
Merge Duplex UMIs
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
TMB Germline Variants
Panel of Normals
Panel of Normals
Panel of Normals
Panel of Normals
Panel of Normals
Panel of Normals

--read-trimmers

To enable trimming filters in hard-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable trimming, set to none. During mapping, artifacts are removed from all reads. The following are valid trimmer names:

  • fixed-len—Fixed-length trimming

  • polyg—Poly-G trimming

  • quality—Quality trimming

  • adapter—Adapter trimming

  • n—Ambiguous base trimming

  • min-len—Minimum length trimming

  • cut-end—Maximum length trimming

  • bisulfite—Bisulfite trimming

Read trimming is disabled by default (default: "none").

--soft-read-trimmers

To enable trimming filters in soft-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable soft trimming, set to none. During mapping, reads are aligned as if trimmed, and bases are not removed from the reads. The following are the valid trimmer names.

  • fixed-len—Fixed-length trimming

  • polyg—Poly-G trimming

  • quality—Quality trimming

  • adapter—Adapter trimming

  • n—Ambiguous base trimming

  • min-len—Minimum length trimming

  • cut-end—Maximum length trimming

  • bisulfite—Bisulfite trimming

Soft-trimming is enabled for the polyg filter by default (default: "polyg").

--trimming-only

Disables mapping and alignment to run read-trimming only.

ROH Caller

ROH Caller

Regions of homozygosity (ROH) are detected as part of the small variant caller. The caller detects and outputs the runs of homozygosity from whole genome calls on autosomal human chromosomes. Sex chromosomes are ignored unless the sample sex karyotype is XX, as specified on the command line or determined by the Ploidy Estimator. ROH output allows downstream tools to screen for and predict consanguinity between the parents of the proband subject.

A region is defined as consecutive variant calls on the chromosome with no large gap in between these variants. In other words, regions are broken by chromosome or by large gaps with no SNV calls. The gap size is set to 3 Mbases.

ROH Algorithm

The ROH algorithm runs on the small variant calls. The algorithm excludes variants with multiallelic sites, indels, complex variants, non-PASS filtered calls, and homozygous reference sites. The variant calls are then filtered further using a block list BED, and finally depth filtering is applied after the block list filter. The default value for the fraction of filtered calls is 0.2, which filters the calls with the highest 10% and lowest 10% in DP values. The algorithm then uses the resulting calls to find regions.

The ROH algorithm first finds seed regions that contain at least 50 consecutive homozygous SNV calls with no heterozygous SNV or gaps of 500,000 bases between the variants. The regions can be extended using a scoring system that functions as follows.

  • Score increases with every additional homozygous variant (0.025) and decreases with a large penalty (1-0.025) for every heterozygous SNV. This provides some tolerance of presence of heterozygous SNV in the region.

  • Each region expands on both ends until the regions reach the end of a chromosome, a gap of 500,000 bases between SNVs occurs, or the score becomes too low (0).

Overlapping regions are merged into a single region. Regions can be merged across gaps of 500,000 bases between SNVs if a single region would have been called from the beginning of the first region to the end of the second region without the gap. There is no maximum size for regions, but regions always end at chromosome boundaries.

ROH Options

  • --vc-enable-roh Set to true to enable the ROH caller. The ROH caller is enabled by default for human autosomes only. Set to false to disable.

  • --vc-roh-blacklist-bed If provided, the ROH caller ignores variants that are contained in any region in the block list BED file. DRAGEN distributes block list files for all popular human genomes and automatically selects a block list to match the genome in use, unless this option is used to select a file.

  • --vc-roh-enable-depth-filter Enable depth filter for ROH. The depth filter is enabled by default. Set to false to disable.

  • --vc-roh-min-seed-size The minimum number of consecutive homozygous SNPs to form a ROH (Default=50)

  • --vc-roh-max-gap-length The maximum gap length (bp) between two homozygous SNPs to be included in the same region (Default=500000)

  • --vc-roh-error-rate The rate of genotyping errors (Default=0.025)

ROH Output

The ROH caller produces an ROH output file named <output-file-prefix>.roh.bed in which each row represents one region of homozygosity. The BED file contains the following columns:

Chromosome Start End Score #Homozygous #Heterozygous

  • Score is a function of the number of homozygous and heterozygous variants, where each homozygous variant increases the score by 0.025, and each heterozygous variant reduces the score by 0.975.

  • Start and end positions are a 0-based, half-open interval.

  • #Homozygous is number of homozygous variants in the region.

  • #Heterozygous is number of heterozygous variants in the region. The caller also produces a metrics file named<output-file-prefix>.roh_metrics.csv that lists the number of large ROH and percentage of SNPs in large ROH (>3 MB).

Concordance with PLINK

The table below demonstrates how the PLINK options can be tuned to behave similarly to the DRAGEN ROH caller default settings (see column DRAGEN default). We observed that PLINK ROH calls (see column PLINK default) in default settings are more conservative compared to DRAGEN default settings. By default, PLINK reports ROH regions of size 1MB or larger (see PLINK option --homozyg-kb ) with at least 100 homozygous SNPs (see PLINK option --homozyg-snp) while DRAGEN ROH caller reports smaller regions with at least 50 homozygous SNPs (see DRAGEN ROH Algorithm section). In addition, PLINK by default allows for only 1 heterozygous SNP per scanning window (specified by PLINK option --homozyg-window-het) while DRAGEN uses a soft score threshold penalty without setting an upper bound on the allowed number of heterozygous SNPs (see DRAGEN ROH Algorithm section). The PLINK ROH calls are largely comparable to the DRAGEN ROH calls after relaxing the default PLINK settings, shown in column PLINK tuned. Prior to PLINK ROH calling, the input DRAGEN hard-filtered VCF files are filtered as per the instructions in DRAGEN ROH Algorithm section.

PLINK option
PLINK default
PLINK tuned
DRAGEN default
PLINK Definitions

--homozyg-density

50

50

Minimum required density to call a ROH (1 SNP in 50 kb), can be increased to relax the per SNP density.

--homozyg-gap

1000

1000

3000

Maximal interval between two homozygous SNPs in a ROH (in kb)

--homozyg-kb

1000

500

All sizes reported

Minimal length of reported ROH (in kb)

--homozyg-snp

100

50

50

Minimal number of homozygous SNPs in the reported ROH

--homozyg-window-het

1

2

Soft score threshold (1-0.025) penalty for a het SNP and 0.025 gain for a hom SNP

Maximum number of heterozygous SNPs allowed in a scanning window

--homozyg-window-missing

5

5

Number of missing calls allowed in a scanning window

--homozyg-window-snp

50

50

Variants in a scanning window

--homozyg-window-threshold

0.05

0.05

For a SNP to be eligible for inclusion in a ROH, the hit rate/overlap of all scanning windows containing the SNP must be at least 0.05

Somatic Mode

Somatic Mode

The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.

For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.

Variant Scoring

DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):

##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">

DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.

If tumor SQ > vc-sq-call-threshold (default is 3 for tumor-normal and 0.1 for tumor-only), then FORMAT/GT is hard-coded to 0/1 for the tumor sample and 0/0 for the normal sample (if present), and the tumor-sample FORMAT/AF yields an estimate of the somatic variant allele frequency, which ranges anywhere within [0,1].

  • If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.

  • If tumor SQ < vc-sq-call-threshold, the variant is not emitted in the VCF.

  • If tumor SQ > vc-sq-call-threshold but tumor SQ <vc-sq-filter-threshold, the variant is emitted in the VCF, but FILTER=weak_evidence.

  • If tumor SQ > vc-sq-call-threshold and tumor SQ >vc-sq-filter-threshold, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).

  • The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ >vc-sq-call-threshold but tumor SQ < vc-sq-filter-threshold, so the FILTER is marked as weak_evidence.

chr2 593701 . G A . weak_evidence
DP=97;MQ=48.74;SQ=3.86;NLOD=9.83;FractionInformativeReads=1.000
GT:SQ:AF:F1R2:F2R1:DP:SB:MB 0/0:9.83:33,0:0.000:14,0:19,0:33
0/1:3.86:61,3:0.047:29,2:32,1:64:35,26,0,3:39,22,1,2

The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0.

Somatic Mode Options

To run DRAGEN somatic small variant calling, enable the variant caller with --enable-variant-caller=true and pass in tumor, and optionally, matched normal inputs via the command line. FASTQ (both gzipped and Ora-compressed), FASTQ list, BAM and CRAM inputs are all supported input types. For all input types, reads will be aligned by the DRAGEN map/align module and resulting alignments fed into the caller by default. For BAM and CRAM inputs, you can bypass map/align and use existing alignments as variant caller input by setting --enable-map-align=false.

Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:

  • --tumor-fastq1 and --tumor-fastq2

    Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:

    dragen -f -r  /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
    --tumor-fastq1 <TUMOR_FASTQ1> \
    --tumor-fastq2 <TUMOR_FASTQ2> \
    --RGID-tumor <RG0-tumor> ---RGSM-tumor <SM0-tumor> \
    -1 <NORMAL_FASTQ1> \
    -2 <NORMAL_FASTQ2> \
    --RGID <RG0> --RGSM <SM0> \
    --enable-variant-caller true \
    --output-directory /staging/examples/ \
    --output-file-prefix SRA056922_30x_e10_50M 
  • --tumor-fastq-list

Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:

dragen -f \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--tumor-fastq-list <TUMOR_FASTQ_LIST> \
--fastq-list <NORMAL_FASTQ_LIST> \
--enable-variant-caller true \
--output-directory /staging/examples/ \
--output-file-prefix SRA056922_30x_e10_50M
  • --tumor-bam-input and --tumor-cram-input Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode. When the mapper is enabled (default), reads from the input BAM/CRAM files are re-mapped and updated alignments are sent to the caller (supported for both tumor-normal and tumor-only BAM/CRAM input). When the mapper is disabled (--enable-map-align=false), the existing BAM/CRAM alignments will be used in the caller.

  • --vc-sq-call-threshold and --vc-sq-filter-threshold These options control the thresholds for emitting calls in the VCF and applying the weak_evidence filter tag (see above).

  • --vc-target-vaf This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.

  • --vc-somatic-hotspots, --vc-use-somatic-hotspots, and --vc-hotspot-log10-prior-boost DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_* based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.

  • vc-systematic-noise This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE.

  • --vc-combine-phased-variants-distance This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).

  • vc-skip-germline-tagging=true This option disables the germline tagging feature in the tumor-only pipeline (not recommended).

  • --vc-excluded-regions-bed Optional excluded regions BED file specifying where variants will be hard-filtered. Useful, e.g., to exclude ALU regions that tend to be especially noisy in FFPE samples.

  • --vc-call-hotspots-in-excluded-regions Do not apply excluded regions filter to hotspot variants (Default=false).

Tumor-in-normal contamination and liquid tumor mode

In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.

Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.

Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).

Mixing tumor and normal samples from different sequencing protocols

If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.

Allele frequency and related settings

There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.

The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:

Coverage
Lowest AF

0-199

0.05

200-399

0.025

400-799

0.0125

...

...

If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter (see Post Somatic Calling Filtering below) to apply a hard filter on VAF.

Sample-specific NTD Error Bias Estimation

DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.

This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true.

To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed. Alternatively, if --vc-target-bed is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.

DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.

Unique Molecular Identifier (UMI) Support

DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true) or when running from UMI-collapsed bams, you can enable UMI-aware variant calling by setting one of the following options to true:

  • --vc-enable-umi-solid The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.

  • --vc-enable-umi-liquid The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.

  • If your UMI-collapsed reads do not meet the recommended post-collapsed coverage depths for the options listed above, we recommend you run with default settings.

If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.

gVCF Output

You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.

By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod option.

Post Somatic Calling Filtering

DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the <output-file-prefix>.hard-filtered.vcf.gz output file.

Options

The following options are available for post somatic calling filtering:

  • --vc-sq-call-threshold

    Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.

  • --vc-sq-filter-threshold

    Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.

  • --vc-enable-triallelic-filter

    Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.

  • --vc-enable-non-primary-allelic-filter

    Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.

  • --vc-enable-af-filter

    Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold and vc-af-filter-threshold command-line options. Please use vc-enable-af-filter-mito and corresponding threshold options for mitochondrial allele frequency filtering.

  • --vc-enable-non-homref-normal-filter

    Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.

  • --vc-enable-vaf-ratio-filter

    Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.

  • --vc-depth-filter-threshold

    Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).

  • vc-homref-depth-filter-threshold

    In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.

  • vc-depth-annotation-threshold

    Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).

Filters

Somatic Mode
Filter ID
Description

Tumor-Only & Tumor-Normal

weak_evidence

Variant does not meet likelihood threshold. The likelihood ratio for SQ tumor-normal is < 17.5 or < 3.0 for SQ tumor-only.

Tumor-Only & Tumor-Normal

multiallelic

Site filtered if there are two or more ALT alleles at this location in the tumor. Not applied to somatic hotspot variants.

Tumor-Only & Tumor-Normal

base_quality

Median base quality of ALT reads at this locus is < 20.

Tumor-Only & Tumor-Normal

mapping_quality

Median mapping quality of ALT reads at this locus is < 20 (tumor-normal) or < 30 (tumor-only).

Tumor-Only & Tumor-Normal

fragment_length

Absolute difference between the median fragment length of alt reads and median fragment length of ref reads at a given locus > 10000.

Tumor-Only & Tumor-Normal

read_position

Median of distances between the start and end of read and a given locus < 5 (the variant is too close to edge of all the reads). To output variant read position to the INFO field, use --vc-output-variant-read-position=true.

Tumor-Only & Tumor-Normal

low_af

Allele frequency is below the threshold specified with --vc-af-filter-threshold (default is 5%). Enabled only when using --vc-enable-af-filter=true.

Tumor-Only & Tumor-Normal

systematic_noise

If AQ score is < 10 (default) for tumor-normal or < 60 (default) for tumor-only, the site is filtered.

Tumor-Only & Tumor-Normal

low_frac_info_reads

The fraction of informative reads (denominator excludes filtered_out reads) is below the threshold. The default threshold value is 0.5.

Tumor-Only & Tumor-Normal

filtered_reads

More than 50% of reads have been filtered out.

Tumor-Only & Tumor-Normal

long_indel

Indel length is more than 100bp.

Tumor-Only & Tumor-Normal

low_depth

The site was filtered because the number of reads is too low. The filter is off by default.

Tumor-Only & Tumor-Normal

low_tlen

The site was filtered because the fraction of low TLEN ALT supporting reads is above a threshold. The default threshold is 0.4. Reads with TLEN smaller than -2.25 (default) standard deviations from the mean are considered to be low TLEN. This filter is not applied for reads sampled from tight insert distributions i.e., stddev / mean < 0.1 (default).

Tumor-Only and Tumor-Normal

no_reliable_supporting_read

No reliable supporting read was found in the tumor sample. A reliable supporting read is a read supporting the alt allele with mapping quality ≥ 40, fragment length ≤ 10,000, base call quality ≥ 25, and distance from start/end of read ≥ 5.

Tumor-Only & Tumor-Normal

too_few_supporting_reads

Variant is supported by < 3 reads in the tumor sample. This filter is not applied in UMI-aware pipelines.

Tumor-Normal

noisy_normal

More than three alleles are observed in the normal sample at allele frequency above 9.9%.

Tumor-Normal

alt_allele_in_normal

ALT allele frequency in the normal sample is above 0.2 plus the maximum contamination tolerance. For solid tumor mode, the value is 0. For liquid tumor mode, the default value is 0.15. See vc-enable-vaf-ratio-filter for optional conditions.

Tumor-Normal

non_homref_normal

Normal sample genotype is not a homozygous reference.

Systematic Noise Filtering

The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter removes noise that consistently appears at specific locations in the reference genome. This noise can arise from:

  • Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.

  • PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.

To determine whether a variant should be filtered, the systematic noise filter compares the observed variant's allele frequency (AF) to the noise level at the matching locus in the systematic noise file. Variants are filtered if their AF is not statistically sufficiently higher than the recorded noise.

Note that the systematic noise filter specifically aims to remove noise, not germline variants; however, it may inadvertently filter some germline variants. For this reason, it is not ideal to evaluate the systematic noise file on germline admixture datasets.

Newer versions of the systematic noise filter will include allele-specific information along with two columns for noise frequency: one for the "mean" noise and one for the "max" noise. During a VC run, DRAGEN will automatically detect the input sample type as either WGS or WES/panel and will apply the optimal noise values based on sample type and run context. For WGS data, the "max" noise is used by default; for WES/panel data or whenever UMI is enabled, the "mean" noise is used.

WES and WGS prebuilt systematic noise files are available for download (see below).

Custom panels will require custom noise files. It is recommended to use normal samples sequenced on the same instrument type and using the same library prep. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 30-70 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.

The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding --vc-systematic-noise NOISE_FILE_PATH.

Option
Description

--vc-systematic-noise

Specifies a systematic noise BED file. If a somatic variant does not pass the AQ threshold, the variant is marked as 'systematic_noise' in the FILTER column of the output VCF.

--vc-systematic-noise-filter-threshold

Set the AQ threshold. Higher values filter more aggressively. By default the threshold value is 10 for tumor-normal and 60 for tumor-only. The valid range spans 0-100. For tumor-normal runs the threshold may be set higher (e.g. to 60) to improve specificity at the possible cost of some sensitivity.

--vc-systematic-noise-filter-threshold-in-hotspot

Set the AQ threshold to use in hotspot regions, where one may want to filter less aggressively than in the rest of the genome. By default, the threshold value is 10 for tumor-normal and 20 for tumor-only.

--vc-allele-specific-systematic-noise

Apply systematic noise in an allele-specific manner when allele information is available. This setting is ignored for v1.x.x noise files (Default=true))

Prebuilt Systematic Noise BED Files

Somatic Systematic Noise Baseline Collection v2.0.0 noise files include allele specific information to better preserve sensitivity with systematic noise filtering enabled. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns, with the appropriate noise applied automatically based on auto-detected input type and run context.

The latest noise files (v2.0.0) contain more columns than earlier noise files and are therefore incompatible with versions of DRAGEN prior to v4.3. Older noise files are still supported in the current version of DRAGEN; however, the older noise files lack allele specific information and noise filtering will be applied by position only as was the default in v4.2 and earlier versions of DRAGEN.

Custom Systematic Noise Files

The BaseSpace Sequence Hub DRAGEN Baseline Builder App or the DRAGEN Systematic Noise File Builder Pipeline on ICA can be used to build systematic noise files in the cloud.

Option
Description

--build-sys-noise-vcfs-list

Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line.

--build-sys-noise-germline-vaf-threshold

Variant calls with VAF higher than this threshold will be considered germline and will not contribute to the noise estimate. This option is disabled by default by setting the threshold to 1. (Default 1)

--build-sys-noise-use-germline-tag

This option will ensure that variants tagged by vc-enable-germline-tagging=true will not be counted as noise. (Default true)

--build-sys-noise-min-sample-cov

Min coverage at a site for a sample to be used towards noise estimation. At low coverages estimated allele frequencies become less reliable. Accurate AF estimation is imporant for germline variant detection, and also for noise detection when using MAX noise. (Default 5)

--build-sys-noise-min-supporting-samples

Min number of samples with noise at a position in order for a position to be considered systematic-noise (Default 1).

Germline Tagging in the Tumor-Only Pipeline

When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:

  • --vc-enable-germline-tagging Enable germline tagging. The default is 'false'. In a tumor-only analysis, this option must either be set 'true' (recommended) or germline tagging must be explicitly disabled with --vc-skip-germline-tagging=true (not recommended). Once the vc-enable-germline-tagging option is set to 'true', it will require the user to pass in a variant annotation data directory as follows:

    • --variant-annotation-data Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)

Additional options to control how to define germline variants.

  • --germline-tagging-db-threshold The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).

  • --germline-tagging-pop-af-threshold The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.

1    11301714        .       A       G       .       PASS    
DP=3626;MQ=249.61;FractionInformativeReads=0.974;AQ=100.00;GermlineStatus=Germline_DB   
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:64.73:1772,1758:0.498:872,901:900,857:3530:846,926,843,915:894,878,874,884

Mutation Annotation Format (MAF) Conversion in Tumor-Only and Tumor-Normal Pipelines

When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).

When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:

Annotation options:

  • --enable-variant-annotation=true Enable variant annotation

MAF conversion options:

  • --enable-maf-output=true Enable MAF output

  • --maf-transcript-source Desired transcript source, RefSeq or Ensembl

Additional standalone options (when running without the variant caller):

  • --maf-input-vcf Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz

  • --maf-input-json Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz

Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.

Optional options:

  • --maf-include-non-pass-variants Enabling this option will output all variants, including non-PASS variants, in the MAF output file.

Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.

Example command lines:

MAF output from BAM input and variant caller:

bin/dragen --output-dir=/path/to/output/dir --output-file-prefix=prefix_name --ref-dir=/path/to/ref/dir --enable-map-align=false --enable-sort=false --enable-variant-caller=true -b /path/to/normal/bam --tumor-bam-input /path/to/tumor/bam --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:

bin/dragen --output-dir=/path/to/output/dir/with/vcf --output-file-prefix=prefix_of_vcf --ref-dir=/path/to/ref-dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from source VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-input-vcf=/path/to/vcf/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir and --output-file-prefix options.

MAF output from source annotated VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-maf-output=true --maf-input-json=/path/to/annotated/json/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir and --output-file-prefix options.

polya—RNA Poly-A tail trimming. See additional description in section

polya—RNA Poly-A tail trimming. See additional description in section

An alternative algorithm to detect ROH regions is provided on the Germline WGS ASCN caller, see .

The tumor-only pipeline produces output that contains both germline and somatic variants and can be further analyzed to identify tumor mutations. The caller does not attempt to distinguish between them: filtering out common germline variants as reported in databases is currently the most reliable way to remove germline variants. The tumor-only pipeline provides a germline tagging feature and requires this feature to be explicitly enabled or disabled. When germline tagging is enabled, a variant annotation data directory must be passed in via the commane line; DRAGEN will then tag variants that are common in the gnomAD database as germline so they can be filtered out if desired (see details below in ). The tumor-only pipeline also requires the presence of a systematic noise file by default. To run without germline tagging and/or systematic noise files, these options need to be disabled explicitly.

--vc-callability-tumor-thresh Specifies the callability threshold for tumor samples. The somatic callable regions report includes all regions with tumor coverage above the tumor threshold. The default value is 50. For more information on the somatic callable regions report, see .

--vc-callability-normal-thresh Specifies the callability threshold for normal samples, if present. If applicable, the somatic callable regions report includes all regions with normal coverage above the normal threshold. The default value is 5. For more information on the somatic callable regions report, see .

Nucleotide (NTD) Error Bias Estimation is on by default and recommended as a replacement for the orientation bias filter. Both methods take account of strand-specific biases (systematic differences between F1R2 and F2R1 reads). In addition, NTD error estimation accounts for non-strand-specific biases such as sample-wide elevation of a certain snv type, e.g. C->T or any other transition or transversion. This is done by collecting counts (sampled from across the genome, and counted per read orientation) of reads supporting each specific nucleotide subsitution (C->T, G->A, etc.). The estimated rate of each substitution is . NTD error estimation can also capture these biases in a trinucleotide context, e.g. in the case of C->T it will break down the counts as ACA->ATA, CCA->CTA, GCA->CTA, TCA->TTA, etc.

Prebuilt systematic noise files can be downloaded here:

The default WES and WGS noise files were generated using a combination of Nextera and TruSeq samples (with and without PCR). There are also hg38 WGS HEME and FFPE specific noise files. For details please refer to .

For example command lines on how to build a custom noise file, please refer to the respective DRAGEN recipes: .

The total number of variants tagged as germline and somatic in the VCF are .

--variant--annotation-data Nirvana annotation database, please see .

DRAGEN Software Support Site page
SNV Systematic Noise Files
DRAGEN Rescipes
Nirvana
Germline Tagging in the Tumor-Only Pipeline
Targeted Caller | Exome calling
Targeted Caller | Exome calling
DRAGEN Multigenome Mapper
relevant section for potential differences between the two approaches
RNA alignment
RNA alignment
RNA alignment
Using Custom HLA Reference Files
Somatic Callable Regions Report
Somatic Callable Regions Report
written to a metrics file named "*.allele-transition-noise-metrics.csv"
written to a metrics file named "*.vc_germline_tagging_metrics.csv"
BSSH Setup Screenshot
Figure 1. DRAGEN Heme WGS Tumor Only Workflow
XML Configuration Parameters
BSSH enabled workflow
Solid Tumor Normal Pipeline Workflow
Figure 1. DRAGEN Variant Calling Workflow
ICA Setup Screenshot
BSSH enabled workflow
Figure 1. DRAGEN Variant Calling Workflow
ICA Setup Screenshot