Somatic Tumor Only with UMI

DRAGEN Recipe - Somatic UMI Tumor Only

Overview

This recipe is for processing sequencing data with unique molecular identifier (UMI) for somatic tumor only workflows.

Example Command Line

For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.

  • Configure the INPUT options

  • Configure the OUTPUT options

  • Configure MAP/ALIGN depending on if realignment is desired or not

  • Configure the VARIANT CALLERs based on the application

  • Configure any additional options

  • Build up the necessary options for each component separately, so that they can be re-used in the final command line.

We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to Dragen Reference Support.

The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.

#!/bin/bash
set -euo pipefail

# Path to DRAGEN hashtable
DRAGEN_HASH_TABLE=<REF_DIR>

# Path to output directory for the DRAGEN run
OUTPUT=<OUT_DIR>

# File prefix for DRAGEN output files
PREFIX=<OUT_PREFIX>

# Path to VC systematic noise BED file. In tumor-only variant calling, this filter
# is essential for removing systematic noise observed in normal samples. Prebuilt
# systematic noise files are available for download on the DRAGEN Software 
# Support Site page. Alternatively, running the somatic TO pipeline on
# normal samples can generate a systematic noise file. We recommend using a
# systematic noise file based on normal samples that match the library prep of
# the tumor samples. A prebuilt systematic noise BED file can be downloaded from
# https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html
VC_SYSTEMATIC_NOISE_FILE=<VC_SYSTEMATIC_NOISE_BED_FILE_PATH>

# Define the input sources, select fastq list, fastq, bam, or cram.
INPUT_FASTQ_LIST="
  --tumor-fastq-list $TUMOR_FASTQ_LIST \
  --tumor-fastq-list-sample-id $TUMOR_FASTQ_LIST_SAMPLE_ID \
"

INPUT_FASTQ="
  --tumor-fastq1 $TUMOR_FASTQ1 \
  --tumor-fastq2 $TUMOR_FASTQ2 \
  --RGSM-tumor $RGSM_TUMOR \
  --RGID-tumor $RGID_TUMOR \
"

INPUT_BAM="
  --tumor-bam-input $TUMOR_BAM \
  --bam-input $BAM \
"

INPUT_CRAM="
  --tumor-cram-input $TUMOR_CRAM \
  --cram-input $CRAM \
"

# Select input source, here in this example we use INPUT_FASTQ_LIST
INPUT_OPTIONS="
  --ref-dir $DRAGEN_HASH_TABLE \
  $INPUT_FASTQ_LIST \
"

OUTPUT_OPTIONS="
  --output-directory $OUTPUT \
  --output-file-prefix $PREFIX \
"

MA_OPTIONS="
  --enable-map-align true \
  --enable-sort true \
"

UMI_OPTIONS="
  --enable-umi true \
  --umi-source $UMI_SOURCE \
  --umi-library-type $UMI_LIBRARY_TYPE \
"

SNV_OPTIONS="
  --enable-variant-caller true \
  --vc-enable-umi-solid true or --vc-enable-umi-liquid true \
  --vc-target-bed $VC_TARGET_BED \
  --vc-systematic-noise $VC_SYSTEMATIC_NOISE_FILE \
  --vc-systematic-noise-method mean \
  --vc-enable-germline-tagging true \
  --enable-variant-annotation true \
  --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER \
  --variant-annotation-assembly $REF_TYPE \  # GRCh37 or GRCh38
"

CNV_OPTIONS="
  --enable-cnv true \
  --cnv-target-bed $CNV_TARGET_BED \
  --cnv-combined-counts $CNV_PANEL_OF_NORMALS \
  --cnv-population-b-allele-vcf $CNV_POP_VCF \
"

# HRD requires enabling CNV
HRD_OPTIONS="
--enable-hrd=true \
"

SV_OPTIONS="
  --enable-sv true \
  --sv-exome true \
  --sv-call-regions-bed $SV_TARGET_BED \
"

TMB_OPTIONS="
--enable-tmb=true
# Nirvana settings required for TMB
--enable-variant-annotation=true  
--variant-annotation-data=PATH
--variant-annotation-assembly=GRCh37/8
"

MSI_OPTIONS="
--msi-command=tumor-only \
--msi-coverage-threshold=60 \
--msi-microsatellites-file=$MSI_MICROSATELLITES_FILE \
--msi-ref-normal-dir=$MSI_REFERENCE_NORMAL_FOLDER \
"

HLA_OPTIONS="
--enable-hla=true \
--hla-as-filter-min-threshold=29.0 \
--hla-as-filter-ratio-threshold=0.85 \
--hla-enable-class-2=true \ # only if the panel has sufficient coverage for class II HLA typing 
"

# Construct final command line
CMD="
  dragen \
  $INPUT_OPTIONS \
  $OUTPUT_OPTIONS \
  $MA_OPTIONS \
  $UMI_OPTIONS \
  $SNV_OPTIONS \
  $CNV_OPTIONS \
  $HRD_OPTIONS \
  $SV_OPTIONS \
  $TMB_OPTIONS \ 
  $MSI_OPTIONS \
  $HLA_OPTIONS \
"

# Execute
echo $CMD
bash -c $CMD

Additional Notes and Options

Optional settings per component are listed below. Full option list at this page.

UMI

OptionDescription

--umi-source qname/fastq/bamtag

Specify the input type for the UMI sequence. For more information, see UMI Options.

--umi-library-type random-duplex/random-simplex/nonrandom-duplex

Set the batch option for different UMIs correction. For more information, see UMI Options.

--umi-nonrandom-whitelist $WHITELIST

If UMI is nonrandom, enter the path for a customized, valid UMI sequence.

--umi-min-supporting-reads 2

Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. For more information, see UMI Options.

--umi-metrics-interval-file $UMI_TARGET_BED

Enter the path for target region in BED format.

--umi-emit-multiplicity both

Set the consensus sequence type to output. DRAGEN UMI allows you to collapse duplex sequences from the two strands of the original molecules. For more information, see Merge Duplex UMIs.

SNV

OptionDescription

--vc-enable-umi-solid true / --vc-enable-umi-liquid true

When running from UMI data, one of these options is required to let DRAGEN know that the reads have been UMI-collapsed and are therefore more reliable than non-UMI reads. Solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher. Liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of ~2000–2500X and target allele frequencies of 0.4% and higher. As a rough rule of thumb, choose solid for coverage below 1000X and liquid for higher coverage.

--vc-sq-filter-threshold $THRESHOLD

Threshold for sensitivity-specificity tradeoff. The default threshold is 4(Solid)/2(Liquid). Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-systematic-noise $SYSTEMATIC_NOISE_FILE

Systematic noise filter. In tumor-only variant calling, this filter is essential for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.

--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz

Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probabilityto variants at these positions. Use this option to override with a custom hotspots file if a list of positions of interest is available.

--vc-combine-phased-variants-distance $DIST

Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]

--vc-enable-germline-tagging true --enable-variant-annotation true --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER --variant-annotation-assembly $REFERENCE

Germline filtering. Enable to tag variants as germline or somatic based on population databases. $REFERENCE can be GRCh37 or GRCh38 (GRCh37 is compatible with hs37d5 and hg19). The Nirvana annotation database is downloadable at this page.

--vc-target-vaf FLOAT

This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true).

--vc-systematic-noise-method

The 'max' method is recommended for WGS and results in a more aggressive filter. The 'mean' method is recommended for UMI/PANELs/WES and results in a less aggressive filter. The default is specified in the noise file header.

SNV library specific settings

OptionFFPE

--vc-excluded-regions-bed $BED

Some FFPE samples may have a high rate of FP calls in SINE (and specifically in ALU) regions. Optionally use an ALU bed to hard filter all calls in this region. Steps are provided below to download an ALU region bed.

SNV systematic noise file

Generic SNV noise files can be downloaded here: DRAGEN Software Support Site page

However for UMI samples and panels it is strongly recommended to build a custom systematic noise file as follow:

Step 1. Run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples:

### choose input either from
### i) BAM
INPUT="--tumor-bam-input ${NORMAL_BAM}"
### ii) FASTQs
INPUT="--tumor-fastq-list ${NORMAL_FASTQ_LIST} \
  --tumor-fastq-list-sample-id ${NORMAL_FASTQ_LIST_SAMPLE_ID}"
###

dragen \
-r ${REFERENCE} \
${INPUT} \
--vc-detect-systematic-noise=true \
--vc-detect-systematic-noise-mode=UMI \ # detect ultra low noise levels relevant for UMI panels
--vc-enable-germline-tagging=true \
--enable-variant-annotation=true \
--variant-annotation-data ${NIRVANA_ANNOTATION_FOLDER} \
--variant-annotation-assembly ${REF_TYPE} \  # GRCh37 or GRCh38
--intermediate-results-dir ${TMP} \
--output-directory ${DIR} \
--output-file-prefix ${PREFIX}

Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file with:

dragen \
-r ${REF_DIR} \
--build-sys-noise-vcfs-list ${VCF_LIST} \ 
--build-sys-noise-method=mean \ # sets the default noise mode for this noise file by tagging the noise file header with '##NoiseMethod=mean' 
--output-directory ${DIR} \
--output-file-prefix ${PREFIX}

To download a SINE/ALU regions bed for SNV excluded regions

ALUs comprise approximately 11% of the genome and are common in introns. High rates of deamination FP calls have been observed in some FFPE libraries. If the ALU regions are not clinically significant for a specific analysis, then it is recommended to simply filter out the entire ALU region using the DRAGEN excluded regions filter: --vc-excluded-regions-bed $BED.

The ALU bed file can be downloaded as part of the Bed File Collection: DRAGEN Software Support Site page

CNV

OptionDescription

--cnv-enable-gcbias-correction true

Enable or disable GC bias correction when generating target counts. For more information, see GC Bias Correction.

--cnv-segmentation-mode $SEG_MODE

Specifies the segmentation algorithm to perform. For more information, see Segmentation.

Generating Panel of Normals (PON)

Somatic WES CNV requires PON files. Follow the two steps below to generate CNV PON:

  1. Target counts generation (per normal sample): Target counts of individual normal sample should be generated as baseline. Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.

CNV_PON_OPTIONS="
  --enable-cnv true \
  --cnv-target-bed $CNV_TARGET_BED \
"

CMD="
  dragen \
  $INPUT_OPTIONS \
  $OUTPUT_OPTIONS \
  $CNV_PON_OPTIONS \
"
  1. Combined counts generation: Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz file.

CNV_COMBINED_COUNTS_OPTIONS="
  --enable-cnv true \
  --cnv-generate-combined-counts true \
  --cnv-normals-list $CNV_NORMALS_LIST \
"

CMD="
  dragen \
  $INPUT_OPTIONS \
  $OUTPUT_OPTIONS \
  $CNV_COMBINED_COUNTS_OPTIONS \
"

$CNV_NORMALS_LIST is a single text file with paths to each target counts file generated by step1 (either .target.counts.gz or .target.counts.gc-corrected.gz). Output will have a PON file with suffix .combined.counts.txt.gz file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts option.

For more information, see Panel of Normals.

SV

OptionDescription

--sv-systematic-noise $SYSTEMATIC_NOISE_BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information, see Systematic Noise Filtering.

TMB library specific settings

OptionDescriptionSolidLiquid

--tmb-vaf-threshold FLOAT

Variant mininum allele frequency for usable variants.

0.05 ( default)

0.002

MSI

Microsatellite sites file

Microsatellite sites file can be downloaded here: DRAGEN Software Support Site page

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files please refer to the MSI Biomarker section in the user guide.

Build Normal references of miscrosatellite repeat distribution

Normal reference files can be generated by running collect-evidence mode on a panel of normal samples. This ONLY works with DRAGEN germline mode.

dragen -f \
--msi-command collect-evidence \
--ref-dir ${reference_directory} \
--msi-microsatellites-file ${microsatellite_file} \
--msi-coverage-threshold 60 \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
-1 ${normal_fq1} \
-2 ${normal_fq2}

The --msi-microsatellites-file should be the same file used for running tumor-only mode. --msi-coverage-threshold should also be the same value used for running tumor-only mode.

A minimum of 20 normal samples is required for tumor-only mode.

MSI library specific settings

OptionDescriptionSolidLiquid (cfDNA)

--msi-coverage-threshold INT

Minimum coverage for a microsatellite

60 ( default)

500

--msi-distance-threshold FLOAT

Minimum Jensen-Shannon distance between tumor and normal for a microsatellite

0.1 ( default)

0.02

HLA

OptionDescription

enable-hla

Enable HLA typer (this setting by default will only genotype class 1 genes)

hla-as-filter-min-threshold

Internal option to set min alignment score threshold

hla-as-filter-ratio-threshold

Minimum Alignment score of a read mate to be considered

hla-enable-class-2

Extend genotyping to HLA class 2 genes

Last updated