Somatic Mode

Somatic Mode

The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.

For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.

The tumor-only pipeline produces output that contains both germline and somatic variants and can be further analyzed to identify tumor mutations. The caller does not attempt to distinguish between them: filtering out common germline variants as reported in databases is currently the most reliable way to remove germline variants. The tumor-only pipeline provides a germline tagging feature and requires this feature to be explicitly enabled or disabled. When germline tagging is enabled, variant annotation must also be enabled; DRAGEN then tags variants that are common in the gnomAD database as germline so that they can be filtered out if desired. The tumor-only pipeline also requires the presence of a systematic noise file by default. To run without germline tagging and/or systematic noise files, these options need to be disabled explicitly.

Variant Scoring

DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):

##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">

DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.

If tumor SQ > vc-sq-call-threshold (default is 3 for tumor-normal and 0.1 for tumor-only), then the FORMAT/GT for the tumor sample is hard-coded to 0/1, and the FORMAT/AF yields an estimate on the somatic variant allele frequency, which ranges anywhere within [0,1].

  • If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.

  • If tumor SQ < vc-sq-call-threshold, the variant is not emitted in the VCF.

  • If tumor SQ > vc-sq-call-threshold but tumor SQ < vc-sq-filter-threshold, the variant is emitted in the VCF, but FILTER=weak_evidence.

  • If tumor SQ > vc-sq-call-threshold and tumor SQ > vc-sq-filter-threshold, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).

  • The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ > vc-sq-call-threshold but tumor SQ < vc-sq-filter-threshold, so the FILTER is marked as weak_evidence.

chr2 593701 . G A . weak_evidence
DP=97;MQ=48.74;SQ=3.86;NLOD=9.83;FractionInformativeReads=1.000
GT:SQ:AF:F1R2:F2R1:DP:SB:MB 0/0:9.83:33,0:0.000:14,0:19,0:33
0/1:3.86:61,3:0.047:29,2:32,1:64:35,26,0,3:39,22,1,2

The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0.

Somatic Mode Options

Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:

  • --tumor-fastq1 and --tumor-fastq2

    Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:

    dragen -f -r  /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
    --tumor-fastq1 <TUMOR_FASTQ1> \
    --tumor-fastq2 <TUMOR_FASTQ2> \
    --RGID-tumor <RG0-tumor> ---RGSM-tumor <SM0-tumor> \
    -1 <NORMAL_FASTQ1> \
    -2 <NORMAL_FASTQ2> \
    --RGID <RG0> --RGSM <SM0> \
    --enable-variant-caller true \
    --output-directory /staging/examples/ \
    --output-file-prefix SRA056922_30x_e10_50M 
  • --tumor-fastq-list

Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:

dragen -f \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--tumor-fastq-list <TUMOR_FASTQ_LIST> \
--fastq-list <NORMAL_FASTQ_LIST> \
--enable-variant-caller true \
--output-directory /staging/examples/ \
--output-file-prefix SRA056922_30x_e10_50M
  • --tumor-bam-input and --tumor-cram-input Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode.

  • --vc-sq-call-threshold and --vc-sq-filter-threshold These options control the thresholds for emitting calls in the VCF and applying the weak_evidence filter tag (see above).

  • --vc-target-vaf This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.

  • --vc-somatic-hotspots, --vc-use-somatic-hotspots, and --vc-hotspot-log10-prior-boost DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_* based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.

  • vc-systematic-noise This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE.

  • --vc-combine-phased-variants-distance This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).

  • vc-skip-germline-tagging=true This option disables the germline tagging feature in the tumor-only pipeline (not recommended).

  • --vc-callability-tumor-thresh Specifies the callability threshold for tumor samples. The somatic callable regions report includes all regions with tumor coverage above the tumor threshold. The default value is 50. For more information on the somatic callable regions report, see Somatic Callable Regions Report.

  • --vc-callability-normal-thresh Specifies the callability threshold for normal samples, if present. If applicable, the somatic callable regions report includes all regions with normal coverage above the normal threshold. The default value is 5. For more information on the somatic callable regions report, see Somatic Callable Regions Report.

Tumor-in-normal contamination and liquid tumor mode

In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.

Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.

Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).

Mixing tumor and normal samples from different sequencing protocols

If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.

Allele frequency and related settings

There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.

The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:

CoverageLowest AF

0-199

0.05

200-399

0.025

400-799

0.0125

...

...

If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter (see Post Somatic Calling Filtering below) to apply a hard filter on VAF.

Sample-specific NTD Error Bias Estimation

DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.

Nucleotide (NTD) Error Bias Estimation is on by default and recommended as a replacement for the orientation bias filter. Both methods take account of strand-specific biases (systematic differences between F1R2 and F2R1 reads). In addition, NTD error estimation accounts for non-strand-specific biases such as sample-wide elevation of a certain snv type, e.g. C->T or any other transition or transversion. NTD error estimation can also capture these biases in a trinucleotide context, e.g. in the case of C->T it will break down the counts as ACA->ATA, CCA->CTA, GCA->CTA, TCA->TTA, etc.

This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true.

To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed. Alternatively, if --vc-target-bed is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.

DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.

Unique Molecular Identifier (UMI) Support

DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true) or when running from UMI-collapsed bams, enable UMI-aware variant calling by setting one of the following options to true:

  • --vc-enable-umi-solid The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.

  • --vc-enable-umi-liquid The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.

If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.

gVCF Output

You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.

By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod option.

Post Somatic Calling Filtering

DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the *.hard-filtered.vcf.gz output file (note: the *.vcf.gz output file without "hard-filtered" in the filename differs only in that the filter column is unpopulated; the file is produced for historical reasons but is to be deprecated).

Options

The following options are available for post somatic calling filtering:

  • --vc-sq-call-threshold

    Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.

  • --vc-sq-filter-threshold

    Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.

  • --vc-enable-triallelic-filter

    Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.

  • --vc-enable-non-primary-allelic-filter

    Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.

  • --vc-enable-af-filter

    Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold and vc-af-filter-threshold command-line options. Please use vc-enable-af-filter-mito and corresponding threshold options for mitochondrial allele frequency filtering.

  • --vc-enable-non-homref-normal-filter

    Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.

  • --vc-enable-vaf-ratio-filter

    Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.

  • --vc-depth-filter-threshold

    Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).

  • vc-homref-depth-filter-threshold

    In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.

  • vc-depth-annotation-threshold

    Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).

Filters

Somatic ModeFilter IDDescription

Tumor-Only & Tumor-Normal

weak_evidence

Variant does not meet likelihood threshold. The likelihood ratio for SQ tumor-normal is < 17.5 or < 3.0 for SQ tumor-only.

Tumor-Only & Tumor-Normal

multiallelic

Site filtered if there are two or more ALT alleles at this location in the tumor. Not applied to somatic hotspot variants.

Tumor-Only & Tumor-Normal

base_quality

Median base quality of ALT reads at this locus is < 20.

Tumor-Only & Tumor-Normal

mapping_quality

Median mapping quality of ALT reads at this locus is < 20 (tumor-normal) or < 30 (tumor-only).

Tumor-Only & Tumor-Normal

fragment_length

Absolute difference between the median fragment length of alt reads and median fragment length of ref reads at a given locus > 10000.

Tumor-Only & Tumor-Normal

read_position

Median of distances between the start and end of read and a given locus < 5 (the variant is too close to edge of all the reads). To output variant read position to the INFO field, use --vc-output-variant-read-position=true.

Tumor-Only & Tumor-Normal

low_af

Allele frequency is below the threshold specified with --vc-af-filter-threshold (default is 5%). Enabled only when using --vc-enable-af-filter=true.

Tumor-Only & Tumor-Normal

systematic_noise

If AQ score is < 10 (default) for tumor-normal or < 60 (default) for tumor-only, the site is filtered.

Tumor-Only & Tumor-Normal

low_frac_info_reads

The fraction of informative reads (denominator excludes filtered_out reads) is below the threshold. The default threshold value is 0.5.

Tumor-Only & Tumor-Normal

filtered_reads

More than 50% of reads have been filtered out.

Tumor-Only & Tumor-Normal

long_indel

Indel length is more than 100bp.

Tumor-Only & Tumor-Normal

low_depth

The site was filtered because the number of reads is too low. The filter is off by default.

Tumor-Only & Tumor-Normal

low_tlen

The site was filtered because the fraction of low TLEN ALT supporting reads is above a threshold. The default threshold is 0.4. Reads with TLEN smaller than -2.25 (default) standard deviations from the mean are considered to be low TLEN. This filter is not applied for reads sampled from tight insert distributions i.e., stddev / mean < 0.1 (default).

Tumor-Only and Tumor-Normal

no_reliable_supporting_read

No reliable supporting read was found in the tumor sample. A reliable supporting read is a read supporting the alt allele with mapping quality ≥ 40, fragment length ≤ 10,000, base call quality ≥ 25, and distance from start/end of read ≥ 5.

Tumor-Only & Tumor-Normal

too_few_supporting_reads

Variant is supported by < 3 reads in the tumor sample. This filter is not applied in UMI-aware pipelines.

Tumor-Normal

noisy_normal

More than three alleles are observed in the normal sample at allele frequency above 9.9%.

Tumor-Normal

alt_allele_in_normal

ALT allele frequency in the normal sample is above 0.2 plus the maximum contamination tolerance. For solid tumor mode, the value is 0. For liquid tumor mode, the default value is 0.15. See vc-enable-vaf-ratio-filter for optional conditions.

Tumor-Normal

non_homref_normal

Normal sample genotype is not a homozygous reference.

Systematic Noise Filtering

The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter tackles noise that consistently appears at specific locations in the reference genome. This noise can arise from:

  • Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.

  • PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.

The systematic noise filter offers a significant improvement over the older "panel of normals" method. While the panel of normals simply excluded specific positions, the new filter employs a statistical model. This model compares the variant and its allele frequency (AF) to the noise level associated with that specific position and allele in the reference genome. This allows for a more nuanced filtering approach, reducing false positives without discarding potentially valid variants.

Note that the systematic noise filter specifically aims to remove noise, while the option --vc-enable-germline-tagging is used for identifying germline variants. The systematic noise filter is not recommended for germline admixture datasets, where tumor-normal pairs are simulated by combining germline samples from two different individuals. This is because such datasets contain (simulated) somatic variants at germline variant positions, and those positions may be present in the noise files with the result that desired variants are filtered out.

Newer versions of the systematic noise will include two columns, one for the "mean" noise and one for the "max" noise. The noise file header will specify a "##NoiseMethod". This is the column that will be used by default during variant calling. For UMI/PANELs/WES is is recommended to use the "mean" noise, and for WGS it is recommended to use the "max" noise.

Prebuilt systematic noise files are available for download (see below), but when possible, it is recommended to build custom noise files from a panel of normal samples sequenced locally. This will ensure that the noise file is specific to the library preparation, sequencing system, and panel in use. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 20-50 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.

The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding the following commands:

dragen \
[STANDARD INPUT, MAPPING and VC OPTIONS] \
--vc-systematic-noise NOISE_FILE_PATH \
--vc-systematic-noise-method max/mean \ # max is more aggressive and recommended for WGS, "mean" preserves better sensitivity and is recommended for UMI/WES/PANELs.  

Prebuilt Systematic Noise BED Files

Prebuilt systematic noise files can be downloaded here: DRAGEN Software Support Site page

Somatic Systematic Noise Baseline Collection v2.0.0 noise files were generated with V4.3 and for the first time include allele specific information. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns. A header line "##NoiseMethod=mean/max" specifies which noise column will be used by default.

Noise files generated with V4.3 contain extra columns and are not compatible with earlier versions. Older noise files are still supported in the current version of DRAGEN as per the table below.

VersionReleaseModesNormal Samples

Somatic Systematic Noise Baseline Collection v2.0.0

V4.3

hg19, hg38, hs37d5, WES, WGS

~50 per cohort, 80-100X coverage

The default WES and WGS noise files were generated using a combination of Nextera and TruSeq samples (with and without PCR). There are also hg38 WGS HEME and FFPE specific noise files.

Custom Systematic Noise Files

The BaseSpace Sequence Hub DRAGEN CNV Baseline Builder App can be used to build SNV and CNV noise files in the cloud. Alternatively the following DRAGEN CMD lines can be used to generate the noise files locally:

First run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples using the following command:

dragen \
-r {REFERENCE} 
[FQ or BAM INPUT options] \
--vc-detect-systematic-noise=true \         # very sensitive mode to detect and emit noise, should never be used during real variant calling
--vc-detect-systematic-noise-mode=UMI \     # only required for UMI samples, the default works well for WGS/WES/non-UMI PANELs
--vc-enable-germline-tagging=true \
--enable-variant-annotation=true \
--variant-annotation-data {NIRVANA_ANNOTATION_FOLDER} \
--variant-annotation-assembly {REF_TYPE} \  # GRCh37 or GRCh38
--output-directory {DIR} \
--output-file-prefix {PREFIX}

Once the normal samples have completed, collect the normal VCFs in the VCF_LIST file (one vcf per line) and use DRAGEN to generate the systematic noise file:

dragen \
-r ${REF_DIR} \
--build-sys-noise-vcfs-list ${VCF_LIST} \  
--build-sys-noise-method=max \ # sets the default noise mode for this noise file by tagging the noise file header with '##NoiseMethod=mean/max' 
--output-directory ${DIR} \
--output-file-prefix ${PREFIX}

Detailed settings for running or building the systematic noise filter:

Running the filter during somatic variant calling:

OptionDescription

--vc-systematic-noise

Specifies a systematic noise BED file. If a somatic variant does not pass the AQ threshold, the variant is marked as 'systematic_noise' in the FILTER column of the output VCF.

--vc-systematic-noise-method

Specifies which column in the systematic noise file will be used: "max" is more aggressive and recommended for WGS, while "mean" preserves better sensitivity and is recommended for WES/PANELs.

--vc-systematic-noise-filter-threshold

Set the AQ threshold. Higher values filter more aggressively. By default the threshold value is 10 for tumor-normal and 60 for tumor-only. The valid range spans 0-100. For tumor-normal runs the threshold may be set higher (e.g. to 60) to improve specificity at the possible cost of some sensitivity.

--vc-systematic-noise-filter-threshold-in-hotspot

Set the AQ threshold to use in hotspot regions, where one may want to filter less aggressively than in the rest of the genome. By default, the threshold value is 10 for tumor-normal and 20 for tumor-only.

--vc-allele-specific-systematic-noise

Apply systematic noise in an allele-specific manner when allele information is available. This setting is ignored for v1 noise files (Default=true))

Running the tumor-only pipeline on the normals:

OptionDescription

--vc-detect-systematic-noise

Run the tumor-only pipeline in an ultra sensitive mode and intentionally include noise in the output VCF. WARNING: this option should only be used with normal samples to characterize noise, it is NOT intended for analyzing tumor samples.

--vc-detect-systematic-noise-mode

Specify the library type when generating the systematic noise. Only required for UMI samples. This mode will generate GVCFs which are especially useful for capturing very low levels of noise. The default mode will work well for WGS/WES and non-UMI panels. Valid options include [UMI, DEFAULT]

Building the noise file:

OptionDescription

--build-sys-noise-method

Specifies the default value for vc-systematic-noise-method by adding it as part of the header in the systematic noise file. It is recommended to select 'mean' for UMI/PANELS/WES data and 'max' for WGS data (default is 'max')."

--build-sys-noise-vcfs-list

Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line.

--build-sys-noise-germline-vaf-threshold

Variant calls with VAF higher than this threshold will be considered germline and will not contribute to the noise estimate. This option is disabled by default by setting the threshold to 1. (Default 1)

--build-sys-noise-use-germline-tag

This option will ensure that variants tagged by vc-enable-germline-tagging=true will not be counted as noise. (Default true)

--build-sys-noise-min-sample-cov

Min coverage at a site for a sample to be used towards noise estimation. At low coverages estimated allele frequencies become less reliable. Accurate AF estimation is imporant for germline variant detection, and also for noise detection when using MAX noise. (Default 5)

--build-sys-noise-min-supporting-samples

Min number of samples with noise at a position in order for a position to be considered systematic-noise (Default 1).

Germline Tagging in the Tumor-Only Pipeline

When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:

  • --vc-enable-germline-tagging Enable germline tagging. The default is 'false'. Once this is set to 'true', it will require user to set annotation related parameters as follows:

    • --enable-variant-annotation=true

    • --variant-annotation-data Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)

    • --variant-annotation-assembly The genome build, GRCh37 or GRCh38

Additional options to control how to define germline variants.

  • --germline-tagging-db-threshold The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).

  • --germline-tagging-pop-af-threshold The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.

1    11301714        .       A       G       .       PASS    
DP=3626;MQ=249.61;FractionInformativeReads=0.974;AQ=100.00;GermlineStatus=Germline_DB   
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:64.73:1772,1758:0.498:872,901:900,857:3530:846,926,843,915:894,878,874,884

Mutation Annotation Format (MAF) Conversion in Tumor-Only and Tumor-Normal Pipelines

When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).

When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:

Annotation options:

  • --enable-variant-annotation=true Enable variant annotation

  • --variant--annotation-data Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)

  • --variant-annotation-assembly Genome build, GRCh37 or GRCh38

MAF conversion options:

  • --enable-maf-output=true Enable MAF output

  • --maf-transcript-source Desired transcript source, RefSeq or Ensembl

Additional standalone options (when running without the variant caller):

  • --maf-input-vcf Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz

  • --maf-input-json Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz

Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.

Optional options:

  • --maf-include-non-pass-variants Enabling this option will output all variants, including non-PASS variants, in the MAF output file.

Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.

Example command lines:

MAF output from BAM input and variant caller:

bin/dragen --output-dir=/path/to/output/dir --output-file-prefix=prefix_name --ref-dir=/path/to/ref/dir --enable-map-align=false --enable-sort=false --enable-variant-caller=true -b /path/to/normal/bam --tumor-bam-input /path/to/tumor/bam --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:

bin/dragen --output-dir=/path/to/output/dir/with/vcf --output-file-prefix=prefix_of_vcf --ref-dir=/path/to/ref-dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from source VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-input-vcf=/path/to/vcf/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir and --output-file-prefix options.

MAF output from source annotated VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-maf-output=true --maf-input-json=/path/to/annotated/json/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir and --output-file-prefix options.

Last updated