Reference

This reference provides detailed documentation of the DRAGEN CNV pipeline covering contents below.

Input

The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see Streaming Alignments for instructions on streaming alignment records directly from the DRAGEN map/align stage.

DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see Generate an Alignment File.

Reference Hashtable

For the DRAGEN CNV pipeline, the hashtable must be generated with the --ht-build-cnv-hashtable option set to true, in addition to any other options required by other pipelines. When --ht-build-cnv-hashtable is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.

The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see Prepare a Reference Genome.

The following example command generates a hashtable.

dragen \
--build-hash-table true \
--ht-reference \<FASTA\> \
--ht-build-cnv-hashtable true \
--output-directory \<OUTPUT\> \
<OTHER HASHTABLE OPTIONS> \

Generate an Alignment File

The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.

You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.

The following example command maps and aligns a FASTQ file:

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing BAM file:

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing CRAM file:

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

Streaming Alignments

DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.

To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true

For information on running CNV concurrently with the Haplotype Variant Caller, see Concurrent CNV and Small Variant Calling.

Concurrent CNV and Small Variant Calling

DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.

Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.

The following examples show different commands.

Map/Align FASTQ With CNV

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true

Map/Align FASTQ With VC

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-variant-caller true

Map/Align FASTQ With CNV and VC

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true \
--enable-variant-caller true

BAM Input to CNV and VC

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true \
--enable-variant-caller true

Preprocessing

Target Counts

The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.

When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list option.

With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed option to determine the intervals for analysis. The target BED file should contain intervals that match those in the panel of normals file. If the intervals in the target BED file and the panel of normals file do not match, DRAGEN will use the target intervals from the panel of normals file.

The target counts stage generates a *.target.counts.gz file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input or --cnv-tumor-input option for the normalization stage. The *.target.counts.gz file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.

Further details are available in Output Files - Target Counts File.

Whole Genome

If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.

The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.

WGS Coverage per Sample

Recommended Resolution* (bp)

10000

5000

>= 30

1000

Using a --cnv-interval-width of less than 250 bp for WGS analysis can drastically increase runtime.

The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y. You can specify a list of contigs to skip by using the --cnv-skip-contig-list option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.

For example, to skip chromosome M, X, and Y, use the following option:

--cnv-skip-contig-list "chrM,chrX,chrY"

Whole Exome

If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed <TARGET_BED> option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width.

To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.

Target Counts Options

The following options control the generation of target counts.

--cnv-counts-method --- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq --- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed --- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width --- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list --- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm.
--cnv-filter-duplicate-alignments --- Filter duplicate marked alignments during target counts if option is set to true. The default setting is true unless map/align is enabled and duplicate marking is disabled.

Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.

Filter Duplicate Alignments

PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.

If --enable-map-align=false, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true, then --enable-duplicate-marking=true should be set.

Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.

Input format

enable-map-align

Required option

Fastq

TRUE

--enable-map-align=true, --enable-duplicate-marking=true

BAM

TRUE

--enable-map-align=true, --enable-duplicate-marking=true

BAM

FALSE

--enable-map-align=false

Target Counts Dropout Regions

In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.

Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.

For WGS samples and in absence of a cnv-target-bed file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width option, which defaults to 1000bp. The cnv-interval-width option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE in the *.cnv.excluded_intervals.bed.gz file.

A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.

Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow, as described in the following section.

Rescue of target counts in Segmental Duplications

The germline WGS CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.

This extension complements the original germline WGS CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We recommend WGS data aligned to a supported human reference genome (currently only hg38) with at least 30x coverage. See below for additional requirements.

Supported duplications

The following pairs of genes defining Segmental Duplications are included:

CYP2A6

CYP2A7

FCGR3A

FCGR3B

RHD

RHCE

STRC

STRCP1

ACSM2A

ACSM2B

ACTR3B

ACTR3C

AQP12A

AQP12B

ASAH2

ASAH2B

CCDC74A

CCDC74B

CD177

CD177p1

CD8B

CD8B2

CFH1

CFHR1

CYP4A11

CYP4A22

DHX40

DHX40P1

EIF5AL1

EIF5AP4

FCGR2A

FCGR2C

FFAR3

GPR42

FOLH1

FOLH1B

FRMPD2

FRMPD2B

GPAT2

GPAT2P1

GSTT2B

GSTT2

DDT

DDTL

HCAR2

HCAR3

HSPA1A

HSPA1B

KRT81

KRT86

LGALS7

LGALS7B

MRPL45

MRPL45P2

MSTO1

MSTO2p

MUC20

MUC20P1

MZT2A

MZT2B

OTOA

OTOAp1

PDPR

PDPR2P

PIEZ02

ENST00000591853.1

ZP3

POMZP3

PRAMEF7

PRAMEF8

PROS1

PROS2P

RMND5A

ANAPC1P2

ROCK1

ROCK1p1

SERPINB3

SERPINB4

SYT3

ZNF473CR

TBC1D26

TBC1D28

TOP3B

TOP3BP1

TUBA3D

TUBA3E

ZNF443

ZNF799

Extension requirements

This extension is enabled by default in the germline WGS CNV workflow. However, it requires:

Normalization set to self-normalization (--cnv-enable-self-normalization=true).
GC bias correction enabled (--cnv-enable-gcbias-correction=true).
Counts method set to start (--cnv-counts-method=start).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38).

If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension to false.

Algorithm

For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz file for inspection and they are automatically injected before the segmentation step.
- During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.

See Output Files - Segdups Extension Files for a description of the extension output files.

B-Allele Counts

In workflows supporting B-allele frequency (BAF), a source of heterozygous SNP sites is required to measure B-allele counts of the input sample. The following are the available modes, of which some are only available in somatic workflows.

Option

Description

cnv-population-b-allele-vcf

Specify a population SNP VCF. This option is available for both the germline and the somatic workflows. In somatic, it can be used when a matched normal sample is not available and analysis must be performed in tumor-only mode.

cnv-normal-b-allele-vcf

(Somatic-specific) Specify a matched normal SNV VCF. Use when a matched normal sample and the matched normal SNV VCF are available. To use this option, you must run the matched normal sample through the DRAGEN Germline workflow.

cnv-use-somatic-vc-baf

(Somatic-specific) Set to true to enable DRAGEN to identify germline variants during a tumor/matched-normal run, rather than requiring a separate run on the normal sample. Use if and only if tumor and matched normal input are available. Also enable the Somatic SNV Caller via enable-variant-caller to use this option.

To specify a population SNP VCF, use --cnv-population-b-allele-vcf option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is "1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency> to the INFO section of each record. Additional INFO fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf can be either heterozygous or homozygous in the germline genome from which the tumor genome derives

The following is an example valid population SNP record (note: it needs to be tab-delimited):

chr1  51479  .  T  A  1000  PASS  AF=0.3253

DRAGEN considers the following requirements when parsing records from the b-allele VCF:

Only simple SNV sites.
Records must be marked PASS in the FILTER field.
If there are records with the same CHROM and POS values in the VCF, then DRAGEN uses the first record that occurs.

A suitable population B-allele VCF is provided for selected references at this page.

Somatic-specific options

To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.

If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.

To enable --cnv-use-somatic-vc-baf, enter the following command line options.

--tumor-bam-input <TUMOR_BAM>—Specify the tumor input
--bam-input <NORMAL_BAM>—Specify the matched normal input
--enable-variant-caller true—Enable the somatic SNV variant caller
--cnv-use-somatic-vc-baf true—Enable somatic VC BAF

GC Bias Correction

GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.

The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See Output Files - Bias Correction file for further details on GC-corrected target counts files.

Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.

The following options control the GC bias correction module.

--cnv-enable-gcbias-correction --- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing --- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins --- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.

Normalization

The DRAGEN CNV pipeline supports two normalization algorithms:

Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.

Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.

Self-Normalization

Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y naming conventions are supported.

Panel of Normals

Whole genome sequencing
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples

The table below shows supported normalization methods for CNV workflow:

Platform

Germline

Somatic (T/N)

Somatic (T/O)

Germline (depth-only)

Somatic (depth-only)

WGS

Self / PoN

No workflow

WES

PoN

"No workflow" indicates that no workflow exists for this configuration.

Self Normalization

The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.

Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.

The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true

If you are running from a FASTQ sample, then the default mode of operation is self-normalization.

When operating in self-normalization mode, the --cnv-interval-width option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.

Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references or similar mammalian references (chr1, chr2, chr3, ..., chrX, chrY).

If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and research use only, and no claims or validation is made for the use of this feature.

Panel of Normals

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. Proper sample selection and preparation are critical for constructing an accurate and reliable CNV PON. High-quality germline samples—meeting stringent sequencing quality criteria such as a high percentage of bases over Q30, sufficient total read depth (yield), appropriate GC content, and minimal adapter contamination—must be used. Additionally, all samples should originate from the same sample type (e.g., FFPE, fresh-frozen) and be processed under identical experimental conditions, including the same library preparation kit, sequencing platform, and capture panel version. Even minor variations in hybridization efficiency or read depth distribution can introduce systematic artifacts, leading to inaccurate CNV calls.

Below are the key recommendations for preparing a high-quality PON:

Sample Selection: Normal samples should be sourced from individuals without known chromosomal abnormalities to establish a clean and representative reference baseline. Additionally, normal samples should not be drawn from a cohort that is likely to be enriched for particular CNVs, or enriched for individuals affected by a particular disease or syndrome with a genetic component. Normal samples should ideally be unrelated to each other and to the case samples to be processed. No more than ~6% of the samples in the PON should be related to any case sample. For example, a 50-sample PON containing a quad (mother/father/sibling/proband) may be used to analyze each sample in the quad provided there are no additional related samples in the PON.
Balanced sample sex: The normal sample set should include both male and female samples in similar numbers to ensure a well-represented reference baseline.
Exclude Low-Quality Samples: Samples with unusually uneven target coverage, low sequencing depth, or high technical noise should be removed to minimize variability and ensure consistency in the PON.
Standardized Library Preparation: All samples must be processed using the same library preparation protocol. Any deviations such as differences in hybridization efficiency, incubation time, or temperature can lead to inconsistent coverage patterns, increasing the likelihood of false positive CNV calls.
Adequate Number of Reference Samples: A sufficient number (a minimum of 50 samples is recommended, though not mandatory) of high-quality reference samples is essential for reliable coverage estimation and robust CNV detection.

By following these guidelines, the PON can effectively minimize technical biases, improving the accuracy and reliability of CNV detection.

In PON mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample (case and normals), to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.

CNV PONs can also be built in the cloud using the DRAGEN Baseline Builder App on BaseSpace or the DRAGEN Systematic Noise File Builder Pipeline on ICA.

In-run PON for Germline Exome

Some pre-built PONs are available for download from the DRAGEN Software Support Site page. When possible, however, it is recommended to utilize an in-run PON created from samples from the same sequencing run and library prep as the case samples. This will ensure that any biases that may have been introduced during library prep and/or sequencing will be properly normalized. If the samples in the sequencing run are sufficiently diverse and contain a large majority of copy neutral samples for each target region, then it is recommended that the PON consist of as many samples from the sequencing run as possible, but can be limited to 96 samples without significantly impacting the accuracy of coverage normalization. When the sequencing run is enriched with samples containing specific CNVs of interest, the in-run PON should be built from only those samples in the run without the enriched events (i.e. "normal samples"). A minimum of 50 such normal samples is recommended as with a pre-built PON. The table below summarizes the available options and high-level steps for running CNV using an in-run PON. CNV and Targeted Caller require separate PON files, but the intermediate counts files can be generated in the same DRAGEN command line invocation. For additional details click on the link for each option.

Analysis option

Steps

BSSH planned sequencing run (from BCLs)

Create run using the Run Planning tool in BSSH
Start planned run in Control Software on instrument

BSSH existing sequencing run (from BCLs)

Run DRAGEN Germline Enrichment from BCLs App

ICA from FASTQs/BAMs/CRAMs

Run DRAGEN Germline Enrichment App

BSSH from FASTQs/BAMs/CRAMs

Run DRAGEN Enrichment App

Local or AMI

BCL to FASTQ conversion
Generate CNV target counts and Targeted Caller exome counts for each PON sample
Generate CNV combined counts PON file
Generate Targeted Caller PON file
Perform case sample analyses

Target Counts Stage

Target counts should be generated for all normal samples used as a panel of normals. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings including reference version, target bed, counting methods, duplicate marking/filtering, filtering method/cutoff, etc. The target counts stage also performs GC Bias correction, if enabled. GC Bias correction is enabled by default, but can be disabled if desired.

The following examples are for WES processing, where a panel of normals is required.

The following is an example command for processing a BAM file.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

The following is an example command for processing a CRAM file.

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

The following example is for WGS processing, where a panel of normals is optional.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true \

Generating Panel of Normals (Combined Counts)

When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file (one per file) or --cnv-normals-list (single text file with paths to each sample).

The following is an example command line using a normals list:

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--cnv-normals-list <NORMALS_LIST> \
--enable-cnv true \
--cnv-generate-combined-counts true \

Normalization and Call Detection Stage

The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.

Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.

The presence of CNVs in the panel can result in artifactual calls in the test sample at locations where at least some of the panel samples have copy number changes. This leads to two considerations regarding construction of a panel.

Firstly, while it is not generally possible to select samples with no CNVs, panel samples should not be clearly aneuploid or contain large-scale somatic CNVs; further, if there is a region of particular interest, samples should be selected to be normal in that region.

Secondly, for optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels increase the likelihood of artifactual calls. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.

The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.

/data/output/sample1.target.counts.gc-corrected.gz
/data/output/sample2.target.counts.gc-corrected.gz
/data/output/sample4.target.counts.gc-corrected.gz
/data/output/sample5.target.counts.gc-corrected.gz
/data/output/sample7.target.counts.gc-corrected.gz
/data/output/sample8.target.counts.gc-corrected.gz
...

DRAGEN accepts 3 different file formats for a Panel of Normals (PON).

Option

Description

--cnv-normals-file

Individual normal file. This option uses a single file name and can be specified multiple times.

--cnv-normals-list

List of normal files. A plain text file in which each line in the file contains a path pointing to a *.target.counts.gz or *.target.counts.gc-corrected.gz file generated from the target counts stage. Relative paths are supported if the paths are relative to the current working directory. Absolute paths are recommended in case the workflow is used later or shared with other users.

--cnv-combined-counts

PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals (*.combined.counts.txt.gz file). Some pre-packaged PON file directly downloaded from Illumina support site need to use this option.

The CNV caller can also be started from the *.target.counts.gz (raw counts) or *.target.counts.gc-corrected.gz (GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input or --cnv-tumor-input option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction should be set to false to disable the GC-correction stage; GC-corrected inputs are not supported for somatic WGS analysis.

For example, the following command normalizes the case sample against the panel of normals.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-input <CASE_COUNTS> \
--cnv-normals-list <NORMALS> \
--cnv-enable-gcbias-correction false

See Output Files - Target Counts File for a description of the target counts files.

Normalization Options

These options control the preconditioning of the panel of normals and the normalization of the case sample.

--cnv-enable-self-normalization --- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile --- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input --- Specifies a target counts file for the case sample under analysis when using a panel of normals, for germline analysis (see --cnv-tumor-input for somatic analysis).
--cnv-normals-file --- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list --- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples --- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets --- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold --- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-tumor-input --- Specifies a target counts file for the case sample under analysis when using a panel of normals, for somatic analysis (see --cnv-input for germline analysis).
--cnv-truncate-threshold --- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon --- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.
--cnv-enable-cross-gender-adjustments-chrX --- Enable normalization on chrX by adjusting coverage of PON samples according to the expected number of copies of chrX in male and female samples. If the case sample is male, coverage of female PON samples is scaled down by a factor of 2 on chrX. If the case sample is female, coverage of male PON samples is scaled up by a factor of 2 on chrX. If no male PON samples are available, chrY intervals will be filtered. This feature is only supported for germline enrichment runs. The default value is false; if set to true, then --cnv-enable-gender-matched-pon must also be true.

Exclude BED Filtering

You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using --cnv-exclude-bed. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.

The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than --cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz file excludes the intervals removed during normalization.

An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See Output Files - Excluded interval files for further details.

Segmentation

After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:

Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)

The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.

By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing.

If you have specific regions of interests, you can also run with a --cnv-segmentation-bed. The option pre-defines the segments to estimate copy numbers region of interest listed in the bed file. See Targeted Segmentation (Segment BED) for more information.

--cnv-segmentation-mode --- Specifies the segmentation algorithm to perform. The following values are available.
- bed --- This option is not applicable to T/N and T/O of somatic WGS and somatic WES workflows
- cbs
- slm --- The default for germline WGS analysis.
- hslm --- The default for germline WES analysis.
- aslm --- The default for somatic analysis, either WGS or WES sample types.

Circular Binary Segmentation

Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.

--cnv-cbs-alpha --- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta --- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax --- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width --- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin --- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm --- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim --- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.

¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646

Shifting Level Models Segmentation

The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles². The options relevant for SLM and HSLM mode are described in the germline workflow page. The options for ASLM are described in the somatic workflow page.

²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5

User-Defined Segmentation (Segment BED)

DRAGEN CNV optionally accepts additional regions of interest by specifying a --cnv-segmentation-bed file. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. or might covers entire chromosome-arms. Intervals provided by --cnv-segmentation-bed will be appended to the CNV VCF with an INFO tag of SEGID provided by the name column of the input bed file.

The recommended format for the BED file includes four columns and a header. The four columns are contig, start, stop, and name. The name column represents the name of the region and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID field. The following example file is in the recommended format with nominal use cases of gene level, arm level, and/or whole chromosome:

contig  start      stop       name
chr1    115245083  115261621  NRAS
chr1    204485504  204526342  MDM4
chr2    16075981   16090656   MYCN
chr2    29416087   30143527   ALK
chr3    12626010   12704516   RAF1
chr3    138374228  138478187  PIK3CB
chr3    178866307  178952154  PIK3CA
chr3    195776751  195806640  TFRC
chr16   29500000   30200000   16p11.2 // pathogenic arm-level CNV related to 16p11.2 Microdeletion Syndrome
chr17   15229779   15265326   PMP22   // pathogenic gene-level CNV related to Charcot-Marie-Tooth Disease Type 1A
chr21   1          46709983   chr21   // pathogenic whole chromosome CNV related to Down Syndrome

If a three-column BED file is used including the contig, start, and stop values, then segment identifiers are autogenerated from the coordinate fields.

You can mix user-defined segmentation with standard segmentation modes using the --cnv-segmentation-mode option (cbs, slm, aslm, hslm). For example:

dragen \
--cnv-segmentation-mode slm \
--cnv-segmentation-bed <BED_FILE> \

In this case, the CNV VCF output will include results from the selected segmentation method (SLM in this example), plus additional entries from the user-provided segmentation BED file. If variants are called REF, then they may be filtered based on option --cnv-enable-ref-calls. If you set --cnv-segmentation-mode=bed, the CNV VCF will include only the entries defined in the segmentation BED file.

If some segments in the --cnv-segmentation-bed file are not covered by any target intervals (from --cnv-target-bed), or if all overlapping target intervals are filtered out (e.g., due to k-mer uniqueness filtering), then the associated segments will not be output to the VCF.

Examples of output VCF entries are below:

# Example of REF (only shown if --cnv-enable-ref-calls=true)
chr1    1       DRAGEN:REF:chr1:115245083-115261621     N       .       136     PASS
  END=115261621;REFLEN=16539;SEGID=NRAS GT:SM:CN:BC:PE:LR
  ./.:1.00491:2:32:0,0:0.0

# Examples of DEL/DUP
chr1    204485503       DRAGEN:LOSS:chr1:204485504-204526342    N       <DEL>   55      PASS
  SVLEN=-40839;SVTYPE=CNV;END=204526342;REFLEN=40839;SEGID=MDM4;GCP=0.359;CTP=0.531;ACP=0.475
  GT:SM:CN:BC:PE:LR
  0/1:0.512937:1:64:23,17:122.6

chr2   16075980       DRAGEN:GAIN:chr2:16075981-16090656    N       <DUP>   76      PASS
  SVLEN=14676;SVTYPE=CNV;END=16090656;REFLEN=14676;SEGID=MYCN;GCP=0.401;CTP=0.498;ACP=0.503
  GT:SM:CN:BC:PE:LR
  ./1:1.82095:4:67:0,0:168.1

The table below shows CNV workflows supporting the cnv-segmentation-bed option:

Germline

Somatic (T/N)

Somatic (T/O)

Germline (depth-only)

Somatic (depth-only)

WGS

✓

No workflow

WES

✓

ASCN CNV requires cnv-segmentation-mode not equal to bed to calculate likelihood of purity/ploidy model from segments deriven by data. The table below shows CNV workflows supporting cnv-segmentation-mode=bed option:

Germline

Somatic (T/N)

Somatic (T/O)

Germline (depth-only)

Somatic (depth-only)

WGS

✗

✓

No workflow

WES

✗

✓

"No workflow" indicates that no workflow exists for this configuration.

Allele-Specific Copy Number Calling

Selecting a diploid coverage level is a key component of an allele-specific copy number (ASCN) caller. In the somatic case, the caller also needs to identify the most likely tumor purity. DRAGEN CNV ASCN callers use a grid-search approach that evaluates many candidate models to attempt to fit the observed read and b-allele counts across all segments in the input sample. A log likelihood score is emitted for each candidate, and all scores are output (in *.cnv.coverage.models.tsv or *.cnv.purity.coverage.models.tsv, respectively for germline or somatic workflows). The caller chooses the model with the highest log likelihood and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.

Note: if BAF data is not sufficient it might be discarded during model estimation, leading to a model based on coverage depth only. In such case, the model will not be able to detect alterations that cannot be easily identified without BAF (e.g., whole-genome trisomy).

Somatic-specific extensions

Default purity/ploidy model

If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA. This can be identified on the output VCF header lines:

##fileformat=VCFv4.4
##ModelSource=DEGENERATE_DIPLOID
##EstimatedTumorPurity=NA
##DiploidCoverage=337.000000
##OverallPloidy=2.008310

The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM) or, in the case of copy-neutral LOHs, by their minor allele frequency value (MAF). The threshold values for SM used by the caller are estimated automatically considering the variance of the sample, with larger SM thresholds for DUPs when the variance is higher. For MAF values, PASSing copy-neutral LOHs are called when the MAF is below a certain threshold. The user can use alternative threshold values through the --cnv-filter-del-mean, --cnv-filter-dup-mean and --cnv-filter-cnloh-maf parameters.

Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN, CNF, CNQ, MCN, MCNF, MCNQ) are omitted from the final VCF output. The following is a set of example calls from the final VCF output:

# LOSS - not PASSing since it is a short alteration (< 1Mb)
chr11   125205167       DRAGEN:LOSS:chr11:125205168-125215362   N       <DEL>   322     lengthDegenerate
  END=125215362;REFLEN=10195;SVLEN=10195;SVTYPE=CNV
  GT:SM:SD:MAF:BC:AS:PE:OBF
  0/1:0.506231:170.6:0.337:10:1:143,143:0

# REF - not PASSing (only GAIN/LOSS/LOH are output by the default model)
chr12   11415425        DRAGEN:REF:chr12:11415425-33369045      N       .       1000    segmentMean
  END=33369045;REFLEN=21953621
  GT:SM:SD:MAF:BC:AS:PE:OBF
  0/0:1.005638:338.9:0.5:19315:16563:13,86:0.00398647

# GAIN - PASSing
chr12   33369045        DRAGEN:GAIN:chr12:33369046-34655426     N       <DUP>   1000    PASS
  END=34655426;REFLEN=1286381;SVLEN=1286381;SVTYPE=CNV
  GT:SM:SD:MAF:BC:AS:PE:OBF
  0/1:1.487834:501.4:0.355:1045:723:86,89:0.00414938

# LOH - PASSing
chr13   18785987        DRAGEN:CNLOH:chr13:18785988-114351403   N       <LOH>   1000    PASS
  END=114351403;REFLEN=95565416;SVLEN=95565416;LOHTYPE=CNLOH;SVTYPE=CNV
  GT:SM:SD:MAF:BC:AS:PE:OBF
  1/1:0.996893:866.3:0.139:85604:167539:28,12:0.00247715

Grid search optimization informed by essential regions

In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.

The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH> parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.

If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.

¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041

The following is an example taken from the output log where this feature is triggered, leading to an additional iteration of model fitting:

==================================================================
Checking for model plausibility
==================================================================
Call overlap found: chr18 65121873 80261473
- Call segment counts: 272.4
- Call length: 15139600
Model considered not plausible

==================================================================
Performing model fitting
==================================================================
Constraint on lowest non plausible HOMDEL coverage: 272.4
Performing grid search on 23867 models

and, an example where this feature is not triggered:

==================================================================
Checking for model plausibility
==================================================================
Plausible model identified

Rejection of models calling large portions of chromosome as CN0 (homozygous deletion)

Large chromosomal events are likely to negatively impact genome stability and cell viability. The option --cnv-somatic-homdel-max-fraction is the maximum allowed fraction for any chromosome that can be called as CN0 (default value: 0.7). If the number of bases on a chromosome are more than this fraction (over the total number of called bases), the weighted average coverage across all HOMDEL segments is taken as the coverage that needs to be at least CN1 for a model to be considered. Model fitting then restarts from the beginning with new constraints (and thus a reduced set of alternative models). This feature can be disabled by setting the parameter to --cnv-somatic-homdel-max-fraction=1, effectively allowing the total number of called bases on each chromosome to be CN0 without rejecting the model.

The following is an example taken from the output log where this feature is triggered:

==================================================================
Checking for model plausibility
==================================================================
Model considered not plausible

Constraining tumor purity

When a minimum and/or a maximum value of tumor purity for the input sample are known from additional evidence, it is possible to constrain the search of models based on either or both of these values. The available input options that can be provided are:

--cnv-somatic-min-purity - float in [0,1]
--cnv-somatic-max-purity - float in [min-purity,1]

Constraining sample ploidy

When a minimum and/or a maximum value of ploidy for the input sample are known from additional evidence, it is possible to constrain the search of models based on either or both of these values. The available input options that can be provided are:

--cnv-ascn-min-ploidy - positive float between 0.5 and --cnv-somatic-max-quartile-copy-number (default 9)
--cnv-ascn-max-ploidy - positive float between min ploidy and --cnv-somatic-max-quartile-copy-number (default 9)

Please note: the sample ploidy constraints are applied to a preliminary estimation of ploidy from sample parameters (which might not be exactly respected in the final estimated ploidy in output), equal to:

2 + (excess coverage with respect to diploid coverage) divided by (coverage of one copy)

For germline ASCN workflows, this is computed as:

$2+\frac{m - c}{c/2}$

while for somatic ASCN workflow, this expression is computed as:

$2+\frac{m - c}{c*p/2}$

where:

m is the mean coverage of the input sample
c is the diploid coverage of the model under consideration
p is the tumor purity (only for somatic workflows)

Subclonal/Mosaic Calling Mode

DRAGEN uses a subclonal/mosaic calling mode for segments with a copy number that is estimated to be heterogeneous among different cells in the sample. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.

Note, in somatic this setting will only be honored when DRAGEN is able to identify a confident model. When a confident model cannot be identified, the caller will return a default model and this feature will always be disabled (see the Default purity/ploidy model section for more details and nuances of this approach).

When a segment is considered as heterogeneous, the output for the segment is changed as follows.

The MOSAIC (germline) or HET (somatic) tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.

To turn on this feature, specify either one of these options:

--cnv-enable-mosaic-calling=true (for the germline ASCN workflow, default true)
--cnv-somatic-enable-het-calling=true (for the somatic workflow, default false)

Note: this calling mode can be disabled for alterations smaller than N bases. It is recommended not to change default thresholds. If necessary, however, they can be changed with:

--cnv-filter-mosaic-length=N (for the germline ASCN workflow, default 100000)
--cnv-somatic-filter-het-length=N (for the somatic workflow, default 0)

The following is an example of MOSAIC GAIN call from the germline ASCN model (from *.cnv.vcf.gz output file):

chr2    89673674        DRAGEN:GAIN:chr2:89673675-89851643      N       <DUP>   1000    PASS
  END=89851643;REFLEN=177969;MOSAIC;SVLEN=177969;SVTYPE=CNV
  GT:CN:MCN:CNQ:MCNQ:CNF:MCNF:SM:SD:MAF:BC:AS:PE
  ./1:4:.:0:.:4.32403:.:2.162016:278.9:.:70:0:13,2

The assigned total CN is 4. However, inspecting the CNF annotation (CNF ~ 4.32), we can see that the segment above has a larger deviation from the diploid state with respect to the assigned integer CN state. This can support various hypotheses on the fraction of cells bearing the different CN states, for example:

32% of cells having CN5, 68% of cells having CN4

Allele Specific Copy Number Examples

In addition to assigning total copy number based on depth, ASCN Callers make use of BAFs to call allele specific copy numbers. The following table provides examples for a DUP in a reference-diploid region:

Total Copy Number (CN)

Minor Copy Number (MCN)

ASCN Scenario

2+2

3+1

4+0

*The entry represents a Absence or Loss of Heterozygosity (AOH/LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH to distinguish the value from Copy Neutral AOH/LOH (CNLOH), which would be annotated as 2+0.

Call Smoothing

The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.

After initial calling, segments shorter than the specified value of --cnv-filter-length are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the caller combines two successive segments that are within --cnv-merge-distance (default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. Note: in somatic workflows, when the germline CN information is available, and two segments have different germline CN, they will not be merged.

QUAL Model

The caller uses a model based on diploid coverage (and purity in somatic workflows) from depth of coverage and B-allele frequency.

Given the most likely diploid coverage (and purity in somatic workflows), for each segment, the algorithm calls the most likely copy number state (complete with total copy number CN, and minor allele copy number MCN).

The probability of the REF state is used in input to the scoring algorithm which outputs the QUAL value (a PHRED score capped at 1000). The QUAL value is the PHRED score where the probability of error is the probability of REF when an alteration is called, or the probability of having a non-REF call when the segment should be called REF.

Note: this is different from how QUAL is computed in (legacy) depth-only callers.

Output Files

DRAGEN emits the calls in the standard VCF format. The VCF file includes only copy number gain and loss events. To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls to true. AOH/LOH events are available in workflows where allele-specific copy number is available.

CNV VCF File

File extension: *.cnv.vcf.gz

The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV to indicate the file is generated by the DRAGEN CNV pipeline.

VCF format differences between different callers

In the DRAGEN CNV component, two versions of the VCF specification are used for the *.cnv.vcf.gz file:

For ASCN workflows, the format used is VCF v4.4
For depth-only workflows (including multisample CNV calling), the format used is VCF v4.2

The differences between the two formats in output from DRAGEN are the following:

General

Field

VCF v4.2

VCF v4.4

INFO/SVLEN

Positive or Negative

Always Positive

Absence/Loss of Heterozygosity (AOH/LOH)

Field

VCF v4.2

VCF v4.4

ALT

<DEL>,<DUP>

<LOH>

FORMAT/GT

1/2

1/1

Header

The following is an example of some of the header lines that are specific to CNV:

##fileformat=VCFv4.2
##CoverageUniformity=0.402517
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
...
##reference=file:///reference_genomes/Hsapiens/hs37d5/DRAGEN
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=REFLEN,Number=1,Type=Integer,Description="Number of REF positions included in this record">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">
##FILTER=<ID=cnvQual,Description="CNV with quality below <WORKFLOW-SPECIFIC DEFAULT VALUES>">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Estimated copy number">
##FORMAT=<ID=BC,Number=1,Type=Integer,Description="Number of bins in the region">
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Number of improperly paired end reads at start and stop breakpoints">

The following header lines are specific to the somatic CNV callers (WGS/WES) and the germline WGS CNV caller:

ModelSource The primary basis on which the final model was chosen. The following values can be included:
- DEPTH+BAF: Depth+BAF signal is used to determine model.
DiploidCoverage Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy Length weighted average of copy number for PASS events (for the tumor fraction in somatic runs). The numeric value is unlimited.
OutlierBafFraction A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].
HomozygosityIndex Autosomal AOH/LOH percentage, considering only PASS AOH/LOH greater or equal than a certain threshold. This metric can be used as a proxy for consanguinity in the germline WGS CNV caller. The default minimum size for PASS AOH/LOH to be considered is 2Mb, since it is often found that shorter ROHs "do not arise from inbreeding in recent generations and are common in all of the populations represented in the HGDP" (Kirin et al., 2010). However, a custom minimum size can be set through the option --cnv-min-length-homozygosity-index. Note: The Cyto VCF (*.cyto.vcf.gz) also provides resolution-specific homozygosity indexes (i.e., computed on each specific resolution's callset). The default minimum size considered is the same as the main HomozygosityIndex, and for each resolution in output, there will be an additional header line on the Cyto VCF indicating the resulting metric, e.g., ##HomozygosityIndex(25k)=0.001015.

The following header lines are specific to the somatic CNV callers (WGS/WES):

ModelSource can also have the following values (see section below for additional details):
- DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
- DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
- DEPTH+BAF_WEAK: Depth+BAF signal is used to determine tumor model, but this is associated with lower-confidence than DEPTH+BAF.
- VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
- SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
- DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
EstimatedTumorPurity Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA if a confident model could not be determined.
AlternativeModelDedup An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.

Understanding ModelSource Annotation (Somatic only)

The ModelSource indicates the type and strength of evidence used to determine the tumor purity and ploidy model for the sample. Possible values are listed in approximate order of decreasing evidence strength, with DEPTH+BAF variants representing the most robust determinations and the degenerate models representing the least confident scenarios.

DEPTH+BAF represents the strongest evidence, where both sequencing depth (read coverage) and B-allele frequency (BAF) signals consistently support the chosen model, confirmed by variant allele frequency (VAF) data (if available).
DEPTH+BAF_DOUBLED indicates that the initial depth and BAF model was adjusted upward by a whole-genome duplication factor, supported by either VAF evidence showing variants at the expected frequencies for a duplicated genome, or an excess of genomic segments with a closely matching state in the model with the WGD. Conversely, DEPTH+BAF_DEDUPLICATED means the model was adjusted downward by removing a whole-genome duplication, based on VAF data inconsistent with duplication or insufficient genomic segments supporting the higher ploidy hypothesis.
DEPTH+BAF_WEAK reflects a scenario where depth and BAF signals provided the model, but with lower confidence than other DEPTH+BAF model sources. A model receiving this model source is found through depth and BAF, but either:
- VAF data is available but the model is not concordant with VAF evidence.
- Several regions of the genome have no closely matching state under the selected model.
VAF indicates that variant allele frequency data from somatic mutations became the primary evidence source because depth and BAF signals were insufficient or conflicting.
Finally, DEGENERATE_DIPLOID and SAMPLE_MEDIAN represent fallback models used when neither depth/BAF nor VAF provide adequate signal, and tumor purity cannot be reliably estimated. These assume the sample is high-purity diploid, with coverage set either to the lowest observed value in BAF-balanced regions (DEGENERATE_DIPLOID) or to the sample's median coverage (SAMPLE_MEDIAN).

Records

All coordinates in the VCF are 1-based.

CHROM

The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.

POS

The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN, LOSS and REF events, in Somatic (WGS/WES) and Germline (WGS) CNV, the ID could include the Copy Neutral Loss/Absence of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.

REF

The REF column contains an N for all CNV events.

ALT

The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL>, <DUP> or <LOH> entries are used. If REF calls are emitted, their ALT will always be .. In workflows where allele-specific copy number (ASCN) is available, if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with --cnv-enable-legacy-vcf-format, the ALT field will contain two alleles, <DEL>,<DUP>, in place of <LOH>, for AOH/LOH events.

QUAL

The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header. Note: different workflows (e.g., germline WGS depth-only vs germline WGS) do not share the same underlying model and provide different QUAL score distributions. It is recommended to compare QUAL scores only within results from the same workflow. More details are available on QUAL (depth-only) and QUAL (ASCN).

FILTER

The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.

FILTER

Germline WGS (depth-only)

Germline WGS

Germline WES

Somatic WGS

Somatic WES (depth-only)

Somatic WES

binCount

✓

chromArmBinCount

✓

cnvBinSupportRatio

✓

cnvCopyRatio

✓

cnvHetLength

✓

cnvLength

✓

cnvLikelihoodRatio

✓

cnvMosaicLength

✓

cnvQual

✓

dinucQual

✓

mosaicFraction

✓

highCN

✓

lengthDegenerate

✓

segmentMean

✓

SqQual

✓

FILTER description

Available FILTERs:

binCount - Filters CNV events with a bin count lower than a threshold.
cnvBinSupportRatio which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
cnvCopyRatio which indicates that the segment mean of the CNV is not far enough from copy neutral.
cnvHetLength which indicates that a HET call below a certain length has been filtered as candidate FP.
cnvLength which indicates that the length of the CNV is lower than a threshold.
cnvLikelihoodRatio indicates a log10 likelihood ratio of ALT to REF is less than a threshold.
cnvMosaicLength which indicates that a MOSAIC call below a certain length has been filtered as candidate FP.
cnvQual which indicates that the QUAL of the CNV is lower than a threshold.
chromArmBinCount which indicates that a whole-arm alteration call is based on a minimal portion (default 500 intervals) of the entire arm (e.g., in acrocentric chromosomes, where the short arm is mainly consisting of poor mappability regions, that are ignored during copy-number calling).
dinucQual is applied based on the percentage of bases in a segment that belong to a two-base set (GC, CT, or AC), determined by individual occurrences. A CNV call is filtered out if any of these percentages fall outside typical ranges, indicating a likely false positive.
mosaicFraction which indicates that the mosaic fraction of a germline CNV is below a defined threshold (--cnv-filter-mosaic-fraction). This filter is applied only to small CNVs with lengths shorter than the specified size threshold (--cnv-filter-mosaic-fraction-max-length).
highCN which indicates a CNV call with implausible copy number (>6).
lengthDegenerate - Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean - Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficient SM in DELs or DUPs are assigned this filter when returning the default model.
SqQual - Marks records as non-PASSing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.

INFO

The INFO column contains information representing the event.

REFLEN indicates the length of the event.
SVLEN indicates the length of the event and it is only present for non-REF records. Note: if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with --cnv-enable-legacy-vcf-format, SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion).
SVTYPE is always CNV and only present for non-REF records.
END indicates the end position of the event (1-based, inclusive).

The legacy (depth-only) Germline CNV caller also includes the following INFO fields:

Description

GCP

Percentage of bases that are G or C

CTP

Percentage of bases that are C or T

ACP

Percentage of bases that are A or C

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

In Germline WGS CNV the MOSAIC tag identifies mosaic calls. In Somatic CNV the HET tag identifies subclonal calls. See Subclonal/Mosaic-Calling Mode for more details.

When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.

FORMAT

The common FORMAT fields are described in the header:

Description

Genotype

Linear copy ratio of the segment mean

Estimated copy number

Number of bins in the region

Number of improperly paired end reads at start and stop breakpoints

Number of allelic read count sites

Number of read count bins

Estimated total copy number of sample (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

CNF

Floating point estimate of copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

CNQ

Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence.

MAF

Estimate for the minor allele frequency

MCN

Estimated minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

MCNF

Floating point estimate of minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

MCNQ

Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence.

OBF

Per-segment Outlier BAF Fraction. Percentage of BAF counts which are considered "outlier" with respect to the chosen segment call. Higher values might indicate segments where BAF counts are problematic.

Best estimate of segment's bias-corrected read count

NCN

Normal-sample copy number. The field is only present in somatic workflows with enabled germline-aware mode.

SCND

Difference between CN and NCN. The field is only present in somatic workflows with enabled germline-aware mode.

Note: legacy depth-only callers (germline WGS/WES and somatic WES) do not include some of the FORMAT fields indicated above, due to limitation of the used legacy models. Germline WES (depth-only) also includes the following FORMAT fields:

Description

Log10 likelihood ratio of ALT to REF

Note on genotype annotation in germline copy number calling (depth-only)

Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:

Diploid or Haploid?

ALT

FORMAT:CN

FORMAT:GT

Diploid

./.

Diploid

<DUP>

./1

Diploid

<DEL>

0/1

Diploid

<DEL>

1/1

Haploid

<DUP>

Haploid

<DEL>

Post VCF target BED

The DRAGEN CNV pipeline can receive in input a target BED to only emit calls overlapping with the BED intervals. The post VCF target BED is provided through the --cnv-post-vcf-target-bed option.

Coverage Uniformity

The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.

A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.

The CoverageUniformity metric is calculated as follows:

Parse normalized counts from tn.tsv.gz for autosomes.
Split counts into disjoint tiled windows of size 100 bins.
Compute variance for all windows to measure observed distribution.
Shuffle normalized counts and compute variance for similar disjoint windows to create a background distribution.
Perform a two-sample Kolmogorov–Smirnov (KS) test between the observed and background variance distributions per chromosome.
Compute the final CoverageUniformity score as a weighted mean of KS test statistic across chromosomes, where weights correspond to chromosome sizes.

CNV Metrics File

DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv file extension. The following list summarizes the metrics that are output from a CNV run.

Sex Genotyper:

Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.

DRAGEN Sex Genotyper requires a minimum of 300 target intervals to confidently determine sex genotype; if the panel covers fewer intervals on the sex chromosomes, genotyping will fail and an undetermined genotype is returned. Users may lower this requirement by setting --cnv-sex-genotyper-num-interval-requirement to a smaller value, at the risk of increased false genotype calls.

CNV Summary:

Bases in reference genome in use
Average alignment coverage over genome - The average alignment coverage over the genome is calculated by dividing the total number of bases from processed alignment records (excluding those filtered by the Target Counts stage in DRAGEN CNV) by the genome length. Alignment records are filtered taking into consideration duplicate marking status (if available), MAPQ, and mapping status.
Number of alignment records processed
- Number of filtered records (total)
- Number of filtered records (due to duplicates)
- Number of filtered records (due to MAPQ)
- Number of filtered records (due to being unmapped)
PMAD - Pairwise Median Absolute Deviation measures the variation in read coverage between adjacent bins. It measures variability due to various factors, such as DNA degradation, extraction, amplification or library preparation. Higher values indicate noiser sample data. PMAD is calculated as following:
- Define a vector v[i] as normalized counts of i-th interval in log scale, and d[i] as pairwise differences of consecutive normalized counts between i and i+1 intervals, i.e. d[i] = (v[i] - v[i+1])
- PMAD is median absolute deviation of d, i.e. PMAD = Median(|d[i]-Median(d)|)
Coverage MAD - Median absolute deviation of normalized case counts. Higher values indicate noiser sample data.
Median Bin Count - Median of raw counts normalized by interval size.
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications - Note: GAINLOH events (ALT=LOH and CN > 2) are also included here
Number of deletions
Number of CNLOHs (Copy-Neutral LOHs)
Number of PASS amplifications - Note: GAINLOH events (ALT=LOH and CN > 2) are also included here
Number of PASS deletions
Number of PASS CNLOHs (Copy-Neutral LOHs)
Post-Normalization Bin Count Sigma - Standard deviation of post-PoN-normalization median-normalized coverage values.

Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Post-Normalization Bin Count Sigma is only printed when PoN normalization has been applied.

Intermediate and Visualization Files

Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.

All files have a structure similar to a BED file with optional header line(s).

Target Counts File

The file *.target.counts.gz is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.

It has the following columns:

Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gz file is shown below.

#TARGET COUNTS FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start  stop   name                <SampleName> improper_pairs
1       565480 565959 target-wgs-1-565480 7          6
1       566837 567182 target-wgs-1-566837 9          0
1       713984 714455 target-wgs-1-713984 34         4
1       721116 721593 target-wgs-1-721116 47         1
1       724219 724547 target-wgs-1-724219 24         21
1       725166 725544 target-wgs-1-725166 43         12
1       726381 726817 target-wgs-1-726381 47         14
1       753243 753655 target-wgs-1-753243 31         2
1       754322 754594 target-wgs-1-754322 27         0
1       754594 755052 target-wgs-1-754594 41         0

B-Allele counts

In germline runs, B-allele counts are calculated at bi-allelic sites taken from a collection of high-frequency SNVs in the population. In somatic runs, B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, sites are selected from a population collection (similar to germline runs). Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the sample supporting each of these alleles is counted.

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

B-allele tsv

The tsv file format is the following:

Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (one-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele

Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:

Population frequency for the first allele
Population frequency for the second allele

An example of B-allele counts file is provided below:

contig  start   stop    refAllele       allele1 allele2 allele1Count    allele2Count
chr1    11021   11022   G       G       A       4       2
chr1    14463   14464   A       A       T       111     36
chr1    16494   16495   G       G       C       122     262
chr1    38741   38742   C       C       T       9       9
chr1    39014   39015   A       A       C       38      48
chr1    39260   39261   T       T       C       199     143
chr1    48447   48448   C       C       T       8       15
chr1    48517   48518   A       A       G       13      15
chr1    91485   91486   G       G       C       1       4
chr1    91489   91490   A       A       G       1       3
chr1    98944   98945   C       C       T       46      114

B-allele bedgraph file

The bedgraph file format is similar to the BED format and it has the following columns:

Contig identifier
Start
Stop
Ratio of allele counts

The numerator and denominator of the ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.

When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:

allele1Count / (allele1Count + allele2Count)

When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:

allele2Count / (allele1Count + allele2Count)

By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.

An example of the bedgraph file is shown below:

chr1    11021   11022   0.333333
chr1    14463   14464   0.755102
chr1    16494   16495   0.317708
chr1    38741   38742   0.5
chr1    39014   39015   0.44186
chr1    39260   39261   0.581871
chr1    48447   48448   0.652174
chr1    48517   48518   0.464286

Bias correction file

The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gc-corrected.gz file is shown below.

#GC CORRECTED FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   1071.353133     6
chr1    819840  821337  target-wgs-chr1-819840:821337   1051.014997     19
chr1    821337  822485  target-wgs-chr1-821337:822485   1098.6502       10
chr1    822485  824431  target-wgs-chr1-822485:824431   1117.28308      7
chr1    830446  832304  target-wgs-chr1-830446:832304   1102.211816     1
chr1    832304  834311  target-wgs-chr1-832304:834311   1004.822683     5
chr1    836677  838659  target-wgs-chr1-836677:838659   1015.973037     7
chr1    841054  843056  target-wgs-chr1-841054:843056   1014.921403     3

Combined counts file

The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.

Normalization file

The file *.tn.tsv.gz contains the normalized signal of the case sample, per target interval, i.e., the log2-transformed copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
Log2-transformed copy ratio in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. In some cases, the normalization counts could be patched internally with intervals from other processes, such as the SegDups extension. In such cases, patches are indicated (sorted in order of application) with header lines starting with #patch:

#patch 1 = <normalized_counts_patch_1_filename>
#patch 2 = <normalized_counts_patch_2_filename>
...

and the original (unpatched) *.tn.tsv.gz is renamed as *.tn.unpatched.tsv.gz. Note: this file is reported in output for inspection, but most use cases will use the (patched) *.tn.tsv.gz file downstream of normalization.

An example of a *.tn.tsv.gz file is shown below.

#title = Normalized coverage profile
#sex = UNDETERMINED
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   -0.18479358083014644    6
chr1    819840  821337  target-wgs-chr1-819840:821337   -0.21244441644669046    19
chr1    821337  822485  target-wgs-chr1-821337:822485   -0.14849555308041734    10
chr1    822485  824431  target-wgs-chr1-822485:824431   -0.12423291178926463    7
chr1    830446  832304  target-wgs-chr1-830446:832304   -0.1438261733656668     1
chr1    832304  834311  target-wgs-chr1-832304:834311   -0.27728673450293895    5
chr1    836677  838659  target-wgs-chr1-836677:838659   -0.26136555699676262    7

Segmentation file

File extension: *.seg, *.seg.called, *.seg.called.merged

Files containing the segments produced by the segmentation algorithm. The Segment_Mean value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.

The *.seg file has the following columns:

Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment

An example of a *.seg file is shown below.

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
<SampleName> chr1    818022  1117426 224     0.82500341336435279
<SampleName> chr1    1117426 4063702 2438    0.91726081432236528
<SampleName> chr1    4063702 4067591 3       0.38861386123247205
<SampleName> chr1    4067591 7705829 3302    0.93021316913709917
<SampleName> chr1    7705829 9357003 1405    0.98147825043799442
<SampleName> chr1    9357003 9377365 19      0.50269670724395654
<SampleName> chr1    9377365 12859821        2905    1.0684818476332989

Germline-Specific (Depth-Only) Segmentation Output Files

The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + or a deletion -.

The *.seg.called.merged file is identical to the *.seg.called file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:

QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count

B-allele segmentation file

In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg and it has the same format of the *.seg file with two modifications. Firstly, the Segment_Mean value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:

BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction

An example of segmentation output file is shown below:

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean    BAF_SLM_STATE
<SampleName> chr1    820348  1104646 194     0.29301737166888697     6
<SampleName> chr1    1105091 1533754 444     0.26185904799069076     5
<SampleName> chr1    1533810 1534166 9       0.41958837071702065     8
<SampleName> chr1    1534217 9356793 6689    0.26034515815016335     5
<SampleName> chr1    9358304 9376529 27      0.46450553586280602     10
<SampleName> chr1    9378480 12859495        1651    0.24172965924359388     5

Model identification file

In somatic callers the file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:

Model purity (Cellularity)
Model diploid coverage
Model log-likelihood
Model ploidy
Failed constraints

An example is shown below:

#Purity Coverage        logL    ApproxPloidy    FailedConstraints
0.99    973     -19887608.9043  0.813   VAF_PEAKS
0.99    974     -19887600.1993  0.812   VAF_PEAKS
0.99    975     -19887591.592   0.811   VAF_PEAKS
1       115     -18380016.1561  6.98    VAF_PEAKS
1       116     -18384459.3513  6.91    VAF_PEAKS
1       117     -18390436.957   6.86    VAF_PEAKS

In the germline WGS caller the file *.cnv.coverage.models.tsv serves the same purpose. However, since germline analysis has no concept for tumor purity, the first column is set to the default value of 1.

Visualization files

To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml can be loaded directly into IGV for analysis.

The following IGV tracks are automatically populated in the output IGV session file:

*.target.counts.bw --- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw --- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw --- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw --- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.seg.bw --- BigWig representation of the BAF segments (if available). Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz --- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3 --- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. When the caller can call AOH/LOH events, they will show up as magenta. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):

##gff-version 3
chr1    DRAGEN  LOSS    12779193        12859821        30      .       .       Alt=DEL;LinearCopyRatio=0.576;CopyNumber=1;Genotype=0/1;Qual=30;Filter=PASS;Start=12779192;Stop=12859821;Length=80629;BinCount=24;ImproperPairsCount=16,7;color=#0000FF;
chr1    DRAGEN  REF     13106280        13122338        19      .       .       Alt=REF;LinearCopyRatio=1.05981;CopyNumber=2;Genotype=./.;Qual=19;Filter=PASS;Start=13106279;Stop=13122338;Length=16059;BinCount=8;ImproperPairsCount=3,1;color=#00FF00;
chr1    DRAGEN  GAIN    13225213        13247040        66      .       .       Alt=DUP;LinearCopyRatio=2.016;CopyNumber=4;Genotype=./1;Qual=66;Filter=PASS;Start=13225212;Stop=13247040;Length=21828;BinCount=9;ImproperPairsCount=7,5;color=#FF0000;

For somatic WGS analyses, the following additional files are included in the IGV session xml:

*.tumor.baf.bedgraph.gz --- Bedgraph representation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.

IGV Session

File extension: *.igv_session.xml

The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome attribute in the Session element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.

<?xml version="1.0" encoding="utf-8"?>
<Session genome="b37" hasGeneTrack="false" hasSequenceTrack="true" version="8">
    <Resources>
        <Resource path="example.cnv.gff3"/>
        <Resource path="example.cnv.excluded_intervals.bed.gz"/>
        <Resource path="example.target.counts.bw"/>
        <Resource path="example.improper.pairs.bw"/>
        <Resource path="example.tn.bw"/>
        <Resource path="example.seg.bw"/>
    </Resources>
    <Panel height="500" width="1200" name="DataPanel">
        ...
    </Panel>
</Session>

Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome field in the XML file directly. For example, IGV has traditionally packaged a b37 reference genome, but may also include a 1kg_v37 or a 1kg_b37+decoy, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

When the Cytogenetics Modality is enabled, DRAGEN CNV produces an additional IGV session xml *.cyto.igv_session.xml shown below. Please see Cytogenetics Modality for a description of the different tracks on this file.

Creating CNV coverage and BAF plots with third-party tools

DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:

*.target.counts.gz or *.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.

In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.

Using R, a good starting point is the karyoploteR package. The main workflow involves reading the *.target.counts.gz file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.

Using Python, the workflow is similar to R's but using Python's libraries such as pandas, to convert DRAGEN output files to dataframe, and matplotlib, to plot coverage and BAF profiles across the genome.

A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:

Germline WGS

chr1    DRAGEN  LOSS    12779193        12859821        30      .       .       Alt=DEL;LinearCopyRatio=0.576;CopyNumber=1;Genotype=0/1;Qual=30;Filter=PASS;Start=12779192;Stop=12859821;Length=80629;BinCount=24;ImproperPairsCount=16,7;color=#0000FF;
chr1    DRAGEN  REF     13106280        13122338        19      .       .       Alt=REF;LinearCopyRatio=1.05981;CopyNumber=2;Genotype=./.;Qual=19;Filter=PASS;Start=13106279;Stop=13122338;Length=16059;BinCount=8;ImproperPairsCount=3,1;color=#00FF00;
chr1    DRAGEN  GAIN    13225213        13247040        66      .       .       Alt=DUP;LinearCopyRatio=2.016;CopyNumber=4;Genotype=./1;Qual=66;Filter=PASS;Start=13225212;Stop=13247040;Length=21828;BinCount=9;ImproperPairsCount=7,5;color=#FF0000;

Somatic WGS

chr1    DRAGEN  GAIN    16605768        16949283        237     .       .       Start=16605769;Stop=16949283;Length=343515;Alt=<DUP>;Qual=237;Filter=PASS;Genotype=1/1;CopyNumber=4;MinorCopyNumber=2;CopyNumberQual=1;MinorCopyNumberQual=1;CopyNumberFloat=4.371887;MinorCopyNumberFloat=2.000000;BiasCorrectedReadCount=1182.6;MinorAlleleFrequency=0.5;BinCount=74;ImproperPairsCount=15,17;NumAllelicSites=223;color=#FF0000;
chr1    DRAGEN  CNLOH   16949283        23272950        1000    .       .       Start=16949284;Stop=23272950;Length=6323667;Alt=<LOH>;Qual=1000;Filter=PASS;Genotype=1/1;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.090573;MinorCopyNumberFloat=0.000000;BiasCorrectedReadCount=565.5;MinorAlleleFrequency=0;BinCount=5572;ImproperPairsCount=17,84;NumAllelicSites=2517;color=#FF00FF;
chr1    DRAGEN  LOSS    23272950        25394644        1000    .       .       Start=23272951;Stop=25394644;Length=2121694;Alt=<DEL>;Qual=1000;Filter=PASS;Genotype=0/1;CopyNumber=1;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=1.069501;MinorCopyNumberFloat=0.000000;BiasCorrectedReadCount=289.3;MinorAlleleFrequency=0;BinCount=1718;ImproperPairsCount=84,5;NumAllelicSites=872;color=#0000FF;

From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).

Excluded Intervals File

To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.cnv.excluded_intervals.bed.gz file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.

Exclusion Reason

Description

Related DRAGEN Option

NON_KMER_UNIQUE

Non-unique Kmer bases are larger than 50% of interval.

Not applicable. This reason only applies to self-normalization mode.

EXCLUDE_BED

Interval overlaps with exclude BED larger than threshold.

--cnv-exclude-bed-min-overlap

PON_MAX_PERCENT_ZERO_SAMPLES

Number of PON samples with 0 coverage is larger than threshold.

--cnv-max-percent-zero-samples

PON_TARGET_FACTOR_THRESHOLD

Median coverage of interval is lower than threshold of overall median coverage.

--cnv-target-factor-threshold

PON_MISSING_INTERVAL

Target interval not found in PON.

Not applicable

An example of a *.cnv.excluded_intervals.bed.gz file is shown below:

chr1    0       818022  NON_KMER_UNIQUE
chr1    824431  830446  NON_KMER_UNIQUE
chr1    834311  836677  NON_KMER_UNIQUE
chr1    838659  841054  NON_KMER_UNIQUE
chr1    850451  853257  NON_KMER_UNIQUE
chr1    855442  860261  NON_KMER_UNIQUE
chr1    866189  868833  NON_KMER_UNIQUE
chr1    881779  884116  NON_KMER_UNIQUE
chr1    1016667 1018959 NON_KMER_UNIQUE
chr1    1075880 1079718 NON_KMER_UNIQUE
chr1    1137942 1140725 NON_KMER_UNIQUE

Excluded Samples File

To improve accuracy, the DRAGEN CNV Pipeline excludes panel of normals samples if one or more of the samples failed at least one quality requirement. The excluded samples are reported to *.cnv.excluded_samples.txt.gz file. The file has a tsv (tab separated) format, identifies the excluded panel of normals samples and describes the reason. The following are the possible reasons for exclusion.

Exclusion Reason

Description

Related DRAGEN Option

PON_SAMPLE_NAME_EQUAL_TO_CASE

PON sample name is equal to case sample name

PON_SAMPLE_CORRELATION_EQUAL_TO_CASE

PON sample counts are equal to case sample counts

PON_MAX_PERCENT_NAN_SAMPLES

number of nan values in sample is higher than threshold

--cnv-max-percent-nan-samples(default=50)

MAX_PERCENT_ZERO_TARGETS

number of 0 target counts in sample is higher than threshold

--cnv-max-percent-zero-targets(default=5)

EXTREME_PERCENTILE:UPPER

median coverage of sample is higher than threshold

--cnv-extreme-percentile(default=2.5)

EXTREME_PERCENTILE:LOWER

median coverage of sample is lower than threshold

--cnv-extreme-percentile(default=2.5)

An example of a *.cnv.excluded_samples.txt.gz file is shown below:

#name        reason                                 value        threshold
Sample1      MAX_PERCENT_ZERO_TARGETS               4776         418
Sample2      EXTREME_PERCENTILE:LOWER               0.000812534  0.20065
Sample3      EXTREME_PERCENTILE:UPPER               1.0003       1.00025
Sample4      PON_SAMPLE_NAME_EQUAL_TO_CASE          NA           NA
Sample5      PON_SAMPLE_CORRELATION_EQUAL_TO_CASE   NA           NA

The excluded samples output file may not exist if there are no excluded samples.

Panel of Normals Files

PON Metrics File

The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz) if a Panel of Normals is provided and --cnv-generate-pon-metric-file is set to true. If PON size is less than 2, then an empty file will be generated.

The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:

Column index

Column contents

Description

contig

chromosome name

start

genomic locus of interval start

stop

genomic locus of interval stop

name

interval name

mean

average coverage depth

std

standard deviation

normalizedStd

normalized standard deviation (std/mean)

min

minimum

25%

25 percentile

50%

median

75%

75 percentile

max

maximum

intervalSize

interval size (stop-start)

gcContents

percent GC

Example:

contig  start   stop    name    mean    std     normalizedStd min     25%     50%     75%     max     intervalSize    gcContents
1       12098   12178   target-wes-1-12098:12178/1      3.6259044560802365      0.46661435469856077      0.1286890927079175     2.7961783439490446      3.2573018790849675      3.7105263157894739      4.0162683823529415      4.3298969072164946      80      0.49382716049382713
1       12178   12258   target-wes-1-12178:12258/2      5.0685579775753595      0.70638315915955963      0.13936570564740217     3.9044585987261144      4.5225944682508761      5.067708333333333       5.5778115844038769      6.3277777777777775      80      0.46913580246913578
1       12553   12637   target-wes-1-12553:12637/1      4.6990858287992054      0.62537786269786677      0.13308500535681309     3.7417218543046356      4.0305632538350444      5.0382165605095546      5.2151580459770113      5.5773195876288657      84      0.6705882352941176

PON Correlation File

The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.

Example:

Correlation of case sample CASE_SAMPLE_NAME
  PON1: 0.9786
  PON2: 0.9868
  PON3: 0.9912
  ...

SegDups Extension Files

The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).

The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:

Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)

Intermediate files

The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:

Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)

The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:

Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site

PreviousSomatic NextRepeat Expansion Detection

Last updated 3 hours ago

Was this helpful?

hashtagReference

hashtagInput

hashtagReference Hashtable

hashtagGenerate an Alignment File

hashtagStreaming Alignments

hashtagConcurrent CNV and Small Variant Calling

hashtagMap/Align FASTQ With CNV

hashtagMap/Align FASTQ With VC

hashtagMap/Align FASTQ With CNV and VC

hashtagBAM Input to CNV and VC

hashtagPreprocessing

hashtagTarget Counts

hashtagWhole Genome

hashtagWhole Exome

hashtagTarget Counts Options

hashtagFilter Duplicate Alignments

hashtagTarget Counts Dropout Regions

hashtagRescue of target counts in Segmental Duplications

hashtagB-Allele Counts

hashtagSomatic-specific options

hashtagGC Bias Correction

hashtagNormalization

hashtagSelf Normalization

hashtagPanel of Normals

hashtagGenerating Panel of Normals (Combined Counts)

hashtagNormalization Options

hashtagExclude BED Filtering

hashtagSegmentation

hashtagCircular Binary Segmentation

hashtagShifting Level Models Segmentation

hashtagUser-Defined Segmentation (Segment BED)

hashtagAllele-Specific Copy Number Calling

hashtagSomatic-specific extensions

hashtagDefault purity/ploidy model

hashtagGrid search optimization informed by essential regions

hashtagRejection of models calling large portions of chromosome as CN0 (homozygous deletion)

hashtagConstraining tumor purity

hashtagConstraining sample ploidy

hashtagSubclonal/Mosaic Calling Mode

hashtagAllele Specific Copy Number Examples

hashtagCall Smoothing

hashtagQUAL Model

hashtagOutput Files

hashtagCNV VCF File

hashtagVCF format differences between different callers

hashtagHeader

hashtagRecords

hashtagCoverage Uniformity

hashtagCNV Metrics File

hashtagIntermediate and Visualization Files

hashtagTarget Counts File

hashtagB-Allele counts

hashtagBias correction file

hashtagCombined counts file

hashtagNormalization file

hashtagSegmentation file

hashtagModel identification file

hashtagVisualization files

hashtagExcluded Intervals File

hashtagExcluded Samples File

hashtagPanel of Normals Files

hashtagPON Metrics File

hashtagPON Correlation File

hashtagSegDups Extension Files

hashtagIntermediate files

Reference

Input

Reference Hashtable

Generate an Alignment File

Streaming Alignments

Concurrent CNV and Small Variant Calling

Map/Align FASTQ With CNV

Map/Align FASTQ With VC

Map/Align FASTQ With CNV and VC

BAM Input to CNV and VC

Preprocessing

Target Counts

Whole Genome

Whole Exome

Target Counts Options

Filter Duplicate Alignments

Target Counts Dropout Regions

Rescue of target counts in Segmental Duplications

B-Allele Counts

Somatic-specific options

GC Bias Correction

Normalization

Self Normalization

Panel of Normals

Generating Panel of Normals (Combined Counts)

Normalization Options

Exclude BED Filtering

Segmentation

Circular Binary Segmentation

Shifting Level Models Segmentation

User-Defined Segmentation (Segment BED)

Allele-Specific Copy Number Calling

Somatic-specific extensions

Default purity/ploidy model

Grid search optimization informed by essential regions

Rejection of models calling large portions of chromosome as CN0 (homozygous deletion)

Constraining tumor purity

Constraining sample ploidy

Subclonal/Mosaic Calling Mode

Allele Specific Copy Number Examples

Call Smoothing

QUAL Model

Output Files

CNV VCF File

VCF format differences between different callers

Header

Records

Coverage Uniformity

CNV Metrics File

Intermediate and Visualization Files

Target Counts File

B-Allele counts

Bias correction file

Combined counts file

Normalization file

Segmentation file

Model identification file

Visualization files

Excluded Intervals File

Excluded Samples File

Panel of Normals Files

PON Metrics File

PON Correlation File

SegDups Extension Files

Intermediate files