Reference
Reference
This reference provides detailed documentation of the DRAGEN CNV pipeline covering contents below.
Input
The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see Streaming Alignments for instructions on streaming alignment records directly from the DRAGEN map/align stage.
DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see Generate an Alignment File.
Reference Hashtable
For the DRAGEN CNV pipeline, the hashtable must be generated with the --ht-build-cnv-hashtable option set to true, in addition to any other options required by other pipelines. When --ht-build-cnv-hashtable is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.
The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see Prepare a Reference Genome.
The following example command generates a hashtable.
dragen \
--build-hash-table true \
--ht-reference \<FASTA\> \
--ht-build-cnv-hashtable true \
--output-directory \<OUTPUT\> \
<OTHER HASHTABLE OPTIONS> \Generate an Alignment File
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
Streaming Alignments
DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.
To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.
For information on running CNV concurrently with the Haplotype Variant Caller, see Concurrent CNV and Small Variant Calling.
Concurrent CNV and Small Variant Calling
DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.
Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.
The following examples show different commands.
Map/Align FASTQ With CNV
Map/Align FASTQ With VC
Map/Align FASTQ With CNV and VC
BAM Input to CNV and VC
Preprocessing
Target Counts
The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.
When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list option.
With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed option to determine the intervals for analysis. The target BED file should contain intervals that match those in the panel of normals file. If the intervals in the target BED file and the panel of normals file do not match, DRAGEN will use the target intervals from the panel of normals file.
The target counts stage generates a *.target.counts.gz file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input or --cnv-tumor-input option for the normalization stage. The *.target.counts.gz file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.
Further details are available in Output Files - Target Counts File.
Whole Genome
If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.
The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.
5
10000
10
5000
>= 30
1000
Using a --cnv-interval-width of less than 250 bp for WGS analysis can drastically increase runtime.
The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y. You can specify a list of contigs to skip by using the --cnv-skip-contig-list option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.
For example, to skip chromosome M, X, and Y, use the following option:
Whole Exome
If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed <TARGET_BED> option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width.
To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.
Target Counts Options
The following options control the generation of target counts.
--cnv-counts-method--- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.--cnv-min-mapq--- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.--cnv-target-bed--- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.--cnv-interval-width--- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.--cnv-skip-contig-list--- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, arechrM,MT,m,chrm.--cnv-filter-duplicate-alignments--- Filter duplicate marked alignments during target counts if option is set totrue. The default setting istrueunless map/align is enabled and duplicate marking is disabled.
Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.
Filter Duplicate Alignments
PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.
If --enable-map-align=false, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true, then --enable-duplicate-marking=true should be set.
Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.

Fastq
TRUE
--enable-map-align=true, --enable-duplicate-marking=true
BAM
TRUE
--enable-map-align=true, --enable-duplicate-marking=true
BAM
FALSE
--enable-map-align=false
Target Counts Dropout Regions
In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.
Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.
For WGS samples and in absence of a cnv-target-bed file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width option, which defaults to 1000bp. The cnv-interval-width option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE in the *.cnv.excluded_intervals.bed.gz file.
A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.
Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow, as described in the following section.
Rescue of target counts in Segmental Duplications
The germline WGS CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.
This extension complements the original germline WGS CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We recommend WGS data aligned to a supported human reference genome (currently only hg38) with at least 30x coverage. See below for additional requirements.
Supported duplications
The following pairs of genes defining Segmental Duplications are included:
CYP2A6
CYP2A7
FCGR3A
FCGR3B
RHD
RHCE
STRC
STRCP1
ACSM2A
ACSM2B
ACTR3B
ACTR3C
AQP12A
AQP12B
ASAH2
ASAH2B
CCDC74A
CCDC74B
CD177
CD177p1
CD8B
CD8B2
CFH1
CFHR1
CYP4A11
CYP4A22
DHX40
DHX40P1
EIF5AL1
EIF5AP4
FCGR2A
FCGR2C
FFAR3
GPR42
FOLH1
FOLH1B
FRMPD2
FRMPD2B
GPAT2
GPAT2P1
GSTT2B
GSTT2
DDT
DDTL
HCAR2
HCAR3
HSPA1A
HSPA1B
KRT81
KRT86
LGALS7
LGALS7B
MRPL45
MRPL45P2
MSTO1
MSTO2p
MUC20
MUC20P1
MZT2A
MZT2B
OTOA
OTOAp1
PDPR
PDPR2P
PIEZ02
ENST00000591853.1
ZP3
POMZP3
PRAMEF7
PRAMEF8
PROS1
PROS2P
RMND5A
ANAPC1P2
ROCK1
ROCK1p1
SERPINB3
SERPINB4
SYT3
ZNF473CR
TBC1D26
TBC1D28
TOP3B
TOP3BP1
TUBA3D
TUBA3E
ZNF443
ZNF799
Extension requirements
This extension is enabled by default in the germline WGS CNV workflow. However, it requires:
Normalization set to self-normalization (
--cnv-enable-self-normalization=true).GC bias correction enabled (
--cnv-enable-gcbias-correction=true).Counts method set to
start(--cnv-counts-method=start).Interval width not greater than 10kb. However, we recommend using the
cnv-interval-widthdefault (1kb) for best performance.A supported reference genome builds in input (currently supported based on:
hg38).
If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension to false.
Algorithm

For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to
*.cnv.segdups.joint_coverage.tsv.gz).Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to
*.cnv.segdups.site_ratios.tsv.gz).Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the
*.cnv.segdups.rescued_intervals.tsv.gzfile for inspection and they are automatically injected before the segmentation step.During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.
See Output Files - Segdups Extension Files for a description of the extension output files.
B-Allele Counts
In workflows supporting B-allele frequency (BAF), a source of heterozygous SNP sites is required to measure B-allele counts of the input sample. The following are the available modes, of which some are only available in somatic workflows.
cnv-population-b-allele-vcf
Specify a population SNP VCF. This option is available for both the germline and the somatic workflows. In somatic, it can be used when a matched normal sample is not available and analysis must be performed in tumor-only mode.
cnv-normal-b-allele-vcf
(Somatic-specific) Specify a matched normal SNV VCF. Use when a matched normal sample and the matched normal SNV VCF are available. To use this option, you must run the matched normal sample through the DRAGEN Germline workflow.
cnv-use-somatic-vc-baf
(Somatic-specific) Set to true to enable DRAGEN to identify germline variants during a tumor/matched-normal run, rather than requiring a separate run on the normal sample. Use if and only if tumor and matched normal input are available. Also enable the Somatic SNV Caller via enable-variant-caller to use this option.
To specify a population SNP VCF, use --cnv-population-b-allele-vcf option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is "1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency> to the INFO section of each record. Additional INFO fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf can be either heterozygous or homozygous in the germline genome from which the tumor genome derives
The following is an example valid population SNP record (note: it needs to be tab-delimited):
DRAGEN considers the following requirements when parsing records from the b-allele VCF:
Only simple SNV sites.
Records must be marked
PASSin theFILTERfield.If there are records with the same
CHROMandPOSvalues in theVCF, then DRAGEN uses the first record that occurs.
A suitable population B-allele VCF is provided for selected references at this page.
Somatic-specific options
To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.
If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
To enable --cnv-use-somatic-vc-baf, enter the following command line options.
--tumor-bam-input <TUMOR_BAM>—Specify the tumor input--bam-input <NORMAL_BAM>—Specify the matched normal input--enable-variant-caller true—Enable the somatic SNV variant caller--cnv-use-somatic-vc-baf true—Enable somatic VC BAF
GC Bias Correction
GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.
The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See Output Files - Bias Correction file for further details on GC-corrected target counts files.
Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.
The following options control the GC bias correction module.
--cnv-enable-gcbias-correction--- Enable or disable GC bias correction when generating target counts. The default is true.--cnv-enable-gcbias-smoothing--- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.--cnv-num-gc-bins--- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.
Normalization
The DRAGEN CNV pipeline supports two normalization algorithms:
Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.
Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.
Self-Normalization
Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with
chr1, chr2, chr3, ..., chrX, chrYor1, 2, 3, ..., X, Ynaming conventions are supported.
Panel of Normals
Whole genome sequencing
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples
The table below shows supported normalization methods for CNV workflow:
WGS
Self / PoN
Self / PoN
Self / PoN
Self / PoN
No workflow
WES
PoN
PoN
PoN
PoN
PoN
"No workflow" indicates that no workflow exists for this configuration.
Self Normalization
The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.
Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.
The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.
If you are running from a FASTQ sample, then the default mode of operation is self-normalization.
When operating in self-normalization mode, the --cnv-interval-width option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.
Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references or similar mammalian references (chr1, chr2, chr3, ..., chrX, chrY).
If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and research use only, and no claims or validation is made for the use of this feature.
Panel of Normals
The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. Proper sample selection and preparation are critical for constructing an accurate and reliable CNV PON. High-quality germline samples—meeting stringent sequencing quality criteria such as a high percentage of bases over Q30, sufficient total read depth (yield), appropriate GC content, and minimal adapter contamination—must be used. Additionally, all samples should originate from the same sample type (e.g., FFPE, fresh-frozen) and be processed under identical experimental conditions, including the same library preparation kit, sequencing platform, and capture panel version. Even minor variations in hybridization efficiency or read depth distribution can introduce systematic artifacts, leading to inaccurate CNV calls.
Below are the key recommendations for preparing a high-quality PON:
Sample Selection: Normal samples should be sourced from individuals without known chromosomal abnormalities to establish a clean and representative reference baseline. Additionally, normal samples should not be drawn from a cohort that is likely to be enriched for particular CNVs, or enriched for individuals affected by a particular disease or syndrome with a genetic component. Normal samples should ideally be unrelated to each other and to the case samples to be processed. No more than ~6% of the samples in the PON should be related to any case sample. For example, a 50-sample PON containing a quad (mother/father/sibling/proband) may be used to analyze each sample in the quad provided there are no additional related samples in the PON.
Balanced sample sex: The normal sample set should include both male and female samples in similar numbers to ensure a well-represented reference baseline.
Exclude Low-Quality Samples: Samples with unusually uneven target coverage, low sequencing depth, or high technical noise should be removed to minimize variability and ensure consistency in the PON.
Standardized Library Preparation: All samples must be processed using the same library preparation protocol. Any deviations such as differences in hybridization efficiency, incubation time, or temperature can lead to inconsistent coverage patterns, increasing the likelihood of false positive CNV calls.
Adequate Number of Reference Samples: A sufficient number (a minimum of 50 samples is recommended, though not mandatory) of high-quality reference samples is essential for reliable coverage estimation and robust CNV detection.
By following these guidelines, the PON can effectively minimize technical biases, improving the accuracy and reliability of CNV detection.
In PON mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample (case and normals), to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.
CNV PONs can also be built in the cloud using the DRAGEN Baseline Builder App on BaseSpace or the DRAGEN Systematic Noise File Builder Pipeline on ICA.
In-run PON for Germline Exome
Some pre-built PONs are available for download from the DRAGEN Software Support Site page. When possible, however, it is recommended to utilize an in-run PON created from samples from the same sequencing run and library prep as the case samples. This will ensure that any biases that may have been introduced during library prep and/or sequencing will be properly normalized. If the samples in the sequencing run are sufficiently diverse and contain a large majority of copy neutral samples for each target region, then it is recommended that the PON consist of as many samples from the sequencing run as possible, but can be limited to 96 samples without significantly impacting the accuracy of coverage normalization. When the sequencing run is enriched with samples containing specific CNVs of interest, the in-run PON should be built from only those samples in the run without the enriched events (i.e. "normal samples"). A minimum of 50 such normal samples is recommended as with a pre-built PON. The table below summarizes the available options and high-level steps for running CNV using an in-run PON. CNV and Targeted Caller require separate PON files, but the intermediate counts files can be generated in the same DRAGEN command line invocation. For additional details click on the link for each option.
Create run using the Run Planning tool in BSSH
Start planned run in Control Software on instrument
Run DRAGEN Germline Enrichment from BCLs App
Run DRAGEN Germline Enrichment App
Run DRAGEN Enrichment App
BCL to FASTQ conversion
Generate CNV target counts and Targeted Caller exome counts for each PON sample
Generate CNV combined counts PON file
Generate Targeted Caller PON file
Perform case sample analyses
Target Counts Stage
Target counts should be generated for all normal samples used as a panel of normals. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings including reference version, target bed, counting methods, duplicate marking/filtering, filtering method/cutoff, etc. The target counts stage also performs GC Bias correction, if enabled. GC Bias correction is enabled by default, but can be disabled if desired.
The following examples are for WES processing, where a panel of normals is required.
The following is an example command for processing a BAM file.
The following is an example command for processing a CRAM file.
The following example is for WGS processing, where a panel of normals is optional.
Generating Panel of Normals (Combined Counts)
When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file (one per file) or --cnv-normals-list (single text file with paths to each sample).
The following is an example command line using a normals list:
Normalization and Call Detection Stage
The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.
Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.
The presence of CNVs in the panel can result in artifactual calls in the test sample at locations where at least some of the panel samples have copy number changes. This leads to two considerations regarding construction of a panel.
Firstly, while it is not generally possible to select samples with no CNVs, panel samples should not be clearly aneuploid or contain large-scale somatic CNVs; further, if there is a region of particular interest, samples should be selected to be normal in that region.
Secondly, for optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels increase the likelihood of artifactual calls. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.
The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.
DRAGEN accepts 3 different file formats for a Panel of Normals (PON).
--cnv-normals-file
Individual normal file. This option uses a single file name and can be specified multiple times.
--cnv-normals-list
List of normal files. A plain text file in which each line in the file contains a path pointing to a *.target.counts.gz or *.target.counts.gc-corrected.gz file generated from the target counts stage. Relative paths are supported if the paths are relative to the current working directory. Absolute paths are recommended in case the workflow is used later or shared with other users.
--cnv-combined-counts
PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals (*.combined.counts.txt.gz file). Some pre-packaged PON file directly downloaded from Illumina support site need to use this option.
The CNV caller can also be started from the *.target.counts.gz (raw counts) or *.target.counts.gc-corrected.gz (GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input or --cnv-tumor-input option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction should be set to false to disable the GC-correction stage; GC-corrected inputs are not supported for somatic WGS analysis.
For example, the following command normalizes the case sample against the panel of normals.
See Output Files - Target Counts File for a description of the target counts files.
Normalization Options
These options control the preconditioning of the panel of normals and the normalization of the case sample.
--cnv-enable-self-normalization--- Enable/disable self normalization mode, which does not require a panel of normals.--cnv-extreme-percentile--- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.--cnv-input--- Specifies a target counts file for the case sample under analysis when using a panel of normals, for germline analysis (see --cnv-tumor-input for somatic analysis).--cnv-normals-file--- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.--cnv-normals-list--- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.--cnv-max-percent-zero-samples--- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.--cnv-max-percent-zero-targets--- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.--cnv-target-factor-threshold--- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.--cnv-tumor-input--- Specifies a target counts file for the case sample under analysis when using a panel of normals, for somatic analysis (see --cnv-input for germline analysis).--cnv-truncate-threshold--- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.--cnv-enable-gender-matched-pon--- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.--cnv-enable-cross-gender-adjustments-chrX--- Enable normalization on chrX by adjusting coverage of PON samples according to the expected number of copies of chrX in male and female samples. If the case sample is male, coverage of female PON samples is scaled down by a factor of 2 on chrX. If the case sample is female, coverage of male PON samples is scaled up by a factor of 2 on chrX. If no male PON samples are available, chrY intervals will be filtered. This feature is only supported for germline enrichment runs. The default value is false; if set to true, then--cnv-enable-gender-matched-ponmust also be true.
Exclude BED Filtering
You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using --cnv-exclude-bed. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.
The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than --cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz file excludes the intervals removed during normalization.
An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See Output Files - Excluded interval files for further details.
Segmentation
After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:
Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)
The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.
By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing.
If you have specific regions of interests, you can also run with a --cnv-segmentation-bed. The option pre-defines the segments to estimate copy numbers region of interest listed in the bed file. See Targeted Segmentation (Segment BED) for more information.
--cnv-segmentation-mode--- Specifies the segmentation algorithm to perform. The following values are available.bed--- This option is not applicable to T/N and T/O of somatic WGS and somatic WES workflowscbsslm--- The default for germline WGS analysis.hslm--- The default for germline WES analysis.aslm--- The default for somatic analysis, either WGS or WES sample types.
Circular Binary Segmentation
Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.
--cnv-cbs-alpha--- Specifies the significance level for the test to accept change points. The default is 0.01.--cnv-cbs-eta--- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.--cnv-cbs-kmax--- Specifies maximum width of smaller segment for permutation. The default is 25.--cnv-cbs-min-width--- Specifies the minimum number of markers for a changed segment. The default is 2.--cnv-cbs-nmin--- Specifies the minimum length of data for maximum statistic approximation. The default is 200.--cnv-cbs-nperm--- Specifies the number of permutations used for p-value computation. The default is 10000.--cnv-cbs-trim--- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.
¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646
Shifting Level Models Segmentation
The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles². The options relevant for SLM and HSLM mode are described in the germline workflow page. The options for ASLM are described in the somatic workflow page.
²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5
User-Defined Segmentation (Segment BED)
DRAGEN CNV optionally accepts additional regions of interest by specifying a --cnv-segmentation-bed file. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. or might covers entire chromosome-arms. Intervals provided by --cnv-segmentation-bed will be appended to the CNV VCF with an INFO tag of SEGID provided by the name column of the input bed file.
The recommended format for the BED file includes four columns and a header. The four columns are contig, start, stop, and name. The name column represents the name of the region and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID field. The following example file is in the recommended format with nominal use cases of gene level, arm level, and/or whole chromosome:
If a three-column BED file is used including the contig, start, and stop values, then segment identifiers are autogenerated from the coordinate fields.
You can mix user-defined segmentation with standard segmentation modes using the --cnv-segmentation-mode option (cbs, slm, aslm, hslm). For example:
In this case, the CNV VCF output will include results from the selected segmentation method (SLM in this example), plus additional entries from the user-provided segmentation BED file. If variants are called REF, then they may be filtered based on option --cnv-enable-ref-calls. If you set --cnv-segmentation-mode=bed, the CNV VCF will include only the entries defined in the segmentation BED file.
If some segments in the --cnv-segmentation-bed file are not covered by any target intervals (from --cnv-target-bed), or if all overlapping target intervals are filtered out (e.g., due to k-mer uniqueness filtering), then the associated segments will not be output to the VCF.
Examples of output VCF entries are below:
The table below shows CNV workflows supporting the cnv-segmentation-bed option:
WGS
✓
✓
✓
✓
No workflow
WES
✓
✓
✓
✓
✓
ASCN CNV requires cnv-segmentation-mode not equal to bed to calculate likelihood of purity/ploidy model from segments deriven by data. The table below shows CNV workflows supporting cnv-segmentation-mode=bed option:
WGS
✗
✗
✗
✓
No workflow
WES
✗
✗
✗
✓
✓
"No workflow" indicates that no workflow exists for this configuration.
Allele-Specific Copy Number Calling
Selecting a diploid coverage level is a key component of an allele-specific copy number (ASCN) caller. In the somatic case, the caller also needs to identify the most likely tumor purity. DRAGEN CNV ASCN callers use a grid-search approach that evaluates many candidate models to attempt to fit the observed read and b-allele counts across all segments in the input sample. A log likelihood score is emitted for each candidate, and all scores are output (in *.cnv.coverage.models.tsv or *.cnv.purity.coverage.models.tsv, respectively for germline or somatic workflows). The caller chooses the model with the highest log likelihood and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.
Note: if BAF data is not sufficient it might be discarded during model estimation, leading to a model based on coverage depth only. In such case, the model will not be able to detect alterations that cannot be easily identified without BAF (e.g., whole-genome trisomy).
Somatic-specific extensions
Default purity/ploidy model
If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA. This can be identified on the output VCF header lines:
The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM) or, in the case of copy-neutral LOHs, by their minor allele frequency value (MAF). The threshold values for SM used by the caller are estimated automatically considering the variance of the sample, with larger SM thresholds for DUPs when the variance is higher. For MAF values, PASSing copy-neutral LOHs are called when the MAF is below a certain threshold. The user can use alternative threshold values through the --cnv-filter-del-mean, --cnv-filter-dup-mean and --cnv-filter-cnloh-maf parameters.
Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN, CNF, CNQ, MCN, MCNF, MCNQ) are omitted from the final VCF output. The following is a set of example calls from the final VCF output:
Grid search optimization informed by essential regions
In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.
The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH> parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.
If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.
¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041
The following is an example taken from the output log where this feature is triggered, leading to an additional iteration of model fitting:
and, an example where this feature is not triggered:
Rejection of models calling large portions of chromosome as CN0 (homozygous deletion)
Large chromosomal events are likely to negatively impact genome stability and cell viability. The option --cnv-somatic-homdel-max-fraction is the maximum allowed fraction for any chromosome that can be called as CN0 (default value: 0.7). If the number of bases on a chromosome are more than this fraction (over the total number of called bases), the weighted average coverage across all HOMDEL segments is taken as the coverage that needs to be at least CN1 for a model to be considered. Model fitting then restarts from the beginning with new constraints (and thus a reduced set of alternative models). This feature can be disabled by setting the parameter to --cnv-somatic-homdel-max-fraction=1, effectively allowing the total number of called bases on each chromosome to be CN0 without rejecting the model.
The following is an example taken from the output log where this feature is triggered:
Constraining tumor purity
When a minimum and/or a maximum value of tumor purity for the input sample are known from additional evidence, it is possible to constrain the search of models based on either or both of these values. The available input options that can be provided are:
--cnv-somatic-min-purity- float in [0,1]--cnv-somatic-max-purity- float in [min-purity,1]
Constraining sample ploidy
When a minimum and/or a maximum value of ploidy for the input sample are known from additional evidence, it is possible to constrain the search of models based on either or both of these values. The available input options that can be provided are:
--cnv-ascn-min-ploidy- positive float between 0.5 and--cnv-somatic-max-quartile-copy-number(default 9)--cnv-ascn-max-ploidy- positive float between min ploidy and--cnv-somatic-max-quartile-copy-number(default 9)
Please note: the sample ploidy constraints are applied to a preliminary estimation of ploidy from sample parameters (which might not be exactly respected in the final estimated ploidy in output), equal to:
2 + (excess coverage with respect to diploid coverage) divided by (coverage of one copy)
For germline ASCN workflows, this is computed as:
2+c/2m−c
while for somatic ASCN workflow, this expression is computed as:
2+c∗p/2m−c
where:
mis the mean coverage of the input samplecis the diploid coverage of the model under considerationpis the tumor purity (only for somatic workflows)
Subclonal/Mosaic Calling Mode
DRAGEN uses a subclonal/mosaic calling mode for segments with a copy number that is estimated to be heterogeneous among different cells in the sample. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.
Note, in somatic this setting will only be honored when DRAGEN is able to identify a confident model. When a confident model cannot be identified, the caller will return a default model and this feature will always be disabled (see the Default purity/ploidy model section for more details and nuances of this approach).
When a segment is considered as heterogeneous, the output for the segment is changed as follows.
The MOSAIC (germline) or HET (somatic) tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.
To turn on this feature, specify either one of these options:
--cnv-enable-mosaic-calling=true(for the germline ASCN workflow, default true)--cnv-somatic-enable-het-calling=true(for the somatic workflow, default false)
Note: this calling mode can be disabled for alterations smaller than N bases. It is recommended not to change default thresholds. If necessary, however, they can be changed with:
--cnv-filter-mosaic-length=N(for the germline ASCN workflow, default 100000)--cnv-somatic-filter-het-length=N(for the somatic workflow, default 0)
The following is an example of MOSAIC GAIN call from the germline ASCN model (from *.cnv.vcf.gz output file):
The assigned total CN is 4. However, inspecting the CNF annotation (CNF ~ 4.32), we can see that the segment above has a larger deviation from the diploid state with respect to the assigned integer CN state. This can support various hypotheses on the fraction of cells bearing the different CN states, for example:
32% of cells having CN5, 68% of cells having CN4
Allele Specific Copy Number Examples
In addition to assigning total copy number based on depth, ASCN Callers make use of BAFs to call allele specific copy numbers. The following table provides examples for a DUP in a reference-diploid region:
4
2
2+2
4
1
3+1
*4
0
4+0
*The entry represents a Absence or Loss of Heterozygosity (AOH/LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH to distinguish the value from Copy Neutral AOH/LOH (CNLOH), which would be annotated as 2+0.
Call Smoothing
The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.
After initial calling, segments shorter than the specified value of --cnv-filter-length are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the caller combines two successive segments that are within --cnv-merge-distance (default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. Note: in somatic workflows, when the germline CN information is available, and two segments have different germline CN, they will not be merged.
QUAL Model
The caller uses a model based on diploid coverage (and purity in somatic workflows) from depth of coverage and B-allele frequency.
Given the most likely diploid coverage (and purity in somatic workflows), for each segment, the algorithm calls the most likely copy number state (complete with total copy number CN, and minor allele copy number MCN).
The probability of the REF state is used in input to the scoring algorithm which outputs the QUAL value (a PHRED score capped at 1000). The QUAL value is the PHRED score where the probability of error is the probability of REF when an alteration is called, or the probability of having a non-REF call when the segment should be called REF.
Note: this is different from how QUAL is computed in (legacy) depth-only callers.
Output Files
DRAGEN emits the calls in the standard VCF format. The VCF file includes only copy number gain and loss events. To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls to true. AOH/LOH events are available in workflows where allele-specific copy number is available.
CNV VCF File
File extension: *.cnv.vcf.gz
The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV to indicate the file is generated by the DRAGEN CNV pipeline.
VCF format differences between different callers
In the DRAGEN CNV component, two versions of the VCF specification are used for the *.cnv.vcf.gz file:
For ASCN workflows, the format used is VCF v4.4
For depth-only workflows (including multisample CNV calling), the format used is VCF v4.2
The differences between the two formats in output from DRAGEN are the following:
General
INFO/SVLEN
Positive or Negative
Always Positive
Absence/Loss of Heterozygosity (AOH/LOH)
ALT
<DEL>,<DUP>
<LOH>
FORMAT/GT
1/2
1/1
Header
The following is an example of some of the header lines that are specific to CNV:
The following header lines are specific to the somatic CNV callers (WGS/WES) and the germline WGS CNV caller:
ModelSourceThe primary basis on which the final model was chosen. The following values can be included:DEPTH+BAF: Depth+BAF signal is used to determine model.
DiploidCoverageExpected read count for a target bin in a diploid region. The numeric value is unlimited.OverallPloidyLength weighted average of copy number for PASS events (for the tumor fraction in somatic runs). The numeric value is unlimited.OutlierBafFractionA QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].HomozygosityIndexAutosomal AOH/LOH percentage, considering only PASS AOH/LOH greater or equal than a certain threshold. This metric can be used as a proxy for consanguinity in the germline WGS CNV caller. The default minimum size for PASS AOH/LOH to be considered is 2Mb, since it is often found that shorter ROHs "do not arise from inbreeding in recent generations and are common in all of the populations represented in the HGDP" (Kirin et al., 2010). However, a custom minimum size can be set through the option--cnv-min-length-homozygosity-index. Note: The Cyto VCF (*.cyto.vcf.gz) also provides resolution-specific homozygosity indexes (i.e., computed on each specific resolution's callset). The default minimum size considered is the same as the mainHomozygosityIndex, and for each resolution in output, there will be an additional header line on the Cyto VCF indicating the resulting metric, e.g.,##HomozygosityIndex(25k)=0.001015.
The following header lines are specific to the somatic CNV callers (WGS/WES):
ModelSourcecan also have the following values (see section below for additional details):DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.DEPTH+BAF_WEAK: Depth+BAF signal is used to determine tumor model, but this is associated with lower-confidence thanDEPTH+BAF.VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
EstimatedTumorPurityEstimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] orNAif a confident model could not be determined.AlternativeModelDedupAn alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.AlternativeModelDupAn alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
Understanding ModelSource Annotation (Somatic only)
The ModelSource indicates the type and strength of evidence used to determine the tumor purity and ploidy model for the sample. Possible values are listed in approximate order of decreasing evidence strength, with DEPTH+BAF variants representing the most robust determinations and the degenerate models representing the least confident scenarios.
DEPTH+BAFrepresents the strongest evidence, where both sequencing depth (read coverage) and B-allele frequency (BAF) signals consistently support the chosen model, confirmed by variant allele frequency (VAF) data (if available).DEPTH+BAF_DOUBLEDindicates that the initial depth and BAF model was adjusted upward by a whole-genome duplication factor, supported by either VAF evidence showing variants at the expected frequencies for a duplicated genome, or an excess of genomic segments with a closely matching state in the model with the WGD. Conversely,DEPTH+BAF_DEDUPLICATEDmeans the model was adjusted downward by removing a whole-genome duplication, based on VAF data inconsistent with duplication or insufficient genomic segments supporting the higher ploidy hypothesis.DEPTH+BAF_WEAKreflects a scenario where depth and BAF signals provided the model, but with lower confidence than otherDEPTH+BAFmodel sources. A model receiving this model source is found through depth and BAF, but either:VAF data is available but the model is not concordant with VAF evidence.
Several regions of the genome have no closely matching state under the selected model.
VAFindicates that variant allele frequency data from somatic mutations became the primary evidence source because depth and BAF signals were insufficient or conflicting.Finally,
DEGENERATE_DIPLOIDandSAMPLE_MEDIANrepresent fallback models used when neither depth/BAF nor VAF provide adequate signal, and tumor purity cannot be reliably estimated. These assume the sample is high-purity diploid, with coverage set either to the lowest observed value in BAF-balanced regions (DEGENERATE_DIPLOID) or to the sample's median coverage (SAMPLE_MEDIAN).
Records
All coordinates in the VCF are 1-based.
CHROM
The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.
POS
The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.
ID
The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN, LOSS and REF events, in Somatic (WGS/WES) and Germline (WGS) CNV, the ID could include the Copy Neutral Loss/Absence of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.
REF
The REF column contains an N for all CNV events.
ALT
The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL>, <DUP> or <LOH> entries are used. If REF calls are emitted, their ALT will always be .. In workflows where allele-specific copy number (ASCN) is available, if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with --cnv-enable-legacy-vcf-format, the ALT field will contain two alleles, <DEL>,<DUP>, in place of <LOH>, for AOH/LOH events.
QUAL
The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header. Note: different workflows (e.g., germline WGS depth-only vs germline WGS) do not share the same underlying model and provide different QUAL score distributions. It is recommended to compare QUAL scores only within results from the same workflow. More details are available on QUAL (depth-only) and QUAL (ASCN).
FILTER
The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.
binCount
✓
✓
✓
chromArmBinCount
✓
✓
✓
cnvBinSupportRatio
✓
✓
✓
cnvCopyRatio
✓
✓
✓
cnvHetLength
✓
✓
cnvLength
✓
✓
✓
✓
✓
✓
cnvLikelihoodRatio
✓
✓
cnvMosaicLength
✓
✓
cnvQual
✓
✓
✓
✓
✓
✓
dinucQual
✓
✓
mosaicFraction
✓
✓
highCN
✓
lengthDegenerate
✓
✓
segmentMean
✓
✓
SqQual
✓
FILTER description
Available FILTERs:
binCount- Filters CNV events with a bin count lower than a threshold.cnvBinSupportRatiowhich indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.cnvCopyRatiowhich indicates that the segment mean of the CNV is not far enough from copy neutral.cnvHetLengthwhich indicates that a HET call below a certain length has been filtered as candidate FP.cnvLengthwhich indicates that the length of the CNV is lower than a threshold.cnvLikelihoodRatioindicates a log10 likelihood ratio of ALT to REF is less than a threshold.cnvMosaicLengthwhich indicates that a MOSAIC call below a certain length has been filtered as candidate FP.cnvQualwhich indicates that the QUAL of the CNV is lower than a threshold.chromArmBinCountwhich indicates that a whole-arm alteration call is based on a minimal portion (default 500 intervals) of the entire arm (e.g., in acrocentric chromosomes, where the short arm is mainly consisting of poor mappability regions, that are ignored during copy-number calling).dinucQualis applied based on the percentage of bases in a segment that belong to a two-base set (GC, CT, or AC), determined by individual occurrences. A CNV call is filtered out if any of these percentages fall outside typical ranges, indicating a likely false positive.mosaicFractionwhich indicates that the mosaic fraction of a germline CNV is below a defined threshold (--cnv-filter-mosaic-fraction). This filter is applied only to small CNVs with lengths shorter than the specified size threshold (--cnv-filter-mosaic-fraction-max-length).highCNwhich indicates a CNV call with implausible copy number (>6).lengthDegenerate- Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.segmentMean- Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficientSMinDELs orDUPs are assigned this filter when returning the default model.SqQual- Marks records as non-PASSing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.
INFO
The INFO column contains information representing the event.
REFLENindicates the length of the event.SVLENindicates the length of the event and it is only present for non-REF records. Note: if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with--cnv-enable-legacy-vcf-format,SVLENis a signed representation ofREFLEN(e.g., a negative value indicates a deletion).SVTYPEis always CNV and only present for non-REF records.ENDindicates the end position of the event (1-based, inclusive).
The legacy (depth-only) Germline CNV caller also includes the following INFO fields:
GCP
Percentage of bases that are G or C
CTP
Percentage of bases that are C or T
ACP
Percentage of bases that are A or C
If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.
In Germline WGS CNV the MOSAIC tag identifies mosaic calls. In Somatic CNV the HET tag identifies subclonal calls. See Subclonal/Mosaic-Calling Mode for more details.
When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.
FORMAT
The common FORMAT fields are described in the header:
GT
Genotype
SM
Linear copy ratio of the segment mean
CN
Estimated copy number
BC
Number of bins in the region
PE
Number of improperly paired end reads at start and stop breakpoints
AS
Number of allelic read count sites
BC
Number of read count bins
CN
Estimated total copy number of sample (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
CNF
Floating point estimate of copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
CNQ
Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
MAF
Estimate for the minor allele frequency
MCN
Estimated minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
MCNF
Floating point estimate of minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
MCNQ
Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
OBF
Per-segment Outlier BAF Fraction. Percentage of BAF counts which are considered "outlier" with respect to the chosen segment call. Higher values might indicate segments where BAF counts are problematic.
SD
Best estimate of segment's bias-corrected read count
NCN
Normal-sample copy number. The field is only present in somatic workflows with enabled germline-aware mode.
SCND
Difference between CN and NCN. The field is only present in somatic workflows with enabled germline-aware mode.
Note: legacy depth-only callers (germline WGS/WES and somatic WES) do not include some of the FORMAT fields indicated above, due to limitation of the used legacy models. Germline WES (depth-only) also includes the following FORMAT fields:
LR
Log10 likelihood ratio of ALT to REF
Note on genotype annotation in germline copy number calling (depth-only)
Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:
Diploid
.
2
./.
Diploid
<DUP>
>2
./1
Diploid
<DEL>
1
0/1
Diploid
<DEL>
0
1/1
Haploid
.
1
0
Haploid
<DUP>
>1
1
Haploid
<DEL>
0
1
Post VCF target BED
The DRAGEN CNV pipeline can receive in input a target BED to only emit calls overlapping with the BED intervals. The post VCF target BED is provided through the --cnv-post-vcf-target-bed option.
Coverage Uniformity
The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.
A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.
The CoverageUniformity metric is calculated as follows:
Parse normalized counts from
tn.tsv.gzfor autosomes.Split counts into disjoint tiled windows of size 100 bins.
Compute variance for all windows to measure observed distribution.
Shuffle normalized counts and compute variance for similar disjoint windows to create a background distribution.
Perform a two-sample Kolmogorov–Smirnov (KS) test between the observed and background variance distributions per chromosome.
Compute the final
CoverageUniformityscore as a weighted mean of KS test statistic across chromosomes, where weights correspond to chromosome sizes.
CNV Metrics File
DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv file extension. The following list summarizes the metrics that are output from a CNV run.
Sex Genotyper:
Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.
DRAGEN Sex Genotyper requires a minimum of 300 target intervals to confidently determine sex genotype; if the panel covers fewer intervals on the sex chromosomes, genotyping will fail and an undetermined genotype is returned. Users may lower this requirement by setting --cnv-sex-genotyper-num-interval-requirement to a smaller value, at the risk of increased false genotype calls.
CNV Summary:
Bases in reference genome in use
Average alignment coverage over genome - The average alignment coverage over the genome is calculated by dividing the total number of bases from processed alignment records (excluding those filtered by the Target Counts stage in DRAGEN CNV) by the genome length. Alignment records are filtered taking into consideration duplicate marking status (if available), MAPQ, and mapping status.
Number of alignment records processed
Number of filtered records (total)
Number of filtered records (due to duplicates)
Number of filtered records (due to MAPQ)
Number of filtered records (due to being unmapped)
PMAD - Pairwise Median Absolute Deviation measures the variation in read coverage between adjacent bins. It measures variability due to various factors, such as DNA degradation, extraction, amplification or library preparation. Higher values indicate noiser sample data. PMAD is calculated as following:
Define a vector v[i] as normalized counts of i-th interval in log scale, and d[i] as pairwise differences of consecutive normalized counts between i and i+1 intervals, i.e. d[i] = (v[i] - v[i+1])
PMAD is median absolute deviation of d, i.e. PMAD = Median(|d[i]-Median(d)|)
Coverage MAD - Median absolute deviation of normalized case counts. Higher values indicate noiser sample data.
Median Bin Count - Median of raw counts normalized by interval size.
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications - Note: GAINLOH events (ALT=LOH and CN > 2) are also included here
Number of deletions
Number of CNLOHs (Copy-Neutral LOHs)
Number of PASS amplifications - Note: GAINLOH events (ALT=LOH and CN > 2) are also included here
Number of PASS deletions
Number of PASS CNLOHs (Copy-Neutral LOHs)
Post-Normalization Bin Count Sigma - Standard deviation of post-PoN-normalization median-normalized coverage values.
Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Post-Normalization Bin Count Sigma is only printed when PoN normalization has been applied.
Intermediate and Visualization Files
Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.
All files have a structure similar to a BED file with optional header line(s).
Target Counts File
The file *.target.counts.gz is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.
It has the following columns:
Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gz file is shown below.
B-Allele counts
In germline runs, B-allele counts are calculated at bi-allelic sites taken from a collection of high-frequency SNVs in the population. In somatic runs, B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, sites are selected from a population collection (similar to germline runs). Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the sample supporting each of these alleles is counted.
B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.
B-allele tsv
The tsv file format is the following:
Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (one-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele
Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:
Population frequency for the first allele
Population frequency for the second allele
An example of B-allele counts file is provided below:
B-allele bedgraph file
The bedgraph file format is similar to the BED format and it has the following columns:
Contig identifier
Start
Stop
Ratio of allele counts
The numerator and denominator of the ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.
When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:
When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:
By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.
An example of the bedgraph file is shown below:
Bias correction file
The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:
Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gc-corrected.gz file is shown below.
Combined counts file
The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.
Normalization file
The file *.tn.tsv.gz contains the normalized signal of the case sample, per target interval, i.e., the log2-transformed copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz file:
Contig identifier
Start position
End position
Target interval name
Log2-transformed copy ratio in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #. In some cases, the normalization counts could be patched internally with intervals from other processes, such as the SegDups extension. In such cases, patches are indicated (sorted in order of application) with header lines starting with #patch:
and the original (unpatched) *.tn.tsv.gz is renamed as *.tn.unpatched.tsv.gz. Note: this file is reported in output for inspection, but most use cases will use the (patched) *.tn.tsv.gz file downstream of normalization.
An example of a *.tn.tsv.gz file is shown below.
Segmentation file
File extension: *.seg, *.seg.called, *.seg.called.merged
Files containing the segments produced by the segmentation algorithm. The Segment_Mean value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.
The *.seg file has the following columns:
Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment
An example of a *.seg file is shown below.
Germline-Specific (Depth-Only) Segmentation Output Files
The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + or a deletion -.
The *.seg.called.merged file is identical to the *.seg.called file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:
QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count
B-allele segmentation file
In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg and it has the same format of the *.seg file with two modifications. Firstly, the Segment_Mean value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:
BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or.when the BAF data are too variable to estimate a minor-allele fraction
An example of segmentation output file is shown below:
Model identification file
In somatic callers the file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:
Model purity (Cellularity)
Model diploid coverage
Model log-likelihood
Model ploidy
Failed constraints
An example is shown below:
In the germline WGS caller the file *.cnv.coverage.models.tsv serves the same purpose. However, since germline analysis has no concept for tumor purity, the first column is set to the default value of 1.
Visualization files
To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml can be loaded directly into IGV for analysis.
The following IGV tracks are automatically populated in the output IGV session file:
*.target.counts.bw--- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.*.improper_pairs.bw--- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.*.tn.bw--- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.*.seg.bw--- BigWig representation of the segments. Setting the track view in IGV to points is recommended.*.baf.seg.bw--- BigWig representation of the BAF segments (if available). Setting the track view in IGV to points is recommended.*.baf.bedgraph.gz--- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.*.cnv.gff3--- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. When the caller can call AOH/LOH events, they will show up as magenta. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):
For somatic WGS analyses, the following additional files are included in the IGV session xml:
*.tumor.baf.bedgraph.gz--- Bedgraph representation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.
IGV Session

File extension: *.igv_session.xml
The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome attribute in the Session element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.
Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome field in the XML file directly. For example, IGV has traditionally packaged a b37 reference genome, but may also include a 1kg_v37 or a 1kg_b37+decoy, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.
You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.
When the Cytogenetics Modality is enabled, DRAGEN CNV produces an additional IGV session xml *.cyto.igv_session.xml shown below. Please see Cytogenetics Modality for a description of the different tracks on this file.

Creating CNV coverage and BAF plots with third-party tools
DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:
*.target.counts.gzor*.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.*.tn.tsv.gz, containing the log2-normalized copy ratio per interval.*.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.
In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.
Using R, a good starting point is the karyoploteR package. The main workflow involves reading the *.target.counts.gz file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.

Using Python, the workflow is similar to R's but using Python's libraries such as pandas, to convert DRAGEN output files to dataframe, and matplotlib, to plot coverage and BAF profiles across the genome.
A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:
Germline WGS
Somatic WGS
From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).
Excluded Intervals File
To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.cnv.excluded_intervals.bed.gz file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.
NON_KMER_UNIQUE
Non-unique Kmer bases are larger than 50% of interval.
Not applicable. This reason only applies to self-normalization mode.
EXCLUDE_BED
Interval overlaps with exclude BED larger than threshold.
--cnv-exclude-bed-min-overlap
PON_MAX_PERCENT_ZERO_SAMPLES
Number of PON samples with 0 coverage is larger than threshold.
--cnv-max-percent-zero-samples
PON_TARGET_FACTOR_THRESHOLD
Median coverage of interval is lower than threshold of overall median coverage.
--cnv-target-factor-threshold
PON_MISSING_INTERVAL
Target interval not found in PON.
Not applicable
An example of a *.cnv.excluded_intervals.bed.gz file is shown below:
Excluded Samples File
To improve accuracy, the DRAGEN CNV Pipeline excludes panel of normals samples if one or more of the samples failed at least one quality requirement. The excluded samples are reported to *.cnv.excluded_samples.txt.gz file. The file has a tsv (tab separated) format, identifies the excluded panel of normals samples and describes the reason. The following are the possible reasons for exclusion.
PON_SAMPLE_NAME_EQUAL_TO_CASE
PON sample name is equal to case sample name
NA
PON_SAMPLE_CORRELATION_EQUAL_TO_CASE
PON sample counts are equal to case sample counts
NA
PON_MAX_PERCENT_NAN_SAMPLES
number of nan values in sample is higher than threshold
--cnv-max-percent-nan-samples(default=50)
MAX_PERCENT_ZERO_TARGETS
number of 0 target counts in sample is higher than threshold
--cnv-max-percent-zero-targets(default=5)
EXTREME_PERCENTILE:UPPER
median coverage of sample is higher than threshold
--cnv-extreme-percentile(default=2.5)
EXTREME_PERCENTILE:LOWER
median coverage of sample is lower than threshold
--cnv-extreme-percentile(default=2.5)
An example of a *.cnv.excluded_samples.txt.gz file is shown below:
The excluded samples output file may not exist if there are no excluded samples.
Panel of Normals Files
PON Metrics File
The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz) if a Panel of Normals is provided and --cnv-generate-pon-metric-file is set to true. If PON size is less than 2, then an empty file will be generated.
The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:
1
contig
chromosome name
2
start
genomic locus of interval start
3
stop
genomic locus of interval stop
4
name
interval name
5
mean
average coverage depth
6
std
standard deviation
7
normalizedStd
normalized standard deviation (std/mean)
8
min
minimum
9
25%
25 percentile
10
50%
median
11
75%
75 percentile
12
max
maximum
13
intervalSize
interval size (stop-start)
14
gcContents
percent GC
Example:
PON Correlation File
The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.
Example:
SegDups Extension Files
The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).
The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:
Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)
Intermediate files
The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:
Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)
The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:
Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site
Last updated
Was this helpful?
