1 of 7

Copy Number Variant Calling

The DRAGEN Copy Number Variant (CNV) Pipeline can call CNV events using next-generation sequencing (NGS) data. This pipeline supports multiple applications in a single interface via the DRAGEN Host Software, including processing of whole-genome sequencing (WGS) data and whole-exome sequencing (WES) data for germline analysis.

The DRAGEN CNV pipeline supports two normalization modes of operation. The two modes apply different normalization techniques to handle biases that differ based on the application, for example, WGS versus WES. While the default option settings attempt to provide the best trade-off in terms of speed and accuracy, a specific workflow may require more finely tuned option settings.

CNV Workflow

The DRAGEN CNV pipeline follows the workflow shown in the following figure.

DRAGEN CNV Pipeline Workflow

The DRAGEN CNV Pipeline uses many aspects of the DRAGEN secondary analysis available in other pipelines, such as hardware acceleration and efficient I/O processing. To enable CNV processing in the DRAGEN Host Software, set the --enable-cnv command line option to true.

The CNV pipeline has the following processing modules:

Target Counts --- Binning of the read counts and other signals from alignments.
Bias Correction --- Correction of intrinsic system biases.
Normalization --- Detection of normal ploidy levels and normalization of the case sample.
Segmentation --- Breakpoint detection via segmentation of the normalized signal.
Calling / Genotyping --- Thresholding, scoring, qualifying, and filtering of putative events as copy number variants.

The normalization module can optionally take in a panel of normals (PoN), which is used when a cohort or population samples are readily available. Note that PoN normalization is not available for somatic WGS analysis. All other modules are shared between the different CNV algorithms.

Signal Flow Analysis

The following figures show a high-level overview of the steps in the DRAGEN CNV Pipeline as the signal traverses through the various stages. These figures are examples and are not identical to the plots that are generated from the DRAGEN CNV Pipeline.

The first step in the DRAGEN CNV Pipeline is the target counts stage. The target counts stage extracts signals such as read count and improper pairs and puts them into target intervals.

Read Count Signal

Improper Pairs Signal

Next, the case sample is normalized against the panel of normals or against the estimated normal ploidy level. Any other biases are subtracted out of the signal to amplify any event level signals.

Normalization

The normalized signal is then segmented using one of the available segmentation algorithms. Events are then called from the segments.

Segments

Called Events

The events are then scored and emitted in the output VCF.

CNV Pipeline Options

The following are the top-level options that are shared with the DRAGEN Host Software to control the CNV pipeline. You can input a BAM or CRAM file into the CNV pipeline. If you are using the DRAGEN mapper and aligner, you can use FASTQ files.

--bam-input --- The BAM file to be processed.
--cram-input --- The CRAM file to be processed.
--enable-cnv --- Enable or disable CNV processing. Set to true to enable CNV processing.
--enable-map-align --- Enables the mapper and aligner module. The default is true, so all input reads are remapped and aligned unless this option is set to false.
--fastq-file1, --fastq-file2 --- FASTQ file or files to be processed.
--output-directory --- Output directory where all results are stored.
--output-file-prefix --- Output file prefix that will be prepended to all result file names.
--ref-dir --- The DRAGEN reference genome hashtable directory.

Output and Filtering Options

The output and filtering options control the CNV output files.

--cnv-exclude-bed --- Specifies a BED file that indicates the intervals to exclude from the CNV analysis. If a target interval overlaps regions specified from exclude BED file more than cnv-exclude-bed-min-overlap, the target interval is suppressed.
--cnv-exclude-bed-min-overlap --- Specifies a fraction for filtering threshold of overlap amount between a target interval and the excluded region (0.5).
--cnv-enable-ref-calls --- Emit copy neutral (REF) calls in the output VCF file. The default is true for single WGS CNV analysis.
--cnv-enable-tracks --- Generate track files that can be imported into IGV for viewing. When this option is enabled, a \*.gff file for the output variant calls is generated, as well as \*.bw files for the tangent normalized signal. The default is true.
--cnv-filter-bin-support-ratio --- Filters out a candidate event if the span of supporting bins is less than the specified ratio with respect to the overall event length. This filter only applies to records with length greater than cnv-filter-bin-support-ratio-min-len. The default ratio is 0.2 (20% support). As an example, if an event is called and has a length of 100,000 bp, but the target interval bins that support the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then the interval is filtered out. If applied, the record will have cnvBinSupportRatio as a filter.
--cnv-filter-bin-support-ratio-min-len --- Minimum length of candidate event at which to apply cnv-filter-bin-support-ratio. Currently only applied to germline WGS workflows, with default value of 80,000 bp.
--cnv-filter-copy-ratio --- Specifies the minimum copy ratio (CR) threshold value centered about 1.0 at which a reported event is marked as PASS in the output VCF file. The default value is 0.2, which leads to calls with CR between 0.8 and 1.2 being filtered. If applied, the record will have cnvCopyRatio as a filter.
--cnv-filter-length --- Specifies the minimum event length in bases at which a reported event is marked as PASS in the output VCF file. The default is 10000. If applied, the record will have cnvLength as a filter.
--cnv-filter-qual --- Specifies the QUAL value at which a reported event is marked as PASS in the output VCF file. You should adjust the parameter value according to your own application data. If applied, the record will have cnvQual as a filter.
--cnv-min-qual --- Specifies the minimum reported QUAL. The default is 3.
--cnv-max-qual --- Specifies the maximum reported QUAL. The default is 200.
--cnv-qual-length-scale --- Specifies the bias weighting factor to adjust QUAL estimates for segments with longer lengths. This is an advanced option and should not need to be modified. The default is 0.9303 (2-0.1).
--cnv-qual-noise-scale --- Specifies the bias weighting factor to adjust QUAL estimates based on sample variance. This is an advanced option and should not need to be modified. The default is 1.0.

CNV Pipeline Input

Reference Hashtable

For the DRAGEN CNV pipeline, the hashtable must be generated with the --enable-cnv option set to true, in addition to any other options required by other pipelines. When --enable-cnv is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.

The following example command generates a hashtable.

Generate an Alignment File

The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.

You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.

The following example command maps and aligns a FASTQ file:

The following example command maps and aligns an existing BAM file:

The following example command maps and aligns an existing CRAM file:

Streaming Alignments

DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.

To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.

Target Counts

The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.

When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list option.

With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed option to determine the intervals for analysis.

The target counts stage generates a .target.counts.gz file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input option for the normalization stage. The .target.counts.gz file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.

Whole Genome

If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.

The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.

Using a cnv-interval-width of less than 250 bp for WGS analysis can drastically increase runtime.

The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y. You can specify a list of contigs to skip by using the --cnv-skip-contig-list option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.

For example, to skip chromosome M, X, and Y, use the following option:

Whole Exome

If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width.

To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.

Target Counts Options

The following options control the generation of target counts.

--cnv-counts-method --- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq --- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed --- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width --- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list --- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm.
--cnv-filter-duplicate-alignments --- Filter duplicate marked alignments during target counts if option is set to true. The deafult setting is false.

Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.

Filter Duplicate Alignments

PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.

If --enable-map-align=false, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true, then --enable-duplicate-marking=true should be set.

Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.

Target Counts Dropout Regions

In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.

Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.

For WGS samples and in absence of a cnv-target-bed file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width option, which defaults to 1000bp. The cnv-interval-width option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE in the *.cnv.excluded_intervals.bed.gz file.

A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.

GC Bias Correction

GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.

Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.

The following options control the GC bias correction module.

--cnv-enable-gcbias-correction --- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing --- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins --- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.

Normalization

The DRAGEN CNV pipeline supports two normalization algorithms:

Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.

Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.

Self-Normalization

Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y naming conventions are supported.

Panel of Normals

Whole genome sequencing (non-somatic)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples

Self Normalization

The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.

Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.

The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.

If you are running from a FASTQ sample, then the default mode of operation is self-normalization.

When operating in self-normalization mode, the --cnv-interval-width option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.

Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references.

If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.

Panel of Normals

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. This allows the algorithm to subtract system level biases that are not sample specific. The generation of the target counts for these normal samples should also have identical command line options with the case sample under analysis.

In this mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample, case, and normals, to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.

Target Counts Stage

Target counts should be generated for all samples, whether the samples are to be used as references or are the case samples under analysis. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings. The target counts stage also performs GC Bias correction. GC Bias correction is enabled by default.

The following examples are for WES processing, which is the case in where a panel of normals is required.

The following is an example command for processing a BAM file.

The following is an example command for processing a CRAM file.

Generating Panel of Normals (Combined Counts)

When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file (one per file) or --cnv-normals-list (single text file with paths to each sample).

The following is an example command line using a normals list:

Normalization and Call Detection Stage

The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.

Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.

For optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels can result in artifactual calls in the test sample where at least some of the panel samples have copy number changes. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.

The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.

DRAGEN accepts 3 different file formats for a Panel of Normals (PON).

The CNV caller can also be started from the *.target.counts.gz (raw counts) or *.target.counts.gc-corrected.gz (GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction should be set to false to disable the GC-correction stage.

For example, the following command normalizes the case sample against the panel of normals.

Normalization Options

These options control the preconditioning of the panel of normals and the normalization of the case sample.

--cnv-enable-self-normalization --- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile --- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input --- Specifies a target counts file for the case sample under analysis when using a panel of normals.
--cnv-normals-file --- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list --- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples --- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets --- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold --- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-truncate-threshold --- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon --- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.

Segmentation

After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:

Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)

The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.

By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing

For the targeted sequencing workflows, you can also run with a --cnv-segmentation-bed. The option pre-defines the segments to estimate copy numbers for and skips the segmentation step of the workflow. See Targeted Segmentation (Segment BED) for more information.

--cnv-segmentation-mode --- Specifies the segmentation algorithm to perform. The following values are available.
- bed
- cbs
- slm --- The default for germline WGS analysis.
- aslm --- The default for somatic WGS analysis.
- hslm --- The default for targeted/WES analysis.
--cnv-merge-distance --- Specifies the maximum number of base pairs between two segments that would allow them to be merged. The default value is 0 for germline WGS, which means the segments must be directly adjacent. For WES analysis, this parameter is disabled by default due to the spacing of targeted intervals.
--cnv-merge-threshold --- Specifies the maximum segment mean difference at which two adjacent segments should be merged. The segment mean is represented as a linear copy ratio value. The default is 0.2 for WGS and 0.4 for WES. To disable merging, set the value to 0.

Circular Binary Segmentation

Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.

--cnv-cbs-alpha --- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta --- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax --- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width --- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin --- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm --- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim --- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.

¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646

Shifting Level Models Segmentation

The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles².

--cnv-slm-eta --- Baseline probability that the mean process changes its value. The default is 4e-5.
--cnv-slm-fw --- Minimum number of data points for a CNV to be emitted. The default is 0, which means segments with one design probe could in effect be emitted.
--cnv-slm-omega --- Scaling parameter that modulates relative weight between experimental or biological variance. The default is 0.3.
--cnv-slm-stepeta --- Distance normalization parameter. The default is 10000. This option is only valid for HSLM.

Regardless of segmentation method, initial segments are split across large gaps where depth data is unavailable, such as across centromeres.

²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5

Targeted Segmentation (Segment BED)

In applications for targeted panels, you can limit the segmentation and calling performed on intervals by specifying a --cnv-segmentation-bed. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. This segmentation mode is only supported with the panel of normals and requires an accompanying --cnv-target-bed. Also specify the --cnv-segmentation-bed during the panel of normals generation step, so that all interval boundaries during analysis are matched. For more information on panel of normals generation, see Panel of Normals

The recommended format for the BED file includes four columns and a header. The four columns are contig, start, stop, and name. The name column represents the name of the gene and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID field. The following example file is in the recommended format:

If using a three-column BED file, then do not include a header or the name field values. Three-column BED files should only include the contig, start, and stop values. In this case, the segment identifier is autogenerated from the coordinate fields.

Quality Scoring

Quality scores are computed using a probabilistic model that uses a mixture of heavy tailed probability distributions (one per integer copy number) with a weighting for event length. Noise variance is estimated. The output VCF contains a Phred-scaled metric that measures confidence in called amplification (CN > 2 for diploid locus), deletion (CN < 2 for diploid locus), or copy neutral (CN=2 for diploid locus) events.

The scoring algorithm also calculates exact copy-number quality scores that are inputs to the DeNovo CNV detection pipeline.

Exclude BED Filtering

You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.

The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz file excludes the intervals removed during normalization.

Concurrent CNV and Small Variant Calling

DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.

Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.

The following examples show different commands.

Map/Align FASTQ With CNV

Map/Align FASTQ With VC

Map/Align FASTQ With CNV and VC

BAM Input to CNV and VC

Sample Correlation and Sex Genotyper

When running the target counts stage or the normalization stage, the DRAGEN CNV pipeline also provides the following information about the samples in the run.

A correlation metric of the read count profile between the case sample and any panel of normals samples. A correlation metric greater than 0.90 is recommended for confident analysis, but there is no hard restriction enforced by the software.
The predicted sex of each sample in the run. The sex is predicted based on the read count information in the sex chromosomes and the autosomal chromosomes. The median value for the counts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Y chromosome. This estimation requires a minimum of 300 target intervals on the sex chromosomes to proceed.

The results are printed to the screen when running the pipeline. For example:

The predicted sexes for samples in use are also printed to the *.cnv_metrics.csv output file. For a panel of normals, the predicted sexes are used to determine which panel samples are leveraged for normalization on sex chromosomes. If the estimated sex of the sample is UNDETERMINED, the sex of the sample is set to FEMALE.

You can override the predicted sex of the sample with the --sample-sex option.

Segmental Duplication Extension

The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.

This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38) with at least 30x coverage. See below for additional requirements.

Supported duplications

The following pairs of genes defining Segmental Duplications are included:

Extension requirements

This extension is enabled by default in the germline CNV workflow. However, it requires:

Normalization set to self-normalization (--cnv-enable-self-normalization=true).
GC bias correction enabled (--cnv-enable-gcbias-correction=true).
Counts method set to start (--cnv-counts-method=start).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38).

If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension to false.

Algorithm

For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz file for inspection and they are automatically injected before the segmentation step.
- During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j

CNV Output

DRAGEN emits the calls in the standard VCF format. By default for analyses other than somatic WGS, the VCF file includes only copy number gain and loss events (and LOH events, where allele-specific copy number is available). To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls to true.

For more information on how to use the output files to aid in debug and analysis, see .

CNV VCF File

File extension: *.cnv.vcf.gz

The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV to indicate the file is generated by the DRAGEN CNV pipeline

Header

The following is an example of some of the header lines that are specific to CNV:

The following header lines are specific to somatic WGS CNV calling:

ModelSource The primary basis on which the final tumor model was chosen. The following values can be included:
- DEPTH+BAF: Depth+BAF signal is used to determine tumor model.
- DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
- DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
- DEPTH+BAF_WEAK: Depth+BAF signal is used to determine lower-confidence tumor model.
- VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
- DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
- SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA if a confident model could not be determined.
DiploidCoverage Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].

Records

All coordinates in the VCF are 1-based.

CHROM

The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.

POS

The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

ID

The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN, LOSS and REF events, in Somatic WGS CNV, the ID could include the Copy Neutral Loss of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.

REF

The REF column contains an N for all CNV events.

ALT

The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL> or <DUP> entries are used. If REF calls are emitted, their ALT will always be .. In Somatic WGS CNV, the ALT field can contain two alleles, such as <DEL>,<DUP>, which allows representation of allele-specific copy numbers if they differ in copy number states.

QUAL

The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.

FILTER

The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.

Available FILTERs:

cnvLength which indicates that the length of the CNV is lower than a threshold.
cnvQual which indicates that the QUAL of the CNV is lower than a threshold.

Germline CNV has the following additional FILTERs:

cnvCopyRatio which indicates that the segment mean of the CNV is not far enough from copy neutral.

Both Germline CNV workflows have the following additional FILTERs:

dinucQual which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.

Germline WGS CNV has the following additional FILTERs:

cnvBinSupportRatio which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN which indicates a CNV call with implausible copy number (>6).

Germline WES CNV has the following additional FILTERs:

cnvLikelihoodRatio indicates a log10 likelihood ratio of ALT to REF is less than a threshold.

Both Somatic CNV workflows have the following additional FILTERs:

binCount - Filters CNV events with a bin count lower than a threshold.

Somatic WGS CNV has the following additional FILTERs:

lengthDegenerate - Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean - Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficient SM in DELs or DUPs are assigned this filter when returning the default model.

Somatic WES CNV has the following additional FILTERs: -SqQual - Marks records as non-PASSing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.

INFO

The INFO column contains information representing the event.

REFLEN indicates the length of the event.
SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE is always CNV and only present for non-REF records.
END indicates the end position of the event (1-based, inclusive).

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

FORMAT

The common FORMAT fields are described in the header:

Germline WGS CNV includes the following FORMAT fields:

Germline WES CNV includes the following FORMAT fields:

Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:

Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.

Note on genotype annotation in germline copy number calling

Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:

Coverage Uniformity

The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.

A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.

CNV Metrics File

DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv file extension. The following list summarizes the metrics that are output from a CNV run.

Sex Genotyper:

Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.

CNV Summary:

Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
- Number of filtered records (total)
- Number of filtered records (due to duplicates)
- Number of filtered records (due to MAPQ)
- Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions

Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Coverage MAD is the median absolute deviation of normalized case counts. Higher values indicate noiser sample data (poor quality). Median Bin Count is the median of raw counts normalized by interval size.

Intermediate and Visualization Files

Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.

All files have a structure similar to a BED file with optional header line(s).

Target Counts

The file *.target.counts.gz is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.

It has the following columns:

Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gz file is shown below.

B-Allele counts

B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, they are taken from a collection of sites that have high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the tumor sample supporting each of these alleles is counted.

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

B-allele tsv

The tsv file format is the following:

Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele

Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:

Population frequency for the first allele
Population frequency for the second allele

An example of B-allele counts file is provided below:

B-allele bedgraph

The bedgraph file format is similar to the BED format and it has the following columns:

Contig identifier
Start
Stop
Ratio of allele counts

The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.

When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:

When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:

By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.

An example of the bedgraph file is shown below:

Bias correction

The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gc-corrected.gz file is shown below.

Combined counts

The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.

Normalization

The file *.tn.tsv.gz contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #.

An example of a *.tn.tsv.gz file is shown below.

Segmentation

File extension: *.seg, *.seg.called, *.seg.called.merged

Files containing the segments produced by the segmentation algorithm. The Segment_Mean value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.

The *.seg file has the following columns:

Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment

An example of a *.seg file is shown below.

The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + ir a deletion -.

The *.seg.called.merged file is identical to the *.seg.called file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:

QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count

B-allele segmentation (Somatic)

In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg and it has the same format of the *.seg file with two modifications. Firstly, the Segment_Mean value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:

BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction"

An example of segmentation output file is shown below:

Model identification (Somatic)

The file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:

Model purity (Cellularity)
Model diploid coverage
Model log-likelihood

An example is shown below:

Visualization

To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml can be loaded directly into IGV for analysis.

The following IGV tracks are automatically populated in the output IGV session file:

*.target.counts.bw --- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw --- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw --- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw --- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz --- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3 --- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):

For somatic WGS analyses, the following additional files are included in the IGV session xml:

*.baf.seq.bw --- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz --- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.

IGV Session

File extension: *.igv_session.xml

The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome attribute in the Session element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.

Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome field in the XML file directly. For example, IGV has traditionally packaged a b37 reference genome, but may also include a 1kg_v37 or a 1kg_b37+decoy, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

Creating CNV coverage and BAF plots with third-party tools

DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:

*.target.counts.gz or *.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.

In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.

A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:

Germline WGS

Somatic WGS

From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).

Excluded Intervals File

To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.excluded_intervals.bed.gz file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.

An example of a *.excluded_intervals.bed.gz file is shown below:

Panel of Normals Files

PON Metrics File

The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz) if a Panel of Normals is provided and --cnv-generate-pon-metric-file is set to true. If PON size is less than 2, then an empty file will be generated.

The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:

Example:

PON Correlation File

The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.

Example:

SegDups Extension Files

The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).

The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:

Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)

Intermediate files

The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:

Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)

The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:

Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site

CNV with SV Support

The DRAGEN CNV caller leverages depth as its primary signal for calling copy number variants. Depth alone poses challenges for calling events that are less than 10kbp. The sensitivity of CNVs at lengths less than 10kbp can be improved by leveraging junction signals from the DRAGEN structural variant caller.

When both the DRAGEN CNV and SV caller are executed in a single invocation, then an additional integration step is done at the end of a DRAGEN run to improve the CNV calls. This feature is enabled automatically when DRAGEN detects a germline WGS analysis.

The SV/CNV Integration module takes in DEL and DUP calls from the output data structures of the germline CNV and SV callers, identifies putative matches, updates annotations, filters, scores, and outputs the refined records in a new output VCF. By leveraging junction signals from the SV caller and depth signals from the CNV caller, this approach allows for sensitive CNV detection down to 1kbp while also improving recall and precision across length scales. This is achieved by rescuing previously low quality calls if evidence is found from both callers, and also by adjusting CNV breakends to the more accurate SV breakends. The matching algorithm takes into account the proximity of the events as well as the transition states at the breakends, among other things.

Example command lines

The following is an example command line for running a germline WGS analysis for both CNV and SV.

Other optional CNV or SV parameters can also be added.

Combined CNV/SV VCF Output

The original CNV and SV VCF output files, prior to integration, are available for users in the DRAGEN output directory, as described elsewhere. Additionally, there is an enhanced CNV VCF available with the *.cnv_sv.vcf.gz extension. The VCF header lines in the *.cnv_sv.vcf.gz mostly correspond to a concatenation of the individual header lines from the CNV and SV VCFs, with a few lines deduplicated and some new ones added. For details on the legacy header lines, please refer to the individual CNV and SV user guide sections.

Newly added header lines are described in the following table.

Records that can be matched or rescued will have annotations indicating the breakpoint linkage between a CNV and SV record. If a complete match is found, then the MatchSv annotation will be present in the record, indicating the SV record's ID field for this CNV record. Furthermore, the use of the SVCLAIM field will indicate if the record has evidence arising from depth signal D, or junction signals J, or both DJ.

Because of the mixing of standalone SV records and CNV records, the FORMAT field may have different annotations. For details on the CNV or SV specific annotations, please refer to the individual CNV and SV user guide sections.

Records that can be matched or rescued will have FILTER set to PASS. The original FILTERs are retained for records that were not matched or rescued. For example, the cnvLength FILTER will still be applied to standalone CNV records (those with SVCLAIM=D).

Example records are shown below.

Multisample CNV Calling

Multisample CNV calling is possible starting from tangent normalized counts files (*.tn.tsv.gz) specified with the --cnv-input option (one per sample). Multisample CNV analysis benefits from using joint segmentation to increase the sensitivity of detection of copy number variable segments. For each copy number variable segment identified, the copy number genotype of each sample is emitted in a single VCF entry to facilitate annotation and interpretation.

Multisample CNV analysis is supported for WGS and WES workflows.

The following is an example command line for running a trio analysis:

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-cnv true \
--cnv-input <FATHER_TN_TSV> \
--cnv-input <MOTHER_TN_TSV> \
--cnv-input <PROBAND_TN_TSV> \
--pedigree-file <PEDIGREE_FILE>

De Novo CNV Calling Options

Make sure all input samples have gone through the same single sample workflow and have identical intervals. If the samples are WES inputs, then you must generate the samples using the same panel of normals, and the autosomal intervals for all samples must match.

The following options are used in DeNovo CNV calling:

--cnv-input For DeNovo CNV calling, this specifies the input tangent-normalized signal files (*.tn.tsv.gz) from the single sample runs. This option can be specified multiple times, once for each input sample.

--cnv-filter-de-novo-qual Phred-scaled threshold at which a putative event in the proband sample if marked as DeNovo. Default value is 0.125.

--pedigree-file Pedigree file specifying the relationship between the input samples.

Joint Segmentation

First, CNV calling is performed on each sample independently. Joint segmentation then uses the copy number variable segments from each single sample analysis to derive a set of joint copy number variable segments. This set of joint segments is determined simply by taking the union of all breakpoints from the copy number variable segments of all samples. This results in the splitting of any partially overlapping segments across different samples. For example:

Following joint segmentation, copy number calling is again performed independently on each sample using the joint segments. Segments can be merged as with the single sample analysis, but each joint segment is emitted in the mutlisample VCF as a single entry. The quality score (QS in the VCF) from the sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered using the sample's FT field in the multisample VCF. The QUAL column of the multisample VCF is always missing (ie, "."). The FILTER column of the mutlisample VCF is SampleFT if none of the sample's FT fields are PASS, and PASS if any of the sample's FT fields are PASS.

Note, however, that when a single segment in one sample overlaps multiple segments in another sample, the larger segment annotation is replicated across multiple records, e.g. (only relevant VCF fields are printed below):

DRAGEN:REF:chr22:21917617-22385563	GT:SM:CN:BC:PE:QS:FT	./.:1.01773:2:867:0,0:62:PASS	./.:1.00693:2:379:0,0:61:PASS
DRAGEN:LOSS:chr22:22385564-22549952	GT:SM:CN:BC:PE:QS:FT:GC:CT:AC	./.:1.01773:2:867:0,0:62:PASS	0/1:0.695867:1:135:0,0:7:cnvQual:0.427961:0.493883:0.506859
DRAGEN:LOSS:chr22:22549953-23041393	GT:SM:CN:BC:PE:QS:FT:GC:CT:AC	./.:1.01773:2:867:0,0:62:PASS	0/1:0.614398:1:341:0,0:40:PASS:0.457178:0.499478:0.500493
DRAGEN:LOSS:chr22:23041394-23055519	GT:SM:CN:BC:PE:QS:FT:GC:CT:AC	./.:1.01773:2:867:0,0:62:PASS	0/1:0.31226:1:141:0,0:52:PASS:0.452074:0.492297:0.513473
DRAGEN:LOSS:chr22:23055520-23198595	GT:SM:CN:BC:GC:CT:AC:PE:QS:FT	0/1:0.57652:1:168:0.452278:0.489735:0.514792:0,0:41:PASS	0/1:0.31226:1:141:0.452074:0.492297:0.513473:0,0:52:PASS
DRAGEN:LOSS:chr22:23198596-23241095	GT:SM:CN:BC:GC:CT:AC:PE:QS:FT	0/1:0.57652:1:168:0.452278:0.489735:0.514792:0,0:41:PASS	1/1:0.128:0:39:0.466259:0.483365:0.516541:0,0:42:PASS

De Novo Calling Stage

A de novo event is defined as the existence of a genotype at a particular locus in a proband's genome that did not result from standard Mendelian inheritance from the parents. The de novo calling stage identifies putative de novo events in the proband of each trio of a multisample analysis. In some cases, these putative de novo events may be real, but they can also arise from sequencing or analysis artifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used to filter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-file option. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.

For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflict for the called copy number genotypes. The CNV caller does not identify the copy number for each allele of a given diploid segment, which means assumptions are made about the possible allelic composition of the parent genotypes.

The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome (sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, as follows:

The following are examples of consistent and inconsistent copy number genotypes for diploid regions using these assumptions:

If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ field in the VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) of each sample in the trio, combined with a prior for the trio genotypes:

Where

The DN field in the VCF is used to indicate the de novo status for each segment. Possible values are:

Inherited - the called trio genotype is consistent with Mendelian inheritance
LowDQ - the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold (default 0.125)
DeNovo - the called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold (default 0.125)

Multisample CNV VCF Output

The records in a multisample CNV VCF differ slightly from the single sample case. The major differences are as follows:

The per-record entries are broken down into the segments among the union of all the input samples breakpoints, which means there are more entries in the overall VCF.

The QUAL column is not used and its value is ".". The per-sample quality is carried over into the SAMPLE columns with the QS tag.

The FILTER column indicates PASS if any of the individual SAMPLE columns PASS. Otherwise, it indicates SampleFT.

The per-sample annotations are carried over from their originating calls. The single sample filters are applied at the sample level and are emitted in the FT annotation.

Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following two annotations to the proband sample.

##FORMAT=<ID=DQ,Number=1,Type=Float,Description="De novo quality">
##FORMAT=<ID=DN,Number=1,Type=String,Description="Possible values are `Inherited', 'DeNovo' or 'LowDQ'. Threshold for a passing de novo call is DQ > 0.125000">

While the VCF contains many entries, due to the joint segmentation stage, the number of de novo events can be found by extracting entries that have a DN and DQ annotation. These records are also extracted and are converted to GFF3 in the de novo calling case.

Somatic CNV Calling WGS

To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.

The germline SNVs are used to compute B-allele ratios in the tumor, which allows for allele-specific copy number calling on the tumor sample. Where possible, use of the small-variant VCF from a matched normal sample is preferred (tumor-normal mode) for best results, but a catalog of population SNPs can be used when a matched normal sample is not available (tumor-only mode).

When a matched normal sample is available, the sample should first be processed using the germline small variant caller. In this case, only germline-heterozygous SNV sites are used for determining B-allele ratios. If no matched normal is available, population SNP B-allele ratios are computed as for matched normal heterozygous loci, but are treated as variants of unknown germline genotype; possible genotype assignments are statistically integrated to determine allele-specific copy number.

In matched normal mode, a VCF containing germline copy number changes for the individual may optionally be input. This makes sure that germline CNVs are output as separate segments in the somatic whole-genome sequencing (WGS) CNV VCF, and annotated with the germline copy number so that it is clear whether there are specifically-somatic copy number changes in the region.

Somatic WGS CNV Calling Options

You can use the following somatic WGS CNV calling command-line options:

Example command lines

The following is an example command line for running tumor-normal somatic WGS CNV calling with a matched normal SNV VCF.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--tumor-bam-input <TUMOR_BAM> \
--cnv-normal-b-allele-vcf <SNV_NORMAL_VCF> \
--sample-sex <SEX>

If a matched normal is not available, you must disable CNV calling or run in tumor-only mode. Running with a mismatched normal in tumor-normal mode yields unexpected results. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--tumor-bam-input <TUMOR_BAM> \
--cnv-population-b-allele-vcf <SNV_POP_VCF> \
--sample-sex <SEX>

The following example command line runs tumor normal somatic WGS CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--tumor-bam-input <TUMOR_BAM> \
--bam-input <NORMAL_BAM>
--enable-variant-caller true \
--cnv-use-somatic-vc-baf true \
--sample-sex <SEX>

You can enable additional features when a matched normal sample and the outputs from DRAGEN Germline analysis are also available. If a matched normal sample is available, enable germline-aware mode and VAF-aware mode using the following example command line. For more information on germline-aware mode and VAF-aware mode, see Germline-aware Mode and VAF-aware Mode.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--tumor-bam-input <TUMOR_BAM> \
--bam-input <NORMAL_BAM>
--enable-variant-caller true \
--cnv-use-somatic-vc-baf true \
--cnv-normal-cnv-vcf <CNV_NORMAL_VCF> \
--sample-sex <SEX>

Target Counts and B-allele Counts

The target counting stage and its output are the same as for the germline CNV calling case. The target intervals with the read counts are output in a *.target.counts.gz file. If there is insufficient read depth coverage detected, processing will halt. For low depth tumor samples, the value of --cnv-interval-width can be increased from to capture more alignments. The B-allele counting occurs in parallel with the read counting phase, and the values are output in a *.baf.bedgraph.gz file. This file can be loaded into IGV along with other bigwig files generated by DRAGEN for visualization. See Output Files for more details on output files.

Specification of B-Allele Loci

The Somatic WGS CNV Caller requires a source of heterozygous SNP sites to measure B-allele counts of the tumor sample. The following are the available modes.

To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.

To specify a population SNP VCF, use --cnv-population-b-allele-vcf option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is "1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency> to the INFO section of each record. Additional INFO fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf can be either heterozygous or homozygous in the germline genome from which the tumor genome derives

The following is an example valid population SNP record:

chr1 51479 . T A 1000 PASS AF=0.3253

DRAGEN considers the following requirements when parsing records from the b-allele VCF:

Only simple SNV sites.
Records must be marked PASS in the FILTER field.
If there are records with the same CHROM and POS values in the VCF, then DRAGEN uses the first record that occurs.

If a tumor sample and matched normal input are available, use --cnv-use-somatic-vc-baf true. You must enable the Somatic SNV Caller. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.

If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.

To enable --cnv-use-somatic-vc-baf, enter the following command line options.

--tumor-bam-input <TUMOR_BAM>—Specify the tumor input
--bam-input <NORMAL_BAM>—Specify the matched normal input
--enable-variant-caller true—Enable the somatic SNV variant caller
--cnv-use-somatic-vc-baf true—Enable somatic VC BAF

Germline-aware Mode

To specify germline CNVs from a matched normal sample, use --cnv-normal-cnv-vcf. When specified, CNV records marked as PASS in the normal sample are used during tumor-sample segmentation to make sure that confident germline CNV boundaries are also boundaries in the somatic output. Segments with germline copy number changes that are relative to reference ploidy are excluded from somatic model selection. During somatic copy number calling and scoring, the germline copy number is used to modify the expected depth contribution from the normal contamination fraction of the tumor sample. The process leads to more accurate assignment of somatic copy number in regions of germline CNV. DRAGEN then annotates the somatic WGS CNV VCF entries with germline copy number (NCN) and the somatic copy number difference relative to germline (SCND) for the segments that have germline CNVs.

VAF-aware Mode

If both the small variant caller and the CNV caller are enabled in a tumor-matched normal run, the somatic SNV results can affect the estimated purity and ploidy of the tumor sample. The somatic SNV variant allele frequencies (VAFs) that are captured by the allele depth values from passing somatic SNVs reflect the combination of tumor purity, total tumor copy number at a somatic SNV locus, and the number of tumor copies bearing the somatic allele. Clusters of somatic SNVs with similar allele depths inform the tumor model.

When a tumor has limited copy number variation and/or CNVs are mostly subclonal, such as in many liquid tumors, VAFs can help prevent incorrect or low-confidence estimated tumor models. Incorrect or low-confidence estimated tumor models can lead to wrong or filtered calls. VAF information can also help determine the presence or absence of a genome duplication even in samples from clonal tumors with clear CNVs.

To utilize VAF information, run somatic WGS CNV calling with small variant calling on tumor and matched-normal read alignment inputs. For example, you could use the following command line:

--enable-variant-caller=true --enable-cnv=true --tumor-bam-input <TUMOR_BAM> --bam-input <NORMAL_BAM>

For tumor/matched-normal runs with --enable-variant-caller true, VAF-based modeling is enabled by default. To disable VAF-based modeling, set --cnv-use-somatic-vc-vaf to false.

HET-Calling Mode

DRAGEN uses HET-calling mode for segments with a copy number that is estimated to be heterogeneous (HET) among different subclones. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.

To turn on HET calling, specify --cnv-somatic-enable-het-calling=true on the command line. N.B., this setting will only be honored when DRAGEN is able to identify a confident purity/ploidy model. When a confident model cannot be identified, the caller will return a default model and HET-calling will always be disabled (see Somatic WGS CNV Model section for more details and nuances of this approach).

When a segment is considered as heterogeneous, the output for the segment is changed as follows.

The HET tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.

Somatic WGS CNV Model

Selecting a tumor purity and diploid coverage level (ploidy) is a key component of the somatic WGS CNV caller. The somatic WGS CNV caller uses a grid-search approach that evaluates many candidate models to attempt to fit the observed read counts and b-allele counts across all segments in the tumor sample. A log likelihood score is emitted for each candidate. The log likelihood scores are output in the *.cnv.purity.coverage.models.tsv file. The somatic WGS CNV caller chooses the purity, coverage pair with the highest log likelihood, and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.

If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA. The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM). The threshold values used by the caller are estimated automatically considering the variance of the sample, with larger SM thresholds for DUPs when the variance is higher. The user can use alternative threshold values through the --cnv-filter-del-mean and --cnv-filter-dup-mean parameters. Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN, CNF, CNQ, MCN, MCNF, MCNQ) are omitted from the final VCF output.

Grid search optimization informed by essential regions

In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.

The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH> parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.

If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.

¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041

Somatic WGS CNV Smoothing

The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.

After initial calling, segments shorter than the specified value of --cnv-filter-length are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the Somatic WGS CNV Caller combines two successive segments that are within --cnv-merge-distance (default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. NB. When the germline CN information is available, and two segments have different germline CN, they will not be merged.

Allele Specific Copy Number Examples

The Somatic WGS CNV Caller can report the total tumor copy number by estimating tumor purity. The BAF estimations from matched normal SNVs or population SNPs allow for allele specific copy number calling. The following table provides examples for a DUP in a reference-diploid region:

*The entry represents a Loss of Heterozygosity (LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH to distinguish the value from Copy Neutral LOH (CNLOH), which would be annotated as 2+0.

Somatic CNV Calling WES

For somatic whole-exome sequencing (WES) and somatic targeted panels, you can use a panel of normals as the reference baseline to provide insight into copy number variants. The reported events are based solely on the normalized copy ratio values and the deviation from the expected reference baseline levels. This workflow can be useful for applications that require only the detection of gains and losses in targeted genes. The somatic WES CNV model is similar to the germline WES CNV model, but utilizes a different quality scoring and calling model.

Use one of the following input options.

--tumor-fastq1 and --tumor-fastq2 --Specify a FASTQ file
--tumor-bam-input --Specify an existing BAM file
--tumor-cram-input --Specify an existing CRAM file

The Somatic WES CNV Caller requires a panel of normals. The panel of normals samples help measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see Panel of Normals. The panel of normals sample should be well matched to the case sample under analysis.

If a matched normal sample is available, the sample can be included in the panel of normals. The workflow does not change if a matched normal is or is not available.

Example Command Lines

The following example command line runs somatic analysis on WES data.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--tumor-bam-input <TUMOR_BAM> \
--cnv-normals-list <NORMALS> \
--cnv-target-bed <BED> \

If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed and using cnv-segmentation-mode=bed. If using this option, all events in the segmentation BED are reported in the output VCF. For more information on the segmentation BED file, see [Targeted Segmentation (Segment BED)].

The following example command line runs somatic analysis on a targeted panel.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--tumor-bam-input <TUMOR_BAM> \
--cnv-normals-list <NORMALS> \
--cnv-target-bed <BED> \
--cnv-segmentation-bed <SEGMENT_BED> \
--cnv-segmentation-mode bed \

Quality Scoring and Calling

The Somatic WES CNV Caller computes quality scores using a 2 sample t-test between the normalized copy ratio of the case sample and the panel of normals samples. The caller computes a p-value per segment. The p-values are then converted to Phred-scaled scores. For copy neutral events, the caller computes quality scores as 1-p.

DUP/DEL events calls are made based on the limit of detection threshold (LoD) which is set using cnv-filter-limit-of-detection (default 0.2). For each segment, the caller compute a p-value for hypothetical counts by Case Counts X (1 +/- LoD) against PON. If p-value of Case Counts X (1+LoD) is highest, then segment is called as DUP. If p-value of Case Counts X (1-LoD) is highest, then segment is called DEL. Otherwise segment is called REF.

The output VCF contains the quality score in the QUAL field.

Tumor Purity and Fold Change

Tumor purity can be estimated automatically through the ASCN workflow.

The non-ASCN Somatic WES CNV Caller only reports copy ratio, also known as fold change. Fold change is encoded in the FORMAT/SM field as a linear copy ratio of the segment mean. In such case, if tumor purity is known, you can infer the ploidy of a gene or segment in the sample from the reported fold change using the following calculation.

For example, if the tumor purity is 30% for MET with a fold change of 2.2x, then there are 10 copies of MET DNA in the sample.

Allele Specific CNV for Somatic WES CNV

To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs from matched normal sample or population SNV VCF. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.

Panel of normals are used for the reference baseline to provide insight into copy number variants. The ASCN somatic WES CNV model is similar to the somatic WGS CNV model (with different internal parameters tuned for WES), but it uses a panel of normals to remove coverage bias in each target region.

The pipeline accepts various input types for matched normal sample or population SNV VCF for B-allele loci. If the normal sample was already processed using the germline small variant caller, the user can provide its output VCF file.

If the normal sample was not processed, the user can provide raw reads or aligned reads and enable the concurrent execution of the small variant caller. In such case the DRAGEN CNV will receive the small variant caller's output, and use it to estimate B-allele frequencies from the germline SNVs.

If there is no matched normal sample, the user can provide a population SNV VCF. DRAGEN will intersect the population SNV VCF with the target region provided by the cnv-target-bed and use the resulting SNVs to estimate B-allele frequencies.

ASCN Somatic WES CNV Calling Options

You can use following somatic WES CNV calling command-line options:

Input requirements:

1 tumor input
1 normal input (either option 1, 2, or 3)
Panel of normals (either option 1, 2, 3 or 4)
Target region

When the normal sample input is not in VCF format (e.g., FASTQ/BAM/CRAM), then the normal sample shall be capable of being used as PON. However, if the normal sample is already included in the PON, then it will not be added.

Example command lines

The following is an example command line for running ASCN tumor-normal somatic WES CNV calling with matched normal SNV VCF.

The following example command line runs ASCN tumor-normal somatic WES CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true.

If a matched normal is not available, DRAGEN CNV requires population SNV VCF to run in tumor-only mode. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.

Method and Outputs

CNV Output

For more information on how to use the output files to aid in debug and analysis, see .

CNV VCF File

File extension: *.cnv.vcf.gz

Header

The following is an example of some of the header lines that are specific to CNV:

The following header lines are specific to somatic WGS CNV calling:

ModelSource The primary basis on which the final tumor model was chosen. The following values can be included:
- DEPTH+BAF: Depth+BAF signal is used to determine tumor model.
- DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
- DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
- DEPTH+BAF_WEAK: Depth+BAF signal is used to determine lower-confidence tumor model.
- VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
- DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
- SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA if a confident model could not be determined.
DiploidCoverage Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].

Records

All coordinates in the VCF are 1-based.

CHROM

The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.

POS

ID

REF

The REF column contains an N for all CNV events.

ALT

QUAL

The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.

FILTER

Available FILTERs:

cnvLength which indicates that the length of the CNV is lower than a threshold.
cnvQual which indicates that the QUAL of the CNV is lower than a threshold.

Germline CNV has the following additional FILTERs:

cnvCopyRatio which indicates that the segment mean of the CNV is not far enough from copy neutral.

Both Germline CNV workflows have the following additional FILTERs:

dinucQual which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.

Germline WGS CNV has the following additional FILTERs:

cnvBinSupportRatio which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN which indicates a CNV call with implausible copy number (>6).

Germline WES CNV has the following additional FILTERs:

cnvLikelihoodRatio indicates a log10 likelihood ratio of ALT to REF is less than a threshold.

Both Somatic CNV workflows have the following additional FILTERs:

binCount - Filters CNV events with a bin count lower than a threshold.

Somatic WGS CNV has the following additional FILTERs:

lengthDegenerate - Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean - Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficient SM in DELs or DUPs are assigned this filter when returning the default model.

INFO

The INFO column contains information representing the event.

REFLEN indicates the length of the event.
SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE is always CNV and only present for non-REF records.
END indicates the end position of the event (1-based, inclusive).

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

In Somatic WGS CNV, the INFO column can also contain the HET tag, when the call is considered sub-clonal. See .

When matching CNV with SV output, additional INFO annotations are added. See .

FORMAT

The common FORMAT fields are described in the header:

Description

Germline WGS CNV includes the following FORMAT fields:

Description

Germline WES CNV includes the following FORMAT fields:

Description

Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:

Description

Note on genotype annotation in germline copy number calling

Diploid or Haploid?

ALT

FORMAT:CN

FORMAT:GT

Coverage Uniformity

CNV Metrics File

Sex Genotyper:

Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.

CNV Summary:

Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
- Number of filtered records (total)
- Number of filtered records (due to duplicates)
- Number of filtered records (due to MAPQ)
- Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions

Intermediate and Visualization Files

All files have a structure similar to a BED file with optional header line(s).

Target Counts

It has the following columns:

Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gz file is shown below.

#TARGET COUNTS FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start  stop   name                <SampleName> improper_pairs
1       565480 565959 target-wgs-1-565480 7          6
1       566837 567182 target-wgs-1-566837 9          0
1       713984 714455 target-wgs-1-713984 34         4
1       721116 721593 target-wgs-1-721116 47         1
1       724219 724547 target-wgs-1-724219 24         21
1       725166 725544 target-wgs-1-725166 43         12
1       726381 726817 target-wgs-1-726381 47         14
1       753243 753655 target-wgs-1-753243 31         2
1       754322 754594 target-wgs-1-754322 27         0
1       754594 755052 target-wgs-1-754594 41         0

B-Allele counts

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

B-allele tsv

The tsv file format is the following:

Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele

Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:

Population frequency for the first allele
Population frequency for the second allele

An example of B-allele counts file is provided below:

contig  start   stop    refAllele       allele1 allele2 allele1Count    allele2Count
chr1    11021   11022   G       G       A       4       2
chr1    14463   14464   A       A       T       111     36
chr1    16494   16495   G       G       C       122     262
chr1    38741   38742   C       C       T       9       9
chr1    39014   39015   A       A       C       38      48
chr1    39260   39261   T       T       C       199     143
chr1    48447   48448   C       C       T       8       15
chr1    48517   48518   A       A       G       13      15
chr1    91485   91486   G       G       C       1       4
chr1    91489   91490   A       A       G       1       3
chr1    98944   98945   C       C       T       46      114

B-allele bedgraph

The bedgraph file format is similar to the BED format and it has the following columns:

Contig identifier
Start
Stop
Ratio of allele counts

When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:

allele1Count / (allele1Count + allele2Count)

When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:

allele2Count / (allele1Count + allele2Count)

An example of the bedgraph file is shown below:

chr1    11021   11022   0.333333
chr1    14463   14464   0.755102
chr1    16494   16495   0.317708
chr1    38741   38742   0.5
chr1    39014   39015   0.44186
chr1    39260   39261   0.581871
chr1    48447   48448   0.652174
chr1    48517   48518   0.464286

Bias correction

The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gc-corrected.gz file is shown below.

#GC CORRECTED FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   1071.353133     6
chr1    819840  821337  target-wgs-chr1-819840:821337   1051.014997     19
chr1    821337  822485  target-wgs-chr1-821337:822485   1098.6502       10
chr1    822485  824431  target-wgs-chr1-822485:824431   1117.28308      7
chr1    830446  832304  target-wgs-chr1-830446:832304   1102.211816     1
chr1    832304  834311  target-wgs-chr1-832304:834311   1004.822683     5
chr1    836677  838659  target-wgs-chr1-836677:838659   1015.973037     7
chr1    841054  843056  target-wgs-chr1-841054:843056   1014.921403     3

Combined counts

The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.

Normalization

Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #.

An example of a *.tn.tsv.gz file is shown below.

#title = Normalized coverage profile
#sex = UNDETERMINED
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   -0.18479358083014644    6
chr1    819840  821337  target-wgs-chr1-819840:821337   -0.21244441644669046    19
chr1    821337  822485  target-wgs-chr1-821337:822485   -0.14849555308041734    10
chr1    822485  824431  target-wgs-chr1-822485:824431   -0.12423291178926463    7
chr1    830446  832304  target-wgs-chr1-830446:832304   -0.1438261733656668     1
chr1    832304  834311  target-wgs-chr1-832304:834311   -0.27728673450293895    5
chr1    836677  838659  target-wgs-chr1-836677:838659   -0.26136555699676262    7

Segmentation

File extension: *.seg, *.seg.called, *.seg.called.merged

The *.seg file has the following columns:

Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment

An example of a *.seg file is shown below.

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
<SampleName> chr1    818022  1117426 224     0.82500341336435279
<SampleName> chr1    1117426 4063702 2438    0.91726081432236528
<SampleName> chr1    4063702 4067591 3       0.38861386123247205
<SampleName> chr1    4067591 7705829 3302    0.93021316913709917
<SampleName> chr1    7705829 9357003 1405    0.98147825043799442
<SampleName> chr1    9357003 9377365 19      0.50269670724395654
<SampleName> chr1    9377365 12859821        2905    1.0684818476332989

The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + ir a deletion -.

QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count

B-allele segmentation (Somatic)

BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction"

An example of segmentation output file is shown below:

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean    BAF_SLM_STATE
<SampleName> chr1    820348  1104646 194     0.29301737166888697     6
<SampleName> chr1    1105091 1533754 444     0.26185904799069076     5
<SampleName> chr1    1533810 1534166 9       0.41958837071702065     8
<SampleName> chr1    1534217 9356793 6689    0.26034515815016335     5
<SampleName> chr1    9358304 9376529 27      0.46450553586280602     10
<SampleName> chr1    9378480 12859495        1651    0.24172965924359388     5

Model identification (Somatic)

The file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:

Model purity (Cellularity)
Model diploid coverage
Model log-likelihood

An example is shown below:

#Purity Coverage        logL
1       384     -23441740.5209
0.99    566     -22926572.4287
0.99    726     -23281869.1423
0.99    1206    -24075475.1481
0.99    1836    -24334376.579
0.99    2256    -24380290.0335
0.99    2696    -24380616.8655
0.98    449     -23988016.7101

Visualization

The following IGV tracks are automatically populated in the output IGV session file:

*.target.counts.bw --- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw --- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw --- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw --- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz --- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3 --- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):

##gff-version 3
chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;
chr1    DRAGEN  CNV     9377366 36656983        1000    .       .       Start=9377365;Stop=36656983;Length=27279618;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=3;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.910093;MinorCopyNumberFloat=0.068221;BiasCorrectedReadCount=1287.9;MinorAlleleFrequency=0.241;BinCount=22591;ImproperPairsCount=186,21;NumAllelicSites=18021;color=#FF0000;

For somatic WGS analyses, the following additional files are included in the IGV session xml:

*.baf.seq.bw --- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz --- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.

IGV Session

File extension: *.igv_session.xml

<?xml version="1.0" encoding="utf-8"?>
<Session genome="b37" hasGeneTrack="false" hasSequenceTrack="true" version="8">
    <Resources>
        <Resource path="example.cnv.gff3"/>
        <Resource path="example.cnv.excluded_intervals.bed.gz"/>
        <Resource path="example.target.counts.bw"/>
        <Resource path="example.improper.pairs.bw"/>
        <Resource path="example.tn.bw"/>
        <Resource path="example.seg.bw"/>
    </Resources>
    <Panel height="500" width="1200" name="DataPanel">
        ...
    </Panel>
</Session>

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

Creating CNV coverage and BAF plots with third-party tools

DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:

*.target.counts.gz or *.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.

In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.

Using R, a good starting point is the package. The main workflow involves reading the *.target.counts.gz file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.

Using Python, the workflow is similar to R's but using Python's libraries such as , to convert DRAGEN output files to dataframe, and , to plot coverage and BAF profiles across the genome.

A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:

Germline WGS

chr1	DRAGEN	CNV	818023	1426288	52	.	.	Alt=REF;LinearCopyRatio=1.03497;CopyNumber=2;Genotype=./.;Qual=52;Filter=PASS;Start=818022;Stop=1426288;Length=608266;BinCount=491;ImproperPairsCount=1,7;color=#00FF00;
chr1	DRAGEN	CNV	1426289	1428354	22	.	.	Alt=DEL;LinearCopyRatio=0.411841;CopyNumber=1;Genotype=0/1;Qual=22;Filter=cnvLength;dinucQual;Start=1426288;Stop=1428354;Length=2066;BinCount=2;ImproperPairsCount=7,16;color=#DDDDDD;

Somatic WGS

chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;

Excluded Intervals File

Exclusion Reason

Description

Related DRAGEN Option

An example of a *.excluded_intervals.bed.gz file is shown below:

chr1    0       818022  NON_KMER_UNIQUE
chr1    824431  830446  NON_KMER_UNIQUE
chr1    834311  836677  NON_KMER_UNIQUE
chr1    838659  841054  NON_KMER_UNIQUE
chr1    850451  853257  NON_KMER_UNIQUE
chr1    855442  860261  NON_KMER_UNIQUE
chr1    866189  868833  NON_KMER_UNIQUE
chr1    881779  884116  NON_KMER_UNIQUE
chr1    1016667 1018959 NON_KMER_UNIQUE
chr1    1075880 1079718 NON_KMER_UNIQUE
chr1    1137942 1140725 NON_KMER_UNIQUE

Panel of Normals Files

PON Metrics File

Column index

Column contents

Description

Example:

contig  start   stop    name    mean    std     normalizedStd min     25%     50%     75%     max     intervalSize    gcContents
1       12098   12178   target-wes-1-12098:12178/1      3.6259044560802365      0.46661435469856077      0.1286890927079175     2.7961783439490446      3.2573018790849675      3.7105263157894739      4.0162683823529415      4.3298969072164946      80      0.49382716049382713
1       12178   12258   target-wes-1-12178:12258/2      5.0685579775753595      0.70638315915955963      0.13936570564740217     3.9044585987261144      4.5225944682508761      5.067708333333333       5.5778115844038769      6.3277777777777775      80      0.46913580246913578
1       12553   12637   target-wes-1-12553:12637/1      4.6990858287992054      0.62537786269786677      0.13308500535681309     3.7417218543046356      4.0305632538350444      5.0382165605095546      5.2151580459770113      5.5773195876288657      84      0.6705882352941176

PON Correlation File

Example:

Correlation of case sample CASE_SAMPLE_NAME
  PON1: 0.9786
  PON2: 0.9868
  PON3: 0.9912
  ...

SegDups Extension Files

The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:

Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)

Intermediate files

The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:

Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)

The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:

Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site

Copy Number Variant Calling

CNV Workflow

The DRAGEN CNV pipeline follows the workflow shown in the following figure.

DRAGEN CNV Pipeline Workflow

The CNV pipeline has the following processing modules:

Target Counts --- Binning of the read counts and other signals from alignments.
Bias Correction --- Correction of intrinsic system biases.
Normalization --- Detection of normal ploidy levels and normalization of the case sample.
Segmentation --- Breakpoint detection via segmentation of the normalized signal.
Calling / Genotyping --- Thresholding, scoring, qualifying, and filtering of putative events as copy number variants.

Signal Flow Analysis

The first step in the DRAGEN CNV Pipeline is the target counts stage. The target counts stage extracts signals such as read count and improper pairs and puts them into target intervals.

Read Count Signal

Improper Pairs Signal

Next, the case sample is normalized against the panel of normals or against the estimated normal ploidy level. Any other biases are subtracted out of the signal to amplify any event level signals.

Normalization

The normalized signal is then segmented using one of the available segmentation algorithms. Events are then called from the segments.

Segments

Called Events

The events are then scored and emitted in the output VCF.

CNV Pipeline Options

--bam-input --- The BAM file to be processed.
--cram-input --- The CRAM file to be processed.
--enable-cnv --- Enable or disable CNV processing. Set to true to enable CNV processing.
--enable-map-align --- Enables the mapper and aligner module. The default is true, so all input reads are remapped and aligned unless this option is set to false.
--fastq-file1, --fastq-file2 --- FASTQ file or files to be processed.
--output-directory --- Output directory where all results are stored.
--output-file-prefix --- Output file prefix that will be prepended to all result file names.
--ref-dir --- The DRAGEN reference genome hashtable directory.

Output and Filtering Options

The output and filtering options control the CNV output files.

--cnv-exclude-bed --- Specifies a BED file that indicates the intervals to exclude from the CNV analysis. If a target interval overlaps regions specified from exclude BED file more than cnv-exclude-bed-min-overlap, the target interval is suppressed.
--cnv-exclude-bed-min-overlap --- Specifies a fraction for filtering threshold of overlap amount between a target interval and the excluded region (0.5).
--cnv-enable-ref-calls --- Emit copy neutral (REF) calls in the output VCF file. The default is true for single WGS CNV analysis.
--cnv-enable-tracks --- Generate track files that can be imported into IGV for viewing. When this option is enabled, a \*.gff file for the output variant calls is generated, as well as \*.bw files for the tangent normalized signal. The default is true.
--cnv-filter-bin-support-ratio --- Filters out a candidate event if the span of supporting bins is less than the specified ratio with respect to the overall event length. This filter only applies to records with length greater than cnv-filter-bin-support-ratio-min-len. The default ratio is 0.2 (20% support). As an example, if an event is called and has a length of 100,000 bp, but the target interval bins that support the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then the interval is filtered out. If applied, the record will have cnvBinSupportRatio as a filter.
--cnv-filter-bin-support-ratio-min-len --- Minimum length of candidate event at which to apply cnv-filter-bin-support-ratio. Currently only applied to germline WGS workflows, with default value of 80,000 bp.
--cnv-filter-copy-ratio --- Specifies the minimum copy ratio (CR) threshold value centered about 1.0 at which a reported event is marked as PASS in the output VCF file. The default value is 0.2, which leads to calls with CR between 0.8 and 1.2 being filtered. If applied, the record will have cnvCopyRatio as a filter.
--cnv-filter-length --- Specifies the minimum event length in bases at which a reported event is marked as PASS in the output VCF file. The default is 10000. If applied, the record will have cnvLength as a filter.
--cnv-filter-qual --- Specifies the QUAL value at which a reported event is marked as PASS in the output VCF file. You should adjust the parameter value according to your own application data. If applied, the record will have cnvQual as a filter.
--cnv-min-qual --- Specifies the minimum reported QUAL. The default is 3.
--cnv-max-qual --- Specifies the maximum reported QUAL. The default is 200.
--cnv-qual-length-scale --- Specifies the bias weighting factor to adjust QUAL estimates for segments with longer lengths. This is an advanced option and should not need to be modified. The default is 0.9303 (2-0.1).
--cnv-qual-noise-scale --- Specifies the bias weighting factor to adjust QUAL estimates based on sample variance. This is an advanced option and should not need to be modified. The default is 1.0.

CNV Pipeline Input

The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see for instructions on streaming alignment records directly from the DRAGEN map/align stage.

DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see .

Reference Hashtable

The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see .

The following example command generates a hashtable.

dragen \
--build-hash-table true \
--ht-reference \<FASTA\> \
--output-directory \<OUTPUT\> \
--enable-cnv true \
<OTHER HASHTABLE OPTIONS> \

Generate an Alignment File

The following example command maps and aligns a FASTQ file:

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing BAM file:

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing CRAM file:

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

Streaming Alignments

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true

For information on running CNV concurrently with the Haplotype Variant Caller, see .

Target Counts

With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed option to determine the intervals for analysis.

Further details are available in the section.

Whole Genome

The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.

WGS Coverage per Sample

Recommended Resolution* (bp)

Using a cnv-interval-width of less than 250 bp for WGS analysis can drastically increase runtime.

For example, to skip chromosome M, X, and Y, use the following option:

--cnv-skip-contig-list "chrM,chrX,chrY"

Whole Exome

To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.

Target Counts Options

The following options control the generation of target counts.

--cnv-counts-method --- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq --- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed --- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width --- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list --- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm.
--cnv-filter-duplicate-alignments --- Filter duplicate marked alignments during target counts if option is set to true. The deafult setting is false.

Filter Duplicate Alignments

If --enable-map-align=false, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true, then --enable-duplicate-marking=true should be set.

Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.

Input format

enable-map-align

Required option

Target Counts Dropout Regions

Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section for more details.

GC Bias Correction

The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See for further details on GC-corrected target counts files.

The following options control the GC bias correction module.

--cnv-enable-gcbias-correction --- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing --- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins --- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.

Normalization

The DRAGEN CNV pipeline supports two normalization algorithms:

Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.

Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.

Self-Normalization

Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y naming conventions are supported.

Panel of Normals

Whole genome sequencing (non-somatic)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples

Self Normalization

Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true

If you are running from a FASTQ sample, then the default mode of operation is self-normalization.

Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references.

Panel of Normals

Target Counts Stage

The following examples are for WES processing, which is the case in where a panel of normals is required.

The following is an example command for processing a BAM file.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

The following is an example command for processing a CRAM file.

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

Generating Panel of Normals (Combined Counts)

The following is an example command line using a normals list:

dragen \
--r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--cnv-normals-list <NORMALS_LIST> \
--enable-cnv true \
--cnv-generate-combined-counts true \

Normalization and Call Detection Stage

The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.

/data/output/sample1.target.counts.gc-corrected.gz
/data/output/sample2.target.counts.gc-corrected.gz
/data/output/sample4.target.counts.gc-corrected.gz
/data/output/sample5.target.counts.gc-corrected.gz
/data/output/sample7.target.counts.gc-corrected.gz
/data/output/sample8.target.counts.gc-corrected.gz
...

DRAGEN accepts 3 different file formats for a Panel of Normals (PON).

Option

Description

For example, the following command normalizes the case sample against the panel of normals.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-input <CASE_COUNTS> \
--cnv-normals-list <NORMALS> \
--cnv-enable-gcbias-correction false

See for a description of the target counts files.

Normalization Options

These options control the preconditioning of the panel of normals and the normalization of the case sample.

--cnv-enable-self-normalization --- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile --- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input --- Specifies a target counts file for the case sample under analysis when using a panel of normals.
--cnv-normals-file --- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list --- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples --- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets --- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold --- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-truncate-threshold --- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon --- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.

Segmentation

After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:

Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)

By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing

--cnv-segmentation-mode --- Specifies the segmentation algorithm to perform. The following values are available.
- bed
- cbs
- slm --- The default for germline WGS analysis.
- aslm --- The default for somatic WGS analysis.
- hslm --- The default for targeted/WES analysis.
--cnv-merge-distance --- Specifies the maximum number of base pairs between two segments that would allow them to be merged. The default value is 0 for germline WGS, which means the segments must be directly adjacent. For WES analysis, this parameter is disabled by default due to the spacing of targeted intervals.
--cnv-merge-threshold --- Specifies the maximum segment mean difference at which two adjacent segments should be merged. The segment mean is represented as a linear copy ratio value. The default is 0.2 for WGS and 0.4 for WES. To disable merging, set the value to 0.

Circular Binary Segmentation

--cnv-cbs-alpha --- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta --- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax --- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width --- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin --- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm --- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim --- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.

¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646

Shifting Level Models Segmentation

The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles².

--cnv-slm-eta --- Baseline probability that the mean process changes its value. The default is 4e-5.
--cnv-slm-fw --- Minimum number of data points for a CNV to be emitted. The default is 0, which means segments with one design probe could in effect be emitted.
--cnv-slm-omega --- Scaling parameter that modulates relative weight between experimental or biological variance. The default is 0.3.
--cnv-slm-stepeta --- Distance normalization parameter. The default is 10000. This option is only valid for HSLM.

Regardless of segmentation method, initial segments are split across large gaps where depth data is unavailable, such as across centromeres.

²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5

Targeted Segmentation (Segment BED)

contig  start      stop       name
chr1    40356094   40372764   MYCL1
chr1    115245083  115261621  NRAS
chr1    204485504  204526342  MDM4
chr2    16075981   16090656   MYCN
chr2    29416087   30143527   ALK
chr3    12626010   12704516   RAF1
chr3    138374228  138478187  PIK3CB
chr3    178866307  178952154  PIK3CA
chr3    195776751  195806640  TFRC

Quality Scoring

The scoring algorithm also calculates exact copy-number quality scores that are inputs to the DeNovo CNV detection pipeline.

Exclude BED Filtering

An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See for further details.

Concurrent CNV and Small Variant Calling

The following examples show different commands.

Map/Align FASTQ With CNV

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true

Map/Align FASTQ With VC

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-variant-caller true

Map/Align FASTQ With CNV and VC

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true \
--enable-variant-caller true

BAM Input to CNV and VC

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true \
--enable-variant-caller true

Sample Correlation and Sex Genotyper

When running the target counts stage or the normalization stage, the DRAGEN CNV pipeline also provides the following information about the samples in the run.

A correlation metric of the read count profile between the case sample and any panel of normals samples. A correlation metric greater than 0.90 is recommended for confident analysis, but there is no hard restriction enforced by the software.
The predicted sex of each sample in the run. The sex is predicted based on the read count information in the sex chromosomes and the autosomal chromosomes. The median value for the counts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Y chromosome. This estimation requires a minimum of 300 target intervals on the sex chromosomes to proceed.

The results are printed to the screen when running the pipeline. For example:

=============================================
Correlation Table
=============================================
Correlation of case sample PlatinumGenomes_50X_NA12877 against
PlatinumGenomes_50X_NA12878: 0.984092

=============================================
Sex Genotyper
=============================================
Predicted sex of samples
PlatinumGenomes_50X_NA12877: MALE XY 0.99737
PlatinumGenomes_50X_NA12878: FEMALE XX 0.968929

You can override the predicted sex of the sample with the --sample-sex option.

Segmental Duplication Extension

Supported duplications

The following pairs of genes defining Segmental Duplications are included:

Extension requirements

This extension is enabled by default in the germline CNV workflow. However, it requires:

Normalization set to self-normalization (--cnv-enable-self-normalization=true).
GC bias correction enabled (--cnv-enable-gcbias-correction=true).
Counts method set to start (--cnv-counts-method=start).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38).

If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension to false.

Algorithm

For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz file for inspection and they are automatically injected before the segmentation step.
- During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j

See for a description of the extension output files.