Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The DRAGEN CNV caller leverages depth as its primary signal for calling copy number variants. Depth alone poses challenges for calling events that are less than 10kbp. The sensitivity of CNVs at lengths less than 10kbp can be improved by leveraging junction signals from the DRAGEN structural variant caller.
When both the DRAGEN CNV and SV caller are executed in a single invocation, then an additional integration step is done at the end of a DRAGEN run to improve the CNV calls. This feature is enabled automatically when DRAGEN detects a germline WGS analysis.
The SV/CNV Integration module takes in DEL and DUP calls from the output data structures of the germline CNV and SV callers, identifies putative matches, updates annotations, filters, scores, and outputs the refined records in a new output VCF. By leveraging junction signals from the SV caller and depth signals from the CNV caller, this approach allows for sensitive CNV detection down to 1kbp while also improving recall and precision across length scales. This is achieved by rescuing previously low quality calls if evidence is found from both callers, and also by adjusting CNV breakends to the more accurate SV breakends. The matching algorithm takes into account the proximity of the events as well as the transition states at the breakends, among other things.
The following is an example command line for running a germline WGS analysis for both CNV and SV.
Other optional CNV or SV parameters can also be added.
The original CNV and SV VCF output files, prior to integration, are available for users in the DRAGEN output directory, as described elsewhere. Additionally, there is an enhanced CNV VCF available with the *.cnv_sv.vcf.gz
extension. The VCF header lines in the *.cnv_sv.vcf.gz
mostly correspond to a concatenation of the individual header lines from the CNV and SV VCFs, with a few lines deduplicated and some new ones added. For details on the legacy header lines, please refer to the individual CNV and SV user guide sections.
Newly added header lines are described in the following table.
Records that can be matched or rescued will have annotations indicating the breakpoint linkage between a CNV and SV record. If a complete match is found, then the MatchSv
annotation will be present in the record, indicating the SV record's ID
field for this CNV record. Furthermore, the use of the SVCLAIM
field will indicate if the record has evidence arising from depth signal D
, or junction signals J
, or both DJ
.
Because of the mixing of standalone SV records and CNV records, the FORMAT field may have different annotations. For details on the CNV or SV specific annotations, please refer to the individual CNV and SV user guide sections.
Records that can be matched or rescued will have FILTER set to PASS. The original FILTERs are retained for records that were not matched or rescued. For example, the cnvLength
FILTER will still be applied to standalone CNV records (those with SVCLAIM=D
).
Example records are shown below.
For somatic whole-exome sequencing (WES) and somatic targeted panels, you can use a panel of normals as the reference baseline to provide insight into copy number variants. The reported events are based solely on the normalized copy ratio values and the deviation from the expected reference baseline levels. This workflow can be useful for applications that require only the detection of gains and losses in targeted genes. The somatic WES CNV model is similar to the germline WES CNV model, but utilizes a different quality scoring and calling model.
Use one of the following input options.
--tumor-fastq1
and --tumor-fastq2
--Specify a FASTQ file
--tumor-bam-input
--Specify an existing BAM file
--tumor-cram-input
--Specify an existing CRAM file
The Somatic WES CNV Caller requires a panel of normals. The panel of normals samples help measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see Panel of Normals. The panel of normals sample should be well matched to the case sample under analysis.
If a matched normal sample is available, the sample can be included in the panel of normals. The workflow does not change if a matched normal is or is not available.
The following example command line runs somatic analysis on WES data.
If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed
and using cnv-segmentation-mode=bed
. If using this option, all events in the segmentation BED are reported in the output VCF. For more information on the segmentation BED file, see [Targeted Segmentation (Segment BED)].
The following example command line runs somatic analysis on a targeted panel.
The Somatic WES CNV Caller computes quality scores using a 2 sample t-test between the normalized copy ratio of the case sample and the panel of normals samples. The caller computes a p-value per segment. The p-values are then converted to Phred-scaled scores. For copy neutral events, the caller computes quality scores as 1-p
.
DUP/DEL events calls are made based on the limit of detection threshold (LoD) which is set using cnv-filter-limit-of-detection
(default 0.2). For each segment, the caller compute a p-value for hypothetical counts by Case Counts X (1 +/- LoD)
against PON. If p-value of Case Counts X (1+LoD)
is highest, then segment is called as DUP. If p-value of Case Counts X (1-LoD)
is highest, then segment is called DEL. Otherwise segment is called REF.
The output VCF contains the quality score in the QUAL
field.
Tumor purity can be estimated automatically through the ASCN workflow.
The non-ASCN Somatic WES CNV Caller only reports copy ratio, also known as fold change. Fold change is encoded in the FORMAT/SM
field as a linear copy ratio of the segment mean. In such case, if tumor purity is known, you can infer the ploidy of a gene or segment in the sample from the reported fold change using the following calculation.
For example, if the tumor purity is 30% for MET with a fold change of 2.2x, then there are 10 copies of MET DNA in the sample.
Multisample CNV calling is possible starting from tangent normalized counts files (*.tn.tsv.gz) specified with the --cnv-input
option (one per sample). Multisample CNV analysis benefits from using joint segmentation to increase the sensitivity of detection of copy number variable segments. For each copy number variable segment identified, the copy number genotype of each sample is emitted in a single VCF entry to facilitate annotation and interpretation.
Multisample CNV analysis is supported for WGS and WES workflows.
The following is an example command line for running a trio analysis:
Make sure all input samples have gone through the same single sample workflow and have identical intervals. If the samples are WES inputs, then you must generate the samples using the same panel of normals, and the autosomal intervals for all samples must match.
The following options are used in DeNovo CNV calling:
--cnv-input
For DeNovo CNV calling, this specifies the input tangent-normalized signal files (*.tn.tsv.gz) from the single sample runs. This option can be specified multiple times, once for each input sample.
--cnv-filter-de-novo-qual
Phred-scaled threshold at which a putative event in the proband sample if marked as DeNovo. Default value is 0.125.
--pedigree-file
Pedigree file specifying the relationship between the input samples.
First, CNV calling is performed on each sample independently. Joint segmentation then uses the copy number variable segments from each single sample analysis to derive a set of joint copy number variable segments. This set of joint segments is determined simply by taking the union of all breakpoints from the copy number variable segments of all samples. This results in the splitting of any partially overlapping segments across different samples. For example:
Following joint segmentation, copy number calling is again performed independently on each sample using the joint segments. Segments can be merged as with the single sample analysis, but each joint segment is emitted in the mutlisample VCF as a single entry. The quality score (QS
in the VCF) from the sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered using the sample's FT field in the multisample VCF. The QUAL
column of the multisample VCF is always missing (ie, "."). The FILTER
column of the mutlisample VCF is SampleFT
if none of the sample's FT
fields are PASS
, and PASS
if any of the sample's FT
fields are PASS
.
Note, however, that when a single segment in one sample overlaps multiple segments in another sample, the larger segment annotation is replicated across multiple records, e.g. (only relevant VCF fields are printed below):
A de novo event is defined as the existence of a genotype at a particular locus in a proband's genome that did not result from standard Mendelian inheritance from the parents. The de novo calling stage identifies putative de novo events in the proband of each trio of a multisample analysis. In some cases, these putative de novo events may be real, but they can also arise from sequencing or analysis artifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used to filter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-file
option. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.
For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflict for the called copy number genotypes. The CNV caller does not identify the copy number for each allele of a given diploid segment, which means assumptions are made about the possible allelic composition of the parent genotypes.
The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome (sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, as follows:
The following are examples of consistent and inconsistent copy number genotypes for diploid regions using these assumptions:
If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ
field in the VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) of each sample in the trio, combined with a prior for the trio genotypes:
Where
The DN
field in the VCF is used to indicate the de novo status for each segment. Possible values are:
Inherited
- the called trio genotype is consistent with Mendelian inheritance
LowDQ
- the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold (default 0.125)
DeNovo
- the called trio genotype is inconsistent with Mendelian inheritance and DQ
is greater than or equal to the de novo quality threshold (default 0.125)
The records in a multisample CNV VCF differ slightly from the single sample case. The major differences are as follows:
The per-record entries are broken down into the segments among the union of all the input samples breakpoints, which means there are more entries in the overall VCF.
The QUAL
column is not used and its value is ".". The per-sample quality is carried over into the SAMPLE
columns with the QS
tag.
The FILTER
column indicates PASS
if any of the individual SAMPLE
columns PASS
. Otherwise, it indicates SampleFT
.
The per-sample annotations are carried over from their originating calls. The single sample filters are applied at the sample level and are emitted in the FT
annotation.
Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following two annotations to the proband sample.
While the VCF contains many entries, due to the joint segmentation stage, the number of de novo events can be found by extracting entries that have a DN
and DQ
annotation. These records are also extracted and are converted to GFF3 in the de novo calling case.
To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.
The germline SNVs are used to compute B-allele ratios in the tumor, which allows for allele-specific copy number calling on the tumor sample. Where possible, use of the small-variant VCF from a matched normal sample is preferred (tumor-normal mode) for best results, but a catalog of population SNPs can be used when a matched normal sample is not available (tumor-only mode).
When a matched normal sample is available, the sample should first be processed using the germline small variant caller. In this case, only germline-heterozygous SNV sites are used for determining B-allele ratios. If no matched normal is available, population SNP B-allele ratios are computed as for matched normal heterozygous loci, but are treated as variants of unknown germline genotype; possible genotype assignments are statistically integrated to determine allele-specific copy number.
In matched normal mode, a VCF containing germline copy number changes for the individual may optionally be input. This makes sure that germline CNVs are output as separate segments in the somatic whole-genome sequencing (WGS) CNV VCF, and annotated with the germline copy number so that it is clear whether there are specifically-somatic copy number changes in the region.
You can use the following somatic WGS CNV calling command-line options:
The following is an example command line for running tumor-normal somatic WGS CNV calling with a matched normal SNV VCF.
If a matched normal is not available, you must disable CNV calling or run in tumor-only mode. Running with a mismatched normal in tumor-normal mode yields unexpected results. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.
The following example command line runs tumor normal somatic WGS CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true
.
You can enable additional features when a matched normal sample and the outputs from DRAGEN Germline analysis are also available. If a matched normal sample is available, enable germline-aware mode and VAF-aware mode using the following example command line. For more information on germline-aware mode and VAF-aware mode, see Germline-aware Mode and VAF-aware Mode.
The target counting stage and its output are the same as for the germline CNV calling case. The target intervals with the read counts are output in a *.target.counts.gz
file. If there is insufficient read depth coverage detected, processing will halt. For low depth tumor samples, the value of --cnv-interval-width
can be increased from to capture more alignments. The B-allele counting occurs in parallel with the read counting phase, and the values are output in a *.baf.bedgraph.gz
file. This file can be loaded into IGV along with other bigwig files generated by DRAGEN for visualization. See Output Files for more details on output files.
The Somatic WGS CNV Caller requires a source of heterozygous SNP sites to measure B-allele counts of the tumor sample. The following are the available modes.
To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf
option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz
extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz
), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.
To specify a population SNP VCF, use --cnv-population-b-allele-vcf
option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is "1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency>
to the INFO
section of each record. Additional INFO
fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf
can be either heterozygous or homozygous in the germline genome from which the tumor genome derives
The following is an example valid population SNP record:
DRAGEN considers the following requirements when parsing records from the b-allele VCF:
Only simple SNV sites.
Records must be marked PASS
in the FILTER
field.
If there are records with the same CHROM
and POS
values in the VCF
, then DRAGEN uses the first record that occurs.
If a tumor sample and matched normal input are available, use --cnv-use-somatic-vc-baf true
. You must enable the Somatic SNV Caller. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true
. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
To enable --cnv-use-somatic-vc-baf
, enter the following command line options.
--tumor-bam-input <TUMOR_BAM>
—Specify the tumor input
--bam-input <NORMAL_BAM>
—Specify the matched normal input
--enable-variant-caller true
—Enable the somatic SNV variant caller
--cnv-use-somatic-vc-baf true
—Enable somatic VC BAF
To specify germline CNVs from a matched normal sample, use --cnv-normal-cnv-vcf
. When specified, CNV records marked as PASS
in the normal sample are used during tumor-sample segmentation to make sure that confident germline CNV boundaries are also boundaries in the somatic output. Segments with germline copy number changes that are relative to reference ploidy are excluded from somatic model selection. During somatic copy number calling and scoring, the germline copy number is used to modify the expected depth contribution from the normal contamination fraction of the tumor sample. The process leads to more accurate assignment of somatic copy number in regions of germline CNV. DRAGEN then annotates the somatic WGS CNV VCF entries with germline copy number (NCN) and the somatic copy number difference relative to germline (SCND) for the segments that have germline CNVs.
If both the small variant caller and the CNV caller are enabled in a tumor-matched normal run, the somatic SNV results can affect the estimated purity and ploidy of the tumor sample. The somatic SNV variant allele frequencies (VAFs) that are captured by the allele depth values from passing somatic SNVs reflect the combination of tumor purity, total tumor copy number at a somatic SNV locus, and the number of tumor copies bearing the somatic allele. Clusters of somatic SNVs with similar allele depths inform the tumor model.
When a tumor has limited copy number variation and/or CNVs are mostly subclonal, such as in many liquid tumors, VAFs can help prevent incorrect or low-confidence estimated tumor models. Incorrect or low-confidence estimated tumor models can lead to wrong or filtered calls. VAF information can also help determine the presence or absence of a genome duplication even in samples from clonal tumors with clear CNVs.
To utilize VAF information, run somatic WGS CNV calling with small variant calling on tumor and matched-normal read alignment inputs. For example, you could use the following command line:
--enable-variant-caller=true --enable-cnv=true --tumor-bam-input <TUMOR_BAM> --bam-input <NORMAL_BAM>
For tumor/matched-normal runs with --enable-variant-caller true
, VAF-based modeling is enabled by default. To disable VAF-based modeling, set --cnv-use-somatic-vc-vaf
to false
.
DRAGEN uses HET-calling mode for segments with a copy number that is estimated to be heterogeneous (HET) among different subclones. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.
To turn on HET calling, specify --cnv-somatic-enable-het-calling=true
on the command line. N.B., this setting will only be honored when DRAGEN is able to identify a confident purity/ploidy model. When a confident model cannot be identified, the caller will return a default model and HET-calling will always be disabled (see Somatic WGS CNV Model section for more details and nuances of this approach).
When a segment is considered as heterogeneous, the output for the segment is changed as follows.
The HET tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.
Selecting a tumor purity and diploid coverage level (ploidy) is a key component of the somatic WGS CNV caller. The somatic WGS CNV caller uses a grid-search approach that evaluates many candidate models to attempt to fit the observed read counts and b-allele counts across all segments in the tumor sample. A log likelihood score is emitted for each candidate. The log likelihood scores are output in the *.cnv.purity.coverage.models.tsv
file. The somatic WGS CNV caller chooses the purity, coverage pair with the highest log likelihood, and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.
If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA
. The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM
). The threshold values used by the caller are estimated automatically considering the variance of the sample, with larger SM
thresholds for DUPs when the variance is higher. The user can use alternative threshold values through the --cnv-filter-del-mean
and --cnv-filter-dup-mean
parameters. Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN
, CNF
, CNQ
, MCN
, MCNF
, MCNQ
) are omitted from the final VCF output.
In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.
The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit
(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH>
parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.
If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.
¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041
The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.
After initial calling, segments shorter than the specified value of --cnv-filter-length
are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the Somatic WGS CNV Caller combines two successive segments that are within --cnv-merge-distance
(default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. NB. When the germline CN information is available, and two segments have different germline CN, they will not be merged.
The Somatic WGS CNV Caller can report the total tumor copy number by estimating tumor purity. The BAF estimations from matched normal SNVs or population SNPs allow for allele specific copy number calling. The following table provides examples for a DUP in a reference-diploid region:
*The entry represents a Loss of Heterozygosity (LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH
to distinguish the value from Copy Neutral LOH (CNLOH
), which would be annotated as 2+0.
DRAGEN emits the calls in the standard VCF format. By default for analyses other than somatic WGS, the VCF file includes only copy number gain and loss events (and LOH events, where allele-specific copy number is available). To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls
to true.
For more information on how to use the output files to aid in debug and analysis, see .
File extension: *.cnv.vcf.gz
The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV
to indicate the file is generated by the DRAGEN CNV pipeline
The following is an example of some of the header lines that are specific to CNV:
The following header lines are specific to somatic WGS CNV calling:
ModelSource
The primary basis on which the final tumor model was chosen. The following values can be included:
DEPTH+BAF
: Depth+BAF signal is used to determine tumor model.
DEPTH+BAF_DOUBLED
: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
DEPTH+BAF_DEDUPLICATED
: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
DEPTH+BAF_WEAK
: Depth+BAF signal is used to determine lower-confidence tumor model.
VAF
: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
DEGENERATE_DIPLOID
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
SAMPLE_MEDIAN
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity
Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA
if a confident model could not be determined.
DiploidCoverage
Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy
Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup
An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup
An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction
A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].
All coordinates in the VCF are 1-based.
The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.
The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.
The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN
, LOSS
and REF
events, in Somatic WGS CNV, the ID could include the Copy Neutral Loss of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.
The REF column contains an N for all CNV events.
The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL> or <DUP> entries are used. If REF calls are emitted, their ALT will always be .
. In Somatic WGS CNV, the ALT
field can contain two alleles, such as <DEL>,<DUP>, which allows representation of allele-specific copy numbers if they differ in copy number states.
The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.
The FILTER column contains PASS
if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.
Available FILTERs:
cnvLength
which indicates that the length of the CNV is lower than a threshold.
cnvQual
which indicates that the QUAL of the CNV is lower than a threshold.
Germline CNV has the following additional FILTERs:
cnvCopyRatio
which indicates that the segment mean of the CNV is not far enough from copy neutral.
Both Germline CNV workflows have the following additional FILTERs:
dinucQual
which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.
Germline WGS CNV has the following additional FILTERs:
cnvBinSupportRatio
which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN
which indicates a CNV call with implausible copy number (>6).
Germline WES CNV has the following additional FILTERs:
cnvLikelihoodRatio
indicates a log10 likelihood ratio of ALT to REF is less than a threshold.
Both Somatic CNV workflows have the following additional FILTERs:
binCount
- Filters CNV events with a bin count lower than a threshold.
Somatic WGS CNV has the following additional FILTERs:
lengthDegenerate
- Marks records as non-PASS
ing based on each record's length (REFLEN
) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean
- Marks records as non-PASS
ing based on each record's segment mean (SM
) when the caller returns the default model. Segments having insufficient SM
in DEL
s or DUP
s are assigned this filter when returning the default model.
Somatic WES CNV has the following additional FILTERs: -SqQual
- Marks records as non-PASS
ing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.
The INFO column contains information representing the event.
REFLEN
indicates the length of the event.
SVLEN
is a signed representation of REFLEN
(e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE
is always CNV and only present for non-REF records.
END
indicates the end position of the event (1-based, inclusive).
If using a segment BED file, then the segment identifier is carried over from the input to SEGID
field.
The common FORMAT fields are described in the header:
Germline WGS CNV includes the following FORMAT fields:
Germline WES CNV includes the following FORMAT fields:
Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:
Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN
entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.
Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:
The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity
metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity
metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.
A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width
setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.
DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv
file extension. The following list summarizes the metrics that are output from a CNV run.
Sex Genotyper:
Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.
CNV Summary:
Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
Number of filtered records (total)
Number of filtered records (due to duplicates)
Number of filtered records (due to MAPQ)
Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions
Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Coverage MAD is the median absolute deviation of normalized case counts. Higher values indicate noiser sample data (poor quality). Median Bin Count is the median of raw counts normalized by interval size.
Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.
All files have a structure similar to a BED file with optional header line(s).
The file *.target.counts.gz
is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid
file, which is normalized to the normal ploidy level of 2 instead of raw counts.
It has the following columns:
Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gz
file is shown below.
B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, they are taken from a collection of sites that have high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the tumor sample supporting each of these alleles is counted.
B-allele counts are written both to gzipped tsv file *.ballele.counts.gz
and gzipped bedgraph file *.baf.bedgraph.gz
.
The tsv file format is the following:
Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele
Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:
Population frequency for the first allele
Population frequency for the second allele
An example of B-allele counts file is provided below:
The bedgraph file format is similar to the BED format and it has the following columns:
Contig identifier
Start
Stop
Ratio of allele counts
The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.
When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:
When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:
By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.
An example of the bedgraph file is shown below:
The file *.target.counts.gc-corrected.gz
contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gc-corrected.gz
file is shown below.
The file *.combined.counts.txt.gz
is a column-wise concatenation of individual *.target.counts.gz
and *.target.counts.gc-corrected.gz
used to form the panel of normals.
The file *.tn.tsv.gz
contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
.
An example of a *.tn.tsv.gz
file is shown below.
File extension: *.seg
, *.seg.called
, *.seg.called.merged
Files containing the segments produced by the segmentation algorithm. The Segment_Mean
value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.
The *.seg
file has the following columns:
Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment
An example of a *.seg
file is shown below.
The *.seg.called
file is identical to the *.seg
file, with an additional column indicating the initial call for whether the segment is a duplication +
ir a deletion -
.
The *.seg.called.merged
file is identical to the *.seg.called
file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:
QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count
In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg
and it has the same format of the *.seg
file with two modifications. Firstly, the Segment_Mean
value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:
BAF_SLM_STATE
: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or .
when the BAF data are too variable to estimate a minor-allele fraction"
An example of segmentation output file is shown below:
The file *.cnv.purity.coverage.models.tsv
describes the different tested models and their log-likelihood. It has columns:
Model purity (Cellularity)
Model diploid coverage
Model log-likelihood
An example is shown below:
To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks
option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml
can be loaded directly into IGV for analysis.
The following IGV tracks are automatically populated in the output IGV session file:
*.target.counts.bw
--- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw
--- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw
--- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw
--- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz
--- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3
--- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):
For somatic WGS analyses, the following additional files are included in the IGV session xml:
*.baf.seq.bw
--- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz
--- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.
File extension: *.igv_session.xml
The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir
specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome
attribute in the Session
element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.
Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome
field in the XML file directly. For example, IGV has traditionally packaged a b37
reference genome, but may also include a 1kg_v37
or a 1kg_b37+decoy
, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.
You can determine what the correct encoding of a reference genome by going to File > Save Session...
and then inspecting the generated igv_session.xml file.
DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:
*.target.counts.gz
or *.target.counts.gc-corrected.gz
, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz
, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz
, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.
In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.
A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3
output file. Some examples of DRAGEN output GFF3 are shown below:
Germline WGS
Somatic WGS
From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber
annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).
To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.excluded_intervals.bed.gz
file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.
An example of a *.excluded_intervals.bed.gz
file is shown below:
The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz
) if a Panel of Normals is provided and --cnv-generate-pon-metric-file
is set to true
. If PON size is less than 2, then an empty file will be generated.
The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:
Example:
The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz
) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.
Example:
The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).
The final output has extension .cnv.segdups.rescued_intervals.tsv.gz
, and contains the rescued target intervals which can then be injected before segmentation. It has columns:
Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)
The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz
with columns:
Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)
The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz
with columns:
Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site
The DRAGEN Copy Number Variant (CNV) Pipeline can call CNV events using next-generation sequencing (NGS) data. This pipeline supports multiple applications in a single interface via the DRAGEN Host Software, including processing of whole-genome sequencing (WGS) data and whole-exome sequencing (WES) data for germline analysis.
The DRAGEN CNV pipeline supports two normalization modes of operation. The two modes apply different normalization techniques to handle biases that differ based on the application, for example, WGS versus WES. While the default option settings attempt to provide the best trade-off in terms of speed and accuracy, a specific workflow may require more finely tuned option settings.
The DRAGEN CNV pipeline follows the workflow shown in the following figure.
DRAGEN CNV Pipeline Workflow
The DRAGEN CNV Pipeline uses many aspects of the DRAGEN secondary analysis available in other pipelines, such as hardware acceleration and efficient I/O processing. To enable CNV processing in the DRAGEN Host Software, set the --enable-cnv
command line option to true.
The CNV pipeline has the following processing modules:
Target Counts --- Binning of the read counts and other signals from alignments.
Bias Correction --- Correction of intrinsic system biases.
Normalization --- Detection of normal ploidy levels and normalization of the case sample.
Segmentation --- Breakpoint detection via segmentation of the normalized signal.
Calling / Genotyping --- Thresholding, scoring, qualifying, and filtering of putative events as copy number variants.
The normalization module can optionally take in a panel of normals (PoN), which is used when a cohort or population samples are readily available. Note that PoN normalization is not available for somatic WGS analysis. All other modules are shared between the different CNV algorithms.
The following figures show a high-level overview of the steps in the DRAGEN CNV Pipeline as the signal traverses through the various stages. These figures are examples and are not identical to the plots that are generated from the DRAGEN CNV Pipeline.
The first step in the DRAGEN CNV Pipeline is the target counts stage. The target counts stage extracts signals such as read count and improper pairs and puts them into target intervals.
Read Count Signal
Improper Pairs Signal
Next, the case sample is normalized against the panel of normals or against the estimated normal ploidy level. Any other biases are subtracted out of the signal to amplify any event level signals.
Normalization
The normalized signal is then segmented using one of the available segmentation algorithms. Events are then called from the segments.
Segments
Called Events
The events are then scored and emitted in the output VCF.
The following are the top-level options that are shared with the DRAGEN Host Software to control the CNV pipeline. You can input a BAM or CRAM file into the CNV pipeline. If you are using the DRAGEN mapper and aligner, you can use FASTQ files.
--bam-input
--- The BAM file to be processed.
--cram-input
--- The CRAM file to be processed.
--enable-cnv
--- Enable or disable CNV processing. Set to true to enable CNV processing.
--enable-map-align
--- Enables the mapper and aligner module. The default is true, so all input reads are remapped and aligned unless this option is set to false.
--fastq-file1
, --fastq-file2
--- FASTQ file or files to be processed.
--output-directory
--- Output directory where all results are stored.
--output-file-prefix
--- Output file prefix that will be prepended to all result file names.
--ref-dir
--- The DRAGEN reference genome hashtable directory.
The output and filtering options control the CNV output files.
--cnv-exclude-bed
--- Specifies a BED file that indicates the intervals to exclude from the CNV analysis. If a target interval overlaps regions specified from exclude BED file more than cnv-exclude-bed-min-overlap
, the target interval is suppressed.
--cnv-exclude-bed-min-overlap
--- Specifies a fraction for filtering threshold of overlap amount between a target interval and the excluded region (0.5).
--cnv-enable-ref-calls
--- Emit copy neutral (REF) calls in the output VCF file. The default is true for single WGS CNV analysis.
--cnv-enable-tracks
--- Generate track files that can be imported into IGV for viewing. When this option is enabled, a \*.gff
file for the output variant calls is generated, as well as \*.bw
files for the tangent normalized signal. The default is true.
--cnv-filter-bin-support-ratio
--- Filters out a candidate event if the span of supporting bins is less than the specified ratio with respect to the overall event length. This filter only applies to records with length greater than cnv-filter-bin-support-ratio-min-len
. The default ratio is 0.2 (20% support). As an example, if an event is called and has a length of 100,000 bp, but the target interval bins that support the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then the interval is filtered out. If applied, the record will have cnvBinSupportRatio
as a filter.
--cnv-filter-bin-support-ratio-min-len
--- Minimum length of candidate event at which to apply cnv-filter-bin-support-ratio
. Currently only applied to germline WGS workflows, with default value of 80,000 bp.
--cnv-filter-copy-ratio
--- Specifies the minimum copy ratio (CR) threshold value centered about 1.0 at which a reported event is marked as PASS in the output VCF file. The default value is 0.2, which leads to calls with CR between 0.8 and 1.2 being filtered. If applied, the record will have cnvCopyRatio
as a filter.
--cnv-filter-length
--- Specifies the minimum event length in bases at which a reported event is marked as PASS in the output VCF file. The default is 10000. If applied, the record will have cnvLength
as a filter.
--cnv-filter-qual
--- Specifies the QUAL value at which a reported event is marked as PASS in the output VCF file. You should adjust the parameter value according to your own application data. If applied, the record will have cnvQual
as a filter.
--cnv-min-qual
--- Specifies the minimum reported QUAL. The default is 3.
--cnv-max-qual
--- Specifies the maximum reported QUAL. The default is 200.
--cnv-qual-length-scale
--- Specifies the bias weighting factor to adjust QUAL estimates for segments with longer lengths. This is an advanced option and should not need to be modified. The default is 0.9303 (2-0.1).
--cnv-qual-noise-scale
--- Specifies the bias weighting factor to adjust QUAL estimates based on sample variance. This is an advanced option and should not need to be modified. The default is 1.0.
For the DRAGEN CNV pipeline, the hashtable must be generated with the --enable-cnv option
set to true, in addition to any other options required by other pipelines. When --enable-cnv
is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.
The following example command generates a hashtable.
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM
option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM
option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.
To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.
The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.
When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list
option.
With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed
option to determine the intervals for analysis.
The target counts stage generates a .target.counts.gz
file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input
option for the normalization stage. The .target.counts.gz
file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.
If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width
option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.
The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.
Using a cnv-interval-width
of less than 250 bp for WGS analysis can drastically increase runtime.
The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
. You can specify a list of contigs to skip by using the --cnv-skip-contig-list
option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.
For example, to skip chromosome M, X, and Y, use the following option:
If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED
option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width
.
To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.
The following options control the generation of target counts.
--cnv-counts-method
--- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq
--- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed
--- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width
--- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list
--- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm
.
--cnv-filter-duplicate-alignments
--- Filter duplicate marked alignments during target counts if option is set to true
. The deafult setting is false
.
Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.
PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments
when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.
If --enable-map-align=false
, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true
, then --enable-duplicate-marking=true
should be set.
Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.
In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.
Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.
For WGS samples and in absence of a cnv-target-bed
file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width
option, which defaults to 1000bp. The cnv-interval-width
option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width
, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE
in the *.cnv.excluded_intervals.bed.gz
file.
A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.
GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.
Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.
The following options control the GC bias correction module.
--cnv-enable-gcbias-correction
--- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing
--- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins
--- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.
The DRAGEN CNV pipeline supports two normalization algorithms:
Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.
Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.
Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
naming conventions are supported.
Whole genome sequencing (non-somatic)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples
The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization
to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.
Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.
The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.
If you are running from a FASTQ sample, then the default mode of operation is self-normalization.
When operating in self-normalization mode, the --cnv-interval-width
option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.
Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references.
If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true
. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.
The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. This allows the algorithm to subtract system level biases that are not sample specific. The generation of the target counts for these normal samples should also have identical command line options with the case sample under analysis.
In this mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample, case, and normals, to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.
Target counts should be generated for all samples, whether the samples are to be used as references or are the case samples under analysis. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings. The target counts stage also performs GC Bias correction. GC Bias correction is enabled by default.
The following examples are for WES processing, which is the case in where a panel of normals is required.
The following is an example command for processing a BAM file.
The following is an example command for processing a CRAM file.
When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true
option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file
(one per file) or --cnv-normals-list
(single text file with paths to each sample).
The following is an example command line using a normals list:
The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.
Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.
For optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels can result in artifactual calls in the test sample where at least some of the panel samples have copy number changes. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.
The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.
DRAGEN accepts 3 different file formats for a Panel of Normals (PON).
The CNV caller can also be started from the *.target.counts.gz
(raw counts) or *.target.counts.gc-corrected.gz
(GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input
option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction
should be set to false to disable the GC-correction stage.
For example, the following command normalizes the case sample against the panel of normals.
These options control the preconditioning of the panel of normals and the normalization of the case sample.
--cnv-enable-self-normalization
--- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile
--- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input
--- Specifies a target counts file for the case sample under analysis when using a panel of normals.
--cnv-normals-file
--- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list
--- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples
--- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets
--- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold
--- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-truncate-threshold
--- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon
--- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.
After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:
Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)
The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.
By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing
For the targeted sequencing workflows, you can also run with a --cnv-segmentation-bed
. The option pre-defines the segments to estimate copy numbers for and skips the segmentation step of the workflow. See Targeted Segmentation (Segment BED) for more information.
--cnv-segmentation-mode
--- Specifies the segmentation algorithm to perform. The following values are available.
bed
cbs
slm
--- The default for germline WGS analysis.
aslm
--- The default for somatic WGS analysis.
hslm
--- The default for targeted/WES analysis.
--cnv-merge-distance
--- Specifies the maximum number of base pairs between two segments that would allow them to be merged. The default value is 0 for germline WGS, which means the segments must be directly adjacent. For WES analysis, this parameter is disabled by default due to the spacing of targeted intervals.
--cnv-merge-threshold
--- Specifies the maximum segment mean difference at which two adjacent segments should be merged. The segment mean is represented as a linear copy ratio value. The default is 0.2 for WGS and 0.4 for WES. To disable merging, set the value to 0.
Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.
--cnv-cbs-alpha
--- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta
--- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax
--- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width
--- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin
--- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm
--- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim
--- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.
¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646
The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles².
--cnv-slm-eta
--- Baseline probability that the mean process changes its value. The default is 4e-5.
--cnv-slm-fw
--- Minimum number of data points for a CNV to be emitted. The default is 0, which means segments with one design probe could in effect be emitted.
--cnv-slm-omega
--- Scaling parameter that modulates relative weight between experimental or biological variance. The default is 0.3.
--cnv-slm-stepeta
--- Distance normalization parameter. The default is 10000. This option is only valid for HSLM.
Regardless of segmentation method, initial segments are split across large gaps where depth data is unavailable, such as across centromeres.
²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5
In applications for targeted panels, you can limit the segmentation and calling performed on intervals by specifying a --cnv-segmentation-bed
. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. This segmentation mode is only supported with the panel of normals and requires an accompanying --cnv-target-bed
. Also specify the --cnv-segmentation-bed
during the panel of normals generation step, so that all interval boundaries during analysis are matched. For more information on panel of normals generation, see Panel of Normals
The recommended format for the BED file includes four columns and a header. The four columns are contig
, start
, stop
, and name
. The name column represents the name of the gene and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID
field. The following example file is in the recommended format:
If using a three-column BED file, then do not include a header or the name field values. Three-column BED files should only include the contig
, start
, and stop
values. In this case, the segment identifier is autogenerated from the coordinate fields.
Quality scores are computed using a probabilistic model that uses a mixture of heavy tailed probability distributions (one per integer copy number) with a weighting for event length. Noise variance is estimated. The output VCF contains a Phred-scaled metric that measures confidence in called amplification (CN > 2 for diploid locus), deletion (CN < 2 for diploid locus), or copy neutral (CN=2 for diploid locus) events.
The scoring algorithm also calculates exact copy-number quality scores that are inputs to the DeNovo CNV detection pipeline.
You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed
. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.
The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz
file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz
file excludes the intervals removed during normalization.
DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.
Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.
The following examples show different commands.
When running the target counts stage or the normalization stage, the DRAGEN CNV pipeline also provides the following information about the samples in the run.
A correlation metric of the read count profile between the case sample and any panel of normals samples. A correlation metric greater than 0.90 is recommended for confident analysis, but there is no hard restriction enforced by the software.
The predicted sex of each sample in the run. The sex is predicted based on the read count information in the sex chromosomes and the autosomal chromosomes. The median value for the counts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Y chromosome. This estimation requires a minimum of 300 target intervals on the sex chromosomes to proceed.
The results are printed to the screen when running the pipeline. For example:
The predicted sexes for samples in use are also printed to the *.cnv_metrics.csv output file. For a panel of normals, the predicted sexes are used to determine which panel samples are leveraged for normalization on sex chromosomes. If the estimated sex of the sample is UNDETERMINED, the sex of the sample is set to FEMALE.
You can override the predicted sex of the sample with the --sample-sex
option.
The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.
This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38
) with at least 30x coverage. See below for additional requirements.
The following pairs of genes defining Segmental Duplications are included:
This extension is enabled by default in the germline CNV workflow. However, it requires:
Normalization set to self-normalization (--cnv-enable-self-normalization=true
).
GC bias correction enabled (--cnv-enable-gcbias-correction=true
).
Counts method set to start
(--cnv-counts-method=start
).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width
default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38
).
If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension
to false.
For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz
).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz
).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz
file for inspection and they are automatically injected before the segmentation step.
During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j
Header Field | Number | Type | Description |
---|---|---|---|
The previous can be visualized as:
Parent Copy Number Genotype | Possible Copy Number Alleles | Assumed Possible Copy Number Alleles |
---|---|---|
Mother Copy Number | Father Copy Number | Proband Copy Number | Mendelian Consistent? |
---|---|---|---|
is the set of all genotypes
is the set of conflicting genotypes
is the Mother copy number
is the Father copy number
is the Proband copy number
is the the prior for the trio genotype
Option | Description |
---|---|
Option | Description |
---|---|
Total Copy Number (CN) | Minor Copy Number (MCN) | ASCN Scenario |
---|---|---|
In Somatic WGS CNV, the INFO column can also contain the HET
tag, when the call is considered sub-clonal. See .
When matching CNV with SV output, additional INFO annotations are added. See .
ID | Description |
---|
ID | Description |
---|
ID | Description |
---|
ID | Description |
---|
Diploid or Haploid? | ALT | FORMAT:CN | FORMAT:GT |
---|
Using R, a good starting point is the package. The main workflow involves reading the *.target.counts.gz
file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR
package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.
Using Python, the workflow is similar to R's but using Python's libraries such as , to convert DRAGEN output files to dataframe, and , to plot coverage and BAF profiles across the genome.
Exclusion Reason | Description | Related DRAGEN Option |
---|
Column index | Column contents | Description |
---|
The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see for instructions on streaming alignment records directly from the DRAGEN map/align stage.
DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see .
The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see .
For information on running CNV concurrently with the Haplotype Variant Caller, see .
Further details are available in the section.
WGS Coverage per Sample | Recommended Resolution* (bp) |
---|
Input format | enable-map-align | Required option |
---|
Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section for more details.
The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz
file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz
extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See for further details on GC-corrected target counts files.
Option | Description |
---|
See for a description of the target counts files.
An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See for further details.
See for a description of the extension output files.
--tumor-fastq1
,--tumor-fastq2
,--tumor-bam-input
, --tumor-cram-input
Specify a tumor input file.
--cnv-normal-b-allele-vcf
Specify a matched normal SNV VCF. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
--cnv-population-b-allele-vcf
Specify a population SNP catalog. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
--cnv-use-somatic-vc-baf
If running in tumor-normal mode with the SNV caller enabled, use this option to specify the germline heterozygous sites. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
--sample-sex
If known, specify the sex of the sample. If the sample sex is not specified, the caller attempts to estimate the sample sex from tumor alignments.
--cnv-normal-cnv-vcf
Specify germline CNVs from the matched normal sample. For more information, see Germline-aware Mode.
--cnv-use-somatic-vc-vaf
Use the variant allele frequencies (VAFs) from the somatic SNVs to help select the tumor model for the sample. For more information, see VAF-aware Mode.
--cnv-somatic-enable-het-calling
Enable HET-calling mode for heterogeneous segments. For more information, see HET-Calling Mode.
cnv-normal-b-allele-vcf
Specify a matched normal SNV VCF. Use when a matched normal sample and the matched normal SNV VCF are available. To use this option, you must run the matched normal sample through the DRAGEN Germline workflow.
cnv-population-b-allele-vcf
Specify a population SNP VCF. Use when a matched normal sample is not available and analysis must be performed in tumor-only mode.
cnv-use-somatic-vc-baf
Set to true
to enable DRAGEN to identify germline variants during a tumor/matched-normal run, rather than requiring a separate run on the normal sample. Use if and only if tumor and matched normal input are available. Also enable the Somatic SNV Caller via enable-variant-caller
to use this option.
4
2
2+2
4
1
3+1
*[LOH]4
0
4+0
END_LEFT_BND_OF
1
String
ID of CNV whose left end is matched to the end of SV
END_RIGHT_BND_OF
1
String
ID of CNV whose right end is matched to the end of SV
LEFT_BND
1
String
ID of SV that matches the left end of CNV record
LEFT_BND_OF
1
String
ID of CNV whose left end is matched to SV
MatchSv
1
Integer
ID of original SV that was merged with CNV record
OrigCnvEnd
1
Integer
Coordinate of original CNV end
OrigCnvPos
1
Integer
Coordinate of original CNV pos
RIGHT_BND
1
String
ID of SV that matches the right end of CNV record
RIGHT_BND_OF
1
String
ID of CNV whose right end is matched to SV
SVCLAIM
A
String
Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively
2
0/2, 1/1
1/1
3
0/3, 1/2
1/2
4
0/4, 1/3, 2/2
1/3, 2/2
N
x/(N-x) for x <= N/2
x/(N-x) for 1 <= x <= N/2
2
2
2
Yes
2
2
1
No
3
2
4
No
3
2
2
Yes
2
0
2
No
GT | Genotype |
SM | Linear copy ratio of the segment mean |
CN | Estimated copy number |
BC | Number of bins in the region |
PE | Number of improperly paired end reads at start and stop breakpoints |
GC | GC dinucleotide percentage |
CT | CT dinucleotide percentage |
AC | AC dinucleotide percentage |
LR | Log10 likelihood ratio of ALT to REF |
AS | Number of allelic read count sites |
BC | Number of read count bins |
CN | Estimated total copy number in tumor fraction of sample. This field is not present if the model cannot be estimated with high confidence. |
CNF | Floating point estimate of tumor copy number. This field is not present if the model cannot be estimated with high confidence. |
CNQ | Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence. |
MAF | Maximum estimate of the minor allele frequency |
MCN | Estimated minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence. |
MCNF | Floating point estimate of tumor minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence. |
MCNQ | Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence. |
NCN | Normal-sample copy number. The field is only present in germline-aware mode. |
SCND | Difference between CN and GCN. The field is only present in germline-aware mode. |
SD | Best estimate of segment's bias-corrected read count |
Diploid | . | 2 | ./. |
Diploid | <DUP> | >2 | ./1 |
Diploid | <DEL> | 1 | 0/1 |
Diploid | <DEL> | 0 | 1/1 |
Haploid | . | 1 | 0 |
Haploid | <DUP> | >1 | 1 |
Haploid | <DEL> | 0 | 1 |
NON_KMER_UNIQUE | Non-unique Kmer bases are larger than 50% of interval. | Not applicable. This reason only applies to self-normalization mode. |
EXCLUDE_BED | Interval overlaps with exclude BED larger than threshold. |
|
PON_MAX_PERCENT_ZERO_SAMPLES | Number of PON samples with 0 coverage is larger than threshold. |
|
PON_TARGET_FACTOR_THRESHOLD | Median coverage of interval is lower than threshold of overall median coverage. |
|
PON_MISSING_INTERVAL | Target interval not found in PON. | Not applicable |
1 | contig | chromosome name |
2 | start | genomic locus of interval start |
3 | stop | genomic locus of interval stop |
4 | name | interval name |
5 | mean | average coverage depth |
6 | std | standard deviation |
7 | normalizedStd | normalized standard deviation (std/mean) |
8 | min | minimum |
9 | 25% | 25 percentile |
10 | 50% | median |
11 | 75% | 75 percentile |
12 | max | maximum |
13 | intervalSize | interval size (stop-start) |
14 | gcContents | percent GC |
5 | 10000 |
10 | 5000 |
>= 30 | 1000 |
Fastq | TRUE |
|
BAM | TRUE |
|
BAM | FALSE |
|
| Individual normal file. This option uses a single file name and can be specified multiple times. |
| List of normal files. A plain text file in which each line in the file contains a path pointing to a |
| PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals ( |
CYP2A6 | CYP2A7 |
FCGR3A | FCGR3B |
RHD | RHCE |
STRC | STRCP1 |
ACSM2A | ACSM2B |
ACTR3B | ACTR3C |
AQP12A | AQP12B |
ASAH2 | ASAH2B |
CCDC74A | CCDC74B |
CD177 | CD177p1 |
CD8B | CD8B2 |
CFH1 | CFHR1 |
CYP4A11 | CYP4A22 |
DHX40 | DHX40P1 |
EIF5AL1 | EIF5AP4 |
FCGR2A | FCGR2C |
FFAR3 | GPR42 |
FOLH1 | FOLH1B |
FRMPD2 | FRMPD2B |
GPAT2 | GPAT2P1 |
GSTT2B | GSTT2 |
DDT | DDTL |
HCAR2 | HCAR3 |
HSPA1A | HSPA1B |
KRT81 | KRT86 |
LGALS7 | LGALS7B |
MRPL45 | MRPL45P2 |
MSTO1 | MSTO2p |
MUC20 | MUC20P1 |
MZT2A | MZT2B |
OTOA | OTOAp1 |
PDPR | PDPR2P |
PIEZ02 | ENST00000591853.1 |
ZP3 | POMZP3 |
PRAMEF7 | PRAMEF8 |
PROS1 | PROS2P |
RMND5A | ANAPC1P2 |
ROCK1 | ROCK1p1 |
SERPINB3 | SERPINB4 |
SYT3 | ZNF473CR |
TBC1D26 | TBC1D28 |
TOP3B | TOP3BP1 |
TUBA3D | TUBA3E |
ZNF443 | ZNF799 |
To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs from matched normal sample or population SNV VCF. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.
The germline SNVs are used to compute B-allele ratios in the tumor, which allows for allele-specific copy number calling on the tumor sample. Where possible, use of the small-variant VCF from a matched normal sample.
Panel of normals are used for the reference baseline to provide insight into copy number variants. The ASCN somatic WES CNV model is similar to the somatic WGS CNV model (with different internal parameters tuned for WES), but it uses a panel of normals to remove coverage bias in each target region.
The pipeline accepts various input types for matched normal sample or population SNV VCF for B-allele loci. If the normal sample was already processed using the germline small variant caller, the user can provide its output VCF file.
If the normal sample was not processed, the user can provide raw reads or aligned reads and enable the concurrent execution of the small variant caller. In such case the DRAGEN CNV will receive the small variant caller's output, and use it to estimate B-allele frequencies from the germline SNVs.
If there is no matched normal sample, the user can provide a population SNV VCF. DRAGEN will intersect the population SNV VCF with the target region provided by the cnv-target-bed
and use the resulting SNVs to estimate B-allele frequencies.
You can use following somatic WES CNV calling command-line options:
1 tumor input
1 normal input (either option 1, 2, or 3)
Panel of normals (either option 1, 2, 3 or 4)
Target region
When the normal sample input is not in VCF format (e.g., FASTQ/BAM/CRAM), then the normal sample shall be capable of being used as PON. However, if the normal sample is already included in the PON, then it will not be added.
The following is an example command line for running ASCN tumor-normal somatic WES CNV calling with matched normal SNV VCF.
The following example command line runs ASCN tumor-normal somatic WES CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true
.
If a matched normal is not available, DRAGEN CNV requires population SNV VCF to run in tumor-only mode. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.
ASCN somatic WES CNV pipeline utilize same methods and workflow of DRAGEN Somatic WGS CNV pipeline. Please see Somatic WGS CNV Calling for more details.
Input | Option | Argument | Description |
---|---|---|---|
Tumor input
--tumor-fastq1
,--tumor-fastq2
,--tumor-bam-input
, --tumor-cram-input
file
Specify a tumor input file.
Normal input Option 1
--fastq1
,--fastq2
,--bam-input
, --cram-input
file
Specify a normal input file (if normal VCF is not ready).
Normal input Option 1
--cnv-use-somatic-vc-baf
true
/false
If running in tumor-normal mode with the SNV caller enabled, use this option to specify the germline heterozygous sites. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
Normal input Option 2
--cnv-normal-b-allele-vcf
vcf file
Specify a matched normal SNV VCF. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
Normal input Option 3
--cnv-population-b-allele-vcf
vcf file
Specify a population SNP catalog. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
PON option 1
--cnv-normals-file
normal count file
Specify individual normal counts file (target.counts.gz or target.counts.gc-corrected.gz) for PON. You can use this option multiple times, one time for each file.
PON option 2
--cnv-normals-list
text file indicating normal count files per line
Specify text file that contains paths to the list of reference target counts files to be used as a panel of normals (new line separated).
PON option 3
--cnv-combined-counts
file
Specify combined PON file (.combined.counts.txt.gz).
PON option 4
NA
If no PON sample is specified, then DRAGEN utilizes matched normal sample as single sample PON. Available for Normal input Option 1
Target region
--cnv-target-bed
bed file
Specify target region bed file
Sample sex
--sample-sex
male
/female
/auto
/none
If known, specify the sex of the sample. If the sample sex is not specified, the caller attempts to estimate the sample sex from tumor alignments.