CNV Output

DRAGEN emits the calls in the standard VCF format. By default for analyses other than somatic WGS, the VCF file includes only copy number gain and loss events (and LOH events, where allele-specific copy number is available). To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls to true.

For more information on how to use the output files to aid in debug and analysis, see Signal Flow Analysis.

CNV VCF File

File extension: *.cnv.vcf.gz

The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV to indicate the file is generated by the DRAGEN CNV pipeline

The following is an example of some of the header lines that are specific to CNV:

##fileformat=VCFv4.2
##CoverageUniformity=0.402517
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
...
##reference=file:///reference_genomes/Hsapiens/hs37d5/DRAGEN
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=REFLEN,Number=1,Type=Integer,Description="Number of REF positions included in this record">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">
##FILTER=<ID=cnvQual,Description="CNV with quality below <WORKFLOW-SPECIFIC DEFAULT VALUES>">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Estimated copy number">
##FORMAT=<ID=BC,Number=1,Type=Integer,Description="Number of bins in the region">
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Number of improperly paired end reads at start and stop breakpoints">

The following header lines are specific to somatic WGS CNV calling:

  • ModelSource The primary basis on which the final tumor model was chosen. The following values can be included:

    • DEPTH+BAF: Depth+BAF signal is used to determine tumor model.

    • DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.

    • DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.

    • DEPTH+BAF_WEAK: Depth+BAF signal is used to determine lower-confidence tumor model.

    • VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.

    • DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.

    • SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.

  • EstimatedTumorPurity Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA if a confident model could not be determined.

  • DiploidCoverage Expected read count for a target bin in a diploid region. The numeric value is unlimited.

  • OverallPloidy Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.

  • AlternativeModelDedup An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.

  • AlternativeModelDup An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.

  • OutlierBafFraction A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].

Records

All coordinates in the VCF are 1-based.

CHROM

The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.

POS

The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

ID

The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN, LOSS and REF events, in Somatic WGS CNV, the ID could include the Copy Neutral Loss of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.

REF

The REF column contains an N for all CNV events.

ALT

The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL> or <DUP> entries are used. If REF calls are emitted, their ALT will always be .. In Somatic WGS CNV, the ALT field can contain two alleles, such as <DEL>,<DUP>, which allows representation of allele-specific copy numbers if they differ in copy number states.

QUAL

The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.

FILTER

The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.

Available FILTERs:

  • cnvLength which indicates that the length of the CNV is lower than a threshold.

  • cnvQual which indicates that the QUAL of the CNV is lower than a threshold.

Germline CNV has the following additional FILTERs:

  • cnvCopyRatio which indicates that the segment mean of the CNV is not far enough from copy neutral.

Both Germline CNV workflows have the following additional FILTERs:

  • dinucQual which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.

Germline WGS CNV has the following additional FILTERs:

  • cnvBinSupportRatio which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.

  • highCN which indicates a CNV call with implausible copy number (>6).

Germline WES CNV has the following additional FILTERs:

  • cnvLikelihoodRatio indicates a log10 likelihood ratio of ALT to REF is less than a threshold.

Both Somatic CNV workflows have the following additional FILTERs:

  • binCount - Filters CNV events with a bin count lower than a threshold.

Somatic WGS CNV has the following additional FILTERs:

  • lengthDegenerate - Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.

  • segmentMean - Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficient SM in DELs or DUPs are assigned this filter when returning the default model.

Somatic WES CNV has the following additional FILTERs: -SqQual - Marks records as non-PASSing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.

INFO

The INFO column contains information representing the event.

  • REFLEN indicates the length of the event.

  • SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion), and it is only present for non-REF records.

  • SVTYPE is always CNV and only present for non-REF records.

  • END indicates the end position of the event (1-based, inclusive).

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

In Somatic WGS CNV, the INFO column can also contain the HET tag, when the call is considered sub-clonal. See HET-Calling Mode.

When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.

FORMAT

The common FORMAT fields are described in the header:

Germline WGS CNV includes the following FORMAT fields:

Germline WES CNV includes the following FORMAT fields:

Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:

Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.

Note on genotype annotation in germline copy number calling

Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:

Coverage Uniformity

The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.

A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.

CNV Metrics File

DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv file extension. The following list summarizes the metrics that are output from a CNV run.

Sex Genotyper:

  • Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.

  • Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.

CNV Summary:

  • Bases in reference genome in use

  • Average alignment coverage over genome

  • Number of alignment records processed

    • Number of filtered records (total)

    • Number of filtered records (due to duplicates)

    • Number of filtered records (due to MAPQ)

    • Number of filtered records (due to being unmapped)

  • Coverage MAD

  • Median Bin Count

  • Number of target intervals

  • Number of normal samples

  • Number of segments

  • Number of amplifications

  • Number of deletions

  • Number of PASS amplifications

  • Number of PASS deletions

Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Coverage MAD is the median absolute deviation of normalized case counts. Higher values indicate noiser sample data (poor quality). Median Bin Count is the median of raw counts normalized by interval size.

Intermediate and Visualization Files

Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.

All files have a structure similar to a BED file with optional header line(s).

Target Counts

The file *.target.counts.gz is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.

It has the following columns:

  1. Contig identifier

  2. Start position

  3. End position

  4. Target interval name

  5. Count of alignments in this interval

  6. Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gz file is shown below.

#TARGET COUNTS FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start  stop   name                <SampleName> improper_pairs
1       565480 565959 target-wgs-1-565480 7          6
1       566837 567182 target-wgs-1-566837 9          0
1       713984 714455 target-wgs-1-713984 34         4
1       721116 721593 target-wgs-1-721116 47         1
1       724219 724547 target-wgs-1-724219 24         21
1       725166 725544 target-wgs-1-725166 43         12
1       726381 726817 target-wgs-1-726381 47         14
1       753243 753655 target-wgs-1-753243 31         2
1       754322 754594 target-wgs-1-754322 27         0
1       754594 755052 target-wgs-1-754594 41         0

B-Allele counts

B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, they are taken from a collection of sites that have high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the tumor sample supporting each of these alleles is counted.

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

B-allele tsv

The tsv file format is the following:

  1. Contig identifier

  2. Start, BED-style (zero-based inclusive) start position of the reference allele

  3. Stop, BED-style (zero-based inclusive) stop position of the reference allele

  4. Base sequence for the reference allele

  5. Base sequence for the the first allele being counted

  6. Base sequence for the second allele being counted

  7. The number of qualified reads containing a sequence matching the first allele

  8. The number of qualified reads containing a sequence matching the second allele

Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:

  1. Population frequency for the first allele

  2. Population frequency for the second allele

An example of B-allele counts file is provided below:

contig  start   stop    refAllele       allele1 allele2 allele1Count    allele2Count
chr1    11021   11022   G       G       A       4       2
chr1    14463   14464   A       A       T       111     36
chr1    16494   16495   G       G       C       122     262
chr1    38741   38742   C       C       T       9       9
chr1    39014   39015   A       A       C       38      48
chr1    39260   39261   T       T       C       199     143
chr1    48447   48448   C       C       T       8       15
chr1    48517   48518   A       A       G       13      15
chr1    91485   91486   G       G       C       1       4
chr1    91489   91490   A       A       G       1       3
chr1    98944   98945   C       C       T       46      114

B-allele bedgraph

The bedgraph file format is similar to the BED format and it has the following columns:

  1. Contig identifier

  2. Start

  3. Stop

  4. Ratio of allele counts

The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.

When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:

allele1Count / (allele1Count + allele2Count)

When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:

allele2Count / (allele1Count + allele2Count)

By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.

An example of the bedgraph file is shown below:

chr1    11021   11022   0.333333
chr1    14463   14464   0.755102
chr1    16494   16495   0.317708
chr1    38741   38742   0.5
chr1    39014   39015   0.44186
chr1    39260   39261   0.581871
chr1    48447   48448   0.652174
chr1    48517   48518   0.464286

Bias correction

The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:

  1. Contig identifier

  2. Start position

  3. End position

  4. Target interval name

  5. GC-corrected read counts in this interval

  6. Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gc-corrected.gz file is shown below.

#GC CORRECTED FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   1071.353133     6
chr1    819840  821337  target-wgs-chr1-819840:821337   1051.014997     19
chr1    821337  822485  target-wgs-chr1-821337:822485   1098.6502       10
chr1    822485  824431  target-wgs-chr1-822485:824431   1117.28308      7
chr1    830446  832304  target-wgs-chr1-830446:832304   1102.211816     1
chr1    832304  834311  target-wgs-chr1-832304:834311   1004.822683     5
chr1    836677  838659  target-wgs-chr1-836677:838659   1015.973037     7
chr1    841054  843056  target-wgs-chr1-841054:843056   1014.921403     3

Combined counts

The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.

Normalization

The file *.tn.tsv.gz contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz file:

  1. Contig identifier

  2. Start position

  3. End position

  4. Target interval name

  5. Log2-normalized read counts in this interval

  6. Count of improperly paired alignments in this interval

Header lines are also included that start with #.

An example of a *.tn.tsv.gz file is shown below.

#title = Normalized coverage profile
#sex = UNDETERMINED
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   -0.18479358083014644    6
chr1    819840  821337  target-wgs-chr1-819840:821337   -0.21244441644669046    19
chr1    821337  822485  target-wgs-chr1-821337:822485   -0.14849555308041734    10
chr1    822485  824431  target-wgs-chr1-822485:824431   -0.12423291178926463    7
chr1    830446  832304  target-wgs-chr1-830446:832304   -0.1438261733656668     1
chr1    832304  834311  target-wgs-chr1-832304:834311   -0.27728673450293895    5
chr1    836677  838659  target-wgs-chr1-836677:838659   -0.26136555699676262    7

Segmentation

File extension: *.seg, *.seg.called, *.seg.called.merged

Files containing the segments produced by the segmentation algorithm. The Segment_Mean value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.

The *.seg file has the following columns:

  1. Sample name

  2. Contig identified

  3. Start position

  4. End position

  5. Number of intervals in the segment

  6. Linear copy-ratio of the segment

An example of a *.seg file is shown below.

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
<SampleName> chr1    818022  1117426 224     0.82500341336435279
<SampleName> chr1    1117426 4063702 2438    0.91726081432236528
<SampleName> chr1    4063702 4067591 3       0.38861386123247205
<SampleName> chr1    4067591 7705829 3302    0.93021316913709917
<SampleName> chr1    7705829 9357003 1405    0.98147825043799442
<SampleName> chr1    9357003 9377365 19      0.50269670724395654
<SampleName> chr1    9377365 12859821        2905    1.0684818476332989

The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + ir a deletion -.

The *.seg.called.merged file is identical to the *.seg.called file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:

  1. QUAL

  2. FILTER

  3. Copy number assignment

  4. Ploidy

  5. Improper_Pairs count

B-allele segmentation (Somatic)

In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg and it has the same format of the *.seg file with two modifications. Firstly, the Segment_Mean value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:

  1. BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction"

An example of segmentation output file is shown below:

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean    BAF_SLM_STATE
<SampleName> chr1    820348  1104646 194     0.29301737166888697     6
<SampleName> chr1    1105091 1533754 444     0.26185904799069076     5
<SampleName> chr1    1533810 1534166 9       0.41958837071702065     8
<SampleName> chr1    1534217 9356793 6689    0.26034515815016335     5
<SampleName> chr1    9358304 9376529 27      0.46450553586280602     10
<SampleName> chr1    9378480 12859495        1651    0.24172965924359388     5

Model identification (Somatic)

The file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:

  1. Model purity (Cellularity)

  2. Model diploid coverage

  3. Model log-likelihood

An example is shown below:

#Purity Coverage        logL
1       384     -23441740.5209
0.99    566     -22926572.4287
0.99    726     -23281869.1423
0.99    1206    -24075475.1481
0.99    1836    -24334376.579
0.99    2256    -24380290.0335
0.99    2696    -24380616.8655
0.98    449     -23988016.7101

Visualization

To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml can be loaded directly into IGV for analysis.

The following IGV tracks are automatically populated in the output IGV session file:

  • *.target.counts.bw --- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.

  • *.improper_pairs.bw --- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.

  • *.tn.bw --- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.

  • *.seg.bw --- BigWig representation of the segments. Setting the track view in IGV to points is recommended.

  • *.baf.bedgraph.gz --- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.

  • *.cnv.gff3 --- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):

##gff-version 3
chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;
chr1    DRAGEN  CNV     9377366 36656983        1000    .       .       Start=9377365;Stop=36656983;Length=27279618;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=3;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.910093;MinorCopyNumberFloat=0.068221;BiasCorrectedReadCount=1287.9;MinorAlleleFrequency=0.241;BinCount=22591;ImproperPairsCount=186,21;NumAllelicSites=18021;color=#FF0000;

For somatic WGS analyses, the following additional files are included in the IGV session xml:

  • *.baf.seq.bw --- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.

  • *.tumor.baf.bedgraph.gz --- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.

IGV Session

File extension: *.igv_session.xml

The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome attribute in the Session element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.

<?xml version="1.0" encoding="utf-8"?>
<Session genome="b37" hasGeneTrack="false" hasSequenceTrack="true" version="8">
    <Resources>
        <Resource path="example.cnv.gff3"/>
        <Resource path="example.cnv.excluded_intervals.bed.gz"/>
        <Resource path="example.target.counts.bw"/>
        <Resource path="example.improper.pairs.bw"/>
        <Resource path="example.tn.bw"/>
        <Resource path="example.seg.bw"/>
    </Resources>
    <Panel height="500" width="1200" name="DataPanel">
        ...
    </Panel>
</Session>

Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome field in the XML file directly. For example, IGV has traditionally packaged a b37 reference genome, but may also include a 1kg_v37 or a 1kg_b37+decoy, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

Creating CNV coverage and BAF plots with third-party tools

DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:

  • *.target.counts.gz or *.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.

  • *.tn.tsv.gz, containing the log2-normalized copy ratio per interval.

  • *.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.

In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.

Using R, a good starting point is the karyoploteR package. The main workflow involves reading the *.target.counts.gz file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.

Using Python, the workflow is similar to R's but using Python's libraries such as pandas, to convert DRAGEN output files to dataframe, and matplotlib, to plot coverage and BAF profiles across the genome.


A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:

Germline WGS

chr1	DRAGEN	CNV	818023	1426288	52	.	.	Alt=REF;LinearCopyRatio=1.03497;CopyNumber=2;Genotype=./.;Qual=52;Filter=PASS;Start=818022;Stop=1426288;Length=608266;BinCount=491;ImproperPairsCount=1,7;color=#00FF00;
chr1	DRAGEN	CNV	1426289	1428354	22	.	.	Alt=DEL;LinearCopyRatio=0.411841;CopyNumber=1;Genotype=0/1;Qual=22;Filter=cnvLength;dinucQual;Start=1426288;Stop=1428354;Length=2066;BinCount=2;ImproperPairsCount=7,16;color=#DDDDDD;

Somatic WGS

chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;

From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).

Excluded Intervals File

To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.excluded_intervals.bed.gz file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.

An example of a *.excluded_intervals.bed.gz file is shown below:

chr1    0       818022  NON_KMER_UNIQUE
chr1    824431  830446  NON_KMER_UNIQUE
chr1    834311  836677  NON_KMER_UNIQUE
chr1    838659  841054  NON_KMER_UNIQUE
chr1    850451  853257  NON_KMER_UNIQUE
chr1    855442  860261  NON_KMER_UNIQUE
chr1    866189  868833  NON_KMER_UNIQUE
chr1    881779  884116  NON_KMER_UNIQUE
chr1    1016667 1018959 NON_KMER_UNIQUE
chr1    1075880 1079718 NON_KMER_UNIQUE
chr1    1137942 1140725 NON_KMER_UNIQUE

Panel of Normals Files

PON Metrics File

The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz) if a Panel of Normals is provided and --cnv-generate-pon-metric-file is set to true. If PON size is less than 2, then an empty file will be generated.

The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:

Example:

contig  start   stop    name    mean    std     normalizedStd min     25%     50%     75%     max     intervalSize    gcContents
1       12098   12178   target-wes-1-12098:12178/1      3.6259044560802365      0.46661435469856077      0.1286890927079175     2.7961783439490446      3.2573018790849675      3.7105263157894739      4.0162683823529415      4.3298969072164946      80      0.49382716049382713
1       12178   12258   target-wes-1-12178:12258/2      5.0685579775753595      0.70638315915955963      0.13936570564740217     3.9044585987261144      4.5225944682508761      5.067708333333333       5.5778115844038769      6.3277777777777775      80      0.46913580246913578
1       12553   12637   target-wes-1-12553:12637/1      4.6990858287992054      0.62537786269786677      0.13308500535681309     3.7417218543046356      4.0305632538350444      5.0382165605095546      5.2151580459770113      5.5773195876288657      84      0.6705882352941176

PON Correlation File

The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.

Example:

Correlation of case sample CASE_SAMPLE_NAME
  PON1: 0.9786
  PON2: 0.9868
  PON3: 0.9912
  ...

SegDups Extension Files

The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).

The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:

  1. Chromosome name

  2. Start - 0-based inclusive

  3. Stop - 0-based exclusive

  4. Target interval name prefixed with "target-wgs-"

  5. Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval

  6. Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals

  7. Target region ID - ID of the target region (aka pair of rescued target intervals)

Intermediate files

The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:

  1. Target region ID

  2. Joint normalized coverage (log2-scale) of the two intervals in the target region

  3. Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)

The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:

  1. Differentiating site name

  2. Target (gene A) counts at site

  3. Non-target (gene B) counts at site

  4. Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site

Last updated