DRAGEN emits the calls in the standard VCF format. By default for analyses other than somatic WGS, the VCF file includes only copy number gain and loss events (and LOH events, where allele-specific copy number is available). To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls
to true.
For more information on how to use the output files to aid in debug and analysis, see Signal Flow Analysis.
File extension: *.cnv.vcf.gz
The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV
to indicate the file is generated by the DRAGEN CNV pipeline
The following is an example of some of the header lines that are specific to CNV:
The following header lines are specific to somatic WGS CNV calling:
ModelSource
The primary basis on which the final tumor model was chosen. The following values can be included:
DEPTH+BAF
: Depth+BAF signal is used to determine tumor model.
DEPTH+BAF_DOUBLED
: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
DEPTH+BAF_DEDUPLICATED
: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
DEPTH+BAF_WEAK
: Depth+BAF signal is used to determine lower-confidence tumor model.
VAF
: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
DEGENERATE_DIPLOID
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
SAMPLE_MEDIAN
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity
Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA
if a confident model could not be determined.
DiploidCoverage
Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy
Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup
An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup
An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction
A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].
All coordinates in the VCF are 1-based.
The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.
The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.
The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN
, LOSS
and REF
events, in Somatic WGS CNV, the ID could include the Copy Neutral Loss of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.
The REF column contains an N for all CNV events.
The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL> or <DUP> entries are used. If REF calls are emitted, their ALT will always be .
. In Somatic WGS CNV, the ALT
field can contain two alleles, such as <DEL>,<DUP>, which allows representation of allele-specific copy numbers if they differ in copy number states.
The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.
The FILTER column contains PASS
if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.
Available FILTERs:
cnvLength
which indicates that the length of the CNV is lower than a threshold.
cnvQual
which indicates that the QUAL of the CNV is lower than a threshold.
Germline CNV has the following additional FILTERs:
cnvCopyRatio
which indicates that the segment mean of the CNV is not far enough from copy neutral.
Both Germline CNV workflows have the following additional FILTERs:
dinucQual
which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.
Germline WGS CNV has the following additional FILTERs:
cnvBinSupportRatio
which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN
which indicates a CNV call with implausible copy number (>6).
Germline WES CNV has the following additional FILTERs:
cnvLikelihoodRatio
indicates a log10 likelihood ratio of ALT to REF is less than a threshold.
Both Somatic CNV workflows have the following additional FILTERs:
binCount
- Filters CNV events with a bin count lower than a threshold.
Somatic WGS CNV has the following additional FILTERs:
lengthDegenerate
- Marks records as non-PASS
ing based on each record's length (REFLEN
) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean
- Marks records as non-PASS
ing based on each record's segment mean (SM
) when the caller returns the default model. Segments having insufficient SM
in DEL
s or DUP
s are assigned this filter when returning the default model.
Somatic WES CNV has the following additional FILTERs: -SqQual
- Marks records as non-PASS
ing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.
The INFO column contains information representing the event.
REFLEN
indicates the length of the event.
SVLEN
is a signed representation of REFLEN
(e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE
is always CNV and only present for non-REF records.
END
indicates the end position of the event (1-based, inclusive).
If using a segment BED file, then the segment identifier is carried over from the input to SEGID
field.
In Somatic WGS CNV, the INFO column can also contain the HET
tag, when the call is considered sub-clonal. See HET-Calling Mode.
When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.
The common FORMAT fields are described in the header:
Germline WGS CNV includes the following FORMAT fields:
Germline WES CNV includes the following FORMAT fields:
Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:
Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN
entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.
Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:
The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity
metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity
metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.
A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width
setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.
DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv
file extension. The following list summarizes the metrics that are output from a CNV run.
Sex Genotyper:
Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.
CNV Summary:
Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
Number of filtered records (total)
Number of filtered records (due to duplicates)
Number of filtered records (due to MAPQ)
Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions
Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Coverage MAD is the median absolute deviation of normalized case counts. Higher values indicate noiser sample data (poor quality). Median Bin Count is the median of raw counts normalized by interval size.
Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.
All files have a structure similar to a BED file with optional header line(s).
The file *.target.counts.gz
is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid
file, which is normalized to the normal ploidy level of 2 instead of raw counts.
It has the following columns:
Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gz
file is shown below.
B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, they are taken from a collection of sites that have high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the tumor sample supporting each of these alleles is counted.
B-allele counts are written both to gzipped tsv file *.ballele.counts.gz
and gzipped bedgraph file *.baf.bedgraph.gz
.
The tsv file format is the following:
Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele
Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:
Population frequency for the first allele
Population frequency for the second allele
An example of B-allele counts file is provided below:
The bedgraph file format is similar to the BED format and it has the following columns:
Contig identifier
Start
Stop
Ratio of allele counts
The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.
When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:
When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:
By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.
An example of the bedgraph file is shown below:
The file *.target.counts.gc-corrected.gz
contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gc-corrected.gz
file is shown below.
The file *.combined.counts.txt.gz
is a column-wise concatenation of individual *.target.counts.gz
and *.target.counts.gc-corrected.gz
used to form the panel of normals.
The file *.tn.tsv.gz
contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
.
An example of a *.tn.tsv.gz
file is shown below.
File extension: *.seg
, *.seg.called
, *.seg.called.merged
Files containing the segments produced by the segmentation algorithm. The Segment_Mean
value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.
The *.seg
file has the following columns:
Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment
An example of a *.seg
file is shown below.
The *.seg.called
file is identical to the *.seg
file, with an additional column indicating the initial call for whether the segment is a duplication +
ir a deletion -
.
The *.seg.called.merged
file is identical to the *.seg.called
file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:
QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count
In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg
and it has the same format of the *.seg
file with two modifications. Firstly, the Segment_Mean
value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:
BAF_SLM_STATE
: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or .
when the BAF data are too variable to estimate a minor-allele fraction"
An example of segmentation output file is shown below:
The file *.cnv.purity.coverage.models.tsv
describes the different tested models and their log-likelihood. It has columns:
Model purity (Cellularity)
Model diploid coverage
Model log-likelihood
An example is shown below:
To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks
option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml
can be loaded directly into IGV for analysis.
The following IGV tracks are automatically populated in the output IGV session file:
*.target.counts.bw
--- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw
--- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw
--- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw
--- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz
--- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3
--- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):
For somatic WGS analyses, the following additional files are included in the IGV session xml:
*.baf.seq.bw
--- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz
--- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.
File extension: *.igv_session.xml
The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir
specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome
attribute in the Session
element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.
Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome
field in the XML file directly. For example, IGV has traditionally packaged a b37
reference genome, but may also include a 1kg_v37
or a 1kg_b37+decoy
, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.
You can determine what the correct encoding of a reference genome by going to File > Save Session...
and then inspecting the generated igv_session.xml file.
DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:
*.target.counts.gz
or *.target.counts.gc-corrected.gz
, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz
, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz
, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.
In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.
Using R, a good starting point is the karyoploteR package. The main workflow involves reading the *.target.counts.gz
file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR
package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.
Using Python, the workflow is similar to R's but using Python's libraries such as pandas, to convert DRAGEN output files to dataframe, and matplotlib, to plot coverage and BAF profiles across the genome.
A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3
output file. Some examples of DRAGEN output GFF3 are shown below:
Germline WGS
Somatic WGS
From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber
annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).
To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.excluded_intervals.bed.gz
file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.
An example of a *.excluded_intervals.bed.gz
file is shown below:
The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz
) if a Panel of Normals is provided and --cnv-generate-pon-metric-file
is set to true
. If PON size is less than 2, then an empty file will be generated.
The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:
Example:
The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz
) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.
Example:
The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).
The final output has extension .cnv.segdups.rescued_intervals.tsv.gz
, and contains the rescued target intervals which can then be injected before segmentation. It has columns:
Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)
The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz
with columns:
Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)
The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz
with columns:
Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site
ID | Description |
---|---|
ID | Description |
---|---|
ID | Description |
---|---|
ID | Description |
---|---|
Diploid or Haploid? | ALT | FORMAT:CN | FORMAT:GT |
---|---|---|---|
Exclusion Reason | Description | Related DRAGEN Option |
---|---|---|
Column index | Column contents | Description |
---|---|---|
GT
Genotype
SM
Linear copy ratio of the segment mean
CN
Estimated copy number
BC
Number of bins in the region
PE
Number of improperly paired end reads at start and stop breakpoints
GC
GC dinucleotide percentage
CT
CT dinucleotide percentage
AC
AC dinucleotide percentage
LR
Log10 likelihood ratio of ALT to REF
AS
Number of allelic read count sites
BC
Number of read count bins
CN
Estimated total copy number in tumor fraction of sample. This field is not present if the model cannot be estimated with high confidence.
CNF
Floating point estimate of tumor copy number. This field is not present if the model cannot be estimated with high confidence.
CNQ
Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
MAF
Maximum estimate of the minor allele frequency
MCN
Estimated minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence.
MCNF
Floating point estimate of tumor minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence.
MCNQ
Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
NCN
Normal-sample copy number. The field is only present in germline-aware mode.
SCND
Difference between CN and GCN. The field is only present in germline-aware mode.
SD
Best estimate of segment's bias-corrected read count
Diploid
.
2
./.
Diploid
<DUP>
>2
./1
Diploid
<DEL>
1
0/1
Diploid
<DEL>
0
1/1
Haploid
.
1
0
Haploid
<DUP>
>1
1
Haploid
<DEL>
0
1
NON_KMER_UNIQUE
Non-unique Kmer bases are larger than 50% of interval.
Not applicable. This reason only applies to self-normalization mode.
EXCLUDE_BED
Interval overlaps with exclude BED larger than threshold.
--cnv-exclude-bed-min-overlap
PON_MAX_PERCENT_ZERO_SAMPLES
Number of PON samples with 0 coverage is larger than threshold.
--cnv-max-percent-zero-samples
PON_TARGET_FACTOR_THRESHOLD
Median coverage of interval is lower than threshold of overall median coverage.
--cnv-target-factor-threshold
PON_MISSING_INTERVAL
Target interval not found in PON.
Not applicable
1
contig
chromosome name
2
start
genomic locus of interval start
3
stop
genomic locus of interval stop
4
name
interval name
5
mean
average coverage depth
6
std
standard deviation
7
normalizedStd
normalized standard deviation (std/mean)
8
min
minimum
9
25%
25 percentile
10
50%
median
11
75%
75 percentile
12
max
maximum
13
intervalSize
interval size (stop-start)
14
gcContents
percent GC