Small Variant Calling
Last updated
Last updated
The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).
Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller
option to true. The VCF header is annotated with ##source=DRAGEN_SNV
to indicate the file is generated by the DRAGEN SNV pipeline.
The DRAGEN Small Variant Caller performs the following steps:
Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.
Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.
Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.
Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.
Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.
In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:
Unfiltered depth is the number of reads covering the position, downstream of any read collapsing or deduplication that may have preceded the variant calling step, but upstream of most read filtering and overlapping mate handling. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads actually used to make the calling decision, where filtered reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded, and overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.
The following figure summarizes the different filtering steps in more detail.
Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
Duplicate reads.
Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
[Somatic] Reads with MAPQ=0.
[Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
Unmapped reads.
Secondary reads.
Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
Reads that are badly mated. A badly mated read is a read where the pair is mapped to two different reference contigs.
Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out reads that are not informative. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.
Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC
INFO flag. As default, MOSAIC
tagged variants with AF
smaller than 20% are filtered with the MosaicLowAF
filter.
See Mosaic detection for further details on the mosaic small variant caller and the mosaic detection mode and a comparison with DRAGEN 4.2 features.
The following options control the variant caller stage of the DRAGEN host software.
--enable-variant-caller
Set --enable-variant-caller
to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:
If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.
--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection with mosaic AF filter threshold set to 0.0
. Set to false to disable DRAGEN mosaic detection. The default is true with mosaic AF filter threshold set to 0.2
.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
when the germline variant caller is enabled. The AF default threshold is set to 0.0
when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true
.
You can use the following options for downsampling reads in the small variant calling pipeline.
Option | Description |
---|---|
--vc-target-coverage | Specifies the maximum number of reads covering any given position. |
--vc-max-reads-per-active-region | Specifies the maximum number of reads covering a given active region. |
--vc-max-reads-per-raw-region | Specifies the maximum number of reads covering a given raw region. |
--vc-min-reads-per-start-pos | Specifies the minimum number of reads with a start position overlapping any given position. |
--high-coverage-support-mode | Applies the high coverage mode down-sample options if set to true. Enabling this option is recommended for targeted panels with coverage over 1000x, but will slow down run time. |
For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.
--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito
The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.
The following are the default downsampling values for each small variant calling mode.
Mode | Downsampling Option | Default Value |
---|---|---|
Germline |
| 500 |
Germline |
| 10000 |
Germline |
| 30000 |
Somatic |
| 1000 |
Somatic |
| 10000 |
Somatic |
| 30000 |
High Coverage |
| 100000 |
High Coverage |
| 200000 |
High Coverage |
| 200000 |
Mitochondrial |
| 40000 |
Mitochondrial |
| 200000 |
Mitochondrial |
| 200000 |
The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos
setting.
If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos
, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos
, default value is 10) at any start position.
The next downsampling step is to apply the --vc-max-reads-per-raw-region
and --vc-max-reads-per-active-region
limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.
This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.
When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.
A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF>
allele. Reads that do not support the reference or any variants are assigned the <NON_REF>
allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ
and in somatic mode the score is FORMAT/SQ
.
The following options are available to enable and control gVCF output.
--vc-emit-ref-confidence
To enable gVCF output, set to GVCF
. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence
to BP_RESOLUTION
.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf
(banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80
for germline and 1 3 10 20 50 80
for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50
.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf
, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30
and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.
Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.
The following are example gVCF records that include a hom-ref block call and a variant call.
In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.
In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.
QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.
Metric | QUAL | GQ (non-homref) | GQ (homref) | QD |
---|---|---|---|---|
Description | Probability that the site has no variant | Probability that the call is incorrect | Evidence supporting homref call | Qual normalized by depth |
Formulation | QUAL = GP(GT=0/0) | GQ =-10*log10(p) | GQ = 10*log10[P(D|homref)/P(D|variant)] | QUAL/DP |
Scale | Unsigned Phred | Unsigned Phred | Signed Phred | Unsigned Phred |
Numerical example | QUAL=20: 1 % chance that there is no variant at the site. Qual=50: 1 in 1e5 chance that there is no variant at the site. | GQ=3, 50% that the call is incorrect. GQ=20, 1% change that the call is incorrect. | GQ=0: no evidence. GQ>0: evidence favors homref. |
The QUAL scores generated by DRAGEN differ significantly from those of GATK, as DRAGEN's algorithms for small variant detection provide more realistic scores. This improvement stems from two key factors:
Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.
Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3) compared to DRAGEN without ML (10).
Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3 with ML enabled.
DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same set.
The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.
During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, then they are phased together. If the variants only occur on different haplotypes, then they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.
Phased variant records that belong to the same phasing set can be combined into a single VCF record. For example, assuming reference at position chr2 115035
is A
, the following two phased variants are combined.
The phased variants are combined as follows.
The command-line option --vc-combine-phased-variants-distance
specifies the maximum distance over which phased variants will be combined. The default value 0 disables the feature. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value.
DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.
GT variant 1 | GT variant 2 | GT MNV | Relevant Pipeline | Supported in DRAGEN |
---|---|---|---|---|
0|1 | 0|1 | 0/1 | Germline and Somatic | Yes in 4.0 |
0/1 | 1/1 | 1/2 | Germline | No |
0/1 | 1/2 | 1/2 | Germline | No |
1/1 | 1/1 | 1/1 | Germline | Yes in 4.2 |
Examples of diploid haplotypes where phasing is supported:
Examples of diploid haplotypes where phasing is not supported:
By default in somatic mode, DRAGEN will output all component SNVs and INDELs that make up an MNV along with the MNV call itself. MNVs and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. Setting --vc-mnv-emit-component-calls=false
can be used to restrict which component calls are reported. When DRAGEN reports an MNV call, it considers the difference between the VAF of the MNV call and the VAF of each component call, and reports any given component call in addition to the MNV call if this difference is greater than --vc-combine-phased-variants-max-vaf-delta
(default: 0.1). The --vc-mnv-emit-component-calls
and --vc-combine-phased-variants-max-vaf-delta
options are only applicable in somatic mode and are not supported in germline mode. In germline mode, functionality to output component calls is not available and MNV calls are emitted only without component calls.
DRAGEN outputs variants in a VCF file following variant normalization as described here https://genome.sph.umich.edu/wiki/Variant_Normalization. The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.
Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned
Additional notes on variant representation in the DRAGEN VCF:
Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).
Allele decomposition: by default, multi-nucleotide polymorphisms (MNPs) are represented as separate, contiguous individual SNVs records in the VCF. If phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants. This determines which variant have occurred on the same haplotype. Phased variant records that belong to the same phasing set can be combined into a single VCF record by using the --vc-combine-phased-variants-distance
command-line option and set it to a non-zero value. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value (specified in the number of basepairs).
In some cases, such as complex variants in repetitive regions, some variants cannot be normalized (i.e. converted into a standard representation) or represented uniquely. To counteract this problem, when comparing two VCFs (e.g. a DRAGEN VCF against a truth set VCF), it is recommended to use the RTG vcfeval tool which performs variant comparisons using a haplotype-aware approach. RTG vcfeval has been adopted as the standard VCF comparison tool by GA4GH and PrecisionFDA https://www.biorxiv.org/content/biorxiv/early/2018/02/23/270157.full.pdf.
A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:
chr1 2656216 . A T,C 107.65 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=12;FS=0.000;MQ=28.95;QD=8.97;SOR=3.056;FractionInformativeReads=0.750 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,5,4:0.556,0.444:9:15:177,144,46,122,0,72:-17.704,-14.420,-4.626,-12.220,0.000,-7.244:1.076e+02,1.096e+02,1.465e+01,8.758e+01,1.520e-01,4.082e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,0,1,8:0,0,4,5
Two indels are considered as multi-allelic if they share the same reference base preceding the indel. chr1 7392258 . C CT,CTTT 234.76 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=44;FS=0.000;MQ=199.22;QD=5.34;SOR=2.226;FractionInformativeReads=0.659 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,15,14:0.517,0.483:29:50:245,256,55,190,0,55:-24.476,-25.634,-5.492,-18.976,0.000,-5.500:2.348e+02,2.513e+02,5.292e+01,1.848e+02,4.401e-05,5.300e+01:0.00,5.00,8.00,5.00,10.00,8.00:0,0,7,22:0,0,17,12
If a SNP overlaps an INDEL, but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below. However DRAGEN has the joint detection of overlaping variants feature which is designed to detect overlapping SNP and INDEL and output them in a single VCF variant record, represented as a multi-allelic genotype.
chr1 1029628 . C CGT 49.88 PASS AC=1;AF=0.500;AN=2;DP=37;FS=7.791;MQ=105.32;MQRankSum=-1.315;QD=1.35;ReadPosRankSum=1.423;SOR=1.510;FractionInformativeReads=0.892;R2_5P_bias=-19.742 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:17,16:0.485:33:48:81,0,50:-8.088,0.000,-5.000:4.988e+01,6.653e-05,5.300e+01:0.00,31.00,34.00:10,7,5,11:11,6,9,7:1029628 chr1 1029629 . A G 50.00 PASS AC=1;AF=0.500;AN=2;DP=37;FS=1.289;MQ=105.32;MQRankSum=-0.659;QD=1.35;ReadPosRankSum=-0.199;SOR=0.604;FractionInformativeReads=1.000;R2_5P_bias=-24.923 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:16,21:0.568:37:48:85,0,49:-8.477,0.000,-4.934:5.000e+01,6.886e-05,5.234e+01:0.00,34.77,37.77:9,7,10,11:10,6,13,8:1029628
The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.
If --sample-sex
is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex
is specified on the command line, contigs are processed as follows.
For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.
For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 85% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid
to false.
An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909
:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8
DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.
--sample-sex | Ploidy Estimation | Sample Sex in Small VC |
---|---|---|
male | Not relevant | Male |
female | Not relevant | Female |
none | Not relevant | None |
auto (default) | XY | Male |
auto (default) | XX | Female |
auto (default) | Everything else | None |
Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.
When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.
Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.
DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.
--vc-enable-af-filter-mito
Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito
Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito
Set the threshold to mark emitted vcf call as filtered. The default is 0.02.
QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).
The following filters can be applied to mitochondrial variant calls.
--vc-sq-call-threshold
Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold
Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default value is false.
If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.
If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.
The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.
For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.
The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.
We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.
The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.
To enable personalized variant calling and machine learning, set --enable-personalization
to true (default: false).
Note that this is a beta feature and available only for the germline small variant caller when run with a V4 pangenome reference.
When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:
Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.
Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.
Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection
to false.
DRAGEN has two algorithms that model correlated errors across reads in a given pileup.
Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.
Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.
FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.
Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.
The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.
Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.