VNTR Calling
Last updated
Last updated
The DRAGEN Variable Number Tandem Repeat (VNTR) Caller detects expansions and contractions in tandem repeat (TR) regions. For specified TR regions in the genome, the DRAGEN VNTR Caller estimates the size of the haplotypes in each region and provides variant calls, including the number of copies of the repeat for the sample in question. The DRAGEN VNTR Caller only considers TR regions included in a pre-specified VNTR catalog file.
For each region in the VNTR catalog, the VNTR Caller performs the following steps:
Read fragment collection, including wrap-around alignment and read classification;
Genotyping, including the scoring of candidate haplotype lengths using a Bayesian likelihood model.
The output of the VNTR Caller is the total length of sequence present in each TR region for the sample in question, resolved for each haplotype if possible; the copy number for each region is calculated from the length. Calls are reported in a VNTR output VCF file following the VCFv4.4 spec.
The DRAGEN VNTR Caller can be enabled by setting the --enable-vntr
option to true
. The VNTR Caller requires whole-genome sequencing (WGS) data aligned to a human reference genome with at least 30x coverage.
This diagram illustrates the overall workflow of the VNTR Caller. The VNTR Caller takes as input a set of aligned reads from the sample in question (either from the DRAGEN mapper or from an input BAM/CRAM) and a VNTR catalog file.
The VNTR catalog is a bed file specifying the TR regions for the VNTR Caller to act upon. Each region in the bed file is expected to be the start and the end of a tandem repeat sequence with no additional buffer sequence. The catalog also includes a unique TR ID for each region and the sequence of the repeat unit/pattern (see below for more details on the VNTR catalog file format).
The VNTR Caller processes each TR region in parallel, starting with read fragment collection. The VNTR Caller considers read fragments (i.e. paired-end reads as a single unit) rather than individual reads. To obtain all of the relevant read fragments for each region, all of the reads that overlap the region are found, and then all of their mates are collected as well.
Due to the repetitive nature of TR regions, existing read-alignments may be unreliable. Therefore spanning reads, unmapped reads, and reads with soft-clips undergo a specialized wrap-around alignment algorithm, which allows for a read to align to the same pattern sequence multiple times without penalty (mirroring the structure of the tandem repeat). This algorithm produces more reliable alignments of the read fragments to the TR region. Additionally, another rule to virtually extend the boundaries of the repetitive region into the flanks is applied to resolve some alignment ambiguities arising from the wrap-around alignment.
Once reliable alignments of the read fragments have been obtained, the next step is to classify each read. Reads are classified as non-overlapping, flanking, spanning, and contained relative to the TR region based on the following figure.
The output of fragment collection is the set of all read fragments in each TR region, re-aligned as necessary, with each read given a classification. This collection of read fragments is referred to as a pileup and acts as the input to the genotyper.
The genotyper determines the top-scoring haplotypes based on the read fragment evidence for each TR region. Given a pileup, the genotyper further classifies each read fragment into fragment classes.
The number of fragments in each class acts as evidence for the haplotype lengths of the TR region. A Bayesian likelihood model is used to evaluate what pair of haplotypes have the highest likelihood of generating the observation of these fragment class counts. A set of candidate haplotype lengths is generated based on fractional increments of the repeat pattern length, and each pair of haplotype lengths is evaluated as a candidate diploid genotype. If the caller detects that individual haplotype lengths cannot be resolved, the total length is considered as a candidate genotype instead (referred to as a total call). In subsequent steps, these total call candidates are assessed as if they were homozygous diploid genotypes.
For a Bayesian model, the posterior probability of each candidate diploid genotype must be considered. The posterior probability is made up of two parts: the genotype prior and the pileup likelihood. Three types of priors are currently supported:
No prior (all alleles weighted equally, referred to as model 0)
Het/hom priors (four classes with different weights: homozygous reference, ref/alt, homozygous alt, and alt1/alt2, referred to as model 1)
Population haplotype frequencies (region-specific haplotype frequencies over high-quality population data sets, referred to as model 3 and used by default)
The priors model can be chosen by setting the option --vntr-priors-model
to 0
, 1
, or 3
(the default being 3
).
The pileup likelihood is calculated as the likelihood of observing the fragment class counts given the candidate diploid genotype (based on an underlying model for how fragments are generated from a TR region haplotype of a given length). With the prior and the pileup likelihood, the posterior probability of each candidate diploid genotype can be computed. The diploid genotype with the highest posterior probability is chosen as the resulting call for each region.
The VNTR Caller is disabled by default. To enable the VNTR Caller, set --enable-vntr
to true
. The VNTR Caller can run directly from FASTQ input with the mapper or from prealigned BAM/CRAM input. You can also enable the VNTR Caller in parallel with any other germline variant callers as part of a WGS germline analysis workflow. For more information on other variant callers, see the DRAGEN DNA Pipeline.
FASTQ input example:
BAM input example:
Additional Options:
The number of threads used for the DRAGEN VNTR caller can be adjusted using the --vntr-num-threads <number_of_threads>
option.
The VNTR catalog is a bed file with the following required fields:
chromosome (or contig)
start position (0-based inclusive)
end position (0-based exclusive = 1-based inclusive)
TR ID (unique ID for TR region)
repeat unit sequence (sequence of repeat pattern/motif)
The reference haplotype length is calculated by subtracting the start position from the end position, and the number of repeat units in the reference can be found by dividing the reference haplotype length by the length of the repeat unit.
When using a standard reference (hg38
, hg19
, or GRCh37
), DRAGEN will automatically use a matching pre-packaged catalog by default. A custom catalog can be provided by adding in the option, --vntr-catalog-bed <custom_catalog_bed_file>
.
For references other than the ones mentioned above, a catalog must be provided. Furthermore the caller requires a set of normalization regions (--vntr-normalization-regions-bed <bed file>
. These regions should be well-behaved and free of any VNTRs or other large variants. We recommend using a few thousand regions of at least 2kb each. These two files are enough to run the genotyper without priors or the aforementioned flat priors model (--vntr-priors-model 0
or 1
). To enable population priors 3
, one additional file has to be provided: --vntr-priors-file <json file>
. The json file contains data obtained from a population analysis, structured like the following example with one entry per catalog region:
The output of the DRAGEN VNTR Caller includes a VNTR VCF file, a table output TSV file, and a VNTR metrics file.
The VNTR VCF file follows version 4.4 of the VCF spec. The VCF includes a call for every TR region provided in the VNTR catalog unless it was hard-filtered in the fragment collection or the genotyping (the filter annotation can be found in the table output file).
Each call is an estimate of the lengths of the haplotypes present in that region for the sample in question. If the individual haplotypes lengths can be distinguished, then a diploid call is reported including the lengths and copy number of each haplotype (in the INFO RB and RUC fields, respectively). Otherwise, a total call is made, only reporting the total length and total copy number for the region (in the FORMAT TOTALRB and TOTALRUC fields). For total calls, GT = ./.
.
If the length of a haplotype is within a certain threshold of the reference array length for the region, then it is reported as a reference allele (the default reference threshold is 10%). If both haplotypes are reference alleles or if the total length of a total call is within the total reference threshold, then a reference call is reported in the VCF, with HomRef
in the FILTER
field and GT = 0/0
. A symbolic <CNV:TR>
is reported in the ALT
field for each non-reference allele in the call.
The following fields are included for each VCF entry:
INFO:SVTYPE
: set to "CNV" for all VNTR calls
INFO:SVLEN
: set to the reference array length
INFO:EVENTTYPE
: set to "VNTR"
INFO:RUS
: the sequence of the repeat unit (i.e. the repeat pattern or motif)
INFO:RUL
: the length of the repeat unit
INFO:REFRUC
: the number of copies of the repeat in the reference haplotype
INFO:RB
: the length of each ALT haplotype being reported
INFO:CN
: the copy number per ALT haplotype relative to the reference (equal to RB / SVLEN
)
INFO:CNVTRLEN
: the change in length of each ALT haplotype compared to the reference (equal to RB - SVLEN
)
INFO:RUC
: the number of repeat unit copies for each ALT haplotype being reported (equal to RB / RUL
)
FORMAT:SVFT
: any filters that will be applied only in the merged SV + VNTR VCF
FORMAT:GQ
: Genotype quality score
FORMAT:CN
: the total copy number relative to the reference (equal to TOTALRB / SVLEN
)
FORMAT:TOTALRB
: the total length of all haplotypes (including reference haplotypes if present)
FORMAT:TOTALCNVTRLEN
: the total change in length of all haplotypes relative to the reference (equal to TOTALRB - 2*SVLEN
)
FORMAT:TOTALRUC
: the total number of repeat unit copies of all haplotypes (including reference haplotypes if present; equal to TOTALRB / RUL
)
Additional fields:
INFO:RUCCHANGE
: the change in the RUC compared to the reference for each ALT haplotype (equal to RUC - REFRUC
)
INFO:LOGPROB
: the log probability of the called alleles from the genotyper
INFO:VNTRCLASSFIT
: the score of how well the fragment classes fit the expected distribution
INFO:TOTALFRAGCOUNT
: the number of fragments used to make the call
The table output file provides a simple summary of the VNTR Caller output. Every single region included in the VNTR catalog bed will also be included in the table output file; regions where a hard-filter was applied in the fragment collection or the genotyping will still be included with the reason for the filter annotated (these regions will not appear in the VCF file).
For each region, the following information is provided:
trId
: the unique ID of the TR region
patternSize
: the length of the repeat unit (i.e. the repeat pattern/motif length)
refArraySize
: the length of the reference haplotype for this region
Hap1Size
: the length of the first haplotype of the call (Hap1Size <= Hap2Size
; NA
for total calls)
Hap2Size
: the length of the second haplotype of the call (NA
for total calls)
TotalSize
: the total length of all haplotypes in the call (equal to Hap1Size + Hap2Size
for diploid calls)
Likelihood
: the log likelihood of the called alleles from the genotyper (equal to INFO:LOGPROB
in the VCF)
QUAL
: QUAL score (matches QUAL field in the VCF)
GQScore
: Genotype quality score (equal to FORMAT:GQ
in the VCF)
ClassDistributionFit
: the score of how well the fragment classes fit the expected distribution (equal to INFO:VNTRCLASSFIT
in the VCF)
FragmentCount
: the number of fragments used to make the call (equal to INFO:TOTALFRAGCOUNT
in the VCF)
Flags
: the flags and filters applied to the call
The VNTR metrics file reports summary statistics for the VNTR caller including region counts, read class counts, and call counts.
Region counts include the number of normalization regions, the number of prior regions, the number of TR regions with nonzero coverage, and the total number of TR regions.
Read class counts include the total number of reads in each class (strictly left, left-flanking, spanning, contained, right-flanking, strictly right, and unmapped).
Call counts include the total number of uncalled TR regions, as well as the total number of deletion, insertion, and reference calls for diploid and total calls (note that for a region where a diploid call is made, two calls are reported, but for a total call region, only one call is reported).
DRAGEN supports the automatic merger of the VNTR VCF calls with the DRAGEN Structural Variant (SV) Caller output VCF. By default, if both the DRAGEN VNTR Caller and the DRAGEN SV Caller are enabled (with the options --enable-vntr true
and --enable-sv true
, respectively), then calls made by the VNTR Caller will also be included in the DRAGEN SV VCF (<output_prefix>.sv.vcf.gz
). This behavior can be disabled by adding the option --sv-vntr-merge false
. The VNTR VCF does not change even if DRAGEN SV is enabled.
When VNTR calls are added to the SV VCF, the following changes are applied:
VNTR diploid calls with GT = 1/2
are split into two separate VCF entries, each of which is reported as GT = 0/1
.
A lt50bp
filter is applied to all VNTR calls with INFO:CNVTRLEN < 50
(the min SV length parameter is set to 50 bp by default). For total calls with no INFO:CNVTRLEN
, FORMAT:TOTALCNVTRLEN
is used instead.
A TotalCall
filter is applied to all VNTR total calls (calls with GT = ./.
). This behavior can be disabled by adding the option --sv-vntr-filter-total-calls false
.
A LowPopulationVariance
filter is applied to all VNTR calls with the FORMAT:SVFT
field equal to LowPopulationVariance
. The filter indicates that there are few population samples with an SV variant for this region.
An OverlapsVNTR
filter is applied to any SV call that overlaps with a VNTR call (even with a HomRef filter) UNLESS the VNTR call has a TotalCall
or a LowPopulationVariance
filter.