Targeted Caller
Repetitive regions in the human genome pose a challenge for general variant calling approaches which typically cannot make use of potentially misplaced MAPQ0 reads. Furthermore, high sequence homology of some genes with a pseudogene paralog can lead to a wide variety of common structural variants (SVs) in the population, requiring specialized targeted calling approaches. DRAGEN supports targeted calling for a number of genes/targets as described in subsequent target-specific sections.
The targeted caller can be enabled using the command line option --enable-targeted=true
or a subset of targets can be enabled by providing a space-separated list of target names. The supported target names are: cyp2b6
, cyp2d6
, cyp21a2
, gba
, hba
, lpa
, rh
, and smn
. For a list of all supported targeted caller options along with their default values, see Targeted Caller Options. The targeted caller produces a <output-file-prefix>.targeted.json
file containing a summary of the variant caller results for each target. Additional detail of individual variant calls are reported in VCF format in the <output-file-prefix>.targeted.vcf.gz
output file.
Input Data
The targeted caller requires WGS data aligned to a human reference genome with at least 30x coverage. The caller may be less reliable at lower coverage. Human reference genome builds based on hg19
, hs37d5
(including GRCh37
), or hg38
are supported. The targeted caller should not be enabled with low-coverage, exome or enrichment sequencing data.
Output Files
Targeted JSON File
The targeted caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
Fields in JSON | Explanation | Type and Possible Values | Present |
---|---|---|---|
sampleId | The sample name. | string | always |
softwareVersion | The version of DRAGEN. | string | always |
phenotypeDatabaseSources | Resources used for calling metabolism status (phenotype). | json array of strings | CYP2B6 or CYP2D6 is enabled |
cyp2b6 | The CYP2B6 caller fields. | dictionary | CYP2B6 caller is enabled |
cyp2d6 | The CYP2D6 caller fields. | dictionary | CYP2D6 caller is enabled |
cyp21a2 | The CYP21A2 caller fields. | dictionary | CYP21A2 caller is enabled |
gba | The GBA caller fields. | dictionary | GBA caller is enabled |
hba | The HBA caller fields. | dictionary | HBA caller is enabled |
lpa | The LPA caller fields. | dictionary | LPA caller is enabled |
rh | The RH caller fields. | dictionary | RH caller is enabled |
smn | The SMN caller fields. | dictionary | SMN caller is enabled |
Targeted VCF File
The targeted caller generates a <output-file-prefix>.targeted.vcf.gz
file in the output directory. The output file is a VCFv4.2
formatted file. The targets that have VCF output are: cyp21a2, gba, hba, lpa, rh, and smn.
Small variants, structural variants, and copy number variants are reported in the same VCF file.
The <output-file-prefix>.targeted.vcf.gz
file includes the following source
header line:
For lpa, rh and smn targets, the EVENT
and EVENTTYPE
INFO fields are used to identify the called variants.
The EVENT
and EVENTTYPE
INFO fields are formally introduced in VCFv4.4
to enable the representation of complex rearrangements. This is achieved using the EVENT
field to group all the related VCF records together, and the EVENTTYPE
to classify the event. The corresponding header lines are the following.
However, the use of EVENT
is not limited to complex rearrangements and can be used to associate nonsymbolic alleles, for example in cases of variant position ambiguity in high homology regions.
Since the EVENTTYPE
values are implementation-defined, custom EVENTTYPE
header lines are included to describe each EVENTTYPE
.
For cyp21a2, gba, and hba targets, the ALLELE_ID
INFO field is used to identify the called variant alleles.
The missing value .
is used when no identifier is available (e.g. a wild type allele) or applicable (e.g. allele index 0 for a structural variant record).
Nonrecombinant-like Variants In High Homology Regions
In the case of target variants in a high homology region, each variant is reported ambiguously at all corresponding homologous positions (i.e. in both the pseudogene and in the target gene). Additional analysis for these variants can be performed if absolute certainty that these variants are located in the target gene (e.g. in gba or cyp21a2) is required.
For lpa and smn the ploidy of the called genotype (FORMAT/GT
field) corresponds to the combined copy number from all the homologous positions. For cyp21a2, gba and hba, this "joint" genotype from all the homologous positions is instead reported in a separate FORMAT/JGT
field which is then collapsed into a diploid genotype and reported in the FORMAT/GT
field. The following fields are reported for "joint" calls:
Note that the FORMAT/GQ
and FORMAT/JGQ
fields contain the unconditional genotype quality, unlike the VCF spec where FORMAT/GQ
is defined as the genotype quality conditioned on the site being variant.
In the depicted example there are two genes A and B that include a high homology region. The usual process to call variants in this regions is to make a joint pileup of the reads aligning in both genes A and B and call the variants using a model with a ploidy proportional to the total copy number of the regions. This generates divergent possible genotypes that are equally likely since the variant cannot be confidently placed in either gene A or gene B. For lpa and smn the variant would be reported as follows:
Given the unconventional ploidy of the FORMAT/GT
field used in this representation, a TargetedRepeatConflict
filter is applied to these records. The header line for the filter is the following.
For cyp21a2, gba and hba, a conventional diploid FORMAT/GT
is reported and so no TargetedRepeatConflict
filter is applied. Due to the ambiguity in placing target variants in high homology regions, the corresponding QUAL
and FORMAT/GQ
fields can be much lower than conventional small variant calls (i.e. Phred 3 for a single variant allele copy across two homologous diploid positions). Therefore, instead of filtering on QUAL
and FORMAT/GQ
for these records, the records are filtered based on the FORMAT/JVQL
and FORMAT/JGQ
fields:
Since the wild type alleles at homologous positions may be different from each other or different from the reference alleles, an additional filter is applied when only wild type alleles are detected across the homologous positions. This avoids making ambiguous variant calls when no target variant of interest is detected.
Rh Gene Conversion Events
In the case of an identified gene conversion even in rh, a small variant is reported at each differentiating site in the acceptor region.
In the depicted example there are two genes A and B and gene A is the acceptor of a gene conversion from gene B (green box in the figure). Gene conversion are identified by observing variations in copy number at differentiating sites (blue and pink bars in the figure) in consecutive regions. Copy number variations between regions define the breakends of the gene conversion. An equivalent VCF representation for gene conversion would be using CNV and SV entries with breakends corresponding to the donor/acceptor regions, however, only the small variant representation is currently supported.
In the case of a detected gene conversion event, there may be differentiating sites with a genotype that is inconsistent with that gene conversion event. In these cases the RecombinantConflict
filter is applied. The RecombinantConflict
is defined by the following header line.
In the example, the resulting representation is as follows.
Nonallelic Homologous Recombination
For cyp21a2 and gba, nonallelic homologous recombination can result in gene deletion or duplication in the case of reciprocal recombination or gene conversion in the case of nonreciprocal recombination. Both gene deletion and gene conversion can introduce loss-of-function variants and in both cases the targeted caller will report these variants in the target gene. In the case of gene deletion, the differentiating sites at the nontarget (i.e. pseudogene) positions will contain the overlapping deletion allele *
while the differentiating sites in the target will contain any variant alleles. Although an equivalent VCF representation would be to simply report the deletion with a single structural variant VCF record, reporting small variant VCF records in the target gene allows for identification of the specific mutations that may occur in a gene transcript and matches well with annotation using HGVS nomenclature. Similarly, for gene conversions, variants are reported at differentiating sites in the target gene, rather than as pairs of structural variant breakends.
Calls at differentiating sites within the recombinant variant calling region will contain the same "joint" fields as are reported for nonrecombinant-like variants in high homology regions ( see Nonrecombinant-like Variants In High Homology Regions). However, the collapsed diploid FORMAT/GT
will be based on any detected recombination events. Because detected recombinant variants are placed in the target gene, these records are filtered differently than the ambiguously placed, nonrecombinant-like variants in high homology regions. The INFO/Recombinant
flag is added to calls derived from recombinant variant calling to distinguish them from nonrecombinant-like variant calls in high homology regions. The FORMAT/VQL
field is used to apply the RecombinantLowVQL
filter for low quality recombinant variants and the RecombinantREF
filter is applied when the collapsed diploid FORMAT/GT
contains only reference alleles.
Overlapping Structural Variant Representation
The use of GT=0
for symbolic structural variant alleles is formally disambiguated in VCFv4.4
, specifying that "GT=0 indicates the absence of any of the ALT symbolic structural variants defined in the record". With this convention we can report compound overlapping heterozygous structural variants.
In the hba genotype depicted above, two overlapping SVs can be represented as follows:
The relevant header lines for the VCF records above are:
Variable Number Tandem Repeat Representation
In the depicted example there is a Variable Number Tandem Repeat (VNTR) region composed of three repeat units in the reference. The CN
INFO field is used to report the allele copy number, the CN
FORMAT field to is used report the region total copy number given by the sum of the allele copy numbers, and the REPCN
FORMAT field is used to report the repeat unit copy number equal to the allele copy number multiplied by the number of repeat units in the reference.
This VNTR can be represented as follows:
The REPCN
and CN
header lines are:
Additional Filters
For lpa, rh and smn, the TargetedLowQual
filter is applied if the QUAL
of a target variant is less than 3.00
.
Similarly, for cyp21a2 and gba the TargetedLowVQL
filter is applied if the VQL
of a target variant in low-homology region is less than 3.00
.
The TargetedLowGQ
filter is applied if the targeted variant has GQ
smaller than 3
.
Merging Targeted Calls In The hard-filtered
Files
hard-filtered
FilesWhen the small variant caller is enabled, the targeted small variant VCF calls can be merged into the <output-file-prefix>.hard-filtered.vcf.gz
and <output-file-prefix>.hard-filtered.gvcf.gz
files, briefly hard-filtered
files. The --targeted-merge-vc
command line option can be used to control which targets will have their small variant VCF records merged into the hard-filtered
files. For example, --targeted-merge-vc rh
will enable merging of the calls from the rh
caller into the hard-filtered
files and --targeted-merge-vc rh hba
will enable merging of the calls from the rh
and hba
targets into the hard-filtered
files. The true
value will merge all calls from all supported targets into the hard-filtered
files, while the false
value will merge no calls into the hard-filtered
files.
The targeted calls merged into the hard-filtered
files are marked with a TARGETED
INFO flag.
When enabled, targeted small variants are merged into the hard-filtered
files regardless of any regions that may be provided using the --vc-target-bed
option.
Merging Strategy
The merging strategy for targeted small variant calls is to prioritize the targeted calls over small variant calls from the germline small variant caller. When a germline small variant call overlaps a targeted caller call, then the small variant call is filtered with a TargetedConflict
filter if any of the following holds:
The targeted caller call is
PASS
.The small variant call and targeted caller call have incompatible genotypes and the targeted caller call is not filtered with the
TargetedLowGQ
filter.
The strategy is summarized in the following examples.
The
TARGETED
call isPASS
.
The
TARGETED
call and the small variant call are not overlapping
The
TARGETED
call is filtered withTargetedLowQual
and has a discordant variant representation with the overlapping small variant call.
The
TARGETED
call is filtered withTargetedLowQual
and has a discordant genotype with the overlapping small variant call.
The
TARGETED
call is filtered withTargetedLowGQ
and has a discordant genotype with the overlapping small variant call.
Command-Line Examples
The targeted caller can be enabled in parallel with other components as part of a human WGS germline analysis workflow (see DRAGEN Recipe - Germline WGS).
FASTQ Input Example
The following command-line example runs the targeted caller from FASTQ input:
Prealigned BAM Input Example
The following command-line example runs cyp21a2 only using BAM input without realignment:
Last updated