1 of 9

Targeted Caller

Repetitive regions in the human genome pose a challenge for general variant calling approaches which typically cannot make use of potentially misplaced MAPQ0 reads. Furthermore, high sequence homology of some genes with a pseudogene paralog can lead to a wide variety of common structural variants (SVs) in the population, requiring specialized targeted calling approaches. DRAGEN supports targeted calling for a number of genes/targets as described in subsequent target-specific sections.

The targeted caller can be enabled using the command line option --enable-targeted=true or a subset of targets can be enabled by providing a space-separated list of target names. The supported target names are: cyp2b6, cyp2d6, cyp21a2, gba, hba, lpa, rh, and smn. For a list of all supported targeted caller options along with their default values, see Targeted Caller Options. The targeted caller produces a <output-file-prefix>.targeted.json file containing a summary of the variant caller results for each target. Additional detail of individual variant calls are reported in VCF format in the <output-file-prefix>.targeted.vcf.gz output file.

Input Data

The targeted caller requires WGS data aligned to a human reference genome with at least 30x coverage. The caller may be less reliable at lower coverage. Human reference genome builds based on hg19, hs37d5 (including GRCh37), or hg38 are supported. The targeted caller should not be enabled with low-coverage, exome or enrichment sequencing data.

Output Files

Targeted JSON File

The targeted caller generates a <output-file-prefix>.targeted.json file in the output directory. The output file is a JSON formatted file containing the fields below.

Fields in JSON

Explanation

Type and Possible Values

Present

Targeted VCF File

The targeted caller generates a <output-file-prefix>.targeted.vcf.gz file in the output directory. The output file is a VCFv4.2 formatted file. The targets that have VCF output are: cyp21a2, gba, hba, lpa, rh, and smn.

Small variants, structural variants, and copy number variants are reported in the same VCF file.

The <output-file-prefix>.targeted.vcf.gz file includes the following source header line:

##source=DRAGEN_TARGETED

For lpa, rh and smn targets, the EVENT and EVENTTYPE INFO fields are used to identify the called variants.

The EVENT and EVENTTYPE INFO fields are formally introduced in VCFv4.4 to enable the representation of complex rearrangements. This is achieved using the EVENT field to group all the related VCF records together, and the EVENTTYPE to classify the event. The corresponding header lines are the following.

##INFO=<ID=EVENT,Number=A,Type=String,Description="Event name">
##INFO=<ID=EVENTTYPE,Number=A,Type=String,Description="Type of associated event">

However, the use of EVENT is not limited to complex rearrangements and can be used to associate nonsymbolic alleles, for example in cases of variant position ambiguity in high homology regions.

Since the EVENTTYPE values are implementation-defined, custom EVENTTYPE header lines are included to describe each EVENTTYPE.

##EVENTTYPE=<ID=GENE_CONVERSION,Description="Gene conversion event">
##EVENTTYPE=<ID=VARIANT_IN_HOMOLOGY_REGION,Description="Variant in homology region">
##EVENTTYPE=<ID=VNTR,Description="Variable number tandem repeat">

For cyp21a2, gba, and hba targets, the ALLELE_ID INFO field is used to identify the called variant alleles.

##INFO=<ID=ALLELE_ID,Number=R,Type=String,Description="Identifier for each allele">

The missing value . is used when no identifier is available (e.g. a wild type allele) or applicable (e.g. allele index 0 for a structural variant record).

Nonrecombinant-like Variants In High Homology Regions

In the case of target variants in a high homology region, each variant is reported ambiguously at all corresponding homologous positions (i.e. in both the pseudogene and in the target gene). Additional analysis for these variants can be performed if absolute certainty that these variants are located in the target gene (e.g. in gba or cyp21a2) is required.

For lpa and smn the ploidy of the called genotype (FORMAT/GT field) corresponds to the combined copy number from all the homologous positions. For cyp21a2, gba and hba, this "joint" genotype from all the homologous positions is instead reported in a separate FORMAT/JGT field which is then collapsed into a diploid genotype and reported in the FORMAT/GT field. The following fields are reported for "joint" calls:

##INFO=<ID=JIDS,Number=.,Type=String,Description="IDs (from ID column) of calls associated with a joint genotype call in duplicated regions">
##FORMAT=<ID=JGT,Number=1,Type=String,Description="Joint genotype in duplicated regions">
##FORMAT=<ID=JGQ,Number=1,Type=Integer,Description="Quality of joint genotype in duplicated regions">
##FORMAT=<ID=JPL,Number=.,Type=Integer,Description="Normalized, Phred-scaled likelihoods for joint genotypes as defined in the VCF specification">
##FORMAT=<ID=JQL,Number=1,Type=Float,Description="Phred-scaled likelihood for homozygous reference joint genotype call">
##FORMAT=<ID=JVQL,Number=1,Type=Float,Description="Phred-scaled likelihood for nonvariant joint genotype call where overlapping deletion (*) ALT alleles are not considered to be variant.">
##FORMAT=<ID=JDP,Number=1,Type=Integer,Description="Total depth from all alleles in duplicated regions">
##FORMAT=<ID=JAD,Number=R,Type=Integer,Description="Total read depth for each allele in duplicated regions">
##FORMAT=<ID=JAF,Number=A,Type=Float,Description="Allele frequency for each alt allele in duplicated regions">

Note that the FORMAT/GQ and FORMAT/JGQ fields contain the unconditional genotype quality, unlike the VCF spec where FORMAT/GQ is defined as the genotype quality conditioned on the site being variant.

In the depicted example there are two genes A and B that include a high homology region. The usual process to call variants in this regions is to make a joint pileup of the reads aligning in both genes A and B and call the variants using a model with a ploidy proportional to the total copy number of the regions. This generates divergent possible genotypes that are equally likely since the variant cannot be confidently placed in either gene A or gene B. For lpa and smn the variant would be reported as follows:

chr1 100 . A T . TargetedRepeatConflict EVENT=GeneA-B:50A>T;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION GT 0/0/0/1
chr1 200 . A T . TargetedRepeatConflict EVENT=GeneA-B:50A>T;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION GT 0/0/0/1

Given the unconventional ploidy of the FORMAT/GT field used in this representation, a TargetedRepeatConflict filter is applied to these records. The header line for the filter is the following.

##FILTER=<ID=TargetedRepeatConflict,Description="Set if call is in a targeted repeat region that cannot be placed">

For cyp21a2, gba and hba, a conventional diploid FORMAT/GT is reported and so no TargetedRepeatConflict filter is applied. Due to the ambiguity in placing target variants in high homology regions, the corresponding QUAL and FORMAT/GQ fields can be much lower than conventional small variant calls (i.e. Phred 3 for a single variant allele copy across two homologous diploid positions). Therefore, instead of filtering on QUAL and FORMAT/GQ for these records, the records are filtered based on the FORMAT/JVQL and FORMAT/JGQ fields:

##FILTER=<ID=TargetedLowJGQ,Description="Set if call has JGQ < 3.">
##FILTER=<ID=TargetedLowJVQL,Description="Set if call has JVQL < 3.00.">

Since the wild type alleles at homologous positions may be different from each other or different from the reference alleles, an additional filter is applied when only wild type alleles are detected across the homologous positions. This avoids making ambiguous variant calls when no target variant of interest is detected.

##FILTER=<ID=TargetedWT,Description="Region-ambiguous targeted call with GT containing only wild type alleles, ignoring any overlapping deletions.">

Rh Gene Conversion Events

In the case of an identified gene conversion even in rh, a small variant is reported at each differentiating site in the acceptor region.

In the depicted example there are two genes A and B and gene A is the acceptor of a gene conversion from gene B (green box in the figure). Gene conversion are identified by observing variations in copy number at differentiating sites (blue and pink bars in the figure) in consecutive regions. Copy number variations between regions define the breakends of the gene conversion. An equivalent VCF representation for gene conversion would be using CNV and SV entries with breakends corresponding to the donor/acceptor regions, however, only the small variant representation is currently supported.

chr1 121 .   A T    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121
...
chr1 280 .   G A    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121

In the case of a detected gene conversion event, there may be differentiating sites with a genotype that is inconsistent with that gene conversion event. In these cases the RecombinantConflict filter is applied. The RecombinantConflict is defined by the following header line.

##FILTER=<ID=RecombinantConflict,Description="Set if call has a copy number that conflicts with a recombinant variant">

In the example, the resulting representation is as follows.

chr1 121 .   A T    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121
...
chr1 144 .   C T    . RecombinantConflict EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 1|1:121
chr1 153 .   A G    . RecombinantConflict EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT 0/0
...
chr1 280 .   G A    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121

Nonallelic Homologous Recombination

For cyp21a2 and gba, nonallelic homologous recombination can result in gene deletion or duplication in the case of reciprocal recombination or gene conversion in the case of nonreciprocal recombination. Both gene deletion and gene conversion can introduce loss-of-function variants and in both cases the targeted caller will report these variants in the target gene. In the case of gene deletion, the differentiating sites at the nontarget (i.e. pseudogene) positions will contain the overlapping deletion allele * while the differentiating sites in the target will contain any variant alleles. Although an equivalent VCF representation would be to simply report the deletion with a single structural variant VCF record, reporting small variant VCF records in the target gene allows for identification of the specific mutations that may occur in a gene transcript and matches well with annotation using HGVS nomenclature. Similarly, for gene conversions, variants are reported at differentiating sites in the target gene, rather than as pairs of structural variant breakends.

Calls at differentiating sites within the recombinant variant calling region will contain the same "joint" fields as are reported for nonrecombinant-like variants in high homology regions ( see Nonrecombinant-like Variants In High Homology Regions). However, the collapsed diploid FORMAT/GT will be based on any detected recombination events. Because detected recombinant variants are placed in the target gene, these records are filtered differently than the ambiguously placed, nonrecombinant-like variants in high homology regions. The INFO/Recombinant flag is added to calls derived from recombinant variant calling to distinguish them from nonrecombinant-like variant calls in high homology regions. The FORMAT/VQL field is used to apply the RecombinantLowVQL filter for low quality recombinant variants and the RecombinantREF filter is applied when the collapsed diploid FORMAT/GT contains only reference alleles.

##FORMAT=<ID=VQL,Number=1,Type=Float,Description="Phred-scaled likelihood for nonvariant genotype call where overlapping deletion (*) ALT alleles are not considered to be variant.">
##FILTER=<ID=RecombinantLowVQL,Description="Region-ambiguous targeted call at recombinant site with VQL below 0.50.">
##FILTER=<ID=RecombinantREF,Description="Region-ambiguous targeted call at recombinant site with GT containing only reference alleles, ignoring any overlapping deletions.">

Overlapping Structural Variant Representation

The use of GT=0 for symbolic structural variant alleles is formally disambiguated in VCFv4.4, specifying that "GT=0 indicates the absence of any of the ALT symbolic structural variants defined in the record". With this convention we can report compound overlapping heterozygous structural variants.

In the hba genotype depicted above, two overlapping SVs can be represented as follows:

chr16	170262	.	G	<DEL>,<DUP>	.	.	END=174517;IMPRECISE;SVLEN=4255,4255;SVCLAIM=DJ,DJ;ALLELE_ID=.,-a4.2,aaa4.2	GT	0/2
chr16	173301	.	A	<DEL>,<DUP>	.	.	END=177104;IMPRECISE;SVLEN=3804,3804;SVCLAIM=DJ,DJ;ALLELE_ID=.,-a3.7,aaa3.7	GT	0/1

The relevant header lines for the VCF records above are:

##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVCLAIM,Number=A,Type=String,Description="Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively.">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">

Variable Number Tandem Repeat Representation

In the depicted example there is a Variable Number Tandem Repeat (VNTR) region composed of three repeat units in the reference. The CN INFO field is used to report the allele copy number, the CN FORMAT field to is used report the region total copy number given by the sum of the allele copy numbers, and the REPCN FORMAT field is used to report the repeat unit copy number equal to the allele copy number multiplied by the number of repeat units in the reference.

This VNTR can be represented as follows:

chr1 100 . A <DUP>,<DUP> . . END=400;EVENT=A;EVENTTYPE=VNTR;SVCLAIM=D;SVLEN=300;CN=2.6,4.3   GT:CN:REPCN 1|2:6.9:8|13

The REPCN and CN header lines are:

##FORMAT=<ID=REPCN,Number=1,Type=String,Description="Number of repeat units spanned by the allele">
##INFO=<ID=CN,Number=A,Type=Float,Description="Copy number of CNV / breakpoint">
##FORMAT=<ID=CN,Number=1,Type=Float,Description="Estimated copy number">

Additional Filters

For lpa, rh and smn, the TargetedLowQual filter is applied if the QUAL of a target variant is less than 3.00.

##FILTER=<ID=TargetedLowQual,Description="Set if call has QUAL < 3.00">

Similarly, for cyp21a2 and gba the TargetedLowVQL filter is applied if the VQL of a target variant in low-homology region is less than 3.00.

##FORMAT=<ID=VQL,Number=1,Type=Float,Description="Phred-scaled likelihood for nonvariant genotype call where overlapping deletion (*) ALT alleles are not considered to be variant.">
##FILTER=<ID=TargetedLowVQL,Description="Set if call has VQL < 3.00.">

The TargetedLowGQ filter is applied if the targeted variant has GQ smaller than 3.

##FILTER=<ID=TargetedLowGQ,Description="Set if call has GQ < 3 and JGQ is not present.">

Merging Targeted Calls In The `hard-filtered` Files

When the small variant caller is enabled, the targeted small variant VCF calls can be merged into the <output-file-prefix>.hard-filtered.vcf.gz and <output-file-prefix>.hard-filtered.gvcf.gz files, briefly hard-filtered files. The --targeted-merge-vc command line option can be used to control which targets will have their small variant VCF records merged into the hard-filtered files. For example, --targeted-merge-vc rh will enable merging of the calls from the rh caller into the hard-filtered files and --targeted-merge-vc rh hba will enable merging of the calls from the rh and hba targets into the hard-filtered files. The true value will merge all calls from all supported targets into the hard-filtered files, while the false value will merge no calls into the hard-filtered files.

The targeted calls merged into the hard-filtered files are marked with a TARGETED INFO flag.

When enabled, targeted small variants are merged into the hard-filtered files regardless of any regions that may be provided using the --vc-target-bed option.

Merging Strategy

The merging strategy for targeted small variant calls is to prioritize the targeted calls over small variant calls from the germline small variant caller. When a germline small variant call overlaps a targeted caller call, then the small variant call is filtered with a TargetedConflict filter if any of the following holds:

The targeted caller call is PASS.
The small variant call and targeted caller call have incompatible genotypes and the targeted caller call is not filtered with the TargetedLowGQ filter.

The strategy is summarized in the following examples.

The TARGETED call is PASS.

chr1 100 . A	C	. TargetedConflict 	.			GT 0/1
chr1 100 . A	C	. PASS 				TARGETED 	GT 1/1

The TARGETED call and the small variant call are not overlapping

chr1 110 . T	TCA	. PASS 				. 			GT 0/1
chr1 111 . G	A	. PASS 				TARGETED 	GT 0/1

The TARGETED call is filtered with TargetedLowQual and has a discordant variant representation with the overlapping small variant call.

chr1 120 . ATTC A	. TargetedConflict	.			GT 0/1
chr1 121 . T	A	. TargetedLowQual	TARGETED 	GT 0/1
chr1 125 . TCAC T	. TargetedLowQual	TARGETED	GT 0/1
chr1 126 . C	G	. TargetedConflict	.			GT 0/1

The TARGETED call is filtered with TargetedLowQual and has a discordant genotype with the overlapping small variant call.

chr1 130 . C	G	. TargetedConflict	.			GT 0/1
chr1 130 . C	G	. TargetedLowQual	TARGETED 	GT 1/1

The TARGETED call is filtered with TargetedLowGQ and has a discordant genotype with the overlapping small variant call.

chr1 140 . AC	A	. PASS			.			GT:GQ 0/1:5
chr1 140 . A	T	. TargetedLowGQ	TARGETED 	GT:GQ 1/1:2

Command-Line Examples

The targeted caller can be enabled in parallel with other components as part of a human WGS germline analysis workflow (see DRAGEN Recipe - Germline WGS).

FASTQ Input Example

The following command-line example runs the targeted caller from FASTQ input:

dragen \
	-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
	--fastq-file1 /staging/test/data/NA12878_R1.fastq \
	--fastq-file2 /staging/test/data/NA12878_R2.fastq \
	--output-directory /staging/test/output \
	--output-file-prefix NA12878_dragen \
	--RGID DRAGEN_RGID \
	--RGSM NA12878 \
	--enable-targeted=true

Prealigned BAM Input Example

The following command-line example runs cyp21a2 only using BAM input without realignment:

dragen \
	-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
	--bam-input /staging/test/data/NA12878.bam \
	--output-directory /staging/test/output \
	--output-file-prefix NA12878_dragen \
	--enable-map-align=false \
	--enable-targeted=cyp21a2

CYPDB6 Caller

The CYP2B6 Caller is capable of genotyping the CYP2B6 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP2B7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.

The CYP2B6 Caller performs the following steps:

Determines total CYP2B6 and CYP2B7 copy number from read depth.
Determines CYP2B6-derived copy number at CYP2B6/CYP2B7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2B6-derived copy number along the CYP2B6 gene.
Calls small variants in CYP2B6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.

Total CYP2B6 and CYP2B7 Copy Number

The first step of CYP2B6 calling is to determine the combined copy number of CYP2B6 and CYP2B7. Reads aligned to regions in either CYP2B6 or CYP2B7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2B6 and CYP2B7 copy number is then calculated from the average sequencing depth across the CYP2B6 and CYP2B7 regions.

Differentiating Sites

The CYP2B6-derived copy number is calculated at 99 predefined differentiating sites across the CYP2B6 gene. The differentiating sites are selected at positions with sequence differences in CYP2B6 and CYP2B7 where calling the CYP2B6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.

For each differentiating site, CYP2B6-specific and CYP2B7-specific alleles are counted in reads mapping to either CYP2B6 or the homologous region in CYP2B7. The CYP2B6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2B6 and CYP2B7 copy number calculated from the previous step.

Structural Variant Calling

The CYP2B6-derived copy number along the CYP2B6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:

Small Variant Calling

35 small variants that define various star alleles are detected from the read alignments. All of these variants are in unique (nonhomologous) regions of CYP2B6 with high mapping quality. Only reads mapping to CYP2B6 are used for calling variants in nonhomologous regions.

Samples with poor sequencing quality or greater than five copies of CYP2B6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.

Recombinant Variant Calling

The recombinant (gene conversion) variant 18053A>G is detected by phasing the variant site with five flanking differentiating sites. When the haplotypes formed from phasing these sites supports the gene conversion in CYP2B6, a read depth analysis at the gene conversion breakpoints (transitions from either CYP2B6->CYP2B7 or CYP2B7->CYP2B6) is performed. When the posterior probability that there is at least one gene conversion variant is above 0.7 then DRAGEN uses the variant for star allele identification.

Star Allele Identification

The called SVs and small variant genotypes are matched against the definitions of 39 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1, *6 and *4, *49 where both sets of star alleles contain the same two small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2B6-derived gene copies determined from previous steps. If no variant genotypes can be matched to a set of star alleles, the CYP2B6 Caller returns a no call during the genotyping step with filter value No_call.

Genotyping

Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype.

CYP2B6 Output File

The caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json that also contains calls from other targets (see Targeted JSON File).

An example of the CYP2B6 caller content in the output is as follows:

{
    "cyp2b6": {
        "genotype": "*17/*2",
        "genotypeFilter": "PASS",
        "phenotypeDatabaseAnnotation": "Normal Metabolizer"
    }
}

For CYP2B6 caller, the fields are defined as follows.

CYP2D6 Caller

The CYP2D6 Caller is capable of genotyping the CYP2D6 gene from whole-genome sequencing (WGS) data and is derived from the method implemented in Cyrius¹. Due to high sequence similarity with its pseudogene paralog CYP2D7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.

The CYP2D6 Caller performs the following steps:

Determines total CYP2D6 and CYP2D7 copy number from read depth.
Determines CYP2D6-derived copy number at CYP2D6/CYP2D7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2D6-derived copy number along the CYP2D6 gene.
Calls small variants in CYP2D6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.

Total CYP2D6 and CYP2D7 Copy Number

The first step of CYP2D6 calling is to determine the combined copy number of CYP2D6 and CYP2D7. Reads aligned to regions in either CYP2D6 or CYP2D7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2D6 and CYP2D7 copy number is then calculated from the average sequencing depth across the CYP2D6 and CYP2D7 regions.

Differentiating Sites

The CYP2D6-derived copy number is calculated at 117 predefined differentiating sites across the CYP2D6 gene. The differentiating sites are selected at positions with sequence differences in CYP2D6 and CYP2D7 where calling the CYP2D6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.

For each differentiating site, CYP2D6-specific and CYP2D7-specific alleles are counted in reads mapping to either CYP2D6 or the homologous region in CYP2D7. The CYP2D6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2D6 and CYP2D7 copy number calculated from the previous step.

Structural Variant Calling

The CYP2D6-derived copy number along the CYP2D6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:

In addition to the exon 9 fusion breakpoints, exon 9 can participate in CYP2D7 gene conversion resulting in an embedded CYP2D7 sequence instead of a true hybrid. The structural variant caller also detects exon 9 gene conversions. Because only changes in CYP2D6-derived copy number yield structural variant calls, there might be rare cases where two hybrid copies result in no structural variant calls. For example, when both *36 and *13 with fusion breakpoint in exon 9 are present. However, the structural variant caller is capable of detecting multiple copies of the same fusion type (eg, *36x2) or cases where both an exon 9 gene conversion copy and an exon 9 2D6-2D7 hybrid are present.

Small Variant Calling

118 small variants that define various star alleles are detected from the read alignments. 96 of these variants are in unique (nonhomologous) regions of CYP2D6 with high mapping quality. Only reads mapping to CYP2D6 are used for calling variants in nonhomologous regions. The other 22 variants occur in homologous regions of CYP2D6 where reads mapping to either CYP2D6 or CYP2D7 are used for variant calling.

For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant). A strand bias filter is applied to a small subset of variants that would otherwise tend to have false positive calls in the population.

Samples with poor sequencing quality or greater than five copies of CYP2D6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.

Star Allele Identification

The called SVs and small variant genotypes are matched against the definitions of 128 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1, *46 and *43, *45 where both sets of star alleles contain the same 4 small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2D6-derived gene copies determined from previous steps. When there are fewer than two CYP2D6-derived gene copies, then one or more *5 deletion haplotypes are included in the output set of star alleles. If all variant genotypes cannot be matched to a set of star alleles, the CYP2D6 Caller returns a no call during the genotyping step with filter value No_call.

Genotyping

Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The deletion haplotype (*5) is considered as a possible haplotype during this process. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype.

CYP2D6 Output File

The CYP2D6 Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json that also contains calls from other targets (see Targeted JSON File). An example of the CYP2D6 caller content in the output is as follows:

{
    "cyp2d6": {
        "genotype": "*17/*2",
        "genotypeFilter": "PASS",
        "phenotypeDatabaseAnnotation": "Normal Metabolizer"
    }
}

Each CYP2D6 genotype contains two haplotypes separated by a slash (eg *1/*2). Each haplotype consists of one or more star alleles separated by a plus sign (eg *10+*36). When a haplotype contains more than one copy of the same star allele, that star allele only appears once and is followed by a multiplication sign, and then the number of copies (eg *1x2 for two copies of *1).

¹Chen X, Shen F, Gonzaludo N, et al. Cyrius: accurate CYP2D6 genotyping using whole-genome sequencing data. The Pharmacogenomics Journal. 2021;21(2):251-261. doi:10.1038/s41397-020-00205-5

CYP21A2 Caller

The CYP21A2 Caller is capable of genotyping the CYP21A2 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP21A1P and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants.

The CYP21A2 calling workflow is broken up into the following major stages:

Loading input configuration
Processing read data
Analyzing read data

Read data analysis is further split into the following steps:

Determine total CYP21A2 and CYP21A1P copy number from read depth.
Call small variants in CYP21A2 copies.
Phase reads to detect common variants and recombination events.
Identify most likely haplotypes.

The CYP21A2 Caller requires WGS data aligned to a human reference genome with at least 30x coverage.

Total CYP21A2 and CYP21A1P Copy Number

The first step of CYP21A2 calling is to determine the combined copy number of CYP21A2 and CYP21A1P. Reads aligned to regions in either CYP21A2 or CYP21A1P are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP21A2 and CYP21A1P copy number is then calculated from the average sequencing depth across the CYP21A2 and CYP21A1P regions.

Nonrecombinant-like Variant Calling

Of the known nonrecombinant-like variants, some are in unique (nonhomologous) regions of CYP21A2 with high mapping quality. Only reads mapping to CYP21A2 are used for calling variants in nonhomologous regions. The other variants occur in homologous regions of CYP21A2/CYP21A1P where reads mapping to either are used for variant calling.

For each variant, reads containing either the variant allele or the nonvariant allele is counted. A binomial model that incorporates the sequencing error rate is then used to determine the most likely variant copy number (0 for nonvariant).

For a list of the supported nonrecombinant-like variants, refer to the targeted/cyp21a2/target_variants_*.tsv files located in the resources directory of the DRAGEN install location.

Nonallelic Homologous Recombination Variant Calling

To analyze the homologous region even further, DRAGEN phases reads covering differentiating sites and known variant sites. Whenever a detected haplotype has a CYP21A2->CYP21A1P or CYP21A1P->CYP21A2 transition that is consistent with one of the known recombinant-like variants, the transition is considered as a candidate breakpoint for calling those variants. Reads containing phasing information for the two sites flanking each candidate breakpoint are used for variant calling. When the read data supports the hypothesis that the sample contains at least one copy of a candidate breakpoint, the associated haplotype is a recombinant haplotype candidate. Recombinant haplotype candidates are sorted by likelihood and the number of variant sites. If no wild type haplotype was detected, DRAGEN reports any detected homozygous recombinant haplotype, or up to two different recombinant haplotypes (i.e. compound het) if detected. If any wild type haplotype was found, DRAGEN reports a maximum of one recombinant haplotype. When no recombinant haplotypes are detected two wild type haplotypes are reported.

For a list of recombinant variant sites, refer to the targeted/cyp21a2/recombinant_variants_*.tsv files located in the resources directory of the DRAGEN install location.

Note that NM_000500.9:c.710_719delinsACGAGGAGAA will be reported as the following three variants on the same haplotype: NM_000500.9:c.710T>A NM_000500.9:c.713T>A NM_000500.9:c.719T>A

CYP21A2 Output File

The CYP21A2 Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json that also contains calls from other targets (see Targeted JSON File).

Note: A deletion-like recombinant variant haplotype (as opposed to a gene conversion-like recombinant variant haplotype) is defined as a haplotype with one or fewer switch sites (transitions from a CYP21A1P allele to a CYP21A2 allele) after excluding some sites with common gene conversions in CYP21A1P.

Each nonrecombinant-like variant reported in the variants array will have the fields below.

Recombinant-like and nonrecombinant-like variants are reported in VCF format. See Targeted VCF File for details about how these variants are reported in VCF.

Output File Example

An example of the CYP21A2 caller content in the <output-file-prefix>.targeted.json output file is shown below.

{        
    "cyp21a2": {
            "totalCopyNumber": 2,
            "deletionBreakpointInGene": true,
            "recombinantHaplotypes": [
                "NM_000500.9:c.92C>T+NM_000500.9:c.293-13C>G+NM_000500.9:c.332_339del+NM_000500.9:c.518T>A+NM_000500.9:c.710T>A+NM_000500.9:c.713T>A+NM_000500.9:c.719T>A+NM_000500.9:c.923dupT",
                "NM_000500.9:c.92C>T+NM_000500.9:c.293-13C>G+NM_000500.9:c.332_339del+NM_000500.9:c.518T>A+NM_000500.9:c.710T>A+NM_000500.9:c.713T>A+NM_000500.9:c.719T>A+NM_000500.9:c.923dupT+NM_000500.9:c.955C>T"
            ],
            "variants": [
                {
                    "alleleId": "NM_000500.9:c.1360C>T",
                    "alleleCopyNumber": 2,
                    "genotypeQuality": 18,
                    "filter": "PASS"
                }
            ]
    }
}

LPA Caller

The LPA Caller is capable of identifying the LPA Kringle-IV-2 (KIV-2) VNTR unit copy number from whole-genome sequencing (WGS) data. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the VNTR unit copy number.

The LPA Caller performs the following steps:

Determines total LPA KIV-2 VNTR unit copy number.
Determines the heterozygous LPA KIV-2 VNTR unit copy number if heterozygous KIV-2 markers are present.
Calls small variants in the LPA KIV-2 VNTR region based on the unit copy number along with allele counts from read information.

The LPA Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19, GRCh37, or hg38.

Total LPA KIV-2 VNTR Unit Copy Number

The first step of LPA calling is to determine the unit copy number of LPA KIV-2. Reads aligned to the LPA KIV-2 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.

Heterozygous LPA KIV-2 VNTR Unit Copy Number

The second step of LPA calling is to determine the heterozygous unit copy numbert of LPA KIV-2. Heterozygous unit copy number is determined using two specific linked SNV sites that have been identified as a combined marker allele that is always present in every copy of the repeat unit concordantly. That is, if any copy of the repeat unit in an LPA haplotype contains the ALT alleles at those two SNV sites, then every copy of the repeat unit in that LPA haplotype contains the ALT alleles at those two sites. The relative read coverage for the ALT and REF cases at these sites can therefore be used to determine the proportions of overall copy numbers across the KIV repeat array that belong to each haplotype.

Small Variant Calling

2 small variants are detected from the read alignments. These variants occur in the LPA KIV-2 VNTR region where reads mapping to either of the 6 units in the reference are used for variant calling.

For each variant, reads containing either the variant allele or the nonvariant allele are counted and a binomial model is used to determine the likelihood for each possible variant allele copy number up to the maximum possible as determined from the LPA KIV-2 VNTR unit copy number.

LPA Output File

The LPA Caller generates a <output-file-prefix>.targeted.json file in the output directory. The output file is a JSON formatted file containing the fields below.

The lpa fields are defined as below.

For the variants the fields are defined as below.

The LPA Caller also generates a <output-file-prefix>.targeted.vcf[.gz] file in the output directory. The output file is a VCFv4.2 formatted file possibly compressed.

JSON Output File Examples

Examples of the LPA Caller content in the output json file are shown below.

{
  "lpa": {
    "kiv2CopyNumber": 39.236019179048256,
    "refMarkerAlleleCopyNumber": 17.84675290975861,
    "altMarkerAlleleCopyNumber": 21.389266269289646,
    "type": "Heterozygous markers call",
    "variants": [
      {
        "hgvs": "LPA:4925G>A",
        "qual": 0.0,
        "altCopyNumber": 0,
        "altCopyNumberQuality": 150.0
      },
      {
        "hgvs": "LPA:4733G>A",
        "qual": 135.16839207370617,
        "altCopyNumber": 1,
        "altCopyNumberQuality": 18.641166763477106
      }
    ]
  }
}

{
  "lpa":  {
    "kiv2CopyNumber": 28.709054093341653,
    "refMarkerAlleleCopyNumber": null,
    "altMarkerAlleleCopyNumber": null,
    "type": "Homozygous REF markers call",
        "variants": [
      {
        "hgvs": "LPA:4925G>A",
        "qual": 0.0,
        "altCopyNumber": 0,
        "altCopyNumberQuality": 150.0
      },
      {
        "hgvs": "LPA:4733G>A",
        "qual": 132.8341692747592,
        "altCopyNumber": 1,
        "altCopyNumberQuality": 15.776967125542765
      }
    ]
  }
}

VCF Output File Examples

The following are example output files:

##fileformat=VCFv4.2
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096
chr6	160613491	.	A	<DUP>,<DUP>	.	PASS	SVCLAIM=D,D;END=160646772;SVLEN=33281,33281;CN=2.974459,3.564878;EVENT=LPA:KIV2,.;EVENTTYPE=VNTR,.	GT:CN:REPCN:PS	1|2:6.539337:18|21:160613491
chr6	160613786	.	T	G	.	PASS	EVENT=LPA:296T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0|1:160613491
chr6	160614754	.	C	G	.	PASS	EVENT=LPA:1264C>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0|1:160613491
chr6 160618484 . C T 0.00 TargetedLowQual;TargetedRepeatConflict EVENT=LPA:4925G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION GT:GQ 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:150
chr6	160618676	.	C	T	135.17	TargetedRepeatConflict	EVENT=LPA:4733G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:19
...
chr6	160641520	.	T	G	.	PASS	EVENT=LPA:296T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0|1:160613491
chr6	160642488	.	C	G	.	PASS	EVENT=LPA:1264C>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0|1:160613491
chr6	160646214	.	C	T	0.00	TargetedLowQual;TargetedRepeatConflict	EVENT=LPA:4925G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:150
chr6	160646406	.	C	T	135.17	TargetedRepeatConflict	EVENT=LPA:4733G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:19

##fileformat=VCFv4.2
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00097
chr6	160613491	.	A	<DUP>,<CNV>	.	PASS	SVCLAIM=D,D;END=160646772;SVLEN=33281,33281;CN=.,.;EVENT=LPA:KIV2,.;EVENTTYPE=VNTR,.	GT:CN:REPCN:PS	1|2:4.784842:.|.:160613491
chr6	160613786	.	T	G	.	PASS	EVENT=LPA:296T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0/0:160613491
chr6	160614754	.	C	G	.	PASS	EVENT=LPA:1264C>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0/0:160613491
chr6	160618484	.	C	T	0.00	TargetedLowQual;TargetedRepeatConflict	EVENT=LPA:4925G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:150
chr6	160618676	.	C	T	132.83	TargetedRepeatConflict	EVENT=LPA:4733G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:16
...
chr6	160641520	.	T	G	.	PASS	EVENT=LPA:296T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0/0:160613491
chr6	160642488	.	C	G	.	PASS	EVENT=LPA:1264C>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:PS	0/0:160613491
chr6	160646214	.	C	T	0.00	TargetedLowQual;TargetedRepeatConflict	EVENT=LPA:4925G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0:150
chr6	160646406	.	C	T	132.83	TargetedRepeatConflict	EVENT=LPA:4733G>A;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1:16

Rh Caller

The Rh Caller is capable of identifying a common gene conversion between RHD and RHCE genes from whole-genome sequencing (WGS) data, that is referred to as RHCE Exon2 gene conversion. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the gene conversion between the pair of genes. We consider 798 loci, called differentiating sites, that represents differences between the RHD and RHCE genes, that are well preserved in the population.

The Rh Caller performs the following steps:

Determines total copy number from read depth of the RHD and RHCE regions.
Detect RHD -> RHCE breakpoints that are consistent with the RHCE Exon2 gene conversion.

The Rh Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19, GRCh37, or hg38.

The Rh Caller is run by default when the small variant caller is enabled, the sample is a not a tumor sample, and the sample is detected as WGS by the Ploidy Estimator.

Total Combined RHD and RHCE Copy Number

The first step of Rh calling is to determine the copy number of RHD and RHCE regions. Reads aligned to the RHD and RHCE regions are counted according to their support of the differentiating sits. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.

Haplotype Phasing in the Exon 2 Region

A collection of 4 differentiating sites in the exon 2 region of RHD and RHCE are used to detect the presence of the RHCE Exon2 gene conversion in the sample. An iterative phasing algorithm is used to build up haplotypes that are supported by the read data. The phasing algorithm starts with candidate haplotypes formed from all possible bases at the first differentiating site. The haplotypes are then extended at the next differentiating site by considering all reads that can be uniquely assigned to a single candidate haplotype. If these reads support only a single base at the next differentiating site for a given candidate haplotype, then the haplotype is extended with that base. When a candidate haplotype can be extended by both bases at the next differentiating site then both possible extended haplotypes are included in the set of candidate haplotypes, growing the set by 1. Subsequent extension steps are performed at neighboring differentiating sites until all sites have been processed. Some haplotypes may have sites that are unresolved (i.e. ambiguous), but these haplotypes can still participate in RHD -> RHCE breakpoint detection.

Recombinant-like Variant Calling

When the phased haplotypes support the RHCE Exon2 gene conversion. We visit all the differentiating sites ad report them as variants in the output VCF file with ploidy identified using the copy number estimated from the read depth of the differentiating site.

Rh Output File

The Rh Caller generates a <output-file-prefix>.targeted.json file in the output directory. The output file is a JSON formatted file containing the fields below.

The rh fields are defined as below.

For the variants the fields are defined as below.

Examples of the Rh Caller content in the output json file are shown below.

{
  "rh": {
    "totalCopyNumber": 4,
    "rhdCopyNumber": 2,
    "rhceCopyNumber": 2,
    "variants": [
      {
        "hgvs": "NC_000001.11g.25405596_25409676con25283766_25287797",
        "qual": 63.42360372675952,
        "altCopyNumber": 2,
        "altCopyNumberQuality": 74.27989942583051
      }
    ]
  }
}

{
  "rh": {
    "totalCopyNumber": 4,
    "rhdCopyNumber": 2,
    "rhceCopyNumber": 2,
    "variants": [
      {
        "hgvs": "NC_000001.11g.25405596_25409676con25283766_25287797",
        "qual": 0.0,
        "altCopyNumber": 0,
        "altCopyNumberQuality": 0.026368888668874203
      }
    ]
  }
}

The Rh Caller also generates a <output-file-prefix>.targeted.vcf[.gz] file in the output directory. The output file is a VCFv4.2 formatted file, possibly compressed.

The following are example output files:

##fileformat=VCFv4.2
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG002
chr1 25405596 . G A 128.19 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 1/1:128
chr1 25405674 . A T 134.41 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 1/1:134
...
chr1 25409655 . G C 133.16 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 1/1:133
chr1 25409676 . G A 104.46 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 1/1:104

##fileformat=VCFv4.2
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00110
chr1 25405596 . G A 44.96 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 0/0:45
chr1 25405674 . A T 35.71 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 0/0:36
...
chr1 25409655 . G C 32.72 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 0/0:33
chr1 25409676 . G A 27.11 PASS EVENT=NC_000001.11g.25405596_25409676con25283766_25287797;EVENTTYPE=GENE_CONVERSION GT:GQ 0/0:27

SMN Caller

Disruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA). SMN1 has a high identity paralog, SMN2. SMN2 differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C-> T affects splicing and largely disrupts the production of functional SMN protein from SMN2. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (SMN1) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.

DRAGEN offers the following two independent components that can call the SMN1 copy number using WGS data from a germline sample.

ExpansionHunter
SMN Caller

SMA Calling With ExpansionHunter

SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents SMN1 and SMN2.

In addition to the standard diploid genotype call, SMA Calling with ExpansionHunter uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.

SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.

To enable SMA calling along with repeat expansion detection, set the --repeat-genotype-enable option to true. For information on graph-alignment options, see Repeat Expansion Detection with ExpansionHunter.

To activate SMA calling, the variant specification catalog file must include a description of the targeted SMN1/SMN2 variant. The <INSTALL_PATH>/resources/repeat-specs/experimental folder contains example files.

The <output-file-prefix>.repeat.vcf file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in SMN1 with SMA status in the following custom fields.

SMA Result in repeat.vcf Output File

Field

Description

SMN Caller

The SMN Caller calls SMN1 and SMN2 copy numbers and detects the presence of a SNP, NM_000344.4:c.*3+80T>G that is associated with the two-copy SMN1 allele. The caller is derived from the method implemented in Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data.²

To enable the SMN Caller, use --enable-smn=true as part of a germline-only WGS analysis workflow. Additionally, it can also be enabled along with other targets from the targeted caller by using the option --enable-targeted=true. The SMN Caller is disabled by default.

The SMN Caller performs the following steps:

Determines total and intact SMN copy numbers
Calls SMN1 copy number at eight differentiating sites
Determines copy number for NM_000344.4:c.*3+80T>G

The SMN Caller requires WGS data aligned to a human reference genome with at least 30x coverage

Total and Intact SMN Copy Number

Two common copy-number variants (CNVs) in SMN1 and SMN2 include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either SMN1 or SMN2 are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.

SMN1 Copy Number at Differentiating Sites

To calculate the SMN1 copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of SMN1 and SMN2. One of these sites is the splice site variant used for SMA calling with ExpansionHunter (see SMA Calling With ExpansionHunter). The caller selects differentiating sites at positions that have sequence differences between SMN1 and SMN2 where calling the SMN1 copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.

For each differentiating site, the SMN1-specific and SMN2-specific alleles are counted in reads mapping to either SMN1 or the homologous region in SMN2. The caller uses a binomial model to calculate the likelihood of each possible SMN1 copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.

Copy number call for `NM_000344.4:c.*3+80T>G`

The SNP NM_000344.4:c.*3+80T>G (also referred to as g.27134T>G) has been reported in the literature to be associated with the two-copy SMN1 allele.

For this high-homology region SNP, reads mapping to either SMN1 or SMN2 are used for variant calling. The number of reads containing the variant allele and the nonvariant allele are counted and then a binomial model that incorporates the sequencing error rate is used to determine the most likely variant allele copy number (0 for nonvariant).

SMN Output File

The SMN Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json that also contains calls from other targets (see Targeted JSON File). An example of the SMN caller content in this file is shown below.

"smn": {
    "smn1CopyNumber": 3,
    "smn2CopyNumber": 1,
    "smn2Delta78CopyNumber": 0,
    "totalCopyNumber": "3.89",
    "fullLengthCopyNumber": "4.10",
    "variants": [
    {
        "hgvs": "NM_000344.4:c.*3+80T>G",
        "qual": null,
        "altCopyNumber": 1,
        "altCopyNumberQuality": 54.63854344932561
    }
    ]
}

For SMN caller, the fields are defined as follows.

Each variant reported in the variants array will have the fields below.

The variant NM_000344.4:c.*3+80T>G is also reported in a <output-file-prefix>.targeted.vcf[.gz] file in the output directory. The output file is a VCFv4.2 formatted file and possibly compressed. The variant is reported with the VARIANT_IN_HOMOLOGY_REGION flag in the INFO field and also filtered with the TargetedRepeatConflict filter. This variant lies in a region of homology between SMN1 and SMN2 and hence this variant is reported twice - once for each SMN1 and SMN2 regions - and is connected by the same EVENT in the INFO field. The ploidy of the variant is reported in concordance with the identified genotype.

An example of the vcf entry for the variant NM_000344.4:c.*3+80T>G is as follows.

##fileformat=VCFv4.2
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG04038
chr5	70076654	.	T	G	150	TargetedRepeatConflict	EVENT=NM_000344.4:c.*3+80T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/1:55
chr5	70952074	.	T	G	150	TargetedRepeatConflict	EVENT=NM_000344.4:c.*3+80T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/1:55
...

The variant NM_000344.4:c.*3+80T>G in the <output-file-prefix>.targeted.vcf[.gz] file can also be included in the <output-file-prefix>.hard-filtered.vcf[.gz] by including smn in the --targeted-merge-vc list, i.e. --targeted-merge-vc smn. The output file <output-file-prefix>.targeted.vcf[.gz] is compressed by default. This option can be disabled using --enable-vcf-compression=false.

References

¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). Human Mutation. 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3> 3.0.CO;2-9

²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genetics in Medicine. 2020;22(5):945-953. doi: 10.1038/s41436-020-0754-0

Targeted Caller

Input Data

Output Files

Targeted JSON File

The targeted caller generates a <output-file-prefix>.targeted.json file in the output directory. The output file is a JSON formatted file containing the fields below.

Fields in JSON

Explanation

Type and Possible Values

Present

Targeted VCF File

Small variants, structural variants, and copy number variants are reported in the same VCF file.

The <output-file-prefix>.targeted.vcf.gz file includes the following source header line:

##source=DRAGEN_TARGETED

For lpa, rh and smn targets, the EVENT and EVENTTYPE INFO fields are used to identify the called variants.

##INFO=<ID=EVENT,Number=A,Type=String,Description="Event name">
##INFO=<ID=EVENTTYPE,Number=A,Type=String,Description="Type of associated event">

However, the use of EVENT is not limited to complex rearrangements and can be used to associate nonsymbolic alleles, for example in cases of variant position ambiguity in high homology regions.

Since the EVENTTYPE values are implementation-defined, custom EVENTTYPE header lines are included to describe each EVENTTYPE.

##EVENTTYPE=<ID=GENE_CONVERSION,Description="Gene conversion event">
##EVENTTYPE=<ID=VARIANT_IN_HOMOLOGY_REGION,Description="Variant in homology region">
##EVENTTYPE=<ID=VNTR,Description="Variable number tandem repeat">

For cyp21a2, gba, and hba targets, the ALLELE_ID INFO field is used to identify the called variant alleles.

##INFO=<ID=ALLELE_ID,Number=R,Type=String,Description="Identifier for each allele">

The missing value . is used when no identifier is available (e.g. a wild type allele) or applicable (e.g. allele index 0 for a structural variant record).

Nonrecombinant-like Variants In High Homology Regions

##INFO=<ID=JIDS,Number=.,Type=String,Description="IDs (from ID column) of calls associated with a joint genotype call in duplicated regions">
##FORMAT=<ID=JGT,Number=1,Type=String,Description="Joint genotype in duplicated regions">
##FORMAT=<ID=JGQ,Number=1,Type=Integer,Description="Quality of joint genotype in duplicated regions">
##FORMAT=<ID=JPL,Number=.,Type=Integer,Description="Normalized, Phred-scaled likelihoods for joint genotypes as defined in the VCF specification">
##FORMAT=<ID=JQL,Number=1,Type=Float,Description="Phred-scaled likelihood for homozygous reference joint genotype call">
##FORMAT=<ID=JVQL,Number=1,Type=Float,Description="Phred-scaled likelihood for nonvariant joint genotype call where overlapping deletion (*) ALT alleles are not considered to be variant.">
##FORMAT=<ID=JDP,Number=1,Type=Integer,Description="Total depth from all alleles in duplicated regions">
##FORMAT=<ID=JAD,Number=R,Type=Integer,Description="Total read depth for each allele in duplicated regions">
##FORMAT=<ID=JAF,Number=A,Type=Float,Description="Allele frequency for each alt allele in duplicated regions">

chr1 100 . A T . TargetedRepeatConflict EVENT=GeneA-B:50A>T;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION GT 0/0/0/1
chr1 200 . A T . TargetedRepeatConflict EVENT=GeneA-B:50A>T;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION GT 0/0/0/1

Given the unconventional ploidy of the FORMAT/GT field used in this representation, a TargetedRepeatConflict filter is applied to these records. The header line for the filter is the following.

##FILTER=<ID=TargetedRepeatConflict,Description="Set if call is in a targeted repeat region that cannot be placed">

##FILTER=<ID=TargetedLowJGQ,Description="Set if call has JGQ < 3.">
##FILTER=<ID=TargetedLowJVQL,Description="Set if call has JVQL < 3.00.">

##FILTER=<ID=TargetedWT,Description="Region-ambiguous targeted call with GT containing only wild type alleles, ignoring any overlapping deletions.">

Rh Gene Conversion Events

In the case of an identified gene conversion even in rh, a small variant is reported at each differentiating site in the acceptor region.

chr1 121 .   A T    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121
...
chr1 280 .   G A    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121

##FILTER=<ID=RecombinantConflict,Description="Set if call has a copy number that conflicts with a recombinant variant">

In the example, the resulting representation is as follows.

chr1 121 .   A T    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121
...
chr1 144 .   C T    . RecombinantConflict EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 1|1:121
chr1 153 .   A G    . RecombinantConflict EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT 0/0
...
chr1 280 .   G A    . PASS EVENT=GC_AB;EVENTTYPE=GENE_CONVERSION; GT:PS 0|1:121

Nonallelic Homologous Recombination

##FORMAT=<ID=VQL,Number=1,Type=Float,Description="Phred-scaled likelihood for nonvariant genotype call where overlapping deletion (*) ALT alleles are not considered to be variant.">
##FILTER=<ID=RecombinantLowVQL,Description="Region-ambiguous targeted call at recombinant site with VQL below 0.50.">
##FILTER=<ID=RecombinantREF,Description="Region-ambiguous targeted call at recombinant site with GT containing only reference alleles, ignoring any overlapping deletions.">

Overlapping Structural Variant Representation

In the hba genotype depicted above, two overlapping SVs can be represented as follows:

chr16	170262	.	G	<DEL>,<DUP>	.	.	END=174517;IMPRECISE;SVLEN=4255,4255;SVCLAIM=DJ,DJ;ALLELE_ID=.,-a4.2,aaa4.2	GT	0/2
chr16	173301	.	A	<DEL>,<DUP>	.	.	END=177104;IMPRECISE;SVLEN=3804,3804;SVCLAIM=DJ,DJ;ALLELE_ID=.,-a3.7,aaa3.7	GT	0/1

The relevant header lines for the VCF records above are:

##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVCLAIM,Number=A,Type=String,Description="Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively.">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">

Variable Number Tandem Repeat Representation

This VNTR can be represented as follows:

chr1 100 . A <DUP>,<DUP> . . END=400;EVENT=A;EVENTTYPE=VNTR;SVCLAIM=D;SVLEN=300;CN=2.6,4.3   GT:CN:REPCN 1|2:6.9:8|13

The REPCN and CN header lines are:

##FORMAT=<ID=REPCN,Number=1,Type=String,Description="Number of repeat units spanned by the allele">
##INFO=<ID=CN,Number=A,Type=Float,Description="Copy number of CNV / breakpoint">
##FORMAT=<ID=CN,Number=1,Type=Float,Description="Estimated copy number">

Additional Filters

For lpa, rh and smn, the TargetedLowQual filter is applied if the QUAL of a target variant is less than 3.00.

##FILTER=<ID=TargetedLowQual,Description="Set if call has QUAL < 3.00">

Similarly, for cyp21a2 and gba the TargetedLowVQL filter is applied if the VQL of a target variant in low-homology region is less than 3.00.

##FORMAT=<ID=VQL,Number=1,Type=Float,Description="Phred-scaled likelihood for nonvariant genotype call where overlapping deletion (*) ALT alleles are not considered to be variant.">
##FILTER=<ID=TargetedLowVQL,Description="Set if call has VQL < 3.00.">

The TargetedLowGQ filter is applied if the targeted variant has GQ smaller than 3.

##FILTER=<ID=TargetedLowGQ,Description="Set if call has GQ < 3 and JGQ is not present.">

Merging Targeted Calls In The `hard-filtered` Files

The targeted calls merged into the hard-filtered files are marked with a TARGETED INFO flag.

When enabled, targeted small variants are merged into the hard-filtered files regardless of any regions that may be provided using the --vc-target-bed option.

Merging Strategy

The targeted caller call is PASS.
The small variant call and targeted caller call have incompatible genotypes and the targeted caller call is not filtered with the TargetedLowGQ filter.

The strategy is summarized in the following examples.

The TARGETED call is PASS.

chr1 100 . A	C	. TargetedConflict 	.			GT 0/1
chr1 100 . A	C	. PASS 				TARGETED 	GT 1/1

The TARGETED call and the small variant call are not overlapping

chr1 110 . T	TCA	. PASS 				. 			GT 0/1
chr1 111 . G	A	. PASS 				TARGETED 	GT 0/1

The TARGETED call is filtered with TargetedLowQual and has a discordant variant representation with the overlapping small variant call.

chr1 120 . ATTC A	. TargetedConflict	.			GT 0/1
chr1 121 . T	A	. TargetedLowQual	TARGETED 	GT 0/1
chr1 125 . TCAC T	. TargetedLowQual	TARGETED	GT 0/1
chr1 126 . C	G	. TargetedConflict	.			GT 0/1

The TARGETED call is filtered with TargetedLowQual and has a discordant genotype with the overlapping small variant call.

chr1 130 . C	G	. TargetedConflict	.			GT 0/1
chr1 130 . C	G	. TargetedLowQual	TARGETED 	GT 1/1

The TARGETED call is filtered with TargetedLowGQ and has a discordant genotype with the overlapping small variant call.

chr1 140 . AC	A	. PASS			.			GT:GQ 0/1:5
chr1 140 . A	T	. TargetedLowGQ	TARGETED 	GT:GQ 1/1:2

Command-Line Examples

The targeted caller can be enabled in parallel with other components as part of a human WGS germline analysis workflow (see DRAGEN Recipe - Germline WGS).

FASTQ Input Example

The following command-line example runs the targeted caller from FASTQ input:

dragen \
	-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
	--fastq-file1 /staging/test/data/NA12878_R1.fastq \
	--fastq-file2 /staging/test/data/NA12878_R2.fastq \
	--output-directory /staging/test/output \
	--output-file-prefix NA12878_dragen \
	--RGID DRAGEN_RGID \
	--RGSM NA12878 \
	--enable-targeted=true

Prealigned BAM Input Example

The following command-line example runs cyp21a2 only using BAM input without realignment:

dragen \
	-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
	--bam-input /staging/test/data/NA12878.bam \
	--output-directory /staging/test/output \
	--output-file-prefix NA12878_dragen \
	--enable-map-align=false \
	--enable-targeted=cyp21a2

Targeted Caller

Input Data

Output Files

Targeted JSON File

Targeted VCF File

Nonrecombinant-like Variants In High Homology Regions

Rh Gene Conversion Events

Nonallelic Homologous Recombination

Overlapping Structural Variant Representation

Variable Number Tandem Repeat Representation

Additional Filters

Merging Targeted Calls In The hard-filtered Files

Merging Strategy

Command-Line Examples

FASTQ Input Example

Prealigned BAM Input Example

CYPDB6 Caller

Total CYP2B6 and CYP2B7 Copy Number

Differentiating Sites

Structural Variant Calling

Small Variant Calling

Recombinant Variant Calling

Star Allele Identification

Genotyping

CYP2B6 Output File

CYP2D6 Caller

Total CYP2D6 and CYP2D7 Copy Number

Differentiating Sites

Structural Variant Calling

Small Variant Calling

Star Allele Identification

Genotyping

CYP2D6 Output File

CYP21A2 Caller

Total CYP21A2 and CYP21A1P Copy Number

Nonrecombinant-like Variant Calling

Nonallelic Homologous Recombination Variant Calling

CYP21A2 Output File

Output File Example

LPA Caller

Total LPA KIV-2 VNTR Unit Copy Number

Heterozygous LPA KIV-2 VNTR Unit Copy Number

Small Variant Calling

LPA Output File

JSON Output File Examples

VCF Output File Examples

Rh Caller

Total Combined RHD and RHCE Copy Number

Haplotype Phasing in the Exon 2 Region

Recombinant-like Variant Calling

Rh Output File

SMN Caller

SMA Calling With ExpansionHunter

SMA Result in repeat.vcf Output File

SMN Caller

Total and Intact SMN Copy Number

SMN1 Copy Number at Differentiating Sites

Copy number call for NM_000344.4:c.*3+80T>G

SMN Output File

References

CYPDB6 Caller

Total CYP2B6 and CYP2B7 Copy Number

Differentiating Sites

Structural Variant Calling

Small Variant Calling

Recombinant Variant Calling

Star Allele Identification

Genotyping

CYP2B6 Output File

LPA Caller

Total LPA KIV-2 VNTR Unit Copy Number

Heterozygous LPA KIV-2 VNTR Unit Copy Number

Small Variant Calling

LPA Output File

JSON Output File Examples

VCF Output File Examples

CYP2D6 Caller

Total CYP2D6 and CYP2D7 Copy Number

Differentiating Sites

Structural Variant Calling

Merging Targeted Calls In The `hard-filtered` Files

Copy number call for `NM_000344.4:c.*3+80T>G`

Merging Targeted Calls In The `hard-filtered` Files

Copy number call for `NM_000344.4:c.*3+80T>G`