Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The CYP2B6 Caller is capable of genotyping the CYP2B6 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP2B7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.
The CYP2B6 Caller performs the following steps:
Determines total CYP2B6 and CYP2B7 copy number from read depth.
Determines CYP2B6-derived copy number at CYP2B6/CYP2B7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2B6-derived copy number along the CYP2B6 gene.
Calls small variants in CYP2B6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.
The first step of CYP2B6 calling is to determine the combined copy number of CYP2B6 and CYP2B7. Reads aligned to regions in either CYP2B6 or CYP2B7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2B6 and CYP2B7 copy number is then calculated from the average sequencing depth across the CYP2B6 and CYP2B7 regions.
The CYP2B6-derived copy number is calculated at 99 predefined differentiating sites across the CYP2B6 gene. The differentiating sites are selected at positions with sequence differences in CYP2B6 and CYP2B7 where calling the CYP2B6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.
For each differentiating site, CYP2B6-specific and CYP2B7-specific alleles are counted in reads mapping to either CYP2B6 or the homologous region in CYP2B7. The CYP2B6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2B6 and CYP2B7 copy number calculated from the previous step.
The CYP2B6-derived copy number along the CYP2B6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:
35 small variants that define various star alleles are detected from the read alignments. All of these variants are in unique (nonhomologous) regions of CYP2B6 with high mapping quality. Only reads mapping to CYP2B6 are used for calling variants in nonhomologous regions.
For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant).
Samples with poor sequencing quality or greater than five copies of CYP2B6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.
The recombinant (gene conversion) variant 18053A>G is detected by phasing the variant site with five flanking differentiating sites. When the haplotypes formed from phasing these sites supports the gene conversion in CYP2B6, a read depth analysis at the gene conversion breakpoints (transitions from either CYP2B6->CYP2B7 or CYP2B7->CYP2B6) is performed. When the posterior probability that there is at least one gene conversion variant is above 0.7 then DRAGEN uses the variant for star allele identification.
The called SVs and small variant genotypes are matched against the definitions of 39 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1
, *6
and *4
, *49
where both sets of star alleles contain the same two small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2B6-derived gene copies determined from previous steps. If no variant genotypes can be matched to a set of star alleles, the CYP2B6 Caller returns a no call during the genotyping step with filter value No_call
.
Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype
.
The caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File).
An example of the CYP2B6 caller content in the output is as follows:
For CYP2B6 caller, the fields are defined as follows.
The Rh Caller is capable of identifying a common gene conversion between RHD and RHCE genes from whole-genome sequencing (WGS) data, that is referred to as RHCE Exon2 gene conversion. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the gene conversion between the pair of genes. We consider 798 loci, called differentiating sites, that represents differences between the RHD and RHCE genes, that are well preserved in the population.
The Rh Caller performs the following steps:
Determines total copy number from read depth of the RHD and RHCE regions.
Detect RHD -> RHCE breakpoints that are consistent with the RHCE Exon2 gene conversion.
The Rh Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
The Rh Caller is run by default when the small variant caller is enabled, the sample is a not a tumor sample, and the sample is detected as WGS by the Ploidy Estimator.
The first step of Rh calling is to determine the copy number of RHD and RHCE regions. Reads aligned to the RHD and RHCE regions are counted according to their support of the differentiating sits. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.
A collection of 4 differentiating sites in the exon 2 region of RHD and RHCE are used to detect the presence of the RHCE Exon2 gene conversion in the sample. An iterative phasing algorithm is used to build up haplotypes that are supported by the read data. The phasing algorithm starts with candidate haplotypes formed from all possible bases at the first differentiating site. The haplotypes are then extended at the next differentiating site by considering all reads that can be uniquely assigned to a single candidate haplotype. If these reads support only a single base at the next differentiating site for a given candidate haplotype, then the haplotype is extended with that base. When a candidate haplotype can be extended by both bases at the next differentiating site then both possible extended haplotypes are included in the set of candidate haplotypes, growing the set by 1. Subsequent extension steps are performed at neighboring differentiating sites until all sites have been processed. Some haplotypes may have sites that are unresolved (i.e. ambiguous), but these haplotypes can still participate in RHD -> RHCE breakpoint detection.
When the phased haplotypes support the RHCE Exon2 gene conversion. We visit all the differentiating sites ad report them as variants in the output VCF file with ploidy identified using the copy number estimated from the read depth of the differentiating site.
The Rh Caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
The rh
fields are defined as below.
For the variants
the fields are defined as below.
Examples of the Rh Caller content in the output json file are shown below.
The Rh Caller also generates a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file, possibly compressed.
The following are example output files:
The CYP2D6 Caller is capable of genotyping the CYP2D6 gene from whole-genome sequencing (WGS) data and is derived from the method implemented in Cyrius¹. Due to high sequence similarity with its pseudogene paralog CYP2D7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.
The CYP2D6 Caller performs the following steps:
Determines total CYP2D6 and CYP2D7 copy number from read depth.
Determines CYP2D6-derived copy number at CYP2D6/CYP2D7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2D6-derived copy number along the CYP2D6 gene.
Calls small variants in CYP2D6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.
The first step of CYP2D6 calling is to determine the combined copy number of CYP2D6 and CYP2D7. Reads aligned to regions in either CYP2D6 or CYP2D7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2D6 and CYP2D7 copy number is then calculated from the average sequencing depth across the CYP2D6 and CYP2D7 regions.
The CYP2D6-derived copy number is calculated at 117 predefined differentiating sites across the CYP2D6 gene. The differentiating sites are selected at positions with sequence differences in CYP2D6 and CYP2D7 where calling the CYP2D6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.
For each differentiating site, CYP2D6-specific and CYP2D7-specific alleles are counted in reads mapping to either CYP2D6 or the homologous region in CYP2D7. The CYP2D6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2D6 and CYP2D7 copy number calculated from the previous step.
The CYP2D6-derived copy number along the CYP2D6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:
In addition to the exon 9 fusion breakpoints, exon 9 can participate in CYP2D7 gene conversion resulting in an embedded CYP2D7 sequence instead of a true hybrid. The structural variant caller also detects exon 9 gene conversions. Because only changes in CYP2D6-derived copy number yield structural variant calls, there might be rare cases where two hybrid copies result in no structural variant calls. For example, when both *36
and *13
with fusion breakpoint in exon 9 are present. However, the structural variant caller is capable of detecting multiple copies of the same fusion type (eg, *36x2
) or cases where both an exon 9 gene conversion copy and an exon 9 2D6-2D7 hybrid are present.
118 small variants that define various star alleles are detected from the read alignments. 96 of these variants are in unique (nonhomologous) regions of CYP2D6 with high mapping quality. Only reads mapping to CYP2D6 are used for calling variants in nonhomologous regions. The other 22 variants occur in homologous regions of CYP2D6 where reads mapping to either CYP2D6 or CYP2D7 are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant). A strand bias filter is applied to a small subset of variants that would otherwise tend to have false positive calls in the population.
Samples with poor sequencing quality or greater than five copies of CYP2D6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.
The called SVs and small variant genotypes are matched against the definitions of 128 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1
, *46
and *43
, *45
where both sets of star alleles contain the same 4 small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2D6-derived gene copies determined from previous steps. When there are fewer than two CYP2D6-derived gene copies, then one or more *5
deletion haplotypes are included in the output set of star alleles. If all variant genotypes cannot be matched to a set of star alleles, the CYP2D6 Caller returns a no call during the genotyping step with filter value No_call
.
Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The deletion haplotype (*5
) is considered as a possible haplotype during this process. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype
.
Each CYP2D6 genotype contains two haplotypes separated by a slash (eg *1/*2
). Each haplotype consists of one or more star alleles separated by a plus sign (eg *10+*36
). When a haplotype contains more than one copy of the same star allele, that star allele only appears once and is followed by a multiplication sign, and then the number of copies (eg *1x2
for two copies of *1
).
¹Chen X, Shen F, Gonzaludo N, et al. Cyrius: accurate CYP2D6 genotyping using whole-genome sequencing data. The Pharmacogenomics Journal. 2021;21(2):251-261. doi:10.1038/s41397-020-00205-5
The CYP21A2 Caller is capable of genotyping the CYP21A2 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP21A1P and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants.
The CYP21A2 calling workflow is broken up into the following major stages:
Loading input configuration
Processing read data
Analyzing read data
Read data analysis is further split into the following steps:
Determine total CYP21A2 and CYP21A1P copy number from read depth.
Call small variants in CYP21A2 copies.
Phase reads to detect common variants and recombination events.
Identify most likely haplotypes.
The CYP21A2 Caller requires WGS data aligned to a human reference genome with at least 30x coverage.
The first step of CYP21A2 calling is to determine the combined copy number of CYP21A2 and CYP21A1P. Reads aligned to regions in either CYP21A2 or CYP21A1P are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP21A2 and CYP21A1P copy number is then calculated from the average sequencing depth across the CYP21A2 and CYP21A1P regions.
Of the known nonrecombinant-like variants, some are in unique (nonhomologous) regions of CYP21A2 with high mapping quality. Only reads mapping to CYP21A2 are used for calling variants in nonhomologous regions. The other variants occur in homologous regions of CYP21A2/CYP21A1P where reads mapping to either are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele is counted. A binomial model that incorporates the sequencing error rate is then used to determine the most likely variant copy number (0 for nonvariant).
For a list of the supported nonrecombinant-like variants, refer to the targeted/cyp21a2/target_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
To analyze the homologous region even further, DRAGEN phases reads covering differentiating sites and known variant sites. Whenever a detected haplotype has a CYP21A2->CYP21A1P or CYP21A1P->CYP21A2 transition that is consistent with one of the known recombinant-like variants, the transition is considered as a candidate breakpoint for calling those variants. Reads containing phasing information for the two sites flanking each candidate breakpoint are used for variant calling. When the read data supports the hypothesis that the sample contains at least one copy of a candidate breakpoint, the associated haplotype is a recombinant haplotype candidate. Recombinant haplotype candidates are sorted by likelihood and the number of variant sites. If no wild type haplotype was detected, DRAGEN reports any detected homozygous recombinant haplotype, or up to two different recombinant haplotypes (i.e. compound het) if detected. If any wild type haplotype was found, DRAGEN reports a maximum of one recombinant haplotype. When no recombinant haplotypes are detected two wild type haplotypes are reported.
For a list of recombinant variant sites, refer to the targeted/cyp21a2/recombinant_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
Note that NM_000500.9:c.710_719delinsACGAGGAGAA will be reported as the following three variants on the same haplotype: NM_000500.9:c.710T>A NM_000500.9:c.713T>A NM_000500.9:c.719T>A
Note: A deletion-like recombinant variant haplotype (as opposed to a gene conversion-like recombinant variant haplotype) is defined as a haplotype with one or fewer switch sites (transitions from a CYP21A1P allele to a CYP21A2 allele) after excluding some sites with common gene conversions in CYP21A1P.
Each nonrecombinant-like variant reported in the variants
array will have the fields below.
An example of the CYP21A2 caller content in the <output-file-prefix>.targeted.json
output file is shown below.
The GBA Caller is capable of detecting both recombinant-like and nonrecombinant-like variants in the GBA gene from whole-genome sequencing (WGS) data. Disruption of all copies of the GBA gene in an individual causes the autosomal recessive disorder Gaucher disease, and carriers are at increased risk of Parkinson's disease and Lewy body dementia. Due to high sequence similarity with its pseudogene paralog GBAP1, calling recombinant-like variants in GBA requires a specialized caller.
To enable the GBA Caller, use --enable-gba=true
as part of a germline-only WGS analysis workflow. The GBA Caller is disabled by default and requires WGS data aligned to a human reference genome with at least 30x coverage.
The GBA Caller performs the following steps:
Determine the total combined GBA and GBAP1 copy number
Detect nonrecombinant-like variants from a set of 111 known variants
Assemble phased haplotypes in the exon 9-11 region where recombinant variants occur
Detect any GBAP1 -> GBA breakpoints that are consistent with one of the 7 known recombinant-like variants
A 10 kb region of unique sequence in between GBA and GBAP1 is used to compute the copy number change due to reciprocal recombination events. Reads that align to this 10 kb region are counted and the count is normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The total combined GBA and GBAP1 copy number is then calculated as two more than the copy number of this 10 kb region.
Of the known nonrecombinant-like variants, some are in unique (nonhomologous) regions of GBA with high mapping quality. Only reads mapping to GBA are used for calling variant in nonhomologous regions. The other variants occur in homologous regions of GBA/GBAP1 where reads mapping to either GBA or GBAP1 are used for variant calling.
For each variant, reads containing the variant allele and the nonvariant alleles are counted. A binomial model that incorporates the sequencing error rate is then used to determine the most likely variant allele copy number (0 for nonvariant).
For a list of the supported nonrecombinant-like variants, refer to the targeted/gba/target_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
A collection of 10 differentiating sites in the exon 9-11 region of GBA are used to detect the GBA and GBAP1 haplotypes present in the sample. An iterative phasing algorithm is used to build up haplotypes that are supported by the read data. The phasing algorithm starts with seed sites which are then iteratively extended to neighboring sites. At each iteration, reads that can be unambiguously assigned to one of the detected partial haplotypes are used to extend the next neighboring site for each partial haplotype. Iteration continues until all sites have been extended. Some haplotypes may have sites that are unresolved (i.e. ambiguous), but these haplotypes can still participate in GBA -> GBAP1 breakpoint detection.
If any of the 10 differentiating sites in exon 9-11 indicate that there is no wild type GBA allele copies, then the sample is called as homozygous variant and the recombinant-like variant that best matches the depth calls at the 10 sites is reported.
When the sample is not homozygous variant, the phased haplotypes are used to detect heterozygous variants. The detected haplotypes are compared against a set of 7 known recombinant-like variants: A495P, L483P, D448H, c.1263del, RecNciI, RecTL, c.1263del+RecTL). Whenever a detected haplotype has a GBA->GBAP1 or GBAP1->GBA transition that is consistent with one of these 7 known recombinant-like variants, the transition is considered as a candidate breakpoint for calling that recombinant-like variant. Reads containing phasing information for the two sites flanking each candidate breakpoint are used for variant calling. When the read data supports the hypothesis that the sample contains at least one copy of a candidate breakpoint , the associated haplotype is a recombinant haplotype candidate. Recombinant haplotype candidates are sorted by likelihood and the number of variant sites. If no wild type haplotype was detected, DRAGEN reports any detected homozygous recombinant haplotype, or up to two different recombinant haplotypes (i.e. compound het) if detected. If any wild type haplotype was found, DRAGEN reports a maximum of one recombinant haplotype. When no recombinant haplotypes are detected two wild type haplotypes are reported.
The caller can detect the following recombinant variant haplotypes: A495P, L483P, D448H, 1263del, RecNciI, RecTL, and c.1263del+RecTL. Note: RecNciI, RecTL, and c.1263del+RecTL maye be deletion-like recombinant variants. A deletion-like recombinant variant haplotype (as opposed to a gene conversion-like recombinant variant haplotype) is defined as a haplotype with one or fewer switch sites (transitions from a GBAP1 allele to a GBA allele).
The table below shows the HGVS identifiers associated with each recombinant variant haplotype.
Each nonrecombinant-like variant reported in the variants
array will have the fields below.
An example of the GBA caller content in the <output-file-prefix>.targeted.json
output file is shown below.
The LPA Caller is capable of identifying the LPA Kringle-IV-2 (KIV-2) VNTR unit copy number from whole-genome sequencing (WGS) data. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the VNTR unit copy number.
The LPA Caller performs the following steps:
Determines total LPA KIV-2 VNTR unit copy number.
Determines the heterozygous LPA KIV-2 VNTR unit copy number if heterozygous KIV-2 markers are present.
Calls small variants in the LPA KIV-2 VNTR region based on the unit copy number along with allele counts from read information.
The LPA Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
The first step of LPA calling is to determine the unit copy number of LPA KIV-2. Reads aligned to the LPA KIV-2 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.
The second step of LPA calling is to determine the heterozygous unit copy numbert of LPA KIV-2. Heterozygous unit copy number is determined using two specific linked SNV sites that have been identified as a combined marker allele that is always present in every copy of the repeat unit concordantly. That is, if any copy of the repeat unit in an LPA haplotype contains the ALT alleles at those two SNV sites, then every copy of the repeat unit in that LPA haplotype contains the ALT alleles at those two sites. The relative read coverage for the ALT and REF cases at these sites can therefore be used to determine the proportions of overall copy numbers across the KIV repeat array that belong to each haplotype.
2 small variants are detected from the read alignments. These variants occur in the LPA KIV-2 VNTR region where reads mapping to either of the 6 units in the reference are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele are counted and a binomial model is used to determine the likelihood for each possible variant allele copy number up to the maximum possible as determined from the LPA KIV-2 VNTR unit copy number.
The LPA Caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
The lpa
fields are defined as below.
For the variants
the fields are defined as below.
The LPA Caller also generates a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file possibly compressed.
Examples of the LPA Caller content in the output json file are shown below.
The following are example output files:
Disruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA). SMN1 has a high identity paralog, SMN2. SMN2 differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C-> T affects splicing and largely disrupts the production of functional SMN protein from SMN2. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (SMN1) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.
DRAGEN offers the following two independent components that can call the SMN1 copy number using WGS data from a germline sample.
ExpansionHunter
SMN Caller
SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents SMN1 and SMN2.
In addition to the standard diploid genotype call, SMA Calling with ExpansionHunter uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.
SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.
To enable SMA calling along with repeat expansion detection, set the --repeat-genotype-enable
option to true
. For information on graph-alignment options, see .
To activate SMA calling, the variant specification catalog file must include a description of the targeted SMN1/SMN2 variant. The <INSTALL_PATH>/resources/repeat-specs/experimental
folder contains example files.
The <output-file-prefix>.repeat.vcf
file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in SMN1 with SMA status in the following custom fields.
Field | Description |
---|
The SMN Caller calls SMN1 and SMN2 copy numbers and detects the presence of a SNP, NM_000344.4:c.*3+80T>G
that is associated with the two-copy SMN1 allele. The caller is derived from the method implemented in Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data.²
To enable the SMN Caller, use --enable-smn=true
as part of a germline-only WGS analysis workflow. Additionally, it can also be enabled along with other targets from the targeted caller by using the option --enable-targeted=true
. The SMN Caller is disabled by default.
The SMN Caller performs the following steps:
Determines total and intact SMN copy numbers
Calls SMN1 copy number at eight differentiating sites
Determines copy number for NM_000344.4:c.*3+80T>G
The SMN Caller requires WGS data aligned to a human reference genome with at least 30x coverage
Two common copy-number variants (CNVs) in SMN1 and SMN2 include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either SMN1 or SMN2 are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.
To calculate the SMN1 copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of SMN1 and SMN2. One of these sites is the splice site variant used for SMA calling with ExpansionHunter (see SMA Calling With ExpansionHunter). The caller selects differentiating sites at positions that have sequence differences between SMN1 and SMN2 where calling the SMN1 copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.
For each differentiating site, the SMN1-specific and SMN2-specific alleles are counted in reads mapping to either SMN1 or the homologous region in SMN2. The caller uses a binomial model to calculate the likelihood of each possible SMN1 copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.
NM_000344.4:c.*3+80T>G
For this high-homology region SNP, reads mapping to either SMN1 or SMN2 are used for variant calling. The number of reads containing the variant allele and the nonvariant allele are counted and then a binomial model that incorporates the sequencing error rate is used to determine the most likely variant allele copy number (0 for nonvariant).
For SMN caller, the fields are defined as follows.
Each variant reported in the variants
array will have the fields below.
The variant NM_000344.4:c.*3+80T>G
is also reported in a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file and possibly compressed. The variant is reported with the VARIANT_IN_HOMOLOGY_REGION
flag in the INFO
field and also filtered with the TargetedRepeatConflict
filter. This variant lies in a region of homology between SMN1 and SMN2 and hence this variant is reported twice - once for each SMN1 and SMN2 regions - and is connected by the same EVENT
in the INFO
field. The ploidy of the variant is reported in concordance with the identified genotype.
An example of the vcf entry for the variant NM_000344.4:c.*3+80T>G is as follows.
The variant NM_000344.4:c.*3+80T>G in the <output-file-prefix>.targeted.vcf[.gz]
file can also be included in the <output-file-prefix>.hard-filtered.vcf[.gz]
by including smn
in the --targeted-merge-vc
list, i.e. --targeted-merge-vc smn
. The output file <output-file-prefix>.targeted.vcf[.gz]
is compressed by default. This option can be disabled using --enable-vcf-compression=false
.
¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). Human Mutation. 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3> 3.0.CO;2-9
²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genetics in Medicine. 2020;22(5):945-953. doi: 10.1038/s41436-020-0754-0
Repetitive regions in the human genome pose a challenge for general variant calling approaches which typically cannot make use of potentially misplaced MAPQ0 reads. Furthermore, high sequence homology of some genes with a pseudogene paralog can lead to a wide variety of common structural variants (SVs) in the population, requiring specialized targeted calling approaches. DRAGEN supports targeted calling for a number of genes/targets as described in subsequent target-specific sections.
The targeted caller can be enabled using the command line option --enable-targeted=true
or a subset of targets can be enabled by providing a space-separated list of target names. The supported target names are: cyp2b6
, cyp2d6
, cyp21a2
, gba
, hba
, lpa
, rh
, and smn
. For a list of all supported targeted caller options along with their default values, see . The targeted caller produces a <output-file-prefix>.targeted.json
file containing a summary of the variant caller results for each target. Additional detail of individual variant calls are reported in VCF format in the <output-file-prefix>.targeted.vcf.gz
output file.
The targeted caller requires WGS data aligned to a human reference genome with at least 30x coverage. The caller may be less reliable at lower coverage. Human reference genome builds based on hg19
, hs37d5
(including GRCh37
), or hg38
are supported. The targeted caller should not be enabled with low-coverage, exome or enrichment sequencing data.
The targeted caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
Fields in JSON | Explanation | Type and Possible Values | Present |
---|
The targeted caller generates a <output-file-prefix>.targeted.vcf.gz
file in the output directory. The output file is a VCFv4.2
formatted file. The targets that have VCF output are: cyp21a2, gba, hba, lpa, rh, and smn.
Small variants, structural variants, and copy number variants are reported in the same VCF file.
The <output-file-prefix>.targeted.vcf.gz
file includes the following source
header line:
For lpa, rh and smn targets, the EVENT
and EVENTTYPE
INFO fields are used to identify the called variants.
The EVENT
and EVENTTYPE
INFO fields are formally introduced in VCFv4.4
to enable the representation of complex rearrangements. This is achieved using the EVENT
field to group all the related VCF records together, and the EVENTTYPE
to classify the event. The corresponding header lines are the following.
However, the use of EVENT
is not limited to complex rearrangements and can be used to associate nonsymbolic alleles, for example in cases of variant position ambiguity in high homology regions.
Since the EVENTTYPE
values are implementation-defined, custom EVENTTYPE
header lines are included to describe each EVENTTYPE
.
For cyp21a2, gba, and hba targets, the ALLELE_ID
INFO field is used to identify the called variant alleles.
The missing value .
is used when no identifier is available (e.g. a wild type allele) or applicable (e.g. allele index 0 for a structural variant record).
In the case of target variants in a high homology region, each variant is reported ambiguously at all corresponding homologous positions (i.e. in both the pseudogene and in the target gene). Additional analysis for these variants can be performed if absolute certainty that these variants are located in the target gene (e.g. in gba or cyp21a2) is required.
For lpa and smn the ploidy of the called genotype (FORMAT/GT
field) corresponds to the combined copy number from all the homologous positions. For cyp21a2, gba and hba, this "joint" genotype from all the homologous positions is instead reported in a separate FORMAT/JGT
field which is then collapsed into a diploid genotype and reported in the FORMAT/GT
field. The following fields are reported for "joint" calls:
Note that the FORMAT/GQ
and FORMAT/JGQ
fields contain the unconditional genotype quality, unlike the VCF spec where FORMAT/GQ
is defined as the genotype quality conditioned on the site being variant.
In the depicted example there are two genes A and B that include a high homology region. The usual process to call variants in this regions is to make a joint pileup of the reads aligning in both genes A and B and call the variants using a model with a ploidy proportional to the total copy number of the regions. This generates divergent possible genotypes that are equally likely since the variant cannot be confidently placed in either gene A or gene B. For lpa and smn the variant would be reported as follows:
Given the unconventional ploidy of the FORMAT/GT
field used in this representation, a TargetedRepeatConflict
filter is applied to these records. The header line for the filter is the following.
For cyp21a2, gba and hba, a conventional diploid FORMAT/GT
is reported and so no TargetedRepeatConflict
filter is applied. Due to the ambiguity in placing target variants in high homology regions, the corresponding QUAL
and FORMAT/GQ
fields can be much lower than conventional small variant calls (i.e. Phred 3 for a single variant allele copy across two homologous diploid positions). Therefore, instead of filtering on QUAL
and FORMAT/GQ
for these records, the records are filtered based on the FORMAT/JVQL
and FORMAT/JGQ
fields:
Since the wild type alleles at homologous positions may be different from each other or different from the reference alleles, an additional filter is applied when only wild type alleles are detected across the homologous positions. This avoids making ambiguous variant calls when no target variant of interest is detected.
In the case of an identified gene conversion even in rh, a small variant is reported at each differentiating site in the acceptor region.
In the depicted example there are two genes A and B and gene A is the acceptor of a gene conversion from gene B (green box in the figure). Gene conversion are identified by observing variations in copy number at differentiating sites (blue and pink bars in the figure) in consecutive regions. Copy number variations between regions define the breakends of the gene conversion. An equivalent VCF representation for gene conversion would be using CNV and SV entries with breakends corresponding to the donor/acceptor regions, however, only the small variant representation is currently supported.
In the case of a detected gene conversion event, there may be differentiating sites with a genotype that is inconsistent with that gene conversion event. In these cases the RecombinantConflict
filter is applied. The RecombinantConflict
is defined by the following header line.
In the example, the resulting representation is as follows.
For cyp21a2 and gba, nonallelic homologous recombination can result in gene deletion or duplication in the case of reciprocal recombination or gene conversion in the case of nonreciprocal recombination. Both gene deletion and gene conversion can introduce loss-of-function variants and in both cases the targeted caller will report these variants in the target gene. In the case of gene deletion, the differentiating sites at the nontarget (i.e. pseudogene) positions will contain the overlapping deletion allele *
while the differentiating sites in the target will contain any variant alleles. Although an equivalent VCF representation would be to simply report the deletion with a single structural variant VCF record, reporting small variant VCF records in the target gene allows for identification of the specific mutations that may occur in a gene transcript and matches well with annotation using HGVS nomenclature. Similarly, for gene conversions, variants are reported at differentiating sites in the target gene, rather than as pairs of structural variant breakends.
The use of GT=0
for symbolic structural variant alleles is formally disambiguated in VCFv4.4
, specifying that "GT=0 indicates the absence of any of the ALT symbolic structural variants defined in the record". With this convention we can report compound overlapping heterozygous structural variants.
In the hba genotype depicted above, two overlapping SVs can be represented as follows:
The relevant header lines for the VCF records above are:
In the depicted example there is a Variable Number Tandem Repeat (VNTR) region composed of three repeat units in the reference. The CN
INFO field is used to report the allele copy number, the CN
FORMAT field to is used report the region total copy number given by the sum of the allele copy numbers, and the REPCN
FORMAT field is used to report the repeat unit copy number equal to the allele copy number multiplied by the number of repeat units in the reference.
This VNTR can be represented as follows:
The REPCN
and CN
header lines are:
For lpa, rh and smn, the TargetedLowQual
filter is applied if the QUAL
of a target variant is less than 3.00
.
Similarly, for cyp21a2 and gba the TargetedLowVQL
filter is applied if the VQL
of a target variant in low-homology region is less than 3.00
.
The TargetedLowGQ
filter is applied if the targeted variant has GQ
smaller than 3
.
hard-filtered
FilesWhen the small variant caller is enabled, the targeted small variant VCF calls can be merged into the <output-file-prefix>.hard-filtered.vcf.gz
and <output-file-prefix>.hard-filtered.gvcf.gz
files, briefly hard-filtered
files. The --targeted-merge-vc
command line option can be used to control which targets will have their small variant VCF records merged into the hard-filtered
files. For example, --targeted-merge-vc rh
will enable merging of the calls from the rh
caller into the hard-filtered
files and --targeted-merge-vc rh hba
will enable merging of the calls from the rh
and hba
targets into the hard-filtered
files. The true
value will merge all calls from all supported targets into the hard-filtered
files, while the false
value will merge no calls into the hard-filtered
files.
The targeted calls merged into the hard-filtered
files are marked with a TARGETED
INFO flag.
When enabled, targeted small variants are merged into the hard-filtered
files regardless of any regions that may be provided using the --vc-target-bed
option.
The merging strategy for targeted small variant calls is to prioritize the targeted calls over small variant calls from the germline small variant caller. When a germline small variant call overlaps a targeted caller call, then the small variant call is filtered with a TargetedConflict
filter if any of the following holds:
The targeted caller call is PASS
.
The small variant call and targeted caller call have incompatible genotypes and the targeted caller call is not filtered with the TargetedLowGQ
filter.
The strategy is summarized in the following examples.
The TARGETED
call is PASS
.
The TARGETED
call and the small variant call are not overlapping
The TARGETED
call is filtered with TargetedLowQual
and has a discordant variant representation with the overlapping small variant call.
The TARGETED
call is filtered with TargetedLowQual
and has a discordant genotype with the overlapping small variant call.
The TARGETED
call is filtered with TargetedLowGQ
and has a discordant genotype with the overlapping small variant call.
The following command-line example runs the targeted caller from FASTQ input:
The following command-line example runs cyp21a2 only using BAM input without realignment:
Fusion Breakpoint | Hybrid Gene Structure | Star-Allele Designation |
---|---|---|
Fields in JSON | Explanation | Type and Possible Values |
---|---|---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fusion Breakpoint | Hybrid Gene Structure | Star-Allele Designation |
---|
The CYP2D6 Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see ). An example of the CYP2D6 caller content in the output is as follows:
Fields in JSON | Explanation | Type and Possible Values |
---|
The CYP21A2 Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see ).
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Recombinant-like and nonrecombinant-like variants are reported in VCF format. See for details about how these variants are reported in VCF.
Recombinant variant haplotype | HGVS identifiers |
---|
The GBA Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see ).
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Recombinant-like and nonrecombinant-like variants are reported in VCF format. See for details about how these variants are reported in VCF.
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
The SNP (also referred to as g.27134T>G) has been reported in the literature to be associated with the two-copy SMN1 allele.
The SMN Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see ). An example of the SMN caller content in this file is shown below.
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Calls at differentiating sites within the recombinant variant calling region will contain the same "joint" fields as are reported for nonrecombinant-like variants in high homology regions ( see ). However, the collapsed diploid FORMAT/GT
will be based on any detected recombination events. Because detected recombinant variants are placed in the target gene, these records are filtered differently than the ambiguously placed, nonrecombinant-like variants in high homology regions. The INFO/Recombinant
flag is added to calls derived from recombinant variant calling to distinguish them from nonrecombinant-like variant calls in high homology regions. The FORMAT/VQL
field is used to apply the RecombinantLowVQL
filter for low quality recombinant variants and the RecombinantREF
filter is applied when the collapsed diploid FORMAT/GT
contains only reference alleles.
The targeted caller can be enabled in parallel with other components as part of a human WGS germline analysis workflow (see ).
intron 4-exon 5
2B7-2B6
*29
intron 4-exon 5
2B6-2B7
*30
genotype
star allele genotype identified for sample
string
genotypeFilter
The filter status for the genotype call
string (The value can include: PASS, No_call, or More_than_one_possible_genotype)
phenotypeDatabaseAnnotation
The metabolism status corresponding to the genotype, mapped from phenotypeDatabaseSources
string
smn1CopyNumber | Copy number of intact SMN1 | nonnegative integer or null |
smn2CopyNumber | Copy number of intact SMN2 | nonnegative integer or null |
smn2Delta78CopyNumber | Copy number of SMN2Δ7–8 (deletion of exon 7 and 8) | nonnegative integer |
totalCopyNumber | Raw normalized depth of total SMN (exons 1 to 6) | nonnegative floating point number |
fullLengthCopyNumber | Raw normalized depth of intact SMN (exons 7 & 8) | nonnegative floating point number |
variants | a json array containing info about specific SMN variants | json-array |
hgvs | HGVS id of the variant being reported | string |
qual | Phred quality that at least one copy of the variant allele is found | nonnegative floating point number |
altCopyNumber | detected copy number of the variant allele | nonnegative floating point number |
altCopyNumberQuality | Phred quality of the detected copy number | nonnegative floating point number |
sample | The sample name. | string |
dragenVersion | The version of DRAGEN. | string |
rh | The RH targeted caller specific fields. | dictionary |
totalCopyNumber | Total RHD/RHCE copy number | integer |
rhdCopyNumber | RHD gene copy number | integer |
rhceCopyNumber | RHCE copy number | integer |
variants | List of known variants from recombination that were detected in RHD/RHCE. | list of variants |
hgvs | HGVS identifier of the variant | string, "NC_000001.11g.25405596_25409676con25283766_25287797" |
qual | Phred QUAL score of the variant | double |
altCopyNumber | Copy number of the ALT variant | double |
altCopyNumberQuality | Phred QUAL copy number of the ALT variant | double |
exon 9 | 2D6-2D7 |
|
exon 9 | 2D7-2D6 |
|
intron 4 | 2D7-2D6 |
|
intron 1 | 2D7-2D6 |
|
intron 1 | 2D6-2D7 |
|
genotype | called star allele genotype | string (semi-colon delimited list of possible genotypes with haplotypes separated by |
genotypeFilter | The filter status for the genotype call | string (The value can include: PASS, No_call, or More_than_one_possible_genotype) |
phenotypeDatabaseAnnotation | The metabolism status corresponding to the genotype, mapped from phenotypeDatabaseSources | string |
totalCopyNumber | Total copy number of CYP21A2 and CYP21A1P genes including hybrids | nonnegative integer |
deletionBreakpointInGene | null (i.e. unknown) if totalCopyNumber > 3 | true, false, null |
true if CN <= 3 and a deletion-like recombinant variant haplotype is detected |
false if CN <=3 and no deletion-like recombinant variant is detected |
recombinantHaplotypes | List of detected haplotypes arising from nonallelic homologous recombination variant calling | Array of two strings. Each string consists of all associated allele IDs (if any) within the haplotype. Consecutive IDs in the same haplotype are separated by a '+'. |
variants | List of single site, nonrecombinant-like variants (i.e. not arising from nonallelic homologous recombination). An empty list if no variants are detected. | Array of nonrecombinant-like variants. |
alleleId | HGVS identifier of the variant allele | string |
alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer |
genotypeQuality | Phred-scaled quality for the called genotype | nonnegative integer |
filter | Filter for the called genotype | string. "PASS" when not filtered |
A495P | NM_000157.4:c.1483G>C |
L483P | NM_000157.4:c.1448T>C |
D448H | NM_000157.4:c.1342G>C |
c.1263del | NM_000157.4:c.1265_1319del |
RecNciI | NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C |
RecTL | NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C, NM_000157.4:c.1342G>C |
c.1263del+RecTL | NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C, NM_000157.4:c.1342G>C, NM_000157.4:c.1265_1319del |
totalCopyNumber | Total copy number of all GBA and GBAP1 genes including hybrids | nonnegative integer |
deletionBreakpointInGene | null (i.e. unknown) if totalCopyNumber > 3 | true, false, null |
true if CN <= 3 and a deletion-like recombinant variant haplotype is detected |
false if CN <=3 and no deletion-like recombinant variant is detected |
recombinantHaplotypes | List of detected haplotypes arising from nonallelic homologous recombination variant calling | Array of two strings. Each string consists of all associated allele IDs (if any) within the haplotype. Consecutive IDs in the same haplotype are separated by a '+'. |
variants | List of single site, nonrecombinant-like variants (i.e. not arising from nonallelic homologous recombination). An empty list if no variants are detected. | Array of nonrecombinant-like variants. |
alleleId | HGVS identifier of the variant allele | string |
alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer |
genotypeQuality | Phred-scaled quality for the called genotype | nonnegative integer |
filter | Filter for the called genotype | string. "PASS" when not filtered |
sample | The sample name. | string |
dragenVersion | The version of DRAGEN. | string |
lpa | The LPA targeted caller specific fields. | dictionary |
kiv2CopyNumber | Total KIV-2 unit copy number | float |
refMarkerAlleleCopyNumber | Null if Homozygous REF/ALT markers call | float, null |
Float if Heterozygous markers call and stores the KIV-2 unit copy number of the allele having REF markers |
altMarkerAlleleCopyNumber | Null if Homozygous REF/ALT markers call | float, null |
Float if Heterozygous markers call and stores the KIV-2 unit copy number of the allele having ALT markers |
type | "Heterozygous markers call" if we observe both REF and ALT markers | string, "Heterozygous markers call", "Homozygous REF markers call", "Homozygous ALT markers call" |
"Homozygous REF markers call" if we observe only REF markers |
"Homozygous ALT markers call" if we observe only ALT markers |
variants | List of known variants that were detected in the KIV-2 region. | list of variants |
hgvs | HGVS identifier of the variant | string |
qual | Phred QUAL score of the variant | double |
altCopyNumber | Copy number of the ALT variant | double |
altCopyNumberQuality | Phred QUAL copy number of the ALT variant | double |
VARID | SMN marks the SMN call. |
GT | Genotype call at this position using a normal (diploid) genotype model. |
DST | SMA status call: + indicates detected - indicates undetected ? indicates undetermined. |
AD | Total read counts supporting the C and T allele. |
RPL | Log10 likelihood ratio between the unaffected and affected models. Positive scores indicate the unaffected model is more likely. |
sampleId | The sample name. | string | always |
softwareVersion | The version of DRAGEN. | string | always |
phenotypeDatabaseSources | Resources used for calling metabolism status (phenotype). | json array of strings | CYP2B6 or CYP2D6 is enabled |
cyp2b6 | The CYP2B6 caller fields. | dictionary | CYP2B6 caller is enabled |
cyp2d6 | The CYP2D6 caller fields. | dictionary | CYP2D6 caller is enabled |
cyp21a2 | The CYP21A2 caller fields. | dictionary | CYP21A2 caller is enabled |
gba | The GBA caller fields. | dictionary | GBA caller is enabled |
hba | The HBA caller fields. | dictionary | HBA caller is enabled |
lpa | The LPA caller fields. | dictionary | LPA caller is enabled |
rh | The RH caller fields. | dictionary | RH caller is enabled |
smn | The SMN caller fields. | dictionary | SMN caller is enabled |
The HBA Caller is capable of genotyping the HBA1 and HBA2 genes from whole-genome sequencing (WGS) data. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the possible genotypes of the pair of genes. We consider regions surrounding the HBA1 and HBA2 sites to resolve the possible HBA1 and HBA2 genotypes.
The HBA Caller performs the following steps:
Determines total copy number from read depth of the regions surrounding the HBA1 and HBA2 sites.
Determines HBA genotypes based on the copy number of the regions surrounding the HBA1 and HBA2 sites.
Calls small variants in the HBA1 and HBA2 regions based on the region copy number derived from the genotype along with allele counts from read information.
The HBA Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
For a comprehensive evaluation of the HBA caller, see HBA targeted caller blog post.
The first step of HBA calling is to determine the copy number of the regions sorrounding the HBA1 and HBA2 sites. Reads aligned to the regions are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. Finally, a Gaussian Mixture Model (GMM) is used to obtain the integer region copy number from the region normalized counts.
The genotyping step attempts to identify the two likely haplotypes described in the following table, where "a" stands for a functional copy of either HBA1 or HBA2, "-" stands for a nonfunctional/missing copy of either HBA1 or HBA2, while "3.7" and "4.2" describe the recombinant event that likely caused the deletion/duplication of the functional HBA copy. The second column of the following table reports the interpretation of the genotype.
Genotype | Interpretation |
---|---|
If none of the previous genotype is identified, then no call is made and the caller reports a "None" genotype.
18 small variants are detected from the read alignments. These variants occur in homologous regions of HBA1 and HBA2 where reads mapping to either HBA1 or HBA2 are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele are counted and a binomial model is used to determine the likelihood for each possible variant allele copy number up to the maximum possible as determined from the HBA1/HBA2 genotyping.
The HBA Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File).
Each variant reported in the variants
array will have the fields below.
Structural variant and homology region variants are reported in VCF format. See Targeted VCF File for details about how these variants are reported in VCF.
An example of the HBA caller content in the <output-file-prefix>.targeted.json
output file is shown below.
Fields in JSON | Explanation | Type and Possible Values |
---|---|---|
Fields in JSON | Explanation | Type and Possible Values |
---|---|---|
aaa3.7/aa
alpha-globin triplication
aaa4.2/aa
alpha-globin triplication
aa/aa
Normal
-a3.7/aa
Silent Carrier
-a4.2/aa
Silent Carrier
--/aaa3.7
Carrier
--/aaa4.2
Carrier
-a3.7/-a3.7
Carrier
-a4.2/-a4.2
Carrier
-a3.7/-a4.2
Carrier
--/aa
Carrier
--/-a3.7
HbH
--/-a4.2
HbH
--/--
Hb Bart's
genotype
The HBA genotype.
string
genotypeFilter
The HBA genotype filter.
string, [PASS, HBALowGQ, HBALowPValue, No_call]
genotypeQual
The HBA Phred genotype quality.
double
minPValue
The minimum copy number p-value of regions used to determine copy number genotype of the HBA locus.
double
variants
List of detected homology region variants in HBA1/HBA2.
Array of variants
alleleId
HGVS identifier of the variant allele
string
alleleCopyNumber
Copy number of the allele in the called genotype
nonnegative integer
genotypeQuality
Phred-scaled quality for the called genotype
nonnegative integer
filter
Filter for the called genotype
string. "PASS" when not filtered