Disruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA). SMN1 has a high identity paralog, SMN2. SMN2 differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C-> T affects splicing and largely disrupts the production of functional SMN protein from SMN2. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (SMN1) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.
DRAGEN offers the following two independent components that can call the SMN1 copy number using WGS data from a germline sample.
ExpansionHunter
SMN Caller
SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents SMN1 and SMN2.
In addition to the standard diploid genotype call, SMA Calling with ExpansionHunter uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.
SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.
To enable SMA calling along with repeat expansion detection, set the --repeat-genotype-enable
option to true
. For information on graph-alignment options, see Repeat Expansion Detection with ExpansionHunter.
To activate SMA calling, the variant specification catalog file must include a description of the targeted SMN1/SMN2 variant. The <INSTALL_PATH>/resources/repeat-specs/experimental
folder contains example files.
The <output-file-prefix>.repeat.vcf
file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in SMN1 with SMA status in the following custom fields.
Field | Description |
---|---|
The SMN Caller calls SMN1 and SMN2 copy numbers and detects the presence of a SNP, NM_000344.4:c.*3+80T>G
that is associated with the two-copy SMN1 allele. The caller is derived from the method implemented in Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data.²
To enable the SMN Caller, use --enable-smn=true
as part of a germline-only WGS analysis workflow. Additionally, it can also be enabled along with other targets from the targeted caller by using the option --enable-targeted=true
. The SMN Caller is disabled by default.
The SMN Caller performs the following steps:
Determines total and intact SMN copy numbers
Calls SMN1 copy number at eight differentiating sites
Determines copy number for NM_000344.4:c.*3+80T>G
The SMN Caller requires WGS data aligned to a human reference genome with at least 30x coverage
Two common copy-number variants (CNVs) in SMN1 and SMN2 include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either SMN1 or SMN2 are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.
To calculate the SMN1 copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of SMN1 and SMN2. One of these sites is the splice site variant used for SMA calling with ExpansionHunter (see SMA Calling With ExpansionHunter). The caller selects differentiating sites at positions that have sequence differences between SMN1 and SMN2 where calling the SMN1 copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.
For each differentiating site, the SMN1-specific and SMN2-specific alleles are counted in reads mapping to either SMN1 or the homologous region in SMN2. The caller uses a binomial model to calculate the likelihood of each possible SMN1 copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.
NM_000344.4:c.*3+80T>G
The SNP NM_000344.4:c.*3+80T>G (also referred to as g.27134T>G) has been reported in the literature to be associated with the two-copy SMN1 allele.
For this high-homology region SNP, reads mapping to either SMN1 or SMN2 are used for variant calling. The number of reads containing the variant allele and the nonvariant allele are counted and then a binomial model that incorporates the sequencing error rate is used to determine the most likely variant allele copy number (0 for nonvariant).
The SMN Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File). An example of the SMN caller content in this file is shown below.
For SMN caller, the fields are defined as follows.
Each variant reported in the variants
array will have the fields below.
The variant NM_000344.4:c.*3+80T>G
is also reported in a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file and possibly compressed. The variant is reported with the VARIANT_IN_HOMOLOGY_REGION
flag in the INFO
field and also filtered with the TargetedRepeatConflict
filter. This variant lies in a region of homology between SMN1 and SMN2 and hence this variant is reported twice - once for each SMN1 and SMN2 regions - and is connected by the same EVENT
in the INFO
field. The ploidy of the variant is reported in concordance with the identified genotype.
An example of the vcf entry for the variant NM_000344.4:c.*3+80T>G is as follows.
The variant NM_000344.4:c.*3+80T>G in the <output-file-prefix>.targeted.vcf[.gz]
file can also be included in the <output-file-prefix>.hard-filtered.vcf[.gz]
by including smn
in the --targeted-merge-vc
list, i.e. --targeted-merge-vc smn
. The output file <output-file-prefix>.targeted.vcf[.gz]
is compressed by default. This option can be disabled using --enable-vcf-compression=false
.
¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). Human Mutation. 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3> 3.0.CO;2-9
²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genetics in Medicine. 2020;22(5):945-953. doi: 10.1038/s41436-020-0754-0
Fields in JSON | Explanation | Type and Possible Values |
---|---|---|
Fields in JSON | Explanation | Type and Possible Values |
---|---|---|
VARID
SMN marks the SMN call.
GT
Genotype call at this position using a normal (diploid) genotype model.
DST
SMA status call: + indicates detected - indicates undetected ? indicates undetermined.
AD
Total read counts supporting the C and T allele.
RPL
Log10 likelihood ratio between the unaffected and affected models. Positive scores indicate the unaffected model is more likely.
smn1CopyNumber
Copy number of intact SMN1
nonnegative integer or null
smn2CopyNumber
Copy number of intact SMN2
nonnegative integer or null
smn2Delta78CopyNumber
Copy number of SMN2Δ7–8 (deletion of exon 7 and 8)
nonnegative integer
totalCopyNumber
Raw normalized depth of total SMN (exons 1 to 6)
nonnegative floating point number
fullLengthCopyNumber
Raw normalized depth of intact SMN (exons 7 & 8)
nonnegative floating point number
variants
a json array containing info about specific SMN variants
json-array
hgvs
HGVS id of the variant being reported
string
qual
Phred quality that at least one copy of the variant allele is found
nonnegative floating point number
altCopyNumber
detected copy number of the variant allele
nonnegative floating point number
altCopyNumberQuality
Phred quality of the detected copy number
nonnegative floating point number