Joint Analysis
DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other.
Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF
option. There is also the option to write a germline gVCF with reduced size using the option --vc-compact-gvcf
. This results in a significant speed up for a downstream analysis using gVCF Genotyper. Please note that this compact format is not compatible with a pedigree analysis.
The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
Combined phased variants in the gVCF input
The DRAGEN germline variant caller has an option --vc-combine-phased-variants-distance
to combine phased variants in the gVCF output. Input gVCF files created with this option cannot be processed in a population-based analysis using gVCF Genotyper.
The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.
Force-genotyped and targeted variants in the gVCF input
If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.
Similarly, targeted variant calls (option --targeted-merge-vc
) in any gVCF file that are not also called by the variant caller will be ignored as well.
Processing GATK gVCF Files
Both pedigree- and population-based joint analysis can process gVCF files written by the GATK v4.1 variant caller.
Joint Analysis Output Format
There are two available joint analysis output files:
Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.
The multisample gVCF output is only available in the pedigree-based analysis.
The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".
Hom-ref Blocks FORMAT Fields
In hom-ref blocks, the following FORMAT fields are calculated uniquely.
FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.
In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.
SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.
In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.
In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.
Pedigree Mode
Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.
To invoke pedigree mode, set the --enable-joint-genotyping
option to true. Use the --pedigree-file
option to specify the path to a pedigree file that describes the relationship between panels.
The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.
Column Header | Description |
---|---|
Family_ID | The pedigree identifier. |
Individual_ID | The ID of the individual. |
Paternal_ID | The ID of the individual's father. If the founder, the value is 0. |
Maternal_ID | The ID of the individual's mother. If the founder, the value is 0. |
Sex | The sex of the sample. If male, the value is 1. If female, the value is 2. |
Phenotype | The genetic data of the sample. If unknown, the value is 0. If unaffected, the value is 1. If affected, the value is 2. |
The following is an example of an input pedigree file.
De Novo Calling
The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.
Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.
Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
gVCF files for the Small Variant Caller.
*.tn.tsv files for the Copy Number Caller.
BAM files for the Structural Variant Caller.
Run Pedigree Mode for Small Variant Caller. For more information, see Small Variant DeNovo Calling.
Run Pedigree Mode for Copy Number Caller. For more information, see Multisample CNV Calling.
Run Pedigree Mode for Structural Variant Caller. For more information, see Structural Variant De Novo Quality Scoring.
Run DeNovo Variant Small Variant Filtering. For more information, see De Novo Small Variant Filtering.
Small Variant DeNovo Calling
The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.
Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.
At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.
The DQ formula is DQ = -10log10(1 - Pdenovo).
Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.
In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.
The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.
The following are the possible values:
Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.
The following is an example VCF line for a trio:
1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.
Pedigree Mode Options
The following command line options are available for de novo small variant calling.
--enable-joint-genotyping
--Run the joint genotyping caller.--pedigree-file
--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.--variant
or--variant-list
--Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.--qc-snp-denovo-quality-threshold
--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.--qc-indel-denovo-quality-threshold
--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.--output-directory
--The output directory. This is required.--output-file-prefix
--The prefix used to label all output files. This is required.-r
The directory where the hash table resides.
The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant
or --variant-list
. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.
Population Mode
DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.
The tool for population-based analysis is the iterative gVCF Genotyper. Its input is a set of single or multisample gVCFs. The output is a multisample VCF that contains one entry for any variant seen in any of the input gVCFs. The variants are genotyped across all input samples using information from the hom-ref blocks as necessary. The iterative gVCF Genotyper does not adjust genotypes based on population information but it provides means to filter variant sites based on information leveraged from the population. See Iterative gVCF Genotyper analysis for information on the available command line options.
To compare multiple pedigrees, you can run gVCF Genotyper on the output of a pedigree analysis and merge multiple joint-called pedigrees into a single multisample VCF. To enable, run the pedigree analysis using the --enable-multi-sample-gvcf=true
option to write a multisample gVCF.
Iterative gVCF Genotyper analysis
gVCF Genotyper offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples, every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a "step-by-step mode" if multiple batches of samples are available, or "end-to-end mode", if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller (described above) are also accepted as input. gVCF Genotyper can accept input gVCF files generated using DRAGEN version 3.2.6 or later.
Step 1 (gVCF aggregation): the user can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multi-sample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. As part of this step, adjacent hom-ref blocks with matching FILTER columns are further merged to reduce the disk footprint of the intermediate files, FORMAT field values being base-pair weight averaged in the process.
When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.
Step 2 (census aggregation): after all per batch census files are generated, the user can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way than aggregating gVCFs from all batches. When a new batch of samples becomes available, the user only needs to perform Step 1 on that batch, then aggregate the census file from the new batch with the global census file from all previous batches in order to generate an updated global census file.
Step 3 (msVCF generation): every time a global census file is updated, with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take a per-batch cohort file, per-batch census file and the global census file as input, and generate a multi-sample VCF for one batch of samples. The output multi-sample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multi-sample VCF is the same across all batches.
To facilitate parallel processing on distributed compute nodes, for every step above, the user can choose to split the genome into shards of equal size, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard
below.
There is a special treatment of alternative or unaligned contigs when the --shard
option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.
If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.
Commandline arguments common to all steps:
--enable-gvcf-genotyper-iterative
: set to true to run the iterative gVCF Genotyper (always required).--ht-reference
: The file containing the reference sequence in FASTA format (always required).--output-directory
: The output directory (always required).--output-file-prefix
: The prefix used to label all output files (optional, default valuedragen
).--shard
: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format ofn/N
(e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.--gg-regions
: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the--shard
option is not given. It is important that the same regions are chosen for each step 1,2 and 3.--gg-regions-bed
: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.--gg-discard-ac-zero
If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.--gg-remove-nonref
If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.--gg-vc-filter
Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.--gg-hard-filter
Specifies a filtering expression to be applied to the output msVCF records. See msVCF hard filtering below. The default is to apply no filters.--gg-skip-filtered-sites
Omits msVCF records that fail the given hard filter. The default is false.--gg-msvcf-format-fields
Can be used to override the default set of sample genotype fields in the output msVCF. See msVCF metric customization below.--gg-msvcf-info-fields
Can be used to override the default set of site-wise INFO fields in the output msVCF. See msVCF metric customization below.--gg-squeeze-msvcf
Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.--gg-gq-squeezing-threshold
Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.--gg-output-type
Set tospvcf
to write the output in spVCF format rather then the default msVCF. See File size optimizations below for details.--gg-diploidify
In the output msVCF file, convert haploid calls to diploid. The diploidified genotype is homozygous in the haploid call e.g.1
becomes1/1
. The LPL field is also diploidified for these samples. Site metrics, such as allele counts, are calculated before diploidification. Diploidifying genotypes may ease the ingestion of msVCF files into downstream analysis tool, such as Hail and Plink. When this option is enabled, it is possible to include theDF
FORMAT field (included by default) that signifies whether or not a genotype has been diploidified, see msVCF metric customization below.
Commandline arguments for Step1 (step-by-step mode)
--gvcfs-to-cohort-census
: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.--variant
: if--variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
Commandline arguments for Step2 (step-by-step mode)
--aggregate-censuses
: set to true to aggregate a list of per batch census files into a global census file.--input-census-list
: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.
Commandline arguments for Step3 (step-by-step mode)
--generate-msvcf
: set to true to generate a multi-sample VCF for one batch of samples.--input-cohort-file
: the path to the per batch cohort file (from Step1).--input-census-file
: the path to the per batch census file (from Step1).--input-global-census-file
: the path to the global census file (from Step2).
Commandline arguments for running Step1 + Step2 + Step3 (end-to-end mode for a single batch)
--gvcfs-to-msvcf
: set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.--variant
: if--variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
Commandline arguments for merging per-batch msVCF files
--merge-batches
: set to true to merge msVCF files for a set of batches.--input-batch-list
: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.--gg-enable-indexing
: set to true (the default) to generate a tabix index for the merged msVCF.--gg-merge-subset
: set to override the restriction that all batches must be included in the merge.
Enabling use of mimalloc library for enhanced performance
Mimalloc is a custom memory allocation library that can yield a significant speed-up in the iterative gVCF Genotyper workflow. In some deployments, e.g. cloud, it is automatically and seamlessly used, but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.
For this purpose, the convenience script mi_dragen.sh
is provided, which loads the bundled library and can be transparently used in the same way as the DRAGEN executable. Please note that its use is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow too. Its use for other purposes is known to possibly lead to undesirable memory overuse and thus should be undertaken at the user's own risk.
The multi-sample VCF output of the iterative gVCF Genotyper
The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.
The msVCF can become a very large file with increasing cohort size. In some cases, the file might need more storage than can be allocated by VCF parsers. This is caused by VCF entries such as FORMAT/PL which store a value for each combination of alleles. We therefore decided to replace FORMAT/PL with a tag FORMAT/LPL which stores a value only for the alleles that actually occur in the sample. Similarly, the msVCF also contains FORMAT/LAD which stores the allelic depth only for the alleles occurring in the sample.
We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.
This approach is also referred to as local alleles and is also used by open source software such as bcftools and Hail.
Iterative gVCF Genotyper with Mitochondrial Variant Calls
When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:
Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF
QUAL column in msVCF
The value displayed in the QUAL column of the msVCF is the maximum of the input QUAL values for the site across the global cohort. The QUAL value will be missing if any of the batch census files used to create the global census were generated with a version of DRAGEN earlier than v4.2.
Measures of Hardy-Weinberg Equilibrium in the msVCF output
The Hardy-Weinberg Equilibrium (HWE) states that, given certain conditions, genotype and allele frequencies should remain constant between generations. Deviations from HWE can results from violations of the underlying HWE assumptions in the population, non-random sampling or may be artifacts of variant calling. Adherance to HWE can be assessed by comparing the observed frequencies of genotypes to those expected under HWE given the observed allele counts.
Iterative gVCF Genotyper offers several metrics for assessing adherence to HWE. It calculates both allele-wise and site-wise HWE P-values, an allele-wise excess heterozygosity (ExcHet) P-value and the site-wise inbreeding coefficient (IC). These metrics are calculated only for diploid sites and missing values are excluded from the calculations. These values are included as fields in the INFO column of the output msVCF file. Both batch-wise and global values are included, where the field names for the global values are prefixed with G
.
Metric | Description | Scope | Number of values |
---|---|---|---|
HWE | Hardy-Weinberg Equilibrium P-value | Allele-wise | One for each alt allele |
ExcHet | Excess Heterozygosity P-value | Allele-wise | One for each alt allele |
HWEc2 | Hardy-Weinberg Equilibrium P-value | Site-wise | 1 |
IC | Inbreeding Coefficient | Site-wise | 1 |
Care should be taken when interpreting these metrics for small cohorts and/or low frequency alleles, as small changes in inputs can lead to large changes in their values. Further, violations of the underlying HWE assumptions (such as inbreeding), and non-random sampling (such as the presence of consanguineous samples), can adversely affect results, making identification of poorly called variants more difficult.
Where it is not possible to calculate the metric, they are represented as missing (i.e., ".") in the msVCF file. This can vary between the metrics, but may occur if non-diploid genotypes are encountered, if there is only one allele present at a site, or if no samples are genotyped at a site.
Allele-wise vs site-wise Hardy-Weinberg metrics
Iterative gVCF Genotyper offers both allele-wise and site-wise HWE P-values. The allele-wise P-values are based on the exact-conditional method Am J Hum Genet. 2005 May; 76(5): 887–893 the site-wise P-values are based on Pearson's chi-squared method. For bi-allelic sites, although both are measuring the same property, their values may differ. The differences between the methods are explored in Am J Hum Genet. 2005 May; 76(5): 887–893. Care should be taken when deciding which to use.
Allele-Wise Hardy-Weinberg Equilibrium and Excess Heterozygosity.
Iterative gVCF Genotyper calculates allele-wise HWE and the ExcHet P-values. The values are calculated using the exact-conditional method described in Am J Hum Genet. 2005 May; 76(5): 887–893. The implementation does not use a mid P-value correction.
For HWE a P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it. For ExcHet a P-value of ≈ 0.5 suggests that the number of heterozygotes is close to the number expected under HWE, while a value ≈ 1 suggests that there are more heterozygotes than expected and a value ≈ 0 suggests that there are fewer heterozygotes than expected.
For a bi-allelic site the HWE P-values is based on the numbers of homozygotes and heterozygotes comparing the observed to expected. For a multi-allelic site, P-values are calculated per ALT allele as if it were bi-allelic. Genotypes composed of only the ALT allele being considered are counted as alternative homozygous, any other genotype containing a copy of the ALT allele being considered are counted as a heterozygous, and any genotype with no copies of the ALT allele being considered are counted as reference homozygous (this may include genotypes containing other ALT alleles).
Site-Wise Hardy-Weinberg Equilibrium.
Iterative gVCF Genotyper calculates a site-wise HWE P-value. The value is calculated using the Pearson's chi-squared method, comparing the genotype counts expected under HWE to those observed. The chi-squared test statistic is calculated as
𝜒2 = ∑gt (Egt - Ogt)2 / Egt
where the summation is over gt
is over all genotypes possible at the site given the alleles present, and Egt and Ogt are the expected and observed counts for genotype gt
, respectively. From the chi-squared test statistic the P-value is then calculated from a chi-squared distribution where the number of degrees of freedom is the number of possible genotypes minus the number of alleles, which is
where n
is the number of alleles.
The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.
A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.
The Inbreeding Coefficient
Iterative gVCF Genotyper calculates the inbreeding coefficient (IC) (sometimes called the Fixation index and denoted by F
). It is defined as the proportion of the population that is inbred. The value of IC can be estimated by looking at the observed number of heterozygotes in comparison to the number expected under HWE:
where O(het)
and E(het)
are the observed and expected number of heterozygotes in the cohort, respectively. Although initially conceived for studying inbreeding and defined as a non-negative value, it is also commonly used to look for deviations from HWE and can take values in the range [-1, 1].
Values of IC ≈ 0 suggest that the cohort is in HWE. Negative values suggest an excess of heterozygosity and a deviation from HWE, which can be symptomatic of poor variant calling. Positive values suggest a deficit of heterozygotes and the possible presence of inbreeding.
Using the above definition, IC should be a property of the population, and so would be expected to be drawn from the same distribution for all sites and for all variants at a site. Deviations from this distribution can suggest issues in calling a site correctly. Violations of HWE assumptions and/or non-random sampling may adversely affect the distribution of IC, causing it to be shifted. However, outliers can still be identified, although thresholds may need to be adjusted accordingly.
Allelic Balance
Allelic balance (AB) describes the proportion of reads that support each allele within a called genotype and can be calculated from the allelic depth (FORMAT/AD or FORMAT/LAD). For homozygotes this is taken as
AB = ADi / ∑j ADj
where i
is the index of the called allele and j
runs over all alleles. For heterozygotes this is taken as
ABi = ADi / (ADa + ADb)
where a
and b
are the indices of the called alleles and i
can have values a
or b
. For homozygous genotypes AB is expected to be ≈ 1 and for heterozygous genotypes it is expected to be ≈ 0.5 for each allele. Deviations from the expected values can be indicative of an error.
DRAGEN's iterative gVCF Genotyper calculates site-wise AB values for each allele based on the read depths among all samples. Only diploid genotypes are included in the calculations. Values are calculated separately among homozygous (ABHom) and heterozygous (ABHet) genotypes. ABHet is calculated using the counts among all heterozygous calls that contains the allele under consideration. P-values for ABHet are also calculated (ABHetP) based on a binomial test with an expected probability of 0.5. A P-value of ≈ 1 signifies that results are in line with expectation while ≈ 0 signifies a deviation from expectation. Values are written to the INFO fields ABHom, ABHet and ABHetP, with one value for each allele (including the reference allele). Values should be in the range [0, 1]. Missing values are coded by -1, for example where there are no homozygous calls for an allele. If AD is not present in any input gVCF file, the values are not calculated and the fields will be omitted from the output msVCF file.
It is also possible to filter based on the maximum ABHetP value, see msVCF hard filtering.
msVCF hard filtering
Sites in the output msVCF can be filtered on the following global metrics:
QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒2 Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)
The syntax of a filtering expression is the same as that used by the small variant caller (see Germline Small Variant Hard Filtering). Filters are always applied to the globally-computed metrics, not the values for the current batch. Records failing filter will have the specified filter ID(s) written to the FILTER column of the msVCF, or will be omitted entirely if the --gg-skip-filtered-sites option is specified. Since filtering is on a per-site basis, filters cannot be applied separately to SNPs or indels as they can in the variant caller.
msVCF metric customization
The per-sample genotype metrics in the output msVCF can be customized by providing a colon-separated list of metrics, analogous to that of the VCF FORMAT column, to the --gg-msvcf-format-fields
option, e.g. --gg-msvcf-format-fields=GT:LAD:LPL:LAA:QL
. Supported metrics are GT, GQ, AD, LAD, FT, LPL, LAA, LA, LGT, QL, MQR, LAF and DF (N.B. LAF will only appear on the MT contig and DF will only appear if the --gg-diploidify
option is enabled). Sample genotype (GT) is always present and always shown first, regardless of whether it is included in the option string or not. Alternatively, an msVCF containing only site statistics and no per-sample genotype fields can be generated using the option --gg-msvcf-format-fields=None
.
Metric | Description | Number of values1 | Type |
---|---|---|---|
GT | Genotype | 1 | String |
GQ | Genotype quality | 1 | Integer |
AD | Allelic depths | R | Integer |
LAD | Localized allelic depths | . | Integer |
FT | Sample filter | 1 | String |
LPL | Local normalized, Phred-scaled likelihoods for genotypes as in original gVCF | . | Integer |
LAA | Mapping of local alt allele index from original gVCF to msVCF excluding the reference allele | . | Integer |
LA | Mapping of local allele indices from original gVCF to msVCF including the reference allele | . | Integer |
LGT | Local GT value as in original gVCF | 1 | String |
QL | Phred-scaled probability that the site has no variant in this sample (original gVCF QUAL) | 1 | Float |
MQR | Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (original gVCF INFO/MQRankSum) | 1 | Float |
LAF | Allele fractions for the local alt alleles | . | Float |
DF | Diploidified, 1 represents a genotype that was originally haploid, 0 represents originally diploid | 1 | Integer |
The per-site INFO metrics in the output msVCF can be customized by providing a semicolon-separated list of metrics, analogous to that of the VCF INFO column, to the --gg-msvcf-info-fields
option, e.g. --gg-msvcf-info-fields=AC;AN;NS;NS_GT;NS_NOGT;NS_NODATA;AF
. Supported metrics are AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet, HWEc2, AF. The default set of metrics is AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet and HWEc2. All INFO fields can be included using the option --gg-msvcf-info-fields=All
. All INFO fields can be dropped using the option --gg-msvcf-info-fields=None
, in which case the INFO field will contain the missing symbol, .
. For each specified metric, the value for the current batch and the global value are written. For global values, the metric names are prepended by G
.
INFO fields that have a missing value, .
, at a site are omitted from the msVCF for that site, so sites may contain different sets of fields.
Metric | Description | Number of values1 | Type |
---|---|---|---|
AC | Allele count in genotypes | A | Integer |
AN | Total number of alleles in called genotypes | 1 | Integer |
NS | Total number of samples | 1 | Integer |
NS_GT | Total number of samples with called genotypes | 1 | Integer |
NS_NOGT | Total number of samples with unknown genotypes ./. | 1 | Integer |
NS_NODATA | Total number of samples with no coverage | 1 | Integer |
IC | The inbreeding coefficient | 1 | Float |
HWE | The exact conditional Hardy-Weinberg Equilibrium P-value | A | Float |
ExcHet | he exact conditional Excess Heterozygosity P-value | A | Float |
HWEc2 | The chi-squared Hardy-Weinberg Equilibrium P-value | 1 | Float |
AF | The ALT allele frequencies (AC/AN) | A | Float |
ABHom | The allelic balance among homozygotes | R | Float |
ABHet | The allelic balance among heterozygotes | R | Float |
ABHetP | The P-value for allelic balance among heterozygotes | R | Float |
File size optimizations
For sizable cohorts, the file outputs from gVCF Genotyper can become extremely large. However, there are a number of options within the component which can mitigate this. As well as reduced footprint on disk, these options can lead to faster runtimes owing to the diminished I/O demands.
The following options have applicability to this:
The small variant caller's
--vc-compact-gvcf
, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
Outputting local allele values, as described above.
Use of the msVCF metric customization options to output only those metrics required for the downstream analysis.
Omitting records that fail filters (
--gg-skip-filtered-sites
option).Dropping trailing genotype fields for hom-ref records (
--gg-squeeze-msvcf
option). This behaviour is explicitly permitted by the VCF specification.
The option that can have the biggest impact on the final output file size is that to write it directly in spVCF format. This is a lossless encoding and the space saving can be dramatic: file size reductions of multiple tens of times have been observed for large cohorts with sparsely distributed variants. Files output as spVCF at step 3 (--generate-msvcf
) can be directly merged via the --merge-batches
subcommand to produce a single spVCF file. spVCF-encoded files are likely to require decoding back to full msVCF for use with downstream tools, and a binary for this is available for download. The decoding will take time, but this is offset by the reduced time required within gVCF Genotyper to initially write the smaller spVCF files. Users are recommended to, if possible, directly pipe the decoded data into the downstream tool rather than first writing the full msVCF file to disk.
1: The number of values is coded as per the VCF specification, with A
denoting one value per alt allele, R
one value per possible allele (including the reference allele), G
one value per possible genotype and .
an unspecified number of values that may vary between site and sample. The number of elements in localised array FORMAT fields that depend on the number of local alleles will vary between samples and so are specified as .
.
Last updated