Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
B-Allele frequency (BAF) output is enabled by default in germline and somatic VCF and gVCF runs.
The BAF value is calculated as either AF
or (1 - AF)
, where
AF = (alt_count / (ref_count + alt_count))
BAF = 1 - AF
, only when ref base < alt base, order of priority for bases is A < T < G < C < N
.
The B-allele frequency values are often plotted to visually inspect the spread away from a perfectly diploid heterozygous call (BAF=50%). This plot is more easily interpreted if it is symmetric about the BAF=50% line. To ensure the symmetry, a heuristic must be used to determine when BAF = AF
or BAF = 1-AF
. This definition of B-Allele Frequency is based on the definition that is used for bead arrays, as most users are accustomed to that implementation. Here, the choice of the B allele is based on the color of dye attached to each nucleotide. A and T get one color, G and C get the other color. The bead array implementation has much more complex rule for tie-breaking between A and T or G and C that involves top and bottom strands. This is unnecessary and so the simpler hierarchical approach of using a priority for the nucleotides A<T<G<C<N
is used.
For each small variant VCF entry with exactly one SNP alternate allele, the output contains a corresponding entry in the BAF output file.
<NON_REF>
lines are excluded
ForceGT variants (as marked by the "FGT" tag in the INFO field) are not included in the output, unless the variant also contains the "NML" tag in the INFO field.
Variants where the ref_count and alt_count are both zero are not included in the output.
--vc-enable-baf
Enable or disable B-allele frequency output. Enabled by default.
The BF generates are BigWig-compressed files, named <output-file-prefix>.baf.bw
and <output-file-prefix>.hard-filtered.baf.bw
. The hard-filtered file only contains entries for variants that pass the filters defined in the VCF (ie, PASS entries).
Each entry contains the following information: Chromosome Start End BAF
Where:
Chromosome is a string matching a reference contig.
Start and end values are zero-based, half open intervals.
BAF is a floating point value.
The filtering step identifies de novo variants calls of the joint calling workflow in regions with ploidy changes. Since de novo calling can have reduced specificity in regions where at least one of the pedigree members shows non-diploid genotypes, the de novo variant filtering marks relevant variants and thus can improve specificity of the call set.
Based on the structural and copy number variant calls of the pedigree, the FORMAT/DN field in the proband is changed from the original DeNovo value to DeNovoSV or DeNovoCNV if the de novo variant overlaps with a ploidy-changing SV or CNV, respectively. All other variant details remain unchanged, and all variants of the input VCF will also be present in the filtered output VCF. Structural or copy number variants which result in no change of ploidy, such as inversions, are not considered in the filtering. As an example, a de novo SNV calls in the input VCF
Overlapping with an SV duplication in the proband, mother or father would be represented in the filtered output VCF as follows:
The following is an example command line for running the de novo filtering, based on the files returned by the joint calling workflows:
The following options are used for de novo variant filtering:
--dn-input-vcf
---Joint small variant VCF from the de novo calling step to be filtered.
--dn-output-vcf
---File location to which the filtered VCF should be written. If not specified, the input VCF is overwritten.
--dn-sv-vcf
---Joint structural variant VCF from the SV calling step. If omitted, checks with overlapping structural variants are skipped.
--dn-cnv-vcf
--- Joint structural variant VCF from the CNV calling step. If omitted, checks with overlapping copy number variants are skipped.
DRAGEN provides post-VCF variant filtering based on annotations present in the VCF records. Default and non-default variant hard filtering are described below. However, due to the nature of DRAGEN's algorithms, which incorporate the hypothesis of correlated errors from within the core of variant caller, the pipeline has improved capabilities in distinguishing the true variants from noise, and therefore the dependency on post-VCF filtering is substantially reduced. For this reason, the default post-VCF filtering in DRAGEN is very simple.
The default filters in the germline pipeline are as follows:
##FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41 (3 when ML recalibration is enabled)">
##FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83 (3 when ML recalibration is enabled)">
##FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
DRAGENSnpHardQUAL and DRAGENIndelHardQUAL: For all contigs other than the mitochondrial contig, the default hard filtering consists of thresholding the QUAL value only. A different default QUAL threshold value is applied to SNP and INDEL
LowDepth: This filter is applied to all variants calls with INFO/DP <= 1
PloidyConflict: This filter is applied to all variant calls on chrY of a female subject, if female is specified on the DRAGEN command line, of if female is detected by the ploidy estimator.
For the mitochondrial contig, DRAGEN processes it through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. Please refer to Mitochondrial Calling for the filtering details.
DRAGEN supports basic filtering of variant calls as described in the VCF standard. You can apply any number of filters with the --vc-hard-filter
option, which takes a semicolon-delimited list of expressions, as follows:
where the list of criteria is itself a list of expressions, delimited by the || (OR) operator in this format:
The meaning of these expression elements is as follows:
filterID---The name of the filter, which is entered in the FILTER column of the VCF file for calls that are filtered by that expression.
snp/indel/all---The subset of variant calls to which the expression should be applied.
annotation ID---The variant call record annotation for which values should be checked for the filter. Supported annotations include FS, MQ, MQRankSum, QD, and ReadPosRankSum.
comparison operator---The numeric comparison operator to use for comparing to the specified filter value. Supported operators include <, ≤, =, ≠, ≥, and >. For example, the following expression would mark with the label "SNP filter" any SNPs with FS < 2.1 or with MQ < 100, and would mark with "indel filter" any records with FS < 2.2 or with MQ < 110:
This example is for illustration purposes only and is NOT recommended for use with DRAGEN V3 output. Illumina recommends using the default hard filters. The only supported operation for combining value comparisons is OR, and there is no support for arithmetic combinations of multiple annotations. More complex expressions may be supported in the future.
The orientation bias filter is designed to reduce noise typically associated with the following:
Pre-adapter artifacts introduced during genomic library preparation (eg, a combination of heat, shearing, and metal contaminates can result in the 8-oxoguanine base pairing with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification), or
FFPE (formalin-fixed paraffin-embedded) artifact. FFPE artifacts stem from formaldehyde deamination of cytosines, which results in C to T transition mutations. The orientation bias filter can only be used on somatic pipelines. To enable the filter, set the --vc-enable-orientation-bias-filter
option to true. The default is false.
The artifact type to be filtered can be specified with the --vc-orientation-bias-filter-artifacts
option. The default is C/T,G/T, which correspond to OxoG and FFPE artifacts. Valid values include C/T, or G/T, or C/T,G/T,C/A.
An artifact (or an artifact and its reverse compliment) cannot be listed twice. For example, C/T,G/A is not valid, because C→G and T→A are reverse compliments.
The orientation bias filter adds the following information:
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=OBC,Number=1,Type=String,Description="Orientation Bias Filter base context">
##FORMAT=<ID=OBPa,Number=1,Type=String,Description="Orientation Bias prior for artifact">
##FORMAT=<ID=OBParc,Number=1,Type=String,Description="Orientation Bias prior for reverse compliment artifact">
##FORMAT=<ID=OBPsnp,Number=1,Type=String,Description="Orientation Bias prior for real variant">
Please note that the OBF filter runs as a standalone process after DRAGEN is complete. The VC metrics that are computed as part of DRAGEN SNV caller will not be updated and will not reflect the additional variants that are filtered in this stage.
DRAGEN secondary analysis employs machine learning based variant recalibration (DRAGEN-ML) for germline SNV VC. Variant calling accuracy is improved using powerful yet efficient machine learning techniques that augment the variant caller, by exploiting more of the available read and context information that does not easily integrate into the Bayesian processing used by the haplotype variant caller. A supervised machine learning method was developed using truth from the PrecisionFDA v4.2.1 sets to build a model that processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors, for both SNVs and INDELs.
No additional setup is required. ML model files for the hg38 and hg19 human references are packaged with the DRAGEN installer. After installation, the files are present at <INSTALL_PATH>/resources/ml_model/<ref>
DRAGEN-ML is enabled by default as needed, when running the germline SNV VC. DRAGEN will automatically detect the reference used for analysis, and use the correct model files. It either hg38 or hg19 reference type is not detected, ML recalibration will automatically be disabled and SNV VC falls back to legacy operation.
DRAGEN-ML requires a run with BAM or FASTQ input, since the machine learning model extracts information from the read pile-up. DRAGEN-ML runs concurrently with DRAGEN SNV VC. DRAGEN-ML can be applied to WGS or WES samples. Re-calibration of existing VCF files is not supported.
DRAGEN-ML recalibrates all quality scores, changing the values of the QUAL and GQ fields in the output VCF/GVCF.
DRAGEN-ML also updates PL and GP in the output VCF/GVCF.
The genotypes (GT field) of some variants may be changed by ML e.g., 0/1 to 1/1 or vice versa.
DRAGEN-ML PHRED scores (e.g. QUAL) are better calibrated than and differ significantly from those with ML disabled and, as a consequence, QUAL scores tend to not exceed 75. By comparison, QUAL scores with ML disabled can exceed 1000. For this reason, the QUAL filtering threshold is set to 3 when DRAGEN-ML is enabled, compared to 10 for DRAGEN-VC when DRAGEN-ML is disabled.
The following variants types are recalibrated:
Biallelic and multiallelic variants
Autosomes and sex chromosomes, including haploid positions
Force GT calls
Non primary contigs
DRAGEN-ML typically removes 30-50% of SNP FPs, with smaller gains on INDELS. FN counts are reduced by 10% or more. The output QUAL/GQ of DRAGEN-ML is empirically more accurately calibrated than DRAGEN SNV VC without ML. There are significant gains in accuracy statistics across the entire genome with ML enabled. Note that a small number of variant calls may have degraded accuracy with ML enabled compared to VC without ML.
DRAGEN-ML adds about 10% to the run time compared to runs without ML.
An MD5SUM file is generated automatically for VCF output files. This file is in the same output directory and has the same name as the VCF output file, but with an .md5sum extension appended. For example, whole_genome_run_123.vcf.md5sum. The MD5SUM files is a single-line text file that contains the md5sum of the VCF output file. This md5sum exactly matches the output of the Linux md5sum command.
DRAGEN supports force genotyping (ForceGT) for small variant calling. To use ForceGT, use the --vc-forcegt-vcf
option with a list of small variants to force genotype. The input list of small variants can be a *.vcf or *.vcf.gz file.
The current limitations of ForceGT are as follows:
ForceGT is supported for germline small variant calling in the V3 mode. The V1, V2, and V2+ modes are not supported.
ForceGT is also supported for somatic small variant calling.
ForceGT variants do not propagate through joint genotyping.
DRAGEN supports only a single ForceGT VCF input file, which must meet the following requirements:
The input has to be a valid VCF file according to version 4.2 of the VCF standard. For instance, it has to have at least eight tab-delimited columns and records need to be sorted by reference contig and position.
The header has to list the same contigs as the reference used for variant calling. All variants must refer to one of these contig names.
Variants have to be normalized (parsimonious and left-aligned, see below).
It must not contain any multinucleotide or complex variants (AT -> C). These are variants that require more than one substitution / insertion / deletion to go from REF allele to ALT allele and are ignored.
Any deletions longer than 50bp are filtered out.
Any variant will only be called once. Duplicate entries will be ignored.
The following nonnormalized variant will cause undefined behavior in DRAGEN:
Instead of…
use…
Force genotyping requires an input VCF and can be used with DRAGEN software in VCF, GVCF or VCF+GVCF mode. In all cases the output file(s) contains all regular calls and the forceGT variants, as follows:
For a ForceGT call that was not called by the variant caller (not common), the call is tagged with FGT in the INFO field.
For a germline ForceGT call that was also called by the variant caller and filter field is PASS, the call is tagged with NML;FGT in the INFO field (NML denotes normal). In somatic mode, the call is tagged with FGT;SOM.
For a normal call (and PASS) by the variant caller, with no ForceGT call (normal), no extra tags are added (no NML tag, no FGT tag).
This scheme distinguishes among calls that are present due to FGT only, common in both ForceGT input and normal calling, and normal calls.
All the variants in the input ForceGT VCF are genotyped and present in the output file. The following table lists the reported GTs for the variants.
If DRAGEN calls a variant that is different from the one specified in the input ForceGT VCF, the output contains the following multiple entries at the same position:
One entry for the default DRAGEN variant call
One entry each for every variant call present in the input ForceGT-VCF at that position
If a target BED file is provided along with the input ForceGT VCF, then the output file only contains ForceGT variants that overlap the BED file positions.
DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other.
Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF
option. There is also the option to write a germline gVCF with reduced size using the option --vc-compact-gvcf
. This results in a significant speed up for a downstream analysis using gVCF Genotyper. Please note that this compact format is not compatible with a pedigree analysis.
The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
The DRAGEN germline variant caller has an option --vc-combine-phased-variants-distance
to combine phased variants in the gVCF output. Input gVCF files created with this option cannot be processed in a population-based analysis using gVCF Genotyper.
The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.
If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.
Similarly, targeted variant calls (option --targeted-merge-vc
) in any gVCF file that are not also called by the variant caller will be ignored as well.
Both pedigree- and population-based joint analysis can process gVCF files written by the GATK v4.1 variant caller.
There are two available joint analysis output files:
Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.
The multisample gVCF output is only available in the pedigree-based analysis.
The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".
In hom-ref blocks, the following FORMAT fields are calculated uniquely.
FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.
In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.
SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.
In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.
In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.
Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.
To invoke pedigree mode, set the --enable-joint-genotyping
option to true. Use the --pedigree-file
option to specify the path to a pedigree file that describes the relationship between panels.
The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.
The following is an example of an input pedigree file.
The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.
Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.
Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
gVCF files for the Small Variant Caller.
*.tn.tsv files for the Copy Number Caller.
BAM files for the Structural Variant Caller.
The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.
Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.
At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.
The DQ formula is DQ = -10log10(1 - Pdenovo).
Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.
In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.
The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.
The following are the possible values:
Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.
The following is an example VCF line for a trio:
1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.
The following command line options are available for de novo small variant calling.
--enable-joint-genotyping
--Run the joint genotyping caller.
--pedigree-file
--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.
--variant
or --variant-list
--Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.
--qc-snp-denovo-quality-threshold
--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
--qc-indel-denovo-quality-threshold
--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
--output-directory
--The output directory. This is required.
--output-file-prefix
--The prefix used to label all output files. This is required.
-r
The directory where the hash table resides.
The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant
or --variant-list
. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.
DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.
To compare multiple pedigrees, you can run gVCF Genotyper on the output of a pedigree analysis and merge multiple joint-called pedigrees into a single multisample VCF. To enable, run the pedigree analysis using the --enable-multi-sample-gvcf=true
option to write a multisample gVCF.
gVCF Genotyper offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples, every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a "step-by-step mode" if multiple batches of samples are available, or "end-to-end mode", if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller (described above) are also accepted as input. gVCF Genotyper can accept input gVCF files generated using DRAGEN version 3.2.6 or later.
Step 1 (gVCF aggregation): the user can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multi-sample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. As part of this step, adjacent hom-ref blocks with matching FILTER columns are further merged to reduce the disk footprint of the intermediate files, FORMAT field values being base-pair weight averaged in the process.
When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.
Step 2 (census aggregation): after all per batch census files are generated, the user can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way than aggregating gVCFs from all batches. When a new batch of samples becomes available, the user only needs to perform Step 1 on that batch, then aggregate the census file from the new batch with the global census file from all previous batches in order to generate an updated global census file.
Step 3 (msVCF generation): every time a global census file is updated, with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take a per-batch cohort file, per-batch census file and the global census file as input, and generate a multi-sample VCF for one batch of samples. The output multi-sample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multi-sample VCF is the same across all batches.
To facilitate parallel processing on distributed compute nodes, for every step above, the user can choose to split the genome into shards of equal size, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard
below.
There is a special treatment of alternative or unaligned contigs when the --shard
option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.
If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.
--enable-gvcf-genotyper-iterative
: set to true to run the iterative gVCF Genotyper (always required).
--ht-reference
: The file containing the reference sequence in FASTA format (always required).
--output-directory
: The output directory (always required).
--output-file-prefix
: The prefix used to label all output files (optional, default value dragen
).
--shard
: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format of n/N
(e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.
--gg-regions
: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the --shard
option is not given. It is important that the same regions are chosen for each step 1,2 and 3.
--gg-regions-bed
: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.
--gg-discard-ac-zero
If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.
--gg-remove-nonref
If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.
--gg-vc-filter
Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.
--gg-skip-filtered-sites
Omits msVCF records that fail the given hard filter. The default is false.
--gg-squeeze-msvcf
Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.
--gg-gq-squeezing-threshold
Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.
--gvcfs-to-cohort-census
: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant
: if --variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
--aggregate-censuses
: set to true to aggregate a list of per batch census files into a global census file.
--input-census-list
: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.
--generate-msvcf
: set to true to generate a multi-sample VCF for one batch of samples.
--input-cohort-file
: the path to the per batch cohort file (from Step1).
--input-census-file
: the path to the per batch census file (from Step1).
--input-global-census-file
: the path to the global census file (from Step2).
--gvcfs-to-msvcf
: set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.
--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant
: if --variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
--merge-batches
: set to true to merge msVCF files for a set of batches.
--input-batch-list
: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing
: set to true (the default) to generate a tabix index for the merged msVCF.
--gg-merge-subset
: set to override the restriction that all batches must be included in the merge.
Mimalloc is a custom memory allocation library that can yield a significant speed-up in the iterative gVCF Genotyper workflow. In some deployments, e.g. cloud, it is automatically and seamlessly used, but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.
For this purpose, the convenience script mi_dragen.sh
is provided, which loads the bundled library and can be transparently used in the same way as the DRAGEN executable. Please note that its use is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow too. Its use for other purposes is known to possibly lead to undesirable memory overuse and thus should be undertaken at the user's own risk.
The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.
The msVCF can become a very large file with increasing cohort size. In some cases, the file might need more storage than can be allocated by VCF parsers. This is caused by VCF entries such as FORMAT/PL which store a value for each combination of alleles. We therefore decided to replace FORMAT/PL with a tag FORMAT/LPL which stores a value only for the alleles that actually occur in the sample. Similarly, the msVCF also contains FORMAT/LAD which stores the allelic depth only for the alleles occurring in the sample.
We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.
When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:
Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF
The value displayed in the QUAL column of the msVCF is the maximum of the input QUAL values for the site across the global cohort. The QUAL value will be missing if any of the batch census files used to create the global census were generated with a version of DRAGEN earlier than v4.2.
Iterative gVCF Genotyper offers several metrics for assessing adherence to HWE. It calculates both allele-wise and site-wise HWE P-values, an allele-wise excess heterozygosity (ExcHet) P-value and the site-wise inbreeding coefficient (IC). These metrics are calculated only for diploid sites and missing values are excluded from the calculations. These values are included as fields in the INFO column of the output msVCF file. Both batch-wise and global values are included, where the field names for the global values are prefixed with G
.
Care should be taken when interpreting these metrics for small cohorts and/or low frequency alleles, as small changes in inputs can lead to large changes in their values. Further, violations of the underlying HWE assumptions (such as inbreeding), and non-random sampling (such as the presence of consanguineous samples), can adversely affect results, making identification of poorly called variants more difficult.
Where it is not possible to calculate the metric, they are represented as missing (i.e., ".") in the msVCF file. This can vary between the metrics, but may occur if non-diploid genotypes are encountered, if there is only one allele present at a site, or if no samples are genotyped at a site.
For HWE a P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it. For ExcHet a P-value of ≈ 0.5 suggests that the number of heterozygotes is close to the number expected under HWE, while a value ≈ 1 suggests that there are more heterozygotes than expected and a value ≈ 0 suggests that there are fewer heterozygotes than expected.
For a bi-allelic site the HWE P-values is based on the numbers of homozygotes and heterozygotes comparing the observed to expected. For a multi-allelic site, P-values are calculated per ALT allele as if it were bi-allelic. Genotypes composed of only the ALT allele being considered are counted as alternative homozygous, any other genotype containing a copy of the ALT allele being considered are counted as a heterozygous, and any genotype with no copies of the ALT allele being considered are counted as reference homozygous (this may include genotypes containing other ALT alleles).
Iterative gVCF Genotyper calculates a site-wise HWE P-value. The value is calculated using the Pearson's chi-squared method, comparing the genotype counts expected under HWE to those observed. The chi-squared test statistic is calculated as
𝜒2 = ∑gt (Egt - Ogt)2 / Egt
where the summation is over gt
is over all genotypes possible at the site given the alleles present, and Egt and Ogt are the expected and observed counts for genotype gt
, respectively. From the chi-squared test statistic the P-value is then calculated from a chi-squared distribution where the number of degrees of freedom is the number of possible genotypes minus the number of alleles, which is
where n
is the number of alleles.
The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.
A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.
Iterative gVCF Genotyper calculates the inbreeding coefficient (IC) (sometimes called the Fixation index and denoted by F
). It is defined as the proportion of the population that is inbred. The value of IC can be estimated by looking at the observed number of heterozygotes in comparison to the number expected under HWE:
where O(het)
and E(het)
are the observed and expected number of heterozygotes in the cohort, respectively. Although initially conceived for studying inbreeding and defined as a non-negative value, it is also commonly used to look for deviations from HWE and can take values in the range [-1, 1].
Values of IC ≈ 0 suggest that the cohort is in HWE. Negative values suggest an excess of heterozygosity and a deviation from HWE, which can be symptomatic of poor variant calling. Positive values suggest a deficit of heterozygotes and the possible presence of inbreeding.
Using the above definition, IC should be a property of the population, and so would be expected to be drawn from the same distribution for all sites and for all variants at a site. Deviations from this distribution can suggest issues in calling a site correctly. Violations of HWE assumptions and/or non-random sampling may adversely affect the distribution of IC, causing it to be shifted. However, outliers can still be identified, although thresholds may need to be adjusted accordingly.
Allelic balance (AB) describes the proportion of reads that support each allele within a called genotype and can be calculated from the allelic depth (FORMAT/AD or FORMAT/LAD). For homozygotes this is taken as
AB = ADi / ∑j ADj
where i
is the index of the called allele and j
runs over all alleles. For heterozygotes this is taken as
ABi = ADi / (ADa + ADb)
where a
and b
are the indices of the called alleles and i
can have values a
or b
. For homozygous genotypes AB is expected to be ≈ 1 and for heterozygous genotypes it is expected to be ≈ 0.5 for each allele. Deviations from the expected values can be indicative of an error.
DRAGEN's iterative gVCF Genotyper calculates site-wise AB values for each allele based on the read depths among all samples. Only diploid genotypes are included in the calculations. Values are calculated separately among homozygous (ABHom) and heterozygous (ABHet) genotypes. ABHet is calculated using the counts among all heterozygous calls that contains the allele under consideration. P-values for ABHet are also calculated (ABHetP) based on a binomial test with an expected probability of 0.5. A P-value of ≈ 1 signifies that results are in line with expectation while ≈ 0 signifies a deviation from expectation. Values are written to the INFO fields ABHom, ABHet and ABHetP, with one value for each allele (including the reference allele). Values should be in the range [0, 1]. Missing values are coded by -1, for example where there are no homozygous calls for an allele. If AD is not present in any input gVCF file, the values are not calculated and the fields will be omitted from the output msVCF file.
Sites in the output msVCF can be filtered on the following global metrics:
QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒2 Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)
The per-sample genotype metrics in the output msVCF can be customized by providing a colon-separated list of metrics, analogous to that of the VCF FORMAT column, to the --gg-msvcf-format-fields
option, e.g. --gg-msvcf-format-fields=GT:LAD:LPL:LAA:QL
. Supported metrics are GT, GQ, AD, LAD, FT, LPL, LAA, LA, LGT, QL, MQR, LAF and DF (N.B. LAF will only appear on the MT contig and DF will only appear if the --gg-diploidify
option is enabled). Sample genotype (GT) is always present and always shown first, regardless of whether it is included in the option string or not. Alternatively, an msVCF containing only site statistics and no per-sample genotype fields can be generated using the option --gg-msvcf-format-fields=None
.
The per-site INFO metrics in the output msVCF can be customized by providing a semicolon-separated list of metrics, analogous to that of the VCF INFO column, to the --gg-msvcf-info-fields
option, e.g. --gg-msvcf-info-fields=AC;AN;NS;NS_GT;NS_NOGT;NS_NODATA;AF
. Supported metrics are AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet, HWEc2, AF. The default set of metrics is AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet and HWEc2. All INFO fields can be included using the option --gg-msvcf-info-fields=All
. All INFO fields can be dropped using the option --gg-msvcf-info-fields=None
, in which case the INFO field will contain the missing symbol, .
. For each specified metric, the value for the current batch and the global value are written. For global values, the metric names are prepended by G
.
INFO fields that have a missing value, .
, at a site are omitted from the msVCF for that site, so sites may contain different sets of fields.
For sizable cohorts, the file outputs from gVCF Genotyper can become extremely large. However, there are a number of options within the component which can mitigate this. As well as reduced footprint on disk, these options can lead to faster runtimes owing to the diminished I/O demands.
The following options have applicability to this:
The small variant caller's --vc-compact-gvcf
, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.
The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
Omitting records that fail filters (--gg-skip-filtered-sites
option).
Dropping trailing genotype fields for hom-ref records (--gg-squeeze-msvcf
option). This behaviour is explicitly permitted by the VCF specification.
1: The number of values is coded as per the VCF specification, with A
denoting one value per alt allele, R
one value per possible allele (including the reference allele), G
one value per possible genotype and .
an unspecified number of values that may vary between site and sample. The number of elements in localised array FORMAT fields that depend on the number of local alleles will vary between samples and so are specified as .
.
The DRAGEN small variant caller is a haplotype-based caller which performs local assembly of all reads in an active region into a de Bruijn graph (DBG). The assembly process uses all the read bases including the soft-clip bases of reads. The soft-clip bases provide evidence for the presence of variants, specifically longer insertions and deletions which are not present in the read cigar and hence cannot be directly viewed in IGV.
The assembly and realignment step (using pair-HMM) performed by variant caller aims to correct mapping errors made by the original aligner and improves the overall variant caller accuracy. Using the evidence BAM, we can view how the variant caller sees the read evidence and how the reads have been realigned making it a very useful debugging tool.
By default, the evidence BAM contains only a subset of regions processed by the small variant caller. Only regions which have candidate indel variants and some percentage of soft-clip reads in the pileup are realgned and output in the evidence BAM. This is done to reduce the run-time overhead needed to generate the evidence BAM.
The output of the VC Evidence BAM feature will match the output format that the customer has selected using --output-format option. The default format is bam.
A bam/cram/sam file with the suffix _evidence.bam/cram/sam
and the corresponding index file. The evidence BAM can be enabled along with the regular BAM output from the Map-Align step. When multiple BAM are passed as inputs to the variant caller, for e.g., in Tumor-Normal calling, then they will be combined in the evidence BAM output and tagged with appropriate read groups.
A bed file with regions that were realigned and output in VC Evidence BAM with suffix ".realigned-regions.bed".
The evidence BAM consists of realigned reads, badly mated reads and reads that are disqualified by the variant caller based on the read likelihood scores.
Disqualified and Badly Mated reads
Reads that are badly-mated (when the read and its mate are mapped to different chromosmes) are tagged with a BM tag (integer) and reads that are disqualified (based on read likelihoods) are tagged with the DQ tag (integer). These reads are filtered out by the genotyper in the variant caller. The alignment score tag AS is forced to 0 for such reads in the evidence BAM and hence, they can be filtered from the IGV pileup by setting the minimum AS score to be 1 instead of 0.
Graph Haplotypes
When enabling graph haplotypes output using --vc-evidence-bam-output-haplotypes
, all the haplotypes constructed by the de Bruijn graph are output in the evidence BAM as single reads covering the entire active region. The reads and haplotypes are tagged with different read groups which makes it easily distinguishable in IGV. In IGV, we can use “Color Alignments By” or “Group Alignments By” > read group to separate out the reads from the haplotypes. The haplotypes are tagged with read group EvidenceHaplotype
and the reads are part of the EvidenceRead_Normal/Tumor
read group.
The haplotypes are named as Haplotype 1, Haplotype 2 and so on and have an additional ‘HC’ tag (integer). The realigned reads also have an HC tag which encodes which haplotype best matches the read based on the likelihood calculation. Only reads which are supported by a single unique haplotype have the HC tag, reads which match more than one haplotype well do not have an HC tag. The use of this tag is primarily intended to enable highlighting of reads in IGV. Go to "Color Alignments By > Tag" and enter "HC" to view which reads are uniquely supported by a certain graph haplotypes.
Regions of homozygosity (ROH) are detected as part of the small variant caller. The caller detects and outputs the runs of homozygosity from whole genome calls on autosomal human chromosomes. Sex chromosomes are ignored unless the sample sex karyotype is XX, as specified on the command line or determined by the Ploidy Estimator. ROH output allows downstream tools to screen for and predict consanguinity between the parents of the proband subject.
A region is defined as consecutive variant calls on the chromosome with no large gap in between these variants. In other words, regions are broken by chromosome or by large gaps with no SNV calls. The gap size is set to 3 Mbases.
ROH Algorithm
The ROH algorithm runs on the small variant calls. The algorithm excludes variants with multiallelic sites, indels, complex variants, non-PASS filtered calls, and homozygous reference sites. The variant calls are then filtered further using a block list BED, and finally depth filtering is applied after the block list filter. The default value for the fraction of filtered calls is 0.2, which filters the calls with the highest 10% and lowest 10% in DP values. The algorithm then uses the resulting calls to find regions.
The ROH algorithm first finds seed regions that contain at least 50 consecutive homozygous SNV calls with no heterozygous SNV or gaps of 500,000 bases between the variants. The regions can be extended using a scoring system that functions as follows.
Score increases with every additional homozygous variant (0.025) and decreases with a large penalty (1-0.025) for every heterozygous SNV. This provides some tolerance of presence of heterozygous SNV in the region.
Each region expands on both ends until the regions reach the end of a chromosome, a gap of 500,000 bases between SNVs occurs, or the score becomes too low (0).
Overlapping regions are merged into a single region. Regions can be merged across gaps of 500,000 bases between SNVs if a single region would have been called from the beginning of the first region to the end of the second region without the gap. There is no maximum size for regions, but regions always end at chromosome boundaries.
ROH Options
--vc-enable-roh
Set to true to enable the ROH caller. The ROH caller is enabled by default for human autosomes only. Set to false to disable.
--vc-roh-blacklist-bed
If provided, the ROH caller ignores variants that are contained in any region in the block list BED file. DRAGEN distributes block list files for all popular human genomes and automatically selects a block list to match the genome in use, unless this option is used to select a file.
ROH Output
The ROH caller produces an ROH output file named <output-file-prefix>.roh.bed
in which each row represents one region of homozygosity. The BED file contains the following columns:
Chromosome Start End Score #Homozygous #Heterozygous
Score is a function of the number of homozygous and heterozygous variants, where each homozygous variant increases the score by 0.025, and each heterozygous variant reduces the score by 0.975.
Start and end positions are a 0-based, half-open interval.
#Homozygous is number of homozygous variants in the region.
#Heterozygous is number of heterozygous variants in the region. The caller also produces a metrics file named <output-file-prefix>.roh_metrics.csv
that lists the number of large ROH and percentage of SNPs in large ROH (>3 MB).
The table below demonstrates how the PLINK options can be tuned to behave similarly to the DRAGEN ROH caller default settings (see column DRAGEN default). We observed that PLINK ROH calls (see column PLINK default) in default settings are more conservative compared to DRAGEN default settings. By default, PLINK reports ROH regions of size 1MB or larger (see PLINK option --homozyg-kb ) with at least 100 homozygous SNPs (see PLINK option --homozyg-snp) while DRAGEN ROH caller reports smaller regions with at least 50 homozygous SNPs (see DRAGEN ROH Algorithm section). In addition, PLINK by default allows for only 1 heterozygous SNP per scanning window (specified by PLINK option --homozyg-window-het) while DRAGEN uses a soft score threshold penalty without setting an upper bound on the allowed number of heterozygous SNPs (see DRAGEN ROH Algorithm section). The PLINK ROH calls are largely comparable to the DRAGEN ROH calls after relaxing the default PLINK settings, shown in column PLINK tuned. Prior to PLINK ROH calling, the input DRAGEN hard-filtered VCF files are filtered as per the instructions in DRAGEN ROH Algorithm section.
The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).
Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller
option to true. The VCF header is annotated with ##source=DRAGEN_SNV
to indicate the file is generated by the DRAGEN SNV pipeline.
The DRAGEN Small Variant Caller performs the following steps:
Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.
Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.
Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.
Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.
Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.
In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:
Unfiltered depth is the number of reads covering the position, downstream of any read collapsing or deduplication that may have preceded the variant calling step, but upstream of most read filtering and overlapping mate handling. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads actually used to make the calling decision, where filtered reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded, and overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.
The following figure summarizes the different filtering steps in more detail.
Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
Duplicate reads.
Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
[Somatic] Reads with MAPQ=0.
[Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
Unmapped reads.
Secondary reads.
Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
Reads that are badly mated. A badly mated read is a read where the pair is mapped to two different reference contigs.
Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out reads that are not informative. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.
Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC
INFO flag. As default, MOSAIC
tagged variants with AF
smaller than 20% are filtered with the MosaicLowAF
filter.
See Mosaic detection for further details on the mosaic small variant caller and the mosaic detection mode and a comparison with DRAGEN 4.2 features.
The following options control the variant caller stage of the DRAGEN host software.
--enable-variant-caller
Set --enable-variant-caller
to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:
If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.
--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection with mosaic AF filter threshold set to 0.0
. Set to false to disable DRAGEN mosaic detection. The default is true with mosaic AF filter threshold set to 0.2
.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
when the germline variant caller is enabled. The AF default threshold is set to 0.0
when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true
.
You can use the following options for downsampling reads in the small variant calling pipeline.
For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.
--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito
The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.
The following are the default downsampling values for each small variant calling mode.
The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos
setting.
If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos
, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos
, default value is 10) at any start position.
The next downsampling step is to apply the --vc-max-reads-per-raw-region
and --vc-max-reads-per-active-region
limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.
This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.
When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.
A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF>
allele. Reads that do not support the reference or any variants are assigned the <NON_REF>
allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ
and in somatic mode the score is FORMAT/SQ
.
The following options are available to enable and control gVCF output.
--vc-emit-ref-confidence
To enable gVCF output, set to GVCF
. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence
to BP_RESOLUTION
.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf
(banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80
for germline and 1 3 10 20 50 80
for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50
.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf
, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30
and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.
Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.
The following are example gVCF records that include a hom-ref block call and a variant call.
In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.
In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.
QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.
The QUAL scores generated by DRAGEN differ significantly from those of GATK, as DRAGEN's algorithms for small variant detection provide more realistic scores. This improvement stems from two key factors:
Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.
Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3) compared to DRAGEN without ML (10).
Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3 with ML enabled.
DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same set.
The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.
During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, then they are phased together. If the variants only occur on different haplotypes, then they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.
Phased variant records that belong to the same phasing set can be combined into a single VCF record. For example, assuming reference at position chr2 115035
is A
, the following two phased variants are combined.
The phased variants are combined as follows.
The command-line option --vc-combine-phased-variants-distance
specifies the maximum distance over which phased variants will be combined. The default value 0 disables the feature. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value.
DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.
Examples of diploid haplotypes where phasing is supported:
Examples of diploid haplotypes where phasing is not supported:
By default in somatic mode, DRAGEN will output all component SNVs and INDELs that make up an MNV along with the MNV call itself. MNVs and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. Setting --vc-mnv-emit-component-calls=false
can be used to restrict which component calls are reported. When DRAGEN reports an MNV call, it considers the difference between the VAF of the MNV call and the VAF of each component call, and reports any given component call in addition to the MNV call if this difference is greater than --vc-combine-phased-variants-max-vaf-delta
(default: 0.1). The --vc-mnv-emit-component-calls
and --vc-combine-phased-variants-max-vaf-delta
options are only applicable in somatic mode and are not supported in germline mode. In germline mode, functionality to output component calls is not available and MNV calls are emitted only without component calls.
DRAGEN outputs variants in a VCF file following variant normalization as described here https://genome.sph.umich.edu/wiki/Variant_Normalization. The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.
Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned
Additional notes on variant representation in the DRAGEN VCF:
Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).
Allele decomposition: by default, multi-nucleotide polymorphisms (MNPs) are represented as separate, contiguous individual SNVs records in the VCF. If phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants. This determines which variant have occurred on the same haplotype. Phased variant records that belong to the same phasing set can be combined into a single VCF record by using the --vc-combine-phased-variants-distance
command-line option and set it to a non-zero value. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value (specified in the number of basepairs).
In some cases, such as complex variants in repetitive regions, some variants cannot be normalized (i.e. converted into a standard representation) or represented uniquely. To counteract this problem, when comparing two VCFs (e.g. a DRAGEN VCF against a truth set VCF), it is recommended to use the RTG vcfeval tool which performs variant comparisons using a haplotype-aware approach. RTG vcfeval has been adopted as the standard VCF comparison tool by GA4GH and PrecisionFDA https://www.biorxiv.org/content/biorxiv/early/2018/02/23/270157.full.pdf.
A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:
chr1 2656216 . A T,C 107.65 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=12;FS=0.000;MQ=28.95;QD=8.97;SOR=3.056;FractionInformativeReads=0.750 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,5,4:0.556,0.444:9:15:177,144,46,122,0,72:-17.704,-14.420,-4.626,-12.220,0.000,-7.244:1.076e+02,1.096e+02,1.465e+01,8.758e+01,1.520e-01,4.082e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,0,1,8:0,0,4,5
Two indels are considered as multi-allelic if they share the same reference base preceding the indel. chr1 7392258 . C CT,CTTT 234.76 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=44;FS=0.000;MQ=199.22;QD=5.34;SOR=2.226;FractionInformativeReads=0.659 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,15,14:0.517,0.483:29:50:245,256,55,190,0,55:-24.476,-25.634,-5.492,-18.976,0.000,-5.500:2.348e+02,2.513e+02,5.292e+01,1.848e+02,4.401e-05,5.300e+01:0.00,5.00,8.00,5.00,10.00,8.00:0,0,7,22:0,0,17,12
If a SNP overlaps an INDEL, but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below. However DRAGEN has the joint detection of overlaping variants feature which is designed to detect overlapping SNP and INDEL and output them in a single VCF variant record, represented as a multi-allelic genotype.
chr1 1029628 . C CGT 49.88 PASS AC=1;AF=0.500;AN=2;DP=37;FS=7.791;MQ=105.32;MQRankSum=-1.315;QD=1.35;ReadPosRankSum=1.423;SOR=1.510;FractionInformativeReads=0.892;R2_5P_bias=-19.742 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:17,16:0.485:33:48:81,0,50:-8.088,0.000,-5.000:4.988e+01,6.653e-05,5.300e+01:0.00,31.00,34.00:10,7,5,11:11,6,9,7:1029628 chr1 1029629 . A G 50.00 PASS AC=1;AF=0.500;AN=2;DP=37;FS=1.289;MQ=105.32;MQRankSum=-0.659;QD=1.35;ReadPosRankSum=-0.199;SOR=0.604;FractionInformativeReads=1.000;R2_5P_bias=-24.923 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:16,21:0.568:37:48:85,0,49:-8.477,0.000,-4.934:5.000e+01,6.886e-05,5.234e+01:0.00,34.77,37.77:9,7,10,11:10,6,13,8:1029628
The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.
If --sample-sex
is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex
is specified on the command line, contigs are processed as follows.
For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.
For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 85% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid
to false.
An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909
:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8
DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.
Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.
When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.
Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.
DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.
--vc-enable-af-filter-mito
Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito
Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito
Set the threshold to mark emitted vcf call as filtered. The default is 0.02.
QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).
The following filters can be applied to mitochondrial variant calls.
--vc-sq-call-threshold
Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold
Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default value is false.
If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.
If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.
The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.
For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.
The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.
We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.
The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.
To enable personalized variant calling and machine learning, set --enable-personalization
to true (default: false).
Note that this is a beta feature and available only for the germline small variant caller when run with a V4 pangenome reference.
When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:
Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.
Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.
Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection
to false.
DRAGEN has two algorithms that model correlated errors across reads in a given pileup.
Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.
Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.
FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.
Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.
The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.
Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.
The default mode of the small variant caller has been optimized to detect germline variants with typical AFs of 0%, 50% or 100%. On the other hand, non-cancer post-zygotic mosaic variants have typical allele frequencies (AFs) lower than 50% and therefore more challenging to find with the default small variant caller. To improve sensitivity of low AF calls, a new machine learning (ML) model trained using read and context evidence from low AF calls is used. This allows the model to identify variants down to approximately 5% AF on 35x WGS and 3% AF on 300x WGS. The mosaic ML model is applied to all calls that are rejected by the germline model and variants detected with the mosaic detection are ideintified by a MOSAIC
flag in the VCF INFO field.
When the mosaic detection is enabled, the hard filter QUAL
threshold for both SNPs and INDELs is lowered to 0.4
in this mode to allow low AF calls to be set as PASS
in the FILTER field. MOSAIC
tagged variants with QUAL
smaller than 3
are filtered with the MosaicHardQUAL
filter.
We provide an optional MosaicLowAF
filtering option to filter MOSAIC
tagged variants with AF
smaller than the AF
threshold. The threshold for this filter can be set with the --vc-mosaic-af-filter-threshold
option.
Furthermore, the output of MOSAIC
tagged calls can be restricted using an optional target BED provided with the --vc-mosaic-target-bed
option.
--vc-enable-mosaic-detection
Set to true to enable mosaic detection with mosaic AF filter threshold set to 0.0
. Set to false to disable mosaic detection. The default is true with mosaic AF filter threshold set to 0.2
.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
when the germline variant caller is enabled. The AF default threshold is set to 0.0
when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true
.
--vc-mosaic-qual-filter-threshold
Set the QUAL
threshold for the application of the MosaicHardQUAL
filter to mosaic calls. All MOSAIC
tagged variants with QUAL
smaller than the threshold QUAL
are filtered with the MosaicHardQUAL
filter. The default mosaic QUAL
filter threshold is set to 3.0
.
--vc-mosaic-target-bed
Optional target BED file to restrict the output of MOSAIC
tagged variant calls only in the specified regions.
Small variant calling features comparison between default germline small variant caller and mosaic detection mode in DRAGEN 4.2 and DRAGEN 4.3
The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.
For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.
The tumor-only pipeline produces output that contains both germline and somatic variants and can be further analyzed to identify tumor mutations. The caller does not attempt to distinguish between them: filtering out common germline variants as reported in databases is currently the most reliable way to remove germline variants. The tumor-only pipeline provides a germline tagging feature and requires this feature to be explicitly enabled or disabled. When germline tagging is enabled, variant annotation must also be enabled; DRAGEN then tags variants that are common in the gnomAD database as germline so that they can be filtered out if desired. The tumor-only pipeline also requires the presence of a systematic noise file by default. To run without germline tagging and/or systematic noise files, these options need to be disabled explicitly.
DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):
##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">
DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence
tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.
If tumor SQ > vc-sq-call-threshold
(default is 3 for tumor-normal and 0.1 for tumor-only), then the FORMAT/GT for the tumor sample is hard-coded to 0/1, and the FORMAT/AF yields an estimate on the somatic variant allele frequency, which ranges anywhere within [0,1].
If the value for vc-sq-filter-threshold
is lower than vc-sq-call-threshold
, the filter threshold value is used instead of the call threshold value.
If tumor SQ < vc-sq-call-threshold
, the variant is not emitted in the VCF.
If tumor SQ > vc-sq-call-threshold
but tumor SQ < vc-sq-filter-threshold
, the variant is emitted in the VCF, but FILTER=weak_evidence.
If tumor SQ > vc-sq-call-threshold
and tumor SQ > vc-sq-filter-threshold
, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).
The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ > vc-sq-call-threshold
but tumor SQ < vc-sq-filter-threshold
, so the FILTER is marked as weak_evidence.
The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance
option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0
.
Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:
--tumor-fastq1 and --tumor-fastq2
Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:
--tumor-fastq-list
Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:
--tumor-bam-input
and --tumor-cram-input
Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode.
--vc-sq-call-threshold
and --vc-sq-filter-threshold
These options control the thresholds for emitting calls in the VCF and applying the weak_evidence
filter tag (see above).
--vc-target-vaf
This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.
--vc-somatic-hotspots
, --vc-use-somatic-hotspots
, and --vc-hotspot-log10-prior-boost
DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_*
based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots
option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false
. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost
to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.
vc-systematic-noise
This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE
.
--vc-combine-phased-variants-distance
This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).
vc-skip-germline-tagging=true
This option disables the germline tagging feature in the tumor-only pipeline (not recommended).
In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true
if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance
enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.
Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.
Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).
If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false
. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.
Allele frequency and related settings
There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf
setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.
The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:
If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter
(see Post Somatic Calling Filtering below) to apply a hard filter on VAF.
DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.
Nucleotide (NTD) Error Bias Estimation is on by default and recommended as a replacement for the orientation bias filter. Both methods take account of strand-specific biases (systematic differences between F1R2 and F2R1 reads). In addition, NTD error estimation accounts for non-strand-specific biases such as sample-wide elevation of a certain snv type, e.g. C->T or any other transition or transversion. NTD error estimation can also capture these biases in a trinucleotide context, e.g. in the case of C->T it will break down the counts as ACA->ATA, CCA->CTA, GCA->CTA, TCA->TTA, etc.
This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false
or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto
. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true
.
To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed
. Alternatively, if --vc-target-bed
is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed
can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.
DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.
DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true
) or when running from UMI-collapsed bams, enable UMI-aware variant calling by setting one of the following options to true:
--vc-enable-umi-solid
The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.
--vc-enable-umi-liquid
The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.
If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.
You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.
By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod
option.
DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the *.hard-filtered.vcf.gz
output file (note: the *.vcf.gz
output file without "hard-filtered" in the filename differs only in that the filter column is unpopulated; the file is produced for historical reasons but is to be deprecated).
Options
The following options are available for post somatic calling filtering:
--vc-sq-call-threshold
Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold
is lower than vc-sq-call-threshold
, the filter threshold value is used instead of the call threshold value.
--vc-sq-filter-threshold
Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.
--vc-enable-non-primary-allelic-filter
Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.
--vc-enable-af-filter
Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold
and vc-af-filter-threshold
command-line options. Please use vc-enable-af-filter-mito
and corresponding threshold options for mitochondrial allele frequency filtering.
--vc-enable-non-homref-normal-filter
Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.
--vc-enable-vaf-ratio-filter
Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.
--vc-depth-filter-threshold
Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).
vc-homref-depth-filter-threshold
In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.
vc-depth-annotation-threshold
Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).
Filters
The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter tackles noise that consistently appears at specific locations in the reference genome. This noise can arise from:
Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.
PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.
The systematic noise filter offers a significant improvement over the older "panel of normals" method. While the panel of normals simply excluded specific positions, the new filter employs a statistical model. This model compares the variant and its allele frequency (AF) to the noise level associated with that specific position and allele in the reference genome. This allows for a more nuanced filtering approach, reducing false positives without discarding potentially valid variants.
Note that the systematic noise filter specifically aims to remove noise, while the option --vc-enable-germline-tagging
is used for identifying germline variants. The systematic noise filter is not recommended for germline admixture datasets, where tumor-normal pairs are simulated by combining germline samples from two different individuals. This is because such datasets contain (simulated) somatic variants at germline variant positions, and those positions may be present in the noise files with the result that desired variants are filtered out.
Newer versions of the systematic noise will include two columns, one for the "mean" noise and one for the "max" noise. The noise file header will specify a "##NoiseMethod". This is the column that will be used by default during variant calling. For UMI/PANELs/WES is is recommended to use the "mean" noise, and for WGS it is recommended to use the "max" noise.
Prebuilt systematic noise files are available for download (see below), but when possible, it is recommended to build custom noise files from a panel of normal samples sequenced locally. This will ensure that the noise file is specific to the library preparation, sequencing system, and panel in use. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 20-50 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.
The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding the following commands:
Somatic Systematic Noise Baseline Collection v2.0.0 noise files were generated with V4.3 and for the first time include allele specific information. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns. A header line "##NoiseMethod=mean/max" specifies which noise column will be used by default.
Noise files generated with V4.3 contain extra columns and are not compatible with earlier versions. Older noise files are still supported in the current version of DRAGEN as per the table below.
The default WES and WGS noise files were generated using a combination of Nextera and TruSeq samples (with and without PCR). There are also hg38 WGS HEME and FFPE specific noise files.
The BaseSpace Sequence Hub DRAGEN CNV Baseline Builder App can be used to build SNV and CNV noise files in the cloud. Alternatively the following DRAGEN CMD lines can be used to generate the noise files locally:
First run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples using the following command:
Once the normal samples have completed, collect the normal VCFs in the VCF_LIST file (one vcf per line) and use DRAGEN to generate the systematic noise file:
Running the filter during somatic variant calling:
Running the tumor-only pipeline on the normals:
Building the noise file:
When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:
--vc-enable-germline-tagging
Enable germline tagging. The default is 'false'. Once this is set to 'true', it will require user to set annotation related parameters as follows:
--enable-variant-annotation=true
--variant-annotation-data
Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
--variant-annotation-assembly
The genome build, GRCh37 or GRCh38
Additional options to control how to define germline variants.
--germline-tagging-db-threshold
The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).
--germline-tagging-pop-af-threshold
The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.
When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).
When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:
Annotation options:
--enable-variant-annotation=true
Enable variant annotation
--variant--annotation-data
Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
--variant-annotation-assembly
Genome build, GRCh37 or GRCh38
MAF conversion options:
--enable-maf-output=true
Enable MAF output
--maf-transcript-source
Desired transcript source, RefSeq or Ensembl
Additional standalone options (when running without the variant caller):
--maf-input-vcf
Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz
--maf-input-json
Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz
Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.
Optional options:
--maf-include-non-pass-variants
Enabling this option will output all variants, including non-PASS variants, in the MAF output file.
Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.
Example command lines:
MAF output from BAM input and variant caller:
MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:
MAF output from source VCF file:
Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir
and --output-file-prefix
options.
MAF output from source annotated VCF file:
Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir
and --output-file-prefix
options.
DRAGEN Multi-region Joint Detection (MRJD) is a de novo germline small variant caller for paralogous regions. In DRAGEN v4.3, MRJD covers regions that include six clinically relevant genes: NEB, TTN, SMN1/2, PMS2, STRC, and IKBKG. MRJD is compatible with hg38, hg19 and GRCh37 reference genome. The table below includes hg38 region coordinates covered by MRJD.
Chromosome | Start | End | Description |
---|
MRJD is a variant calling method that is designed to detect de novo germline small variants in paralogous regions of the genome. A conventional variant caller relies on the read aligner to determine which reads likely originated from a given location. This method works well when the region of interest does not resemble any other region of the genome over the span of a single read (or a pair of reads for paired-end sequencing). However, a significant fraction of the human genome does not meet this criterion. At least 5% of the human genome consists of segmental duplications. Many regions of the genome have near-identical copies elsewhere, and as a result, the true source location of a read might be subject to considerable uncertainty. If a group of reads is mapped with low confidence, a conventional variant caller might ignore the reads, even though they contain useful information. If a read is mismapped (i.e., the primary alignment is not the true source of the read), it can result in variant detection errors.
MRJD is designed in attempt to tackle the complexities raised by segmental duplication regions. Basically, instead of considering each region in isolation, MRJD considers all locations from which a group of reads may have originated and attempts to detect the underlying sequences jointly across all paralogous regions in the sample of interest.
Below is a diagram showing the general workflow of MRJD in PMS2 and PMS2CL regions. MRJD takes primary alignments in all paralogous regions, regardless of mapping quality, builds and places haplotypes based on reads and prior knowledge, and computes joint genotypes to call small variants.
Figure 1. MRJD Caller workflow.
As shown in the diagram above, there are two modes of the DRAGEN MRJD Caller, default mode and high sensitivity mode. Here are details on the differences between the two modes.
With --enable-mrjd=true
, the MRJD Caller will report the following two types of variants:
Uniquely placed variants, which means the variant is found and placed in one of the paralogous regions without ambiguity. See variants labeled with “type 1” in Figure 2.
Region-ambiguous variants. In this case, the aggregated genotype contains a variant allele with high confidence, but MRJD Caller is unable to place the variant allele in one of the paralogous regions with high confidence. The MRJD Caller will report the variant allele in all paralogous regions. See variants labeled with “type 2” in Figure 2.
With both --enable-mrjd=true
and --mrjd-enable-high-sensitivity-mode=true
, the MRJD Caller reports the same variants as from the default mode, plus two other types of variants.
Positions where the reference alleles in all paralogous regions are not the same. It is well established that gene conversion, including reciprocal crossover, is a common event between paralogous regions (such as PMS2 and PMS2CL). When reciprocal crossover event occurs, the prior model, without nearby information on phasing, might end up placing the converted haplotype in the source region instead of the destination region, resulting in no variant. The high sensitivity mode compensates for this event by reporting the variant in corresponding positions in all paralogous regions. See variants labeled with “type 3” in Figure 2.
Variants that have been placed uniquely in one of the paralogous regions and no variant in the corresponding position in the other region. The high sensitivity mode reports the variant in the rest of the paralogous regions. This is to compensate the fact that sometimes the prior knowledge that is used to help place the variant is not sufficient or is estimated incorrectly. In those cases, the variant allele still exists but is placed in the wrong paralog region. Therefore, reporting the variant in the other paralogous regions can help maximize sensitivity even with the lack of prior. See variants labeled with “type 4” in Figure 2.
Figure 2. Different variant types reported by MRJD Caller default mode and high sensitivity mode.
The MRJD Caller is disabled by default and requires WGS data aligned to a human reference genome build 38, 19, or GRCh37.
Here is the list of options related to MRJD.
--enable-mrjd
If set to true, MRJD is enabled for the DRAGEN pipeline. Note that MRJD cannot run together with SNV caller in the current version of DRAGEN (default = ‘false’).
--mrjd-enable-high-sensitivity-mode
If set to true, MRJD high sensitivity mode is enabled for the DRAGEN pipeline. See previous section on what variant types are reported in MRJD default mode and high sensitivity mode (default = ‘false’).
The following command-line example uses FASTQ input and runs MRJD Caller with high sensitivity mode:
The following command-line example uses BAM input that has already been aligned and runs MRJD Caller with high sensitivity mode:
Here are the example command lines to first run DNA Mapping and Small Variant Calling workflow using FASTQ files as input, and then run MRJD using BAM file generated by the DNA Mapping workflow as input.
The MRJD Caller generates a .mrjd.hard-filtered.vcf.gz file in the output directory. The output file is a compressed VCFv4.2 formatted file that contains the VCF representation of the small variants from the identified genotype.
The following are example output format for uniquely placed variant. The DRAGENHardQual filter is applied to the records if the variant has a QUAL < 3.00.
Figure 3. VCF output format example for uniquely placed call.
For variant that are not uniquely placed (type 2-4 variant in Figure 2), the MRJD Caller will also report variants under diploid genotype format, which can be interpreted the same way as uniquely placed variant (the genotype is region-specific instead of an aggregate across all regions). Under this format, The QUAL presents phred-scaled quality score for the assertion made in ALT (i.e. −10log10 prob(GT==0/0)). Note that the QUAL score will be equal to or less than 3 (if the QUAL > 3, then the call should be uniquely placed).
The QUAL, GT, GQ and PL will be reported similar to the DRAGEN germline VC. To avoid losing information about the aggregated genotype across paralogous regions, the MRJD Caller reports genotype, phred-scaled quality score, and the phred-scaled genotype likelihoods for aggregated genotype using JGT, JQL, and JPL in the FORMAT column.
Figure 4. VCF output format example for non-uniquely-placed call.
Condition | Reported GT |
---|---|
Column Header | Description |
---|
Run Pedigree Mode for Small Variant Caller. For more information, see .
Run Pedigree Mode for Copy Number Caller. For more information, see .
Run Pedigree Mode for Structural Variant Caller. For more information, see .
Run DeNovo Variant Small Variant Filtering. For more information, see .
The tool for population-based analysis is the iterative gVCF Genotyper. Its input is a set of single or multisample gVCFs. The output is a multisample VCF that contains one entry for any variant seen in any of the input gVCFs. The variants are genotyped across all input samples using information from the hom-ref blocks as necessary. The iterative gVCF Genotyper does not adjust genotypes based on population information but it provides means to filter variant sites based on information leveraged from the population. See for information on the available command line options.
--gg-hard-filter
Specifies a filtering expression to be applied to the output msVCF records. See below. The default is to apply no filters.
--gg-msvcf-format-fields
Can be used to override the default set of sample genotype fields in the output msVCF. See below.
--gg-msvcf-info-fields
Can be used to override the default set of site-wise INFO fields in the output msVCF. See below.
--gg-output-type
Set to spvcf
to write the output in spVCF format rather then the default msVCF. See below for details.
--gg-diploidify
In the output msVCF file, convert haploid calls to diploid. The diploidified genotype is homozygous in the haploid call e.g. 1
becomes 1/1
. The LPL field is also diploidified for these samples. Site metrics, such as allele counts, are calculated before diploidification. Diploidifying genotypes may ease the ingestion of msVCF files into downstream analysis tool, such as Hail and Plink. When this option is enabled, it is possible to include the DF
FORMAT field (included by default) that signifies whether or not a genotype has been diploidified, see below.
This approach is also referred to as local alleles and is also used by open source software such as and .
The (HWE) states that, given certain conditions, genotype and allele frequencies should remain constant between generations. Deviations from HWE can results from violations of the underlying HWE assumptions in the population, non-random sampling or may be artifacts of variant calling. can be assessed by comparing the observed frequencies of genotypes to those expected under HWE given the observed allele counts.
Metric | Description | Scope | Number of values |
---|
Iterative gVCF Genotyper offers both allele-wise and site-wise HWE P-values. The allele-wise P-values are based on the exact-conditional method the site-wise P-values are based on Pearson's chi-squared method. For bi-allelic sites, although both are measuring the same property, their values may differ. The differences between the methods are explored in . Care should be taken when deciding which to use.
Iterative gVCF Genotyper calculates allele-wise HWE and the ExcHet P-values. The values are calculated using the exact-conditional method described in . The implementation does not use a mid P-value correction.
It is also possible to filter based on the maximum ABHetP value, see .
The syntax of a filtering expression is the same as that used by the small variant caller (see ). Filters are always applied to the globally-computed metrics, not the values for the current batch. Records failing filter will have the specified filter ID(s) written to the FILTER column of the msVCF, or will be omitted entirely if the --gg-skip-filtered-sites option is specified. Since filtering is on a per-site basis, filters cannot be applied separately to SNPs or indels as they can in the variant caller.
Metric | Description | Number of values1 | Type |
---|
Metric | Description | Number of values1 | Type |
---|
Outputting local allele values, as described .
Use of the options to output only those metrics required for the downstream analysis.
The option that can have the biggest impact on the final output file size is that to write it directly in . This is a lossless encoding and the space saving can be dramatic: file size reductions of multiple tens of times have been observed for large cohorts with sparsely distributed variants. Files output as spVCF at step 3 (--generate-msvcf
) can be directly merged via the --merge-batches
subcommand to produce a single spVCF file. spVCF-encoded files are likely to require decoding back to full msVCF for use with downstream tools, and a binary for this is available for . The decoding will take time, but this is offset by the reduced time required within gVCF Genotyper to initially write the smaller spVCF files. Users are recommended to, if possible, directly pipe the decoded data into the downstream tool rather than first writing the full msVCF file to disk.
Name | Description | Default Value |
---|
PLINK option | PLINK default | PLINK tuned | DRAGEN default | PLINK Definitions |
---|
Option | Description |
---|---|
Mode | Downsampling Option | Default Value |
---|---|---|
Metric | QUAL | GQ (non-homref) | GQ (homref) | QD |
---|---|---|---|---|
GT variant 1 | GT variant 2 | GT MNV | Relevant Pipeline | Supported in DRAGEN |
---|---|---|---|---|
--sample-sex | Ploidy Estimation | Sample Sex in Small VC |
---|---|---|
Tag | Name | Command line options | QUAL threshold | MAPQ0 | Mosaic Detection | Mosaic AF filter threshold |
---|
--vc-callability-tumor-thresh
Specifies the callability threshold for tumor samples. The somatic callable regions report includes all regions with tumor coverage above the tumor threshold. The default value is 50. For more information on the somatic callable regions report, see .
--vc-callability-normal-thresh
Specifies the callability threshold for normal samples, if present. If applicable, the somatic callable regions report includes all regions with normal coverage above the normal threshold. The default value is 5. For more information on the somatic callable regions report, see .
Coverage | Lowest AF |
---|
Somatic Mode | Filter ID | Description |
---|
Prebuilt systematic noise files can be downloaded here:
Version | Release | Modes | Normal Samples |
---|
Option | Description |
---|
Option | Description |
---|
Option | Description |
---|
It is important to note that MRJD cannot run together with the DRAGEN Small Variant Caller in this DRAGEN version. We recommend users to run DNA Mapping and Small Variant Calling workflow first, and then run MRJD using the aligned BAM file generated from DNA Mapping workflow as input. Using this workflow, two VCF files will be created (.hard-filtered.vcf.gz by DRAGEN Small Variant Caller and .mrjd.hard-filtered.vcf.gz by DRAGEN MRJD). To help user get a single VCF file for downstream anlaysis, we prepared a utility tool that replaces the DRAGEN Small Variant Caller output in the homology region of the six medically relevant and challenging genes with MRJD caller output. The tool also annotates the calls made by MRJD (with "MRJD" tag in the INFO column). Please refer to the to download the utility tool.
--vc-target-coverage
Specifies the maximum number of reads covering any given position.
--vc-max-reads-per-active-region
Specifies the maximum number of reads covering a given active region.
--vc-max-reads-per-raw-region
Specifies the maximum number of reads covering a given raw region.
--vc-min-reads-per-start-pos
Specifies the minimum number of reads with a start position overlapping any given position.
--high-coverage-support-mode
Applies the high coverage mode down-sample options if set to true. Enabling this option is recommended for targeted panels with coverage over 1000x, but will slow down run time.
Germline
--vc-target-coverage
500
Germline
--vc-max-reads-per-active-region
10000
Germline
--vc-max-reads-per-raw-region
30000
Somatic
--vc-target-coverage
1000
Somatic
--vc-max-reads-per-active-region
10000
Somatic
--vc-max-reads-per-raw-region
30000
High Coverage
--vc-target-coverage
100000
High Coverage
--vc-max-reads-per-active-region
200000
High Coverage
--vc-max-reads-per-raw-region
200000
Mitochondrial
--vc-target-coverage-mito
40000
Mitochondrial
--vc-max-reads-per-active-region-mito
200000
Mitochondrial
--vc-max-reads-per-raw-region-mito
200000
Description
Probability that the site has no variant
Probability that the call is incorrect
Evidence supporting homref call
Qual normalized by depth
Formulation
QUAL = GP(GT=0/0)
GQ =-10*log10(p)
GQ = 10*log10[P(D|homref)/P(D|variant)]
QUAL/DP
Scale
Unsigned Phred
Unsigned Phred
Signed Phred
Unsigned Phred
Numerical example
QUAL=20: 1 % chance that there is no variant at the site. Qual=50: 1 in 1e5 chance that there is no variant at the site.
GQ=3, 50% that the call is incorrect. GQ=20, 1% change that the call is incorrect.
GQ=0: no evidence. GQ>0: evidence favors homref.
0|1
0|1
0/1
Germline and Somatic
Yes in 4.0
0/1
1/1
1/2
Germline
No
0/1
1/2
1/2
Germline
No
1/1
1/1
1/1
Germline
Yes in 4.2
male
Not relevant
Male
female
Not relevant
Female
none
Not relevant
None
auto (default)
XY
Male
auto (default)
XX
Female
auto (default)
Everything else
None
At a position with no coverage
./. or .
At a position with coverage but no reads supporting ALT allele
0/0 or 0
At a position with coverage and reads supporting ALT allele
dependent on pipeline (germline/somatic)
4.2 | DRAGEN 4.2 default Small Variant Caller | --enable-variant-caller=true | 3 | No | No | N/A |
4.2 HSM | DRAGEN 4.2 High Sensitivity Mode | --enable-variant-caller=true --vc-enable-high-sensitivity-mode=true | 0.4 | Yes | Yes (Alpha) | N/A |
4.3 | DRAGEN 4.3 default Small Variant Caller | --enable-variant-caller=true | 3 | Yes | Yes (Full) | 20% |
4.3 Mosaic | DRAGEN 4.3 Mosaic Detection Mode | --enable-variant-caller=true --vc-enable-mosaic-detection=true | 0.4 | Yes | Yes (Full) | 0% |
0-199 | 0.05 |
200-399 | 0.025 |
400-799 | 0.0125 |
... | ... |
Tumor-Only & Tumor-Normal | weak_evidence | Variant does not meet likelihood threshold. The likelihood ratio for SQ tumor-normal is < 17.5 or < 3.0 for SQ tumor-only. |
Tumor-Only & Tumor-Normal | multiallelic | Site filtered if there are two or more ALT alleles at this location in the tumor. Not applied to somatic hotspot variants. |
Tumor-Only & Tumor-Normal | base_quality | Median base quality of ALT reads at this locus is < 20. |
Tumor-Only & Tumor-Normal | mapping_quality | Median mapping quality of ALT reads at this locus is < 20 (tumor-normal) or < 30 (tumor-only). |
Tumor-Only & Tumor-Normal | fragment_length | Absolute difference between the median fragment length of alt reads and median fragment length of ref reads at a given locus > 10000. |
Tumor-Only & Tumor-Normal | read_position | Median of distances between the start and end of read and a given locus < 5 (the variant is too close to edge of all the reads). To output variant read position to the INFO field, use |
Tumor-Only & Tumor-Normal | low_af | Allele frequency is below the threshold specified with |
Tumor-Only & Tumor-Normal | systematic_noise | If AQ score is < 10 (default) for tumor-normal or < 60 (default) for tumor-only, the site is filtered. |
Tumor-Only & Tumor-Normal | low_frac_info_reads | The fraction of informative reads (denominator excludes filtered_out reads) is below the threshold. The default threshold value is 0.5. |
Tumor-Only & Tumor-Normal | filtered_reads | More than 50% of reads have been filtered out. |
Tumor-Only & Tumor-Normal | long_indel | Indel length is more than 100bp. |
Tumor-Only & Tumor-Normal | low_depth | The site was filtered because the number of reads is too low. The filter is off by default. |
Tumor-Only & Tumor-Normal | low_tlen | The site was filtered because the fraction of low TLEN ALT supporting reads is above a threshold. The default threshold is 0.4. Reads with TLEN smaller than -2.25 (default) standard deviations from the mean are considered to be low TLEN. This filter is not applied for reads sampled from tight insert distributions i.e., stddev / mean < 0.1 (default). |
Tumor-Only and Tumor-Normal | no_reliable_supporting_read | No reliable supporting read was found in the tumor sample. A reliable supporting read is a read supporting the alt allele with mapping quality ≥ 40, fragment length ≤ 10,000, base call quality ≥ 25, and distance from start/end of read ≥ 5. |
Tumor-Only & Tumor-Normal | too_few_supporting_reads | Variant is supported by < 3 reads in the tumor sample. This filter is not applied in UMI-aware pipelines. |
Tumor-Normal | noisy_normal | More than three alleles are observed in the normal sample at allele frequency above 9.9%. |
Tumor-Normal | alt_allele_in_normal | ALT allele frequency in the normal sample is above 0.2 plus the maximum contamination tolerance. For solid tumor mode, the value is 0. For liquid tumor mode, the default value is 0.15. See |
Tumor-Normal | non_homref_normal | Normal sample genotype is not a homozygous reference. |
Somatic Systematic Noise Baseline Collection v2.0.0 | V4.3 | hg19, hg38, hs37d5, WES, WGS | ~50 per cohort, 80-100X coverage |
--vc-systematic-noise | Specifies a systematic noise BED file. If a somatic variant does not pass the AQ threshold, the variant is marked as 'systematic_noise' in the FILTER column of the output VCF. |
--vc-systematic-noise-method | Specifies which column in the systematic noise file will be used: "max" is more aggressive and recommended for WGS, while "mean" preserves better sensitivity and is recommended for WES/PANELs. |
--vc-systematic-noise-filter-threshold | Set the AQ threshold. Higher values filter more aggressively. By default the threshold value is 10 for tumor-normal and 60 for tumor-only. The valid range spans 0-100. For tumor-normal runs the threshold may be set higher (e.g. to 60) to improve specificity at the possible cost of some sensitivity. |
--vc-systematic-noise-filter-threshold-in-hotspot | Set the AQ threshold to use in hotspot regions, where one may want to filter less aggressively than in the rest of the genome. By default, the threshold value is 10 for tumor-normal and 20 for tumor-only. |
--vc-allele-specific-systematic-noise | Apply systematic noise in an allele-specific manner when allele information is available. This setting is ignored for v1 noise files (Default=true)) |
--vc-detect-systematic-noise | Run the tumor-only pipeline in an ultra sensitive mode and intentionally include noise in the output VCF. WARNING: this option should only be used with normal samples to characterize noise, it is NOT intended for analyzing tumor samples. |
--vc-detect-systematic-noise-mode | Specify the library type when generating the systematic noise. Only required for UMI samples. This mode will generate GVCFs which are especially useful for capturing very low levels of noise. The default mode will work well for WGS/WES and non-UMI panels. Valid options include [UMI, DEFAULT] |
--build-sys-noise-method | Specifies the default value for vc-systematic-noise-method by adding it as part of the header in the systematic noise file. It is recommended to select 'mean' for UMI/PANELS/WES data and 'max' for WGS data (default is 'max')." |
--build-sys-noise-vcfs-list | Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line. |
--build-sys-noise-germline-vaf-threshold | Variant calls with VAF higher than this threshold will be considered germline and will not contribute to the noise estimate. This option is disabled by default by setting the threshold to 1. (Default 1) |
--build-sys-noise-use-germline-tag | This option will ensure that variants tagged by |
--build-sys-noise-min-sample-cov | Min coverage at a site for a sample to be used towards noise estimation. At low coverages estimated allele frequencies become less reliable. Accurate AF estimation is imporant for germline variant detection, and also for noise detection when using MAX noise. (Default 5) |
--build-sys-noise-min-supporting-samples | Min number of samples with noise at a position in order for a position to be considered systematic-noise (Default 1). |
Family_ID | The pedigree identifier. |
Individual_ID | The ID of the individual. |
Paternal_ID | The ID of the individual's father. If the founder, the value is 0. |
Maternal_ID | The ID of the individual's mother. If the founder, the value is 0. |
Sex | The sex of the sample. If male, the value is 1. If female, the value is 2. |
Phenotype | The genetic data of the sample. If unknown, the value is 0. If unaffected, the value is 1. If affected, the value is 2. |
HWE | Hardy-Weinberg Equilibrium P-value | Allele-wise | One for each alt allele |
ExcHet | Excess Heterozygosity P-value | Allele-wise | One for each alt allele |
HWEc2 | Hardy-Weinberg Equilibrium P-value | Site-wise | 1 |
IC | Inbreeding Coefficient | Site-wise | 1 |
GT | Genotype | 1 | String |
GQ | Genotype quality | 1 | Integer |
AD | Allelic depths | R | Integer |
LAD | Localized allelic depths | . | Integer |
FT | Sample filter | 1 | String |
LPL | Local normalized, Phred-scaled likelihoods for genotypes as in original gVCF | . | Integer |
LAA | Mapping of local alt allele index from original gVCF to msVCF excluding the reference allele | . | Integer |
LA | Mapping of local allele indices from original gVCF to msVCF including the reference allele | . | Integer |
LGT | Local GT value as in original gVCF | 1 | String |
QL | Phred-scaled probability that the site has no variant in this sample (original gVCF QUAL) | 1 | Float |
MQR | Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (original gVCF INFO/MQRankSum) | 1 | Float |
LAF | Allele fractions for the local alt alleles | . | Float |
DF | Diploidified, 1 represents a genotype that was originally haploid, 0 represents originally diploid | 1 | Integer |
AC | Allele count in genotypes | A | Integer |
AN | Total number of alleles in called genotypes | 1 | Integer |
NS | Total number of samples | 1 | Integer |
NS_GT | Total number of samples with called genotypes | 1 | Integer |
NS_NOGT | Total number of samples with unknown genotypes ./. | 1 | Integer |
NS_NODATA | Total number of samples with no coverage | 1 | Integer |
IC | The inbreeding coefficient | 1 | Float |
HWE | The exact conditional Hardy-Weinberg Equilibrium P-value | A | Float |
ExcHet | he exact conditional Excess Heterozygosity P-value | A | Float |
HWEc2 | The chi-squared Hardy-Weinberg Equilibrium P-value | 1 | Float |
AF | The ALT allele frequencies (AC/AN) | A | Float |
ABHom | The allelic balance among homozygotes | R | Float |
ABHet | The allelic balance among heterozygotes | R | Float |
ABHetP | The P-value for allelic balance among heterozygotes | R | Float |
chr2 | 151578759 | 151588523 | NEB exon 98-105 |
chr2 | 151589318 | 151599076 | NEB exon 90-97 |
chr2 | 151599871 | 151609628 | NEB exon 82-89 |
chr2 | 178653238 | 178654995 | TTN exon 172-180 |
chr2 | 178657498 | 178659255 | TTN exon 181-189 |
chr2 | 178661759 | 178663516 | TTN exon 190-198 |
chr5 | 70049522 | 70077596 | SMN2 |
chr5 | 70924940 | 70953013 | SMN1 |
chr7 | 5970924 | 5980896 | PMS2 exon 13-15 |
chr7 | 5980968 | 5987689 | PMS2 exon 11-12 |
chr7 | 6737007 | 6743712 | PMS2CL exon 2-3 |
chr7 | 6743880 | 6753867 | PMS2CL exon 4-6 |
chr15 | 43599563 | 43602630 | STRC exon 24-29 |
chr15 | 43602982 | 43611000 | STRC exon 14-23 |
chr15 | 43611040 | 43618800 | STRC exon 1-13 |
chr15 | 43699379 | 43702452 | STRCP1 exon 23-28 |
chr15 | 43702488 | 43710472 | STRCP1 exon 13-22 |
chr15 | 43710502 | 43718262 | STRCP1 exon 1-12 |
chrX | 154555884 | 154565047 | IKBKG exon 3-10 |
chrX | 154639390 | 154648553 | IKBKGP1 |
| Enable evidence BAM output | False |
| Output graph haplotypes in evidence BAM | False |
| Percentage of clipped reads in active region to enable evidence BAM output for that region | 10% |
| Force evidence BAM output for all active regions | False |
--homozyg-density | 50 | 50 | Minimum required density to call a ROH (1 SNP in 50 kb), can be increased to relax the per SNP density. |
--homozyg-gap | 1000 | 1000 | 3000 | Maximal interval between two homozygous SNPs in a ROH (in kb) |
--homozyg-kb | 1000 | 500 | All sizes reported | Minimal length of reported ROH (in kb) |
--homozyg-snp | 100 | 50 | 50 | Minimal number of homozygous SNPs in the reported ROH |
--homozyg-window-het | 1 | 2 | Soft score threshold (1-0.025) penalty for a het SNP and 0.025 gain for a hom SNP | Maximum number of heterozygous SNPs allowed in a scanning window |
--homozyg-window-missing | 5 | 5 | Number of missing calls allowed in a scanning window |
--homozyg-window-snp | 50 | 50 | Variants in a scanning window |
--homozyg-window-threshold | 0.05 | 0.05 | For a SNP to be eligible for inclusion in a ROH, the hit rate/overlap of all scanning windows containing the SNP must be at least 0.05 |
The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:
with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes
with accceleration supported with Advanced Vector Extensions (AVX)
The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.
Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the DRAGEN Software Support site page.
For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.
Notes:
The output is in biallelic format, one line per ALT.
The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.
The following is an example of commands to impute vcf on a single chromosome:
The following is an example of commands to impute vcf on chromosome X:
The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.
The sample(s) to be imputed must have the following format:
VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported
Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information
To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the DRAGEN Software Support Site page. When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).
To impute INDELs and get the best accuracy on INDELs, it is recommended:
to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the DRAGEN Software Support Site page.
and to set the command --imputation-phase-impute-reference-only-variants
to true.
A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the DRAGEN Software Support Site page. IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.
Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported
A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename
. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.
A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files. The genetic map should follow the format:
<chromosome name>.gmap.gz
3 columns: position, chromosome number, distance (cM)
compliant with the reference genome used to generate the sample input
This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the DRAGEN Software Support Site page. It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.
In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.
For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):
The JSON config file is made of two fields as defined in the table below
Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.
The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.
The VCF imputation tool generates several outputs:
The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name <prefix>.impute.vcf.gz
The intermediate files:
chunk regions to be passed along to the internal Phase step with name <prefix>.impute.chunk.out.txt
imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name <prefix>_chr_start-end.impute.phase.vcf.gz
text file with path to all the <prefix>_chr_start-end.impute.phase.vcf.gz
generated with name <prefix>.impute.phase.out.txt
Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps
Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.
Fields | Required/Optional | Purpose | Type |
---|---|---|---|
Option | Type | Required | Description |
---|---|---|---|
regions
Required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define contig name and subregion name of mixed ploidy chromosome
Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...]
ploidy
“default” is a required name
contigname_of_mixed_ploidy_"nonpar" is required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define:
ploidy behavior when different from “default”
default ploidy behavior
Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input
--enable-imputation
NA
Yes
Set to true
to enable vcf imputation pipeline
--imputation-ref-panel-dir
STRING
Yes
Directory containing per-chromosome reference panel VCF and optionally the JSON config file
--imputation-ref-panel-prefix
STRING
Yes
Prefix for reference panel files and the JSON config file
--imputation-genome-map-dir
STRING
Yes
Directory containing per-chromosome genome map files
--imputation-chunk-input-region
STRING
Yes for single region
Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20).
--imputation-chunk-input-region-list
STRING
Yes for list of regions
Text file listing chromosomes or regions to be processed, one chromosome/region per line.
--imputation-phase-input
STRING
Yes for single VCF file
Sample input file with VCF/BCF format. Single VCF or multi-sample VCF
--imputation-phase-input-list
STRING
Yes for multiple VCF files
Text file listing sample input in VCF/BCF format, one input file per line
--imputation-phase-sample-type
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file
Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file
--imputation-phase-sample-type-list
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files
Path to the Sample Type file
--output-directory
STRING
Yes
Output directory
--output-file-prefix
STRING
Yes
Output files prefix
--imputation-phase-threads
INT
No
Specify the number of threads to use. Default is the number of system threads
--imputation-phase-filter-input-sample-in-ref
NA
No
Default is true
: if sample ID matches between reference panel and sample input, then the corresponding samples are ignored from the reference panel to avoid imputation against itself. To be turned to false
if all samples from the reference panel should be kept regardless of their presence in the sample input.
--imputation-phase-impute-reference-only-variants
STRING
No
Default is false
. If set to true
, allows imputation at variants only present in the reference panel. The use of this option is intended only to allow imputation at sporadic missing variants. If the number of missing variants is non-sporadic, please re-run the genotype likelihood computation at all reference variants and avoid using this option, since data from the reads should be used.
When the input sample variant calling was performed using --vc-forcegt-vcf
with SNPs-only sites.vcf file, it is recommended to set this option to true to also impute INDELs positions from the reference panel.
--imputation-phase-input-independently
STRING
No
Default is false
. If set to true
, allows to treat each sample input independently without using them in the reference panel calculation