Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
B-Allele frequency (BAF) output is enabled by default in germline and somatic VCF and gVCF runs.
The BAF value is calculated as either AF
or (1 - AF)
, where
AF = (alt_count / (ref_count + alt_count))
BAF = 1 - AF
, only when ref base < alt base, order of priority for bases is A < T < G < C < N
.
The B-allele frequency values are often plotted to visually inspect the spread away from a perfectly diploid heterozygous call (BAF=50%). This plot is more easily interpreted if it is symmetric about the BAF=50% line. To ensure the symmetry, a heuristic must be used to determine when BAF = AF
or BAF = 1-AF
. This definition of B-Allele Frequency is based on the definition that is used for bead arrays, as most users are accustomed to that implementation. Here, the choice of the B allele is based on the color of dye attached to each nucleotide. A and T get one color, G and C get the other color. The bead array implementation has much more complex rule for tie-breaking between A and T or G and C that involves top and bottom strands. This is unnecessary and so the simpler hierarchical approach of using a priority for the nucleotides A<T<G<C<N
is used.
For each small variant VCF entry with exactly one SNP alternate allele, the output contains a corresponding entry in the BAF output file.
<NON_REF>
lines are excluded
ForceGT variants (as marked by the "FGT" tag in the INFO field) are not included in the output, unless the variant also contains the "NML" tag in the INFO field.
Variants where the ref_count and alt_count are both zero are not included in the output.
--vc-enable-baf
Enable or disable B-allele frequency output. Enabled by default.
The BF generates are BigWig-compressed files, named <output-file-prefix>.baf.bw
and <output-file-prefix>.hard-filtered.baf.bw
. The hard-filtered file only contains entries for variants that pass the filters defined in the VCF (ie, PASS entries).
Each entry contains the following information: Chromosome Start End BAF
Where:
Chromosome is a string matching a reference contig.
Start and end values are zero-based, half open intervals.
BAF is a floating point value.
The map/align system produces a BAM file sorted by reference sequence and position by default. Creating this BAM file typically eliminates the requirement to run samtools sort or any equivalent postprocessing command. The ‑‑enable-sort option
can be used to enable or disable creation of the BAM file, as follows:
To enable, set to true.
To disable, set to false.
On the reference hardware system, running with sort enabled increases run time for a 30x full genome by about 6--7 minutes.
Marking or removing duplicate aligned reads is a common best practice in whole-genome sequencing. Not doing so can bias variant calling and lead to incorrect results.
The DRAGEN system can mark or remove duplicate reads, and produces a BAM file with duplicates marked in the FLAG field, or with duplicates entirely removed.
In testing, enabling duplicate marking adds minimal run time over and above the time required to produce the sorted BAM file. The additional time is approximately 1--2 minutes for a 30x whole human genome, which is a huge improvement over the long run times of open source tools.
The DRAGEN duplicate-marking algorithm is modeled on the Picard toolkit's MarkDuplicates feature. All the aligned reads are grouped into subsets in which all the members of each subset are potential duplicates.
For two pairs to be duplicates, they must have the following:
Identical alignment coordinates (position adjusted for soft- or hard-clips from the CIGAR) at both ends.
Identical orientations (direction of the two ends, with the left-most coordinate being first).
In addition, an unpaired read may be marked as a duplicate if it has identical coordinate and orientation with either end of any other read, whether paired or not.
Unmapped read pairs are never marked as duplicates.
When DRAGEN has identified a group of duplicates, it picks one as the best of the group, and marks the others with the BAM duplicate flag (0x400, or decimal 1024). For this comparison, duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of the scores of both ends, while unpaired reads get the score of the one mapped end. The idea of this score is to try, all other things being equal, to preserve the reads with the highest-quality base calls.
If two reads (or pairs) have exactly matching quality scores, DRAGEN breaks the tie by choosing the pair with the higher alignment score. If there are multiple pairs that also tie on this attribute, then DRAGEN chooses a winner arbitrarily.
The score for an unpaired read R is the average Phred quality score per base, calculated as follows:
Where R is a BAM record, QUAL is its array of Phred quality scores, and dedup-min-qual is a DRAGEN configuration option with default value of 15. For a pair, the score is the sum of the scores for the two ends.
This score is stored as a one-byte number, with values rounded down to the nearest one-quarter. This rounding may lead to different duplicate marks from those chosen by Picard, but because the reads were very close in quality this has negligible impact on variant calling results.
The limitations to DRAGEN duplicate marking implementation are as follows:
When there are two duplicate reads or pairs with very close Phred sequence quality scores, DRAGEN might choose a different winner from that chosen by Picard. These differences have negligible impact on variant calling results.
If using a single FASTQ file as input, DRAGEN accepts only a single library ID as a command-line argument (RGLB). For this reason, the FASTQ inputs to the system must be already separated by library ID. Library ID cannot be used as a criterion for distinguishing non-duplicates.
DRAGEN does not distinguish between optical and PCR duplicates.
The following options can be used to configure duplicate marking in DRAGEN:
--enable-duplicate-marking
Set to true to enable duplicate marking. When \--enable-duplicate-marking is enabled
, the output is sorted, regardless of the value of the enable-sort option.
--remove-duplicates
Set to true to suppress the output of duplicate records. If set to false, set the 0x400 flag in the FLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable- duplicate-marking is forced to enabled as well.
--dedup-min-qual
Specifies the Phred quality score below which a base should be excluded from the quality score calculation used for choosing among duplicate reads.
DRAGEN can remove artifacts from reads using hardware accelerated read trimming. Hardware accelerated read trimming is available on U200 and cloud systems, as part of the DRAGEN mapper and adds no additional run time. DRAGEN provides multiple independent trimming filters that target different types of artifacts or use cases. You can enable and configure the artifacts or use cases independently to tailor the read-trimming to your analysis. Read trimming uses two different modes, hard-trimming and soft-trimming.
To enable hard-trimming mode, use --read-trimmers
. In hard-trimming mode, potential artifacts are removed from input reads. Reads that are trimmed to fewer than 20 bases are filtered and replaced with a placeholder read that uses 10 N bases. DRAGEN assigns the filtered reads a 0x200 flag set.
DRAGEN contains a novel lossless soft-trimming mode. In soft-trimming mode, reads are mapped as though they had been trimmed, but no bases are removed. To enable the trimmer in soft mode, use --soft-read-trimmers
.
Soft-trimming suppresses systematic mismapping of reads that contain trimmable artifacts, without actually losing the trimmed bases in aligned output. Soft-trimming prevents reads with trimmable artifacts, such as Poly-G artifacts, from being mapped to reference G homopolymers, or prevents adapter sequences from being mapped to the matching reference loci. Soft-trimming might map reads to different positions in the reference than they would have been if not using soft-trimming. When using soft-trimmed, DRAGEN does not filter reads and does not map reads with bases that would have been trimmed entirely.
Soft-trimming for Poly-G artifacts is enabled by default on supported systems.
Fixed-length trimming removes a fixed number of bases from the 5' end of each read. If you are analyzing sequencing data from an amplicon of fixed size and expect the read-length to consistently exceed the length of quality sequence data, you can use the expected number in fixed-length trimming.
Poly-G artifacts appear on two-channel sequencing systems when the dark base G is called after synthesis has terminated. As a result, DRAGEN calls several erroneous high-confidence G bases on the ends of affected reads. For contaminated samples, many affected reads can be mapped to reference regions with high G content. The affected reads can cause problems for processing downstream.
Base quality can degrade over the length of a read toward the 5' end and separate from any artifacts from early termination of synthesis. The lower quality bases can affect mapping and alignment results, and might lead to incorrect variant or methylation calls downstream. The quality trimming tool calculates a rolling average of the base quality inward from the 5' end and removes the minimum number of bases, so the average number of bases is above the threshold specified using --trim-min-quality
.
Problems during library preparation, or libraries with smaller inserts can result in the synthesis of high quality reads containing sequence from the adapters used. If not removed before analysis, noninsert bases can reduce mapping efficiency and downstream accuracy. The adapter trimming tool uses the adapter sequences from the input FASTA file, and then removes all hits greater than a specified size. Adapter trimming allows for a 10% mismatch. For 3' adapters, trimming is from the first matching adapter base to the end of the read. For 5' adapters, trimming is from the first (3') matching adapter base to the beginning (5') of the read.
If quality trimming is not feasible due to reduced yield or other limitations, an alternative option is to remove only explicitly ambiguous bases from the ends of read. If enabled the ambiguous base trimmer applies a simple exact-match search to both ends of all processed reads, regardless of mate-pair status.
You can maximize trimmer sensitivity, by using the minimum length trimming tool to remove a fixed number of bases from each read after the trimmer tools above have run. For example, if you would like to remove 5 bp from each read, a 7 bp adapter hit could be missed if five of the bases are removed first. To mitigate this issue, DRAGEN provides an optional minimum trim-length filter.
If using libraries of fixed-size inserts, such as small PCR amplicons, it is more convenient to specify a length that all reads should be trimmed to rather than the number of bases to remove. You can use the maximum length trimming tool.
If using RNA libraries, reads overlapping the poly-A tail of the transcripts may contain long poly-A/poly-T sequences at the end of the reads which may result in incorrect alignment. The poly-A trimmer mitigates this by trimming the poly-A tail from the end of the read. See additional description in RNA alignment section.
The trimmer generates a metrics file titled \<output prefix\>.trimmer_metrics.csv
. Metrics are available on an aggregate level over all input data. The metrics units are in reads or bases.
Total input reads Total number of reads in the input files.
Total input bases Total number of bases in the input reads.
Total input bases R1 Total number of bases in R1 reads.
Total input bases R2 Total number of bases in R2 reads.
Average input read length Total number of input bases divided by the number of input reads.
Total trimmed reads Total number of reads trimmed by at least one base, not including soft-trimming.
Total trimmed bases Total number of bases trimmed, not including soft-trimming.
Average bases trimmed per read The number of trimmed bases divided by the number of input reads.
Average bases trimmed per trimmed read The number of trimmed bases divided by the number of trimmed reads.
Remaining poly-G K-mers R1 3prime The number of R1 3' read ends that contain likely Poly-G artifacts after trimming.
Remaining poly-G K-mers R2 3prime The number of R2 3' read ends that contain likely Poly-G artifacts after trimming.
Total filtered reads The number of reads that were filtered out during trimming.
Reads filtered for minimum read length R1 The number of R1 reads that were filtered due to being trimmed below the minimum read length.
Reads filtered for minimum read length R2 The number of R2 reads that were filtered due to being trimmed below the minimum read length.
<Trimmer tool> trimmed reads The number of reads with at least one base trimmed by TRIMMER. DRAGEN reports the metric for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes reads that were trimmed during soft-trimming. Each trimming tool above produces the metric.
<Trimmer tool> trimmed bases The number of bases trimmed by TRIMMER. The metric is produced for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes bases from reads that were trimmed during soft trimming. Each trimming tool above produces the metric.
Regions of homozygosity (ROH) are detected as part of the small variant caller. The caller detects and outputs the runs of homozygosity from whole genome calls on autosomal human chromosomes. Sex chromosomes are ignored unless the sample sex karyotype is XX, as specified on the command line or determined by the Ploidy Estimator. ROH output allows downstream tools to screen for and predict consanguinity between the parents of the proband subject.
A region is defined as consecutive variant calls on the chromosome with no large gap in between these variants. In other words, regions are broken by chromosome or by large gaps with no SNV calls. The gap size is set to 3 Mbases.
ROH Algorithm
The ROH algorithm runs on the small variant calls. The algorithm excludes variants with multiallelic sites, indels, complex variants, non-PASS filtered calls, and homozygous reference sites. The variant calls are then filtered further using a block list BED, and finally depth filtering is applied after the block list filter. The default value for the fraction of filtered calls is 0.2, which filters the calls with the highest 10% and lowest 10% in DP values. The algorithm then uses the resulting calls to find regions.
The ROH algorithm first finds seed regions that contain at least 50 consecutive homozygous SNV calls with no heterozygous SNV or gaps of 500,000 bases between the variants. The regions can be extended using a scoring system that functions as follows.
Score increases with every additional homozygous variant (0.025) and decreases with a large penalty (1-0.025) for every heterozygous SNV. This provides some tolerance of presence of heterozygous SNV in the region.
Each region expands on both ends until the regions reach the end of a chromosome, a gap of 500,000 bases between SNVs occurs, or the score becomes too low (0).
Overlapping regions are merged into a single region. Regions can be merged across gaps of 500,000 bases between SNVs if a single region would have been called from the beginning of the first region to the end of the second region without the gap. There is no maximum size for regions, but regions always end at chromosome boundaries.
ROH Options
--vc-enable-roh
Set to true to enable the ROH caller. The ROH caller is enabled by default for human autosomes only. Set to false to disable.
--vc-roh-blacklist-bed
If provided, the ROH caller ignores variants that are contained in any region in the block list BED file. DRAGEN distributes block list files for all popular human genomes and automatically selects a block list to match the genome in use, unless this option is used to select a file.
ROH Output
The ROH caller produces an ROH output file named <output-file-prefix>.roh.bed
in which each row represents one region of homozygosity. The BED file contains the following columns:
Chromosome Start End Score #Homozygous #Heterozygous
Score is a function of the number of homozygous and heterozygous variants, where each homozygous variant increases the score by 0.025, and each heterozygous variant reduces the score by 0.975.
Start and end positions are a 0-based, half-open interval.
#Homozygous is number of homozygous variants in the region.
#Heterozygous is number of heterozygous variants in the region. The caller also produces a metrics file named <output-file-prefix>.roh_metrics.csv
that lists the number of large ROH and percentage of SNPs in large ROH (>3 MB).
The table below demonstrates how the PLINK options can be tuned to behave similarly to the DRAGEN ROH caller default settings (see column DRAGEN default). We observed that PLINK ROH calls (see column PLINK default) in default settings are more conservative compared to DRAGEN default settings. By default, PLINK reports ROH regions of size 1MB or larger (see PLINK option --homozyg-kb ) with at least 100 homozygous SNPs (see PLINK option --homozyg-snp) while DRAGEN ROH caller reports smaller regions with at least 50 homozygous SNPs (see DRAGEN ROH Algorithm section). In addition, PLINK by default allows for only 1 heterozygous SNP per scanning window (specified by PLINK option --homozyg-window-het) while DRAGEN uses a soft score threshold penalty without setting an upper bound on the allowed number of heterozygous SNPs (see DRAGEN ROH Algorithm section). The PLINK ROH calls are largely comparable to the DRAGEN ROH calls after relaxing the default PLINK settings, shown in column PLINK tuned. Prior to PLINK ROH calling, the input DRAGEN hard-filtered VCF files are filtered as per the instructions in DRAGEN ROH Algorithm section.
DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other.
Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF
option. There is also the option to write a germline gVCF with reduced size using the option --vc-compact-gvcf
. This results in a significant speed up for a downstream analysis using gVCF Genotyper. Please note that this compact format is not compatible with a pedigree analysis.
The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
The DRAGEN germline variant caller has an option --vc-combine-phased-variants-distance
to combine phased variants in the gVCF output. Input gVCF files created with this option cannot be processed in a population-based analysis using gVCF Genotyper.
The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.
If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.
Similarly, targeted variant calls (option --targeted-merge-vc
) in any gVCF file that are not also called by the variant caller will be ignored as well.
Both pedigree- and population-based joint analysis can process gVCF files written by the GATK v4.1 variant caller.
There are two available joint analysis output files:
Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.
The multisample gVCF output is only available in the pedigree-based analysis.
The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".
In hom-ref blocks, the following FORMAT fields are calculated uniquely.
FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.
In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.
SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.
In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.
In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.
Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.
To invoke pedigree mode, set the --enable-joint-genotyping
option to true. Use the --pedigree-file
option to specify the path to a pedigree file that describes the relationship between panels.
The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.
The following is an example of an input pedigree file.
The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.
Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.
Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
gVCF files for the Small Variant Caller.
*.tn.tsv files for the Copy Number Caller.
BAM files for the Structural Variant Caller.
The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.
Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.
At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.
The DQ formula is DQ = -10log10(1 - Pdenovo).
Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.
In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.
The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.
The following are the possible values:
Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.
The following is an example VCF line for a trio:
1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.
The following command line options are available for de novo small variant calling.
--enable-joint-genotyping
--Run the joint genotyping caller.
--pedigree-file
--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.
--variant
or --variant-list
--Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.
--qc-snp-denovo-quality-threshold
--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
--qc-indel-denovo-quality-threshold
--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
--output-directory
--The output directory. This is required.
--output-file-prefix
--The prefix used to label all output files. This is required.
-r
The directory where the hash table resides.
The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant
or --variant-list
. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.
DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.
To compare multiple pedigrees, you can run gVCF Genotyper on the output of a pedigree analysis and merge multiple joint-called pedigrees into a single multisample VCF. To enable, run the pedigree analysis using the --enable-multi-sample-gvcf=true
option to write a multisample gVCF.
gVCF Genotyper offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples, every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a "step-by-step mode" if multiple batches of samples are available, or "end-to-end mode", if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller (described above) are also accepted as input. gVCF Genotyper can accept input gVCF files generated using DRAGEN version 3.2.6 or later.
Step 1 (gVCF aggregation): the user can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multi-sample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. As part of this step, adjacent hom-ref blocks with matching FILTER columns are further merged to reduce the disk footprint of the intermediate files, FORMAT field values being base-pair weight averaged in the process.
When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.
Step 2 (census aggregation): after all per batch census files are generated, the user can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way than aggregating gVCFs from all batches. When a new batch of samples becomes available, the user only needs to perform Step 1 on that batch, then aggregate the census file from the new batch with the global census file from all previous batches in order to generate an updated global census file.
Step 3 (msVCF generation): every time a global census file is updated, with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take a per-batch cohort file, per-batch census file and the global census file as input, and generate a multi-sample VCF for one batch of samples. The output multi-sample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multi-sample VCF is the same across all batches.
To facilitate parallel processing on distributed compute nodes, for every step above, the user can choose to split the genome into shards of equal size, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard
below.
There is a special treatment of alternative or unaligned contigs when the --shard
option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.
If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.
--enable-gvcf-genotyper-iterative
: set to true to run the iterative gVCF Genotyper (always required).
--ht-reference
: The file containing the reference sequence in FASTA format (always required).
--output-directory
: The output directory (always required).
--output-file-prefix
: The prefix used to label all output files (optional, default value dragen
).
--shard
: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format of n/N
(e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.
--gg-regions
: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the --shard
option is not given. It is important that the same regions are chosen for each step 1,2 and 3.
--gg-regions-bed
: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.
--gg-discard-ac-zero
If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.
--gg-remove-nonref
If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.
--gg-vc-filter
Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.
--gg-skip-filtered-sites
Omits msVCF records that fail the given hard filter. The default is false.
--gg-squeeze-msvcf
Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.
--gg-gq-squeezing-threshold
Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.
--gvcfs-to-cohort-census
: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant
: if --variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
--aggregate-censuses
: set to true to aggregate a list of per batch census files into a global census file.
--input-census-list
: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.
--generate-msvcf
: set to true to generate a multi-sample VCF for one batch of samples.
--input-cohort-file
: the path to the per batch cohort file (from Step1).
--input-census-file
: the path to the per batch census file (from Step1).
--input-global-census-file
: the path to the global census file (from Step2).
--gvcfs-to-msvcf
: set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.
--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant
: if --variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
--merge-batches
: set to true to merge msVCF files for a set of batches.
--input-batch-list
: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing
: set to true (the default) to generate a tabix index for the merged msVCF.
--gg-merge-subset
: set to override the restriction that all batches must be included in the merge.
Mimalloc is a custom memory allocation library that can yield a significant speed-up in the iterative gVCF Genotyper workflow. In some deployments, e.g. cloud, it is automatically and seamlessly used, but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.
For this purpose, the convenience script mi_dragen.sh
is provided, which loads the bundled library and can be transparently used in the same way as the DRAGEN executable. Please note that its use is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow too. Its use for other purposes is known to possibly lead to undesirable memory overuse and thus should be undertaken at the user's own risk.
The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.
The msVCF can become a very large file with increasing cohort size. In some cases, the file might need more storage than can be allocated by VCF parsers. This is caused by VCF entries such as FORMAT/PL which store a value for each combination of alleles. We therefore decided to replace FORMAT/PL with a tag FORMAT/LPL which stores a value only for the alleles that actually occur in the sample. Similarly, the msVCF also contains FORMAT/LAD which stores the allelic depth only for the alleles occurring in the sample.
We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.
When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:
Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF
The value displayed in the QUAL column of the msVCF is the maximum of the input QUAL values for the site across the global cohort. The QUAL value will be missing if any of the batch census files used to create the global census were generated with a version of DRAGEN earlier than v4.2.
Iterative gVCF Genotyper offers several metrics for assessing adherence to HWE. It calculates both allele-wise and site-wise HWE P-values, an allele-wise excess heterozygosity (ExcHet) P-value and the site-wise inbreeding coefficient (IC). These metrics are calculated only for diploid sites and missing values are excluded from the calculations. These values are included as fields in the INFO column of the output msVCF file. Both batch-wise and global values are included, where the field names for the global values are prefixed with G
.
Care should be taken when interpreting these metrics for small cohorts and/or low frequency alleles, as small changes in inputs can lead to large changes in their values. Further, violations of the underlying HWE assumptions (such as inbreeding), and non-random sampling (such as the presence of consanguineous samples), can adversely affect results, making identification of poorly called variants more difficult.
Where it is not possible to calculate the metric, they are represented as missing (i.e., ".") in the msVCF file. This can vary between the metrics, but may occur if non-diploid genotypes are encountered, if there is only one allele present at a site, or if no samples are genotyped at a site.
For HWE a P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it. For ExcHet a P-value of ≈ 0.5 suggests that the number of heterozygotes is close to the number expected under HWE, while a value ≈ 1 suggests that there are more heterozygotes than expected and a value ≈ 0 suggests that there are fewer heterozygotes than expected.
For a bi-allelic site the HWE P-values is based on the numbers of homozygotes and heterozygotes comparing the observed to expected. For a multi-allelic site, P-values are calculated per ALT allele as if it were bi-allelic. Genotypes composed of only the ALT allele being considered are counted as alternative homozygous, any other genotype containing a copy of the ALT allele being considered are counted as a heterozygous, and any genotype with no copies of the ALT allele being considered are counted as reference homozygous (this may include genotypes containing other ALT alleles).
Iterative gVCF Genotyper calculates a site-wise HWE P-value. The value is calculated using the Pearson's chi-squared method, comparing the genotype counts expected under HWE to those observed. The chi-squared test statistic is calculated as
𝜒2 = ∑gt (Egt - Ogt)2 / Egt
where the summation is over gt
is over all genotypes possible at the site given the alleles present, and Egt and Ogt are the expected and observed counts for genotype gt
, respectively. From the chi-squared test statistic the P-value is then calculated from a chi-squared distribution where the number of degrees of freedom is the number of possible genotypes minus the number of alleles, which is
where n
is the number of alleles.
The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.
A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.
Iterative gVCF Genotyper calculates the inbreeding coefficient (IC) (sometimes called the Fixation index and denoted by F
). It is defined as the proportion of the population that is inbred. The value of IC can be estimated by looking at the observed number of heterozygotes in comparison to the number expected under HWE:
where O(het)
and E(het)
are the observed and expected number of heterozygotes in the cohort, respectively. Although initially conceived for studying inbreeding and defined as a non-negative value, it is also commonly used to look for deviations from HWE and can take values in the range [-1, 1].
Values of IC ≈ 0 suggest that the cohort is in HWE. Negative values suggest an excess of heterozygosity and a deviation from HWE, which can be symptomatic of poor variant calling. Positive values suggest a deficit of heterozygotes and the possible presence of inbreeding.
Using the above definition, IC should be a property of the population, and so would be expected to be drawn from the same distribution for all sites and for all variants at a site. Deviations from this distribution can suggest issues in calling a site correctly. Violations of HWE assumptions and/or non-random sampling may adversely affect the distribution of IC, causing it to be shifted. However, outliers can still be identified, although thresholds may need to be adjusted accordingly.
Allelic balance (AB) describes the proportion of reads that support each allele within a called genotype and can be calculated from the allelic depth (FORMAT/AD or FORMAT/LAD). For homozygotes this is taken as
AB = ADi / ∑j ADj
where i
is the index of the called allele and j
runs over all alleles. For heterozygotes this is taken as
ABi = ADi / (ADa + ADb)
where a
and b
are the indices of the called alleles and i
can have values a
or b
. For homozygous genotypes AB is expected to be ≈ 1 and for heterozygous genotypes it is expected to be ≈ 0.5 for each allele. Deviations from the expected values can be indicative of an error.
DRAGEN's iterative gVCF Genotyper calculates site-wise AB values for each allele based on the read depths among all samples. Only diploid genotypes are included in the calculations. Values are calculated separately among homozygous (ABHom) and heterozygous (ABHet) genotypes. ABHet is calculated using the counts among all heterozygous calls that contains the allele under consideration. P-values for ABHet are also calculated (ABHetP) based on a binomial test with an expected probability of 0.5. A P-value of ≈ 1 signifies that results are in line with expectation while ≈ 0 signifies a deviation from expectation. Values are written to the INFO fields ABHom, ABHet and ABHetP, with one value for each allele (including the reference allele). Values should be in the range [0, 1]. Missing values are coded by -1, for example where there are no homozygous calls for an allele. If AD is not present in any input gVCF file, the values are not calculated and the fields will be omitted from the output msVCF file.
Sites in the output msVCF can be filtered on the following global metrics:
QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒2 Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)
The per-sample genotype metrics in the output msVCF can be customized by providing a colon-separated list of metrics, analogous to that of the VCF FORMAT column, to the --gg-msvcf-format-fields
option, e.g. --gg-msvcf-format-fields=GT:LAD:LPL:LAA:QL
. Supported metrics are GT, GQ, AD, LAD, FT, LPL, LAA, LA, LGT, QL, MQR, LAF and DF (N.B. LAF will only appear on the MT contig and DF will only appear if the --gg-diploidify
option is enabled). Sample genotype (GT) is always present and always shown first, regardless of whether it is included in the option string or not. Alternatively, an msVCF containing only site statistics and no per-sample genotype fields can be generated using the option --gg-msvcf-format-fields=None
.
The per-site INFO metrics in the output msVCF can be customized by providing a semicolon-separated list of metrics, analogous to that of the VCF INFO column, to the --gg-msvcf-info-fields
option, e.g. --gg-msvcf-info-fields=AC;AN;NS;NS_GT;NS_NOGT;NS_NODATA;AF
. Supported metrics are AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet, HWEc2, AF. The default set of metrics is AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet and HWEc2. All INFO fields can be included using the option --gg-msvcf-info-fields=All
. All INFO fields can be dropped using the option --gg-msvcf-info-fields=None
, in which case the INFO field will contain the missing symbol, .
. For each specified metric, the value for the current batch and the global value are written. For global values, the metric names are prepended by G
.
INFO fields that have a missing value, .
, at a site are omitted from the msVCF for that site, so sites may contain different sets of fields.
For sizable cohorts, the file outputs from gVCF Genotyper can become extremely large. However, there are a number of options within the component which can mitigate this. As well as reduced footprint on disk, these options can lead to faster runtimes owing to the diminished I/O demands.
The following options have applicability to this:
The small variant caller's --vc-compact-gvcf
, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.
The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
Omitting records that fail filters (--gg-skip-filtered-sites
option).
Dropping trailing genotype fields for hom-ref records (--gg-squeeze-msvcf
option). This behaviour is explicitly permitted by the VCF specification.
1: The number of values is coded as per the VCF specification, with A
denoting one value per alt allele, R
one value per possible allele (including the reference allele), G
one value per possible genotype and .
an unspecified number of values that may vary between site and sample. The number of elements in localised array FORMAT fields that depend on the number of local alleles will vary between samples and so are specified as .
.
The DRAGEN DNA Pipeline accelerates the secondary analysis of NGS data by harnessing the tremendous power available on the DRAGEN Platform. The pipeline includes highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions and targeted calls.
DRAGEN FastQC is a tool for calculating common metrics used for quality control of high-throughput sequencing data. The tool is modeled after the metrics generated by Babraham Institute's FastQC tool.
The metrics are generated automatically on all DRAGEN map-align workflows with no additional run time and output in a CSV format file called \<PREFIX\>.fastqc_metrics.csv
. All metrics are calculated and reported separately for each mate-pair.
For users only interested in sample QC or would like to obtain FastQC results only, DRAGEN provides a mode to generate the fastqc_metrics.csv
file directly.
By default DRAGEN FastQC and read-trimming are run as preprocessing steps to standard sequence alignment workflows. If DNA alignment is not needed or if QC results are needed more quickly, the mapping and BAM output portions of the workflow can be disabled. The workflow only outputs key metric files and runs ~70% faster. This option is available on the command-line by entering --fastqc-only=true
after the DRAGEN command.
If FastQC runs stand-alone, then the license will not be consumed. If FastQC runs with map-align enabled, then the license will be consumed.
DRAGEN FastQC is a complete reimplementation of the original FastQC tool developed by the Babraham Institute (henceforth BI-FastQC). The reimplementation of FastQC in DRAGEN, however, has been modified to take advantage of the hardware-acceleration provided by the DRAGEN Field-Programmable Gate Array (FPGA) for a significant speed improvement. As such, there are some differences in how the values are calculated and the resulting metrics will not be exactly identical between the two tools. The most significant differences are described below.
Binning: BI-FastQC uses a customizable binning strategy with a default of 5bp bins, while DRAGEN uses an algorithmic binning strategy based on the Granularity setting described below. In general, this should mean that DRAGEN provides more precise results at default settings.
Outputs: BI-FastQC text output contain the same information as their plots in tabular format, while DRAGEN-FastQC outputs it's raw data. For example, BI-FastQC both plots an outputs the average base quality per-position, while DRAGEN outputs the average base quality by both position and nucleotide. This allows for a more detailed analysis of the data, but requires slightly more work to generate the associated plot.
Rounding: DRAGEN consistently rounds it's calculations to the nearest integer, while the original FastQC uses a mixture of rounding and taking the mathematical floor, leading DRAGEN-FastQC to provide incrementally higher results for some metrics.
Smoothing: Both DRAGEN-FastQC and BI-FastQC utilize smoothing techniques for their distributions of %GC, to account for the fact that 151bp do not divide evenly into 100 percentile bins. However, to take advantage of the speed offered by the FPGA, DRAGEN utilizes a slightly different algorithm than BI-FastQC which results in slightly different results.
It is not possible due to memory constraints to guarantee single-base resolution for all metrics. DRAGEN provides an algorithmic solution for binning via --fastqc-granularity. DRAGEN allocates 256 bins in memory for each size or position-based metric. The granularity value of 4–7 inclusive can be used to determine the bin size. High values use smaller bins for greater resolution. Lower values can be used to create larger bins for larger read-lengths
Granularity | Single Base Resolution (bp) | Resolution at 150 (bp) | Recommended Read-Lengths (bp) |
---|
If a value for --fastqc-granularity is not provided by the user, DRAGEN will attempt to estimate the read length of the input data and set the granularity accordingly.
To include metrics for adapter or other sequence content, DRAGEN FastQC needs to be provided with the desired sequences in FASTA format. DRAGEN provides two options for this purpose, --fastqc-adapter-file
for adapter sequences and --fastqc-kmer-file
for any additional kmers of interest so that users can add sequences of interest without changing the expected adapter results.
DRAGEN FastQC can accept up to a combined total of 16 adapters and kmer sequences. Each sequence can be a maximum of 12 bp in length. By default, DRAGEN uses the adapter file located at <INSTALL_PATH>/config/adapter_sequences.fasta
. The file contains the following same adapter sequences as Babraham's FastQC v 0.11.10 and later.
Illumina Universal Adapter--AGATCGGAAGAG
Illumina Small RNA 3' Adapter--TGGAATTCTCGG
Illumina Small RNA 5' Adapter--GATCGTCGGACT
Nextera Transposase Sequence--CTGTCTCTTATA
The FastQC metrics are output to a CSV file format in the run output directory called
<PREFIX>.fastqc_metrics.csv
The reported metrics are broken down into eight sections by metric type. Each section is broken down further into separate rows by either the length, position, or other relevant categorical variables. The following are the metric sections.
Read Mean Quality---Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.
Positional Base Mean Quality---Average Phred-scale quality value of bases with a specific nucleotide and at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.
Positional Base Content---Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.
Read Lengths---Total number of reads with each observed length. Lengths can be either specific sizes or ranges, depending on settings specified using --fastqc-granularity
.
Read GC Content---Total number of reads with each GC content percentile between 0 % and 100 %.
Read GC Content Quality---Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.
Sequence Positions---Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads. Sequences are listed first in the metric description in quotes. Locations are listed second and can be either specific positions or ranges.
Positional Quality---Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.
The following are examples rows from each section.
The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.
For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.
The tumor-only pipeline produces output that contains both germline and somatic variants and can be further analyzed to identify tumor mutations. The caller does not attempt to distinguish between them: filtering out common germline variants as reported in databases is currently the most reliable way to remove germline variants. The tumor-only pipeline provides a germline tagging feature and requires this feature to be explicitly enabled or disabled. When germline tagging is enabled, variant annotation must also be enabled; DRAGEN then tags variants that are common in the gnomAD database as germline so that they can be filtered out if desired. The tumor-only pipeline also requires the presence of a systematic noise file by default. To run without germline tagging and/or systematic noise files, these options need to be disabled explicitly.
DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):
##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">
DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence
tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.
If tumor SQ > vc-sq-call-threshold
(default is 3 for tumor-normal and 0.1 for tumor-only), then the FORMAT/GT for the tumor sample is hard-coded to 0/1, and the FORMAT/AF yields an estimate on the somatic variant allele frequency, which ranges anywhere within [0,1].
If the value for vc-sq-filter-threshold
is lower than vc-sq-call-threshold
, the filter threshold value is used instead of the call threshold value.
If tumor SQ < vc-sq-call-threshold
, the variant is not emitted in the VCF.
If tumor SQ > vc-sq-call-threshold
but tumor SQ < vc-sq-filter-threshold
, the variant is emitted in the VCF, but FILTER=weak_evidence.
If tumor SQ > vc-sq-call-threshold
and tumor SQ > vc-sq-filter-threshold
, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).
The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ > vc-sq-call-threshold
but tumor SQ < vc-sq-filter-threshold
, so the FILTER is marked as weak_evidence.
The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance
option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0
.
Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:
--tumor-fastq1 and --tumor-fastq2
Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:
--tumor-fastq-list
Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:
--tumor-bam-input
and --tumor-cram-input
Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode.
--vc-sq-call-threshold
and --vc-sq-filter-threshold
These options control the thresholds for emitting calls in the VCF and applying the weak_evidence
filter tag (see above).
--vc-target-vaf
This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.
--vc-somatic-hotspots
, --vc-use-somatic-hotspots
, and --vc-hotspot-log10-prior-boost
DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_*
based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots
option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false
. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost
to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.
vc-systematic-noise
This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE
.
--vc-combine-phased-variants-distance
This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).
vc-skip-germline-tagging=true
This option disables the germline tagging feature in the tumor-only pipeline (not recommended).
In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true
if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance
enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.
Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.
Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).
If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false
. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.
Allele frequency and related settings
There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf
setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.
The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:
If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter
(see Post Somatic Calling Filtering below) to apply a hard filter on VAF.
DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.
Nucleotide (NTD) Error Bias Estimation is on by default and recommended as a replacement for the orientation bias filter. Both methods take account of strand-specific biases (systematic differences between F1R2 and F2R1 reads). In addition, NTD error estimation accounts for non-strand-specific biases such as sample-wide elevation of a certain snv type, e.g. C->T or any other transition or transversion. NTD error estimation can also capture these biases in a trinucleotide context, e.g. in the case of C->T it will break down the counts as ACA->ATA, CCA->CTA, GCA->CTA, TCA->TTA, etc.
This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false
or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto
. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true
.
To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed
. Alternatively, if --vc-target-bed
is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed
can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.
DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.
DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true
) or when running from UMI-collapsed bams, enable UMI-aware variant calling by setting one of the following options to true:
--vc-enable-umi-solid
The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.
--vc-enable-umi-liquid
The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.
If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.
You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.
By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod
option.
DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the *.hard-filtered.vcf.gz
output file (note: the *.vcf.gz
output file without "hard-filtered" in the filename differs only in that the filter column is unpopulated; the file is produced for historical reasons but is to be deprecated).
Options
The following options are available for post somatic calling filtering:
--vc-sq-call-threshold
Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold
is lower than vc-sq-call-threshold
, the filter threshold value is used instead of the call threshold value.
--vc-sq-filter-threshold
Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.
--vc-enable-non-primary-allelic-filter
Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.
--vc-enable-af-filter
Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold
and vc-af-filter-threshold
command-line options. Please use vc-enable-af-filter-mito
and corresponding threshold options for mitochondrial allele frequency filtering.
--vc-enable-non-homref-normal-filter
Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.
--vc-enable-vaf-ratio-filter
Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.
--vc-depth-filter-threshold
Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).
vc-homref-depth-filter-threshold
In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.
vc-depth-annotation-threshold
Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).
Filters
The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter tackles noise that consistently appears at specific locations in the reference genome. This noise can arise from:
Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.
PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.
The systematic noise filter offers a significant improvement over the older "panel of normals" method. While the panel of normals simply excluded specific positions, the new filter employs a statistical model. This model compares the variant and its allele frequency (AF) to the noise level associated with that specific position and allele in the reference genome. This allows for a more nuanced filtering approach, reducing false positives without discarding potentially valid variants.
Note that the systematic noise filter specifically aims to remove noise, while the option --vc-enable-germline-tagging
is used for identifying germline variants. The systematic noise filter is not recommended for germline admixture datasets, where tumor-normal pairs are simulated by combining germline samples from two different individuals. This is because such datasets contain (simulated) somatic variants at germline variant positions, and those positions may be present in the noise files with the result that desired variants are filtered out.
Newer versions of the systematic noise will include two columns, one for the "mean" noise and one for the "max" noise. The noise file header will specify a "##NoiseMethod". This is the column that will be used by default during variant calling. For UMI/PANELs/WES is is recommended to use the "mean" noise, and for WGS it is recommended to use the "max" noise.
Prebuilt systematic noise files are available for download (see below), but when possible, it is recommended to build custom noise files from a panel of normal samples sequenced locally. This will ensure that the noise file is specific to the library preparation, sequencing system, and panel in use. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 20-50 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.
The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding the following commands:
Somatic Systematic Noise Baseline Collection v2.0.0 noise files were generated with V4.3 and for the first time include allele specific information. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns. A header line "##NoiseMethod=mean/max" specifies which noise column will be used by default.
Noise files generated with V4.3 contain extra columns and are not compatible with earlier versions. Older noise files are still supported in the current version of DRAGEN as per the table below.
The default WES and WGS noise files were generated using a combination of Nextera and TruSeq samples (with and without PCR). There are also hg38 WGS HEME and FFPE specific noise files.
The BaseSpace Sequence Hub DRAGEN CNV Baseline Builder App can be used to build SNV and CNV noise files in the cloud. Alternatively the following DRAGEN CMD lines can be used to generate the noise files locally:
First run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples using the following command:
Once the normal samples have completed, collect the normal VCFs in the VCF_LIST file (one vcf per line) and use DRAGEN to generate the systematic noise file:
Running the filter during somatic variant calling:
Running the tumor-only pipeline on the normals:
Building the noise file:
When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:
--vc-enable-germline-tagging
Enable germline tagging. The default is 'false'. Once this is set to 'true', it will require user to set annotation related parameters as follows:
--enable-variant-annotation=true
--variant-annotation-data
Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
--variant-annotation-assembly
The genome build, GRCh37 or GRCh38
Additional options to control how to define germline variants.
--germline-tagging-db-threshold
The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).
--germline-tagging-pop-af-threshold
The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.
When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).
When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:
Annotation options:
--enable-variant-annotation=true
Enable variant annotation
--variant--annotation-data
Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
--variant-annotation-assembly
Genome build, GRCh37 or GRCh38
MAF conversion options:
--enable-maf-output=true
Enable MAF output
--maf-transcript-source
Desired transcript source, RefSeq or Ensembl
Additional standalone options (when running without the variant caller):
--maf-input-vcf
Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz
--maf-input-json
Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz
Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.
Optional options:
--maf-include-non-pass-variants
Enabling this option will output all variants, including non-PASS variants, in the MAF output file.
Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.
Example command lines:
MAF output from BAM input and variant caller:
MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:
MAF output from source VCF file:
Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir
and --output-file-prefix
options.
MAF output from source annotated VCF file:
Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir
and --output-file-prefix
options.
The filtering step identifies de novo variants calls of the joint calling workflow in regions with ploidy changes. Since de novo calling can have reduced specificity in regions where at least one of the pedigree members shows non-diploid genotypes, the de novo variant filtering marks relevant variants and thus can improve specificity of the call set.
Based on the structural and copy number variant calls of the pedigree, the FORMAT/DN field in the proband is changed from the original DeNovo value to DeNovoSV or DeNovoCNV if the de novo variant overlaps with a ploidy-changing SV or CNV, respectively. All other variant details remain unchanged, and all variants of the input VCF will also be present in the filtered output VCF. Structural or copy number variants which result in no change of ploidy, such as inversions, are not considered in the filtering. As an example, a de novo SNV calls in the input VCF
Overlapping with an SV duplication in the proband, mother or father would be represented in the filtered output VCF as follows:
The following is an example command line for running the de novo filtering, based on the files returned by the joint calling workflows:
The following options are used for de novo variant filtering:
--dn-input-vcf
---Joint small variant VCF from the de novo calling step to be filtered.
--dn-output-vcf
---File location to which the filtered VCF should be written. If not specified, the input VCF is overwritten.
--dn-sv-vcf
---Joint structural variant VCF from the SV calling step. If omitted, checks with overlapping structural variants are skipped.
--dn-cnv-vcf
--- Joint structural variant VCF from the CNV calling step. If omitted, checks with overlapping copy number variants are skipped.
DRAGEN provides post-VCF variant filtering based on annotations present in the VCF records. Default and non-default variant hard filtering are described below. However, due to the nature of DRAGEN's algorithms, which incorporate the hypothesis of correlated errors from within the core of variant caller, the pipeline has improved capabilities in distinguishing the true variants from noise, and therefore the dependency on post-VCF filtering is substantially reduced. For this reason, the default post-VCF filtering in DRAGEN is very simple.
The default filters in the germline pipeline are as follows:
##FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41 (3 when ML recalibration is enabled)">
##FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83 (3 when ML recalibration is enabled)">
##FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
DRAGENSnpHardQUAL and DRAGENIndelHardQUAL: For all contigs other than the mitochondrial contig, the default hard filtering consists of thresholding the QUAL value only. A different default QUAL threshold value is applied to SNP and INDEL
LowDepth: This filter is applied to all variants calls with INFO/DP <= 1
PloidyConflict: This filter is applied to all variant calls on chrY of a female subject, if female is specified on the DRAGEN command line, of if female is detected by the ploidy estimator.
DRAGEN supports basic filtering of variant calls as described in the VCF standard. You can apply any number of filters with the --vc-hard-filter
option, which takes a semicolon-delimited list of expressions, as follows:
where the list of criteria is itself a list of expressions, delimited by the || (OR) operator in this format:
The meaning of these expression elements is as follows:
filterID---The name of the filter, which is entered in the FILTER column of the VCF file for calls that are filtered by that expression.
snp/indel/all---The subset of variant calls to which the expression should be applied.
annotation ID---The variant call record annotation for which values should be checked for the filter. Supported annotations include FS, MQ, MQRankSum, QD, and ReadPosRankSum.
comparison operator---The numeric comparison operator to use for comparing to the specified filter value. Supported operators include <, ≤, =, ≠, ≥, and >. For example, the following expression would mark with the label "SNP filter" any SNPs with FS < 2.1 or with MQ < 100, and would mark with "indel filter" any records with FS < 2.2 or with MQ < 110:
This example is for illustration purposes only and is NOT recommended for use with DRAGEN V3 output. Illumina recommends using the default hard filters. The only supported operation for combining value comparisons is OR, and there is no support for arithmetic combinations of multiple annotations. More complex expressions may be supported in the future.
The orientation bias filter is designed to reduce noise typically associated with the following:
Pre-adapter artifacts introduced during genomic library preparation (eg, a combination of heat, shearing, and metal contaminates can result in the 8-oxoguanine base pairing with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification), or
FFPE (formalin-fixed paraffin-embedded) artifact. FFPE artifacts stem from formaldehyde deamination of cytosines, which results in C to T transition mutations. The orientation bias filter can only be used on somatic pipelines. To enable the filter, set the --vc-enable-orientation-bias-filter
option to true. The default is false.
The artifact type to be filtered can be specified with the --vc-orientation-bias-filter-artifacts
option. The default is C/T,G/T, which correspond to OxoG and FFPE artifacts. Valid values include C/T, or G/T, or C/T,G/T,C/A.
An artifact (or an artifact and its reverse compliment) cannot be listed twice. For example, C/T,G/A is not valid, because C→G and T→A are reverse compliments.
The orientation bias filter adds the following information:
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=OBC,Number=1,Type=String,Description="Orientation Bias Filter base context">
##FORMAT=<ID=OBPa,Number=1,Type=String,Description="Orientation Bias prior for artifact">
##FORMAT=<ID=OBParc,Number=1,Type=String,Description="Orientation Bias prior for reverse compliment artifact">
##FORMAT=<ID=OBPsnp,Number=1,Type=String,Description="Orientation Bias prior for real variant">
Please note that the OBF filter runs as a standalone process after DRAGEN is complete. The VC metrics that are computed as part of DRAGEN SNV caller will not be updated and will not reflect the additional variants that are filtered in this stage.
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
Option | Description |
---|---|
PLINK option | PLINK default | PLINK tuned | DRAGEN default | PLINK Definitions |
---|---|---|---|---|
Column Header | Description |
---|
Run Pedigree Mode for Small Variant Caller. For more information, see .
Run Pedigree Mode for Copy Number Caller. For more information, see .
Run Pedigree Mode for Structural Variant Caller. For more information, see .
Run DeNovo Variant Small Variant Filtering. For more information, see .
The tool for population-based analysis is the iterative gVCF Genotyper. Its input is a set of single or multisample gVCFs. The output is a multisample VCF that contains one entry for any variant seen in any of the input gVCFs. The variants are genotyped across all input samples using information from the hom-ref blocks as necessary. The iterative gVCF Genotyper does not adjust genotypes based on population information but it provides means to filter variant sites based on information leveraged from the population. See for information on the available command line options.
--gg-hard-filter
Specifies a filtering expression to be applied to the output msVCF records. See below. The default is to apply no filters.
--gg-msvcf-format-fields
Can be used to override the default set of sample genotype fields in the output msVCF. See below.
--gg-msvcf-info-fields
Can be used to override the default set of site-wise INFO fields in the output msVCF. See below.
--gg-output-type
Set to spvcf
to write the output in spVCF format rather then the default msVCF. See below for details.
--gg-diploidify
In the output msVCF file, convert haploid calls to diploid. The diploidified genotype is homozygous in the haploid call e.g. 1
becomes 1/1
. The LPL field is also diploidified for these samples. Site metrics, such as allele counts, are calculated before diploidification. Diploidifying genotypes may ease the ingestion of msVCF files into downstream analysis tool, such as Hail and Plink. When this option is enabled, it is possible to include the DF
FORMAT field (included by default) that signifies whether or not a genotype has been diploidified, see below.
This approach is also referred to as local alleles and is also used by open source software such as and .
The (HWE) states that, given certain conditions, genotype and allele frequencies should remain constant between generations. Deviations from HWE can results from violations of the underlying HWE assumptions in the population, non-random sampling or may be artifacts of variant calling. can be assessed by comparing the observed frequencies of genotypes to those expected under HWE given the observed allele counts.
Metric | Description | Scope | Number of values |
---|
Iterative gVCF Genotyper offers both allele-wise and site-wise HWE P-values. The allele-wise P-values are based on the exact-conditional method the site-wise P-values are based on Pearson's chi-squared method. For bi-allelic sites, although both are measuring the same property, their values may differ. The differences between the methods are explored in . Care should be taken when deciding which to use.
Iterative gVCF Genotyper calculates allele-wise HWE and the ExcHet P-values. The values are calculated using the exact-conditional method described in . The implementation does not use a mid P-value correction.
It is also possible to filter based on the maximum ABHetP value, see .
The syntax of a filtering expression is the same as that used by the small variant caller (see ). Filters are always applied to the globally-computed metrics, not the values for the current batch. Records failing filter will have the specified filter ID(s) written to the FILTER column of the msVCF, or will be omitted entirely if the --gg-skip-filtered-sites option is specified. Since filtering is on a per-site basis, filters cannot be applied separately to SNPs or indels as they can in the variant caller.
Metric | Description | Number of values1 | Type |
---|
Metric | Description | Number of values1 | Type |
---|
Outputting local allele values, as described .
Use of the options to output only those metrics required for the downstream analysis.
The option that can have the biggest impact on the final output file size is that to write it directly in . This is a lossless encoding and the space saving can be dramatic: file size reductions of multiple tens of times have been observed for large cohorts with sparsely distributed variants. Files output as spVCF at step 3 (--generate-msvcf
) can be directly merged via the --merge-batches
subcommand to produce a single spVCF file. spVCF-encoded files are likely to require decoding back to full msVCF for use with downstream tools, and a binary for this is available for . The decoding will take time, but this is offset by the reduced time required within gVCF Genotyper to initially write the smaller spVCF files. Users are recommended to, if possible, directly pipe the decoded data into the downstream tool rather than first writing the full msVCF file to disk.
Section | Mate | Metric | Value |
---|
--vc-callability-tumor-thresh
Specifies the callability threshold for tumor samples. The somatic callable regions report includes all regions with tumor coverage above the tumor threshold. The default value is 50. For more information on the somatic callable regions report, see .
--vc-callability-normal-thresh
Specifies the callability threshold for normal samples, if present. If applicable, the somatic callable regions report includes all regions with normal coverage above the normal threshold. The default value is 5. For more information on the somatic callable regions report, see .
Coverage | Lowest AF |
---|
Somatic Mode | Filter ID | Description |
---|
Prebuilt systematic noise files can be downloaded here:
Version | Release | Modes | Normal Samples |
---|
Option | Description |
---|
Option | Description |
---|
Option | Description |
---|
For the mitochondrial contig, DRAGEN processes it through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. Please refer to for the filtering details.
--read-trimmers
To enable trimming filters in hard-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable trimming, set to none
. During mapping, artifacts are removed from all reads. The following are valid trimmer names:
fixed-len
—Fixed-length trimming
polyg
—Poly-G trimming
quality
—Quality trimming
adapter
—Adapter trimming
n
—Ambiguous base trimming
min-len
—Minimum length trimming
cut-end
—Maximum length trimming
polya
—RNA Poly-A tail trimming. See additional description in RNA alignment section
bisulfite
—Bisulfite trimming
Read trimming is disabled by default (default: "none").
--soft-read-trimmers
To enable trimming filters in soft-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable soft trimming, set to none
. During mapping, reads are aligned as if trimmed, and bases are not removed from the reads. The following are the valid trimmer names.
fixed-len
—Fixed-length trimming
polyg
—Poly-G trimming
quality
—Quality trimming
adapter
—Adapter trimming
n
—Ambiguous base trimming
min-len
—Minimum length trimming
cut-end
—Maximum length trimming
polya
—RNA Poly-A tail trimming. See additional description in RNA alignment section
bisulfite
—Bisulfite trimming
Soft-trimming is enabled for the polyg
filter by default (default: "polyg").
--trimming-only
Disables mapping and alignment to run read-trimming only.
--trim-min-length
Specify a minimum read length allowed after the trimmer execution. DRAGEN filters any reads with a length less than the value after all read-trimming steps are completed (default: 20).
--trim-min-len-read1
Specify a minimum read length allowed for read1 after the trimmer execution. DRAGEN filters any reads with a length of read1 less than the value after all read-trimming steps are completed (default: 20).
--trim-min-len-read2
Specify a minimum read length allowed for read2 after the trimmer execution. DRAGEN filters any reads with a length of read2 less than the value after all read-trimming steps are completed (default: 20).
--trim-filter-dummy-len
Specify the number of N bases in dummy reads that replace filtered reads (default: 10).
--trim-filter-set-flag
If enabled, dummy reads will have their 0x200 SAM flag set (default: true).
--trim-r1-5prime
Specify a fixed number of bases to trim from the 5' end of Read 1 (default: 0).
--trim-r1-3prime
Specify a fixed number of bases to trim from the 3' end of Read 1 (default: 0).
--trim-r2-5prime
Specify a fixed number of bases to trim from the 5' end of Read 2 (default: 0).
--trim-r2-3prime
Specify a fixed number of bases to trim from the 3' end of Read 2 (default: 0).
--trim-min-quality
Specify the minimum read quality. DRAGEN trims bases from the 3' end of reads with a quality below the value.
--trim-quality-r1-5prime
Specify the quality cutoff below which to trim from the 5' end of read 1.
--trim-quality-r1-3prime
Specify the quality cutoff below which to trim from the 3' end of read 1.
--trim-quality-r2-5prime
Specify the quality cutoff below which to trim from the 5' end of read 2.
--trim-quality-r2-3prime
Specify the quality cutoff below which to trim from the 3' end of read 2.
--trim-adapter-read1
Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 1.
--trim-adapter-read2
Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 2.
--trim-adapter-r1-5prime
Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 1. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.
--trim-adapter-r2-5prime
Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 2. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.
--trim-adapter-stringency
Specify the minimum number of adapter bases required for trimming (default: 4).
--trim-bisulfite-ends
Enable both 5-Prime and 3-Prime bisulfite trimming.
--trim-bisulfite-5prime
If a 3' adapter was trimmed, trim an additional 2bp from the 3' end, unless the 5' end matches 'CAA' or 'CGA'".
--trim-bisulfite-3prime
If the 5' end matches 'CAA' or 'CGA', trim the first two of these 5' bases.
--trim-min-r1-5prime
Specify the minimum number of bases to trim from the 5' end of Read 1 (default: 0).
--trim-min-r1-3prime
Specify the minimum number of bases to trim from the 3' end of Read 1 (default: 0).
--trim-min-r2-5prime
Specify the minimum number of bases to trim from the 5' end of Read 2 (default: 0).
--trim-min-r2-3prime
Specify the minimum number of bases to trim from the 3' end of Read 2 (default: 0).
--trim-max-length
Specify the maximum number of bases that can be trimmed from the sequences of both reads.
--trim-max-len-read1
Specify the maximum number of bases that can be trimmed from the sequences of read1.
--trim-max-len-read2
Specify the maximum number of bases that can be trimmed from the sequences of read2.
--trim-polya-min-trim
The minimum number of poly-As required for polya trimming (default: 20).
--trim-polyg-kmer-len
How many bases to check at each read end for poly-G artifact detection (default: 25).
--trim-polyg-kmer-non-g
The maximum number of non-G bases in the K-mer for poly-G artifact detection (default: 2).
--trim-polyg-g-score-r1-5prime
The score for G bases on the 5' end of read 1 (default: 0).
--trim-polyg-g-score-r1-3prime
The score for G bases on the 3' end of read 1 (default: 15).
--trim-polyg-g-score-r2-5prime
The score for G bases on the 5' end of read 2 (default: 0).
--trim-polyg-g-score-r2-3prime
The score for G bases on the 3' end of read 2 (default: 15).
--trim-polyg-min-trim-r1-5prime
The minimum number of G's to trim from the 5' end of read 1 (default: 6).
--trim-polyg-min-trim-r1-3prime
The minimum number of G's to trim from the 3' end of read 1 (default: 6).
--trim-polyg-min-trim-r2-5prime
The minimum number of G's to trim from the 5' end of read 2 (default: 6).
--trim-polyg-min-trim-r2-3prime
The minimum number of G's to trim from the 3' end of read 2 (default: 6).
--trim-polyg-early-exit-threshold
The signed score threshold for poly-G trimming to exit early (default: -500).
--trim-polyx-bases-r1-5prime
The bases to trim for polyX trimming from the 5' end of read 1 (default: empty string "" ).
--trim-polyx-bases-r1-3prime
The bases to trim for polyX trimming from the 3' end of read 1 (default: empty string "" ).
--trim-polyx-bases-r2-5prime
The bases to trim for polyX trimming from the 5' end of read 2 (default: empty string "" ).
--trim-polyx-bases-r2-3prime
The bases to trim for polyX trimming from the 3' end of read 2 (default: empty string "" ).
--trim-polyx-min-trim-r1-5prime
The minimum number of X's to trim from the 5' end of read 1 (default: 20).
--trim-polyx-min-trim-r1-3prime
The minimum number of X's to trim from the 3' end of read 1 (default: 20).
--trim-polyx-min-trim-r2-5prime
The minimum number of X's to trim from the 5' end of read 2 (default: 20).
--trim-polyx-min-trim-r2-3prime
The minimum number of X's to trim from the 3' end of read 2 (default: 20).
--homozyg-density
50
50
Minimum required density to call a ROH (1 SNP in 50 kb), can be increased to relax the per SNP density.
--homozyg-gap
1000
1000
3000
Maximal interval between two homozygous SNPs in a ROH (in kb)
--homozyg-kb
1000
500
All sizes reported
Minimal length of reported ROH (in kb)
--homozyg-snp
100
50
50
Minimal number of homozygous SNPs in the reported ROH
--homozyg-window-het
1
2
Soft score threshold (1-0.025) penalty for a het SNP and 0.025 gain for a hom SNP
Maximum number of heterozygous SNPs allowed in a scanning window
--homozyg-window-missing
5
5
Number of missing calls allowed in a scanning window
--homozyg-window-snp
50
50
Variants in a scanning window
--homozyg-window-threshold
0.05
0.05
For a SNP to be eligible for inclusion in a ROH, the hit rate/overlap of all scanning windows containing the SNP must be at least 0.05
7 | 1-255 | 1 | <256 |
6 | 1-128 | 2 | >=256 and <507 |
5 | 1-64 | 4 | >=507 and <4031 |
4 | 1-32 | 8 | >=4031 |
READ MEAN QUALITY | Read1 | Q38 Reads | 965377 |
... |
|
|
|
POSITIONAL BASE MEAN QUALITY | Read1 | ReadPos 145-152 T Average Quality | 34.49 |
POSITIONAL BASE MEAN QUALITY | Read1 | ReadPos 150 T Average Quality | 34.44 |
POSITIONAL BASE MEAN QUALITY | Read1 | ReadPos 256+ T Average Quality | 36.99 |
... |
|
|
|
POSITIONAL BASE CONTENT | Read1 | ReadPos 145-152 A Bases | 113362306 |
POSITIONAL BASE CONTENT | Read1 | ReadPos 150 A Bases | 14300589 |
POSITIONAL BASE CONTENT | Read1 | ReadPos 256+ A Bases | 13249068 |
... |
|
|
|
READ LENGTHS | Read1 | 150bp Length Reads | 77304421 |
READ LENGTHS | Read1 | 144-151bp Length Reads | 77304421 |
READ LENGTHS | Read1 | >=255bp Length Reads | 1000000 |
... |
|
|
|
READ GC CONTENT | Read1 | 50% GC Reads | 140878674373 |
... |
|
|
|
READ GC CONTENT QUALITY | Read1 | 50% GC Reads Average Quality | 36.20 |
... |
|
|
|
SEQUENCE POSITIONS | Read1 | 'AGATCGGAAGAG' 137bp Starts | 20 |
SEQUENCE POSITIONS | Read1 | 'AGATCGGAAGAG' 137-144bp Starts | 23 |
... |
|
|
|
POSITIONAL QUALITY | Read1 | ReadPos 150 50% Quantile QV | 37 |
POSITIONAL QUALITY | Read1 | ReadPos 145-152 50% Quantile QV | 37 |
... |
|
|
|
0-199 | 0.05 |
200-399 | 0.025 |
400-799 | 0.0125 |
... | ... |
Tumor-Only & Tumor-Normal | weak_evidence | Variant does not meet likelihood threshold. The likelihood ratio for SQ tumor-normal is < 17.5 or < 3.0 for SQ tumor-only. |
Tumor-Only & Tumor-Normal | multiallelic | Site filtered if there are two or more ALT alleles at this location in the tumor. Not applied to somatic hotspot variants. |
Tumor-Only & Tumor-Normal | base_quality | Median base quality of ALT reads at this locus is < 20. |
Tumor-Only & Tumor-Normal | mapping_quality | Median mapping quality of ALT reads at this locus is < 20 (tumor-normal) or < 30 (tumor-only). |
Tumor-Only & Tumor-Normal | fragment_length | Absolute difference between the median fragment length of alt reads and median fragment length of ref reads at a given locus > 10000. |
Tumor-Only & Tumor-Normal | read_position | Median of distances between the start and end of read and a given locus < 5 (the variant is too close to edge of all the reads). To output variant read position to the INFO field, use |
Tumor-Only & Tumor-Normal | low_af | Allele frequency is below the threshold specified with |
Tumor-Only & Tumor-Normal | systematic_noise | If AQ score is < 10 (default) for tumor-normal or < 60 (default) for tumor-only, the site is filtered. |
Tumor-Only & Tumor-Normal | low_frac_info_reads | The fraction of informative reads (denominator excludes filtered_out reads) is below the threshold. The default threshold value is 0.5. |
Tumor-Only & Tumor-Normal | filtered_reads | More than 50% of reads have been filtered out. |
Tumor-Only & Tumor-Normal | long_indel | Indel length is more than 100bp. |
Tumor-Only & Tumor-Normal | low_depth | The site was filtered because the number of reads is too low. The filter is off by default. |
Tumor-Only & Tumor-Normal | low_tlen | The site was filtered because the fraction of low TLEN ALT supporting reads is above a threshold. The default threshold is 0.4. Reads with TLEN smaller than -2.25 (default) standard deviations from the mean are considered to be low TLEN. This filter is not applied for reads sampled from tight insert distributions i.e., stddev / mean < 0.1 (default). |
Tumor-Only and Tumor-Normal | no_reliable_supporting_read | No reliable supporting read was found in the tumor sample. A reliable supporting read is a read supporting the alt allele with mapping quality ≥ 40, fragment length ≤ 10,000, base call quality ≥ 25, and distance from start/end of read ≥ 5. |
Tumor-Only & Tumor-Normal | too_few_supporting_reads | Variant is supported by < 3 reads in the tumor sample. This filter is not applied in UMI-aware pipelines. |
Tumor-Normal | noisy_normal | More than three alleles are observed in the normal sample at allele frequency above 9.9%. |
Tumor-Normal | alt_allele_in_normal | ALT allele frequency in the normal sample is above 0.2 plus the maximum contamination tolerance. For solid tumor mode, the value is 0. For liquid tumor mode, the default value is 0.15. See |
Tumor-Normal | non_homref_normal | Normal sample genotype is not a homozygous reference. |
Somatic Systematic Noise Baseline Collection v2.0.0 | V4.3 | hg19, hg38, hs37d5, WES, WGS | ~50 per cohort, 80-100X coverage |
--vc-systematic-noise | Specifies a systematic noise BED file. If a somatic variant does not pass the AQ threshold, the variant is marked as 'systematic_noise' in the FILTER column of the output VCF. |
--vc-systematic-noise-method | Specifies which column in the systematic noise file will be used: "max" is more aggressive and recommended for WGS, while "mean" preserves better sensitivity and is recommended for WES/PANELs. |
--vc-systematic-noise-filter-threshold | Set the AQ threshold. Higher values filter more aggressively. By default the threshold value is 10 for tumor-normal and 60 for tumor-only. The valid range spans 0-100. For tumor-normal runs the threshold may be set higher (e.g. to 60) to improve specificity at the possible cost of some sensitivity. |
--vc-systematic-noise-filter-threshold-in-hotspot | Set the AQ threshold to use in hotspot regions, where one may want to filter less aggressively than in the rest of the genome. By default, the threshold value is 10 for tumor-normal and 20 for tumor-only. |
--vc-allele-specific-systematic-noise | Apply systematic noise in an allele-specific manner when allele information is available. This setting is ignored for v1 noise files (Default=true)) |
--vc-detect-systematic-noise | Run the tumor-only pipeline in an ultra sensitive mode and intentionally include noise in the output VCF. WARNING: this option should only be used with normal samples to characterize noise, it is NOT intended for analyzing tumor samples. |
--vc-detect-systematic-noise-mode | Specify the library type when generating the systematic noise. Only required for UMI samples. This mode will generate GVCFs which are especially useful for capturing very low levels of noise. The default mode will work well for WGS/WES and non-UMI panels. Valid options include [UMI, DEFAULT] |
--build-sys-noise-method | Specifies the default value for vc-systematic-noise-method by adding it as part of the header in the systematic noise file. It is recommended to select 'mean' for UMI/PANELS/WES data and 'max' for WGS data (default is 'max')." |
--build-sys-noise-vcfs-list | Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line. |
--build-sys-noise-germline-vaf-threshold | Variant calls with VAF higher than this threshold will be considered germline and will not contribute to the noise estimate. This option is disabled by default by setting the threshold to 1. (Default 1) |
--build-sys-noise-use-germline-tag | This option will ensure that variants tagged by |
--build-sys-noise-min-sample-cov | Min coverage at a site for a sample to be used towards noise estimation. At low coverages estimated allele frequencies become less reliable. Accurate AF estimation is imporant for germline variant detection, and also for noise detection when using MAX noise. (Default 5) |
--build-sys-noise-min-supporting-samples | Min number of samples with noise at a position in order for a position to be considered systematic-noise (Default 1). |
Family_ID | The pedigree identifier. |
Individual_ID | The ID of the individual. |
Paternal_ID | The ID of the individual's father. If the founder, the value is 0. |
Maternal_ID | The ID of the individual's mother. If the founder, the value is 0. |
Sex | The sex of the sample. If male, the value is 1. If female, the value is 2. |
Phenotype | The genetic data of the sample. If unknown, the value is 0. If unaffected, the value is 1. If affected, the value is 2. |
HWE | Hardy-Weinberg Equilibrium P-value | Allele-wise | One for each alt allele |
ExcHet | Excess Heterozygosity P-value | Allele-wise | One for each alt allele |
HWEc2 | Hardy-Weinberg Equilibrium P-value | Site-wise | 1 |
IC | Inbreeding Coefficient | Site-wise | 1 |
GT | Genotype | 1 | String |
GQ | Genotype quality | 1 | Integer |
AD | Allelic depths | R | Integer |
LAD | Localized allelic depths | . | Integer |
FT | Sample filter | 1 | String |
LPL | Local normalized, Phred-scaled likelihoods for genotypes as in original gVCF | . | Integer |
LAA | Mapping of local alt allele index from original gVCF to msVCF excluding the reference allele | . | Integer |
LA | Mapping of local allele indices from original gVCF to msVCF including the reference allele | . | Integer |
LGT | Local GT value as in original gVCF | 1 | String |
QL | Phred-scaled probability that the site has no variant in this sample (original gVCF QUAL) | 1 | Float |
MQR | Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (original gVCF INFO/MQRankSum) | 1 | Float |
LAF | Allele fractions for the local alt alleles | . | Float |
DF | Diploidified, 1 represents a genotype that was originally haploid, 0 represents originally diploid | 1 | Integer |
AC | Allele count in genotypes | A | Integer |
AN | Total number of alleles in called genotypes | 1 | Integer |
NS | Total number of samples | 1 | Integer |
NS_GT | Total number of samples with called genotypes | 1 | Integer |
NS_NOGT | Total number of samples with unknown genotypes ./. | 1 | Integer |
NS_NODATA | Total number of samples with no coverage | 1 | Integer |
IC | The inbreeding coefficient | 1 | Float |
HWE | The exact conditional Hardy-Weinberg Equilibrium P-value | A | Float |
ExcHet | he exact conditional Excess Heterozygosity P-value | A | Float |
HWEc2 | The chi-squared Hardy-Weinberg Equilibrium P-value | 1 | Float |
AF | The ALT allele frequencies (AC/AN) | A | Float |
ABHom | The allelic balance among homozygotes | R | Float |
ABHet | The allelic balance among heterozygotes | R | Float |
ABHetP | The P-value for allelic balance among heterozygotes | R | Float |
DRAGEN secondary analysis employs machine learning based variant recalibration (DRAGEN-ML) for germline SNV VC. Variant calling accuracy is improved using powerful yet efficient machine learning techniques that augment the variant caller, by exploiting more of the available read and context information that does not easily integrate into the Bayesian processing used by the haplotype variant caller. A supervised machine learning method was developed using truth from the PrecisionFDA v4.2.1 sets to build a model that processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors, for both SNVs and INDELs.
No additional setup is required. ML model files for the hg38 and hg19 human references are packaged with the DRAGEN installer. After installation, the files are present at <INSTALL_PATH>/resources/ml_model/<ref>
DRAGEN-ML is enabled by default as needed, when running the germline SNV VC. DRAGEN will automatically detect the reference used for analysis, and use the correct model files. It either hg38 or hg19 reference type is not detected, ML recalibration will automatically be disabled and SNV VC falls back to legacy operation.
DRAGEN-ML requires a run with BAM or FASTQ input, since the machine learning model extracts information from the read pile-up. DRAGEN-ML runs concurrently with DRAGEN SNV VC. DRAGEN-ML can be applied to WGS or WES samples. Re-calibration of existing VCF files is not supported.
DRAGEN-ML recalibrates all quality scores, changing the values of the QUAL and GQ fields in the output VCF/GVCF.
DRAGEN-ML also updates PL and GP in the output VCF/GVCF.
The genotypes (GT field) of some variants may be changed by ML e.g., 0/1 to 1/1 or vice versa.
DRAGEN-ML PHRED scores are limited to a maximum value of around 60-70. Therefore, the QUAL filtering threshold is set to 3 when DRAGEN-ML is enabled, compared to 10 for DRAGEN-VC when DRAGEN-ML is disabled.
The following variants types are recalibrated:
Biallelic and multiallelic variants
Autosomes and sex chromosomes, including haploid positions
Force GT calls
Non primary contigs
DRAGEN-ML typically removes 30-50% of SNP FPs, with smaller gains on INDELS. FN counts are reduced by 10% or more. The output QUAL/GQ of DRAGEN-ML is empirically more accurately calibrated than DRAGEN SNV VC without ML. There are significant gains in accuracy statistics across the entire genome with ML enabled. Note that a small number of variant calls may have degraded accuracy with ML enabled compared to VC without ML.
DRAGEN-ML adds about 10% to the run time compared to runs without ML.
An MD5SUM file is generated automatically for VCF output files. This file is in the same output directory and has the same name as the VCF output file, but with an .md5sum extension appended. For example, whole_genome_run_123.vcf.md5sum. The MD5SUM files is a single-line text file that contains the md5sum of the VCF output file. This md5sum exactly matches the output of the Linux md5sum command.
DRAGEN supports force genotyping (ForceGT) for small variant calling. To use ForceGT, use the --vc-forcegt-vcf
option with a list of small variants to force genotype. The input list of small variants can be a *.vcf or *.vcf.gz file.
The current limitations of ForceGT are as follows:
ForceGT is supported for germline small variant calling in the V3 mode. The V1, V2, and V2+ modes are not supported.
ForceGT is also supported for somatic small variant calling.
ForceGT variants do not propagate through joint genotyping.
DRAGEN supports only a single ForceGT VCF input file, which must meet the following requirements:
The input has to be a valid VCF file according to version 4.2 of the VCF standard. For instance, it has to have at least eight tab-delimited columns and records need to be sorted by reference contig and position.
The header has to list the same contigs as the reference used for variant calling. All variants must refer to one of these contig names.
Variants have to be normalized (parsimonious and left-aligned, see below).
It must not contain any multinucleotide or complex variants (AT -> C). These are variants that require more than one substitution / insertion / deletion to go from REF allele to ALT allele and are ignored.
Any deletions longer than 50bp are filtered out.
Any variant will only be called once. Duplicate entries will be ignored.
The following nonnormalized variant will cause undefined behavior in DRAGEN:
Instead of…
use…
Force genotyping requires an input VCF and can be used with DRAGEN software in VCF, GVCF or VCF+GVCF mode. In all cases the output file(s) contains all regular calls and the forceGT variants, as follows:
For a ForceGT call that was not called by the variant caller (not common), the call is tagged with FGT in the INFO field.
For a germline ForceGT call that was also called by the variant caller and filter field is PASS, the call is tagged with NML;FGT in the INFO field (NML denotes normal). In somatic mode, the call is tagged with FGT;SOM.
For a normal call (and PASS) by the variant caller, with no ForceGT call (normal), no extra tags are added (no NML tag, no FGT tag).
This scheme distinguishes among calls that are present due to FGT only, common in both ForceGT input and normal calling, and normal calls.
All the variants in the input ForceGT VCF are genotyped and present in the output file. The following table lists the reported GTs for the variants.
If DRAGEN calls a variant that is different from the one specified in the input ForceGT VCF, the output contains the following multiple entries at the same position:
One entry for the default DRAGEN variant call
One entry each for every variant call present in the input ForceGT-VCF at that position
If a target BED file is provided along with the input ForceGT VCF, then the output file only contains ForceGT variants that overlap the BED file positions.
The seed-density option controls how many (normally overlapping) primary seeds from each read the mapper looks up in its hash table for exact matches. The maximum density value of 1.0 generates a seed starting at every position in the read, ie, (L-K+1) K-base seeds from an L-base read.
Seed density must be between 0.0 and 1.0. Internally, an available seed pattern equal or close to the requested density is selected. The sparsest pattern is one seed per 32 positions, or density 0.03125.
Accuracy Considerations--Generally, denser seed lookup patterns improve mapping accuracy. However, for modestly long reads (eg, 50 bp+) and low sequencer error rates, there is little to be gained beyond the default 50% seed lookup density.
Speed Considerations--Denser seed lookup patterns generally slow down mapping, and sparser seed patterns speed it up. However, when the seed mapping stage can run faster than the aligning stage, a sparser seed pattern does not make the mapper much faster.
Relationship to Reference Seed Interval
Functionally, a denser or sparser seed lookup pattern has an impact very similar to a shorter or longer reference seed interval (build hash table option --ht-ref-seed-interval
). Populating 100% of reference seed positions and looking up 50% of read seed positions has the same effect as populating 50% of reference seed positions and looking up 100% of read seed positions. Either way, the expected density of seed hits is 50%.
More generally, the expected density of seed hits is the product of the reference seed density (the inverse of the reference seed interval) and the seed lookup density. For example, if 50% of reference seeds are populated and 33.3% (1/3) of read seed positions are looked up, then the expected seed hit density should be 16.7% (1/6).
DRAGEN automatically adjusts its precise seed lookup pattern to ensure it does not systematically miss the seed positions populated from the reference. For example, the mapper does not look up seeds matching only odd positions in the reference when only even positions are populated in the hash table, even if the reference seed interval is 2 and seed-density is 0.5.
The --Mapper.map-orientations
option is used in mapping reads for bisulfite methylation analysis. It is set automatically based on the value set for ‑‑methylation-protocol
.
The --Mapper.map-orientations
option can restrict the orientation of read mapping to only forward in the reference genome, or only reverse-complemented. The valid values for --map-orientations
are as follows.
0--Either orientation (default)
1--Only forward mapping
2--Only reverse-complemented mapping
If mapping orientations are restricted and paired end reads are used, the expected pair orientation can only be FR, not FF or RF.
Although DRAGEN primarily maps reads by finding exact reference matches to short seeds, it can also map seeds differing from the reference by one nucleotide by also looking up single-SNP edited seeds. Seed editing is usually not necessary with longer reads (100 bp+), because longer reads have a high probability of containing at least one exact seed match. This is especially true when paired ends are used, because a seed match from either mate can successfully align the pair. But seed editing can, for example, be useful to increase mapping accuracy for short single-ended reads, with some cost in increased mapping time. The following options control seed editing:
Seed Editing Options
edit-mode and edit-chain-limit
The edit-mode and edit-chain-limit options control when seed editing is used. The following four edit-mode values are available:
Edit mode 0 requires all seeds to match exactly. Mode 3 is the most expensive because every seed that fails to match the reference exactly is edited. Modes 1 and 2 employ heuristics to look up edited seeds only for reads most likely to be salvaged to accurate mapping.
The main heuristic in edit modes 1 and 2 is a seed chain length test. Exact seeds are mapped to the reference in a first pass over a given read, and the matching seeds are grouped into chains of similarly aligning seeds. If the longest seed chain (in the read) exceeds a threshold edit-chain-limit, the read is judged not to require seed editing, because there is already a promising mapping position.
Edit mode 1 triggers seed editing for a given read using the seed chain length test. If no seed chain exceeds edit-chain-limit
(including if no exact seeds match), then a second seed mapping pass is attempted using edited seeds. Edit mode 2 further optimizes the heuristic for paired-end reads. If either mate has an exact seed chain longer than edit-chain-limit
, then seed editing is disabled for the pair, because a rescue scan is likely to recover the mate alignment based on seed matches from one read. Edit mode 2 is the same as mode 1 for single-ended reads.
edit-seed-num and edit-read-len
For edit modes 1 and 2, when the heuristic triggers seed editing, these options control how many seed positions are edited in the second pass over the read. Although exact seed mapping can use a densely overlapping seed pattern, such as seeds starting at 50% or 100% of read positions, most of the value of seed editing can be obtained by editing a much sparser pattern of seeds, even a nonoverlapping pattern. Generally, if a user application can afford to spend some additional amount of mapping time on seed editing, a greater increase in mapping accuracy can be obtained for the same time cost by editing seeds in sparse patterns for a large number of reads, than by editing seeds in dense patterns for a small number of reads.
Whenever seed editing is triggered, these two options request edit-seed-num seed editing positions, distributed evenly over the first edit-read-len bases of the read. For example, with 21-base seeds, edit-seed-num=6 and edit-read-len=100, edited seeds can begin at offsets {0, 16, 32, 48, 64, 80} from the 5' end, consecutive seeds overlapping by 5 bases. Because sequencing technologies often yield better base qualities nearer the (5') beginning of each read, this can focus seed editing where it is most likely to succeed. When a particular read is shorter than edit-read-len
, fewer seeds are edited.
Seed editing is more expensive when the reference seed interval (build hash table option ‑-ht‑ref-seed-interval) is greater than 1. For edit modes 1 and 2, additional seed editing positions are automatically generated to avoid missing the populated reference seed positions. For edit mode 3, the time cost can increase dramatically because query seeds matching unpopulated reference positions typically miss and trigger editing.
The first stage of mapping is to generate seeds from the read and look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.
The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. But any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors, and differently tuned alignment scoring values can be more appropriate for some applications.
The following alignment options control Smith-Waterman Alignment:
global The global
option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in the read. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm (although not end-to-end in the reference), and alignment scores can be positive or negative. When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are nonnegative. Generally, global=0
is preferred for longer reads, so significant read segments after a break of some kind (large indel, structural variant, chimeric read, and so forth) can be clipped without severely decreasing the alignment score. Setting global=1 might not have the desired effect with longer reads because insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions. Using global=1
is sometimes preferable with short reads, which are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end. Consider using the unclip-score option, or increasing it, instead ofsetting global=1, to make a soft preference for unclipped alignments.
match-score The match-score
option specifies the score for a read nucleotide matching a reference nucleotide (A, C, G, or T), or matching a reference 2–3 nucleotide IUPAC-IUB code. Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used when global=1. A higher match score results in longer alignments, and fewer long insertions.
match-n-score The match-n-score
option specifies the score for an aligned position where the read position and/or the reference position is an N code. This option is a signed integer, from -16 to 15.
mismatch-pen The mismatch-pen
option is the penalty (negative score) for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code, except N. This option is an unsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.
gap-open-pen The gap-open-pen
option is the penalty (negative score) for opening a gap (ie, an insertion or deletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs, with clipping or alignment through SNPs used instead.
gap-ext-pen The gap-ext-pen
option is the penalty (negative score) for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping, or alignment through SNPs used instead.
unclip-score The unclip-score
option is the score bonus for an alignment reaching the beginning or end of the read. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0 to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read more often, where this can be done without too many SNPs or indels. A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignments are forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen can have the undesirable effect of insertions at or near one end of a read being utilized as pseudoclipping, as happens with global=1
no-unclip-score The no-unclip-score
option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing, such as comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it stays the same or decreases if no-unclip-score=1. The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment. When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.
aln-min-score The aln-min-score
option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0). aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. A read's best alignment score is saved in the AS SAM tag, and the second-best score (if available) is saved in the XS tag. aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments. You can use the min-score-coeff option to adjust aln-min-score as a function of read length.
min-score-coeff The min-score-coeff
option makes adjustments to aln-min-score
per read base. When using the min-score-coeff
and aln-min-score
options together, you can define the minimum alignment score for each read as an affine function of read lengths. The minimum score for an N-base read is calculated as follows: (min-score-coeff)\*N+(aln-min-score)
The min-score-coeff
option is an integer ranging from –64 to 63.999. If the value is 0, then the minimum alignment score is fixed at aln-min-score for all read length. You can use positive values for min-score-coeff
to allow shorter reads to match with lower alignment scores, but require longer reads to achieve higher scores.
DRAGEN can process paired-end data passed via a pair of FASTQ files or in a single interleaved FASTQ file. The hardware maps the two ends separately, and then determines a set of alignments that seem most likely to form a pair in the expected orientation and having roughly the expected insert size. The alignments for the two ends are evaluated for the quality of their pairing, with larger penalties for insert sizes far from the expected size. The following options control processing of paired-end data:
Reorientation The pe-orientation
option specifies the expected paired-end orientation. Only pairs with this orientation can be flagged as proper pairs. Valid values are as follows:
0--FR (default)
1--RF
2--FF
unpaired-pen For paired end reads, best mapping positions are determined jointly for each pair, according to the largest pair score found, considering the various combinations of alignments for each mate. A pair score is the sum of the two alignment scores minus a pairing penalty, which estimates the unlikelihood of insert lengths further from the mean insert than this aligned pair. The unpaired-pen
option specifies how much alignment pair scores should be penalized when the two alignments are not in properly paired position or orientation. This option also serves as the maximum pairing penalty for properly paired alignments with extreme insert lengths. The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ. Internally, it is scaled into alignment score space based on Smith-Waterman scoring parameters.
pe-max-penalty
The pe-max-penalty
option limits how much the estimated MAPQ for one read can increase because its mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that it would have received mapping single-ended, plus this value. By default, pe-max-penalty = mapq-max = 255, effectively disabling this limit. The key difference between unpaired-pen
and pe-max-penalty
is that unpaired-pen
affects calculated pair scores and thus which alignments are selected and pe-max-penalty affects only reported MAPQ for paired alignments.
When working with paired-end data, DRAGEN must choose among the highest-quality alignments for the two ends to try to choose likely pairs. To make this choice, DRAGEN uses a skew normal insert model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the observation that common library preparation methods have insert-size distributions that are sometimes close to normal, but also sometimes clearly asymmetric, often skewing toward longer insert sizes. The skew normal insert model is used only for the DNA mode.
If you know the statistics of your library prep for an input file (and the file consists of a single read group), you can specify the characteristics of the insert-length distribution: mean, standard deviation, shape (or skewness) and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert
, Aligner.pe-stat-stddev-insert
, Aligner.pe-stat-shape-insert
, Aligner.pe-stat-quartiles-insert
, and Aligner.pe-stat-mean-read-len
options. However, it is typically preferable to allow DRAGEN to detect these characteristics automatically.
Dragen automatically samples the insert-length distribution. When the software starts execution, it runs a sample of up to 2,000,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.
The DRAGEN host software reports the statistics in its stdout log in a report, as follows:
Note that the Mean
, Standard deviation
and Quartiles
reported above are the sample mean, standard deviation and quartiles calculated from the initial sample of up to 2,000,000 pairs, assuming a normal distribution. The sample mean and standard deviation are used to fit the parameters of a skew-normal distribution. A skew-normal distribution is defined by starting with an underlying normal distribution (whose mean we call position
or xi
and standard deviation we call scale
or omega
) and folding a varying portion of the probability mass from one side of the mean (e.g., left side) to the other (e.g., right) side. The portion folded varies smoothly, from 0% at the original mean, approaching 100% from the left tail to the right tail. A shape
parameter which we call alpha
controls how rapidly the folded fraction increases, and at alpha=0
there is no folding and the distribution remains normal.
In the standard output, we also include the command line options needed to reproduce the DRAGEN run with the same insert stat settings. Note that when specifying stats on the command line, the skew-normal xi
value should be used for Aligner.pe-stat-mean-insert
. The omega
value should be used for Aligner.pe-stat-stddev-insert
, and the alpha
value should be used for Aligner.pe-stat-shape-insert
. If Aligner-pe-stat-shape-insert
is not specified on the command line, a default value of 0 is assumed.
The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines
These lines are followed by the histogram for the first ~2M read pairs for DNA (~100K read pairs for RNA). The histogram counts are aggregated across all read groups sharing the same sample id (RGSM
field).
When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:
The small samples formula calculates standard deviation as follows:
The default model is "standard deviation = 10000". If the first 2M reads are unmapped or if all pairs are improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0. Note that the minimum value for standard deviation is 12, which is independent of the number of samples. Also, in the DNA mode when we have fewer than 1000 high quality alignments we revert to the normal distribution based insert model, because of insufficient number of samples to accurately estimate the parameters of the skew normal distribution.
For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.
DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. This file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, shape, minimum, and maximum. The information matches the standard-out report above. Additionally, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans. Note that the reported mean and standard deviation in this tab-limited log file are the xi
and omega
parameters of the skew-normal distribution.
For paired-end reads, where a seed hit is found for one mate but not the other, rescue scans hunt for missing mate alignments within a rescue radius of the mean insert length. Normally, the DRAGEN host software sets the rescue radius to 2.5 standard deviations of the empirical insert distribution. But in cases where the insert standard deviation is large compared to the read length, the rescue radius is restricted to limit mapping slowdowns. In this case, a warning message is displayed, as follows:
Although the user can ignore this warning, or specify an intermediate rescue radius to maintain mapping speed, it is recommended to use 2.5 sigmas for the rescue radius to maintain mapping sensitivity. To disable rescue scanning, set max-rescues to 0.
DRAGEN can track multiple independent alignments for each read. These alignments include the optimal (primary) one, as well as those mapping different subsegments of the read, (chimeric/supplementary), and sub-optimal (secondary) mappings of the read to different areas of the reference.
For DNA alignment by default, DRAGEN can emit one primary alignment for each read, up to three chimeric alignments (Aligner.supp-aligns=3), and no secondary alignments (Aligner.sec-aligns=0). The maximum user-specified value for supp-aligns or sec-aligns is 4095.
You can use the following configuration options to control how many of each type of alignment to include in DRAGEN output.
mapq-max The mapq-max
option specifies a ceiling on the estimated MAPQ that can be reported for any alignment, from 0 to 255. If the calculated MAPQ is higher, this value is reported instead. The default is 60.
supp-aligns, sec-aligns The supp-aligns
and sec-aligns
options restrict the maximum number of supplementary (ie, chimeric and SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100) alignments, respectively, that can be reported for each read. A maximum of 4095 supplementary alignments and 4095 secondary alignments can be reported for any read, in addition to a primary alignment. High settings for these two options impact speed so it is advisable to increase only as needed.
sec-phred-delta The sec-phred-delta
option controls which secondary alignments are emitted based on the alignment score relative to the primary reported alignment. Only secondary alignments with likelihood within this Phred value of the primary are reported.
sec-aligns-hard The sec-aligns-hard
option suppresses the output of all secondary alignments if there are more secondary alignments than can be emitted. Set sec-aligns-hard to 1 to force the read to be unmapped when not all secondary alignments can be output.
supp-as-sec When the supp-as-sec
option is set to 1, then supplementary (chimeric) alignments are reported with SAM FLAG 0x100 instead of 0x800. The default is 0. The supp-as-sec option provides compatibility with tools that do not support FLAG 0x800.
hard-clips The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specify alignments, as follows:
Bit 0--primary alignments
Bit 1--supplementary alignments
Bit 2--secondary alignments
Each bit determines whether local alignments of that type are reported with hard clipping (1) or soft clipping (0). The default is 6, meaning primary alignments use soft clipping and supplementary and secondary alignments use hard clipping.
The GRCh38 human reference contains many more alternate haplotypes (ALT contigs) than previous versions of the reference. Generally, including ALT contigs in the mapping reference improves mapping and variant calling specificity, because misalignments are eliminated for reads matching an ALT contig but scoring poorly against the primary assembly. However, mapping with GRCh38's ALT contigs without special treatment can substantially degrade variant calling sensitivity in corresponding regions, because many reads align equally well to an ALT contig and to the corresponding position in the primary assembly.
The recomeneded and default approach for dealing with ALT-contigs in DRAGEN is masking regions of ALT contigs of high similarity to their corresponding primary contig. This approach is more accurate than liftover based ALT-awarness because there are many places where the "correct" or most useful liftover between a long ALT haplotype and the primary assembly is ambiguous. Incorrect liftover can produce dense clusters of mismapped reads and false variant calls. The base masking approach has the benefits of using ALT contigs without the negative consequences.
Masked hash tables are built from a standard hg18 or hg38 FASTA that contains ALT contigs. The hash table builder will automatically mask regions of the ALT contigs with Ns.
With liftover based ALT-awareness, the mapper and aligner are aware of the liftover relationship between ALT contig positions and corresponding primary assembly positions. Seed matches within ALT contigs are used to obtain corresponding primary assembly alignments, even if the latter score poorly. Liftover groups are formed, each containing a primary assembly alignment candidate, and zero or more ALT alignment candidates that lift to the same location. Each liftover group is scored according to its best-matching alignments, taking properly paired alignments into account. The winning liftover group provides its primary assembly representative as the primary output alignment, with MAPQ calculated based on the score difference to the second-best liftover group. Emitting primary alignments within the primary assembly maintains normal aligned coverage and facilitates variant calling there. If the --Aligner.en-alt-hap-aln option is set to 1 and --Aligner.supp-aligns is greater than 0, then corresponding alternate haplotype alignments can also be output, flagged as supplementary alignments.
DRAGEN requires ALT-Aware hash tables for any hg19 or GRCh38 reference where ALT contigs are detected. To disable this requirement in DRAGEN, set the --ht-alt-aware-validate option to false.
The following is a comparison of alternative options for dealing with alternate haplotypes.
Mapping without ALT contigs in the reference:
False-positive variant calls result when reads matching an alternate haplotype misalign somewhere else.
Poor mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.
Mapping with ALT contigs but no ALT awareness:
False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
Low or zero aligned coverage in primary assembly regions covered by alternate haplotypes, due to some reads mapping to ALT contigs.
Low or zero MAPQ in regions covered by alternate haplotypes, where they are similar or identical to the primary assembly.
Variant calling sensitivity is dramatically reduced throughout regions covered by alternate haplotypes.
Mapping with ALT contigs and ALT awareness:
False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
Normal aligned coverage in regions covered by alternate haplotypes because primary alignments are to the primary assembly.
Normal MAPQs are assigned because alignment candidates in alternative haplotypes are not considered in competition.
Good mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.
To improve variant calling accuracy in segmental duplications and other regions difficult to map with Illumina reads, you can use the Mutigenome (Graph) mapper in DRAGEN. The graph-based method uses additional variants from population haplotypes to establish alternate graph paths that reads could seed-map and align to. The Mutigenome (Graph) mapper reduces mapping ambiguity because reads that contain population variants are attracted to the specific regions where the variants are observed.
When given a set of population variants (VCF) or haplotypes, the graph reference modification is categorized in the following types:
Alternate contigs represent population haplotypes. Alt-contigs can have a single variant or a combination of nearby phased variants.
Ambiguous codes (IUPAC codes) to represent SNPs. To improve alignment, it edits the reference FASTA with isolated population SNPs.
Haplotype database. And additional haplotype database is built and used to augment the reference FASTA with population variants. A graph - mapper algorithm is used to score read alignment according to he variants in this database.
The DRAGEN graph hashtables are available to download from the DRAGEN Software Support Site page.
The DRAGEN small variant caller is a haplotype-based caller which performs local assembly of all reads in an active region into a de Bruijn graph (DBG). The assembly process uses all the read bases including the soft-clip bases of reads. The soft-clip bases provide evidence for the presence of variants, specifically longer insertions and deletions which are not present in the read cigar and hence cannot be directly viewed in IGV.
The assembly and realignment step (using pair-HMM) performed by variant caller aims to correct mapping errors made by the original aligner and improves the overall variant caller accuracy. Using the evidence BAM, we can view how the variant caller sees the read evidence and how the reads have been realigned making it a very useful debugging tool.
By default, the evidence BAM contains only a subset of regions processed by the small variant caller. Only regions which have candidate indel variants and some percentage of soft-clip reads in the pileup are realgned and output in the evidence BAM. This is done to reduce the run-time overhead needed to generate the evidence BAM.
The output of the VC Evidence BAM feature will match the output format that the customer has selected using --output-format option. The default format is bam.
A bam/cram/sam file with the suffix _evidence.bam/cram/sam
and the corresponding index file. The evidence BAM can be enabled along with the regular BAM output from the Map-Align step. When multiple BAM are passed as inputs to the variant caller, for e.g., in Tumor-Normal calling, then they will be combined in the evidence BAM output and tagged with appropriate read groups.
A bed file with regions that were realigned and output in VC Evidence BAM with suffix ".realigned-regions.bed".
The evidence BAM consists of realigned reads, badly mated reads and reads that are disqualified by the variant caller based on the read likelihood scores.
Disqualified and Badly Mated reads
Reads that are badly-mated (when the read and its mate are mapped to different chromosmes) are tagged with a BM tag (integer) and reads that are disqualified (based on read likelihoods) are tagged with the DQ tag (integer). These reads are filtered out by the genotyper in the variant caller. The alignment score tag AS is forced to 0 for such reads in the evidence BAM and hence, they can be filtered from the IGV pileup by setting the minimum AS score to be 1 instead of 0.
Graph Haplotypes
When enabling graph haplotypes output using --vc-evidence-bam-output-haplotypes
, all the haplotypes constructed by the de Bruijn graph are output in the evidence BAM as single reads covering the entire active region. The reads and haplotypes are tagged with different read groups which makes it easily distinguishable in IGV. In IGV, we can use “Color Alignments By” or “Group Alignments By” > read group to separate out the reads from the haplotypes. The haplotypes are tagged with read group EvidenceHaplotype
and the reads are part of the EvidenceRead_Normal/Tumor
read group.
The haplotypes are named as Haplotype 1, Haplotype 2 and so on and have an additional ‘HC’ tag (integer). The realigned reads also have an HC tag which encodes which haplotype best matches the read based on the likelihood calculation. Only reads which are supported by a single unique haplotype have the HC tag, reads which match more than one haplotype well do not have an HC tag. The use of this tag is primarily intended to enable highlighting of reads in IGV. Go to "Color Alignments By > Tag" and enter "HC" to view which reads are uniquely supported by a certain graph haplotypes.
The DRAGEN CNV caller leverages depth as its primary signal for calling copy number variants. Depth alone poses challenges for calling events that are less than 10kbp. The sensitivity of CNVs at lengths less than 10kbp can be improved by leveraging junction signals from the DRAGEN structural variant caller.
When both the DRAGEN CNV and SV caller are executed in a single invocation, then an additional integration step is done at the end of a DRAGEN run to improve the CNV calls. This feature is enabled automatically when DRAGEN detects a germline WGS analysis.
The SV/CNV Integration module takes in DEL and DUP calls from the output data structures of the germline CNV and SV callers, identifies putative matches, updates annotations, filters, scores, and outputs the refined records in a new output VCF. By leveraging junction signals from the SV caller and depth signals from the CNV caller, this approach allows for sensitive CNV detection down to 1kbp while also improving recall and precision across length scales. This is achieved by rescuing previously low quality calls if evidence is found from both callers, and also by adjusting CNV breakends to the more accurate SV breakends. The matching algorithm takes into account the proximity of the events as well as the transition states at the breakends, among other things.
The following is an example command line for running a germline WGS analysis for both CNV and SV.
Other optional CNV or SV parameters can also be added.
The original CNV and SV VCF output files, prior to integration, are available for users in the DRAGEN output directory, as described elsewhere. Additionally, there is an enhanced CNV VCF available with the *.cnv_sv.vcf.gz
extension. The VCF header lines in the *.cnv_sv.vcf.gz
mostly correspond to a concatenation of the individual header lines from the CNV and SV VCFs, with a few lines deduplicated and some new ones added. For details on the legacy header lines, please refer to the individual CNV and SV user guide sections.
Newly added header lines are described in the following table.
Records that can be matched or rescued will have annotations indicating the breakpoint linkage between a CNV and SV record. If a complete match is found, then the MatchSv
annotation will be present in the record, indicating the SV record's ID
field for this CNV record. Furthermore, the use of the SVCLAIM
field will indicate if the record has evidence arising from depth signal D
, or junction signals J
, or both DJ
.
Because of the mixing of standalone SV records and CNV records, the FORMAT field may have different annotations. For details on the CNV or SV specific annotations, please refer to the individual CNV and SV user guide sections.
Records that can be matched or rescued will have FILTER set to PASS. The original FILTERs are retained for records that were not matched or rescued. For example, the cnvLength
FILTER will still be applied to standalone CNV records (those with SVCLAIM=D
).
Example records are shown below.
DRAGEN emits the calls in the standard VCF format. By default for analyses other than somatic WGS, the VCF file includes only copy number gain and loss events (and LOH events, where allele-specific copy number is available). To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls
to true.
For more information on how to use the output files to aid in debug and analysis, see Signal Flow Analysis.
File extension: *.cnv.vcf.gz
The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV
to indicate the file is generated by the DRAGEN CNV pipeline
The following is an example of some of the header lines that are specific to CNV:
The following header lines are specific to somatic WGS CNV calling:
ModelSource
The primary basis on which the final tumor model was chosen. The following values can be included:
DEPTH+BAF
: Depth+BAF signal is used to determine tumor model.
DEPTH+BAF_DOUBLED
: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
DEPTH+BAF_DEDUPLICATED
: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
DEPTH+BAF_WEAK
: Depth+BAF signal is used to determine lower-confidence tumor model.
VAF
: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
DEGENERATE_DIPLOID
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
SAMPLE_MEDIAN
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity
Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA
if a confident model could not be determined.
DiploidCoverage
Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy
Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup
An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup
An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction
A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].
All coordinates in the VCF are 1-based.
The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.
The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.
The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN
, LOSS
and REF
events, in Somatic WGS CNV, the ID could include the Copy Neutral Loss of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.
The REF column contains an N for all CNV events.
The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL> or <DUP> entries are used. If REF calls are emitted, their ALT will always be .
. In Somatic WGS CNV, the ALT
field can contain two alleles, such as <DEL>,<DUP>, which allows representation of allele-specific copy numbers if they differ in copy number states.
The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.
The FILTER column contains PASS
if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.
Available FILTERs:
cnvLength
which indicates that the length of the CNV is lower than a threshold.
cnvQual
which indicates that the QUAL of the CNV is lower than a threshold.
Germline CNV has the following additional FILTERs:
cnvCopyRatio
which indicates that the segment mean of the CNV is not far enough from copy neutral.
Both Germline CNV workflows have the following additional FILTERs:
dinucQual
which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.
Germline WGS CNV has the following additional FILTERs:
cnvBinSupportRatio
which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN
which indicates a CNV call with implausible copy number (>6).
Germline WES CNV has the following additional FILTERs:
cnvLikelihoodRatio
indicates a log10 likelihood ratio of ALT to REF is less than a threshold.
Both Somatic CNV workflows have the following additional FILTERs:
binCount
- Filters CNV events with a bin count lower than a threshold.
Somatic WGS CNV has the following additional FILTERs:
lengthDegenerate
- Marks records as non-PASS
ing based on each record's length (REFLEN
) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean
- Marks records as non-PASS
ing based on each record's segment mean (SM
) when the caller returns the default model. Segments having insufficient SM
in DEL
s or DUP
s are assigned this filter when returning the default model.
Somatic WES CNV has the following additional FILTERs: -SqQual
- Marks records as non-PASS
ing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.
The INFO column contains information representing the event.
REFLEN
indicates the length of the event.
SVLEN
is a signed representation of REFLEN
(e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE
is always CNV and only present for non-REF records.
END
indicates the end position of the event (1-based, inclusive).
If using a segment BED file, then the segment identifier is carried over from the input to SEGID
field.
In Somatic WGS CNV, the INFO column can also contain the HET
tag, when the call is considered sub-clonal. See HET-Calling Mode.
When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.
The common FORMAT fields are described in the header:
Germline WGS CNV includes the following FORMAT fields:
Germline WES CNV includes the following FORMAT fields:
Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:
Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN
entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.
Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:
The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity
metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity
metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.
A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width
setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.
DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv
file extension. The following list summarizes the metrics that are output from a CNV run.
Sex Genotyper:
Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.
CNV Summary:
Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
Number of filtered records (total)
Number of filtered records (due to duplicates)
Number of filtered records (due to MAPQ)
Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions
Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Coverage MAD is the median absolute deviation of normalized case counts. Higher values indicate noiser sample data (poor quality). Median Bin Count is the median of raw counts normalized by interval size.
Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.
All files have a structure similar to a BED file with optional header line(s).
The file *.target.counts.gz
is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid
file, which is normalized to the normal ploidy level of 2 instead of raw counts.
It has the following columns:
Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gz
file is shown below.
B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, they are taken from a collection of sites that have high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the tumor sample supporting each of these alleles is counted.
B-allele counts are written both to gzipped tsv file *.ballele.counts.gz
and gzipped bedgraph file *.baf.bedgraph.gz
.
The tsv file format is the following:
Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele
Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:
Population frequency for the first allele
Population frequency for the second allele
An example of B-allele counts file is provided below:
The bedgraph file format is similar to the BED format and it has the following columns:
Contig identifier
Start
Stop
Ratio of allele counts
The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.
When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:
When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:
By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.
An example of the bedgraph file is shown below:
The file *.target.counts.gc-corrected.gz
contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gc-corrected.gz
file is shown below.
The file *.combined.counts.txt.gz
is a column-wise concatenation of individual *.target.counts.gz
and *.target.counts.gc-corrected.gz
used to form the panel of normals.
The file *.tn.tsv.gz
contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
.
An example of a *.tn.tsv.gz
file is shown below.
File extension: *.seg
, *.seg.called
, *.seg.called.merged
Files containing the segments produced by the segmentation algorithm. The Segment_Mean
value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.
The *.seg
file has the following columns:
Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment
An example of a *.seg
file is shown below.
The *.seg.called
file is identical to the *.seg
file, with an additional column indicating the initial call for whether the segment is a duplication +
ir a deletion -
.
The *.seg.called.merged
file is identical to the *.seg.called
file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:
QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count
In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg
and it has the same format of the *.seg
file with two modifications. Firstly, the Segment_Mean
value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:
BAF_SLM_STATE
: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or .
when the BAF data are too variable to estimate a minor-allele fraction"
An example of segmentation output file is shown below:
The file *.cnv.purity.coverage.models.tsv
describes the different tested models and their log-likelihood. It has columns:
Model purity (Cellularity)
Model diploid coverage
Model log-likelihood
An example is shown below:
To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks
option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml
can be loaded directly into IGV for analysis.
The following IGV tracks are automatically populated in the output IGV session file:
*.target.counts.bw
--- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw
--- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw
--- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw
--- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz
--- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3
--- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):
For somatic WGS analyses, the following additional files are included in the IGV session xml:
*.baf.seq.bw
--- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz
--- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.
File extension: *.igv_session.xml
The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir
specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome
attribute in the Session
element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.
Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome
field in the XML file directly. For example, IGV has traditionally packaged a b37
reference genome, but may also include a 1kg_v37
or a 1kg_b37+decoy
, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.
You can determine what the correct encoding of a reference genome by going to File > Save Session...
and then inspecting the generated igv_session.xml file.
DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:
*.target.counts.gz
or *.target.counts.gc-corrected.gz
, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz
, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz
, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.
In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.
Using R, a good starting point is the karyoploteR package. The main workflow involves reading the *.target.counts.gz
file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR
package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.
Using Python, the workflow is similar to R's but using Python's libraries such as pandas, to convert DRAGEN output files to dataframe, and matplotlib, to plot coverage and BAF profiles across the genome.
A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3
output file. Some examples of DRAGEN output GFF3 are shown below:
Germline WGS
Somatic WGS
From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber
annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).
To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.excluded_intervals.bed.gz
file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.
An example of a *.excluded_intervals.bed.gz
file is shown below:
The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz
) if a Panel of Normals is provided and --cnv-generate-pon-metric-file
is set to true
. If PON size is less than 2, then an empty file will be generated.
The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:
Example:
The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz
) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.
Example:
The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).
The final output has extension .cnv.segdups.rescued_intervals.tsv.gz
, and contains the rescued target intervals which can then be injected before segmentation. It has columns:
Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)
The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz
with columns:
Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)
The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz
with columns:
Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site
The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:
with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes
with accceleration supported with Advanced Vector Extensions (AVX)
The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.
Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the DRAGEN Software Support site page.
For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.
Notes:
The output is in biallelic format, one line per ALT.
The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.
The following is an example of commands to impute vcf on a single chromosome:
The following is an example of commands to impute vcf on chromosome X:
The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.
The sample(s) to be imputed must have the following format:
VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported
Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information
To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the DRAGEN Software Support Site page. When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).
To impute INDELs and get the best accuracy on INDELs, it is recommended:
to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the DRAGEN Software Support Site page.
and to set the command --imputation-phase-impute-reference-only-variants
to true.
A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the DRAGEN Software Support Site page. IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.
Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported
A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename
. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.
A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files. The genetic map should follow the format:
<chromosome name>.gmap.gz
3 columns: position, chromosome number, distance (cM)
compliant with the reference genome used to generate the sample input
This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the DRAGEN Software Support Site page. It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.
In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.
For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):
The JSON config file is made of two fields as defined in the table below
Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.
The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.
The VCF imputation tool generates several outputs:
The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name <prefix>.impute.vcf.gz
The intermediate files:
chunk regions to be passed along to the internal Phase step with name <prefix>.impute.chunk.out.txt
imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name <prefix>_chr_start-end.impute.phase.vcf.gz
text file with path to all the <prefix>_chr_start-end.impute.phase.vcf.gz
generated with name <prefix>.impute.phase.out.txt
Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps
Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.
Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.
ExpansionHunter de novo allows the discovery of expanded STR regions from paired-end reads across a cohort of samples. It is designed to work with PCR-free samples of 100-200bp paired-end reads at >30X coverage.
Note:
STRs shorter than the read length are ignored; the program is appropriate only for detecting expansions that exceed the read length.
The location of each reported STR is approximate (up to about 500bp-1Kbp)
STRs are not genotyped; the program reports a depth-normalized count of reads originating inside each STR; this count can be used as a very approximate measure of the repeat length
To achieve best results all samples must be sequenced on the same instrument to similar coverage, have the same read and fragment lengths, and be subjected to the same computational pre-processing (e.g. reads must be aligned by the same aligner)
For more information refer to:
Dolzhenko, E., Bennett, M.F., Richmond, P.A. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21, 102 (2020)
Briefly, the workflow can be separated in two distinct steps: profiling and analysis. In the profiling step, repetitive reads are found and used to infer the location of potential STR regions. The regions and the respective read counts are then saved in a "profile" on disk. The profiling step is run for each sample and the resulting profiles are merged into a single dataset for the analysis. In the analysis step the user needs to provide a table describing the experimental design to run either an outlier analysis which tests one sample against the rest or a case-control analysis where the samples are split in two groups.
In DRAGEN, the analysis is more streamlined than the standalone EHdn tool and has considerable performance improvements, while retaining the same output.
Note: The output in the case of outlier analysis might not be exactly identical because it involves bootstrapping. In DRAGEN, the random sampling function necessary for bootstrapping is different than what is implemented by Numpy in the standalone EHdn.
Note: The DRAGEN implementation is based on EHdn version v0.9.1
The two steps of the workflow, profiling and analysis, are performed by two separate DRAGEN commands.
In the first step we compute the profiles which are going to be saved as ProtoBuf messages (<out_prefix>.data
). The profile can be saved in a specific directory with the --str-profiler-output-directory
flag. The sample name will be saved in the profile and can be specified at the profiling stage with the flag --str-profiler-sample-name
. If not specified, the sample name in the RGSM field will be used instead.
DRAGEN has to be called once for each sample, for example with the command:
After all the profiles are computed, they have to be divided in 'cases' and 'controls' directories. This can be achieved while computing the profiles by passing the directory with the --str-profiler-output-directory
flag. The input can be a list of samples with the --fastq-list
option. DRAGEN can take as input a list of FASTQ files and save each profile in the directory specified directory with --str-profiler-output-directory
. A list of cases and a list of controls can be run in this manner.
Example command:
The analysis is performed with a separate DRAGEN command, which takes as input the path to the two directories.
Two analysis types can be specified:
outlier
= bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for the cases samples
casecontrol
= cases and controls counts are compared with a one-sided Wilcoxon rank-sum test and a Bonferroni correction is applied to the resulting p-values
Providing the --str-profile-analysis flag
will trigger the analysis workflow. Example command:
The standalone version of EHdn performs 100 rounds of resampling during bootstrapping due to computational constraints. In DRAGEN the resampling has been increased to 1000 by default thanks to the much faster computation. This number can be adjusted with the flag --str-profiler-resampling-rounds
. Increasing the number of resampling cycles will improve the precision of the estimate but also linearly increase the compute times.
DRAGEN will spread the computation across 48 threads by default, but the number can be adjusted on the command line with the flag --str-profiler-threads
.
The output (as in the standalone EHdn implementation) is composed of two tables, one for the "motif" level analysis and one for the "locus" level analysis which will be saved as <output-prefix>.str_profiler_locus.tsv
and <output-prefix>.str_profiler_motif.tsv
respectively. Below is a description of the locus analysis output. The motif table is the same as the locus table but without the contig, start and end columns.
The LPA Caller is capable of identifying the LPA Kringle-IV-2 (KIV-2) VNTR unit copy number from whole-genome sequencing (WGS) data. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the VNTR unit copy number.
The LPA Caller performs the following steps:
Determines total LPA KIV-2 VNTR unit copy number.
Determines the heterozygous LPA KIV-2 VNTR unit copy number if heterozygous KIV-2 markers are present.
Calls small variants in the LPA KIV-2 VNTR region based on the unit copy number along with allele counts from read information.
The LPA Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
The first step of LPA calling is to determine the unit copy number of LPA KIV-2. Reads aligned to the LPA KIV-2 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.
The second step of LPA calling is to determine the heterozygous unit copy numbert of LPA KIV-2. Heterozygous unit copy number is determined using two specific linked SNV sites that have been identified as a combined marker allele that is always present in every copy of the repeat unit concordantly. That is, if any copy of the repeat unit in an LPA haplotype contains the ALT alleles at those two SNV sites, then every copy of the repeat unit in that LPA haplotype contains the ALT alleles at those two sites. The relative read coverage for the ALT and REF cases at these sites can therefore be used to determine the proportions of overall copy numbers across the KIV repeat array that belong to each haplotype.
2 small variants are detected from the read alignments. These variants occur in the LPA KIV-2 VNTR region where reads mapping to either of the 6 units in the reference are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele are counted and a binomial model is used to determine the likelihood for each possible variant allele copy number up to the maximum possible as determined from the LPA KIV-2 VNTR unit copy number.
The LPA Caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
The lpa
fields are defined as below.
For the variants
the fields are defined as below.
The LPA Caller also generates a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file possibly compressed.
Examples of the LPA Caller content in the output json file are shown below.
The following are example output files:
The CYP2B6 Caller is capable of genotyping the CYP2B6 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP2B7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.
The CYP2B6 Caller performs the following steps:
Determines total CYP2B6 and CYP2B7 copy number from read depth.
Determines CYP2B6-derived copy number at CYP2B6/CYP2B7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2B6-derived copy number along the CYP2B6 gene.
Calls small variants in CYP2B6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.
The first step of CYP2B6 calling is to determine the combined copy number of CYP2B6 and CYP2B7. Reads aligned to regions in either CYP2B6 or CYP2B7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2B6 and CYP2B7 copy number is then calculated from the average sequencing depth across the CYP2B6 and CYP2B7 regions.
The CYP2B6-derived copy number is calculated at 99 predefined differentiating sites across the CYP2B6 gene. The differentiating sites are selected at positions with sequence differences in CYP2B6 and CYP2B7 where calling the CYP2B6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.
For each differentiating site, CYP2B6-specific and CYP2B7-specific alleles are counted in reads mapping to either CYP2B6 or the homologous region in CYP2B7. The CYP2B6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2B6 and CYP2B7 copy number calculated from the previous step.
The CYP2B6-derived copy number along the CYP2B6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:
35 small variants that define various star alleles are detected from the read alignments. All of these variants are in unique (nonhomologous) regions of CYP2B6 with high mapping quality. Only reads mapping to CYP2B6 are used for calling variants in nonhomologous regions.
For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant).
Samples with poor sequencing quality or greater than five copies of CYP2B6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.
The recombinant (gene conversion) variant 18053A>G is detected by phasing the variant site with five flanking differentiating sites. When the haplotypes formed from phasing these sites supports the gene conversion in CYP2B6, a read depth analysis at the gene conversion breakpoints (transitions from either CYP2B6->CYP2B7 or CYP2B7->CYP2B6) is performed. When the posterior probability that there is at least one gene conversion variant is above 0.7 then DRAGEN uses the variant for star allele identification.
The called SVs and small variant genotypes are matched against the definitions of 39 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1
, *6
and *4
, *49
where both sets of star alleles contain the same two small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2B6-derived gene copies determined from previous steps. If no variant genotypes can be matched to a set of star alleles, the CYP2B6 Caller returns a no call during the genotyping step with filter value No_call
.
Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype
.
An example of the CYP2B6 caller content in the output is as follows:
For CYP2B6 caller, the fields are defined as follows.
The CYP2D6 Caller is capable of genotyping the CYP2D6 gene from whole-genome sequencing (WGS) data and is derived from the method implemented in Cyrius¹. Due to high sequence similarity with its pseudogene paralog CYP2D7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.
The CYP2D6 Caller performs the following steps:
Determines total CYP2D6 and CYP2D7 copy number from read depth.
Determines CYP2D6-derived copy number at CYP2D6/CYP2D7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2D6-derived copy number along the CYP2D6 gene.
Calls small variants in CYP2D6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.
The first step of CYP2D6 calling is to determine the combined copy number of CYP2D6 and CYP2D7. Reads aligned to regions in either CYP2D6 or CYP2D7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2D6 and CYP2D7 copy number is then calculated from the average sequencing depth across the CYP2D6 and CYP2D7 regions.
The CYP2D6-derived copy number is calculated at 117 predefined differentiating sites across the CYP2D6 gene. The differentiating sites are selected at positions with sequence differences in CYP2D6 and CYP2D7 where calling the CYP2D6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.
For each differentiating site, CYP2D6-specific and CYP2D7-specific alleles are counted in reads mapping to either CYP2D6 or the homologous region in CYP2D7. The CYP2D6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2D6 and CYP2D7 copy number calculated from the previous step.
The CYP2D6-derived copy number along the CYP2D6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:
In addition to the exon 9 fusion breakpoints, exon 9 can participate in CYP2D7 gene conversion resulting in an embedded CYP2D7 sequence instead of a true hybrid. The structural variant caller also detects exon 9 gene conversions. Because only changes in CYP2D6-derived copy number yield structural variant calls, there might be rare cases where two hybrid copies result in no structural variant calls. For example, when both *36
and *13
with fusion breakpoint in exon 9 are present. However, the structural variant caller is capable of detecting multiple copies of the same fusion type (eg, *36x2
) or cases where both an exon 9 gene conversion copy and an exon 9 2D6-2D7 hybrid are present.
118 small variants that define various star alleles are detected from the read alignments. 96 of these variants are in unique (nonhomologous) regions of CYP2D6 with high mapping quality. Only reads mapping to CYP2D6 are used for calling variants in nonhomologous regions. The other 22 variants occur in homologous regions of CYP2D6 where reads mapping to either CYP2D6 or CYP2D7 are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant). A strand bias filter is applied to a small subset of variants that would otherwise tend to have false positive calls in the population.
Samples with poor sequencing quality or greater than five copies of CYP2D6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.
The called SVs and small variant genotypes are matched against the definitions of 128 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1
, *46
and *43
, *45
where both sets of star alleles contain the same 4 small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2D6-derived gene copies determined from previous steps. When there are fewer than two CYP2D6-derived gene copies, then one or more *5
deletion haplotypes are included in the output set of star alleles. If all variant genotypes cannot be matched to a set of star alleles, the CYP2D6 Caller returns a no call during the genotyping step with filter value No_call
.
Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The deletion haplotype (*5
) is considered as a possible haplotype during this process. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype
.
Each CYP2D6 genotype contains two haplotypes separated by a slash (eg *1/*2
). Each haplotype consists of one or more star alleles separated by a plus sign (eg *10+*36
). When a haplotype contains more than one copy of the same star allele, that star allele only appears once and is followed by a multiplication sign, and then the number of copies (eg *1x2
for two copies of *1
).
¹Chen X, Shen F, Gonzaludo N, et al. Cyrius: accurate CYP2D6 genotyping using whole-genome sequencing data. The Pharmacogenomics Journal. 2021;21(2):251-261. doi:10.1038/s41397-020-00205-5
The CYP21A2 Caller is capable of genotyping the CYP21A2 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP21A1P and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants.
The CYP21A2 calling workflow is broken up into the following major stages:
Loading input configuration
Processing read data
Analyzing read data
Read data analysis is further split into the following steps:
Determine total CYP21A2 and CYP21A1P copy number from read depth.
Call small variants in CYP21A2 copies.
Phase reads to detect common variants and recombination events.
Identify most likely haplotypes.
The CYP21A2 Caller requires WGS data aligned to a human reference genome with at least 30x coverage.
The first step of CYP21A2 calling is to determine the combined copy number of CYP21A2 and CYP21A1P. Reads aligned to regions in either CYP21A2 or CYP21A1P are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP21A2 and CYP21A1P copy number is then calculated from the average sequencing depth across the CYP21A2 and CYP21A1P regions.
Of the known nonrecombinant-like variants, some are in unique (nonhomologous) regions of CYP21A2 with high mapping quality. Only reads mapping to CYP21A2 are used for calling variants in nonhomologous regions. The other variants occur in homologous regions of CYP21A2/CYP21A1P where reads mapping to either are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele is counted. A binomial model that incorporates the sequencing error rate is then used to determine the most likely variant copy number (0 for nonvariant).
For a list of the supported nonrecombinant-like variants, refer to the targeted/cyp21a2/target_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
To analyze the homologous region even further, DRAGEN phases reads covering differentiating sites and known variant sites. Whenever a detected haplotype has a CYP21A2->CYP21A1P or CYP21A1P->CYP21A2 transition that is consistent with one of the known recombinant-like variants, the transition is considered as a candidate breakpoint for calling those variants. Reads containing phasing information for the two sites flanking each candidate breakpoint are used for variant calling. When the read data supports the hypothesis that the sample contains at least one copy of a candidate breakpoint, the associated haplotype is a recombinant haplotype candidate. Recombinant haplotype candidates are sorted by likelihood and the number of variant sites. If no wild type haplotype was detected, DRAGEN reports any detected homozygous recombinant haplotype, or up to two different recombinant haplotypes (i.e. compound het) if detected. If any wild type haplotype was found, DRAGEN reports a maximum of one recombinant haplotype. When no recombinant haplotypes are detected two wild type haplotypes are reported.
For a list of recombinant variant sites, refer to the targeted/cyp21a2/recombinant_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
Note that NM_000500.9:c.710_719delinsACGAGGAGAA will be reported as the following three variants on the same haplotype: NM_000500.9:c.710T>A NM_000500.9:c.713T>A NM_000500.9:c.719T>A
Note: A deletion-like recombinant variant haplotype (as opposed to a gene conversion-like recombinant variant haplotype) is defined as a haplotype with one or fewer switch sites (transitions from a CYP21A1P allele to a CYP21A2 allele) after excluding some sites with common gene conversions in CYP21A1P.
Each nonrecombinant-like variant reported in the variants
array will have the fields below.
An example of the CYP21A2 caller content in the <output-file-prefix>.targeted.json
output file is shown below.
The GBA Caller is capable of detecting both recombinant-like and nonrecombinant-like variants in the GBA gene from whole-genome sequencing (WGS) data. Disruption of all copies of the GBA gene in an individual causes the autosomal recessive disorder Gaucher disease, and carriers are at increased risk of Parkinson's disease and Lewy body dementia. Due to high sequence similarity with its pseudogene paralog GBAP1, calling recombinant-like variants in GBA requires a specialized caller.
To enable the GBA Caller, use --enable-gba=true
as part of a germline-only WGS analysis workflow. The GBA Caller is disabled by default and requires WGS data aligned to a human reference genome with at least 30x coverage.
The GBA Caller performs the following steps:
Determine the total combined GBA and GBAP1 copy number
Detect nonrecombinant-like variants from a set of 111 known variants
Assemble phased haplotypes in the exon 9-11 region where recombinant variants occur
Detect any GBAP1 -> GBA breakpoints that are consistent with one of the 7 known recombinant-like variants
A 10 kb region of unique sequence in between GBA and GBAP1 is used to compute the copy number change due to reciprocal recombination events. Reads that align to this 10 kb region are counted and the count is normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The total combined GBA and GBAP1 copy number is then calculated as two more than the copy number of this 10 kb region.
Of the known nonrecombinant-like variants, some are in unique (nonhomologous) regions of GBA with high mapping quality. Only reads mapping to GBA are used for calling variant in nonhomologous regions. The other variants occur in homologous regions of GBA/GBAP1 where reads mapping to either GBA or GBAP1 are used for variant calling.
For each variant, reads containing the variant allele and the nonvariant alleles are counted. A binomial model that incorporates the sequencing error rate is then used to determine the most likely variant allele copy number (0 for nonvariant).
For a list of the supported nonrecombinant-like variants, refer to the targeted/gba/target_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
A collection of 10 differentiating sites in the exon 9-11 region of GBA are used to detect the GBA and GBAP1 haplotypes present in the sample. An iterative phasing algorithm is used to build up haplotypes that are supported by the read data. The phasing algorithm starts with seed sites which are then iteratively extended to neighboring sites. At each iteration, reads that can be unambiguously assigned to one of the detected partial haplotypes are used to extend the next neighboring site for each partial haplotype. Iteration continues until all sites have been extended. Some haplotypes may have sites that are unresolved (i.e. ambiguous), but these haplotypes can still participate in GBA -> GBAP1 breakpoint detection.
If any of the 10 differentiating sites in exon 9-11 indicate that there is no wild type GBA allele copies, then the sample is called as homozygous variant and the recombinant-like variant that best matches the depth calls at the 10 sites is reported.
When the sample is not homozygous variant, the phased haplotypes are used to detect heterozygous variants. The detected haplotypes are compared against a set of 7 known recombinant-like variants: A495P, L483P, D448H, c.1263del, RecNciI, RecTL, c.1263del+RecTL). Whenever a detected haplotype has a GBA->GBAP1 or GBAP1->GBA transition that is consistent with one of these 7 known recombinant-like variants, the transition is considered as a candidate breakpoint for calling that recombinant-like variant. Reads containing phasing information for the two sites flanking each candidate breakpoint are used for variant calling. When the read data supports the hypothesis that the sample contains at least one copy of a candidate breakpoint , the associated haplotype is a recombinant haplotype candidate. Recombinant haplotype candidates are sorted by likelihood and the number of variant sites. If no wild type haplotype was detected, DRAGEN reports any detected homozygous recombinant haplotype, or up to two different recombinant haplotypes (i.e. compound het) if detected. If any wild type haplotype was found, DRAGEN reports a maximum of one recombinant haplotype. When no recombinant haplotypes are detected two wild type haplotypes are reported.
The caller can detect the following recombinant variant haplotypes: A495P, L483P, D448H, 1263del, RecNciI, RecTL, and c.1263del+RecTL. Note: RecNciI, RecTL, and c.1263del+RecTL maye be deletion-like recombinant variants. A deletion-like recombinant variant haplotype (as opposed to a gene conversion-like recombinant variant haplotype) is defined as a haplotype with one or fewer switch sites (transitions from a GBAP1 allele to a GBA allele).
The table below shows the HGVS identifiers associated with each recombinant variant haplotype.
Each nonrecombinant-like variant reported in the variants
array will have the fields below.
An example of the GBA caller content in the <output-file-prefix>.targeted.json
output file is shown below.
The Rh Caller is capable of identifying a common gene conversion between RHD and RHCE genes from whole-genome sequencing (WGS) data, that is referred to as RHCE Exon2 gene conversion. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the gene conversion between the pair of genes. We consider 798 loci, called differentiating sites, that represents differences between the RHD and RHCE genes, that are well preserved in the population.
The Rh Caller performs the following steps:
Determines total copy number from read depth of the RHD and RHCE regions.
Detect RHD -> RHCE breakpoints that are consistent with the RHCE Exon2 gene conversion.
The Rh Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
The Rh Caller is run by default when the small variant caller is enabled, the sample is a not a tumor sample, and the sample is detected as WGS by the Ploidy Estimator.
The first step of Rh calling is to determine the copy number of RHD and RHCE regions. Reads aligned to the RHD and RHCE regions are counted according to their support of the differentiating sits. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.
A collection of 4 differentiating sites in the exon 2 region of RHD and RHCE are used to detect the presence of the RHCE Exon2 gene conversion in the sample. An iterative phasing algorithm is used to build up haplotypes that are supported by the read data. The phasing algorithm starts with candidate haplotypes formed from all possible bases at the first differentiating site. The haplotypes are then extended at the next differentiating site by considering all reads that can be uniquely assigned to a single candidate haplotype. If these reads support only a single base at the next differentiating site for a given candidate haplotype, then the haplotype is extended with that base. When a candidate haplotype can be extended by both bases at the next differentiating site then both possible extended haplotypes are included in the set of candidate haplotypes, growing the set by 1. Subsequent extension steps are performed at neighboring differentiating sites until all sites have been processed. Some haplotypes may have sites that are unresolved (i.e. ambiguous), but these haplotypes can still participate in RHD -> RHCE breakpoint detection.
When the phased haplotypes support the RHCE Exon2 gene conversion. We visit all the differentiating sites ad report them as variants in the output VCF file with ploidy identified using the copy number estimated from the read depth of the differentiating site.
The Rh Caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
The rh
fields are defined as below.
For the variants
the fields are defined as below.
Examples of the Rh Caller content in the output json file are shown below.
The Rh Caller also generates a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file, possibly compressed.
The following are example output files:
The default mode of the small variant caller has been optimized to detect germline variants with typical AFs of 0%, 50% or 100%. On the other hand, non-cancer post-zygotic mosaic variants have typical allele frequencies (AFs) lower than 50% and therefore more challenging to find with the default small variant caller. To improve sensitivity of low AF calls, a new machine learning (ML) model trained using read and context evidence from low AF calls is used. This allows the model to identify variants down to approximately 5% AF on 35x WGS and 3% AF on 300x WGS. The mosaic ML model is applied to all calls that are rejected by the germline model and variants detected with the mosaic detection are ideintified by a MOSAIC
flag in the VCF INFO field.
When the mosaic detection is enabled, the hard filter QUAL
threshold for both SNPs and INDELs is lowered to 0.4
in this mode to allow low AF calls to be set as PASS
in the FILTER field. MOSAIC
tagged variants with QUAL
smaller than 3
are filtered with the MosaicHardQUAL
filter.
We provide an optional MosaicLowAF
filtering option to filter MOSAIC
tagged variants with AF
smaller than the AF
threshold. The threshold for this filter can be set with the --vc-mosaic-af-filter-threshold
option.
Furthermore, the output of MOSAIC
tagged calls can be restricted using an optional target BED provided with the --vc-mosaic-target-bed
option.
--vc-enable-mosaic-detection
Set to true to enable mosaic detection with mosaic AF filter threshold set to 0.0
. Set to false to disable mosaic detection. The default is true with mosaic AF filter threshold set to 0.2
.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
when the germline variant caller is enabled. The AF default threshold is set to 0.0
when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true
.
--vc-mosaic-qual-filter-threshold
Set the QUAL
threshold for the application of the MosaicHardQUAL
filter to mosaic calls. All MOSAIC
tagged variants with QUAL
smaller than the threshold QUAL
are filtered with the MosaicHardQUAL
filter. The default mosaic QUAL
filter threshold is set to 3.0
.
--vc-mosaic-target-bed
Optional target BED file to restrict the output of MOSAIC
tagged variant calls only in the specified regions.
Small variant calling features comparison between default germline small variant caller and mosaic detection mode in DRAGEN 4.2 and DRAGEN 4.3
The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).
Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller
option to true. The VCF header is annotated with ##source=DRAGEN_SNV
to indicate the file is generated by the DRAGEN SNV pipeline.
The DRAGEN Small Variant Caller performs the following steps:
Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.
Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.
Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.
Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.
Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.
In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:
Unfiltered depth is the number of reads covering the position, downstream of any read collapsing or deduplication that may have preceded the variant calling step, but upstream of most read filtering and overlapping mate handling. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads actually used to make the calling decision, where filtered reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded, and overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.
The following figure summarizes the different filtering steps in more detail.
Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
Duplicate reads.
Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
[Somatic] Reads with MAPQ=0.
[Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
Unmapped reads.
Secondary reads.
Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
Reads that are badly mated. A badly mated read is a read where the pair is mapped to two different reference contigs.
Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out reads that are not informative. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.
Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC
INFO flag. As default, MOSAIC
tagged variants with AF
smaller than 20% are filtered with the MosaicLowAF
filter.
See Mosaic detection for further details on the mosaic small variant caller and the mosaic detection mode and a comparison with DRAGEN 4.2 features.
The following options control the variant caller stage of the DRAGEN host software.
--enable-variant-caller
Set --enable-variant-caller
to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:
If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.
--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection with mosaic AF filter threshold set to 0.0
. Set to false to disable DRAGEN mosaic detection. The default is true with mosaic AF filter threshold set to 0.2
.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
when the germline variant caller is enabled. The AF default threshold is set to 0.0
when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true
.
You can use the following options for downsampling reads in the small variant calling pipeline.
For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.
--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito
The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.
The following are the default downsampling values for each small variant calling mode.
The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos
setting.
If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos
, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos
, default value is 10) at any start position.
The next downsampling step is to apply the --vc-max-reads-per-raw-region
and --vc-max-reads-per-active-region
limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.
This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.
When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.
A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF>
allele. Reads that do not support the reference or any variants are assigned the <NON_REF>
allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ
and in somatic mode the score is FORMAT/SQ
.
The following options are available to enable and control gVCF output.
--vc-emit-ref-confidence
To enable gVCF output, set to GVCF
. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence
to BP_RESOLUTION
.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf
(banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80
for germline and 1 3 10 20 50 80
for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50
.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf
, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30
and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.
Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.
The following are example gVCF records that include a hom-ref block call and a variant call.
In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.
In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.
QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.
DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same set.
The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.
During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, then they are phased together. If the variants only occur on different haplotypes, then they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.
Phased variant records that belong to the same phasing set can be combined into a single VCF record. For example, assuming reference at position chr2 115035
is A
, the following two phased variants are combined.
The phased variants are combined as follows.
The command-line option --vc-combine-phased-variants-distance
specifies the maximum distance over which phased variants will be combined. The default value 0 disables the feature. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value.
DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.
Examples of diploid haplotypes where phasing is supported:
Examples of diploid haplotypes where phasing is not supported:
By default in somatic mode, DRAGEN will output all component SNVs and INDELs that make up an MNV along with the MNV call itself. MNVs and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. Setting --vc-mnv-emit-component-calls=false
can be used to restrict which component calls are reported. When DRAGEN reports an MNV call, it considers the difference between the VAF of the MNV call and the VAF of each component call, and reports any given component call in addition to the MNV call if this difference is greater than --vc-combine-phased-variants-max-vaf-delta
(default: 0.1). The --vc-mnv-emit-component-calls
and --vc-combine-phased-variants-max-vaf-delta
options are only applicable in somatic mode and are not supported in germline mode. In germline mode, functionality to output component calls is not available and MNV calls are emitted only without component calls.
DRAGEN outputs variants in a VCF file following variant normalization as described here https://genome.sph.umich.edu/wiki/Variant_Normalization. The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.
Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned
Additional notes on variant representation in the DRAGEN VCF:
Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).
Allele decomposition: by default, multi-nucleotide polymorphisms (MNPs) are represented as separate, contiguous individual SNVs records in the VCF. If phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants. This determines which variant have occurred on the same haplotype. Phased variant records that belong to the same phasing set can be combined into a single VCF record by using the --vc-combine-phased-variants-distance
command-line option and set it to a non-zero value. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value (specified in the number of basepairs).
In some cases, such as complex variants in repetitive regions, some variants cannot be normalized (i.e. converted into a standard representation) or represented uniquely. To counteract this problem, when comparing two VCFs (e.g. a DRAGEN VCF against a truth set VCF), it is recommended to use the RTG vcfeval tool which performs variant comparisons using a haplotype-aware approach. RTG vcfeval has been adopted as the standard VCF comparison tool by GA4GH and PrecisionFDA https://www.biorxiv.org/content/biorxiv/early/2018/02/23/270157.full.pdf.
A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:
chr1 2656216 . A T,C 107.65 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=12;FS=0.000;MQ=28.95;QD=8.97;SOR=3.056;FractionInformativeReads=0.750 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,5,4:0.556,0.444:9:15:177,144,46,122,0,72:-17.704,-14.420,-4.626,-12.220,0.000,-7.244:1.076e+02,1.096e+02,1.465e+01,8.758e+01,1.520e-01,4.082e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,0,1,8:0,0,4,5
Two indels are considered as multi-allelic if they share the same reference base preceding the indel. chr1 7392258 . C CT,CTTT 234.76 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=44;FS=0.000;MQ=199.22;QD=5.34;SOR=2.226;FractionInformativeReads=0.659 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,15,14:0.517,0.483:29:50:245,256,55,190,0,55:-24.476,-25.634,-5.492,-18.976,0.000,-5.500:2.348e+02,2.513e+02,5.292e+01,1.848e+02,4.401e-05,5.300e+01:0.00,5.00,8.00,5.00,10.00,8.00:0,0,7,22:0,0,17,12
If a SNP overlaps an INDEL, but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below. However DRAGEN has the joint detection of overlaping variants feature which is designed to detect overlapping SNP and INDEL and output them in a single VCF variant record, represented as a multi-allelic genotype.
chr1 1029628 . C CGT 49.88 PASS AC=1;AF=0.500;AN=2;DP=37;FS=7.791;MQ=105.32;MQRankSum=-1.315;QD=1.35;ReadPosRankSum=1.423;SOR=1.510;FractionInformativeReads=0.892;R2_5P_bias=-19.742 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:17,16:0.485:33:48:81,0,50:-8.088,0.000,-5.000:4.988e+01,6.653e-05,5.300e+01:0.00,31.00,34.00:10,7,5,11:11,6,9,7:1029628 chr1 1029629 . A G 50.00 PASS AC=1;AF=0.500;AN=2;DP=37;FS=1.289;MQ=105.32;MQRankSum=-0.659;QD=1.35;ReadPosRankSum=-0.199;SOR=0.604;FractionInformativeReads=1.000;R2_5P_bias=-24.923 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:16,21:0.568:37:48:85,0,49:-8.477,0.000,-4.934:5.000e+01,6.886e-05,5.234e+01:0.00,34.77,37.77:9,7,10,11:10,6,13,8:1029628
The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.
If --sample-sex
is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex
is specified on the command line, contigs are processed as follows.
For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.
For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 85% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid
to false.
An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909
:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8
DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.
Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.
When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.
Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.
DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.
--vc-enable-af-filter-mito
Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito
Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito
Set the threshold to mark emitted vcf call as filtered. The default is 0.02.
QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).
The following filters can be applied to mitochondrial variant calls.
--vc-sq-call-threshold
Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold
Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default value is false.
If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.
If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.
The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.
For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.
The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.
We leverage the new multigenome graph reference and graph mapper output to compute a personalized 2-haplotype reference for the input sample.
The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.
To enable personalized variant calling and machine learning, set --enable-personalization
to true (default: false).
Note that this is a beta feature and available only for the germline small variant caller when run with a V4 multigenome graph reference.
When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:
Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.
Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.
Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection
to false.
DRAGEN has two algorithms that model correlated errors across reads in a given pileup.
Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.
Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.
FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.
Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.
The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.
Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.
The DRAGEN Copy Number Variant (CNV) Pipeline can call CNV events using next-generation sequencing (NGS) data. This pipeline supports multiple applications in a single interface via the DRAGEN Host Software, including processing of whole-genome sequencing (WGS) data and whole-exome sequencing (WES) data for germline analysis.
The DRAGEN CNV pipeline supports two normalization modes of operation. The two modes apply different normalization techniques to handle biases that differ based on the application, for example, WGS versus WES. While the default option settings attempt to provide the best trade-off in terms of speed and accuracy, a specific workflow may require more finely tuned option settings.
The DRAGEN CNV pipeline follows the workflow shown in the following figure.
DRAGEN CNV Pipeline Workflow
The DRAGEN CNV Pipeline uses many aspects of the DRAGEN secondary analysis available in other pipelines, such as hardware acceleration and efficient I/O processing. To enable CNV processing in the DRAGEN Host Software, set the --enable-cnv
command line option to true.
The CNV pipeline has the following processing modules:
Target Counts --- Binning of the read counts and other signals from alignments.
Bias Correction --- Correction of intrinsic system biases.
Normalization --- Detection of normal ploidy levels and normalization of the case sample.
Segmentation --- Breakpoint detection via segmentation of the normalized signal.
Calling / Genotyping --- Thresholding, scoring, qualifying, and filtering of putative events as copy number variants.
The normalization module can optionally take in a panel of normals (PoN), which is used when a cohort or population samples are readily available. Note that PoN normalization is not available for somatic WGS analysis. All other modules are shared between the different CNV algorithms.
The following figures show a high-level overview of the steps in the DRAGEN CNV Pipeline as the signal traverses through the various stages. These figures are examples and are not identical to the plots that are generated from the DRAGEN CNV Pipeline.
The first step in the DRAGEN CNV Pipeline is the target counts stage. The target counts stage extracts signals such as read count and improper pairs and puts them into target intervals.
Read Count Signal
Improper Pairs Signal
Next, the case sample is normalized against the panel of normals or against the estimated normal ploidy level. Any other biases are subtracted out of the signal to amplify any event level signals.
Normalization
The normalized signal is then segmented using one of the available segmentation algorithms. Events are then called from the segments.
Segments
Called Events
The events are then scored and emitted in the output VCF.
The following are the top-level options that are shared with the DRAGEN Host Software to control the CNV pipeline. You can input a BAM or CRAM file into the CNV pipeline. If you are using the DRAGEN mapper and aligner, you can use FASTQ files.
--bam-input
--- The BAM file to be processed.
--cram-input
--- The CRAM file to be processed.
--enable-cnv
--- Enable or disable CNV processing. Set to true to enable CNV processing.
--enable-map-align
--- Enables the mapper and aligner module. The default is true, so all input reads are remapped and aligned unless this option is set to false.
--fastq-file1
, --fastq-file2
--- FASTQ file or files to be processed.
--output-directory
--- Output directory where all results are stored.
--output-file-prefix
--- Output file prefix that will be prepended to all result file names.
--ref-dir
--- The DRAGEN reference genome hashtable directory.
The output and filtering options control the CNV output files.
--cnv-exclude-bed
--- Specifies a BED file that indicates the intervals to exclude from the CNV analysis. If a target interval overlaps regions specified from exclude BED file more than cnv-exclude-bed-min-overlap
, the target interval is suppressed.
--cnv-exclude-bed-min-overlap
--- Specifies a fraction for filtering threshold of overlap amount between a target interval and the excluded region (0.5).
--cnv-enable-ref-calls
--- Emit copy neutral (REF) calls in the output VCF file. The default is true for single WGS CNV analysis.
--cnv-enable-tracks
--- Generate track files that can be imported into IGV for viewing. When this option is enabled, a \*.gff
file for the output variant calls is generated, as well as \*.bw
files for the tangent normalized signal. The default is true.
--cnv-filter-bin-support-ratio
--- Filters out a candidate event if the span of supporting bins is less than the specified ratio with respect to the overall event length. This filter only applies to records with length greater than cnv-filter-bin-support-ratio-min-len
. The default ratio is 0.2 (20% support). As an example, if an event is called and has a length of 100,000 bp, but the target interval bins that support the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then the interval is filtered out. If applied, the record will have cnvBinSupportRatio
as a filter.
--cnv-filter-bin-support-ratio-min-len
--- Minimum length of candidate event at which to apply cnv-filter-bin-support-ratio
. Currently only applied to germline WGS workflows, with default value of 80,000 bp.
--cnv-filter-copy-ratio
--- Specifies the minimum copy ratio (CR) threshold value centered about 1.0 at which a reported event is marked as PASS in the output VCF file. The default value is 0.2, which leads to calls with CR between 0.8 and 1.2 being filtered. If applied, the record will have cnvCopyRatio
as a filter.
--cnv-filter-length
--- Specifies the minimum event length in bases at which a reported event is marked as PASS in the output VCF file. The default is 10000. If applied, the record will have cnvLength
as a filter.
--cnv-filter-qual
--- Specifies the QUAL value at which a reported event is marked as PASS in the output VCF file. You should adjust the parameter value according to your own application data. If applied, the record will have cnvQual
as a filter.
--cnv-min-qual
--- Specifies the minimum reported QUAL. The default is 3.
--cnv-max-qual
--- Specifies the maximum reported QUAL. The default is 200.
--cnv-qual-length-scale
--- Specifies the bias weighting factor to adjust QUAL estimates for segments with longer lengths. This is an advanced option and should not need to be modified. The default is 0.9303 (2-0.1).
--cnv-qual-noise-scale
--- Specifies the bias weighting factor to adjust QUAL estimates based on sample variance. This is an advanced option and should not need to be modified. The default is 1.0.
The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see Streaming Alignments for instructions on streaming alignment records directly from the DRAGEN map/align stage.
DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see Generate an Alignment File.
For the DRAGEN CNV pipeline, the hashtable must be generated with the --enable-cnv option
set to true, in addition to any other options required by other pipelines. When --enable-cnv
is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.
The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see Prepare a Reference Genome.
The following example command generates a hashtable.
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM
option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM
option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.
To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.
For information on running CNV concurrently with the Haplotype Variant Caller, see Concurrent CNV and Small Variant Calling.
The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.
When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list
option.
With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed
option to determine the intervals for analysis.
The target counts stage generates a .target.counts.gz
file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input
option for the normalization stage. The .target.counts.gz
file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.
Further details are available in the Output Files section.
If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width
option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.
The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.
Using a cnv-interval-width
of less than 250 bp for WGS analysis can drastically increase runtime.
The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
. You can specify a list of contigs to skip by using the --cnv-skip-contig-list
option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.
For example, to skip chromosome M, X, and Y, use the following option:
If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED
option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width
.
To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.
The following options control the generation of target counts.
--cnv-counts-method
--- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq
--- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed
--- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width
--- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list
--- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm
.
--cnv-filter-duplicate-alignments
--- Filter duplicate marked alignments during target counts if option is set to true
. The deafult setting is false
.
Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.
PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments
when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.
If --enable-map-align=false
, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true
, then --enable-duplicate-marking=true
should be set.
Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.
In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.
Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.
For WGS samples and in absence of a cnv-target-bed
file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width
option, which defaults to 1000bp. The cnv-interval-width
option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width
, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE
in the *.cnv.excluded_intervals.bed.gz
file.
A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.
Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section Segmental Duplication Extension for more details.
GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.
The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz
file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz
extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See Output Files for further details on GC-corrected target counts files.
Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.
The following options control the GC bias correction module.
--cnv-enable-gcbias-correction
--- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing
--- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins
--- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.
The DRAGEN CNV pipeline supports two normalization algorithms:
Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.
Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.
Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
naming conventions are supported.
Whole genome sequencing (non-somatic)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples
The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization
to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.
Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.
The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.
If you are running from a FASTQ sample, then the default mode of operation is self-normalization.
When operating in self-normalization mode, the --cnv-interval-width
option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.
Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references.
If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true
. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.
The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. This allows the algorithm to subtract system level biases that are not sample specific. The generation of the target counts for these normal samples should also have identical command line options with the case sample under analysis.
In this mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample, case, and normals, to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.
Target counts should be generated for all samples, whether the samples are to be used as references or are the case samples under analysis. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings. The target counts stage also performs GC Bias correction. GC Bias correction is enabled by default.
The following examples are for WES processing, which is the case in where a panel of normals is required.
The following is an example command for processing a BAM file.
The following is an example command for processing a CRAM file.
When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true
option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file
(one per file) or --cnv-normals-list
(single text file with paths to each sample).
The following is an example command line using a normals list:
The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.
Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.
For optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels can result in artifactual calls in the test sample where at least some of the panel samples have copy number changes. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.
The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.
DRAGEN accepts 3 different file formats for a Panel of Normals (PON).
The CNV caller can also be started from the *.target.counts.gz
(raw counts) or *.target.counts.gc-corrected.gz
(GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input
option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction
should be set to false to disable the GC-correction stage.
For example, the following command normalizes the case sample against the panel of normals.
See Output Files for a description of the target counts files.
These options control the preconditioning of the panel of normals and the normalization of the case sample.
--cnv-enable-self-normalization
--- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile
--- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input
--- Specifies a target counts file for the case sample under analysis when using a panel of normals.
--cnv-normals-file
--- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list
--- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples
--- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets
--- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold
--- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-truncate-threshold
--- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon
--- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.
After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:
Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)
The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.
By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing
For the targeted sequencing workflows, you can also run with a --cnv-segmentation-bed
. The option pre-defines the segments to estimate copy numbers for and skips the segmentation step of the workflow. See Targeted Segmentation (Segment BED) for more information.
--cnv-segmentation-mode
--- Specifies the segmentation algorithm to perform. The following values are available.
bed
cbs
slm
--- The default for germline WGS analysis.
aslm
--- The default for somatic WGS analysis.
hslm
--- The default for targeted/WES analysis.
--cnv-merge-distance
--- Specifies the maximum number of base pairs between two segments that would allow them to be merged. The default value is 0 for germline WGS, which means the segments must be directly adjacent. For WES analysis, this parameter is disabled by default due to the spacing of targeted intervals.
--cnv-merge-threshold
--- Specifies the maximum segment mean difference at which two adjacent segments should be merged. The segment mean is represented as a linear copy ratio value. The default is 0.2 for WGS and 0.4 for WES. To disable merging, set the value to 0.
Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.
--cnv-cbs-alpha
--- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta
--- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax
--- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width
--- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin
--- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm
--- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim
--- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.
¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646
The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles².
--cnv-slm-eta
--- Baseline probability that the mean process changes its value. The default is 4e-5.
--cnv-slm-fw
--- Minimum number of data points for a CNV to be emitted. The default is 0, which means segments with one design probe could in effect be emitted.
--cnv-slm-omega
--- Scaling parameter that modulates relative weight between experimental or biological variance. The default is 0.3.
--cnv-slm-stepeta
--- Distance normalization parameter. The default is 10000. This option is only valid for HSLM.
Regardless of segmentation method, initial segments are split across large gaps where depth data is unavailable, such as across centromeres.
²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5
In applications for targeted panels, you can limit the segmentation and calling performed on intervals by specifying a --cnv-segmentation-bed
. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. This segmentation mode is only supported with the panel of normals and requires an accompanying --cnv-target-bed
. Also specify the --cnv-segmentation-bed
during the panel of normals generation step, so that all interval boundaries during analysis are matched. For more information on panel of normals generation, see Panel of Normals
The recommended format for the BED file includes four columns and a header. The four columns are contig
, start
, stop
, and name
. The name column represents the name of the gene and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID
field. The following example file is in the recommended format:
If using a three-column BED file, then do not include a header or the name field values. Three-column BED files should only include the contig
, start
, and stop
values. In this case, the segment identifier is autogenerated from the coordinate fields.
Quality scores are computed using a probabilistic model that uses a mixture of heavy tailed probability distributions (one per integer copy number) with a weighting for event length. Noise variance is estimated. The output VCF contains a Phred-scaled metric that measures confidence in called amplification (CN > 2 for diploid locus), deletion (CN < 2 for diploid locus), or copy neutral (CN=2 for diploid locus) events.
The scoring algorithm also calculates exact copy-number quality scores that are inputs to the DeNovo CNV detection pipeline.
You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed
. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.
The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz
file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz
file excludes the intervals removed during normalization.
An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See Output Files for further details.
DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.
Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.
The following examples show different commands.
When running the target counts stage or the normalization stage, the DRAGEN CNV pipeline also provides the following information about the samples in the run.
A correlation metric of the read count profile between the case sample and any panel of normals samples. A correlation metric greater than 0.90 is recommended for confident analysis, but there is no hard restriction enforced by the software.
The predicted sex of each sample in the run. The sex is predicted based on the read count information in the sex chromosomes and the autosomal chromosomes. The median value for the counts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Y chromosome. This estimation requires a minimum of 300 target intervals on the sex chromosomes to proceed.
The results are printed to the screen when running the pipeline. For example:
The predicted sexes for samples in use are also printed to the *.cnv_metrics.csv output file. For a panel of normals, the predicted sexes are used to determine which panel samples are leveraged for normalization on sex chromosomes. If the estimated sex of the sample is UNDETERMINED, the sex of the sample is set to FEMALE.
You can override the predicted sex of the sample with the --sample-sex
option.
The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.
This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38
) with at least 30x coverage. See below for additional requirements.
The following pairs of genes defining Segmental Duplications are included:
This extension is enabled by default in the germline CNV workflow. However, it requires:
Normalization set to self-normalization (--cnv-enable-self-normalization=true
).
GC bias correction enabled (--cnv-enable-gcbias-correction=true
).
Counts method set to start
(--cnv-counts-method=start
).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width
default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38
).
If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension
to false.
For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz
).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz
).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz
file for inspection and they are automatically injected before the segmentation step.
During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j
See Output Files for a description of the extension output files.
DRAGEN Multi-region Joint Detection (MRJD) is a de novo germline small variant caller for paralogous regions. In DRAGEN v4.3, MRJD covers regions that include six clinically relevant genes: NEB, TTN, SMN1/2, PMS2, STRC, and IKBKG. MRJD is compatible with hg38, hg19 and GRCh37 reference genome. The table below includes hg38 region coordinates covered by MRJD.
Chromosome | Start | End | Description |
---|---|---|---|
MRJD is a variant calling method that is designed to detect de novo germline small variants in paralogous regions of the genome. A conventional variant caller relies on the read aligner to determine which reads likely originated from a given location. This method works well when the region of interest does not resemble any other region of the genome over the span of a single read (or a pair of reads for paired-end sequencing). However, a significant fraction of the human genome does not meet this criterion. At least 5% of the human genome consists of segmental duplications. Many regions of the genome have near-identical copies elsewhere, and as a result, the true source location of a read might be subject to considerable uncertainty. If a group of reads is mapped with low confidence, a conventional variant caller might ignore the reads, even though they contain useful information. If a read is mismapped (i.e., the primary alignment is not the true source of the read), it can result in variant detection errors.
MRJD is designed in attempt to tackle the complexities raised by segmental duplication regions. Basically, instead of considering each region in isolation, MRJD considers all locations from which a group of reads may have originated and attempts to detect the underlying sequences jointly across all paralogous regions in the sample of interest.
Below is a diagram showing the general workflow of MRJD in PMS2 and PMS2CL regions. MRJD takes primary alignments in all paralogous regions, regardless of mapping quality, builds and places haplotypes based on reads and prior knowledge, and computes joint genotypes to call small variants.
Figure 1. MRJD Caller workflow.
As shown in the diagram above, there are two modes of the DRAGEN MRJD Caller, default mode and high sensitivity mode. Here are details on the differences between the two modes.
With --enable-mrjd=true
, the MRJD Caller will report the following two types of variants:
Uniquely placed variants, which means the variant is found and placed in one of the paralogous regions without ambiguity. See variants labeled with “type 1” in Figure 2.
Region-ambiguous variants. In this case, the aggregated genotype contains a variant allele with high confidence, but MRJD Caller is unable to place the variant allele in one of the paralogous regions with high confidence. The MRJD Caller will report the variant allele in all paralogous regions. See variants labeled with “type 2” in Figure 2.
With both --enable-mrjd=true
and --mrjd-enable-high-sensitivity-mode=true
, the MRJD Caller reports the same variants as from the default mode, plus two other types of variants.
Positions where the reference alleles in all paralogous regions are not the same. It is well established that gene conversion, including reciprocal crossover, is a common event between paralogous regions (such as PMS2 and PMS2CL). When reciprocal crossover event occurs, the prior model, without nearby information on phasing, might end up placing the converted haplotype in the source region instead of the destination region, resulting in no variant. The high sensitivity mode compensates for this event by reporting the variant in corresponding positions in all paralogous regions. See variants labeled with “type 3” in Figure 2.
Variants that have been placed uniquely in one of the paralogous regions and no variant in the corresponding position in the other region. The high sensitivity mode reports the variant in the rest of the paralogous regions. This is to compensate the fact that sometimes the prior knowledge that is used to help place the variant is not sufficient or is estimated incorrectly. In those cases, the variant allele still exists but is placed in the wrong paralog region. Therefore, reporting the variant in the other paralogous regions can help maximize sensitivity even with the lack of prior. See variants labeled with “type 4” in Figure 2.
Figure 2. Different variant types reported by MRJD Caller default mode and high sensitivity mode.
The MRJD Caller is disabled by default and requires WGS data aligned to a human reference genome build 38, 19, or GRCh37.
Here is the list of options related to MRJD.
--enable-mrjd
If set to true, MRJD is enabled for the DRAGEN pipeline. Note that MRJD cannot run together with SNV caller in the current version of DRAGEN (default = ‘false’).
--mrjd-enable-high-sensitivity-mode
If set to true, MRJD high sensitivity mode is enabled for the DRAGEN pipeline. See previous section on what variant types are reported in MRJD default mode and high sensitivity mode (default = ‘false’).
The following command-line example uses FASTQ input and runs MRJD Caller with high sensitivity mode:
The following command-line example uses BAM input that has already been aligned and runs MRJD Caller with high sensitivity mode:
It is important to note that MRJD cannot run together with the DRAGEN Small Variant Caller in this DRAGEN version. We recommend users to run DNA Mapping and Small Variant Calling workflow first, and then run MRJD using the aligned BAM file generated from DNA Mapping workflow as input. Using this workflow, two VCF files will be created (.hard-filtered.vcf.gz by DRAGEN Small Variant Caller and .mrjd.hard-filtered.vcf.gz by DRAGEN MRJD). To help user get a single VCF file for downstream anlaysis, we prepared a utility tool that replaces the DRAGEN Small Variant Caller output in the homology region of the six medically relevant and challenging genes with MRJD caller output. The tool also annotates the calls made by MRJD (with "MRJD" tag in the INFO column). Please refer to the DRAGEN Software Support Site page to download the utility tool.
Here are the example command lines to first run DNA Mapping and Small Variant Calling workflow using FASTQ files as input, and then run MRJD using BAM file generated by the DNA Mapping workflow as input.
The MRJD Caller generates a .mrjd.hard-filtered.vcf.gz file in the output directory. The output file is a compressed VCFv4.2 formatted file that contains the VCF representation of the small variants from the identified genotype.
The following are example output format for uniquely placed variant. The DRAGENHardQual filter is applied to the records if the variant has a QUAL < 3.00.
Figure 3. VCF output format example for uniquely placed call.
For variant that are not uniquely placed (type 2-4 variant in Figure 2), the MRJD Caller will also report variants under diploid genotype format, which can be interpreted the same way as uniquely placed variant (the genotype is region-specific instead of an aggregate across all regions). Under this format, The QUAL presents phred-scaled quality score for the assertion made in ALT (i.e. −10log10 prob(GT==0/0)). Note that the QUAL score will be equal to or less than 3 (if the QUAL > 3, then the call should be uniquely placed).
The QUAL, GT, GQ and PL will be reported similar to the DRAGEN germline VC. To avoid losing information about the aggregated genotype across paralogous regions, the MRJD Caller reports genotype, phred-scaled quality score, and the phred-scaled genotype likelihoods for aggregated genotype using JGT, JQL, and JPL in the FORMAT column.
Figure 4. VCF output format example for non-uniquely-placed call.
Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.
DRAGEN includes a repeat expansion detection method called ExpansionHunter. ExpansionHunter performs sequence-graph based realignment of reads that originate inside and around each target repeat. ExpansionHunter then genotypes the length of the repeat in each allele based on these graph alignments.
The ExpansionHunter is designed for PCR-free whole genome samples. Repeats are only genotyped if the coverage at the locus is at least 10x, but a minimum of 30x is recommended. Sequencing reads must be paired-end with a minimum read length of 100 (2x100bp). The ExpansionHunter cannot be run on multiple FASTQ files that are assigned to different library IDs in the fastq_list.csv
file.
ExpansionHunter does not support somatic analysis.
More information and analysis is available in the following ExpansionHunter papers:
Dolzhenko et al., Detection of long repeat expansions from PCR-free whole-genome sequence data 2017
Dolzhenko et al., ExpansionHunter: A sequence-graph based tool to analyze variation in short tandem repeat regions 2019
To enable DRAGEN repeat expansion detection, the following command-line options are required.
--repeat-genotype-enable=true
--repeat-genotype-specs=<path to specification file>
You can use the --sample-sex
option to specify the sex of the sample. The following options are optional.
--repeat-genotype-region-extension-length=<length of region around repeat to examine>
(default 1000 bp)
--repeat-genotype-min-baseq=<Minimum base quality for high confidence bases>
(default 20)
For more information on the specification file specified by --repeat-genotype-specs
option, see .
The main output of repeat expansion detection is a VCF file that contains the variants found via this analysis.
The repeat-specification (also called variant catalog) JSON file defines the repeat regions for ExpansionHunter to analyze. Default repeat-specification for some pathogenic and polymorphic repeats are in the <INSTALL_PATH>/resources/repeat-specs/
directory, based on the reference genome used with DRAGEN.
You can create specification files for new repeat regions by using one of the provided specification files as a template. See the ExpansionHunter documentation for details on the format.
--repeat-genotype-specs
is required for ExpansionHunter. If the option is not provided, DRAGEN attempts to autodetect the applicable catalog file from <INSTALL_PATH>/resources/repeat-specs/
based on the reference provided.
The ExpansionHunter can detect pathogenic expansions of FXN, ATXN3, ATN1, AR, DMPK, HTT, FMR1, ATXN1, C9ORF72 repeats with high accuracy (see the ExpansionHunter papers above). The pathogenicity status of some repeats might depend on the presence of sequence interruptions or motif changes that ExpansionHunter does not call. If you would like to visually inspect the relevant read alignments, you can use a Repeat Expansion Viewer third-party tool.
Included below are the repeat unit expansion thresholds (normal, pre-mutation and expansion) for some common repeats.
The results of repeat genotyping are output as a separate VCF file, which provides the length of each allele at each callable repeat defined in the repeat-specification catalog file. The name is <outputPrefix>.repeats.vcf
(*.gz). The VCF output file lists with the following fields first.
Table 2 Core VCF Fields
Table 3 Additional INFO Fields
Table 4 GENOTYPE (Per Sample) Fields
For example, the following VCF entry describes the ATXN1 repeat in a sample NA13537.
In this example, the first allele spans 33 repeat units while the second allele spans 58 repeat units. The repeat unit is TGC (RU INFO field), so the sequence of the first allele is TGC x 33 and the sequence of the second allele is TGC x 58. The repeat spans 30 repeat units in the reference (REF INFO field).
The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (52,71). There are 4 spanning and 69 flanking reads consistent with the repeat allele of size 33 that is 4 reads fully contain the repeat of size 33 and 69 flanking reads overlap at most 33 repeat units. There are 83 flanking and 4 in-repeat reads consistent with the repeat allele of size 58. The average coverage of this locus is 37.46x.
The sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file. You can use a specialized GraphAlignmentViewer tool available on GitHub to visualize the alignments. Programs like Integrative Genomics Viewer (IGV) are not designed for displaying graph-aligned reads and cannot visualize these BAMs.
The BAMs store graph alignments in custom XG tags using the format <LocusName>,<StartPosition>,<GraphCIGAR>
.
LocusName---A locus identifier that matches the corresponding entry in the repeat expansion specification file.
StartPosition---The starting alignment position of a read on the first graph node.
GraphCIGAR---The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node.
Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.
For somatic whole-exome sequencing (WES) and somatic targeted panels, you can use a panel of normals as the reference baseline to provide insight into copy number variants. The reported events are based solely on the normalized copy ratio values and the deviation from the expected reference baseline levels. This workflow can be useful for applications that require only the detection of gains and losses in targeted genes. The somatic WES CNV model is similar to the germline WES CNV model, but utilizes a different quality scoring and calling model.
Use one of the following input options.
--tumor-fastq1
and --tumor-fastq2
--Specify a FASTQ file
--tumor-bam-input
--Specify an existing BAM file
--tumor-cram-input
--Specify an existing CRAM file
The Somatic WES CNV Caller requires a panel of normals. The panel of normals samples help measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see Panel of Normals. The panel of normals sample should be well matched to the case sample under analysis.
If a matched normal sample is available, the sample can be included in the panel of normals. The workflow does not change if a matched normal is or is not available.
The following example command line runs somatic analysis on WES data.
If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed
and using cnv-segmentation-mode=bed
. If using this option, all events in the segmentation BED are reported in the output VCF. For more information on the segmentation BED file, see [Targeted Segmentation (Segment BED)].
The following example command line runs somatic analysis on a targeted panel.
The Somatic WES CNV Caller computes quality scores using a 2 sample t-test between the normalized copy ratio of the case sample and the panel of normals samples. The caller computes a p-value per segment. The p-values are then converted to Phred-scaled scores. For copy neutral events, the caller computes quality scores as 1-p
.
DUP/DEL events calls are made based on the limit of detection threshold (LoD) which is set using cnv-filter-limit-of-detection
(default 0.2). For each segment, the caller compute a p-value for hypothetical counts by Case Counts X (1 +/- LoD)
against PON. If p-value of Case Counts X (1+LoD)
is highest, then segment is called as DUP. If p-value of Case Counts X (1-LoD)
is highest, then segment is called DEL. Otherwise segment is called REF.
The output VCF contains the quality score in the QUAL
field.
The non-ASCN Somatic WES CNV Caller only reports copy ratio, also known as fold change. Fold change is encoded in the FORMAT/SM
field as a linear copy ratio of the segment mean. In such case, if tumor purity is known, you can infer the ploidy of a gene or segment in the sample from the reported fold change using the following calculation.
For example, if the tumor purity is 30% for MET with a fold change of 2.2x, then there are 10 copies of MET DNA in the sample.
To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.
The germline SNVs are used to compute B-allele ratios in the tumor, which allows for allele-specific copy number calling on the tumor sample. Where possible, use of the small-variant VCF from a matched normal sample is preferred (tumor-normal mode) for best results, but a catalog of population SNPs can be used when a matched normal sample is not available (tumor-only mode).
When a matched normal sample is available, the sample should first be processed using the germline small variant caller. In this case, only germline-heterozygous SNV sites are used for determining B-allele ratios. If no matched normal is available, population SNP B-allele ratios are computed as for matched normal heterozygous loci, but are treated as variants of unknown germline genotype; possible genotype assignments are statistically integrated to determine allele-specific copy number.
In matched normal mode, a VCF containing germline copy number changes for the individual may optionally be input. This makes sure that germline CNVs are output as separate segments in the somatic whole-genome sequencing (WGS) CNV VCF, and annotated with the germline copy number so that it is clear whether there are specifically-somatic copy number changes in the region.
You can use the following somatic WGS CNV calling command-line options:
The following is an example command line for running tumor-normal somatic WGS CNV calling with a matched normal SNV VCF.
If a matched normal is not available, you must disable CNV calling or run in tumor-only mode. Running with a mismatched normal in tumor-normal mode yields unexpected results. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.
The following example command line runs tumor normal somatic WGS CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true
.
The Somatic WGS CNV Caller requires a source of heterozygous SNP sites to measure B-allele counts of the tumor sample. The following are the available modes.
To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf
option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz
extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz
), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.
To specify a population SNP VCF, use --cnv-population-b-allele-vcf
option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is "1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency>
to the INFO
section of each record. Additional INFO
fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf
can be either heterozygous or homozygous in the germline genome from which the tumor genome derives
The following is an example valid population SNP record:
DRAGEN considers the following requirements when parsing records from the b-allele VCF:
Only simple SNV sites.
Records must be marked PASS
in the FILTER
field.
If there are records with the same CHROM
and POS
values in the VCF
, then DRAGEN uses the first record that occurs.
If a tumor sample and matched normal input are available, use --cnv-use-somatic-vc-baf true
. You must enable the Somatic SNV Caller. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true
. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
To enable --cnv-use-somatic-vc-baf
, enter the following command line options.
--tumor-bam-input <TUMOR_BAM>
—Specify the tumor input
--bam-input <NORMAL_BAM>
—Specify the matched normal input
--enable-variant-caller true
—Enable the somatic SNV variant caller
--cnv-use-somatic-vc-baf true
—Enable somatic VC BAF
To specify germline CNVs from a matched normal sample, use --cnv-normal-cnv-vcf
. When specified, CNV records marked as PASS
in the normal sample are used during tumor-sample segmentation to make sure that confident germline CNV boundaries are also boundaries in the somatic output. Segments with germline copy number changes that are relative to reference ploidy are excluded from somatic model selection. During somatic copy number calling and scoring, the germline copy number is used to modify the expected depth contribution from the normal contamination fraction of the tumor sample. The process leads to more accurate assignment of somatic copy number in regions of germline CNV. DRAGEN then annotates the somatic WGS CNV VCF entries with germline copy number (NCN) and the somatic copy number difference relative to germline (SCND) for the segments that have germline CNVs.
If both the small variant caller and the CNV caller are enabled in a tumor-matched normal run, the somatic SNV results can affect the estimated purity and ploidy of the tumor sample. The somatic SNV variant allele frequencies (VAFs) that are captured by the allele depth values from passing somatic SNVs reflect the combination of tumor purity, total tumor copy number at a somatic SNV locus, and the number of tumor copies bearing the somatic allele. Clusters of somatic SNVs with similar allele depths inform the tumor model.
When a tumor has limited copy number variation and/or CNVs are mostly subclonal, such as in many liquid tumors, VAFs can help prevent incorrect or low-confidence estimated tumor models. Incorrect or low-confidence estimated tumor models can lead to wrong or filtered calls. VAF information can also help determine the presence or absence of a genome duplication even in samples from clonal tumors with clear CNVs.
To utilize VAF information, run somatic WGS CNV calling with small variant calling on tumor and matched-normal read alignment inputs. For example, you could use the following command line:
--enable-variant-caller=true --enable-cnv=true --tumor-bam-input <TUMOR_BAM> --bam-input <NORMAL_BAM>
For tumor/matched-normal runs with --enable-variant-caller true
, VAF-based modeling is enabled by default. To disable VAF-based modeling, set --cnv-use-somatic-vc-vaf
to false
.
DRAGEN uses HET-calling mode for segments with a copy number that is estimated to be heterogeneous (HET) among different subclones. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.
When a segment is considered as heterogeneous, the output for the segment is changed as follows.
The HET tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.
Selecting a tumor purity and diploid coverage level (ploidy) is a key component of the somatic WGS CNV caller. The somatic WGS CNV caller uses a grid-search approach that evaluates many candidate models to attempt to fit the observed read counts and b-allele counts across all segments in the tumor sample. A log likelihood score is emitted for each candidate. The log likelihood scores are output in the *.cnv.purity.coverage.models.tsv
file. The somatic WGS CNV caller chooses the purity, coverage pair with the highest log likelihood, and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.
If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA
. The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM
). The threshold values used by the caller are estimated automatically considering the variance of the sample, with larger SM
thresholds for DUPs when the variance is higher. The user can use alternative threshold values through the --cnv-filter-del-mean
and --cnv-filter-dup-mean
parameters. Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN
, CNF
, CNQ
, MCN
, MCNF
, MCNQ
) are omitted from the final VCF output.
In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.
The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit
(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH>
parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.
If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.
¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041
The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.
After initial calling, segments shorter than the specified value of --cnv-filter-length
are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the Somatic WGS CNV Caller combines two successive segments that are within --cnv-merge-distance
(default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. NB. When the germline CN information is available, and two segments have different germline CN, they will not be merged.
The Somatic WGS CNV Caller can report the total tumor copy number by estimating tumor purity. The BAF estimations from matched normal SNVs or population SNPs allow for allele specific copy number calling. The following table provides examples for a DUP in a reference-diploid region:
*The entry represents a Loss of Heterozygosity (LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH
to distinguish the value from Copy Neutral LOH (CNLOH
), which would be annotated as 2+0.
Multisample CNV calling is possible starting from tangent normalized counts files (*.tn.tsv.gz) specified with the --cnv-input
option (one per sample). Multisample CNV analysis benefits from using joint segmentation to increase the sensitivity of detection of copy number variable segments. For each copy number variable segment identified, the copy number genotype of each sample is emitted in a single VCF entry to facilitate annotation and interpretation.
Multisample CNV analysis is supported for WGS and WES workflows.
The following is an example command line for running a trio analysis:
Make sure all input samples have gone through the same single sample workflow and have identical intervals. If the samples are WES inputs, then you must generate the samples using the same panel of normals, and the autosomal intervals for all samples must match.
The following options are used in DeNovo CNV calling:
--cnv-input
For DeNovo CNV calling, this specifies the input tangent-normalized signal files (*.tn.tsv.gz) from the single sample runs. This option can be specified multiple times, once for each input sample.
--cnv-filter-de-novo-qual
Phred-scaled threshold at which a putative event in the proband sample if marked as DeNovo. Default value is 0.125.
--pedigree-file
Pedigree file specifying the relationship between the input samples.
First, CNV calling is performed on each sample independently. Joint segmentation then uses the copy number variable segments from each single sample analysis to derive a set of joint copy number variable segments. This set of joint segments is determined simply by taking the union of all breakpoints from the copy number variable segments of all samples. This results in the splitting of any partially overlapping segments across different samples. For example:
Following joint segmentation, copy number calling is again performed independently on each sample using the joint segments. Segments can be merged as with the single sample analysis, but each joint segment is emitted in the mutlisample VCF as a single entry. The quality score (QS
in the VCF) from the sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered using the sample's FT field in the multisample VCF. The QUAL
column of the multisample VCF is always missing (ie, "."). The FILTER
column of the mutlisample VCF is SampleFT
if none of the sample's FT
fields are PASS
, and PASS
if any of the sample's FT
fields are PASS
.
Note, however, that when a single segment in one sample overlaps multiple segments in another sample, the larger segment annotation is replicated across multiple records, e.g. (only relevant VCF fields are printed below):
A de novo event is defined as the existence of a genotype at a particular locus in a proband's genome that did not result from standard Mendelian inheritance from the parents. The de novo calling stage identifies putative de novo events in the proband of each trio of a multisample analysis. In some cases, these putative de novo events may be real, but they can also arise from sequencing or analysis artifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used to filter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-file
option. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.
For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflict for the called copy number genotypes. The CNV caller does not identify the copy number for each allele of a given diploid segment, which means assumptions are made about the possible allelic composition of the parent genotypes.
The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome (sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, as follows:
The following are examples of consistent and inconsistent copy number genotypes for diploid regions using these assumptions:
If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ
field in the VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) of each sample in the trio, combined with a prior for the trio genotypes:
Where
The DN
field in the VCF is used to indicate the de novo status for each segment. Possible values are:
Inherited
- the called trio genotype is consistent with Mendelian inheritance
LowDQ
- the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold (default 0.125)
DeNovo
- the called trio genotype is inconsistent with Mendelian inheritance and DQ
is greater than or equal to the de novo quality threshold (default 0.125)
The records in a multisample CNV VCF differ slightly from the single sample case. The major differences are as follows:
The per-record entries are broken down into the segments among the union of all the input samples breakpoints, which means there are more entries in the overall VCF.
The QUAL
column is not used and its value is ".". The per-sample quality is carried over into the SAMPLE
columns with the QS
tag.
The FILTER
column indicates PASS
if any of the individual SAMPLE
columns PASS
. Otherwise, it indicates SampleFT
.
The per-sample annotations are carried over from their originating calls. The single sample filters are applied at the sample level and are emitted in the FT
annotation.
Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following two annotations to the proband sample.
While the VCF contains many entries, due to the joint segmentation stage, the number of de novo events can be found by extracting entries that have a DN
and DQ
annotation. These records are also extracted and are converted to GFF3 in the de novo calling case.
To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs from matched normal sample or population SNV VCF. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.
The germline SNVs are used to compute B-allele ratios in the tumor, which allows for allele-specific copy number calling on the tumor sample. Where possible, use of the small-variant VCF from a matched normal sample.
Panel of normals are used for the reference baseline to provide insight into copy number variants. The ASCN somatic WES CNV model is similar to the somatic WGS CNV model (with different internal parameters tuned for WES), but it uses a panel of normals to remove coverage bias in each target region.
The pipeline accepts various input types for matched normal sample or population SNV VCF for B-allele loci. If the normal sample was already processed using the germline small variant caller, the user can provide its output VCF file.
If the normal sample was not processed, the user can provide raw reads or aligned reads and enable the concurrent execution of the small variant caller. In such case the DRAGEN CNV will receive the small variant caller's output, and use it to estimate B-allele frequencies from the germline SNVs.
If there is no matched normal sample, the user can provide a population SNV VCF. DRAGEN will intersect the population SNV VCF with the target region provided by the cnv-target-bed
and use the resulting SNVs to estimate B-allele frequencies.
You can use following somatic WES CNV calling command-line options:
1 tumor input
1 normal input (either option 1, 2, or 3)
Panel of normals (either option 1, 2, 3 or 4)
Target region
When the normal sample input is not in VCF format (e.g., FASTQ/BAM/CRAM), then the normal sample shall be capable of being used as PON. However, if the normal sample is already included in the PON, then it will not be added.
The following is an example command line for running ASCN tumor-normal somatic WES CNV calling with matched normal SNV VCF.
The following example command line runs ASCN tumor-normal somatic WES CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true
.
If a matched normal is not available, DRAGEN CNV requires population SNV VCF to run in tumor-only mode. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.
Repetitive regions in the human genome pose a challenge for general variant calling approaches which typically cannot make use of potentially misplaced MAPQ0 reads. Furthermore, high sequence homology of some genes with a pseudogene paralog can lead to a wide variety of common structural variants (SVs) in the population, requiring specialized targeted calling approaches. DRAGEN supports targeted calling for a number of genes/targets as described in subsequent target-specific sections.
The targeted caller can be enabled using the command line option --enable-targeted=true
or a subset of targets can be enabled by providing a space-separated list of target names. The supported target names are: cyp2b6
, cyp2d6
, cyp21a2
, gba
, hba
, lpa
, rh
, and smn
. For a list of all supported targeted caller options along with their default values, see . The targeted caller produces a <output-file-prefix>.targeted.json
file containing a summary of the variant caller results for each target. Additional detail of individual variant calls are reported in VCF format in the <output-file-prefix>.targeted.vcf.gz
output file.
The targeted caller requires WGS data aligned to a human reference genome with at least 30x coverage. The caller may be less reliable at lower coverage. Human reference genome builds based on hg19
, hs37d5
(including GRCh37
), or hg38
are supported. The targeted caller should not be enabled with low-coverage, exome or enrichment sequencing data.
The targeted caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
Fields in JSON | Explanation | Type and Possible Values | Present |
---|
The targeted caller generates a <output-file-prefix>.targeted.vcf.gz
file in the output directory. The output file is a VCFv4.2
formatted file. The targets that have VCF output are: cyp21a2, gba, hba, lpa, rh, and smn.
Small variants, structural variants, and copy number variants are reported in the same VCF file.
The <output-file-prefix>.targeted.vcf.gz
file includes the following source
header line:
For lpa, rh and smn targets, the EVENT
and EVENTTYPE
INFO fields are used to identify the called variants.
The EVENT
and EVENTTYPE
INFO fields are formally introduced in VCFv4.4
to enable the representation of complex rearrangements. This is achieved using the EVENT
field to group all the related VCF records together, and the EVENTTYPE
to classify the event. The corresponding header lines are the following.
However, the use of EVENT
is not limited to complex rearrangements and can be used to associate nonsymbolic alleles, for example in cases of variant position ambiguity in high homology regions.
Since the EVENTTYPE
values are implementation-defined, custom EVENTTYPE
header lines are included to describe each EVENTTYPE
.
For cyp21a2, gba, and hba targets, the ALLELE_ID
INFO field is used to identify the called variant alleles.
The missing value .
is used when no identifier is available (e.g. a wild type allele) or applicable (e.g. allele index 0 for a structural variant record).
In the case of target variants in a high homology region, each variant is reported ambiguously at all corresponding homologous positions (i.e. in both the pseudogene and in the target gene). Additional analysis for these variants can be performed if absolute certainty that these variants are located in the target gene (e.g. in gba or cyp21a2) is required.
For lpa and smn the ploidy of the called genotype (FORMAT/GT
field) corresponds to the combined copy number from all the homologous positions. For cyp21a2, gba and hba, this "joint" genotype from all the homologous positions is instead reported in a separate FORMAT/JGT
field which is then collapsed into a diploid genotype and reported in the FORMAT/GT
field. The following fields are reported for "joint" calls:
Note that the FORMAT/GQ
and FORMAT/JGQ
fields contain the unconditional genotype quality, unlike the VCF spec where FORMAT/GQ
is defined as the genotype quality conditioned on the site being variant.
In the depicted example there are two genes A and B that include a high homology region. The usual process to call variants in this regions is to make a joint pileup of the reads aligning in both genes A and B and call the variants using a model with a ploidy proportional to the total copy number of the regions. This generates divergent possible genotypes that are equally likely since the variant cannot be confidently placed in either gene A or gene B. For lpa and smn the variant would be reported as follows:
Given the unconventional ploidy of the FORMAT/GT
field used in this representation, a TargetedRepeatConflict
filter is applied to these records. The header line for the filter is the following.
For cyp21a2, gba and hba, a conventional diploid FORMAT/GT
is reported and so no TargetedRepeatConflict
filter is applied. Due to the ambiguity in placing target variants in high homology regions, the corresponding QUAL
and FORMAT/GQ
fields can be much lower than conventional small variant calls (i.e. Phred 3 for a single variant allele copy across two homologous diploid positions). Therefore, instead of filtering on QUAL
and FORMAT/GQ
for these records, the records are filtered based on the FORMAT/JVQL
and FORMAT/JGQ
fields:
Since the wild type alleles at homologous positions may be different from each other or different from the reference alleles, an additional filter is applied when only wild type alleles are detected across the homologous positions. This avoids making ambiguous variant calls when no target variant of interest is detected.
In the case of an identified gene conversion even in rh, a small variant is reported at each differentiating site in the acceptor region.
In the depicted example there are two genes A and B and gene A is the acceptor of a gene conversion from gene B (green box in the figure). Gene conversion are identified by observing variations in copy number at differentiating sites (blue and pink bars in the figure) in consecutive regions. Copy number variations between regions define the breakends of the gene conversion. An equivalent VCF representation for gene conversion would be using CNV and SV entries with breakends corresponding to the donor/acceptor regions, however, only the small variant representation is currently supported.
In the case of a detected gene conversion event, there may be differentiating sites with a genotype that is inconsistent with that gene conversion event. In these cases the RecombinantConflict
filter is applied. The RecombinantConflict
is defined by the following header line.
In the example, the resulting representation is as follows.
For cyp21a2 and gba, nonallelic homologous recombination can result in gene deletion or duplication in the case of reciprocal recombination or gene conversion in the case of nonreciprocal recombination. Both gene deletion and gene conversion can introduce loss-of-function variants and in both cases the targeted caller will report these variants in the target gene. In the case of gene deletion, the differentiating sites at the nontarget (i.e. pseudogene) positions will contain the overlapping deletion allele *
while the differentiating sites in the target will contain any variant alleles. Although an equivalent VCF representation would be to simply report the deletion with a single structural variant VCF record, reporting small variant VCF records in the target gene allows for identification of the specific mutations that may occur in a gene transcript and matches well with annotation using HGVS nomenclature. Similarly, for gene conversions, variants are reported at differentiating sites in the target gene, rather than as pairs of structural variant breakends.
The use of GT=0
for symbolic structural variant alleles is formally disambiguated in VCFv4.4
, specifying that "GT=0 indicates the absence of any of the ALT symbolic structural variants defined in the record". With this convention we can report compound overlapping heterozygous structural variants.
In the hba genotype depicted above, two overlapping SVs can be represented as follows:
The relevant header lines for the VCF records above are:
In the depicted example there is a Variable Number Tandem Repeat (VNTR) region composed of three repeat units in the reference. The CN
INFO field is used to report the allele copy number, the CN
FORMAT field to is used report the region total copy number given by the sum of the allele copy numbers, and the REPCN
FORMAT field is used to report the repeat unit copy number equal to the allele copy number multiplied by the number of repeat units in the reference.
This VNTR can be represented as follows:
The REPCN
and CN
header lines are:
For lpa, rh and smn, the TargetedLowQual
filter is applied if the QUAL
of a target variant is less than 3.00
.
Similarly, for cyp21a2 and gba the TargetedLowVQL
filter is applied if the VQL
of a target variant in low-homology region is less than 3.00
.
The TargetedLowGQ
filter is applied if the targeted variant has GQ
smaller than 3
.
hard-filtered
FilesWhen the small variant caller is enabled, the targeted small variant VCF calls can be merged into the <output-file-prefix>.hard-filtered.vcf.gz
and <output-file-prefix>.hard-filtered.gvcf.gz
files, briefly hard-filtered
files. The --targeted-merge-vc
command line option can be used to control which targets will have their small variant VCF records merged into the hard-filtered
files. For example, --targeted-merge-vc rh
will enable merging of the calls from the rh
caller into the hard-filtered
files and --targeted-merge-vc rh hba
will enable merging of the calls from the rh
and hba
targets into the hard-filtered
files. The true
value will merge all calls from all supported targets into the hard-filtered
files, while the false
value will merge no calls into the hard-filtered
files.
The targeted calls merged into the hard-filtered
files are marked with a TARGETED
INFO flag.
When enabled, targeted small variants are merged into the hard-filtered
files regardless of any regions that may be provided using the --vc-target-bed
option.
The merging strategy for targeted small variant calls is to prioritize the targeted calls over small variant calls from the germline small variant caller. When a germline small variant call overlaps a targeted caller call, then the small variant call is filtered with a TargetedConflict
filter if any of the following holds:
The targeted caller call is PASS
.
The small variant call and targeted caller call have incompatible genotypes and the targeted caller call is not filtered with the TargetedLowGQ
filter.
The strategy is summarized in the following examples.
The TARGETED
call is PASS
.
The TARGETED
call and the small variant call are not overlapping
The TARGETED
call is filtered with TargetedLowQual
and has a discordant variant representation with the overlapping small variant call.
The TARGETED
call is filtered with TargetedLowQual
and has a discordant genotype with the overlapping small variant call.
The TARGETED
call is filtered with TargetedLowGQ
and has a discordant genotype with the overlapping small variant call.
The following command-line example runs the targeted caller from FASTQ input:
The following command-line example runs cyp21a2 only using BAM input without realignment:
The HBA Caller is capable of genotyping the HBA1 and HBA2 genes from whole-genome sequencing (WGS) data. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the possible genotypes of the pair of genes. We consider regions surrounding the HBA1 and HBA2 sites to resolve the possible HBA1 and HBA2 genotypes.
The HBA Caller performs the following steps:
Determines total copy number from read depth of the regions surrounding the HBA1 and HBA2 sites.
Determines HBA genotypes based on the copy number of the regions surrounding the HBA1 and HBA2 sites.
Calls small variants in the HBA1 and HBA2 regions based on the region copy number derived from the genotype along with allele counts from read information.
The HBA Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
For a comprehensive evaluation of the HBA caller, see .
The first step of HBA calling is to determine the copy number of the regions sorrounding the HBA1 and HBA2 sites. Reads aligned to the regions are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. Finally, a Gaussian Mixture Model (GMM) is used to obtain the integer region copy number from the region normalized counts.
The genotyping step attempts to identify the two likely haplotypes described in the following table, where "a" stands for a functional copy of either HBA1 or HBA2, "-" stands for a nonfunctional/missing copy of either HBA1 or HBA2, while "3.7" and "4.2" describe the recombinant event that likely caused the deletion/duplication of the functional HBA copy. The second column of the following table reports the interpretation of the genotype.
Genotype | Interpretation |
---|
If none of the previous genotype is identified, then no call is made and the caller reports a "None" genotype.
18 small variants are detected from the read alignments. These variants occur in homologous regions of HBA1 and HBA2 where reads mapping to either HBA1 or HBA2 are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele are counted and a binomial model is used to determine the likelihood for each possible variant allele copy number up to the maximum possible as determined from the HBA1/HBA2 genotyping.
Each variant reported in the variants
array will have the fields below.
An example of the HBA caller content in the <output-file-prefix>.targeted.json
output file is shown below.
DRAGEN can find and remove variants that are common to separate VCF files. DRAGEN supports the following modes:
Small indel deduplication—If using a structural variant VCF and a small variant VCF, DRAGEN filters all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF (without changing SV and SNV VCF files) that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup.vcf.gz
as suffix. The diagram below describes the small indel deduplication pipeline. You must provide a reference genome to generate the VCF files to normalize the variants. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
SMN deduplication—If using a small variant VCF and an ExpansionHunter VCF, DRAGEN filters any lines in the small variant VCF that have the same chromosome and position as lines in the ExpansionHunter VCF with the INFO tag VARID=SMN
. A reference genome is not required.
Use the following command line options to input VCF or gVCF files. The input files are not altered.
vd-sv-vcf
—Specify a structural variant VCF or gVCF.
vd-small-variant-vcf
—Specify a small variant VCF or gVCF.
vd-eh-vcf
—Specify an ExpansionHunter VCF or gVCF.
DRAGEN determines the name and type of the output file as follows.
You can use the following command line options for variant deduplication.
The following is an example command for an SMN deduplication standalone run:
You can also run small indel deduplication automatically on outputs from the DRAGEN joint caller where both structural variant and small variant callers are enabled. To run small indel deduplication automatically, set enable-variant-deduplication
to true
, and make sure the vd-sv-vcf
, vd-small-indel-vcf
, and vd-eh-vcf
input options are not set. Only small indel deduplication can be run automatically.
The following is an example command for an automatic small indel deduplication run.
Disruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA). SMN1 has a high identity paralog, SMN2. SMN2 differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C-> T affects splicing and largely disrupts the production of functional SMN protein from SMN2. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (SMN1) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.
DRAGEN offers the following two independent components that can call the SMN1 copy number using WGS data from a germline sample.
ExpansionHunter
SMN Caller
SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents SMN1 and SMN2.
In addition to the standard diploid genotype call, SMA Calling with ExpansionHunter uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.
SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.
To enable SMA calling along with repeat expansion detection, set the --repeat-genotype-enable
option to true
. For information on graph-alignment options, see .
To activate SMA calling, the variant specification catalog file must include a description of the targeted SMN1/SMN2 variant. The <INSTALL_PATH>/resources/repeat-specs/experimental
folder contains example files.
The <output-file-prefix>.repeat.vcf
file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in SMN1 with SMA status in the following custom fields.
Field | Description |
---|
The SMN Caller calls SMN1 and SMN2 copy numbers and detects the presence of a SNP, NM_000344.4:c.*3+80T>G
that is associated with the two-copy SMN1 allele. The caller is derived from the method implemented in Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data.²
To enable the SMN Caller, use --enable-smn=true
as part of a germline-only WGS analysis workflow. Additionally, it can also be enabled along with other targets from the targeted caller by using the option --enable-targeted=true
. The SMN Caller is disabled by default.
The SMN Caller performs the following steps:
Determines total and intact SMN copy numbers
Calls SMN1 copy number at eight differentiating sites
Determines copy number for NM_000344.4:c.*3+80T>G
The SMN Caller requires WGS data aligned to a human reference genome with at least 30x coverage
Two common copy-number variants (CNVs) in SMN1 and SMN2 include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either SMN1 or SMN2 are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.
To calculate the SMN1 copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of SMN1 and SMN2. One of these sites is the splice site variant used for SMA calling with ExpansionHunter (see SMA Calling With ExpansionHunter). The caller selects differentiating sites at positions that have sequence differences between SMN1 and SMN2 where calling the SMN1 copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.
For each differentiating site, the SMN1-specific and SMN2-specific alleles are counted in reads mapping to either SMN1 or the homologous region in SMN2. The caller uses a binomial model to calculate the likelihood of each possible SMN1 copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.
NM_000344.4:c.*3+80T>G
For this high-homology region SNP, reads mapping to either SMN1 or SMN2 are used for variant calling. The number of reads containing the variant allele and the nonvariant allele are counted and then a binomial model that incorporates the sequencing error rate is used to determine the most likely variant allele copy number (0 for nonvariant).
For SMN caller, the fields are defined as follows.
Each variant reported in the variants
array will have the fields below.
The variant NM_000344.4:c.*3+80T>G
is also reported in a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file and possibly compressed. The variant is reported with the VARIANT_IN_HOMOLOGY_REGION
flag in the INFO
field and also filtered with the TargetedRepeatConflict
filter. This variant lies in a region of homology between SMN1 and SMN2 and hence this variant is reported twice - once for each SMN1 and SMN2 regions - and is connected by the same EVENT
in the INFO
field. The ploidy of the variant is reported in concordance with the identified genotype.
An example of the vcf entry for the variant NM_000344.4:c.*3+80T>G is as follows.
The variant NM_000344.4:c.*3+80T>G in the <output-file-prefix>.targeted.vcf[.gz]
file can also be included in the <output-file-prefix>.hard-filtered.vcf[.gz]
by including smn
in the --targeted-merge-vc
list, i.e. --targeted-merge-vc smn
. The output file <output-file-prefix>.targeted.vcf[.gz]
is compressed by default. This option can be disabled using --enable-vcf-compression=false
.
¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). Human Mutation. 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3> 3.0.CO;2-9
²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genetics in Medicine. 2020;22(5):945-953. doi: 10.1038/s41436-020-0754-0
You can enable de novo structural variant quality scoring in DRAGEN.
To enable de novo scoring for structural variant joint diploid calling, set --sv-denovo-scoring
to true. To adjust the threshold value for which variants are classified as de novo, use the --sv-denovo-threshold
command line option. See DN Field for more information.
De novo scoring requires the following two files:
A pedigree file that specifies the relationship of all samples in the pedigree.
The VCF output from germline structural variant calling analysis run jointly over all samples in the pedigree.
The pedigree file is required for de novo scoring. Use the same file format as required for joint small variant calling analysis and de novo scoring. For information on the file format, see . The file specifies which sample in the trio is the proband, mother, or father. If there are multiple trios specified in the pedigree file (eg, multigeneration pedigree or siblings), DRAGEN automatically detects the trios and provides the de novo scores on the proband sample of each detected trio.
DRAGEN applies de novo scoring to the VCF output from germline structural variant analysis for all samples specified in the pedigree file. You can supply the VCF file directly using the command line or produce the file as part of the DRAGEN run where de novo scoring is enabled.
De novo scoring adds the de novo quality score (DQ
) and de novo call (DN
) fields for each sample in the output VCF file.
The DQ
field is defined as follows.
The DQ
field represents the Phred-scaled posterior probability of the variant being de novo in the proband. For example, DQ scores of 13 and 20 correspond to a posterior probability of a de novo variant of 0.95 and 0.99. If DRAGEN can calculate the DQ score, the score is added to the proband samples. If the DQ score cannot be calculated, the field is set to ".".
The DN
Field is defined as follows.
DRAGEN compares valid (> 0) DQ scores against a threshold value. You can set the threshold value using the --sv-denovo-threshold
command line option. For example, to set the threshold value to 10, add --sv-denovo-threshold 10
to the command line. The default threshold value is 20.
If a DQ score is greater than or equal to the threshold value, the DN
field is set to DeNovo
. If the DQ score is below the threshold value, the DN
field is set to LowDQ
. If the DQ is 0 or ".", the DQ score is invalid and the DN
field is set to ".".
You can use de novo structural variant scoring in the following workflows.
Perform de novo scoring in two DRAGEN runs. In the first, run germline structural variant analysis jointly over all samples in the pedigree file. In the second, apply de novo structural variant scoring to the joint germline VCF output. See Two-Run Workflow.
Perform de novo scoring in one DRAGEN run. Run germline structural variant analysis jointly over all samples in the pedigree file, and then apply de novo scoring to the joint germline structural variant calls. See One-Run Workflow.
In the two-run workflow, first run a standard DRAGEN joint germline analysis over multiple samples as shown in the following example.
In the second run, use the VCF output (<OUT_DIR1>/<PREFIX1>.sv.vcf.gz
) as input for de novo scoring. You can provide the VCF input using the --variant
option. The following command line provides an example of the second run.
The resulting output VCF file (<OUT_DIR2>/<PREFIX2>.sv.vcf.gz
) includes all de novo scoring annotations.
Run a standard DRAGEN joint germline analysis over multiple samples with all required de novo scoring options. The following example shows the one-run workflow.
The resulting output VCF file (<OUT_DIR>/<PREFIX>.sv.vcf.gz
) includes all de novo scoring annotations
Condition | Reported GT |
---|---|
Command-Line Option Name | Configuration File Option Name |
---|---|
Mode | Description |
---|---|
Command-Line Option Name | Configuration File Option Name |
---|---|
Name | Description | Default Value |
---|---|---|
Header Field | Number | Type | Description |
---|---|---|---|
ID | Description |
---|---|
ID | Description |
---|---|
ID | Description |
---|---|
ID | Description |
---|---|
Diploid or Haploid? | ALT | FORMAT:CN | FORMAT:GT |
---|---|---|---|
Exclusion Reason | Description | Related DRAGEN Option |
---|---|---|
Column index | Column contents | Description |
---|---|---|
Fields | Required/Optional | Purpose | Type |
---|---|---|---|
Option | Type | Required | Description |
---|---|---|---|
Column | Description |
---|
Column | Description |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fusion Breakpoint | Hybrid Gene Structure | Star-Allele Designation |
---|
The caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see ).
Fields in JSON | Explanation | Type and Possible Values |
---|
Fusion Breakpoint | Hybrid Gene Structure | Star-Allele Designation |
---|
The CYP2D6 Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see ). An example of the CYP2D6 caller content in the output is as follows:
Fields in JSON | Explanation | Type and Possible Values |
---|
The CYP21A2 Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see ).
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Recombinant-like and nonrecombinant-like variants are reported in VCF format. See for details about how these variants are reported in VCF.
Recombinant variant haplotype | HGVS identifiers |
---|
The GBA Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see ).
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Recombinant-like and nonrecombinant-like variants are reported in VCF format. See for details about how these variants are reported in VCF.
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Tag | Name | Command line options | QUAL threshold | MAPQ0 | Mosaic Detection | Mosaic AF filter threshold |
---|---|---|---|---|---|---|
Option | Description |
---|---|
Mode | Downsampling Option | Default Value |
---|---|---|
Metric | QUAL | GQ (non-homref) | GQ (homref) | QD |
---|---|---|---|---|
GT variant 1 | GT variant 2 | GT MNV | Relevant Pipeline | Supported in DRAGEN |
---|---|---|---|---|
--sample-sex | Ploidy Estimation | Sample Sex in Small VC |
---|---|---|
WGS Coverage per Sample | Recommended Resolution* (bp) |
---|---|
Input format | enable-map-align | Required option |
---|---|---|
Option | Description |
---|---|
Users can choose between any of the three default repeat-specification files packaged with DRAGEN using the command line option: --repeat-genotype-use-catalog=<default|default_plus_smn|expanded>
. The default
option includes ~60 repeats. The default_plus_smn
option includes the SMN repeat in addition to all the repeats in the default
catalog. The expanded catalog includes ~174K repeats, see . If --repeat-genotype-use-catalog
is not specified on the command line, then the default
catalog is used.
The repeat genotyping results will be incorrect if the selected reference genome is not compatible with the repeat specification file. When this occurs, many repeats may be marked as "LowDepth" in the VCF output file or estimated to have zero length. This can be further confirmed by visualizing read alignments with the .
The default
variant catalog contains specifications on disease-causing repeats located in AFF2, AR, ARX_1, ARX_2, ATN1, ATXN1, ATXN10, ATXN2, ATXN3, ATXN7, ATXN8OS, BEAN1, C9ORF72, CACNA1A, CBL, CNBP, COMP, CSTB, DAB1, DIP2B, DMD, DMPK, EIF4A3, FMR1, FOXL2, FXN, GIPC1, GLS, HOXA13_1, HOXA13_2, HOXA13_3, HOXD13, HTT, JPH3, LRP12, MARCHF6, NIPA1, NOP56, NOTCH2NLC, NUTM2B-AS1, PABPN1, PHOX2B, PPP2R2B, PRDM12, PRNP, RAPGEF2, RFC1, RUNX2, SAMD12, SOX3, STARD7, TBP, TBX1, TCF4, TNRC6A, VWA1, XYLT1, YEATS2, ZIC2 and ZIC3 genes. More information about disease-causing repeats can also be found .
For the expanded
variant catalog, apart from the aforementioned disease-causing repeats, there are ~174K additional polymorphic repeats. They are initially detected using STR-Finder from the 1000 Genomes Project. After that, the candidate repeats are filtered out based on a customized quality control pipeline, see details .
Gene containing repeat | Repeat number (normal) | Repeat number (pre-mutation) | Repeat number (expansion) |
---|
Field | Description |
---|
Field | Description |
---|
Field | Description |
---|
Tumor purity can be estimated automatically through the workflow.
Option | Description |
---|
You can enable additional features when a matched normal sample and the outputs from DRAGEN Germline analysis are also available. If a matched normal sample is available, enable germline-aware mode and VAF-aware mode using the following example command line. For more information on germline-aware mode and VAF-aware mode, see and .
The target counting stage and its output are the same as for the germline CNV calling case. The target intervals with the read counts are output in a *.target.counts.gz
file. If there is insufficient read depth coverage detected, processing will halt. For low depth tumor samples, the value of --cnv-interval-width
can be increased from to capture more alignments. The B-allele counting occurs in parallel with the read counting phase, and the values are output in a *.baf.bedgraph.gz
file. This file can be loaded into IGV along with other bigwig files generated by DRAGEN for visualization. See for more details on output files.
Option | Description |
---|
To turn on HET calling, specify --cnv-somatic-enable-het-calling=true
on the command line. N.B., this setting will only be honored when DRAGEN is able to identify a confident purity/ploidy model. When a confident model cannot be identified, the caller will return a default model and HET-calling will always be disabled (see section for more details and nuances of this approach).
Total Copy Number (CN) | Minor Copy Number (MCN) | ASCN Scenario |
---|
The previous can be visualized as:
Parent Copy Number Genotype | Possible Copy Number Alleles | Assumed Possible Copy Number Alleles |
---|
Mother Copy Number | Father Copy Number | Proband Copy Number | Mendelian Consistent? |
---|
is the set of all genotypes
is the set of conflicting genotypes
is the Mother copy number
is the Father copy number
is the Proband copy number
is the the prior for the trio genotype
Input | Option | Argument | Description |
---|
ASCN somatic WES CNV pipeline utilize same methods and workflow of DRAGEN Somatic WGS CNV pipeline. Please see for more details.
Calls at differentiating sites within the recombinant variant calling region will contain the same "joint" fields as are reported for nonrecombinant-like variants in high homology regions ( see ). However, the collapsed diploid FORMAT/GT
will be based on any detected recombination events. Because detected recombinant variants are placed in the target gene, these records are filtered differently than the ambiguously placed, nonrecombinant-like variants in high homology regions. The INFO/Recombinant
flag is added to calls derived from recombinant variant calling to distinguish them from nonrecombinant-like variant calls in high homology regions. The FORMAT/VQL
field is used to apply the RecombinantLowVQL
filter for low quality recombinant variants and the RecombinantREF
filter is applied when the collapsed diploid FORMAT/GT
contains only reference alleles.
The targeted caller can be enabled in parallel with other components as part of a human WGS germline analysis workflow (see ).
The HBA Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see ).
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
Structural variant and homology region variants are reported in VCF format. See for details about how these variants are reported in VCF.
Component | Description |
---|
Option | Description |
---|
The SNP (also referred to as g.27134T>G) has been reported in the literature to be associated with the two-copy SMN1 allele.
The SMN Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see ). An example of the SMN caller content in this file is shown below.
Fields in JSON | Explanation | Type and Possible Values |
---|
Fields in JSON | Explanation | Type and Possible Values |
---|
4.2
DRAGEN 4.2 default Small Variant Caller
--enable-variant-caller=true
3
No
No
N/A
4.2 HSM
DRAGEN 4.2 High Sensitivity Mode
--enable-variant-caller=true --vc-enable-high-sensitivity-mode=true
0.4
Yes
Yes (Alpha)
N/A
4.3
DRAGEN 4.3 default Small Variant Caller
--enable-variant-caller=true
3
Yes
Yes (Full)
20%
4.3 Mosaic
DRAGEN 4.3 Mosaic Detection Mode
--enable-variant-caller=true --vc-enable-mosaic-detection=true
0.4
Yes
Yes (Full)
0%
At a position with no coverage
./. or .
At a position with coverage but no reads supporting ALT allele
0/0 or 0
At a position with coverage and reads supporting ALT allele
dependent on pipeline (germline/somatic)
--Mappper.seed-density
seed-density
-Mapper.edit-mode
edit-mode
--Mapper.edit-seed-num
edit-seed-num
--Mapper.edit-read-len
edit-read-len
--Mapper.edit-chain-limit
edit-chain-limit
0
No editing (default)
1
Chain length test
2
Paired chain length test
3
Full seed editing
--Aligner.global
global
--Aligner.match-score
match-score
--Aligner.match-n-score
match-n-score
--Aligner.mismatch-pen
mismatch-pen
--Aligner.gap-open-pen
gap-open-pen
--Aligner.gap-ext-pen
gap-ext-pen
--Aligner.unclip-score
unclip-score
--Aligner.no-unclip-score
no-unclip-score
--Aligner.aln-min-score
aln-min-score
--Aligner.min-score-coeff
min-score-coeff
vc-output-evidence-bam
Enable evidence BAM output
False
vc-evidence-bam-output-haplotypes
Output graph haplotypes in evidence BAM
False
vc-evidence-bam-clipped-read-threshold
Percentage of clipped reads in active region to enable evidence BAM output for that region
10%
vc-evidence-bam-force-output
Force evidence BAM output for all active regions
False
END_LEFT_BND_OF
1
String
ID of CNV whose left end is matched to the end of SV
END_RIGHT_BND_OF
1
String
ID of CNV whose right end is matched to the end of SV
LEFT_BND
1
String
ID of SV that matches the left end of CNV record
LEFT_BND_OF
1
String
ID of CNV whose left end is matched to SV
MatchSv
1
Integer
ID of original SV that was merged with CNV record
OrigCnvEnd
1
Integer
Coordinate of original CNV end
OrigCnvPos
1
Integer
Coordinate of original CNV pos
RIGHT_BND
1
String
ID of SV that matches the right end of CNV record
RIGHT_BND_OF
1
String
ID of CNV whose right end is matched to SV
SVCLAIM
A
String
Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively
--vc-target-coverage
Specifies the maximum number of reads covering any given position.
--vc-max-reads-per-active-region
Specifies the maximum number of reads covering a given active region.
--vc-max-reads-per-raw-region
Specifies the maximum number of reads covering a given raw region.
--vc-min-reads-per-start-pos
Specifies the minimum number of reads with a start position overlapping any given position.
--high-coverage-support-mode
Applies the high coverage mode down-sample options if set to true. Enabling this option is recommended for targeted panels with coverage over 1000x, but will slow down run time.
Germline
--vc-target-coverage
500
Germline
--vc-max-reads-per-active-region
10000
Germline
--vc-max-reads-per-raw-region
30000
Somatic
--vc-target-coverage
1000
Somatic
--vc-max-reads-per-active-region
10000
Somatic
--vc-max-reads-per-raw-region
30000
High Coverage
--vc-target-coverage
100000
High Coverage
--vc-max-reads-per-active-region
200000
High Coverage
--vc-max-reads-per-raw-region
200000
Mitochondrial
--vc-target-coverage-mito
40000
Mitochondrial
--vc-max-reads-per-active-region-mito
200000
Mitochondrial
--vc-max-reads-per-raw-region-mito
200000
Description
Probability that the site has no variant
Probability that the call is incorrect
Evidence supporting homref call
Qual normalized by depth
Formulation
QUAL = GP(GT=0/0)
GQ =-10*log10(p)
GQ = 10*log10[P(D|homref)/P(D|variant)]
QUAL/DP
Scale
Unsigned Phred
Unsigned Phred
Signed Phred
Unsigned Phred
Numerical example
QUAL=20: 1 % chance that there is no variant at the site. Qual=50: 1 in 1e5 chance that there is no variant at the site.
GQ=3, 50% that the call is incorrect. GQ=20, 1% change that the call is incorrect.
GQ=0: no evidence. GQ>0: evidence favors homref.
0|1
0|1
0/1
Germline and Somatic
Yes in 4.0
0/1
1/1
1/2
Germline
No
0/1
1/2
1/2
Germline
No
1/1
1/1
1/1
Germline
Yes in 4.2
male
Not relevant
Male
female
Not relevant
Female
none
Not relevant
None
auto (default)
XY
Male
auto (default)
XX
Female
auto (default)
Everything else
None
GT
Genotype
SM
Linear copy ratio of the segment mean
CN
Estimated copy number
BC
Number of bins in the region
PE
Number of improperly paired end reads at start and stop breakpoints
GC
GC dinucleotide percentage
CT
CT dinucleotide percentage
AC
AC dinucleotide percentage
LR
Log10 likelihood ratio of ALT to REF
AS
Number of allelic read count sites
BC
Number of read count bins
CN
Estimated total copy number in tumor fraction of sample. This field is not present if the model cannot be estimated with high confidence.
CNF
Floating point estimate of tumor copy number. This field is not present if the model cannot be estimated with high confidence.
CNQ
Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
MAF
Maximum estimate of the minor allele frequency
MCN
Estimated minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence.
MCNF
Floating point estimate of tumor minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence.
MCNQ
Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
NCN
Normal-sample copy number. The field is only present in germline-aware mode.
SCND
Difference between CN and GCN. The field is only present in germline-aware mode.
SD
Best estimate of segment's bias-corrected read count
Diploid
.
2
./.
Diploid
<DUP>
>2
./1
Diploid
<DEL>
1
0/1
Diploid
<DEL>
0
1/1
Haploid
.
1
0
Haploid
<DUP>
>1
1
Haploid
<DEL>
0
1
NON_KMER_UNIQUE
Non-unique Kmer bases are larger than 50% of interval.
Not applicable. This reason only applies to self-normalization mode.
EXCLUDE_BED
Interval overlaps with exclude BED larger than threshold.
--cnv-exclude-bed-min-overlap
PON_MAX_PERCENT_ZERO_SAMPLES
Number of PON samples with 0 coverage is larger than threshold.
--cnv-max-percent-zero-samples
PON_TARGET_FACTOR_THRESHOLD
Median coverage of interval is lower than threshold of overall median coverage.
--cnv-target-factor-threshold
PON_MISSING_INTERVAL
Target interval not found in PON.
Not applicable
1
contig
chromosome name
2
start
genomic locus of interval start
3
stop
genomic locus of interval stop
4
name
interval name
5
mean
average coverage depth
6
std
standard deviation
7
normalizedStd
normalized standard deviation (std/mean)
8
min
minimum
9
25%
25 percentile
10
50%
median
11
75%
75 percentile
12
max
maximum
13
intervalSize
interval size (stop-start)
14
gcContents
percent GC
5
10000
10
5000
>= 30
1000
Fastq
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
FALSE
--enable-map-align=false
--cnv-normals-file
Individual normal file. This option uses a single file name and can be specified multiple times.
--cnv-normals-list
List of normal files. A plain text file in which each line in the file contains a path pointing to a *.target.counts.gz
or *.target.counts.gc-corrected.gz
file generated from the target counts stage. Relative paths are supported if the paths are relative to the current working directory. Absolute paths are recommended in case the workflow is used later or shared with other users.
--cnv-combined-counts
PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals (*.combined.counts.txt.gz
file). Some pre-packaged PON file directly downloaded from Illumina support site need to use this option.
CYP2A6
CYP2A7
FCGR3A
FCGR3B
RHD
RHCE
STRC
STRCP1
ACSM2A
ACSM2B
ACTR3B
ACTR3C
AQP12A
AQP12B
ASAH2
ASAH2B
CCDC74A
CCDC74B
CD177
CD177p1
CD8B
CD8B2
CFH1
CFHR1
CYP4A11
CYP4A22
DHX40
DHX40P1
EIF5AL1
EIF5AP4
FCGR2A
FCGR2C
FFAR3
GPR42
FOLH1
FOLH1B
FRMPD2
FRMPD2B
GPAT2
GPAT2P1
GSTT2B
GSTT2
DDT
DDTL
HCAR2
HCAR3
HSPA1A
HSPA1B
KRT81
KRT86
LGALS7
LGALS7B
MRPL45
MRPL45P2
MSTO1
MSTO2p
MUC20
MUC20P1
MZT2A
MZT2B
OTOA
OTOAp1
PDPR
PDPR2P
PIEZ02
ENST00000591853.1
ZP3
POMZP3
PRAMEF7
PRAMEF8
PROS1
PROS2P
RMND5A
ANAPC1P2
ROCK1
ROCK1p1
SERPINB3
SERPINB4
SYT3
ZNF473CR
TBC1D26
TBC1D28
TOP3B
TOP3BP1
TUBA3D
TUBA3E
ZNF443
ZNF799
chr2
151578759
151588523
NEB exon 98-105
chr2
151589318
151599076
NEB exon 90-97
chr2
151599871
151609628
NEB exon 82-89
chr2
178653238
178654995
TTN exon 172-180
chr2
178657498
178659255
TTN exon 181-189
chr2
178661759
178663516
TTN exon 190-198
chr5
70049522
70077596
SMN2
chr5
70924940
70953013
SMN1
chr7
5970924
5980896
PMS2 exon 13-15
chr7
5980968
5987689
PMS2 exon 11-12
chr7
6737007
6743712
PMS2CL exon 2-3
chr7
6743880
6753867
PMS2CL exon 4-6
chr15
43599563
43602630
STRC exon 24-29
chr15
43602982
43611000
STRC exon 14-23
chr15
43611040
43618800
STRC exon 1-13
chr15
43699379
43702452
STRCP1 exon 23-28
chr15
43702488
43710472
STRCP1 exon 13-22
chr15
43710502
43718262
STRCP1 exon 1-12
chrX
154555884
154565047
IKBKG exon 3-10
chrX
154639390
154648553
IKBKGP1
regions
Required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define contig name and subregion name of mixed ploidy chromosome
Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...]
ploidy
“default” is a required name
contigname_of_mixed_ploidy_"nonpar" is required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define:
ploidy behavior when different from “default”
default ploidy behavior
Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input
--enable-imputation
NA
Yes
Set to true
to enable vcf imputation pipeline
--imputation-ref-panel-dir
STRING
Yes
Directory containing per-chromosome reference panel VCF and optionally the JSON config file
--imputation-ref-panel-prefix
STRING
Yes
Prefix for reference panel files and the JSON config file
--imputation-genome-map-dir
STRING
Yes
Directory containing per-chromosome genome map files
--imputation-chunk-input-region
STRING
Yes for single region
Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20).
--imputation-chunk-input-region-list
STRING
Yes for list of regions
Text file listing chromosomes or regions to be processed, one chromosome/region per line.
--imputation-phase-input
STRING
Yes for single VCF file
Sample input file with VCF/BCF format. Single VCF or multi-sample VCF
--imputation-phase-input-list
STRING
Yes for multiple VCF files
Text file listing sample input in VCF/BCF format, one input file per line
--imputation-phase-sample-type
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file
Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file
--imputation-phase-sample-type-list
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files
Path to the Sample Type file
--output-directory
STRING
Yes
Output directory
--output-file-prefix
STRING
Yes
Output files prefix
--imputation-phase-threads
INT
No
Specify the number of threads to use. Default is the number of system threads
--imputation-phase-filter-input-sample-in-ref
NA
No
Default is true
: if sample ID matches between reference panel and sample input, then the corresponding samples are ignored from the reference panel to avoid imputation against itself. To be turned to false
if all samples from the reference panel should be kept regardless of their presence in the sample input.
--imputation-phase-impute-reference-only-variants
STRING
No
Default is false
. If set to true
, allows imputation at variants only present in the reference panel. The use of this option is intended only to allow imputation at sporadic missing variants. If the number of missing variants is non-sporadic, please re-run the genotype likelihood computation at all reference variants and avoid using this option, since data from the reads should be used.
When the input sample variant calling was performed using --vc-forcegt-vcf
with SNPs-only sites.vcf file, it is recommended to set this option to true to also impute INDELs positions from the reference panel.
--imputation-phase-input-independently
STRING
No
Default is false
. If set to true
, allows to treat each sample input independently without using them in the reference panel calculation
DMPK | < 37 | 37-50 | > 50 |
FXN | < 33 | 33-66 | > 66 |
HTT | < 35 | 35-40 | > 40 |
ATN1 | < 35 | 35-49 | > 49 |
ATXN1 | < 40 | 40-41 | > 41 |
AR | < 35 | 35-36 | > 36 |
FMR1 | < 55 | 55 - 200 | > 200 |
ATXN10 | < 32 | 32-33 | > 33 |
ATXN2 | < 31 | 31-32 | > 32 |
ATXN7 | < 27 | 27-33 | > 33 |
CACNA1A | < 19 | 19-20 | > 20 |
CBL | < 80 | 80-81 | > 81 |
CSTB | < 29 | 29-30 | > 30 |
JPH3 | < 28 | 28-40 | > 40 |
PPP2R2B | < 32 | 32-65 | > 65 |
C9ORF72 | < 30 | 30-31 | > 31 |
ATXN3 | < 41 | 41-52 | > 52 |
CHROM | Chromosome identifier |
POS | Position of the first base before the repeat region in the reference |
ID | Always |
REF | The reference base at position POS |
ALT | List of repeat alleles in format |
QUAL | Always |
FILTER | LowDepth filter is applied when the overall locus depth is below 10x or number of reads that span one or both breakends is below 5. |
END | Position of the last base of the repeat region in the reference |
REF | Number of repeat units spanned by the repeat in the reference |
RL | Reference length in bp |
VARID | Variant ID from the variant catalog |
RU | Repeat unit in the reference orientation |
REPID | Variant ID from the variant catalog |
GT | Genotype |
SO | Type of reads that support the allele. Values can be SPANNING, FLANKING, or INREPEAT. These values indicate if the reads span, flank, or are fully contained in the repeat. |
REPCN | Number of repeat units spanned by the allele |
REPCI | Confidence interval for REPCN |
ADSP | Number of spanning reads consistent with the allele |
ADFL | Number of flanking reads consistent with the allele |
ADIR | Number of in-repeat reads consistent with the allele |
LC | Locus Coverage |
contig | Contig of the repeat region |
start | Approximate start of the repeat |
end | Approximate end of the repeat |
motif | Inferred repeat motif |
top_case_zscore | Top z-score of a case sample |
high_case_counts | Counts of case samples corresponding to z-score greater than 1.0 |
counts | Nonzero counts for all samples |
contig | Contig of the repeat region |
start | Approximate start of the repeat |
end | Approximate end of the repeat |
motif | Inferred repeat motif |
pvalue | P-value from Wilcoxon rank-sum test |
bonf_pvalue | P-value after Bonferroni correction |
counts | Depth-normalized counts of anchored in-repeat reads for each sample (omitting samples with zero count) |
| Specify a matched normal SNV VCF. Use when a matched normal sample and the matched normal SNV VCF are available. To use this option, you must run the matched normal sample through the DRAGEN Germline workflow. |
| Specify a population SNP VCF. Use when a matched normal sample is not available and analysis must be performed in tumor-only mode. |
| Set to |
4 | 2 | 2+2 |
4 | 1 | 3+1 |
*[LOH]4 | 0 | 4+0 |
2 | 0/2, 1/1 | 1/1 |
3 | 0/3, 1/2 | 1/2 |
4 | 0/4, 1/3, 2/2 | 1/3, 2/2 |
N | x/(N-x) for x <= N/2 | x/(N-x) for 1 <= x <= N/2 |
2 | 2 | 2 | Yes |
2 | 2 | 1 | No |
3 | 2 | 4 | No |
3 | 2 | 2 | Yes |
2 | 0 | 2 | No |
sample | The sample name. | string |
dragenVersion | The version of DRAGEN. | string |
lpa | The LPA targeted caller specific fields. | dictionary |
kiv2CopyNumber | Total KIV-2 unit copy number | float |
refMarkerAlleleCopyNumber | Null if Homozygous REF/ALT markers call | float, null |
Float if Heterozygous markers call and stores the KIV-2 unit copy number of the allele having REF markers |
altMarkerAlleleCopyNumber | Null if Homozygous REF/ALT markers call | float, null |
Float if Heterozygous markers call and stores the KIV-2 unit copy number of the allele having ALT markers |
type | "Heterozygous markers call" if we observe both REF and ALT markers | string, "Heterozygous markers call", "Homozygous REF markers call", "Homozygous ALT markers call" |
"Homozygous REF markers call" if we observe only REF markers |
"Homozygous ALT markers call" if we observe only ALT markers |
variants | List of known variants that were detected in the KIV-2 region. | list of variants |
hgvs | HGVS identifier of the variant | string |
qual | Phred QUAL score of the variant | double |
altCopyNumber | Copy number of the ALT variant | double |
altCopyNumberQuality | Phred QUAL copy number of the ALT variant | double |
intron 4-exon 5 | 2B7-2B6 |
|
intron 4-exon 5 | 2B6-2B7 |
|
genotype | star allele genotype identified for sample | string |
genotypeFilter | The filter status for the genotype call | string (The value can include: PASS, No_call, or More_than_one_possible_genotype) |
phenotypeDatabaseAnnotation | The metabolism status corresponding to the genotype, mapped from phenotypeDatabaseSources | string |
exon 9 | 2D6-2D7 |
|
exon 9 | 2D7-2D6 |
|
intron 4 | 2D7-2D6 |
|
intron 1 | 2D7-2D6 |
|
intron 1 | 2D6-2D7 |
|
genotype | called star allele genotype | string (semi-colon delimited list of possible genotypes with haplotypes separated by |
genotypeFilter | The filter status for the genotype call | string (The value can include: PASS, No_call, or More_than_one_possible_genotype) |
phenotypeDatabaseAnnotation | The metabolism status corresponding to the genotype, mapped from phenotypeDatabaseSources | string |
totalCopyNumber | Total copy number of CYP21A2 and CYP21A1P genes including hybrids | nonnegative integer |
deletionBreakpointInGene | null (i.e. unknown) if totalCopyNumber > 3 | true, false, null |
true if CN <= 3 and a deletion-like recombinant variant haplotype is detected |
false if CN <=3 and no deletion-like recombinant variant is detected |
recombinantHaplotypes | List of detected haplotypes arising from nonallelic homologous recombination variant calling | Array of two strings. Each string consists of all associated allele IDs (if any) within the haplotype. Consecutive IDs in the same haplotype are separated by a '+'. |
variants | List of single site, nonrecombinant-like variants (i.e. not arising from nonallelic homologous recombination). An empty list if no variants are detected. | Array of nonrecombinant-like variants. |
alleleId | HGVS identifier of the variant allele | string |
alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer |
genotypeQuality | Phred-scaled quality for the called genotype | nonnegative integer |
filter | Filter for the called genotype | string. "PASS" when not filtered |
A495P | NM_000157.4:c.1483G>C |
L483P | NM_000157.4:c.1448T>C |
D448H | NM_000157.4:c.1342G>C |
c.1263del | NM_000157.4:c.1265_1319del |
RecNciI | NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C |
RecTL | NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C, NM_000157.4:c.1342G>C |
c.1263del+RecTL | NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C, NM_000157.4:c.1342G>C, NM_000157.4:c.1265_1319del |
totalCopyNumber | Total copy number of all GBA and GBAP1 genes including hybrids | nonnegative integer |
deletionBreakpointInGene | null (i.e. unknown) if totalCopyNumber > 3 | true, false, null |
true if CN <= 3 and a deletion-like recombinant variant haplotype is detected |
false if CN <=3 and no deletion-like recombinant variant is detected |
recombinantHaplotypes | List of detected haplotypes arising from nonallelic homologous recombination variant calling | Array of two strings. Each string consists of all associated allele IDs (if any) within the haplotype. Consecutive IDs in the same haplotype are separated by a '+'. |
variants | List of single site, nonrecombinant-like variants (i.e. not arising from nonallelic homologous recombination). An empty list if no variants are detected. | Array of nonrecombinant-like variants. |
alleleId | HGVS identifier of the variant allele | string |
alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer |
genotypeQuality | Phred-scaled quality for the called genotype | nonnegative integer |
filter | Filter for the called genotype | string. "PASS" when not filtered |
genotype | The HBA genotype. | string |
genotypeFilter | The HBA genotype filter. | string, [PASS, HBALowGQ, HBALowPValue, No_call] |
genotypeQual | The HBA Phred genotype quality. | double |
minPValue | The minimum copy number p-value of regions used to determine copy number genotype of the HBA locus. | double |
variants | List of detected homology region variants in HBA1/HBA2. | Array of variants |
alleleId | HGVS identifier of the variant allele | string |
alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer |
genotypeQuality | Phred-scaled quality for the called genotype | nonnegative integer |
filter | Filter for the called genotype | string. "PASS" when not filtered |
Output prefix | If a value is specified for |
Deduplication mode | The prefix is followed by |
File type | The output file type matches the input file type (VCF or gVCF). If |
| To enable variant deduplication, set to |
| To generate tabix index files, set to 'true'. The default is 'true'. |
| To log matching lines to a text file, set to true. The default is false. For each match, the two matching lines follow each other, then by a new line. |
smn1CopyNumber | Copy number of intact SMN1 | nonnegative integer or null |
smn2CopyNumber | Copy number of intact SMN2 | nonnegative integer or null |
smn2Delta78CopyNumber | Copy number of SMN2Δ7–8 (deletion of exon 7 and 8) | nonnegative integer |
totalCopyNumber | Raw normalized depth of total SMN (exons 1 to 6) | nonnegative floating point number |
fullLengthCopyNumber | Raw normalized depth of intact SMN (exons 7 & 8) | nonnegative floating point number |
variants | a json array containing info about specific SMN variants | json-array |
hgvs | HGVS id of the variant being reported | string |
qual | Phred quality that at least one copy of the variant allele is found | nonnegative floating point number |
altCopyNumber | detected copy number of the variant allele | nonnegative floating point number |
altCopyNumberQuality | Phred quality of the detected copy number | nonnegative floating point number |
sample | The sample name. | string |
dragenVersion | The version of DRAGEN. | string |
rh | The RH targeted caller specific fields. | dictionary |
totalCopyNumber | Total RHD/RHCE copy number | integer |
rhdCopyNumber | RHD gene copy number | integer |
rhceCopyNumber | RHCE copy number | integer |
variants | List of known variants from recombination that were detected in RHD/RHCE. | list of variants |
hgvs | HGVS identifier of the variant | string, "NC_000001.11g.25405596_25409676con25283766_25287797" |
qual | Phred QUAL score of the variant | double |
altCopyNumber | Copy number of the ALT variant | double |
altCopyNumberQuality | Phred QUAL copy number of the ALT variant | double |
sampleId | The sample name. | string | always |
softwareVersion | The version of DRAGEN. | string | always |
phenotypeDatabaseSources | Resources used for calling metabolism status (phenotype). | json array of strings | CYP2B6 or CYP2D6 is enabled |
cyp2b6 | The CYP2B6 caller fields. | dictionary | CYP2B6 caller is enabled |
cyp2d6 | The CYP2D6 caller fields. | dictionary | CYP2D6 caller is enabled |
cyp21a2 | The CYP21A2 caller fields. | dictionary | CYP21A2 caller is enabled |
gba | The GBA caller fields. | dictionary | GBA caller is enabled |
hba | The HBA caller fields. | dictionary | HBA caller is enabled |
lpa | The LPA caller fields. | dictionary | LPA caller is enabled |
rh | The RH caller fields. | dictionary | RH caller is enabled |
smn | The SMN caller fields. | dictionary | SMN caller is enabled |
aaa3.7/aa | alpha-globin triplication |
aaa4.2/aa | alpha-globin triplication |
aa/aa | Normal |
-a3.7/aa | Silent Carrier |
-a4.2/aa | Silent Carrier |
--/aaa3.7 | Carrier |
--/aaa4.2 | Carrier |
-a3.7/-a3.7 | Carrier |
-a4.2/-a4.2 | Carrier |
-a3.7/-a4.2 | Carrier |
--/aa | Carrier |
--/-a3.7 | HbH |
--/-a4.2 | HbH |
--/-- | Hb Bart's |
VARID | SMN marks the SMN call. |
GT | Genotype call at this position using a normal (diploid) genotype model. |
DST | SMA status call: + indicates detected - indicates undetected ? indicates undetermined. |
AD | Total read counts supporting the C and T allele. |
RPL | Log10 likelihood ratio between the unaffected and affected models. Positive scores indicate the unaffected model is more likely. |
Large genomic rearrangements affecting one or more exons account for approximately 5~10% of all disease-causing mutations in BRCA1 and BRCA2 genes in patients with hereditary breast and ovarian cancer syndrome. DRAGEN LR can detect within gene large genomic rearrangements in tumor-only mode for targeted panels such as TruSight Oncology 500. The performance has been verified for BRCA1/2 with TruSight Oncology 500 Assay.
Use the following command-line options to run large rearrangement detection. The same cmd line options can be tested on other tumor-only pipelines.
--tso500-solid-brca-lr=true
Set to true
enable large rearrangement parameters. This is not limited to TruSight Oncology 500 Assay.
--cnv-normals-list
Specify the panel of normal samples to measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see the example command line. The panel of normal samples should be well matched to the case sample under analysis.
--cnv-target-bed
Specify the targeted regions of the panel.
--cnv-within-gene-lr-bed
Specify the gene regions in BED format to do large rearrangment calling. Example file:
Run the following command on each normal sample to generate .target.counts.gc-corrected.gz
file.
Put the path to the generated .target.counts.gc-corrected.gz
files into a txt file. One file per line. This will be the file given to --cnv-normals-list
.
The output file .cnv.LR.json
contains the breakpoints detected for each specified gene region. The following is an example output file.
Note that coordinate follows BED format [start,stop) suggesting:
start: segment starting coordinate. (0-base inclusive: first base on the chromosome is numbered 0. start coordinate is included in the interval)
stop: segment stop coordinate. (0-base exclusive: first base on the chromosome is numbered 0. stop coordinate is not included in the interval)
While DRAGEN secondary analysis is capable of supporting up to 1000x coverage, its default settings are tuned for a more typical sample size in the ~100x range. So if you find that the processing of your large sample doesn't complete, or gives unexpected results, there are options available to improve the behavior.
Users may want to analyze high amounts of data using the DRAGEN secondary analysis. For instance, in somatic contexts it can be beneficial to sequence the tumor at a very high depth to detect mutations at even lower frequencies. DRAGEN reliably supports a total average coverage of up to 1000x. As the input read data can grow excessively, but the system memory is limited, DRAGEN can only keep a subset of the input in RAM at the same time. The area reserved for the read data is called bin_memory. Higher bin_memory size means that bigger chunks can be processed simultaneously, but less memory is available to the rest of DRAGEN or for other processes.
After the map-align step, reads are loaded into the bin_memory, looking for regions of zero coverage. A set of reads that spans two such zero-coverage loci is a callable region. The memory used by a callable region is determined by the number of reads and their length. For instance, a long region with few reads per position uses the same amount of memory as a short region with a spike in coverage. The size of a callable region must stay well below the size of the bin_memory. To this end, any callable region that surpasses the --vc-max-callable-region-memory-usage
threshold is cut into smaller regions. Due to these cuts, the accuracy of variant calls in the vicinity may be affected.
The following options can be used to change the bin_memory and callable region size.
--bin_memory
Set the amount of memory reserved for read data. Defaults to at least 20GB for germline and 40GB for a somatic run.
--vc-max-callable-region-memory-usage
Set the maximum size of a single callable region. Default is 13GB.
The Star Allele Caller identifies the genotypes and metabolism status of the following PGx genes that are included in FDA's PGx recommendations or have CPIC Level A designation : CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, BCHE, ABCG2, NAT2, F5 and UGT2B17. It finds optimal genotypes for the above genes, based on star allele definitions from resources listed below. It calls metabolism status based on a PharmCAT resource file that provides mappings between genotypes and phenotypes. The file is here. The primary support for the Star Allele Caller is for human reference hg38 for which it supports the above mentioned genes. Additionally, it also supports the following genes on references hg19 and GRCh37 : CACNA1S, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, NUDT15, SLCO1B1, VKORC1, DPYD, ABCG2, F5.
For genes CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, ABCG2 the allele definitions are sourced from PharmGKB which are found here. For BCHE and NAT2, the alleles are sourced from this paper and this website, respectively. For UGT2B17, the star alleles are defined here. Note that since BCHE does not have defined star alleles, the Star Allele Caller checks if a sample is positive for any of the variants that are reported in the paper.
For genes CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, NUDT15, SLCO1B1, DPYD, the definitions are sourced from PharmVAR and can be found here. For the remaining hg19/GRCh37 genes, i.e., ABCG2, CACNA1S, IFNL3, F5 and VKORC1 - the allele definitions have been lifted from their corresponding definitions for hg38 (which are sourced from PharmGKB as noted above).
The Star Allele Caller has the following features.
It calls star allele genotypes from different types of genomic data like FASTQ, BAM, gVCF, VCF.
It provides additional details about the genotype call, including a confidence score.
It assumes genotypes for missing positions to be ref - these positions are listed in the output.
It assumes filtered genotype calls to be ref - these records are also listed in the output.
If multiple optimal diplotypes are satisfied, then it lists them all.
It supports different versions of the human reference hg38, hg19 and GRCh37.
For the genes UGT2B17 and CYP2C19, the caller analyzes CNV calls to detect star alleles.
The Star Allele Caller can accept as input, different forms of sequence data such as FASTQs files, BAM/CRAM files or gVCF/VCF files.
If small variant VCF/gVCF and CNV-VCF files are used as input, they should meet the following specifications.
Must be aligned to the same human reference that is passed through the -r option.
Variants should follow a parsimonious left aligned variant representation format.
Complex variants - for example, representing closely located, independent variants, in a single record - are NOT supported.
Note that VCF/gVCF files can also be substituted with, a compressed GZ file (i.e. <file_name>.vcf.gz
or <file_name>.gvcf.gz
).
For running the caller, the human reference needs to be always passed as a command line option. The Star Allele Caller detects the reference version (i.e., hg19, GRCh37 or hg38) and accordingly reads in the correct allele definitions.
The Star allele caller can be enabled in parallel with other components as part of a WGS germline analysis workflow using the option --enable-pgx
(see DRAGEN Recipe - Germline WGS)
In the simplest case, the caller takes DRAGEN gVCF and DRAGEN CNV-VCF files as input. The following is an example of the command line for the basic use case.
Contrary to a variant-only VCF file, a DRAGEN gVCF file contains the genotypes for all positions in a genome. Although the gVCF format is the preferred format for the caller, it can also accept a standard variant-only VCF file as input. The command line for this case will be the same as above, with the VCF file passed instead of a gVCF file. Also, the CNV-VCF file is optional - in this case the Star Allele Caller will not call star alleles that are detected through CNV analysis. An example of this use case, with only a variant only VCF file as input, is as follows.
For running the Star Allele Caller from a BAM input, the variant caller also needs to be enabled. Optionally, the CNV caller should also be preferably enabled for analyzing CNV star alleles. An example of the command line for this use case is as follows.
Note that the Star Allele Caller supports force genotyping option of the variant caller (set by --vc-forcegt-vcf
) but other variant caller options, such as combining phased variants (set using --vc-combine-phased-variants-distance
), is NOT supported at this time.
If a FASTQ file is used as input, additional options, --RGID
and --RGSM
need to be set in the command line. An example of the command line for this use case as follows.
Following completion of the DRAGEN Star Allele Caller run, the following output files are produced.
When the Star Allele Caller is run with small variant calling, or directly from genome VCF input, then the main output file, <prefix>.targeted.json
contains the complete and detailed results for all genes. This is an example output for one gene DPYD
and for one sample NA19374
.
The fields in the json file are as follows.
"genomeBuild": Reference version being used
"softwareVersion": Version of DRAGEN being run
"sampleId": Sample name
"phenotypeDatabaseSources": Resources used for calling metabolism status (phenotype)
"starAlleleDatabaseSources": Resources used for identifying star alleles (genotype)
"locusAnnotations": List of star allele caller results, one for each gene
"gene": Gene name
"geneId": HGNC or Ensembl id of the gene that is static
"starAlleleDatabaseSource": Resource for the star allele definitions file
"genotype": The detected star allele diplotype (or haplotype for haploid gene)
"genotypeQuality": Phred scaled quality score for the genotype
"phenotypeDatabaseAnnotation": Metabolism status corresponding to the genotype called
"supportingVariants": List of star alleles that are satisfied by found variants. The id field denotes the name of the star allele. Each non-ref star allele has a list of supportingVariants which displays the variant details (same as from the small variant vcf file. The quality field denotes the gq field from the vcf record)
"missingVariantSites": List of relevant gene sites for which vcf records are missing or filtered
"variantStarAllelesFound": List of star allele haplotypes that are satisfied by the found variants
Each Star allele genotype contains one or two haplotypes (a haplotype for chrM gene MT-RNR1 and chrX gene G6PD for male samples, and a diplotype for all other genes) separated by a slash (e.g. *1/*2
). Each haplotype is a pre-defined star allele and the definitions can be found under the allele definitions URL. Note that there may be some variance to star allele definitions and notations based on the resource and when it was last updated. When the Star Allele Caller cannot identify an optimal genotype for a gene, a no-call (./.
or .
) is made. In certain cases, more than one genotype is optimally satisfied, in that case all satisfied genotypes are listed, separated by a semi-colon (e.g. *1/*2;*3/*4
).
Tsv and json files (<prefix>.star_allele.tsv
and <prefix>.star_allele.json
, respectively) are produced when the Star Allele Caller is run stand-alone from a gvcf or vcf file or if the option --targeted-enable-legacy-output
is set. The json file has the same format as <prefix>.targeted.json
(shown above) while the tsv file contains summarized star allele calls for each gene. This is an example for one gene from the tsv output. The fields are gene name and genotype.
| Specify a tumor input file. |
|
|
|
| If known, specify the sex of the sample. If the sample sex is not specified, the caller attempts to estimate the sample sex from tumor alignments. |
|
|
|
Tumor input |
| file | Specify a tumor input file. |
Normal input Option 1 |
| file | Specify a normal input file (if normal VCF is not ready). |
Normal input Option 1 |
|
|
Normal input Option 2 |
| vcf file |
Normal input Option 3 |
| vcf file |
PON option 1 |
| normal count file | Specify individual normal counts file (target.counts.gz or target.counts.gc-corrected.gz) for PON. You can use this option multiple times, one time for each file. |
PON option 2 |
| text file indicating normal count files per line | Specify text file that contains paths to the list of reference target counts files to be used as a panel of normals (new line separated). |
PON option 3 |
| file | Specify combined PON file (.combined.counts.txt.gz). |
PON option 4 | NA | If no PON sample is specified, then DRAGEN utilizes matched normal sample as single sample PON. Available for Normal input Option 1 |
Target region |
| bed file | Specify target region bed file |
Sample sex |
|
| If known, specify the sex of the sample. If the sample sex is not specified, the caller attempts to estimate the sample sex from tumor alignments. |
DRAGEN supports Tumor Mutational Burden (TMB) in Tumor-Only or Tumor-Normal Mode.
It is important to note that in T/O mode germline variants must be identified and filtered using database information and optionally also allele frequency information. These germline filtering techniques are generally not as accurate as tumor normal subtraction. When using databases only to subtract germline variants, the TMB may be slightly higher than the more accurate T/N estimate. When using database and allele frequency information to remove germline variants, the TMB may be slightly underestimated for high purity tumor samples.
DRAGEN TMB comprise the following steps:
Please refer to "Somatic mode" for detailed variant calling options.
TMB is computed over protein coding regions with sufficient coverage. If DRAGEN detects a reference hg19/38, GRCh37/38 or hs37d5 it will automatically select the appropriate coding region based on the bed files available in "<INSTALL_PATH>/resources/tmb/". By default the coverage threshold for eligible regions is 50.
The protein coding region bed file and the coverage settings can explicitly be specified using the qc-coverage
options listed below in [QC coverage settings to override the default eligible region]. If DRAGEN does not automatically detect the reference it is required to specify these settings.
The following variants are excluded from the TMB calculation:
Non-PASS variants
Mitochondrial variants
MNVs
Variants that do not meet the minimum depth (DP) threshold. Use the --vc-callability-tumor-thresh
command line option to specify the threshold value.
Variants that do not meet the minimum variant allele threshold. Use the --tmb-vaf-threshold
command line option to specify the threshold value.
Variants that fall outside the eligible regions.
Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. You can specify the cosmic driver threshold using the tmb-cosmic-count-threshold
command line option. The tumor driver mutations filter relies on Nirvana annotations and will additionally require settings for --enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
.
By default, germline variants are not counted towards TMB. Variants are determined as germline based on a database or a proxi filter. The database germline filter can be disabled with tmb-skip-db-filter
. Disabling the database germline filter will effectively also disable the germline proxi filter.
Database filter
Variants with a population allele count ≥ 50 that are observed in either the 1000 Genome or gnomAD database will be marked as germline. Use germline-tagging-db-threshold
to change the population allele counts. The database germline filter relies on Nirvana annotations and requires settings for --enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
.
Proxi filter
Proxi filter can be enabled with tmb-enable-proxi-filter
. The proxi filter will flag any variants with VAF > 0.9 as germline. The proxi filter scans the variants surround a specific variant and identifies those variants with similar VAFs. The proxi window size that determines the number of surrounding variants can be specified with tmb-proxi-window-size
. If 95% (default value for tmb-proxi-fraction-threshold
) and no less than 5 (tmb-proxi-count-threshold
) of the surrounding variants of similar VAF are germline, then mark the current variant also as germline.
Proxi filter can also be done via a probabilistic approach, which can be enabled with tmb-enable-prob-proxi
. It estimates the expected germline allele frequency using the surrounding germline variants and then tests whether the allele frequency of the target variant is similar to the expected germline allele frequency or not. P value threshold can be set by tmb-prob-proxi-p-value
(the default value 1e-15 is set for ultra-deep sequenced samples, e.g. cfDNA)
Note that proxi filters can be too aggressive for 100% pure cell lines. Probabilistic proxi filter can be problematic for mixing or contaminated samples, as these samples do not have clear germline variant allele frequency distributions.
CH filter
When processing ctDNA samples it may be beneficial to also remove CH (clonal hematopoiesis) variants. Circulating tumor DNA generally has shorter fragment size. CH variants can be identified based on the insert size of the reads supporting the call. To capture the insert-size distribution for each variant call, it is required to specify vc-log-insert-size
during variant calling (step1). Once specified, potential CH variants based on insert size distribution will be labeled in the output. Additional, CH variants can be also labeled via a bed file supplied to tmb-ch-bed
. Variants other than germlines overlapping the region will be labeled as CH.
Nonsynonymous consequences are detected based on the Nirvana annotations. Nirvana variants that are annotated with the following consequences are labaled as nonsynonymous:
feature_elongation, feature_truncation, frameshift_variant, incomplete_terminal_codon_variant, inframe_deletion, inframe_insertion, missense_variant, protein_altering_variant, splice_acceptor_variant, splice_donor_variant, start_lost, stop_gained, stop_lost, transcript_truncation
TMB outputs a tmb.trace.csv file with detailed information on each variant used the TMB score. The trace file contains a column "Nonsynonymous" that indicates the appropriate status for each variant.
The subset of filtered variants that are nonsynonymous are used as numerator in the "Filtered Nonsyn Variant Count" metric.
TMB = Filtered Variants / Eligible Region (Mbp)
Nonsynonymous TMB = Filtered Nonsynonymous Variants / Eligible Region (Mbp)
The maximum somatic allele frequency (MSAF) outputs the estimated maximum somatic allele frequency of the sample. This is done via finding the confident somatic variants with highest allele frequency. MSAF is a rough approximate to the tumor fraction of cfDNA in peripheral blood samples. The MSAF mode can be enabled with tmb-enable-msaf
.
[Required]
--enable-tmb true
Enables TMB. If set, the small variant caller, Illumina Annotation Engine, and the related callability report are enabled.
--enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
enables Nirvana, the Illumina Annotation Engine. For more information on selecting the correct assembly and downloading reference files, see Illumina Annotation Engine.
[Recommended]
[QC coverage settings to override the default eligible region]
The protein coding region and the coverage settings can explicitly be specified using the qc-coverage
options listed below. All four settings must be specified to override the defaults. If DRAGEN does not automatically detect the reference it is required to specify these settings.
--qc-coverage-region-1
Specify the coding regions bed file to use.
--qc-coverage-tag-1=tmb
Required to associate these coverage settings with TMB. If this setting is not specified then DRAGEN will revert to default coding regions.
--vc-callability-tumor-thresh
Specify the somatic_callable bed minimum threshold, this will limit the regions over which TMB will be computed (default is 50).
--qc-coverage-reports-1=callability
. The callability report is required whenever it is desired to override the default TMB coverage settings.
[Optional settings]
--tmb-vaf-threshold
Specify the minimum VAF threshold for a variant. Variants that do not meet the threshold are filtered out (default=0.05)
--tmb-cosmic-count-threshold
The minimum number of observations in cosmic for variant to be considered a driver mutation. Driver mutations are not counted in TMB. This setting has very little impact on WES/WGS, but can help avoid bias in small panels (default=50)
--tmb-skip-db-filter
Skip database germline filtering. The database germline filter is required for tumor-only samples, but can be skipped for tumor-normal (default=false)
--germline-tagging-db-threshold
Specify the minimum allele count (total number of observations) for an allele in gnomAD or 1000 Genome to be considered a germline variant. Variant calls that have the same positions and allele are ignored from the TMB calculation (default=50)
--tmb-germline-max-cosmic-count
Restrict the db-filter. Variants with cosmic allele count higher than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1000]).
--tmb-germline-min-vaf
Restrict the db-filter. Variants with a variant allele frequency lower than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1])
--tmb-enable-proxi-filter
Enable proxi filter functionality in germline filtering. This is an optional feature that may be appropriate for T/O runs. In T/O mode the DB germline filter may not able to detect all germline variants, especially for ethnicity groups that are not well represented in germline databases. The proxi filter uses allele frequency information to help remove germline variants missed by the DB, and can help to obtain more accurate (lower) TMB values on samples with low tumor purity. In samples with high tumor purity this filter may be too aggressive and mark some somatic variants as germline resulting in too low TMB scores. (default=false)
--tmb-proxi-count-threshold
Proxi filter surrounding variant count threshold in germline filtering (default=5)
--tmb-proxi-fraction-threshold
Proxi filter surrounding variant db filter fraction threshold in germline filtering (default=0.95)
--tmb-proxi-window-size
, Number of surrounding variants before and after the target variant for proxi filter (default=500)
--tmb-ch-bed
Variants in the region will be labeled as clonal hematopoiesis (CH) variants.
--tmb-ch-insert-p-value
Minimum P value to classify a variant as CH using insert size (default=0.1)
--tmb-ch-insert-min-len
Minimum fragment size to test for CH using insert size (default=100)
--tmb-ch-insert-max-len
Maximum fragment size to test for CH using insert size (default=200)
--tmb-ch-insert-min-num
Minimum number of fragment size record to test for CH using insert size (default=50)
--tmb-enable-msaf
Enable MSAF output (default=false)
--tmb-msaf-p-value
Maximum P value (from insert size) to call confident somatic variant (default=1e-5)
--tmb-msaf-rank-num
If no confident somatic variant found, it will use the specified ranked variant (default=4)
The TMB values are output to <output prefix>.tmb.metrics.csv
. The file format uses the following CSV column convention, similar to other metric CSV files.
The TMB module also outputs a tmb.trace.csv file that provides detailed information on each variant that was included in the TMB calculation.
When enabling MSAF, the information is output to <output prefix>.tmb.msaf.csv
.
DRAGEN includes a dedicated human leukocyte antigen (HLA) genotyper for calling HLA class I and class II alleles with two-field resolution (a.k.a. four-digit resolution). At this resolution, DRAGEN HLA genotyper is able to discern and report HLA alleles based on their protein sequences. For more information on HLA nomenclature, see Nomenclature for factors of the HLA system¹.
Class I HLA typing is enabled by setting the --enable-hla
flag to true
. Additionally, class II HLA typing is enabled by setting the --hla-enable-class-2
flag to true
. For TSO500-solid or TSO500-liquid runs, HLA typing should be enabled instead through the following batch options: --tso500-solid-hla=true
and --tso500-liquid-hla=true
respectively. NOTE: class II HLA typing is not supported for TSO500 runs.
The HLA Caller primarily executes the following four steps:
Extract reads mapped to the HLA genes. These are HLA-A, -B and -C loci for class I, and HLA-DQA1, -DQB1, -DRB1 for class II loci. The human reference version is auto-detected during this step. The human reference builds hg19, hs37d5, and GRCh38 are fully supported, CHM13 build is enabled but not supported.
Align the extracted HLA reads to a reference set of 9,086 HLA alleles using the DRAGEN map-align processor. Only full-sequence alleles from the IMGT/HLA database (v3.45) that have also been reported on the Allele Frequency Net database were selected in building the default HLA reference resource.
Filter out HLA-specific alignments with sub-maximal alignment scores, and optimize the read distribution using Expectation-Maximization.
Select the most likely genotype for each HLA locus from a short list of candidate alleles using a homozygosity threshold set at 20%.
The reference directory that is supplied at command-line with --ref-dir
must contain anchored_hla
, a specific subdirectory with HLA-specific reference files. The DRAGEN default reference directories have been updated to contain the anchored_hla
subdirectory.
An HLA-specific reference subdirectory can be built by executing
This command will create anchored_hla
as a subdirectory of the target {REF-DIR}
supplied as an argument to --output-directory
as above.
The HLA-specific reference subdirectory can be built at the same time as the primary reference construction. An example command-line for this mode is
An HLA resource file, HLA_resource.v2.fasta.gz
, is packaged with DRAGEN. It is located at <INSTALL_PATH>/resources/hla/HLA_resource.v2.fasta.gz
This file is used by default when building the HLA-specific hash-table as above, see Building the HLA-Specific Reference Subdirectory.
An HLA allele reference FASTA file can be used as input to the hash-table building option --ht-hla-reference
.
Note: Using custom HLA reference files to generate the HLA-specific reference subdirectory anchored_hla
is not recommended, as accuracy cannot be guaranteed.
Custom input FASTA files (which can be zipped or unzipped) must contain only HLA allele sequences, and all allele names must adhere to the HLA star-allele nomenclature¹, where the first character of each allele name indicates the HLA locus, e.g. A*02:01:01:01. Allele names extracted from such a custom input file start at the first character of the allele name (to be preceded by character '>') and end at the last character of the name or until the first delimiter character '-' is reached.
The following is an illustration of a valid HLA reference input file to option --ht-hla-reference
:
Custom HLA reference files might require customized memory allocation, which can be specified with an argument to the command-line option --ht-hla-ext-table-alloc
.
The HLA component has no additional user-settable command-line options.
Note: this HLA component replaces prior workflows. See the appropriate guide for the DRAGEN software version being used in order to determine valid parameters.
The HLA Caller requires the DRAGEN mapper-aligner to be enabled (enabled via option --enable-map-align=true
, or through TSO500-batch options).
The HLA Caller generates a tab-delimited output file for class I and, if enabled, class II alleles. Class I results contain six class I alleles, with two alleles per class I HLA gene (HLA-A, -B and -C), and class II results contain six class II alleles, with two alleles per class II HLA gene (HLA-DQA1, -DQB1, and -DRB1). Homozygous calls show identical alleles at the respective loci.
The genotype output file is <prefix>.hla.tsv
, and it is located in the user-specified output directory. In tumor-only mode the output is stored to <prefix>.hla.tumor.tsv
file. In tumor-normal mode, two output genotype files are generated from tumor and normal samples: <prefix>.hla.tumor.tsv
and <prefix>.hla.tsv
.
In all cases, the genotype file contains a header row with one column for each of the class I and/or class II alleles and a body row with the HLA type of each allele at two-field resolution.
The following is an example output file produced by DRAGEN class I and II HLA typing:
The HLA Caller generates two additional HLA files.
<prefix>.hla_metrics.csv
—Contains the number of reads supporting each allele result (individual reads may support multiple alleles), and the total number of HLA reads analyzed.
<prefix>.hla_2field_EM.tsv
—Contains the maximal likelihood output from the Expectation-Maximization step: a list of candidate alleles at two-field resolution and corresponding intermediate posterior probability.
Internal checks for sufficient coverage at each HLA locus will trigger a warning message when fewer than 50 reads support any given allele call, or when fewer than 300 HLA reads are detected overall. In both settings, an allele call will still be attempted, but the results may be unreliable.
An empty genotype call at a given HLA locus is returned when there are no reads supporting that locus. In this scenario, a warning message will indicate missing coverage.
Map-align must be enabled for HLA (see Map-Align DRAGEN Requirement for HLA). As such, tumor-normal paired file inputs from BAM are not currently supported for HLA calling.
No HLA genotype will be returned with single-end DNA read inputs.
By default, DRAGEN only genotypes HLA alleles that have full-nucleotide sequence data in IMGT/HLA v3.45 and that have also been reported on the Allele Frequency Net database. As such, no partial alleles are currently called using the supplied resource reference FASTA file HLA_resource.v2.fasta
.
The HLA Caller accepts standard input files in FASTQ or BAM format.
The following example command line uses FASTQ file inputs.
The following example command line uses BAM file inputs (with map-align enabled). NOTE: the --hla-enable-class-2
enables class II HLA typing.
The following example command line uses tumor-normal paired file inputs from FASTQ.
The following example command line activates HLA typing in a TSO500-solid run from FASTQ input. A TSO500-compatible reference_directory is one which uses the same reference genome as in TSO i.e. hg19.
The following example command line activates HLA typing in a TSO500-liquid run from FASTQ input. A TSO500-compatible reference_directory is one which uses the same reference genome as in TSO i.e. hg19.
¹Marsh SG, et al. Nomenclature for factors of the HLA system, 2010. Tissue Antigens. 2010 75:291-455.
Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.
DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the collect-evidence
mode.
The following is an example command for tumor-normal
mode. Default resource files are available for WES and WGS. Please note that the WES and WGS tumor-normal
modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.
The following is an example command for the tumor-only
mode. Please note that the WES and WGS tumor-only
modes are not as extensively tested as the tumor-normal
modes. The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only
mode.
TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.
The following is an example of a microsatellite file:
Default WES and WGS Microsatellite site files can be downloaded here: DRAGEN Software Support Site page
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.
Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.
Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].
A subsequent post-processing step is recommended:
only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)
Please note an error would occur if long (>100bp) microsatellite sites are present in the file.
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples.
Please note:
The collect-evidence
mode MUST be run in DRAGEN germline mode.
The --msi-microsatellites-file
and --msi-coverage-threshold
settings used in collect-evidence
mode must be consistent with the settings used during tumor-only MSI calling.
At least 20 normal samples are required.
The output containing MSI score (PecentageUnstableSites
) are stored in <output prefix>.microsat_output.json
.
The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".
In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.
There are two other output files (*_diffs.txt
and *.dist
) that are useful for debugging.
Here is an example of *_diffs.txt
file
The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column
The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.
The *.dist
file stores the read counts for each repeat length of the microsatellite site
The coverage of the site can be obtained by summing up all counts in the last column
<output prefix>.microsat_output.json
(described above)
<output prefix>.microsat_tumor.dist
. This file contains the repeat length array for every microsatellite.
Column length_dis
is the repeat length array.
<output prefix>.microsat_diffs.txt
. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.
Column Assessed
indicates if a site passes the coverage filter (msi-coverage-threshold
). Column PassFilter
is an internal metric and currently is not used for filtering microsatellites.
The MSI algorithm performs the following steps:
Tabulates tumor and normal counts from the read alignments for each microsatellite site.
Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (tumor-normal
mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only
mode).
Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (tumor-normal
mode). In tumor-only
mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).
Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.
The DRAGEN offering encompasses a multitude of bioinformatics tools and allows for rapid end-to-end analysis of NGS data. The most common workflow is running FASTQ data through the DRAGEN map/align component and streaming directly to the small variant caller. This eliminates the need for a user to construct a workflow from off-the-shelf tools, dealing with interfaces, unfortunate incompability issues, and external library dependencies. In this section, we expand on the capabilities of DRAGEN to ease the workflow needs of common bioinformatics analyses.
Most components in DRAGEN can be enabled or disabled independently. These are controlled by enable-<component>
flags on the command line. Based on which components are enabled, DRAGEN will resolve any inconsistencies (if applicable) and construct the desired workflow. Where possible, DRAGEN will run components in parallel to save time and compute costs. Some examples of the top level options are listed here:
enable-map-align
enable-sort
enable-duplicate-marking
enable-variant-caller
enable-cnv
enable-sv
Each component has its own set of options which are used to configure the behavior of the component. These options typically control specific input settings, internal algorithm parameters, or output files and filtering criteria. Refer to the individual component sections for more details. As an example, a different BED file may be provided separately for each caller:
cnv-target-bed
sv-call-regions-bed
vc-target-bed
Additionally, some options are shared amongst callers, such as output-directory
and sample-sex
. Each variant caller will also produce its own set of VCFs and metric output files.
DRAGEN accepts the following common standard NGS input formats:
FASTQ (fastq-file1
and fastq-file2
)
FASTQ List (fastq-list
)
BAM (bam-input
)
CRAM (cram-input
)
Somatic workflows can use tumor equivalent input files (eg, tumor-bam-input
).
When running from unaligned reads, the reads first go through the map/align component to produce alignments which continue downstream to the variant callers. When running from prealigned reads, the user has the choice to re-align with the DRAGEN map/align component or to use the existing alignments from the source input. It is common to run with enable-map-align false
if you already have DRAGEN alignments available in BAM or CRAM format.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work. In this section we outline some best practices for doing so.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Build up the necessary options for each component separately, so that it can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
The following table summarizes the support for different input formats and variant callers.
For brevity, other features and callers are not listed in the table even though they may be supported. Examples include repeat genotyping, SMA, CYP2D6, and ploidy calling. DRAGEN can run all germline callers for WGS analysis in a single command line (CNV + SNV + SV + ...). Similar support also exists for WES analysis, if the component is supported in single caller mode and there is no conflict with the input configurations.
The somatic workflows can be constructed in a similar manner by specifying tumor and normal inputs. The need for potentially two input files (tumor and matched normal) as well as the need for a matched normal SNV VCF for the Somatic CNV caller means extra care has to be taken.
One recommended tumor/normal workflow first starts with running the matched normal through the Germline Workflow.
Run matched normal through Germline workflow (CNV + SNV + SV + ...). This is required to first generate the matched normal SNV VCF. See the Somatic CNV section for more details.
Run tumor and matched normal through Somatic workflow (CNV + SNV + SV + ...)
Optionally, a full tumor/normal analysis can be done in a single execution if both the SNV and CNV modules are enabled, by leveraging the BAF information directly from the small variant caller. See the Somatic CNV section for more details. In brief, this requires the use of --enable-variant-caller true
and --cnv-use-somatic-vc-baf true
.
The following table lists the various combinations that are supported under the tumor/normal mode of operation.
Running in tumor only mode just requires removing the matched normal input from the INPUT
options and configuring each individual caller to run in tumor only mode (for example, CNV uses a population B-allele VCF instead of the matched normal SNP VCF).
The following table lists the combinations that are supported under the tumor only mode of operation.
These modes are for WGS analysis. Similar support also exists for WES analysis, if the mode is supported in single caller mode and there is no conflict in the input configurations. For WES analysis, note that CNV requires a panel of normals regardless of whether it is Tumor Normal or Tumor Only analysis.
The DRAGEN Structural Variant (SV) Caller integrates and extends Manta structural variant calling methods to provide SV and indel calls 50 bases or larger. SVs and indels are called from mapped paired-end sequencing reads. The SV caller is optimized for analysis of diploid germline variation in small sets of individuals and somatic variation in tumor-normal sample pairs.
The SV caller performs the following actions:
Discovers, assembles, and scores large-scale SVs, medium-sized indels, and large insertions within a single efficient workflow.
Combines paired and split-read evidence during SV discovery and scoring to improve accuracy, but does not require split-reads or successful breakpoint assemblies to report a variant in cases where there is strong evidence otherwise.
Scores known SV deletions and insertions from an input VCF file against one or more input samples, either as a standalone procedure or together with standard SV discovery.
Provides scoring models for 1) germline variants in small sets of diploid samples, 2) somatic variants in matched tumor-normal sample pairs, and 3) somatic and germline variants in tumor-only samples.
All SV and indel inferences are output in VCF 4.2 format.
The DRAGEN SV Caller divides the SV and indel discovery process into the following steps.
Reads input files to estimate alignment statistics, including fragment size distribution and chromosome level depth. For more information on the SV Caller input options, see Command Line Options.
Scans the genome or a subset of the genome (specified by the call regions) to build various genome-wide data structures, including a breakend association graph of all SV associated regions. The graph contains edges that connect all regions of the genome that have a possible breakend association. Edges can connect two different regions of the genome to represent evidence of a long-range association, or an edge can connect to a region to capture a local indel/small SV association. These associations are more general than a specific SV hypothesis and multiple breakend candidates might be found on one edge. Typically only one or two candidates are found per edge. Instead of passing an inclusion region BED file, an exclusion region BED file can be passed to DRAGEN so that any SV breakend that overlaps with these regions gets removed from downstream analyses. The excluded regions are removed from the graph building process, but active regions can get extended and present in the excluded regions in the refinement step. This can happen for the active regions that are close to the boundaries of the excluded regions. Hence, the final SV calls may still get extended to these regions.
Analyzes the breakend association graph to discover candidate SVs, then scores discovered candidate SVs and any known SVs from the input. Analysis and scoring are performed as follows.
Infers SV candidates that are associated with the given graph edge.
Assembles the SV breakends.
Merges discovered SV candidates with any known SV candidates included in the input data.
Scores/genotypes and filters each SV candidate under various biological models (currently germline, tumor-normal, and tumor-only).
Outputs scored SVs to VCF.
The DRAGEN SV Caller can discover all identifiable structural variant types in the absence of copy number analysis and large-scale de novo assembly. For more information on detectable types, see Detected Variant Classes.
For each structural variant and indel, the SV Caller attempts to assemble the breakends by gathering nearby evidential reads, and to call SV events to base pair resolution by aligning assemblies against the reference genome. Then SV caller reports the left-shifted breakend coordinate (per the VCF 4.2 SV reporting guidelines), together with any breakend homology sequence and/or inserted sequence between the breakends. As a result, SV events' reported coordinates may not be directly reflected by read alignments' IGV view. Often the assembly will fail to provide a confident explanation of the data, especially in repeat regions. As a result, the SV caller will skip providing single-base resolution breakpoints or the associated split read support. In such cases, the SV caller will approximate the event breakpoints and score the events under the unified likelihood model as in other regular cases but report the variant as IMPRECISE instead.
You can provide known SVs as input for forced genotyping. This known SV input can be scored either standalone or together with the standard SV discovery workflow, in which case the known and discovered SVs are merged.
The sequencing reads provided as input to the SV Caller are expected to be from a paired-end sequencing assay that results in an "innie" orientation between the two reads of each sequence fragment, each presenting a read from the outer edge of the fragment insert inward.
The SV Caller is primarily tested for whole-genome and whole-exome (or other targeted enrichment) sequencing assays on DNA. For these assays the following applications are supported:
Joint analysis of 5 or fewer diploid individuals
Subtractive analysis of a matched tumor-normal sample pair
Analysis of an individual tumor sample
For joint analysis, there is no specific restriction against larger cohorts, but there might be stability or call quality issues.
When performing somatic calling on liquid tumor samples, the matched normal sample might be contaminated with tumor cells. The contamination can substantially reduce somatic variant recall. To account for Tumor-in-Normal (TiN) contamination, you can enable liquid tumor mode. For more information, see Liquid Tumor Calling.
Tumor samples can be analyzed without a matched normal sample. In this case, both germline and somatic variants are scored and reported in the output.
The SV Caller can discover all variation classes that can be explained as novel DNA adjacencies in the genome. Novel DNA adjacencies are classified into the following categories based on the breakend pattern:
Deletions
Insertions
Insertions in the result can be divided into the following two subclasses depending on if the inserted sequence can be fully assembled. 1) Fully-assembled insertions; 2) Partially-assembled (inferred) insertions
Mobile Element Insertions that are not called by the general purpose SV routine will be rescued by the MEI specific routine based on similarity between assembled contigs and known sequences in the MEI catalog described in the file <INSTALL_PATH>/config/sv_mobile_element_sequences.fa
.
Tandem Duplications
Inversions
Unclassified breakend pairs corresponding to intra- and inter-chromosomal translocations, or complex structural variants.
The SV Caller cannot directly discover the following variant types:
Dispersed duplications.
Dispersed duplications may be indirectly called as insertions or unclassified breakends.
Most expansion/contraction variants of a reference tandem repeat.
Breakends corresponding to small inversions.
The limiting size is not tested, but in theory, detection falls off below ~200 bases. Micro-inversions might be detected indirectly as combined insertion/deletion variants.
Fully-assembled large insertions.
The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but power to fully assemble the insertion should fall off to impractical levels before this size.
The SV Caller does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.
More general repeat-based limitations exist for all variant types:
Power to assemble variants to breakend resolution falls to zero as breakend repeat length approaches the read size.
Power to detect any breakend falls to (nearly) zero as the breakend repeat length approaches the fragment size.
While the SV Caller classifies certain novel DNA-adjacencies into variant classes, it has a limited ability to infer high-level events resulting from complex rearrangements, so certain calls summarized as deletions, duplications, and insertions might be better described by looking at the full system of breakends and copy number changes associated with a given event.
The DRAGEN SV caller is capable of forced genotyping a set of SVs input from a VCF file. Forced genotyping means that the input SVs are scored and emitted in the output of the SV Caller even if the variant is not supported in the sample data. For example, given a germline analysis, the input variants are processed and written to the output VCF, even if the variant quality falls below the threshold normally required for an SV to be emitted.
Forced genotyping typically enables known SVs to be detected at higher recall than standard SV discovery (particularly for SV discovery on a lower-depth sample). Forced genotyping can also be useful to assert against the presence of an SV allele. For example, you can use forced genotyping to distinguish a confident homozygous reference genotype from a lack of sequencing coverage over the SV locus.
Forced genotyping SVs are processed according to the current SV analysis being run. For example, if a germline analysis is configured by providing one or more normal samples as input, then the input SVs are scored under a germline model.
Forced genotyping alleles are always emitted in the output and might have modified scoring and filtering rules applied compared to SVs only discovered from the sample data.
Forced Genotyping can be run in two modes.
Standalone --- Only the SVs described in an input VCF are scored and emitted.
Integrated --- The standard SV discovery analysis is run and the results are merged with SVs scored from the forced genotyping input. The workflow outputs the union of SVs discovered from the sample data and any additional forced genotyping alleles. The workflow is run whenever the --sv-discovery
option is true.
You can specify forced genotyping input using the --sv-forcegt-vcf
option. The input must be a VCF of SV alleles. The SV allele types are restricted to insertions, deletions, tandem duplications, and breakends, which are not labeled with the INFO/IMPRECISE
flag. The following are the filtering criteria required for the VCF record to be processed as an input SV allele. If any of the criteria are not met, the VCF record is removed from the set of input SVs for forced genotyping. When a forced genotyping VCF is specified on the command line, the SV caller reports the total number of SV records used as input SVs and the total number of records filtered (if any) due to the following criteria.
Describes an insertion, deletion, tandem duplication, or breakend record.
Cannot contain the INFO/IMPRECISE
flag.
Cannot contain multiple ALT alleles.
Has a FILTER
value of PASS
or unknown (.
).
All indels are at least the minimum scored variant size (default is 50).
Cannot repeat an SV allele previously described in the same file.
The REF
field cannot be empty or unknown (.
).
You must describe insertions using the VCF small indel format, including an ALT
entry that describes the complete insertion sequence. Using <INS>
as a symbolic alt allele is not accepted. You can describe deletions using either the VCF small indel format or the <DEL>
symbolic alt allele. For any variant described using a symbolic alt allele, you must also provide a value for INFO/END
. Inversions represented in a single VCF record using the <INV>
alt allele are not accepted, but the inversion can be genotyped if converted to a set of breakend records. Each breakpoint is described by a pair of breakend VCF records. If the forced genotyping input contains just one record of the pair and the input conditions above are met, the input is still accepted for forced genotyping, and the distal breakend is inferred from the local record.
You can describe breakpoint insertions for non-insertion SV alleles using one of the following two methods. Both methods correspond to the format used to describe breakpoint insertions in the SV VCF output.
For SVs described using the symbolic ALT
format, such as <DEL>
, the INFO/SVINSSEQ
field is parsed to read the breakpoint insertion sequence.
For smaller indels described directly in the REF
and ALT
fields, the contents of the ALT
field describe the breakend sequence.
Forced genotyping SVs are always output to the standard VCF output of the SV Caller, regardless of whether the forced genotyping is standalone or integrated with SV calling. When the same SV allele is independently discovered from the sample data, only the discovered SV appears in the final output. The discovered SV allele is annotated to indicate the match to a forced genotyping input SV, and the scoring and filtration rules are changed to match.
VCF output records influenced by forced genotyping have the following associated fields.
The flag INFO/NotDiscovered
is set for any VCF record that was not independently discovered from the sample data. When forced genotyping is run standalone, all output records contain the flag. When integrated with SV calling, the flag can distinguish the SV alleles that would not have been discovered in a standard SV analysis.
For these variants only, the usual SV caller ID field generated from the SV Locus graph is not available, instead, the ID is taken from the corresponding user input VCF. The suffix UserInput${InputVCFRecordNumber}
is appended to the ID, separated by an underscore. If your input VCF contains only one of the two VCF records that comprise a breakend variant, then the ID is taken from the mate breakend record and the _Mate
suffix is added.
Any output VCF record that corresponds to a forced genotyping input VCF record has the value INFO/UserInputId=${ID}
set to reflect the VCF ID value of the input VCF record. The corresponding record might have also been discovered independently from the sample data and might not have the INFO/NotDiscovered
flag set.
Any output VCF record that corresponds to a forced genotyping input VCF record containing forced genotyping alleles that match exactly to an input SV has the flag INFO/KnownSVScoring
. VCF records with this flag are always emitted in the output of the SV Caller. Several filters, such as MaxDepth, are not applied.
When DRAGEN-SV is used in the somatic mode (tumor-only or tumor-normal), a BEDPE file with a set of paired-end regions in the BEDPE file format can be specified to filter out sequencing / systematic noise and also recurrent germline calls. Any variant that overlaps with one of the systematic noise paired regions (with a population count of at least 2) and has the same orientation will be marked as SystematicNoise
in the final VCF file. This BEDPE file can be passed via the command line option --sv-systematic-noise
.
The systematic noise BEDPE file is built using VCFs that were generated by the DRAGEN-SV tumor-only pipeline when run on normal samples that do not necessarily match to the subject the tumor sample was taken from. The file might contain several dozen samples.
You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.
To generate a BEDPE file, do as follows.
Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below
You can also build systematic noise BEDPE files in the cloud using the DRAGEN Baseline Builder App on BaseSpace.
The following prebuilt systematic noise files for WGS are available for download on the DRAGEN Software Support Site page. To generate these noise files, we used 100 unrelated normal samples from the 1000 Genomes Project. Each systematic noise file contains a version string that DRAGEN uses to check the compatibility by default and exits early if a wrong systematic noise file is provided.
The systematic noise BEDPE should follow a particular format
The SV caller applies a diploid scoring model for one or more diploid samples(treated as unrelated), as well as a somatic scoring model when a tumor and matched normal sample pair are given.
The germline scoring model produces diploid genotype probabilities for each candidate structural variant. Most candidates are approximated as independent for scoring purposes and modeled under a bi-allelic and diploid genotype likelihood setting. DRAGEN solves for the posterior probability over posible genotypes given the sequencing data by approximating it proportionally to the product between the prior probability of a genotype and the conditional probability of observing a set of read fragments(post-filtering) given the underlying genotype. DRAGEN treats each read fragment independently and represents the conditional probability of the set of read fragments as the product over all the individual read fragments'. For each individual read fragment's conditional probability, DRAGEN combines both paired-read and split-read evidence components, and approximates their contributions as indipendent by representing it as a product of these components, with the condition that the paired-read component is weighted by a linear ramp from one to zero depending on the candidate event type and size as tiny event will not affect pair-read mapping status significantly.
The paired-read component is modeled as a function measuring the deviation of the inferred fragment length from the overall distribution.
The split-read component is modeled as a function measuring the correctness of a read alignment to the a breakend by multiplying across all the non-gap bases' probability of observing a certain base call given the corresponding base of the evaluated allele.
Each read fragment may contribute only paired-read support, only split-read support, or both. Where a fragment contributes split-read support, this support may come from either or both reads in the read pair.
The somatic scoring model is a Bayesian probabilistic model using a tumor-normal joint genotyping approach. It aims to call somatic structural variants in tumors while avoiding germline variants and noisy variants. In this model, the tumor and normal allele frequencies are treated as nonindependent random variables. DRAGEN calculates posterior probabilities for a range of genotype hypotheses, under the assumption that the normal sample conforms to the diploid germline genotype considering homozygous reference, heterozygous, and homozygous states. The tumor sample is a mixture of the germline genotype and, if present, the somatic allele. For the somatic genotype, we consider only two states referring to the absence and presence of the somatic variant in the tumor sample. In cases where the somatic variant is not present, we account for unsystematic independent noise, while assuming an error-free scenario when the somatic variant is considered. To calculate the genotype likelihoods, the model integrates allele frequency likelihoods over the joint tumor and normal allele frequencies and applies modifications to address liquid tumors with Tumor-in-Normal (TiN) contamination. The integration is approximated with a discrete summation. In these calculations, the likelihood for each read to support a given allele is shared with the germline scoring model. The tumor-only somatic scoring model is seen as a special case of the somatic scoring model in the absence of normal data (zero coverage). The posterior probability is converted into a Phred quality score and reported in the VCF output INFO/SOMATICSCORE field.
When running the SV Caller, the input sequencing reads must be from a standard Illumina paired-end sequencing assay with an FR read pair orientation, where for each sequence fragment, a read proceeds from each end of the fragment inwards. For more information, see DRAGEN SV Caller Capabilities.
The SV Caller is optimized for paired-end libraries where the fragment size is typically larger than the size of both reads. Overlapping read pairs can be used to discover SVs, but might not always be handled optimally. For libraries where the typical fragment size is less than the read length, the SV caller attempts to differentiate reads sequencing into adapter sequence from the variant signal. In such cases, the SV Caller's input quality checks may fail and cause SV analysis to be skipped.
If using the standalone mode, your BAM/CRAM inputs must first be mapped. If you have not mapped and aligned your data yet, you can generate an alignment file.
If running from a mapped and aligned BAM, then the contigs specified in the header must strictly match those in the DRAGEN hashtable specified in the current analysis. Missing or extra contigs will lead to a "Reference genome mismatch" configuration error and the analysis will not proceed. If such an error is observed, it is recommended to regenerate the alignment file with the intended DRAGEN hashtable, or to run with the DRAGEN map/align module enabled.
The SV Caller runs quality checks on the input sequencing reads for each sample to make sure that the input corresponds to a paired read assay with the expected FR orientation, before estimating the fragment size distribution. To check consensus read pair orientation, a subset of high-quality read pairs is sampled. At least 90% of these must have the expected FR orientation for SV analysis to continue, otherwise, the SV caller issues a warning, skips any further analysis, and the resulting output files display empty results.
The SV Caller can tolerate nonpaired reads in the input, if sufficient paired-end reads exist to estimate the fragment size distribution. To estimate the fragment size distribution, the SV Caller requires at least 100 read pairs that meet the quality requirements of the estimation routine. Both reads of the pair must have a non-zero mapping quality to the same chromosome, are not filtered or part of a split read mapping, and do not contain indels or soft-clipping. If a sample does not contain a sufficient number of such read pairs, the SV Caller issues a warning, skips any further analysis, and writes empty results to its output files.
The SV Caller disregards any read group labels applied to the input sequences. Each input sample is treated as a separate library with a single fragment size distribution.
In standalone mode, input sequencing reads must be mapped and provided as input in either BAM or CRAM format. Each input file must be coordinate sorted and indexed to produce an htslib-style index in a file named to match the input BAM or CRAM file with an additional .bai
, .crai
, or .csi
file name extension. For more information on standalone mode, see Modes of Operation.
At least one BAM or CRAM file must be provided for the normal or tumor sample. A matched tumor-normal sample pair can be provided as well. If multiple input files are provided for the normal sample, each file is treated as a separate sample as part of a joint diploid sample analysis.
In standalone mode, input BAM or CRAM files contain the following limitations:
Alignments cannot have an unknown read sequence (SEQ="*")
Alignments cannot contain the "=" character in the SEQ field.
Alignments cannot use the sequence match/mismatch ("="/"X"). CIGAR notation RG (read group) tags in the alignment records are ignored. Each alignment file is treated as representing one sample.
Alignments with base call quality values greater than 70 are rejected. These are not supported on the assumption that this indicates an offset error.
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
The SV caller can be configured for targeted sequencing inputs, which disables high-depth filters. Exome mode can be directly set to true or false with the command line option --sv-exome
. If not directly set, exome mode defaults to false unless you run the SV caller in integrated mode and there is not more than 50 Gb of sequencing input.
You can use the --sv-somatic-ins-tandup-hotspot-regions-bed ${BEDFILE}
option to specify ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed
. The file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps). To disable this feature, enter --sv-enable-somatic-ins-tandup-hotspot-regions false
.
Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. In tumor-normal analysis, DRAGEN accounts for Tumor-in-Normal (TiN) contamination by running liquid tumor mode. You can use liquid tumor mode to account for TiN contamination by allowing a nonzero variant allele frequency for the matched normal when calculating the posterior probability of the somatic state. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.
Use the following two options to control liquid tumor mode behavior.
--sv-enable-liquid-tumor-mode
---Enable liquid tumor mode. Liquid tumor mode is disabled by default.
--sv-tin-contam-tolerance
---Set the TiN contamination tolerance level. DRAGEN calls variants in the presence of TiN contamination up to a specified maximum tolerance level. You can enter any value between 0–1. The default maximum TiN contamination tolerance is 0.15. If using the default value, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample.
The following command line options are supported for the Structural Variant Caller.
The following are the top-level options that are shared with the DRAGEN Host Software to control the SV pipeline. You can use BAM and CRAM files as input. Alternatively, if using read mapping with the SV calling in a single run, you can use all of the DRAGEN input options, including FASTQ, BAM, and CRAM files.
--cram-input
---The CRAM file to be processed.
--tumor-cram-input
---If performing tumor-normal or tumor-only analysis, the tumor CRAM file to be processed.
--fastq-file1
, --fastq-file2
, --fastq-list
---Input FASTQ files or a list of files to be processed.
--tumor-fastq1
, --tumor-fastq2
, --tumor-fastq-list
---Input tumor FASTQ file or list of files to be processed.
--enable-map-align
---Enables DRAGEN map/align. The default is true, so all input reads are remapped and aligned unless the option is set to false.
--output-directory
---Output directory where all results are stored.
--output-file-prefix
---Output file prefix that will be prepended to all result file names.
--ref-dir
---The DRAGEN reference genome hashtable directory. For more information about the reference genome hashtable, see Prepare a Reference Genome.
--bam-input
---The BAM file to be processed.
--tumor-bam-input
--If performing tumor-normal or tumor-only analysis, the tumor BAM file to be processed.
--enable-sv
---Enable or disable the structural variant caller. The default is false.
--sv-call-regions-bed
---Specifies a BED file containing the set of regions to call. Optionally, you can compress the file in gzip or bgzip format.
--sv-exclusion-bed
--- Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.
--sv-region
--- Limit the analysis to a specified region of the genome for debugging purposes. This option can be specified multiple times to build a list of regions. The value must be in the format "chr:startPos-endPos".
--sv-exome
--- Set to true to configure the variant caller for targeted sequencing inputs, which includes disabling high depth filters. In integrated mode, the default is to autodetect targeted sequencing input, and in standalone mode the default is false.
--sv-output-contigs
--- Set to true to have assembled contig sequences output in a VCF file. The default is false.
--sv-forcegt-vcf
--- Specify a VCF of structural variants for forced genotyping. The variants are scored and emitted in the output VCF even if not found in the sample data. The variants are merged with any additional variants discovered directly from the sample data.
--sv-discovery
--- Enable SV discovery. This flag can be set to false only when --sv-forcegt-vcf
is used. When set to false, SV discovery is disabled and only the forced genotyping input variants are processed. The default is true.
--sv-use-overlap-pair-evidence
--- Allow overlapping read pairs to be considered as evidence. The default is false.
--sv-somatic-ins-tandup-hotspot-regions-bed
--- Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed
.
--sv-enable-somatic-ins-tandup-hotspot-regions
--- Enable or disable the ITD hotspot region input. The default is true in somatic variant analysis.
--sv-enable-liquid-tumor-mode
---Enable liquid tumor mode. See Liquid Tumor Calling.
--sv-tin-contam-tolerance
--- Set the Tumor-in-Normal (TiN) contamination tolerance level. See Liquid Tumor Calling for more information.
--sv-systematic-noise
--- Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information see Systematic Noise Filtering.
--sv-detect-systematic-noise
--- Set to true to generate VCF output per normal sample. For more information see Systematic Noise Filtering
--sv-build-systematic-noise-vcfs-list
--- List of input VCFs from previous step. Enter one VCF per line. For more information see Systematic Noise Filtering
--sv-min-edge-observations
--- Remove all edges from the graph with less than this many observations. The default value is set to 3.
--sv-min-candidate-spanning-count
--- Run SV caller and report all large SVs with at least this many spanning support observations. The default value is set to 3.
--sv-min-candidate-variant-size
--- Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.
--sv-min-scored-variant-size
--- After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.
--sv-hotspot-min-scored-variant-size
--- After candidate identification, only score and report SVs/indels at or above this size inside the SV hotspot region, which includes FLT3, ARHGEF7, and KMT2A genes by default. The default value is set to 25.
Structural Variant calling can run in the following modes:
Standalone --- Uses mapped BAM/CRAM input files. If you have not mapped and aligned your data yet, see Input Requirements. This mode requires the following options:
--enable-map-align false
--enable-sv true
Integrated -- Automatically runs on the output of the DRAGEN mapper/aligner. This mode requires the following options:
--enable-map-align true
--enable-sv true
--enable-map-align-output true
--output-format bam
You can also enable Structural Variant calling with any other caller.
The following is an example command line for Integrated mode:
The following is an example command line for joint diploid calling in standalone mode:
The structural variants VCF output file is available in the output directory. The file is named <output-file-prefix>.sv.vcf.gz
. The contents of the file depend on the type of analysis.
For each major analysis category (germline, tumor-normal, and tumor-only), the appropriate VCF output file is output, reflecting variant calls made under the variant calling mode corresponding to the given analysis type.
VCF output follows the VCF 4.2 specification for describing structural variants. It uses standard field names wherever possible. All custom fields are described in the VCF header. The following sections provide information on the variant representation details and the primary VCF field values.
Sample names output in the VCF output are extracted from each input alignment file from the first read group (@RG) record found in the header. If no sample name is found, a default (SAMPLE1, SAMPLE2, etc.) label is used instead.
All variants are reported in the VCF using symbolic alleles unless they are classified as a small indel, in which case full sequences are provided for the VCF REF and ALT allele fields. A variant is classified as a small indel if all of the following criteria are met:
The variant can be entirely expressed as a combination of inserted and deleted sequences.
The deletion or insertion length is not 1000 or greater.
The variant breakends and/or the inserted sequence are not imprecise.
The variant has not been converted from a deletion to intra-chromosomal breakends by the depth-based SV classification routine.
When VCF records are output in the small indel format, they also include the CIGAR INFO tag describing the combined insertion and deletion event.
Large insertions are reported in some cases even when the insert sequence cannot be fully assembled. In this case, the SV Caller reports the insertion using the <INS>
symbolic allele and includes the special INFO
fields LEFT_SVINSSEQ
and RIGHT_SVINSSEQ
to describe the assembled left and right ends of the insert sequence. The following is an example of such a record from the joint diploid analysis of NA12878, NA12891 and NA12892 mapped to hg19:
The SV caller can also represent tandem duplications as insertions. This representation creates ambiguity in how the variants are presented in the VCF output, especially for small tandem duplications. The representation can lead to complications, such as unrecognized call duplication.
To better normalize the SV caller output, so that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions, using the symbolic allele <INS>
for the ALT
field. The following example shows an insertion, which was converted from a tandem duplication during this normalization process.
Converted insertions include copies of certain output fields. The fields appear the same as in a tandem duplication record. For example, INFO/DUPSVINSSEQ
provides a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication, the breakpoint insertion value would normally be written to INFO/SVINSSEQ
. The following example shows a converted insertion with a breakpoint insertion value:
For more information about copied INFO
fields, see VCF INFO Fields. All INFO
fields use the same DUP
prefix.
Inversions are reported as a set of breakends. For example, given a simple reciprocal inversion, four breakends are reported, sharing the same EVENT INFO
tag. The following is an example breakend records representing a simple reciprocal inversion:
In the germline calling model, when SV candidates are discovered from the sample data and have sufficient paired and split read evidence to be reported in the output, the SV caller applies additional depth-based tests to more accurately classify certain SV candidate types. Candidate breakpoints that are consistent with a deletion are tested for the lower read depth that is expected inside the deleted region. Candidate breakpoints consistent with a tandem duplication are tested for the higher read depth expected in the duplicated region. Candidate SV calls that fail the depth-based tests are still reported in the output, but changed to intrachromosomal breakends. Candidate SV calls that pass continue to be reported in the standard deletion and tandem duplication output formats.
SVs frequently include a small sequence insertion at the breakpoint. Breakpoint insertions are represented differently depending on the SV type. The INFO/SVINSSEQ
field in the VCF output provides the most general description of breakpoint insertions by describing the insertion sequence itself. The corresponding INFO/SVINSLEN
field describes the length of the insertion sequence. For example, the following VCF record describes a large (~8.8 kb) deletion, which includes a single base insertion (C) between the left and right deletion breakends.
The INFO/SVINSSEQ
field is also used to describe breakpoint insertions for tandem duplication and breakend records. The field can also be used to describe the insertion sequence of a large SV insertion.
Breakpoint insertions are represented differently in the VCF small indel format. The SV caller represents small deletions and insertions using the VCF small indel format instead of symbolic ALT alleles. Any breakpoint insertion that occurs in the VCF small indel format is represented as part of the VCF ALT field. See Small Indel Representation for information on the conditions this format is used for SVs under.
In the following small indel format example, the VCF record describes a 57 base deletion that includes a single base insertion (A) between the left and right deletion breakends
Breakend records include an additional encoding of breakpoint insertion sequence, as described in the VCF specification for the breakend ALT
field. The SV caller also provides the information to the INFO/SVINSSEQ
field for consistency with other SV record types.
The following example shows a breakend connecting a region of chromosomes 1 and 12 in the sample with a breakend insertion sequence of CA
between the two breakends. The insertion sequence is described in both the ALT
and INFO/SVINNSEQ
fields.
SV Breakpoint Insertion Orientation
The breakpoint insertion sequence is always provided with respect to the strand of the current SV record. Some breakend records have inverted orientation. For inverted orientations, the pair of breakend records contains an insertion sequence that is reverse complemented compared to the mated record.
The following breakend pair example demonstrates an inverted orientation.
Each VCF record output by the SV caller is shifted to the left-most position of the exact homology range of the breakpoint. The exact homology range of the breakpoint is the continuous range of positions over which the SV could be represented while still describing the same SV haplotype. The exact homology range is described in the VCF output with the INFO/HOMSEQ
field, which describes the sequence of the exact homology range and the corresponding INFO/HOMLEN
field, which describes the length of the range.
The following example shows a 62 base deletion with an 11 base breakend homology region. Without left-shifting, the SV has an equivalent representation anywhere from position 39497639 to 39497650.
The following examples illustrate simplified exact breakend homology. The example displays one three base deletion and another three base insertion. In both the insertion and deletion, the variant is left-shifted, so that the corresponding VCF record position is 2.
Deletion
Reference: GTCAGCGA
Variant: GT---CGA
Insertion
Reference: GT---CAG
Variant: GTCGGCAA
In both the insertion and deletion, there is a single base of exact breakend homology C
, so that the same variant can be represented one base to the right.
Germline
The following table lists the VCF FILTER fields applied to germline VCF output.
The following table lists the VCF FILTER fields applied to tumor-normal somatic VCF output.
The following table lists the VCF FILTER fields applied to tumor-only VCF output.
There are two levels of VCF filters: record level (FILTER
) and sample level (FORMAT/FT
). Most record-level filters are independent of those at the sample-level. However, in a germline analysis, if none of the samples pass all sample-level filters, the SampleFT
record-level filter is applied.
Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the sample. The INFO/EVENT
field indicates that two or more such junctions are hypothesized to occur together as part of a single variant event. All individual variant records belonging to the same event share the same INFO/EVENT
string. Note that although such an inference could be applied after SV calling by analyzing the relative distance and orientation of the called variant breakpoints, the SV Caller incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale events. Given that at least one junction in the event has already passed standard variant candidacy thresholds, sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern consistent with a multijunction event (such as a reciprocal translocation pair).
Although this mechanism could generalize to events including an arbitrary number of junctions, it is currently limited to two. Thus, at present it is most useful for identifying and improving sensitivity towards reciprocal translocation pairs.
Some of the evidential read pairs could provide both PR and SR support, we defined VF as an additional field to represent number of evidence in sequence fragment(or read pairs), which strongly support the REF or ALT alleles in the listed order, to facilitate unbiased calculation of Variant Allele Fraction (VAF), where VAF = VF_ALT/(VF_ALT+VF_REF).
The VCF ID
, or identifier, field can be used for annotation, or in the case of BND
(breakend) records for translocations, the ID
value is used to link breakend mates or partners. The following is an example of a VCF ID
field from the SV caller
The value provided in the ID
field reflects the SV association graph edge(s) from which the SV or indel was discovered. The value is guaranteed to be unique within any single VCF output file produced by the SV Caller. These values are therefore used to link associated breakend records using the standard VCF MATEID
key. The exact structure of this identifier may change in the future. You can use the entire value as a unique key, but parsing the key could lead to incompatibility with future DRAGEN versions. See the DRAGEN Software Support Site for information on the latest version of DRAGEN.
It can sometimes be convenient to express structural variants in BEDPE format. For such applications, DRAGEN recommends the script vcfToBedpe available on GitHub. The repository is forked from @hall-lab with modifications to support VCF 4.2 SV format.
BEDPE format greatly reduces structural variant information compared to the SV Caller VCF output. In particular, breakend orientation, breakend homology, and insertion sequence are lost, in addition to the ability to define fields for locus and sample specific information. For this reason, Illumina only recommends BEDPE as a temporary output for applications that require it.
The DRAGEN Indel Re-aligner is a consensus based re-alignment step, independent from other DRAGEN callers and pipelines. Re-aligned reads are reflected in the output bam file, and their original alignment is described in an OA tag. The implementation is similair to the Indel Re-aligner tool that was found in GATK3. The tool is designed to reduce false positive SNP's by considering evidence of near-by indels.
The pipeline is comprised of two concurent steps: Interval creation and re-alignment. The interval creation step identifies genomic intervals for which there is evidence of insertions or deletions in the CIGAR's of properly paired (if paired) reads aligned with positive mapq. To output these intervals as a text file, use the command line argument --ir-write-intervals-file=true
. Each line will describe a genomic interval as chrom:start-end, or chrom:start for intervals of length one. The start and end positions are both inclusive and 1-based. The intervals file will be written to the DRAGEN output directory, with the suffix realign-intervals.txt
For each genomic interval, the realigment step groups all aligned reads that intersect the interval. If there are more than ir-max-num-reads
reads that intersect the interval, it is skipped. The following reads are then discarded from the re-alignment analysis:
Non-primary aligned reads.
Reads whose mapping quality is zero.
Paired end reads that mapped to different contigs.
Paired end reads that mapped to the same contig with start positions more than ir-max-distance-between-mates
apart.
Reads that have not been skipped are candidates for re-alignment. If there are more than ir-max-num-candidates
candidates, the interval is skipped. From each re-alignment candidate, a consensus read is generated from any read that has a single indel that is not the first or last CIGAR operation excluding clip operations. If there are more than ir-max-number-consensus
consensus reads, the interval is skipped. Each re-alignment candidate is then scored against each consensus to determine the winning consensus. If the combined score for the interval against the winning consensus is better than the score against the reference by a differnce of at least ir-realignment-threshold
, the reads start position, CIGAR, and NM tag are updated to reflect the re-alignment. The scoring used is hamming distance weighted by base qualities. OA tags that describe the original alignment are added to any re-aligned reads. Mate positions of reads whose mate was re-aligned are updated as well.
When the re-alignment step is complete, a summary will be printed to standard out. It will describe the number of intervals found, sum of the lengths of all intevals, number of reads that intersected intervals, number of reads that got re-aligned, and the number of reads that were skipped due to memory constraints. Such reads will be documented in the DRAGEN log. This may happen in regions with very deep coverage.
The DRAGEN Indel Re-aligner is designed to improve the quality of the DRAGEN BAM output for downstream analysis. The DRAGEN small variant caller pipeline does not read the output BAM, and has its own internal haplotype assembly step which will usually recovers most of the artifacts found during Indel Re-alignment. Limited testing has shown that there may be a small improvement in DRAGEN small variant calls when Indel Re-alignment is enabled. However, Indel Re-alignment will slow down a DRAGEN Map/Align + VC run roughly by a factor of two. For that reason, it is not recommended to enable Indel Re-alignment with the DRAGEN VC, and it is not enabled by default.
The Indel Re-alignment pipeline cannot run with:
The UMI pipeline.
The Methylation pipelines.
--qc-coverage-ignore-overlaps=true
.
SA tag generation (--generate-sa-tags=true
).
The Expansion Hunter pipeline.
CheckFingerprint is broadly based on Picard CheckFingerprint. CheckFingerprint will output LOD score to indicate whether all the genetic data between two files from the same individual or not.
If LOD score is positive, those two samples come from the same individual. Otherwise, those two samples come from different individuals.
In general, the sign of LOD in summary file should be consistent with Picard CheckFingerprint summary file, but the exact values may be different.
Validation were done on whole-genome sequencing (WGS) data, mixing WGS samples and whole exon sequencing data.
The checks can run in one of two modes:
Read comparison mode. Aligned reads are compared with the expected VCF
VCF comparison mode. Output VCF is compared with the expected VCF
To enable CheckFingerprint module, the following command-line options are required.
--enable-checkfingerprint true
--checkfingerprint-expected-vcf [path_to_expected_sample_vcf]
Read comparison mode is enabled by default. Read comparison mode is recommended to use for small dataset or whole exon sequencing data.
To switch to VCF comparison mode, use the following options
--checkfingerprint-enable-vcf-comparison true
--enable-variant-caller true
Vcf comparison mode is recommended to use for larger samples, such as whole-genome sequencing data with average 30 coverage or whole exon sequencing data.
Read mode. Input BAM/FASTQ/CRAM, examine the individual reads in input sample, and compare individual reads with expected VCF file.
VCF mode. Input BAM/FASTQ/CRAM, generate a VCF file first, and compare the VCF file with expected VCF file
VCF mode. Input an observed VCF file, and compare observed VCF file with expected VCF file
The input files used by DRAGEN CheckFingerprint are: a) haplotype map (configuration files), b) FASTQ/BAM/CRAM (user input) or observed VCF file (user input), c) expected VCF file (user input).
a) Haplotype Map
Haplotype maps for hg19, hg38 and chm13 are files that are packaged with DRAGEN and automatically selected by the software. The haplotype map is a set of SNPs grouped into haplotyp blocks (also known as linkage disequilibrium blocks). SNPs in haplotye map is used as fingerprinting.
The following columns are of interest:
b) Sample Input
Samples are input from bam/cram/fastq or observed vcf files.
The following command-line example uses FASTQ input:
The following command-line example uses vcf input:
c) Expected Vcf Input
Vcf output from dragen is recommended. It can contains multiple samples. Multiple sample vcfs can combine together and input here --checkfingerprint-expected-vcf
Checkfingerprint calculates LOD between input sample (bam/cram/fastq or vcf) and each sample in expected_vcf file.
There are two main output files:
[output-file-prefix].CheckFingerprint.summary.txt : contains LOD scores between input sample and expected sample
[output-file-prefix].CheckFingerprint.detail.txt : contains LOD scores between individual SNPs.
CheckFingerprint.detail.txt example
CheckFingerprint.summary.txt example LOD_EXPECTED_SAMPLE is the LOD score between two samples
CheckFingerprint calculates the LOD score to identify whether two samples are from the same individual or not. A positive value indicates those two samples are from the same individual. A negative value indicates two samples are not match. LOD is in logarithmic scale (base 10). Thus, a LOD of 4 indicates it is 10,000 more likely that data matches the genotypes than not. A score that is close to 0 is inconclusive that can result from low coverage or missing informative genotypes. The identity check takes advantage of haplotype blocks defined in configuration file (hg38_nochr.map,hg19_nochr.map). It can improve statistic power for identity detection by checking SNPs in haplotype blocks.
In VCF mode, CheckFingerprint uses PL to estimate genotype probabilities.
Limitaions: Currently, Vcf mode is designed for whole genome sequencing samples with 30 coverage; Read mode is designed for whole exome sequencing. Larger datasets may encounter timeout errors. Vcf mode is recommended for general use. Read mode should be used in isolation without other components enabled and should only be used if Vcf mode does not provide sufficient accuracy.
DRAGEN implements a beta version of the Population Haplotyping tool. This tool supports the estimation of haplotypes from a population scale dataset via the packaging of the SHAPEIT5 Software (2022, Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O). It is designed to phase common variants as well as rare variants in a step-by-step mode. The following step-by-step workflow must be reproduced to phase each chromosome of the studied genome.
Step 1: Phase Common step to estimate the haplotypes of common variants (variants with allele frequency above a given allele frequency threshold) on defined regions.
Step 2: Common Ligate step to ligate the phased common variants from step 1 into a single chromosome.
Step 3: Phase Rare step to add the haplotypes of rare variants (variants with allele frequency below a given allele frequency threshold) on defined regions to the common variant scaffold obtained in step 2.
Step 4: Concat All step to concatenate the haplotype regions obtained in step 3 into a single chromosome.
This tool provides best accuracy on population scale dataset with thousands of samples. It is recommended to be run on multiple nodes to parallelize processes. A common use case of the Population Haplotyping tool is the generation of a custom reference panel to be used for the VCF Imputation pipeline.
The tool supports autosomes and mixed ploidy chromosomes for diploid species only. It does not use the FPGA accelerated capability and it can run on generic software only compute node.
Note: the Population Haplotyping tool only supports input msVCF produced with the DRAGEN gVCF Genotyper tool.
The following is an example of required command to generate haplotypes on common and rare variants (with default allele frequency threshold) on population scale dataset:
To generate per chromosome haplotypes:
To generate per genome haplotyped sites
For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:
per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition
generated from the same reference build
compressed and indexed
with unphased GT calls
with no duplicates
with header ##contig "ID" and "length" fields for all contigs present in the studied genome
Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.
The msVCF input list provided at step 1 is pre-processed to generate a formatted msVCF called <prefix>.preprocess.vcf.gz
. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
To facilitate parallel processing on distributed compute nodes, and to avoid overhead chromosome level multisample VCF download and upload per sub-chromosome processing, chromosome portions of equal size (shards) can be used as input. The gVCF Genotyper tool, with proper option, can generate these shards of equal size. Note: streaming from the cloud is not supported. Instead use predownload and local input process to achieve maximum IO efficiency and stability.
A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the geentic map corresponding to the human hg38 reference genome available to download from the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files.
The genetic map should follow the format:
3 columns: position, chromosome number, distance (cM), in this order and tab separated
Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1 chrX_par1
, PAR2 chrX_par2
and non PAR chrX_nonpar
regions)
Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y chrY
)
The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.
This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar
).
The user can provide its own or use the one available to download from DRAGEN Software Support Site page.
The config file is a text file with the headers:
##version
##ref_build indicating the reference build used for the study.
The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.
Note: for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1
, chrX_nonpar
, and chrX_par2
.
The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes.
The Phase common step (step 1) is run on a defined region, and outputs:
a single scaffold msVCF and related msVCF index with phased common variants for that region. The default name is dragen.ph_phase_common.vcf.gz
.
a single formatted msVCF called <prefix>.preprocess.vcf.gz
and related index. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
The Ligate Common step (step 2) ligates the regions phased in step 1 and outputs a single scaffold msVCF and related msVCF index with phased common variants for a single chromosome. The default name is “dragen.ph_ligate_common.vcf.gz”.
The Phase Rare step (step 3) is run on a defined region on a chromosome with preprocessed unphased msVCF from step 1 and phased scaffold msVCF from step 2, and outputs:
a single phased msVCF and related msVCF index with phased common and rare variants for that region. The default name is “dragen.ph_rare_common.vcf.gz”.
a single 8-column VCF and related index listing all sites that have been phased for that region. The default name is “dragen.ph_rare_common.sites.vcf.gz”.This output is used at the Concat-All step to generate a VCF file with all phased sites accross the genome.
The Concat All processing is used to generate 2 types of output
Phased common and rare variants for a chromosome
The Concat All step (step 4) concatenates the regions phased in step 3 and outputs an msVCF and related index with phased common and rare variants for a single chromosome. The default name is “dragen.ph_concat_all.vcf.gz”.
List of phased sites
This output is useful for input of the ForceGT tool. The Concat All step lists all sites in a 8-column VCF format that have been phased and output a VCF and related index with list of phased sites. This output can be generated either from a list of phased site VCFs across the genome from step3, or, in a second step once the list of per chromosome sites have been generated. The default name is “dragen.ph_concat_all.sites.vcf.gz”.
An additional module of the Population Haplotyping tool checks for the quality of the haplotypes produced based on a phased truth set provided as input.
DRAGEN can process data from whole genome and hybrid-capture assays with unique molecular identifiers (UMI). UMIs are molecular tags added to DNA fragments before amplification to determine the original input DNA molecule of the amplified fragments. UMIs help reduce errors and biases introduced by DNA damage such as deamination before library prep, PCR error, or sequencing errors.
To use the UMI Pipeline, the input reads files must be from a paired-end run. Input can be pairs of FASTQ files or aligned/unaligned BAM input. DRAGEN supports the following UMI types:
Dual, nonrandom UMIs, such as TruSight Oncology (TSO) UMI Reagents or IDT xGen Prism.
Dual, random UMIs, such as Agilent SureSelect XT HS2 molecular barcodes (MBC) or IDT xGen Duplex Seq Adapters.
Single-ended, random UMIs, such as Agilent SureSelect XT HS molecular barcodes (MBC) or IDT xGen dual index UMI Adapters.
DRAGEN uses the UMI sequence to group the read pairs by their original input fragment and generates a consensus read pair for each such group, or family. The consensus reduces error rates to detect rare and low frequency somatic variants in DNA samples with high accuracy. DRAGEN generates a consensus as follows.
Aligns reads.
Groups reads into groups with matching UMI and pair alignments. These groups are referred to as families.
Generates a single consensus read pair for each read family.
These generated reads have higher quality scores than the input reads and reflect the increased confidence gained by combining multiple observations into each base call. UMI workflow is only compatible with small variant calling and SV in DRAGEN.
Enter UMIs in one of the following formats:
Read name—The UMI sequence is located in the eighth colon-delimited field of the read name (QNAME). For example, NDX550136:7:H2MTNBDXX:1:13302:3141:10799:AAGGATG+TCGGAGA
BAM tag—The UMI is present as an RX tag in prealigned or aligned BAM file (standard SAM format).
FASTQ file—The UMI is located in a third FASTQ file using the same read order as the read pairs.
To create FASTQ, append the UMI to the read name, and then specify the appropriate OverrideCycles setting in the BCL conversion tool (see Illumina BCL Data Conversion). DRAGEN supports UMIs with two parts each with a maximum of 8 bp and separated by +, or a single UMI with a maximum of 15 bp.
The UMI workflow must be executed using a set of reads that correspond to a unique set of RGSM/RGLB. DRAGEN supports multiple lanes if all lanes correspond to the same RGSM/RGLB set.
DRAGEN UMI does not support a tumor-normal analysis, because a tumor-normal run corresponds to two different RGSM. In a tumor-normal run, one sample name is used for tumor and one sample name is used for normal. DRAGEN UMI supports one sample in a run.
If using a BAM file or a list of FASTQ files as the input, the input might contain multiple samples. DRAGEN checks if only one sample is included in the run and if the sample uses only a single, unique RGLB library. DRAGEN also accepts a library that was spread across multiple lanes. If there is a single sample and single library, DRAGEN processes all included reads. If there are multiple samples or multiple libraries, DRAGEN aborts analysis with an error.
For dual, nonrandom UMIs, you can provide a predefined UMI correction table or a list of valid UMI sequences as input. To create the UMI correction table, use a tab-delimited file, include a header, and add the following fields.
If customized correction table is not specified, DRAGEN uses the default table for TruSight Oncology (TSO) UMI Reagents that is located at <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz
. Alternatively, you can provide a file for whitelisted nonrandom UMI with valid UMI sequence one per line. DRAGEN then autogenerates a UMI correction table with hamming distance of one.
--umi-library-type
—Set the batch option for different UMIs correction. Three batch modes are available that optimize collapsing configurations for different UMI types. Use one of the following modes:
random-duplex
—Dual, random UMIs.
random-simplex
—Single-ended, random UMIs.
nonrandom-duplex
—Dual, nonrandom UMIs. To use this option, provide either --umi-nonrandom-whitelist
or --umi-correction-table
.
--umi-min-supporting-reads
—Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. For example, the following are the recommended settings for FFPE and ctDNA.
[FFPE] If the variant > 1%, use --umi-min-supporting-reads=1
with the --vc-enable-umi-solid
variant caller parameter. For more information on variant caller options, see Variant Caller Options.
[ctDNA] If the variant < 1%, use --umi-min-supporting-reads=2
with the --vc-enable-umi-liquid
variant caller parameter. For more information on variant caller options, see Variant Caller Options.
--umi-enable
—To enable read collapsing, set the --umi-enable option
to true. This option is not compatible with --enable-duplicate-marking
because the UMI pipeline generates a consensus read from a set of candidate input reads, rather than choosing the best nonduplicate read. If using the --umi-library-type
option, --umi-enable
is not required.
--umi-emit-multiplicity
—Set the consensus sequence type to output. DRAGEN UMI allows you to collapse duplex sequences from the two strands of the original molecules. Duplex sequence is typically ~20–60% of total library, depending on library kit, input material, and sequencing depth. Enter one of the following consensus sequence types:
both
—Output both simplex and duplex sequences. This option is the default.
simplex
—Output only simplex sequences.
duplex
—Output only duplex sequences.
--umi-source
—Specify the input type for the UMI sequence. The following are valid values: qname
, bamtag
, fastq
. If using --umi-source=fastq
, provide the UMI sequence from FASTQ file using --umi-fastq
.
--umi-correction-table
—Enter the path to a customized correction table. By default, Local Run Manager NextSeq 1000/2000 doesn't use LRM. What would it use instead? uses lookup correction with a built-in table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits.
--umi-nonrandom-whitelist
—Enter the path for a customized, valid UMI sequence.
--umi-metrics-interval-file
—Enter the path for target region in BED format.
--umi-output-uncollapsed-bam
—Output uncollapsed (raw) reads map/aligning results to separate BAM with filename <output_prefix>.uncollapsed.bam.
DRAGEN processes UMIs by grouping reads by UMI and alignment position. If there are sequencing errors in the UMIs, DRAGEN can correct and detect small sequencing errors by using a lookup table or by using sequence similarity and read counts. You specify the type of correction with the --umi-library-type
or --umi-correction-scheme
option using the values lookup
, random
, or none
.
For sparse sets of nonrandom UMIs, it is possible to create a lookup table that specifies which sequence can be corrected and how to correct it. This correct file scheme works best on UMI sets where sequences have a minimum hamming/edit distance between them. By default, DRAGEN uses lookup correction with a built-in correct table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits. Specify the path for your correction file using the --umi-correction-table
option. If you are using a different set of nonrandom UMIs, contact Illumina Technical Support for information on generating the corresponding correction file.
In the random UMI correction scheme, DRAGEN must infer which UMIs at a given position are likely to be errors relative to other UMIs observed at the same position. The error modes include small UMI errors, such as one mismatch or UMI jumping or hopping artifact from library prep. DRAGEN accomplishes this as follows.
Groups reads by fragment alignment position.
Within a small fuzzy window at each position, groups the reads first by exact UMI sequence, which forms a family.
Estimate UMI jumping or hopping probability through insert size distribution and number of distinct UMI at certain positions.
Within a fuzzy window, calculates pair-wise likelihood ratio to assess if two families with different UMI sequences and genomic positions are derived from same original molecule.
Merges families with likelihood lower than threshold. The default threshold is 1.
Duplex UMI adapters simultaneously tag both strands of double-stranded DNA fragments. It is then possible to identify reads resulting from amplification of each strand of the original fragment.
DRAGEN considers two collapsed read pairs to be the sequence of two strands of the same original fragment of DNA if they have the same alignment position (within a fuzzy window), complementary orientations, and their UMIs are swapped from Read 1 and Read 2. If there is only single-ended UMI, DRAGEN compares the start-end position of families from two strands and computes pair-wise likelihood to determine if they are likely originated from two distinct families or should be merged as a duplex sequence. By default, DRAGEN outputs both simplex and duplex consensus sequences. To change the consensus sequence output type, use --umi-emit-multiplicity
.
The following is an example DRAGEN command for generating a consensus BAM file from input reads with Illumina UMIs:
To run with other random UMI library type, change --umi-library-type
to random-simplex
or random-duplex
.
If you enable BAM output, DRAGEN generates a <output_prefix>.bam that includes all UMI consensus reads. The QNAMEs for the reads are generated based on the following convention.
refID1—The reference ID of Read 1.
pos1—The genomic position of Read 1.
refID2—The reference ID of Read 2.
pos2—The genomic position of Read 2.
orientation—The orientation of Read 1 and Read 2. Orientation can be one of the following values. Position refers to the outermost aligned position of the read and is adjusted for soft clips.
1—Read 1 is forward and Read 2 is reverse. The starting position for Read 1 is less than or equal to the Read 2 end position.
2—Read 1 is reverse and Read 2 is forward. The starting position for Read 2 is greater than or equal to the Read 1 end position.
3—Read 1 is forward and Read 2 is reverse. The starting position for Read 1 is greater than the Read 2 end position.
4—Read 1 is reverse and Read 2 is forward. The starting position for Read 2 is greater than the Read 1 end position.
5—Read 1 and Read 2 are forward.
6—Read 1 and Read 2 are reverse.
XV
and XW
tags are added to consensus reads specifying number of supporting reads in a collapsed family. XV
tag indicates number of fragmnets and XW
tag indicates number of duplex fragments.
DRAGEN outputs an <output_prefix>.umi_metrics.csv file that describes the statistics for UMI collapsing. This file summarizes statistics on input reads, how they were grouped into families, how UMIs were corrected, and how families generated consensus reads. The following metrics can be useful when tuning the pipeline for your application:
Discarded families---Any families having fewer than --umi-min-supporting-reads
input or having a different duplex/simplex status than specified by --umi-emit-multiplicity
are discarded. These reads are logged as Reads filtered out. The families are logged as Families discarded.
UMI correction---Families may be combined in various ways. The number of such corrections are reported as follows.
Families shifted---Families with fragment alignment coordinates up to the distance specified by the umi-fuzzy-window-size
parameter. The default umi-fuzzy-window-size parameter is 3.
Families contextually corrected---Families with exactly the same fragment alignment coordinates and compatible UMIs are merged. - Duplex families---Families with close alignment coordinates and complementary UMIs are merged.
When you specify a valid path for --umi-metrics-interval-file, DRAGEN outputs a separate set of on target UMI statistics that contains only families within the specified BED file.
If you need to analyze the extent to which the observed UMIs cover the full space of possible UMI sequences, the histogram of unique UMIs per fragment position metric may be helpful. It is a zero-based histogram, where the index indicates a count of unique UMIs at a particular fragment position and the value represents the number of positions with that count.
The following figures and table describe available UMI metrics.
Fig1) Read pairs with duplex UMI
Fig2) UMI error correction
Fig3) UMI collapsible regions
The DUX4 Rearrangement Caller identifies the events of potential structural rearrangements between DUX4 and other genes (including IGH). The primary support for the DUX4 Rearrangement Caller is for human reference hg38.
The DUX4 Rearrangement Caller has the following features:
call DUX4 Rearrangement events from various format of genomic data like FASTQ, BAM, CRAM.
scan the whole genome and identify potential DUX4 rearrangement events.
run in parallel with the host DRAGEN software with minimal overhead.
Sequencing dataset to be tumor-only, paired-end and whole-genome sequencing
Sequencing dataset with mean coverage range between 25X to 120X
Sequencing dataset with mean fragment length between 300 to 500bp
Sequencing dataset with mean read length between 100 to 151bp
A reference genome that is compatible with DRAGEN software. You can download prebuilt reference genomes from our website or build your own customized version with: dragen --build-hash-table true --output-directory <HASHTABLE_DIR> --ht-reference <REF_FASTA> [options]
The DRAGEN DUX4 caller has been validated with a cohort of samples that fall within the above defined parameters. If you have datasets that don't comply with the above parameters, you can bypass the requirements check by specifying --dux4-skip-santiy-check true
to obtain experimental results.
The basic syntax of the DRAGEN command line is:
dragen [global options] [pipeline options] [output options]
The global options are common to all pipelines and control the general behavior of DRAGEN, such as the input and output files/directories, the reference genome, and the license file.
The pipeline options are specific to each pipeline and control the parameters and features of the analysis, such as the variant callers, the filters and the annotations.
The output options control the format and content of the output files, such as the VCF, BAM, and the metrics files.
For DUX4 caller, a simple and quick example would be:
where DRAGEN analysis will take in sequencing data from fastq format (BAM, CRAM, ORA also acceptable) and map/align the reads to the reference genome, the mapped and sorted reads will be consumed by DUX4 caller.
Alternatively, DRAGEN DUX4 caller can start from bam format input by skipping the map/align step (assuming bam file is sorted and with duplicates being marked):
What's more, DUX4 caller can run in parallel with other variant callers:
Finally, you will find DUX4 VCF results in the directory of --output-dir with prefix being specified by --output-file-prefix.
The DUX4 VCF will contain positive calls that represent translocation events across gene pairs. Each event will consist of a set of 4 VCF Breakend records to describe the potential translocation event. Each record will contain PR:SR:SRPB tags to describe the number of fragment that support the events, where PR stands for number of spanning paired reads, SR stands for number of spanning split reads and SRPB stands for number of support read pairs per billion reads being processed. We predefined two sets of genomics target regions, "CoreDUX4" regions and "ExtendedDUX4" regions, to optimize the events detection process, where "CoreDUX4" regions is a subset of "ExtendedDUX4" regions.
An output VCF example will look like this:
The Ploidy Estimator runs by default. The Ploidy Estimator uses reads from the mapper/aligner to calculate the sequencing depth of coverage for each autosome and allosome in the human genome. The sex karyotype of the sample is then estimated using the ratios of the median sex chromosome coverages to the median autosomal coverage. The sex karyotype is estimated based on the range the ratios fall in. If the ratios are outside all expected ranges, then the Ploidy Estimator does not determine a sex karyotype.
Sex Karyotype | X Ratio Min | X Ratio Max | Y Ratio Min | Y Ratio Max |
---|---|---|---|---|
Ploidy estimation can fail if the type of input sequencing data cannot be determined to be either WGS or WES. When ploidy estimation fails the estimated median coverage values will be zero. The type of input sequencing data is determined using coverage skewness.
skewness = std::abs(autosomeMean - autosomeMedian) / autosomeMean
When skewness is <= 0.2 the data is determined to be WGS. Note that a minimum of 2x coverage is required for WGS. WGS with coverage lower than 2x may not be detected properly or may be detected as WES. When skewness is >=0.6 the data is determined to be WES. Skewness between 0.2 and 0.6 will have undefined input sequencing data type and the reported estimated median coverage values will be zero.
For WES data, the median exome coverage is estimated using the 99th percentile of coverage bins across each contig. This estimated median exome coverage is then reported by the Ploidy Estimator and used for sex estimation.
If there is not sufficient sequencing coverage in the autosomes (at least 2x for either WGS or WES) then the Ploidy Estimator does not determine a sex karyotype.
When both tumor and matched normal reads are provided as input, the Ploidy Estimator only estimates sequencing coverage and sex karyotype for the matched normal sample and ignores the tumor reads. If only tumor reads are provided as input, the Ploidy Estimator estimates sequencing coverage and sex karyotype for the tumor sample.
The Ploidy Estimator results, including each normalized per-contig median coverage, is reported in the <output-file-prefix>.ploidy_estimation_metrics.csv
file and in standard output.
The following is an example of the results.
The DRAGEN Variable Number Tandem Repeat (VNTR) Caller detects expansions and contractions in tandem repeat (TR) regions. For specified TR regions in the genome, the DRAGEN VNTR Caller estimates the size of the haplotypes in each region and provides variant calls, including the number of copies of the repeat for the sample in question. The DRAGEN VNTR Caller only considers TR regions included in a pre-specified VNTR catalog file.
For each region in the VNTR catalog, the VNTR Caller performs the following steps:
Read fragment collection, including wrap-around alignment and read classification;
Genotyping, including the scoring of candidate haplotype lengths using a Bayesian likelihood model.
The output of the VNTR Caller is the total length of sequence present in each TR region for the sample in question, resolved for each haplotype if possible; the copy number for each region is calculated from the length. Calls are reported in a VNTR output VCF file following the VCFv4.4 spec.
The DRAGEN VNTR Caller can be enabled by setting the --enable-vntr
option to true
. The VNTR Caller requires whole-genome sequencing (WGS) data aligned to a human reference genome with at least 30x coverage.
This diagram illustrates the overall workflow of the VNTR Caller. The VNTR Caller takes as input a set of aligned reads from the sample in question (either from the DRAGEN mapper or from an input BAM/CRAM) and a VNTR catalog file.
The VNTR catalog is a bed file specifying the TR regions for the VNTR Caller to act upon. Each region in the bed file is expected to be the start and the end of a tandem repeat sequence with no additional buffer sequence. The catalog also includes a unique TR ID for each region and the sequence of the repeat unit/pattern (see below for more details on the VNTR catalog file format).
The VNTR Caller processes each TR region in parallel, starting with read fragment collection. The VNTR Caller considers read fragments (i.e. paired-end reads as a single unit) rather than individual reads. To obtain all of the relevant read fragments for each region, all of the reads that overlap the region are found, and then all of their mates are collected as well.
Due to the repetitive nature of TR regions, existing read-alignments may be unreliable. Therefore spanning reads, unmapped reads, and reads with soft-clips undergo a specialized wrap-around alignment algorithm, which allows for a read to align to the same pattern sequence multiple times without penalty (mirroring the structure of the tandem repeat). This algorithm produces more reliable alignments of the read fragments to the TR region. Additionally, another rule to virtually extend the boundaries of the repetitive region into the flanks is applied to resolve some alignment ambiguities arising from the wrap-around alignment.
Once reliable alignments of the read fragments have been obtained, the next step is to classify each read. Reads are classified as non-overlapping, flanking, spanning, and contained relative to the TR region based on the following figure.
The output of fragment collection is the set of all read fragments in each TR region, re-aligned as necessary, with each read given a classification. This collection of read fragments is referred to as a pileup and acts as the input to the genotyper.
The genotyper determines the top-scoring haplotypes based on the read fragment evidence for each TR region. Given a pileup, the genotyper further classifies each read fragment into fragment classes.
The number of fragments in each class acts as evidence for the haplotype lengths of the TR region. A Bayesian likelihood model is used to evaluate what pair of haplotypes have the highest likelihood of generating the observation of these fragment class counts. A set of candidate haplotype lengths is generated based on fractional increments of the repeat pattern length, and each pair of haplotype lengths is evaluated as a candidate diploid genotype. If the caller detects that individual haplotype lengths cannot be resolved, the total length is considered as a candidate genotype instead (referred to as a total call). In subsequent steps, these total call candidates are assessed as if they were homozygous diploid genotypes.
For a Bayesian model, the posterior probability of each candidate diploid genotype must be considered. The posterior probability is made up of two parts: the genotype prior and the pileup likelihood. Three types of priors are currently supported:
No prior (all alleles weighted equally, referred to as model 0)
Het/hom priors (four classes with different weights: homozygous reference, ref/alt, homozygous alt, and alt1/alt2, referred to as model 1)
Population haplotype frequencies (region-specific haplotype frequencies over high-quality population data sets, referred to as model 3 and used by default)
The priors model can be chosen by setting the option --vntr-priors-model
to 0
, 1
, or 3
(the default being 3
).
The pileup likelihood is calculated as the likelihood of observing the fragment class counts given the candidate diploid genotype (based on an underlying model for how fragments are generated from a TR region haplotype of a given length). With the prior and the pileup likelihood, the posterior probability of each candidate diploid genotype can be computed. The diploid genotype with the highest posterior probability is chosen as the resulting call for each region.
The VNTR Caller is disabled by default. To enable the VNTR Caller, set --enable-vntr
to true
. The VNTR Caller can run directly from FASTQ input with the mapper or from prealigned BAM/CRAM input. You can also enable the VNTR Caller in parallel with any other germline variant callers as part of a WGS germline analysis workflow. For more information on other variant callers, see the DRAGEN DNA Pipeline.
FASTQ input example:
BAM input example:
Additional Options:
The number of threads used for the DRAGEN VNTR caller can be adjusted using the --vntr-num-threads <number_of_threads>
option.
The VNTR catalog is a bed file with the following required fields:
chromosome (or contig)
start position (0-based inclusive)
end position (0-based exclusive = 1-based inclusive)
TR ID (unique ID for TR region)
repeat unit sequence (sequence of repeat pattern/motif)
The reference haplotype length is calculated by subtracting the start position from the end position, and the number of repeat units in the reference can be found by dividing the reference haplotype length by the length of the repeat unit.
When using a standard reference (hg38
, hg19
, or GRCh37
), DRAGEN will automatically use a matching pre-packaged catalog by default. A custom catalog can be provided by adding in the option, --vntr-catalog-bed <custom_catalog_bed_file>
.
For references other than the ones mentioned above, a catalog must be provided. Furthermore the caller requires a set of normalization regions (--vntr-normalization-regions-bed <bed file>
. These regions should be well-behaved and free of any VNTRs or other large variants. We recommend using a few thousand regions of at least 2kb each. These two files are enough to run the genotyper without priors or the aforementioned flat priors model (--vntr-priors-model 0
or 1
). To enable population priors 3
, one additional file has to be provided: --vntr-priors-file <json file>
. The json file contains data obtained from a population analysis, structured like the following example with one entry per catalog region:
The output of the DRAGEN VNTR Caller includes a VNTR VCF file, a table output TSV file, and a VNTR metrics file.
The VNTR VCF file follows version 4.4 of the VCF spec. The VCF includes a call for every TR region provided in the VNTR catalog unless it was hard-filtered in the fragment collection or the genotyping (the filter annotation can be found in the table output file).
Each call is an estimate of the lengths of the haplotypes present in that region for the sample in question. If the individual haplotypes lengths can be distinguished, then a diploid call is reported including the lengths and copy number of each haplotype (in the INFO RB and RUC fields, respectively). Otherwise, a total call is made, only reporting the total length and total copy number for the region (in the FORMAT TOTALRB and TOTALRUC fields). For total calls, GT = ./.
.
If the length of a haplotype is within a certain threshold of the reference array length for the region, then it is reported as a reference allele (the default reference threshold is 10%). If both haplotypes are reference alleles or if the total length of a total call is within the total reference threshold, then a reference call is reported in the VCF, with HomRef
in the FILTER
field and GT = 0/0
. A symbolic <CNV:TR>
is reported in the ALT
field for each non-reference allele in the call.
The following fields are included for each VCF entry:
INFO:SVTYPE
: set to "CNV" for all VNTR calls
INFO:SVLEN
: set to the reference array length
INFO:EVENTTYPE
: set to "VNTR"
INFO:RUS
: the sequence of the repeat unit (i.e. the repeat pattern or motif)
INFO:RUL
: the length of the repeat unit
INFO:REFRUC
: the number of copies of the repeat in the reference haplotype
INFO:RB
: the length of each ALT haplotype being reported
INFO:CN
: the copy number per ALT haplotype relative to the reference (equal to RB / SVLEN
)
INFO:CNVTRLEN
: the change in length of each ALT haplotype compared to the reference (equal to RB - SVLEN
)
INFO:RUC
: the number of repeat unit copies for each ALT haplotype being reported (equal to RB / RUL
)
FORMAT:SVFT
: any filters that will be applied only in the merged SV + VNTR VCF
FORMAT:GQ
: Genotype quality score
FORMAT:CN
: the total copy number relative to the reference (equal to TOTALRB / SVLEN
)
FORMAT:TOTALRB
: the total length of all haplotypes (including reference haplotypes if present)
FORMAT:TOTALCNVTRLEN
: the total change in length of all haplotypes relative to the reference (equal to TOTALRB - 2*SVLEN
)
FORMAT:TOTALRUC
: the total number of repeat unit copies of all haplotypes (including reference haplotypes if present; equal to TOTALRB / RUL
)
Additional fields:
INFO:RUCCHANGE
: the change in the RUC compared to the reference for each ALT haplotype (equal to RUC - REFRUC
)
INFO:LOGPROB
: the log probability of the called alleles from the genotyper
INFO:VNTRCLASSFIT
: the score of how well the fragment classes fit the expected distribution
INFO:TOTALFRAGCOUNT
: the number of fragments used to make the call
The table output file provides a simple summary of the VNTR Caller output. Every single region included in the VNTR catalog bed will also be included in the table output file; regions where a hard-filter was applied in the fragment collection or the genotyping will still be included with the reason for the filter annotated (these regions will not appear in the VCF file).
For each region, the following information is provided:
trId
: the unique ID of the TR region
patternSize
: the length of the repeat unit (i.e. the repeat pattern/motif length)
refArraySize
: the length of the reference haplotype for this region
Hap1Size
: the length of the first haplotype of the call (Hap1Size <= Hap2Size
; NA
for total calls)
Hap2Size
: the length of the second haplotype of the call (NA
for total calls)
TotalSize
: the total length of all haplotypes in the call (equal to Hap1Size + Hap2Size
for diploid calls)
Likelihood
: the log likelihood of the called alleles from the genotyper (equal to INFO:LOGPROB
in the VCF)
QUAL
: QUAL score (matches QUAL field in the VCF)
GQScore
: Genotype quality score (equal to FORMAT:GQ
in the VCF)
ClassDistributionFit
: the score of how well the fragment classes fit the expected distribution (equal to INFO:VNTRCLASSFIT
in the VCF)
FragmentCount
: the number of fragments used to make the call (equal to INFO:TOTALFRAGCOUNT
in the VCF)
Flags
: the flags and filters applied to the call
The VNTR metrics file reports summary statistics for the VNTR caller including region counts, read class counts, and call counts.
Region counts include the number of normalization regions, the number of prior regions, the number of TR regions with nonzero coverage, and the total number of TR regions.
Read class counts include the total number of reads in each class (strictly left, left-flanking, spanning, contained, right-flanking, strictly right, and unmapped).
Call counts include the total number of uncalled TR regions, as well as the total number of deletion, insertion, and reference calls for diploid and total calls (note that for a region where a diploid call is made, two calls are reported, but for a total call region, only one call is reported).
DRAGEN supports the automatic merger of the VNTR VCF calls with the DRAGEN Structural Variant (SV) Caller output VCF. By default, if both the DRAGEN VNTR Caller and the DRAGEN SV Caller are enabled (with the options --enable-vntr true
and --enable-sv true
, respectively), then calls made by the VNTR Caller will also be included in the DRAGEN SV VCF (<output_prefix>.sv.vcf.gz
). This behavior can be disabled by adding the option --sv-vntr-merge false
. The VNTR VCF does not change even if DRAGEN SV is enabled.
When VNTR calls are added to the SV VCF, the following changes are applied:
VNTR diploid calls with GT = 1/2
are split into two separate VCF entries, each of which is reported as GT = 0/1
.
A lt50bp
filter is applied to all VNTR calls with INFO:CNVTRLEN < 50
(the min SV length parameter is set to 50 bp by default). For total calls with no INFO:CNVTRLEN
, FORMAT:TOTALCNVTRLEN
is used instead.
A TotalCall
filter is applied to all VNTR total calls (calls with GT = ./.
). This behavior can be disabled by adding the option --sv-vntr-filter-total-calls false
.
A LowPopulationVariance
filter is applied to all VNTR calls with the FORMAT:SVFT
field equal to LowPopulationVariance
. The filter indicates that there are few population samples with an SV variant for this region.
An OverlapsVNTR
filter is applied to any SV call that overlaps with a VNTR call (even with a HomRef filter) UNLESS the VNTR call has a TotalCall
or a LowPopulationVariance
filter.
DRAGEN generates multiple pipeline-specific metrics including:
Mapping and Aligning metrics
Variant calling metrics
Biomarker metrics
Coverage (or enrichment) metrics and reports
Duration (or run time) metrics
Figure 10: Generation of Metrics and Reports
The QC metrics are printed to the standard output. In addition CSV files are written to the run output directory:
<output prefix>.mapping_metrics.csv
<output prefix>.vc_metrics.csv
<output prefix>.<coverage region prefix>_coverage_metrics.csv
<output prefix>.time_metrics.csv
<output prefix>.<other coverage reports>.csv
Each CSV includes 5 columns, including: Section, Subsection (e.g. read group or sample), Metric, Value 1 (Count/Ratio/Minutes) and Value 2 (Percentage/Seconds).
DRAGEN computes mapping and aligning metrics similar to Samtools Flagstat.
Mapping metrics are:
available both on an aggregate level and on a per read group level.
in germline and somatic tumor-only mode only one set of mapping metrics are available.
in somatic tumor-normal mode, the mapping and aligning metrics are generated separately for the tumor and normal samples, with each line beginning with TUMOR or NORMAL to indicate the sample. The metrics for the tumor sample are output first, followed by the metrics for the normal sample. Metrics per read group are also separated into tumor and normal read groups.
unless explicitly stated, the metrics units are in reads (not in terms of pairs).
Definitions:
Total input reads---Total number of reads in the input FASTQ files.
Number of duplicate marked reads---Reads marked as duplicates as a result of the --enable-duplicate-marking
option being set to true.
Number of duplicate marked and mate reads removed---Reads marked as duplicates, along with any mate reads, that are removed when the --remove-duplicates
option is set to true.
Number of unique reads---Total number of reads minus the duplicate marked reads.
Reads with mate sequenced---Number of reads with a mate.
Reads without mate sequenced---Total number of reads minus number of reads with mate sequenced.
QC-failed reads---Reads that did not pass platform/ vendor quality checks (SAM flag 0x200).
Mapped reads---Total number of mapped reads
Mapped reads with filtered mapping---Total number of mapped reads plus reads mapped to non-reference decoy contigs plus reads mapped to the rRNA filter contig.
Mapped reads adjusted for excluded mapping---Total number of mapped reads plus reads mapped to the excluded RNA mitochondrial contig.
Mapped reads adjusted for filtered and excluded mapping---Total number of mapped reads plus reads mapped to the rRNA filter contig plus reads mapped to the excluded RNA mitochondrial contig.
Number of unique and mapped reads---Number of mapped reads minus number of duplicate marked reads.
Unmapped reads---Total number of reads that could not be mapped.
Unmapped reads minus filtered mapping---Total number of unmapped reads minus reads mapped to non-reference decoy contigs minus reads mapped to the rRNA filter contig.
Unmapped reads adjusted for excluded mapping---Total number of unmapped reads minus reads mapped to the excluded RNA mitochondrial contig.
Unmapped reads adjusted for filtered and excluded mapping---Total number of unmapped reads minus reads mapped to the rRNA filter contig minus reads mapped to the excluded RNA mitochondrial contig.
Singleton reads---Number of reads where the read could be mapped, but the paired mate could not be read.
Paired reads---Count of reads in which both reads in the pair are mapped.
Properly paired reads---Both reads in the pair are mapped and fall within an acceptable range from each other based on the estimated insert length distribution.
Not properly paired reads (discordant)---The number of paired reads minus the number of properly paired reads.
Paired reads mapped to different chromosomes---The number of reads with a mate, where the mate was mapped to a different chromosome.
Paired reads mapped to different chromosomes (MAPQ >= 10)---The number of reads with a MAPQ>10 and with a mate, where the mate was mapped to a different chromosome.
Reads with indel R1---The percentage of R1 reads containing at least 1 indel.
Reads with indel R2---The percentage of R2 reads containing at least 1 indel.
Soft-clipped bases R1---The percentage of bases in R1 reads that are soft-clipped.
Soft-clipped bases R2---The percentage of bases in R2 reads that are soft-clipped.
Mismatched bases R1---The number of mismatched bases on R1, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2---The number of mismatched bases on R2, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R1 (excluding indels)---The number of mismatched bases on R1. The indels lengths are ignored. It does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2 (excluding indels)---The number of mismatched bases on R2. The indels lengths are ignored. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Q30 Bases---The total number of bases with a BQ >= 30. Includes mapped & unmapped reads & bases. Excludes duplicate marked reads & secondary alignments.
Q30 Bases R1---The total number of bases on R1 with a BQ >= 30.
Q30 Bases R2---The total number of bases on R2 with a BQ >= 30.
Q30 Bases (excluding dups and clipped bases)---The number of bases on non-duplicate and non-clipped bases with a BQ >= 30.
Histogram of reads map qualities
Reads with MAPQ [40:inf)
Reads with MAPQ [30:40)
Reads with MAPQ [20:30)
Reads with MAPQ [10:20)
Reads with MAPQ [0:10)
Total alignments---Total number of loci reads aligned to with > 0 quality.
Secondary alignments---Number of secondary alignment loci.
Supplementary (chimeric) alignments---A chimeric read is split over multiple loci (possibly due to structural variants). One alignment is referred to as the representative alignment. The other are supplementary.
Estimated read length---Total number of input bases divided by the number of reads.
Insert length: mean---Mean insert size estimated for the read group
Insert length: median---Median insert size estimated for the read group
Insert length: standard deviation---Standard deviation of insert size estimated for the read group
Note: The insert length metrics reported above are computed using high quality (MAPQ >= 20) properly paired read pairs, considering all the read pairs for the read group. It may be different from the standard output log reported during insert stats sampling which reports these metrics only for the first ~2M read pairs for DNA (~100K read pairs for RNA).
Whole read group insert length estimation for RNA datasets is currently not supported. For RNA runs, the reported insert length metrics are computed using up to the first ~100K high quality read pairs for the read group from the input FASTQ/BAM/CRAM file.
Input bases divided by reference genome size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the reference genome size.
Input bases divided by target bed size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the target bed size.
Estimated sample contamination---The estimated fraction of reads in a sample that may be from another human source.
The DRAGEN cross-sample contamination module uses a probabilistic mixture model to estimate the fraction of reads in a sample that may be from another human source. DRAGEN supports separate modes for germline and somatic samples.
The germline model, like VerifyBamID, assumes that a sample can be modeled as a DNA mixture from 2 or more individuals. Pileup analysis is used to investigate loci where variants are common in the general population. Variants with high allele frequencies are likely to be real germline variants in the individual of interest, while low allele frequency variants at these common germline loci are likely noise or germline variants from a contaminating sample B. The probabilistic mixture model accounts for noise and then tries to detect consistent allele frequency distributions. As example, if the pileups show consistent low allele frequencies of 1% or 2%, then the mixture model will likely infer 2% contamination from sample B, where the 1% and 2% AF variants correspond to heterozygous and homozygous germline calls in sample B.
The germline cross-contamination metric is enabled by using the following setting and pointing a VCF that includes marker sites (RSIDs) with population allele frequencies that are close to 0.5.
--qc-cross-cont-vcf <INSTALL_PATH>/resources/qc/sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf
The somatic model, like GATK CalculateContamination, supports tumor-only or tumor-normal runs. The somatic model is more advanced than the germline model in the way that it accounts for somatic CNVs or LoH regions where the diploid assumptions may be broken. The algorithm also tries to account for FFPE deamination and oxidation noise by empirically adjusting base qualities prior to estimation.
The somatic cross-contamination metric is enabled by pointing to the VCF that includes the marker sites (RSIDs) with high population allele frequencies.
--qc-somatic-contam-vcf <INSTALL_PATH>/resources/qc/somatic_sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf.gz
The metric value is printed as a fraction, so a value of 0.011 represents 1.1% contamination from another sample.
MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.011
The precision of variant calling, particularly for somatic variants, can be significantly impacted by cross-sample contamination. To ensure safe usage of a sample, the level of cross-sample contamination must be considerably lower than the minimum allele frequencies of interest. For instance, if a sample has 1% contamination, it may be necessary to disregard all variants with less than 5% allele frequency. The cross-contamination metric for a sample reaches saturation near 30% contamination.
The contamination module requires a minimum of 100 valid pileups for contamination estimation, where a pileup is considered valid if it has at least 10X coverage and 95% or more reads are deemed valid. Soft clipped reads that could indicate INDELs or structural variants are not considered valid, and datasets with untrimmed adapters may lead to most reads being soft clipped and classified as invalid. If the contamination module reports "NA," even for high-coverage samples, it is recommended to inspect a few pileup locations in IGV for evidence of untrimmed bases.
Optional Contamination Settings:
The generated variant calling metrics are similar to the metrics computed by RTG vcfstats. Metrics are reported for each sample in multi sample VCF and gVCF files and in a csv file with the file name ending in "vc_metrics.csv". Based on the run case, metrics are reported either as standard VARIANT CALLER or JOINT CALLER. Metrics are reported both for the raw (PREFILTER) and hard filtered (POSTFILTER) VCF file.
Panel of Normals (PON) and COSMIC filtered variants are counted as PASS variants in the POSTFILTER VCF metrics. These PASS variants can cause higher than expected variant counts in the POSTFILTER VCF metrics
Number of samples---Number of samples in the population/ joint VCF.
Reads Processed---The number of reads used for variant calling, excluding any duplicate marked reads and reads falling outside of the target region.
Total---The total number of variants (SNPs + MNPs + indels).
Biallelic---Number of sites in a genome that contains two observed alleles. The reference is counted as one allele, which allows for one variant allele.
Multiallelic---Number of sites in the VCF that contain three or more observed alleles. The reference is counted as one, which allows for two or more variant alleles.
SNPs---A variant is counted as an SNP when the reference, allele 1, and allele 2 are all length 1.
Insertions (Hom)---Number of variants that contains homozygous insertions.
Insertions (Het)---Number of variants where both alleles are insertions, but not homozygous.
Deletions (Het)---Number of variants that contains homozygous deletions.
INDELS (Het)---Number of variants where genotypes are either [insertion+deletion], [insertion+SNP], or [deletion+SNP].
De Novo SNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold
option to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
De Novo INDELs---De novo marked indels with DQ values greater than the threshold. This DQ threshold can be specified by setting the --qc-indel-denovo-quality-threshold
option to the required DQ threshold. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
De Novo MNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold
to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
(Chr X SNPs)/(Chr Y SNPs) ratio in the genome (or the target region) ---Number of SNPs in chromosome X (or in the intersection of chromosome X with the target region) divided by the number of SNPs in chromosome Y (or in the intersection of chromosome Y with the target region). If there was no alignment to either chromosome X or chromosome Y, this metric shows as NA.
SNP Transitions---An interchange of two purines (A<->G) or two pyrimidines (C<->T).
SNP Transversions---An interchange of purine and pyrimidine bases Ti/Tv ratio: ratio of transitions to transitions.
Heterozygous---Number of heterozygous variants.
Homozygous---Number of homozygous variants.
Het/Hom ratio---Heterozygous/ homozygous ratio.
In dbSNP---Number of variants detected that are present in the dbSNP reference file. If no dbSNP file is provided via the --bsnp
option, then both the In dbSNP and Novel metrics show as NA.
Novel---Total number of variants minus number of variants in dbSNP.
Percent Callability---Available in germline and somatic modes with gVCF output. The percentage of non-N reference positions having a PASSing genotype call. Multiallelic variants are not counted. Deletions are counted for all the deleted reference positions only for homozygous calls. Only autosomes and chromosomes X, Y, and M are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names. Optionally, --qc-callability-xym-contigs allows setting X, Y and M contig names.
Percent Autosome Callability---Only autosomes are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names.
Percent QC Region Callability in Region i (i is equivalent to regions 1,2, or 3)---Available if callability for custom regions is requested via the --qc-coverage-region-i
option and the callability output is specified with --qc-coverage-reports-i
. All contigs are considered. Setting --qc-callability-autosome-contigs enables outputting this metric for non-human references.
When the germline small variant caller is executed, DRAGEN calculates a het/hom ratio per contig.
The het/hom ratio values can be used as an indication of whole chromosome uniparental disomy (UPD). UPD of certain chromosomes are associated with genetic syndromes known as imprinting disorders. Whole chromosome UPD have het/hom ratios close to 0.0. Ranges vary, but are usually between 1.0–2.0. The het/hom ratios should be interpreted in the context of the specific assay.
DRAGEN reports the ratios for both the raw (PREFILTER) and hard-filtered (POSTFILTER) VCF. The metrics are output to the .vc_hethom_ratio_metrics.csv
file.
The file contains the following values for each primary contig processed.
Contig
Number of heterozygous variants
Number of homozygous variants
Het/Hom ratio
The following example shows a section of the metrics.
DRAGEN supports a number of reports dedicated to coverage metrics. Some other DRAGEN components, including the mapper and aligner, ploidy caller and variant callers, may emit limited coverage related metrics. The metrics from these other components may not always exactly match the metrics in the DRAGEN coverage reports. The following table list some important differences.
Table 6 Coverage reported in files other than the main coverage reports
The coverage reports listed in Table 7 all follow the same default rules for counting or excluding reads:
Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included (i.e. MAPQ=0 reads are filtered).
BQ >= 0 are included.
Table 7 DRAGEN Coverage Reports
DRAGEN coverage reports will by default be generated over the whole genome, and if provided also over a target region. DRAGEN additionally supports the ability to specify custom regions and report types of interest.
In somatic tumor-normal mode, DRAGEN generates separate reports for the tumor and normal samples. Each report is labeled according to the sample type. Tumor sample reports include tumor
at the end of the file name, and normal sample reports include normal
at the end of the file name. To include both tumor and normal sample results in one file, set the --vc-enable-separate-t-n-metrics
option to false. DRAGEN then reports on the aggregate of both samples.
The coverage reports do not require the mapper or variant callers, however it is not compatible with --enable-sort=false
.
The following command shows an example use case for specifying custom coverage reports:
The settings --qc-coverage-region-i
and --qc-coverage-reports-i
work as a pair (i can be 1, 2, or 3). The former setting specifies the region while the latter specify the report type for that region. The number i
links the settings. Up to 3 such region and report pairs may be specified.
The --qc-coverage-region-i
option requires a BED file argument (i can be 1, 2, or 3).
Regions in each BED file can be optionally padded using --qc-coverage-region-padding-i
option (by default 0 padding is applied).
A set of default reports are generated for each region.
Additional reports can be specified for each region by using the --qc-coverage-reports-i
option.
If multiple report types are selected per region, they should be space-separated, e.g. --qc-coverage-reports-1 callability full_res
.
Defaults settings used for all DRAGEN coverage reports:
Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included. MAPQ=0 reads are filtered.
BQ >= 0 are included.
Non-default setting
As example, the following options are used to enable full (basepair) resolution coverage output with more stringent MAPQ and BQ filtering:
The argument syntax mapq<value,bq<value implies that reads with a mapping quality less than the specified value, or bases with a read base call quality below the specified value, will be ignored.
Valid filter arguments are mapq and bq only. Either, or both, can be specified.
Only one operator < is supported. <=, >, >=, = are not supported.
By default DRAGEN will emit a _coverage_metrics.csv file for each available region type, including the full genome, target region, and any additionally specified QC regions.
The _coverage_metrics.csv file is generally the most useful of all the coverage reports and will probably be the first file to inspect when performing coverage based QC.
The first column of the output file contains the section name COVERAGE SUMMARY and the second column (the subsection) is empty for all entries in the file.
The following metrics are calculated:
Aligned bases in region---Number of uniquely mapped bases to region and the percentage relative to the number of uniquely mapped bases to the genome.
Average alignment coverage over region---Number of uniquely mapped bases to region divided by the number of sites in region.
Uniformity of coverage (PCT > 0.2*mean) over region__---Percentage of sites with coverage greater than 20% of the mean coverage in region.
PCT of region with coverage [ix, inf)---Percentage of sites in region with at least ix coverage, where i can equal 100, 50, 20, 15, 10, 3, 1 and 0.
PCT of region with coverage [ix, jx)---Percentage of sites in region with at least ix but less than jx coverage, where (i, j) can equal (50, 100), (20, 50), (15, 20), (10, 15), (3, 10), (1, 3) and (0, 1).
Average chromosome X coverage over region---Total number of bases that aligned to the intersection of chromosome X with region divided by the total number of loci in the intersection of chromosome X with region. If there is no chromosome X in the reference genome or the region does not intersect chromosome X, this metric shows as NA.
Average chromosome Y coverage over region---Total number of bases that aligned to the intersection of chromosome Y with region divided by the total number of loci in the intersection of chromosome Y with region. If there is no chromosome Y in the reference genome or the region does not intersect chromosome Y, this metric shows as NA.
XAvgCov/YAvgCov ratio over genome/target region---Average chromosome X alignment coverage in region divided by the average chromosome Y alignment coverage in region. If there is no chromosome X or chromosome Y in the reference genome or the region does not intersect chromosome X or Y, this metric shows as NA.
Average mitochondrial coverage over region---Total number of bases that aligned to the intersection of the mitochondrial chromosome with region divided by the total number of loci in the intersection of the mitochondrial chromosome with region. If there is no mitochondrial chromosome in the reference genome or the region does not intersect mitochondrial chromosome, this metric shows as NA.
Average autosomal coverage over region---Total number of bases that aligned to the autosomal loci in region divided by the total number of loci in the autosomal loci in region. If there is no autosome in the reference genome, or the region does not intersect autosomes, this metric shows as NA.
Median autosomal coverage over region---Median alignment coverage over the autosomal loci in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Mean/Median autosomal coverage ratio over region---Mean autosomal coverage in region divided by the median autosomal coverage in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Aligned reads in region---Number of uniquely mapped reads to region and percentage relative to the number of uniquely mapped reads to the genome. Only reads with with MAPQ ≥ 1 are included. Secondary and supplementary alignments are ignored.
The following is an example of the contents of the \_coverage\_metrics.csv
file:
The fine histogram report outputs a _fine_hist.csv
file, which contains two columns: Depth and Overall. The value in the Depth column ranges from 0 to 2000+ and the Overall column indicates the number of loci covered at the corresponding depth.
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The histogram report outputs a _hist.csv file, which provides the following:
Percentage of bases in the coverage BED/target BED/WGS region that fall within a certain range of coverage.
Duplicate reads are ignored if DRAGEN is run with --enable-duplicate-marking
true.
The following ranges are used: "[100x:inf)", "[1x:3x)", "[0x:1x)"
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Overall Mean Coverage report generates an _overall_mean_cov.csv file, which contains the average alignment coverage over the coverage BED/target BED/WGS, as applicable.
The following is an example of the contents of the _overall_mean_cov.csv file:
Average alignment coverage over target_bed,80.69
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Contig Mean Coverage report generates a _contig_mean_cov.csv file, which contains the estimated coverage for all contigs and an autosomal estimated coverage. The file includes the following three columns:
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Full Res Report outputs a _full_res.bed file in tab-delimited format. The first three columns are the standard BED fields, and the fourth column is the depth. Each record in the file is for a given interval that has a constant depth. If the depth changes, then a new record is written to the file. Alignments that have a mapping quality value of 0, duplicate reads, and clipped bases are not counted towards the depth.
Only base positions that fall under the user-specified coverage-region bed regions are present in the _full_res.bed output file.
The _full_res.bed file structure is similar to the output file of bedtools genomecov -bg. The contents are identical if the bedtools command line is executed after filtering out alignments with mapping quality value of 0, and possibly filtering by a target BED (if specified).
The following is an example of the contents of the _full_res.bed file:
Coverage is reported for all positions specified by qc-coverage-region-i
. Masked regions in the FASTA are not ignored.
When --enable-metrics-compression
is set to true, the 1 bp resolution coverage metrics output bed file (_full_res.bed
) are compressed to bigwig format.
The cov_report report generates a _cov_report.bed file in a tab-delimited format. This report includes summary coverage statistic per BED region. The first three columns are standard BED fields. The DRAGEN Amplicon pipeline includes a fourth column for name and fifth column for gene_id. The remaining column fields are statistics calculated over the interval region specified on the same record line.
The following table lists the appended columns.
total_cvg---The total coverage value.
mean_cvg---The mean coverage value.
Q1_cvg---The lower quartile (25th percentile) coverage value.
median_cvg---The median coverage value.
Q3_cvg---The upper quartile (75th percentile) coverage value.
min_cvg---The minimum coverage value.
max_cvg---The maximum coverage value.
pct_above_X---Indicates the percentage of bases over the specified interval region that had a depth coverage greater than X.
By default, if an interval has a total coverage of 0, then the record is written to the output file. To filter out intervals with zero coverage, set vc-emit-zero-coverage-intervals
to false in the configuration file.
By default, if --qc-coverage-region-i-thresholds
are not set, the thresholds will default to 5, 15, 20, 30, 50, 100, 200, 300, 400, 500, 1000.
The following is an example of the contents of the _cov_report.bed file:
The read_cov_report report generates a _read_cov_report.bed file in a tab-delimited format. The first five columns are chrom
, start
, end
, name
, and gene_id
BED fields. The following additional columns represent statistics that are calculated over the interval region specified on the same record line.
total_cvg---The total coverage value.
read1_cvg---The total Read 1 coverage value.
read2_cvg---The total Read 2 coverage value.
If an alignment overlaps more than one region, the alignment is counted toward the region with the largest overlap. If the alignment overlaps equally with more than one region, the alignment is counted toward the first intersecting region.
The following shows the contents of the _read_cov_report.bed file:
Callability is defined as the fraction of positions in the genome or target region having a GVCF PASSing genotype call. The callability report can be interpreted as the fraction of sites in the genome or target bed where the small variant caller had sufficient information (enough good quality reads) to confidently either call a variant or a HOM-REF region.
The callability report requires DRAGEN to be run in gVCF mode. When gVCF mode is enabled, DRAGEN will automatically generate a callability report as part of variant caller metrics.
The following criteria are used to calculate callability metrics:
Callability is calculated over all positions included in the gVCF.
Decoy contigs are ignored.
Unplaced and unlocalized contigs are ignored.
Masked regions in the FASTA (bases set to N) are ignored.
For regions where no variant calling was performed, callability is 0.
A homozygous deletion counts as a PASSing genotype call for all the reference positions spanned by the deletion.
If the --vc-target-bed
option is specified, the output is a target_bed_callability.bed
file that contains the overall and autosome callability over the input target bed region. The padding size specified by the --vc-target-bed-padding
option is used and overlapping regions are merged.
Callability can also be output over custom regions. If the --qc-coverage-region-i
option is used with --qc-coverage-reports-i
(where i is 1, 2, or 3), callability can be added as a report type for that region. The output is a qc-coverage-region-i_callability.bed
file. For each specified qc-coverage-region-i
file, the average callability is reported in the variant calling metrics file. The padding size specified by the --qc-coverage-region-padding-i
is used and overlapping regions are merged.
The optional min MAPQ and min BQ filters only influence read and base counting and do not influence the callability reports. The callability reports only depends on the gVCF PASS variants.
The following table shows which outputs are generated when default options (--vc-target-bed
) versus optional coverage region options (--coverage-region
) are used.
The GC bias report provides information on GC content and the associated read coverage across a genome. DRAGEN GC bias metric is modeled after the Picard implementation and adapted to preexisting internal measures. The DRAGEN GC bias correction module attempts to correct these biases following the target count stage. For more information, see GC Bias Correction
The GC bias metric is computed as follows.
Calculates GC content using a 100 bp wide, per-base rolling window over all chromosomes in the reference genome, excluding any decoys and alternate contigs. Windows containing more than four masked (N) bases in the reference are discarded.
Calculates the average coverage for each window, excluding any non-PF, duplicate, secondary, and supplementary reads.
Calculates the average global coverage across the whole genome.
Groups valid windows based on the percentage of GC content, both at individual percentages and five 20% ranges as summary.
Calculates the normalized coverage for each group by dividing the average coverage for the bin by the global average coverage across the genome. Values below 1.0 indicate a lower than expected coverage at the given GC percent or range. Coverages significantly deviating from 1.0 at greater GC values are an expected result.
Calculates dropout metrics as the sum of all positive values of (percentage of windows at GC X-percentage aligned reads at GC X) for each GC ≤ 50% and > 50% for AT and GC dropout.
By default, the GC bias metric report is not calculated. To enable GC Bias calculations, enter the --gc-metrics-enable command line option. The following is an example command:
$ dragen -b <BAM file> -r <reference genome> --gc-metrics-enable=true
The GC metrics report generates a gc_metrics.csv file. The file is structured as follows.
The GC bias report also includes the following command line options, but they are not recommended.
| Setting | Description | |:-------------------------------| :---------------------------- -----------------------| | --gc-metrics-window-size | Overrides the default rolling window size of 100 bp. | | --gc-metrics-num-bins | Overrides the number of summary bins. |
In somatic mode, DRAGEN automatically generates a somatic callable regions report as a bed file. The somatic callable regions report includes all regions with tumor coverage at least as high as the tumor threshold and (if applicable) normal coverage at least as high as the normal threshold. If only the tumor sample is provided, then the report includes all regions with tumor coverage at least as high as the tumor threshold. Each line in the bed output file is formatted as follows.
chromosome region_start region_end
You can specify the threshold values using the --vc-callability-tumor-thresh
or --vc-callability-normal-thresh
options. The default value for the tumor threshold is 15. The default value for the normal threshold is 5. For more information on each option, see [Somatic Mode Options]{.underline}.
If the target bed or the --qc-coverage-region-i
(where i is 1, 2, or 3) option is included in the run. DRAGEN then generates corresponding somatic callable regions bed files in addition to the whole genome somatic callable region bed file.
The duration metrics section includes a breakdown of the run duration for each process. For example, the following metrics are generated for the mapper and variant caller pipeline:
Time loading reference
Time aligning reads
Time sorting and marking duplicates
Time DRAGStr calibration
Time partial reconfiguration
Time variant calling
Total run time
DRAGEN Homologous Recombination Deficiency (HRD) Scoring takes in allele-specific copy number calls in either VCF format or directly streamed from somatic copy number callers. DRAGEN HRD then calculates scores for Loss of Heterozygosity (LOH), Telomeric Allelic Imbalance (TAI), and Large-Scale State Transition (LST). The three scores are output to the .hrdscore.csv
file. You can only use DRAGEN HRD when inputting results from WGS somatic CNV calling or ASCN WES somatic CNV calling.
Use the following command-line options to run HRD scoring. You can run HRD scoring with somatic CNV calling or after using somatic CNV calling results.
To run HRD scoring together with somatic CNV calling, use the following options. For more CNV parameters, please refer to CNV calling.
--enable-hrd=true
Set to true to enable HRD scoring to quantify genomic instability.
--enable-cnv=true
Set to true to enable CNV calling to run together with HRD scoring.
To run HRD scoring after somatic CNV calling, use the following options:
--enable-hrd=true
Set to true to enable HRD scoring to quantify genomic instability.
--hrd-input-ascn
Specify the allele-specific copy number file (*cnv.vcf.gz
). The CNV VCF file should include REF
calls for proper HRD segmentation. See the option --cnv-enable-ref-calls
in the CNV section.
--hrd-input-tn
Specify the tumor normalized bin count file (*.tn.tsv.gz
).
If reference is failed to AutoDetected
, then centromere and blacklist files should be specified with following options:
--hrd-input-centromere
Centromere locations per chromosome in tsv format
--hrd-input-blacklist
Blacklist bed file
The following metrics are included in the .hrdscore.csv
output file. The following is an example output file.
The following example command runs HRD end to end workflow with CNV. This is an example of Somatic WGS T/N. See the Somatic CNV section for other use cases. HRD is supported for any CNV workflows that support ASCN, and just needs to add --enable-hrd=true
on top of the CNV command lines.
The following example command runs HRD standalone.
The Ploidy Caller uses the per contig median coverage values from the Ploidy Estimator to detect aneuploidy and chromosomal mosaicism in mammalian germline samples from whole genome sequencing data.
The Ploidy Caller runs by default except in the following circumstances:
The Ploidy Estimator cannot determine if the input data is from whole genome sequencing. For example, data from exome or targeted sequencing.
The reference genome does not contain any autosomes following the expected naming convention (e.g. chr1
or 1
).
There is no germline sample. For example, tumor-only analysis.
Chromosomal mosaicism is detected when there is a significant shift in median coverage of a chromosome compared to the overall autosomal median coverage.
The following table displays some examples of expected shifts in coverage for a give aneuploidy and mosaic fraction.
Neutral Copy Number | Variant Copy Number | Mosaic Fraction | Expected Coverage Shift |
---|---|---|---|
The Ploidy Caller models coverage as a normal distribution for both the null (neutral) and the alternative (mosaic) hypotheses. The two normal distributions have equal mean at the median autosomal coverage for the sample, but the variance of the alternative normal distribution is greater than that of the null normal distribution. The baseline variance of the two models at 30x coverage was determined empirically from a cohort of ~2500 WGS samples. The actual variance used for the two models is calculated from the baseline variance at 30x coverage, adjusting for the median autosomal coverage of the sample. Below are the likelihood distributions for the null and alternative hypotheses for a sample with 35x median autosomal coverage.
After applying an empirically estimated prior for chromosomal mosaicism the Ploidy Caller generates ploidy calls according to the posterior probability of the null and alternative hypotheses as shown below for a sample with 35x median autosomal sequencing coverage.
At 35x median autosomal coverage, the threshold for deciding between a neutral (REF) and an alternative (DEL or DUP) call is roughly at +/- 5% shift in coverage for an autosome. At 100x median autosomal coverage, the threshold is at roughly +/- 3% shift in coverage for an autosome. A Q20 threshold is used to filter low quality calls.
In addition to detecting aneuploidy and chromosomal mosaicism in autosomes where the expected reference ploidy is 2, the Ploidy Caller can also detect these variants in allosomes. The reference sex karyotype used for making calls on the allosomes is determined from the sex karyotype of the sample either provided on the command line using the --sample-sex
option or from the Ploidy Estimator. If the sex karyotype of the sample is not provided on the command line and not determined by the Ploidy Estimator, then the sex karyotype is assumed to be XX. Whenever the sex karyotype contains at least one Y chromosome, the reference sex karyotpye is XY. If the sex karyotype does not contain at least one Y chromosome, then the reference sex karyotype is XX. The following table displays each of the possible sex karyotypes for a sample. If the Y chromosome reference ploidy is zero, then ploidy calling is not performed on the Y chromosome.
The Ploidy Caller generates a <output-file-prefix>.ploidy.vcf.gz
file in the output directory. The output file follows the VCF 4.2 Specification. A single record is reported for each reference autosome and allosome, except for the Y chromosome if the reference sex karotype is XX. Calls are not made for other sequences in the reference genome, such as mitochondrial DNA, unlocalized or unplaced sequences, alternate contigs, decoy contigs, or the Epstein-Barr virus sequence. The VCF header is annotated with ##source=DRAGEN_PLOIDY
to indicate the file is generated by the DRAGEN PLOIDY pipeline.
The following information is provided in the VCF file.
Meta-information--The VCF output file contains common meta-information such as DRAGENVersion
and DRAGEN CommandLine
, as well as Ploidy Caller specific information. The VCF header contains the meta-information for median autosome depth of coverage, the provided sex karyotype if available, the estimated sex karyotype from the Ploidy Estimator if available, and the reference sex karyotype. The following is an example of the header lines:
FILTER Fields--The VCF output file includes the LowQual filter, which filters results with quality score below 20.
INFO Fields--The VCF output INFO fields include the following:
END—End position of the variant described in this record.
SVTYPE—Type of structural variant.
FORMAT Fields--The VCF output file includes the following format fields. There is no GT
FORMAT field. A variant call in the VCF displays either <DUP>
or <DEL>
in the ALT column. A non-variant call displays .
in the ALT column. If using the output file for downstream use, a GT field can be added for variant calls using ./1
for a diploid contig and 1 for a haploid contig. For non-variant calls, use 0/0
for diploid and 0
for haploid.
DC—Depth of coverage.
NDC—Normalized depth of coverage.
The following is an example output file for a sample with mosaic loss of the Y chromosome.
The following is an example output file for a sample with trisomy 21.
Samples derived from cell lines frequently have coverage artifacts that might result in variant ploidy calls on some chromosomes. Chromosomes 17, 19, and 22 are the most common for the cell line coverage artifacts. When performing accuracy assessments of ploidy calls on cell line samples, filter out chromosomes with known cell line artifacts.
DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).
To enable fractional downsampling, set the --enable-fractional-down-sampler
command line option to true
.
Any valid sequencing data format that is compatible with the DRAGEN Host Software can be used. For more information on compatible input options, see Input Options.
In addition to enabling the fractional downsampling command line option, you must set the subsample fraction to downsample. To set the subsample fraction, use --down-sampler-normal-subsample
and/or --down-sampler-tumor-subsample
depending on the input files.
You can also specify a seed using --down-sampler-random-seed
to generate different subsamples of the input data set.
Option | Description |
---|---|
Fragmentomics is the study of fragmentation patterns of cell-free DNA or circulating turmor DNA (ctDNA). DNA molecules are released into the plasma from various tissues and cell types. Fragmentation features, such as fragment sizes and end motifs, of the cell-free DNA contains the characteristics of their tissue of origin. Studies have shown that fragmentation features are distinct between cancer and noncancer cells derived ctDNA. The use of genome-wide fragment profile of cell-free DNA has proven to be a powerful tool to infer cancer status and their tissue of origin. The DRAGEN fragmentomics component computes three fragmentomics metrics as following.[1]
Fragment profile
End motif frequency
Window protection score (WPS)
The fragmentomics component works by taking aligned reads from the mapper, calculating per read metrics, and finally tabulating into per-bin or target region metrics. DRAGEN first gets the chromosome sizes from the reference genome. Only autosomes and X, Y chromosomes are considered for fragment profile calculation. The genome is binned with the bin size specified by the user. Each aligned read is processed sequentially. Only reads satisfied with the following criteria are considered: 1) mapped, 2) mate-mapped, 3) not PCR duplicates, 4) primary alignment, 5) mapping quality no less than minimum mapq specified by the user. Reads that have template length within the short fragment size ranges are counted as short fragment. Reads that have template length within the long fragment size range are counted as long fragment. The fragment profile is calculated as the ratio of short-to-long fragment counts for each genomic bin. Genome-wide short fragment counts, long fragment counts, and their ratio are normalized against the GC bias of each genomic bin using the GC correction module from DRAGEN CNV component.
End motif frequency calculation is enabled when --fragmentomics-end-motif-len
is set to positive integers. Unmapped, duplicated, or secondary alignments are excluded for end motif frequency calculation. The first x basepair sequences (x is specified by --fragmentomics-end-motif-len
) at the 5' end of the reads is tabulated into a frequency dictionary with keys being the sequences and values being the frequencies. If the first x basepair contains any 'N's, this read will be ignored. After all reads are processed, the frequency table is sorted by the sequences in alphabetic order.
Window protection score (WPS) calculation is enabled when a target region is provided with --fragmentomics-wps-target-file
. This file must be a BED format text file with three columns. Each row in the file represents a 120-bp region for which WPS will be calculated. An interval tree will be constructed for the target regions. Then each aligned read is processed sequentially, and unmapped, duplicated, or secondary alignments are excluded. Any read with 5' end falling in a target region increments the read count for the region by one. Forward and reverse reads are counted separately. If a read fully spans the region, the fully-span read count for the region increments by one. After all reads are processed, WPS is calculated for each target region. Two ways of WPS calculation are supported, 1) number of fully spanning rads subtracted by the number of reads with 5' ending in the region. 2) percentage of reads ending in the region of all reads mapped to this region.
DRAGEN Fragmentomics currently supports Tumor-only
and Normal-only
sequencing data from TSO500/WES/WGS ctDNA assays. The results for Tumor-Normal
pair data are undefined because ctDNA data are derived from mixture of tumor and normal DNA. Therefore, users should avoid running Fragmentomics in Tumor-Normal
mode.
Enable the Fragmentomics component:
The target regions file is used only in window protection score calculation. The target regions file is in BED format with three columns.
Users can provide a blocklist of regions to remove reads from fragment profile calculation. For example, low mappability regions. This file is in BED format with three columns.
The system should output the fragment profile file, and optionally the end motif frequency file or WPS file if either or both are enabled.
The fragment profile file is in the following format:
The end motif frequency file is in the following format:
The WPS file is in the following format:
Y. M. DENNIS LO, DIANA S. C. HAN, PEIYONG JIANG, ROSSA W. K. CHIU. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021. DOI: 10.1126/science.aaw3616
DRAGEN can reserve a random subset of fragments that are separate from the normal alignment outputs using downsampling. You can use downsampling to generate data sets for performing comparisons between samples or between replicates. DRAGEN samples fragments after performing any hardware accelerated trimming or filtering functions, which enables DRAGEN to rapidly create analysis-read test data sets.
To enable downsampling, set the --enable-down-sampler
command line option to true
.
You can use any valid sequencing data format that is compatible with the DRAGEN Host Software. For more information on compatible input options, see Input Options.
DRAGEN downsampling outputs the reserved subset of data in FASTQ format. If the input is paired-ended, DRAGEN outputs two FASTQ files that contain subsampled data. If the input is unpaired, DRAGEN outputs two FASTQ files.
In addition to enabling the downsampling command line option, you must set the quantity of fragments to downsample. To set the quantity of fragments, use either --down-sampler-fragments
or --down-sampler-coverage
.
If you specified a coverage level, you must also specify a genome using the --ref-dir
or manually specify the genome size using --down-sampler-genome-size
. If you specify both a read and coverage limit, DRAGEN applies both quantity limits and keeps whichever result is smaller.
Option | Description |
---|---|
Specify a matched normal SNV VCF. For more information on specifying b-allele loci, see .
Specify a population SNP catalog. For more information on specifying b-allele loci, see .
If running in tumor-normal mode with the SNV caller enabled, use this option to specify the germline heterozygous sites. For more information on specifying b-allele loci, see .
Specify germline CNVs from the matched normal sample. For more information, see .
Use the variant allele frequencies (VAFs) from the somatic SNVs to help select the tumor model for the sample. For more information, see .
Enable HET-calling mode for heterogeneous segments. For more information, see .
If running in tumor-normal mode with the SNV caller enabled, use this option to specify the germline heterozygous sites. For more information on specifying b-allele loci, see .
Specify a matched normal SNV VCF. For more information on specifying b-allele loci, see .
Specify a population SNP catalog. For more information on specifying b-allele loci, see .
Setting | Description | Tumor-Normal Panel/WES/WGS | Tumor-Only Panel/WES/WGS | High coverage bTMB |
---|---|---|---|---|
Metric | Definition |
---|---|
A1 | A2 | B1 | B2 | C1 | C2 | DQA11 | DQA12 | DQB11 | DQB12 | DRB11 | DRB12 |
---|---|---|---|---|---|---|---|---|---|---|---|
Option | Description |
---|---|
Sample Type | Assay | Microsatelitte file | Specific Settings | PercentageUnstableSites Threshold |
---|---|---|---|---|
GERMLINE | FASTQ w/ Map/Align | BAM/CRAM | BAM/CRAM w/ Map/Align |
---|---|---|---|
TUMOR NORMAL | FASTQ w/ Map/Align | BAM/CRAM | BAM/CRAM w/ Map/Align |
---|---|---|---|
TUMOR ONLY | FASTQ w/ Map/Align | BAM/CRAM | BAM/CRAM w/ Map/Align |
---|---|---|---|
Pre-built Systematic Noise File | Comment | Systematic Noise Version | DRAGEN Compatibility |
---|---|---|---|
ID | Description |
---|---|
ID | Description |
---|---|
ID | Description |
---|---|
ID | Level | Description |
---|---|---|
ID | Level | Description |
---|---|---|
ID | Level | Description |
---|---|---|
Name | Description | Default Value |
---|---|---|
Field | Description |
---|---|
Column information | Description |
---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
Field | Value |
---|---|
Metric | Description | Denominator of percentile | Example |
---|---|---|---|
Setting | Description |
---|---|
Dragen output | Description |
---|---|
Report Name | Output File | Notes |
---|---|---|
Optional Report type | Enabled with |
---|---|
Filtering rules | Description |
---|---|
Column 1 | Column 2 | Column3 |
---|---|---|
Sample | LOH_Score | TAI_Score | LST_Score | HRD_Score |
---|---|---|---|---|
Sex Karyotype | X Reference Ploidy | Y Reference Ploidy |
---|---|---|
XX
0.75
1.25
0.00
0.25
XY
0.25
0.75
0.25
0.75
XXY
0.75
1.25
0.25
0.75
XYY
0.25
0.75
0.75
1.25
X0
0.25
0.75
0.00
0.25
XXXY
1.25
1.75
0.25
0.75
XXX
1.25
1.75
0.00
0.25
qc-contam-min-cov
The minimum read coverage required for a pileup to be used in contamination detection (default is 10). Lower coverage may produce unreliable results.
qc-contam-min-valid-read-ratio
The minimum ratio of valid reads in a pileup for it to be considered valid. The default setting is 0.95, meaning 95% of the reads in a pileup must be valid. This value may be lowered to 0.75 and still yield accurate contamination estimates. If many reads are classified as invalid, it is likely due to untrimmed adapters that are being systematically soft clipped. It is recommended to fix the BAM file rather than force the contamination module to use these reads.
DRAGEN SNV VCF INFO DP field
The SNV VCF INFO DP field is computed after excluding unmapped reads, secondary alignments, BQ<10, bad quality reads (badly mated reads, and reads with bad cigars). It will generally be equal or lower than coverage reported in the fine_hist or other coverage reports. It is also expected to be lower than the unfiltered coverage track reported in IGV.
DRAGEN SNV VCF FORMAT DP field
The SNV VCF FORMAT DP is similar to the INFO DP field, but it also excludes non-informative reads that matches more than 1 haplotype equally well. In general the following pattern is expected: FORMAT DP <= INFO DP <= per position coverage in full_res report.
Input bases divided by reference genome size.
Available in mapping_metrics.csv file. This metric is a useful indication of the raw sequencing coverage. All primary reads (including duplicates) are considered. Secondary and supplementary alignments are ignored, but no other filters are applied.
Autosomal Median Coverage
Available in ploidy_estimation_metrics.csv. This is an internal development metric that makes various assumptions about which regions will be treated as callable or not. This metric will not be consistent with "Median autosomal coverage over genome" in "wgs_coverage_metrics.csv". It is not recommended for any QC.
Coverage metrics
_coverage_metrics.csv
Important coverage summary statistics. On by default.
Fine histogram coverage
_fine_hist.csv
Detailed coverage histogram. On by default.
Histogram coverage
_hist.csv
Binned coverage histogram. On by default.
Overall mean coverage
_overall_mean_cov.csv
Redundant subset of information available in _coverage_metrics.csv. On by default.
Per contig mean coverage
_contig_mean_cov.csv
On by default.
Read-level coverage report
_read_cov_report.bed
On by default.
Basepair full resolution
_full_res.bed
Optionally enabled with custom reports.
Per BED region cov_report
_cov_report.bed
Optionally enabled with custom reports.
GVCF callability
_callability.bed
Optionally enabled with custom reports.
Basepair full resolution
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 full_res
Per BED region cov_report
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 cov_report
GVCF callability
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 callability
Handing of overlapping mates
By default overlapping mates are double counted. Set --qc-coverage-ignore-overlaps=true
to resolve all of the alignments for each fragment and avoid double-counting any overlapping bases. This might result in marginally longer run times. This option also requires setting --enable-map-align=true
. --qc-coverage-ignore-overlaps
is a global setting and updates all qc-coverage reports.
Soft-clipped bases
By default soft-clipped bases are not counted towards coverage. Set --qc-coverage-count-soft-clipped-bases=true
to also include those bases in the coverage calculations. --qc-coverage-count-soft-clipped-bases
is a global setting and updates all qc-coverage reports.
MAPQ and BQ filters
The --qc-coverage-filters-i
setting can be used to override the min MAPQ and BQ filters. A coverage filter is enabled by using one of the --qc-coverage-filters-i
options (where i is 1, 2, or 3), in combination with the associated --qc-coverage-region-i
option. The default value for --qc-coverage-filters-i
is mapq<1,bq<0
. The default includes all BQ, but filters reads with MAPQ=0.
Contig name
Number of bases aligned to that contig, which excludes bases from duplicate marked reads, reads with MAPQ=0, and clipped bases.
Estimated coverage, as follows: <number of bases aligned to the contig (ie, Col2)> divided by <length of the contig or (if a target BED is used) the total length of the target region spanning that contig>.
--vc-target-bed specified? Y/N
--qc-coverage-region-i (i equal to 1, 2, or 3) specified? Y/N
Expected Output Files
N
N
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv
N
Y
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv For each coverage region specified by the user: qc-coverage-region-i_coverage_metrics.csv qc-coverage-region-i_fine_hist.csv qc-coverage-region-i_hist.csv qc-coverage-region-i_overall_mean_cov.csv qc-coverage-region-i_contig_mean_cov.csv qc-coverage-region-i_full_res.bed if full_res report type is requested for qc-coverage-region-i qc-coverage-region-i_cov_report.bed if cov_report report type is requested for qc-coverage-region-i qc-coverage-region-i_callability.bed if GVCF mode is enabled and the callability or exome-callability report type is requested
Y
N
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv target_bed_coverage_metrics.csv target_bed_fine_hist.csv target_bed_hist.csv target_bed_overall_mean_cov.csv target_bed_contig_mean_cov.csv target_bed_callability.bed if GVCF mode is enabled
Y
Y
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv target_bed_coverage_metrics.csv target_bed_fine_hist.csv target_bed_hist.csv target_bed_overall_mean_cov.csv target_bed_contig_mean_cov.csv target_bed_callability.bed if GVCF mode is enabled For each coverage region specified by the user: qc-coverage-region-i_coverage_metrics.csv qc-coverage-region-i_fine_hist.csv qc-coverage-region-i_hist.csv qc-coverage-region-i_overall_mean_cov.csv qc-coverage-region-i_contig_mean_cov.csv qc-coverage-regon-i_full_res.bed if full_res report type is requested for qc-coverage-region-i qc-coverage-region-i_cov_report.bed if cov_report report type is requested for qc-coverage-region-i qc-coverage-region-i_callability.bed if GVCF mode is enabled and the callability or exome-callability report type is requested
--vc-callability-tumor-thresh
The minimum coverage for usable coding regions
50 (default)
50 (default)
1000 (not default)
--tmb-vaf-threshold
Variant mininum allele frequency for usable variants
0.05 (default)
0.05 (default)
0.002 (not default)
--tmb-cosmic-count-threshold
Number of observations in cosmic for variant to be considered a driver mutation.
50 (default)
50 (default)
50 (default)
--tmb-skip-db-filter
Do not use Nirvana database to filter germline variants
TRUE (default:T/N)
FALSE (default:T/O)
FALSE (not default)
--tmb-enable-proxi-filter
Use allele frequency information to filter germline variants
OPTIONAL (default is FALSE)
FALSE (not default)
TRUE (not default)
Eligible Region (Mbp)
The specified custom regions in (Mbp) that meet the minimum coverage threshold.
Filtered Variant Count
Remaining variants after variant and germline filters.
Filtered Nonsyn Variant Count
Subset of filtered variants that are nonsynonymous.
TMB
Filtered variants normalized by the eligible regions (Mbp).
Nonsyn TMB
Filtered nonsynonymous variants normalized by the eligible regions (Mbp).
A*26:01
A*29:02
B*44:02
B*44:03
C*05:01
C*16:01
DQA1*01:03
DQA1*01:02
DQB1*06:03
DQB1*06:02
DRB1*15:01
DRB1*15:01
Sample
16
17
28
61
msi-command tumor-only/tumor-normal/collect-evidence
Mode of execution: tumor-only, tumor-normal, or collect-evidence.
msi-microsatellites-file
Specify the file containing the microsatellites. You can generate this file by scanning the genome for microsatellites using an MSI-sensor. DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.
msi-ref-normal-dir
Full name of directory containing files with normal reference repeat length distribution. Used only in tumor-only
mode. These files can be generated by running collect-evidence
on each normal sample. At least 20 normal samples are required.
msi-coverage-threshold
Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not included in analysis. DRAGEN recommends using 60 as the value for solid samples. For TSO500 liquid, a value of 500 is recommended.
msi-distance-threshold
Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.
Solid
TSO500
Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.
msi-distance-threshold=0.1
20
Heme
TSO500
N/A
N/A
N/A
Liquid (cfDNA)
TSO500
Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WGS
Available for download. Repeats 10 - 50.Approx. 1 mil sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WGS
Available for download. Repeats 10 - 50. Approx. 1 mil sites.
msi-distance-threshold=0.02
TBD
XX
2
0
XY
1
1
XXY
1
1
XYY
1
1
X0
2
0
XXXY
1
1
XXX
2
0
CNV+SNV
Supported
Supported
Supported
CNV+SV
Supported
Supported
Supported
SNV+SV
Supported
Supported
Supported
CNV+SNV+SV
Supported
Supported
Supported
CNV+SNV
Supported
Supported
Not Supported
CNV+SV
Supported
Supported
Not Supported
SNV+SV
Supported
Supported
Not Supported
CNV+SNV+SV
Supported
Supported
Not Supported
CNV+SNV
Supported
Supported
Supported
CNV+SV
Supported
Supported
Supported
SNV+SV
Supported
Supported
Supported
CNV+SNV+SV
Supported
Supported
Supported
WGS_hg19_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HG19 reference
3.0.0
4.3.*
WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HG38 reference
3.0.0
4.3.*
WGS_hs37d5_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HS37D5 reference
3.0.0
4.3.*
contig1
chromosome of the first region (string)
start1
start position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
end1
end position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
contig2
chromosome of the second region (string)
start2
start position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
end2
end position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
event_id
The paired region unique ID (string)
score
The number of occurrences in the cohort
orientation1
direction of breakpoint1 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
orientation2
direction of breakpoint2 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
assembly-status
If all variants used to generate the noise candidate have end-to-end local assemblies, noise candidate is "precise", otherwise it is "imprecise" (string, "precise", or "imprecise")
IMPRECISE
Flag indicating that the structural variation is imprecise, ie, the exact breakpoint location is not found
SVTYPE
Type of structural variant
SVLEN
Difference in length between REF and ALT alleles
END
End position of the variant described in this record
CIPOS
Confidence interval around POS
CIEND
Confidence interval around END
CIGAR
CIGAR alignment for each alternate indel allele
MATEID
ID of mate breakend
EVENT
ID of event associated to breakend
HOMLEN
Length of base pair identical homology at event breakpoints
HOMSEQ
Sequence of base pair identical homology at event breakpoints
SVINSLEN
Length of insertion
SVINSSEQ
Sequence of insertion
LEFT_SVINSSEQ
Known left side of insertion for an insertion of unknown length
RIGHT_SVINSSEQ
Known right side of insertion for an insertion of unknown length
PAIR_COUNT
Read pairs supporting this variant where both reads are confidently mapped
BND_PAIR_COUNT
Confidently mapped reads supporting this variant at this breakend (mapping may not be confident at remote breakend)
UPSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at the upstream breakend (mapping may not be confident at downstream breakend)
DOWNSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at this downstream breakend (mapping may not be confident at upstream breakend)
BND_DEPTH
Read depth at local translocation breakend
MATE_BND_DEPTH
Read depth at remote translocation mate breakend
JUNCTION_QUAL
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only
SOMATIC
Flag indicating a somatic variant
SOMATICSCORE
Somatic variant quality score
SOMATIC_EVENT
If the probability of the SV being a germline variant is greater than the probability of the SV being a somatic variant, this is 0. Otherwise, this is 1.
JUNCTION_SOMATICSCORE
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the SOMATICSCORE value for the adjacency in question only
CONTIG
Assembled contig sequence, if the variant is not imprecise (with --outputContig
)
DUPSVLEN
Length of duplicated reference sequence
DUPHOMLEN
Length of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPHOMSEQ
Sequence of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPSVINSLEN
Length of inserted sequence after duplicated reference sequence
DUPSVINSSEQ
Inserted sequence after duplicated reference sequence
NotDiscovered
Variant candidate specified by the user and not discovered from input sequencing data
UserInputId
Variant ID from user input VCF
KnownSVScoring
Variant is associated with a user specified input variant, therefore scoring and filtration criteria are relaxed under a stronger prior assumption of truth
GT
Genotype
FT
Sample filter, 'PASS' indicates that all filters have passed for this sample
GQ
Genotype Quality
PL
Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
PR
Number of spanning read pairs which strongly support the REF or ALT alleles
SR
Number of split-reads which strongly support the REF or ALT alleles
VF
Number of fragments which strongly support the REF or ALT alleles
MinQUAL
Record
QUAL score is less than a threshold. The filter is not applied to records with KnownSVScoring
flag.
Ploidy
Record
For DEL and DUP variants, the genotypes of overlapping variants with similar size are inconsistent with diploid expectation. The filter is not applied to records with KnownSVScoring
flag.
MaxDepth
Record
Depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend that exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
NoPairSupport
Record
For variants significantly larger than the paired read fragment size, no paired reads support the alternate allele in any sample. The filter is not applied to records with KnownSVScoring
flag.
SampleFT
Record
No sample passes all the sample-level filters.
MinGQ
Sample
GQ score is less than 15. The filter is applied at sample level and not applied to records with KnownSVScoring
flag.
HomRef
Sample
Homozygous reference call. The filter is applied at the sample level.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
MaxDepth
Record
Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (< 1000 bases) in the normal sample, the fraction of reads with MAPQ0 around either breakend exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation. The filter is not applied to records with the KnownSVScoring
flag.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation. The filter is not applied to records with the KnownSVScoring
flag.
MaxDepth
Record
Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads with MAPQ0 around either breakend exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
enable-indel-realigner
Enable indel re-alignment
False
ir-write-intervals-file
Output a file with the reference intervals that contain evidence for re-alignment.
False
ir-max-num-reads
Max number of reads in an interval for re-alignment.
20,000
ir-max-num-candidates
Max number of re-alignment candidates in an interval for re-alignment.
256
ir-max-num-consensus
Max number of consenses reads in an interval for re-alignment.
256
ir-max-distance-between-mates
Max number of re-alignment candidates in an interval for re-alignment.
100,000
ir-realignment-threshold
Minimal improvement of sum of mismatching base qualities to merit realignment.
50
--enable-fractional-down-sampler
Set to true
to enable fractional downsampling. The default value is false.
--down-sampler-normal-subsample
Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).
--down-sampler-tumor-subsample
Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).
--down-sampler-random-seed
Specify the random seed for different runs of the same input data. The default value is 42.
NAME
SNP identifier
MAF
minor allele frequency
ANCHOR_SNP
refers to the NAME of a SNP. SNPs with the same ANCHOR_SNP have high linkage disequilibrium with each other.
--enable-down-sampler
Set to true
to enable downsampling. The default value is false. If enabled, you must set either down-sampler-fragments
or --down-sampler-coverage
.
--down-sampler-num-threads
Specify the number of threads to use for down-sampled reads. The default value is 8.
--down-sampler-random-seed
Set random seed for down-sampled fragments. The default value is 42.
--down-sampler-genome-size
Set target genome size for downsampling coverage. The default value is 0. The --down-sampler-genome-size
option is not compatible with the --ref-dir
option.
--down-sampler-fragments
Specify the target number of fragments for downsampling. The default value is 0.
--down-sampler-coverage
Set target genomic coverage for downsampling. The default value is 0. If enabled, you must set either -ref-dir
or --down-sampler-genome-size
.
First column: filename
Specifies the genetic map basename, 1 name per line. Mixed ploidy chromosomes must be separated into par and non-par regions. Basenames must match genetic map basenames.
Second column: region
Specifies the start and end positions of the chromosome or sub-chromosome region with format <contig_name>:<start_position>-<end_position>
. For chromosomes without mixed ploidy regions, the start position is 1, end position is the length of the chromosome (1-based, inclusive). For chromosomes with mixed ploidy regions, for each region, the start and end positions are those of the region (1-based, inclusive).
Third column: mixed ploidy subject
Specifies 2 on diploid chromosomes and PAR regions. 1 for non PAR region
Fourth column: diploid subject
Specifies 2 for all chromosomes
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-common
Yes
Set to true to enable the Phase Common step.
--ph-phase-common-input-list
Yes
Provides a .txt file listing the sample input pertaining to one chromosome, with path to a single msVCF or a list of msVCF, one line per path. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.
--ph-phase-common-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must overlap between them for the downstream ligate common step. Examples of input region length for human data: 10 mbp
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1
, chrX_nonpar
and chrX_par2
, instead of one run with region chrX
).
--ph-phase-common-map
Yes
Provides path to the chromosome genetic map. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.
--ph-phase-common-config
Yes
Provides path to the txt config file.
--ph-phase-common-reference
No
Provides the path to a reference panel of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.
--ph-phase-common-scaffold
No
Provides the path to a scaffold of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.
--ph-phase-common-sample-type
Yes
Provides the path to the Sample type file.
--ph-phase-common-filter-maf
No
Default 0.001. Set the Minimum Allele Frequency threshold. All variants with allele frequency equal or above this MAF are phased during this Phase Common step.
--ph-phase-common-max-miss-gt-rate
No
Default 0.1. Set the threshold for variants to be skipped if the rate of missing GT is higher than this value.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-ligate-common
Yes
Set to true to enable the Ligate Common step.
--ph-ligate-common-input-list
Yes
Provide a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Common step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-rare
Yes
Set to true to enable the Phase Rare step.
--ph-phase-rare-input
Yes
Provides the path to the preprocessed unphased msVCF generated from Phase Common step covering the phase rare region.
--ph-phase-rare-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must not overlap or have gaps between them.
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1
, chrX_nonpar
and chrX_par2
, instead of one run with region chrX
).
--ph-phase-rare-map
Yes
Provides the path to the genetic map of the chromosome. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.
--ph-phase-rare-config
Yes
Provides the path to the txt config file.
--ph-phase-rare-scaffold
Yes
Provides the path to the scaffold of haplotypes in msVCF format generated from Ligate Common step.
--ph-phase-rare-scaffold-region
Yes
Specifies the scaffold region to be phased. String in the format contigname: startposition-endposition. This scaffold region needs to cover the Input region and to allow buffer between regions. The buffer length impacts the accuracy and speed of the process: longer length is slower but improves accuracy.
--ph-phase-rare-sample-type
Yes
Provides the path to the Sample type file.
--ph-phase-rare-filter-maf
No
Default 0.001. Set the Maximum Allele Frequency threshold. All variants with allele frequency below this MAF are phased during this Phase Rare step. This value must be the same as the one provided at –ph-phase-common-filter-maf. If values differ not all variants will be phased.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file, generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-concat-all
Yes
Set to true to enable the Concat All step.
--ph-concat-all-input-list
Yes when --ph-concat-all-input-list is not provided
Provides a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Rare step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.
--ph-concat-all-input-list-sites-only
Yes when --ph-concat-all-input-list is not provided
Provides a .txt file with list of VCF containing all the haplotyped sites. The VCF files provided are the output files of Phase Rare step, in ascending position order, sex chromosomes at the end.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-qc
Yes
Set to true to enable the quality control module.
--ph-phase-qc-validation
Yes
Provides the path to the phased truth set msVCF. Note: the validation msVCF must have the same samples as in the estimation msVCF for which the phasing accuracy is to be estimated.
--ph-phase-qc-estimation
Yes
Provides the path to the phased msVCF, output of Concat All to be validated.
--ph-phase-qc-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition (startposition-endposition is optional). Regions must not overlap or have gaps between them.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
UMI
The UMI sequence. For example, ACGTAC
IsValid
Specify if the UMI sequence is valid. Enter either: TRUE
or FALSE
NearestCodes
Colon-separated list of nearest UMI sequences. For example, ACGTAA:ACGTAT
SecondNearestCodes
Colon-separated list of second nearest sequences. For example, ACGGAA:ACGGAT
Number of reads
Total number of reads.
NA
Fig1) 14 pairs of read X 2 = 28 reads
Number of reads with valid or correctable UMIs
Number of reads for which the UMIs could be corrected based on the lookup table.
Number of reads
Fig2) Valid UMI read count (Exact match+Correctable UMI)
Number of reads in discarded families
Number of reads in discarded families. Families are discarded when there are not enough raw reads to support the family (family size less than "--umi-min-supporting-reads"). For "--umi-emit-multiplicity=duplex" option, simplex families will be discarded.
Number of reads
Fig1) Number of reads in Families discarded (See "Families discarded" for more detail)
Reads filtered out
Number of reads filtered out in total, either for properties or in a discarded family.
Number of reads
Number of reads in discarded families + Reads with all-G UMIs + Number of unpaired reads
Reads with all-G UMIs filtered out
Number of reads filtered out due to all-G in UMI sequence.
Number of reads
Fig2) PolyG UMI read count
Reads with uncorrectable UMIs
Number of reads where the UMI could not be corrected.
Number of reads
Fig2) Uncorrectable + Ambiguous correction + PolyG
Total number of families
Number of simplex collapsed reads.
NA
Fig1) F1~F10.
Families contextually corrected
Number of families that have some contextual correction. Contextual correction is based on other families at the same mapping location including UMI sequencing error and UMI jumping.
Total number of families
Fig2) Family count of correctable UMI
Families shifted
Number of families that have some shift correction. Shift correction merges families with fragment alignment coordinates up to the distance specified by the umi-fuzzy-window-size
parameter. I updated this description to match the description above.
Total number of families
Fig1) First read pair of DF1 (If shifted distance <= "umi-fuzzy-window-size")
Families discarded
Number of families filtered out by failing min supporting reads criteria or umi-emit type of simplex/duplex.
Total number of families
Fig1) Families discarded by min-support-reads + Families discarded by duplex/simplex (See below for detail)
Families discarded by min-support-reads
Number of families filtered out by failing min supporting reads criteria.
Total number of families
Fig1) Number of families size less than "umi-min-supporting-reads" option Size 1: F6, F10 Size 2: DF3, F5, F9 Size 3: DF1, DF2
Families discarded by duplex/simplex
Number of families filtered out by failing umi-emit type of simplex/duplex.
Total number of families
Fig1) Number of simplex families (F5, F6, F9, F10) filtered. Note that simplex reads are only filtered if umi-emit-multiplicity=duplex (default: both)
Families with ambiguous correction
Number of families where the UMI cannot be corrected because more than one possible UMI corrections exists.
Total number of families
Fig2) Number of families of ambiguous correction UMI
Duplex families
Number of families that are merged as duplex (both strands).
Consensus pairs emitted
Fig1) DF1, DF2, DF3
Consensus pairs emitted
Number of collapsed reads in output BAM.
NA
Fig1) Depends on umi-emit-multiplicity=simplex/duplex/both, umi-min-supporting-reads=x simplex=F1~F10 (F2, F3, F6, F7, F8, F10 filtered if x>=2) duplex=DF1, DF2, DF3 both=DF1, DF2, DF3, F5, F6, F9, F10 (F6, F10 filtered if x>=2)
Mean family depth
Average number of read pairs per family. Filtered reads and families are excluded.
NA
Fig1) Number of reads per family: DF1=3, DF2=3, DF3=2, F5=2, F6=1, F9=2, F10=1 Mean family depth = (3+3+2+2+1+2+1)/7 = 2
Histogram of num supporting fragments
Number of families with zero raw reads, one raw read, two raw reads, three raw reads, etc
NA
Fig1) 0 reads: None 1 reads: F6, F10 = 2 (0 if umi-min-supporting-reads=2) 2 reads: DF3, F5, F9 = 3 3 reads: DF1, DF2 = 2 Histogram = {0|0|3|2}
Number of collapsible regions
Number of regions.
NA
Fig3) R1~R7
Min collapsible region size (num reads)
Number of reads in the least populated region.
NA
Fig3) 2 reads (R4)
Max collapsible region size (num reads)
Number of reads in the most populated region.
NA
Fig3) 18 reads (R2)
Mean collapsible region size (num reads)
Average number of reads per region.
NA
Fig3) 8.3
Collapsible region size standard deviation
Standard deviation of the number of reads per region.
NA
Fig3) 5.8
On target number of reads
Number of reads that overlapped with the UMI target interval --umi-metrics-interval-file
.
NA
Fig1, Fig3) All On target metrics are same as corresponding metric but only considering fragments overlap with target intervals. i.e. DF3, F9, F10 in figure1 and R1, R3, R4, R6, R7 in figure3 are excluded from metric
On target number of bases
Number of bases that overlapped with the UMI target interval --umi-metrics-interval-file
.
NA
On target number of reads with valid or correctable UMIs
Number of reads with a UMI that matched a UMI in the lookup table, including error allowance, and overlapped with the UMI target interval.
On target number of reads
On target number of reads in discarded families
Number of reads in discarded families that overlapped with the UMI target interval.
On target number of reads
On target duplex families
Number of families that are merged as duplex among all the families that are overlapped with UMI target interval.
On target consensus pairs emitted
On target mean family depth
Average number of reads per family that overlapped with UMI target interval.
NA
On target families discarded
Number of families that overlapped with UMI target interval filtered out by failing min supporting reads criteria or umi-emit type of simplex/duplex.
On target number of families
On target families discarded by min-support-reads
Number of families that overlapped with UMI target interval filtered out by failing min supporting reads criteria.
On target number of families
On target families discarded by duplex/simplex
Number of families that overlapped with UMI target interval filtered out by failing umi-emit type of simplex/duplex.
On target number of families
On target families with ambiguous correction
Number of families that overlapped with UMI target interval where the UMI cannot be corrected because more than one possible UMI corrections exists.
On target number of families
Histogram of unique UMIs per fragment position
Number of positions with zero UMI sequences, one UMI sequence, two UMI sequences, etc.
NA
Fig1) 0 UMI sequence: None 1 UMI sequences: ins2 (F5), ins3 (F6) 2 UMI sequences: ins1 (DF1, DF2) 3 UMI sequences: ins4 (DF3, F9, F10) Histogram = {0|2|1|1}
Total Families in Probability Model Estimation
Total number of families used in estimation of UMI jumping rate and fragment size distribution used for probabilistic family merging.
NA
Number of potential Jumping Families
Total number of families that are potential UMI jumping candidates and the corresponding ratio.
Total Families in Probability Model Estimation
2
1
10%
-5%
2
1
5%
-2.5%
2
3
5%
+2.5%
2
3
10%
+5%