1 of 62

DRAGEN DNA Pipeline

The DRAGEN DNA Pipeline accelerates the secondary analysis of NGS data by harnessing the tremendous power available on the DRAGEN Platform. The pipeline includes highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions and targeted calls.

DNA Mapping

Seed Density Option

The seed-density option controls how many (normally overlapping) primary seeds from each read the mapper looks up in its hash table for exact matches. The maximum density value of 1.0 generates a seed starting at every position in the read, ie, (L-K+1) K-base seeds from an L-base read.

Seed density must be between 0.0 and 1.0. Internally, an available seed pattern equal or close to the requested density is selected. The sparsest pattern is one seed per 32 positions, or density 0.03125.

Accuracy Considerations--Generally, denser seed lookup patterns improve mapping accuracy. However, for modestly long reads (eg, 50 bp+) and low sequencer error rates, there is little to be gained beyond the default 50% seed lookup density.
Speed Considerations--Denser seed lookup patterns generally slow down mapping, and sparser seed patterns speed it up. However, when the seed mapping stage can run faster than the aligning stage, a sparser seed pattern does not make the mapper much faster.

Relationship to Reference Seed Interval

Functionally, a denser or sparser seed lookup pattern has an impact very similar to a shorter or longer reference seed interval (build hash table option --ht-ref-seed-interval). Populating 100% of reference seed positions and looking up 50% of read seed positions has the same effect as populating 50% of reference seed positions and looking up 100% of read seed positions. Either way, the expected density of seed hits is 50%.

More generally, the expected density of seed hits is the product of the reference seed density (the inverse of the reference seed interval) and the seed lookup density. For example, if 50% of reference seeds are populated and 33.3% (1/3) of read seed positions are looked up, then the expected seed hit density should be 16.7% (1/6).

DRAGEN automatically adjusts its precise seed lookup pattern to ensure it does not systematically miss the seed positions populated from the reference. For example, the mapper does not look up seeds matching only odd positions in the reference when only even positions are populated in the hash table, even if the reference seed interval is 2 and seed-density is 0.5.

Map Orientations Option

The --Mapper.map-orientations option is used in mapping reads for bisulfite methylation analysis. It is set automatically based on the value set for ‑‑methylation-protocol.

The --Mapper.map-orientations option can restrict the orientation of read mapping to only forward in the reference genome, or only reverse-complemented. The valid values for --map-orientations are as follows.

0--Either orientation (default)
1--Only forward mapping
2--Only reverse-complemented mapping

If mapping orientations are restricted and paired end reads are used, the expected pair orientation can only be FR, not FF or RF.

Seed-Editing Options

Although DRAGEN primarily maps reads by finding exact reference matches to short seeds, it can also map seeds differing from the reference by one nucleotide by also looking up single-SNP edited seeds. Seed editing is usually not necessary with longer reads (100 bp+), because longer reads have a high probability of containing at least one exact seed match. This is especially true when paired ends are used, because a seed match from either mate can successfully align the pair. But seed editing can, for example, be useful to increase mapping accuracy for short single-ended reads, with some cost in increased mapping time. The following options control seed editing:

Seed Editing Options

edit-mode and edit-chain-limit

The edit-mode and edit-chain-limit options control when seed editing is used. The following four edit-mode values are available:

Edit mode 0 requires all seeds to match exactly. Mode 3 is the most expensive because every seed that fails to match the reference exactly is edited. Modes 1 and 2 employ heuristics to look up edited seeds only for reads most likely to be salvaged to accurate mapping.

The main heuristic in edit modes 1 and 2 is a seed chain length test. Exact seeds are mapped to the reference in a first pass over a given read, and the matching seeds are grouped into chains of similarly aligning seeds. If the longest seed chain (in the read) exceeds a threshold edit-chain-limit, the read is judged not to require seed editing, because there is already a promising mapping position.

Edit mode 1 triggers seed editing for a given read using the seed chain length test. If no seed chain exceeds edit-chain-limit (including if no exact seeds match), then a second seed mapping pass is attempted using edited seeds. Edit mode 2 further optimizes the heuristic for paired-end reads. If either mate has an exact seed chain longer than edit-chain-limit, then seed editing is disabled for the pair, because a rescue scan is likely to recover the mate alignment based on seed matches from one read. Edit mode 2 is the same as mode 1 for single-ended reads.

edit-seed-num and edit-read-len

For edit modes 1 and 2, when the heuristic triggers seed editing, these options control how many seed positions are edited in the second pass over the read. Although exact seed mapping can use a densely overlapping seed pattern, such as seeds starting at 50% or 100% of read positions, most of the value of seed editing can be obtained by editing a much sparser pattern of seeds, even a nonoverlapping pattern. Generally, if a user application can afford to spend some additional amount of mapping time on seed editing, a greater increase in mapping accuracy can be obtained for the same time cost by editing seeds in sparse patterns for a large number of reads, than by editing seeds in dense patterns for a small number of reads.

Whenever seed editing is triggered, these two options request edit-seed-num seed editing positions, distributed evenly over the first edit-read-len bases of the read. For example, with 21-base seeds, edit-seed-num=6 and edit-read-len=100, edited seeds can begin at offsets {0, 16, 32, 48, 64, 80} from the 5' end, consecutive seeds overlapping by 5 bases. Because sequencing technologies often yield better base qualities nearer the (5') beginning of each read, this can focus seed editing where it is most likely to succeed. When a particular read is shorter than edit-read-len, fewer seeds are edited.

Seed editing is more expensive when the reference seed interval (build hash table option ‑-ht‑ref-seed-interval) is greater than 1. For edit modes 1 and 2, additional seed editing positions are automatically generated to avoid missing the populated reference seed positions. For edit mode 3, the time cost can increase dramatically because query seeds matching unpopulated reference positions typically miss and trigger editing.

DNA Aligning

Smith-Waterman Alignment Scoring Settings

The first stage of mapping is to generate seeds from the read and look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.

The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. But any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors, and differently tuned alignment scoring values can be more appropriate for some applications.

The following alignment options control Smith-Waterman Alignment:

global The global option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in the read. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm (although not end-to-end in the reference), and alignment scores can be positive or negative. When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are nonnegative. Generally, global=0 is preferred for longer reads, so significant read segments after a break of some kind (large indel, structural variant, chimeric read, and so forth) can be clipped without severely decreasing the alignment score. Setting global=1 might not have the desired effect with longer reads because insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions. Using global=1 is sometimes preferable with short reads, which are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end. Consider using the unclip-score option, or increasing it, instead ofsetting global=1, to make a soft preference for unclipped alignments.
match-score The match-score option specifies the score for a read nucleotide matching a reference nucleotide (A, C, G, or T), or matching a reference 2–3 nucleotide IUPAC-IUB code. Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used when global=1. A higher match score results in longer alignments, and fewer long insertions.
match-n-score The match-n-score option specifies the score for an aligned position where the read position and/or the reference position is an N code. This option is a signed integer, from -16 to 15.
mismatch-pen The mismatch-pen option is the penalty (negative score) for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code, except N. This option is an unsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.
gap-open-pen The gap-open-pen option is the penalty (negative score) for opening a gap (ie, an insertion or deletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs, with clipping or alignment through SNPs used instead.
gap-ext-pen The gap-ext-pen option is the penalty (negative score) for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping, or alignment through SNPs used instead.
unclip-score The unclip-score option is the score bonus for an alignment reaching the beginning or end of the read. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0 to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read more often, where this can be done without too many SNPs or indels. A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignments are forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen can have the undesirable effect of insertions at or near one end of a read being utilized as pseudoclipping, as happens with global=1
no-unclip-score The no-unclip-score option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing, such as comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it stays the same or decreases if no-unclip-score=1. The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment. When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.
aln-min-score The aln-min-score option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0). aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. A read's best alignment score is saved in the AS SAM tag, and the second-best score (if available) is saved in the XS tag. aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments. You can use the min-score-coeff option to adjust aln-min-score as a function of read length.
min-score-coeff The min-score-coeff option makes adjustments to aln-min-score per read base. When using the min-score-coeff and aln-min-score options together, you can define the minimum alignment score for each read as an affine function of read lengths. The minimum score for an N-base read is calculated as follows: (min-score-coeff)\*N+(aln-min-score) The min-score-coeff option is an integer ranging from –64 to 63.999. If the value is 0, then the minimum alignment score is fixed at aln-min-score for all read length. You can use positive values for min-score-coeff to allow shorter reads to match with lower alignment scores, but require longer reads to achieve higher scores.

Paired-End Options

DRAGEN can process paired-end data passed via a pair of FASTQ files or in a single interleaved FASTQ file. The hardware maps the two ends separately, and then determines a set of alignments that seem most likely to form a pair in the expected orientation and having roughly the expected insert size. The alignments for the two ends are evaluated for the quality of their pairing, with larger penalties for insert sizes far from the expected size. The following options control processing of paired-end data:

Reorientation The pe-orientation option specifies the expected paired-end orientation. Only pairs with this orientation can be flagged as proper pairs. Valid values are as follows:
- 0--FR (default)
- 1--RF
- 2--FF
unpaired-pen For paired end reads, best mapping positions are determined jointly for each pair, according to the largest pair score found, considering the various combinations of alignments for each mate. A pair score is the sum of the two alignment scores minus a pairing penalty, which estimates the unlikelihood of insert lengths further from the mean insert than this aligned pair. The unpaired-pen option specifies how much alignment pair scores should be penalized when the two alignments are not in properly paired position or orientation. This option also serves as the maximum pairing penalty for properly paired alignments with extreme insert lengths. The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ. Internally, it is scaled into alignment score space based on Smith-Waterman scoring parameters.
pe-max-penalty

The pe-max-penalty option limits how much the estimated MAPQ for one read can increase because its mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that it would have received mapping single-ended, plus this value. By default, pe-max-penalty = mapq-max = 255, effectively disabling this limit. The key difference between unpaired-pen and pe-max-penalty is that unpaired-pen affects calculated pair scores and thus which alignments are selected and pe-max-penalty affects only reported MAPQ for paired alignments.

Mean Insert Size Detection

When working with paired-end data, DRAGEN must choose among the highest-quality alignments for the two ends to try to choose likely pairs. To make this choice, DRAGEN uses a skew normal insert model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the observation that common library preparation methods have insert-size distributions that are sometimes close to normal, but also sometimes clearly asymmetric, often skewing toward longer insert sizes. The skew normal insert model is used only for the DNA mode.

If you know the statistics of your library prep for an input file (and the file consists of a single read group), you can specify the characteristics of the insert-length distribution: mean, standard deviation, shape (or skewness) and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert, Aligner.pe-stat-stddev-insert, Aligner.pe-stat-shape-insert, Aligner.pe-stat-quartiles-insert, and Aligner.pe-stat-mean-read-len options. However, it is typically preferable to allow DRAGEN to detect these characteristics automatically.

Dragen automatically samples the insert-length distribution. When the software starts execution, it runs a sample of up to 2,000,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.

The DRAGEN host software reports the statistics in its stdout log in a report, as follows:

Initial paired-end statistics detected for read group RGID, based on 39042 high quality pairs for FR orientation
        Quartiles (25 50 75) = 398 409 420
        Mean = 410.192
        Standard deviation = 14.1254
        NOTE: DRAGEN's insert estimates include corrections for clipping (so they are not identical to TLEN)

        Skew-normal insert distribution applied:
          Position (xi) = 424.084
          Scale (omega) = 19.8719
          Shape (alpha) = -1.88125

        To rerun with identical insert stats, specify:
          --Aligner.pe-stat-mean-insert=424.084
          --Aligner.pe-stat-stddev-insert=19.8719
          --Aligner.pe-stat-shape-insert=-1.88125
          --Aligner.pe-stat-quartiles-insert="398 409 420"
          --Aligner.pe-stat-mean-read-len=101

Note that the Mean, Standard deviation and Quartiles reported above are the sample mean, standard deviation and quartiles calculated from the initial sample of up to 2,000,000 pairs, assuming a normal distribution. The sample mean and standard deviation are used to fit the parameters of a skew-normal distribution. A skew-normal distribution is defined by starting with an underlying normal distribution (whose mean we call position or xi and standard deviation we call scale or omega) and folding a varying portion of the probability mass from one side of the mean (e.g., left side) to the other (e.g., right) side. The portion folded varies smoothly, from 0% at the original mean, approaching 100% from the left tail to the right tail. A shape parameter which we call alpha controls how rapidly the folded fraction increases, and at alpha=0 there is no folding and the distribution remains normal.

In the standard output, we also include the command line options needed to reproduce the DRAGEN run with the same insert stat settings. Note that when specifying stats on the command line, the skew-normal xi value should be used for Aligner.pe-stat-mean-insert. The omega value should be used for Aligner.pe-stat-stddev-insert, and the alpha value should be used for Aligner.pe-stat-shape-insert. If Aligner-pe-stat-shape-insert is not specified on the command line, a default value of 0 is assumed.

The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines

 #Sample: sample name
 FragmentLength,Count

These lines are followed by the histogram for the first ~2M read pairs for DNA (~100K read pairs for RNA). The histogram counts are aggregated across all read groups sharing the same sample id (RGSM field).

When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:

WARNING: Less than 28 high quality pairs found - standard deviation is
calculated from the small samples formula

The small samples formula calculates standard deviation as follows:

 if samples < 3 then                                                     
      standard deviation = 10000                                          
 else if samples < 28 then                                               
    standard deviation = 25 * (standard deviation + 1) / (samples - 2) 
 end if                                                                   
                                                                          
 if standard deviation < 12 then                                         
      standard deviation = 12                                             
 end if

The default model is "standard deviation = 10000". If the first 2M reads are unmapped or if all pairs are improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0. Note that the minimum value for standard deviation is 12, which is independent of the number of samples. Also, in the DNA mode when we have fewer than 1000 high quality alignments we revert to the normal distribution based insert model, because of insufficient number of samples to accurately estimate the parameters of the skew normal distribution.

For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.

DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. This file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, shape, minimum, and maximum. The information matches the standard-out report above. Additionally, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans. Note that the reported mean and standard deviation in this tab-limited log file are the xi and omega parameters of the skew-normal distribution.

Rescue Scans

For paired-end reads, where a seed hit is found for one mate but not the other, rescue scans hunt for missing mate alignments within a rescue radius of the mean insert length. Normally, the DRAGEN host software sets the rescue radius to 2.5 standard deviations of the empirical insert distribution. But in cases where the insert standard deviation is large compared to the read length, the rescue radius is restricted to limit mapping slowdowns. In this case, a warning message is displayed, as follows:

Rescue radius = 220
     Effective rescue sigmas = 0.5
            WARNING: Default rescue sigmas value of 2.5 was overridden by host software!
            The user may wish to set rescue sigmas value explicitly with --Aligner.rescue-sigmas

Although the user can ignore this warning, or specify an intermediate rescue radius to maintain mapping speed, it is recommended to use 2.5 sigmas for the rescue radius to maintain mapping sensitivity. To disable rescue scanning, set max-rescues to 0.

Output Options

DRAGEN can track multiple independent alignments for each read. These alignments include the optimal (primary) one, as well as those mapping different subsegments of the read, (chimeric/supplementary), and sub-optimal (secondary) mappings of the read to different areas of the reference.

For DNA alignment by default, DRAGEN can emit one primary alignment for each read, up to three chimeric alignments (Aligner.supp-aligns=3), and no secondary alignments (Aligner.sec-aligns=0). The maximum user-specified value for supp-aligns or sec-aligns is 4095.

You can use the following configuration options to control how many of each type of alignment to include in DRAGEN output.

mapq-max The mapq-max option specifies a ceiling on the estimated MAPQ that can be reported for any alignment, from 0 to 255. If the calculated MAPQ is higher, this value is reported instead. The default is 60.
supp-aligns, sec-aligns The supp-aligns and sec-aligns options restrict the maximum number of supplementary (ie, chimeric and SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100) alignments, respectively, that can be reported for each read. A maximum of 4095 supplementary alignments and 4095 secondary alignments can be reported for any read, in addition to a primary alignment. High settings for these two options impact speed so it is advisable to increase only as needed.
sec-phred-delta The sec-phred-delta option controls which secondary alignments are emitted based on the alignment score relative to the primary reported alignment. Only secondary alignments with likelihood within this Phred value of the primary are reported.
sec-aligns-hard The sec-aligns-hard option suppresses the output of all secondary alignments if there are more secondary alignments than can be emitted. Set sec-aligns-hard to 1 to force the read to be unmapped when not all secondary alignments can be output.
supp-as-sec When the supp-as-sec option is set to 1, then supplementary (chimeric) alignments are reported with SAM FLAG 0x100 instead of 0x800. The default is 0. The supp-as-sec option provides compatibility with tools that do not support FLAG 0x800.
hard-clips The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specify alignments, as follows:
- Bit 0--primary alignments
- Bit 1--supplementary alignments
- Bit 2--secondary alignments

Each bit determines whether local alignments of that type are reported with hard clipping (1) or soft clipping (0). The default is 6, meaning primary alignments use soft clipping and supplementary and secondary alignments use hard clipping.

Mapping with ALT-contigs

The GRCh38 human reference contains many more alternate haplotypes (ALT contigs) than previous versions of the reference. Generally, including ALT contigs in the mapping reference improves mapping and variant calling specificity, because misalignments are eliminated for reads matching an ALT contig but scoring poorly against the primary assembly. However, mapping with GRCh38's ALT contigs without special treatment can substantially degrade variant calling sensitivity in corresponding regions, because many reads align equally well to an ALT contig and to the corresponding position in the primary assembly.

Masked Based ALT-awareness

The recomeneded and default approach for dealing with ALT-contigs in DRAGEN is masking regions of ALT contigs of high similarity to their corresponding primary contig. This approach is more accurate than liftover based ALT-awarness because there are many places where the "correct" or most useful liftover between a long ALT haplotype and the primary assembly is ambiguous. Incorrect liftover can produce dense clusters of mismapped reads and false variant calls. The base masking approach has the benefits of using ALT contigs without the negative consequences.

Masked hash tables are built from a standard hg18 or hg38 FASTA that contains ALT contigs. The hash table builder will automatically mask regions of the ALT contigs with Ns.

Liftover Based ALT-awarness

With liftover based ALT-awareness, the mapper and aligner are aware of the liftover relationship between ALT contig positions and corresponding primary assembly positions. Seed matches within ALT contigs are used to obtain corresponding primary assembly alignments, even if the latter score poorly. Liftover groups are formed, each containing a primary assembly alignment candidate, and zero or more ALT alignment candidates that lift to the same location. Each liftover group is scored according to its best-matching alignments, taking properly paired alignments into account. The winning liftover group provides its primary assembly representative as the primary output alignment, with MAPQ calculated based on the score difference to the second-best liftover group. Emitting primary alignments within the primary assembly maintains normal aligned coverage and facilitates variant calling there. If the --Aligner.en-alt-hap-aln option is set to 1 and --Aligner.supp-aligns is greater than 0, then corresponding alternate haplotype alignments can also be output, flagged as supplementary alignments.

DRAGEN requires ALT-Aware hash tables for any hg19 or GRCh38 reference where ALT contigs are detected. To disable this requirement in DRAGEN, set the --ht-alt-aware-validate option to false.

The following is a comparison of alternative options for dealing with alternate haplotypes.

Mapping without ALT contigs in the reference:
- False-positive variant calls result when reads matching an alternate haplotype misalign somewhere else.
- Poor mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.
Mapping with ALT contigs but no ALT awareness:
- False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
- Low or zero aligned coverage in primary assembly regions covered by alternate haplotypes, due to some reads mapping to ALT contigs.
- Low or zero MAPQ in regions covered by alternate haplotypes, where they are similar or identical to the primary assembly.
- Variant calling sensitivity is dramatically reduced throughout regions covered by alternate haplotypes.
Mapping with ALT contigs and ALT awareness:
- False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
- Normal aligned coverage in regions covered by alternate haplotypes because primary alignments are to the primary assembly.
- Normal MAPQs are assigned because alignment candidates in alternative haplotypes are not considered in competition.
- Good mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.

DRAGEN Multigenome Graph Mapper

To improve variant calling accuracy in segmental duplications and other regions difficult to map with Illumina reads, you can use the Mutigenome (Graph) mapper in DRAGEN. The graph-based method uses additional variants from population haplotypes to establish alternate graph paths that reads could seed-map and align to. The Mutigenome (Graph) mapper reduces mapping ambiguity because reads that contain population variants are attracted to the specific regions where the variants are observed.

When given a set of population variants (VCF) or haplotypes, the graph reference modification is categorized in the following types:

Alternate contigs represent population haplotypes. Alt-contigs can have a single variant or a combination of nearby phased variants.
Ambiguous codes (IUPAC codes) to represent SNPs. To improve alignment, it edits the reference FASTA with isolated population SNPs.
Haplotype database. And additional haplotype database is built and used to augment the reference FASTA with population variants. A graph - mapper algorithm is used to score read alignment according to he variants in this database.

The DRAGEN graph hashtables are available to download from the DRAGEN Software Support Site page.

Read Trimming

DRAGEN can remove artifacts from reads using hardware accelerated read trimming. Hardware accelerated read trimming is available on U200 and cloud systems, as part of the DRAGEN mapper and adds no additional run time. DRAGEN provides multiple independent trimming filters that target different types of artifacts or use cases. You can enable and configure the artifacts or use cases independently to tailor the read-trimming to your analysis. Read trimming uses two different modes, hard-trimming and soft-trimming.

To enable hard-trimming mode, use --read-trimmers. In hard-trimming mode, potential artifacts are removed from input reads. Reads that are trimmed to fewer than 20 bases are filtered and replaced with a placeholder read that uses 10 N bases. DRAGEN assigns the filtered reads a 0x200 flag set.

DRAGEN contains a novel lossless soft-trimming mode. In soft-trimming mode, reads are mapped as though they had been trimmed, but no bases are removed. To enable the trimmer in soft mode, use --soft-read-trimmers.

Soft-trimming suppresses systematic mismapping of reads that contain trimmable artifacts, without actually losing the trimmed bases in aligned output. Soft-trimming prevents reads with trimmable artifacts, such as Poly-G artifacts, from being mapped to reference G homopolymers, or prevents adapter sequences from being mapped to the matching reference loci. Soft-trimming might map reads to different positions in the reference than they would have been if not using soft-trimming. When using soft-trimmed, DRAGEN does not filter reads and does not map reads with bases that would have been trimmed entirely.

Soft-trimming for Poly-G artifacts is enabled by default on supported systems.

Read Trimming Tools

Fixed-Length Trimming

Fixed-length trimming removes a fixed number of bases from the 5' end of each read. If you are analyzing sequencing data from an amplicon of fixed size and expect the read-length to consistently exceed the length of quality sequence data, you can use the expected number in fixed-length trimming.

Poly-G Trimming

Poly-G artifacts appear on two-channel sequencing systems when the dark base G is called after synthesis has terminated. As a result, DRAGEN calls several erroneous high-confidence G bases on the ends of affected reads. For contaminated samples, many affected reads can be mapped to reference regions with high G content. The affected reads can cause problems for processing downstream.

Quality Trimming

Base quality can degrade over the length of a read toward the 5' end and separate from any artifacts from early termination of synthesis. The lower quality bases can affect mapping and alignment results, and might lead to incorrect variant or methylation calls downstream. The quality trimming tool calculates a rolling average of the base quality inward from the 5' end and removes the minimum number of bases, so the average number of bases is above the threshold specified using --trim-min-quality.

Adapter Trimming

Problems during library preparation, or libraries with smaller inserts can result in the synthesis of high quality reads containing sequence from the adapters used. If not removed before analysis, noninsert bases can reduce mapping efficiency and downstream accuracy. The adapter trimming tool uses the adapter sequences from the input FASTA file, and then removes all hits greater than a specified size. Adapter trimming allows for a 10% mismatch. For 3' adapters, trimming is from the first matching adapter base to the end of the read. For 5' adapters, trimming is from the first (3') matching adapter base to the beginning (5') of the read.

Ambiguous Base Trimming

If quality trimming is not feasible due to reduced yield or other limitations, an alternative option is to remove only explicitly ambiguous bases from the ends of read. If enabled the ambiguous base trimmer applies a simple exact-match search to both ends of all processed reads, regardless of mate-pair status.

Minimum Length Trimming

You can maximize trimmer sensitivity, by using the minimum length trimming tool to remove a fixed number of bases from each read after the trimmer tools above have run. For example, if you would like to remove 5 bp from each read, a 7 bp adapter hit could be missed if five of the bases are removed first. To mitigate this issue, DRAGEN provides an optional minimum trim-length filter.

Maximum Length Trimming

If using libraries of fixed-size inserts, such as small PCR amplicons, it is more convenient to specify a length that all reads should be trimmed to rather than the number of bases to remove. You can use the maximum length trimming tool.

PolyA Tail Trimming

If using RNA libraries, reads overlapping the poly-A tail of the transcripts may contain long poly-A/poly-T sequences at the end of the reads which may result in incorrect alignment. The poly-A trimmer mitigates this by trimming the poly-A tail from the end of the read. See additional description in RNA alignment section.

Read Trimming Metrics

The trimmer generates a metrics file titled \<output prefix\>.trimmer_metrics.csv. Metrics are available on an aggregate level over all input data. The metrics units are in reads or bases.

Total input reads Total number of reads in the input files.
Total input bases Total number of bases in the input reads.
Total input bases R1 Total number of bases in R1 reads.
Total input bases R2 Total number of bases in R2 reads.
Average input read length Total number of input bases divided by the number of input reads.
Total trimmed reads Total number of reads trimmed by at least one base, not including soft-trimming.
Total trimmed bases Total number of bases trimmed, not including soft-trimming.
Average bases trimmed per read The number of trimmed bases divided by the number of input reads.
Average bases trimmed per trimmed read The number of trimmed bases divided by the number of trimmed reads.
Remaining poly-G K-mers R1 3prime The number of R1 3' read ends that contain likely Poly-G artifacts after trimming.
Remaining poly-G K-mers R2 3prime The number of R2 3' read ends that contain likely Poly-G artifacts after trimming.
Total filtered reads The number of reads that were filtered out during trimming.
Reads filtered for minimum read length R1 The number of R1 reads that were filtered due to being trimmed below the minimum read length.
Reads filtered for minimum read length R2 The number of R2 reads that were filtered due to being trimmed below the minimum read length.
<Trimmer tool> trimmed reads The number of reads with at least one base trimmed by TRIMMER. DRAGEN reports the metric for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes reads that were trimmed during soft-trimming. Each trimming tool above produces the metric.
<Trimmer tool> trimmed bases The number of bases trimmed by TRIMMER. The metric is produced for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes bases from reads that were trimmed during soft trimming. Each trimming tool above produces the metric.

Read Trimming Settings

Read trimmer

Filtering after the trimmer's execution

Fixed-length trimming

Quality trimming

Adapter trimming

Bisulfite trimming

Minimum-length trimming

Maximum-length trimming

PolyA trimming

PolyG trimming

PolyX trimming

DRAGEN FASTQC

DRAGEN FastQC is a tool for calculating common metrics used for quality control of high-throughput sequencing data. The tool is modeled after the metrics generated by Babraham Institute's FastQC tool.

The metrics are generated automatically on all DRAGEN map-align workflows with no additional run time and output in a CSV format file called \<PREFIX\>.fastqc_metrics.csv. All metrics are calculated and reported separately for each mate-pair.

For users only interested in sample QC or would like to obtain FastQC results only, DRAGEN provides a mode to generate the fastqc_metrics.csv file directly.

By default DRAGEN FastQC and read-trimming are run as preprocessing steps to standard sequence alignment workflows. If DNA alignment is not needed or if QC results are needed more quickly, the mapping and BAM output portions of the workflow can be disabled. The workflow only outputs key metric files and runs ~70% faster. This option is available on the command-line by entering --fastqc-only=true after the DRAGEN command.

If FastQC runs stand-alone, then the license will not be consumed. If FastQC runs with map-align enabled, then the license will be consumed.

Differences from the Babraham Institutes' FastQC

DRAGEN FastQC is a complete reimplementation of the original FastQC tool developed by the Babraham Institute (henceforth BI-FastQC). The reimplementation of FastQC in DRAGEN, however, has been modified to take advantage of the hardware-acceleration provided by the DRAGEN Field-Programmable Gate Array (FPGA) for a significant speed improvement. As such, there are some differences in how the values are calculated and the resulting metrics will not be exactly identical between the two tools. The most significant differences are described below.

Binning: BI-FastQC uses a customizable binning strategy with a default of 5bp bins, while DRAGEN uses an algorithmic binning strategy based on the Granularity setting described below. In general, this should mean that DRAGEN provides more precise results at default settings.
Outputs: BI-FastQC text output contain the same information as their plots in tabular format, while DRAGEN-FastQC outputs it's raw data. For example, BI-FastQC both plots an outputs the average base quality per-position, while DRAGEN outputs the average base quality by both position and nucleotide. This allows for a more detailed analysis of the data, but requires slightly more work to generate the associated plot.
Rounding: DRAGEN consistently rounds it's calculations to the nearest integer, while the original FastQC uses a mixture of rounding and taking the mathematical floor, leading DRAGEN-FastQC to provide incrementally higher results for some metrics.
Smoothing: Both DRAGEN-FastQC and BI-FastQC utilize smoothing techniques for their distributions of %GC, to account for the fact that 151bp do not divide evenly into 100 percentile bins. However, to take advantage of the speed offered by the FPGA, DRAGEN utilizes a slightly different algorithm than BI-FastQC which results in slightly different results.

Metric Granularity

It is not possible due to memory constraints to guarantee single-base resolution for all metrics. DRAGEN provides an algorithmic solution for binning via --fastqc-granularity. DRAGEN allocates 256 bins in memory for each size or position-based metric. The granularity value of 4–7 inclusive can be used to determine the bin size. High values use smaller bins for greater resolution. Lower values can be used to create larger bins for larger read-lengths

Granularity

Single Base Resolution (bp)

Resolution at 150 (bp)

Adapter and Kmer Sequence Files

To include metrics for adapter or other sequence content, DRAGEN FastQC needs to be provided with the desired sequences in FASTA format. DRAGEN provides two options for this purpose, --fastqc-adapter-file for adapter sequences and --fastqc-kmer-file for any additional kmers of interest so that users can add sequences of interest without changing the expected adapter results.

DRAGEN FastQC can accept up to a combined total of 16 adapters and kmer sequences. Each sequence can be a maximum of 12 bp in length. By default, DRAGEN uses the adapter file located at <INSTALL_PATH>/config/adapter_sequences.fasta. The file contains the following same adapter sequences as Babraham's FastQC v 0.11.10 and later.

Illumina Universal Adapter--AGATCGGAAGAG
Illumina Small RNA 3' Adapter--TGGAATTCTCGG
Illumina Small RNA 5' Adapter--GATCGTCGGACT
Nextera Transposase Sequence--CTGTCTCTTATA

FastQC Metrics Output

The FastQC metrics are output to a CSV file format in the run output directory called

<PREFIX>.fastqc_metrics.csv

The reported metrics are broken down into eight sections by metric type. Each section is broken down further into separate rows by either the length, position, or other relevant categorical variables. The following are the metric sections.

Read Mean Quality---Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.
Positional Base Mean Quality---Average Phred-scale quality value of bases with a specific nucleotide and at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.
Positional Base Content---Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.
Read Lengths---Total number of reads with each observed length. Lengths can be either specific sizes or ranges, depending on settings specified using --fastqc-granularity.
Read GC Content---Total number of reads with each GC content percentile between 0 % and 100 %.
Read GC Content Quality---Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.
Sequence Positions---Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads. Sequences are listed first in the metric description in quotes. Locations are listed second and can be either specific positions or ranges.
Positional Quality---Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.

The following are examples rows from each section.

Sorting and Duplicate Marking

Sorting

The map/align system produces a BAM file sorted by reference sequence and position by default. Creating this BAM file typically eliminates the requirement to run samtools sort or any equivalent postprocessing command. The ‑‑enable-sort option can be used to enable or disable creation of the BAM file, as follows:

To enable, set to true.
To disable, set to false.

On the reference hardware system, running with sort enabled increases run time for a 30x full genome by about 6--7 minutes.

Duplicate Marking

Marking or removing duplicate aligned reads is a common best practice in whole-genome sequencing. Not doing so can bias variant calling and lead to incorrect results.

The DRAGEN system can mark or remove duplicate reads, and produces a BAM file with duplicates marked in the FLAG field, or with duplicates entirely removed.

In testing, enabling duplicate marking adds minimal run time over and above the time required to produce the sorted BAM file. The additional time is approximately 1--2 minutes for a 30x whole human genome, which is a huge improvement over the long run times of open source tools.

The Duplicate Marking Algorithm

The DRAGEN duplicate-marking algorithm is modeled on the Picard toolkit's MarkDuplicates feature. All the aligned reads are grouped into subsets in which all the members of each subset are potential duplicates.

For two pairs to be duplicates, they must have the following:

Identical alignment coordinates (position adjusted for soft- or hard-clips from the CIGAR) at both ends.
Identical orientations (direction of the two ends, with the left-most coordinate being first).

In addition, an unpaired read may be marked as a duplicate if it has identical coordinate and orientation with either end of any other read, whether paired or not.

Unmapped read pairs are never marked as duplicates.

When DRAGEN has identified a group of duplicates, it picks one as the best of the group, and marks the others with the BAM duplicate flag (0x400, or decimal 1024). For this comparison, duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of the scores of both ends, while unpaired reads get the score of the one mapped end. The idea of this score is to try, all other things being equal, to preserve the reads with the highest-quality base calls.

If two reads (or pairs) have exactly matching quality scores, DRAGEN breaks the tie by choosing the pair with the higher alignment score. If there are multiple pairs that also tie on this attribute, then DRAGEN chooses a winner arbitrarily.

The score for an unpaired read R is the average Phred quality score per base, calculated as follows:

Where R is a BAM record, QUAL is its array of Phred quality scores, and dedup-min-qual is a DRAGEN configuration option with default value of 15. For a pair, the score is the sum of the scores for the two ends.

This score is stored as a one-byte number, with values rounded down to the nearest one-quarter. This rounding may lead to different duplicate marks from those chosen by Picard, but because the reads were very close in quality this has negligible impact on variant calling results.

Duplicate Marking Limitations

The limitations to DRAGEN duplicate marking implementation are as follows:

When there are two duplicate reads or pairs with very close Phred sequence quality scores, DRAGEN might choose a different winner from that chosen by Picard. These differences have negligible impact on variant calling results.
If using a single FASTQ file as input, DRAGEN accepts only a single library ID as a command-line argument (RGLB). For this reason, the FASTQ inputs to the system must be already separated by library ID. Library ID cannot be used as a criterion for distinguishing non-duplicates.
DRAGEN does not distinguish between optical and PCR duplicates.

Duplicate Marking Settings

The following options can be used to configure duplicate marking in DRAGEN:

--enable-duplicate-marking Set to true to enable duplicate marking. When \--enable-duplicate-marking is enabled, the output is sorted, regardless of the value of the enable-sort option.
--remove-duplicates Set to true to suppress the output of duplicate records. If set to false, set the 0x400 flag in the FLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable- duplicate-marking is forced to enabled as well.
--dedup-min-qual Specifies the Phred quality score below which a base should be excluded from the quality score calculation used for choosing among duplicate reads.

ROH Caller

Regions of homozygosity (ROH) are detected as part of the small variant caller. The caller detects and outputs the runs of homozygosity from whole genome calls on autosomal human chromosomes. Sex chromosomes are ignored unless the sample sex karyotype is XX, as specified on the command line or determined by the Ploidy Estimator. ROH output allows downstream tools to screen for and predict consanguinity between the parents of the proband subject.

A region is defined as consecutive variant calls on the chromosome with no large gap in between these variants. In other words, regions are broken by chromosome or by large gaps with no SNV calls. The gap size is set to 3 Mbases.

ROH Algorithm

The ROH algorithm runs on the small variant calls. The algorithm excludes variants with multiallelic sites, indels, complex variants, non-PASS filtered calls, and homozygous reference sites. The variant calls are then filtered further using a block list BED, and finally depth filtering is applied after the block list filter. The default value for the fraction of filtered calls is 0.2, which filters the calls with the highest 10% and lowest 10% in DP values. The algorithm then uses the resulting calls to find regions.

The ROH algorithm first finds seed regions that contain at least 50 consecutive homozygous SNV calls with no heterozygous SNV or gaps of 500,000 bases between the variants. The regions can be extended using a scoring system that functions as follows.

Score increases with every additional homozygous variant (0.025) and decreases with a large penalty (1-0.025) for every heterozygous SNV. This provides some tolerance of presence of heterozygous SNV in the region.
Each region expands on both ends until the regions reach the end of a chromosome, a gap of 500,000 bases between SNVs occurs, or the score becomes too low (0).

Overlapping regions are merged into a single region. Regions can be merged across gaps of 500,000 bases between SNVs if a single region would have been called from the beginning of the first region to the end of the second region without the gap. There is no maximum size for regions, but regions always end at chromosome boundaries.

ROH Options

--vc-enable-roh Set to true to enable the ROH caller. The ROH caller is enabled by default for human autosomes only. Set to false to disable.
--vc-roh-blacklist-bed If provided, the ROH caller ignores variants that are contained in any region in the block list BED file. DRAGEN distributes block list files for all popular human genomes and automatically selects a block list to match the genome in use, unless this option is used to select a file.

ROH Output

The ROH caller produces an ROH output file named <output-file-prefix>.roh.bed in which each row represents one region of homozygosity. The BED file contains the following columns:

Chromosome Start End Score #Homozygous #Heterozygous

Score is a function of the number of homozygous and heterozygous variants, where each homozygous variant increases the score by 0.025, and each heterozygous variant reduces the score by 0.975.
Start and end positions are a 0-based, half-open interval.
#Homozygous is number of homozygous variants in the region.
#Heterozygous is number of heterozygous variants in the region. The caller also produces a metrics file named <output-file-prefix>.roh_metrics.csv that lists the number of large ROH and percentage of SNPs in large ROH (>3 MB).

Concordance with PLINK

The table below demonstrates how the PLINK options can be tuned to behave similarly to the DRAGEN ROH caller default settings (see column DRAGEN default). We observed that PLINK ROH calls (see column PLINK default) in default settings are more conservative compared to DRAGEN default settings. By default, PLINK reports ROH regions of size 1MB or larger (see PLINK option --homozyg-kb ) with at least 100 homozygous SNPs (see PLINK option --homozyg-snp) while DRAGEN ROH caller reports smaller regions with at least 50 homozygous SNPs (see DRAGEN ROH Algorithm section). In addition, PLINK by default allows for only 1 heterozygous SNP per scanning window (specified by PLINK option --homozyg-window-het) while DRAGEN uses a soft score threshold penalty without setting an upper bound on the allowed number of heterozygous SNPs (see DRAGEN ROH Algorithm section). The PLINK ROH calls are largely comparable to the DRAGEN ROH calls after relaxing the default PLINK settings, shown in column PLINK tuned. Prior to PLINK ROH calling, the input DRAGEN hard-filtered VCF files are filtered as per the instructions in DRAGEN ROH Algorithm section.

B-Allele Frequency Output

B-Allele frequency (BAF) output is enabled by default in germline and somatic VCF and gVCF runs.

The BAF value is calculated as either AF or (1 - AF), where

AF = (alt_count / (ref_count + alt_count))
BAF = 1 - AF, only when ref base < alt base, order of priority for bases is A < T < G < C < N.

The B-allele frequency values are often plotted to visually inspect the spread away from a perfectly diploid heterozygous call (BAF=50%). This plot is more easily interpreted if it is symmetric about the BAF=50% line. To ensure the symmetry, a heuristic must be used to determine when BAF = AF or BAF = 1-AF. This definition of B-Allele Frequency is based on the definition that is used for bead arrays, as most users are accustomed to that implementation. Here, the choice of the B allele is based on the color of dye attached to each nucleotide. A and T get one color, G and C get the other color. The bead array implementation has much more complex rule for tie-breaking between A and T or G and C that involves top and bottom strands. This is unnecessary and so the simpler hierarchical approach of using a priority for the nucleotides A<T<G<C<N is used.

For each small variant VCF entry with exactly one SNP alternate allele, the output contains a corresponding entry in the BAF output file.

<NON_REF> lines are excluded
- ForceGT variants (as marked by the "FGT" tag in the INFO field) are not included in the output, unless the variant also contains the "NML" tag in the INFO field.
- Variants where the ref_count and alt_count are both zero are not included in the output.

BAF Options

--vc-enable-baf Enable or disable B-allele frequency output. Enabled by default.

BAF Output

The BF generates are BigWig-compressed files, named <output-file-prefix>.baf.bw and <output-file-prefix>.hard-filtered.baf.bw. The hard-filtered file only contains entries for variants that pass the filters defined in the VCF (ie, PASS entries).

Each entry contains the following information: Chromosome Start End BAF

Where:

Chromosome is a string matching a reference contig.
Start and end values are zero-based, half open intervals.
BAF is a floating point value.

Somatic Mode

The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.

For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.

The tumor-only pipeline produces output that contains both germline and somatic variants and can be further analyzed to identify tumor mutations. The caller does not attempt to distinguish between them: filtering out common germline variants as reported in databases is currently the most reliable way to remove germline variants. The tumor-only pipeline provides a germline tagging feature and requires this feature to be explicitly enabled or disabled. When germline tagging is enabled, variant annotation must also be enabled; DRAGEN then tags variants that are common in the gnomAD database as germline so that they can be filtered out if desired. The tumor-only pipeline also requires the presence of a systematic noise file by default. To run without germline tagging and/or systematic noise files, these options need to be disabled explicitly.

Variant Scoring

DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):

##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">

DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.

If tumor SQ > vc-sq-call-threshold (default is 3 for tumor-normal and 0.1 for tumor-only), then the FORMAT/GT for the tumor sample is hard-coded to 0/1, and the FORMAT/AF yields an estimate on the somatic variant allele frequency, which ranges anywhere within [0,1].

If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.
If tumor SQ < vc-sq-call-threshold, the variant is not emitted in the VCF.
If tumor SQ > vc-sq-call-threshold but tumor SQ < vc-sq-filter-threshold, the variant is emitted in the VCF, but FILTER=weak_evidence.
If tumor SQ > vc-sq-call-threshold and tumor SQ > vc-sq-filter-threshold, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).
The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ > vc-sq-call-threshold but tumor SQ < vc-sq-filter-threshold, so the FILTER is marked as weak_evidence.

The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0.

Somatic Mode Options

Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:

--tumor-fastq1 and --tumor-fastq2
Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:
--tumor-fastq-list

Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:

--tumor-bam-input and --tumor-cram-input Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode.
--vc-sq-call-threshold and --vc-sq-filter-threshold These options control the thresholds for emitting calls in the VCF and applying the weak_evidence filter tag (see above).
--vc-target-vaf This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.
--vc-somatic-hotspots, --vc-use-somatic-hotspots, and --vc-hotspot-log10-prior-boost DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_* based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.
vc-systematic-noise This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE.
--vc-combine-phased-variants-distance This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).
vc-skip-germline-tagging=true This option disables the germline tagging feature in the tumor-only pipeline (not recommended).

Tumor-in-normal contamination and liquid tumor mode

In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.

Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.

Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).

Mixing tumor and normal samples from different sequencing protocols

If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.

Allele frequency and related settings

There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.

The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:

If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter (see Post Somatic Calling Filtering below) to apply a hard filter on VAF.

Sample-specific NTD Error Bias Estimation

DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.

Nucleotide (NTD) Error Bias Estimation is on by default and recommended as a replacement for the orientation bias filter. Both methods take account of strand-specific biases (systematic differences between F1R2 and F2R1 reads). In addition, NTD error estimation accounts for non-strand-specific biases such as sample-wide elevation of a certain snv type, e.g. C->T or any other transition or transversion. NTD error estimation can also capture these biases in a trinucleotide context, e.g. in the case of C->T it will break down the counts as ACA->ATA, CCA->CTA, GCA->CTA, TCA->TTA, etc.

This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true.

To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed. Alternatively, if --vc-target-bed is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.

DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.

Unique Molecular Identifier (UMI) Support

DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true) or when running from UMI-collapsed bams, enable UMI-aware variant calling by setting one of the following options to true:

--vc-enable-umi-solid The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.
--vc-enable-umi-liquid The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.

If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.

gVCF Output

You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.

By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod option.

Post Somatic Calling Filtering

DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the *.hard-filtered.vcf.gz output file (note: the *.vcf.gz output file without "hard-filtered" in the filename differs only in that the filter column is unpopulated; the file is produced for historical reasons but is to be deprecated).

Options

The following options are available for post somatic calling filtering:

--vc-sq-call-threshold
Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.
--vc-sq-filter-threshold
Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.
--vc-enable-non-primary-allelic-filter
Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.
--vc-enable-af-filter
Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold and vc-af-filter-threshold command-line options. Please use vc-enable-af-filter-mito and corresponding threshold options for mitochondrial allele frequency filtering.
--vc-enable-non-homref-normal-filter
Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.
--vc-enable-vaf-ratio-filter
Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.
--vc-depth-filter-threshold
Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).
vc-homref-depth-filter-threshold
In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.
vc-depth-annotation-threshold
Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).

Filters

Systematic Noise Filtering

The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter tackles noise that consistently appears at specific locations in the reference genome. This noise can arise from:

Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.
PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.

The systematic noise filter offers a significant improvement over the older "panel of normals" method. While the panel of normals simply excluded specific positions, the new filter employs a statistical model. This model compares the variant and its allele frequency (AF) to the noise level associated with that specific position and allele in the reference genome. This allows for a more nuanced filtering approach, reducing false positives without discarding potentially valid variants.

Note that the systematic noise filter specifically aims to remove noise, while the option --vc-enable-germline-tagging is used for identifying germline variants. The systematic noise filter is not recommended for germline admixture datasets, where tumor-normal pairs are simulated by combining germline samples from two different individuals. This is because such datasets contain (simulated) somatic variants at germline variant positions, and those positions may be present in the noise files with the result that desired variants are filtered out.

Newer versions of the systematic noise will include two columns, one for the "mean" noise and one for the "max" noise. The noise file header will specify a "##NoiseMethod". This is the column that will be used by default during variant calling. For UMI/PANELs/WES is is recommended to use the "mean" noise, and for WGS it is recommended to use the "max" noise.

Prebuilt systematic noise files are available for download (see below), but when possible, it is recommended to build custom noise files from a panel of normal samples sequenced locally. This will ensure that the noise file is specific to the library preparation, sequencing system, and panel in use. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 20-50 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.

The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding the following commands:

Prebuilt Systematic Noise BED Files

Somatic Systematic Noise Baseline Collection v2.0.0 noise files were generated with V4.3 and for the first time include allele specific information. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns. A header line "##NoiseMethod=mean/max" specifies which noise column will be used by default.

Noise files generated with V4.3 contain extra columns and are not compatible with earlier versions. Older noise files are still supported in the current version of DRAGEN as per the table below.

The default WES and WGS noise files were generated using a combination of Nextera and TruSeq samples (with and without PCR). There are also hg38 WGS HEME and FFPE specific noise files.

Custom Systematic Noise Files

The BaseSpace Sequence Hub DRAGEN CNV Baseline Builder App can be used to build SNV and CNV noise files in the cloud. Alternatively the following DRAGEN CMD lines can be used to generate the noise files locally:

First run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples using the following command:

Once the normal samples have completed, collect the normal VCFs in the VCF_LIST file (one vcf per line) and use DRAGEN to generate the systematic noise file:

Detailed settings for running or building the systematic noise filter:

Running the filter during somatic variant calling:

Running the tumor-only pipeline on the normals:

Building the noise file:

Germline Tagging in the Tumor-Only Pipeline

When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:

--vc-enable-germline-tagging Enable germline tagging. The default is 'false'. Once this is set to 'true', it will require user to set annotation related parameters as follows:
- --enable-variant-annotation=true
- --variant-annotation-data Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
- --variant-annotation-assembly The genome build, GRCh37 or GRCh38

Additional options to control how to define germline variants.

--germline-tagging-db-threshold The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).
--germline-tagging-pop-af-threshold The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.

Mutation Annotation Format (MAF) Conversion in Tumor-Only and Tumor-Normal Pipelines

When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).

When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:

Annotation options:

--enable-variant-annotation=true Enable variant annotation
--variant--annotation-data Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
--variant-annotation-assembly Genome build, GRCh37 or GRCh38

MAF conversion options:

--enable-maf-output=true Enable MAF output
--maf-transcript-source Desired transcript source, RefSeq or Ensembl

Additional standalone options (when running without the variant caller):

--maf-input-vcf Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz
--maf-input-json Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz

Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.

Optional options:

--maf-include-non-pass-variants Enabling this option will output all variants, including non-PASS variants, in the MAF output file.

Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.

Example command lines:

MAF output from BAM input and variant caller:

MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:

MAF output from source VCF file:

Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir and --output-file-prefix options.

MAF output from source annotated VCF file:

Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir and --output-file-prefix options.

Joint Analysis

DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other.

Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF option. There is also the option to write a germline gVCF with reduced size using the option --vc-compact-gvcf. This results in a significant speed up for a downstream analysis using gVCF Genotyper. Please note that this compact format is not compatible with a pedigree analysis.

The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.

Combined phased variants in the gVCF input

The DRAGEN germline variant caller has an option --vc-combine-phased-variants-distance to combine phased variants in the gVCF output. Input gVCF files created with this option cannot be processed in a population-based analysis using gVCF Genotyper.

The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.

Force-genotyped and targeted variants in the gVCF input

If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.

Similarly, targeted variant calls (option --targeted-merge-vc) in any gVCF file that are not also called by the variant caller will be ignored as well.

Processing GATK gVCF Files

Both pedigree- and population-based joint analysis can process gVCF files written by the GATK v4.1 variant caller.

Joint Analysis Output Format

There are two available joint analysis output files:

Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.

The multisample gVCF output is only available in the pedigree-based analysis.

The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".

Hom-ref Blocks FORMAT Fields

In hom-ref blocks, the following FORMAT fields are calculated uniquely.

FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.

In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.

SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.

In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.

In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.

Pedigree Mode

Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.

To invoke pedigree mode, set the --enable-joint-genotyping option to true. Use the --pedigree-file option to specify the path to a pedigree file that describes the relationship between panels.

The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.

The following is an example of an input pedigree file.

De Novo Calling

The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.

Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.

Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
- gVCF files for the Small Variant Caller.
- *.tn.tsv files for the Copy Number Caller.
- BAM files for the Structural Variant Caller.

Small Variant DeNovo Calling

The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.

Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.

At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.

The DQ formula is DQ = -10log10(1 - Pdenovo).

Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.

In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.

The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.

The following are the possible values:

Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.

The following is an example VCF line for a trio:

1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.

Pedigree Mode Options

The following command line options are available for de novo small variant calling.

--enable-joint-genotyping--Run the joint genotyping caller.
--pedigree-file--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.
--variant or --variant-list --Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.
--qc-snp-denovo-quality-threshold--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
--qc-indel-denovo-quality-threshold--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
--output-directory--The output directory. This is required.
--output-file-prefix--The prefix used to label all output files. This is required.
-r The directory where the hash table resides.

The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant or --variant-list. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.

Population Mode

DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.

To compare multiple pedigrees, you can run gVCF Genotyper on the output of a pedigree analysis and merge multiple joint-called pedigrees into a single multisample VCF. To enable, run the pedigree analysis using the --enable-multi-sample-gvcf=true option to write a multisample gVCF.

Iterative gVCF Genotyper analysis

gVCF Genotyper offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples, every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a "step-by-step mode" if multiple batches of samples are available, or "end-to-end mode", if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller (described above) are also accepted as input. gVCF Genotyper can accept input gVCF files generated using DRAGEN version 3.2.6 or later.

Step 1 (gVCF aggregation): the user can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multi-sample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. As part of this step, adjacent hom-ref blocks with matching FILTER columns are further merged to reduce the disk footprint of the intermediate files, FORMAT field values being base-pair weight averaged in the process.

When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.

Step 2 (census aggregation): after all per batch census files are generated, the user can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way than aggregating gVCFs from all batches. When a new batch of samples becomes available, the user only needs to perform Step 1 on that batch, then aggregate the census file from the new batch with the global census file from all previous batches in order to generate an updated global census file.

Step 3 (msVCF generation): every time a global census file is updated, with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take a per-batch cohort file, per-batch census file and the global census file as input, and generate a multi-sample VCF for one batch of samples. The output multi-sample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multi-sample VCF is the same across all batches.

To facilitate parallel processing on distributed compute nodes, for every step above, the user can choose to split the genome into shards of equal size, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard below.

There is a special treatment of alternative or unaligned contigs when the --shard option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.

If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.

Commandline arguments common to all steps:

--enable-gvcf-genotyper-iterative: set to true to run the iterative gVCF Genotyper (always required).
--ht-reference: The file containing the reference sequence in FASTA format (always required).
--output-directory: The output directory (always required).
--output-file-prefix: The prefix used to label all output files (optional, default value dragen).
--shard: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format of n/N (e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.
--gg-regions: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the --shard option is not given. It is important that the same regions are chosen for each step 1,2 and 3.
--gg-regions-bed: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.
--gg-discard-ac-zero If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.
--gg-remove-nonref If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.
--gg-vc-filter Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.
--gg-skip-filtered-sites Omits msVCF records that fail the given hard filter. The default is false.
--gg-squeeze-msvcf Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.
--gg-gq-squeezing-threshold Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.

Commandline arguments for Step1 (step-by-step mode)

--gvcfs-to-cohort-census: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for Step2 (step-by-step mode)

--aggregate-censuses: set to true to aggregate a list of per batch census files into a global census file.
--input-census-list: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.

Commandline arguments for Step3 (step-by-step mode)

--generate-msvcf: set to true to generate a multi-sample VCF for one batch of samples.
--input-cohort-file: the path to the per batch cohort file (from Step1).
--input-census-file: the path to the per batch census file (from Step1).
--input-global-census-file: the path to the global census file (from Step2).

Commandline arguments for running Step1 + Step2 + Step3 (end-to-end mode for a single batch)

--gvcfs-to-msvcf : set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for merging per-batch msVCF files

--merge-batches: set to true to merge msVCF files for a set of batches.
--input-batch-list: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing: set to true (the default) to generate a tabix index for the merged msVCF.
--gg-merge-subset: set to override the restriction that all batches must be included in the merge.

Enabling use of mimalloc library for enhanced performance

Mimalloc is a custom memory allocation library that can yield a significant speed-up in the iterative gVCF Genotyper workflow. In some deployments, e.g. cloud, it is automatically and seamlessly used, but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.

For this purpose, the convenience script mi_dragen.sh is provided, which loads the bundled library and can be transparently used in the same way as the DRAGEN executable. Please note that its use is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow too. Its use for other purposes is known to possibly lead to undesirable memory overuse and thus should be undertaken at the user's own risk.

The multi-sample VCF output of the iterative gVCF Genotyper

The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.

The msVCF can become a very large file with increasing cohort size. In some cases, the file might need more storage than can be allocated by VCF parsers. This is caused by VCF entries such as FORMAT/PL which store a value for each combination of alleles. We therefore decided to replace FORMAT/PL with a tag FORMAT/LPL which stores a value only for the alleles that actually occur in the sample. Similarly, the msVCF also contains FORMAT/LAD which stores the allelic depth only for the alleles occurring in the sample.

We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.

Iterative gVCF Genotyper with Mitochondrial Variant Calls

When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:

Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF

QUAL column in msVCF

The value displayed in the QUAL column of the msVCF is the maximum of the input QUAL values for the site across the global cohort. The QUAL value will be missing if any of the batch census files used to create the global census were generated with a version of DRAGEN earlier than v4.2.

Measures of Hardy-Weinberg Equilibrium in the msVCF output

Iterative gVCF Genotyper offers several metrics for assessing adherence to HWE. It calculates both allele-wise and site-wise HWE P-values, an allele-wise excess heterozygosity (ExcHet) P-value and the site-wise inbreeding coefficient (IC). These metrics are calculated only for diploid sites and missing values are excluded from the calculations. These values are included as fields in the INFO column of the output msVCF file. Both batch-wise and global values are included, where the field names for the global values are prefixed with G.

Care should be taken when interpreting these metrics for small cohorts and/or low frequency alleles, as small changes in inputs can lead to large changes in their values. Further, violations of the underlying HWE assumptions (such as inbreeding), and non-random sampling (such as the presence of consanguineous samples), can adversely affect results, making identification of poorly called variants more difficult.

Where it is not possible to calculate the metric, they are represented as missing (i.e., ".") in the msVCF file. This can vary between the metrics, but may occur if non-diploid genotypes are encountered, if there is only one allele present at a site, or if no samples are genotyped at a site.

Allele-wise vs site-wise Hardy-Weinberg metrics

Allele-Wise Hardy-Weinberg Equilibrium and Excess Heterozygosity.

For HWE a P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it. For ExcHet a P-value of ≈ 0.5 suggests that the number of heterozygotes is close to the number expected under HWE, while a value ≈ 1 suggests that there are more heterozygotes than expected and a value ≈ 0 suggests that there are fewer heterozygotes than expected.

For a bi-allelic site the HWE P-values is based on the numbers of homozygotes and heterozygotes comparing the observed to expected. For a multi-allelic site, P-values are calculated per ALT allele as if it were bi-allelic. Genotypes composed of only the ALT allele being considered are counted as alternative homozygous, any other genotype containing a copy of the ALT allele being considered are counted as a heterozygous, and any genotype with no copies of the ALT allele being considered are counted as reference homozygous (this may include genotypes containing other ALT alleles).

Site-Wise Hardy-Weinberg Equilibrium.

Iterative gVCF Genotyper calculates a site-wise HWE P-value. The value is calculated using the Pearson's chi-squared method, comparing the genotype counts expected under HWE to those observed. The chi-squared test statistic is calculated as

𝜒2 = ∑gt (Egt - Ogt)2 / Egt

where the summation is over gt is over all genotypes possible at the site given the alleles present, and Egt and Ogt are the expected and observed counts for genotype gt, respectively. From the chi-squared test statistic the P-value is then calculated from a chi-squared distribution where the number of degrees of freedom is the number of possible genotypes minus the number of alleles, which is

where n is the number of alleles.

The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.

A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.

The Inbreeding Coefficient

Iterative gVCF Genotyper calculates the inbreeding coefficient (IC) (sometimes called the Fixation index and denoted by F). It is defined as the proportion of the population that is inbred. The value of IC can be estimated by looking at the observed number of heterozygotes in comparison to the number expected under HWE:

where O(het) and E(het) are the observed and expected number of heterozygotes in the cohort, respectively. Although initially conceived for studying inbreeding and defined as a non-negative value, it is also commonly used to look for deviations from HWE and can take values in the range [-1, 1].

Values of IC ≈ 0 suggest that the cohort is in HWE. Negative values suggest an excess of heterozygosity and a deviation from HWE, which can be symptomatic of poor variant calling. Positive values suggest a deficit of heterozygotes and the possible presence of inbreeding.

Using the above definition, IC should be a property of the population, and so would be expected to be drawn from the same distribution for all sites and for all variants at a site. Deviations from this distribution can suggest issues in calling a site correctly. Violations of HWE assumptions and/or non-random sampling may adversely affect the distribution of IC, causing it to be shifted. However, outliers can still be identified, although thresholds may need to be adjusted accordingly.

Allelic Balance

Allelic balance (AB) describes the proportion of reads that support each allele within a called genotype and can be calculated from the allelic depth (FORMAT/AD or FORMAT/LAD). For homozygotes this is taken as

AB = ADi / ∑j ADj

where i is the index of the called allele and j runs over all alleles. For heterozygotes this is taken as

ABi = ADi / (ADa + ADb)

where a and b are the indices of the called alleles and i can have values a or b. For homozygous genotypes AB is expected to be ≈ 1 and for heterozygous genotypes it is expected to be ≈ 0.5 for each allele. Deviations from the expected values can be indicative of an error.

DRAGEN's iterative gVCF Genotyper calculates site-wise AB values for each allele based on the read depths among all samples. Only diploid genotypes are included in the calculations. Values are calculated separately among homozygous (ABHom) and heterozygous (ABHet) genotypes. ABHet is calculated using the counts among all heterozygous calls that contains the allele under consideration. P-values for ABHet are also calculated (ABHetP) based on a binomial test with an expected probability of 0.5. A P-value of ≈ 1 signifies that results are in line with expectation while ≈ 0 signifies a deviation from expectation. Values are written to the INFO fields ABHom, ABHet and ABHetP, with one value for each allele (including the reference allele). Values should be in the range [0, 1]. Missing values are coded by -1, for example where there are no homozygous calls for an allele. If AD is not present in any input gVCF file, the values are not calculated and the fields will be omitted from the output msVCF file.

msVCF hard filtering

Sites in the output msVCF can be filtered on the following global metrics:

QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒2 Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)

msVCF metric customization

The per-sample genotype metrics in the output msVCF can be customized by providing a colon-separated list of metrics, analogous to that of the VCF FORMAT column, to the --gg-msvcf-format-fields option, e.g. --gg-msvcf-format-fields=GT:LAD:LPL:LAA:QL. Supported metrics are GT, GQ, AD, LAD, FT, LPL, LAA, LA, LGT, QL, MQR, LAF and DF (N.B. LAF will only appear on the MT contig and DF will only appear if the --gg-diploidify option is enabled). Sample genotype (GT) is always present and always shown first, regardless of whether it is included in the option string or not. Alternatively, an msVCF containing only site statistics and no per-sample genotype fields can be generated using the option --gg-msvcf-format-fields=None.

The per-site INFO metrics in the output msVCF can be customized by providing a semicolon-separated list of metrics, analogous to that of the VCF INFO column, to the --gg-msvcf-info-fields option, e.g. --gg-msvcf-info-fields=AC;AN;NS;NS_GT;NS_NOGT;NS_NODATA;AF. Supported metrics are AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet, HWEc2, AF. The default set of metrics is AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet and HWEc2. All INFO fields can be included using the option --gg-msvcf-info-fields=All. All INFO fields can be dropped using the option --gg-msvcf-info-fields=None, in which case the INFO field will contain the missing symbol, .. For each specified metric, the value for the current batch and the global value are written. For global values, the metric names are prepended by G. INFO fields that have a missing value, ., at a site are omitted from the msVCF for that site, so sites may contain different sets of fields.

File size optimizations

For sizable cohorts, the file outputs from gVCF Genotyper can become extremely large. However, there are a number of options within the component which can mitigate this. As well as reduced footprint on disk, these options can lead to faster runtimes owing to the diminished I/O demands.

The following options have applicability to this:

The small variant caller's --vc-compact-gvcf, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.
The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
- Omitting records that fail filters (--gg-skip-filtered-sites option).
- Dropping trailing genotype fields for hom-ref records (--gg-squeeze-msvcf option). This behaviour is explicitly permitted by the VCF specification.

1: The number of values is coded as per the VCF specification, with A denoting one value per alt allele, R one value per possible allele (including the reference allele), G one value per possible genotype and . an unspecified number of values that may vary between site and sample. The number of elements in localised array FORMAT fields that depend on the number of local alleles will vary between samples and so are specified as ..

De Novo Small Variant Filtering

The filtering step identifies de novo variants calls of the joint calling workflow in regions with ploidy changes. Since de novo calling can have reduced specificity in regions where at least one of the pedigree members shows non-diploid genotypes, the de novo variant filtering marks relevant variants and thus can improve specificity of the call set.

Based on the structural and copy number variant calls of the pedigree, the FORMAT/DN field in the proband is changed from the original DeNovo value to DeNovoSV or DeNovoCNV if the de novo variant overlaps with a ploidy-changing SV or CNV, respectively. All other variant details remain unchanged, and all variants of the input VCF will also be present in the filtered output VCF. Structural or copy number variants which result in no change of ploidy, such as inversions, are not considered in the filtering. As an example, a de novo SNV calls in the input VCF

Overlapping with an SV duplication in the proband, mother or father would be represented in the filtered output VCF as follows:

The following is an example command line for running the de novo filtering, based on the files returned by the joint calling workflows:

De Novo Small Variant Filtering Options

The following options are used for de novo variant filtering:

--dn-input-vcf---Joint small variant VCF from the de novo calling step to be filtered.
--dn-output-vcf---File location to which the filtered VCF should be written. If not specified, the input VCF is overwritten.
--dn-sv-vcf---Joint structural variant VCF from the SV calling step. If omitted, checks with overlapping structural variants are skipped.
--dn-cnv-vcf--- Joint structural variant VCF from the CNV calling step. If omitted, checks with overlapping copy number variants are skipped.

Germline Small Variant Hard Filtering

DRAGEN provides post-VCF variant filtering based on annotations present in the VCF records. Default and non-default variant hard filtering are described below. However, due to the nature of DRAGEN's algorithms, which incorporate the hypothesis of correlated errors from within the core of variant caller, the pipeline has improved capabilities in distinguishing the true variants from noise, and therefore the dependency on post-VCF filtering is substantially reduced. For this reason, the default post-VCF filtering in DRAGEN is very simple.

Default Small Variant Hard Filtering

The default filters in the germline pipeline are as follows:

##FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41 (3 when ML recalibration is enabled)">
##FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83 (3 when ML recalibration is enabled)">
##FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
DRAGENSnpHardQUAL and DRAGENIndelHardQUAL: For all contigs other than the mitochondrial contig, the default hard filtering consists of thresholding the QUAL value only. A different default QUAL threshold value is applied to SNP and INDEL
LowDepth: This filter is applied to all variants calls with INFO/DP <= 1
PloidyConflict: This filter is applied to all variant calls on chrY of a female subject, if female is specified on the DRAGEN command line, of if female is detected by the ploidy estimator.

Non-Default Small Variant Hard Filtering

DRAGEN supports basic filtering of variant calls as described in the VCF standard. You can apply any number of filters with the --vc-hard-filter option, which takes a semicolon-delimited list of expressions, as follows:

where the list of criteria is itself a list of expressions, delimited by the || (OR) operator in this format:

The meaning of these expression elements is as follows:

filterID---The name of the filter, which is entered in the FILTER column of the VCF file for calls that are filtered by that expression.
snp/indel/all---The subset of variant calls to which the expression should be applied.
annotation ID---The variant call record annotation for which values should be checked for the filter. Supported annotations include FS, MQ, MQRankSum, QD, and ReadPosRankSum.
comparison operator---The numeric comparison operator to use for comparing to the specified filter value. Supported operators include <, ≤, =, ≠, ≥, and >. For example, the following expression would mark with the label "SNP filter" any SNPs with FS < 2.1 or with MQ < 100, and would mark with "indel filter" any records with FS < 2.2 or with MQ < 110:

This example is for illustration purposes only and is NOT recommended for use with DRAGEN V3 output. Illumina recommends using the default hard filters. The only supported operation for combining value comparisons is OR, and there is no support for arithmetic combinations of multiple annotations. More complex expressions may be supported in the future.

Orientation Bias Filter

The orientation bias filter is designed to reduce noise typically associated with the following:

Pre-adapter artifacts introduced during genomic library preparation (eg, a combination of heat, shearing, and metal contaminates can result in the 8-oxoguanine base pairing with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification), or
FFPE (formalin-fixed paraffin-embedded) artifact. FFPE artifacts stem from formaldehyde deamination of cytosines, which results in C to T transition mutations. The orientation bias filter can only be used on somatic pipelines. To enable the filter, set the --vc-enable-orientation-bias-filter option to true. The default is false.

The artifact type to be filtered can be specified with the --vc-orientation-bias-filter-artifacts option. The default is C/T,G/T, which correspond to OxoG and FFPE artifacts. Valid values include C/T, or G/T, or C/T,G/T,C/A.

An artifact (or an artifact and its reverse compliment) cannot be listed twice. For example, C/T,G/A is not valid, because C→G and T→A are reverse compliments.

The orientation bias filter adds the following information:

##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=OBC,Number=1,Type=String,Description="Orientation Bias Filter base context">
##FORMAT=<ID=OBPa,Number=1,Type=String,Description="Orientation Bias prior for artifact">
##FORMAT=<ID=OBParc,Number=1,Type=String,Description="Orientation Bias prior for reverse compliment artifact">
##FORMAT=<ID=OBPsnp,Number=1,Type=String,Description="Orientation Bias prior for real variant">

Please note that the OBF filter runs as a standalone process after DRAGEN is complete. The VC metrics that are computed as part of DRAGEN SNV caller will not be updated and will not reflect the additional variants that are filtered in this stage.

Autogenerated MD5SUM for VCF Files

An MD5SUM file is generated automatically for VCF output files. This file is in the same output directory and has the same name as the VCF output file, but with an .md5sum extension appended. For example, whole_genome_run_123.vcf.md5sum. The MD5SUM files is a single-line text file that contains the md5sum of the VCF output file. This md5sum exactly matches the output of the Linux md5sum command.

Force Genotyping

DRAGEN supports force genotyping (ForceGT) for small variant calling. To use ForceGT, use the --vc-forcegt-vcf option with a list of small variants to force genotype. The input list of small variants can be a *.vcf or *.vcf.gz file.

The current limitations of ForceGT are as follows:

ForceGT is supported for germline small variant calling in the V3 mode. The V1, V2, and V2+ modes are not supported.
ForceGT is also supported for somatic small variant calling.
ForceGT variants do not propagate through joint genotyping.

ForceGT Input

DRAGEN supports only a single ForceGT VCF input file, which must meet the following requirements:

The input has to be a valid VCF file according to version 4.2 of the VCF standard. For instance, it has to have at least eight tab-delimited columns and records need to be sorted by reference contig and position.
The header has to list the same contigs as the reference used for variant calling. All variants must refer to one of these contig names.
Variants have to be normalized (parsimonious and left-aligned, see below).
It must not contain any multinucleotide or complex variants (AT -> C). These are variants that require more than one substitution / insertion / deletion to go from REF allele to ALT allele and are ignored.
Any deletions longer than 50bp are filtered out.
Any variant will only be called once. Duplicate entries will be ignored.

The following nonnormalized variant will cause undefined behavior in DRAGEN:

Instead of…

parsimonious: chrX 153592402 GC GCG

use…

parsimonious representation: chrX 153592403 C CG

ForceGT Operation and Expected Outcome

Force genotyping requires an input VCF and can be used with DRAGEN software in VCF, GVCF or VCF+GVCF mode. In all cases the output file(s) contains all regular calls and the forceGT variants, as follows:

For a ForceGT call that was not called by the variant caller (not common), the call is tagged with FGT in the INFO field.
For a germline ForceGT call that was also called by the variant caller and filter field is PASS, the call is tagged with NML;FGT in the INFO field (NML denotes normal). In somatic mode, the call is tagged with FGT;SOM.
For a normal call (and PASS) by the variant caller, with no ForceGT call (normal), no extra tags are added (no NML tag, no FGT tag).

This scheme distinguishes among calls that are present due to FGT only, common in both ForceGT input and normal calling, and normal calls.

All the variants in the input ForceGT VCF are genotyped and present in the output file. The following table lists the reported GTs for the variants.

If DRAGEN calls a variant that is different from the one specified in the input ForceGT VCF, the output contains the following multiple entries at the same position:

One entry for the default DRAGEN variant call
One entry each for every variant call present in the input ForceGT-VCF at that position

chrX 100 G C [Default DRAGEN variant call]
chrX 100 G A [Variant in ForceGT vcf]

If a target BED file is provided along with the input ForceGT VCF, then the output file only contains ForceGT variants that overlap the BED file positions.

Machine Learning for Variant Calling

DRAGEN secondary analysis employs machine learning based variant recalibration (DRAGEN-ML) for germline SNV VC. Variant calling accuracy is improved using powerful yet efficient machine learning techniques that augment the variant caller, by exploiting more of the available read and context information that does not easily integrate into the Bayesian processing used by the haplotype variant caller. A supervised machine learning method was developed using truth from the PrecisionFDA v4.2.1 sets to build a model that processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors, for both SNVs and INDELs.

Setup

No additional setup is required. ML model files for the hg38 and hg19 human references are packaged with the DRAGEN installer. After installation, the files are present at <INSTALL_PATH>/resources/ml_model/<ref> DRAGEN-ML is enabled by default as needed, when running the germline SNV VC. DRAGEN will automatically detect the reference used for analysis, and use the correct model files. It either hg38 or hg19 reference type is not detected, ML recalibration will automatically be disabled and SNV VC falls back to legacy operation.

Inputs

DRAGEN-ML requires a run with BAM or FASTQ input, since the machine learning model extracts information from the read pile-up. DRAGEN-ML runs concurrently with DRAGEN SNV VC. DRAGEN-ML can be applied to WGS or WES samples. Re-calibration of existing VCF files is not supported.

Outputs

DRAGEN-ML recalibrates all quality scores, changing the values of the QUAL and GQ fields in the output VCF/GVCF.

DRAGEN-ML also updates PL and GP in the output VCF/GVCF.
The genotypes (GT field) of some variants may be changed by ML e.g., 0/1 to 1/1 or vice versa.
DRAGEN-ML PHRED scores are limited to a maximum value of around 60-70. Therefore, the QUAL filtering threshold is set to 3 when DRAGEN-ML is enabled, compared to 10 for DRAGEN-VC when DRAGEN-ML is disabled.

The following variants types are recalibrated:

Biallelic and multiallelic variants
Autosomes and sex chromosomes, including haploid positions
Force GT calls
Non primary contigs

Accuracy Improvements

DRAGEN-ML typically removes 30-50% of SNP FPs, with smaller gains on INDELS. FN counts are reduced by 10% or more. The output QUAL/GQ of DRAGEN-ML is empirically more accurately calibrated than DRAGEN SNV VC without ML. There are significant gains in accuracy statistics across the entire genome with ML enabled. Note that a small number of variant calls may have degraded accuracy with ML enabled compared to VC without ML.

Run time

DRAGEN-ML adds about 10% to the run time compared to runs without ML.

Evidence BAM

Overview

The DRAGEN small variant caller is a haplotype-based caller which performs local assembly of all reads in an active region into a de Bruijn graph (DBG). The assembly process uses all the read bases including the soft-clip bases of reads. The soft-clip bases provide evidence for the presence of variants, specifically longer insertions and deletions which are not present in the read cigar and hence cannot be directly viewed in IGV.

The assembly and realignment step (using pair-HMM) performed by variant caller aims to correct mapping errors made by the original aligner and improves the overall variant caller accuracy. Using the evidence BAM, we can view how the variant caller sees the read evidence and how the reads have been realigned making it a very useful debugging tool.

By default, the evidence BAM contains only a subset of regions processed by the small variant caller. Only regions which have candidate indel variants and some percentage of soft-clip reads in the pileup are realgned and output in the evidence BAM. This is done to reduce the run-time overhead needed to generate the evidence BAM.

Outputs

The output of the VC Evidence BAM feature will match the output format that the customer has selected using --output-format option. The default format is bam.

A bam/cram/sam file with the suffix _evidence.bam/cram/sam and the corresponding index file. The evidence BAM can be enabled along with the regular BAM output from the Map-Align step. When multiple BAM are passed as inputs to the variant caller, for e.g., in Tumor-Normal calling, then they will be combined in the evidence BAM output and tagged with appropriate read groups.
A bed file with regions that were realigned and output in VC Evidence BAM with suffix ".realigned-regions.bed".

Features

The evidence BAM consists of realigned reads, badly mated reads and reads that are disqualified by the variant caller based on the read likelihood scores.

Disqualified and Badly Mated reads
Reads that are badly-mated (when the read and its mate are mapped to different chromosmes) are tagged with a BM tag (integer) and reads that are disqualified (based on read likelihoods) are tagged with the DQ tag (integer). These reads are filtered out by the genotyper in the variant caller. The alignment score tag AS is forced to 0 for such reads in the evidence BAM and hence, they can be filtered from the IGV pileup by setting the minimum AS score to be 1 instead of 0.
Graph Haplotypes
When enabling graph haplotypes output using --vc-evidence-bam-output-haplotypes, all the haplotypes constructed by the de Bruijn graph are output in the evidence BAM as single reads covering the entire active region. The reads and haplotypes are tagged with different read groups which makes it easily distinguishable in IGV. In IGV, we can use “Color Alignments By” or “Group Alignments By” > read group to separate out the reads from the haplotypes. The haplotypes are tagged with read group EvidenceHaplotype and the reads are part of the EvidenceRead_Normal/Tumor read group.
The haplotypes are named as Haplotype 1, Haplotype 2 and so on and have an additional ‘HC’ tag (integer). The realigned reads also have an HC tag which encodes which haplotype best matches the read based on the likelihood calculation. Only reads which are supported by a single unique haplotype have the HC tag, reads which match more than one haplotype well do not have an HC tag. The use of this tag is primarily intended to enable highlighting of reads in IGV. Go to "Color Alignments By > Tag" and enter "HC" to view which reads are uniquely supported by a certain graph haplotypes.

Command Line Arguments

CNV Output

DRAGEN emits the calls in the standard VCF format. By default for analyses other than somatic WGS, the VCF file includes only copy number gain and loss events (and LOH events, where allele-specific copy number is available). To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls to true.

For more information on how to use the output files to aid in debug and analysis, see Signal Flow Analysis.

CNV VCF File

File extension: *.cnv.vcf.gz

The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV to indicate the file is generated by the DRAGEN CNV pipeline

Header

The following is an example of some of the header lines that are specific to CNV:

##fileformat=VCFv4.2
##CoverageUniformity=0.402517
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
...
##reference=file:///reference_genomes/Hsapiens/hs37d5/DRAGEN
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=REFLEN,Number=1,Type=Integer,Description="Number of REF positions included in this record">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">
##FILTER=<ID=cnvQual,Description="CNV with quality below <WORKFLOW-SPECIFIC DEFAULT VALUES>">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Estimated copy number">
##FORMAT=<ID=BC,Number=1,Type=Integer,Description="Number of bins in the region">
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Number of improperly paired end reads at start and stop breakpoints">

The following header lines are specific to somatic WGS CNV calling:

ModelSource The primary basis on which the final tumor model was chosen. The following values can be included:
- DEPTH+BAF: Depth+BAF signal is used to determine tumor model.
- DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
- DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
- DEPTH+BAF_WEAK: Depth+BAF signal is used to determine lower-confidence tumor model.
- VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
- DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
- SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA if a confident model could not be determined.
DiploidCoverage Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].

Records

All coordinates in the VCF are 1-based.

CHROM

The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.

POS

The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

ID

The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN, LOSS and REF events, in Somatic WGS CNV, the ID could include the Copy Neutral Loss of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.

REF

The REF column contains an N for all CNV events.

ALT

The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL> or <DUP> entries are used. If REF calls are emitted, their ALT will always be .. In Somatic WGS CNV, the ALT field can contain two alleles, such as <DEL>,<DUP>, which allows representation of allele-specific copy numbers if they differ in copy number states.

QUAL

The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.

FILTER

The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.

Available FILTERs:

cnvLength which indicates that the length of the CNV is lower than a threshold.
cnvQual which indicates that the QUAL of the CNV is lower than a threshold.

Germline CNV has the following additional FILTERs:

cnvCopyRatio which indicates that the segment mean of the CNV is not far enough from copy neutral.

Both Germline CNV workflows have the following additional FILTERs:

dinucQual which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.

Germline WGS CNV has the following additional FILTERs:

cnvBinSupportRatio which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN which indicates a CNV call with implausible copy number (>6).

Germline WES CNV has the following additional FILTERs:

cnvLikelihoodRatio indicates a log10 likelihood ratio of ALT to REF is less than a threshold.

Both Somatic CNV workflows have the following additional FILTERs:

binCount - Filters CNV events with a bin count lower than a threshold.

Somatic WGS CNV has the following additional FILTERs:

lengthDegenerate - Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean - Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficient SM in DELs or DUPs are assigned this filter when returning the default model.

Somatic WES CNV has the following additional FILTERs: -SqQual - Marks records as non-PASSing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.

INFO

The INFO column contains information representing the event.

REFLEN indicates the length of the event.
SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE is always CNV and only present for non-REF records.
END indicates the end position of the event (1-based, inclusive).

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

In Somatic WGS CNV, the INFO column can also contain the HET tag, when the call is considered sub-clonal. See HET-Calling Mode.

When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.

FORMAT

The common FORMAT fields are described in the header:

Germline WGS CNV includes the following FORMAT fields:

Germline WES CNV includes the following FORMAT fields:

Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:

Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.

Note on genotype annotation in germline copy number calling

Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:

Coverage Uniformity

The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.

A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.

CNV Metrics File

DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv file extension. The following list summarizes the metrics that are output from a CNV run.

Sex Genotyper:

Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.

CNV Summary:

Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
- Number of filtered records (total)
- Number of filtered records (due to duplicates)
- Number of filtered records (due to MAPQ)
- Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions

Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Coverage MAD is the median absolute deviation of normalized case counts. Higher values indicate noiser sample data (poor quality). Median Bin Count is the median of raw counts normalized by interval size.

Intermediate and Visualization Files

Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.

All files have a structure similar to a BED file with optional header line(s).

Target Counts

The file *.target.counts.gz is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.

It has the following columns:

Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gz file is shown below.

#TARGET COUNTS FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start  stop   name                <SampleName> improper_pairs
1       565480 565959 target-wgs-1-565480 7          6
1       566837 567182 target-wgs-1-566837 9          0
1       713984 714455 target-wgs-1-713984 34         4
1       721116 721593 target-wgs-1-721116 47         1
1       724219 724547 target-wgs-1-724219 24         21
1       725166 725544 target-wgs-1-725166 43         12
1       726381 726817 target-wgs-1-726381 47         14
1       753243 753655 target-wgs-1-753243 31         2
1       754322 754594 target-wgs-1-754322 27         0
1       754594 755052 target-wgs-1-754594 41         0

B-Allele counts

B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, they are taken from a collection of sites that have high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the tumor sample supporting each of these alleles is counted.

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

B-allele tsv

The tsv file format is the following:

Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele

Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:

Population frequency for the first allele
Population frequency for the second allele

An example of B-allele counts file is provided below:

contig  start   stop    refAllele       allele1 allele2 allele1Count    allele2Count
chr1    11021   11022   G       G       A       4       2
chr1    14463   14464   A       A       T       111     36
chr1    16494   16495   G       G       C       122     262
chr1    38741   38742   C       C       T       9       9
chr1    39014   39015   A       A       C       38      48
chr1    39260   39261   T       T       C       199     143
chr1    48447   48448   C       C       T       8       15
chr1    48517   48518   A       A       G       13      15
chr1    91485   91486   G       G       C       1       4
chr1    91489   91490   A       A       G       1       3
chr1    98944   98945   C       C       T       46      114

B-allele bedgraph

The bedgraph file format is similar to the BED format and it has the following columns:

Contig identifier
Start
Stop
Ratio of allele counts

The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.

When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:

allele1Count / (allele1Count + allele2Count)

When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:

allele2Count / (allele1Count + allele2Count)

By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.

An example of the bedgraph file is shown below:

chr1    11021   11022   0.333333
chr1    14463   14464   0.755102
chr1    16494   16495   0.317708
chr1    38741   38742   0.5
chr1    39014   39015   0.44186
chr1    39260   39261   0.581871
chr1    48447   48448   0.652174
chr1    48517   48518   0.464286

Bias correction

The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gc-corrected.gz file is shown below.

#GC CORRECTED FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   1071.353133     6
chr1    819840  821337  target-wgs-chr1-819840:821337   1051.014997     19
chr1    821337  822485  target-wgs-chr1-821337:822485   1098.6502       10
chr1    822485  824431  target-wgs-chr1-822485:824431   1117.28308      7
chr1    830446  832304  target-wgs-chr1-830446:832304   1102.211816     1
chr1    832304  834311  target-wgs-chr1-832304:834311   1004.822683     5
chr1    836677  838659  target-wgs-chr1-836677:838659   1015.973037     7
chr1    841054  843056  target-wgs-chr1-841054:843056   1014.921403     3

Combined counts

The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.

Normalization

The file *.tn.tsv.gz contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #.

An example of a *.tn.tsv.gz file is shown below.

#title = Normalized coverage profile
#sex = UNDETERMINED
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   -0.18479358083014644    6
chr1    819840  821337  target-wgs-chr1-819840:821337   -0.21244441644669046    19
chr1    821337  822485  target-wgs-chr1-821337:822485   -0.14849555308041734    10
chr1    822485  824431  target-wgs-chr1-822485:824431   -0.12423291178926463    7
chr1    830446  832304  target-wgs-chr1-830446:832304   -0.1438261733656668     1
chr1    832304  834311  target-wgs-chr1-832304:834311   -0.27728673450293895    5
chr1    836677  838659  target-wgs-chr1-836677:838659   -0.26136555699676262    7

Segmentation

File extension: *.seg, *.seg.called, *.seg.called.merged

Files containing the segments produced by the segmentation algorithm. The Segment_Mean value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.

The *.seg file has the following columns:

Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment

An example of a *.seg file is shown below.

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
<SampleName> chr1    818022  1117426 224     0.82500341336435279
<SampleName> chr1    1117426 4063702 2438    0.91726081432236528
<SampleName> chr1    4063702 4067591 3       0.38861386123247205
<SampleName> chr1    4067591 7705829 3302    0.93021316913709917
<SampleName> chr1    7705829 9357003 1405    0.98147825043799442
<SampleName> chr1    9357003 9377365 19      0.50269670724395654
<SampleName> chr1    9377365 12859821        2905    1.0684818476332989

The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + ir a deletion -.

The *.seg.called.merged file is identical to the *.seg.called file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:

QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count

B-allele segmentation (Somatic)

In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg and it has the same format of the *.seg file with two modifications. Firstly, the Segment_Mean value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:

BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction"

An example of segmentation output file is shown below:

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean    BAF_SLM_STATE
<SampleName> chr1    820348  1104646 194     0.29301737166888697     6
<SampleName> chr1    1105091 1533754 444     0.26185904799069076     5
<SampleName> chr1    1533810 1534166 9       0.41958837071702065     8
<SampleName> chr1    1534217 9356793 6689    0.26034515815016335     5
<SampleName> chr1    9358304 9376529 27      0.46450553586280602     10
<SampleName> chr1    9378480 12859495        1651    0.24172965924359388     5

Model identification (Somatic)

The file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:

Model purity (Cellularity)
Model diploid coverage
Model log-likelihood

An example is shown below:

#Purity Coverage        logL
1       384     -23441740.5209
0.99    566     -22926572.4287
0.99    726     -23281869.1423
0.99    1206    -24075475.1481
0.99    1836    -24334376.579
0.99    2256    -24380290.0335
0.99    2696    -24380616.8655
0.98    449     -23988016.7101

Visualization

To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml can be loaded directly into IGV for analysis.

The following IGV tracks are automatically populated in the output IGV session file:

*.target.counts.bw --- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw --- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw --- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw --- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz --- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3 --- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):

##gff-version 3
chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;
chr1    DRAGEN  CNV     9377366 36656983        1000    .       .       Start=9377365;Stop=36656983;Length=27279618;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=3;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.910093;MinorCopyNumberFloat=0.068221;BiasCorrectedReadCount=1287.9;MinorAlleleFrequency=0.241;BinCount=22591;ImproperPairsCount=186,21;NumAllelicSites=18021;color=#FF0000;

For somatic WGS analyses, the following additional files are included in the IGV session xml:

*.baf.seq.bw --- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz --- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.

IGV Session

File extension: *.igv_session.xml

The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome attribute in the Session element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.

<?xml version="1.0" encoding="utf-8"?>
<Session genome="b37" hasGeneTrack="false" hasSequenceTrack="true" version="8">
    <Resources>
        <Resource path="example.cnv.gff3"/>
        <Resource path="example.cnv.excluded_intervals.bed.gz"/>
        <Resource path="example.target.counts.bw"/>
        <Resource path="example.improper.pairs.bw"/>
        <Resource path="example.tn.bw"/>
        <Resource path="example.seg.bw"/>
    </Resources>
    <Panel height="500" width="1200" name="DataPanel">
        ...
    </Panel>
</Session>

Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome field in the XML file directly. For example, IGV has traditionally packaged a b37 reference genome, but may also include a 1kg_v37 or a 1kg_b37+decoy, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

Creating CNV coverage and BAF plots with third-party tools

DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:

*.target.counts.gz or *.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.

In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.

Using R, a good starting point is the karyoploteR package. The main workflow involves reading the *.target.counts.gz file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.

Using Python, the workflow is similar to R's but using Python's libraries such as pandas, to convert DRAGEN output files to dataframe, and matplotlib, to plot coverage and BAF profiles across the genome.

A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:

Germline WGS

chr1	DRAGEN	CNV	818023	1426288	52	.	.	Alt=REF;LinearCopyRatio=1.03497;CopyNumber=2;Genotype=./.;Qual=52;Filter=PASS;Start=818022;Stop=1426288;Length=608266;BinCount=491;ImproperPairsCount=1,7;color=#00FF00;
chr1	DRAGEN	CNV	1426289	1428354	22	.	.	Alt=DEL;LinearCopyRatio=0.411841;CopyNumber=1;Genotype=0/1;Qual=22;Filter=cnvLength;dinucQual;Start=1426288;Stop=1428354;Length=2066;BinCount=2;ImproperPairsCount=7,16;color=#DDDDDD;

Somatic WGS

chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;

From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).

Excluded Intervals File

To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.excluded_intervals.bed.gz file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.

An example of a *.excluded_intervals.bed.gz file is shown below:

chr1    0       818022  NON_KMER_UNIQUE
chr1    824431  830446  NON_KMER_UNIQUE
chr1    834311  836677  NON_KMER_UNIQUE
chr1    838659  841054  NON_KMER_UNIQUE
chr1    850451  853257  NON_KMER_UNIQUE
chr1    855442  860261  NON_KMER_UNIQUE
chr1    866189  868833  NON_KMER_UNIQUE
chr1    881779  884116  NON_KMER_UNIQUE
chr1    1016667 1018959 NON_KMER_UNIQUE
chr1    1075880 1079718 NON_KMER_UNIQUE
chr1    1137942 1140725 NON_KMER_UNIQUE

Panel of Normals Files

PON Metrics File

The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz) if a Panel of Normals is provided and --cnv-generate-pon-metric-file is set to true. If PON size is less than 2, then an empty file will be generated.

The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:

Example:

contig  start   stop    name    mean    std     normalizedStd min     25%     50%     75%     max     intervalSize    gcContents
1       12098   12178   target-wes-1-12098:12178/1      3.6259044560802365      0.46661435469856077      0.1286890927079175     2.7961783439490446      3.2573018790849675      3.7105263157894739      4.0162683823529415      4.3298969072164946      80      0.49382716049382713
1       12178   12258   target-wes-1-12178:12258/2      5.0685579775753595      0.70638315915955963      0.13936570564740217     3.9044585987261144      4.5225944682508761      5.067708333333333       5.5778115844038769      6.3277777777777775      80      0.46913580246913578
1       12553   12637   target-wes-1-12553:12637/1      4.6990858287992054      0.62537786269786677      0.13308500535681309     3.7417218543046356      4.0305632538350444      5.0382165605095546      5.2151580459770113      5.5773195876288657      84      0.6705882352941176

PON Correlation File

The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.

Example:

Correlation of case sample CASE_SAMPLE_NAME
  PON1: 0.9786
  PON2: 0.9868
  PON3: 0.9912
  ...

SegDups Extension Files

The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).

The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:

Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)

Intermediate files

The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:

Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)

The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:

Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site

CNV with SV Support

The DRAGEN CNV caller leverages depth as its primary signal for calling copy number variants. Depth alone poses challenges for calling events that are less than 10kbp. The sensitivity of CNVs at lengths less than 10kbp can be improved by leveraging junction signals from the DRAGEN structural variant caller.

When both the DRAGEN CNV and SV caller are executed in a single invocation, then an additional integration step is done at the end of a DRAGEN run to improve the CNV calls. This feature is enabled automatically when DRAGEN detects a germline WGS analysis.

The SV/CNV Integration module takes in DEL and DUP calls from the output data structures of the germline CNV and SV callers, identifies putative matches, updates annotations, filters, scores, and outputs the refined records in a new output VCF. By leveraging junction signals from the SV caller and depth signals from the CNV caller, this approach allows for sensitive CNV detection down to 1kbp while also improving recall and precision across length scales. This is achieved by rescuing previously low quality calls if evidence is found from both callers, and also by adjusting CNV breakends to the more accurate SV breakends. The matching algorithm takes into account the proximity of the events as well as the transition states at the breakends, among other things.

Example command lines

The following is an example command line for running a germline WGS analysis for both CNV and SV.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--bam-input <BAM> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true \
--enable-sv true \

Other optional CNV or SV parameters can also be added.

Combined CNV/SV VCF Output

The original CNV and SV VCF output files, prior to integration, are available for users in the DRAGEN output directory, as described elsewhere. Additionally, there is an enhanced CNV VCF available with the *.cnv_sv.vcf.gz extension. The VCF header lines in the *.cnv_sv.vcf.gz mostly correspond to a concatenation of the individual header lines from the CNV and SV VCFs, with a few lines deduplicated and some new ones added. For details on the legacy header lines, please refer to the individual CNV and SV user guide sections.

Newly added header lines are described in the following table.

Records that can be matched or rescued will have annotations indicating the breakpoint linkage between a CNV and SV record. If a complete match is found, then the MatchSv annotation will be present in the record, indicating the SV record's ID field for this CNV record. Furthermore, the use of the SVCLAIM field will indicate if the record has evidence arising from depth signal D, or junction signals J, or both DJ.

Because of the mixing of standalone SV records and CNV records, the FORMAT field may have different annotations. For details on the CNV or SV specific annotations, please refer to the individual CNV and SV user guide sections.

Records that can be matched or rescued will have FILTER set to PASS. The original FILTERs are retained for records that were not matched or rescued. For example, the cnvLength FILTER will still be applied to standalone CNV records (those with SVCLAIM=D).

Example records are shown below.

# Merged record, note presence of SVCLAIM=DJ and MatchSv
1   869444  DRAGEN:LOSS:1:869444-870284 N   <DEL>   150  PASS    SVLEN=-840;SVTYPE=CNV;END=870284;REFLEN=840;OrigCnvPos=869000;OrigCnvEnd=871000;SVCLAIM=DJ;MatchSv=DRAGEN:DEL:41710:0:0:0:2:0   GT:SM:CN:BC:GC:CT:AC:PE 1/1:0.649442:1:2:0.6785:0.408:0.3705:10,3
 
# CNV record that did not match, note presence of SVCLAIM=D
1   13472000    DRAGEN:LOSS:1:13472001-13663000 N   <DEL>   69  PASS    SVLEN=-191000;SVTYPE=CNV;END=13663000;REFLEN=191000;SVCLAIM=D   GT:SM:CN:BC:GC:CT:AC:PE 0/1:0.427273:1:141:0.467603:0.501092:0.498667:7,10
  
# SV record that did not match, note presence of SVCLAIM=J
1   14657708    DRAGEN:LOSS:1:14657708-14658485 N   <DEL>   150 PASS    END=14658485;SVTYPE=DEL;SVLEN=-776;CIGAR=1M776D;CIPOS=0,2;HOMLEN=2;HOMSEQ=TC;SVCLAIM=J  GT:FT:GQ:PL:PR:SR   0/1:PASS:671:908,0,668:36,13:27,18

Ploidy Calling

Biomarkers

Downsampling

Joint Analysis

Combined phased variants in the gVCF input

The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.

Force-genotyped and targeted variants in the gVCF input

If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.

Similarly, targeted variant calls (option --targeted-merge-vc) in any gVCF file that are not also called by the variant caller will be ignored as well.

Processing GATK gVCF Files

Both pedigree- and population-based joint analysis can process gVCF files written by the GATK v4.1 variant caller.

Joint Analysis Output Format

There are two available joint analysis output files:

Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.

The multisample gVCF output is only available in the pedigree-based analysis.

The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".

1 605262 . G A 13.41 DRAGENHardQUAL
AC=2;AF=1.000;AN=2;DP=2;FS=0.000;MQ=14.00;QD=6.70;SOR=0.693
GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP ./.:.:.:.:.:LowDepth
1/1:0,2:1.000:2:4:PASS:0,0:0,2:50,6,0:1.383e+01,4.943e+00,1.951e+00
./.:.:.:.:.:LowDepth

Hom-ref Blocks FORMAT Fields

In hom-ref blocks, the following FORMAT fields are calculated uniquely.

FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.

chr1 10288 . C <NON_REF> . PASS END=10290
GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT
0/0:131,4:135:69:132:0,69,1035:0,125,255:23,1

chr1 10291 . C
T,<NON_REF> 38.45 PASS
DP=100;MQ=24.72;MQRankSum=0.733;ReadPosRankSum=4.112;FractionInformativeReads=0.600;R2_5P_bias=0.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB
0/1:28,32,0:0.533,0.000:60:20,21,0:8,11,0:15:73,0,12,307,157,464:255,0,255:23,10:3.8452e+01,1.3151e-01,1.5275e+01,3.0757e+02,1.9173e+02,4.5000e+02:0.00,34.77,37.77,34.77,69.54,37.77:4,24,7,25:8,20,14,18

SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.

In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.

Pedigree Mode

Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.

To invoke pedigree mode, set the --enable-joint-genotyping option to true. Use the --pedigree-file option to specify the path to a pedigree file that describes the relationship between panels.

The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.

Column Header

Description

The following is an example of an input pedigree file.

#Family_ID Individual_ID Paternal_ID Maternal_ID Sex Phenotype
FAM001 NA12877_Father 0 0 1 1
FAM001 NA12878_Mother 0 0 2 1
FAM001 NA12882_Proband NA12877_Father NA12878_Mother 2 2
FAM001 NA12883_Proband NA12877_Father NA12878_Mother 1 0

De Novo Calling

Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.

Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
- gVCF files for the Small Variant Caller.
- *.tn.tsv files for the Copy Number Caller.
- BAM files for the Structural Variant Caller.
Run Pedigree Mode for Small Variant Caller. For more information, see .
Run Pedigree Mode for Copy Number Caller. For more information, see .
Run Pedigree Mode for Structural Variant Caller. For more information, see .
Run DeNovo Variant Small Variant Filtering. For more information, see .

Small Variant DeNovo Calling

The DQ formula is DQ = -10log10(1 - Pdenovo).

Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.

The following are the possible values:

Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.

The following is an example VCF line for a trio:

Pedigree Mode Options

The following command line options are available for de novo small variant calling.

--enable-joint-genotyping--Run the joint genotyping caller.
--pedigree-file--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.
--variant or --variant-list --Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.
--qc-snp-denovo-quality-threshold--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
--qc-indel-denovo-quality-threshold--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
--output-directory--The output directory. This is required.
--output-file-prefix--The prefix used to label all output files. This is required.
-r The directory where the hash table resides.

Population Mode

DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.

The tool for population-based analysis is the iterative gVCF Genotyper. Its input is a set of single or multisample gVCFs. The output is a multisample VCF that contains one entry for any variant seen in any of the input gVCFs. The variants are genotyped across all input samples using information from the hom-ref blocks as necessary. The iterative gVCF Genotyper does not adjust genotypes based on population information but it provides means to filter variant sites based on information leveraged from the population. See for information on the available command line options.

Iterative gVCF Genotyper analysis

When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.

If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.

Commandline arguments common to all steps:

--enable-gvcf-genotyper-iterative: set to true to run the iterative gVCF Genotyper (always required).
--ht-reference: The file containing the reference sequence in FASTA format (always required).
--output-directory: The output directory (always required).
--output-file-prefix: The prefix used to label all output files (optional, default value dragen).
--shard: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format of n/N (e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.
--gg-regions: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the --shard option is not given. It is important that the same regions are chosen for each step 1,2 and 3.
--gg-regions-bed: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.
--gg-discard-ac-zero If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.
--gg-remove-nonref If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.
--gg-vc-filter Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.
--gg-hard-filter Specifies a filtering expression to be applied to the output msVCF records. See below. The default is to apply no filters.
--gg-skip-filtered-sites Omits msVCF records that fail the given hard filter. The default is false.
--gg-msvcf-format-fields Can be used to override the default set of sample genotype fields in the output msVCF. See below.
--gg-msvcf-info-fields Can be used to override the default set of site-wise INFO fields in the output msVCF. See below.
--gg-squeeze-msvcf Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.
--gg-gq-squeezing-threshold Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.
--gg-output-type Set to spvcf to write the output in spVCF format rather then the default msVCF. See below for details.
--gg-diploidify In the output msVCF file, convert haploid calls to diploid. The diploidified genotype is homozygous in the haploid call e.g. 1 becomes 1/1. The LPL field is also diploidified for these samples. Site metrics, such as allele counts, are calculated before diploidification. Diploidifying genotypes may ease the ingestion of msVCF files into downstream analysis tool, such as Hail and Plink. When this option is enabled, it is possible to include the DF FORMAT field (included by default) that signifies whether or not a genotype has been diploidified, see below.

Commandline arguments for Step1 (step-by-step mode)

--gvcfs-to-cohort-census: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for Step2 (step-by-step mode)

--aggregate-censuses: set to true to aggregate a list of per batch census files into a global census file.
--input-census-list: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.

Commandline arguments for Step3 (step-by-step mode)

--generate-msvcf: set to true to generate a multi-sample VCF for one batch of samples.
--input-cohort-file: the path to the per batch cohort file (from Step1).
--input-census-file: the path to the per batch census file (from Step1).
--input-global-census-file: the path to the global census file (from Step2).

Commandline arguments for running Step1 + Step2 + Step3 (end-to-end mode for a single batch)

--gvcfs-to-msvcf : set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for merging per-batch msVCF files

--merge-batches: set to true to merge msVCF files for a set of batches.
--input-batch-list: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing: set to true (the default) to generate a tabix index for the merged msVCF.
--gg-merge-subset: set to override the restriction that all batches must be included in the merge.

Enabling use of mimalloc library for enhanced performance

The multi-sample VCF output of the iterative gVCF Genotyper

The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.

We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.

This approach is also referred to as local alleles and is also used by open source software such as and .

Iterative gVCF Genotyper with Mitochondrial Variant Calls

When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:

Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF

QUAL column in msVCF

Measures of Hardy-Weinberg Equilibrium in the msVCF output

The (HWE) states that, given certain conditions, genotype and allele frequencies should remain constant between generations. Deviations from HWE can results from violations of the underlying HWE assumptions in the population, non-random sampling or may be artifacts of variant calling. can be assessed by comparing the observed frequencies of genotypes to those expected under HWE given the observed allele counts.

Metric

Description

Scope

Number of values

Allele-wise vs site-wise Hardy-Weinberg metrics

Iterative gVCF Genotyper offers both allele-wise and site-wise HWE P-values. The allele-wise P-values are based on the exact-conditional method the site-wise P-values are based on Pearson's chi-squared method. For bi-allelic sites, although both are measuring the same property, their values may differ. The differences between the methods are explored in . Care should be taken when deciding which to use.

Allele-Wise Hardy-Weinberg Equilibrium and Excess Heterozygosity.

Iterative gVCF Genotyper calculates allele-wise HWE and the ExcHet P-values. The values are calculated using the exact-conditional method described in . The implementation does not use a mid P-value correction.

Site-Wise Hardy-Weinberg Equilibrium.

𝜒2 = ∑gt (Egt - Ogt)2 / Egt

dof = n(n-1)/2

where n is the number of alleles.

The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.

A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.

The Inbreeding Coefficient

IC = 1 - O(het) / E(het),

Allelic Balance

AB = ADi / ∑j ADj

where i is the index of the called allele and j runs over all alleles. For heterozygotes this is taken as

ABi = ADi / (ADa + ADb)

It is also possible to filter based on the maximum ABHetP value, see .

msVCF hard filtering

Sites in the output msVCF can be filtered on the following global metrics:

QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒2 Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)

The syntax of a filtering expression is the same as that used by the small variant caller (see ). Filters are always applied to the globally-computed metrics, not the values for the current batch. Records failing filter will have the specified filter ID(s) written to the FILTER column of the msVCF, or will be omitted entirely if the --gg-skip-filtered-sites option is specified. Since filtering is on a per-site basis, filters cannot be applied separately to SNPs or indels as they can in the variant caller.

msVCF metric customization

Metric

Description

Number of values1

Type

Metric

Description

Number of values1

Type

File size optimizations

The following options have applicability to this:

The small variant caller's --vc-compact-gvcf, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.
The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
- Outputting local allele values, as described .
- Use of the options to output only those metrics required for the downstream analysis.
- Omitting records that fail filters (--gg-skip-filtered-sites option).
- Dropping trailing genotype fields for hom-ref records (--gg-squeeze-msvcf option). This behaviour is explicitly permitted by the VCF specification.

The option that can have the biggest impact on the final output file size is that to write it directly in . This is a lossless encoding and the space saving can be dramatic: file size reductions of multiple tens of times have been observed for large cohorts with sparsely distributed variants. Files output as spVCF at step 3 (--generate-msvcf) can be directly merged via the --merge-batches subcommand to produce a single spVCF file. spVCF-encoded files are likely to require decoding back to full msVCF for use with downstream tools, and a binary for this is available for . The decoding will take time, but this is offset by the reduced time required within gVCF Genotyper to initially write the smaller spVCF files. Users are recommended to, if possible, directly pipe the decoded data into the downstream tool rather than first writing the full msVCF file to disk.

CNV Output

For more information on how to use the output files to aid in debug and analysis, see Signal Flow Analysis.

CNV VCF File

File extension: *.cnv.vcf.gz

Header

The following is an example of some of the header lines that are specific to CNV:

##fileformat=VCFv4.2
##CoverageUniformity=0.402517
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
...
##reference=file:///reference_genomes/Hsapiens/hs37d5/DRAGEN
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=REFLEN,Number=1,Type=Integer,Description="Number of REF positions included in this record">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">
##FILTER=<ID=cnvQual,Description="CNV with quality below <WORKFLOW-SPECIFIC DEFAULT VALUES>">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Estimated copy number">
##FORMAT=<ID=BC,Number=1,Type=Integer,Description="Number of bins in the region">
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Number of improperly paired end reads at start and stop breakpoints">

The following header lines are specific to somatic WGS CNV calling:

ModelSource The primary basis on which the final tumor model was chosen. The following values can be included:
- DEPTH+BAF: Depth+BAF signal is used to determine tumor model.
- DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
- DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
- DEPTH+BAF_WEAK: Depth+BAF signal is used to determine lower-confidence tumor model.
- VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
- DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
- SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA if a confident model could not be determined.
DiploidCoverage Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].

Records

All coordinates in the VCF are 1-based.

CHROM

The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.

POS

ID

REF

The REF column contains an N for all CNV events.

ALT

QUAL

The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.

FILTER

Available FILTERs:

cnvLength which indicates that the length of the CNV is lower than a threshold.
cnvQual which indicates that the QUAL of the CNV is lower than a threshold.

Germline CNV has the following additional FILTERs:

cnvCopyRatio which indicates that the segment mean of the CNV is not far enough from copy neutral.

Both Germline CNV workflows have the following additional FILTERs:

dinucQual which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.

Germline WGS CNV has the following additional FILTERs:

cnvBinSupportRatio which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN which indicates a CNV call with implausible copy number (>6).

Germline WES CNV has the following additional FILTERs:

cnvLikelihoodRatio indicates a log10 likelihood ratio of ALT to REF is less than a threshold.

Both Somatic CNV workflows have the following additional FILTERs:

binCount - Filters CNV events with a bin count lower than a threshold.

Somatic WGS CNV has the following additional FILTERs:

lengthDegenerate - Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean - Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficient SM in DELs or DUPs are assigned this filter when returning the default model.

INFO

The INFO column contains information representing the event.

REFLEN indicates the length of the event.
SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE is always CNV and only present for non-REF records.
END indicates the end position of the event (1-based, inclusive).

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

In Somatic WGS CNV, the INFO column can also contain the HET tag, when the call is considered sub-clonal. See HET-Calling Mode.

When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.

FORMAT

The common FORMAT fields are described in the header:

Germline WGS CNV includes the following FORMAT fields:

Germline WES CNV includes the following FORMAT fields:

Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:

Note on genotype annotation in germline copy number calling

Coverage Uniformity

CNV Metrics File

Sex Genotyper:

Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.

CNV Summary:

Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
- Number of filtered records (total)
- Number of filtered records (due to duplicates)
- Number of filtered records (due to MAPQ)
- Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions

Intermediate and Visualization Files

All files have a structure similar to a BED file with optional header line(s).

Target Counts

It has the following columns:

Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gz file is shown below.

#TARGET COUNTS FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start  stop   name                <SampleName> improper_pairs
1       565480 565959 target-wgs-1-565480 7          6
1       566837 567182 target-wgs-1-566837 9          0
1       713984 714455 target-wgs-1-713984 34         4
1       721116 721593 target-wgs-1-721116 47         1
1       724219 724547 target-wgs-1-724219 24         21
1       725166 725544 target-wgs-1-725166 43         12
1       726381 726817 target-wgs-1-726381 47         14
1       753243 753655 target-wgs-1-753243 31         2
1       754322 754594 target-wgs-1-754322 27         0
1       754594 755052 target-wgs-1-754594 41         0

B-Allele counts

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

B-allele tsv

The tsv file format is the following:

Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele

Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:

Population frequency for the first allele
Population frequency for the second allele

An example of B-allele counts file is provided below:

contig  start   stop    refAllele       allele1 allele2 allele1Count    allele2Count
chr1    11021   11022   G       G       A       4       2
chr1    14463   14464   A       A       T       111     36
chr1    16494   16495   G       G       C       122     262
chr1    38741   38742   C       C       T       9       9
chr1    39014   39015   A       A       C       38      48
chr1    39260   39261   T       T       C       199     143
chr1    48447   48448   C       C       T       8       15
chr1    48517   48518   A       A       G       13      15
chr1    91485   91486   G       G       C       1       4
chr1    91489   91490   A       A       G       1       3
chr1    98944   98945   C       C       T       46      114

B-allele bedgraph

The bedgraph file format is similar to the BED format and it has the following columns:

Contig identifier
Start
Stop
Ratio of allele counts

When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:

allele1Count / (allele1Count + allele2Count)

When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:

allele2Count / (allele1Count + allele2Count)

An example of the bedgraph file is shown below:

chr1    11021   11022   0.333333
chr1    14463   14464   0.755102
chr1    16494   16495   0.317708
chr1    38741   38742   0.5
chr1    39014   39015   0.44186
chr1    39260   39261   0.581871
chr1    48447   48448   0.652174
chr1    48517   48518   0.464286

Bias correction

The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gc-corrected.gz file is shown below.

#GC CORRECTED FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   1071.353133     6
chr1    819840  821337  target-wgs-chr1-819840:821337   1051.014997     19
chr1    821337  822485  target-wgs-chr1-821337:822485   1098.6502       10
chr1    822485  824431  target-wgs-chr1-822485:824431   1117.28308      7
chr1    830446  832304  target-wgs-chr1-830446:832304   1102.211816     1
chr1    832304  834311  target-wgs-chr1-832304:834311   1004.822683     5
chr1    836677  838659  target-wgs-chr1-836677:838659   1015.973037     7
chr1    841054  843056  target-wgs-chr1-841054:843056   1014.921403     3

Combined counts

The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.

Normalization

Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #.

An example of a *.tn.tsv.gz file is shown below.

#title = Normalized coverage profile
#sex = UNDETERMINED
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   -0.18479358083014644    6
chr1    819840  821337  target-wgs-chr1-819840:821337   -0.21244441644669046    19
chr1    821337  822485  target-wgs-chr1-821337:822485   -0.14849555308041734    10
chr1    822485  824431  target-wgs-chr1-822485:824431   -0.12423291178926463    7
chr1    830446  832304  target-wgs-chr1-830446:832304   -0.1438261733656668     1
chr1    832304  834311  target-wgs-chr1-832304:834311   -0.27728673450293895    5
chr1    836677  838659  target-wgs-chr1-836677:838659   -0.26136555699676262    7

Segmentation

File extension: *.seg, *.seg.called, *.seg.called.merged

The *.seg file has the following columns:

Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment

An example of a *.seg file is shown below.

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
<SampleName> chr1    818022  1117426 224     0.82500341336435279
<SampleName> chr1    1117426 4063702 2438    0.91726081432236528
<SampleName> chr1    4063702 4067591 3       0.38861386123247205
<SampleName> chr1    4067591 7705829 3302    0.93021316913709917
<SampleName> chr1    7705829 9357003 1405    0.98147825043799442
<SampleName> chr1    9357003 9377365 19      0.50269670724395654
<SampleName> chr1    9377365 12859821        2905    1.0684818476332989

The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + ir a deletion -.

QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count

B-allele segmentation (Somatic)

BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction"

An example of segmentation output file is shown below:

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean    BAF_SLM_STATE
<SampleName> chr1    820348  1104646 194     0.29301737166888697     6
<SampleName> chr1    1105091 1533754 444     0.26185904799069076     5
<SampleName> chr1    1533810 1534166 9       0.41958837071702065     8
<SampleName> chr1    1534217 9356793 6689    0.26034515815016335     5
<SampleName> chr1    9358304 9376529 27      0.46450553586280602     10
<SampleName> chr1    9378480 12859495        1651    0.24172965924359388     5

Model identification (Somatic)

The file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:

Model purity (Cellularity)
Model diploid coverage
Model log-likelihood

An example is shown below:

#Purity Coverage        logL
1       384     -23441740.5209
0.99    566     -22926572.4287
0.99    726     -23281869.1423
0.99    1206    -24075475.1481
0.99    1836    -24334376.579
0.99    2256    -24380290.0335
0.99    2696    -24380616.8655
0.98    449     -23988016.7101

Visualization

The following IGV tracks are automatically populated in the output IGV session file:

*.target.counts.bw --- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw --- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw --- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw --- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz --- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3 --- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):

##gff-version 3
chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;
chr1    DRAGEN  CNV     9377366 36656983        1000    .       .       Start=9377365;Stop=36656983;Length=27279618;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=3;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.910093;MinorCopyNumberFloat=0.068221;BiasCorrectedReadCount=1287.9;MinorAlleleFrequency=0.241;BinCount=22591;ImproperPairsCount=186,21;NumAllelicSites=18021;color=#FF0000;

For somatic WGS analyses, the following additional files are included in the IGV session xml:

*.baf.seq.bw --- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz --- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.

IGV Session

File extension: *.igv_session.xml

<?xml version="1.0" encoding="utf-8"?>
<Session genome="b37" hasGeneTrack="false" hasSequenceTrack="true" version="8">
    <Resources>
        <Resource path="example.cnv.gff3"/>
        <Resource path="example.cnv.excluded_intervals.bed.gz"/>
        <Resource path="example.target.counts.bw"/>
        <Resource path="example.improper.pairs.bw"/>
        <Resource path="example.tn.bw"/>
        <Resource path="example.seg.bw"/>
    </Resources>
    <Panel height="500" width="1200" name="DataPanel">
        ...
    </Panel>
</Session>

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

Creating CNV coverage and BAF plots with third-party tools

DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:

*.target.counts.gz or *.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.

In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.

A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:

Germline WGS

chr1	DRAGEN	CNV	818023	1426288	52	.	.	Alt=REF;LinearCopyRatio=1.03497;CopyNumber=2;Genotype=./.;Qual=52;Filter=PASS;Start=818022;Stop=1426288;Length=608266;BinCount=491;ImproperPairsCount=1,7;color=#00FF00;
chr1	DRAGEN	CNV	1426289	1428354	22	.	.	Alt=DEL;LinearCopyRatio=0.411841;CopyNumber=1;Genotype=0/1;Qual=22;Filter=cnvLength;dinucQual;Start=1426288;Stop=1428354;Length=2066;BinCount=2;ImproperPairsCount=7,16;color=#DDDDDD;

Somatic WGS

chr1    DRAGEN  CNV     818023  9357003 1000    .       .       Start=818022;Stop=9357003;Length=8538981;Alt=<DEL>,<DUP>;Qual=1000;Filter=PASS;Genotype=1/2;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.290614;MinorCopyNumberFloat=0.004158;BiasCorrectedReadCount=1138.9;MinorAlleleFrequency=0.259;BinCount=7372;ImproperPairsCount=6,157;NumAllelicSites=7336;color=#FF00FF;
chr1    DRAGEN  CNV     9357004 9377365 534     .       .       Start=9357003;Stop=9377365;Length=20362;Alt=<DEL>;Qual=534;Filter=cnvLength;Genotype=1/1;CopyNumber=0;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=0.147802;MinorCopyNumberFloat=0.073901;BiasCorrectedReadCount=623.5;MinorAlleleFrequency=0.5;BinCount=19;ImproperPairsCount=157,186;NumAllelicSites=27;color=#DDDDDD;

Excluded Intervals File

An example of a *.excluded_intervals.bed.gz file is shown below:

chr1    0       818022  NON_KMER_UNIQUE
chr1    824431  830446  NON_KMER_UNIQUE
chr1    834311  836677  NON_KMER_UNIQUE
chr1    838659  841054  NON_KMER_UNIQUE
chr1    850451  853257  NON_KMER_UNIQUE
chr1    855442  860261  NON_KMER_UNIQUE
chr1    866189  868833  NON_KMER_UNIQUE
chr1    881779  884116  NON_KMER_UNIQUE
chr1    1016667 1018959 NON_KMER_UNIQUE
chr1    1075880 1079718 NON_KMER_UNIQUE
chr1    1137942 1140725 NON_KMER_UNIQUE

Panel of Normals Files

PON Metrics File

Example:

contig  start   stop    name    mean    std     normalizedStd min     25%     50%     75%     max     intervalSize    gcContents
1       12098   12178   target-wes-1-12098:12178/1      3.6259044560802365      0.46661435469856077      0.1286890927079175     2.7961783439490446      3.2573018790849675      3.7105263157894739      4.0162683823529415      4.3298969072164946      80      0.49382716049382713
1       12178   12258   target-wes-1-12178:12258/2      5.0685579775753595      0.70638315915955963      0.13936570564740217     3.9044585987261144      4.5225944682508761      5.067708333333333       5.5778115844038769      6.3277777777777775      80      0.46913580246913578
1       12553   12637   target-wes-1-12553:12637/1      4.6990858287992054      0.62537786269786677      0.13308500535681309     3.7417218543046356      4.0305632538350444      5.0382165605095546      5.2151580459770113      5.5773195876288657      84      0.6705882352941176

PON Correlation File

Example:

Correlation of case sample CASE_SAMPLE_NAME
  PON1: 0.9786
  PON2: 0.9868
  PON3: 0.9912
  ...

SegDups Extension Files

The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:

Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)

Intermediate files

The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:

Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)

The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:

Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site

Small Variant Calling

The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).

Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller option to true. The VCF header is annotated with ##source=DRAGEN_SNV to indicate the file is generated by the DRAGEN SNV pipeline.

The Variant Caller Algorithm

The DRAGEN Small Variant Caller performs the following steps:

Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.

Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.

Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.

Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.

Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.

Read filtering and reporting of vcf DP fields

In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:

Unfiltered depth is the number of reads covering the position, downstream of any read collapsing or deduplication that may have preceded the variant calling step, but upstream of most read filtering and overlapping mate handling. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads actually used to make the calling decision, where filtered reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded, and overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.

The following figure summarizes the different filtering steps in more detail.

Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
- Duplicate reads.
- Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
- [Somatic] Reads with MAPQ=0.
- [Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
- Unmapped reads.
- Secondary reads.
- Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
- Reads that are badly mated. A badly mated read is a read where the pair is mapped to two different reference contigs.
- Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out reads that are not informative. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.

Mosaic Calling

Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC INFO flag. As default, MOSAIC tagged variants with AF smaller than 20% are filtered with the MosaicLowAF filter.

See Mosaic detection for further details on the mosaic small variant caller and the mosaic detection mode and a comparison with DRAGEN 4.2 features.

Variant Caller Options

The following options control the variant caller stage of the DRAGEN host software.

--enable-variant-caller
Set --enable-variant-caller to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:

# header information
chr11 0 246920
chr11 255660 255661

If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.

--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection with mosaic AF filter threshold set to 0.0. Set to false to disable DRAGEN mosaic detection. The default is true with mosaic AF filter threshold set to 0.2.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF filter to mosaic calls. All MOSAIC tagged variants with AF smaller than the AF threshold are filtered with the MosaicLowAF filter. The default mosaic AF filter threshold is set to 0.2 when the germline variant caller is enabled. The AF default threshold is set to 0.0 when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true.

Downsampling Options for Small Variant Calling

You can use the following options for downsampling reads in the small variant calling pipeline.

For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.

--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.

The following are the default downsampling values for each small variant calling mode.

The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos setting.

If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos, default value is 10) at any start position.

The next downsampling step is to apply the --vc-max-reads-per-raw-region and --vc-max-reads-per-active-region limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.

This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.

When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.

gVCF Output

A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF> allele. Reads that do not support the reference or any variants are assigned the <NON_REF> allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ and in somatic mode the score is FORMAT/SQ.

The following options are available to enable and control gVCF output.

--vc-emit-ref-confidence
To enable gVCF output, set to GVCF. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence to BP_RESOLUTION.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf (banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80 for germline and 1 3 10 20 50 80 for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30 and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.

Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.

In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.

The following are example gVCF records that include a hom-ref block call and a variant call.

1 39224 . C <NON_REF> . PASS END=39260
GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT
0/0:2,0:2:3:1:0,3,37:0,3,37:3,0

1 39261 . T C,<NON_REF> 15.59 PASS
DP=3;MQ=12.73;MQRankSum=0.736;ReadPosRankSum=0.736;FractionInformativeReads=1.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB
0/1:1,2,0:0.667,0.000:3:1,0,0:0,2,0:5:49,0,1,69,7,75:66,0,8:1,0:1.5592e+01,1.5915e+00,5.5412e+00,7.0100e+01,4.3330e+01,8.0068e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,1,0,2:0,1,2,0

In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.

QUAL, QD, and GQ Formulation

In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.

QUAL is the Phred-scaled probability that the site has no variant and is computed as:
```
QUAL = -10\*log10 (posterior genotype probability of a
homozygous-reference genotype (GT=0/0))
```
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.

Phasing and Phased Variants

DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same set.

##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Physical phasing
ID information, where each unique ID within a given sample (but not
across samples) connects records within a phasing group">

The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.

chr1 1947645 . C T,<NON_REF> 48.44 PASS
DP=35;MQ=250.00;MQRankSum=4.983;ReadPosRankSum=3.217;FractionInformativeReads=1.000;R2_5P_bias=0.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB:PS
0|1:20,15,0:0.429:35:9,7,0:11,8,0:47:83,0,50,572,758,622:255,0,255:19,0:4.844e+01,8.387e-05,5.300e+01,4.500e+02,4.500e+02,4.500e+02:0.00,34.77,37.77,34.77,69.54,37.77:11,9,10,5:12,8,8,7:1947645

chr1 1947648 . G A,<NON_REF> 50.00 PASS
DP=36;MQ=250.00;MQRankSum=5.078;ReadPosRankSum=2.563;FractionInformativeReads=1.000;R2_5P_bias=0.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB:PS
1|0:16,20,0:0.556:36:8,9,0:8,11,0:48:85,0,49,734,613,698:255,0,255:16,0:5.000e+01,7.067e-05,5.204e+01,4.500e+02,4.500e+02,4.500e+02:0.00,34.77,37.77,34.77,69.54,37.77:10,6,11,9:8,8,12,8:1947645

During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, then they are phased together. If the variants only occur on different haplotypes, then they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.

Combine Phased Variants

Phased variant records that belong to the same phasing set can be combined into a single VCF record. For example, assuming reference at position chr2 115035 is A, the following two phased variants are combined.

chr2 115034 . G C GT:PS 0|1:115034
chr2 115036 . C T GT:PS 0|1:115034

The phased variants are combined as follows.

chr2 115034 . GAC CAT GT:PS 0|1:115034

The command-line option --vc-combine-phased-variants-distance specifies the maximum distance over which phased variants will be combined. The default value 0 disables the feature. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value.

DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.

Examples of diploid haplotypes where phasing is supported:

-------------------------------------------------------------- H0 ( REF ) 
-----------------x---------------------------y---------------- H1

-----------------x---------------------------y---------------- H1
-----------------x---------------------------y-----------------H1

Examples of diploid haplotypes where phasing is not supported:

-----------------x---------------------------y---------------- H1
---------------------------------------------y---------------- H2

-----------------x----------------------------y--------------- H1
----------------------------------------------z--------------- H2

By default in somatic mode, DRAGEN will output all component SNVs and INDELs that make up an MNV along with the MNV call itself. MNVs and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. Setting --vc-mnv-emit-component-calls=false can be used to restrict which component calls are reported. When DRAGEN reports an MNV call, it considers the difference between the VAF of the MNV call and the VAF of each component call, and reports any given component call in addition to the MNV call if this difference is greater than --vc-combine-phased-variants-max-vaf-delta (default: 0.1). The --vc-mnv-emit-component-calls and --vc-combine-phased-variants-max-vaf-delta options are only applicable in somatic mode and are not supported in germline mode. In germline mode, functionality to output component calls is not available and MNV calls are emitted only without component calls.

Variant Representation

DRAGEN outputs variants in a VCF file following variant normalization as described here https://genome.sph.umich.edu/wiki/Variant_Normalization. The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.

Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned

Additional notes on variant representation in the DRAGEN VCF:

Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).
Allele decomposition: by default, multi-nucleotide polymorphisms (MNPs) are represented as separate, contiguous individual SNVs records in the VCF. If phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants. This determines which variant have occurred on the same haplotype. Phased variant records that belong to the same phasing set can be combined into a single VCF record by using the --vc-combine-phased-variants-distance command-line option and set it to a non-zero value. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value (specified in the number of basepairs).

In some cases, such as complex variants in repetitive regions, some variants cannot be normalized (i.e. converted into a standard representation) or represented uniquely. To counteract this problem, when comparing two VCFs (e.g. a DRAGEN VCF against a truth set VCF), it is recommended to use the RTG vcfeval tool which performs variant comparisons using a haplotype-aware approach. RTG vcfeval has been adopted as the standard VCF comparison tool by GA4GH and PrecisionFDA https://www.biorxiv.org/content/biorxiv/early/2018/02/23/270157.full.pdf.

Multi-allelic Variants and Overlapping Variants

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:

chr1 2656216 . A T,C 107.65 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=12;FS=0.000;MQ=28.95;QD=8.97;SOR=3.056;FractionInformativeReads=0.750 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,5,4:0.556,0.444:9:15:177,144,46,122,0,72:-17.704,-14.420,-4.626,-12.220,0.000,-7.244:1.076e+02,1.096e+02,1.465e+01,8.758e+01,1.520e-01,4.082e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,0,1,8:0,0,4,5

Two indels are considered as multi-allelic if they share the same reference base preceding the indel. chr1 7392258 . C CT,CTTT 234.76 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=44;FS=0.000;MQ=199.22;QD=5.34;SOR=2.226;FractionInformativeReads=0.659 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,15,14:0.517,0.483:29:50:245,256,55,190,0,55:-24.476,-25.634,-5.492,-18.976,0.000,-5.500:2.348e+02,2.513e+02,5.292e+01,1.848e+02,4.401e-05,5.300e+01:0.00,5.00,8.00,5.00,10.00,8.00:0,0,7,22:0,0,17,12

If a SNP overlaps an INDEL, but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below. However DRAGEN has the joint detection of overlaping variants feature which is designed to detect overlapping SNP and INDEL and output them in a single VCF variant record, represented as a multi-allelic genotype.

chr1 1029628 . C CGT 49.88 PASS AC=1;AF=0.500;AN=2;DP=37;FS=7.791;MQ=105.32;MQRankSum=-1.315;QD=1.35;ReadPosRankSum=1.423;SOR=1.510;FractionInformativeReads=0.892;R2_5P_bias=-19.742 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:17,16:0.485:33:48:81,0,50:-8.088,0.000,-5.000:4.988e+01,6.653e-05,5.300e+01:0.00,31.00,34.00:10,7,5,11:11,6,9,7:1029628 chr1 1029629 . A G 50.00 PASS AC=1;AF=0.500;AN=2;DP=37;FS=1.289;MQ=105.32;MQRankSum=-0.659;QD=1.35;ReadPosRankSum=-0.199;SOR=0.604;FractionInformativeReads=1.000;R2_5P_bias=-24.923 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:16,21:0.568:37:48:85,0,49:-8.477,0.000,-4.934:5.000e+01,6.886e-05,5.234e+01:0.00,34.77,37.77:9,7,10,11:10,6,13,8:1029628

Ploidy Support

The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.

If --sample-sex is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex is specified on the command line, contigs are processed as follows.
- For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
- For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.

For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 85% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid to false.

An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8

DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.

Overlapping Mates in the Small Variant Calling

Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.

When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.

Mitochondrial Calling

Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.

DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.

--vc-enable-af-filter-mito Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito Set the threshold to mark emitted vcf call as filtered. The default is 0.02.

QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).

##FORMAT=<ID=SQ,Number=A,Type=Float,Description="Somatic quality">

The following filters can be applied to mitochondrial variant calls.

--vc-sq-call-threshold Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter Enables the multiallelic filter. The default value is false.

If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.

If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.

The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.

chrM    513     .       GCA     G       .       PASS    DP=4937;MQ=235.28;FractionInformativeReads=0.883
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  1/1:95.46:33,4327:0.992:7,1081:26,3246:4360:31,2,2371,1956:10,23,2811,1516

chrM    7028    .       C       T       .       PASS    DP=8868;MQ=60.19;FractionInformativeReads=0.993 
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:21.48:8622,181:0.021:4190,82:4432,99:8803:4344,4278,94,87:5032,3590,101,80

FORMAT/GT

For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.

The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.

chrM 2623 . A G . PASS DP=18772;MQ=111.77 GT:AD:AF:DP:FT:SQ:F1R2:F2R1 0/0:6841,7:0.001:4334:weak_evidence:0:.:. 0/1:6736,2053:0.234:8789:PASS:21.32:3394,1060:3342,993 0/0:6086,9:0.001:5613:weak_evidence:0:.:.

Personalized Germline Small Variant Calling

We leverage the new multigenome graph reference and graph mapper output to compute a personalized 2-haplotype reference for the input sample.

The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.

To enable personalized variant calling and machine learning, set --enable-personalization to true (default: false).

Note that this is a beta feature and available only for the germline small variant caller when run with a V4 multigenome graph reference.

Joint Detection of Overlapping Variants

When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:

Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.

Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.

Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection to false.

Modeling of Correlated Errors Across Reads

DRAGEN has two algorithms that model correlated errors across reads in a given pileup.

Foreign Read Detection

Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.

Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.

FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.

Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.

Base Quality Dropoff

The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.

Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.

Copy Number Variant Calling

The DRAGEN Copy Number Variant (CNV) Pipeline can call CNV events using next-generation sequencing (NGS) data. This pipeline supports multiple applications in a single interface via the DRAGEN Host Software, including processing of whole-genome sequencing (WGS) data and whole-exome sequencing (WES) data for germline analysis.

The DRAGEN CNV pipeline supports two normalization modes of operation. The two modes apply different normalization techniques to handle biases that differ based on the application, for example, WGS versus WES. While the default option settings attempt to provide the best trade-off in terms of speed and accuracy, a specific workflow may require more finely tuned option settings.

CNV Workflow

The DRAGEN CNV pipeline follows the workflow shown in the following figure.

DRAGEN CNV Pipeline Workflow

The DRAGEN CNV Pipeline uses many aspects of the DRAGEN secondary analysis available in other pipelines, such as hardware acceleration and efficient I/O processing. To enable CNV processing in the DRAGEN Host Software, set the --enable-cnv command line option to true.

The CNV pipeline has the following processing modules:

Target Counts --- Binning of the read counts and other signals from alignments.
Bias Correction --- Correction of intrinsic system biases.
Normalization --- Detection of normal ploidy levels and normalization of the case sample.
Segmentation --- Breakpoint detection via segmentation of the normalized signal.
Calling / Genotyping --- Thresholding, scoring, qualifying, and filtering of putative events as copy number variants.

The normalization module can optionally take in a panel of normals (PoN), which is used when a cohort or population samples are readily available. Note that PoN normalization is not available for somatic WGS analysis. All other modules are shared between the different CNV algorithms.

Signal Flow Analysis

The following figures show a high-level overview of the steps in the DRAGEN CNV Pipeline as the signal traverses through the various stages. These figures are examples and are not identical to the plots that are generated from the DRAGEN CNV Pipeline.

The first step in the DRAGEN CNV Pipeline is the target counts stage. The target counts stage extracts signals such as read count and improper pairs and puts them into target intervals.

Read Count Signal

Improper Pairs Signal

Next, the case sample is normalized against the panel of normals or against the estimated normal ploidy level. Any other biases are subtracted out of the signal to amplify any event level signals.

Normalization

The normalized signal is then segmented using one of the available segmentation algorithms. Events are then called from the segments.

Segments

Called Events

The events are then scored and emitted in the output VCF.

CNV Pipeline Options

The following are the top-level options that are shared with the DRAGEN Host Software to control the CNV pipeline. You can input a BAM or CRAM file into the CNV pipeline. If you are using the DRAGEN mapper and aligner, you can use FASTQ files.

--bam-input --- The BAM file to be processed.
--cram-input --- The CRAM file to be processed.
--enable-cnv --- Enable or disable CNV processing. Set to true to enable CNV processing.
--enable-map-align --- Enables the mapper and aligner module. The default is true, so all input reads are remapped and aligned unless this option is set to false.
--fastq-file1, --fastq-file2 --- FASTQ file or files to be processed.
--output-directory --- Output directory where all results are stored.
--output-file-prefix --- Output file prefix that will be prepended to all result file names.
--ref-dir --- The DRAGEN reference genome hashtable directory.

Output and Filtering Options

The output and filtering options control the CNV output files.

--cnv-exclude-bed --- Specifies a BED file that indicates the intervals to exclude from the CNV analysis. If a target interval overlaps regions specified from exclude BED file more than cnv-exclude-bed-min-overlap, the target interval is suppressed.
--cnv-exclude-bed-min-overlap --- Specifies a fraction for filtering threshold of overlap amount between a target interval and the excluded region (0.5).
--cnv-enable-ref-calls --- Emit copy neutral (REF) calls in the output VCF file. The default is true for single WGS CNV analysis.
--cnv-enable-tracks --- Generate track files that can be imported into IGV for viewing. When this option is enabled, a \*.gff file for the output variant calls is generated, as well as \*.bw files for the tangent normalized signal. The default is true.
--cnv-filter-bin-support-ratio --- Filters out a candidate event if the span of supporting bins is less than the specified ratio with respect to the overall event length. This filter only applies to records with length greater than cnv-filter-bin-support-ratio-min-len. The default ratio is 0.2 (20% support). As an example, if an event is called and has a length of 100,000 bp, but the target interval bins that support the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then the interval is filtered out. If applied, the record will have cnvBinSupportRatio as a filter.
--cnv-filter-bin-support-ratio-min-len --- Minimum length of candidate event at which to apply cnv-filter-bin-support-ratio. Currently only applied to germline WGS workflows, with default value of 80,000 bp.
--cnv-filter-copy-ratio --- Specifies the minimum copy ratio (CR) threshold value centered about 1.0 at which a reported event is marked as PASS in the output VCF file. The default value is 0.2, which leads to calls with CR between 0.8 and 1.2 being filtered. If applied, the record will have cnvCopyRatio as a filter.
--cnv-filter-length --- Specifies the minimum event length in bases at which a reported event is marked as PASS in the output VCF file. The default is 10000. If applied, the record will have cnvLength as a filter.
--cnv-filter-qual --- Specifies the QUAL value at which a reported event is marked as PASS in the output VCF file. You should adjust the parameter value according to your own application data. If applied, the record will have cnvQual as a filter.
--cnv-min-qual --- Specifies the minimum reported QUAL. The default is 3.
--cnv-max-qual --- Specifies the maximum reported QUAL. The default is 200.
--cnv-qual-length-scale --- Specifies the bias weighting factor to adjust QUAL estimates for segments with longer lengths. This is an advanced option and should not need to be modified. The default is 0.9303 (2-0.1).
--cnv-qual-noise-scale --- Specifies the bias weighting factor to adjust QUAL estimates based on sample variance. This is an advanced option and should not need to be modified. The default is 1.0.

CNV Pipeline Input

The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see Streaming Alignments for instructions on streaming alignment records directly from the DRAGEN map/align stage.

DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see Generate an Alignment File.

Reference Hashtable

For the DRAGEN CNV pipeline, the hashtable must be generated with the --enable-cnv option set to true, in addition to any other options required by other pipelines. When --enable-cnv is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.

The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see Prepare a Reference Genome.

The following example command generates a hashtable.

dragen \
--build-hash-table true \
--ht-reference \<FASTA\> \
--output-directory \<OUTPUT\> \
--enable-cnv true \
<OTHER HASHTABLE OPTIONS> \

Generate an Alignment File

The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.

You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.

The following example command maps and aligns a FASTQ file:

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing BAM file:

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing CRAM file:

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

Streaming Alignments

DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.

To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true

For information on running CNV concurrently with the Haplotype Variant Caller, see Concurrent CNV and Small Variant Calling.

Target Counts

The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.

When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list option.

With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed option to determine the intervals for analysis.

The target counts stage generates a .target.counts.gz file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input option for the normalization stage. The .target.counts.gz file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.

Further details are available in the Output Files section.

Whole Genome

If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.

The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.

Using a cnv-interval-width of less than 250 bp for WGS analysis can drastically increase runtime.

The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y. You can specify a list of contigs to skip by using the --cnv-skip-contig-list option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.

For example, to skip chromosome M, X, and Y, use the following option:

--cnv-skip-contig-list "chrM,chrX,chrY"

Whole Exome

If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width.

To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.

Target Counts Options

The following options control the generation of target counts.

--cnv-counts-method --- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq --- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed --- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width --- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list --- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm.
--cnv-filter-duplicate-alignments --- Filter duplicate marked alignments during target counts if option is set to true. The deafult setting is false.

Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.

Filter Duplicate Alignments

PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.

If --enable-map-align=false, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true, then --enable-duplicate-marking=true should be set.

Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.

Target Counts Dropout Regions

In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.

Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.

For WGS samples and in absence of a cnv-target-bed file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width option, which defaults to 1000bp. The cnv-interval-width option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE in the *.cnv.excluded_intervals.bed.gz file.

A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.

Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section Segmental Duplication Extension for more details.

GC Bias Correction

GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.

The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See Output Files for further details on GC-corrected target counts files.

Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.

The following options control the GC bias correction module.

--cnv-enable-gcbias-correction --- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing --- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins --- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.

Normalization

The DRAGEN CNV pipeline supports two normalization algorithms:

Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.

Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.

Self-Normalization

Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y naming conventions are supported.

Panel of Normals

Whole genome sequencing (non-somatic)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples

Self Normalization

The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.

Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.

The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true

If you are running from a FASTQ sample, then the default mode of operation is self-normalization.

When operating in self-normalization mode, the --cnv-interval-width option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.

Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references.

If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.

Panel of Normals

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. This allows the algorithm to subtract system level biases that are not sample specific. The generation of the target counts for these normal samples should also have identical command line options with the case sample under analysis.

In this mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample, case, and normals, to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.

Target Counts Stage

Target counts should be generated for all samples, whether the samples are to be used as references or are the case samples under analysis. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings. The target counts stage also performs GC Bias correction. GC Bias correction is enabled by default.

The following examples are for WES processing, which is the case in where a panel of normals is required.

The following is an example command for processing a BAM file.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

The following is an example command for processing a CRAM file.

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

Generating Panel of Normals (Combined Counts)

When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file (one per file) or --cnv-normals-list (single text file with paths to each sample).

The following is an example command line using a normals list:

dragen \
--r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--cnv-normals-list <NORMALS_LIST> \
--enable-cnv true \
--cnv-generate-combined-counts true \

Normalization and Call Detection Stage

The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.

Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.

For optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels can result in artifactual calls in the test sample where at least some of the panel samples have copy number changes. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.

The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.

/data/output/sample1.target.counts.gc-corrected.gz
/data/output/sample2.target.counts.gc-corrected.gz
/data/output/sample4.target.counts.gc-corrected.gz
/data/output/sample5.target.counts.gc-corrected.gz
/data/output/sample7.target.counts.gc-corrected.gz
/data/output/sample8.target.counts.gc-corrected.gz
...

DRAGEN accepts 3 different file formats for a Panel of Normals (PON).

The CNV caller can also be started from the *.target.counts.gz (raw counts) or *.target.counts.gc-corrected.gz (GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction should be set to false to disable the GC-correction stage.

For example, the following command normalizes the case sample against the panel of normals.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-input <CASE_COUNTS> \
--cnv-normals-list <NORMALS> \
--cnv-enable-gcbias-correction false

See Output Files for a description of the target counts files.

Normalization Options

These options control the preconditioning of the panel of normals and the normalization of the case sample.

--cnv-enable-self-normalization --- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile --- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input --- Specifies a target counts file for the case sample under analysis when using a panel of normals.
--cnv-normals-file --- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list --- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples --- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets --- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold --- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-truncate-threshold --- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon --- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.

Segmentation

After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:

Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)

The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.

By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing

For the targeted sequencing workflows, you can also run with a --cnv-segmentation-bed. The option pre-defines the segments to estimate copy numbers for and skips the segmentation step of the workflow. See Targeted Segmentation (Segment BED) for more information.

--cnv-segmentation-mode --- Specifies the segmentation algorithm to perform. The following values are available.
- bed
- cbs
- slm --- The default for germline WGS analysis.
- aslm --- The default for somatic WGS analysis.
- hslm --- The default for targeted/WES analysis.
--cnv-merge-distance --- Specifies the maximum number of base pairs between two segments that would allow them to be merged. The default value is 0 for germline WGS, which means the segments must be directly adjacent. For WES analysis, this parameter is disabled by default due to the spacing of targeted intervals.
--cnv-merge-threshold --- Specifies the maximum segment mean difference at which two adjacent segments should be merged. The segment mean is represented as a linear copy ratio value. The default is 0.2 for WGS and 0.4 for WES. To disable merging, set the value to 0.

Circular Binary Segmentation

Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.

--cnv-cbs-alpha --- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta --- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax --- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width --- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin --- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm --- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim --- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.

¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646

Shifting Level Models Segmentation

The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles².

--cnv-slm-eta --- Baseline probability that the mean process changes its value. The default is 4e-5.
--cnv-slm-fw --- Minimum number of data points for a CNV to be emitted. The default is 0, which means segments with one design probe could in effect be emitted.
--cnv-slm-omega --- Scaling parameter that modulates relative weight between experimental or biological variance. The default is 0.3.
--cnv-slm-stepeta --- Distance normalization parameter. The default is 10000. This option is only valid for HSLM.

Regardless of segmentation method, initial segments are split across large gaps where depth data is unavailable, such as across centromeres.

²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5

Targeted Segmentation (Segment BED)

In applications for targeted panels, you can limit the segmentation and calling performed on intervals by specifying a --cnv-segmentation-bed. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. This segmentation mode is only supported with the panel of normals and requires an accompanying --cnv-target-bed. Also specify the --cnv-segmentation-bed during the panel of normals generation step, so that all interval boundaries during analysis are matched. For more information on panel of normals generation, see Panel of Normals

The recommended format for the BED file includes four columns and a header. The four columns are contig, start, stop, and name. The name column represents the name of the gene and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID field. The following example file is in the recommended format:

contig  start      stop       name
chr1    40356094   40372764   MYCL1
chr1    115245083  115261621  NRAS
chr1    204485504  204526342  MDM4
chr2    16075981   16090656   MYCN
chr2    29416087   30143527   ALK
chr3    12626010   12704516   RAF1
chr3    138374228  138478187  PIK3CB
chr3    178866307  178952154  PIK3CA
chr3    195776751  195806640  TFRC

If using a three-column BED file, then do not include a header or the name field values. Three-column BED files should only include the contig, start, and stop values. In this case, the segment identifier is autogenerated from the coordinate fields.

Quality Scoring

Quality scores are computed using a probabilistic model that uses a mixture of heavy tailed probability distributions (one per integer copy number) with a weighting for event length. Noise variance is estimated. The output VCF contains a Phred-scaled metric that measures confidence in called amplification (CN > 2 for diploid locus), deletion (CN < 2 for diploid locus), or copy neutral (CN=2 for diploid locus) events.

The scoring algorithm also calculates exact copy-number quality scores that are inputs to the DeNovo CNV detection pipeline.

Exclude BED Filtering

You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.

The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz file excludes the intervals removed during normalization.

An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See Output Files for further details.

Concurrent CNV and Small Variant Calling

DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.

Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.

The following examples show different commands.

Map/Align FASTQ With CNV

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true

Map/Align FASTQ With VC

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-variant-caller true

Map/Align FASTQ With CNV and VC

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true \
--enable-cnv true \
--cnv-enable-self-normalization true \
--enable-variant-caller true

BAM Input to CNV and VC

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true \
--enable-variant-caller true

Sample Correlation and Sex Genotyper

When running the target counts stage or the normalization stage, the DRAGEN CNV pipeline also provides the following information about the samples in the run.

A correlation metric of the read count profile between the case sample and any panel of normals samples. A correlation metric greater than 0.90 is recommended for confident analysis, but there is no hard restriction enforced by the software.
The predicted sex of each sample in the run. The sex is predicted based on the read count information in the sex chromosomes and the autosomal chromosomes. The median value for the counts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Y chromosome. This estimation requires a minimum of 300 target intervals on the sex chromosomes to proceed.

The results are printed to the screen when running the pipeline. For example:

=============================================
Correlation Table
=============================================
Correlation of case sample PlatinumGenomes_50X_NA12877 against
PlatinumGenomes_50X_NA12878: 0.984092

=============================================
Sex Genotyper
=============================================
Predicted sex of samples
PlatinumGenomes_50X_NA12877: MALE XY 0.99737
PlatinumGenomes_50X_NA12878: FEMALE XX 0.968929

The predicted sexes for samples in use are also printed to the *.cnv_metrics.csv output file. For a panel of normals, the predicted sexes are used to determine which panel samples are leveraged for normalization on sex chromosomes. If the estimated sex of the sample is UNDETERMINED, the sex of the sample is set to FEMALE.

You can override the predicted sex of the sample with the --sample-sex option.

Segmental Duplication Extension

The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.

This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38) with at least 30x coverage. See below for additional requirements.

Supported duplications

The following pairs of genes defining Segmental Duplications are included:

Extension requirements

This extension is enabled by default in the germline CNV workflow. However, it requires:

Normalization set to self-normalization (--cnv-enable-self-normalization=true).
GC bias correction enabled (--cnv-enable-gcbias-correction=true).
Counts method set to start (--cnv-counts-method=start).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38).

If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension to false.

Algorithm

For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz file for inspection and they are automatically injected before the segmentation step.
- During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j

See Output Files for a description of the extension output files.

Structural Variant Calling

The DRAGEN Structural Variant (SV) Caller integrates and extends Manta structural variant calling methods to provide SV and indel calls 50 bases or larger. SVs and indels are called from mapped paired-end sequencing reads. The SV caller is optimized for analysis of diploid germline variation in small sets of individuals and somatic variation in tumor-normal sample pairs.

The SV caller performs the following actions:

Discovers, assembles, and scores large-scale SVs, medium-sized indels, and large insertions within a single efficient workflow.
Combines paired and split-read evidence during SV discovery and scoring to improve accuracy, but does not require split-reads or successful breakpoint assemblies to report a variant in cases where there is strong evidence otherwise.
Scores known SV deletions and insertions from an input VCF file against one or more input samples, either as a standalone procedure or together with standard SV discovery.
Provides scoring models for 1) germline variants in small sets of diploid samples, 2) somatic variants in matched tumor-normal sample pairs, and 3) somatic and germline variants in tumor-only samples.

All SV and indel inferences are output in VCF 4.2 format.

DRAGEN SV Caller Overview

The DRAGEN SV Caller divides the SV and indel discovery process into the following steps.

Reads input files to estimate alignment statistics, including fragment size distribution and chromosome level depth. For more information on the SV Caller input options, see Command Line Options.
Scans the genome or a subset of the genome (specified by the call regions) to build various genome-wide data structures, including a breakend association graph of all SV associated regions. The graph contains edges that connect all regions of the genome that have a possible breakend association. Edges can connect two different regions of the genome to represent evidence of a long-range association, or an edge can connect to a region to capture a local indel/small SV association. These associations are more general than a specific SV hypothesis and multiple breakend candidates might be found on one edge. Typically only one or two candidates are found per edge. Instead of passing an inclusion region BED file, an exclusion region BED file can be passed to DRAGEN so that any SV breakend that overlaps with these regions gets removed from downstream analyses. The excluded regions are removed from the graph building process, but active regions can get extended and present in the excluded regions in the refinement step. This can happen for the active regions that are close to the boundaries of the excluded regions. Hence, the final SV calls may still get extended to these regions.
Analyzes the breakend association graph to discover candidate SVs, then scores discovered candidate SVs and any known SVs from the input. Analysis and scoring are performed as follows.
1. Infers SV candidates that are associated with the given graph edge.
2. Assembles the SV breakends.
3. Merges discovered SV candidates with any known SV candidates included in the input data.
4. Scores/genotypes and filters each SV candidate under various biological models (currently germline, tumor-normal, and tumor-only).
5. Outputs scored SVs to VCF.

DRAGEN SV Caller Capabilities

The DRAGEN SV Caller can discover all identifiable structural variant types in the absence of copy number analysis and large-scale de novo assembly. For more information on detectable types, see Detected Variant Classes.

For each structural variant and indel, the SV Caller attempts to assemble the breakends by gathering nearby evidential reads, and to call SV events to base pair resolution by aligning assemblies against the reference genome. Then SV caller reports the left-shifted breakend coordinate (per the VCF 4.2 SV reporting guidelines), together with any breakend homology sequence and/or inserted sequence between the breakends. As a result, SV events' reported coordinates may not be directly reflected by read alignments' IGV view. Often the assembly will fail to provide a confident explanation of the data, especially in repeat regions. As a result, the SV caller will skip providing single-base resolution breakpoints or the associated split read support. In such cases, the SV caller will approximate the event breakpoints and score the events under the unified likelihood model as in other regular cases but report the variant as IMPRECISE instead.

You can provide known SVs as input for forced genotyping. This known SV input can be scored either standalone or together with the standard SV discovery workflow, in which case the known and discovered SVs are merged.

The sequencing reads provided as input to the SV Caller are expected to be from a paired-end sequencing assay that results in an "innie" orientation between the two reads of each sequence fragment, each presenting a read from the outer edge of the fragment insert inward.

The SV Caller is primarily tested for whole-genome and whole-exome (or other targeted enrichment) sequencing assays on DNA. For these assays the following applications are supported:

Joint analysis of 5 or fewer diploid individuals
Subtractive analysis of a matched tumor-normal sample pair
Analysis of an individual tumor sample

For joint analysis, there is no specific restriction against larger cohorts, but there might be stability or call quality issues.

When performing somatic calling on liquid tumor samples, the matched normal sample might be contaminated with tumor cells. The contamination can substantially reduce somatic variant recall. To account for Tumor-in-Normal (TiN) contamination, you can enable liquid tumor mode. For more information, see Liquid Tumor Calling.

Tumor samples can be analyzed without a matched normal sample. In this case, both germline and somatic variants are scored and reported in the output.

Detected Variant Classes

The SV Caller can discover all variation classes that can be explained as novel DNA adjacencies in the genome. Novel DNA adjacencies are classified into the following categories based on the breakend pattern:

Deletions
Insertions
- Insertions in the result can be divided into the following two subclasses depending on if the inserted sequence can be fully assembled. 1) Fully-assembled insertions; 2) Partially-assembled (inferred) insertions
- Mobile Element Insertions that are not called by the general purpose SV routine will be rescued by the MEI specific routine based on similarity between assembled contigs and known sequences in the MEI catalog described in the file <INSTALL_PATH>/config/sv_mobile_element_sequences.fa.
Tandem Duplications
Inversions
Unclassified breakend pairs corresponding to intra- and inter-chromosomal translocations, or complex structural variants.

Known Limitations

The SV Caller cannot directly discover the following variant types:

Dispersed duplications.
- Dispersed duplications may be indirectly called as insertions or unclassified breakends.
Most expansion/contraction variants of a reference tandem repeat.
Breakends corresponding to small inversions.
- The limiting size is not tested, but in theory, detection falls off below ~200 bases. Micro-inversions might be detected indirectly as combined insertion/deletion variants.
Fully-assembled large insertions.
- The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but power to fully assemble the insertion should fall off to impractical levels before this size.
- The SV Caller does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.

More general repeat-based limitations exist for all variant types:

Power to assemble variants to breakend resolution falls to zero as breakend repeat length approaches the read size.
Power to detect any breakend falls to (nearly) zero as the breakend repeat length approaches the fragment size.

While the SV Caller classifies certain novel DNA-adjacencies into variant classes, it has a limited ability to infer high-level events resulting from complex rearrangements, so certain calls summarized as deletions, duplications, and insertions might be better described by looking at the full system of breakends and copy number changes associated with a given event.

Forced Genotyping Capability

The DRAGEN SV caller is capable of forced genotyping a set of SVs input from a VCF file. Forced genotyping means that the input SVs are scored and emitted in the output of the SV Caller even if the variant is not supported in the sample data. For example, given a germline analysis, the input variants are processed and written to the output VCF, even if the variant quality falls below the threshold normally required for an SV to be emitted.

Forced genotyping typically enables known SVs to be detected at higher recall than standard SV discovery (particularly for SV discovery on a lower-depth sample). Forced genotyping can also be useful to assert against the presence of an SV allele. For example, you can use forced genotyping to distinguish a confident homozygous reference genotype from a lack of sequencing coverage over the SV locus.

Forced genotyping SVs are processed according to the current SV analysis being run. For example, if a germline analysis is configured by providing one or more normal samples as input, then the input SVs are scored under a germline model.

Forced genotyping alleles are always emitted in the output and might have modified scoring and filtering rules applied compared to SVs only discovered from the sample data.

Forced Genotyping Modes

Forced Genotyping can be run in two modes.

Standalone --- Only the SVs described in an input VCF are scored and emitted.
Integrated --- The standard SV discovery analysis is run and the results are merged with SVs scored from the forced genotyping input. The workflow outputs the union of SVs discovered from the sample data and any additional forced genotyping alleles. The workflow is run whenever the --sv-discovery option is true.

Forced Genotyping Inputs

You can specify forced genotyping input using the --sv-forcegt-vcf option. The input must be a VCF of SV alleles. The SV allele types are restricted to insertions, deletions, tandem duplications, and breakends, which are not labeled with the INFO/IMPRECISE flag. The following are the filtering criteria required for the VCF record to be processed as an input SV allele. If any of the criteria are not met, the VCF record is removed from the set of input SVs for forced genotyping. When a forced genotyping VCF is specified on the command line, the SV caller reports the total number of SV records used as input SVs and the total number of records filtered (if any) due to the following criteria.

Describes an insertion, deletion, tandem duplication, or breakend record.
Cannot contain the INFO/IMPRECISE flag.
Cannot contain multiple ALT alleles.
Has a FILTER value of PASS or unknown (.).
All indels are at least the minimum scored variant size (default is 50).
Cannot repeat an SV allele previously described in the same file.
The REF field cannot be empty or unknown (.).

You must describe insertions using the VCF small indel format, including an ALT entry that describes the complete insertion sequence. Using <INS> as a symbolic alt allele is not accepted. You can describe deletions using either the VCF small indel format or the <DEL> symbolic alt allele. For any variant described using a symbolic alt allele, you must also provide a value for INFO/END. Inversions represented in a single VCF record using the <INV> alt allele are not accepted, but the inversion can be genotyped if converted to a set of breakend records. Each breakpoint is described by a pair of breakend VCF records. If the forced genotyping input contains just one record of the pair and the input conditions above are met, the input is still accepted for forced genotyping, and the distal breakend is inferred from the local record.

You can describe breakpoint insertions for non-insertion SV alleles using one of the following two methods. Both methods correspond to the format used to describe breakpoint insertions in the SV VCF output.

For SVs described using the symbolic ALT format, such as <DEL>, the INFO/SVINSSEQ field is parsed to read the breakpoint insertion sequence.
For smaller indels described directly in the REF and ALT fields, the contents of the ALT field describe the breakend sequence.

Forced Genotyping Output

Forced genotyping SVs are always output to the standard VCF output of the SV Caller, regardless of whether the forced genotyping is standalone or integrated with SV calling. When the same SV allele is independently discovered from the sample data, only the discovered SV appears in the final output. The discovered SV allele is annotated to indicate the match to a forced genotyping input SV, and the scoring and filtration rules are changed to match.

VCF output records influenced by forced genotyping have the following associated fields.

The flag INFO/NotDiscovered is set for any VCF record that was not independently discovered from the sample data. When forced genotyping is run standalone, all output records contain the flag. When integrated with SV calling, the flag can distinguish the SV alleles that would not have been discovered in a standard SV analysis.
- For these variants only, the usual SV caller ID field generated from the SV Locus graph is not available, instead, the ID is taken from the corresponding user input VCF. The suffix UserInput${InputVCFRecordNumber} is appended to the ID, separated by an underscore. If your input VCF contains only one of the two VCF records that comprise a breakend variant, then the ID is taken from the mate breakend record and the _Mate suffix is added.
Any output VCF record that corresponds to a forced genotyping input VCF record has the value INFO/UserInputId=${ID} set to reflect the VCF ID value of the input VCF record. The corresponding record might have also been discovered independently from the sample data and might not have the INFO/NotDiscovered flag set.
Any output VCF record that corresponds to a forced genotyping input VCF record containing forced genotyping alleles that match exactly to an input SV has the flag INFO/KnownSVScoring. VCF records with this flag are always emitted in the output of the SV Caller. Several filters, such as MaxDepth, are not applied.

Systematic Noise Filtering

When DRAGEN-SV is used in the somatic mode (tumor-only or tumor-normal), a BEDPE file with a set of paired-end regions in the BEDPE file format can be specified to filter out sequencing / systematic noise and also recurrent germline calls. Any variant that overlaps with one of the systematic noise paired regions (with a population count of at least 2) and has the same orientation will be marked as SystematicNoise in the final VCF file. This BEDPE file can be passed via the command line option --sv-systematic-noise.

The systematic noise BEDPE file is built using VCFs that were generated by the DRAGEN-SV tumor-only pipeline when run on normal samples that do not necessarily match to the subject the tumor sample was taken from. The file might contain several dozen samples.

Generating systematic noise BEDPE file

You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.

To generate a BEDPE file, do as follows.

Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below

dragen \
-r <HASHTABLE> \
--sv-build-systematic-noise-vcfs-list <LIST OF VCF FILES>
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \

You can also build systematic noise BEDPE files in the cloud using the DRAGEN Baseline Builder App on BaseSpace.

Pre-built SV systematic noise BEDPE files

The following prebuilt systematic noise files for WGS are available for download on the DRAGEN Software Support Site page. To generate these noise files, we used 100 unrelated normal samples from the 1000 Genomes Project. Each systematic noise file contains a version string that DRAGEN uses to check the compatibility by default and exits early if a wrong systematic noise file is provided.

The systematic noise BEDPE should follow a particular format

SV Scoring

The SV caller applies a diploid scoring model for one or more diploid samples(treated as unrelated), as well as a somatic scoring model when a tumor and matched normal sample pair are given.

Germline scoring model

The germline scoring model produces diploid genotype probabilities for each candidate structural variant. Most candidates are approximated as independent for scoring purposes and modeled under a bi-allelic and diploid genotype likelihood setting. DRAGEN solves for the posterior probability over posible genotypes given the sequencing data by approximating it proportionally to the product between the prior probability of a genotype and the conditional probability of observing a set of read fragments(post-filtering) given the underlying genotype. DRAGEN treats each read fragment independently and represents the conditional probability of the set of read fragments as the product over all the individual read fragments'. For each individual read fragment's conditional probability, DRAGEN combines both paired-read and split-read evidence components, and approximates their contributions as indipendent by representing it as a product of these components, with the condition that the paired-read component is weighted by a linear ramp from one to zero depending on the candidate event type and size as tiny event will not affect pair-read mapping status significantly.

The paired-read component is modeled as a function measuring the deviation of the inferred fragment length from the overall distribution.
The split-read component is modeled as a function measuring the correctness of a read alignment to the a breakend by multiplying across all the non-gap bases' probability of observing a certain base call given the corresponding base of the evaluated allele.

Each read fragment may contribute only paired-read support, only split-read support, or both. Where a fragment contributes split-read support, this support may come from either or both reads in the read pair.

Somatic scoring model

The somatic scoring model is a Bayesian probabilistic model using a tumor-normal joint genotyping approach. It aims to call somatic structural variants in tumors while avoiding germline variants and noisy variants. In this model, the tumor and normal allele frequencies are treated as nonindependent random variables. DRAGEN calculates posterior probabilities for a range of genotype hypotheses, under the assumption that the normal sample conforms to the diploid germline genotype considering homozygous reference, heterozygous, and homozygous states. The tumor sample is a mixture of the germline genotype and, if present, the somatic allele. For the somatic genotype, we consider only two states referring to the absence and presence of the somatic variant in the tumor sample. In cases where the somatic variant is not present, we account for unsystematic independent noise, while assuming an error-free scenario when the somatic variant is considered. To calculate the genotype likelihoods, the model integrates allele frequency likelihoods over the joint tumor and normal allele frequencies and applies modifications to address liquid tumors with Tumor-in-Normal (TiN) contamination. The integration is approximated with a discrete summation. In these calculations, the likelihood for each read to support a given allele is shared with the germline scoring model. The tumor-only somatic scoring model is seen as a special case of the somatic scoring model in the absence of normal data (zero coverage). The posterior probability is converted into a Phred quality score and reported in the VCF output INFO/SOMATICSCORE field.

Input Requirements

When running the SV Caller, the input sequencing reads must be from a standard Illumina paired-end sequencing assay with an FR read pair orientation, where for each sequence fragment, a read proceeds from each end of the fragment inwards. For more information, see DRAGEN SV Caller Capabilities.

The SV Caller is optimized for paired-end libraries where the fragment size is typically larger than the size of both reads. Overlapping read pairs can be used to discover SVs, but might not always be handled optimally. For libraries where the typical fragment size is less than the read length, the SV caller attempts to differentiate reads sequencing into adapter sequence from the variant signal. In such cases, the SV Caller's input quality checks may fail and cause SV analysis to be skipped.

If using the standalone mode, your BAM/CRAM inputs must first be mapped. If you have not mapped and aligned your data yet, you can generate an alignment file.

Alignment Contig Checks

If running from a mapped and aligned BAM, then the contigs specified in the header must strictly match those in the DRAGEN hashtable specified in the current analysis. Missing or extra contigs will lead to a "Reference genome mismatch" configuration error and the analysis will not proceed. If such an error is observed, it is recommended to regenerate the alignment file with the intended DRAGEN hashtable, or to run with the DRAGEN map/align module enabled.

Input Quality Checks

The SV Caller runs quality checks on the input sequencing reads for each sample to make sure that the input corresponds to a paired read assay with the expected FR orientation, before estimating the fragment size distribution. To check consensus read pair orientation, a subset of high-quality read pairs is sampled. At least 90% of these must have the expected FR orientation for SV analysis to continue, otherwise, the SV caller issues a warning, skips any further analysis, and the resulting output files display empty results.

The SV Caller can tolerate nonpaired reads in the input, if sufficient paired-end reads exist to estimate the fragment size distribution. To estimate the fragment size distribution, the SV Caller requires at least 100 read pairs that meet the quality requirements of the estimation routine. Both reads of the pair must have a non-zero mapping quality to the same chromosome, are not filtered or part of a split read mapping, and do not contain indels or soft-clipping. If a sample does not contain a sufficient number of such read pairs, the SV Caller issues a warning, skips any further analysis, and writes empty results to its output files.

Read Groups

The SV Caller disregards any read group labels applied to the input sequences. Each input sample is treated as a separate library with a single fragment size distribution.

File Format

In standalone mode, input sequencing reads must be mapped and provided as input in either BAM or CRAM format. Each input file must be coordinate sorted and indexed to produce an htslib-style index in a file named to match the input BAM or CRAM file with an additional .bai, .crai, or .csi file name extension. For more information on standalone mode, see Modes of Operation.

At least one BAM or CRAM file must be provided for the normal or tumor sample. A matched tumor-normal sample pair can be provided as well. If multiple input files are provided for the normal sample, each file is treated as a separate sample as part of a joint diploid sample analysis.

In standalone mode, input BAM or CRAM files contain the following limitations:

Alignments cannot have an unknown read sequence (SEQ="*")
Alignments cannot contain the "=" character in the SEQ field.
Alignments cannot use the sequence match/mismatch ("="/"X"). CIGAR notation RG (read group) tags in the alignment records are ignored. Each alignment file is treated as representing one sample.
Alignments with base call quality values greater than 70 are rejected. These are not supported on the assumption that this indicates an offset error.

Generate an Alignment File

You need to generate alignment files for all samples that have not already been mapped and aligned. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.

The following example command maps and aligns a FASTQ file:

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing BAM file:

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

The following example command maps and aligns an existing CRAM file:

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true

Exome/Targeted Calling

The SV caller can be configured for targeted sequencing inputs, which disables high-depth filters. Exome mode can be directly set to true or false with the command line option --sv-exome. If not directly set, exome mode defaults to false unless you run the SV caller in integrated mode and there is not more than 50 Gb of sequencing input.

Internal Tandem Duplications Calling

You can use the --sv-somatic-ins-tandup-hotspot-regions-bed ${BEDFILE} option to specify ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed. The file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps). To disable this feature, enter --sv-enable-somatic-ins-tandup-hotspot-regions false.

Liquid Tumor Calling

Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. In tumor-normal analysis, DRAGEN accounts for Tumor-in-Normal (TiN) contamination by running liquid tumor mode. You can use liquid tumor mode to account for TiN contamination by allowing a nonzero variant allele frequency for the matched normal when calculating the posterior probability of the somatic state. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.

Use the following two options to control liquid tumor mode behavior.

--sv-enable-liquid-tumor-mode ---Enable liquid tumor mode. Liquid tumor mode is disabled by default.
--sv-tin-contam-tolerance ---Set the TiN contamination tolerance level. DRAGEN calls variants in the presence of TiN contamination up to a specified maximum tolerance level. You can enter any value between 0–1. The default maximum TiN contamination tolerance is 0.15. If using the default value, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample.

Command Line Options

The following command line options are supported for the Structural Variant Caller.

Input and Output Options

The following are the top-level options that are shared with the DRAGEN Host Software to control the SV pipeline. You can use BAM and CRAM files as input. Alternatively, if using read mapping with the SV calling in a single run, you can use all of the DRAGEN input options, including FASTQ, BAM, and CRAM files.

--cram-input---The CRAM file to be processed.
--tumor-cram-input---If performing tumor-normal or tumor-only analysis, the tumor CRAM file to be processed.
--fastq-file1, --fastq-file2, --fastq-list---Input FASTQ files or a list of files to be processed.
--tumor-fastq1, --tumor-fastq2, --tumor-fastq-list---Input tumor FASTQ file or list of files to be processed.
--enable-map-align---Enables DRAGEN map/align. The default is true, so all input reads are remapped and aligned unless the option is set to false.
--output-directory---Output directory where all results are stored.
--output-file-prefix---Output file prefix that will be prepended to all result file names.
--ref-dir---The DRAGEN reference genome hashtable directory. For more information about the reference genome hashtable, see Prepare a Reference Genome.
--bam-input---The BAM file to be processed.
--tumor-bam-input--If performing tumor-normal or tumor-only analysis, the tumor BAM file to be processed.

Structural Variant Caller Pipeline Options

--enable-sv ---Enable or disable the structural variant caller. The default is false.
--sv-call-regions-bed ---Specifies a BED file containing the set of regions to call. Optionally, you can compress the file in gzip or bgzip format.
--sv-exclusion-bed --- Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.
--sv-region --- Limit the analysis to a specified region of the genome for debugging purposes. This option can be specified multiple times to build a list of regions. The value must be in the format "chr:startPos-endPos".
--sv-exome --- Set to true to configure the variant caller for targeted sequencing inputs, which includes disabling high depth filters. In integrated mode, the default is to autodetect targeted sequencing input, and in standalone mode the default is false.
--sv-output-contigs --- Set to true to have assembled contig sequences output in a VCF file. The default is false.
--sv-forcegt-vcf --- Specify a VCF of structural variants for forced genotyping. The variants are scored and emitted in the output VCF even if not found in the sample data. The variants are merged with any additional variants discovered directly from the sample data.
--sv-discovery --- Enable SV discovery. This flag can be set to false only when --sv-forcegt-vcf is used. When set to false, SV discovery is disabled and only the forced genotyping input variants are processed. The default is true.
--sv-use-overlap-pair-evidence --- Allow overlapping read pairs to be considered as evidence. The default is false.
--sv-somatic-ins-tandup-hotspot-regions-bed --- Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed.
--sv-enable-somatic-ins-tandup-hotspot-regions --- Enable or disable the ITD hotspot region input. The default is true in somatic variant analysis.
--sv-enable-liquid-tumor-mode---Enable liquid tumor mode. See Liquid Tumor Calling.
--sv-tin-contam-tolerance--- Set the Tumor-in-Normal (TiN) contamination tolerance level. See Liquid Tumor Calling for more information.
--sv-systematic-noise--- Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information see Systematic Noise Filtering.
--sv-detect-systematic-noise--- Set to true to generate VCF output per normal sample. For more information see Systematic Noise Filtering
--sv-build-systematic-noise-vcfs-list --- List of input VCFs from previous step. Enter one VCF per line. For more information see Systematic Noise Filtering
--sv-min-edge-observations--- Remove all edges from the graph with less than this many observations. The default value is set to 3.
--sv-min-candidate-spanning-count--- Run SV caller and report all large SVs with at least this many spanning support observations. The default value is set to 3.
--sv-min-candidate-variant-size--- Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.
--sv-min-scored-variant-size--- After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.
--sv-hotspot-min-scored-variant-size--- After candidate identification, only score and report SVs/indels at or above this size inside the SV hotspot region, which includes FLT3, ARHGEF7, and KMT2A genes by default. The default value is set to 25.

Modes of Operation

Structural Variant calling can run in the following modes:

Standalone --- Uses mapped BAM/CRAM input files. If you have not mapped and aligned your data yet, see Input Requirements. This mode requires the following options:
- --enable-map-align false
- --enable-sv true
Integrated -- Automatically runs on the output of the DRAGEN mapper/aligner. This mode requires the following options:
- --enable-map-align true
- --enable-sv true
- --enable-map-align-output true
- --output-format bam

You can also enable Structural Variant calling with any other caller.

The following is an example command line for Integrated mode:

dragen -f \
--ref-dir=<HASH_TABLE> \
--enable-map-align true \
--enable-map-align-output true \
--enable-sv true \
--output-directory \<OUT\_DIR\> \
--output-file-prefix \<PREFIX\> \
--RGID Illumina_RGID \
--RGSM <sample name> \
-1 <FASTQ1> \
-2 <FASTQ2>

The following is an example command line for joint diploid calling in standalone mode:

dragen -f \
--ref-dir <HASH_TABLE> \
--bam-input <BAM1> \
--bam-input <BAM2> \
--bam-input <BAM3> \
--enable-map-align false \
--enable-sv true \
--output-directory <OUT_DIR> \
--output-file-prefix <PREFIX>

Structural Variant VCF Output

The structural variants VCF output file is available in the output directory. The file is named <output-file-prefix>.sv.vcf.gz. The contents of the file depend on the type of analysis.

For each major analysis category (germline, tumor-normal, and tumor-only), the appropriate VCF output file is output, reflecting variant calls made under the variant calling mode corresponding to the given analysis type.

VCF Output

VCF output follows the VCF 4.2 specification for describing structural variants. It uses standard field names wherever possible. All custom fields are described in the VCF header. The following sections provide information on the variant representation details and the primary VCF field values.

VCF Sample Names

Sample names output in the VCF output are extracted from each input alignment file from the first read group (@RG) record found in the header. If no sample name is found, a default (SAMPLE1, SAMPLE2, etc.) label is used instead.

Small Indel Representation

All variants are reported in the VCF using symbolic alleles unless they are classified as a small indel, in which case full sequences are provided for the VCF REF and ALT allele fields. A variant is classified as a small indel if all of the following criteria are met:

The variant can be entirely expressed as a combination of inserted and deleted sequences.
The deletion or insertion length is not 1000 or greater.
The variant breakends and/or the inserted sequence are not imprecise.
The variant has not been converted from a deletion to intra-chromosomal breakends by the depth-based SV classification routine.

When VCF records are output in the small indel format, they also include the CIGAR INFO tag describing the combined insertion and deletion event.

Insertions with Incomplete Insert Sequence Assembly

Large insertions are reported in some cases even when the insert sequence cannot be fully assembled. In this case, the SV Caller reports the insertion using the <INS> symbolic allele and includes the special INFO fields LEFT_SVINSSEQ and RIGHT_SVINSSEQ to describe the assembled left and right ends of the insert sequence. The following is an example of such a record from the joint diploid analysis of NA12878, NA12891 and NA12892 mapped to hg19:

chr1	11830208	DRAGEN:INS:1577:0:0:0:3:0	T	<INS>	999	PASS	END=11830208;SVTYPE=INS;CIPOS=0,12;CIEND=0,12;HOMLEN=12;HOMSEQ=TAAATTTTTCTT;LEFT_SVINSSEQ=TAAATTTTTCTTTTTTCTTTTTTTTTTAAATTTATTTTTTTATTGATAATTCTTGGGTGTTTCTCACAGAGGGGGATTTGGCAGGGTCACGGGACAACAGTGGAGGGAAGGTCAGCAGACAAACAAGTGAACAAAGGTCTCTGGTTTTCCCAGGCAGAGGACCCTGCGGCCTTCCGCAGTGTTCGTGTCCCTGATTACCTGAGATTAGGGATTTGTGATGACTCCCAACGAGCATGCTGCCTTCAAGCATCTGTTCAACAAAGCACATCTTGCACTGCCCTTAATTCATTTAACCCCGAGTGGACACAGCACATGTTTCAAAGAG;RIGHT_SVINSSEQ=GGGGCAGAGGCGCTCCCCACATCTCAGATGATGGGCGGCCAGGCAGAGACGCTCCTCACTTCCTAGATGTGATGGCGGCTGGGAAGAGGCGCTCCTCACTTCCTAGATGGGACGGCGGCCGGGCGGAGACGCTCCTCACTTTCCAGACTGGGCAGCCAGGCAGAGGGGCTCCTCACATCCCAGACGATGGGCGGCCAGGCAGAGACACTCCCCACTTCCCAGACGGGGTGGCGGCCGGGCAGAGGCTGCAATCTCGGCACTTTGGGAGGCCAAGGCAGGCGGCTGCTCCTTGCCCTCGGGCCCCGCGGGGCCCGTCCGCTCCTCCAGCCGCTGCCTCC	GT:FT:GQ:PL:PR:SR	0/1:PASS:999:999,0,999:22,24:22,32	0/1:PASS:999:999,0,999:18,25:24,20	0/0:PASS:230:0,180,999:39,0:34,0

Normalizing Small Tandem Duplications

The SV caller can also represent tandem duplications as insertions. This representation creates ambiguity in how the variants are presented in the VCF output, especially for small tandem duplications. The representation can lead to complications, such as unrecognized call duplication.

To better normalize the SV caller output, so that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions, using the symbolic allele <INS> for the ALT field. The following example shows an insertion, which was converted from a tandem duplication during this normalization process.

chr2    2520057 DRAGEN:DUP:TANDEM:53645:0:1:0:0:0 T   <INS>   813 PASS    END=2520057;SVTYPE=INS;SVLEN=52;DUPSVLEN=52 GT:FT:GQ:PL:PR:SR   0/1:PASS:393:863,0,390:25,0:19,25

Converted insertions include copies of certain output fields. The fields appear the same as in a tandem duplication record. For example, INFO/DUPSVINSSEQ provides a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication, the breakpoint insertion value would normally be written to INFO/SVINSSEQ. The following example shows a converted insertion with a breakpoint insertion value:

chr2    2645730 DRAGEN:DUP:TANDEM:53649:0:1:0:0:0 C   <INS>   367 PASS    END=2645730;SVTYPE=INS;SVLEN=97;DUPSVLEN=86;DUPSVINSLEN=11;DUPSVINSSEQ=CTCACCTTCAT  GT:FT:GQ:PL:PR:SR   0/1:PASS:367:417,0,386:19,0:20,15

For more information about copied INFO fields, see VCF INFO Fields. All INFO fields use the same DUP prefix.

Inversions

Inversions are reported as a set of breakends. For example, given a simple reciprocal inversion, four breakends are reported, sharing the same EVENT INFO tag. The following is an example breakend records representing a simple reciprocal inversion:

chr1	17124941	DRAGEN:BND:1445:0:1:1:3:0:0	T	[chr1:234919886[T	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:1445:0:1:1:3:0:1;CIPOS=0,1;HOMLEN=1;HOMSEQ=T;EVENT=DRAGEN:BND:1445:0:1:0:0:0:0;JUNCTION_QUAL=254;BND_DEPTH=107;MATE_BND_DEPTH=100	GT:FT:GQ:PL:PR:SR	0/1:PASS:999:999,0,999:65,8:15,51
chr1	17124948	DRAGEN:BND:1445:0:1:0:0:0:0	T	T]chr1:234919824]	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:1445:0:1:0:0:0:1;EVENT=DRAGEN:BND:1445:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=109;MATE_BND_DEPTH=83	GT:FT:GQ:PL:PR:SR	0/1:PASS:999:999,0,999:60,2:0,46
chr1	234919824	DRAGEN:BND:1445:0:1:0:0:0:1	G	G]chr1:17124948]	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:1445:0:1:0:0:0:0;EVENT=DRAGEN:BND:1445:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=83;MATE_BND_DEPTH=109	GT:FT:GQ:PL:PR:SR	0/1:PASS:999:999,0,999:60,2:0,46
chr1	234919885	DRAGEN:BND:1445:0:1:1:3:0:1	A	[chr1:17124942[A	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:1445:0:1:1:3:0:0;CIPOS=0,1;HOMLEN=1;HOMSEQ=A;EVENT=DRAGEN:BND:1445:0:1:0:0:0:0;JUNCTION_QUAL=254;BND_DEPTH=100;MATE_BND_DEPTH=107	GT:FT:GQ:PL:PR:SR	0/1:PASS:999:999,0,999:65,8:15,51

Depth-Based SV Type Classification

In the germline calling model, when SV candidates are discovered from the sample data and have sufficient paired and split read evidence to be reported in the output, the SV caller applies additional depth-based tests to more accurately classify certain SV candidate types. Candidate breakpoints that are consistent with a deletion are tested for the lower read depth that is expected inside the deleted region. Candidate breakpoints consistent with a tandem duplication are tested for the higher read depth expected in the duplicated region. Candidate SV calls that fail the depth-based tests are still reported in the output, but changed to intrachromosomal breakends. Candidate SV calls that pass continue to be reported in the standard deletion and tandem duplication output formats.

SV Breakpoint Insertions

SVs frequently include a small sequence insertion at the breakpoint. Breakpoint insertions are represented differently depending on the SV type. The INFO/SVINSSEQ field in the VCF output provides the most general description of breakpoint insertions by describing the insertion sequence itself. The corresponding INFO/SVINSLEN field describes the length of the insertion sequence. For example, the following VCF record describes a large (~8.8 kb) deletion, which includes a single base insertion (C) between the left and right deletion breakends.

chr22   17770350        DRAGEN:DEL:101:0:1:0:0:0  C       <DEL>   687     PASS    END=17779108;SVTYPE=DEL;SVLEN=-8758;SVINSLEN=1;SVINSSEQ=C       GT:FT:GQ:PL:PR:SR       0/1:PASS:687:737,0,858:39,20:32,8

The INFO/SVINSSEQ field is also used to describe breakpoint insertions for tandem duplication and breakend records. The field can also be used to describe the insertion sequence of a large SV insertion.

Breakpoint insertions are represented differently in the VCF small indel format. The SV caller represents small deletions and insertions using the VCF small indel format instead of symbolic ALT alleles. Any breakpoint insertion that occurs in the VCF small indel format is represented as part of the VCF ALT field. See Small Indel Representation for information on the conditions this format is used for SVs under.

In the following small indel format example, the VCF record describes a 57 base deletion that includes a single base insertion (A) between the left and right deletion breakends

chr22   32981929        DRAGEN:DEL:1136:0:0:0:0:0 TGTATACATATATGTGTATATACGTATATATGTATATATGTATGTATACGTATATATG      TA      537     PASS    END=32981986;SVTYPE=DEL;SVLEN=-57;CIGAR=1M1I57D GT:FT:GQ:PL:PR:SR       0/1:PASS:308:587,0,305:8,0:23,15

Breakend records include an additional encoding of breakpoint insertion sequence, as described in the VCF specification for the breakend ALT field. The SV caller also provides the information to the INFO/SVINSSEQ field for consistency with other SV record types.

The following example shows a breakend connecting a region of chromosomes 1 and 12 in the sample with a breakend insertion sequence of CA between the two breakends. The insertion sequence is described in both the ALT and INFO/SVINNSEQ fields.

1       39604587        DRAGEN:BND:31780:1:3:0:0:0:1      T       TCA[12:6472102[ 774     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:31780:1:3:0:0:0:0;SVINSLEN=2;SVINSSEQ=CA;BND_DEPTH=67;MATE_BND_DEPTH=55      GT:FT:GQ:PL:PR:SR       0/1:PASS:774:824,0,999:63,3:36,33
12      6472102 DRAGEN:BND:31780:1:3:0:0:0:0      G       ]1:39604587]CAG 774     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:31780:1:3:0:0:0:1;SVINSLEN=2;SVINSSEQ=CA;BND_DEPTH=55;MATE_BND_DEPTH=67      GT:FT:GQ:PL:PR:SR       0/1:PASS:774:824,0,999:63,3:36,33

SV Breakpoint Insertion Orientation

The breakpoint insertion sequence is always provided with respect to the strand of the current SV record. Some breakend records have inverted orientation. For inverted orientations, the pair of breakend records contains an insertion sequence that is reverse complemented compared to the mated record.

The following breakend pair example demonstrates an inverted orientation.

1       210891730       DRAGEN:BND:43882:0:2:0:2:0:1      A       AATG]19:45732595]       999     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:43882:0:2:0:2:0:0;SVINSLEN=3;SVINSSEQ=ATG;BND_DEPTH=76;MATE_BND_DEPTH=106    GT:FT:GQ:PL:PR:SR       0/1:PASS:999:999,0,999:69,16:43,55
19      45732595        DRAGEN:BND:43882:0:2:0:2:0:0      G       GCAT]1:210891730]       999     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:43882:0:2:0:2:0:1;SVINSLEN=3;SVINSSEQ=CAT;BND_DEPTH=106;MATE_BND_DEPTH=76    GT:FT:GQ:PL:PR:SR       0/1:PASS:999:999,0,999:69,16:43,55

SV Breakpoint Homology

Each VCF record output by the SV caller is shifted to the left-most position of the exact homology range of the breakpoint. The exact homology range of the breakpoint is the continuous range of positions over which the SV could be represented while still describing the same SV haplotype. The exact homology range is described in the VCF output with the INFO/HOMSEQ field, which describes the sequence of the exact homology range and the corresponding INFO/HOMLEN field, which describes the length of the range.

The following example shows a 62 base deletion with an 11 base breakend homology region. Without left-shifting, the SV has an equivalent representation anywhere from position 39497639 to 39497650.

chr22   39497639        DRAGEN:DEL:34:85:85:1:0:0 GGGGGGTGGGGGCGGGTTGGAGGAGGTTGGCGGGGGGCGGGGGCGGGTTGGAGGAGGTTGGCA G       187     PASS    END=39497701;SVTYPE=DEL;SVLEN=-62;CIGAR=1M62D;CIPOS=0,11;HOMLEN=11;HOMSEQ=GGGGGTGGGGG   GT:FT:GQ:PL:PR:SR     0/1:PASS:12:237,0,8:4,0:2,8

The following examples illustrate simplified exact breakend homology. The example displays one three base deletion and another three base insertion. In both the insertion and deletion, the variant is left-shifted, so that the corresponding VCF record position is 2.

Deletion

Reference: GTCAGCGA

Variant: GT---CGA

Insertion

Reference: GT---CAG

Variant: GTCGGCAA

In both the insertion and deletion, there is a single base of exact breakend homology C, so that the same variant can be represented one base to the right.

VCF INFO Fields

VCF FORMAT Fields

VCF FILTER Fields

Germline

The following table lists the VCF FILTER fields applied to germline VCF output.

Tumor-Normal Somatic

The following table lists the VCF FILTER fields applied to tumor-normal somatic VCF output.

Tumor-Only

The following table lists the VCF FILTER fields applied to tumor-only VCF output.

Interpretation of VCF Filters

There are two levels of VCF filters: record level (FILTER) and sample level (FORMAT/FT). Most record-level filters are independent of those at the sample-level. However, in a germline analysis, if none of the samples pass all sample-level filters, the SampleFT record-level filter is applied.

Interpretation of INFO/EVENT Field

Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the sample. The INFO/EVENT field indicates that two or more such junctions are hypothesized to occur together as part of a single variant event. All individual variant records belonging to the same event share the same INFO/EVENT string. Note that although such an inference could be applied after SV calling by analyzing the relative distance and orientation of the called variant breakpoints, the SV Caller incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale events. Given that at least one junction in the event has already passed standard variant candidacy thresholds, sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern consistent with a multijunction event (such as a reciprocal translocation pair).

Although this mechanism could generalize to events including an arbitrary number of junctions, it is currently limited to two. Thus, at present it is most useful for identifying and improving sensitivity towards reciprocal translocation pairs.

SV Variant Allele Fraction (VAF) Calculation

Some of the evidential read pairs could provide both PR and SR support, we defined VF as an additional field to represent number of evidence in sequence fragment(or read pairs), which strongly support the REF or ALT alleles in the listed order, to facilitate unbiased calculation of Variant Allele Fraction (VAF), where VAF = VF_ALT/(VF_ALT+VF_REF).

VCF ID Field

The VCF ID, or identifier, field can be used for annotation, or in the case of BND (breakend) records for translocations, the ID value is used to link breakend mates or partners. The following is an example of a VCF ID field from the SV caller

DRAGEN:INS:1577:0:0:0:3:0

The value provided in the ID field reflects the SV association graph edge(s) from which the SV or indel was discovered. The value is guaranteed to be unique within any single VCF output file produced by the SV Caller. These values are therefore used to link associated breakend records using the standard VCF MATEID key. The exact structure of this identifier may change in the future. You can use the entire value as a unique key, but parsing the key could lead to incompatibility with future DRAGEN versions. See the DRAGEN Software Support Site for information on the latest version of DRAGEN.

Convert SV VCF to BEDPE Format

It can sometimes be convenient to express structural variants in BEDPE format. For such applications, DRAGEN recommends the script vcfToBedpe available on GitHub. The repository is forked from @hall-lab with modifications to support VCF 4.2 SV format.

BEDPE format greatly reduces structural variant information compared to the SV Caller VCF output. In particular, breakend orientation, breakend homology, and insertion sequence are lost, in addition to the ability to define fields for locus and sample specific information. For this reason, Illumina only recommends BEDPE as a temporary output for applications that require it.

QC Metrics Reporting

DRAGEN generates multiple pipeline-specific metrics including:

Mapping and Aligning metrics
Variant calling metrics
Biomarker metrics
Coverage (or enrichment) metrics and reports
Duration (or run time) metrics

Figure 10: Generation of Metrics and Reports

The QC metrics are printed to the standard output. In addition CSV files are written to the run output directory:

<output prefix>.mapping_metrics.csv
<output prefix>.vc_metrics.csv
<output prefix>.<coverage region prefix>_coverage_metrics.csv
<output prefix>.time_metrics.csv
<output prefix>.<other coverage reports>.csv

Each CSV includes 5 columns, including: Section, Subsection (e.g. read group or sample), Metric, Value 1 (Count/Ratio/Minutes) and Value 2 (Percentage/Seconds).

Mapping and Aligning Metrics

DRAGEN computes mapping and aligning metrics similar to Samtools Flagstat.

Mapping metrics are:

available both on an aggregate level and on a per read group level.
in germline and somatic tumor-only mode only one set of mapping metrics are available.
in somatic tumor-normal mode, the mapping and aligning metrics are generated separately for the tumor and normal samples, with each line beginning with TUMOR or NORMAL to indicate the sample. The metrics for the tumor sample are output first, followed by the metrics for the normal sample. Metrics per read group are also separated into tumor and normal read groups.
unless explicitly stated, the metrics units are in reads (not in terms of pairs).

Definitions:

Total input reads---Total number of reads in the input FASTQ files.
Number of duplicate marked reads---Reads marked as duplicates as a result of the --enable-duplicate-marking option being set to true.
Number of duplicate marked and mate reads removed---Reads marked as duplicates, along with any mate reads, that are removed when the --remove-duplicates option is set to true.
Number of unique reads---Total number of reads minus the duplicate marked reads.
Reads with mate sequenced---Number of reads with a mate.
Reads without mate sequenced---Total number of reads minus number of reads with mate sequenced.
QC-failed reads---Reads that did not pass platform/ vendor quality checks (SAM flag 0x200).
Mapped reads---Total number of mapped reads
Mapped reads with filtered mapping---Total number of mapped reads plus reads mapped to non-reference decoy contigs plus reads mapped to the rRNA filter contig.
Mapped reads adjusted for excluded mapping---Total number of mapped reads plus reads mapped to the excluded RNA mitochondrial contig.
Mapped reads adjusted for filtered and excluded mapping---Total number of mapped reads plus reads mapped to the rRNA filter contig plus reads mapped to the excluded RNA mitochondrial contig.
Number of unique and mapped reads---Number of mapped reads minus number of duplicate marked reads.
Unmapped reads---Total number of reads that could not be mapped.
Unmapped reads minus filtered mapping---Total number of unmapped reads minus reads mapped to non-reference decoy contigs minus reads mapped to the rRNA filter contig.
Unmapped reads adjusted for excluded mapping---Total number of unmapped reads minus reads mapped to the excluded RNA mitochondrial contig.
Unmapped reads adjusted for filtered and excluded mapping---Total number of unmapped reads minus reads mapped to the rRNA filter contig minus reads mapped to the excluded RNA mitochondrial contig.
Singleton reads---Number of reads where the read could be mapped, but the paired mate could not be read.
Paired reads---Count of reads in which both reads in the pair are mapped.
Properly paired reads---Both reads in the pair are mapped and fall within an acceptable range from each other based on the estimated insert length distribution.
Not properly paired reads (discordant)---The number of paired reads minus the number of properly paired reads.
Paired reads mapped to different chromosomes---The number of reads with a mate, where the mate was mapped to a different chromosome.
Paired reads mapped to different chromosomes (MAPQ >= 10)---The number of reads with a MAPQ>10 and with a mate, where the mate was mapped to a different chromosome.
Reads with indel R1---The percentage of R1 reads containing at least 1 indel.
Reads with indel R2---The percentage of R2 reads containing at least 1 indel.
Soft-clipped bases R1---The percentage of bases in R1 reads that are soft-clipped.
Soft-clipped bases R2---The percentage of bases in R2 reads that are soft-clipped.
Mismatched bases R1---The number of mismatched bases on R1, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2---The number of mismatched bases on R2, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R1 (excluding indels)---The number of mismatched bases on R1. The indels lengths are ignored. It does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2 (excluding indels)---The number of mismatched bases on R2. The indels lengths are ignored. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Q30 Bases---The total number of bases with a BQ >= 30. Includes mapped & unmapped reads & bases. Excludes duplicate marked reads & secondary alignments.
Q30 Bases R1---The total number of bases on R1 with a BQ >= 30.
Q30 Bases R2---The total number of bases on R2 with a BQ >= 30.
Q30 Bases (excluding dups and clipped bases)---The number of bases on non-duplicate and non-clipped bases with a BQ >= 30.
Histogram of reads map qualities
Reads with MAPQ [40:inf)
Reads with MAPQ [30:40)
Reads with MAPQ [20:30)
Reads with MAPQ [10:20)
Reads with MAPQ [0:10)
Total alignments---Total number of loci reads aligned to with > 0 quality.
Secondary alignments---Number of secondary alignment loci.
Supplementary (chimeric) alignments---A chimeric read is split over multiple loci (possibly due to structural variants). One alignment is referred to as the representative alignment. The other are supplementary.
Estimated read length---Total number of input bases divided by the number of reads.
Insert length: mean---Mean insert size estimated for the read group
Insert length: median---Median insert size estimated for the read group
Insert length: standard deviation---Standard deviation of insert size estimated for the read group

Note: The insert length metrics reported above are computed using high quality (MAPQ >= 20) properly paired read pairs, considering all the read pairs for the read group. It may be different from the standard output log reported during insert stats sampling which reports these metrics only for the first ~2M read pairs for DNA (~100K read pairs for RNA).

Whole read group insert length estimation for RNA datasets is currently not supported. For RNA runs, the reported insert length metrics are computed using up to the first ~100K high quality read pairs for the read group from the input FASTQ/BAM/CRAM file.

Input bases divided by reference genome size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the reference genome size.
Input bases divided by target bed size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the target bed size.
Estimated sample contamination---The estimated fraction of reads in a sample that may be from another human source.

Cross-sample contamination

The DRAGEN cross-sample contamination module uses a probabilistic mixture model to estimate the fraction of reads in a sample that may be from another human source. DRAGEN supports separate modes for germline and somatic samples.

The germline model, like VerifyBamID, assumes that a sample can be modeled as a DNA mixture from 2 or more individuals. Pileup analysis is used to investigate loci where variants are common in the general population. Variants with high allele frequencies are likely to be real germline variants in the individual of interest, while low allele frequency variants at these common germline loci are likely noise or germline variants from a contaminating sample B. The probabilistic mixture model accounts for noise and then tries to detect consistent allele frequency distributions. As example, if the pileups show consistent low allele frequencies of 1% or 2%, then the mixture model will likely infer 2% contamination from sample B, where the 1% and 2% AF variants correspond to heterozygous and homozygous germline calls in sample B.

The germline cross-contamination metric is enabled by using the following setting and pointing a VCF that includes marker sites (RSIDs) with population allele frequencies that are close to 0.5.

--qc-cross-cont-vcf <INSTALL_PATH>/resources/qc/sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf

The somatic model, like GATK CalculateContamination, supports tumor-only or tumor-normal runs. The somatic model is more advanced than the germline model in the way that it accounts for somatic CNVs or LoH regions where the diploid assumptions may be broken. The algorithm also tries to account for FFPE deamination and oxidation noise by empirically adjusting base qualities prior to estimation.

The somatic cross-contamination metric is enabled by pointing to the VCF that includes the marker sites (RSIDs) with high population allele frequencies.

--qc-somatic-contam-vcf <INSTALL_PATH>/resources/qc/somatic_sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf.gz

The metric value is printed as a fraction, so a value of 0.011 represents 1.1% contamination from another sample.

MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.011

The precision of variant calling, particularly for somatic variants, can be significantly impacted by cross-sample contamination. To ensure safe usage of a sample, the level of cross-sample contamination must be considerably lower than the minimum allele frequencies of interest. For instance, if a sample has 1% contamination, it may be necessary to disregard all variants with less than 5% allele frequency. The cross-contamination metric for a sample reaches saturation near 30% contamination.

The contamination module requires a minimum of 100 valid pileups for contamination estimation, where a pileup is considered valid if it has at least 10X coverage and 95% or more reads are deemed valid. Soft clipped reads that could indicate INDELs or structural variants are not considered valid, and datasets with untrimmed adapters may lead to most reads being soft clipped and classified as invalid. If the contamination module reports "NA," even for high-coverage samples, it is recommended to inspect a few pileup locations in IGV for evidence of untrimmed bases.

Optional Contamination Settings:

Variant Calling Metrics

The generated variant calling metrics are similar to the metrics computed by RTG vcfstats. Metrics are reported for each sample in multi sample VCF and gVCF files and in a csv file with the file name ending in "vc_metrics.csv". Based on the run case, metrics are reported either as standard VARIANT CALLER or JOINT CALLER. Metrics are reported both for the raw (PREFILTER) and hard filtered (POSTFILTER) VCF file.

Panel of Normals (PON) and COSMIC filtered variants are counted as PASS variants in the POSTFILTER VCF metrics. These PASS variants can cause higher than expected variant counts in the POSTFILTER VCF metrics

Number of samples---Number of samples in the population/ joint VCF.
Reads Processed---The number of reads used for variant calling, excluding any duplicate marked reads and reads falling outside of the target region.
Total---The total number of variants (SNPs + MNPs + indels).
Biallelic---Number of sites in a genome that contains two observed alleles. The reference is counted as one allele, which allows for one variant allele.
Multiallelic---Number of sites in the VCF that contain three or more observed alleles. The reference is counted as one, which allows for two or more variant alleles.
SNPs---A variant is counted as an SNP when the reference, allele 1, and allele 2 are all length 1.
Insertions (Hom)---Number of variants that contains homozygous insertions.
Insertions (Het)---Number of variants where both alleles are insertions, but not homozygous.
Deletions (Het)---Number of variants that contains homozygous deletions.
INDELS (Het)---Number of variants where genotypes are either [insertion+deletion], [insertion+SNP], or [deletion+SNP].
De Novo SNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold option to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
De Novo INDELs---De novo marked indels with DQ values greater than the threshold. This DQ threshold can be specified by setting the --qc-indel-denovo-quality-threshold option to the required DQ threshold. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
De Novo MNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
(Chr X SNPs)/(Chr Y SNPs) ratio in the genome (or the target region) ---Number of SNPs in chromosome X (or in the intersection of chromosome X with the target region) divided by the number of SNPs in chromosome Y (or in the intersection of chromosome Y with the target region). If there was no alignment to either chromosome X or chromosome Y, this metric shows as NA.
SNP Transitions---An interchange of two purines (A<->G) or two pyrimidines (C<->T).
SNP Transversions---An interchange of purine and pyrimidine bases Ti/Tv ratio: ratio of transitions to transitions.
Heterozygous---Number of heterozygous variants.
Homozygous---Number of homozygous variants.
Het/Hom ratio---Heterozygous/ homozygous ratio.
In dbSNP---Number of variants detected that are present in the dbSNP reference file. If no dbSNP file is provided via the --bsnp option, then both the In dbSNP and Novel metrics show as NA.
Novel---Total number of variants minus number of variants in dbSNP.
Percent Callability---Available in germline and somatic modes with gVCF output. The percentage of non-N reference positions having a PASSing genotype call. Multiallelic variants are not counted. Deletions are counted for all the deleted reference positions only for homozygous calls. Only autosomes and chromosomes X, Y, and M are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names. Optionally, --qc-callability-xym-contigs allows setting X, Y and M contig names.
Percent Autosome Callability---Only autosomes are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names.
Percent QC Region Callability in Region i (i is equivalent to regions 1,2, or 3)---Available if callability for custom regions is requested via the --qc-coverage-region-i option and the callability output is specified with --qc-coverage-reports-i. All contigs are considered. Setting --qc-callability-autosome-contigs enables outputting this metric for non-human references.

Per Contig Het/Hom Ratio

When the germline small variant caller is executed, DRAGEN calculates a het/hom ratio per contig.

The het/hom ratio values can be used as an indication of whole chromosome uniparental disomy (UPD). UPD of certain chromosomes are associated with genetic syndromes known as imprinting disorders. Whole chromosome UPD have het/hom ratios close to 0.0. Ranges vary, but are usually between 1.0–2.0. The het/hom ratios should be interpreted in the context of the specific assay.

DRAGEN reports the ratios for both the raw (PREFILTER) and hard-filtered (POSTFILTER) VCF. The metrics are output to the .vc_hethom_ratio_metrics.csv file.

The file contains the following values for each primary contig processed.

Contig
Number of heterozygous variants
Number of homozygous variants
Het/Hom ratio

The following example shows a section of the metrics.

VARIANT CALLER POSTFILTER,HG04070,1 Heterozygous,185733
VARIANT CALLER POSTFILTER,HG04070,1 Homozygous,182928
VARIANT CALLER POSTFILTER,HG04070,1 Het/Hom ratio,1.015
VARIANT CALLER POSTFILTER,HG04070,2 Heterozygous,203946
VARIANT CALLER POSTFILTER,HG04070,2 Homozygous,174294
VARIANT CALLER POSTFILTER,HG04070,2 Het/Hom ratio,1.170
VARIANT CALLER POSTFILTER,HG04070,3 Heterozygous,192861
VARIANT CALLER POSTFILTER,HG04070,3 Homozygous,130087
VARIANT CALLER POSTFILTER,HG04070,3 Het/Hom ratio,1.483
VARIANT CALLER POSTFILTER,HG04070,4 Heterozygous,178389
VARIANT CALLER POSTFILTER,HG04070,4 Homozygous,157062
VARIANT CALLER POSTFILTER,HG04070,4 Het/Hom ratio,1.136

Coverage and Callability Reports

DRAGEN supports a number of reports dedicated to coverage metrics. Some other DRAGEN components, including the mapper and aligner, ploidy caller and variant callers, may emit limited coverage related metrics. The metrics from these other components may not always exactly match the metrics in the DRAGEN coverage reports. The following table list some important differences.

Table 6 Coverage reported in files other than the main coverage reports

The coverage reports listed in Table 7 all follow the same default rules for counting or excluding reads:

Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included (i.e. MAPQ=0 reads are filtered).
BQ >= 0 are included.

Table 7 DRAGEN Coverage Reports

Custom coverage reports

DRAGEN coverage reports will by default be generated over the whole genome, and if provided also over a target region. DRAGEN additionally supports the ability to specify custom regions and report types of interest.

In somatic tumor-normal mode, DRAGEN generates separate reports for the tumor and normal samples. Each report is labeled according to the sample type. Tumor sample reports include tumor at the end of the file name, and normal sample reports include normal at the end of the file name. To include both tumor and normal sample results in one file, set the --vc-enable-separate-t-n-metrics option to false. DRAGEN then reports on the aggregate of both samples.

The coverage reports do not require the mapper or variant callers, however it is not compatible with --enable-sort=false.

The following command shows an example use case for specifying custom coverage reports:

dragen ... \
--qc-coverage-region-1 <bed file 1> \
--qc-coverage-reports-1 full_res \
--qc-coverage-region-2 <bed file 2> \
--qc-coverage-region-3 <bed file 3> \
--qc-coverage-reports-3 full_res cov_report

The settings --qc-coverage-region-i and --qc-coverage-reports-i work as a pair (i can be 1, 2, or 3). The former setting specifies the region while the latter specify the report type for that region. The number i links the settings. Up to 3 such region and report pairs may be specified.

The --qc-coverage-region-i option requires a BED file argument (i can be 1, 2, or 3).
Regions in each BED file can be optionally padded using --qc-coverage-region-padding-i option (by default 0 padding is applied).
A set of default reports are generated for each region.
Additional reports can be specified for each region by using the --qc-coverage-reports-i option.

If multiple report types are selected per region, they should be space-separated, e.g. --qc-coverage-reports-1 callability full_res.

Rules for including reads and bases in the coverage calculations

Defaults settings used for all DRAGEN coverage reports:

Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included. MAPQ=0 reads are filtered.
BQ >= 0 are included.

Non-default setting

As example, the following options are used to enable full (basepair) resolution coverage output with more stringent MAPQ and BQ filtering:

--qc-coverage-region-1 <custom_regions.bed>
--qc-coverage-filters-1 'mapq<10,bq<30'
--qc-coverage-reports-1 full_res

The argument syntax mapq<value,bq<value implies that reads with a mapping quality less than the specified value, or bases with a read base call quality below the specified value, will be ignored.
Valid filter arguments are mapq and bq only. Either, or both, can be specified.
Only one operator < is supported. <=, >, >=, = are not supported.

Coverage Metrics

By default DRAGEN will emit a _coverage_metrics.csv file for each available region type, including the full genome, target region, and any additionally specified QC regions.

The _coverage_metrics.csv file is generally the most useful of all the coverage reports and will probably be the first file to inspect when performing coverage based QC.

The first column of the output file contains the section name COVERAGE SUMMARY and the second column (the subsection) is empty for all entries in the file.

The following metrics are calculated:

Aligned bases in region---Number of uniquely mapped bases to region and the percentage relative to the number of uniquely mapped bases to the genome.
Average alignment coverage over region---Number of uniquely mapped bases to region divided by the number of sites in region.
Uniformity of coverage (PCT > 0.2*mean) over region__---Percentage of sites with coverage greater than 20% of the mean coverage in region.
PCT of region with coverage [ix, inf)---Percentage of sites in region with at least ix coverage, where i can equal 100, 50, 20, 15, 10, 3, 1 and 0.
PCT of region with coverage [ix, jx)---Percentage of sites in region with at least ix but less than jx coverage, where (i, j) can equal (50, 100), (20, 50), (15, 20), (10, 15), (3, 10), (1, 3) and (0, 1).
Average chromosome X coverage over region---Total number of bases that aligned to the intersection of chromosome X with region divided by the total number of loci in the intersection of chromosome X with region. If there is no chromosome X in the reference genome or the region does not intersect chromosome X, this metric shows as NA.
Average chromosome Y coverage over region---Total number of bases that aligned to the intersection of chromosome Y with region divided by the total number of loci in the intersection of chromosome Y with region. If there is no chromosome Y in the reference genome or the region does not intersect chromosome Y, this metric shows as NA.
XAvgCov/YAvgCov ratio over genome/target region---Average chromosome X alignment coverage in region divided by the average chromosome Y alignment coverage in region. If there is no chromosome X or chromosome Y in the reference genome or the region does not intersect chromosome X or Y, this metric shows as NA.
Average mitochondrial coverage over region---Total number of bases that aligned to the intersection of the mitochondrial chromosome with region divided by the total number of loci in the intersection of the mitochondrial chromosome with region. If there is no mitochondrial chromosome in the reference genome or the region does not intersect mitochondrial chromosome, this metric shows as NA.
Average autosomal coverage over region---Total number of bases that aligned to the autosomal loci in region divided by the total number of loci in the autosomal loci in region. If there is no autosome in the reference genome, or the region does not intersect autosomes, this metric shows as NA.
Median autosomal coverage over region---Median alignment coverage over the autosomal loci in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Mean/Median autosomal coverage ratio over region---Mean autosomal coverage in region divided by the median autosomal coverage in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Aligned reads in region---Number of uniquely mapped reads to region and percentage relative to the number of uniquely mapped reads to the genome. Only reads with with MAPQ ≥ 1 are included. Secondary and supplementary alignments are ignored.

The following is an example of the contents of the \_coverage\_metrics.csv file:

COVERAGE SUMMARY,,Aligned bases,148169295474
COVERAGE SUMMARY,,Aligned bases in genome,148169295474,100.00
COVERAGE SUMMARY,,Average alignment coverage over genome,46.08
COVERAGE SUMMARY,,Uniformity of coverage (PCT > 0.2*mean) over genome,91.01
COVERAGE SUMMARY,,PCT of genome with coverage [100x: inf),0.25
COVERAGE SUMMARY,,PCT of genome with coverage [ 50x: inf),50.01
COVERAGE SUMMARY,,PCT of genome with coverage [ 20x: inf),89.46
COVERAGE SUMMARY,,PCT of genome with coverage [ 15x: inf),90.51
COVERAGE SUMMARY,,PCT of genome with coverage [ 10x: inf),91.01
COVERAGE SUMMARY,,PCT of genome with coverage [ 3x: inf),91.69
COVERAGE SUMMARY,,PCT of genome with coverage [ 1x: inf),92.10
COVERAGE SUMMARY,,PCT of genome with coverage [ 0x: inf),100.00
COVERAGE SUMMARY,,PCT of genome with coverage [ 50x:100x),49.76
COVERAGE SUMMARY,,PCT of genome with coverage [ 20x: 50x),39.45
COVERAGE SUMMARY,,PCT of genome with coverage [ 15x: 20x),1.04
COVERAGE SUMMARY,,PCT of genome with coverage [ 10x: 15x),0.51
COVERAGE SUMMARY,,PCT of genome with coverage [ 3x: 10x),0.67
COVERAGE SUMMARY,,PCT of genome with coverage [ 1x: 3x),0.42
COVERAGE SUMMARY,,PCT of genome with coverage [ 0x: 1x),7.90
COVERAGE SUMMARY,,Average chr X coverage over genome,24.70
COVERAGE SUMMARY,,Average chr Y coverage over genome,20.96
COVERAGE SUMMARY,,Average mitochondrial coverage over genome,20682.19
COVERAGE SUMMARY,,Average autosomal coverage over genome,47.81
COVERAGE SUMMARY,,Median autosomal coverage over genome,48.62
COVERAGE SUMMARY,,Mean/Median autosomal coverage ratio over genome,0.98
COVERAGE SUMMARY,,XAvgCov/YAvgCov ratio over genome,1.18
COVERAGE SUMMARY,,XAvgCov/AutosomalAvgCov ratio over genome,0.52
COVERAGE SUMMARY,,YAvgCov/AutosomalAvgCov ratio over genome,0.44
COVERAGE SUMMARY,,Aligned reads,1477121058
COVERAGE SUMMARY,,Aligned reads in genome,1477121058,100.00

Fine Histogram Coverage Report

The fine histogram report outputs a _fine_hist.csv file, which contains two columns: Depth and Overall. The value in the Depth column ranges from 0 to 2000+ and the Overall column indicates the number of loci covered at the corresponding depth.

Masked regions in the FASTA are ignored and no depth for these regions are reported.

Histogram Coverage Report

The histogram report outputs a _hist.csv file, which provides the following:

Percentage of bases in the coverage BED/target BED/WGS region that fall within a certain range of coverage.
Duplicate reads are ignored if DRAGEN is run with --enable-duplicate-marking true.

The following ranges are used: "[100x:inf)", "[1x:3x)", "[0x:1x)"

Masked regions in the FASTA are ignored and no depth for these regions are reported.

Overall Mean Coverage Report

The Overall Mean Coverage report generates an _overall_mean_cov.csv file, which contains the average alignment coverage over the coverage BED/target BED/WGS, as applicable.

The following is an example of the contents of the _overall_mean_cov.csv file:

Average alignment coverage over target_bed,80.69

Masked regions in the FASTA are ignored and no depth for these regions are reported.

Per Contig Mean Coverage Report

The Contig Mean Coverage report generates a _contig_mean_cov.csv file, which contains the estimated coverage for all contigs and an autosomal estimated coverage. The file includes the following three columns:

Masked regions in the FASTA are ignored and no depth for these regions are reported.

Full Res Report

The Full Res Report outputs a _full_res.bed file in tab-delimited format. The first three columns are the standard BED fields, and the fourth column is the depth. Each record in the file is for a given interval that has a constant depth. If the depth changes, then a new record is written to the file. Alignments that have a mapping quality value of 0, duplicate reads, and clipped bases are not counted towards the depth.

Only base positions that fall under the user-specified coverage-region bed regions are present in the _full_res.bed output file.

The _full_res.bed file structure is similar to the output file of bedtools genomecov -bg. The contents are identical if the bedtools command line is executed after filtering out alignments with mapping quality value of 0, and possibly filtering by a target BED (if specified).

The following is an example of the contents of the _full_res.bed file:

chr1 121483984 121483985 10
chr1 121483985 121483986 9
chr1 121483986 121483989 8
chr1 121483989 121483991 7
chr1 121483991 121483992 6
chr1 121483992 121483993 4
chr1 121483993 121483994 2
chr1 121483994 121484039 1
chr1 121484039 121484043 2
chr1 121484043 121484048 3

Coverage is reported for all positions specified by qc-coverage-region-i. Masked regions in the FASTA are not ignored.

When --enable-metrics-compression is set to true, the 1 bp resolution coverage metrics output bed file (_full_res.bed) are compressed to bigwig format.

Per Region Coverage Report

The cov_report report generates a _cov_report.bed file in a tab-delimited format. This report includes summary coverage statistic per BED region. The first three columns are standard BED fields. The DRAGEN Amplicon pipeline includes a fourth column for name and fifth column for gene_id. The remaining column fields are statistics calculated over the interval region specified on the same record line.

The following table lists the appended columns.

total_cvg---The total coverage value.
mean_cvg---The mean coverage value.
Q1_cvg---The lower quartile (25th percentile) coverage value.
median_cvg---The median coverage value.
Q3_cvg---The upper quartile (75th percentile) coverage value.
min_cvg---The minimum coverage value.
max_cvg---The maximum coverage value.
pct_above_X---Indicates the percentage of bases over the specified interval region that had a depth coverage greater than X.

By default, if an interval has a total coverage of 0, then the record is written to the output file. To filter out intervals with zero coverage, set vc-emit-zero-coverage-intervals to false in the configuration file.

By default, if --qc-coverage-region-i-thresholds are not set, the thresholds will default to 5, 15, 20, 30, 50, 100, 200, 300, 400, 500, 1000.

The following is an example of the contents of the _cov_report.bed file:

chrom start end total_cvg mean_cvg Q1_cvg median_cvg Q3_cvg min_cvg max_cvg pct_above_5 ...
chr5 34190121 34191570 76636 52.89 44.00 54.00 60.00 32 76 100.00 ...
chr5 34191751 34192380 39994 63.58 57.00 61.00 69.00 50 85 100.00 ...
chr5 34192440 34192642 10074 49.87 47.00 49.00 51.00 44 62 100.00 ...
chr9 66456991 66457682 31926 46.20 39.00 45.00 52.00 27 65 100.00 ...
chr9 68426500 68426601 4870 48.22 42.00 48.00 54.00 39 58 100.00 ...
chr17 41465818 41466180 24470 67.60 4.00 66.00 124.00 2 153 66.30 ...
chr20 29652081 29652203 5738 47.03 40.00 49.00 52.00 34 58 100.00 ...
chr21 9826182 9826283 4160 41.19 23.00 52.00 58.00 5 60 99.01 ...

Read Coverage Report

The read_cov_report report generates a _read_cov_report.bed file in a tab-delimited format. The first five columns are chrom, start, end, name, and gene_id BED fields. The following additional columns represent statistics that are calculated over the interval region specified on the same record line.

total_cvg---The total coverage value.
read1_cvg---The total Read 1 coverage value.
read2_cvg---The total Read 2 coverage value.

If an alignment overlaps more than one region, the alignment is counted toward the region with the largest overlap. If the alignment overlaps equally with more than one region, the alignment is counted toward the first intersecting region.

The following shows the contents of the _read_cov_report.bed file:

#chrom    start    end    name    gene_id    total_cvg    read1_cvg    read2_cvg
chr21    10033000    10034919            48    24    24
chr21    10034919    10034920            0    0    0
chr21    10034920    10034921            0    0    0

Callability

Callability is defined as the fraction of positions in the genome or target region having a GVCF PASSing genotype call. The callability report can be interpreted as the fraction of sites in the genome or target bed where the small variant caller had sufficient information (enough good quality reads) to confidently either call a variant or a HOM-REF region.

The callability report requires DRAGEN to be run in gVCF mode. When gVCF mode is enabled, DRAGEN will automatically generate a callability report as part of variant caller metrics.

The following criteria are used to calculate callability metrics:

Callability is calculated over all positions included in the gVCF.
Decoy contigs are ignored.
Unplaced and unlocalized contigs are ignored.
Masked regions in the FASTA (bases set to N) are ignored.
For regions where no variant calling was performed, callability is 0.
A homozygous deletion counts as a PASSing genotype call for all the reference positions spanned by the deletion.

If the --vc-target-bed option is specified, the output is a target_bed_callability.bed file that contains the overall and autosome callability over the input target bed region. The padding size specified by the --vc-target-bed-padding option is used and overlapping regions are merged.

Callability can also be output over custom regions. If the --qc-coverage-region-i option is used with --qc-coverage-reports-i (where i is 1, 2, or 3), callability can be added as a report type for that region. The output is a qc-coverage-region-i_callability.bed file. For each specified qc-coverage-region-i file, the average callability is reported in the variant calling metrics file. The padding size specified by the --qc-coverage-region-padding-i is used and overlapping regions are merged.

The optional min MAPQ and min BQ filters only influence read and base counting and do not influence the callability reports. The callability reports only depends on the gVCF PASS variants.

Coverage/Callability Reports Use Cases and Expected Output

The following table shows which outputs are generated when default options (--vc-target-bed) versus optional coverage region options (--coverage-region) are used.

GC Bias Report

The GC bias report provides information on GC content and the associated read coverage across a genome. DRAGEN GC bias metric is modeled after the Picard implementation and adapted to preexisting internal measures. The DRAGEN GC bias correction module attempts to correct these biases following the target count stage. For more information, see GC Bias Correction

The GC bias metric is computed as follows.

Calculates GC content using a 100 bp wide, per-base rolling window over all chromosomes in the reference genome, excluding any decoys and alternate contigs. Windows containing more than four masked (N) bases in the reference are discarded.
Calculates the average coverage for each window, excluding any non-PF, duplicate, secondary, and supplementary reads.
Calculates the average global coverage across the whole genome.
Groups valid windows based on the percentage of GC content, both at individual percentages and five 20% ranges as summary.
Calculates the normalized coverage for each group by dividing the average coverage for the bin by the global average coverage across the genome. Values below 1.0 indicate a lower than expected coverage at the given GC percent or range. Coverages significantly deviating from 1.0 at greater GC values are an expected result.
Calculates dropout metrics as the sum of all positive values of (percentage of windows at GC X-percentage aligned reads at GC X) for each GC ≤ 50% and > 50% for AT and GC dropout.

By default, the GC bias metric report is not calculated. To enable GC Bias calculations, enter the --gc-metrics-enable command line option. The following is an example command:

$ dragen -b <BAM file> -r <reference genome> --gc-metrics-enable=true

The GC metrics report generates a gc_metrics.csv file. The file is structured as follows.

GC BIAS DETAILS,,Windows at GC [0-100],<number of windows>,<fraction of all windows>
GC BIAS DETAILS,,Normalized coverage at GC [0-100],<average coverage of all windows at given GC divided by average coverage of whole genome>
GC METRICS SUMMARY,,Window size,<window size in base, typically 100>
GC METRICS SUMMARY,,Number of valid windows,<total number of windows used in calculations>
GC METRICS SUMMARY,,Number of discarded windows,<total number windows discarded due to more than 4 masked bases>
GC METRICS SUMMARY,,Average reference GC,<average GC content over all valid windows>
GC METRICS SUMMARY,,Mean global coverage,<average genome coverage over all valid windows>
GC METRICS SUMMARY,,Normalized coverage at GCs <GC range>,<average coverage of all windows at given GC range divided by average coverage of whole genome>
GC METRICS SUMMARY,,AT Dropout,<Calculated AT dropout value>
GC METRICS SUMMARY,,GC Dropout,<Calculated GC dropout value>

The GC bias report also includes the following command line options, but they are not recommended.

| Setting | Description | |:-------------------------------| :---------------------------- -----------------------| | --gc-metrics-window-size | Overrides the default rolling window size of 100 bp. | | --gc-metrics-num-bins | Overrides the number of summary bins. |

Somatic Callable Regions Report

In somatic mode, DRAGEN automatically generates a somatic callable regions report as a bed file. The somatic callable regions report includes all regions with tumor coverage at least as high as the tumor threshold and (if applicable) normal coverage at least as high as the normal threshold. If only the tumor sample is provided, then the report includes all regions with tumor coverage at least as high as the tumor threshold. Each line in the bed output file is formatted as follows.

chromosome region_start region_end

You can specify the threshold values using the --vc-callability-tumor-thresh or --vc-callability-normal-thresh options. The default value for the tumor threshold is 15. The default value for the normal threshold is 5. For more information on each option, see [Somatic Mode Options]{.underline}.

If the target bed or the --qc-coverage-region-i (where i is 1, 2, or 3) option is included in the run. DRAGEN then generates corresponding somatic callable regions bed files in addition to the whole genome somatic callable region bed file.

Duration Metrics

The duration metrics section includes a breakdown of the run duration for each process. For example, the following metrics are generated for the mapper and variant caller pipeline:

Time loading reference
Time aligning reads
Time sorting and marking duplicates
Time DRAGStr calibration
Time partial reconfiguration
Time variant calling
Total run time