Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The map/align system produces a BAM file sorted by reference sequence and position by default. Creating this BAM file typically eliminates the requirement to run samtools sort or any equivalent postprocessing command. The ‑‑enable-sort option
can be used to enable or disable creation of the BAM file, as follows:
To enable, set to true.
To disable, set to false.
On the reference hardware system, running with sort enabled increases run time for a 30x full genome by about 6--7 minutes.
Marking or removing duplicate aligned reads is a common best practice in whole-genome sequencing. Not doing so can bias variant calling and lead to incorrect results.
The DRAGEN system can mark or remove duplicate reads, and produces a BAM file with duplicates marked in the FLAG field, or with duplicates entirely removed.
In testing, enabling duplicate marking adds minimal run time over and above the time required to produce the sorted BAM file. The additional time is approximately 1--2 minutes for a 30x whole human genome, which is a huge improvement over the long run times of open source tools.
The DRAGEN duplicate-marking algorithm is modeled on the Picard toolkit's MarkDuplicates feature. All the aligned reads are grouped into subsets in which all the members of each subset are potential duplicates.
For two pairs to be duplicates, they must have the following:
Identical alignment coordinates (position adjusted for soft- or hard-clips from the CIGAR) at both ends.
Identical orientations (direction of the two ends, with the left-most coordinate being first).
In addition, an unpaired read may be marked as a duplicate if it has identical coordinate and orientation with either end of any other read, whether paired or not.
Unmapped read pairs are never marked as duplicates.
When DRAGEN has identified a group of duplicates, it picks one as the best of the group, and marks the others with the BAM duplicate flag (0x400, or decimal 1024). For this comparison, duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of the scores of both ends, while unpaired reads get the score of the one mapped end. The idea of this score is to try, all other things being equal, to preserve the reads with the highest-quality base calls.
If two reads (or pairs) have exactly matching quality scores, DRAGEN breaks the tie by choosing the pair with the higher alignment score. If there are multiple pairs that also tie on this attribute, then DRAGEN chooses a winner arbitrarily.
The score for an unpaired read R is the average Phred quality score per base, calculated as follows:
Where R is a BAM record, QUAL is its array of Phred quality scores, and dedup-min-qual is a DRAGEN configuration option with default value of 15. For a pair, the score is the sum of the scores for the two ends.
This score is stored as a one-byte number, with values rounded down to the nearest one-quarter. This rounding may lead to different duplicate marks from those chosen by Picard, but because the reads were very close in quality this has negligible impact on variant calling results.
The limitations to DRAGEN duplicate marking implementation are as follows:
When there are two duplicate reads or pairs with very close Phred sequence quality scores, DRAGEN might choose a different winner from that chosen by Picard. These differences have negligible impact on variant calling results.
If using a single FASTQ file as input, DRAGEN accepts only a single library ID as a command-line argument (RGLB). For this reason, the FASTQ inputs to the system must be already separated by library ID. Library ID cannot be used as a criterion for distinguishing non-duplicates.
DRAGEN does not distinguish between optical and PCR duplicates.
The following options can be used to configure duplicate marking in DRAGEN:
--enable-duplicate-marking
Set to true to enable duplicate marking. When \--enable-duplicate-marking is enabled
, the output is sorted, regardless of the value of the enable-sort option.
--remove-duplicates
Set to true to suppress the output of duplicate records. If set to false, set the 0x400 flag in the FLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable- duplicate-marking is forced to enabled as well.
--dedup-min-qual
Specifies the Phred quality score below which a base should be excluded from the quality score calculation used for choosing among duplicate reads.
The DRAGEN DNA Pipeline accelerates the secondary analysis of NGS data by harnessing the tremendous power available on the DRAGEN Platform. The pipeline includes highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions and targeted calls.
DRAGEN can remove artifacts from reads using hardware accelerated read trimming. Hardware accelerated read trimming is available on U200 and cloud systems, as part of the DRAGEN mapper and adds no additional run time. DRAGEN provides multiple independent trimming filters that target different types of artifacts or use cases. You can enable and configure the artifacts or use cases independently to tailor the read-trimming to your analysis. Read trimming uses two different modes, hard-trimming and soft-trimming.
To enable hard-trimming mode, use --read-trimmers
. In hard-trimming mode, potential artifacts are removed from input reads. Reads that are trimmed to fewer than 20 bases are filtered and replaced with a placeholder read that uses 10 N bases. DRAGEN assigns the filtered reads a 0x200 flag set.
DRAGEN contains a novel lossless soft-trimming mode. In soft-trimming mode, reads are mapped as though they had been trimmed, but no bases are removed. To enable the trimmer in soft mode, use --soft-read-trimmers
.
Soft-trimming suppresses systematic mismapping of reads that contain trimmable artifacts, without actually losing the trimmed bases in aligned output. Soft-trimming prevents reads with trimmable artifacts, such as Poly-G artifacts, from being mapped to reference G homopolymers, or prevents adapter sequences from being mapped to the matching reference loci. Soft-trimming might map reads to different positions in the reference than they would have been if not using soft-trimming. When using soft-trimmed, DRAGEN does not filter reads and does not map reads with bases that would have been trimmed entirely.
Soft-trimming for Poly-G artifacts is enabled by default on supported systems.
Fixed-length trimming removes a fixed number of bases from the 5' end of each read. If you are analyzing sequencing data from an amplicon of fixed size and expect the read-length to consistently exceed the length of quality sequence data, you can use the expected number in fixed-length trimming.
Poly-G artifacts appear on two-channel sequencing systems when the dark base G is called after synthesis has terminated. As a result, DRAGEN calls several erroneous high-confidence G bases on the ends of affected reads. For contaminated samples, many affected reads can be mapped to reference regions with high G content. The affected reads can cause problems for processing downstream.
Base quality can degrade over the length of a read toward the 5' end and separate from any artifacts from early termination of synthesis. The lower quality bases can affect mapping and alignment results, and might lead to incorrect variant or methylation calls downstream. The quality trimming tool calculates a rolling average of the base quality inward from the 5' end and removes the minimum number of bases, so the average number of bases is above the threshold specified using --trim-min-quality
.
Problems during library preparation, or libraries with smaller inserts can result in the synthesis of high quality reads containing sequence from the adapters used. If not removed before analysis, noninsert bases can reduce mapping efficiency and downstream accuracy. The adapter trimming tool uses the adapter sequences from the input FASTA file, and then removes all hits greater than a specified size. Adapter trimming allows for a 10% mismatch. For 3' adapters, trimming is from the first matching adapter base to the end of the read. For 5' adapters, trimming is from the first (3') matching adapter base to the beginning (5') of the read.
If quality trimming is not feasible due to reduced yield or other limitations, an alternative option is to remove only explicitly ambiguous bases from the ends of read. If enabled the ambiguous base trimmer applies a simple exact-match search to both ends of all processed reads, regardless of mate-pair status.
You can maximize trimmer sensitivity, by using the minimum length trimming tool to remove a fixed number of bases from each read after the trimmer tools above have run. For example, if you would like to remove 5 bp from each read, a 7 bp adapter hit could be missed if five of the bases are removed first. To mitigate this issue, DRAGEN provides an optional minimum trim-length filter.
If using libraries of fixed-size inserts, such as small PCR amplicons, it is more convenient to specify a length that all reads should be trimmed to rather than the number of bases to remove. You can use the maximum length trimming tool.
If using RNA libraries, reads overlapping the poly-A tail of the transcripts may contain long poly-A/poly-T sequences at the end of the reads which may result in incorrect alignment. The poly-A trimmer mitigates this by trimming the poly-A tail from the end of the read. See additional description in RNA alignment section.
The trimmer generates a metrics file titled \<output prefix\>.trimmer_metrics.csv
. Metrics are available on an aggregate level over all input data. The metrics units are in reads or bases.
Total input reads Total number of reads in the input files.
Total input bases Total number of bases in the input reads.
Total input bases R1 Total number of bases in R1 reads.
Total input bases R2 Total number of bases in R2 reads.
Average input read length Total number of input bases divided by the number of input reads.
Total trimmed reads Total number of reads trimmed by at least one base, not including soft-trimming.
Total trimmed bases Total number of bases trimmed, not including soft-trimming.
Average bases trimmed per read The number of trimmed bases divided by the number of input reads.
Average bases trimmed per trimmed read The number of trimmed bases divided by the number of trimmed reads.
Remaining poly-G K-mers R1 3prime The number of R1 3' read ends that contain likely Poly-G artifacts after trimming.
Remaining poly-G K-mers R2 3prime The number of R2 3' read ends that contain likely Poly-G artifacts after trimming.
Total filtered reads The number of reads that were filtered out during trimming.
Reads filtered for minimum read length R1 The number of R1 reads that were filtered due to being trimmed below the minimum read length.
Reads filtered for minimum read length R2 The number of R2 reads that were filtered due to being trimmed below the minimum read length.
<Trimmer tool> trimmed reads The number of reads with at least one base trimmed by TRIMMER. DRAGEN reports the metric for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes reads that were trimmed during soft-trimming. Each trimming tool above produces the metric.
<Trimmer tool> trimmed bases The number of bases trimmed by TRIMMER. The metric is produced for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes bases from reads that were trimmed during soft trimming. Each trimming tool above produces the metric.
--read-trimmers
To enable trimming filters in hard-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable trimming, set to none
. During mapping, artifacts are removed from all reads. The following are valid trimmer names:
fixed-len
—Fixed-length trimming
polyg
—Poly-G trimming
quality
—Quality trimming
adapter
—Adapter trimming
n
—Ambiguous base trimming
min-len
—Minimum length trimming
cut-end
—Maximum length trimming
bisulfite
—Bisulfite trimming
Read trimming is disabled by default (default: "none").
--soft-read-trimmers
To enable trimming filters in soft-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable soft trimming, set to none
. During mapping, reads are aligned as if trimmed, and bases are not removed from the reads. The following are the valid trimmer names.
fixed-len
—Fixed-length trimming
polyg
—Poly-G trimming
quality
—Quality trimming
adapter
—Adapter trimming
n
—Ambiguous base trimming
min-len
—Minimum length trimming
cut-end
—Maximum length trimming
bisulfite
—Bisulfite trimming
Soft-trimming is enabled for the polyg
filter by default (default: "polyg").
--trimming-only
Disables mapping and alignment to run read-trimming only.
--trim-min-length
Specify a minimum read length allowed after the trimmer execution. DRAGEN filters any reads with a length less than the value after all read-trimming steps are completed (default: 20).
--trim-min-len-read1
Specify a minimum read length allowed for read1 after the trimmer execution. DRAGEN filters any reads with a length of read1 less than the value after all read-trimming steps are completed (default: 20).
--trim-min-len-read2
Specify a minimum read length allowed for read2 after the trimmer execution. DRAGEN filters any reads with a length of read2 less than the value after all read-trimming steps are completed (default: 20).
--trim-filter-dummy-len
Specify the number of N bases in dummy reads that replace filtered reads (default: 10).
--trim-filter-set-flag
If enabled, dummy reads will have their 0x200 SAM flag set (default: true).
--trim-r1-5prime
Specify a fixed number of bases to trim from the 5' end of Read 1 (default: 0).
--trim-r1-3prime
Specify a fixed number of bases to trim from the 3' end of Read 1 (default: 0).
--trim-r2-5prime
Specify a fixed number of bases to trim from the 5' end of Read 2 (default: 0).
--trim-r2-3prime
Specify a fixed number of bases to trim from the 3' end of Read 2 (default: 0).
--trim-min-quality
Specify the minimum read quality. DRAGEN trims bases from the 3' end of reads with a quality below the value.
--trim-quality-r1-5prime
Specify the quality cutoff below which to trim from the 5' end of read 1.
--trim-quality-r1-3prime
Specify the quality cutoff below which to trim from the 3' end of read 1.
--trim-quality-r2-5prime
Specify the quality cutoff below which to trim from the 5' end of read 2.
--trim-quality-r2-3prime
Specify the quality cutoff below which to trim from the 3' end of read 2.
--trim-adapter-read1
Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 1.
--trim-adapter-read2
Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 2.
--trim-adapter-r1-5prime
Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 1. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.
--trim-adapter-r2-5prime
Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 2. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.
--trim-adapter-stringency
Specify the minimum number of adapter bases required for trimming (default: 4).
--trim-bisulfite-ends
Enable both 5-Prime and 3-Prime bisulfite trimming.
--trim-bisulfite-5prime
If a 3' adapter was trimmed, trim an additional 2bp from the 3' end, unless the 5' end matches 'CAA' or 'CGA'".
--trim-bisulfite-3prime
If the 5' end matches 'CAA' or 'CGA', trim the first two of these 5' bases.
--trim-min-r1-5prime
Specify the minimum number of bases to trim from the 5' end of Read 1 (default: 0).
--trim-min-r1-3prime
Specify the minimum number of bases to trim from the 3' end of Read 1 (default: 0).
--trim-min-r2-5prime
Specify the minimum number of bases to trim from the 5' end of Read 2 (default: 0).
--trim-min-r2-3prime
Specify the minimum number of bases to trim from the 3' end of Read 2 (default: 0).
--trim-max-length
Specify the maximum number of bases that can be trimmed from the sequences of both reads.
--trim-max-len-read1
Specify the maximum number of bases that can be trimmed from the sequences of read1.
--trim-max-len-read2
Specify the maximum number of bases that can be trimmed from the sequences of read2.
--trim-polya-min-trim
The minimum number of poly-As required for polya trimming (default: 20).
--trim-polyg-kmer-len
How many bases to check at each read end for poly-G artifact detection (default: 25).
--trim-polyg-kmer-non-g
The maximum number of non-G bases in the K-mer for poly-G artifact detection (default: 2).
--trim-polyg-g-score-r1-5prime
The score for G bases on the 5' end of read 1 (default: 0).
--trim-polyg-g-score-r1-3prime
The score for G bases on the 3' end of read 1 (default: 15).
--trim-polyg-g-score-r2-5prime
The score for G bases on the 5' end of read 2 (default: 0).
--trim-polyg-g-score-r2-3prime
The score for G bases on the 3' end of read 2 (default: 15).
--trim-polyg-min-trim-r1-5prime
The minimum number of G's to trim from the 5' end of read 1 (default: 6).
--trim-polyg-min-trim-r1-3prime
The minimum number of G's to trim from the 3' end of read 1 (default: 6).
--trim-polyg-min-trim-r2-5prime
The minimum number of G's to trim from the 5' end of read 2 (default: 6).
--trim-polyg-min-trim-r2-3prime
The minimum number of G's to trim from the 3' end of read 2 (default: 6).
--trim-polyg-early-exit-threshold
The signed score threshold for poly-G trimming to exit early (default: -500).
--trim-polyx-bases-r1-5prime
The bases to trim for polyX trimming from the 5' end of read 1 (default: empty string "" ).
--trim-polyx-bases-r1-3prime
The bases to trim for polyX trimming from the 3' end of read 1 (default: empty string "" ).
--trim-polyx-bases-r2-5prime
The bases to trim for polyX trimming from the 5' end of read 2 (default: empty string "" ).
--trim-polyx-bases-r2-3prime
The bases to trim for polyX trimming from the 3' end of read 2 (default: empty string "" ).
--trim-polyx-min-trim-r1-5prime
The minimum number of X's to trim from the 5' end of read 1 (default: 20).
--trim-polyx-min-trim-r1-3prime
The minimum number of X's to trim from the 3' end of read 1 (default: 20).
--trim-polyx-min-trim-r2-5prime
The minimum number of X's to trim from the 5' end of read 2 (default: 20).
--trim-polyx-min-trim-r2-3prime
The minimum number of X's to trim from the 3' end of read 2 (default: 20).
polya
—RNA Poly-A tail trimming. See additional description in section
polya
—RNA Poly-A tail trimming. See additional description in section
An MD5SUM file is generated automatically for VCF output files. This file is in the same output directory and has the same name as the VCF output file, but with an .md5sum extension appended. For example, whole_genome_run_123.vcf.md5sum. The MD5SUM files is a single-line text file that contains the md5sum of the VCF output file. This md5sum exactly matches the output of the Linux md5sum command.
Regions of homozygosity (ROH) are detected as part of the small variant caller. The caller detects and outputs the runs of homozygosity from whole genome calls on autosomal human chromosomes. Sex chromosomes are ignored unless the sample sex karyotype is XX, as specified on the command line or determined by the Ploidy Estimator. ROH output allows downstream tools to screen for and predict consanguinity between the parents of the proband subject.
A region is defined as consecutive variant calls on the chromosome with no large gap in between these variants. In other words, regions are broken by chromosome or by large gaps with no SNV calls. The gap size is set to 3 Mbases.
ROH Algorithm
The ROH algorithm runs on the small variant calls. The algorithm excludes variants with multiallelic sites, indels, complex variants, non-PASS filtered calls, and homozygous reference sites. The variant calls are then filtered further using a block list BED, and finally depth filtering is applied after the block list filter. The default value for the fraction of filtered calls is 0.2, which filters the calls with the highest 10% and lowest 10% in DP values. The algorithm then uses the resulting calls to find regions.
The ROH algorithm first finds seed regions that contain at least 50 consecutive homozygous SNV calls with no heterozygous SNV or gaps of 500,000 bases between the variants. The regions can be extended using a scoring system that functions as follows.
Score increases with every additional homozygous variant (0.025) and decreases with a large penalty (1-0.025) for every heterozygous SNV. This provides some tolerance of presence of heterozygous SNV in the region.
Each region expands on both ends until the regions reach the end of a chromosome, a gap of 500,000 bases between SNVs occurs, or the score becomes too low (0).
Overlapping regions are merged into a single region. Regions can be merged across gaps of 500,000 bases between SNVs if a single region would have been called from the beginning of the first region to the end of the second region without the gap. There is no maximum size for regions, but regions always end at chromosome boundaries.
ROH Options
--vc-enable-roh
Set to true to enable the ROH caller. The ROH caller is enabled by default for human autosomes only. Set to false to disable.
--vc-roh-blacklist-bed
If provided, the ROH caller ignores variants that are contained in any region in the block list BED file. DRAGEN distributes block list files for all popular human genomes and automatically selects a block list to match the genome in use, unless this option is used to select a file.
ROH Output
The ROH caller produces an ROH output file named <output-file-prefix>.roh.bed
in which each row represents one region of homozygosity. The BED file contains the following columns:
Chromosome Start End Score #Homozygous #Heterozygous
Score is a function of the number of homozygous and heterozygous variants, where each homozygous variant increases the score by 0.025, and each heterozygous variant reduces the score by 0.975.
Start and end positions are a 0-based, half-open interval.
#Homozygous is number of homozygous variants in the region.
#Heterozygous is number of heterozygous variants in the region. The caller also produces a metrics file named <output-file-prefix>.roh_metrics.csv
that lists the number of large ROH and percentage of SNPs in large ROH (>3 MB).
The table below demonstrates how the PLINK options can be tuned to behave similarly to the DRAGEN ROH caller default settings (see column DRAGEN default). We observed that PLINK ROH calls (see column PLINK default) in default settings are more conservative compared to DRAGEN default settings. By default, PLINK reports ROH regions of size 1MB or larger (see PLINK option --homozyg-kb ) with at least 100 homozygous SNPs (see PLINK option --homozyg-snp) while DRAGEN ROH caller reports smaller regions with at least 50 homozygous SNPs (see DRAGEN ROH Algorithm section). In addition, PLINK by default allows for only 1 heterozygous SNP per scanning window (specified by PLINK option --homozyg-window-het) while DRAGEN uses a soft score threshold penalty without setting an upper bound on the allowed number of heterozygous SNPs (see DRAGEN ROH Algorithm section). The PLINK ROH calls are largely comparable to the DRAGEN ROH calls after relaxing the default PLINK settings, shown in column PLINK tuned. Prior to PLINK ROH calling, the input DRAGEN hard-filtered VCF files are filtered as per the instructions in DRAGEN ROH Algorithm section.
--homozyg-density
50
50
Minimum required density to call a ROH (1 SNP in 50 kb), can be increased to relax the per SNP density.
--homozyg-gap
1000
1000
3000
Maximal interval between two homozygous SNPs in a ROH (in kb)
--homozyg-kb
1000
500
All sizes reported
Minimal length of reported ROH (in kb)
--homozyg-snp
100
50
50
Minimal number of homozygous SNPs in the reported ROH
--homozyg-window-het
1
2
Soft score threshold (1-0.025) penalty for a het SNP and 0.025 gain for a hom SNP
Maximum number of heterozygous SNPs allowed in a scanning window
--homozyg-window-missing
5
5
Number of missing calls allowed in a scanning window
--homozyg-window-snp
50
50
Variants in a scanning window
--homozyg-window-threshold
0.05
0.05
For a SNP to be eligible for inclusion in a ROH, the hit rate/overlap of all scanning windows containing the SNP must be at least 0.05
B-Allele frequency (BAF) output is enabled by default in germline and somatic VCF and gVCF runs.
The BAF value is calculated as either AF
or (1 - AF)
, where
AF = (alt_count / (ref_count + alt_count))
BAF = 1 - AF
, only when ref base < alt base, order of priority for bases is A < T < G < C < N
.
The B-allele frequency values are often plotted to visually inspect the spread away from a perfectly diploid heterozygous call (BAF=50%). This plot is more easily interpreted if it is symmetric about the BAF=50% line. To ensure the symmetry, a heuristic must be used to determine when BAF = AF
or BAF = 1-AF
. This definition of B-Allele Frequency is based on the definition that is used for bead arrays, as most users are accustomed to that implementation. Here, the choice of the B allele is based on the color of dye attached to each nucleotide. A and T get one color, G and C get the other color. The bead array implementation has much more complex rule for tie-breaking between A and T or G and C that involves top and bottom strands. This is unnecessary and so the simpler hierarchical approach of using a priority for the nucleotides A<T<G<C<N
is used.
For each small variant VCF entry with exactly one SNP alternate allele, the output contains a corresponding entry in the BAF output file.
<NON_REF>
lines are excluded
ForceGT variants (as marked by the "FGT" tag in the INFO field) are not included in the output, unless the variant also contains the "NML" tag in the INFO field.
Variants where the ref_count and alt_count are both zero are not included in the output.
--vc-enable-baf
Enable or disable B-allele frequency output. Enabled by default.
The BF generates are BigWig-compressed files, named <output-file-prefix>.baf.bw
and <output-file-prefix>.hard-filtered.baf.bw
. The hard-filtered file only contains entries for variants that pass the filters defined in the VCF (ie, PASS entries).
Each entry contains the following information: Chromosome Start End BAF
Where:
Chromosome is a string matching a reference contig.
Start and end values are zero-based, half open intervals.
BAF is a floating point value.
The filtering step identifies de novo variants calls of the joint calling workflow in regions with ploidy changes. Since de novo calling can have reduced specificity in regions where at least one of the pedigree members shows non-diploid genotypes, the de novo variant filtering marks relevant variants and thus can improve specificity of the call set.
Based on the structural and copy number variant calls of the pedigree, the FORMAT/DN field in the proband is changed from the original DeNovo value to DeNovoSV or DeNovoCNV if the de novo variant overlaps with a ploidy-changing SV or CNV, respectively. All other variant details remain unchanged, and all variants of the input VCF will also be present in the filtered output VCF. Structural or copy number variants which result in no change of ploidy, such as inversions, are not considered in the filtering. As an example, a de novo SNV calls in the input VCF
Overlapping with an SV duplication in the proband, mother or father would be represented in the filtered output VCF as follows:
The following is an example command line for running the de novo filtering, based on the files returned by the joint calling workflows:
The following options are used for de novo variant filtering:
--dn-input-vcf
---Joint small variant VCF from the de novo calling step to be filtered.
--dn-output-vcf
---File location to which the filtered VCF should be written. If not specified, the input VCF is overwritten.
--dn-sv-vcf
---Joint structural variant VCF from the SV calling step. If omitted, checks with overlapping structural variants are skipped.
--dn-cnv-vcf
--- Joint structural variant VCF from the CNV calling step. If omitted, checks with overlapping copy number variants are skipped.
DRAGEN provides post-VCF variant filtering based on annotations present in the VCF records. Default and non-default variant hard filtering are described below. However, due to the nature of DRAGEN's algorithms, which incorporate the hypothesis of correlated errors from within the core of variant caller, the pipeline has improved capabilities in distinguishing the true variants from noise, and therefore the dependency on post-VCF filtering is substantially reduced. For this reason, the default post-VCF filtering in DRAGEN is very simple.
The default filters in the germline pipeline are as follows:
##FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41 (3 when ML recalibration is enabled)">
##FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83 (3 when ML recalibration is enabled)">
##FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
DRAGENSnpHardQUAL and DRAGENIndelHardQUAL: For all contigs other than the mitochondrial contig, the default hard filtering consists of thresholding the QUAL value only. A different default QUAL threshold value is applied to SNP and INDEL
LowDepth: This filter is applied to all variants calls with INFO/DP <= 1
PloidyConflict: This filter is applied to all variant calls on chrY of a female subject, if female is specified on the DRAGEN command line, of if female is detected by the ploidy estimator.
For the mitochondrial contig, DRAGEN processes it through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. Please refer to Mitochondrial Calling for the filtering details.
DRAGEN supports basic filtering of variant calls as described in the VCF standard. You can apply any number of filters with the --vc-hard-filter
option, which takes a semicolon-delimited list of expressions, as follows:
where the list of criteria is itself a list of expressions, delimited by the || (OR) operator in this format:
The meaning of these expression elements is as follows:
filterID---The name of the filter, which is entered in the FILTER column of the VCF file for calls that are filtered by that expression.
snp/indel/all---The subset of variant calls to which the expression should be applied.
annotation ID---The variant call record annotation for which values should be checked for the filter. Supported annotations include FS, MQ, MQRankSum, QD, and ReadPosRankSum.
comparison operator---The numeric comparison operator to use for comparing to the specified filter value. Supported operators include <, ≤, =, ≠, ≥, and >. For example, the following expression would mark with the label "SNP filter" any SNPs with FS < 2.1 or with MQ < 100, and would mark with "indel filter" any records with FS < 2.2 or with MQ < 110:
This example is for illustration purposes only and is NOT recommended for use with DRAGEN V3 output. Illumina recommends using the default hard filters. The only supported operation for combining value comparisons is OR, and there is no support for arithmetic combinations of multiple annotations. More complex expressions may be supported in the future.
The orientation bias filter is designed to reduce noise typically associated with the following:
Pre-adapter artifacts introduced during genomic library preparation (eg, a combination of heat, shearing, and metal contaminates can result in the 8-oxoguanine base pairing with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification), or
FFPE (formalin-fixed paraffin-embedded) artifact. FFPE artifacts stem from formaldehyde deamination of cytosines, which results in C to T transition mutations. The orientation bias filter can only be used on somatic pipelines. To enable the filter, set the --vc-enable-orientation-bias-filter
option to true. The default is false.
The artifact type to be filtered can be specified with the --vc-orientation-bias-filter-artifacts
option. The default is C/T,G/T, which correspond to OxoG and FFPE artifacts. Valid values include C/T, or G/T, or C/T,G/T,C/A.
An artifact (or an artifact and its reverse compliment) cannot be listed twice. For example, C/T,G/A is not valid, because C→G and T→A are reverse compliments.
The orientation bias filter adds the following information:
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=OBC,Number=1,Type=String,Description="Orientation Bias Filter base context">
##FORMAT=<ID=OBPa,Number=1,Type=String,Description="Orientation Bias prior for artifact">
##FORMAT=<ID=OBParc,Number=1,Type=String,Description="Orientation Bias prior for reverse compliment artifact">
##FORMAT=<ID=OBPsnp,Number=1,Type=String,Description="Orientation Bias prior for real variant">
Please note that the OBF filter runs as a standalone process after DRAGEN is complete. The VC metrics that are computed as part of DRAGEN SNV caller will not be updated and will not reflect the additional variants that are filtered in this stage.
DRAGEN secondary analysis employs machine learning based variant recalibration (DRAGEN-ML) for germline SNV VC. Variant calling accuracy is improved using powerful yet efficient machine learning techniques that augment the variant caller, by exploiting more of the available read and context information that does not easily integrate into the Bayesian processing used by the haplotype variant caller. A supervised machine learning method was developed using truth from the PrecisionFDA v4.2.1 sets to build a model that processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors, for both SNVs and INDELs.
No additional setup is required. ML model files for the hg38 and hg19 human references are packaged with the DRAGEN installer. After installation, the files are present at <INSTALL_PATH>/resources/ml_model/<ref>
DRAGEN-ML is enabled by default as needed, when running the germline SNV VC. DRAGEN will automatically detect the reference used for analysis, and use the correct model files. It either hg38 or hg19 reference type is not detected, ML recalibration will automatically be disabled and SNV VC falls back to legacy operation.
DRAGEN-ML requires a run with BAM or FASTQ input, since the machine learning model extracts information from the read pile-up. DRAGEN-ML runs concurrently with DRAGEN SNV VC. DRAGEN-ML can be applied to WGS or WES samples. Re-calibration of existing VCF files is not supported.
DRAGEN-ML recalibrates all quality scores, changing the values of the QUAL and GQ fields in the output VCF/GVCF.
DRAGEN-ML also updates PL and GP in the output VCF/GVCF.
The genotypes (GT field) of some variants may be changed by ML e.g., 0/1 to 1/1 or vice versa.
DRAGEN-ML PHRED scores (e.g. QUAL) are better calibrated than and differ significantly from those with ML disabled and, as a consequence, QUAL scores tend to not exceed 75. By comparison, QUAL scores with ML disabled can exceed 1000. For this reason, the QUAL filtering threshold is set to 3 when DRAGEN-ML is enabled, compared to 10 for DRAGEN-VC when DRAGEN-ML is disabled.
The following variants types are recalibrated:
Biallelic and multiallelic variants
Autosomes and sex chromosomes, including haploid positions
Force GT calls
Non primary contigs
DRAGEN-ML typically removes 30-50% of SNP FPs, with smaller gains on INDELS. FN counts are reduced by 10% or more. The output QUAL/GQ of DRAGEN-ML is empirically more accurately calibrated than DRAGEN SNV VC without ML. There are significant gains in accuracy statistics across the entire genome with ML enabled. Note that a small number of variant calls may have degraded accuracy with ML enabled compared to VC without ML.
DRAGEN-ML adds about 10% to the run time compared to runs without ML.
The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:
with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes
with accceleration supported with Advanced Vector Extensions (AVX)
The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.
For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.
Notes:
The output is in biallelic format, one line per ALT.
The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.
The following is an example of commands to impute vcf on a single chromosome:
The following is an example of commands to impute vcf on chromosome X:
The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.
The sample(s) to be imputed must have the following format:
VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported
Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information
To impute INDELs and get the best accuracy on INDELs, it is recommended:
and to set the command --imputation-phase-impute-reference-only-variants
to true.
Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported
A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename
. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.
<chromosome name>.gmap.gz
3 columns: position, chromosome number, distance (cM)
compliant with the reference genome used to generate the sample input
In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.
For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):
The JSON config file is made of two fields as defined in the table below
Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.
The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.
The VCF imputation tool generates several outputs:
The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name <prefix>.impute.vcf.gz
The intermediate files:
chunk regions to be passed along to the internal Phase step with name <prefix>.impute.chunk.out.txt
imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name <prefix>_chr_start-end.impute.phase.vcf.gz
text file with path to all the <prefix>_chr_start-end.impute.phase.vcf.gz
generated with name <prefix>.impute.phase.out.txt
Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps
Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.
The DRAGEN small variant caller is a haplotype-based caller which performs local assembly of all reads in an active region into a de Bruijn graph (DBG). The assembly process uses all the read bases including the soft-clip bases of reads. The soft-clip bases provide evidence for the presence of variants, specifically longer insertions and deletions which are not present in the read cigar and hence cannot be directly viewed in IGV.
The assembly and realignment step (using pair-HMM) performed by variant caller aims to correct mapping errors made by the original aligner and improves the overall variant caller accuracy. Using the evidence BAM, we can view how the variant caller sees the read evidence and how the reads have been realigned making it a very useful debugging tool.
By default, the evidence BAM contains only a subset of regions processed by the small variant caller. Only regions which have candidate indel variants and some percentage of soft-clip reads in the pileup are realgned and output in the evidence BAM. This is done to reduce the run-time overhead needed to generate the evidence BAM.
The output of the VC Evidence BAM feature will match the output format that the customer has selected using --output-format option. The default format is bam.
A bam/cram/sam file with the suffix _evidence.bam/cram/sam
and the corresponding index file. The evidence BAM can be enabled along with the regular BAM output from the Map-Align step. When multiple BAM are passed as inputs to the variant caller, for e.g., in Tumor-Normal calling, then they will be combined in the evidence BAM output and tagged with appropriate read groups.
A bed file with regions that were realigned and output in VC Evidence BAM with suffix ".realigned-regions.bed".
The evidence BAM consists of realigned reads, badly mated reads and reads that are disqualified by the variant caller based on the read likelihood scores.
Disqualified and Badly Mated reads
Reads that are badly-mated (when the read and its mate are mapped to different chromosmes) are tagged with a BM tag (integer) and reads that are disqualified (based on read likelihoods) are tagged with the DQ tag (integer). These reads are filtered out by the genotyper in the variant caller. The alignment score tag AS is forced to 0 for such reads in the evidence BAM and hence, they can be filtered from the IGV pileup by setting the minimum AS score to be 1 instead of 0.
Graph Haplotypes
When enabling graph haplotypes output using --vc-evidence-bam-output-haplotypes
, all the haplotypes constructed by the de Bruijn graph are output in the evidence BAM as single reads covering the entire active region. The reads and haplotypes are tagged with different read groups which makes it easily distinguishable in IGV. In IGV, we can use “Color Alignments By” or “Group Alignments By” > read group to separate out the reads from the haplotypes. The haplotypes are tagged with read group EvidenceHaplotype
and the reads are part of the EvidenceRead_Normal/Tumor
read group.
The haplotypes are named as Haplotype 1, Haplotype 2 and so on and have an additional ‘HC’ tag (integer). The realigned reads also have an HC tag which encodes which haplotype best matches the read based on the likelihood calculation. Only reads which are supported by a single unique haplotype have the HC tag, reads which match more than one haplotype well do not have an HC tag. The use of this tag is primarily intended to enable highlighting of reads in IGV. Go to "Color Alignments By > Tag" and enter "HC" to view which reads are uniquely supported by a certain graph haplotypes.
DRAGEN supports force genotyping (ForceGT) for small variant calling. To use ForceGT, use the --vc-forcegt-vcf
option with a list of small variants to force genotype. The input list of small variants can be a *.vcf or *.vcf.gz file.
The current limitations of ForceGT are as follows:
ForceGT is supported for germline small variant calling in the V3 mode. The V1, V2, and V2+ modes are not supported.
ForceGT is also supported for somatic small variant calling.
ForceGT variants do not propagate through joint genotyping.
DRAGEN supports only a single ForceGT VCF input file, which must meet the following requirements:
The input has to be a valid VCF file according to version 4.2 of the VCF standard. For instance, it has to have at least eight tab-delimited columns and records need to be sorted by reference contig and position.
The header has to list the same contigs as the reference used for variant calling. All variants must refer to one of these contig names.
Variants have to be normalized (parsimonious and left-aligned, see below).
It must not contain any multinucleotide or complex variants (AT -> C). These are variants that require more than one substitution / insertion / deletion to go from REF allele to ALT allele and are ignored.
Any deletions longer than 50bp are filtered out.
Any variant will only be called once. Duplicate entries will be ignored.
The following nonnormalized variant will cause undefined behavior in DRAGEN:
Instead of…
use…
Force genotyping requires an input VCF and can be used with DRAGEN software in VCF, GVCF or VCF+GVCF mode. In all cases the output file(s) contains all regular calls and the forceGT variants, as follows:
For a ForceGT call that was not called by the variant caller (not common), the call is tagged with FGT in the INFO field.
For a germline ForceGT call that was also called by the variant caller and filter field is PASS, the call is tagged with NML;FGT in the INFO field (NML denotes normal). In somatic mode, the call is tagged with FGT;SOM.
For a normal call (and PASS) by the variant caller, with no ForceGT call (normal), no extra tags are added (no NML tag, no FGT tag).
This scheme distinguishes among calls that are present due to FGT only, common in both ForceGT input and normal calling, and normal calls.
All the variants in the input ForceGT VCF are genotyped and present in the output file. The following table lists the reported GTs for the variants.
If DRAGEN calls a variant that is different from the one specified in the input ForceGT VCF, the output contains the following multiple entries at the same position:
One entry for the default DRAGEN variant call
One entry each for every variant call present in the input ForceGT-VCF at that position
If a target BED file is provided along with the input ForceGT VCF, then the output file only contains ForceGT variants that overlap the BED file positions.
DRAGEN Multi-region Joint Detection (MRJD) is a de novo germline small variant caller for paralogous regions. In DRAGEN v4.3, MRJD covers regions that include six clinically relevant genes: NEB, TTN, SMN1/2, PMS2, STRC, and IKBKG. MRJD is compatible with hg38, hg19 and GRCh37 reference genome. The table below includes hg38 region coordinates covered by MRJD.
MRJD is a variant calling method that is designed to detect de novo germline small variants in paralogous regions of the genome. A conventional variant caller relies on the read aligner to determine which reads likely originated from a given location. This method works well when the region of interest does not resemble any other region of the genome over the span of a single read (or a pair of reads for paired-end sequencing). However, a significant fraction of the human genome does not meet this criterion. At least 5% of the human genome consists of segmental duplications. Many regions of the genome have near-identical copies elsewhere, and as a result, the true source location of a read might be subject to considerable uncertainty. If a group of reads is mapped with low confidence, a conventional variant caller might ignore the reads, even though they contain useful information. If a read is mismapped (i.e., the primary alignment is not the true source of the read), it can result in variant detection errors.
MRJD is designed in attempt to tackle the complexities raised by segmental duplication regions. Basically, instead of considering each region in isolation, MRJD considers all locations from which a group of reads may have originated and attempts to detect the underlying sequences jointly across all paralogous regions in the sample of interest.
Below is a diagram showing the general workflow of MRJD in PMS2 and PMS2CL regions. MRJD takes primary alignments in all paralogous regions, regardless of mapping quality, builds and places haplotypes based on reads and prior knowledge, and computes joint genotypes to call small variants.
Figure 1. MRJD Caller workflow.
As shown in the diagram above, there are two modes of the DRAGEN MRJD Caller, default mode and high sensitivity mode. Here are details on the differences between the two modes.
With --enable-mrjd=true
, the MRJD Caller will report the following two types of variants:
Uniquely placed variants, which means the variant is found and placed in one of the paralogous regions without ambiguity. See variants labeled with “type 1” in Figure 2.
Region-ambiguous variants. In this case, the aggregated genotype contains a variant allele with high confidence, but MRJD Caller is unable to place the variant allele in one of the paralogous regions with high confidence. The MRJD Caller will report the variant allele in all paralogous regions. See variants labeled with “type 2” in Figure 2.
With both --enable-mrjd=true
and --mrjd-enable-high-sensitivity-mode=true
, the MRJD Caller reports the same variants as from the default mode, plus two other types of variants.
Positions where the reference alleles in all paralogous regions are not the same. It is well established that gene conversion, including reciprocal crossover, is a common event between paralogous regions (such as PMS2 and PMS2CL). When reciprocal crossover event occurs, the prior model, without nearby information on phasing, might end up placing the converted haplotype in the source region instead of the destination region, resulting in no variant. The high sensitivity mode compensates for this event by reporting the variant in corresponding positions in all paralogous regions. See variants labeled with “type 3” in Figure 2.
Variants that have been placed uniquely in one of the paralogous regions and no variant in the corresponding position in the other region. The high sensitivity mode reports the variant in the rest of the paralogous regions. This is to compensate the fact that sometimes the prior knowledge that is used to help place the variant is not sufficient or is estimated incorrectly. In those cases, the variant allele still exists but is placed in the wrong paralog region. Therefore, reporting the variant in the other paralogous regions can help maximize sensitivity even with the lack of prior. See variants labeled with “type 4” in Figure 2.
Figure 2. Different variant types reported by MRJD Caller default mode and high sensitivity mode.
The MRJD Caller is disabled by default and requires WGS data aligned to a human reference genome build 38, 19, or GRCh37.
Here is the list of options related to MRJD.
--enable-mrjd
If set to true, MRJD is enabled for the DRAGEN pipeline. Note that MRJD cannot run together with SNV caller in the current version of DRAGEN (default = ‘false’).
--mrjd-enable-high-sensitivity-mode
If set to true, MRJD high sensitivity mode is enabled for the DRAGEN pipeline. See previous section on what variant types are reported in MRJD default mode and high sensitivity mode (default = ‘false’).
The following command-line example uses FASTQ input and runs MRJD Caller with high sensitivity mode:
The following command-line example uses BAM input that has already been aligned and runs MRJD Caller with high sensitivity mode:
Here are the example command lines to first run DNA Mapping and Small Variant Calling workflow using FASTQ files as input, and then run MRJD using BAM file generated by the DNA Mapping workflow as input.
The MRJD Caller generates a .mrjd.hard-filtered.vcf.gz file in the output directory. The output file is a compressed VCFv4.2 formatted file that contains the VCF representation of the small variants from the identified genotype.
The following are example output format for uniquely placed variant. The DRAGENHardQual filter is applied to the records if the variant has a QUAL < 3.00.
Figure 3. VCF output format example for uniquely placed call.
For variant that are not uniquely placed (type 2-4 variant in Figure 2), the MRJD Caller will also report variants under diploid genotype format, which can be interpreted the same way as uniquely placed variant (the genotype is region-specific instead of an aggregate across all regions). Under this format, The QUAL presents phred-scaled quality score for the assertion made in ALT (i.e. −10log10 prob(GT==0/0)). Note that the QUAL score will be equal to or less than 3 (if the QUAL > 3, then the call should be uniquely placed).
The QUAL, GT, GQ and PL will be reported similar to the DRAGEN germline VC. To avoid losing information about the aggregated genotype across paralogous regions, the MRJD Caller reports genotype, phred-scaled quality score, and the phred-scaled genotype likelihoods for aggregated genotype using JGT, JQL, and JPL in the FORMAT column.
Figure 4. VCF output format example for non-uniquely-placed call.
Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the .
To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the . When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).
to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the .
A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the . IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.
A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the . DRAGEN does not generate custom genetic map files. The genetic map should follow the format:
This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the . It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.
It is important to note that MRJD cannot run together with the DRAGEN Small Variant Caller in this DRAGEN version. We recommend users to run DNA Mapping and Small Variant Calling workflow first, and then run MRJD using the aligned BAM file generated from DNA Mapping workflow as input. Using this workflow, two VCF files will be created (.hard-filtered.vcf.gz by DRAGEN Small Variant Caller and .mrjd.hard-filtered.vcf.gz by DRAGEN MRJD). To help user get a single VCF file for downstream anlaysis, we prepared a utility tool that replaces the DRAGEN Small Variant Caller output in the homology region of the six medically relevant and challenging genes with MRJD caller output. The tool also annotates the calls made by MRJD (with "MRJD" tag in the INFO column). Please refer to the to download the utility tool.
regions
Required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define contig name and subregion name of mixed ploidy chromosome
Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...]
ploidy
“default” is a required name
contigname_of_mixed_ploidy_"nonpar" is required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define:
ploidy behavior when different from “default”
default ploidy behavior
Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input
--enable-imputation
NA
Yes
Set to true
to enable vcf imputation pipeline
--imputation-ref-panel-dir
STRING
Yes
Directory containing per-chromosome reference panel VCF and optionally the JSON config file
--imputation-ref-panel-prefix
STRING
Yes
Prefix for reference panel files and the JSON config file
--imputation-genome-map-dir
STRING
Yes
Directory containing per-chromosome genome map files
--imputation-chunk-input-region
STRING
Yes for single region
Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20).
--imputation-chunk-input-region-list
STRING
Yes for list of regions
Text file listing chromosomes or regions to be processed, one chromosome/region per line.
--imputation-phase-input
STRING
Yes for single VCF file
Sample input file with VCF/BCF format. Single VCF or multi-sample VCF
--imputation-phase-input-list
STRING
Yes for multiple VCF files
Text file listing sample input in VCF/BCF format, one input file per line
--imputation-phase-sample-type
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file
Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file
--imputation-phase-sample-type-list
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files
Path to the Sample Type file
--output-directory
STRING
Yes
Output directory
--output-file-prefix
STRING
Yes
Output files prefix
--imputation-phase-threads
INT
No
Specify the number of threads to use. Default is the number of system threads
--imputation-phase-filter-input-sample-in-ref
NA
No
Default is true
: if sample ID matches between reference panel and sample input, then the corresponding samples are ignored from the reference panel to avoid imputation against itself. To be turned to false
if all samples from the reference panel should be kept regardless of their presence in the sample input.
--imputation-phase-impute-reference-only-variants
STRING
No
Default is false
. If set to true
, allows imputation at variants only present in the reference panel. The use of this option is intended only to allow imputation at sporadic missing variants. If the number of missing variants is non-sporadic, please re-run the genotype likelihood computation at all reference variants and avoid using this option, since data from the reads should be used.
When the input sample variant calling was performed using --vc-forcegt-vcf
with SNPs-only sites.vcf file, it is recommended to set this option to true to also impute INDELs positions from the reference panel.
--imputation-phase-input-independently
STRING
No
Default is false
. If set to true
, allows to treat each sample input independently without using them in the reference panel calculation
vc-output-evidence-bam
Enable evidence BAM output
False
vc-evidence-bam-output-haplotypes
Output graph haplotypes in evidence BAM
False
vc-evidence-bam-clipped-read-threshold
Percentage of clipped reads in active region to enable evidence BAM output for that region
10%
vc-evidence-bam-force-output
Force evidence BAM output for all active regions
False
At a position with no coverage
./. or .
At a position with coverage but no reads supporting ALT allele
0/0 or 0
At a position with coverage and reads supporting ALT allele
dependent on pipeline (germline/somatic)
chr2
151578759
151588523
NEB exon 98-105
chr2
151589318
151599076
NEB exon 90-97
chr2
151599871
151609628
NEB exon 82-89
chr2
178653238
178654995
TTN exon 172-180
chr2
178657498
178659255
TTN exon 181-189
chr2
178661759
178663516
TTN exon 190-198
chr5
70049522
70077596
SMN2
chr5
70924940
70953013
SMN1
chr7
5970924
5980896
PMS2 exon 13-15
chr7
5980968
5987689
PMS2 exon 11-12
chr7
6737007
6743712
PMS2CL exon 2-3
chr7
6743880
6753867
PMS2CL exon 4-6
chr15
43599563
43602630
STRC exon 24-29
chr15
43602982
43611000
STRC exon 14-23
chr15
43611040
43618800
STRC exon 1-13
chr15
43699379
43702452
STRCP1 exon 23-28
chr15
43702488
43710472
STRCP1 exon 13-22
chr15
43710502
43718262
STRCP1 exon 1-12
chrX
154555884
154565047
IKBKG exon 3-10
chrX
154639390
154648553
IKBKGP1