DRAGEN generates multiple pipeline-specific metrics including:
Mapping and Aligning metrics
Variant calling metrics
Biomarker metrics
Coverage (or enrichment) metrics and reports
Duration (or run time) metrics
Figure 10: Generation of Metrics and Reports
The QC metrics are printed to the standard output. In addition CSV files are written to the run output directory:
<output prefix>.mapping_metrics.csv
<output prefix>.vc_metrics.csv
<output prefix>.<coverage region prefix>_coverage_metrics.csv
<output prefix>.time_metrics.csv
<output prefix>.<other coverage reports>.csv
Each CSV includes 5 columns, including: Section, Subsection (e.g. read group or sample), Metric, Value 1 (Count/Ratio/Minutes) and Value 2 (Percentage/Seconds).
DRAGEN computes mapping and aligning metrics similar to Samtools Flagstat.
Mapping metrics are:
available both on an aggregate level and on a per read group level.
in germline and somatic tumor-only mode only one set of mapping metrics are available.
in somatic tumor-normal mode, the mapping and aligning metrics are generated separately for the tumor and normal samples, with each line beginning with TUMOR or NORMAL to indicate the sample. The metrics for the tumor sample are output first, followed by the metrics for the normal sample. Metrics per read group are also separated into tumor and normal read groups.
unless explicitly stated, the metrics units are in reads (not in terms of pairs).
Definitions:
Total input reads---Total number of reads in the input FASTQ files.
Number of duplicate marked reads---Reads marked as duplicates as a result of the --enable-duplicate-marking
option being set to true.
Number of duplicate marked and mate reads removed---Reads marked as duplicates, along with any mate reads, that are removed when the --remove-duplicates
option is set to true.
Number of unique reads---Total number of reads minus the duplicate marked reads.
Reads with mate sequenced---Number of reads with a mate.
Reads without mate sequenced---Total number of reads minus number of reads with mate sequenced.
QC-failed reads---Reads that did not pass platform/ vendor quality checks (SAM flag 0x200).
Mapped reads---Total number of mapped reads
Mapped reads with filtered mapping---Total number of mapped reads plus reads mapped to non-reference decoy contigs plus reads mapped to the rRNA filter contig.
Mapped reads adjusted for excluded mapping---Total number of mapped reads plus reads mapped to the excluded RNA mitochondrial contig.
Mapped reads adjusted for filtered and excluded mapping---Total number of mapped reads plus reads mapped to the rRNA filter contig plus reads mapped to the excluded RNA mitochondrial contig.
Number of unique and mapped reads---Number of mapped reads minus number of duplicate marked reads.
Unmapped reads---Total number of reads that could not be mapped.
Unmapped reads minus filtered mapping---Total number of unmapped reads minus reads mapped to non-reference decoy contigs minus reads mapped to the rRNA filter contig.
Unmapped reads adjusted for excluded mapping---Total number of unmapped reads minus reads mapped to the excluded RNA mitochondrial contig.
Unmapped reads adjusted for filtered and excluded mapping---Total number of unmapped reads minus reads mapped to the rRNA filter contig minus reads mapped to the excluded RNA mitochondrial contig.
Singleton reads---Number of reads where the read could be mapped, but the paired mate could not be read.
Paired reads---Count of reads in which both reads in the pair are mapped.
Properly paired reads---Both reads in the pair are mapped and fall within an acceptable range from each other based on the estimated insert length distribution.
Not properly paired reads (discordant)---The number of paired reads minus the number of properly paired reads.
Paired reads mapped to different chromosomes---The number of reads with a mate, where the mate was mapped to a different chromosome.
Paired reads mapped to different chromosomes (MAPQ >= 10)---The number of reads with a MAPQ>10 and with a mate, where the mate was mapped to a different chromosome.
Reads with indel R1---The percentage of R1 reads containing at least 1 indel.
Reads with indel R2---The percentage of R2 reads containing at least 1 indel.
Soft-clipped bases R1---The percentage of bases in R1 reads that are soft-clipped.
Soft-clipped bases R2---The percentage of bases in R2 reads that are soft-clipped.
Mismatched bases R1---The number of mismatched bases on R1, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2---The number of mismatched bases on R2, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R1 (excluding indels)---The number of mismatched bases on R1. The indels lengths are ignored. It does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2 (excluding indels)---The number of mismatched bases on R2. The indels lengths are ignored. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Q30 Bases---The total number of bases with a BQ >= 30. Includes mapped & unmapped reads & bases. Excludes duplicate marked reads & secondary alignments.
Q30 Bases R1---The total number of bases on R1 with a BQ >= 30.
Q30 Bases R2---The total number of bases on R2 with a BQ >= 30.
Q30 Bases (excluding dups and clipped bases)---The number of bases on non-duplicate and non-clipped bases with a BQ >= 30.
Histogram of reads map qualities
Reads with MAPQ [40:inf)
Reads with MAPQ [30:40)
Reads with MAPQ [20:30)
Reads with MAPQ [10:20)
Reads with MAPQ [0:10)
Total alignments---Total number of loci reads aligned to with > 0 quality.
Secondary alignments---Number of secondary alignment loci.
Supplementary (chimeric) alignments---A chimeric read is split over multiple loci (possibly due to structural variants). One alignment is referred to as the representative alignment. The other are supplementary.
Estimated read length---Total number of input bases divided by the number of reads.
Insert length: mean---Mean insert size estimated for the read group
Insert length: median---Median insert size estimated for the read group
Insert length: standard deviation---Standard deviation of insert size estimated for the read group
Note: The insert length metrics reported above are computed using high quality (MAPQ >= 20) properly paired read pairs, considering all the read pairs for the read group. It may be different from the standard output log reported during insert stats sampling which reports these metrics only for the first ~2M read pairs for DNA (~100K read pairs for RNA).
Whole read group insert length estimation for RNA datasets is currently not supported. For RNA runs, the reported insert length metrics are computed using up to the first ~100K high quality read pairs for the read group from the input FASTQ/BAM/CRAM file.
Input bases divided by reference genome size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the reference genome size.
Input bases divided by target bed size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the target bed size.
Estimated sample contamination---The estimated fraction of reads in a sample that may be from another human source.
The DRAGEN cross-sample contamination module uses a probabilistic mixture model to estimate the fraction of reads in a sample that may be from another human source. DRAGEN supports separate modes for germline and somatic samples.
The germline model, like VerifyBamID, assumes that a sample can be modeled as a DNA mixture from 2 or more individuals. Pileup analysis is used to investigate loci where variants are common in the general population. Variants with high allele frequencies are likely to be real germline variants in the individual of interest, while low allele frequency variants at these common germline loci are likely noise or germline variants from a contaminating sample B. The probabilistic mixture model accounts for noise and then tries to detect consistent allele frequency distributions. As example, if the pileups show consistent low allele frequencies of 1% or 2%, then the mixture model will likely infer 2% contamination from sample B, where the 1% and 2% AF variants correspond to heterozygous and homozygous germline calls in sample B.
The germline cross-contamination metric is enabled by using the following setting and pointing a VCF that includes marker sites (RSIDs) with population allele frequencies that are close to 0.5.
--qc-cross-cont-vcf <INSTALL_PATH>/resources/qc/sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf
The somatic model, like GATK CalculateContamination, supports tumor-only or tumor-normal runs. The somatic model is more advanced than the germline model in the way that it accounts for somatic CNVs or LoH regions where the diploid assumptions may be broken. The algorithm also tries to account for FFPE deamination and oxidation noise by empirically adjusting base qualities prior to estimation.
The somatic cross-contamination metric is enabled by pointing to the VCF that includes the marker sites (RSIDs) with high population allele frequencies.
--qc-somatic-contam-vcf <INSTALL_PATH>/resources/qc/somatic_sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf.gz
The metric value is printed as a fraction, so a value of 0.011 represents 1.1% contamination from another sample.
MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.011
The precision of variant calling, particularly for somatic variants, can be significantly impacted by cross-sample contamination. To ensure safe usage of a sample, the level of cross-sample contamination must be considerably lower than the minimum allele frequencies of interest. For instance, if a sample has 1% contamination, it may be necessary to disregard all variants with less than 5% allele frequency. The cross-contamination metric for a sample reaches saturation near 30% contamination.
The contamination module requires a minimum of 100 valid pileups for contamination estimation, where a pileup is considered valid if it has at least 10X coverage and 95% or more reads are deemed valid. Soft clipped reads that could indicate INDELs or structural variants are not considered valid, and datasets with untrimmed adapters may lead to most reads being soft clipped and classified as invalid. If the contamination module reports "NA," even for high-coverage samples, it is recommended to inspect a few pileup locations in IGV for evidence of untrimmed bases.
Optional Contamination Settings:
The generated variant calling metrics are similar to the metrics computed by RTG vcfstats. Metrics are reported for each sample in multi sample VCF and gVCF files and in a csv file with the file name ending in "vc_metrics.csv". Based on the run case, metrics are reported either as standard VARIANT CALLER or JOINT CALLER. Metrics are reported both for the raw (PREFILTER) and hard filtered (POSTFILTER) VCF file.
Panel of Normals (PON) and COSMIC filtered variants are counted as PASS variants in the POSTFILTER VCF metrics. These PASS variants can cause higher than expected variant counts in the POSTFILTER VCF metrics
Number of samples---Number of samples in the population/ joint VCF.
Reads Processed---The number of reads used for variant calling, excluding any duplicate marked reads and reads falling outside of the target region.
Total---The total number of variants (SNPs + MNPs + indels).
Biallelic---Number of sites in a genome that contains two observed alleles. The reference is counted as one allele, which allows for one variant allele.
Multiallelic---Number of sites in the VCF that contain three or more observed alleles. The reference is counted as one, which allows for two or more variant alleles.
SNPs---A variant is counted as an SNP when the reference, allele 1, and allele 2 are all length 1.
Insertions (Hom)---Number of variants that contains homozygous insertions.
Insertions (Het)---Number of variants where both alleles are insertions, but not homozygous.
Deletions (Het)---Number of variants that contains homozygous deletions.
INDELS (Het)---Number of variants where genotypes are either [insertion+deletion], [insertion+SNP], or [deletion+SNP].
De Novo SNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold
option to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
De Novo INDELs---De novo marked indels with DQ values greater than the threshold. This DQ threshold can be specified by setting the --qc-indel-denovo-quality-threshold
option to the required DQ threshold. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
De Novo MNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold
to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
(Chr X SNPs)/(Chr Y SNPs) ratio in the genome (or the target region) ---Number of SNPs in chromosome X (or in the intersection of chromosome X with the target region) divided by the number of SNPs in chromosome Y (or in the intersection of chromosome Y with the target region). If there was no alignment to either chromosome X or chromosome Y, this metric shows as NA.
SNP Transitions---An interchange of two purines (A<->G) or two pyrimidines (C<->T).
SNP Transversions---An interchange of purine and pyrimidine bases Ti/Tv ratio: ratio of transitions to transitions.
Heterozygous---Number of heterozygous variants.
Homozygous---Number of homozygous variants.
Het/Hom ratio---Heterozygous/ homozygous ratio.
In dbSNP---Number of variants detected that are present in the dbSNP reference file. If no dbSNP file is provided via the --bsnp
option, then both the In dbSNP and Novel metrics show as NA.
Novel---Total number of variants minus number of variants in dbSNP.
Percent Callability---Available in germline and somatic modes with gVCF output. The percentage of non-N reference positions having a PASSing genotype call. Multiallelic variants are not counted. Deletions are counted for all the deleted reference positions only for homozygous calls. Only autosomes and chromosomes X, Y, and M are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names. Optionally, --qc-callability-xym-contigs allows setting X, Y and M contig names.
Percent Autosome Callability---Only autosomes are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names.
Percent QC Region Callability in Region i (i is equivalent to regions 1,2, or 3)---Available if callability for custom regions is requested via the --qc-coverage-region-i
option and the callability output is specified with --qc-coverage-reports-i
. All contigs are considered. Setting --qc-callability-autosome-contigs enables outputting this metric for non-human references.
When the germline small variant caller is executed, DRAGEN calculates a het/hom ratio per contig.
The het/hom ratio values can be used as an indication of whole chromosome uniparental disomy (UPD). UPD of certain chromosomes are associated with genetic syndromes known as imprinting disorders. Whole chromosome UPD have het/hom ratios close to 0.0. Ranges vary, but are usually between 1.0–2.0. The het/hom ratios should be interpreted in the context of the specific assay.
DRAGEN reports the ratios for both the raw (PREFILTER) and hard-filtered (POSTFILTER) VCF. The metrics are output to the .vc_hethom_ratio_metrics.csv
file.
The file contains the following values for each primary contig processed.
Contig
Number of heterozygous variants
Number of homozygous variants
Het/Hom ratio
The following example shows a section of the metrics.
DRAGEN supports a number of reports dedicated to coverage metrics. Some other DRAGEN components, including the mapper and aligner, ploidy caller and variant callers, may emit limited coverage related metrics. The metrics from these other components may not always exactly match the metrics in the DRAGEN coverage reports. The following table list some important differences.
Table 6 Coverage reported in files other than the main coverage reports
The coverage reports listed in Table 7 all follow the same default rules for counting or excluding reads:
Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included (i.e. MAPQ=0 reads are filtered).
BQ >= 0 are included.
Table 7 DRAGEN Coverage Reports
DRAGEN coverage reports will by default be generated over the whole genome, and if provided also over a target region. DRAGEN additionally supports the ability to specify custom regions and report types of interest.
In somatic tumor-normal mode, DRAGEN generates separate reports for the tumor and normal samples. Each report is labeled according to the sample type. Tumor sample reports include tumor
at the end of the file name, and normal sample reports include normal
at the end of the file name. To include both tumor and normal sample results in one file, set the --vc-enable-separate-t-n-metrics
option to false. DRAGEN then reports on the aggregate of both samples.
The coverage reports do not require the mapper or variant callers, however it is not compatible with --enable-sort=false
.
The following command shows an example use case for specifying custom coverage reports:
The settings --qc-coverage-region-i
and --qc-coverage-reports-i
work as a pair (i can be 1, 2, or 3). The former setting specifies the region while the latter specify the report type for that region. The number i
links the settings. Up to 3 such region and report pairs may be specified.
The --qc-coverage-region-i
option requires a BED file argument (i can be 1, 2, or 3).
Regions in each BED file can be optionally padded using --qc-coverage-region-padding-i
option (by default 0 padding is applied).
A set of default reports are generated for each region.
Additional reports can be specified for each region by using the --qc-coverage-reports-i
option.
If multiple report types are selected per region, they should be space-separated, e.g. --qc-coverage-reports-1 callability full_res
.
Defaults settings used for all DRAGEN coverage reports:
Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included. MAPQ=0 reads are filtered.
BQ >= 0 are included.
Non-default setting
As example, the following options are used to enable full (basepair) resolution coverage output with more stringent MAPQ and BQ filtering:
The argument syntax mapq<value,bq<value implies that reads with a mapping quality less than the specified value, or bases with a read base call quality below the specified value, will be ignored.
Valid filter arguments are mapq and bq only. Either, or both, can be specified.
Only one operator < is supported. <=, >, >=, = are not supported.
By default DRAGEN will emit a _coverage_metrics.csv file for each available region type, including the full genome, target region, and any additionally specified QC regions.
The _coverage_metrics.csv file is generally the most useful of all the coverage reports and will probably be the first file to inspect when performing coverage based QC.
The first column of the output file contains the section name COVERAGE SUMMARY and the second column (the subsection) is empty for all entries in the file.
The following metrics are calculated:
Aligned bases in region---Number of uniquely mapped bases to region and the percentage relative to the number of uniquely mapped bases to the genome.
Average alignment coverage over region---Number of uniquely mapped bases to region divided by the number of sites in region.
Uniformity of coverage (PCT > 0.2*mean) over region__---Percentage of sites with coverage greater than 20% of the mean coverage in region.
PCT of region with coverage [ix, inf)---Percentage of sites in region with at least ix coverage, where i can equal 100, 50, 20, 15, 10, 3, 1 and 0.
PCT of region with coverage [ix, jx)---Percentage of sites in region with at least ix but less than jx coverage, where (i, j) can equal (50, 100), (20, 50), (15, 20), (10, 15), (3, 10), (1, 3) and (0, 1).
Average chromosome X coverage over region---Total number of bases that aligned to the intersection of chromosome X with region divided by the total number of loci in the intersection of chromosome X with region. If there is no chromosome X in the reference genome or the region does not intersect chromosome X, this metric shows as NA.
Average chromosome Y coverage over region---Total number of bases that aligned to the intersection of chromosome Y with region divided by the total number of loci in the intersection of chromosome Y with region. If there is no chromosome Y in the reference genome or the region does not intersect chromosome Y, this metric shows as NA.
XAvgCov/YAvgCov ratio over genome/target region---Average chromosome X alignment coverage in region divided by the average chromosome Y alignment coverage in region. If there is no chromosome X or chromosome Y in the reference genome or the region does not intersect chromosome X or Y, this metric shows as NA.
Average mitochondrial coverage over region---Total number of bases that aligned to the intersection of the mitochondrial chromosome with region divided by the total number of loci in the intersection of the mitochondrial chromosome with region. If there is no mitochondrial chromosome in the reference genome or the region does not intersect mitochondrial chromosome, this metric shows as NA.
Average autosomal coverage over region---Total number of bases that aligned to the autosomal loci in region divided by the total number of loci in the autosomal loci in region. If there is no autosome in the reference genome, or the region does not intersect autosomes, this metric shows as NA.
Median autosomal coverage over region---Median alignment coverage over the autosomal loci in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Mean/Median autosomal coverage ratio over region---Mean autosomal coverage in region divided by the median autosomal coverage in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Aligned reads in region---Number of uniquely mapped reads to region and percentage relative to the number of uniquely mapped reads to the genome. Only reads with with MAPQ ≥ 1 are included. Secondary and supplementary alignments are ignored.
The following is an example of the contents of the \_coverage\_metrics.csv
file:
The fine histogram report outputs a _fine_hist.csv
file, which contains two columns: Depth and Overall. The value in the Depth column ranges from 0 to 2000+ and the Overall column indicates the number of loci covered at the corresponding depth.
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The histogram report outputs a _hist.csv file, which provides the following:
Percentage of bases in the coverage BED/target BED/WGS region that fall within a certain range of coverage.
Duplicate reads are ignored if DRAGEN is run with --enable-duplicate-marking
true.
The following ranges are used: "[100x:inf)", "[1x:3x)", "[0x:1x)"
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Overall Mean Coverage report generates an _overall_mean_cov.csv file, which contains the average alignment coverage over the coverage BED/target BED/WGS, as applicable.
The following is an example of the contents of the _overall_mean_cov.csv file:
Average alignment coverage over target_bed,80.69
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Contig Mean Coverage report generates a _contig_mean_cov.csv file, which contains the estimated coverage for all contigs and an autosomal estimated coverage. The file includes the following three columns:
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Full Res Report outputs a _full_res.bed file in tab-delimited format. The first three columns are the standard BED fields, and the fourth column is the depth. Each record in the file is for a given interval that has a constant depth. If the depth changes, then a new record is written to the file. Alignments that have a mapping quality value of 0, duplicate reads, and clipped bases are not counted towards the depth.
Only base positions that fall under the user-specified coverage-region bed regions are present in the _full_res.bed output file.
The _full_res.bed file structure is similar to the output file of bedtools genomecov -bg. The contents are identical if the bedtools command line is executed after filtering out alignments with mapping quality value of 0, and possibly filtering by a target BED (if specified).
The following is an example of the contents of the _full_res.bed file:
Coverage is reported for all positions specified by qc-coverage-region-i
. Masked regions in the FASTA are not ignored.
When --enable-metrics-compression
is set to true, the 1 bp resolution coverage metrics output bed file (_full_res.bed
) are compressed to bigwig format.
The cov_report report generates a _cov_report.bed file in a tab-delimited format. This report includes summary coverage statistic per BED region. The first three columns are standard BED fields. The DRAGEN Amplicon pipeline includes a fourth column for name and fifth column for gene_id. The remaining column fields are statistics calculated over the interval region specified on the same record line.
The following table lists the appended columns.
total_cvg---The total coverage value.
mean_cvg---The mean coverage value.
Q1_cvg---The lower quartile (25th percentile) coverage value.
median_cvg---The median coverage value.
Q3_cvg---The upper quartile (75th percentile) coverage value.
min_cvg---The minimum coverage value.
max_cvg---The maximum coverage value.
pct_above_X---Indicates the percentage of bases over the specified interval region that had a depth coverage greater than X.
By default, if an interval has a total coverage of 0, then the record is written to the output file. To filter out intervals with zero coverage, set vc-emit-zero-coverage-intervals
to false in the configuration file.
By default, if --qc-coverage-region-i-thresholds
are not set, the thresholds will default to 5, 15, 20, 30, 50, 100, 200, 300, 400, 500, 1000.
The following is an example of the contents of the _cov_report.bed file:
The read_cov_report report generates a _read_cov_report.bed file in a tab-delimited format. The first five columns are chrom
, start
, end
, name
, and gene_id
BED fields. The following additional columns represent statistics that are calculated over the interval region specified on the same record line.
total_cvg---The total coverage value.
read1_cvg---The total Read 1 coverage value.
read2_cvg---The total Read 2 coverage value.
If an alignment overlaps more than one region, the alignment is counted toward the region with the largest overlap. If the alignment overlaps equally with more than one region, the alignment is counted toward the first intersecting region.
The following shows the contents of the _read_cov_report.bed file:
Callability is defined as the fraction of positions in the genome or target region having a GVCF PASSing genotype call. The callability report can be interpreted as the fraction of sites in the genome or target bed where the small variant caller had sufficient information (enough good quality reads) to confidently either call a variant or a HOM-REF region.
The callability report requires DRAGEN to be run in gVCF mode. When gVCF mode is enabled, DRAGEN will automatically generate a callability report as part of variant caller metrics.
The following criteria are used to calculate callability metrics:
Callability is calculated over all positions included in the gVCF.
Decoy contigs are ignored.
Unplaced and unlocalized contigs are ignored.
Masked regions in the FASTA (bases set to N) are ignored.
For regions where no variant calling was performed, callability is 0.
A homozygous deletion counts as a PASSing genotype call for all the reference positions spanned by the deletion.
If the --vc-target-bed
option is specified, the output is a target_bed_callability.bed
file that contains the overall and autosome callability over the input target bed region. The padding size specified by the --vc-target-bed-padding
option is used and overlapping regions are merged.
Callability can also be output over custom regions. If the --qc-coverage-region-i
option is used with --qc-coverage-reports-i
(where i is 1, 2, or 3), callability can be added as a report type for that region. The output is a qc-coverage-region-i_callability.bed
file. For each specified qc-coverage-region-i
file, the average callability is reported in the variant calling metrics file. The padding size specified by the --qc-coverage-region-padding-i
is used and overlapping regions are merged.
The optional min MAPQ and min BQ filters only influence read and base counting and do not influence the callability reports. The callability reports only depends on the gVCF PASS variants.
The following table shows which outputs are generated when default options (--vc-target-bed
) versus optional coverage region options (--coverage-region
) are used.
The GC bias report provides information on GC content and the associated read coverage across a genome. DRAGEN GC bias metric is modeled after the Picard implementation and adapted to preexisting internal measures. The DRAGEN GC bias correction module attempts to correct these biases following the target count stage. For more information, see GC Bias Correction
The GC bias metric is computed as follows.
Calculates GC content using a 100 bp wide, per-base rolling window over all chromosomes in the reference genome, excluding any decoys and alternate contigs. Windows containing more than four masked (N) bases in the reference are discarded.
Calculates the average coverage for each window, excluding any non-PF, duplicate, secondary, and supplementary reads.
Calculates the average global coverage across the whole genome.
Groups valid windows based on the percentage of GC content, both at individual percentages and five 20% ranges as summary.
Calculates the normalized coverage for each group by dividing the average coverage for the bin by the global average coverage across the genome. Values below 1.0 indicate a lower than expected coverage at the given GC percent or range. Coverages significantly deviating from 1.0 at greater GC values are an expected result.
Calculates dropout metrics as the sum of all positive values of (percentage of windows at GC X-percentage aligned reads at GC X) for each GC ≤ 50% and > 50% for AT and GC dropout.
By default, the GC bias metric report is not calculated. To enable GC Bias calculations, enter the --gc-metrics-enable command line option. The following is an example command:
$ dragen -b <BAM file> -r <reference genome> --gc-metrics-enable=true
The GC metrics report generates a gc_metrics.csv file. The file is structured as follows.
The GC bias report also includes the following command line options, but they are not recommended.
| Setting | Description | |:-------------------------------| :---------------------------- -----------------------| | --gc-metrics-window-size | Overrides the default rolling window size of 100 bp. | | --gc-metrics-num-bins | Overrides the number of summary bins. |
In somatic mode, DRAGEN automatically generates a somatic callable regions report as a bed file. The somatic callable regions report includes all regions with tumor coverage at least as high as the tumor threshold and (if applicable) normal coverage at least as high as the normal threshold. If only the tumor sample is provided, then the report includes all regions with tumor coverage at least as high as the tumor threshold. Each line in the bed output file is formatted as follows.
chromosome region_start region_end
You can specify the threshold values using the --vc-callability-tumor-thresh
or --vc-callability-normal-thresh
options. The default value for the tumor threshold is 15. The default value for the normal threshold is 5. For more information on each option, see [Somatic Mode Options]{.underline}.
If the target bed or the --qc-coverage-region-i
(where i is 1, 2, or 3) option is included in the run. DRAGEN then generates corresponding somatic callable regions bed files in addition to the whole genome somatic callable region bed file.
The duration metrics section includes a breakdown of the run duration for each process. For example, the following metrics are generated for the mapper and variant caller pipeline:
Time loading reference
Time aligning reads
Time sorting and marking duplicates
Time DRAGStr calibration
Time partial reconfiguration
Time variant calling
Total run time
Setting | Description |
---|---|
Dragen output | Description |
---|---|
Report Name | Output File | Notes |
---|---|---|
Optional Report type | Enabled with |
---|---|
Filtering rules | Description |
---|---|
Column 1 | Column 2 | Column3 |
---|---|---|
qc-contam-min-cov
The minimum read coverage required for a pileup to be used in contamination detection (default is 10). Lower coverage may produce unreliable results.
qc-contam-min-valid-read-ratio
The minimum ratio of valid reads in a pileup for it to be considered valid. The default setting is 0.95, meaning 95% of the reads in a pileup must be valid. This value may be lowered to 0.75 and still yield accurate contamination estimates. If many reads are classified as invalid, it is likely due to untrimmed adapters that are being systematically soft clipped. It is recommended to fix the BAM file rather than force the contamination module to use these reads.
DRAGEN SNV VCF INFO DP field
The SNV VCF INFO DP field is computed after excluding unmapped reads, secondary alignments, BQ<10, bad quality reads (badly mated reads, and reads with bad cigars). It will generally be equal or lower than coverage reported in the fine_hist or other coverage reports. It is also expected to be lower than the unfiltered coverage track reported in IGV.
DRAGEN SNV VCF FORMAT DP field
The SNV VCF FORMAT DP is similar to the INFO DP field, but it also excludes non-informative reads that matches more than 1 haplotype equally well. In general the following pattern is expected: FORMAT DP <= INFO DP <= per position coverage in full_res report.
Input bases divided by reference genome size.
Available in mapping_metrics.csv file. This metric is a useful indication of the raw sequencing coverage. All primary reads (including duplicates) are considered. Secondary and supplementary alignments are ignored, but no other filters are applied.
Autosomal Median Coverage
Available in ploidy_estimation_metrics.csv. This is an internal development metric that makes various assumptions about which regions will be treated as callable or not. This metric will not be consistent with "Median autosomal coverage over genome" in "wgs_coverage_metrics.csv". It is not recommended for any QC.
Coverage metrics
_coverage_metrics.csv
Important coverage summary statistics. On by default.
Fine histogram coverage
_fine_hist.csv
Detailed coverage histogram. On by default.
Histogram coverage
_hist.csv
Binned coverage histogram. On by default.
Overall mean coverage
_overall_mean_cov.csv
Redundant subset of information available in _coverage_metrics.csv. On by default.
Per contig mean coverage
_contig_mean_cov.csv
On by default.
Read-level coverage report
_read_cov_report.bed
On by default.
Basepair full resolution
_full_res.bed
Optionally enabled with custom reports.
Per BED region cov_report
_cov_report.bed
Optionally enabled with custom reports.
GVCF callability
_callability.bed
Optionally enabled with custom reports.
Basepair full resolution
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 full_res
Per BED region cov_report
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 cov_report
GVCF callability
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 callability
Handing of overlapping mates
By default overlapping mates are double counted. Set --qc-coverage-ignore-overlaps=true
to resolve all of the alignments for each fragment and avoid double-counting any overlapping bases. This might result in marginally longer run times. This option also requires setting --enable-map-align=true
. --qc-coverage-ignore-overlaps
is a global setting and updates all qc-coverage reports.
Soft-clipped bases
By default soft-clipped bases are not counted towards coverage. Set --qc-coverage-count-soft-clipped-bases=true
to also include those bases in the coverage calculations. --qc-coverage-count-soft-clipped-bases
is a global setting and updates all qc-coverage reports.
MAPQ and BQ filters
The --qc-coverage-filters-i
setting can be used to override the min MAPQ and BQ filters. A coverage filter is enabled by using one of the --qc-coverage-filters-i
options (where i is 1, 2, or 3), in combination with the associated --qc-coverage-region-i
option. The default value for --qc-coverage-filters-i
is mapq<1,bq<0
. The default includes all BQ, but filters reads with MAPQ=0.
Contig name
Number of bases aligned to that contig, which excludes bases from duplicate marked reads, reads with MAPQ=0, and clipped bases.
Estimated coverage, as follows: <number of bases aligned to the contig (ie, Col2)> divided by <length of the contig or (if a target BED is used) the total length of the target region spanning that contig>.
--vc-target-bed specified? Y/N
--qc-coverage-region-i (i equal to 1, 2, or 3) specified? Y/N
Expected Output Files
N
N
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv
N
Y
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv For each coverage region specified by the user: qc-coverage-region-i_coverage_metrics.csv qc-coverage-region-i_fine_hist.csv qc-coverage-region-i_hist.csv qc-coverage-region-i_overall_mean_cov.csv qc-coverage-region-i_contig_mean_cov.csv qc-coverage-region-i_full_res.bed if full_res report type is requested for qc-coverage-region-i qc-coverage-region-i_cov_report.bed if cov_report report type is requested for qc-coverage-region-i qc-coverage-region-i_callability.bed if GVCF mode is enabled and the callability or exome-callability report type is requested
Y
N
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv target_bed_coverage_metrics.csv target_bed_fine_hist.csv target_bed_hist.csv target_bed_overall_mean_cov.csv target_bed_contig_mean_cov.csv target_bed_callability.bed if GVCF mode is enabled
Y
Y
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv target_bed_coverage_metrics.csv target_bed_fine_hist.csv target_bed_hist.csv target_bed_overall_mean_cov.csv target_bed_contig_mean_cov.csv target_bed_callability.bed if GVCF mode is enabled For each coverage region specified by the user: qc-coverage-region-i_coverage_metrics.csv qc-coverage-region-i_fine_hist.csv qc-coverage-region-i_hist.csv qc-coverage-region-i_overall_mean_cov.csv qc-coverage-region-i_contig_mean_cov.csv qc-coverage-regon-i_full_res.bed if full_res report type is requested for qc-coverage-region-i qc-coverage-region-i_cov_report.bed if cov_report report type is requested for qc-coverage-region-i qc-coverage-region-i_callability.bed if GVCF mode is enabled and the callability or exome-callability report type is requested