Output Files
The following section describes the outputs produced by DRAGEN Array.
CNV VCF File
DRAGEN Array produces one CNV variant call file (VCF) (*.cnv.vcf) per sample to report the CN status on the gene and sub gene level, along with the CN events for PGx targets.
The CNV VCF output file follows the standard VCF format. The QUAL field in the VCF file measures the CNV call quality. The CNV call quality is a Phred-scaled score capped at 60 and the minimal value is 0. Low quality calls (QUAL<7) are flagged by the Q7 filter. Low quality samples with LogRDev greater than a threshold 0.2 are flagged with the SampleQuality flag.
The CNV VCF files are by default bgzipped (Block GZIP) and have the “.gz” extension. The compression saves storage space and facilitates efficient lookup when indexed with the TBI Index File. To view these files as plain text, they can be uncompressed with bgzip from Samtools or other third-party tools. The CNV VCF must be bgzipped and indexed to be used in downstream DRAGEN Array commands, such as star allele calling.
The CNV VCF output file includes the following content.
##fileformat=VCFv4.1
##source=dragena 1.1.0
##genomeBuild=38
##reference=file:///hg38_with_alt/hg38_nochr_MT.fa
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events. CN=5 indicates 5 or 5+">
##FORMAT=<ID=NR,Number=1,Type=Float,Description="Aggregated normalized intensity">
##ALT=<ID=CNV,Description="Copy number variant region">
##FILTER=<ID=Q7,Description="Quality below 7">
##FILTER=<ID=SampleQuality,Description="Sample was flagged as potentially low-quality due to high noise levels.">
##INFO=<ID=CNVLEN,Number=1,Type=Integer,Description="Number of bases in CNV hotspot">
##INFO=<ID=PROBE,Number=1,Type=Integer,Description="Number of probes assayed for CNV hotspot">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of CNV hotspot">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Structural Variant Type">
##CNVOverallPloidy=1.8
##CNVGCCorrect=True
##contig=<ID=1,length=248956422>
##contig=<ID=4,length=190214555>
##contig=<ID=10,length=133797422>
##contig=<ID=16,length=90338345>
##contig=<ID=19,length=58617616>
##contig=<ID=22,length=50818468>
##contig=<ID=22_KI270879v1_alt,length=304135>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 204619760001_R01C01
1 109687842 CNV:GSTM1:chr1:109687842:109693526 N <CNV> 60 PASS CNVLEN=5685;PROBE=124;END=109693526;SVTYPE=CNV CN:NR 2:0.966631132771593
4 68537222 CNV:UGT2B17:chr4:68537222:68568499 N <CNV> 60 PASS CNVLEN=31278;PROBE=383;END=68568499;SVTYPE=CNV CN:NR 0:0.376696837881692
10 133527374 CNV:CYP2E1:chr10:133527374:133539096 N <CNV> 60 PASS CNVLEN=11723;PROBE=194;END=133539096;SVTYPE=CNV CN:NR 2:0.980059731860893
16 28615068 CNV:SULT1A1:chr16:28603587:28613544 N <CNV> 57 PASS CNVLEN=8315;PROBE=164;END=28623382;SVTYPE=CNV CN:NR 2:0.980552325552963
19 40844791 CNV:CYP2A6.intron.7:chr19:40844791:40845293 N <CNV> 60 PASS CNVLEN=503;PROBE=38;END=40845293;SVTYPE=CNV CN:NR 2:0.9663775484762
19 40850267 CNV:CYP2A6.exon.1:chr19:40850267:40850414 N <CNV> 60 PASS CNVLEN=148;PROBE=21;END=40850414;SVTYPE=CNV CN:NR 2:0.9663775484762
22 42126498 CNV:CYP2D6.exon.9:chr22:42126498:42126752 N <CNV> 48 PASS CNVLEN=255;PROBE=370;END=42126752;SVTYPE=CNV CN:NR 2:0.981703411438716
22 42129188 CNV:CYP2D6.intron.2:chr22:42129188:42129734 N <CNV> 10 PASS CNVLEN=547;PROBE=333;END=42129734;SVTYPE=CNV CN:NR 2:0.965498002434641
22 42130886 CNV:CYP2D6.p5:chr22:42130886:42131379 N <CNV> 60 PASS CNVLEN=494;PROBE=172;END=42131379;SVTYPE=CNV CN:NR 2:0.970341562236357
22_KI270879v1_alt 270316 CNV:GSTT1:chr22_KI270879v1_alt:270316:278477 N <CNV> 60 PASS CNVLEN=8162;PROBE=91;END=278477;SVTYPE=CNV CN:NR 2:1.01191145130511
SNV VCF File
The software produces one genotyping variant call file (*.snv.vcf) file per sample, covering single nucleotide variants (SNV) and indels for the sample. It reports GenCell score (GS), B Allele Frequency (BAF), and Log R Ratio (LRR) per variant. The VCF file output follows VCF4.1 format.
Some additional details:
Genotypes are adjusted to reflect the sample ploidy. Calls are haploid for loci on Y, MT, and non-PAR chromosome X for males.
Multiple SNPs in the input manifest which are mapped to the same chromosomal coordinate (e.g. tri-allelic loci or duplicated sites) are collapsed into one VCF entry and a combined genotype generated. To produce the combined genotype, the set of all possible genotypes is enumerated based on the queried alleles. Genotypes which are not possible based on called alleles and assay design limitations (e.g. Infinium II designs cannot distinguish between A/T and C/G calls) are filtered. If only one consistent genotype remains after the filtering process, then the site is assigned this genotype. Otherwise, the genotype is ambiguous (more than 1) or inconsistent (less than 1) and a no-call is returned.
Certain SNV and indel calls can be skipped when reported in the VCF. Skipped data can include unmapped loci, intensity-only probes used for CNV identification, and indels that do not map back to the genome. See Warning/Error Messages and Logs for messages that may be seen with DRAGEN Array Local related to the skipped data.
The BAF and LRR are oriented with Ref as A and Alt as B relative to the reference genome, while GS is agnostic to the reference genome. Users familiar with GenomeStudio may observe BAF and LRR reported in the VCF as 1 minus the value reported in GenomeStudio depending on the Ref Alt allele orientation with the reference genome. GenomeStudio reports these values based on the information in the manifest without knowledge of the reference genome.
The SNV VCF files are by default bgzipped (Block GZIP) and have the “.gz” extension. The compression saves storage space and facilitates efficient lookup when indexed with the TBI Index File. To view these files as plain text, they can be uncompressed with bgzip from Samtools or other third-party tools. The SNV VCF must be bgzipped and indexed to be used in downstream DRAGEN Array commands, such as star allele calling.
The SNV VCF output file includes the following content. The last row shows an example of variant call.
##fileformat=VCFv4.1
##source=dragena 1.1.0
##genomeBuild=38
##reference=file:///genomes/38/genome.fa
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GS,Number=1,Type=Float,Description="GenCall score. For merged multi-assay or multi-allelic records, min GenCall score is reported.">
##FORMAT=<ID=BAF,Number=1,Type=Float,Description="B Allele Frequency">
##FORMAT=<ID=LRR,Number=1,Type=Float,Description="LogR ratio">
##contig=<ID=1,length=248956422>
##contig=<ID=2,length=242193529>
##contig=<ID=3,length=198295559>
##contig=<ID=4,length=190214555>
##contig=<ID=5,length=181538259>
##contig=<ID=6,length=170805979>
##contig=<ID=7,length=159345973>
##contig=<ID=8,length=145138636>
##contig=<ID=9,length=138394717>
##contig=<ID=10,length=133797422>
##contig=<ID=11,length=135086622>
##contig=<ID=12,length=133275309>
##contig=<ID=13,length=114364328>
##contig=<ID=14,length=107043718>
##contig=<ID=15,length=101991189>
##contig=<ID=16,length=90338345>
##contig=<ID=17,length=83257441>
##contig=<ID=18,length=80373285>
##contig=<ID=19,length=58617616>
##contig=<ID=20,length=64444167>
##contig=<ID=21,length=46709983>
##contig=<ID=22,length=50818468>
##contig=<ID=MT,length=16569>
##contig=<ID=X,length=156040895>
##contig=<ID=Y,length=57227415>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 202937470021_R06C01
1 2290399 rs878093 G A . PASS . GT:GS:BAF:LRR 0/1:0.7923:0.50724137:0.14730307
Genotype Call (GTC) File
The genotype call algorithm produces one genotype call file (.gtc) per sample analyzed. The Genotype Call (GTC) file contains the small variant (SNV and indel) genotype for each marker specified by the product and sample quality metrics. The sample marker location is not included and must be extracted from the manifest file. Binary proprietary format can be parsed using the Illumina open-source tool BeadArray Library File Parser.
BedGraph File
The BedGraph file contains the log R ratios from the genotyping algorithm for use in visual tools.
Star Allele CSV File
The Star Allele CSV file is an intermediate file generated by the star-allele call command and serves as the input to the star-allele annotate command. It contains all the star allele calls for all samples in a run. Each row in the file provides either a star allele diplotype or simple variant call for a PGx-related gene. Star allele diplotype calls for a sample and a gene may span multiple lines where alternative solutions can be listed.
The Star Allele CSV file also contains meta information marked by # at the top of the file for the genome build and PGx database used for the star allele calling.
The star_allele.csv file contains the following details per sample:
Field | Description |
---|---|
Sample | Sentrix barcode and position of the sample. |
Rank | Rank of a single star allele solution for a gene. The top solution based on quality score is ranked as 1 with the alternative solutions ranked lower. |
Gene or Variant | The gene symbol, or gene symbol plus rsID for variants. |
Type | ‘Haplotype’ (star allele) or ‘Variant’ PGx calling type. |
Solution | Star allele or variant solution. If diploid, variant solutions have the format of Allele1/Allele2. |
Solution Long | Long format solution for star alleles. The field has the following format: Structural Variant Type: Underlying Star allele. An example of a long solution is: Complete: CYP2D64, Complete: CYP2D610, CYP2D668: CYP2D64 where there are two complete alleles that have CYP2D64 and CYP2D610 haplotypes and one CYP2D668 structural variant that has a CYP2D64 haplotype configuration. |
Supporting Variants | All variants present in the array that support the star allele solution. The field has the following format: Long Solution Star Allele: (Supporting Variants). Each supporting variant is listed with essential information extracted from the SNV VCF to assist with troubleshooting, including Chromosome, Location, Reference allele, Alternative allele, Genotype, GenCall score (GS), and B-allele frequency (BAF). |
Missing/Masked Core Variants | All variants not present in the array or not called in the SNV VCF file for the star allele. The field has the following format: Long Solution Star-Allele: (Missing Variants). |
All Missing Variants in Array | All core definition variants that are not on the array or are not called in the SNV VCF along with the associated star alleles that are impacted. The field has the following format: Missing Variant: (List of impacted star alleles). |
Collapsed Star-Alleles | Star alleles that cannot be distinguished from the solution star allele given the input array’s content. The field has the following format: Long Solution Star-Allele: (List of collapsed star alleles). The most frequent star allele based on the population frequency of PGx alleles will be the star allele in the solution. |
Score | Quality score of the solution including the population frequency of PGx alleles. The score ranges from 0 to 1. |
Raw Score: | Raw quality score of the solution without including the population frequency of PGx alleles. The score ranges from 0 to 1. |
Copy Number Solution | Estimated copy number for each gene region. The field has the following format: Gene Region: Copy Number. |
Below is an example of the first 4 columns from a star allele CSV file:
Sample,Rank,Gene or Variant,Type,Solution
204650490282_R02C01,1,CYP2C9,Haplotype,*9/*11
204650490282_R02C01,1,CYP2C19,Haplotype,*2/*10
Genotype Summary Files
The software produces genotype summary files (gt_sample_summary.csv and gt_sample_summary.json) that contains the following details per sample:
Sample ID
Sample Name
Sample Folder
Autosomal Call Rate
Call Rate
Log R Ratio Std Dev
Sex Estimate
TGA_Ctrl_5716 Norm R
The TGA_Ctrl_5716 Norm R field is specific to PGx products (e.g., Global Diversity Array with enhanced PGx). The field value is the Normalized R value of one probe and is meant as an assay control where < 1 indicates the sample failed in the TGA (Targeted Gene Amplification) process. If the product does not have this probe, it is not included in the gt_sample_summary.
Final Report
DRAGEN Array Cloud produces a Final Report (gtc_final_report.csv) per analysis batch similar to the one available in GenomeStudio. It contains the following details per locus per sample:
Field | Description |
---|---|
SNP Name | SNP identifier. |
SNP | SNP alleles as reported by assay probes. Alleles on the Design strand (the ILMN strand) are listed in order of Allele A/B. |
Sample ID | Sample identifier. |
Allele 1 – Top | Allele 1 corresponds to Allele A and are reported on the Top strand. |
Allele 2 – Top | Allele 2 corresponds to Allele B and are reported on the Top strand. |
Allele 1 – Forward | Allele 1 corresponds to Allele A and are reported on the Forward strand. |
Allele 2 – Forward | Allele 2 corresponds to Allele B and are reported on the Forward strand. |
Allele 1 – Plus | Allele 1 corresponds to Allele A and are reported on the Plus strand. |
Allele 2 – Plus | Allele 2 corresponds to Allele B and are reported on the Plus strand. |
GC Score | Quality metric calculated for each genotype (data point), and ranges from 0 to 1. |
GT Score | The SNP cluster quality. Score for a SNP from the GenTrain clustering algorithm. |
Log R Ratio | Base-2 log of the normalized R value over the expected R value for the theta value (interpolated from the R-values of the clusters). For loci categorized as intensity only; the value is adjusted so that the expected R value is the mean of the cluster. |
B Allele Freq | B allele frequency for this sample as interpolated from known B allele frequencies of 3 canonical clusters: 0, 0.5 and 1 if it is equal to or greater than the theta mean of the BB cluster. B Allele Freq is between 0 and 1, or set to NaN for loci categorized as intensity only. |
Chr | Chromosome containing the SNP. |
Position | SNP chromosomal position. |
Note: Analyses on products with large numbers of loci (>1 Million) and large numbers of samples (>100) yield a large (50+ Gigabyte) Final Report that are difficult to download and review. It’s recommended to create analysis configurations that do not produce this report if large batches are desired.
For more information on interpreting DNA strand and allele information, see Illumina Knowledge article How to interpret DNA strand and allele information for Infinium genotyping array data.
Locus Summary
DRAGEN Array Cloud produces a Locus Summary (locus_summary.csv) per analysis batch similar to the one available in GenomeStudio. It contains the following details per locus:
Field | Description |
---|---|
Locus_Name | Locus name from the manifest file. |
Illumicode_Name | Locus ID from the manifest file. |
#No_Calls | Number of loci with GenCall scores below the call region threshold. |
#Calls | Number of loci with GenCall scores above the call region threshold. |
Call_Freq | Call frequency or call rate calculated as follows: #Calls/(#No_Calls + #Calls) |
A/A_Freq | Frequency of homozygote allele A calls. |
A/B_Freq | Frequency of heterozygote calls. |
B/B_Freq | Frequency of homozygote allele B calls. |
Minor_Freq | Frequency of the minor allele. |
Gentrain_Score | Quality score for samples clustered for this locus. |
50%_GC_Score | 50th percentile GenCall score for all samples. |
10%_GC_Score | 10th percentile GenCall score for all samples. |
Het_Excess_Freq | Heterozygote excess frequency, calculated as (Observed -Expected)/Expected for the heterozygote class. If $f_{ab}$ is the heterozygote frequency observed at a locus, and p and q are the major and minor allele frequencies, then het excess calculation is the following: $(f_{ab} - 2pq)/(2pq + \varepsilon)$ |
ChiTest_P100 | Hardy-Weinberg p-value estimate calculated using genotype frequency. The value is calculated with 1 degree of freedom and is normalized to 100 individuals. |
Cluster_Sep | Cluster separation score. |
AA_T_Mean | Normalized theta angles mean for the AA genotype. |
AA_T_Std | Normalized theta angles standard deviation for the AA genotype. |
AB_T_Mean | Normalized theta angles mean for the AB genotype. |
AB_T_Std | Standard deviation of the normalized theta angles for the AB genotype. |
BB_T_Mean | Normalized theta angles mean for the BB genotypes. |
BB_T_Std | Standard deviation of the normalized theta angles for the BB genotypes. |
AA_R_Mean | Normalized R value mean for the AA genotypes. |
AA_R_Std | Standard deviation of the normalized R value for the AA genotypes. |
AB_R_Mean | Normalized R value mean for the AB genotypes. |
AB_R_Std | Standard deviation of the normalized R value for the AB genotypes. |
BB_R_Mean | Normalized R value mean for the BB genotypes. |
BB_R_Std | Standard deviation of the normalized R value for the BB genotypes. |
Plus/Minus Strand | Designated "+" or "-" with respect to the reference genome strand. "U" designates unknown. |
CN Summary File
The sample summary contains per sample key stats for each sample in a batch that contains the following details per sample:
Sample ID
Sample Name
Sample Folder
Copy Number Batch File
The copy number batch summary file (cn_batch_summary.csv) shows the total copy number gain, loss, and neutral (CN=2) values for each target region across all the samples in the analysis.
Example copy number batch summary file content:
Target Region,Total CN gain,Total CN loss,Total CN neutral
CYP2A6.exon.1,0,1,47
CYP2A6.intron.7,0,1,47
CYP2D6.exon.9,2,4,42
CYP2D6.intron.2,7,2,39
CYP2D6.p5,13,2,33
CYP2E1,2,0,46
GSTM1,0,42,6
GSTT1,0,33,15
SULT1A1,0,0,48
UGT2B17,0,34,14
All Target Regions,24,119,337
Warning/Error Messages and Logs
The following scenarios result in a warning or error message:
Manifest file used to generate GTC is not the same as the manifest file used to generate the CN model.
FASTA files and FASTA index files do not match.
For the following scenarios, the software reports messages to the terminal output (as either a warning or an error):
Indel processing for GTC to VCF conversion failed.
The input folder does not contain the required input files.
An input file is corrupt.
Examples of such notifications can include the following:
Error | Type | Cause |
---|---|---|
Failed to normalize and gencall sample: {sample_id}, it will be skipped. Error: The given key '{loci_id}' was not present in the dictionary. | Warning | This generally occurs because of a mismatch between the manifest (bpm) and cluster file (egt) (i.e., the cluster file was generated via a different manifest). To remedy the issue, use the manifest and cluster files intended for use together. |
Reference allele is not queried for locus: {identifier} | Warning | True reference allele does not match any alleles in the manifest. The error is common for MNVs and will be addressed in future versions of the software. |
Skipping non-mapped locus: {identifier} | Warning | Locus has no chromosome position (usually 0) These loci may be used for quality purposes or CNV calling only. |
Skipping intensity only locus: {identifier} | Warning | Similar to non-mapped loci, intensity only probes have applications outside creating variants for SNV VCFs such as CNV calling. |
Skipping indel: {identifier} | Warning | Indel context (deletion/insertion) could not be determined. |
Failed to process entry for record: {identifier} | Warning | Unable to determine reference allele for indel. |
Incomplete match of source sequence to genome for indel: {identifier} | Warning | Indel not properly mapped to the reference genome. |
Failed to combine genotypes due to ambiguity - exm1068284 (InfiniumII): TT, ilmnseq_rs1131690890_mnv (InfiniumII): AA, rs1131690890_mnv (InfiniumII): AA | Warning | Detailed information about a NoCall ("./.”) in the VCF as a result of combining multiple probes that assay the same variant with conflicting results. The example here is two probes with homozygous REF genotypes (AA) and one probe with homozygous ALT probe (TT) |
Cluster file ({GTC.egt}) is not the same as CN Model Cluster file ({CN_Model.egt}). | Warning | Cluster file used to generated GTCs used for copy number calling is not the same as was used for the GTCs used during copy number training that created the input CN model. Though CNV model is robust to minor cluster file updates, CNV training should be considered when there are significant updates in the cluster file. To remove the warning, copy number training needs to be re-run with the new GTCs generated via the new cluster file during genotyping, a different CN model with the expected cluster file needs to be used, or different GTCs should be used for copy number calling that were generated using the same cluster file as was used during the generation of the input CN model. |
{numPassingSamples} sample(s) passed QC. Requires at least {minPassingSamples} samples to proceed. | Error | CNV calling is batch dependent and requires a certain number of samples with high-quality to make accurate calls. More high-quality samples need to be added to analysis batch to resolve error. |
Invalid manifest file path {manifestPath} | Error | Application could not find manifest file provided or user error. |
Failed to load cluster file: {e.Message} | Error | Corrupted file or unsupported version. |
Star allele JSON File
The star allele JSON file is produced per sample. It contains the fields present in the star allele CSV file as well as additional meta data and annotations.
Fields included in the star allele JSON header are described below.
Field | Description |
---|---|
softwareVersion | DRAGEN Array software version, e.g. dragena 1.0.0. |
genomeBuild | Genome build, e.g hg38. |
starAlleleDatabaseSources | Public databases with versions used as the sources of the star allele definitions and population frequencies. |
phenotypeDatabaseSources | Public databases with versions used as the sources of the star allele phenotypes. |
mappingFile | The PGx database file used for the star allele calling. |
pgxGuideline | The PGx guidelines used for metabolizer status/phenotype annotations, e.g. CPIC or DPWG |
sampleId | Sentrix barcode and position of the sample. |
locusAnnotations | The star allele call information. |
Fields included in the star allele call (locusAnnotations) information are described below.
Field | Description |
---|---|
gene | The gene symbol. |
callType | ‘Star Allele’ or ‘Variant’ PGx calling type. |
genotype | Most likely star allele or variant solution. If diploid, variant solutions have the format of Allele1/Allele2. |
activityScore | Activity score annotation of the determined genotype of the gene determined based on public PGx guidelines CPIC or DPWG. |
phenotypeDatabaseAnnotation | Metabolizer status and function annotations of the determined genotype of the gene based on lookup into public PGx guidelines CPIC or DPWG per user choice. |
qualityScore | Quality score of the solution including the population frequency of PGx alleles. The score ranges from 0 to 1. |
rawScore | Raw quality score of the solution without including the population frequency of PGx alleles. The score ranges from 0 to 1. |
supportingVariants | All variants present in the array that support the star allele solution. The field provides an array (list) of supporting Variants. Each supporting variant is listed with essential information extracted from the SNV VCF to assist with troubleshooting, including Chromosome (chrom), Location (pos), Reference allele (ref), Alternative allele (alt), Genotype (gt), GenCall score (gs), B-allele frequency (baf), the variant ID (id), and the associated star allele IDs (alleleIds). |
candidateSolutions | The set of alternative star allele calling solutions, this is only relevant for genes of the ‘Star Allele’ call type. |
missingVariantSites | All core variants that are not available (e.g. not on the array, or no calls in the SNV VCF) for star allele calling for this gene. For star alleles, the field provides an array (list) of variant "id" and impacted "alleleIds" pairs |
allelesTested | Alleles that are covered by the star allele caller. The capability to call star alleles is also dependent on array content coverage and data quality. This field is defined by the array's content and will be the same across all samples. |
Fields included in the candidateSolution section, only available for star allele call type, are described below.
Field | Description |
---|---|
rank | Rank of a single star allele solution for a gene. The top solution based on quality score is ranked as 1 with the alternative solutions ranked lower. |
genotype | Star allele or variant solution. If diploid, variant solutions have the format of Allele1/Allele2. |
activityScore | Activity score annotation of the determined genotype of the gene determined based on public PGx guidelines CPIC or DPWG. |
phenotype | Metabolizer status and function annotations of the determined genotype of the gene based on lookup into public PGx guidelines CPIC or DPWG per user choice. |
qualityScore | Quality score of the solution including the population frequency of PGx alleles. The score ranges from 0 to 1. |
rawScore | Raw quality score of the solution without including the population frequency of PGx alleles. The score ranges from 0 to 1. |
alleles | The composite alleles of the candidate genotype solution. |
solutionLong | Long format solution for star alleles. The field has the following format: Structural Variant Type: Underlying Star allele. An example of a long solution is: Complete: CYP2D64, Complete: CYP2D610, CYP2D668: CYP2D64 where there are two complete alleles that have CYP2D64 and CYP2D610 haplotypes and one CYP2D668 structural variant that has a CYP2D64 haplotype configuration. |
supportingVariants | All variants present in the array that support the star allele solution. The field provides an array (list) of supporting Variants. Each supporting variant is listed with essential information extracted from the SNV VCF to assist with troubleshooting, including Chromosome (chrom), Location (pos), Reference allele (ref), Alternative allele (alt), Genotype (gt), GenCall score (gs), B-allele frequency (baf), and the variant ID (id). |
missingVariantSites | All variants not present in the array or not called in the SNV VCF file for the star allele solution. The field provides an array (list) of missing variants. |
collapsedAlleles | Star alleles that cannot be distinguished from the solution star allele given the input array’s content. The field has the following format: Long Solution Star-Allele: (List of collapsed star alleles). The most frequent star allele based on the population frequency of PGx alleles will be the star allele in the solution. |
copyNumberRegions | Gene regions for the copy numbers listed in CopyNumberSolution. |
copyNumberSolution | Estimated copy number for each gene region listed in CopyNumberRegions |
Example of JSON file content:
TBI Index File
The TBI (TABIX) index file is associated with the bgzipped VCF files. It allows for data line lookup in VCF files for quick data retrieval. The format is a tab-delimited genome index file developed by Samtools as part of the HTSlib utilities. For more information, visit the Samtools website.
Methylation Control Probe Output File
The software produces a control probe output file ({BeadChipBarcode}_{Position}_ctrl.tsv.gz) per sample that includes the raw methylated and unmethylated values for each control probe.
Each control probe has an address, type, color channel, name, and probe ID. It also provides the raw signal for methylated green (MG), methylated red (MR), unmethylated green (UG) and unmethylated red (UR).
The file can help identify which probes are available on a given BeadChip.
Methylation CG Output File
The software produces a CG output file ({BeadChipBarcode}_{Position}_cgs.tsv.gz) per sample that includes beta values, m-values and detection p-values for each CG site.
Beta values measure methylation levels in a linear fashion for easy interpretation. Unmethylated probes are close to zero and methylated probes are close to 1.
M-values are a log transformed beta value which provides a more representative measure of methylation.
Detection p-values measure the likelihood that the signal is background noise. It is recommended that p-value >0.05 are excluded from analysis as they are likely background noise.
see High-throughput Infinium methylation array QC using DRAGEN Array Methylation QC software tech note for further detail on calculation of these metrics.
Methylation Sample QC Summary Files
The software produces methylation sample QC summary in .xlsx and .tsv file formats (sample_qc_summary.xlsx and sample_qc_summary.tsv) per analysis batch, which provides per sample QC data for all samples in the batch.
The QC summary provides details on 21 controls metrics (see tables below), which are computed in same way as in the BeadArray Controls Reporter software from Illumina. In addition, it provides average red and green raw and normalized signals, time of scanning, proportion of probes passing, overall sample pass/fail status, and the failure codes for control metrics that did not pass. The sample pass status is defined as the passing of all 21 control metrics. The QC summary .xlsx file further highlights failing parameters for easy viewing.
The QC summary files contain the following fields:
Field | Description |
---|---|
Sentrix_ID | 12-digit BeadChip Barcode associated with the sample. |
Sentrix_Position | Row and column on the BeadChip ie R01C01 |
Sample_ID | Optional field that can be indicated using IDAT Sample Sheet |
User Defined Meta Data | Optional field(s) that can be indicated using IDAT Sample Sheet. Any number of fields indicated will appear in this output file. |
restoration |
|
staining_green staining_red |
|
extension_green extension_red |
|
hybridization_high_medium hybridization_medium_low |
|
target_removal1 target_removal2 |
|
bisulfite_conversion1_green bisulfite_conversion1_background_green bisulfite_conversion1_red bisulfite_conversion1_background_red |
|
bisulfite_conversion2 bisulfite_conversion2_background |
|
specificity1_green specificity1_red |
|
specificity2 specificity2_background |
|
nonpolymorphic_green nonpolymorphic_red |
|
avg_green_raw avg_red_raw |
|
avg_green_norm avg_red_norm |
|
ScanTime |
|
NProbes |
|
NPassDetection |
|
prop_probes_passing |
|
passQC |
|
failCodes |
|
The control metrics in the QC summary files are calculated as following. The default value for background correction offset (x) of 3,000 can be modified and applies to all background calculations indicated with (bkg + x). Note that the table uses default thresholds for EPIC arrays as example, the default thresholds changes with the methylation arrays. See section Threshold Adjustment for additional details.
Control | Calculation | Additional Information |
Restoration Green > bkg | (Green/(bkg+x))> |
|
Staining Green Biotin High > Biotin Bkg | (High/Biotin Bkg) > 5 | |
Staining Red DNP High > DNP Bkg | (High/DNP Bkg) > 5 | |
Extension Green Lowest CG/Highest AT | (C or G/A or T) > 5 | Green channel—Lowest C or G intensity is used; highest A or T intensity is used. |
Extension Red Lowest AT/Highest CG | (A or T/C or G) > 5 | Red channel—Lowest A or T intensity is used; highest C or G intensity is used. |
Hybridization Green High > Medium > Low | (High/Med) > 1 (Med/Low) > 1 | |
Target Removal Green ctrl 1 ≤ bkg | ((bkg + x)/ctrl) > 1 | bkg = Extension Green highest A or T intensity |
Target Removal Green ctrl 2 ≤ bkg | ((bkg + x)/ctrl) > 1 | bkg = Extension Green highest A or T intensity |
Bisulfite Conversion I Green C1, 2 > U1, 2 | (C/U) > 1 |
|
Bisulfite Conversion I Green U ≤ bkg | ((bkg + x)/U) > |
|
Bisulfite Conversion I Red C3, 4, 5 > U3, 4, 5 | (C/U) >1 |
|
Bisulfite Conversion I Red U ≤ bkg | ((bkg + x)/U) > |
|
Bisulfite Conversion II C Red > C Green | (C Red/ C Green) > |
|
Bisulfite Conversion II C green ≤ bkg | ((bkg + x)/C Green) > |
|
Specificity I Green PM > MM | (PM/MM) > 1 |
|
Specificity I Red PM > MM | (PM/MM) > 1 |
|
Specificity II S Red > S Green | (S Red/ S Green) > 1 |
|
Specificity II S Green ≤ bkg | ((bkg + x)/ S green) > 1 |
|
Nonpolymorphic Green Lowest CG/ Highest AT | (C or G/ A or T) > |
|
Nonpolymorphic Red Lowest AT/ Highest CG | (A or T/ C or G) > |
|
Methylation Sample QC Summary Plots
The software produces methylation sample QC summary plots (sample_qc_summary.pdf) per analysis batch which provides visual depictions of two QC summary plots for quick visual review.
The file contains the following control plots:
Control Plot | Description |
---|---|
Proportion of Probes Passing Threshold | Histogram of the proportion of probes passing the p-value detection threshold. Samples passing QC are shown in one color, and samples failing QC are shown in another color. |
Principal Component Analysis (PCA) | Uses beta values for all analytical probes to compare samples. Principal component analysis (PCA) is applied to the beta values to reduce the dimensionality of the data to two “principal components” that reflect the most variation across samples. If more than 100 samples are used in the analysis, a random subset of 10,000 probes are used for the PCA analysis to reduce computational burden. PCA control plot assigns unique colors to each sample group defined by the IDAT Sample Sheet. If no groups were assigned, all samples will appear the same color. Sample groups may cluster together and can be used to explain some of the variation. Coordinates used to plot each sample in the PCA control plot are provided in the pcs.tsv.gz output file (see below). |
Methylation Principal Component Summary
The software produces a methylation principal component summary file (pcs.tsv.gz) per analysis batch which provides principal component data for each sample within the batch. This can be used to identify the specific samples associated with points on the PCA control plot within the Methylation Sample QC Control Plots output file.
The files contain the following fields:
Field | Description |
---|---|
blank | BeadChip Barcode and Position ie 123456789101_R01C01 |
principal component 1 | The variable of the first axis for the Principal Component Analysis |
principal component 2 | The variable of the second axis for the Principal Component Analysis |
Sample_Group | Sample group defined by the user in the IDAT Sample Sheet. If no sample group was defined, all samples will show NA. |
Methylation Manifest Files
The software produces two methylation manifest files
Manifest in Sesame format (probes.csv)
Additional information for control probes (controls.csv)
The probes.csv file has the following columns:
Field | Description |
---|---|
Probe_ID | This is a unique identifier for each probe. It corresponds to the IlmnID column in the standard Illumina manifest format or ctl_[AddressA_ID] for control probes. |
U | This is corresponds to the AddressA_ID column in the standard Illumina manifest format. |
M | This corresponds to the AddressB_ID column in the standard Illumina manifest format. |
col | This is the color channel for Infinium I probes (R/G). For Infinium I probes, this column will be NA. |
The controls.csv file has the following columns:
Field | Description |
---|---|
Address | The address of the probe |
Type | The control probe type |
Color_Channel | A color used to denote certain control probes in legacy software |
Name | A human readable identifier for certain control probes |
Probe_ID | This is a unique identifier for each probe. It corresponds to the IlmnID column in the standard Illumina manifest format or ctl_[AddressA_ID] for control probes. |
Methylation Warning/Error Messages and Logs
The following scenarios result in a warning or error message:
Missing IDATs or manifest
Incorrect sample sheet formatting
Duplicate BeadChip Barcode and Position within the sample sheet
Missing control or assay probes
Missing required columns in the manifest
Unable to compute certain metrics
Examples of such notifications can include the following:
Log | Error | Type | Cause |
write_samplesheet.log | No IDATs found | Error | No IDATs provided for analysis |
format_samplesheet.log | No samples in sample sheet | Error | No samples in user’s sample sheet input |
format_samplesheet.log | Sample sheet not correctly formatted | Error | Sample sheet is not in CSV format or header lines do not start with “<” |
format_samplesheet.log | beadChipName and sampleSectionName columns are required for the sample sheet. | Error | Sample sheet does not contain required columns: beadChipName and sampleSectionName. |
format_samplesheet.log | Warning: <Number> samples have duplicate Sample_ID | Warning | X lines in the sample sheet have duplicate <beadChipName>_<sampleSectionName>. Duplicates are dropped from analysis. |
convert_manifest_ilmn_sesame.log | Missing control probes in manifest | Error | Missing “[Controls]” line in CSV manifest |
convert_manifest_ilmn_sesame.log | Probe section not found | Error | Missing “[Assay]” line in CSV manifest |
convert_manifest_ilmn_sesame.log | Missing required columns: IlmnID, AddressA_ID, AddressB_ID, Color_Channel | Error | Missing one of required columns in Assay section of manifest |
convert_manifest_ilmn_sesame.log | Controls not formatted correctly. Must have 4 columns (Address,Type,Color_Channel,Name) | Error | Missing one of required columns in Control section of manifest |
run_sesame_gs.log | Missing sample: <Sample_ID> | Error | Missing idats for a particular sample |
run_sesame_gs.log | No scan time available | Warning | No scan time in idat |
run_sesame_gs.log | Prep failed | Error | Dye bias correction or noob failure for sample |
run_sesame_gs.log | Warning: missing control probe types <Missing probes> | Warning | Missing control probe types to compute a BACR metric. Metric will be set to NA. |
run_sesame_gs.log | Warning: missing control probe names <Missing probe types> | Warning | Missing control probes to compute a BACR metric. Metric will be set to NA. |
qc.log | No features, skipping PCA plot | Warning | No common betas found in all samples. This may occur if a sample has no signal intensity in the IDAT files. |
Last updated