Output Files

The following section describes the outputs produced by DRAGEN Array.

PGx CNV VCF File

DRAGEN Array produces one PGx CNV variant call file (VCF) (*.cnv.vcf) per sample to report the CN status on the gene and sub gene level, along with the CN events for PGx targets.

The PGx CNV VCF output file follows the standard VCF format. The QUAL field in the VCF file measures the CNV call quality. The CNV call quality is a Phred-scaled score capped at 60 and the minimal value is 0. Low quality calls (QUAL<7) are flagged by the Q7 filter. Low quality samples with LogRDev greater than a threshold 0.2 are flagged with the SampleQuality flag.

The PGx CNV VCF files are by default bgzipped (Block GZIP) and have the “.gz” extension. The compression saves storage space and facilitates efficient lookup when indexed with the TBI Index File. To view these files as plain text, they can be uncompressed with bgzip from Samtools or other third-party tools. The CNV VCF must be bgzipped and indexed to be used in downstream DRAGEN Array commands, such as star allele calling.

The PGx CNV VCF output file includes the following content.

##fileformat=VCFv4.1

##source=dragena 1.3.0

##genomeBuild=38

##reference=file:///hg38_with_alt/hg38_nochr_MT.fa

##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events. CN=5 indicates 5 or 5+">

##FORMAT=<ID=NR,Number=1,Type=Float,Description="Aggregated normalized intensity">

##ALT=<ID=CNV,Description="Copy number variant region">

##FILTER=<ID=Q7,Description="Quality below 7">

##FILTER=<ID=SampleQuality,Description="Sample was flagged as potentially low-quality due to high noise levels.">

##INFO=<ID=CNVLEN,Number=1,Type=Integer,Description="Number of bases in CNV hotspot">

##INFO=<ID=PROBE,Number=1,Type=Integer,Description="Number of probes assayed for CNV hotspot">

##INFO=<ID=END,Number=1,Type=Integer,Description="End position of CNV hotspot">

##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Structural Variant Type">

##OverallPloidy=1.8

##GCCorrect=True

##contig=<ID=1,length=248956422>

##contig=<ID=4,length=190214555>

##contig=<ID=10,length=133797422>

##contig=<ID=16,length=90338345>

##contig=<ID=19,length=58617616>

##contig=<ID=22,length=50818468>

##contig=<ID=22_KI270879v1_alt,length=304135>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 204619760001_R01C01

1 109687842 CNV:GSTM1:chr1:109687842:109693526 N <CNV> 60 PASS CNVLEN=5685;PROBE=124;END=109693526;SVTYPE=CNV CN:NR 2:0.966631132771593

4 68537222 CNV:UGT2B17:chr4:68537222:68568499 N <CNV> 60 PASS CNVLEN=31278;PROBE=383;END=68568499;SVTYPE=CNV CN:NR 0:0.376696837881692

10 133527374 CNV:CYP2E1:chr10:133527374:133539096 N <CNV> 60 PASS CNVLEN=11723;PROBE=194;END=133539096;SVTYPE=CNV CN:NR 2:0.980059731860893

16 28615068 CNV:SULT1A1:chr16:28603587:28613544 N <CNV> 57 PASS CNVLEN=8315;PROBE=164;END=28623382;SVTYPE=CNV CN:NR 2:0.980552325552963

19 40844791 CNV:CYP2A6.intron.7:chr19:40844791:40845293 N <CNV> 60 PASS CNVLEN=503;PROBE=38;END=40845293;SVTYPE=CNV CN:NR 2:0.9663775484762

19 40850267 CNV:CYP2A6.exon.1:chr19:40850267:40850414 N <CNV> 60 PASS CNVLEN=148;PROBE=21;END=40850414;SVTYPE=CNV CN:NR 2:0.9663775484762

22 42126498 CNV:CYP2D6.exon.9:chr22:42126498:42126752 N <CNV> 48 PASS CNVLEN=255;PROBE=370;END=42126752;SVTYPE=CNV CN:NR 2:0.981703411438716

22 42129188 CNV:CYP2D6.intron.2:chr22:42129188:42129734 N <CNV> 10 PASS CNVLEN=547;PROBE=333;END=42129734;SVTYPE=CNV CN:NR 2:0.965498002434641

22 42130886 CNV:CYP2D6.p5:chr22:42130886:42131379 N <CNV> 60 PASS CNVLEN=494;PROBE=172;END=42131379;SVTYPE=CNV CN:NR 2:0.970341562236357

22_KI270879v1_alt 270316 CNV:GSTT1:chr22_KI270879v1_alt:270316:278477 N <CNV> 60 PASS CNVLEN=8162;PROBE=91;END=278477;SVTYPE=CNV CN:NR 2:1.01191145130511

Cytogenetics VCF File

DRAGEN Array produces one cytogenetics Variant Call File (VCF) (*.cnv.vcf) per sample to report the CN and LOH status of the detected variants.

The cytogenetics CNV VCF output file follows the standard VCF format. The QUAL field in the VCF file measures the CNV/LOH call quality. The CNV/LOH call quality is a Phred-scaled score capped at 60 and the minimal value is 0. Low quality calls (QUAL<10) are flagged by the Q10 filter. Low quality samples with LogRDev greater than a threshold 0.2 are flagged with the SampleQuality flag.

The cytogenetics CNV VCF files are by default bgzipped (Block GZIP) and have the “.gz” extension. The compression saves storage space and facilitates efficient lookup when indexed with the TBI Index File. To view these files as plain text, they can be uncompressed with bgzip from Samtools or other third-party tools. The CNV VCF must be bgzipped and indexed to be used in downstream DRAGEN Array commands, such as cyto annotate.

One example file can be found below:

##fileformat=VCFv4.1

##source=dragena 1.3.0 Cyto

##genomeBuild=37

##product=GDACyto-8v1-0_A

##reference=file://genome.fa

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype. CN=4 indicates 4 or 4+">

##FORMAT=<ID=NR,Number=1,Type=Float,Description="Aggregated normalized intensity">

##FORMAT=<ID=LRD,Number=1,Type=Float,Description="Standard deviation of logR ratios">

##platform=cytoplatform

##ALT=<ID=DEL,Description="Copy number loss region">

##ALT=<ID=DUP,Description="Copy number gain heterozygous region">

##ALT=<ID=LOH,Description="AOH/LOH/ROH, absence of heterozygosity region, or, loss of heterozygosity region">

##FILTER=<ID=Q10,Description="Quality below 10">

##FILTER=<ID=SampleQuality,Description="Sample was flagged as potentially low-quality due to high noise levels.">

##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Number of bases in CNV/LOH region">

##INFO=<ID=PROBE,Number=1,Type=Integer,Description="Number of probes assayed for CNV/LOH region">

##INFO=<ID=END,Number=1,Type=Integer,Description="End position of CNV/LOH region">

##INFO=<ID=LOHTYPE,Number=A,Type=String,Description="Type of LOH (Loss/absence of heterozygosity). Valid values are AOH (germline, copy number neutral or gain LOH), CNLOH (somatic, copy number neutral LOH), GAINLOH (somatic, copy number gain LOH)">

##OverallPloidy=1.9

##GCCorrect=True

##contig=<ID=1,length=249250621>

##contig=<ID=2,length=243199373>

##contig=<ID=3,length=198022430>

##contig=<ID=4,length=191154276>

##contig=<ID=5,length=180915260>

##contig=<ID=6,length=171115067>

##contig=<ID=7,length=159138663>

##contig=<ID=8,length=146364022>

##contig=<ID=9,length=141213431>

##contig=<ID=10,length=135534747>

##contig=<ID=11,length=135006516>

##contig=<ID=12,length=133851895>

##contig=<ID=13,length=115169878>

##contig=<ID=14,length=107349540>

##contig=<ID=15,length=102531392>

##contig=<ID=16,length=90354753>

##contig=<ID=17,length=81195210>

##contig=<ID=18,length=78077248>

##contig=<ID=19,length=59128983>

##contig=<ID=20,length=63025520>

##contig=<ID=21,length=48129895>

##contig=<ID=22,length=51304566>

##contig=<ID=X,length=155270560>

##contig=<ID=Y,length=59373566>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 208588190001_R02C01 1 109687842 DEL:chr1:109687842:109693526 N <DEL> 60 PASS SVLEN=5685;PROBE=99;END=109693526 GT:CN:NR:LRD 1/1:1:0.8860:0.21 16 28603587 DUP:chr16:28603587:28613544 N <DUP> 60 PASS SVLEN=9958;PROBE=197;END=28613544 GT:CN:NR:LRD 1/1:3:1.1666:0.11 22 42129188 AOH:chr22:42129188:42129734 N <LOH> 37 PASS SVLEN=547;PROBE=198;END=42129734;LOHTYPE=AOH GT:CN:NR:LRD 1/1:2:1.0208:0.25

SNV VCF File

The software produces one genotyping variant call file (*.snv.vcf) file per sample, covering single nucleotide variants (SNV) and indels for the sample. It reports GenCall score (GS), B Allele Frequency (BAF), and Log R Ratio (LRR) per variant. The VCF file output follows VCF4.1 format.

Some additional details:

The FILTER column is hardcoded to PASS and is not dependent on the GT value. It does not reflect the underlying quality of the call. Refer to the GS value for quality information.
Genotypes are adjusted to reflect the sample ploidy. Calls are haploid for loci on Y, MT, and non-PAR chromosome X for males.
Multiple SNPs in the input manifest which are mapped to the same chromosomal coordinate (e.g. tri-allelic loci or duplicated sites) are collapsed into one VCF entry and a combined genotype generated. To produce the combined genotype, the set of all possible genotypes is enumerated based on the queried alleles. Genotypes which are not possible based on called alleles and assay design limitations (e.g. Infinium II designs cannot distinguish between A/T and C/G calls) are filtered. If only one consistent genotype remains after the filtering process, then the site is assigned this genotype. Otherwise, the genotype is ambiguous (more than 1) or inconsistent (less than 1) and a no-call is returned.
Certain SNV and indel calls will be skipped when reported in the VCF. Skipped data can include unmapped loci (i.e., Chr is 0 in the manifest), intensity-only probes used for CNV identification, and indels that do not map back to the genome. See Warning/Error Messages and Logs for messages that may be seen with DRAGEN Array Local related to the skipped data.
The BAF and LRR are oriented with Ref as A and Alt as B relative to the reference genome, while GS is agnostic to the reference genome. Users familiar with GenomeStudio may observe BAF and LRR reported in the VCF as 1 minus the value reported in GenomeStudio depending on the Ref Alt allele orientation with the reference genome. GenomeStudio reports these values based on the information in the manifest without knowledge of the reference genome.
The SNV VCF files are by default bgzipped (Block GZIP) and have the “.gz” extension. The compression saves storage space and facilitates efficient lookup when indexed with the TBI Index File. To view these files as plain text, they can be uncompressed with bgzip from Samtools or other third-party tools. The SNV VCF must be bgzipped and indexed to be used in downstream DRAGEN Array commands, such as star allele calling.

The SNV VCF output file includes the following content. The last row shows an example of variant call.

##fileformat=VCFv4.1

##source=dragena 1.3.0

##genomeBuild=38

##reference=file:///genomes/38/genome.fa

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GS,Number=1,Type=Float,Description="GenCall score. For merged multi-assay or multi-allelic records, min GenCall score is reported.">

##FORMAT=<ID=BAF,Number=1,Type=Float,Description="B Allele Frequency">

##FORMAT=<ID=LRR,Number=1,Type=Float,Description="LogR ratio">

##contig=<ID=1,length=248956422>

##contig=<ID=2,length=242193529>

##contig=<ID=3,length=198295559>

##contig=<ID=4,length=190214555>

##contig=<ID=5,length=181538259>

##contig=<ID=6,length=170805979>

##contig=<ID=7,length=159345973>

##contig=<ID=8,length=145138636>

##contig=<ID=9,length=138394717>

##contig=<ID=10,length=133797422>

##contig=<ID=11,length=135086622>

##contig=<ID=12,length=133275309>

##contig=<ID=13,length=114364328>

##contig=<ID=14,length=107043718>

##contig=<ID=15,length=101991189>

##contig=<ID=16,length=90338345>

##contig=<ID=17,length=83257441>

##contig=<ID=18,length=80373285>

##contig=<ID=19,length=58617616>

##contig=<ID=20,length=64444167>

##contig=<ID=21,length=46709983>

##contig=<ID=22,length=50818468>

##contig=<ID=MT,length=16569>

##contig=<ID=X,length=156040895>

##contig=<ID=Y,length=57227415>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 202937470021_R06C01

1 2290399 rs878093 G A . PASS . GT:GS:BAF:LRR 0/1:0.7923:0.50724137:0.14730307

Note on Multi-Allelic Variants (MAV) calling limitations

DRAGEN Array can combine multiple assays with different target bases but the same genomic position to make MAV calls. However, Illumina Microarrays are inherently bi-allelic assays made up of Infinium I or Infinium II probe designs which require special design considerations and have some inherent limitations.

The MAV calling algorithm currently filters across all overlapping assays, retaining only genotypes whose alleles are present in the intersection of all assays. If multiple genotypes remain after filtering, the result is considered ambiguous and reported as a NoCall to avoid false positives. This ambiguity often arises when one assay is a NoCall due to presumed probe failure rather than missing signal. In such cases, its potential genotypes are not excluded, contributing to ambiguity. When DRAGEN Array outputs a NoCall because of the described behaviors, they are logged as warnings (e.g., Failed to combine genotypes due to ambiguity...).

Overall, the current algorithm errs on the side of caution to ensure quality calls, but produces some idiosyncratic behavior and potential false NoCalls when genotypes are biologically consistent but differ due to probe designs. We hope to improve this behavior in future versions of DRAGEN Array.

Some illustrative examples are below to help understand the current limitations:

Scenario

Expected MAV Call

Actual MAV Call

Explanation

Inf II [A/G] -> AA + Inf I [T/G] -> TT

NoCall

The Inf II assay cannot differentiate A versus T alleles, hence AA call for the Inf II assay is consistent with AA, AT, or TT genotypes. The Inf I assay TT call, however, is consistent with AT or TT genotypes. Combining the two assays results in a NoCall due to the AT or TT ambiguity.

Inf I [T/A] -> AA + Inf II [A/G] -> NC

NoCall

NoCall for Inf II probe leaves possibility of AG. Ambiguity between AG or AA leads to NoCall.

Note on delimiters in the "ID" field

By default, when multiple probes are present for a given variant, all probe names are included in the "ID" field of the resulting VCF file.

For SNP entries, probe names are separated by commas (,).
For Indel entries, probe names are separated by semicolons (;).

Note on REF/ALT "flipping" for INDELs

Expected REF and ALTs for INDELS may not match the dbSNP annotations in rare cases. E.g., an expected "Deletion" with the REF = "ATCG" and the ALT = "A" may be "flipped" to an "Insertion" variant with REF="A" and ALT="ACTG". The corresponding genotype output will take this into account so the actual VCF is still correct. This is simply a notation issue in some of the manifest files.

Note on PLINK compatibility

It is possible to make DRAGEN Array genotype VCF files compatible for conversion to PED/MAP format with preprocessing using tabix (v1.19.1), BCFtools (v1.21) and PLINK (v1.9). The following three commands demonstrate the basic process.

bcftools merge -l vcf_list.txt -Oz -o merged.vcf.gz creates a single compressed VCFs from individual sample VCFs listed in the vcf_list.txt text file.
tabix -p vcf merged.vcf.gz creates a binary index for the merged file.
plink --vcf merged.vcf.gz --recode --out merged creates .ped and .map files with the prefix provided to --out.

Some optional arguments may be provided to PLINK depending on the content of the VCFs to be converted and the downstream analysis.

For VCFs containing non-standard human chromosomes (e.g. haplotype chromosomes or unplaced contigs), the --allow-extra-chr flag can be used.
If using non-human data, refer to the PLINK documentation for the --chr-set argument and supported options.
By default, PLINK will only consider the most common ALT allele for multi-allelic variants. The --biallelic-only argument can be provided to exclude multi-allelic variants altogether. As an alternative, using bcftools norm -m - in.vcf.gz -Oz -o out.vcf.gz, can be used upstream of PLINK to split multi-allelic variants into bi-allelic records to retain them for downstream processing.

For more info on the options described and others, refer the PLINK VCF conversion documentation.

Genotype Call (GTC) File

The genotype call algorithm produces one genotype call file (.gtc) per sample analyzed. The Genotype Call (GTC) file contains the small variant (SNV and indel) genotype for each marker specified by the product and sample quality metrics. The sample marker location is not included and must be extracted from the manifest file. Binary proprietary format can be parsed using the Illumina open-source tool BeadArray Library File Parser.

Note on lack of i18n: GTCs are binary/fixed format files built designed before modern internationalization and localization tools. There is a related known issue that makes the GTCs unable to be used in downstream analyses. Refer to the same issue to see a workaround.

Note on legacy GTCs: Other Illumina software (such as AutoConvert and Beeline) also product GTC files. These "legacy GTC" files will work in DRAGEN Array genotyping commands such as genotype gtc-to-vcf but they will not work with all other downstream analyses such as Cytogenetics analysis and PGx. We recommend using DRAGEN Array end-to-end starting from IDATs for these analyses.

BedGraph Files

The BedGraph files contains the Log R Ratios (LRR.bedgraph) and B-Allele Frequencies (BAF.bedgraph) from the genotyping algorithm for use in visual tools.

Star Allele CSV File

The Star Allele CSV file is an intermediate file generated by the pgx star-allele call command and serves as the input to the pgx star-allele annotate command. It contains all the star allele calls for all samples in a run. Each row in the file provides either a star allele diplotype or simple variant call for a PGx-related gene. Star allele diplotype calls for a sample and a gene may span multiple lines where alternative solutions can be listed.

The Star Allele CSV file also contains meta information marked by # at the top of the file for the genome build and PGx database used for the star allele calling.

The star_allele.csv file contains the following details per sample:

Field

Description

Sample

Sentrix barcode and position of the sample.

Rank

Rank of a single star allele solution for a gene. The top solution based on quality score is ranked as 1 with the alternative solutions ranked lower.

Gene or Variant

The gene symbol, or gene symbol plus rsID for variants.

Type

‘Haplotype’ (star allele) or ‘Variant’ PGx calling type.

Solution

Star allele or variant solution. If diploid, variant solutions have the format of Allele1/Allele2.

Solution Long

Long format solution for star alleles. The field has the following format: Structural Variant Type: Underlying Star allele.

An example of a long solution is: Complete: CYP2D64, Complete: CYP2D610, CYP2D668: CYP2D64 where there are two complete alleles that have CYP2D64 and CYP2D610 haplotypes and one CYP2D668 structural variant that has a CYP2D64 haplotype configuration.

Supporting Variants

All variants present in the array that support the star allele solution. The field has the following format: Long Solution Star Allele: (Supporting Variants).

Each supporting variant is listed with essential information extracted from the SNV VCF to assist with troubleshooting, including Chromosome, Location, Reference allele, Alternative allele, Genotype, GenCall score (GS), and B-allele frequency (BAF).

Missing/Masked Core Variants

All variants not present in the array or not called in the SNV VCF file for the star allele. The field has the following format: Long Solution Star-Allele: (Missing Variants).

All Missing Variants in Array

All core definition variants that are not on the array or are not called in the SNV VCF along with the associated star alleles that are impacted. The field has the following format: Missing Variant: (List of impacted star alleles).

Collapsed Star-Alleles

Star alleles that cannot be distinguished from the solution star allele given the input array’s content. The field has the following format: Long Solution Star-Allele: (List of collapsed star alleles).

The most frequent star allele based on the population frequency of PGx alleles will be the star allele in the solution.

Score

Quality score of the solution including the population frequency of PGx alleles. The score ranges from 0 to 1.

Raw Score:

Raw quality score of the solution without including the population frequency of PGx alleles. The score ranges from 0 to 1.

Copy Number Solution

Estimated copy number for each gene region. The field has the following format: Gene Region: Copy Number.

Below is an example of the first 4 columns from a star allele CSV file:

Sample,Rank,Gene or Variant,Type,Solution

204650490282_R02C01,1,CYP2C9,Haplotype,*9/*11

204650490282_R02C01,1,CYP2C19,Haplotype,*2/*10

Genotype Summary Files

The software produces genotype summary files (gt_sample_summary.csv and gt_sample_summary.json) that contains the following details per sample:

Sample ID
Sample Name
Sample Folder
Autosomal Call Rate
Call Rate
Log R Ratio Std Dev
Sex Estimate
TGA_Ctrl_5716 Norm R
(Optional) User defined fields from the samplesheet

The TGA_Ctrl_5716 Norm R field is specific to PGx products (e.g., Global Diversity Array with enhanced PGx). The field value is the Normalized R value of one probe and is meant as an assay control where < 1 indicates the sample failed in the TGA (Targeted Gene Amplification) process. If the product does not have this probe, it is not included in the gt_sample_summary.

The user defined fields from the samplesheet will appear as-is in the gt_sample_summary files. e.g. for the given samplesheet:

SentrixBarcode_A,SentrixPosition_A,Sample_ID,Sample_Group,MetaData1
204753010023,R01C01,NA1231,Group1,F
204753010024,R01C01,NA1233,Group2,M

It would produce something like the following gt_sample_summary.csv:

Sample ID,Sample Name,Sample Folder,Autosomal Call Rate,Call Rate,Log R Ratio Std Dev,Sex Estimate,SentrixBarcode_A,SentrixPosition_A,Sample_Group,MetaData1
204753010023_R01C01,204753010023_R01C01,/sample/folder,0.99414575,0.98843694,0.14829777,F,204753010023,R01C01,Group1,F
204753010024_R01C01,204753010024_R01C01,/sample/folder,0.99415575,0.98943694,0.14929777,M,204753010024,R01C01,Group2,M

And something like the following gt_sample_summary.json:

[
  {
    "Sample ID": "204753010023_R01C01",
    "Sample Name": "204753010023_R01C01",
    "Sample Folder": "/sample/folder",
    "Autosomal Call Rate": 0.99414575,
    "Call Rate": 0.98843694,
    "Log R Ratio Std Dev": 0.14829777,
    "Sex Estimate": "F",
    "SentrixBarcode_A": "204753010023",
    "SentrixPosition_A": "R01C01",
    "Sample_Group": "Group1",
    "MetaData1": "F"
  },
  {
    "Sample ID": "204753010024_R01C01",
    "Sample Name": "204753010024_R01C01",
    "Sample Folder": "/sample/folder",
    "Autosomal Call Rate": 0.99415575,
    "Call Rate": 0.98943694,
    "Log R Ratio Std Dev": 0.14929777,
    "Sex Estimate": "F",
    "SentrixBarcode_A": "2083757900024",
    "SentrixPosition_A": "R01C01",
    "Sample_Group": "Group2",
    "MetaData1": "M"
  }
]

Note: As of v1.3, samples that fail during genotyping will still be present in this file. See the details in the release notes.

Final Report

DRAGEN Array Cloud produces a Final Report (gtc_final_report.csv) per analysis batch similar to the one available in GenomeStudio. It contains the following details per locus per sample:

Field

Description

SNP Name

SNP identifier.

SNP

SNP alleles as reported by assay probes. Alleles on the Design strand (the ILMN strand) are listed in order of Allele A/B.

Sample ID

Sample identifier.

Allele 1 – Top

Allele 1 corresponds to Allele A and are reported on the Top strand.

Allele 2 – Top

Allele 2 corresponds to Allele B and are reported on the Top strand.

Allele 1 – Forward

Allele 1 corresponds to Allele A and are reported on the Forward strand.

Allele 2 – Forward

Allele 2 corresponds to Allele B and are reported on the Forward strand.

Allele 1 – Plus

Allele 1 corresponds to Allele A and are reported on the Plus strand.

Allele 2 – Plus

Allele 2 corresponds to Allele B and are reported on the Plus strand.

GC Score

Quality metric calculated for each genotype (data point), and ranges from 0 to 1.

GT Score

The SNP cluster quality. Score for a SNP from the GenTrain clustering algorithm.

Log R Ratio

Base-2 log of the normalized R value over the expected R value for the theta value (interpolated from the R-values of the clusters). For loci categorized as intensity only; the value is adjusted so that the expected R value is the mean of the cluster.

B Allele Freq

B allele frequency for this sample as interpolated from known B allele frequencies of 3 canonical clusters: 0, 0.5 and 1 if it is equal to or greater than the theta mean of the BB cluster. B Allele Freq is between 0 and 1, or set to NaN for loci categorized as intensity only.

Chr

Chromosome containing the SNP.

Position

SNP chromosomal position.

Note: Analyses on products with large numbers of loci (>1 Million) and large numbers of samples (>100) yield a large (50+ Gigabyte) Final Report that are difficult to download and review. It’s recommended to create analysis configurations that do not produce this report if large batches are desired.

For more information on interpreting DNA strand and allele information, see Illumina Knowledge article How to interpret DNA strand and allele information for Infinium genotyping array data.

Locus Summary

DRAGEN Array Cloud produces a Locus Summary (locus_summary.csv) per analysis batch similar to the one available in GenomeStudio. It contains the following details per locus:

Field

Description

Locus_Name

Locus name from the manifest file.

Illumicode_Name

Locus ID from the manifest file.

#No_Calls

Number of loci with GenCall scores below the call region threshold.

#Calls

Number of loci with GenCall scores above the call region threshold.

Call_Freq

Call frequency or call rate calculated as follows: #Calls/(#No_Calls + #Calls)

A/A_Freq

Frequency of homozygote allele A calls.

A/B_Freq

Frequency of heterozygote calls.

B/B_Freq

Frequency of homozygote allele B calls.

Minor_Freq

Frequency of the minor allele.

Gentrain_Score

Quality score for samples clustered for this locus.

50%_GC_Score

50th percentile GenCall score for all samples.

10%_GC_Score

10th percentile GenCall score for all samples.

Het_Excess_Freq

Heterozygote excess frequency, calculated as (Observed -Expected)/Expected for the heterozygote class. If $f_{ab}$ is the heterozygote frequency observed at a locus, and p and q are the major and minor allele frequencies, then het excess calculation is the following: $(f_{ab} - 2pq)/(2pq + \varepsilon)$

ChiTest_P100

Hardy-Weinberg p-value estimate calculated using genotype frequency. The value is calculated with 1 degree of freedom and is normalized to 100 individuals.

Cluster_Sep

Cluster separation score.

AA_T_Mean

Normalized theta angles mean for the AA genotype.

AA_T_Std

Normalized theta angles standard deviation for the AA genotype.

AB_T_Mean

Normalized theta angles mean for the AB genotype.

AB_T_Std

Standard deviation of the normalized theta angles for the AB genotype.

BB_T_Mean

Normalized theta angles mean for the BB genotypes.

BB_T_Std

Standard deviation of the normalized theta angles for the BB genotypes.

AA_R_Mean

Normalized R value mean for the AA genotypes.

AA_R_Std

Standard deviation of the normalized R value for the AA genotypes.

AB_R_Mean

Normalized R value mean for the AB genotypes.

AB_R_Std

Standard deviation of the normalized R value for the AB genotypes.

BB_R_Mean

Normalized R value mean for the BB genotypes.

BB_R_Std

Standard deviation of the normalized R value for the BB genotypes.

Plus/Minus Strand

Designated "+" or "-" with respect to the reference genome strand. "U" designates unknown.

CN Summary File

The sample summary contains per sample key stats for each sample in a batch that contains the following details per sample:

Sample ID
Sample Name
Sample Folder

Copy Number Batch File

The copy number batch summary file (cn_batch_summary.csv) shows the total copy number gain, loss, and neutral (CN=2) values for each target region across all the samples in the analysis.

Example copy number batch summary file content:

Target Region,Total CN gain,Total CN loss,Total CN neutral

CYP2A6.exon.1,0,1,47

CYP2A6.intron.7,0,1,47

CYP2D6.exon.9,2,4,42

CYP2D6.intron.2,7,2,39

CYP2D6.p5,13,2,33

CYP2E1,2,0,46

GSTM1,0,42,6

GSTT1,0,33,15

SULT1A1,0,0,48

UGT2B17,0,34,14

All Target Regions,24,119,337

Warning/Error Messages and Logs

The following scenarios result in a warning or error message:

Manifest file used to generate GTC is not the same as the manifest file used to generate the CN model.
FASTA files and FASTA index files do not match.

For the following scenarios, the software reports messages to the terminal output (as either a warning or an error):

Indel processing for GTC to VCF conversion failed.
The input folder does not contain the required input files.
An input file is corrupt.

Examples of such notifications can include the following:

Error

Type

Cause

Failed to normalize and gencall sample: {sample_id}, it will be skipped. Error: The given key '{loci_id}' was not present in the dictionary.

Warning

This generally occurs because of a mismatch between the manifest (bpm) and cluster file (egt) (i.e., the cluster file was generated via a different manifest). To remedy the issue, use the manifest and cluster files intended for use together.

Reference allele is not queried for locus: {identifier}

Warning

True reference allele does not match any alleles in the manifest. The error is common for MNVs and will be addressed in future versions of the software.

Skipping non-mapped locus: {identifier}

Warning

Locus has no chromosome position (usually 0) These loci may be used for quality purposes or CNV calling only.

Skipping intensity only locus: {identifier}

Warning

Similar to non-mapped loci, intensity only probes have applications outside creating variants for SNV VCFs such as CNV calling.

Skipping indel: {identifier}

Warning

Indel context (deletion/insertion) could not be determined.

Failed to process entry for record: {identifier}

Warning

Unable to determine reference allele for indel.

Incomplete match of source sequence to genome for indel: {identifier}

Warning

Indel not properly mapped to the reference genome.

Failed to combine genotypes due to ambiguity - exm1068284 (InfiniumII): TT, ilmnseq_rs1131690890_mnv (InfiniumII): AA, rs1131690890_mnv (InfiniumII): AA

Warning

Detailed information about a NoCall ("./.”) in the VCF as a result of combining multiple probes that assay the same variant with conflicting results. The example here is two probes with homozygous REF genotypes (AA) and one probe with homozygous ALT probe (TT)

Cluster file ({GTC.egt}) is not the same as CN Model Cluster file ({CN_Model.egt}).

Warning

Cluster file used to generated GTCs used for copy number calling is not the same as was used for the GTCs used during copy number training that created the input CN model. Though CNV model is robust to minor cluster file updates, CNV training should be considered when there are significant updates in the cluster file. To remove the warning, copy number training needs to be re-run with the new GTCs generated via the new cluster file during genotyping, a different CN model with the expected cluster file needs to be used, or different GTCs should be used for copy number calling that were generated using the same cluster file as was used during the generation of the input CN model.

{numPassingSamples} sample(s) passed QC.

Requires at least {minPassingSamples} samples to proceed.

Error

CNV calling is batch dependent and requires a certain number of samples with high-quality to make accurate calls. More high-quality samples need to be added to analysis batch to resolve error.

Invalid manifest file path {manifestPath}

Error

Application could not find manifest file provided or user error.

Failed to load cluster file: {e.Message}

Error

Corrupted file or unsupported version.

System.IO.EndOfStreamException: Unable to read beyond the end of the stream.

Error

Likely failure to read a GTC file, see this known issue for more details on root cause and a workaround

Star allele JSON File

The star allele JSON file is produced per sample. It contains the fields present in the star allele CSV file as well as additional meta data and annotations.

Fields included in the star allele JSON header are described below.

Field

Description

softwareVersion

DRAGEN Array software version, e.g. dragena 1.0.0.

genomeBuild

Genome build, e.g hg38.

starAlleleDatabaseSources

Public databases with versions used as the sources of the star allele definitions and population frequencies.

phenotypeDatabaseSources

Public databases with versions used as the sources of the star allele phenotypes.

mappingFile

The PGx database file used for the star allele calling.

pgxGuideline

The PGx guidelines used for metabolizer status/phenotype annotations, e.g. CPIC or DPWG

sampleId

Sentrix barcode and position of the sample.

locusAnnotations

The star allele call information.

Fields included in the star allele call (locusAnnotations) information are described below.

Field

Description

gene

The gene symbol.

callType

‘Star Allele’ or ‘Variant’ PGx calling type.

genotype

Most likely star allele or variant solution. If diploid, variant solutions have the format of Allele1/Allele2. More than one solution meeting threshold requirements can be reported. Multiple top solutions are separated by a semi-colon.

activityScore

Activity score annotation of the determined genotype of the gene determined based on public PGx guidelines CPIC or DPWG.

phenotypeDatabaseAnnotation

Metabolizer status and function annotations of the determined genotype of the gene based on lookup into public PGx guidelines CPIC or DPWG per user choice.

qualityScore

Quality score of the solution including the population frequency of PGx alleles. The score ranges from 0 to 1.

rawScore

Raw quality score of the solution without including the population frequency of PGx alleles. The score ranges from 0 to 1.

supportingVariants

All variants present in the array that support the star allele solution. The field provides an array (list) of supporting Variants.

Each supporting variant is listed with essential information extracted from the SNV VCF to assist with troubleshooting, including Chromosome (chrom), Location (pos), Reference allele (ref), Alternative allele (alt), Genotype (gt), GenCall score (gs), B-allele frequency (baf), the variant ID (id), and the associated star allele IDs (alleleIds).

candidateSolutions

The set of alternative star allele calling solutions, this is only relevant for genes of the ‘Star Allele’ call type.

missingVariantSites

All core variants that are not available (e.g. not on the array, or no calls in the SNV VCF) for star allele calling for this gene. For star alleles, the field provides an array (list) of variant "id" and impacted "alleleIds" pairs

allelesTested

Alleles that are covered by the star allele caller. The capability to call star alleles is also dependent on array content coverage and data quality. This field is defined by the array's content and will be the same across all samples.

Fields included in the candidateSolution section, only available for star allele call type, are described below.

Field

Description

rank

Rank of a single star allele solution for a gene. The top solution based on quality score is ranked as 1 with the alternative solutions ranked lower.

genotype

Star allele or variant solution. If diploid, variant solutions have the format of Allele1/Allele2.

activityScore

Activity score annotation of the determined genotype of the gene determined based on public PGx guidelines CPIC or DPWG.

phenotype

Metabolizer status and function annotations of the determined genotype of the gene based on lookup into public PGx guidelines CPIC or DPWG per user choice.

qualityScore

Quality score of the solution including the population frequency of PGx alleles. The score ranges from 0 to 1.

rawScore

Raw quality score of the solution without including the population frequency of PGx alleles. The score ranges from 0 to 1.

alleles

The composite alleles of the candidate genotype solution.

solutionLong

Long format solution for star alleles. The field has the following format: Structural Variant Type: Underlying Star allele.

supportingVariants

All variants present in the array that support the star allele solution. The field provides an array (list) of supporting Variants.

missingVariantSites

All variants not present in the array or not called in the SNV VCF file for the star allele solution. The field provides an array (list) of missing variants.

collapsedAlleles

The most frequent star allele based on the population frequency of PGx alleles will be the star allele in the solution.

copyNumberRegions

Gene regions for the copy numbers listed in CopyNumberSolution.

copyNumberSolution

Estimated copy number for each gene region listed in CopyNumberRegions

Example of JSON file content:

{
  "softwareVersion": "dragena 1.3.0",
  "genomeBuild": "38",
  "starAlleleDatabaseSources": [
    "PharmVar Version: 6.1",
    "PharmGKB Database Version: Snapshot-2024.05.16",
    "UGT Alleles Nomenclature: 2010.12.21",
    "The Human Cytochrome P450 (CYP) Allele Nomenclature Database, July 2024"
  ],
  "phenotypeDatabaseSources": [
    "CPIC Database Version: 1.38.0",
    "DPWG Database Version: June 2023"
  ],
  "mappingFile": "DRAGENA-549-fix-annotate-sha.e56e884ed1f2d118e796cdab578ab895456bb94e.zip",
  "pgxGuideline": "CPIC",
  "sampleId": "207883050020_R08C03",
  "locusAnnotations": [
    {
      "gene": "CYP2C9",
      "callType": "Star Allele",
      "genotype": "*1/*1",
      "activityScore": "2",
      "phenotypeDatabaseAnnotation": "CYP2C9 Normal Metabolizer",
      "qualityScore": "0.9999",
      "rawScore": "0.9999",
      "supportingVariants": [],
      "candidateSolutions": [
        {
          "rank": 1,
          "genotype": "*1/*1",
          "activityScore": "2",
          "phenotypeDatabaseAnnotation": "CYP2C9 Normal Metabolizer",
          "qualityScore": 0.9999,
          "rawScore": 0.9999,
          "alleles": [
            {
              "solutionLong": "Complete: *1",
              "supportingVariants": [],
              "missingVariantSites": [],
              "collapsedAlleles": ""
            }
          ],
          "copyNumberRegions": "p5,exon.1,intron.1,exon.2,intron.2,exon.3,intron.3,exon.4,intron.4,exon.5,intron.5,exon.6,intron.6,exon.7,intron.7,exon.8,intron.8,exon.9,p3",
          "copyNumberSolution": "2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2"
        }
      ],
      "missingVariantSites": [
        {
          "id": "NC_000010.11:g.94938719T>G",
          "alleleIds": "*80"
        },
        {
          "id": "NC_000010.11:g.94938788C>T",
          "alleleIds": "*83"
        },
        {
          "id": "NC_000010.11:g.94938800G>A",
          "alleleIds": "*76"
        },
        {
          "id": "NC_000010.11:g.94941975G>A",
          "alleleIds": "*77"
        },
        {
          "id": "NC_000010.11:g.94942243T>G",
          "alleleIds": "*78"
        },
        {
          "id": "NC_000010.11:g.94942306C>T",
          "alleleIds": "*72"
        },
        {
          "id": "NC_000010.11:g.94942308C>T",
          "alleleIds": "*73"
        },
        {
          "id": "NC_000010.11:g.94942309G>T",
          "alleleIds": "*27"
        },
        {
          "id": "NC_000010.11:g.94947939G>T",
          "alleleIds": "*74"
        },
        {
          "id": "NC_000010.11:g.94949145C>T",
          "alleleIds": "*82"
        },
        {
          "id": "NC_000010.11:g.94949163del",
          "alleleIds": "*85"
        },
        {
          "id": "NC_000010.11:g.94972183A>T",
          "alleleIds": "*81"
        },
        {
          "id": "NC_000010.11:g.94981258C>T",
          "alleleIds": "*79"
        },
        {
          "id": "NC_000010.11:g.94986136A>C",
          "alleleIds": "*75"
        },
        {
          "id": "NC_000010.11:g.94986174G>C",
          "alleleIds": "*84"
        }
      ],
      "allelesTested": "*1,*2,*3,*4,*5,*6,*7,*8,*9,*10,*11,*12,*13,*14,*15,*16,*17,*18,*19,*20,*21,*22,*23,*24,*25,*26,*27,*28,*29,*30,*31,*32,*33,*34,*35,*36,*37,*38,*39,*40,*41,*42,*43,*44,*45,*46,*47,*48,*49,*50,*51,*52,*53,*54,*55,*56,*57,*58,*59,*60,*61,*62,*63,*64,*65,*66,*67,*68,*69,*70,*71,*72,*73,*74,*75,*76,*77,*78,*79,*80,*81,*82,*83,*84,*85"
    },
    {
      "gene": "CYP2C19",
      "callType": "Star Allele",
      "genotype": "*1/*2",
      "activityScore": "n/a",
      "phenotypeDatabaseAnnotation": "CYP2C19 Intermediate Metabolizer",
      "qualityScore": "0.9999",
      "rawScore": "0.9958",
      "supportingVariants": [
        {
          "chrom": "10",
          "pos": "94842866",
          "ref": "A",
          "alt": "G",
          "gt": "1/1",
          "gs": "0.2669",
          "baf": "1",
          "id": "NC_000010.11:g.94842866A>G",
          "alleleIds": "*1"
        },
        {
          "chrom": "10",
          "pos": "94775367",
          "ref": "A",
          "alt": "G",
          "gt": "0/1",
          "gs": "0.2191",
          "baf": "0.4690612",
          "id": "NC_000010.11:g.94775367A>G",
          "alleleIds": "*2"
        },
        {
          "chrom": "10",
          "pos": "94781859",
          "ref": "G",
          "alt": "A",
          "gt": "0/1",
          "gs": "0.3351",
          "baf": "0.66212183",
          "id": " NC_000010.11:g.94781859G>A",
          "alleleIds": "*2"
        },
        {
          "chrom": "10",
          "pos": "94842866",
          "ref": "A",
          "alt": "G",
          "gt": "1/1",
          "gs": "0.2669",
          "baf": "1",
          "id": " NC_000010.11:g.94842866A>G",
          "alleleIds": "*2"
        }
      ],
      "candidateSolutions": [
        {
          "rank": 1,
          "genotype": "*1/*2",
          "activityScore": "n/a",
          "phenotypeDatabaseAnnotation": "CYP2C19 Intermediate Metabolizer",
          "qualityScore": 0.9999,
          "rawScore": 0.9958,
          "alleles": [
            {
              "solutionLong": "Complete: *1",
              "supportingVariants": [
                {
                  "chrom": "10",
                  "pos": "94842866",
                  "ref": "A",
                  "alt": "G",
                  "gt": "1/1",
                  "gs": "0.2669",
                  "baf": "1",
                  "id": "NC_000010.11:g.94842866A>G"
                }
              ],
              "missingVariantSites": [],
              "collapsedAlleles": ""
            },
            {
              "solutionLong": "Complete: *2",
              "supportingVariants": [
                {
                  "chrom": "10",
                  "pos": "94775367",
                  "ref": "A",
                  "alt": "G",
                  "gt": "0/1",
                  "gs": "0.2191",
                  "baf": "0.4690612",
                  "id": "NC_000010.11:g.94775367A>G"
                },
                {
                  "chrom": "10",
                  "pos": "94781859",
                  "ref": "G",
                  "alt": "A",
                  "gt": "0/1",
                  "gs": "0.3351",
                  "baf": "0.66212183",
                  "id": " NC_000010.11:g.94781859G>A"
                },
                {
                  "chrom": "10",
                  "pos": "94842866",
                  "ref": "A",
                  "alt": "G",
                  "gt": "1/1",
                  "gs": "0.2669",
                  "baf": "1",
                  "id": " NC_000010.11:g.94842866A>G"
                }
              ],
              "missingVariantSites": [],
              "collapsedAlleles": "*2.001"
            }
          ],
          "copyNumberRegions": "p5,exon.1,intron.1,exon.2,intron.2,exon.3,intron.3,exon.4,intron.4,exon.5,intron.5,exon.6,intron.6,exon.7,intron.7,exon.8,intron.8,exon.9,p3",
          "copyNumberSolution": "2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2"
        }
      ],
      "missingVariantSites": [
        {
          "id": "NC_000010.11:g.94762715T>C",
          "alleleIds": "*34"
        }
      ],
      "allelesTested": "*1,*2,*3,*4,*5,*6,*7,*8,*9,*10,*11,*12,*13,*14,*15,*16,*17,*18,*19,*22,*23,*24,*25,*26,*28,*29,*30,*31,*32,*33,*34,*35,*38,*39"
    }

Guidance on alternative star-allele results

Typically, the star allele solution with highest quality score is accepted as the final genotype (i.e. star allele diplotype) for the PGx locus. In rare cases, there are lower ranked star allele solutions with quality scores no less than 50% of the highest quality score, these lower ranked solutions are considered feasible and they are all listed in the genotype field of the locus annotation of the PGx gene in the PGx JSON file. Alternative solutions should also be considered if there are supporting variants for those solutions with low (less than 0.15) GS scores. The clustering of low GS scoring supporting variants should also be evaluated for cluster quality and any potential cluster shift.

Cytogenetics Annotation JSON File

DRAGEN Array produces one cytogenetics annotation JSON (*.json) per sample to report more sample-level, chromosome-level, and event-level metrics and annotations.

Example of JSON file content:

{
  "annotateDb": "CytoAnnotateData_DAv1.2.0.zip",
  "softwareVersion": "dragena 1.3.0 Cyto",
  "referenceGenome": "file://genome.fa",
  "annotationType": "Constitutional",
  "genomeBuild": "hg19",
  "databaseSources": "RefSeq (Version: GCF_000001405.40-RS_2023_10; Release Date: 2023-10-07),Ensembl (Version: 112; Release Date: 2024-05-14)",
  "iscnVersion": "ISCN 2020",
  "sampleId": "208662410005_R01C01",
  "gcCorrect": true,
  "minDelProbes": 10,
  "minDupProbes": 10,
  "minLOHProbes": 500,
  "minDelSize": "20kb",
  "minDupSize": "20kb",
  "minLOHSize": "3000kb",
  "minQual": 20,
  "overallPloidy": 2.012,
  "callRate": 0.9847303032875061,
  "logRDev": 0.19977793097496033,
  "medianLogRDev": 0.1329595363075463,
  "bafDev": {
    "AA": 0.015344353947021572,
    "AB": 0.05078388443393606,
    "BB": 0.026002263878862304
  },
  "numLOHOver1M": 4,
  "numLOHOver8M": 3,
  "totalSizeLOHOver1M": 49319297,
  "copyNumberMedian": 2.0,
  "percentLOH": "1.59%",
  "sexEstimate": "Female",
  "traditionalNomenclature": "dup(2)(q32.3q33.1),dup(2)(q33.1q37.1),dup(2)(q37.1q37.3),del(2)(q37.3q37.3),del(2)(q37.3q37.3),dup(3)(p24.3p24.3),del(13)(q34q34)",
  "microarrayNomenclature": "1p12q21.1(120311442_144549929)x2 hmz,2q32.3q33.1(197045077_201353083)x3,2q33.1q37.1(201356309_234652155)x3,2q37.1q37.3(234653107_238195820)x3,2q37.3(238204076_238283050)x1,2q37.3(238283403_243062047)x1,3p24.3(23235392_23403815)x3,5p12q11.1(44708357_49847659)x2 hmz,11p11.2q12.1(47912150_56507812)x2 hmz,13q34(111358236_111423865)x1,Xp11.22q12(53907828_65253670)x2 hmz",
  "chromosomeAnnotations": [
    {
      "id": "chr1",
      "size": 249250621,
      "percentHet": "10.9587%",
      "hasMosaicism": false,
      "lrrMedian": -0.009397780522704124,
      "lrrDev": 0.16622741086471732,
      "numLOHOver1M": 0,
      "numLOHOver8M": 0,
      "totalSizeLOHOver1M": 0,
      "percentLOH": "0%",
      "copyNumberMedian": 2.0,
      "copyNumberMean": 2.0,
      "minLogRRatio": -2.739042392000556,
      "maxLogRRatio": 1.631125334650278,
      "medianMosaicFraction": ".",
      "numberDel": 0,
      "numberDup": 0,
      "numberLOH": 0,
      "numberMosaic": 0
    },
    ...
  ],
    "locusAnnotations": [
    {
      "id": "AOH:1:120311442:144549929",
      "chrom": "chr1",
      "start": 120311441,
      "end": 144549929,
      "callType": "LOH",
      "mosaicState": false,
      "mosaicFraction": ".",
      "copyNumber": 2,
      "qualityScore": 35.0,
      "size": 24238488,
      "effectiveSize": 151008838,
      "probeCount": 701,
      "percentHet": "1.01%",
      "lrrMedian": 0.05862508801510572,
      "lrrDev": 0.08330211160585141,
      "bafDev": 0.47622189059186604,
      "startCytoBand": "1p12",
      "endCytoBand": "1q21.1",
      "traditionalNomenclature": "N/A",
      "microarrayNomenclature": "1p12q21.1(120311442_144549929)x2 hmz",
      "geneCount": 96,
      "genes": [
        "HMGCS2",
        "REG4",
        "NBPF7P",
        "PFN1P9",
        "NOTCH2P1",
        "ADAM30",
        "RP5-1042I8.7",
        "NOTCH2",
        "RP11-114O18.1",
        ...
      ]
    },
    ...
  ]
}

The fields in the annotation JSON for each sample are described as follows.

Field

Description

annotateDb

File name of the database annotation file.

softwareVersion

Version of DRAGEN Array used for the analysis.

referenceGenome

File name of the reference genome.

annotationType

Integer representing annotation methodology where 0=Constitutional and 1=Oncology.

genomeBuild

Genome build e.g. hg19, hg38.

databaseSources

Release versions of annotation data.

iscnVersion

Release date of ISCN formatting used.

sampleId

ID string assigned to the sample.

gcCorrect

Boolean indicating whether GC correction was enabled.

minDelProbes

Deletions must contain this many probes to be reported.

minDupProbes

Duplications must contain this many probes to be reported.

minLOHProbes

LOH variants must contain this many probes to be reported.

minDelSize

Minimum length filter for reporting deletions in kilobases (kb).

minDupSize

Minimum length filter for reporting a duplication in kb.

minLOHSize

Minimum length filter for reporting a loss-of-heterzygozity (LOH) variant in kb.

minQual

Minimum quality score filter for reporting a variant.

overallPloidy

Arithmetic mean of the ploidy across the genome. This value accounts for the length of all variant calls. The baseline ploidy value without any variants will differ by sex.

callRate

Frequency of expected calls i.e. #Calls/(#No_Calls + #Calls).

logRDev

Standard deviation of the Log R ratio values for all probes.

bafDev

Standard deviation of the B allele frequency values for each assigned genotype (AA/AB/BB).

numLOHOver1M

Count of LOH variants detected > 1 Mbp in length.

numLOHOver8M

Count of LOH variants detected > 8 Mbp in length.

totalSizeLOHOver1M

Cumulative length of all detected LOH variants > 1 Mbp in length.

copyNumberMedian

Length-normalized genome-wide median copy number value. Copy number values are assigned to contiguous segements of variable size in the genome by the algorithm. The length-weighted copy numbers of each variant are aggregated to calculate the overall median.

percentLOH

Percent of the genome comprised of LOH variants.

sexEstimate

Detected sex of the sample.

traditionalNomenclature

Simplified ISCN format designation for all detected variants in the sample.

microarrayNomenclature

ISCN format designation for all detected variants in the sample.

chromosomeAnnotations

Counts of each type of variant detected per chromosome, including mosaic calls.

locusAnnotations

Locus level statistics (see additional table for locus-level statistics).

The fields for each chromosome under the chromosomeAnnotations field of the Cyto annotation JSON are described below.

Field

Description

Chromosome name.

size

Chromosome size.

percentHet

Percent of probes in the chromosome called as heterzygous i.e. AB.

hasMosaicism

Boolean indicating presence of any mosaic variant on the chromosome.

lrrMedian

Median log R ratio value of the probes within the chromosome.

lrrDev

Standard deviation of the log R ratio values of the probes within the chromosome.

numLOHOver1M

Count of LOH variants detected > 1 Mbp in length.

numLOHOver8M

Count of LOH variants detected > 8 Mbp in length.

totalSizeLOHOver1M

Cumulative length of all detected LOH variants > 1 Mbp in length.

percentLOH

Percent of the genome comprised of LOH variants.

copyNumberMedian

Copy number median of the chromosome.

copyNumberMean

Copy number mean of the chromosome.

minLogRRatio

Minimum Log R ratio of the chromosome.

maxLogRRatio

Maximum Log R ratio of the chromosome.

medianMosaicFraction

Median mosaic fraction of mosaic events on the chromosome.

numberDel

Number of deletion events on the chromosome.

numberDup

Number of duplication events on the chromosome.

numberLOH

Number of LOH events on the chromosome.

numberMosaic

Number of mosaic events on the chromosome.

The fields within each variant (CNV/LOH event) under the locusAnnotations field of the Cyto annotation JSON are described below.

Field

Description

Unique variant ID containing variant type, chromosome and the start and end positions.

chrom

Chromosome of the variant.

start

Variant start position.

end

Variant end position.

callType

Variant class (DEL, DUP, LOH).

mosaicState

Boolean indicating whether the locus is a mosaic variant.

copyNumber

Copy number of the locus.

qualityScore

Phred-scaled score of the variant call quality.

size

Length of the variant.

effectiveSize

Gap-excluded length of the variant. A gap is defined when probe spacing is more than 150 times the median probe spacing. In that case, the gap is replaced with the median probe spacing.

probeCount

Number of probes contained in the called variant region.

percentHet

Percent of probes in the region call as heterzygous i.e. AB.

lrrMedian

Median log R ratio value of the probes within the variant.

lrrDev

Standard deviation of the log R ratio values of the probes within the variant.

bafDev

Standard deviation of the B allele frequency values of the probes within the variant.

startCytoBand

Cytoband in which the variant starts.

endCytoBand

Cytoband in which the variant ends.

traditionalNomenclature

Simplified ISCN format designation for the detected variant.

microarrayNomenclature

ISCN format designation for the detected variant.

geneCount

Count of annotated genes within the variant region.

genes

List of names of all annotated genes within the variant region.

The traditionalNomenclature field is used to describe individual and cumulative copy number variants at a coarse resolution. They show gains (dup) and losses (del) according to their chromosome number, arm (p or q), and band (e.g., p36.13). The microarrayNomenclature field follows the Comparative Genomic Hybridization or SNP array conventions. Values are prefixed with arr[] to indicate array data as the source, along with the genome build e.g. GRCh38. These data can more precisely describe the location (start_end in bp) and copy number (x1 for loss, x3 for gain, etc).

The genes list field for each variant include those with transcript coordinates that intersect with the described variant. The values included are a combination of HGNC gene symbols taken from NCBI RefSeq database (e.g. TBC1D3I), as well as the subset of Ensembl gene accessions unmatched to a gene symbol (ENSG00000278395).

TBI Index File

The TBI (TABIX) index file is associated with the bgzipped VCF files. It allows for data line lookup in VCF files for quick data retrieval. The format is a tab-delimited genome index file developed by Samtools as part of the HTSlib utilities. For more information, visit the Samtools website.

Methylation Control Probe Output File

The software produces a control probe output file ({BeadChipBarcode}_{Position}_ctrl.tsv.gz) per sample that includes the raw methylated and unmethylated values for each control probe.

Each control probe has an address, type, color channel, name, and probe ID. It also provides the raw signal for methylated green (MG), methylated red (MR), unmethylated green (UG) and unmethylated red (UR).

The file can help identify which probes are available on a given BeadChip.

Methylation CG Output File

The software produces a CG output file ({BeadChipBarcode}_{Position}_cgs.tsv.gz) per sample that includes beta values, m-values and detection p-values for each CG site.

Beta values measure methylation levels in a linear fashion for easy interpretation. Unmethylated probes are close to zero and methylated probes are close to 1.

M-values are a log transformed beta value which provides a more representative measure of methylation.

Detection p-values measure the likelihood that the signal is background noise. It is recommended that p-value >0.05 are excluded from analysis as they are likely background noise.

see High-throughput Infinium methylation array QC using DRAGEN Array Methylation QC software tech note for further detail on calculation of these metrics.

Methylation Sample QC Summary Files

The software produces methylation sample QC summary in .xlsx and .tsv file formats (sample_qc_summary.xlsx and sample_qc_summary.tsv) per analysis batch, which provides per sample QC data for all samples in the batch.

The QC summary provides details on 21 controls metrics (see tables below), which are computed in same way as in the BeadArray Controls Reporter software from Illumina. In addition, it provides average red and green raw and normalized signals, time of scanning, proportion of probes passing, overall sample pass/fail status, and the failure codes for control metrics that did not pass. The sample pass status is defined as the passing of all 21 control metrics. The QC summary .xlsx file further highlights failing parameters for easy viewing.

The QC summary files contain the following fields:

Field

Description

Sentrix_ID

12-digit BeadChip Barcode associated with the sample.

Sentrix_Position

Row and column on the BeadChip ie R01C01

Sample_ID

Optional field that can be indicated using IDAT Sample Sheet

User Defined Meta Data

Optional field(s) that can be indicated using IDAT Sample Sheet. Any number of fields indicated will appear in this output file.

restoration

The default threshold is 0.
If using the FFPE DNA Restore Kit, the restoration control identifies success of the FFPE restoration chemistry. Change the threshold from 0 to 1 if the FFPE DNA Restore Kit was used.
The green channel intensity is higher than Background. Therefore, the metric provided is the Green Channel Intensity/Background.

staining_green

staining_red

Staining controls are used to examine the efficiency of the staining step in both the red and green channels. These controls are independent of the hybridization and extension step.
The green channel shows a higher signal for biotin staining when compared to biotin background, whereas the red channel shows higher signal for DNP staining when compared to DNP background.
The metric provided for green is the (Biotin High value)/ (Biotin Bkg) and the metric provided for red is (DNP High value)/(DNP Bkg value)
The default threshold is 5. This threshold can be increased on some scanners.

extension_green

extension_red

Extension controls test the extension efficiency of A, T, C, and G nucleotides from a hairpin probe, and are therefore sample independent.
In the green channel, the lowest intensity for C or G is always greater than the highest intensity for A or T.
The metric provided is the (lowest of the C or G intensity)/ (highest of A or T extension) for a single sample.
The default threshold is 5. This threshold can be increased on some scanners.

hybridization_high_medium

hybridization_medium_low

Hybridization controls test the overall performance of the Infinium Assay using synthetic targets instead of amplified DNA. These synthetic targets complement the sequence on the array, allowing the probe to extend on the synthetic target as a template. Synthetic targets are present in the Hybridization Buffer at 3 levels, monitoring the response from high-concentration (5 pM), medium concentration (1 pM), and low concentration (0.2 pM) targets. All bead type IDs result in signals with various intensities, corresponding to the concentrations of the initial synthetic targets.
The value for high concentration is always higher than medium and the value for medium concentration is always higher than low.
The metric provided is the value of high/medium and the value of medium/low.
The default thresholds are 1. Do not change the default threshold.

target_removal1

target_removal2

Target removal controls test the efficiency of the stripping step after the extension reaction. In contrast to allele-specific extension, the control oligos are extended using the probe sequence as a template. This process generates labeled targets. The probe sequences are designed such that extension from the probe does not occur. All target removal controls result in low signal compared to the hybridization controls, indicating that the targets were removed efficiently after extension. Target removal controls are present in the Hybridization Buffer.
The Background for the same sample is close to or larger than either control.
The metric provided is Background/Control Intensity.
The default threshold is 1. Do not change the default threshold; however, the offset correction can be changed.

bisulfite_conversion1_green

bisulfite_conversion1_background_green

bisulfite_conversion1_red

bisulfite_conversion1_background_red

These controls assess the efficiency of bisulfite conversion of the genomic DNA. The Infinium Methylation probes query a [C/T] polymorphism created by bisulfite conversion of non-CpG cytosines in the genome.
These controls use Infinium I probe design and allele-specific single base extension to monitor efficiency of bisulfite conversion. If the bisulfite conversion reaction was successful, the "C" (Converted) probes matches the converted sequence and get extended. If the sample has unconverted DNA, the "U" (Unconverted) probes get extended. There are no underlying C bases in the primer landing sites, except for the query site itself.
The calculation is done in both the green and red channels separately to provide 2 unique sets of values:
- Green Channel
  - Lowest value of C1 or C2 / Highest value of U1 or U2. The default threshold is 1. This value can be increased for some scanners.
  - Background/(U1, or U2). The default threshold is 1. Do not change the default threshold; however, the offset correction can be changed.
- Red Channel
  - Lowest value of C3, 4, or 5 / Highest value of U3, 4, or 5. The default threshold is 1. This value can be increased for some scanners.
  - Background /(Highest value of U4, U5, or U6). The default threshold is 1. Do not change the default threshold; however, the offset correction can be changed.

bisulfite_conversion2

bisulfite_conversion2_background

These controls assess the efficiency of bisulfite conversion of the genomic DNA. The Infinium Methylation probes query a [C/T] polymorphism created by bisulfite conversion of non-CpG cytosines in the genome.
These controls use Infinium II probe design and single base extension to monitor efficiency of bisulfite conversion. If the bisulfite conversion reaction was successful, the "A" base gets incorporated and the probe has intensity in the red channel. If the sample has unconverted DNA, the "G" base gets incorporated across the unconverted cytosine, and the probe has elevated signal in the green channel.
The calculation is done using both channels for 1 set of numbers returned.
The following metrics are provided:
- (Lowest of red C 1, 2, 3, or 4) / (Highest of green C 1, 2, 3, or 4). The default threshold is 1. This value can be increased for some scanners.
- Background/(Highest C1, C2, C3, or C4 green). The default threshold is 1. Do not change the default threshold; however, the offset correction can be changed.

specificity1_green

specificity1_red

Specificity controls are designed to monitor potential nonspecific primer extension for Infinium I and Infinium II assay probes. Specificity controls are designed against nonpolymorphic T sites.
These controls are designed to monitor allele-specific extension for Infinium I probes. The methylation status of a particular cytosine is carried out following bisulfite treatment of DNA by using query probes for unmethylated and methylated state of each CpG locus. In assay oligo design, the A/T match corresponds to the unmethylated status of the interrogated C, and G/C match corresponds to the methylated status of C. G/T mismatch controls check for nonspecific detection of methylation signal over unmethylated background. PM controls correspond to A/T perfect match and give high signal. MM controls correspond to G/T mismatch and give low signal.
The metrics provided are the ratio of the lowest PM/highest MM in each channel.
The default threshold is 1. Do not change the default threshold.

specificity2

specificity2_background

Specificity controls are designed to monitor potential nonspecific primer extension for Infinium I and Infinium II assay probes. Specificity controls are designed against nonpolymorphic T sites.
These controls are designed to monitor extension specificity for Infinium II probes and check for potential nonspecific detection of methylation signal over unmethylated background. Specificity II probes incorporate the "A" base across the nonpolymorphic T and have intensity in the Red channel. If there was nonspecific incorporation of the "G" base, the probe has elevated signal in the Green channel.
The following metrics are provided:
- (Lowest intensity of S1, S2, or S3 red) / (Highest intensity of S1, S2, or S3 green). The default threshold is 1. Do not change the default threshold.
- Background/(Highest intensity S1, S2, S3, or S4 green). The default threshold is 1. Do not change the default threshold; however, the offset correction can be changed.

nonpolymorphic_green

nonpolymorphic_red

Nonpolymorphic controls test the overall performance of the assay, from amplification to detection, by querying a particular base in a nonpolymorphic region of the genome. They let you compare assay performance across different samples. One nonpolymorphic control has been designed for each of the 4 nucleotides (A, T, C, and G).
In the green channel, the lowest intensity of C or G is always greater than the highest intensity of A or T.
The metric provided is the (lowest intensity for C or G) /(highest intensity for A or T) for a single sample.
The default threshold is 5. This value can be increased for some scanners.

avg_green_raw

avg_red_raw

Average green and red raw signal for the given sample.

avg_green_norm

avg_red_norm

Average green and red signal after dye bias correction and noob normalization for the given sample.

ScanTime

The date (MM/DD/YY) and time (HH:MM) that the sample was scanned by the iScan system.

NProbes

Number of probes on the BeadChip, including SNP and CG probes

NPassDetection

Number of probes on the BeadChip that passed detection p-value at the threshold defined.

prop_probes_passing

The proportion of probes passing defined as the number of probes passing detection p-value divided by the total number of probes on the BeadChip.

passQC

1 = sample passed all QC metrics for the thresholds defined
0 = sample did not pass all QC metrics for the thresholds defined

failCodes

The list of parameters that failed QC for the thresholds defined.

The control metrics in the QC summary files are calculated as following. The default value for background correction offset (x) of 3,000 can be modified and applies to all background calculations indicated with (bkg + x). Note that the table uses default thresholds for EPIC arrays as example, the default thresholds changes with the methylation arrays. See section Threshold Adjustment for additional details.

Control

Calculation

Additional Information

Restoration Green > bkg

(Green/(bkg+x))> 0

If using the FFPE Restore kit, change the default threshold from 0 to 1.
bkg = Extension Green highest A or T intensity

Staining Green

Biotin High > Biotin Bkg

(High/Biotin Bkg) > 5

Staining Red

DNP High > DNP Bkg

(High/DNP Bkg) > 5

Extension Green Lowest CG/Highest AT

(C or G/A or T) > 5

Green channel—Lowest C or G intensity is used; highest A or T intensity is used.

Extension Red

Lowest AT/Highest CG

(A or T/C or G) > 5

Red channel—Lowest A or T intensity is used; highest C or G intensity is used.

Hybridization Green High > Medium > Low

(High/Med) > 1 (Med/Low) > 1

Target Removal Green ctrl 1 ≤ bkg

((bkg + x)/ctrl) > 1

bkg = Extension Green highest A or T intensity

Target Removal Green ctrl 2 ≤ bkg

((bkg + x)/ctrl) > 1

bkg = Extension Green highest A or T intensity

Bisulfite Conversion I Green

C1, 2 > U1, 2

(C/U) > 1

Lowest C intensity is used. Highest U intensity is used.

Bisulfite Conversion I Green

U ≤ bkg

((bkg + x)/U) > 1

For MSA arrays, the default is 0.5
Highest U intensity is used.
Green channel—bkg = Extension Green highest AT

Bisulfite Conversion I Red C3, 4, 5 > U3, 4, 5

(C/U) >1

Lowest C intensity is used. Highest U intensity is used.

Bisulfite Conversion I Red U ≤ bkg

((bkg + x)/U) > 1

For MSA arrays, the default is 0.5
Highest U intensity is used.
Red Channel—bkg = Extension Red highest CG

Bisulfite Conversion II C Red > C Green

(C Red/ C Green) > 1

For MSA arrays, the default is 0.5
Lowest C Red intensity is used. Highest C Green intensity is used.

Bisulfite Conversion II C green ≤ bkg

((bkg + x)/C Green) > 1

For MSA arrays, the default is 0.5
Highest C Green intensity is used.
Green channel—bkg = Extension Green highest AT

Specificity I Green PM > MM

(PM/MM) > 1

Lowest PM intensity is used. Highest MM intensity is used

Specificity I Red PM > MM

(PM/MM) > 1

Lowest PM intensity is used. Highest MM intensity is used

Specificity II

S Red > S Green

(S Red/ S Green) > 1

Lowest S Red intensity is used. Highest S Green intensity is used.

Specificity II

S Green ≤ bkg

((bkg + x)/ S green) > 1

bkg = Extension Green highest A or T intensity
Highest S Green intensity is used.

Nonpolymorphic Green Lowest CG/ Highest AT

(C or G/ A or T) > 5

Lowest C or G intensity is used; highest A or T intensity is used
For MSA arrays, the default threshold is 2.5

Nonpolymorphic Red Lowest AT/ Highest CG

(A or T/ C or G) >5

Lowest A or T intensity is used; highest C or G intensity is used
For MSA arrays, the default threshold is 3

Methylation Sample QC Summary Plots

The software produces methylation sample QC summary plots (sample_qc_summary.pdf) per analysis batch which provides visual depictions of two QC summary plots for quick visual review.

The file contains the following control plots:

Control Plot

Description

Proportion of Probes Passing Threshold

Histogram of the proportion of probes passing the p-value detection threshold. Samples passing QC are shown in one color, and samples failing QC are shown in another color.

Principal Component Analysis (PCA)

Uses beta values for all analytical probes to compare samples. Principal component analysis (PCA) is applied to the beta values to reduce the dimensionality of the data to two “principal components” that reflect the most variation across samples. If more than 100 samples are used in the analysis, a random subset of 10,000 probes are used for the PCA analysis to reduce computational burden. PCA control plot assigns unique colors to each sample group defined by the IDAT Sample Sheet. If no groups were assigned, all samples will appear the same color. Sample groups may cluster together and can be used to explain some of the variation. Coordinates used to plot each sample in the PCA control plot are provided in the pcs.tsv.gz output file (see below).

Methylation Principal Component Summary

The software produces a methylation principal component summary file (pcs.tsv.gz) per analysis batch which provides principal component data for each sample within the batch. This can be used to identify the specific samples associated with points on the PCA control plot within the Methylation Sample QC Control Plots output file.

The files contain the following fields:

Field

Description

blank

BeadChip Barcode and Position ie 123456789101_R01C01

principal component 1

The variable of the first axis for the Principal Component Analysis

principal component 2

The variable of the second axis for the Principal Component Analysis

Sample_Group

Sample group defined by the user in the IDAT Sample Sheet. If no sample group was defined, all samples will show NA.

Methylation Manifest Files

The software produces two methylation manifest files

Manifest in Sesame format (probes.csv)
Additional information for control probes (controls.csv)

The probes.csv file has the following columns:

Field

Description

Probe_ID

This is a unique identifier for each probe. It corresponds to the IlmnID column in the standard Illumina manifest format or ctl_[AddressA_ID] for control probes.

This is corresponds to the AddressA_ID column in the standard Illumina manifest format.

This corresponds to the AddressB_ID column in the standard Illumina manifest format.

col

This is the color channel for Infinium I probes (R/G). For Infinium I probes, this column will be NA.

The controls.csv file has the following columns:

Field

Description

Address

The address of the probe

Type

The control probe type

Color_Channel

A color used to denote certain control probes in legacy software

Name

A human readable identifier for certain control probes

Probe_ID

This is a unique identifier for each probe. It corresponds to the IlmnID column in the standard Illumina manifest format or ctl_[AddressA_ID] for control probes.

Methylation Warning/Error Messages and Logs

The following scenarios result in a warning or error message:

Missing IDATs or manifest
Incorrect sample sheet formatting
Duplicate BeadChip Barcode and Position within the sample sheet
Missing control or assay probes
Missing required columns in the manifest
Unable to compute certain metrics

Examples of such notifications can include the following:

Log

Error

Type

Cause

write_samplesheet.log

No IDATs found

Error

No IDATs provided for analysis

format_samplesheet.log

No samples in sample sheet

Error

No samples in user’s sample sheet input

format_samplesheet.log

Sample sheet not correctly formatted

Error

Sample sheet is not in CSV format or header lines do not start with “<”

format_samplesheet.log

beadChipName and sampleSectionName columns are required for the sample sheet.

Error

Sample sheet does not contain required columns: beadChipName and sampleSectionName.

format_samplesheet.log

Warning: <Number> samples have duplicate Sample_ID

Warning

X lines in the sample sheet have duplicate <beadChipName>_<sampleSectionName>. Duplicates are dropped from analysis.

convert_manifest_ilmn_sesame.log

Missing control probes in manifest

Error

Missing “[Controls]” line in CSV manifest

convert_manifest_ilmn_sesame.log

Probe section not found

Error

Missing “[Assay]” line in CSV manifest

convert_manifest_ilmn_sesame.log

Missing required columns: IlmnID, AddressA_ID, AddressB_ID, Color_Channel

Error

Missing one of required columns in Assay section of manifest

convert_manifest_ilmn_sesame.log

Controls not formatted correctly. Must have 4 columns (Address,Type,Color_Channel,Name)

Error

Missing one of required columns in Control section of manifest

run_sesame_gs.log

Missing sample: <Sample_ID>

Error

Missing idats for a particular sample

run_sesame_gs.log

No scan time available

Warning

No scan time in idat

run_sesame_gs.log

Prep failed

Error

Dye bias correction or noob failure for sample

run_sesame_gs.log

Warning: missing control probe types <Missing probes>

Warning

Missing control probe types to compute a BACR metric. Metric will be set to NA.

run_sesame_gs.log

Warning: missing control probe names <Missing probe types>

Warning

Missing control probes to compute a BACR metric. Metric will be set to NA.

qc.log

No features, skipping PCA plot

Warning

No common betas found in all samples. This may occur if a sample has no signal intensity in the IDAT files.

PreviousInput Files NextSupport and Additional Resources

Last updated 4 months ago

Was this helpful?

hashtagPGx CNV VCF File

hashtagCytogenetics VCF File

hashtagSNV VCF File

hashtagNote on Multi-Allelic Variants (MAV) calling limitations

hashtagNote on delimiters in the "ID" field

hashtagNote on REF/ALT "flipping" for INDELs

hashtagNote on PLINK compatibility

hashtagGenotype Call (GTC) File

hashtagBedGraph Files

hashtagStar Allele CSV File

hashtagGenotype Summary Files

hashtagFinal Report

hashtagLocus Summary

hashtagCN Summary File

hashtagCopy Number Batch File

hashtagWarning/Error Messages and Logs

hashtagStar allele JSON File

hashtagGuidance on alternative star-allele results

hashtagCytogenetics Annotation JSON File

hashtagTBI Index File

hashtagMethylation Control Probe Output File

hashtagMethylation CG Output File

hashtagMethylation Sample QC Summary Files

hashtagMethylation Sample QC Summary Plots

hashtagMethylation Principal Component Summary

hashtagMethylation Manifest Files

hashtagMethylation Warning/Error Messages and Logs

PGx CNV VCF File

Cytogenetics VCF File

SNV VCF File

Note on Multi-Allelic Variants (MAV) calling limitations

Note on delimiters in the "ID" field

Note on REF/ALT "flipping" for INDELs

Note on PLINK compatibility

Genotype Call (GTC) File

BedGraph Files

Star Allele CSV File

Genotype Summary Files

Final Report

Locus Summary

CN Summary File

Copy Number Batch File

Warning/Error Messages and Logs

Star allele JSON File

Guidance on alternative star-allele results

Cytogenetics Annotation JSON File

TBI Index File

Methylation Control Probe Output File

Methylation CG Output File

Methylation Sample QC Summary Files

Methylation Sample QC Summary Plots

Methylation Principal Component Summary

Methylation Manifest Files

Methylation Warning/Error Messages and Logs