Input Files

The following section describes the input files required by DRAGEN Array. Product files (anything other than the IDATs) can be found on the support site.

IDAT Files

For each sample a pair of raw intensity files (.idat) are generated from the iScan System or NextSeq550 (for select arrays). They provide intensities in the red and green channels for each probe on the Infinium array. More information on which arrays can be used with NextSeq550, can be found on the Illumina Knowledge page on NextSeq550.

An IDAT file is identified by the BeadChip Barcode (12-digit unique Sentrix ID, i.e. 123456789101), BeadChip Position (row and column of the sample, i.e. R01C01), and Grn (Green) or Red for the specific channel.

Manifest Files

The CSV and BPM manifest files can be found on the Illumina Support Site for all commercial Infinium BeadChips or on MyIllumina for custom and semi-custom designs. DRAGEN Array only supports manifest files from the Illumina Support site. For instructions on obtaining manifest files from MyIllumina, see Illumina Knowledge article, How to access custom array product files (manifest and product definition files) in MyIllumina.

The CSV manifest file (.csv) provides complementary data to the BPM manifest file in a human readable format. It is a required input to the genotype gtc-to-vcf command to enable VCF generation for insertion/deletion variants. gtc-to-vcf depends on the presence of accurate mapping information within the manifest, and may produce inaccurate results if the mapping information is incorrect. Mapping information follows the implicit dbSNP standard, where

Positions are reported with 1-based indexing.
Positions in the PAR are reported with mapping position to the X chromosome.
For an insertion relative to the reference, the position of the base immediately 5' to the insertion (on the plus strand) is given.
For a deletion relative to the reference, the position of the most 5' deleted based (on the plus strand) is given.

Cluster File

The cluster file (.egt) is a standard product file provided by Illumina for commercial genotyping products and it is a required input for the genotype call command in DRAGEN Array. Custom cluster files may be required for optimal genotyping performance. See section Optimizing cluster files and copy number models for additional details.

PGx CN Model File

The PGx CN (Copy Number) model file (.dat) is a required input to the pgx copy-number call command to enable accurate copy number calling for pharmacogenomics. Illumina provides a standard CN model file for each PGx array product. See section Optimizing cluster files and copy number models for additional details.

Cytogenetics Model File

The cytogenetics CN (Copy Number) model file (.dat) is a required input to the cyto call command to enable accurate Cytogenetics analysis. Illumina provides a standard CN model file for each supported array product. For custom or other products, please contact Tech Support to request a CN model file and include the product BPM manifest.

Note: The CN model file needs to be updated upon manifest revisions since probes can be added or removed during manifest revisions. A mismatch between the CN model file and the manifest will cause an error during pgx copy-number call and cyto call.

Mask File

The mask file (.msk) is a required input to the pgx copy-number train command to enable accurate pgx copy number training for pharmacogenomics. It does not need to be provided as an explicit input to the command line interface but should reside in the same folder as the BPM manifest. It should have the same base name as the manifest for the product. Illumina provides a mask file for each PGx array product and these can be found on the product files support page.

PGx Database File

The PGx database file (.zip) contains the variant mapping information from Infinium PGx arrays to PGx variants. Each line in this file represents a single probe ID mapping to a variant's HGVS (Human Genome Variation Society) tag. This creates a map of many probes to one variant. DRAGEN Array cross references this map with SNV VCF IDs during runtime to do star allele calling. It works across all supported PGx products, even though the probes and variant coverage differ across them.

Cytogenetics Database File

The cytogenetics database file (.zip) contains information from Ensembl and RefSeq data sources used in the generation of Cytogenetics Annotation JSON File. This file can be used across products (beadchip/manifest types and versions). It is only necessary for input to local analysis (i.e., cyto annotate) as it is already stored in the cloud for cloud analysis. It may be updated in the future to accomodate changes in the underlying Ensembl and RefSeq datasources.

Genome FASTA Files

The genome FASTA file (.fa) is a text file with the reference genome sequences.The FASTA index file (.fai) contains metadata about chromosomal orchestration within the FASTA file for a particular species. DRAGEN Array PGx calling supports human genome build 37 and 38. The genome FASTA file and FASTA index file are both provided by Illumina for human species and should be stored together in the same input folder. For custom reference genomes, the contig identifiers in the provided genome FASTA file must match exactly the chromosome identifiers specified in the provided manifest. For a standard human product manifest, this means that the contig headers should read ">1" rather than ">chr1". Note: The Genome FASTA file is only required for the dragen-array-local-analysis workflow. If you're using dragen-array-cloud-analysis, you do not need to provide this file.

Sample Sheet

The sample sheet is a CSV formatted input file that utilizes a couple required fields for sample lookup (SentrixBarcode_A, SentrixPosition_A for local, beadChipName, sampleSectionName for cloud) to enable adding optional metadata and analyzing a filtered list of samples within a folder. It is intended to be flexible and the local version should be backwards compatible with most GenomeStudio samplesheets.

The root folder which DRAGEN Array will search the files for can be set by either providing it via the --idat-folder or --gtc-folder options (where applicable). Or by setting the RootFolder field in the [Header] section. This RootFolder should be the full absolute path to the sample files. e.g.,

[Header]
RootFolder,/test/samples
[Data]
....

Note: In the case of conflict between RootFolder and the CLI options (--idat-folder or --gtc-folder), the CLI options take precedence.

The following are examples of all valid samplesheets:

Most basic (no sections, one sample)

SentrixBarcode_A,SentrixPosition_A
204753010023,R02C01

Medium complexity (no sections, multiple samples, optional data)

SentrixBarcode_A,SentrixPosition_A,Sample_ID,Sample_Group,MetaData1
204753010023,R01C01,NA1231,Group1,F
204753010024,R01C01,NA1233,Group2,M

High complexity (sections, multiple samples, optional data)

[Header]
RootFolder,/tests/samples
Date,1/1/2025
[Data]
SentrixBarcode_A,SentrixPosition_A,Sample_ID,Sample_Group,MetaData1
204753010023,R01C01,NA1231,Group1,F
204753010024,R01C01,NA1233,Group2,M

Notes:

The column names are case insensitive. For example, the columns Sample_Name and sample_name, would be considered the same and the software would produce an error like this: Duplicate column sample_name found. Column names are case-insensitive. Please remove or rename the column from the samplesheet and re-process.
Because user-provided fields get output in the Genotype Summary File, the column names cannot conflict with those fields. For example, if the user provides a column named Sex Estimate in their samplesheet. DRAGEN Array will produce the following error: Sex Estimate is a reserved keyword. Please remove or rename the column from the samplesheet and re-process.
The optional fields (i.e. not SentrixBarcode_A and SentrixBarcode_B) will be output as-is in the genotype summary files for the genotype call command.
The [Manifests] section (used by GenomeStudio to delineate manifests in multi-manifest analyses) is currently ignored in DRAGEN Array.
There is a known issue regarding empty columns in the v1.3 Release Notes.

For cloud analyses (i.e., for use in sample selection in running cloud analyses), the samplesheet does not currently support sections such as [Header] and [Data] and instead of using SentrixBarcode_A and SentrixPosition_A columns as the sample's keys, it uses beadChipName and sampleSectionName. i.e., a valid cloud samplesheet could look like this:

beadChipName,sampleSectionName
204753010023,R01C01
204753010023,R02C01
204753010024,R01C01
204753010024,R02C01

There is also a template available on the sample selection interface on Basespace.

Methylation QC sample sheet

For DRAGEN Array Methylation QC on cloud, the additional optional sample sheet fields are used in analysis.

Following Sample_Group, any number of additional columns can be added to include meta data fields such as sex, sample type, plate and well information, etc. Additional columns added after the Sample_Group column may have user-defined column header values. The Sample_ID field and any additional metadata added will be replicated in the Sample QC Summary output files.

The Sample_Group field will be used to populate the PCA Control Plot within the Sample QC Summary Plots file and the Principal Component Summary file. For the PCA Control Plot, each sample group will be assigned a unique color. Samples assigned to the same Sample_Group value will be the same color in the PCA Control Plot. e.g.,

beadChipName,sampleSectionName,Sample_ID,Sample_Group,MetaData1
204753010023,R01C01,NA1231,Group1,F
204753010023,R02C01,NA1232,Group2,F
204753010024,R01C01,NA1233,Group2,M
204753010024,R02C01,NA1234,Group1,M

Cytogenetics analysis + Emedgene interpretation sample sheet

For Cytogenetics analysis + Emedgene interpretation on cloud, an additional column: demographicSex will be used to compare against to the Sex Estimate output from DRAGEN Array genotyping module and be displayed in Emedgene. The allowed values for this field are M (Male), F (Female), or U (Unknown).

Example:

beadChipName,sampleSectionName,demographicSex
204753010023,R01C01,F
204753010023,R02C01,F
204753010024,R01C01,M
204753010024,R02C01,M

Input File Summary Table

In addition to the input files, there are set of intermediate files, including GTC, SNV VCF, CNV VCF and PGx CSV, which are outputs of some DRAGEN Array Local commands and inputs to other commands.

The table below summarizes the input files or intermediate file, their sources, and the associated DRAGEN Array Local commands and options.

Input File

Source

Command

Option

IDAT

User provided from scanning instrument

genotype call

--idat-folder

CSV Manifest

Product file from Illumina

genotype gtc-to-vcf

--csv-manifest

BPM Manifest

Product file from Illumina

pgx copy-number train

genotype call

genotype gtc-to-bedgraph

genotype gtc-to-vcf

--bpm-manifest

Cluster File

Product file from Illumina or user created using GenomeStudio

genotype call

--cluster-file

PGx CN Model

Product file from Illumina or user created using DRAGEN Array Local

pgx copy-number call

--cn-model

Cytogenetics CN Model

Product file from Illumina

cyto call

--cn-model

PGx Database

Product file from Illumina

pgx star-allele call

--database

Cytogenetics Database

Product file from Illumina

cyto annotate

--database

Genome FASTA

Product file from Illumina

genotype gtc-to-vcf

pgx copy-number train

--genome-fasta-file

Sample Sheet

User provided

genotype call

genotype gtc-to-bedgraph

genotype gtc-to-vcf

pgx copy-number call

pgx copy-number train

--sample-sheet

GTC

DRAGEN Array output from genotype call

genotype gtc-to-bedgraph

genotype gtc-to-vcf

pgx copy-number call

pgx copy-number train

--gtc-folder

SNV and PGx CNV VCF

DRAGEN Array output from genotype gtc-to-vcf and pgx copy-number call

pgx star-allele call

--vcf-folder

PGx CSV

DRAGEN Array output from pgx star-allele call

pgx star-allele annotate

--star-alleles

Cytogenetics CNV VCF

DRAGEN Array output from cyto call

cyto annotate

--vcf-folder

PreviousDRAGEN Array Local Analysis NextOutput Files

Last updated 5 months ago

Was this helpful?

hashtagIDAT Files

hashtagManifest Files

hashtagCluster File

hashtagPGx CN Model File

hashtagCytogenetics Model File

hashtagMask File

hashtagPGx Database File

hashtagCytogenetics Database File

hashtagGenome FASTA Files

hashtagSample Sheet

hashtagMethylation QC sample sheet

hashtagCytogenetics analysis + Emedgene interpretation sample sheet

hashtagInput File Summary Table

IDAT Files

Manifest Files

Cluster File

PGx CN Model File

Cytogenetics Model File

Mask File

PGx Database File

Cytogenetics Database File

Genome FASTA Files

Sample Sheet

Methylation QC sample sheet

Cytogenetics analysis + Emedgene interpretation sample sheet

Input File Summary Table