The following section describes the input files required by DRAGEN Array. Product files (anything other than the IDATs) can be found on the support site.
For each sample a pair of raw intensity files (.idat) are generated from the iScan System or NextSeq550 (for select arrays). They provide intensities in the red and green channels for each probe on the Infinium array. More information on which arrays can be used with NextSeq550, can be found on the Illumina Knowledge page on NextSeq550.
An IDAT file is identified by the BeadChip Barcode (12-digit unique Sentrix ID, i.e. 123456789101), BeadChip Position (row and column of the sample, i.e. R01C01), and Grn (Green) or Red for the specific channel.
The CSV and BPM manifest files can be found on the Illumina Support Site for all commercial Infinium BeadChips or on MyIllumina for custom and semi-custom designs. DRAGEN Array only supports manifest files from the Illumina Support site. For instructions on obtaining manifest files from MyIllumina, see Illumina Knowledge article, How to access custom array product files (manifest and product definition files) in MyIllumina.
The CSV manifest file (.csv) provides complementary data to the BPM manifest file in a human readable format. It is a required input to the genotype gtc-to-vcf command to enable VCF generation for insertion/deletion variants. gtc-to-vcf
depends on the presence of accurate mapping information within the manifest, and may produce inaccurate results if the mapping information is incorrect. Mapping information follows the implicit dbSNP standard, where
Positions are reported with 1-based indexing.
Positions in the PAR are reported with mapping position to the X chromosome.
For an insertion relative to the reference, the position of the base immediately 5' to the insertion (on the plus strand) is given.
For a deletion relative to the reference, the position of the most 5' deleted based (on the plus strand) is given.
The cluster file (.egt) is a standard product file provided by Illumina for commercial genotyping products and it is a required input for the genotype call command in DRAGEN Array. Custom cluster files may be required for optimal genotyping performance. See section Optimizing cluster files and copy number models for additional details.
The CN (Copy Number) model file (.dat) is a required input to the copy-number call command to enable accurate copy number calling for pharmacogenomics. Illumina provides a standard CN model file for each PGx array product. See section Optimizing cluster files and copy number models for additional details.
The mask file (.msk) is a required input to the copy-number train command to enable accurate copy number training for pharmacogenomics. It does not need to be provided as an explicit input to the command line interface but should reside in the same folder as the BPM manifest. It should have the same base name as the manifest for the product. Illumina provides a mask file for each PGx array product and these can be found on the product files support page.
The PGx database file (.zip) contains the variant mapping information from Infinium PGx arrays to PGx variants. For each gene and each variant used in the star allele definitions of the gene, there is a mapping to the ID field in the SNV VCF file. Each line in the gene mapping file represents a single variant and contains the SNV VCF ID for that variant followed by the HGVS (Human Genome Variation Society) tag for the variant. The PGx database file is array specific and is one of the product files provided by Illumina for each PGx array product.
The genome FASTA file (.fa) is a text file with the reference genome sequences.The FASTA index file (.fai) contains metadata about chromosomal orchestration within the FASTA file for a particular species. DRAGEN Array PGx calling supports human genome build 37 and 38. The genome FASTA file and FASTA index file are both provided by Illumina for human species and should be stored together in the same input folder. For custom reference genomes, the contig identifiers in the provided genome FASTA file must match exactly the chromosome identifiers specified in the provided manifest. For a standard human product manifest, this means that the contig headers should read ">1" rather than ">chr1".
For local analysis, the IDAT sample sheet can be a CSV or JSON formatted file with direct paths to sample IDAT files. It enables easy analysis of samples from different directories.
Example CSV format:
Green IDAT Path,Red IDAT Path
/path/to/sample1_Grn.idat,/path/to/sample1_Red.idat
/path/to/sample2_Grn.idat,/path/to/sample2_Red.idat
/path/to/sample3_Grn.idat,/path/to/sample3_Red.idat
Example JSON format:
[
{
"Green IDAT Path": "/path/to/sample1_Grn.idat",
"Red IDAT Path": "/path/to/sample1_Red.idat"
},
{
"Green IDAT Path": "/path/to/sample2_Grn.idat",
"Red IDAT Path": "/path/to/sample2_Red.idat"
},
{
"Green IDAT Path": "/path/to/sample3_Grn.idat",
"Red IDAT Path": "/path/to/sample3_Red.idat"
},
]
For cloud analysis, the IDAT sample sheet can be a CSV formatted file.
beadChipName,sampleSectionName
Beadchip 1 barcode (204753010023), sample section (R01C01)
Beadchip 1 barcode (204753010023), sample section (R02C01)
Beadchip 2 barcode (204753010024), sample section (R01C01)
Beadchip 2 barcode (204753010024), sample section (R02C01)
For DRAGEN Array Methylation QC on cloud, additional optional sample sheet fields are available.
Following Sample_Group, any number of additional columns can be added to include meta data fields such as sex, sample type, plate and well information, etc. Additional columns added after the Sample_Group column may have user-defined column header values. The Sample_ID field and any additional metadata added will be replicated in the Sample QC Summary output files.
The Sample_Group field will be used to populate the PCA Control Plot within the Sample QC Summary Plots file and the Principal Component Summary file. For the PCA Control Plot, each sample group will be assigned a unique color. Samples assigned to the same Sample_Group value will be the same color in the PCA Control Plot.
beadChipName,sampleSectionName,Sample_ID,Sample_Group,MetaData1
Beadchip 1 barcode (204753010023), sample section (R01C01),NA1231,Group1,F
Beadchip 1 barcode (204753010023), sample section (R02C01),NA1232,Group2,F
Beadchip 2 barcode (204753010024), sample section (R01C01),NA1233,Group2,M
Beadchip 2 barcode (204753010024), sample section (R02C01),NA1234,Group1,M
The GTC sample sheet is a CSV or JSON formatted file with direct paths to sample GTC files. It enables easy analysis of samples from different directories.
Example CSV format:
GTC Path
/path/to/sample1.gtc
/path/to/sample2.gtc
/path/to/sample3.gtc
Example JSON format:
[
{
"GTC Path": "/path/to/sample1.gtc"
},
{
"GTC Path": "/path/to/sample2.gtc"
},
{
"GTC Path": "/path/to/sample3.gtc"
}
]
In addition to the input files, there are set of intermediate files, including GTC, SNV VCF, CNV VCF and PGx CSV, which are outputs of some DRAGEN Array Local commands and inputs to other commands.
The table below summarizes the input files or intermediate file, their sources, and the associated DRAGEN Array Local commands and options.
Input File | Source | Command | Option |
---|---|---|---|
IDAT
User provided from scanning instrument
genotype call
--idat-folder
CSV Manifest
Product file from Illumina
genotype gtc-to-vcf
--csv-manifest
BPM Manifest
Product file from Illumina
copy-number train
genotype call
genotype gtc-to-bedgraph
genotype gtc-to-vcf
--bpm-manifest
Cluster File
Product file from Illumina or user created using GenomeStudio
genotype call
--cluster-file
CN Model
Product file from Illumina or user created using DRAGEN Array Local
copy-number call
--cn-model
PGx Database
Product file from Illumina
star-allele call
--database
Genome FASTA
Product file from Illumina
genotype gtc-to-vcf
copy-number train
--genome-fasta-file
IDAT Sample Sheet
User provided
genotype call
--idat-sample-sheet
GTC Sample Sheet
User provided
genotype gtc-to-bedgraph
genotype gtc-to-vcf
copy-number call
copy-number train
--gtc-sample-sheet
GTC
DRAGEN Array output from genotype call
genotype gtc-to-bedgraph
genotype gtc-to-vcf
copy-number call
copy-number train
--gtc-folder
SNV and CNV VCF
DRAGEN Array output from genotype gtc-to-vcf and copy-number call
star-allele call
--vcf-folder
PGx CSV
DRAGEN Array output from star-allele call
star-allele annotate
--star-alleles