DRAGEN Host Software
You use the DRAGEN host software program dragen to build and load reference genomes, and then to analyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking with optional removal, and variant calling.
Invoke the software using the dragen command. The command line options are described in the following sections.
Command line options can also be set in a configuration file. For more information on configuration files, see Configuration Files . If an option is set in the configuration file and is also specified on the command-line, the command line option overrides the configuration file.
Command-line Options
The following are examples of frequently used command lines:
Build Reference/Hash Table
Run Map/Align and Variant Caller (*.fastq to *.vcf)
Run Map/Align (*.fastq to *.bam)
Run Variant Caller Only (*.bam to *.vcf)
Re-map and Run Variant Caller (*.bam to *.vcf)
Run BCL Converter (BCL to *.fastq)
Run RNA Map/Align (*.fastq to *.bam)
For a complete list of command line options, see [Command Line Options]{.underline}.
Reference Genome Options
Before you can use the DRAGEN system for aligning reads, you must load a reference genome and its associated hash tables onto the PCIe card. For information on preprocessing a reference genome's FASTA files into the native DRAGEN binary reference and hash table formats, see [Prepare a Reference Genome]{.underline}. You must also specify the directory containing the preprocessed binary reference and hash tables with the -r [or --ref-dir]
option. This argument is always required.
Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.
dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
Use the -l (--force-load-reference)
option to force the reference genome to load even if it is already loaded.
dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.
Operating Modes
DRAGEN has two primary modes of operation, as follows:
Mapper/aligner
Variant caller
DRAGEN is capable of performing each mode independently or as an end-to-end solution. DRAGEN also allows you to enable and disable decompression, sorting, duplicate marking, and compression along the DRAGEN pipeline.
Full pipeline mode To execute full pipeline mode, set
--enable-variant-caller
totrue
and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set
‑-enable‑duplicate‑marking
totrue
.Variant caller mode To execute variant caller mode, set the
--enable-variant-caller
option to true, and set--enable-map-align
option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting--enable-sort
tofalse
will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.RNA-Seq data To enable processing of RNA-Seq--based data, set
--enable-rna
totrue
. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the
--enable-methylation-calling
option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with--ht-methylated
enabled, and run DRAGEN with the appropriate‑‑methylation-protocol
setting.
Output Options
The following command line options for output are mandatory:
--output-directory <out_dir>
—Specifies the output directory for generated files.--output-file-prefix <out_prefix>
-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.-r [--ref-dir ]
—Specifies the reference hash table.
The following examples do not include these mandatory options.
For mapping and aligning, the output is sorted and compressed into BAM format by default before saving to disk. The user can control the output format from the map/align stage with the --output-format <SAM|BAM|CRAM>
option. If the output file exists, the software issues a warning and exits. To force overwrite if the output file already exists, use the -f [ --force ]
option.
For example, the following commands output to a compressed BAM file, and then forces overwrite:
dragen ... -f
dragen ... -f --output-format bam
To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing
to true.
The following example outputs to a SAM file, and then forces overwrite:
dragen ... -f --output-format sam
The following example outputs to a CRAM file, and then forces overwrite:
dragen ... -f --output-format cram
DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.
Alignment tags
DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. The feature is turned off by default because there is a small performance cost to generate these strings. To generate MD tags, set --generate-md-tags
to true.
To generate ZS:Z alignment status tags, set --generate-zs-tags
to true. These tags are only generated in the primary alignment and when a read has suboptimal alignments qualifying for secondary output (even if none were output because --Aligner.sec-aligns
was set to 0). The following are valid tag values:
Tag | Tag meaning |
---|---|
| Multiple alignments with similar score were found. |
| No alignment was found. |
| An alignment was found but it was below the quality threshold. |
To generate SA:Z tags, set --generate-sa-tags
to true (the default). These tags provide alignment information (position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variant calling.
To generate pair score in a ps:i tag, set --generate-ps-tags
to true (false by default for DNA, true for RNA). The pair score is used in DRAGEN for computing MAPQ and can be used to check how well alignment candidate pairs score against each other.
DRAGEN can also output mate alignment tags. To generate the mate cigar (in the MC:Z tag), set --generate-mc-tags
to true (this is the default). To generate the mate mapping quality (in the MQ:i) tag, set --generate-mq-tags
to true (this is the default). To generate mate sequence (in the R2:Z tag) and mate base qualities (in the Q2:Z tag), set --generate-r2-tags
to true (default is false) and set --generate-q2-tags
to true (default is false) respectively. Please note that when enabled, R2:Z and Q2:Z tags are emitted only for improperly paired read alignments with fragment length atleast 1000 bp. Also, our methylation pipelines currently do not support the output of mate alignment tags.
DRAGEN also outputs a graph alignment tag ga:Z --generate-ga-tags
(true by default for DNA, false for RNA) when applicable. This tag is used to describe the best alt contig alignment which improved the score of a primary-contig alignment at its liftover position. It can also be used to describe read alignments to alt contigs for which there is no liftover and the primary alignment is unmapped. For example, cases when the read maps best to an alt contig describing a novel long-insertion that is not present in the reference. In addition, read alignments that have been marked as unmapped because they map to auto-detected decoy contigs not present in the original user-provided FASTA also have their alignments described in the ga tag.
The ga tag uses the same format as the SA tag used to describe supplementary alignments.
CRAM Output
When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:
CRAM format V3.0 is produced
The CRAM is lossless. Lossy compression is never employed and not optional
Quality score compression is lossless. Read names are preserved
Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores
All input BAM tags are preserved
The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.
A CRAM index is produced in .crai format
CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted
The following list of default settings are used for the CRAM output
CRAM option | Value | Description |
---|---|---|
SEQS_PER_SLICE | 2000 | Max sequences per slice |
BASES_PER_SLICE | SEQS_PER_SLICE*500 | Max bases per slice |
SLICE_PER_CNT | 1 | Max slices per container |
embed_ref | 0 | Do not embed reference sequence |
noref | 0 | Do not use non-referenced based encoding |
multiseq | -1 | Do not use multiple references per slice |
unsorted | 0 | Do not use unsorted mode |
use_bz2 | 0 | Do not compress using bzip2 |
use_lzma | 0 | Do not compress using lmza |
use_rans | 1 | Use rANS for quality score compression |
binning | NONE | Qual score binning not used |
preserve_aux_order | 1 | Preserve all aux tags and order (incl RG,NM,MD) |
preserve_aux_size | 0 | Aux tag sizes not preserved ('i', 's', 'c') |
lossy_read_names | 0 | Preserve read names |
lossy | 0 | Do not enable Illumina 8 quality-binning system |
ignore_md5 | 0 | Enable all checking of checksums |
decode_md | 0 | Do not (re)generate MD and NM tags |
Input Options
DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.
Uncompressed
gzip or bgzip compression
ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.
If your input FASTQ files are gzipped, DRAGEN automatically decompresses the files using hardware-accelerated decompression, and then streams the reads into the mapper. If your files end in *.ora, DRAGEN automatically decompresses the files using ORA decompression, and then streams the reads into the mapper. The same FASTQ command-line options apply for all compression formats.
FASTQ Input Files
FASTQ input files can be single-ended or paired-end, as shown in the following examples.
Single-ended in one FASTQ file (-1 option)
Paired-end in two matched FASTQ files(-1 and -2 options)
Paired-end in a single interleaved FASTQ file(
--interleaved (-i)
option)
Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:
<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz
Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.
For Example:
These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 <FileName>_001.fastq
). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile
to false on the command line.
DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name
option to true
If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.
To avoid impacting system performance, input files must be located on a fast file system.
Multiple FASTQ Input Files
To process multiple FASTQ input files as one sample, it is recommended that you use the --fastq-list <csv file name>
option to specify the name of a CSV file containing the list of FASTQ files, instead of using the --combine-samples-by-name
option.
For example:
Using a CSV file avoids having to concatenate the FASTQ files, for cases where there are multiple FASTQ files for a sample such as top-up scenarios or where FASTQ files are split across lanes. It also allows you to name the FASTQ input files, input from multiple subdirectories, and add BAM tags specified explicitly for each read group. DRAGEN automatically generates a CSV file of the correct format during BCL conversion to FASTQ. The CSV file is named fastq_list.csv
and contains an entry for each FASTQ file or paired-end file pair produced during the run.
FASTQ CSV File Format
The first line of the CSV file specifies the title of each column, and is followed by one or more data lines. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or other extraneous characters.
Column titles are case-sensitive. The following column titles are required:
RGID--Read Group
RGSM--Sample ID
RGLB--Library
Lane--Flow cell lane
Read1File--Full path to a valid FASTQ input file
Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.
Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.
When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:
ID (from RGID)
SM (from RGSM)
LB (from RGLB)
You can specify additional tags for each read group by adding a column title. The column title must be only four upper-case characters and begin with RG. For example, to add a PU (platform unit) tag, add a column named RGPU and specify the value for each read group in this column. All column titles must be unique.
A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one unique RGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed in the fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, --fastq-list-sample-id <SampleID>
must be used in addition to --fastq-list <filename>
to process only a specific sample from the CSV file. Only the entries in the fastq-list file with an RGSM value that match the specified SampleID are processed.
Independent processing and output for multiple individual samples in one run is not supported.
To process all listed files together as one sample, regardless of the RGSM value, the option
--fastq-list-all-samples=true
can be used instead of--fastq-list-sample-id
.
Note
For a single run, only one BAM and VCF output file are produced because all input read groups are expected to belong to the same sample. To process multiple samples independently from one BCL conversion run, DRAGEN must be run multiple times using different values for the `--fastq-list-sample-id` option.
There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.
The following is an example FASTQ list CSV file with the required columns:
If you use the --tumor-fastq-list
option for somatic input, use the --tumor-fastq-list-sample-id SampleID>
option to specify the sample ID for the corresponding FASTQ list, as shown in the following example:
Tumor-Normal Pairs Input
If using fastq_lists or tumor_fastq_lists comprising of multiple samples (RGSMs) in somatic mode, you can use a loop to iterate through the two lists to create tumor-normal pairs for testing. Create a *.txt file with the RGSM of each normal sample to be tested (one per line), and then create a separate *.txt file with the RGSM of the tumor samples to be tested. Make sure that the tumor sample RGSM is listed in the same order as the corresponding normal samples and to include a blank line after the last sample.
You can use the following example script to perform testing in somatic mode. Each iteration takes one entry from the tumor samples list and one entry from the normal samples list (from top to bottom) to create a tumor-normal pair as input for the DRAGEN run.
The following are examples of the FASTQ lists and samples lists used as input for the script.
FASTQ ORA Input Files
You can use the same options as the other FASTQ input file types for ORA files. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference
.
See ORA Compression and Decompression for more information on ORA reference files.
The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).
BAM Input Files
BAM files can be used as input to the mapper/aligner. By default --enable-map-align
is true. You can use the BAM file as input to the variant caller by setting the --enable-map-align
option to false.
When you specify a BAM file as input, with map/align enabled, DRAGEN ignores any alignment information contained in the input file, and outputs new alignments for all reads.
If the input file contains paired-end reads, it is important to specify that the input data should be sorted so that pairs can be processed together. Other pipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases the speed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairs are identified. Use the --pair-by-name
option to enable or disable this feature (the default is true).
Specify single-ended input in one BAM file with the (-b
) and --pair-by-name=false
options, as follows:
Specify paired-end input in one BAM file with the (-b
) and \--pair-by-name=true
options, as follows:
CRAM Input
You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input.
By default, the CRAM compressor and decompressor uses the DRAGEN reference specified with the --ref-dir
option. CRAM compression is reference based, and the reference used for compression is not part of the CRAM file. Therefore, the CRAM input file must have been created with the same reference than what is provided to DRAGEN for the analysis.
DRAGEN supports the re-alignment of a CRAM input that was created with a different reference in one step. Re-aligning a CRAM file that was created with a different reference requires use of the --cram-reference
option. This option will make the CRAM decompressor use the specified reference.
--cram-reference
can be either a fasta file, or a DRAGEN hash table folder.If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file
CRAM output will always be compressed using the
--ref-dir
reference
Example: CRAM was created with hg19, re-analysis with hg38
The following options are used for providing a CRAM input to either mapper/aligner or variant caller:
--cram-input
--The name and path for the CRAM file--cram-input
--One usage example is paired-end input in a single CRAM file. In addition, set the--pair-by-name option
to true.
Multiple BAM or CRAM Input Files
To provide multiple BAM input files, you can use the --bam-list <csv file name>
option to specify the name of a CSV file containing the list of BAM files. For example:
To provide multiple CRAM input files, you can use the --cram-list <csv file name>
option.
BAM or CRAM CSV Input File Format
The first line of the CSV file specifies the header containing the title for each column and each subsequent line is a data line. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or any other extraneous characters.
An example BAM CSV file:
Column titles are case sensitive. The following column titles are required:
BamFile -- path to BAM file
Please note that only the "BamFile" column is supported as this time. Extra fields may be specified in the CSV file but they will not be processed by DRAGEN.
CRAM CSV input follows the same format above, with "CramFile" as the column title instead.
Restrictions and Limitations:
DRAGEN bam-list and cram-list are intended to mirror manually merging BAM or CRAM files via a utility such as samtools or MergeSamFiles (Picard). As a result, using bam-list or cram-list is analogous to having a single merged BAM or CRAM input file. Please note that some callers (i.e. DRAGEN variant calling) are unable to process a bam-list or cram-list that is composed of input files containing multiple samples.
In the case where identical read group IDs appear across multiple files and you want to treat them as distinct read groups, you can use the --prepend-filename-to-rgid=true
option to distinguish between read groups.
If enabled, the resulting output BAM or CRAM file will contain all read groups from the input BAM or CRAM files passed in the CSV list file.
Tumor-Normal Pairs Input
You can also use --tumor-bam-list <csv file name>
or --tumor-cram-list <csv file name>
when running with tumor-only or tumor-normal inputs to DRAGEN. The CSV file has the same format as the options described above.
BCL Input Files
BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.
DRAGEN can read directly from BCL in the following circumstances:
Only one lane is input as part of a run (specified on the command-line).
The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).
The following example command is for BCL input with only one lane of input:
For additional BCL conversion options, see Input File Types.
Handling of N bases
One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.
When you use the --fastq-n-quality
and --fastq-offset
options, the base quality scores are overwritten with a fixed base quality. The default values for these options are 2 and 33 to match the Illumina minimum quality of 35 (ASCII character ‘#’).
Read Names for Paired-End Reads
By a common convention, read names can include suffixes, such as /1
or /2
), which indicate the end of a pair the read represents. For BAM input using the --pair-by-name
option, DRAGEN ignores these suffixes to find matching pair names. By default, DRAGEN uses the forward slash character as the delimiter for these suffixes and ignores the /1
and /2
when comparing names. By default, DRAGEN strips these suffixes from the original read names.
DRAGEN has the following options to control how suffixes are used:
To change the delimiter character, for suffixes, use the
--pair-suffix-delimiter
option. Valid values for this option include forward-slash (/), dot (.), and colon (:).To preserve the entire name, including the suffixes, set
--strip-input-qname-suffixes
to false.To append a new set of suffixes to all read names, set
--append-read-index-to-name
to true. The delimiter is determined by the--pair-suffix-delimiter
option. By default, the delimiter is a slash, so/1
and/2
are added to the names.
Gene Annotation Input Files
When processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-file
option. Providing this file improves the accuracy of the mapping and aligning stage (see [Input Files]{.underline}). The file should conform to the GTF/GFF format specification and should list annotated transcripts that match the reference genome being mapped against. The similar GFF3 format is currently not supported, due to inconsistent contig naming between GENCODE and Ensembl. See the RNA user guide section for more details on potential issues and workarounds.
DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.
Networked Streaming
AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming
DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.
Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.
Input streaming is supported for the following use cases:
Mapping/aligning of FASTQ and BAM.
Germline and somatic small variant calling from BAM (without remapping).
For other file types that are significantly smaller in size, download them locally before running the analysis.
Streaming FASTQ Input Using AWS S3
Streaming FASTQ Input Using Azure Blob Storage Account
Streaming FASTQ Input Using Presigned URLs (for AWS only)
Streaming BAM Input Using AWS S3
Streaming BAM Input Using Azure Blob Storage Account
Streaming BAM Input Using Presigned URLs (for AWS only)
AWS S3, Azure Blob Storage, Output Streaming
DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.
Streaming output to AWS S3
Streaming output to Azure Blob Storage Account
Security and Permissions
To stream input files or write to a cloud providers storage, you must have permission to access the remote files.
AWS S3
S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.
Azure Blob Storage Account
Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.
To use managed identities you must run DRAGEN on an Azure instance. The instance must have Contributor
permissions (read/write) on the Storage Account it wants to read and write to. If the instance has a single managed identity, only the AZ_ACCOUNT_NAME=<azure-storage-account-name>
environment variable is required. For multiple managed identities, you must also provide the AZR_IDENT_CLIENT_ID=<client-id>
environment variable, with the client id of the identity that can access your storage bucket. This can be found on the Azure Portal.
With storage account access keys, DRAGEN can write to an Azure bucket both on and off Azure instances. For this use case, find the Storage Account Access Key and set the environment variables AZ_ACCOUNT_NAME=<azure-storage-account-name>
and AZ_ACCOUNT_KEY=<account-key>
.
Presigned URL (AWS only)
An AWS presigned URL most likely has a query string attached to it, which provides the authentication credentials or necessary tokens to grant permission to the S3 bucket (e.g., https://bucket-name.amazonaws.com/path/to/folder?querystring
). Currently, streaming input to DRAGEN Azure presigned URLs is not supported.
Sample Sex
Use the --sample-sex
command line option to control the sex karyotype input used in downstream components, such as variant callers. If a sample sex karyotype input is not specified using the command line, the sex karyotype is automatically determined. The sex karyotype input is converted to a reference sex karyotype for use in variant calling. Other components might support sex karyotype input. Refer to the corresponding section for the component you are using.
The --sample-sex
option supports the following values. Values are not case-sensitive.
none
: No sex karyotype input. Components use a default reference sex karyotype.auto
: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same asnone
.auto
is the default value.female
: Sex karyotype input is XX.male
: Sex karyotype input is XY.
The following example command lines use --sample-sex
to specify the sex karyotype.
If the value is none
, female
, or male
, the Ploidy Estimator could still run and produce output, but variant callers will not use any estimated sex karyotype that is different than the sex karyotype provided via the command-line.
The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex
is used.
Reference Sex Karyotype
Sex Karyotype Input | CNV Caller | ExpansionHunter | Ploidy Caller | Small Variant Caller | SV Caller |
---|---|---|---|---|---|
XX | XX | XX | XX | XX | XXYY |
XY | XY | XY | XY | XY | XXYY |
XXY | XY | XX | XY | XXYY | XXYY |
XYY | XY | XY | XY | XXYY | XXYY |
X0 | XX | XY | XX | XXYY | XXYY |
XXXY | XY | XX | XY | XXYY | XXYY |
XXX | XX | XX | XX | XXYY | XXYY |
None | XX/XY | XX | XX | XXYY | XXYY |
For sex karyotype input of None, CNV independently checks the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.
Preservation or Stripping of BQSR Tags
The Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tags BI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file with BI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags can become invalid.
The recommendation is to strip these tags when using BAM files as input. To remove the BI and BD tags, set the --preserve-bqsr-tags
option to false. If you preserve the tags, DRAGEN warns you to disable hard clipping.
Read Group Options
DRAGEN assumes that all the reads in a given FASTQ belong to the same read group. DRAGEN creates a single @RG read group descriptor in the header of the output BAM file, with the ability to specify the following standard BAM attributes:
Attribute | Argument | Description |
---|---|---|
ID |
| Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record. |
LB |
| Library. |
PL |
| Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO. |
PU |
| Platform unit, eg, flowcell-barcode.lane. |
SM |
| Sample. |
CN |
| Name of the sequencing center that produced the read. |
DS |
| Description. |
DT |
| Date the run was produced. |
PI |
| Predicted mean insert size. |
If any of these arguments are present, DRAGEN adds an RG tag to all the output records to indicate that they are members of a read group. The following example shows a command line that includes read group parameters:
When using the --fastq-list
option to input multiple read groups, BAM tags (and others) are specified for each read group by adding columns to the fastq_list.csv
file. Each column heading consists of four capital letters and each begins with 'RG'. For each column, each read group's values for that column are propagated to the output BAM file in an identically named tag.
License Options
To suppress the license status message at the end of the run, use the --lic-no-print
option. The following shows an example of the license status message:
Autogenerated MD5SUM for BAM and CRAM Output Files
An MD5SUM file is generated automatically for BAM and CRAM output files. The MD5SUM file has the same name as the output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). The MD5SUM file is a single-line text file that contains the md5sum of the output file, which exactly matches the output of the Linux md5sum command.
The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).
Configuration Files
Command line options can be stored in a configuration file. The location of the default configuration file is <INSTALL_PATH>/config/dragen-user-defaults.cfg
. You can override this file by using the --config-file (-c)
option to specify a different file. The configuration file used for a given run supplies the default settings for that run, any of which can be overridden by command line options.
The recommended approach is to use the dragen-user-defaults.cfg file as a template to create default settings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the new file for the specific use-case. Best practice is to put options that rarely change into the configuration file and to specify options that vary from run to run on the command line.
Cloud Authentication and Licensing
Authentication is required for users that run DRAGEN on the cloud, with the Bring-Your-Own-License (BYOL) model, outside of integrated Illumina cloud products. A valid license is required to enable authentication and usage quotas.
License Server
DRAGEN cloud runs access the DRAGEN License Server to validate the credentials and licenses against the intended run. BYOL users must provide credentials and must allow access to the license server URL. The following command line option can be used to pass the credentials to DRAGEN: --lic-server=https://<user>:<pass>@license.edicogenome.com
.
An alternative way to provide license server credentials is by using a license credentials file. The --lic-credentials input command line option can be used to provide the full path to the license credentials file. This provides a more secure way to pass cloud credentials, which avoids accidental credentials leaks from command line console logs.
A license credentials file is a plain text file audited by the customer. The format is the same as the DRAGEN config files: = , each {key,value} separated by new line. The following key names must be used: credentials1 = credentials2 =
AWS Instance Metadata Service Support (IMDSv1/IMDSv2)
DRAGEN uses AWS Instance Metadata Service (IMDS) to identify its own metadata within the AWS environment, including location, identity, and configuration.
DRAGEN supports both AWS IMDSv1, and the more secure AWS IMDSv2. AWS IMDSv1 is request/response based. It accesses metadata by HTTP requests to a specific endpoint on the instance. AWS IMDSv2 is token-based authentication with time-limited tokes.
AWS IMDSv2 must be enabled on the AWS instance, otherwise, IMDSv1 is used by default. DRAGEN software will automatically detect the IMDS version in use and adapt its behavior accordingly.
Instance Identity
DRAGEN cloud runs access the instance identity document via the Instance Metadata Service as part of the authentication. It uses the IPv4 local address. If access to the local address is not allowed, the authentication will fail. Alternately, the user may save the instance identity document(s) and point DRAGEN to use them instead, if the user does not want to allow applications to access this service. The method for providing instance identity documents to the software is described below.
Save the instance identity document(s) as files from the user's instance, and provide them as inputs to the DRAGEN software with each run.
The instance identity document(s) only need to be saved once per AWS account and region, and those files can be re-used subsequently.
Examples for saving instance identity document(s):
AWS
IMDSv1
IMDSv2
There should be 3 files in this folder, respectively named pkcs7
, signature
and document
. Run Dragen using the --lic-instance-id-location ${instance_identity}
command option.
Azure
There should be 2 files in this folder, respectively named instance
and document
. Run Dragen using the --lic-instance-id-location ${instance_identity}
command option.
Last updated