Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. The options used in this preprocessing step offer tradeoffs between performance and mapping quality.
Pre-built DRAGEN reference genomes are available for download in the Illumina customer portal. If you find that performance and mapping quality with these are adequate, there is a good chance that you can simply work with these supplied reference genomes. Depending on your read lengths and other particular aspects of your application, you may be able to improve mapping quality and/or performance by tuning the reference preprocessing options.
The DRAGEN mapper extracts many overlapping seeds (subsequences or K-mers) from each read, and looks up those seeds in a hash table residing in memory on its PCIe card, to identify locations in the reference genome where the seeds match. Hash tables are ideal for extremely fast lookups of exact matches. The DRAGEN hash table must be constructed from a chosen reference genome using the --build-hash-table option
, which extracts many overlapping seeds from the reference genome, populates them into records in the hash table, and saves the hash table as a binary file.
DRAGEN will attempt to detect the provided reference in order to automatically apply recommended resources and settings. There are four human references that DRAGEN can detect: hg38, hg19, hs37d5, and chm13v2. DRAGEN is able to detect references that contain a subset of the primary contigs from one of these references, as long as the names and lengths of the detected contigs are consistent with the names and lengths from the standarad assemblies of these references.
In detail, automatic reference detection operates as follows:
We define a primary contig of a human genome to be an autosome (1-22) or sex chromosome (X,Y). Let F be the input fasta. For each reference genome R in hg38, hg19, hs37d5, and chm13v2, DRAGEN checks if there are any contigs in F that have the same name and length as a primary contig in R, and that there are no contigs in F that have the same name as a contig in R, but with different length. If these conditions hold for exactly one of hg38, hg19, hs37d5, and chm13v2, then that reference is detected and resources may be applied automatically.
The DRAGEN hash table builder will automatically apply decoy contigs and mask bed files to detected reference. Other pipelines may also apply automatic resources. For example variant callers may apply machine learning models and target bed files.
In order for DRAGEN to correctly detect the provided reference, it is important to use the standard naming conventions for each of the four human assemblies that DRAGEN detects:
Assembly | Autosome and Sex Chromosome Names |
---|---|
The size of the DRAGEN hash table is proportionate to the number of seeds populated from the reference genome. The default is to populate a seed starting at every position in the reference genome, ie, roughly 3 billion seeds from a human genome. This default requires at least 32 GB of memory on the DRAGEN PCIe board.
To operate on larger, nonhuman genomes or to reduce hash table congestion, it is possible to populate less than all reference seeds using the --ht-ref-seed-interval
option to specify an average reference interval. The default interval for 100% population is --ht-ref-seed-interval 1
, and 50% population is specified with --ht-ref-seed-interval 2
. The population interval does not need to be an integer. For example, --ht-ref-seed-interval 1.2
indicates 83.3% population, with mostly 1-base and some 2-base intervals to achieve a 1.2 base interval on average.
It is characteristic of hash tables that they are allocated a certain size, but always retain some empty records, so they are less than 100% occupied. A healthy amount of empty space is important for quick access to the DRAGEN hash table. Approximately 90% occupancy is a good upper bound. Empty space is important because records are pseudo-randomly placed in the hash table, resulting in an abnormally high number of records in some places. These congested regions can get quite large as the percentage of empty space approaches zero, and queries by the DRAGEN mapper for some seeds can become increasingly slow.
The hash table is populated with reference seeds of a single common length. This primary seed length is controlled with the --ht-seed-len
option, which defaults to 21.
The longest primary seed supported is 27 bases when the table is 8 GB to 31.5 GB in size. Generally, longer seeds are better for run time performance, and shorter seeds are better for mapping quality (success rate and accuracy). A longer seed is more likely to be unique in the reference genome, facilitating fast mapping without needing to check many alternative locations. But a longer seed is also more likely to overlap a deviation from the reference (variant or sequencing error), which prevents successful mapping by an exact match of that seed (although another seed from the read may still map), and there are fewer long seed positions available in each read.
Longer seeds are more appropriate for longer reads, because there are more seed positions available to avoid deviations.
Seed Length Recommendations
Due to repetitive sequences, some seeds of any given length match many locations in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many reference locations, it extends the seed by some number of bases at both ends, to some greater length that is more unique in the reference.
For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extended seed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these 100 seed positions may divide into 40 groups of 1-3 identical 35-base seeds. Iterative seed extensions are also supported, and are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.
The maximum extended seed length, by default equal to the primary seed length plus 128, can be controlled with the --ht-max-ext-seed-len
option. For example, for short reads, it is advisable to set the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.
It is also possible to tune how aggressively seeds are extended using the following options (advanced usage):
--ht-cost-coeff-seed-len
--ht-cost-coeff-seed-freq
--ht-cost-penalty
--ht-cost-penalty-incr
There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved using longer seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved by avoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies that result. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, and by finding more candidate mapping locations at which to score alignments. The default extension settings along with default seed frequency settings, lean aggressively toward mapping accuracy, with relatively short seed extensions and high hit frequencies.
The defaults for the seed frequency options are as follows:
One primary or extended seed can match multiple places in the reference genome. All such matches are populated into the hash table, and retrieved when the DRAGEN mapper looks up a corresponding seed extracted from a read. The multiple reference positions are then considered and compared to generate aligned mapper output. However, the DRAGEN software enforces a limit on the number of matches, or frequency, of each seed, which is controlled with the --ht-max-seed-freq option
. By default, the frequency limit is 16. In practice, when the software encounters a seed with higher frequency, it extends it to a sufficiently long secondary seed that the frequency of any particular extended seed pattern falls within the limit. However, if a maximum seed extension would still exceed the limit, the seed is rejected, and not populated into the hash table. Instead, a single High Frequency record is populated.
This seed frequency limit does not tend to impact DRAGEN mapping quality notably, for two reasons. First, because seeds are rejected only when extension fails, only extremely high-frequency primary seeds, typically with many thousands of matches are rejected. Such seeds are not very useful for mapping. Second, there are other seed positions to check in a given read. If another seed position is unique enough to return one or more matches, the read can still be properly mapped. However, if all seed positions were rejected as high frequency, often this means that the entire read matches similarly well in many reference positions, so even if the read were mapped it would be an arbitrary choice, with very low or zero MAPQ.
Thus, the default frequency limit of 16 for --ht-max-seed-freq
works well. However, it may be decreased or increased, up to a maximum of 256. A higher frequency limit tends to marginally increase the number of reads mapped (especially for short reads), but commonly the additional mapped reads have very low or zero MAPQ. This also tends to slow down DRAGEN mapping, because correspondingly large numbers of possible mappings are occasionally considered.
In addition to a frequency limit, a target seed frequency can be specified with --ht-target-seed-freq
option. This target frequency is used when extensions are generated for high frequency primary seeds. Extension lengths are chosen with a preference toward extended seed frequencies near the target. The default of 4 for --ht-target-seed-freq
means that the software is biased toward generating shorter seed extensions than necessary to map seeds uniquely.
When building a reference hash table from a fasta with ALT contigs, it may be desired to mask certain regions of high similarity, or to establish a liftover realtionships between primary and alternate contigs. The recommended approach is masking, as described in the Map-Align section. When hg19 or hg38 alt contigs are detected, the hash table builder will require a liftover file or a bed file to mask the alt contigs. If non are provided, a mask bed file from <INSTALL_PATH>/fasta_mask/
will be used automaticaly.
DRAGEN has adopted a masked approach to handle native reference ALT contigs, where strategic regions are masked to increased accuracy. The hash table builder will build the mapper hash table as if the regions that were specified in the argument for ht-mask-bed
were masked with N's. The hash table builder will only allow setting one of ht-mask-bed
or ht-alt-liftover
. Each line in the bed file is expected to contain a contig name, start position (0-based), and end position (1-based), seperated by a single tab or space. Lines that start with # are ignored by the hash table builder to allow commenting. Any line with a contig name that is not found in the input fasta is skipped and logged to the DRAGEN log file. Likewise, lines that describe empty intervals are skipped. If all lines are skipped this way, the hash table builder will issue an error and abort, unless the mask bed file was automatically applied (see Automatic masking). The hash table builder will always issue an error and abort if an interval described in the BED file is outside of the range of the corresponding contig in the fasta. Lines that are not skipped are written to a file called mask.bed that will be present in the hash table output directory, and whose digest will appear in hash_table.cfg. This file is used when a reference is loaded to the FPGA card to dynamically mask reference.bin.
When running from a fasta for which hg38 or hg19 is detected (See Automatic Reference Detection), and no argument for ht-mask-bed
or ht-alt-liftover
was provided, the hash table builder will automatically apply the corresponding bed file for the detected reference from <INSTALL_PATH>/fasta_mask/
. Note that the hash table builder will identify alt contigs by name. So when running from an input fasta that contains alt contig with standard names but modified base content, it is recommended to suppress automatic masking by setting ht-suppress-mask=true
or by passing a custom mask bed file to ht-mask-bed
.
The behavior of DRAGEN with respect to the handling of decoy contigs in the reference has changed since version 2.6.
Starting with DRAGEN 3.x, DRAGEN's hash table builder automatically detects the absence of the decoy contigs from the reference and adds it to the FASTA file, prior to building the hash table. The decoys file is found at <INSTALL_PATH>/liftover/hs\_decoys.fa
. If the reference is missing the decoy contigs, then the reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). This results in an artificially lower mapping rate, however, the accuracy of variant calling is improved thanks to removing false positive caused by decoy reads.
Illumina recommends using this feature by default. However, you can to set the --ht-suppress-decoys
option to true to suppress adding these decoys to the hash table.
The table below describes the difference in behavior between older DRAGEN versions (2.6 and earlier) and DRAGEN 3.x versions with respect to the handling of decoy contigs in the hash table builder:
Use the --build-hash-table
option to transform a reference FASTA into the hash table for DRAGEN mapping. It takes as input a FASTA file (multiple reference sequences being concatenated) and a preexisting output directory. Build command usage is as follows:
The --ht-reference
and --output-directory
options are required for building a hash table. The --ht‑reference
option specifies the path to the reference FASTA file, while --output-directory
specifies a preexisting directory where the hash table output files are written. Illumina recommends organizing various hash table builds into different folders. As a best practice, folder names should include any nondefault parameter settings used to generate the contained hash table. The sequence names in the reference FASTA file must be unique.
While masking is the recommended approach to dealing with ALT contigs, DRAGEN also supports a liftover based method. To enable liftover based ALT-aware mapping in DRAGEN, build the hash table with a liftover file by using the --ht-alt-liftover
option. The hash table builder classifies each reference sequence as primary or alternate based on the liftover file, and packs primaries before alternates in reference.bin. SAM liftover files for hg38DH and hg19 are in the <INSTALL_PATH>/liftover
folder.
Custom Liftover Files
Custom liftover files can be used in place of those provided with DRAGEN. Liftover files must be SAM format, but no SAM header is required. SEQ and QUAL fields can be omitted ('*'). Each alignment record should have an alternate haplotype reference sequence name as QNAME, indicating the RNAME and POS of its liftover alignment in a destination (normally primary assembly) reference sequence.
Reverse-complemented alignments are indicated by bit 0x10 in FLAG. Records flagged unmapped (0x4) or secondary (0x100) are ignored. The CIGAR may include hard or soft clipping, leaving parts of the ALT contig unaligned.
A single reference sequence cannot serve as both an ALT contig (appearing in QNAME) and a liftover destination (appearing in RNAME). Multiple ALT contigs can align to the same primary assembly location. Multiple alignments can also be provided for a single ALT contig (extras optionally be flagged 0x800 supplementary), such as to align one portion forward and another portion reverse-complemented. However, each base of the ALT contig only receives one liftover image, according to the first alignment record with an M CIGAR operation covering that base.
SAM records with QNAME missing from the reference genome are ignored, so that the same liftover file may be used for various reference subsets, but an error occurs if any alignment has its QNAME present but its RNAME absent.
DRAGEN analysis is capable of mapping on a multigenome (graph) hash table. The multigenome hash table introduces alternate graph paths to the linear reference hash table to represent more broadly the allelic diversity of the population over the whole genome or in specific regions defined in a bed file. Gain on accuracy from this methodology has been described in scientific blogs available on the Illumina Genomics Research Hub site. Mutigenome hash tables for CHM13_v2, hg38, hg19 and hs37d5 assemblies are available on the DRAGEN Software Support Site page.
See DRAGEN Graph Mapper for information on the graph mapping method.
It is possible to build a custom multigenome reference in order to:
customize the released multigenome hash table with custom bed files or hash table builder options. A set of bed files are available in the resource files on the DRAGEN Software Support Site page.
generate a population-specific-multigenome hash table from pangenome msVCF generated from the BSSH app.
generate a human or non-human multigenome hash table from customer-provided msVCF.
The input files required are a single multi-sample VCF file containing the set of population variants, and optionally bed files restricting graph to some region. The generated files, including hash_table.cmp and associated files in the specified output directory, can then be used as the reference hash table for the DRAGEN mapper. DRAGEN software supports the tool on human reference with files available on the DRAGEN Software Support Site page. For non-human, the user provides the required resource files.
To enable the multigenome hash table builder, example command usage is :
dragen --build-hash-table true (required) --ht-graph-msvcf-file <path to a multi-sampple VCF file (required for multigenome reference) --ht-reference <reference.fasta> (required) --ht-graph-extra-kmer-bed < graph.bed> (optional) --ht-mask-bed <mask.bed> (optional) --ht-graph-exclusion-bed <exclusion bed> (optional) --output-directory <DIR> (required) [options]
The custom multigenome hash table builder tool uses a set of population variants provided by the user to generate a multigenome hash table. The variants must be specified in VCF format, in a single multi-sample VCF (msVCF) file containing the variants for a set of individuals. This multi-sample VCF file must have specific formatting described below.
The custom multigenome hash table builder tool only supports msVCF file input respecting the format described below:
msVCF compliant with 4.2 VCF format specification
with variants positionally sorted in the same contig order as the main FASTA reference genome provided in --ht-reference
records shall include diploid or haploid GT calls
supports multi-allelic variants merged in multi-line or separated in multiple lines
with the following FILTER codes, non-PASS records are ignored:
##FILTER=<ID=PASS,Description="All filters passed">
with the following FORMAT field :
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
for better results, we recommend variants to be left-aligned.
maximum number of recommended samples in the msVCF is 256. Higher number may lead to very high memory usage at hash table creation.
Note: INFO/FORMAT subfields must be defined in the header. Events with undefined subfields are ignored.
To build a high-performance custom genome it is highly recommended to use long read sequencing data. We recommend using external tools such as Whatshap (https://github.com/whatshap/whatshap) to generate phased input. DRAGEN analysis leverages the phasing information to reconstruct population haplotypes.
A reference genome in FASTA format must be provided. Reference genomes are available to download from the DRAGEN Software Support Site page.
Note: the reference genome provided as input must be the same as the one used to generate the input phased msVCF. If the msVCF contains variants from regions not present in the fasta file, the multigenome reference builder will stop with an error.
This bed file is used to filter out regions of the msVCF file. Variants that fall within intervals defined in the "Graph exclusion bed" file will be ignored and not used in any part of the multigenome reference builder. The result will be the same as if the input msVCF did not contain any variants in the regions defined in the exclusion bed. The file is optional, by default every variants in the msVCF file will be used. Exclusion bed files are available to download from DRAGEN Software Support Site page.
A custom exclusion bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.
Note: records of the exclusion bed file provided must be from the same build as the reference genome used to build the multigenome reference.
This file is used to define regions in the genome where extra seeds will be indexed in the hash table. By default, only seed extracted from the primary reference will be extracted and saved in the reference hash table for mapping. This option will additionally generate seeds from population variants in the defined regions. It is recommended to include the expected difficult regions in this bed file. Extra-kmer-bed files are available to download from DRAGEN Software Support Site page for the human hg38, hg19, hs37d5, and chm13 references.
An Extra-kmer-bed bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.
Note: records of the Extra-kmer-bed file provided must be from the same build as the reference genome used to build the graph reference.
A mask bed file must be provided in order to mask certain regions of high similarity between primary and alternate contigs present in the main genome FASTA. Mask bed files are available to download from the DRAGEN Software Support Site page.
A custom mask bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.
Note: records of the mask bed file provided must be from the same build as the reference genome used to build the graph reference.
Note: The custom graph reference hash table end to end pipeline will return an error if options --ht-alt-liftover or --ht-allow-mask-and-liftover are specified.
The hash table builder generates the following outputs:
The --ht-seed-len
option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of this same length from each read, and looks for exact matches (unless seed editing is enabled) in the hash table.
The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16 GB to 64 GB, covering typical sizes for whole human genome, or k=26 for sizes from 4 GB to 16 GB.
The minimum primary seed length depends mainly on the reference genome size and complexity. It needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound may be smaller for shorter genomes, or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16
for the 3.1Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from 4 nucleotides to distinguish 3.1 G reference positions.
For read mapping to succeed, at least one primary seed must match exactly (or with a single SNP when edited seeds are used). Shorter seeds are more likely to map successfully to the reference, because they are less likely to overlap variants or sequencing errors, and because more of them fit in each read. So for mapping accuracy, shorter seeds are mainly better.
However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions, and lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches may be reported. Run time quality filters such as --Aligner.aln_min_score
can control the accuracy issues with very short seeds.
Shorter seeds tend to slow down mapping, because they map to more reference locations, resulting in more work such as Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the reference genome's uniqueness threshold, eg, K=16 for whole human genome.
Read Length---Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions (variants or sequencing errors) can chop the read into only short segments matching the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, just one SNP in the middle can block seeds longer than 18 bp from matching the reference. By contrast, in a 250 bp read, it takes 15 SNPs to exceed a 0.01% chance of blocking even 27 bp seeds.
Paired Ends---The use of paired end reads can make longer seeds yield good mapping accuracy. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have essentially twice the opportunity for an exact matching seed to find their correct alignments.
Variant or Error Rate---When read differences from the reference are more frequent, shorter seeds may be required to fit between the difference positions in a given read and match the reference exactly.
Mapping Percentage Requirement---If the application requires a high percentage of reads to be mapped somewhere (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well anywhere are more likely to map using short seeds to find partial matches to the reference.
The --ht-max-ext-seed-len
option limits the length of extended seeds populated into the hash table. Primary seeds (length specified by --ht-seed-len
) that match many reference positions can be extended to achieve more unique matching, which may be required to map seeds within the maximum hit frequency (--ht-max-seed-freq
).
Given a primary seed length k, the maximum seed length can be configured between k and k+128. The default is the upper bound, k+128.
The --ht-max-ext-seed-len
option is recommended for short reads, eg, less than 50 bp. In such cases, it is helpful to limit seed extension to the read length minus a small margin, such as 1-4 bp. For example, with 36 bp reads, setting --ht-max-ext-seed-len
to 35 might be appropriate. This ensures that the hash table builder does not plan a seed extension longer than the read causing seed extension and mapping to fail at run time, for seeds that could have fit within the read with shorter extensions.
While seed extension can be similarly limited for longer reads, eg, setting --ht-max-ext-seed-len
to 99 for 100 bp reads, there is little utility in this because seeds are extended conservatively in any event. Even with the default k+128 limit, individual seeds are only extended to the lengths required to fit under the maximum hit frequency (--ht-max-seed-freq
), and at most a few bases longer to approach the target hit frequency (‑‑ht‑target-seed-freq
), or to avoid taking too many incremental extension steps.
The --ht-max-seed-freq
option sets a firm limit on the number of seed hits (reference genome locations) that can be populated for any primary or extended seed. If a given primary seed maps to more reference positions than this limit, it must be extended long enough that the extended seeds subdivide into smaller groups of identical seeds under the limit. If, even at the maximum extended seed length (--ht-max-ext-seed-len
), a group of identical reference seeds is larger than this limit, their reference positions are not populated into the hash table. Instead, a single High Frequency record is populated.
The maximum hit frequency can be configured from 1 to 256. However, if this value is too low, hash table construction can fail because too many seed extensions are needed. The practical minimum for a whole human genome reference, other options being default, is 8.
Generally, a higher maximum hit frequency leads to more successful mapping. There are two reasons for this. First, a higher limit rejects fewer reference positions that cannot map under it. Second, a higher limit allows seed extensions to be shorter, improving the odds of exact seed matching without overlapping variants or sequencing errors.
However, as with very short seeds, allowing high hit counts can sometimes hurt mapping accuracy. Most of the seed hits in a large group are not to the true mapping location, and occasionally one of these noise hits may be reported due to imperfect scoring models. Also, the mapper limits the total number of reference positions it considers, and allowing very high hit counts can potentially crowd out the actual best match from consideration.
Higher maximum hit frequencies slow down read mapping, because seed mapping finds more reference locations, resulting in more work, such as Smith-Waterman alignments, to determine the best result.
The DRAGEN Software enables the user to build a custom multigenome hash table from a set of population variants. The population variants are specified in a single multi-sample VCF file.
--ht-graph-msvcf-file: Input file containing list of population variants, in multi-sample VCF format.
This replaces the previous options that were previously used to build a graph Reference that are now deprecated.
List of deprecated options :
--ht-pop-alt-contigs: Population based alternate contigs FASTA.
--ht-pop-alt-liftover: Liftover SAM file of population alternate contigs.
--ht-pop-snps: Population based SNPs VCF
The following options control building hash tables from references with ALT-contigs. See References with ALT contigs for more information.
--ht-mask-bed
: Set a custom BED file that defines which regions to mask. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from <INSTALL_PATH>/fasta\_mask
.
--ht-alt-liftover
: Set a liftover file to build a liftover based ALT-aware hash table. SAM liftover files for hg38DH and hg19 are provided in <INSTALL_PATH>/liftover
.
--ht-allow-mask-and-liftover
: Allow the use of both --ht-mask-bed
and --ht-alt-liftover
together.
--ht-suppress-mask
: Suppress automatic detection of the default mask bed files when building the hash table.
--ht-decoys
The DRAGEN software automatically detects the use of hg19 and hg38 references and adds decoys to the hash table when they are not found in the FASTA file. Use the --ht-decoys
option to specify the path to a decoys file. The default is <INSTALL_PATH>/liftover/hs\_decoys.fa
.
--ht-suppress-decoys
: Suppress automatic detection of the default decoys file when building the hash table.
--ht-num-threads
The --ht-num-threads
option determines the maximum number of worker CPU threads that are used to speed up hash table construction. The default for this option is 8, with a maximum of 32 threads allowed. If your server supports execution of more threads, it is recommended that you use the maximum. For example, the DRAGEN servers contain 24 cores that have hyperthreading enabled, so a value of 32 should be used. When using a higher value, adjust --ht-max-table-chunks
needs to be adjusted as well. The servers have 128 GB of memory available.
--ht-max-table-chunks
The --ht-max-table-chunks
option controls the memory footprint during hash table construction by limiting the number of ~1 GB hash table chunks that reside in memory simultaneously. Each additional chunk consumes roughly twice its size (~2 GB) in system memory during construction. The hash table is divided into power-of-two independent chunks, of a fixed chunk size, X, which depends on the hash table size, in the range 0.5 GB < X ≤ 1 GB. For example, a 24 GB hash table contains 32 independent 0.75 GB chunks that can be constructed by parallel threads with enough memory and a 16 GB hash table contains 16 independent 1 GB chunks. The default is --ht-max-table-chunks
equal to --ht-num-threads
, but with a minimum default --ht-max-table-chunks
of 8. It makes sense to have these two options match, because building one hash table chunk requires one chunk space in memory and one thread to work on it. Nevertheless, there are build-speed advantages to raising --ht-max-table-chunks
higher than --ht-num-threads
, or to raising --ht-num-threads
higher than --ht-max-table-chunks
.
--ht-mem-limit
Memory Limit. The --ht-mem-limit
option controls the generated hash table size by specifying the DRAGEN card memory available for both the hash table and the encoded reference genome. The ‑‑ht‑mem-limit
option defaults to 32 GB when the reference genome approaches WHG size, or to a generous size for smaller references. Normally there is little reason to override these defaults.
--ht-size
Hash Table Size. This option specifies the hash table size to generate, rather than calculating an appropriate table size from the reference genome size and the available memory (option --ht-mem-limit
). Using default table sizing is recommended and using --ht-mem-limit
is the next best choice.
--ht-ref-seed-interval
Seed Interval. The --ht-ref-seed-interval
option defines the step size between positions of seeds in the reference genome populated into the hash table. An interval of 1 (default) means that every seed position is populated, 2 means 50% of positions are populated, etc. Noninteger values are supported, eg, 2.5 yields 40% populated. Seeds from a whole human reference are easily 100% populated with 32 GB memory on DRAGEN boards. If a substantially larger reference genome is used, change this option.
--ht-soft-seed-freq-cap
and --ht-max-dec-factor
Soft Frequency Cap and Maximum Decimation Factor for Seed Thinning. Seed thinning is an experimental technique to improve mapping performance in high-frequency regions. When primary seeds have higher frequency than the cap indicated by the --ht-soft-seed-freq-cap option
, only a fraction of seed positions are populated to stay under the cap. The --ht-max-dec-factor
option specifies a maximum factor by which seeds can be thinned. For example, --ht-max-dec-factor 3
retains at least 1/3 of the original seeds. --ht-max-dec-factor 1
disables any thinning. Seeds are decimated in careful patterns to prevent leaving any long gaps unpopulated. The idea is that seed thinning can achieve mapped seed coverage in high frequency reference regions where the maximum hit frequency would otherwise have been exceeded. Seed thinning can also keep seed extensions shorter, which is also good for successful mapping. Based on testing to date, seed thinning has not proven to be superior to other accuracy optimization methods.
--ht-rand-hit-hifreq
and --ht-rand-hit-extend
Random Sample Hit with HIFREQ Record and EXTEND Record. Whenever a HIFREQ or EXTEND record is populated into the hash table, it stands in place of a large set of reference hits for a certain seed. Optionally, the hash table builder can choose a random representative of that set, and populate that HIT record alongside the HIFREQ or EXTEND record. Random sample hits provide alternative alignments that are very useful in estimating MAPQ accurately for the alignments that are reported. They are never used outside of this context for reporting alignment positions, because that would result in biased coverage of locations that happened to be selected during hash table construction. To include a sample hit, set --ht-rand-hit-hifreq
to 1. The --ht-rand-hit-extend
option is a minimum pre-extension hit count to include a sample hit, or zero to disable. Modifying these options is not recommended.
DRAGEN seed extension is dynamic, applied as needed for particular K-mers that map to too many reference locations. Seeds are incrementally extended in steps of 2--14 bases (always even) from a primary seed length to a fully extended length. The bases are appended symmetrically in each extension step, determining the next extension increment if any.
There is a potentially complex seed extension tree associated with each high frequency primary seed. Each full tree is generated during hash table construction and a path from the root is traced by iterative extension steps during seed mapping. The hash table builder employs a dynamic programming algorithm to search the space of all possible seed extension trees for an optimal one, using a cost function that balances mapping accuracy and speed. The following options define that cost function:
--ht-target-seed-freq
Target Hit Frequency. The --ht-target-seed-freq
option defines the ideal number of hits per seed for which seed extension should aim. Higher values lead to fewer and shorter final seed extensions, because shorter seeds tend to match more reference positions.
--ht-cost-coeff-seed-len
Cost Coefficient for Seed Length The --ht-cost-coeff-seed-len
option assigns the cost component for each base by which a seed is extended. Additional bases are considered a cost because longer seeds risk overlapping variants or sequencing errors and losing their correct mappings. Higher values lead to shorter final seed extensions.
--ht-cost-coeff-seed-freq
Cost Coefficient for Hit Frequency. The --ht-cost-coeff-seed-freq
option assigns the cost component for the difference between the target hit frequency and the number of hits populated for a single seed. Higher values result primarily in high-frequency seeds being extended further to bring their frequencies down toward the target.
--ht-cost-penalty
Cost Penalty for Seed Extension. The --ht-cost-penalty
option assigns a flat cost for extending beyond the primary seed length. A higher value results in fewer seeds being extended at all. Current testing shows that zero (0) is appropriate for this parameter.
--ht-cost-penalty-incr
Cost Increment for Extension Step. The --ht-cost-penalty-incr
option assigns a recurring cost for each incremental seed extension step taken from primary to final extended seed length. More steps are considered a higher cost because extending in many small steps requires more hash table space for intermediate EXTEND records, and takes substantially more run time to execute the extensions. A higher value results in seed extension trees with fewer nodes, reaching from the root primary seed length to leaf extended seed lengths in fewer, larger steps.
When building a hash table, DRAGEN configures the options for DNA analysis by default. To run RNA-Seq data, you must build an RNA-Seq hash table by setting --ht-build-rna-hashtable
to true. If running RNA-Seq alignment, use the original --output-directory
instead of the automatically generated subdirectory.
If using the CNV pipeline, set --enable-cnv
to true. The command generates an additional Kmer hash map that is used in the CNV algorithm. Illumina recommends to always use the --enable-cnv
option, so you can perform CNV calling with the same hash table used for mapping and aligning.
To run the methylation pipeline, you must build a methylation-specific hash table. DRAGEN can build a single-pass or legacy multi-pass methylation hash table. Methylation runs using a single-pass hash table are completed faster than the legacy multipass hash tables. Single-pass hash tables are recommended for building methylation tables and running analyses.
The following is an example of a single-pass hash table build. The example generates a combined hash table in your reference index folder under the methyl_converted subdirectory.
dragen --build-hash-table true \ --output-directory $REFDIR \ --ht-reference $FASTA \ --ht-num-threads 40 \ --ht-methylated-combined=true \ --ht-seed-len 27
Multi-pass methylation mapping requires building two special hash tables with reference bases converted from C to T in one table and G to A in the other table. The conversions are performed automatically when using the --ht-methylated
command line option. The converted hash tables are generated in two subdirectories under the folder specified using the --output-directory
command line option. The subdirectories are named CT_converted and GA_converted, corresponding with the base conversions. When using the hash tables for methylated alignment runs, make sure to refer to the --output-directory
folder, not the subdirectories.
The base conversions remove a significant amount of information from the hash tables. You might need to use different hash table parameters than in a conventional hash table build. The following options are recommended for building hash tables for mammalian species.
dragen --build-hash-table=true --output-directory $REFDIR --ht-reference $FASTA --ht-max-seed-freq 16 --ht-seed-len 27 --ht-num-threads 40 --ht-methylated=true
To run the HLA caller, an HLA-specific anchored reference hash table must be built. Set --ht-build-hla-hashtable
to true. The command will create a anchored_hla
subdirectory inside the --output-directory
. The HLA-specific reference subdirectory can be built at the same time as the primary reference construction.
An HLA resource file is packaged with DRAGEN and located at the following path after installation: <INSTALL_PATH>/resources/hla/HLA_resource.v1.fasta.gz
. This file is used by default when building the HLA-specific anchored hash table. A custom file can be specified with --ht-hla-reference
. See the HLA section for more information Using Custom HLA Reference Files
Option | Default |
---|---|
DRAGEN Behavior | DRAGEN 2.6 and earlier versions | DRAGEN 3.0 and later versions |
---|---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
File | Description |
---|---|
Hash Table Type | Hash Table Commands |
---|---|
hg38, hg19, chm13v2
chr1-chr22, chrX, chrY
hs37d5
1-22, X, Y
Value for --ht-seed-len
Read Length
21
100 bp to 150 bp
17 to 19
shorter reads (36 bp)
27
250+ bp
--ht-cost-coeff-seed-len
1
--ht-cost-coeff-seed-freq
0.5
--ht-cost-penalty
0
--ht-cost-penalty-incr
0.7
--ht-max-seed-freq
16
--ht-target-seed-freq
4
Reference does not include the decoy contigs (eg, hg19)
Decoy reads mismap elsewhere in the genome due to the lack of contigs in the reference. Artificially higher mapping rate. False positive calls in noisy regions to which the decoy contigs are mismapped.
DRAGEN automatically detects the absence of the decoy contig from the reference and adds it to the FASTA file. Artificially lower mapping rate because decoy reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). False positive calls are avoided thanks to adding the decoy contigs under the hood. Therefore this helps variant calling.
Reference includes the decoy contigs (eg, hs37d5)
Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place
Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place
--build-hash-table
Yes
Set to true
--ht-reference
Yes
Path to the reference genome FASTA file.
--ht-mask-bed
No (but recommended)
Path to the mask bed file. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from /opt/edico/fasta_mask.
--output-directory
Yes
Specify the directory where all related hash table files will be written
--build-hash-table
Yes
Set to true
--ht-graph-msvcf-file
Yes
Path to the multi-sample VCF file containing population variants
--ht-reference
Yes
Path to the reference genome FASTA file.
--ht-graph-extra-kmer-bed
No
Path to the extra kmer bed file
--ht-mask-bed
No (but recommended)
Path to the mask bed file
--ht-graph-exclusion-bed
No
Path to the exclusion bed file
--output-directory
Yes
Specify the directory where all related hash table files will be written
reference.bin
The reference sequences, encoded in 4 bits per base. Four-bit codes are used, so the size in bytes is roughly half the reference genome size. In between reference sequences, N are trimmed and padding is automatically inserted. For example, hg19 has 3,137,161,264 bases in 93 sequences. This is encoded in 1,526,285,312 bytes = 1.46 GB, where 1 GB means 1 GiB or 2^30^ bytes.
hash_table.cmp
Compressed hash table. The hash table is decompressed and used by the DRAGEN mapper to look up primary seeds with length specified by the --ht-seed-len
option and extended seeds of various lengths.
hash_table.cfg
A list of parameters and attributes for the generated hash table, in a text format. This file provides key information about the reference genome and hash table.
hash_table.cfg.bin
A binary version of hash_table.cfg used to configure the DRAGEN hardware.
hash_table_stats.txt
A text file listing extensive internal statistics on the constructed hash including the hash table occupancy percentages. This table is for information purposes. It is not used by other tools.
mask.bed
Present only for masked hash tables. A tab delimeted bed file that describes the masked regions. Contains all lines from the input bed file that are not comment lines, lines that describe empty intervals, or lines with contig names that were not found in the input fasta.
single-pass
--ht-methylated-combined=true
--ht-seed-len 27
multi-pass
--ht-methylated=true
--ht-seed-len 27
--ht-max-seed-freq 16
You use the DRAGEN host software program dragen to build and load reference genomes, and then to analyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking with optional removal, and variant calling.
Invoke the software using the dragen command. The command line options are described in the following sections.
Command line options can also be set in a configuration file. For more information on configuration files, see Configuration Files . If an option is set in the configuration file and is also specified on the command-line, the command line option overrides the configuration file.
The following are examples of frequently used command lines:
Build Reference/Hash Table
Run Map/Align and Variant Caller (*.fastq to *.vcf)
Run Map/Align (*.fastq to *.bam)
Run Variant Caller Only (*.bam to *.vcf)
Re-map and Run Variant Caller (*.bam to *.vcf)
Run BCL Converter (BCL to *.fastq)
Run RNA Map/Align (*.fastq to *.bam)
For a complete list of command line options, see [Command Line Options]{.underline}.
Before you can use the DRAGEN system for aligning reads, you must load a reference genome and its associated hash tables onto the PCIe card. For information on preprocessing a reference genome's FASTA files into the native DRAGEN binary reference and hash table formats, see [Prepare a Reference Genome]{.underline}. You must also specify the directory containing the preprocessed binary reference and hash tables with the -r [or --ref-dir]
option. This argument is always required.
Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.
dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
Use the -l (--force-load-reference)
option to force the reference genome to load even if it is already loaded.
dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.
DRAGEN has two primary modes of operation, as follows:
Mapper/aligner
Variant caller
DRAGEN is capable of performing each mode independently or as an end-to-end solution. DRAGEN also allows you to enable and disable decompression, sorting, duplicate marking, and compression along the DRAGEN pipeline.
Full pipeline mode To execute full pipeline mode, set --enable-variant-caller
to true
and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.
Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking
to true
.
Variant caller mode To execute variant caller mode, set the --enable-variant-caller
option to true, and set --enable-map-align
option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort
to false
will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.
RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna
to true
. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..
Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling
option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated
enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol
setting.
The following command line options for output are mandatory:
--output-directory <out_dir>
—Specifies the output directory for generated files.
--output-file-prefix <out_prefix>
-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.
-r [--ref-dir ]
—Specifies the reference hash table.
The following examples do not include these mandatory options.
For mapping and aligning, the output is sorted and compressed into BAM format by default before saving to disk. The user can control the output format from the map/align stage with the --output-format <SAM|BAM|CRAM>
option. If the output file exists, the software issues a warning and exits. To force overwrite if the output file already exists, use the -f [ --force ]
option.
For example, the following commands output to a compressed BAM file, and then forces overwrite:
dragen ... -f
dragen ... -f --output-format bam
To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing
to true.
The following example outputs to a SAM file, and then forces overwrite:
dragen ... -f --output-format sam
The following example outputs to a CRAM file, and then forces overwrite:
dragen ... -f --output-format cram
DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.
DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. The feature is turned off by default because there is a small performance cost to generate these strings. To generate MD tags, set --generate-md-tags
to true.
To generate ZS:Z alignment status tags, set --generate-zs-tags
to true. These tags are only generated in the primary alignment and when a read has suboptimal alignments qualifying for secondary output (even if none were output because --Aligner.sec-aligns
was set to 0). The following are valid tag values:
To generate SA:Z tags, set --generate-sa-tags
to true (the default). These tags provide alignment information (position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variant calling.
To generate pair score in a ps:i tag, set --generate-ps-tags
to true (false by default for DNA, true for RNA). The pair score is used in DRAGEN for computing MAPQ and can be used to check how well alignment candidate pairs score against each other.
DRAGEN can also output mate alignment tags. To generate the mate cigar (in the MC:Z tag), set --generate-mc-tags
to true (this is the default). To generate the mate mapping quality (in the MQ:i) tag, set --generate-mq-tags
to true (this is the default). To generate mate sequence (in the R2:Z tag) and mate base qualities (in the Q2:Z tag), set --generate-r2-tags
to true (default is false) and set --generate-q2-tags
to true (default is false) respectively. Please note that when enabled, R2:Z and Q2:Z tags are emitted only for improperly paired read alignments with fragment length atleast 1000 bp. Also, our methylation pipelines currently do not support the output of mate alignment tags.
DRAGEN also outputs a graph alignment tag ga:Z --generate-ga-tags
(true by default for DNA, false for RNA) when applicable. This tag is used to describe the best alt contig alignment which improved the score of a primary-contig alignment at its liftover position. It can also be used to describe read alignments to alt contigs for which there is no liftover and the primary alignment is unmapped. For example, cases when the read maps best to an alt contig describing a novel long-insertion that is not present in the reference. In addition, read alignments that have been marked as unmapped because they map to auto-detected decoy contigs not present in the original user-provided FASTA also have their alignments described in the ga tag.
The ga tag uses the same format as the SA tag used to describe supplementary alignments.
When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:
CRAM format V3.0 is produced
The CRAM is lossless. Lossy compression is never employed and not optional
Quality score compression is lossless. Read names are preserved
Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores
All input BAM tags are preserved
The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.
A CRAM index is produced in .crai format
CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted
The following list of default settings are used for the CRAM output
DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.
Uncompressed
gzip or bgzip compression
ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.
If your input FASTQ files are gzipped, DRAGEN automatically decompresses the files using hardware-accelerated decompression, and then streams the reads into the mapper. If your files end in *.ora, DRAGEN automatically decompresses the files using ORA decompression, and then streams the reads into the mapper. The same FASTQ command-line options apply for all compression formats.
FASTQ input files can be single-ended or paired-end, as shown in the following examples.
Single-ended in one FASTQ file (-1 option)
Paired-end in two matched FASTQ files(-1 and -2 options)
Paired-end in a single interleaved FASTQ file(--interleaved (-i)
option)
Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:
<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz
Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.
For Example:
These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 <FileName>_001.fastq
). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile
to false on the command line.
DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name
option to true
If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.
To avoid impacting system performance, input files must be located on a fast file system.
To process multiple FASTQ input files as one sample, it is recommended that you use the --fastq-list <csv file name>
option to specify the name of a CSV file containing the list of FASTQ files, instead of using the --combine-samples-by-name
option.
For example:
Using a CSV file avoids having to concatenate the FASTQ files, for cases where there are multiple FASTQ files for a sample such as top-up scenarios or where FASTQ files are split across lanes. It also allows you to name the FASTQ input files, input from multiple subdirectories, and add BAM tags specified explicitly for each read group. DRAGEN automatically generates a CSV file of the correct format during BCL conversion to FASTQ. The CSV file is named fastq_list.csv
and contains an entry for each FASTQ file or paired-end file pair produced during the run.
FASTQ CSV File Format
The first line of the CSV file specifies the title of each column, and is followed by one or more data lines. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or other extraneous characters.
Column titles are case-sensitive. The following column titles are required:
RGID--Read Group
RGSM--Sample ID
RGLB--Library
Lane--Flow cell lane
Read1File--Full path to a valid FASTQ input file
Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.
Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.
When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:
ID (from RGID)
SM (from RGSM)
LB (from RGLB)
You can specify additional tags for each read group by adding a column title. The column title must be only four upper-case characters and begin with RG. For example, to add a PU (platform unit) tag, add a column named RGPU and specify the value for each read group in this column. All column titles must be unique.
A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one unique RGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed in the fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, --fastq-list-sample-id <SampleID>
must be used in addition to --fastq-list <filename>
to process only a specific sample from the CSV file. Only the entries in the fastq-list file with an RGSM value that match the specified SampleID are processed.
Independent processing and output for multiple individual samples in one run is not supported.
To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true
can be used instead of --fastq-list-sample-id
.
Note
For a single run, only one BAM and VCF output file are produced because all input read groups are expected to belong to the same sample. To process multiple samples independently from one BCL conversion run, DRAGEN must be run multiple times using different values for the `--fastq-list-sample-id` option.
There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.
The following is an example FASTQ list CSV file with the required columns:
If you use the --tumor-fastq-list
option for somatic input, use the --tumor-fastq-list-sample-id SampleID>
option to specify the sample ID for the corresponding FASTQ list, as shown in the following example:
Tumor-Normal Pairs Input
If using fastq_lists or tumor_fastq_lists comprising of multiple samples (RGSMs) in somatic mode, you can use a loop to iterate through the two lists to create tumor-normal pairs for testing. Create a *.txt file with the RGSM of each normal sample to be tested (one per line), and then create a separate *.txt file with the RGSM of the tumor samples to be tested. Make sure that the tumor sample RGSM is listed in the same order as the corresponding normal samples and to include a blank line after the last sample.
You can use the following example script to perform testing in somatic mode. Each iteration takes one entry from the tumor samples list and one entry from the normal samples list (from top to bottom) to create a tumor-normal pair as input for the DRAGEN run.
The following are examples of the FASTQ lists and samples lists used as input for the script.
You can use the same options as the other FASTQ input file types for ORA files. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference
.
See ORA Compression and Decompression for more information on ORA reference files.
The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).
BAM files can be used as input to the mapper/aligner. By default --enable-map-align
is true. You can use the BAM file as input to the variant caller by setting the --enable-map-align
option to false.
When you specify a BAM file as input, with map/align enabled, DRAGEN ignores any alignment information contained in the input file, and outputs new alignments for all reads.
If the input file contains paired-end reads, it is important to specify that the input data should be sorted so that pairs can be processed together. Other pipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases the speed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairs are identified. Use the --pair-by-name
option to enable or disable this feature (the default is true).
Specify single-ended input in one BAM file with the (-b
) and --pair-by-name=false
options, as follows:
Specify paired-end input in one BAM file with the (-b
) and \--pair-by-name=true
options, as follows:
You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input.
By default, the CRAM compressor and decompressor uses the DRAGEN reference specified with the --ref-dir
option. CRAM compression is reference based, and the reference used for compression is not part of the CRAM file. Therefore, the CRAM input file must have been created with the same reference than what is provided to DRAGEN for the analysis.
DRAGEN supports the re-alignment of a CRAM input that was created with a different reference in one step. Re-aligning a CRAM file that was created with a different reference requires use of the --cram-reference
option. This option will make the CRAM decompressor use the specified reference.
--cram-reference
can be either a fasta file, or a DRAGEN hash table folder.
If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file
CRAM output will always be compressed using the --ref-dir
reference
Example: CRAM was created with hg19, re-analysis with hg38
The following options are used for providing a CRAM input to either mapper/aligner or variant caller:
--cram-input
--The name and path for the CRAM file
--cram-input
--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option
to true.
Multiple BAM or CRAM Input Files
To provide multiple BAM input files, you can use the --bam-list <csv file name>
option to specify the name of a CSV file containing the list of BAM files. For example:
To provide multiple CRAM input files, you can use the --cram-list <csv file name>
option.
BAM or CRAM CSV Input File Format
The first line of the CSV file specifies the header containing the title for each column and each subsequent line is a data line. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or any other extraneous characters.
An example BAM CSV file:
Column titles are case sensitive. The following column titles are required:
BamFile -- path to BAM file
Please note that only the "BamFile" column is supported as this time. Extra fields may be specified in the CSV file but they will not be processed by DRAGEN.
CRAM CSV input follows the same format above, with "CramFile" as the column title instead.
Restrictions and Limitations:
DRAGEN bam-list and cram-list are intended to mirror manually merging BAM or CRAM files via a utility such as samtools or MergeSamFiles (Picard). As a result, using bam-list or cram-list is analogous to having a single merged BAM or CRAM input file. Please note that some callers (i.e. DRAGEN variant calling) are unable to process a bam-list or cram-list that is composed of input files containing multiple samples.
In the case where identical read group IDs appear across multiple files and you want to treat them as distinct read groups, you can use the --prepend-filename-to-rgid=true
option to distinguish between read groups.
If enabled, the resulting output BAM or CRAM file will contain all read groups from the input BAM or CRAM files passed in the CSV list file.
Tumor-Normal Pairs Input
You can also use --tumor-bam-list <csv file name>
or --tumor-cram-list <csv file name>
when running with tumor-only or tumor-normal inputs to DRAGEN. The CSV file has the same format as the options described above.
BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.
DRAGEN can read directly from BCL in the following circumstances:
Only one lane is input as part of a run (specified on the command-line).
The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).
The following example command is for BCL input with only one lane of input:
For additional BCL conversion options, see Input File Types.
One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.
When you use the --fastq-n-quality
and --fastq-offset
options, the base quality scores are overwritten with a fixed base quality. The default values for these options are 2 and 33 to match the Illumina minimum quality of 35 (ASCII character ‘#’).
By a common convention, read names can include suffixes, such as /1
or /2
), which indicate the end of a pair the read represents. For BAM input using the --pair-by-name
option, DRAGEN ignores these suffixes to find matching pair names. By default, DRAGEN uses the forward slash character as the delimiter for these suffixes and ignores the /1
and /2
when comparing names. By default, DRAGEN strips these suffixes from the original read names.
DRAGEN has the following options to control how suffixes are used:
To change the delimiter character, for suffixes, use the --pair-suffix-delimiter
option. Valid values for this option include forward-slash (/), dot (.), and colon (:).
To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes
to false.
To append a new set of suffixes to all read names, set --append-read-index-to-name
to true. The delimiter is determined by the --pair-suffix-delimiter
option. By default, the delimiter is a slash, so /1
and /2
are added to the names.
When processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-file
option. Providing this file improves the accuracy of the mapping and aligning stage (see [Input Files]{.underline}). The file should conform to the GTF/GFF format specification and should list annotated transcripts that match the reference genome being mapped against. The similar GFF3 format is currently not supported, due to inconsistent contig naming between GENCODE and Ensembl. See the RNA user guide section for more details on potential issues and workarounds.
DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.
DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.
Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.
Input streaming is supported for the following use cases:
Mapping/aligning of FASTQ and BAM.
Germline and somatic small variant calling from BAM (without remapping).
For other file types that are significantly smaller in size, download them locally before running the analysis.
Streaming FASTQ Input Using AWS S3
Streaming FASTQ Input Using Azure Blob Storage Account
Streaming FASTQ Input Using Presigned URLs (for AWS only)
Streaming BAM Input Using AWS S3
Streaming BAM Input Using Azure Blob Storage Account
Streaming BAM Input Using Presigned URLs (for AWS only)
DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.
Streaming output to AWS S3
Streaming output to Azure Blob Storage Account
To stream input files or write to a cloud providers storage, you must have permission to access the remote files.
AWS S3
S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.
Azure Blob Storage Account
Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.
To use managed identities you must run DRAGEN on an Azure instance. The instance must have Contributor
permissions (read/write) on the Storage Account it wants to read and write to. If the instance has a single managed identity, only the AZ_ACCOUNT_NAME=<azure-storage-account-name>
environment variable is required. For multiple managed identities, you must also provide the AZR_IDENT_CLIENT_ID=<client-id>
environment variable, with the client id of the identity that can access your storage bucket. This can be found on the Azure Portal.
With storage account access keys, DRAGEN can write to an Azure bucket both on and off Azure instances. For this use case, find the Storage Account Access Key and set the environment variables AZ_ACCOUNT_NAME=<azure-storage-account-name>
and AZ_ACCOUNT_KEY=<account-key>
.
Presigned URL (AWS only)
An AWS presigned URL most likely has a query string attached to it, which provides the authentication credentials or necessary tokens to grant permission to the S3 bucket (e.g., https://bucket-name.amazonaws.com/path/to/folder?querystring
). Currently, streaming input to DRAGEN Azure presigned URLs is not supported.
Use the --sample-sex
command line option to control the sex karyotype input used in downstream components, such as variant callers. If a sample sex karyotype input is not specified using the command line, the sex karyotype is automatically determined. The sex karyotype input is converted to a reference sex karyotype for use in variant calling. Other components might support sex karyotype input. Refer to the corresponding section for the component you are using.
The --sample-sex
option supports the following values. Values are not case-sensitive.
none
: No sex karyotype input. Components use a default reference sex karyotype.
auto
: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none
. auto
is the default value.
female
: Sex karyotype input is XX.
male
: Sex karyotype input is XY.
The following example command lines use --sample-sex
to specify the sex karyotype.
If the value is none
, female
, or male
, the Ploidy Estimator could still run and produce output, but variant callers will not use any estimated sex karyotype that is different than the sex karyotype provided via the command-line.
The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex
is used.
For sex karyotype input of None, CNV independently checks the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.
The Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tags BI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file with BI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags can become invalid.
The recommendation is to strip these tags when using BAM files as input. To remove the BI and BD tags, set the --preserve-bqsr-tags
option to false. If you preserve the tags, DRAGEN warns you to disable hard clipping.
DRAGEN assumes that all the reads in a given FASTQ belong to the same read group. DRAGEN creates a single @RG read group descriptor in the header of the output BAM file, with the ability to specify the following standard BAM attributes:
If any of these arguments are present, DRAGEN adds an RG tag to all the output records to indicate that they are members of a read group. The following example shows a command line that includes read group parameters:
When using the --fastq-list
option to input multiple read groups, BAM tags (and others) are specified for each read group by adding columns to the fastq_list.csv
file. Each column heading consists of four capital letters and each begins with 'RG'. For each column, each read group's values for that column are propagated to the output BAM file in an identically named tag.
To suppress the license status message at the end of the run, use the --lic-no-print
option. The following shows an example of the license status message:
An MD5SUM file is generated automatically for BAM and CRAM output files. The MD5SUM file has the same name as the output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). The MD5SUM file is a single-line text file that contains the md5sum of the output file, which exactly matches the output of the Linux md5sum command.
The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).
Command line options can be stored in a configuration file. The location of the default configuration file is <INSTALL_PATH>/config/dragen-user-defaults.cfg
. You can override this file by using the --config-file (-c)
option to specify a different file. The configuration file used for a given run supplies the default settings for that run, any of which can be overridden by command line options.
The recommended approach is to use the dragen-user-defaults.cfg file as a template to create default settings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the new file for the specific use-case. Best practice is to put options that rarely change into the configuration file and to specify options that vary from run to run on the command line.
Authentication is required for users that run DRAGEN on the cloud, with the Bring-Your-Own-License (BYOL) model, outside of integrated Illumina cloud products. A valid license is required to enable authentication and usage quotas.
DRAGEN cloud runs access the DRAGEN License Server to validate the credentials and licenses against the intended run. BYOL users must provide credentials and must allow access to the license server URL. The following command line option can be used to pass the credentials to DRAGEN: --lic-server=https://<user>:<pass>@license.edicogenome.com
.
An alternative way to provide license server credentials is by using a license credentials file. The --lic-credentials input command line option can be used to provide the full path to the license credentials file. This provides a more secure way to pass cloud credentials, which avoids accidental credentials leaks from command line console logs.
A license credentials file is a plain text file audited by the customer. The format is the same as the DRAGEN config files: = , each {key,value} separated by new line. The following key names must be used: credentials1 = credentials2 =
DRAGEN uses AWS Instance Metadata Service (IMDS) to identify its own metadata within the AWS environment, including location, identity, and configuration.
DRAGEN supports both AWS IMDSv1, and the more secure AWS IMDSv2. AWS IMDSv1 is request/response based. It accesses metadata by HTTP requests to a specific endpoint on the instance. AWS IMDSv2 is token-based authentication with time-limited tokes.
AWS IMDSv2 must be enabled on the AWS instance, otherwise, IMDSv1 is used by default. DRAGEN software will automatically detect the IMDS version in use and adapt its behavior accordingly.
DRAGEN cloud runs access the instance identity document via the Instance Metadata Service as part of the authentication. It uses the IPv4 local address. If access to the local address is not allowed, the authentication will fail. Alternately, the user may save the instance identity document(s) and point DRAGEN to use them instead, if the user does not want to allow applications to access this service. The method for providing instance identity documents to the software is described below.
Save the instance identity document(s) as files from the user's instance, and provide them as inputs to the DRAGEN software with each run.
The instance identity document(s) only need to be saved once per AWS account and region, and those files can be re-used subsequently.
Examples for saving instance identity document(s):
IMDSv1
IMDSv2
There should be 3 files in this folder, respectively named pkcs7
, signature
and document
. Run Dragen using the --lic-instance-id-location ${instance_identity}
command option.
There should be 2 files in this folder, respectively named instance
and document
. Run Dragen using the --lic-instance-id-location ${instance_identity}
command option.
DRAGEN supports the construction of a reference hash tables for both human and non-human reference genomes. The reference autodetect feature of DRAGEN is able to recognize the reference hash tables build on the four Human reference genomes: hg19 (hg19
), GRCh37/hs37d5 (hs37d5
), GRCh38/hs38d1(hg38
), and T2T-CHM13v2.0 (chm13
). Pre-built human reference hash tables are available for download at DRAGEN Software Support Site page.
DRAGEN also supports multigenome reference graphs hash tables which extend the reference genomes with alternative variant paths from a sample cohort used to construct the multigenome reference. A graph-based reference improves the mapping accuracy of Illumina reads in the “Difficult-to-Map Regions” of the genome and the downstream variant calling. Pre-built human multigenome reference graphs are available for download at DRAGEN Software Support Site page.
In the following tables we summarize the reference support for each DRAGEN component and the recommended reference type for each component.
Component | Human hg19 | Human hs37d5 | Human hg38 | Human chm13 | Non-Human | Recommended Human Reference Type | Recommended Non-Human Reference Type |
---|---|---|---|---|---|---|---|
Component | Human hg19 | Human hs37d5 | Human hg38 | Human chm13 | Non-Human | Recommended Human Reference Type | Recommended Non-Human Reference Type |
---|---|---|---|---|---|---|---|
* DRAGEN supports the component execution, however the component's accuracy has not been established.
The DRAGEN secondary analysis software utilizes a highly reconfigurable Field Programmable Gate Array (FPGA) card and is available on a preconfigured DRAGEN server that can be seamlessly integrated into bioinformatics workflows. The platform can be loaded with highly optimized algorithms for many different NGS secondary analysis pipelines, including the following:
Whole genome
Exome
RNA-Seq
Methylome
Cancer
All user interaction is accomplished via DRAGEN software that runs on the host server and manages all communication with the FPGA card. This user guide summarizes the technical aspects of the system and provides detailed information for all DRAGEN command line options. If you are working with DRAGEN for the first time, Illumina recommends that you first read the Getting Started section, which provides a short introduction to DRAGEN, including running a test of the server, generating a reference genome, and running example commands.
DRAGEN DNA Pipeline
The DRAGEN DNA Pipeline massively accelerates the secondary analysis of NGS data. For example, the time taken to process an entire human genome at 30x coverage is reduced from approximately 10 hours (using the current industry standard, BWA-MEM+GATK-HC software) to approximately 20 minutes. Time scales linearly with coverage depth.
These pipelines harness the tremendous power of the DRAGEN server and include highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. They also use platform features such as hardware-accelerated compression and optimized BCL conversion, together with the full set of platform tools.
Unlike all other secondary analysis methods, DRAGEN DNA Applications do not reduce accuracy to achieve speed improvements. Accuracy for both SNPs and INDELs is improved over that of BWA-MEM+GATK-HC in side-by-side comparisons.
In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions.
DRAGEN secondary anaylsis includes an RNA-seq (splicing-aware) aligner, as well as RNA-specific analysis components for gene expression quantification and gene fusion detection.
The DRAGEN RNA Pipeline shares many components with the DNA Pipeline. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.
DRAGEN secondary analysis uses hardware accelerated algorithms to map and align RNA-Seq--based reads faster and more accurately than popular software tools. For instance, it can align 100 million paired-end RNA-Seq--based reads in about three minutes. With simulated benchmark RNA-Seq data sets, its splice junction sensitivity and specificity are unsurpassed.
The DRAGEN Methylation Pipeline provides support for automating the processing of bisulfite sequencing data to generate a BAM with the tags required for methylation analysis and reports detailing the locations with methylated cytosines.
Tag | Tag meaning |
---|---|
CRAM option | Value | Description |
---|---|---|
Sex Karyotype Input | CNV Caller | ExpansionHunter | Ploidy Caller | Small Variant Caller | SV Caller |
---|---|---|---|---|---|
Attribute | Argument | Description |
---|---|---|
Component | Human hg19 | Human hs37d5 | Human hg38 | Human chm13 | Non-Human | Recommended Human Reference Type | Recommended Non-Human Reference Type |
---|---|---|---|---|---|---|---|
Component | Human hg19 | Human hs37d5 | Human hg38 | Human chm13 | Non-Human | Recommended Human Reference Type | Recommended Non-Human Reference Type |
---|---|---|---|---|---|---|---|
ZS:Z:R
Multiple alignments with similar score were found.
ZS:Z:NM
No alignment was found.
ZS:Z:QL
An alignment was found but it was below the quality threshold.
SEQS_PER_SLICE
2000
Max sequences per slice
BASES_PER_SLICE
SEQS_PER_SLICE*500
Max bases per slice
SLICE_PER_CNT
1
Max slices per container
embed_ref
0
Do not embed reference sequence
noref
0
Do not use non-referenced based encoding
multiseq
-1
Do not use multiple references per slice
unsorted
0
Do not use unsorted mode
use_bz2
0
Do not compress using bzip2
use_lzma
0
Do not compress using lmza
use_rans
1
Use rANS for quality score compression
binning
NONE
Qual score binning not used
preserve_aux_order
1
Preserve all aux tags and order (incl RG,NM,MD)
preserve_aux_size
0
Aux tag sizes not preserved ('i', 's', 'c')
lossy_read_names
0
Preserve read names
lossy
0
Do not enable Illumina 8 quality-binning system
ignore_md5
0
Enable all checking of checksums
decode_md
0
Do not (re)generate MD and NM tags
XX
XX
XX
XX
XX
XXYY
XY
XY
XY
XY
XY
XXYY
XXY
XY
XX
XY
XXYY
XXYY
XYY
XY
XY
XY
XXYY
XXYY
X0
XX
XY
XX
XXYY
XXYY
XXXY
XY
XX
XY
XXYY
XXYY
XXX
XX
XX
XX
XXYY
XXYY
None
XX/XY
XX
XX
XXYY
XXYY
ID
--RGID
Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.
LB
--RGLB
Library.
PL
--RGPL
Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.
PU
--RGPU
Platform unit, eg, flowcell-barcode.lane.
SM
--RGSM
Sample.
CN
--RGCN
Name of the sequencing center that produced the read.
DS
--RGDS
Description.
DT
--RGDT
Date the run was produced.
PI
--RGPI
Predicted mean insert size.
SNV
Yes
Yes
Yes
Yes
Yes
Graph
Non-Graph
CNV
Yes
Yes
Yes
Yes*
No
Graph
Non-Graph
SV
Yes
Yes
Yes
Yes*
Yes
Graph
Non-Graph
Expansion Hunter
Yes
Yes
Yes
No
No
Graph
Non-Graph
Targeted Callers
Yes
Yes
Yes
No
No
Graph
Non-Graph
RNA
Yes
Yes
Yes
Yes*
Yes
Non-Graph
Non-Graph
De Novo
Yes
Yes
Yes
Yes*
Yes
Graph
Non-Graph
Joint Genotyping
Yes
Yes
Yes
Yes*
Yes
Graph
Non-Graph
Biomarkers (HLA)
Yes
Yes
Yes
Yes*
No
Graph
Non-Graph
gVCF genotyper
Yes
Yes
Yes
Yes*
Yes
Graph
Non-Graph
SNV
Yes
Yes
Yes
Yes*
No
Non-Graph
Non-Graph
UMI SNV
Yes
Yes
Yes
Yes*
No
Non-Graph
Non-Graph
CNV
Yes
Yes
Yes
Yes*
No
Non-Graph
Non-Graph
SV
Yes
Yes
Yes
Yes*
No
Non-Graph
Non-Graph
Methylation
Yes
Yes
Yes
No
No
Non-Graph
Non-Graph
Nirvana
Yes
Yes
Yes
No
Yes