Illumina 5-base Prep

DRAGEN’s 5-Base DNA pipeline integrates genetic and epigenetic analysis, enabling simultaneous genome and methylome profiling through specialized processing options. This comprehensive workflow supports mapping, unique molecular identifiers, methylation calling, variant calling, and copy number variation analysis tailored for 5-base data.

Summary

  • Integrated 5-base processing: Activating --methylation-conversion=illumina automatically enables multiple DRAGEN options for methylation mapping, calling, and UMI processing specific to 5-base data. 

  • 5-base hash table building: Setting --ht-methylated-cg=true builds a 5-base reference hash table stored under the methyl_cg sub-directory, which DRAGEN mapper uses automatically. 

  • Mapping and alignment: The pipeline adapts mapping algorithms to account for C>T conversions due to methylation, supporting local alignment, soft-clipping, and graph genomes by default. 

  • Unique molecular identifiers: UMI processing is extended to handle methylation-induced asymmetric base pairing, with duplex consensus reads annotated with methylation status on both strands. Only nonrandom-duplex UMI libraries are compatible. 

  • Methylation calling: Methylation is identified by C>T or G>A mismatches, with variant calling deconvoluting methylation from true variants. Directional methylation protocol is required. 

  • Methylation reports: DRAGEN generates BAM files with methylation tags, genome-wide cytosine methylation reports, and optionally integrates methylation data into VCF/gVCF files, balancing completeness and file size.  

  • Quality metrics: Mapping and methylation-specific metrics are produced, including base quality and methylation rates, to assess run quality.  

  • Small variant calling support: The pipeline supports germline and somatic variant calling on 5-base data with enhanced algorithms to distinguish methylation-induced changes from variants. Some features like pedigree analysis are not currently supported but planned for future releases.  

  • Copy number variant calling: Supported for whole genome sequencing in germline and somatic contexts with some limitations. Note that Allele-specific copy number germline CNV calling is not currently supported.

Overview

DNA is inherently multiomic, holding both genetic and epigenetic molecular information. Beyond the sequence of adenine (A), thymine (T), guanine (G), and cytosine (C), there are modified bases such as 5-methylcytosine (5mC) that help direct gene expression. The Illumina 5-Base DNA Prep is a single workflow that, when combined with DRAGEN algorithms, provides an integrated readout of both genome and methylome:

image

The following processing is available for 5-base data in DRAGEN, and activated by setting --methylation-conversion=illumina:

image

The --methylation-conversion=illumina batch option sets the following DRAGEN options automatically to facilitate downstream processing. These specific options will be discussed in the following sections, but it is not necessary to set them independently.

--enable-cpg-methylated-mapping=true
--enable-methylation-calling=true
--vc-enable-methylation=true
--umi-enable-methylation=true
--umi-library-type=nonrandom-duplex
--methylation-protocol=directional
--methylation-generate-mbias-report=true

Build a 5-Base Hash Table

When --ht-methylated-cg=true is set, the DRAGEN reference builder will save a 5-base reference information in a sub-directory of the output directory called methyl_cg. When running the DRAGEN mapper, the top level directory should be provided (parent of methyl_cg), and DRAGEN will auto-detect which reference sub-folder to use depending on which DRAGEN workflow the user specifies (DNA, 5-base, RNA, etc.).

5-Base Map/Align

Map/align of 5-Base data follows the approach detailed under DRAGEN DNA. 5-base data-specific algorithms have been updated to account for the possibility of C>T conversions due to methylation at both seed mapping and alignment scoring stages. Unlike with DRAGEN Methylation, local alignment, soft-clipping, and graph genomes are all supported and used by default.

Unique Molecular Identifiers

UMI processing follows logic and options detailed under unique-molecular-identifiers. The logic used to collapse duplex UMI adapters has been extended with 5-base data-specific algorithms to allow for accurate collapsing of asymmetric base pairing due to mC > T conversion. For duplex consensus reads in the BAM, XM tags report the methylation status of cytosines on both + and – strands. 5-base data is only compatible with--umi-library-type=nonrandom-duplex.

Methylation Calling

Methylation is primarily identified by reference C>T mismatches on the + strand, or G>A mismatches on the – strand. Additionally, variant calling provides confident deconvolution of methylation and variant status, whether mixed or unique, at non-reference cytosines, extending the completeness and accuracy of the methylome for individual subjects. If any bases on read2 overlap with those on read1, they are reported in the BAM but otherwise excluded downstream quantifications. 5-base data is only compatible with --methylation-protocol=directional

Methylation Calling Outputs

BAM

The DRAGEN BAM file includes methylation related tags for all MAPQ>0, proper-pair mapped reads. The added tags follow Bismark conventions:

image

Cytosine Report

DRAGEN can generate a genome-wide cytosine methylation report (CX_report.txt.gz) containing the methylation status of every reference cytosine in the genome by setting --methylation-generate-cytosine-report=true.

  • If processing 5-base data without enabling the variant caller this option will be set to true automatically

  • The option will default to false when --enable-variant-caller=true is set, as cytosine methylation is already output in the (g)VCF. See VCF methylation reporting.

To keep all cytosines from your reference in the CX_report, even if they are not included in the input sequences, set --methylation-keep-ref-cytosine=true. (default=false).

  • Setting this option to true increases run time and the CX_report file size.

To compress the cytosine report, set --methylation-compress-cx-report=true. (default=false).

  • DRAGEN outputs a compressed *.CX_report.txt.gz, instead of a *.CX_report.txt.

Report Fields

  • The position and strand of each C in genome are given in the first three fields of the report.

  • A record with a - in the strand field is used for a G in the reference FASTA.

  • The counts of methylated and unmethylated Cs covering the positions are given in the fourth and fifth fields.

  • The C context in the reference (CG, CHG, or CHH) is given in the sixth field.

  • The trinucleotide sequence context is given in the last field (eg, CCC, CGT, CGA, and so on)

  • The cytosine report only includes records for positions that have one or more spanning alignments. The following is an example cytosine report record: chr2 24442367 + 18 0 CG CGC

VCF methylation reporting

5-base small variant calling is enabled by setting --enable-variant-caller=true and --methylation-conversion=illumina

  • DRAGEN can integrate methylation reporting into the VCF and gVCF output files as well.

  • In contrast to the CX_report, methylation reporting is provided not only for the reference allele but also at alternative alleles, and produces more accurate %methylation estimates even in the presence of confounding T or A variant alleles.

For reporting in the (g)VCF files, the --methylation-report-to-vcf and --methylation-report-to-gvcf options can be set to none, cg, or c.

  • none will exclude methylation reporting.

  • cg will report methylation of cytosines in a CpG context.

  • c will report methylation at all cytosines. When analysis is configured as described above, the default values will be set to --methylation-report-to-vcf=c and --methylation-report-to-gvcf=cg to balance considerations of complete methylome reporting with filesize.

Below are VCF header definitions of the 5-base methylation fields

  • INFO:M5mC Marks nucleotides for which 5mC levels are reportable. The letters z, x and h indicate CG, CHG and CHH contexts, respectively. The lowercase letters z, x and h are used to report methylation of individual cytosines (C), whereas the uppercase Z marks CpG dinucleotides for which methylation reporting is aggregated across the two CpG cytosines on opposite strands. The missing value (.) is used for unreported or not applicable (A/T) nucleotides.

  • FORMAT:M5mC 5mC methylation levels of individual cytosines/CpG dinucleotides. Encoded as a VCFv4.5 Number=M field but with cardinality defined by the INFO M5mC field.

  • FORMAT:DPM5mC Total informative read depth of potentially 5mC modified bases. Encoded as a VCFv4.5 Number=M field but with cardinality defined by the INFO M5mC field.

Metric files

The quality of each methylation run can be summarized in the following metric files.

  • *.mapping_metrics.csv—Contains mapping-specific metrics that are generated for the alignment phase, including benchmarks like number of total reads, aligned reads, deduped reads, base quality, etc.

  • *.methyl_metrics.csv—Contains methylation-specific metrics that are generated for the methylation calling phase, including benchmarks like the total number of cytosines analyzed, count and rate of methylation in each cytosine context, strand of the best alignment, etc. This file is generated when --methylation-generate-mbias-report=true.

Small Variant Calling

For a comprehensive overview of small variant calling, please see small-variant-calling. 5-base data is a supported input for:

  • Germline

  • Somatic Tumor Only WGS

  • Somatic Tumor/Normal WGS

  • Somatic Tumor Only Enrichment (including both solid and ctDNA modes) To support accurate variant calling on 5-base data, updates were made throughout the variant caller algorithm and statistical models to account for methylation-specific C>T conversions. The methylation-induced substitutions appear on only a single DNA strand, while variant calls have evidence on both strands. This information allows estimation of methylation levels at putative variant positions, and deconvolution of this signal from DNA variant evidence. This approach provides confident and accurate variant calls without sacrificing excessive information by masking all ambiguous base changes. Not all small VC functionality is supported for 5-base data, the following features are not yet available:

  • Pedigree Analysis

  • Machine Learning for Variant Calling

  • VCF Imputation

  • Multi-Region Joint Detection

CNV Calling

Copy number variant calling follows the logic outlined in cnv-calling. Not all pipelines or modules are compatible with 5-base data:

  • Germline CNV Calling (depth-based): Supported for WGS; not supported for WES

  • Germline CNV Calling ASCN: Not supported

  • Multisample Germline CNV Calling: Not supported

  • Somatic CNV Calling ASCN: Supported for WGS; not supported for WES

  • Somatic CNV Calling WES: Not supported

  • Cytogenetics Modality: Not supported

  • CNV with SV Support: Supported

Processing in common with other pipelines

Last updated

Was this helpful?