DRAGEN Methylation Pipeline
Last updated
Last updated
The epigenetic methylation of cytosine bases in DNA can have a dramatic effect on gene expression, and bisulfite sequencing is the most common method for detecting epigenetic methylation patterns at single-base resolution. This technique involves chemically treating DNA with sodium bisulfite, which converts unmethylated cytosine bases to uracil, but does not alter methylated cytosines. Subsequent PCR amplification converts any uracils to thymines.
A bisulfite sequencing library can either be nondirectional or directional. For nondirectional, each double-stranded DNA fragment yields four distinct strands for sequencing, post-amplification, as shown in the following figure:
Bisulfite Watson (BSW), reverse complement of BSW (BSWR),
Bisulfite Crick (BSC), reverse complement of BSC (BSCR)
For directional libraries, the four strand types are generated, but adapters are attached to the DNA fragments such that only the BSW and BSC strands are sequenced (Lister protocol). Less commonly, the BSWR and BSCR strands are selected for sequencing (eg, PBAT).
BSW and BSC strands:
A, G, T: unchanged
Methylated C remains C
Unmethylated C converted to T
BSWR and BSCR strands:
Bases complementary to original Watson/Crick A, G, T bases remain unchanged
G complementary to original Watson/Crick methylated C remains G
G complementary to original Watson/Crick unmethylated C becomes A
Therefore, several steps are performed to map methylation, as shown in the following flowchart:
Details on each part of the mapping process can be found below.
DRAGEN methylation mapping works in one single alignment run. During alignment, the mapper considers all possible base and reference conversions for the read and emits the single best alignment to a particular methylation strand if one exists. Any read (pair) that did not have a single best scoring alignment across all methylation strands tested appears in the output BAM with MAPQ 0. The output BAM in single-pass might contain mapped reads that do not have the XM, XR, and XG methylation tags. Methylation data from reads that do not have methylation tags are not tallied into the reports or metrics files.
--enable-methylation-calling
must be set to true to enable single-pass methylation mapping.
A read must meet the following requirements to have methylation tags in the output BAM and to get tallied into the reports and metrics:
The read and its mate (if applicable) are mapped with MAPQ above the value specified using --methylation-mapq-threshold
. The default value is 0.
The read is not part of an improper pair.
The DRAGEN methylation pipeline requires a methylation-specific hash table in which combines both C to T and G to A conversions of contigs from the original FASTA. Specifically, each contig appears twice, once with C bases converted to T bases, and once with G bases converted to A bases. When --ht-methylated-combined=true
is set, the DRAGEN hash table builder will create a methylation hash table in a sub-directory of the output directory called methyl_converted
. When running the DRAGEN mapper, the top level directory should be provided (parent of methyl_converted), and DRAGEN will automatically locate the methylation sub-directory and use it as needed. The top level directory can be used for regular mapping.
Due to the base conversions in the methylation hash table, short seeds map poorly. So the default and recommended seed length of the methylation hash table is 27.
The following is an example command line for a methylation hash table.
Different methylation protocols require the generation of two or four alignments per input read, followed by an analysis to choose a best alignment and determine which cytosines are methylated. DRAGEN can automate this process by generating a single output BAM file with Bismark-compatible tags (XR, XG, and XM) that can be used for methylation calling and other downstream workflows.
When the --methylation-protocol
option is set to a valid value other than none, DRAGEN automatically produces the required set of alignment runs. Each alignment run includes the appropriate base conversions on the reads, base conversions on the reference, and constraints on whether reads must be forward-aligned or reverse complement (RC) aligned with the reference. The following options are automatically configured:
--generate-md-tags true
--Aligner.global 1
--Aligner.no-unpaired 1
--Aligner.aln-min-score 0
--Aligner.min-score-coeff -0.2
--Aligner.match-score 0
--Aligner.mismatch-pen 4
--Aligner.gap-open-pen 6
--Aligner.gap-ext-pen 1
--Aligner.supp-aligns 0
--Aligner.sec-aligns 0
--seed-density 1
Because global alignments (end-to-end in the reads) are generated, DRAGEN recommends trimming any artifacts introduced by library prep and adapter sequences.
The following table describes the properties of the alignment runs:
Protocol | BAM | Reference | Read 1 | Read 2 | Orientation Constraint |
---|---|---|---|---|---|
directional | |||||
1 | C->T | C->T | G->A | Forward-only | |
2 | G->A | C->T | G->A | RC-only | |
non-directional, or directional-complement | |||||
1 | C->T | C->T | G->A | Forward-only | |
2 | G->A | C->T | G->A | RC-only | |
3 | C->T | G->A | C->T | RC-only | |
4 | G->A | G->A | C->T | Forward-only | |
PBAT | |||||
3 | C->T | G->A | C->T | RC-only | |
4 | G->A | G->A | C->T | Forward-only |
In directional protocols, the library is prepared such that only the BSW and BSC strands are sequenced. Thus, alignment runs are performed with the two combinations of base conversions and orientation constraints best suited for these strands (directional runs 1 and 2 above).
With nondirectional protocols, reads from each of the four strands are equally likely, so alignment runs must be performed with two more combinations of base conversions and orientation constraints (nondirectional runs 3 and 4 above).
In PBAT protocols, the library is prepared so only the BSWR and BSCR strands are sequenced. Only two alignment runs are performed with the combinations of base conversions and orientation constraints best suited for these strands (runs 3 and 4).
The directional-complement protocol can also be used for PBAT or similar libraries where mainly the BSWR and BSCR strands are sequenced. With this protocol, all four aligner runs are performed, but relatively few good alignments are expected from the runs for the BSW and BSC strands, so DRAGEN is automatically tuned to a faster analysis mode for those runs.
The following is an example DRAGEN command line for the directional protocol:
An end-to-end (ie., fastq->bam->cytosine report) run can be performed as follows:
To generate sorted alignment output (in BAM format), set --enable-sort
to true.
To detect duplicate reads, set --enable-duplicate-marking
to true.
[Optional]To remove duplicate reads, set --remove-duplicates
to true.
[Optional]Set --methylation-generate-cytosine-report
and --methylation-generate-mbias-report
to either false or true according to user need.
By default, DRAGEN methylation performs strand-aware dedup in concordance with Bismark. Strand-aware dedup partitions the mapped reads into four groups, one per methylation strand. Within each group, DRAGEN performs a normal dedup. For paired reads, the strand of the pair is defined as the strand of the first read in the pair.
The following example demonstrates strand-aware dedup for paired-end reads. The example pairs all map to the same position, but the first read in each pair (BAM flag 83 and 99) is mapped to a different methylation strand, as shown by the different values of the XR and XG tags. None of these pairs are marked as duplicates.
DRAGEN support fastq files that contains UMI barcode during the alingment phase. The principle and requirement are identical to DNA UMI. Briefly, during library prep, (methyl-treated) DNA fragments could be barcoded by unique molecular identifiers, so that true signals from the original fragments can be separated from PCR error and sequencing error, which enables more accurate methylation calling. The fastq files need to have UMI barcode in 7th field of the QNAME.
eg. @NS500561:434:H5LC2BGXJ:1:11101:10798:1359:CACATGA+ACATTC 1:N:0:TGGTACCTAA+AGTACTCATG
To enable UMI, either set --umi-enable true
if you are using random UMI (common), or set --tso500-solid-umi true
if you are using the same non-random UMIs as the TSO500 solid panel. If so, read collapsing will be performed among reads with the same UMI that are mapped to the same genomic location at the same strand, either from top (OT/CTOT) or bottom (OB/CTOB) strand.
See DRAGEN DNA Pipeline / Unique Molecular Identifiers for more details.
TET-Assisted Pyridine Borane Sequencing (TAPS) is a new assay which directly converts methylated C to T, whereas typical bisulfite conversion converts unmethylated C to T. This approach preserves genomic complexity and uses less destructive chemicals to enable lower input DNA. To enable analysis of FASTQ data generated through TAPS, set --methylation-TAPS
to true. By default, the option is false. This option is performed only during the alignment step and is not necessary when generating methylation cytosine and M-Bias reports from an existing BAM.
When --enable-methylation-calling
is set to true, DRAGEN analyzes the alignments produced for the configured --methylation-protocol
and generates a single output BAM file that includes methylation-related tags for all mapped reads. As in Bismark, reads without a unique best alignment are excluded from the output BAM. The added tags are as follows.
Tag | Brief Description | Description |
---|---|---|
XR:Z | Read conversion | For the best alignment, which base-conversion was performed on the read: CT or GA. |
XG:Z | Reference conversion | For the best alignment, which base-conversion was performed on the reference: CT or GA |
XM:Z | Methylation call | A byte-per-base methylation string. |
The XM:Z (methylation call) tag contains a byte that corresponds to each base in the sequence of the read. Each position that does not involve a cytosine contains a period (.). Each position that does involve a cytosine contains a letter. The letter indicates the context (CpG, CHG, CHH, or unknown). The case indicates methylation. Methylated positions use upper-case and unmethylated positions use lower-case. The letters used at cytosine positions are as follows.
Character | Methylated? | Context |
---|---|---|
. | not cytosine | not cytosine |
z | No | CpG |
Z | Yes | CpG |
X | No | CHG |
X | Yes | CHG |
h | No | CHH |
H | Yes | CHH |
u | No | Unknown |
U | Yes | Unknown |
You can use DRAGEN to generate a genome-wide cytosine methylation report. Your command line options settings depend on if you are running using FASTQ through the aligner or a prealigned BAM that already contains the methylation tags.
For FASTQ input, set --methylation-generate-cytosine-report=true
For BAM input, set --methylation-reports-only=true
To keep all cytosines from your reference in the CX_report
, even if they are not included in the input sequences, set --methylation-keep-ref-cytosine true
. The default value is false. Setting this option to true increases run time and the CX_report
file size.
To compress the cytosine report, set --methylation-compress-cx-report = true
. The default value is false. DRAGEN outputs a compressed *.CX_report.txt.gz
, instead of a *.CX_report.txt
.
The position and strand of each C in genome are given in the first three fields of the report. A record with a - in the strand field is used for a G in the reference FASTA. The counts of methylated and unmethylated Cs covering the positions are given in the fourth and fifth fields. The C context in the reference (CG, CHG, or CHH) is given in the sixth field. The trinucleotide sequence context is given in the last field (eg, CCC, CGT, CGA, and so on) The cytosine report only includes records for positions that have one or more spanning alignments. The following is an example cytosine report record:
chr2 24442367 + 18 0 CG CGC
To generate an M-bias report, set --methylation-generate-mbias-report
to true. This report contains three tables for single-ended data with one table for each C-context and six tables for paired-end data. Each table is a series of records, with one record per read base position. For example, the first record for the CHG table contains the counts of methylated Cs (field 2) and unmethylated Cs (field 3) that occur in the first read base position, and restricts to those reads in which the first base is aligned to a CHG location in the genome. Each record of a table also includes the percent methylated C bases (field 4) and the sum of methylated and unmethylated C counts (field 5).
The following is an example M-bias record for read base position 10:
10 7335 2356 75.69 9691
For data sets with paired-end reads that overlap, both the cytosine and M-bias reports do not report any Cs in the second read that overlaps the first read. In addition, 1-based coordinates are used for positions in both reports..
The quality of each methylation run can be summarized in the following two metric files.
*.mapping_metrics.csv
—Contains mapping-specific metrics that are generated for the alignment phase, including benchmarks like number of total reads, aligned reads, deduped reads, base quality, etc.
*.methyl_metrics.csv
—Contains methylation-specific metrics that are generated for the methylation calling phase, including benchmarks like the total number of cytosines analyzed, count and rate of methylation in each cytosine context, strand of the best alignment, etc.