DRAGEN implements a beta version of the Population Haplotyping tool. This tool supports the estimation of haplotypes from a population scale dataset via the packaging of the SHAPEIT5 Software (2022, Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O). It is designed to phase common variants as well as rare variants in a step-by-step mode. The following step-by-step workflow must be reproduced to phase each chromosome of the studied genome.
Step 1: Phase Common step to estimate the haplotypes of common variants (variants with allele frequency above a given allele frequency threshold) on defined regions.
Step 2: Common Ligate step to ligate the phased common variants from step 1 into a single chromosome.
Step 3: Phase Rare step to add the haplotypes of rare variants (variants with allele frequency below a given allele frequency threshold) on defined regions to the common variant scaffold obtained in step 2.
Step 4: Concat All step to concatenate the haplotype regions obtained in step 3 into a single chromosome.
This tool provides best accuracy on population scale dataset with thousands of samples. It is recommended to be run on multiple nodes to parallelize processes. A common use case of the Population Haplotyping tool is the generation of a custom reference panel to be used for the VCF Imputation pipeline.
The tool supports autosomes and mixed ploidy chromosomes for diploid species only. It does not use the FPGA accelerated capability and it can run on generic software only compute node.
Note: the Population Haplotyping tool only supports input msVCF produced with the DRAGEN gVCF Genotyper tool.
The following is an example of required command to generate haplotypes on common and rare variants (with default allele frequency threshold) on population scale dataset:
To generate per chromosome haplotypes:
To generate per genome haplotyped sites
For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:
per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition
generated from the same reference build
compressed and indexed
with unphased GT calls
with no duplicates
with header ##contig "ID" and "length" fields for all contigs present in the studied genome
Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.
The msVCF input list provided at step 1 is pre-processed to generate a formatted msVCF called <prefix>.preprocess.vcf.gz
. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
To facilitate parallel processing on distributed compute nodes, and to avoid overhead chromosome level multisample VCF download and upload per sub-chromosome processing, chromosome portions of equal size (shards) can be used as input. The gVCF Genotyper tool, with proper option, can generate these shards of equal size. Note: streaming from the cloud is not supported. Instead use predownload and local input process to achieve maximum IO efficiency and stability.
A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the geentic map corresponding to the human hg38 reference genome available to download from the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files.
The genetic map should follow the format:
3 columns: position, chromosome number, distance (cM), in this order and tab separated
Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1 chrX_par1
, PAR2 chrX_par2
and non PAR chrX_nonpar
regions)
Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y chrY
)
The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.
This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar
).
The user can provide its own or use the one available to download from DRAGEN Software Support Site page.
The config file is a text file with the headers:
##version
##ref_build indicating the reference build used for the study.
The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.
Note: for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1
, chrX_nonpar
, and chrX_par2
.
The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes.
The Phase common step (step 1) is run on a defined region, and outputs:
a single scaffold msVCF and related msVCF index with phased common variants for that region. The default name is dragen.ph_phase_common.vcf.gz
.
a single formatted msVCF called <prefix>.preprocess.vcf.gz
and related index. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
The Ligate Common step (step 2) ligates the regions phased in step 1 and outputs a single scaffold msVCF and related msVCF index with phased common variants for a single chromosome. The default name is “dragen.ph_ligate_common.vcf.gz”.
The Phase Rare step (step 3) is run on a defined region on a chromosome with preprocessed unphased msVCF from step 1 and phased scaffold msVCF from step 2, and outputs:
a single phased msVCF and related msVCF index with phased common and rare variants for that region. The default name is “dragen.ph_rare_common.vcf.gz”.
a single 8-column VCF and related index listing all sites that have been phased for that region. The default name is “dragen.ph_rare_common.sites.vcf.gz”.This output is used at the Concat-All step to generate a VCF file with all phased sites accross the genome.
The Concat All processing is used to generate 2 types of output
Phased common and rare variants for a chromosome
The Concat All step (step 4) concatenates the regions phased in step 3 and outputs an msVCF and related index with phased common and rare variants for a single chromosome. The default name is “dragen.ph_concat_all.vcf.gz”.
List of phased sites
This output is useful for input of the ForceGT tool. The Concat All step lists all sites in a 8-column VCF format that have been phased and output a VCF and related index with list of phased sites. This output can be generated either from a list of phased site VCFs across the genome from step3, or, in a second step once the list of per chromosome sites have been generated. The default name is “dragen.ph_concat_all.sites.vcf.gz”.
An additional module of the Population Haplotyping tool checks for the quality of the haplotypes produced based on a phased truth set provided as input.
Column information | Description |
---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
Option | Required | Description |
---|---|---|
First column: filename
Specifies the genetic map basename, 1 name per line. Mixed ploidy chromosomes must be separated into par and non-par regions. Basenames must match genetic map basenames.
Second column: region
Specifies the start and end positions of the chromosome or sub-chromosome region with format <contig_name>:<start_position>-<end_position>
. For chromosomes without mixed ploidy regions, the start position is 1, end position is the length of the chromosome (1-based, inclusive). For chromosomes with mixed ploidy regions, for each region, the start and end positions are those of the region (1-based, inclusive).
Third column: mixed ploidy subject
Specifies 2 on diploid chromosomes and PAR regions. 1 for non PAR region
Fourth column: diploid subject
Specifies 2 for all chromosomes
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-common
Yes
Set to true to enable the Phase Common step.
--ph-phase-common-input-list
Yes
Provides a .txt file listing the sample input pertaining to one chromosome, with path to a single msVCF or a list of msVCF, one line per path. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.
--ph-phase-common-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must overlap between them for the downstream ligate common step. Examples of input region length for human data: 10 mbp
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1
, chrX_nonpar
and chrX_par2
, instead of one run with region chrX
).
--ph-phase-common-map
Yes
Provides path to the chromosome genetic map. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.
--ph-phase-common-config
Yes
Provides path to the txt config file.
--ph-phase-common-reference
No
Provides the path to a reference panel of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.
--ph-phase-common-scaffold
No
Provides the path to a scaffold of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.
--ph-phase-common-sample-type
Yes
Provides the path to the Sample type file.
--ph-phase-common-filter-maf
No
Default 0.001. Set the Minimum Allele Frequency threshold. All variants with allele frequency equal or above this MAF are phased during this Phase Common step.
--ph-phase-common-max-miss-gt-rate
No
Default 0.1. Set the threshold for variants to be skipped if the rate of missing GT is higher than this value.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-ligate-common
Yes
Set to true to enable the Ligate Common step.
--ph-ligate-common-input-list
Yes
Provide a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Common step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-rare
Yes
Set to true to enable the Phase Rare step.
--ph-phase-rare-input
Yes
Provides the path to the preprocessed unphased msVCF generated from Phase Common step covering the phase rare region.
--ph-phase-rare-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must not overlap or have gaps between them.
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1
, chrX_nonpar
and chrX_par2
, instead of one run with region chrX
).
--ph-phase-rare-map
Yes
Provides the path to the genetic map of the chromosome. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.
--ph-phase-rare-config
Yes
Provides the path to the txt config file.
--ph-phase-rare-scaffold
Yes
Provides the path to the scaffold of haplotypes in msVCF format generated from Ligate Common step.
--ph-phase-rare-scaffold-region
Yes
Specifies the scaffold region to be phased. String in the format contigname: startposition-endposition. This scaffold region needs to cover the Input region and to allow buffer between regions. The buffer length impacts the accuracy and speed of the process: longer length is slower but improves accuracy.
--ph-phase-rare-sample-type
Yes
Provides the path to the Sample type file.
--ph-phase-rare-filter-maf
No
Default 0.001. Set the Maximum Allele Frequency threshold. All variants with allele frequency below this MAF are phased during this Phase Rare step. This value must be the same as the one provided at –ph-phase-common-filter-maf. If values differ not all variants will be phased.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file, generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-concat-all
Yes
Set to true to enable the Concat All step.
--ph-concat-all-input-list
Yes when --ph-concat-all-input-list is not provided
Provides a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Rare step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.
--ph-concat-all-input-list-sites-only
Yes when --ph-concat-all-input-list is not provided
Provides a .txt file with list of VCF containing all the haplotyped sites. The VCF files provided are the output files of Phase Rare step, in ascending position order, sex chromosomes at the end.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-qc
Yes
Set to true to enable the quality control module.
--ph-phase-qc-validation
Yes
Provides the path to the phased truth set msVCF. Note: the validation msVCF must have the same samples as in the estimation msVCF for which the phasing accuracy is to be estimated.
--ph-phase-qc-estimation
Yes
Provides the path to the phased msVCF, output of Concat All to be validated.
--ph-phase-qc-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition (startposition-endposition is optional). Regions must not overlap or have gaps between them.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.