Population Haplotyping (Beta)
DRAGEN implements a beta version of the Population Haplotyping tool. This tool supports the estimation of haplotypes from a population scale dataset via the packaging of the SHAPEIT5 Software (2022, Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O). It is designed to phase common variants as well as rare variants in a step-by-step mode. The following step-by-step workflow must be reproduced to phase each chromosome of the studied genome.
Step 1: Phase Common step to estimate the haplotypes of common variants (variants with allele frequency above a given allele frequency threshold) on defined regions.
Step 2: Common Ligate step to ligate the phased common variants from step 1 into a single chromosome.
Step 3: Phase Rare step to add the haplotypes of rare variants (variants with allele frequency below a given allele frequency threshold) on defined regions to the common variant scaffold obtained in step 2.
Step 4: Concat All step to concatenate the haplotype regions obtained in step 3 into a single chromosome.
This tool provides best accuracy on population scale dataset with thousands of samples. It is recommended to be run on multiple nodes to parallelize processes. A common use case of the Population Haplotyping tool is the generation of a custom reference panel to be used for the VCF Imputation pipeline.
The tool supports autosomes and mixed ploidy chromosomes for diploid species only. It does not use the FPGA accelerated capability and it can run on generic software only compute node.
Note: the Population Haplotyping tool only supports input msVCF produced with the DRAGEN gVCF Genotyper tool.
Command-Line Examples
The following is an example of required command to generate haplotypes on common and rare variants (with default allele frequency threshold) on population scale dataset:
Step 1: Phase Common
Step 2: Ligate Common
Step 3: Phase Rare
Step 4: Concat All
To generate per chromosome haplotypes:
To generate per genome haplotyped sites
Input Files
msVCF Input (step 1 and step 3)
msVCF input list for the Phase Common step (step 1)
For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:
per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition
generated from the same reference build
compressed and indexed
with unphased GT calls
with no duplicates
with header ##contig "ID" and "length" fields for all contigs present in the studied genome
Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.
msVCF input for the Phase Rare step (step 3)
The msVCF input list provided at step 1 is pre-processed to generate a formatted msVCF called <prefix>.preprocess.vcf.gz
. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
To facilitate parallel processing on distributed compute nodes, and to avoid overhead chromosome level multisample VCF download and upload per sub-chromosome processing, chromosome portions of equal size (shards) can be used as input. The gVCF Genotyper tool, with proper option, can generate these shards of equal size. Note: streaming from the cloud is not supported. Instead use predownload and local input process to achieve maximum IO efficiency and stability.
Genetic map (step 1 and step 3)
A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the geentic map corresponding to the human hg38 reference genome available to download from the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files.
The genetic map should follow the format:
3 columns: position, chromosome number, distance (cM), in this order and tab separated
Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1
chrX_par1
, PAR2chrX_par2
and non PARchrX_nonpar
regions)Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y
chrY
)
The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.
Config file (step 1 and step 3)
This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar
).
The user can provide its own or use the one available to download from DRAGEN Software Support Site page.
Example of Config file
Instructions to make a custom configuration file:
The config file is a text file with the headers:
##version
##ref_build indicating the reference build used for the study.
The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.
Column information | Description |
---|---|
First column: filename | Specifies the genetic map basename, 1 name per line. Mixed ploidy chromosomes must be separated into par and non-par regions. Basenames must match genetic map basenames. |
Second column: region | Specifies the start and end positions of the chromosome or sub-chromosome region with format |
Third column: mixed ploidy subject | Specifies 2 on diploid chromosomes and PAR regions. 1 for non PAR region |
Fourth column: diploid subject | Specifies 2 for all chromosomes |
Note: for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1
, chrX_nonpar
, and chrX_par2
.
Sample type file (step 1 and step 3)
The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes.
Output Files
Phase Common step
The Phase common step (step 1) is run on a defined region, and outputs:
a single scaffold msVCF and related msVCF index with phased common variants for that region. The default name is
dragen.ph_phase_common.vcf.gz
.a single formatted msVCF called
<prefix>.preprocess.vcf.gz
and related index. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
Ligate Common step
The Ligate Common step (step 2) ligates the regions phased in step 1 and outputs a single scaffold msVCF and related msVCF index with phased common variants for a single chromosome. The default name is “dragen.ph_ligate_common.vcf.gz”.
Phase Rare step
The Phase Rare step (step 3) is run on a defined region on a chromosome with preprocessed unphased msVCF from step 1 and phased scaffold msVCF from step 2, and outputs:
a single phased msVCF and related msVCF index with phased common and rare variants for that region. The default name is “dragen.ph_rare_common.vcf.gz”.
a single 8-column VCF and related index listing all sites that have been phased for that region. The default name is “dragen.ph_rare_common.sites.vcf.gz”.This output is used at the Concat-All step to generate a VCF file with all phased sites accross the genome.
Concat All step
The Concat All processing is used to generate 2 types of output
Phased common and rare variants for a chromosome
The Concat All step (step 4) concatenates the regions phased in step 3 and outputs an msVCF and related index with phased common and rare variants for a single chromosome. The default name is “dragen.ph_concat_all.vcf.gz”.
List of phased sites
This output is useful for input of the ForceGT tool. The Concat All step lists all sites in a 8-column VCF format that have been phased and output a VCF and related index with list of phased sites. This output can be generated either from a list of phased site VCFs across the genome from step3, or, in a second step once the list of per chromosome sites have been generated. The default name is “dragen.ph_concat_all.sites.vcf.gz”.
Command-Line Options for step 1: Phase Common
Option | Required | Description |
---|---|---|
--enable-population-haplotyping | Yes | Set to true to enable population haplotyping tool. |
--enable-phase-common | Yes | Set to true to enable the Phase Common step. |
--ph-phase-common-input-list | Yes | Provides a .txt file listing the sample input pertaining to one chromosome, with path to a single msVCF or a list of msVCF, one line per path. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome. |
--ph-phase-common-input-region | Yes | Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must overlap between them for the downstream ligate common step. Examples of input region length for human data: 10 mbp
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, |
--ph-phase-common-map | Yes | Provides path to the chromosome genetic map. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions. |
--ph-phase-common-config | Yes | Provides path to the txt config file. |
--ph-phase-common-reference | No | Provides the path to a reference panel of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process. |
--ph-phase-common-scaffold | No | Provides the path to a scaffold of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process. |
--ph-phase-common-sample-type | Yes | Provides the path to the Sample type file. |
--ph-phase-common-filter-maf | No | Default 0.001. Set the Minimum Allele Frequency threshold. All variants with allele frequency equal or above this MAF are phased during this Phase Common step. |
--ph-phase-common-max-miss-gt-rate | No | Default 0.1. Set the threshold for variants to be skipped if the rate of missing GT is higher than this value. |
--output-directory | Yes | Specifies the output directory. |
--output-file-prefix | No | Outputs filename with the defined prefix for the file generated by the pipeline. |
Command-Line Options for step 2: Ligate Common
Option | Required | Description |
---|---|---|
--enable-population-haplotyping | Yes | Set to true to enable population haplotyping tool. |
--enable-ligate-common | Yes | Set to true to enable the Ligate Common step. |
--ph-ligate-common-input-list | Yes | Provide a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Common step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome |
--output-directory | Yes | Specifies the output directory. |
--output-file-prefix | No | Outputs filename with the defined prefix for the file generated by the pipeline. |
Command-Line Options for step 3: Phase Rare
Option | Required | Description |
---|---|---|
--enable-population-haplotyping | Yes | Set to true to enable population haplotyping tool. |
--enable-phase-rare | Yes | Set to true to enable the Phase Rare step. |
--ph-phase-rare-input | Yes | Provides the path to the preprocessed unphased msVCF generated from Phase Common step covering the phase rare region. |
--ph-phase-rare-input-region | Yes | Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must not overlap or have gaps between them.
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, |
--ph-phase-rare-map | Yes | Provides the path to the genetic map of the chromosome. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions. |
--ph-phase-rare-config | Yes | Provides the path to the txt config file. |
--ph-phase-rare-scaffold | Yes | Provides the path to the scaffold of haplotypes in msVCF format generated from Ligate Common step. |
--ph-phase-rare-scaffold-region | Yes | Specifies the scaffold region to be phased. String in the format contigname: startposition-endposition. This scaffold region needs to cover the Input region and to allow buffer between regions. The buffer length impacts the accuracy and speed of the process: longer length is slower but improves accuracy. |
--ph-phase-rare-sample-type | Yes | Provides the path to the Sample type file. |
--ph-phase-rare-filter-maf | No | Default 0.001. Set the Maximum Allele Frequency threshold. All variants with allele frequency below this MAF are phased during this Phase Rare step. This value must be the same as the one provided at –ph-phase-common-filter-maf. If values differ not all variants will be phased. |
--output-directory | Yes | Specifies the output directory. |
--output-file-prefix | No | Outputs filename with the defined prefix for the file, generated by the pipeline. |
Command-Line Options for step 4: Concat All
Option | Required | Description |
---|---|---|
--enable-population-haplotyping | Yes | Set to true to enable population haplotyping tool. |
--enable-concat-all | Yes | Set to true to enable the Concat All step. |
--ph-concat-all-input-list | Yes when --ph-concat-all-input-list is not provided | Provides a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Rare step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome. |
--ph-concat-all-input-list-sites-only | Yes when --ph-concat-all-input-list is not provided | Provides a .txt file with list of VCF containing all the haplotyped sites. The VCF files provided are the output files of Phase Rare step, in ascending position order, sex chromosomes at the end. |
--output-directory | Yes | Specifies the output directory. |
--output-file-prefix | No | Outputs filename with the defined prefix for the file generated by the pipeline. |
Population Haplotyping Accuracy
An additional module of the Population Haplotyping tool checks for the quality of the haplotypes produced based on a phased truth set provided as input.
Command-Line example
Command-Line Options
Option | Required | Description |
---|---|---|
--enable-population-haplotyping | Yes | Set to true to enable population haplotyping tool. |
--enable-phase-qc | Yes | Set to true to enable the quality control module. |
--ph-phase-qc-validation | Yes | Provides the path to the phased truth set msVCF. Note: the validation msVCF must have the same samples as in the estimation msVCF for which the phasing accuracy is to be estimated. |
--ph-phase-qc-estimation | Yes | Provides the path to the phased msVCF, output of Concat All to be validated. |
--ph-phase-qc-input-region | Yes | Specifies the target region to be phased. String in the format contigname: startposition-endposition (startposition-endposition is optional). Regions must not overlap or have gaps between them. |
--output-directory | Yes | Specifies the output directory. |
--output-file-prefix | No | Outputs filename with the defined prefix for the file generated by the pipeline. |
Last updated