VCF Imputation
The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:
with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes
with accceleration supported with Advanced Vector Extensions (AVX)
The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.
Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the DRAGEN Software Support site page.
For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.
Notes:
The output is in biallelic format, one line per ALT.
The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.
The following is an example of commands to impute vcf on a single chromosome:
The following is an example of commands to impute vcf on chromosome X:
Inputs
Sample Input
The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.
The sample(s) to be imputed must have the following format:
VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported
Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information
To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the DRAGEN Software Support Site page. When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).
Recommendation for imputing INDELs
To impute INDELs and get the best accuracy on INDELs, it is recommended:
to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the DRAGEN Software Support Site page.
and to set the command
--imputation-phase-impute-reference-only-variants
to true.
Reference Panel
A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the DRAGEN Software Support Site page. IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.
Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported
A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename
. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.
Genetic Map
A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files. The genetic map should follow the format:
<chromosome name>.gmap.gz
3 columns: position, chromosome number, distance (cM)
compliant with the reference genome used to generate the sample input
JSON config file
This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the DRAGEN Software Support Site page. It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.
In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.
Example of JSON config file
For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):
Instructions to make a custom JSON configuration file:
The JSON config file is made of two fields as defined in the table below
Fields | Required/Optional | Purpose | Type |
---|---|---|---|
regions | Required only when a chromosome of mixed ploidy is present in the Reference Panel folder | Define contig name and subregion name of mixed ploidy chromosome | Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...] |
ploidy |
| Define:
| Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input |
Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.
Sample type file
The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.
Outputs
The VCF imputation tool generates several outputs:
The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name
<prefix>.impute.vcf.gz
The intermediate files:
chunk regions to be passed along to the internal Phase step with name
<prefix>.impute.chunk.out.txt
imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name
<prefix>_chr_start-end.impute.phase.vcf.gz
text file with path to all the
<prefix>_chr_start-end.impute.phase.vcf.gz
generated with name<prefix>.impute.phase.out.txt
Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps
Command Line Options
Option | Type | Required | Description |
---|---|---|---|
--enable-imputation | NA | Yes | Set to |
--imputation-ref-panel-dir | STRING | Yes | Directory containing per-chromosome reference panel VCF and optionally the JSON config file |
--imputation-ref-panel-prefix | STRING | Yes | Prefix for reference panel files and the JSON config file |
--imputation-genome-map-dir | STRING | Yes | Directory containing per-chromosome genome map files |
--imputation-chunk-input-region | STRING | Yes for single region | Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20). |
--imputation-chunk-input-region-list | STRING | Yes for list of regions | Text file listing chromosomes or regions to be processed, one chromosome/region per line. |
--imputation-phase-input | STRING | Yes for single VCF file | Sample input file with VCF/BCF format. Single VCF or multi-sample VCF |
--imputation-phase-input-list | STRING | Yes for multiple VCF files | Text file listing sample input in VCF/BCF format, one input file per line |
--imputation-phase-sample-type | STRING | Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file | Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file |
--imputation-phase-sample-type-list | STRING | Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files | Path to the Sample Type file |
--output-directory | STRING | Yes | Output directory |
--output-file-prefix | STRING | Yes | Output files prefix |
--imputation-phase-threads | INT | No | Specify the number of threads to use. Default is the number of system threads |
--imputation-phase-filter-input-sample-in-ref | NA | No | Default is |
--imputation-phase-impute-reference-only-variants | STRING | No | Default is |
--imputation-phase-input-independently | STRING | No | Default is |
Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.
Last updated