1 of 1

VCF Imputation

The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:

with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes
with accceleration supported with Advanced Vector Extensions (AVX)

The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.

Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the DRAGEN Software Support site page.

For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.

Notes:

The output is in biallelic format, one line per ALT.
The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.

The following is an example of commands to impute vcf on a single chromosome:

dragen 
--enable-imputation true 
--imputation-ref-panel-dir <REF_PANEL_DIR>
--imputation-ref-panel-prefix <IRPv2.0> 
--imputation-chunk-input-region <chr22> 
--imputation-phase-input-list <VCF_to_be_imputed.txt> 
--imputation-genome-map-dir <MAP_DIR> 
--output-directory <OUT_DIR>
--output-file-prefix <OUT_PREFIX>

The following is an example of commands to impute vcf on chromosome X:

dragen 
--enable-imputation true 
--imputation-ref-panel-dir <REF_PANEL_DIR>
--imputation-ref-panel-prefix <IRPv2.0> 
--imputation-chunk-input-region <chrX> 
--imputation-phase-input-list <VCF_to_be_imputed.txt> 
--imputation-genome-map-dir <MAP_DIR> 
--imputation-phase-sample-type-list <path to sample type file>
--output-directory <OUT_DIR>
--output-file-prefix <OUT_PREFIX>

Inputs

Sample Input

The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.

The sample(s) to be imputed must have the following format:

VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported
Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information

To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the DRAGEN Software Support Site page. When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).

Recommendation for imputing INDELs

To impute INDELs and get the best accuracy on INDELs, it is recommended:

to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the DRAGEN Software Support Site page.
and to set the command --imputation-phase-impute-reference-only-variants to true.

Reference Panel

A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the DRAGEN Software Support Site page. IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.

Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported

A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.

Genetic Map

A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files. The genetic map should follow the format:

<chromosome name>.gmap.gz
3 columns: position, chromosome number, distance (cM)
compliant with the reference genome used to generate the sample input

JSON config file

This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the DRAGEN Software Support Site page. It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.

In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.

Example of JSON config file

For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):

{
  "regions": { 
    "chrX" : [ "chrX_par1", "chrX_nonpar", "chrX_par2" ] 
  },
  "ploidy" : {
     "chrX_nonpar" : { "M": 1, "F": 2},
     "default"     : { "M": 2, "F": 2}
  }
 }

Instructions to make a custom JSON configuration file:

The JSON config file is made of two fields as defined in the table below

Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.

Sample type file

The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.

The sample type file is a txt file with the following format

2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.

Outputs

The VCF imputation tool generates several outputs:

The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name <prefix>.impute.vcf.gz
The intermediate files:
- chunk regions to be passed along to the internal Phase step with name <prefix>.impute.chunk.out.txt
- imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name <prefix>_chr_start-end.impute.phase.vcf.gz
- text file with path to all the <prefix>_chr_start-end.impute.phase.vcf.gz generated with name <prefix>.impute.phase.out.txt

Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps

Command Line Options

Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.