Creating an external database VCF file

Prerequisites

Format: Database file must follow VCF 4.2 specificationsarrow-up-right.

Tools:

How to create a noise or historic database file

1

Calculate population statistics

  1. General: Allele Number (AN): Calculate the total number of alleles in your population by multiplying the number of individuals (NN) by 2: AN=2×NAN = 2×N.

  2. For each variant within the dataset:

    1. Allele Count (AC): Determine the number of times alternate allele of a variant appears across all individuals. This is inferred from genotype counts (nn). In case of a biallelic variant where allele A is reference and allele B is alternate, the allele count for alternate allele is calculated as follows:

      AC(B)=2×n(BB)+n(AB)AC(B) = 2×n(BB) + n(AB)

    2. Allele Frequency (AF): Calculate as the ratio of Allele Count to Allele Number: AF(B)=AC(B)/ANAF(B) = AC(B)/ AN.

2

Create a VCF file with your variants

  1. In the INFO field, include the AF sub-field.

  2. Optionally, include AC, AN, and TEN (a list of up to 10 samples carrying the variant). You may add other fields, provided the field names do not contain underscores or hyphens.

  3. Specify the exact format of each INFO sub-field in ##INFO meta-information lines.

See format example below.

3

Sort variants in the VCF based on chromosome and position with awk

awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}'
4

Compress the VCF with bgzip

bgzip <your_db>.vcf
5

Create a TBI index file with tabix

tabix -p vcf <your_db>.vcf.gz

Example historic DB VCF header and variant line

How to create a curated database file

1

Create a VCF file with your variants

  1. In the INFO field, include the significance sub-field and assign its value to each variant based on Table 1. Only one significance value is allowed per variant. If a variant has multiple interpretations, list the variant in separate rows, each with a different significance value.

    Table 1. Mapping of significance values to pathogenicity classes.

Significance value in the VCF
Pathogenicity class in the UI

0

Unknown

1

Benign

2

Likely Benign

3

VUS

4

Likely Pathogenic

5

Pathogenic

  1. Optionally, include comment , category, or other fields to capture text or numerical values that are relevant to classification. You may add other fields, provided the field names do not contain underscores or hyphens.

  2. Specify the exact format of each INFO sub-field in ##INFO meta-information lines.

See format example below.

2

Sort variants in the VCF based on chromosome and position with awk

3

Compress the VCF with bgzip

4

Create a TBI index file with tabix

Example curated DB VCF variant lines

Small variant

Copy number variant

Next steps

1

Reach out to Illumina support

Provide VCF and TBI files to Illumina support to upload to your organization's dedicated storage bucket, along with information:

  • Database name. Underscores ("_") in a database name are not allowed.

  • Database type (Noise, Historic, Curated)

  • Variant type (SNV, CNV)

  • Genome reference (GRCh37, GRCh38)

2

Register the database

Once the database is uploaded, the user with appropriate permissions can register the database by selecting it from their bucket in Settings.

Last updated

Was this helpful?