CheckFingerprint
Check Sample Identity with CheckFingerprint
CheckFingerprint is broadly based on Picard CheckFingerprint. CheckFingerprint will output LOD score to indicate whether all the genetic data between two files from the same individual or not.
If LOD score is positive, those two samples come from the same individual. Otherwise, those two samples come from different individuals.
In general, the sign of LOD in summary file should be consistent with Picard CheckFingerprint summary file, but the exact values may be different.
Validation were done on whole-genome sequencing (WGS) data, mixing WGS samples and whole exon sequencing data.
Usage
Modes
The checks can run in one of two modes:
Read comparison mode. Aligned reads are compared with the expected VCF
VCF comparison mode. Output VCF is compared with the expected VCF
Standalone VCF comparison mode. Provide two VCF files, one as the observed VCF and the other as the expected VCF, and compare them.
Options
To enable CheckFingerprint module, the following command-line options are required.
--enable-checkfingerprint true
--checkfingerprint-expected-vcf [path_to_expected_sample_vcf]
Read comparison mode is enabled by default. Read comparison mode is recommended to use for small dataset or whole exon sequencing data.
To switch to VCF comparison mode, use the following options
--checkfingerprint-enable-vcf-comparison true
--enable-variant-caller true
Vcf comparison mode is recommended to use for larger samples, such as whole-genome sequencing data with average 30 coverage or whole exon sequencing data.
Command-line Examples
Read mode. Input BAM/FASTQ/CRAM, examine the individual reads in input sample, and compare individual reads with expected VCF file.
./bin/dragen -r [dragen_hash_table] -b [bam] --output-directory [output_dir] \
--output-file-prefix [output_prefix] --enable-checkfingerprint true --checkfingerprint-expected-vcf [input_expected_vcf]
VCF mode. Input BAM/FASTQ/CRAM, generate a VCF file first, and compare the VCF file with expected VCF file
./bin/dragen -r [dragen_hash_table] -b [bam] --output-directory [output_dir] \
--output-file-prefix [output_prefix] --enable-checkfingerprint true --checkfingerprint-expected-vcf [input_expected_vcf] \
--checkfingerprint-enable-vcf-comparison true --enable-variant-caller true
Standalone VCF mode. Input an observed VCF file, and compare observed VCF file with expected VCF file
./bin/dragen -r [dragen_hash_table] --output-directory [output_dir] \
--output-file-prefix [output_prefix] --enable-checkfingerprint true --checkfingerprint-expected-vcf [input_expected_vcf] \
--checkfingerprint-observed-vcf [input_observed_vcf]
Advanced Usage
Input customzied haplotype map. Without user input, DRAGEN checkfingerprint will use default haplotype map. Format of the haplotype map presented below in the "Inputs: a) Haplotype Map" section.
--checkfingerprint-haplotype-map [input_haplotype_map]
Enable tumor aware LOD. Default --checkfingerprint-loss-of-het-rate is 0.5. It assumes that tumor sample has undergone a loss of heterozygosity (LoH) where large sections of chromosomes are lost. It makes the heterozygous hapolotypes in normal samples seem homozygouse in corresponding tumor samples.
--checkfingerprint-enable-tumor-aware true
--checkfingerprint-loss-of-het-rate [float]
Inputs
The input files used by DRAGEN CheckFingerprint are: a) haplotype map (configuration files), b) FASTQ/BAM/CRAM (user input) or observed VCF file (user input), c) expected VCF file (user input).
a) Haplotype Map
Haplotype maps for hg19, hg38 and chm13 are files that are packaged with DRAGEN and automatically selected by the software. The haplotype map is a set of SNPs grouped into haplotyp blocks (also known as linkage disequilibrium blocks). SNPs in haplotye map is used as fingerprinting.
Haplotype map is a txt file with tab delimiter.
@SQ SN:X LN:156040895 M5:2b3a55ff7f58eb308420c8a9b11cac50 AS:38 UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens
#CHROMOSOME POSITION NAME MAJOR_ALLELE MINOR_ALLELE MAF ANCHOR_SNP PANELS
1 122872 chr1:25 T G 0.235623
1 789502 chr1:84 T C 0.480232
1 789503 chr1:85 G A 0.480232 chr1:84
1 796338 chr1:89 T C 0.154353
1 798969 chr1:91 T C 0.152556 chr1:89
The following columns are of interest:
CHROMOSOME
chromosome
POSITION
position
NAME
SNP identifier
MAF
minor allele frequency
ANCHOR_SNP
refers to the NAME of a SNP. SNPs with the same ANCHOR_SNP have high linkage disequilibrium with each other.
b) Sample Input
Samples are input from bam/cram/fastq or observed vcf files.
The following command-line example uses FASTQ input:
dragen \
-r [dragen_hash_table] \
--fastq-file1 /staging/test/data/NA12878_R1.fastq \
--fastq-file2 /staging/test/data/NA12878_R2.fastq \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--RGID DRAGEN_RGID \
--RGSM NA12878 \
--enable-checkfingerprint true \
--checkfingerprint-expected-vcf [input_expected_vcf] \
--checkfingerprint-enable-vcf-comparison true \
--enable-variant-caller true
The following command-line example uses vcf input:
dragen \
-r [dragen_hash_table] \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-checkfingerprint true \
--checkfingerprint-expected-vcf [input_expected_vcf] \
--checkfingerprint-observed-vcf [input_observed_vcf] \
c) Expected Vcf Input
Vcf output from dragen is recommended. It can contains multiple samples.
Multiple sample vcfs can combine together and input here --checkfingerprint-expected-vcf
Checkfingerprint calculates LOD between input sample (bam/cram/fastq or vcf) and each sample in expected_vcf file.
Outputs
There are two main output files:
[output-file-prefix].CheckFingerprint.summary.txt : contains LOD scores between input sample and expected sample
[output-file-prefix].CheckFingerprint.detail.txt : contains LOD scores between individual SNPs.
CheckFingerprint.detail.txt example
READ_GROUP EXPECTED_SAMPLE SNP SNP_ALLELES CHROM POSITION EXPECTED_GENOTYPE OBSERVED_GENOTYPE LOD
LOD_EXPECTED_TUMOR_OBSERVED_NORMAL LOD_EXPECTED_NORMAL_OBSERVED_TUMOR OBS_A OBS_B
IGNORE hg002 chr1:274 AG 1 908025 AG AG 5.92214 -9.7141e-06 -0.301035 0 0
IGNORE hg002 chr1:308 GA 1 916119 GG GG 7.15017 -0.0459602 -1.39646e-06 0 0
IGNORE hg002 chr1:473 CT 1 984039 CT CT 4.60476 -0.000568506 -0.301314 0 0
CheckFingerprint.summary.txt example
LOD_EXPECTED_SAMPLE is the LOD score between two samples. LOD_OBS_TUMOR_EXP_NORMAL is the LOD score while observed sample is tumor sample and expected sample is normal sample. LOD_OBS_NORMAL_EXP_TUMOR is the LOD score while expected sample is tumor sample and observed sample is normal sample. LOD_OBS_TUMOR_EXP_NORMAL and LOD_OBS_NORMAL_EXP_TUMOR have values only when ENABLE_TUMOR_AWARE is true.
READ_GROUP EXPECTED_SAMPLE LL_EXPECTED_SAMPLE LL_RANDOM_SAMPLE LOD_EXPECTED_SAMPLE ENABLE_TUMOR_AWARE LOD_OBS_TUMOR_EXP_NORMAL LOD_OBS_NORMAL_EXP_TUMOR HAPLOTYPES_WITH_GENOTYPES HAPLOTYPES_CONFIDENTLY_CHECKED HAPLOTYPES_CONFIDENTLY_MATCHING HET_AS_HOM HOM_AS_HET HOM_AS_OTHER_HOM
IGNORE hg002 -18237 -6517.62 -11719.4 true -5907.94 -4499.57 12423 6725 4307 1193 1225
0
Method of Operation
CheckFingerprint calculates the LOD score to identify whether two samples are from the same individual or not. A positive value indicates those two samples are from the same individual. A negative value indicates two samples are not match. LOD is in logarithmic scale (base 10). Thus, a LOD of 4 indicates it is 10,000 more likely that data matches the genotypes than not. A score that is close to 0 is inconclusive that can result from low coverage or missing informative genotypes. The identity check takes advantage of haplotype blocks defined in configuration file (hg38_nochr.map,hg19_nochr.map). It can improve statistic power for identity detection by checking SNPs in haplotype blocks.
In VCF mode, CheckFingerprint uses PL to estimate genotype probabilities.
Limitaions:
Vcf mode is recommended for general use.
Currently, Vcf mode is designed for whole genome sequencing samples with 30 coverage;
Read mode is designed for whole exome sequencing. Larger datasets may encounter timeout errors.
Read mode should be used in isolation without other components enabled and should only be used if Vcf mode does not provide sufficient accuracy.
DRAGEN CheckFingerprint is compatible only with DRAGEN germline and tumor-only pipelines.
Tumor-aware settings assume tumor samples with loss of heterozygosity and should be used with caution.
The input observed and expected sample VCF should originate from the same pipeline, as using different pipelines can lead to inaccurate LOD calculations.
Last updated
Was this helpful?