Create genotype matrix
A genotype matrix is a numerical representation of genetic variants where rows represent observations (like samples, subject etc.), columns represent variants (like SNV, SNP etc.), the values in the matrix are genotype encoding:
0 — reference homozygous
1 — heterozygous
2 — alternate homozygous
Genotype matrix can be used for analyses like GWAS, QTL PCA, Kinship etc.
This task is available on data nodes containing variants.
Click on a variants data node to choose Variant analysis > Create genotype matrix from the task menu:

Configure the dialog and click Finish to run the task:

Merge mode:
Intersect: use variants presented in all samples from the input variant data node to create the matrix
Union: include all variants presented in any of the samples from the input data node. If a variant is not found in some samples, the value will be 0, in other words they are treated as reference homozygous genotype calls in those samples.
Minor Allele Frequency (MAF): it is the frequency at which the second most common allele occurs in the input samples. The value is between 0 and 1. &#xNAN;Note: sample size matters on this parameter, MAF estimates are more accurate with later sample sample sizes. When this option is selected, minimum MAF threshold parameter needs to be specified, variants with MAF values smaller than the cutoff will be filtered out. The default is 0.01, typically MAF < 0.01 are rare variants.
Linkage Disequilibrium (LD) Pruning: LD measures association of alleles at different loci, high value in r2 indicates high correlation between two variants. r2 ranges from 0 to 1. When this option is selected, it will remove redundant SNVs to create a set of approximately independent variants.
Window size — consider the number of variants in a window at a time.
Step size — slide the window forward by this number of SNV each iteration.
LD r2 — When a SNV pair r2 is higher value than the number specified, the variant that is in higher LD with more other variants in the window will be removed. If both SNVs have the same number of correlated partners, the variant with higher genomic order is removed1. Typically SNVs with r2 <=0.2 are considered independent.
Variant Quality Filtering: based on QUAL field in the vcf file. When the option is selected, SNVs with QUAL values lower than the specified number are remove.
Hardy-Weinberg Equilibrium (HWE): HWE assumes variant's genotypes follow expected proportions in the population. The observes genotype counts in the input samples compare with expected genotype counts using exact test2. When select this option, SNVs with p-value smaller than the specified value indicates they deviate from HWE and will be removed.
Output data node contains the sample by variant matrix.
Reference
Wigginton JE, Cutler DJ, Abecasis GR. A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet. 2005 May;76(5):887-93. doi: 10.1086/429864. Epub 2005 Mar 23. PMID: 15789306; PMCID: PMC1199378.
Last updated
Was this helpful?
