Multisample CNV Calling
Last updated
Last updated
Multisample CNV calling is possible starting from tangent normalized counts files (*.tn.tsv.gz) specified with the --cnv-input
option (one per sample). Multisample CNV analysis benefits from using joint segmentation to increase the sensitivity of detection of copy number variable segments. For each copy number variable segment identified, the copy number genotype of each sample is emitted in a single VCF entry to facilitate annotation and interpretation.
Multisample CNV analysis is supported for WGS and WES workflows.
The following is an example command line for running a trio analysis:
Make sure all input samples have gone through the same single sample workflow and have identical intervals. If the samples are WES inputs, then you must generate the samples using the same panel of normals, and the autosomal intervals for all samples must match.
The following options are used in DeNovo CNV calling:
--cnv-input
For DeNovo CNV calling, this specifies the input tangent-normalized signal files (*.tn.tsv.gz) from the single sample runs. This option can be specified multiple times, once for each input sample.
--cnv-filter-de-novo-qual
Phred-scaled threshold at which a putative event in the proband sample if marked as DeNovo. Default value is 0.125.
--pedigree-file
Pedigree file specifying the relationship between the input samples.
First, CNV calling is performed on each sample independently. Joint segmentation then uses the copy number variable segments from each single sample analysis to derive a set of joint copy number variable segments. This set of joint segments is determined simply by taking the union of all breakpoints from the copy number variable segments of all samples. This results in the splitting of any partially overlapping segments across different samples. For example:
Following joint segmentation, copy number calling is again performed independently on each sample using the joint segments. Segments can be merged as with the single sample analysis, but each joint segment is emitted in the mutlisample VCF as a single entry. The quality score (QS
in the VCF) from the sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered using the sample's FT field in the multisample VCF. The QUAL
column of the multisample VCF is always missing (ie, "."). The FILTER
column of the mutlisample VCF is SampleFT
if none of the sample's FT
fields are PASS
, and PASS
if any of the sample's FT
fields are PASS
.
Note, however, that when a single segment in one sample overlaps multiple segments in another sample, the larger segment annotation is replicated across multiple records, e.g. (only relevant VCF fields are printed below):
A de novo event is defined as the existence of a genotype at a particular locus in a proband's genome that did not result from standard Mendelian inheritance from the parents. The de novo calling stage identifies putative de novo events in the proband of each trio of a multisample analysis. In some cases, these putative de novo events may be real, but they can also arise from sequencing or analysis artifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used to filter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-file
option. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.
For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflict for the called copy number genotypes. The CNV caller does not identify the copy number for each allele of a given diploid segment, which means assumptions are made about the possible allelic composition of the parent genotypes.
The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome (sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, as follows:
The following are examples of consistent and inconsistent copy number genotypes for diploid regions using these assumptions:
If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ
field in the VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) of each sample in the trio, combined with a prior for the trio genotypes:
Where
The DN
field in the VCF is used to indicate the de novo status for each segment. Possible values are:
Inherited
- the called trio genotype is consistent with Mendelian inheritance
LowDQ
- the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold (default 0.125)
DeNovo
- the called trio genotype is inconsistent with Mendelian inheritance and DQ
is greater than or equal to the de novo quality threshold (default 0.125)
The records in a multisample CNV VCF differ slightly from the single sample case. The major differences are as follows:
The per-record entries are broken down into the segments among the union of all the input samples breakpoints, which means there are more entries in the overall VCF.
The QUAL
column is not used and its value is ".". The per-sample quality is carried over into the SAMPLE
columns with the QS
tag.
The FILTER
column indicates PASS
if any of the individual SAMPLE
columns PASS
. Otherwise, it indicates SampleFT
.
The per-sample annotations are carried over from their originating calls. The single sample filters are applied at the sample level and are emitted in the FT
annotation.
Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following two annotations to the proband sample.
While the VCF contains many entries, due to the joint segmentation stage, the number of de novo events can be found by extracting entries that have a DN
and DQ
annotation. These records are also extracted and are converted to GFF3 in the de novo calling case.
The previous can be visualized as:
Parent Copy Number Genotype | Possible Copy Number Alleles | Assumed Possible Copy Number Alleles |
---|
Mother Copy Number | Father Copy Number | Proband Copy Number | Mendelian Consistent? |
---|
is the set of all genotypes
is the set of conflicting genotypes
is the Mother copy number
is the Father copy number
is the Proband copy number
is the the prior for the trio genotype
2 | 0/2, 1/1 | 1/1 |
3 | 0/3, 1/2 | 1/2 |
4 | 0/4, 1/3, 2/2 | 1/3, 2/2 |
N | x/(N-x) for x <= N/2 | x/(N-x) for 1 <= x <= N/2 |
2 | 2 | 2 | Yes |
2 | 2 | 1 | No |
3 | 2 | 4 | No |
3 | 2 | 2 | Yes |
2 | 0 | 2 | No |