Oncovirus Detection

Overview

The DRAGEN oncovirus detection analysis can detect the presence of oncoviruses, whether they have integrated into the human genome, and at what locations. The oncovirus analysis takes in unmapped reads, uses the DRAGEN k-mer classifier to identify whether a read is from an oncovirus, and determines to which reference sequence it best matches. A TSV file describing which oncoviruses were detected is generated.

An oncovirus is considered detected if it passes a read count threshold and has at least one reference that passes its k-mer fraction threshold (described in more detail below).

Any oncovirus that is determined to be present is further analyzed by the DRAGEN SV caller. Assembled SV breakends are aligned to oncoviral references identified by k-mer classification. Integration sites discovered by this process are included in the SV VCF file.

Oncovirus detection can be enabled with WGS, WES, and panels, but it is expected to perform best with WGS and panels with oncoviral probes. Integration site detection has not been evaluated outside of WGS.

Database

Oncovirus detection requires resource files that can be downloaded on the DRAGEN Secondary Analysis Product Files page.arrow-up-right This set of resource files are referred to as the oncovirus database below.

The downloaded tar.gz file will need to be unpacked:

tar xzvf oncovirus-detection-files.tar.gz

The unpacked md5sum file can be used to check the integrity of the other unpacked files.

A subdirectory is also unpacked and is named after the version of the database (e.g. "1.0.0"). This subdirectory is used with the --oncovirus-detection-db command line argument.

Oncovirus Presence

The detection of oncoviruses in a sample is enabled with --enable-oncovirus-detection=true and by providing the database path with --oncovirus-detection-db=/path/to/directory/. An example command is given below where tumor and normal sample reads are analyzed for the presence of oncoviral sequences:

dragen \
  --enable-oncovirus-detection true \
  --oncovirus-detection-db $db \
  --tumor-fastq-list $tumorFastqList \
  --fastq-list $normalFastqList \
  --ref-dir $ref \
  --output-file-prefix $prefix \
  --output-directory $out

Enabling oncovirus detection will create an output TSV file at $out/$prefix.oncovirus_detections.tsv with the fields described below. Empty values are denoted in the TSV with a hyphen.

Field
Description

oncovirus

Virus name

sample

Name of the sample

detected

Value is "detected" if virus metrics are above thresholds

oncovirus_read_count

Number of reads that classified to the virus and its references

best_match_ref_accession

Accession of the reference with the highest k-mer fraction

best_match_ref_read_count

Number of reads that classified to the best-match reference

best_match_ref_kmer_fraction

Fraction of k-mers detected for the best-match reference

best_match_ref_length

Length of the best-match reference

best_match_ref_completeness

Length of the best-match reference compared to the RefSeq reference for this virus; capped at 1.0

best_primary_ref_accession

Accession of the primary (e.g. RefSeq) reference with the highest k-mer fraction

best_primary_ref_read_count

Number of reads that classified to the best-match primary reference

best_primary_ref_kmer_fraction

Fraction of k-mers detected for the best-match primary reference

best_primary_ref_length

Length of the best-match primary reference

In order to be considered detected, an oncovirus must pass a read count threshold and have at least one reference that passes its k-mer fraction threshold.

The k-mer fraction quantifies how much of a reference sequence is supported by the sequencing data. First, all canonical k-mers are enumerated from the reference sequence. The k-mer fraction is then calculated as the proportion of these reference k-mers that are observed at least once in the reads. A value close to 1 indicates broad coverage across the reference, whereas lower values indicate partial or sparse support.

Included Oncoviruses and Thresholds

Virus Name
Read Count Threshold
K-mer Fraction Threshold
Database Reference Count

Epstein-Barr virus (EBV)

5

0.05

196

Hepatitis B virus (HBV)

5

0.05

5493

Hepatitis C virus (HCV)

5

0.05

3293

Human papillomavirus (25+ types)*

5

0.25

310

Human T-lymphotropic virus 1 (HTLV-1)

5

0.05

11

Kaposi's sarcoma-associated herpesvirus (KSHV)

5

0.05

54

Merkel cell polyomavirus (MCPyV)

5

0.05

13

*Classifications are HPV6, HPV11, HPV16, HPV18, HPV26, HPV31, HPV33, HPV35, HPV39, HPV40, HPV42, HPV43, HPV44, HPV45, HPV51, HPV52, HPV53, HPV54, HPV56, HPV58, HPV59, HPV61, HPV66, HPV68, HPV69, HPV70, HPV73, HPV82, Other HPV

Integration Site Detection

When the SV caller is enabled alongside oncovirus detection, DRAGEN can call sites where oncoviral sequences have integrated into the human genome and report them in the SV VCF output. For details on enabling and interpreting viral integration site detection, see Viral Integration Site Detection in the SV Calling documentation.

Command Line Arguments

Argument
Type
Description
Default

enable-oncovirus-detection

bool

Enables detection of oncoviruses

false

oncovirus-detection-db

string

Path to directory containing resource files

empty string

oncovirus-detection-all-reads

bool

Enable to use all reads instead of just unmapped reads

false

oncovirus-detection-softclipped-reads

bool

Enable to keep softclipped reads in addition to unmapped reads

false

oncovirus-detection-below-threshold

bool

Enable to include below-threshold viruses in detections TSV

false

oncovirus-detection-enable-read-output*

bool

false

oncovirus-detection-num-threads

int

Number of threads to use for processing reads

8

*Note that when --oncovirus-detection-enable-read-output=true, --oncovirus-detection-num-threads must be set to 1 to ensure the per-read output file is properly formed.

Last updated

Was this helpful?