📄Genome definition file formats

A BED-like tab-separated value (TSV) file with no header row, consisting of the following columns:

  1. chrom: each sequence name as it appears in Custom Reference FASTA

  2. chromStart: start position (always set to 0)

  3. chromEnd: end position (sequence length)

  4. genomeName: full name of the virus the sequence belongs to (e.g. Monkeypox virus clade II)

  5. (optional) segmentName: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Guidelines

  • This file affects how sequences are labeled in the output.

  • Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.

  • If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.

  • If the Custom Reference FASTA includes sequences from multiple segments, it is recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.

Example

NC_012532.1 0   10794   Zika    Full
NC_007373.1 0   2341    Influenza A virus (H3N2)    Segment 1 (PB2)
NC_007372.1 0   2341    Influenza A virus (H3N2)    Segment 2 (PB1)
NC_007371.1 0   2233    Influenza A virus (H3N2)    Segment 3 (PA+PA-X)
NC_007366.1 0   1762    Influenza A virus (H3N2)    Segment 4 (HA)
NC_007369.1 0   1566    Influenza A virus (H3N2)    Segment 5 (NP)
NC_007368.1 0   1467    Influenza A virus (H3N2)    Segment 6 (NB+NA)
NC_007367.1 0   1027    Influenza A virus (H3N2)    Segment 7 (M1+M2)
NC_007370.1 0   890     Influenza A virus (H3N2)    Segment 8 (NS1+NEP)

Last updated