1 of 3

Custom reference

In addition to the built-in options, DRAGEN Microbial Amplicon supports the use of custom reference genomes and primer definitions. These files must be uploaded to a BaseSpace Project before they can be used. See https://help.basespace.illumina.com/manage-data/import-data for more information about importing files into BaseSpace.

In the app input form, select the 'Custom' option for 'Amplicon Primer Set'. Then expand the 'Custom Reference' settings to provide the following:

Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)
Custom PCR Primer Definitions (optional)

Custom reference FASTA

If the 'Custom' option is selected for 'Amplicon Primer Set', the user must provide a custom FASTA file containing one or more reference sequences as the target for read alignment (and as the basis for generating consensus sequences). The software generates the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app. Note that not all provided reference sequences in the FASTA file may be used for read alignment and consensus sequence generation.

Custom reference BED

Optionally, a reference BED file may be provided to add information about each reference sequence in the FASTA file, such as human-readable names to be used in the reports. For multi-segment genomes such as Influenza, this file assigns the segment name to each sequence, which allows the software to group individual segment sequences by genome. See the following page on the format of this file:

Custom PRC primer definitions

Optionally, a TSV file may be provided to define the primer sequences or binding locations, which are used for two purposes:

Primer sequences are trimmed from reads, which eliminates sequences that may come from the primer sequences themselves (which we do not want) from sequences contributed by the biological sample (which we do want). This reduces reference bias that can incorrectly lower the observed allele frequency of true sequence variants in primer binding sites.
Primer locations are used to define the amplicons expected from PCR reactions. The read coverage within the unique (non-overlapping) amplicon regions is used to determine whether each amplicon is reliably detected. The percentage of detected amplicons is used to determine whether sufficient material exists to accurately call variants and generate consensus sequences from the sample.

See the following pages for further information:

Nextclade datasets

Optionally, one or more Nextclade datasets can be selected to use for phylogenetic analysis of the consensus sequences generated from the samples. Every selected dataset will be applied to every consensus sequence generated in every sample.

Reference BED file format

A BED-like tab-separated value (TSV) file with no header row and with 4 or 5 columns:

accession: each sequence accession as it appears in Custom Reference FASTA heaer
start: start position (always set to 0)
end: end position (sequence length)
genome: full name of the virus the sequence belongs to (e.g. Influenza A H1N1)
(optional) segment: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Guidelines

This file affects how sequences are labeled in the output.
Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.
If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.
If the Custom Reference FASTA includes sequences from multiple segments, it is strongly recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.

Example

NC_012532.1 0   10794   Zika    Full
NC_007373.1 0   2341    Influenza A virus (H3N2)    Segment 1 (PB2)
NC_007372.1 0   2341    Influenza A virus (H3N2)    Segment 2 (PB1)
NC_007371.1 0   2233    Influenza A virus (H3N2)    Segment 3 (PA+PA-X)
NC_007366.1 0   1762    Influenza A virus (H3N2)    Segment 4 (HA)
NC_007369.1 0   1566    Influenza A virus (H3N2)    Segment 5 (NP)
NC_007368.1 0   1467    Influenza A virus (H3N2)    Segment 6 (NB+NA)
NC_007367.1 0   1027    Influenza A virus (H3N2)    Segment 7 (M1+M2)
NC_007370.1 0   890     Influenza A virus (H3N2)    Segment 8 (NS1+NEP)

PCR Primer definition file formats

If primer coordinates are known (recommended)

Provide a BED file with at least 4 columns: accession, start, end, primerName. Additional columns can be included: pool, strand, sequence, but their order must be maintained.

For example, accession, start, end, primerName, pool for 5-column BED format:

And accession, start, end, primerName, pool, strand, sequence for 7-column BED format:

If primer coordinates are unknown

Option 1. One line per amplicon with 3 columns: ampliconName, forwardSequence,reverseSequence.

Option 2. One line per primer with 3 columns: primerName, sequence, pool.

Formatting rules

General
- All text is case sensitive.
- Any line starting with '#' is ignored. This can be used to add a header line with column names.
- Every line must have the same number of columns and format (except those starting with '#').
- Any number of spaces can separate the columns. A value within a single column should not have any space.
BED format
- Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the start field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the end field (3rd column) minus 1.
- accession field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.
- Multiple sequence identifiers (accession) are permitted within one file.
Primer name
- primerName must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.
- In addition to _LEFT and _RIGHT, we permit _L and _R as direction tags in primerName. Any text after the direction tag should be separated by an underscore.
- Text in primerName before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName.
- Each amplicon must have at least one left and right primer (including alternative primers) associated with it.
- Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt after the direction tag in primerName, followed by optional text to distinguish between different alternative primers, such as a number.
- Examples of valid primer names:
  - MY_SEQUENCE_434_A_LEFT
  - virus1_L
  - amplicon_4934m_RIGHT_alt
  - amplicon_4934m_RIGHT_alt1
  - amplicon_4934m_R_altprimerB
- Examples of invalid primer names:
  - LEFT_MY_SEQUENCE_434_A
  - virus1_l
  - amplicon_4934m_RIGHT_L