1 of 3

Custom genomes and primer sets

In addition to the built-in options, DRAGEN Targeted Microbial supports the use of custom reference genomes and primer definitions. These files must be uploaded to a BaseSpace Project before they can be used. See https://help.basespace.illumina.com/manage-data/import-data for more information about importing files into BaseSpace. These files can be used for both Enrichment and Amplicon libraries, when choosing the 'Custom' option for 'Enrichment Panel' or 'Amplicon Primer Set', respectively. Expand the 'Custom Reference' settings block to access the options for custom files. The following controls are applicable to the specified experiment type:

Custom Enrichment Panel

Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)

Custom Amplicon Primer Set

Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)
Custom PCR Primer Definitions (optional)

Custom genome references

The user may provide one or more reference genomes as the target for read alignment (and as the basis for generating consensus sequences). At a minimum, the user must provide a FASTA file containing the sequences of the reference genomes. The software will generate the required DRAGEN hash tables and other auxiliary files automatically, so there is no need to process the FASTA file with a separate app. Use the 'Custom Reference FASTA for Consensus Generation' control to select the previously-uploaded FASTA file containing the reference sequences.

Optionally, a genome definition BED file may also be provided, which tells the software more information about each sequence, such as a human-readable common name to be used in the reports. For multi-segment genomes such as Influenza, the genome definition file provides the segment name of each sequence and indicates that all the segments of a single genome belong together. Use the 'Custom Reference BED' control to select the previously-uploaded BED file containing the genome definition. See the following page for a description of the format of the genome definition file:

Custom primer sets

For amplicon experiments, the user may optionally provide a file that defines the primer sequences or locations. The primers defined in this file are used for two purposes:

The primer binding locations are used to trim reads, which eliminates sequence data that may be contributed by the primer sequences themselves (which we do not want) from sequence data contributed by the sample (which we do want). This is important to avoid reference bias that can depress the observed allele frequency of sequence variants in primer binding sites.
The primers are matched to define the boundaries of the expected amplicons resulting from the PCR reaction. The read coverage within the unique (non-overlapping) regions of these amplicons is used to determine whether or not each amplicon is reliably observed. The fraction of observed amplicons is a function of the concentration of the sample, and is used to determine whether or not sufficient material exists within the sample to reliably and accurately call variants and generate a consensus sequence. See this page for a more in-depth discussion:

Use the 'Custom PCR Primer Definitions' control to select the previously-uploaded primer definition file. The allowed formats for this file are described here:

Required custom input based on reference type for amplicon experiments

Reference

Example

Required input

Note

Single non-segmented genome

Zika

Primer set

Single segmented genome

All 8 segments from one Influenza A genome

Primer set

Multiple non-segmented genomes

Multiple genomes of Zika

Reference BED, Primer set

Reference BED must be provided to make it clear that the reference sequences are not segments in the same genome. Otherwise, the pipeline will assume this is a single segmented genome (above). If multiple genomes remain after reference selection, the genome with the best per-amplicon coverage will be considered for sample filtering.

Multiple segmented genomes

A collection of Influenza A and B genomes

Reference BED, Primer set

Reference BED must be provided to specify which sequences belong to the same genome. Otherwise, the pipeline will assume this is a single segmented genome. If multiple genomes remain after reference selection, the genome with the best per-amplicon coverage will be considered for sample filtering.

Genome definition file formats

A BED-like tab-separated value (TSV) file with no header row, consisting of the following columns:

chrom: each sequence name as it appears in Custom Reference FASTA
chromStart: start position (always set to 0)
chromEnd: end position (sequence length)
genomeName: full name of the virus the sequence belongs to (e.g. Monkeypox virus clade II)
(optional) segmentName: how this sequence is labeled within the virus (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome

Guidelines

This file affects how sequences are labeled in the output.
Sequence names must match those in Custom Reference FASTA. The same set of sequences must appear in both.
If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.
If the Custom Reference FASTA includes sequences from multiple segments, it is recommended to provide this BED file. Otherwise, each segment will be treated independently and not all of them may be used as reference.

Example

NC_012532.1 0   10794   Zika    Full
NC_007373.1 0   2341    Influenza A virus (H3N2)    Segment 1 (PB2)
NC_007372.1 0   2341    Influenza A virus (H3N2)    Segment 2 (PB1)
NC_007371.1 0   2233    Influenza A virus (H3N2)    Segment 3 (PA+PA-X)
NC_007366.1 0   1762    Influenza A virus (H3N2)    Segment 4 (HA)
NC_007369.1 0   1566    Influenza A virus (H3N2)    Segment 5 (NP)
NC_007368.1 0   1467    Influenza A virus (H3N2)    Segment 6 (NB+NA)
NC_007367.1 0   1027    Influenza A virus (H3N2)    Segment 7 (M1+M2)
NC_007370.1 0   890     Influenza A virus (H3N2)    Segment 8 (NS1+NEP)

Primer definition file formats

If primer coordinates are known (recommended)

Provide a BED file with at least 4 columns: chrom, chromStart, chromEnd, primerName. Additional columns can be included: pool, strand, sequence, but their order must be maintained.

For example, chrom, chromStart, chromEnd, primerName, pool for 5-column BED format:

#chrom  chromStart  chromEnd  primerName     pool
seqX    0           15        primer1_LEFT   1
seqX    1745        1760      primer1_RIGHT  1
seqY    0           15        primer2_LEFT   2
seqY    1015        1030      primer2_RIGHT  2

And chrom, chromStart, chromEnd, primerName, pool, strand, sequence for 7-column BED format:

#chrom  chromStart  chromEnd  primerName     pool  strand  sequence
seqX    0           15        primer1_LEFT   1     +       GGGCAAACCTAAAGG
seqX    1745        1760      primer1_RIGHT  1     -       GTTATGTAAAGGTGC
seqY    0           15        primer2_LEFT   2     +       GGGCGAAACTAAAGG
seqY    1015        1030      primer2_RIGHT  2     -       GTTATGTAAAGGTGC

If primer coordinates are unknown

Option 1. One line per amplicon with 3 columns: ampliconName, forwardSequence,reverseSequence.

#ampliconName  forwardSequence  reverseSequence
amplicon1      GGGCAAACCTAAAGG  GTTATGTAAAGGTGC
amplicon2      GGGCGAAACTAAAGG  GTTATGTAAAGGTGC

Option 2. One line per primer with 3 columns: primerName, sequence, pool.

#primerName       sequence         pool
primer1_LEFT      GGGCAAACCTAAAGG  1
primer1_LEFT_alt  GGGCGAAACTAAAGG  1
primer1_RIGHT     GTTATGTAAAGGTGC  1

Formatting rules

General
- All text is case sensitive.
- Any line starting with '#' is ignored. This can be used to add a header line with column names.
- Every line must have the same number of columns and format (except those starting with '#').
- Any number of spaces can separate the columns. A value within a single column should not have any space.
BED format
- Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the chromStart field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the chromEnd field (3rd column) minus 1.
- chrom field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.
- Multiple sequence identifiers (chrom) are permitted within one file.
Primer name
- primerName must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.
- In addition to _LEFT and _RIGHT, we permit _L and _R as direction tags in primerName. Any text after the direction tag should be separated by an underscore.
- Text in primerName before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName.
- Each amplicon must have at least one left and right primer (including alternative primers) associated with it.
- Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt after the direction tag in primerName, followed by optional text to distinguish between different alternative primers, such as a number.
- Examples of valid primer names:
  - MY_SEQUENCE_434_A_LEFT
  - virus1_L
  - amplicon_4934m_RIGHT_alt
  - amplicon_4934m_RIGHT_alt1
  - amplicon_4934m_R_altprimerB
- Examples of invalid primer names:
  - LEFT_MY_SEQUENCE_434_A
  - virus1_l
  - amplicon_4934m_RIGHT_L

Custom genomes and primer sets

Custom Enrichment Panel

Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)

Custom Amplicon Primer Set

Custom Reference FASTA for Consensus Generation (required)
Custom Reference BED (optional)
Custom PCR Primer Definitions (optional)

Custom genome references

📄Genome definition file formats

Custom primer sets

For amplicon experiments, the user may optionally provide a file that defines the primer sequences or locations. The primers defined in this file are used for two purposes:

The primer binding locations are used to trim reads, which eliminates sequence data that may be contributed by the primer sequences themselves (which we do not want) from sequence data contributed by the sample (which we do want). This is important to avoid reference bias that can depress the observed allele frequency of sequence variants in primer binding sites.
The primers are matched to define the boundaries of the expected amplicons resulting from the PCR reaction. The read coverage within the unique (non-overlapping) regions of these amplicons is used to determine whether or not each amplicon is reliably observed. The fraction of observed amplicons is a function of the concentration of the sample, and is used to determine whether or not sufficient material exists within the sample to reliably and accurately call variants and generate a consensus sequence. See this page for a more in-depth discussion:

⭐Special considerations for amplicon sequencing with IMAP protocols

Use the 'Custom PCR Primer Definitions' control to select the previously-uploaded primer definition file. The allowed formats for this file are described here:

📄Primer definition file formats

Required custom input based on reference type for amplicon experiments

Reference

Example

Required input

Note

Single non-segmented genome

Zika

Primer set

Single segmented genome

All 8 segments from one Influenza A genome

Primer set

Multiple non-segmented genomes

Multiple genomes of Zika

Reference BED, Primer set

Multiple segmented genomes

A collection of Influenza A and B genomes

Reference BED, Primer set

Primer definition file formats

If primer coordinates are known (recommended)

Provide a BED file with at least 4 columns: chrom, chromStart, chromEnd, primerName. Additional columns can be included: pool, strand, sequence, but their order must be maintained.

For example, chrom, chromStart, chromEnd, primerName, pool for 5-column BED format:

#chrom  chromStart  chromEnd  primerName     pool
seqX    0           15        primer1_LEFT   1
seqX    1745        1760      primer1_RIGHT  1
seqY    0           15        primer2_LEFT   2
seqY    1015        1030      primer2_RIGHT  2

And chrom, chromStart, chromEnd, primerName, pool, strand, sequence for 7-column BED format:

#chrom  chromStart  chromEnd  primerName     pool  strand  sequence
seqX    0           15        primer1_LEFT   1     +       GGGCAAACCTAAAGG
seqX    1745        1760      primer1_RIGHT  1     -       GTTATGTAAAGGTGC
seqY    0           15        primer2_LEFT   2     +       GGGCGAAACTAAAGG
seqY    1015        1030      primer2_RIGHT  2     -       GTTATGTAAAGGTGC

If primer coordinates are unknown

Option 1. One line per amplicon with 3 columns: ampliconName, forwardSequence,reverseSequence.

#ampliconName  forwardSequence  reverseSequence
amplicon1      GGGCAAACCTAAAGG  GTTATGTAAAGGTGC
amplicon2      GGGCGAAACTAAAGG  GTTATGTAAAGGTGC

Option 2. One line per primer with 3 columns: primerName, sequence, pool.

#primerName       sequence         pool
primer1_LEFT      GGGCAAACCTAAAGG  1
primer1_LEFT_alt  GGGCGAAACTAAAGG  1
primer1_RIGHT     GTTATGTAAAGGTGC  1

Formatting rules

General
- All text is case sensitive.
- Any line starting with '#' is ignored. This can be used to add a header line with column names.
- Every line must have the same number of columns and format (except those starting with '#').
- Any number of spaces can separate the columns. A value within a single column should not have any space.
BED format
- Per standard BED conventions, sequence coordinates are given as 0-based, half-open intervals, such that the chromStart field (2nd column) contains the first nucleotide in the primer binding site and the last nucleotide in the primer binding site is the value in the chromEnd field (3rd column) minus 1.
- chrom field must contain a sequence identifier that matches the header of the FASTA file containing the sequence that the coordinates are relative to.
- Multiple sequence identifiers (chrom) are permitted within one file.
Primer name
- primerName must be unique and encode the name of the amplicon for which the primer is designed, the direction tag indicating which side of the amplicon, left or right, the primer belongs to, and an optional indicator that the primer is an alternative primer for that amplicon.
- In addition to _LEFT and _RIGHT, we permit _L and _R as direction tags in primerName. Any text after the direction tag should be separated by an underscore.
- Text in primerName before the direction tag is considered to be an amplicon identifier. Ensure that the text of the amplicon identifier is unique for that amplicon and that the direction tag occurs only once in primerName.
- Each amplicon must have at least one left and right primer (including alternative primers) associated with it.
- Alternative primers are used to bind to locations that avoid sequence variation in the default primer binding site that may disrupt hybridization. An amplicon may have an arbitrary number of alternative primers (as long as the primer name is unique), but most amplicons will have none. Alternative primers are indicated by the presence of the _alt after the direction tag in primerName, followed by optional text to distinguish between different alternative primers, such as a number.
- Examples of valid primer names:
  - MY_SEQUENCE_434_A_LEFT
  - virus1_L
  - amplicon_4934m_RIGHT_alt
  - amplicon_4934m_RIGHT_alt1
  - amplicon_4934m_R_altprimerB
- Examples of invalid primer names:
  - LEFT_MY_SEQUENCE_434_A
  - virus1_l
  - amplicon_4934m_RIGHT_L