1 of 2

Repeat Expansion Detection with ExpansionHunter

Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.

DRAGEN includes a repeat expansion detection method called ExpansionHunter. ExpansionHunter performs sequence-graph based realignment of reads that originate inside and around each target repeat. ExpansionHunter then genotypes the length of the repeat in each allele based on these graph alignments.

The ExpansionHunter is designed for PCR-free whole genome samples. Repeats are only genotyped if the coverage at the locus is at least 10x, but a minimum of 30x is recommended. Sequencing reads must be paired-end with a minimum read length of 100 (2x100bp). The ExpansionHunter cannot be run on multiple FASTQ files that are assigned to different library IDs in the fastq_list.csv file.

ExpansionHunter does not support somatic analysis.

More information and analysis is available in the following ExpansionHunter papers:

Dolzhenko et al., Detection of long repeat expansions from PCR-free whole-genome sequence data 2017
Dolzhenko et al., ExpansionHunter: A sequence-graph based tool to analyze variation in short tandem repeat regions 2019

Repeat Expansion Detection Options

To enable DRAGEN repeat expansion detection, the following command-line options are required.

--repeat-genotype-enable=true
--repeat-genotype-specs=<path to specification file>

You can use the --sample-sex option to specify the sex of the sample. The following options are optional.

--repeat-genotype-region-extension-length=<length of region around repeat to examine> (default 1000 bp)
--repeat-genotype-min-baseq=<Minimum base quality for high confidence bases> (default 20)

For more information on the specification file specified by --repeat-genotype-specs option, see Repeat Expansion Specification Files.

The main output of repeat expansion detection is a VCF file that contains the variants found via this analysis.

Repeat Expansion Specification Files

The repeat-specification (also called variant catalog) JSON file defines the repeat regions for ExpansionHunter to analyze. Default repeat-specification for some pathogenic and polymorphic repeats are in the <INSTALL_PATH>/resources/repeat-specs/ directory, based on the reference genome used with DRAGEN.

You can create specification files for new repeat regions by using one of the provided specification files as a template. See the ExpansionHunter documentation for details on the format.

--repeat-genotype-specs is required for ExpansionHunter. If the option is not provided, DRAGEN attempts to autodetect the applicable catalog file from <INSTALL_PATH>/resources/repeat-specs/ based on the reference provided.

Users can choose between any of the three default repeat-specification files packaged with DRAGEN using the command line option: --repeat-genotype-use-catalog=<default|default_plus_smn|expanded>. The default option includes ~60 repeats. The default_plus_smn option includes the SMN repeat in addition to all the repeats in the default catalog. The expanded catalog includes ~174K repeats, see Covered Repeat Regions. If --repeat-genotype-use-catalog is not specified on the command line, then the default catalog is used.

The repeat genotyping results will be incorrect if the selected reference genome is not compatible with the repeat specification file. When this occurs, many repeats may be marked as "LowDepth" in the VCF output file or estimated to have zero length. This can be further confirmed by visualizing read alignments with the REViewer visualization tool.

Covered Repeat Regions

The default variant catalog contains specifications on disease-causing repeats located in AFF2, AR, ARX_1, ARX_2, ATN1, ATXN1, ATXN10, ATXN2, ATXN3, ATXN7, ATXN8OS, BEAN1, C9ORF72, CACNA1A, CBL, CNBP, COMP, CSTB, DAB1, DIP2B, DMD, DMPK, EIF4A3, FMR1, FOXL2, FXN, GIPC1, GLS, HOXA13_1, HOXA13_2, HOXA13_3, HOXD13, HTT, JPH3, LRP12, MARCHF6, NIPA1, NOP56, NOTCH2NLC, NUTM2B-AS1, PABPN1, PHOX2B, PPP2R2B, PRDM12, PRNP, RAPGEF2, RFC1, RUNX2, SAMD12, SOX3, STARD7, TBP, TBX1, TCF4, TNRC6A, VWA1, XYLT1, YEATS2, ZIC2 and ZIC3 genes. More information about disease-causing repeats can also be found here.

For the expanded variant catalog, apart from the aforementioned disease-causing repeats, there are ~174K additional polymorphic repeats. They are initially detected using STR-Finder from the 1000 Genomes Project. After that, the candidate repeats are filtered out based on a customized quality control pipeline, see details here.

The ExpansionHunter can detect pathogenic expansions of FXN, ATXN3, ATN1, AR, DMPK, HTT, FMR1, ATXN1, C9ORF72 repeats with high accuracy (see the ExpansionHunter papers above). The pathogenicity status of some repeats might depend on the presence of sequence interruptions or motif changes that ExpansionHunter does not call. If you would like to visually inspect the relevant read alignments, you can use a Repeat Expansion Viewer third-party tool.

Repeat Expansion Thresholds

Included below are the repeat unit expansion thresholds (normal, pre-mutation and expansion) for some common repeats.

Repeat Expansion Detection Output Files

VCF Output File

The results of repeat genotyping are output as a separate VCF file, which provides the length of each allele at each callable repeat defined in the repeat-specification catalog file. The name is <outputPrefix>.repeats.vcf (*.gz). The VCF output file lists with the following fields first.

Table 2 Core VCF Fields

Table 3 Additional INFO Fields

Table 4 GENOTYPE (Per Sample) Fields

For example, the following VCF entry describes the ATXN1 repeat in a sample NA13537.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA13537
chr6	16327864	.	G	<STR33>,<STR58>	.	PASS	END=16327954;REF=30;RL=90;RU=TGC;VARID=ATXN1;REPID=ATXN1	GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC	1/2:SPANNING/INREPEAT:33/58:33-33/52-71:4/0:69/83:0/4:37.459459

In this example, the first allele spans 33 repeat units while the second allele spans 58 repeat units. The repeat unit is TGC (RU INFO field), so the sequence of the first allele is TGC x 33 and the sequence of the second allele is TGC x 58. The repeat spans 30 repeat units in the reference (REF INFO field).

The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (52,71). There are 4 spanning and 69 flanking reads consistent with the repeat allele of size 33 that is 4 reads fully contain the repeat of size 33 and 69 flanking reads overlap at most 33 repeat units. There are 83 flanking and 4 in-repeat reads consistent with the repeat allele of size 58. The average coverage of this locus is 37.46x.

Additional Output Files

The sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file. You can use a specialized GraphAlignmentViewer tool available on GitHub to visualize the alignments. Programs like Integrative Genomics Viewer (IGV) are not designed for displaying graph-aligned reads and cannot visualize these BAMs.

The BAMs store graph alignments in custom XG tags using the format <LocusName>,<StartPosition>,<GraphCIGAR>.

LocusName---A locus identifier that matches the corresponding entry in the repeat expansion specification file.
StartPosition---The starting alignment position of a read on the first graph node.
GraphCIGAR---The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node.

Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.

De Novo STR Expansion Detection

ExpansionHunter de novo allows the discovery of expanded STR regions from paired-end reads across a cohort of samples. It is designed to work with PCR-free samples of 100-200bp paired-end reads at >30X coverage.

Note:
STRs shorter than the read length are ignored; the program is appropriate only for detecting expansions that exceed the read length.
The location of each reported STR is approximate (up to about 500bp-1Kbp)
STRs are not genotyped; the program reports a depth-normalized count of reads originating inside each STR; this count can be used as a very approximate measure of the repeat length
To achieve best results all samples must be sequenced on the same instrument to similar coverage, have the same read and fragment lengths, and be subjected to the same computational pre-processing (e.g. reads must be aligned by the same aligner)

For more information refer to:

Dolzhenko, E., Bennett, M.F., Richmond, P.A. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21, 102 (2020)

Briefly, the workflow can be separated in two distinct steps: profiling and analysis. In the profiling step, repetitive reads are found and used to infer the location of potential STR regions. The regions and the respective read counts are then saved in a "profile" on disk. The profiling step is run for each sample and the resulting profiles are merged into a single dataset for the analysis. In the analysis step the user needs to provide a table describing the experimental design to run either an outlier analysis which tests one sample against the rest or a case-control analysis where the samples are split in two groups.

In DRAGEN, the analysis is more streamlined than the standalone EHdn tool and has considerable performance improvements, while retaining the same output.

Note: The output in the case of outlier analysis might not be exactly identical because it involves bootstrapping. In DRAGEN, the random sampling function necessary for bootstrapping is different than what is implemented by Numpy in the standalone EHdn.

Running DRAGEN

Note: The DRAGEN implementation is based on EHdn version v0.9.1

The two steps of the workflow, profiling and analysis, are performed by two separate DRAGEN commands.

Profiling

In the first step we compute the profiles which are going to be saved as ProtoBuf messages (<out_prefix>.data). The profile can be saved in a specific directory with the --str-profiler-output-directory flag. The sample name will be saved in the profile and can be specified at the profiling stage with the flag --str-profiler-sample-name. If not specified, the sample name in the RGSM field will be used instead.

DRAGEN has to be called once for each sample, for example with the command:

dragen \
  --enable-map-align=false \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --bam-input <bam_input> \
  --enable-str-profiler=true \
  --str-profiler-sample-name=<optional_name> \            # if not set is == to RGSM
  --str-profiler-output-directory=<path_to_directory> \   # if not set is == to --output-directory
  -r=<dragen_ref>

After all the profiles are computed, they have to be divided in 'cases' and 'controls' directories. This can be achieved while computing the profiles by passing the directory with the --str-profiler-output-directory flag. The input can be a list of samples with the --fastq-list option. DRAGEN can take as input a list of FASTQ files and save each profile in the directory specified directory with --str-profiler-output-directory. A list of cases and a list of controls can be run in this manner.

Example command:

dragen \
  --enable-map-align=true \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --fastq-list=<fastq_list> \
  --fastq-list-all-samples=true \                # necessary to run samples separately based on sample-id
  --enable-multi-sample=true \                   # necessary to run samples separately based on sample-id
  --enable-per-sample-map-align-output=true \    # can be set to false if map align output (BAM) is not needed
  --enable-str-profiler=true \
  --str-profiler-output-directory=<path_to_directory> \
  -r=<dragen_ref>

Analysis

The analysis is performed with a separate DRAGEN command, which takes as input the path to the two directories.

Two analysis types can be specified:

outlier = bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for the cases samples
casecontrol = cases and controls counts are compared with a one-sided Wilcoxon rank-sum test and a Bonferroni correction is applied to the resulting p-values

Providing the --str-profile-analysis flag will trigger the analysis workflow. Example command:

dragen \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --enable-str-profiler=true \
  --str-profiler-analysis=<outlier|casecontrol> \
  --str-profiler-cases-directory=<directory_with_cases_profiles> \
  --str-profiler-controls-directory=<directory_with_controls_profiles> \
  -r=<dragen_ref>

A note about bootstrapping

The standalone version of EHdn performs 100 rounds of resampling during bootstrapping due to computational constraints. In DRAGEN the resampling has been increased to 1000 by default thanks to the much faster computation. This number can be adjusted with the flag --str-profiler-resampling-rounds. Increasing the number of resampling cycles will improve the precision of the estimate but also linearly increase the compute times.

DRAGEN will spread the computation across 48 threads by default, but the number can be adjusted on the command line with the flag --str-profiler-threads.

Output

The output (as in the standalone EHdn implementation) is composed of two tables, one for the "motif" level analysis and one for the "locus" level analysis which will be saved as <output-prefix>.str_profiler_locus.tsv and <output-prefix>.str_profiler_motif.tsv respectively. Below is a description of the locus analysis output. The motif table is the same as the locus table but without the contig, start and end columns.

Outlier analysis (locus) output

Case-control analysis (locus) output

Repeat Expansion Detection with ExpansionHunter

ExpansionHunter does not support somatic analysis.

More information and analysis is available in the following ExpansionHunter papers:

Dolzhenko et al., Detection of long repeat expansions from PCR-free whole-genome sequence data 2017
Dolzhenko et al., ExpansionHunter: A sequence-graph based tool to analyze variation in short tandem repeat regions 2019

Repeat Expansion Detection Options

To enable DRAGEN repeat expansion detection, the following command-line options are required.

--repeat-genotype-enable=true
--repeat-genotype-specs=<path to specification file>

You can use the --sample-sex option to specify the sex of the sample. The following options are optional.

--repeat-genotype-region-extension-length=<length of region around repeat to examine> (default 1000 bp)
--repeat-genotype-min-baseq=<Minimum base quality for high confidence bases> (default 20)

For more information on the specification file specified by --repeat-genotype-specs option, see Repeat Expansion Specification Files.

The main output of repeat expansion detection is a VCF file that contains the variants found via this analysis.

Repeat Expansion Specification Files

You can create specification files for new repeat regions by using one of the provided specification files as a template. See the ExpansionHunter documentation for details on the format.

Covered Repeat Regions

Repeat Expansion Thresholds

Included below are the repeat unit expansion thresholds (normal, pre-mutation and expansion) for some common repeats.

Gene containing repeat

Repeat number (normal)

Repeat number (pre-mutation)

Repeat number (expansion)

Repeat Expansion Detection Output Files

VCF Output File

Table 2 Core VCF Fields

Field

Description

Table 3 Additional INFO Fields

Field

Description

Table 4 GENOTYPE (Per Sample) Fields

Field

Description

For example, the following VCF entry describes the ATXN1 repeat in a sample NA13537.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA13537
chr6	16327864	.	G	<STR33>,<STR58>	.	PASS	END=16327954;REF=30;RL=90;RU=TGC;VARID=ATXN1;REPID=ATXN1	GT:SO:REPCN:REPCI:ADSP:ADFL:ADIR:LC	1/2:SPANNING/INREPEAT:33/58:33-33/52-71:4/0:69/83:0/4:37.459459

Additional Output Files

The BAMs store graph alignments in custom XG tags using the format <LocusName>,<StartPosition>,<GraphCIGAR>.

LocusName---A locus identifier that matches the corresponding entry in the repeat expansion specification file.
StartPosition---The starting alignment position of a read on the first graph node.
GraphCIGAR---The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node.

Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.

De Novo STR Expansion Detection

Note:
STRs shorter than the read length are ignored; the program is appropriate only for detecting expansions that exceed the read length.
The location of each reported STR is approximate (up to about 500bp-1Kbp)
STRs are not genotyped; the program reports a depth-normalized count of reads originating inside each STR; this count can be used as a very approximate measure of the repeat length
To achieve best results all samples must be sequenced on the same instrument to similar coverage, have the same read and fragment lengths, and be subjected to the same computational pre-processing (e.g. reads must be aligned by the same aligner)

For more information refer to:

Dolzhenko, E., Bennett, M.F., Richmond, P.A. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21, 102 (2020)

In DRAGEN, the analysis is more streamlined than the standalone EHdn tool and has considerable performance improvements, while retaining the same output.

Note: The output in the case of outlier analysis might not be exactly identical because it involves bootstrapping. In DRAGEN, the random sampling function necessary for bootstrapping is different than what is implemented by Numpy in the standalone EHdn.

Running DRAGEN

Note: The DRAGEN implementation is based on EHdn version v0.9.1

The two steps of the workflow, profiling and analysis, are performed by two separate DRAGEN commands.

Profiling

DRAGEN has to be called once for each sample, for example with the command:

dragen \
  --enable-map-align=false \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --bam-input <bam_input> \
  --enable-str-profiler=true \
  --str-profiler-sample-name=<optional_name> \            # if not set is == to RGSM
  --str-profiler-output-directory=<path_to_directory> \   # if not set is == to --output-directory
  -r=<dragen_ref>

Example command:

dragen \
  --enable-map-align=true \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --fastq-list=<fastq_list> \
  --fastq-list-all-samples=true \                # necessary to run samples separately based on sample-id
  --enable-multi-sample=true \                   # necessary to run samples separately based on sample-id
  --enable-per-sample-map-align-output=true \    # can be set to false if map align output (BAM) is not needed
  --enable-str-profiler=true \
  --str-profiler-output-directory=<path_to_directory> \
  -r=<dragen_ref>

Analysis

The analysis is performed with a separate DRAGEN command, which takes as input the path to the two directories.

Two analysis types can be specified:

outlier = bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for the cases samples
casecontrol = cases and controls counts are compared with a one-sided Wilcoxon rank-sum test and a Bonferroni correction is applied to the resulting p-values

Providing the --str-profile-analysis flag will trigger the analysis workflow. Example command:

dragen \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --enable-str-profiler=true \
  --str-profiler-analysis=<outlier|casecontrol> \
  --str-profiler-cases-directory=<directory_with_cases_profiles> \
  --str-profiler-controls-directory=<directory_with_controls_profiles> \
  -r=<dragen_ref>

A note about bootstrapping

DRAGEN will spread the computation across 48 threads by default, but the number can be adjusted on the command line with the flag --str-profiler-threads.

Output

Outlier analysis (locus) output

Column

Description

Case-control analysis (locus) output

Column

Description