1 of 8

Variant Analysis

Variations in nucleotide sequence, in the form of single nucleotide variants (SNVs) and insertion and deletion events (INDELs), can either be neutral in nature or can have functional effects. Partek Flow provides all the tools necessary to interrogate and prioritize variants for further analysis. Variants stored in Variant Call Format (vcf) files can be analyzed to filter, annotate, summarize, visualize, and validate your panel of identified variants. Multiple vcf processing tools are available under the Variant analysis section of the context sensitive menu

Fusion Gene Detection
Annotate Variants
Annotate Variants (SnpEff)
Annotate Variants (VEP)
Filter Variants
Summarize Cohort Mutations
Combine Variants

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Fusion Gene Detection

A fusion gene is a hybrid gene that combines parts of two or more original genes. They can form as a result of chromosomal rearrangements (such as translocation, interstitial deletion, or chromosomal inversion) or abnormal transcription and have been shown to act as drivers of malignant transformation or/and progression in various neoplasms (1). The discovery and characterization of fusion genes have been greatly facilitated by the use of NGS (2) and several computational algorithms have been developed to detect them.

This chapter covers will illustrate how to detect fusion genes by:

STAR Algorithm
TopHat-Fusion Algorithm

STAR Algorithm

General Overview

The STAR aligner also has the ability to detect fusion genes (referred to as “chimeric alignments”) (5,6). During the first phase of alignment, STAR searches for maximal mappable prefixes (seeds) of sequencing reads. In the second phase, all the seeds that align within user-defined genomic windows are stitched together. If an alignment within one genomic window does not cover the entire read sequence, STAR will try to find two or more windows that cover the entire read. This essentially results in the detection of fusion events, with different parts of reads aligning to distal genomic locations, or different chromosomes, or different strands.

STAR fusion detection is performed in two steps: chimeric alignment of reads with the STAR aligner and fusion detection with STAR-Fusion. Performing fusion detection in two steps is equivalent to running the analysis in "Kickstart" mode, as described by the authors of STAR-Fusion. We recommend using STAR version 2.7.8a (see Task management to check which version you are running).

To save time, you can import the pre-built STAR-Fusion pipeline from our hosted pipeline page. This pipeline includes the two steps outlined below, where the advanced options for the STAR 2.7.8a alignment have been optimized for fusion detection according to the STAR-Fusion author's recommendations. See Importing a Pipeline for more information.

Running STAR Chimeric Alignment within Partek Flow

When performing an alignment with STAR, chimeric alignment can be activated by tick-marking the Chimeric alignment option in the Advanced options of the aligner (the Advanced options dialog is reached via the Configure link in the setup dialog). When the Chimeric alignment checkbox is selected, additional options specific to the fusion search algorithm are shown (Figure 1). For a discussion on the details of the options, see STAR documentation.

The output is associated with the Chimeric junctions data node (Figure 2), which is a part of the STAR results in addition to Aligned reads node and, optionally, Unaligned reads node.

Running STAR-Fusion on Chimeric results

STAR-Fusion v1.10 is wrapped into Partek Flow. STAR-Fusion will process the chimeric output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set. To run fusion detection, select the Chimeric junctions data node and choose STAR-Fusion from the Variant analysis menu in the toolbox (Figure 5).

Choose the STAR-Fusion annotation from the drop-down list. We provide automatic downloads of the plug-n-play libraries distributed by Trinity Cancer Transcriptome Analysis Toolkit (CTAT) for Human hg38 (Gencode v22 and v37) and hg19 (Gencode v19) assemblies (Figure 6). If you wish to add your own STAR-Fusion library, you can either import a pre-build CTAT library or gather the appropriate files and build it in Partek Flow. See here for more details on the files you need.

To change any of the advanced options, click the Configure link (Figure 7). To run the task, click Finish.

The resulting Fusion predictions task node (Figure 18) can be downloaded to your local machine by selecting the data node and clicking Download data from the toolbox. There will be one tab-separated (.tsv) file per sample. To view the full table, double-click the new data node to open the task report (Figure 9). Each row of the table is a fusion event and the columns contain information about each detected fusion.

FusionName: the name of the fusion event, given as LeftGene--RightGene. Multiple fusion events can be detected across the same pair of genes, so the FusionName of an event is not necessarily unique;
JunctionReadCount: indicates the number of RNA-Seq fragments containing a read that aligns as a split read at the site of the putative fusion junction;
SpanningFragCount: indicates the number of RNA-Seq fragments that encompass the fusion junction such that one read of the pair aligns to a different gene than the other paired-end read of that fragment;
est_J: estimated junction read counts corrected for multiple mappings;
est_S: estimated spanning fragment counts corrected for multiple mappings;
SpliceType: indicates whether the proposed breakpoint occurs at reference exon junctions as provided by the reference transcript structure annotations (Gencode);
LeftGene: name of the first (left) gene;
LeftBreakpoint: genome coordinates for the breakpoint in left gene;
RightGene: name of the second (right) gene;
RightBreakpoint: genome coordinates for the breakpoint in right gene;
JunctionReads: sequence identifiers for all junction reads;
SpanningFrags: sequence identifiers for all spanning fragments;
LargeAnchorSupport: indicates whether there are split reads that provide 'long' (set to 25bp) alignments on both sides of the putative breakpoint;
FFPM: fusion fragments per million reads
LeftBreakDinuc: dinucleotide base pairs at the left breakpoint
LeftBreakEntropy: the Shannon entropy of the 15 exonic bases flanking the left breakpoint
RightBreakDinuc: dinucleotide base pairs at the right breakpoint
RightBreakEntropy: the Shannon entropy of the 15 exonic bases flanking the right breakpoint
annots: provides a simplified annotation for fusion transcript

TopHat-Fusion Algorithm

General Overview

TopHat-Fusion is a version of TopHat with the ability to align reads across fusion points and detect fusion genes resulting from breakage and re-joining of two different chromosomes or from rearrangements within a chromosome (3). It is independent of gene annotation and can discover fusion products from known genes, unannotated splice variants of known genes or completely unknown genes.

The reads are first aligned to the genome. The unaligned reads resulting from this initial alignment are split into multiple 25 bp sequences which are, in turn, aligned to the genome by Bowtie. The TopHat-Fusion algorithm identifies the cases where the first and the last 25 bp segments are aligned to either two different chromosomes or two locations on the same chromosome (spacing is defined by the user). The whole read is used to identify a fusion point. After the initial fusion candidates are defined, all the segments from the initially unaligned reads are realigned against the fusion points (as well as intron boundaries and indels). The resulting alignments are combined with the full read alignments.

The most up-to-date TopHat-Fusion version implemented in Partek® Flow® when the manual was written (2.1.0) focuses on fusions due to chromosomal rearrangements, while fusions resulting from read-through transcription or trans-splicing were not supported. For details as well as discussion of TopHat-Fusion options, see TopHat-Fusion home page (4).

Running TopHat-Fusion within Partek Flow

TopHat-Fusion is integrated in the TopHat 2 task and is invoked by using the Fusion search check box in the Alignment options dialog (Figure 10).

The output is generated as a new data node Fusion results (Figure 11) stemming as part of the if the TopHat 2 align reads task (in addition to Aligned reads node and, optionally, Unaligned reads node).

Selecting the Fusion results data node opens the task menu, with four options (Figure 12): Data summary report, Fusion report, Fusion attribute report, and Download data.

Clicking the Download data downloads a *.fusion file to the local computer. The file is human-readable and can be opened in a text editor (example in Figure 13). For details refer to TopHat-Fusion documentation.

A list of annotated fusion genes, in a form of Fusion report can be obtained by first selecting the Fusion report task node and then the Task report link from the task menu. Since the task provides an annotated report, an annotation file needs to be specified first (Figure 14).

The resulting Fusion report task node (Figure 15) can be double-clicked to reveal the full table (Figure 7).

Each row of the table in Figure 7 is a potential fusion event, with the columns providing the following information.

Sample ID: sample in which the fusion event was identified
Chromosome 1: chromosome hosting the first (left) segment of the fusion transcript
Stop 1: end of the first (left) segment of the fusion transcript
Chromosome 2: chromosome hosting the second (right) part of the fusion transcript
Start 2: beginning of the second (right) segment of the fusion transcript
Gene1: gene on the left side of the fusion
Gene2: gene on the right side of the fusion
Spanning reads: number of reads which were unaligned during the initial phase of TopHat and where only one mate is used as evidence of the fusion event
Mate Pairs: number of reads which were unaligned during the initial phase of TopHat and where both mates are used as evidence of the fusion event
Spanning mate pairs: number of reads where both mates were aligned during the initial phase of TopHat, but their pairing is discordant (e.g. different chromosomes, different orientation etc.)
Contradicting reads: number of reads which do not support the fusion
Left bases: number of bases on the left side of the fusion
Right bases: number of bases on the right side of the fusion

All the columns can be sorted by using the arrow buttons in column headers, while the type-in boxes can be used for searching. TopHat-Fusion does not report exact start and stop position for each side of the fusion event. It has a single location for the end of the upstream segment (Stop 1) and the beginning of the downstream segment (Start 2). Therefore, columns Start 1 and Stop 2 are added for (internal) consistency with other Partek Flow tools.

The checkboxes Disrupted Genes and Gene/Gene fusions are filter tools. When selected, Disrupted Genes removes all the rows (fusion events) which have no genes assigned to it, i.e. those that merge two intergenic regions. However, if there is a fusion between a gene and an intergenic region, it will be kept in the table. The Gene/Gene fusions filters in only those fusion events which have an annotated gene on both sides of the breakpoint. In other words, only gene to gene fusions are kept in the table.

Another table which can be generated based on a Fusion results node is the Fusion attribute report. When the option is selected, it brings up the dialog shown in Figure 17. First, you need to specify one or more categorical attributes (Select attribute(s) to test), which have at least two categories (see Data tab). Second, you need to specify an annotation file, using the Assembly and Gene/feature annotation drop-down lists.

A new data node, Fusion attribute report, is generated in the Analysis tab (Figure 18) and it provides access to the Task report link in the task menu.

The output, Fusion report table resembles the basic TopHat-Fusion output; each row of the table is a single fusion event while the information on the merged segments is on the columns.

Chromosome 1: chromosome hosting the first (left) segment of the fusion transcript;
Start 1: beginning of the first (left) segment of the fusion transcript;
Stop 1: end of the first (right) segment of the fusion transcript;
Chromosome 2: chromosome hosting the second (right) segment of the fusion transcript;
Start 2: beginning of the second (right) segment of the fusion transcript;
Stop 2: end of the second (left) segment of the fusion transcript;
Gene1: gene on the left side of the fusion;
Gene2: gene on the right side of the fusion;
% in (category name): fraction of samples within the category with the fusion event.

The checkboxes Disrupted Genes and Gene/Gene fusions are filter tools. When selected, Disrupted Genes removes all the rows (fusion events) which have no genes assigned to it, i.e. those that merge two intergenic regions. However, if there is a fusion between a gene and an intergenic region, it will be kept in the table. The Gene/Gene fusions filters in only those fusion events which have an annotated gene on both sides of the breakpoint. In the other words, only gene to gene fusions are kept in the table.

References

Annala MJ, Parker BC, Zhang W, Nykter M. Fusion genes and their discovery using high throughput sequencing. Cancer Lett. 2013;340:192-200.
Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013;21:134-142.
Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biology. 2011;12:R72
TopHat-Fusion. An algorithm for discovery of novel fusion transcripts. http://tophat.cbcb.umd.edu/fusion_index.html Accessed on April 25, 2014
Dobin A, Davies CA, Schlesinger F et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15-21.
Haas B.J, Dobin A, Li B. et al. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20**:**213 (2019)

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Annotate Variants

The Annotate variants task in Partek Flow provides a means to add information with regards to genomic features, such as transcript models, and existing variant databases to the variants contained in the projects. This information can be useful for filtering, interpreting, and prioritizing variants for downstream investigation. The Annotate variants task can be invoked from any Variants or Annotated variants data node, and the task will be added to and supplement any existing annotation in the underlying vcf files. Annotation information will also be visible in the downstream View variants Variant report.

Annotate variants dialog

The task dialog for Annotate variants contains three sections: Assembly, Annotate with genomic features, and Annotate with known variants (Figure 1). If variant detection was performed in Partek Flow, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence from within the task.

Selecting Annotate with genomic feature provides the means to add gene/feature information to the variants (Figure 2). This typically takes the form of overlaying a transcript model (such as Ensembl). Annotation models previously added to library files (see Library File Management) will be available for selection or Add annotation model in the drop-down list can be utilized to import an annotation model to library files within the task. Promoter upstream limit and Promoter downstream limit provides a means to set the number of bases flanking the transcription start site, and this region will considered the promoter of a feature.

Selecting Annotate with known variants will provide the ability to specify a Variant annotation database (Figure 2). Known variant databases in vcf format, such as dbSNP1 and 1000 Genomes2 for human variants, can be used in the task. Additional databases not provided for automated download in Partek® Flow®, such as the Catalogue of Somatic Mutations in Cancer (COSMIC)3, can be obtained and employed by the user. Variant databases previously added to library files (see Library File Management) will be available for selection or Add variant database in the menu can be utilized to import the variant database to library files from within the task.

References

Sherry ST. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29(1):308-311. doi:10.1093/nar/29.1.308
Auton A, Abecasis GR, Altshuler DM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68-74. doi:10.1038/nature15393.
Forbes SA, Bhamra G, Bamford S, et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). In: Haines JL, Korf BR, Morton CC, Seidman CE, Seidman JG, Smith DR, eds. Current Protocols in Human Genetics. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2008. http://doi.wiley.com/10.1002/0471142905.hg1011s57.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Annotate Variants (SnpEff)

An important aspect of variant analysis is the ability to prioritize specific variants for further investigation. As variant detection can often identify a large number of variants, it may be difficult to determine which variants may impact phenotypes. SnpEff (version 4.1k) provides a means to annotate and predict the effects of variants on genes, allowing for prioritization of variants within the project. In addition, the SnpEff databases utilized for prediction support a large number of genome assemblies. Information regarding the implementation of the predictions is detailed by Cingolani et al.1 The predicted effect of the variant is categorized by impact:

HIGH - frame shifts, addition/deletion of stop codons, etc;
MODERATE – codon change/deletion/insertion, etc;
LOW – synonymous changes, etc;
MODIFIER – changes outside coding regions, etc.

Further details about output metrics can be found in the SnpEff documentation. The Annotate variants (SnpEff) task can be invoked from any Variants or Annotated variants data node, and the task will supplement any existing annotation in the vcf files. Annotation information will also be visible in the View variants Variant report and the Summarize cohort mutations Cohort mutation summary report

Annotate variants (SnpEff) dialog

The task dialog for Annotate variants (SnpEff) contains two sections: Select SnpEff database and Advanced options (Figure 1). Select SnpEff database will specify the reference assembly to utilize for variant detection. If the variant detection was performed in Partek Flow, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence to library files from within the task. Select SnpEff database will allow selection of databases utilized for prediction, and Partek Flow provides automated download of a limited number of these databases. Databases previously added to library files (see Library File Management) will be available for selection or Add SnpEff variant database in the menu can be utilized to import the reference sequence to library files from within the task. Additional information of SnpEff databases can be found in the SnpEff documentation.

References

Cingolani P, Platts A, Wang LL, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80-92.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Annotate Variants (VEP)

An important aspect of variant analysis is the ability to prioritize variants for downstream analysis. As variant detection can often identify a large number of variants, it may be difficult to determine which variants may impact phenotypes. As implemented in Partek Flow, the Ensembl Variant Effect Predictor (VEP, version 84)1 provides a means to add detailed annotation to variants in the project such as discrete aspects of transcript models and variant databases not available in the Annotate Variants task. For variants identified in human data, information from popular tools that predict the impact of variants that cause amino acid changes, SIFT2–4 and PROVEAN5 (available for the hg19 genome assembly), will be included. VEP databases can be obtained for multiple species, and content will be dependent on available transcript and variant information for that organism. The Annotate variants (VEP) task can be invoked from any Variants or Annotated variants data node, and the task will supplement any existing annotation in the vcf files. Annotation information will also be visible in the downstream View variants Variant report .

Annotate variants (VEP) dialog

The task dialog for Annotate variants (VEP) contains two sections: Select Variant Effect Predictor database and Advanced options (Figure 1). Select Variant Effect Predictor database will specify the reference assembly to utilize for variant detection. If the variant detection was performed in Partek Flow, the Assembly will be displayed as text in the section. Upon initial task usage, click the Create variant effect predictor database button to import a database. The VEP database for hg19 is available for automated download in Partek Flow, and information regarding obtaining additional databases for other species and genome assemblies can be found in the VEP documentation.

In the report, there variant impact information, it is a subjective classification of the severity of the variant consequence:

Low: a variant that is assumed to be mostly harmless or unlikely to change protein behavior

Moderate: a non-disruptive variant that might change protein effectiveness

Modifier: usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact

High: a variant is assumed to have high disruptive impact in the protein, probably causing protein truncation, loss of function or triggering nonsense mediated decay.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Filter Variants

Variant detection can identify large numbers of variants, dependent both on the size of the regions being interrogated and the parameters utilized during detection. As such, filtering variants is often a necessary task to aid in identifying variants that may be relevant for downstream investigation. Partek Flow provides tools for the filtering variant data both in regards to quality metrics generated during detection and annotation information. The task can be invoked from any Variants or Annotated variants data node.

Filter variants dialog

The Filter variants task dialog can contain two to five sections, dependent on the variant caller used for detection and the level of annotation (Figure 1). All instances of the task will include the following: Include region overlapping variants and a section for Quality.

Selecting Include region overlapping variants will bring up a dialog to include variants located within genomic regions of interest (Figure 2), which could be regions such as transcript models or amplicons. If variant detection was performed in Partek Flow, the Assembly will be displayed as text in the section, and you do not have the option to change the reference. In the event that variant detection was performed outside of Partek Flow, you will need to select the appropriate Assembly utilized for variant detection in the drop-down list. Assemblies previously added to library files (see Library File Management) will be available for selection or New assembly… can be utilized to import the reference sequence to library files from within the task. The Gene/feature section will allow for the use of any annotation model specified in the library files (see Library File Management) in the drop-down menu or can be imported from within the task by selecting Add annotation model. If an annotation that contains gene-level information is selected, this filter will include both intronic and exonic regions.

If the filter is invoked from an Annotated variants data node, the Variant Novelty section can be utilized to filter known variants as identified in a variant database used for annotation (Figure 3). Selecting Known only, Novel only, or All will include only these types of variants in the resulting filtered variants.

Variants annotated with a transcript model will include a filter for Variant Type (Figure 4). For variants in coding regions, Mutation type allows for the inclusion of Synonymous, Missense, and/or Nonsense variants when selecting the appropriate type. For variants located outside of coding regions, Feature section allows for the inclusion of 5-prime splice site (Splice-5), 3-prime splice site (Splice-3), Non-coding RNA, 5-prime UTR, 3-prime UTR, Intron, Promoter, and/or Intergenic variants by selecting the appropriate type.

When filter by field option is checked, all of the fields can be displayed in the drop-down list, depends variant detection algorithms, annotation database etc, the list of the fields will be different from different data node.

For instance VarQual field is a metrics generated from the variant detection, and these will be dependent upon the method utilized for variant detection.

Field can be searched from the drop-down list (Figure 6), when mouse over on a field, description of the field will be displayed

Decisions on quality filtering parameters should be based upon sequencing assay design as well as goal or the study, either identification of all potential variants or identification of high confidence variants. At the very least, the use of Minimum read depth should be considered for filtering to ensure sufficient read evidence was available to call a variant. In instances where paired variant detection was performed in SAMtools, Minimum genotype log ratio may be employed to ensure sufficient evidence of genotype differences in case and control sample pairs. Please refer to the Samtools, FreeBayes, and LoFreq documentation for further details on any of these parameters.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Summarize Cohort Mutations

Variant information is stored on a per sample basis, but it can be informative to view variants in the context of recurrent variants identified within the project’s sample cohort to identify both the frequency of variants and the samples that share a particular variant. The Summarize cohort mutations task can be invoked from any Variants or Annotated variants data node to generate a report of shared variants identified from detection against a reference sequence or among paired samples.

Summarize cohort mutations dialog

The Summarize cohort mutations task, user needs to specify Minimum coverage for genotype calls. In general, it is likely that if a variant is not called in a sample at a particular locus then the sample has a homozygous reference genotype. Yet this may not always be the case as factors such as insufficient depth or low quality bases at that locus may lead to an inability of the variant caller to identify any genotype at that locus. As such, setting a minimum coverage will make the assumption that the sample contains a homozygous reference genotype if the depth requirement is met. This is done for the purpose of generating genotype calls for all samples (even reference homozygotes) at all variant loci within the project.

For paired variant caller report, if Merge pairs check button is unselected, pairs will be analyzed separately. If it is selected, all samples will be analyzed together.

Cohort mutation summary report

The Cohort mutation summary report provides a row in the table for all variant sites, either SNVs or INDELs, identified in the project (Figure 2). Hovering over a column header will provide a brief description of the column data. Columns presented in the table include the following information:

View provides a link to Chromosome View by selecting the chromosome icon , Chr represents chromosome from the reference assembly, Position represents the base position in the chromosome, Mutation type is the category of variant (Substitution for SNVs and Insertion or Deletion for INDELs), Reference allele is the base(s) in the reference assembly sequence, Case genotypes are the genotypes of the samples with a variant at the locus, Variant frequency represents the frequency of the variant site in the sample cohort, Sample count is the fraction of samples in the cohort with the variant, and Samples are the names of the samples that contain the variant. The Summarize cohort mutations task is not available for variants detected by LoFreq as no genotypes are produced from the caller. If variant detection was performed on paired samples in Samtools (Figure 3), the Genotype column will be replaced with four columns: GT Change presents the possible change in zygosity between cases and controls at the variant locus, Control Genotypes are the genotypes of the designated control samples in the pairs, and Case Genotypes are the genotypes of the cases in the pairs. Additional columns can be added to the Cohort mutation summary report table by selecting Optional columns. The optional columns are dependent upon the information present in the underlying vcf file and include variant and sample metrics from variant detection and information from the annotation. Hovering over a term in the list will provide a brief description of the data contained in that column. Optional columns can also be used to exclude default columns in the table.

Below each data column header in the Cohort mutation summary report, the Search... section allows for filtering of the table (Figures 2, 3 and 4). The search can be useful for limiting the list of variants to those of interest when large numbers of variants are present in the table. For columns with numbers, exact values or ranges using either ">" or "<" can be utilized in the search. For columns with letters or words, and exact string of characters must be entered in order to obtain a match. In the case of table cells with multiple entries, there must be an exact match between the query and 1 entry to retain the table row.

If the Summarize cohort mutations task is performed upon an Annotated variants data node, additional information can be presented in the Cohort mutation summary report table. Click on Optional columns to select more fields to add to the table (Figure 3)

At any point, information in the Cohort mutation summary report table can be saved in text or vcf format by selecting Download at the bottom right corner of the table. If the table is exported in text format, the visible table will be appended with additional columns for all samples in the project. These columns specify the genotype call for each variant locus in the project. In instances where no variant was detected within a sample, the value specified by Minimum coverage for genotype calls in the task dialog will be used to call either a homozygous reference genotype if above the specified threshold or no genotype if below the specified threshold.

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.

Combine Variants

Due to fundamental differences between the statistical models employed by different variant detection tools, as well as varying parameter optimizations within tools selected for discrete discovery goals, there can be a large number of unique and common variants identified between each instance of variant detection. In some cases the goal of the study is to provide a list of all possible variants, whereas in other studies the goal is to generate a list of variants with increased confidence of true polymorphic sites. To facilitate both possible goals, Partek Flow provides a Combine variants tool to generate either the union or intersection of two variants data nodes. This task provides a means to identify common and unique variant calls in samples that have undergone two discrete variant calling tasks. The Combine variants task can be invoked from any Variants or Annotated variants data node, assuming at least discrete variants node(s) exists in the project. The task will generate two new variant data nodes and underlying vcf files: one for the union and one for the intersection of the variant data (Figure 1).

To run the task, select a Variants data node and then click the Combine variants from the task menu. The task dialog (Figure 2) will allow you to select a second Variants data node to be combined with the first. The selection allows for only one other data node to be used in the task. If there are no other valid variant tasks available within the project, a message stating "No connections to upstream task found" will be displayed.

Currently, this is the only task in Partek Flow that requires two input data nodes and then generates two output data nodes.

Additional Assistance

Fusion Gene Detection

This chapter covers will illustrate how to detect fusion genes by:

STAR Algorithm
TopHat-Fusion Algorithm

STAR Algorithm

General Overview

Running STAR Chimeric Alignment within Partek Flow

The output is associated with the Chimeric junctions data node (Figure 2), which is a part of the STAR results in addition to Aligned reads node and, optionally, Unaligned reads node.

To obtain a .fusion file that summarizes the chimeric reads across samples, double click on Chimeric junctions data node to open the report, click on View output files link, select the chimeric_result.fusion file and click download icon (Figure 3). The file is human-readable and can be opened in a text editor (example in Figure 4). For details refer to STAR's documentation.

Running STAR-Fusion on Chimeric results

To change any of the advanced options, click the Configure link (Figure 7). To run the task, click Finish.

FusionName: the name of the fusion event, given as LeftGene--RightGene. Multiple fusion events can be detected across the same pair of genes, so the FusionName of an event is not necessarily unique;
JunctionReadCount: indicates the number of RNA-Seq fragments containing a read that aligns as a split read at the site of the putative fusion junction;
SpanningFragCount: indicates the number of RNA-Seq fragments that encompass the fusion junction such that one read of the pair aligns to a different gene than the other paired-end read of that fragment;
est_J: estimated junction read counts corrected for multiple mappings;
est_S: estimated spanning fragment counts corrected for multiple mappings;
SpliceType: indicates whether the proposed breakpoint occurs at reference exon junctions as provided by the reference transcript structure annotations (Gencode);
LeftGene: name of the first (left) gene;
LeftBreakpoint: genome coordinates for the breakpoint in left gene;
RightGene: name of the second (right) gene;
RightBreakpoint: genome coordinates for the breakpoint in right gene;
JunctionReads: sequence identifiers for all junction reads;
SpanningFrags: sequence identifiers for all spanning fragments;
LargeAnchorSupport: indicates whether there are split reads that provide 'long' (set to 25bp) alignments on both sides of the putative breakpoint;
FFPM: fusion fragments per million reads
LeftBreakDinuc: dinucleotide base pairs at the left breakpoint
LeftBreakEntropy: the Shannon entropy of the 15 exonic bases flanking the left breakpoint
RightBreakDinuc: dinucleotide base pairs at the right breakpoint
RightBreakEntropy: the Shannon entropy of the 15 exonic bases flanking the right breakpoint
annots: provides a simplified annotation for fusion transcript

TopHat-Fusion Algorithm

General Overview

Running TopHat-Fusion within Partek Flow

TopHat-Fusion is integrated in the TopHat 2 task and is invoked by using the Fusion search check box in the Alignment options dialog (Figure 10).

Selecting the Fusion results data node opens the task menu, with four options (Figure 12): Data summary report, Fusion report, Fusion attribute report, and Download data.

The resulting Fusion report task node (Figure 15) can be double-clicked to reveal the full table (Figure 7).

Each row of the table in Figure 7 is a potential fusion event, with the columns providing the following information.

Sample ID: sample in which the fusion event was identified
Chromosome 1: chromosome hosting the first (left) segment of the fusion transcript
Stop 1: end of the first (left) segment of the fusion transcript
Chromosome 2: chromosome hosting the second (right) part of the fusion transcript
Start 2: beginning of the second (right) segment of the fusion transcript
Gene1: gene on the left side of the fusion
Gene2: gene on the right side of the fusion
Spanning reads: number of reads which were unaligned during the initial phase of TopHat and where only one mate is used as evidence of the fusion event
Mate Pairs: number of reads which were unaligned during the initial phase of TopHat and where both mates are used as evidence of the fusion event
Spanning mate pairs: number of reads where both mates were aligned during the initial phase of TopHat, but their pairing is discordant (e.g. different chromosomes, different orientation etc.)
Contradicting reads: number of reads which do not support the fusion
Left bases: number of bases on the left side of the fusion
Right bases: number of bases on the right side of the fusion

A new data node, Fusion attribute report, is generated in the Analysis tab (Figure 18) and it provides access to the Task report link in the task menu.

The output, Fusion report table resembles the basic TopHat-Fusion output; each row of the table is a single fusion event while the information on the merged segments is on the columns.

Chromosome 1: chromosome hosting the first (left) segment of the fusion transcript;
Start 1: beginning of the first (left) segment of the fusion transcript;
Stop 1: end of the first (right) segment of the fusion transcript;
Chromosome 2: chromosome hosting the second (right) segment of the fusion transcript;
Start 2: beginning of the second (right) segment of the fusion transcript;
Stop 2: end of the second (left) segment of the fusion transcript;
Gene1: gene on the left side of the fusion;
Gene2: gene on the right side of the fusion;
% in (category name): fraction of samples within the category with the fusion event.

The checkboxes Disrupted Genes and Gene/Gene fusions are filter tools. When selected, Disrupted Genes removes all the rows (fusion events) which have no genes assigned to it, i.e. those that merge two intergenic regions. However, if there is a fusion between a gene and an intergenic region, it will be kept in the table. The Gene/Gene fusions filters in only those fusion events which have an annotated gene on both sides of the breakpoint. In the other words, only gene to gene fusions are kept in the table.

References

Annala MJ, Parker BC, Zhang W, Nykter M. Fusion genes and their discovery using high throughput sequencing. Cancer Lett. 2013;340:192-200.
Costa V, Aprile M, Esposito R, Ciccodicola A. RNA-Seq and human complex diseases: recent accomplishments and future perspectives. Eur J Hum Genet. 2013;21:134-142.
Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biology. 2011;12:R72
TopHat-Fusion. An algorithm for discovery of novel fusion transcripts. http://tophat.cbcb.umd.edu/fusion_index.html Accessed on April 25, 2014
Dobin A, Davies CA, Schlesinger F et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15-21.
Haas B.J, Dobin A, Li B. et al. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20**:**213 (2019)

Additional Assistance

If you need additional assistance, please visit our support page to submit a help ticket or find phone numbers for regional support.