Cufflinks Assembly & DE output conversion to bioset

The most straightforward way to create a bioset for import into Correlation Engine is to go to the Gene Browser section, set the Significant filter option to “True”, and click on the Save Filtered Table link at the bottom section. The “True” setting applies a q-value cut-off of 0.05. The user may want to save an unfiltered version. Curators apply an FDR analysis to determine if a particular bioset should be tagged as “Below threshold significance” and excluded from calculations in Correlation Engine.

Example content of this file is shown in the table below.

Test ID

Gene

Locus

Status

log2(FFPESample1 FPKM)

log2(FFPESample2 FPKM)

log2(Ratio)

q Value

Significant

A2ML1

chr12:8975149-9029381

0.73

-10

-10.73

1.68E-04

TRUE

ABHD12B

chr14:51338877-51371688

-1.52

-10

-8.48

1.68E-04

TRUE

ACKR1

chr1:159173802-159176290

1.17

-10

-11.17

1.68E-04

TRUE

The following table lists the columns to be extracted, new column headers, and any recommended transformations:

Original header

New header

Transformation

Gene

Gene name

None

log2(Ratio)

Log2 fold change

Remove values between -0.2630344 to 0.2630344 or log2(1/1.2) to log2(1.2)

log2(<control> FPKM)

Control expression

Unlog values

log2(<test> FPKM)

Test expression

Unlog values

q Value

q-value

None

The fpkm column headers will vary according the names of the test and control groups as designated by users. Correlation Engine biosets normally report these values in unlogged format so for consistency we recommend transforming the data prior to upload. Since the q-value cutoff has already been applied, applying the fold change cut-off is the last data quality step.

An optional step to perform at this point is to rename the gene names to refseq identifiers. Correlation Engine matches RNA-Seq biosets to the correct platform model based on species-specific refseq identifiers. This ensures that best statistics are used for correlation calculations. Skipping this step results in the bioset being treated as a custom platform. Note:

Files can be provided to human, mouse, and rat genomes respectively for re-mapping
If users wish to upload RNA-seq data from other species supported in Correlation Engine, gene names should be used as is and they will be ingested as custom platforms

It is advisable to add information as a header to the data table in a bioset. This informs other users of the details around the processing and group identification. Below is a listing of the content Correlation Engine normally provides and following is the layout.

The Bioset summary is the same as the study title
Comparison: restates the comparison using full group names
Data pre-processing: fixed text
Analysis summary: modified according to species
Test expression: Test group name with sample count
Control expression: Control group name with sample count
Internal ID - <studyID>: Typically the GEO series number. Based on this ID, biosets can be matched with their Internal ID files

Bioset summary =

Comparison = <test group> v. <control group>

Data pre-processing = FASTQ files were downloaded from Sequence Read Archive (SRA). No other preprocessing was performed.

Analysis summary = Alignment genome, software used, gene identifiers used, cut-offs applied, etc.

Test expression - Median FPKM expression in <test group> (total replicates = #)

Control expression - Median FPKM expression in <control group> (total replicates = #)

Internal ID - <>

Gene name Fold change Control expression Test expression q-value

Import the finalized bioset thought the Import UI, select ranking on absolution fold change descending.

Notes: Not all genes in the cufflinks output will be recognized by the Correlation Engine gene tables, particularly some miRNAs and lincRNAs.

PreviousCorrelation Engine FAQ NextRNA-Express output conversion to bioset

Last updated 2 months ago

Was this helpful?