Cufflinks Assembly & DE output conversion to bioset
Last updated
Was this helpful?
Last updated
Was this helpful?
The most straightforward way to create a bioset for import into Correlation Engine is to go to the Gene Browser section, set the Significant filter option to “True”, and click on the Save Filtered Table link at the bottom section. The “True” setting applies a q-value cut-off of 0.05. The user may want to save an unfiltered version. Curators apply an FDR analysis to determine if a particular bioset should be tagged as “Below threshold significance” and excluded from calculations in Correlation Engine.
Example content of this file is shown in the table below.
A2ML1
A2ML1
chr12:8975149-9029381
OK
0.73
-10
-10.73
1.68E-04
TRUE
ABHD12B
ABHD12B
chr14:51338877-51371688
OK
-1.52
-10
-8.48
1.68E-04
TRUE
ACKR1
ACKR1
chr1:159173802-159176290
OK
1.17
-10
-11.17
1.68E-04
TRUE
The following table lists the columns to be extracted, new column headers, and any recommended transformations:
Gene
Gene name
None
log2(Ratio)
Log2 fold change
Remove values between -0.2630344 to 0.2630344 or log2(1/1.2) to log2(1.2)
log2(<control> FPKM)
Control expression
Unlog values
log2(<test> FPKM)
Test expression
Unlog values
q Value
q-value
None
The fpkm column headers will vary according the names of the test and control groups as designated by users. Correlation Engine biosets normally report these values in unlogged format so for consistency we recommend transforming the data prior to upload. Since the q-value cutoff has already been applied, applying the fold change cut-off is the last data quality step.
An optional step to perform at this point is to rename the gene names to refseq identifiers. Correlation Engine matches RNA-Seq biosets to the correct platform model based on species-specific refseq identifiers. This ensures that best statistics are used for correlation calculations. Skipping this step results in the bioset being treated as a custom platform. Note:
Files can be provided to human, mouse, and rat genomes respectively for re-mapping
If users wish to upload RNA-seq data from other species supported in Correlation Engine, gene names should be used as is and they will be ingested as custom platforms
It is advisable to add information as a header to the data table in a bioset. This informs other users of the details around the processing and group identification. Below is a listing of the content Correlation Engine normally provides and following is the layout.
The Bioset summary is the same as the study title
Comparison: restates the comparison using full group names
Data pre-processing: fixed text
Analysis summary: modified according to species
Test expression: Test group name with sample count
Control expression: Control group name with sample count
Internal ID - <studyID>: Typically the GEO series number. Based on this ID, biosets can be matched with their Internal ID files
Bioset summary =
Comparison = <test group> v. <control group>
Data pre-processing = FASTQ files were downloaded from Sequence Read Archive (SRA). No other preprocessing was performed.
Analysis summary = Alignment genome, software used, gene identifiers used, cut-offs applied, etc.
Test expression - Median FPKM expression in <test group> (total replicates = #)
Control expression - Median FPKM expression in <control group> (total replicates = #)
Internal ID - <>
Gene name Fold change Control expression Test expression q-value
Import the finalized bioset thought the Import UI, select ranking on absolution fold change descending.
Notes: Not all genes in the cufflinks output will be recognized by the Correlation Engine gene tables, particularly some miRNAs and lincRNAs.