RNA-Seq alignment conversion to Correlation Engine Biosets
Last updated
Was this helpful?
Last updated
Was this helpful?
There are two options to transfer data to Correlation Engine from the outputs of the BaseSpace RNA-Seq alignment app. One, you can download the entire analysis in the project page.
2
1
1
Second, select the sample of interest and scroll down to download the FPKM or gene counts file.
2
The following table is an example of the Reference FPKM values for genes. The user should filter according to desired status and FPKM cut-offs (recommended at least >0). A new file should be created extracting the tracking_id and FPKM columns.
A1BG
-
-
A1BG
A1BG
TSS12895
chr19:58858171-58874214
-
-
1.30307
1.30307
1.30307
OK
A1BG-AS1
-
-
A1BG-AS1
A1BG-AS1
TSS15204
chr19:58858171-58874214
-
-
0.328392
0.328392
0.328392
OK
A1CF
-
-
A1CF
A1CF
TSS27441
chr10:52559168-52645435
-
-
0
0
0
OK
A2M
-
-
A2M
A2M
TSS12287
chr12:9217772-9268558
-
-
465.75
465.75
465.75
OK
A2M-AS1
-
-
A2M-AS1
A2M-AS1
TSS19734
chr12:9217772-9268558
-
-
1.52412
1.52412
1.52412
OK
A2ML1
-
-
A2ML1
A2ML1
TSS32184,TSS6636
chr12:8975149-9029381
-
-
4.83395
4.83395
4.83395
OK
A2MP1
-
-
A2MP1
A2MP1
TSS26806
chr12:9381128-9386803
-
-
0.0701295
0.0701295
0.0701295
OK
A3GALT2
-
-
A3GALT2
A3GALT2
TSS700
chr1:33772366-33786699
-
-
0
0
0
OK
A4GALT
-
-
A4GALT
A4GALT
TSS7799
chr22:43088126-43116876
-
-
9.15569
9.15569
9.15569
OK
And renaming them to Gene name and Test expression accordingly
A1BG
1.30307
A1BG-AS1
0.328392
A1CF
0
A2M
465.75
A2M-AS1
1.52412
A2ML1
4.83395
A2MP1
0.0701295
A3GALT2
0
A4GALT
9.15569
A1BG
1.30307
A1BG-AS1
0.328392
A1CF
0
A2M
465.75
A2M-AS1
1.52412
A2ML1
4.83395
A2MP1
0.0701295
A3GALT2
0
A4GALT
9.15569
The reference gene counts file comes without headers. Before upload, add the heads Gene name and Test expression to the first and second columns, respectively
FAM41C
16
LOC100130417
43
SAMD11
228
NOC2L
590
KLHL17
33
LOC100130417
43
SAMD11
228
NOC2L
590
KLHL17
33
PLEKHN1
77
An optional step to perform at this point is to rename the gene names to refseq identifiers. Correlation Engine matches RNA-Seq biosets to the correct platform model based on species-specific refseq identifiers. This ensures that best statistics are used for correlation calculations. Skipping this step results in the bioset being treated as a custom platform. Note:
Files can be provided to human, mouse, and rat genomes respectively for re-mapping
If users wish to upload RNA-seq data from other species supported in Correlation Engine, gene names should be used as is and they will be ingested as custom platforms
It is advisable to add information as a header to the data table in a bioset. This informs other users of the details around the processing and group identification. Below is a listing of the content Correlation Engine normally provides and following is the layout. Since this basic bioset does not provide a fold change comparison column, lines 2 and 7 do not necessarily apply.
The Bioset summary is the same as the study title
Comparison: restates the comparison using full group names
Data pre-processing: fixed text
Analysis summary: modified according to species
Test expression: Test group name with sample count
Control expression: Control group name with sample count
Internal ID - <studyID>: Typically the GEO series number. Based on this ID, biosets can be matched with their Internal ID files
Bioset summary =
Comparison = <test group> v. <control group>
Data pre-processing = FASTQ files were downloaded from Sequence Read Archive (SRA). No other preprocessing was performed.
Analysis summary = Alignment genome, software used, gene identifiers used, cut-offs applied, etc.
Test expression - Median FPKM expression in <test group> (total replicates = #)
Control expression - Median FPKM expression in <control group> (total replicates = #)
Internal ID - <>
Gene name Fold change Control expression Test expression q-value
Import the finalized bioset thought the Import UI.
Notes: Not all genes in the output will be recognized by the Correlation Engine gene tables, particularly some miRNAs and lincRNAs.
The last step before completing upload is to apply tags to the biosets. The standard types of tags are BIOSOURCE, TISSUE, PHENOTYPE/DISEASE, COMPOUND, and GENE/GENEMODE, and BIODESIGN. GENE/GENEMODES are paired to indicate a directed perturbation such as GENE KNOCKOUT, ANTIBODY TARGTE, etc. BIODESIGNs reflect the experimental structure of the experiment: DISEASE VS. NORMAL, TREATMENT VS. CONTROL, MUTANT VS. WILDTYPE, etc. In the case of single sample analysis of expression, the user can omit or apply a non-differentiating tag such as NORMAL VS. NORMAL, MUTANT VS. MUTANT, DISEASE VS. DISEASE.