RNA-Seq alignment conversion to Correlation Engine Biosets

There are two options to transfer data to Correlation Engine from the outputs of the BaseSpace RNA-Seq alignment app. One, you can download the entire analysis in the project page.

Second, select the sample of interest and scroll down to download the FPKM or gene counts file.

The following table is an example of the Reference FPKM values for genes. The user should filter according to desired status and FPKM cut-offs (recommended at least >0). A new file should be created extracting the tracking_id and FPKM columns.

tracking_id

class_code

nearest_ref_id

gene_id

gene_short_name

tss_id

locus

length

coverage

FPKM

conf_lo

conf_hi

status

A1BG

TSS12895

chr19:58858171-58874214

1.30307

A1BG-AS1

TSS15204

chr19:58858171-58874214

0.328392

A1CF

TSS27441

chr10:52559168-52645435

A2M

TSS12287

chr12:9217772-9268558

465.75

A2M-AS1

TSS19734

chr12:9217772-9268558

1.52412

A2ML1

TSS32184,TSS6636

chr12:8975149-9029381

4.83395

A2MP1

TSS26806

chr12:9381128-9386803

0.0701295

A3GALT2

TSS700

chr1:33772366-33786699

A4GALT

TSS7799

chr22:43088126-43116876

9.15569

And renaming them to Gene name and Test expression accordingly

tracking_id

FPKM

A1BG

1.30307

A1BG-AS1

0.328392

A1CF

A2M

465.75

A2M-AS1

1.52412

A2ML1

4.83395

A2MP1

0.0701295

A3GALT2

A4GALT

9.15569

Gene name

Test expression

A1BG

1.30307

A1BG-AS1

0.328392

A1CF

A2M

465.75

A2M-AS1

1.52412

A2ML1

4.83395

A2MP1

0.0701295

A3GALT2

A4GALT

9.15569

The reference gene counts file comes without headers. Before upload, add the heads Gene name and Test expression to the first and second columns, respectively

Gene name

Test expression

FAM41C

LOC100130417

SAMD11

228

NOC2L

590

KLHL17

FAM41C

LOC100130417

SAMD11

228

NOC2L

590

KLHL17

PLEKHN1

An optional step to perform at this point is to rename the gene names to refseq identifiers. Correlation Engine matches RNA-Seq biosets to the correct platform model based on species-specific refseq identifiers. This ensures that best statistics are used for correlation calculations. Skipping this step results in the bioset being treated as a custom platform. Note:

Files can be provided to human, mouse, and rat genomes respectively for re-mapping
If users wish to upload RNA-seq data from other species supported in Correlation Engine, gene names should be used as is and they will be ingested as custom platforms

It is advisable to add information as a header to the data table in a bioset. This informs other users of the details around the processing and group identification. Below is a listing of the content Correlation Engine normally provides and following is the layout. Since this basic bioset does not provide a fold change comparison column, lines 2 and 7 do not necessarily apply.

The Bioset summary is the same as the study title
Comparison: restates the comparison using full group names
Data pre-processing: fixed text
Analysis summary: modified according to species
Test expression: Test group name with sample count
Control expression: Control group name with sample count
Internal ID - <studyID>: Typically the GEO series number. Based on this ID, biosets can be matched with their Internal ID files

Bioset summary =

Comparison = <test group> v. <control group>

Data pre-processing = FASTQ files were downloaded from Sequence Read Archive (SRA). No other preprocessing was performed.

Analysis summary = Alignment genome, software used, gene identifiers used, cut-offs applied, etc.

Test expression - Median FPKM expression in <test group> (total replicates = #)

Control expression - Median FPKM expression in <control group> (total replicates = #)

Internal ID - <>

Gene name Fold change Control expression Test expression q-value

Import the finalized bioset thought the Import UI.

Notes: Not all genes in the output will be recognized by the Correlation Engine gene tables, particularly some miRNAs and lincRNAs.

The last step before completing upload is to apply tags to the biosets. The standard types of tags are BIOSOURCE, TISSUE, PHENOTYPE/DISEASE, COMPOUND, and GENE/GENEMODE, and BIODESIGN. GENE/GENEMODES are paired to indicate a directed perturbation such as GENE KNOCKOUT, ANTIBODY TARGTE, etc. BIODESIGNs reflect the experimental structure of the experiment: DISEASE VS. NORMAL, TREATMENT VS. CONTROL, MUTANT VS. WILDTYPE, etc. In the case of single sample analysis of expression, the user can omit or apply a non-differentiating tag such as NORMAL VS. NORMAL, MUTANT VS. MUTANT, DISEASE VS. DISEASE.

PreviousRNA-Express output conversion to bioset NextBaseSpace ChIP-Seq output conversion to bioset

Last updated 6 months ago

Was this helpful?