RNA-Express output conversion to bioset
Last updated
Was this helpful?
Last updated
Was this helpful?
The core content needed to create a bioset for import into Correlation Engine is found in the “deseq.res.csv” file. The file contains 8 comma separated columns. This file is found in the RNA-Express-AppResult/differential/<control_vs_comparison> folder. The columns of interest are the first no-header column which contains the Entrez gene names, the columns with headers “log2FoldChange”, “pvalue”, and “padj”.
baseMean
log2FoldChange
lfcSE
stat
pvalue
padj
status
DDX11L1
0.257395
0.093503
0.132188
0.707353
0.479347
NaN
Low
WASH7P
51.49006
-0.02556
0.273837
-0.09336
0.925619
0.953609
OK
FAM138A
0
NaN
NaN
NaN
NaN
NaN
Low
FAM138F
0
NaN
NaN
NaN
NaN
NaN
Low
OR4F5
0
NaN
NaN
NaN
NaN
NaN
Low
LOC729737
52.91158
0.432888
0.230048
1.881727
0.059873
0.11944
OK
LOC100132287
0
NaN
NaN
NaN
NaN
NaN
Low
LOC100132062
0
NaN
NaN
NaN
NaN
NaN
Low
LOC100133331
50.25088
0.011386
0.254789
0.044688
0.964356
0.976735
OK
Our processing pipeline applies a number of transformations to these data to create a bioset and there are optional steps involved.
Filter out low status data by status column 5 (this removes all nonsense NaN results)
Extract columns 1-4 to create a new file and change column headers as follows:
1 -> Gene name
Pvalue -> p-value
Padj -> q-value
Decide how to represent differential expression. This will determine how you will name the column and apply cutoffs later. Fold change data can be ingested as either:
log2 of the fold change ratio 0-N, column = Log2 fold change
fold change ratio 0-N, column = Fold change 0-N
directional +/- fold change from 0. Upon upload, fold change will be converted to +/- fold change. Column = Fold change
Optional: Rename the gene names to refseq identifiers
Correlation Engine matches RNA-Seq biosets to the correct platform model based on species-specific refseq identifiers. This ensures that best statistics are used for correlation calculations. Skipping this step results in the bioset being treated as a custom platform.
Files can be provided to human, mouse, and rat genomes respectively for re-mapping
If users wish to upload RNA-seq data from other species supported in Correlation Engine, they will be ingested as custom platforms
At this point, you have the basic unfiltered bioset equivalent to the Internal ID files provided as supplementary files in Correlation Engine. This file is used to perform a FDR analysis to determine whether the bioset should be tagged as “Below threshold significance” and thereby excluded from categorizations when the study is published to the Enterprise library.
Fold change cut-offs: how fold change is represented effects how the user will apply filters to remove features
Log2 fold change ratio 0-N
cut-offs <log2(1/1.2) – log2(1.2)> or
-0.2630344 to 0.2630344
Fold change ration 0-N
1/1.2 to 1.2 or
0.8333333 to 1.2
+/- Fold change
-1.2 to 1.2
Q-value cutoff
Remove any features with q-value > 0.05
Note, this is different from our practice with array studies, where we remove above a p-value cutoff of 0.05
At this point the files should look like one of the two following tables:
WASH7P
-0.02556
0.007739
0.012142
LOC729737
0.432888
0.002186
0.003668
LOC100133331
0.011386
1.50E-10
4.47E-10
LOC100288069
0.578921
1.94E-129
5.89E-128
LOC643837
-0.00985
1.27E-21
6.35E-21
SAMD11
0.612835
8.38E-06
1.81E-05
NOC2L
0.064104
4.47E-20
2.11E-19
KLHL17
-1.47607
1.82E-27
1.12E-26
653635
-1.56044
0.007739
0.012142
79854
1.575059
0.002186
0.003668
100130417
-2.36943
1.50E-10
4.47E-10
148398
-3.90303
1.94E-129
5.89E-128
26155
1.271878
1.27E-21
6.35E-21
84069
-1.3572
8.38E-06
1.81E-05
9636
-1.58597
4.47E-20
2.11E-19
375790
-1.47607
1.82E-27
1.12E-26
Optional: Add in test expression and control expression
In addition to the RNA-Express work flow, our pipeline provides follow on calculations of the average counts in the test group of samples and control group of samples based on the globally normalized counts and adjusted for gene lengths to calculate fragments per kb length of gene per total millions of reads (FPKM).
The source file is found in the RNA-Express-AppResult/differential/global folder/genes.counts.csv
Using a reference for gene lengths, a user would create a genes.fpkm.csv file and then calculate median fpkm values for each group per feature.
These values are merely provided as a reference and are not used in correlations
Add header information. We provide details around the processing and group identification in the header section of each bioset. Below is an example of this content. Some highlights of which are:
The Bioset summary is the same as the study title
Comparison: restates the comparison using full group names
Data pre-processing: fixed text
Analysis summary: modified according to species
Test expression: Test group name with sample count
Control expression: Control group name with sample count
Internal ID - <studyID>: Typically the GEO series number. Based on this ID, biosets can be matched with their Internal ID files
Bioset summary = Prostate cancer cell lines treated with combinations of enzalutamide, dihydrotestosterone and JQ1
Comparison = Prostate_cancer_cell_line_MR49F_treated_24hr_with_1nM_dihydrotestosterone_(DHT)_and_charcoal_stripped_serum vs. Prostate_cancer_cell_line_MR49F_treated_24hr_with_0.1percent_each_dimethylasulfoxide_(DMSO)_and_ethanol_vehicles_and_charcoal_stripped_serum
Data pre-processing = FASTQ files were downloaded from Sequence Read Archive (SRA). No other preprocessing was performed.
Analysis summary = Reads are aligned to the human genome (UCSC iGenomes hg19, download date 5/23/2014) using STAR 2.3 and RefSeq annotations. Reads are assigned to a gene if the read (or both reads in a pair) uniquely and fully map to the exons of one gene. Differential gene expression results between the control and test sample groups are generated based on these counts using DESeq2. The base-mean read count, fold change, p-value and q-value (Benjamini-Hochberg adjusted) are derived from this analysis. The median FPKMs per group are calculated separately based on normalized read counts, number of aligned reads and the full gene length. The genes were filtered with a q-value cutoff 0.05. An additional fold change cutoff of +/-1.2 was applied to generate the final list of genes.
Test expression - Median FPKM expression in Prostate_cancer_cell_line_MR49F_treated_24hr_with_1nM_dihydrotestosterone_(DHT)_and_charcoal_stripped_serum (total replicates = 3)
Control expression - Median FPKM expression in Prostate_cancer_cell_line_MR49F_treated_24hr_with_0.1percent_each_dimethylasulfoxide_(DMSO)_and_ethanol_vehicles_and_charcoal_stripped_serum (total replicates = 3)
Internal ID - GSE69896_3
Rename file to an appropriately descriptive title that keeps to 96 characters or less
Import through UI, select ranking on absolution fold change decreasing