LogoLogo
Illumina KnowledgeIllumina SupportSign In
  • Home
  • Release Notes
  • Correlation Engine FAQ
    • Cufflinks Assembly & DE output conversion to bioset
    • RNA-Express output conversion to bioset
    • RNA-Seq alignment conversion to Correlation Engine Biosets
    • BaseSpace ChIP-Seq output conversion to bioset
Powered by GitBook
On this page

Was this helpful?

Export as PDF
  1. Correlation Engine FAQ

RNA-Express output conversion to bioset

PreviousCufflinks Assembly & DE output conversion to biosetNextRNA-Seq alignment conversion to Correlation Engine Biosets

Last updated 3 days ago

Was this helpful?

The core content needed to create a bioset for import into Correlation Engine is found in the “deseq.res.csv” file. The file contains 8 comma separated columns. This file is found in the RNA-Express-AppResult/differential/<control_vs_comparison> folder. The columns of interest are the first no-header column which contains the Entrez gene names, the columns with headers “log2FoldChange”, “pvalue”, and “padj”.

1
2
3
4
5

baseMean

log2FoldChange

lfcSE

stat

pvalue

padj

status

DDX11L1

0.257395

0.093503

0.132188

0.707353

0.479347

NaN

Low

WASH7P

51.49006

-0.02556

0.273837

-0.09336

0.925619

0.953609

OK

FAM138A

0

NaN

NaN

NaN

NaN

NaN

Low

FAM138F

0

NaN

NaN

NaN

NaN

NaN

Low

OR4F5

0

NaN

NaN

NaN

NaN

NaN

Low

LOC729737

52.91158

0.432888

0.230048

1.881727

0.059873

0.11944

OK

LOC100132287

0

NaN

NaN

NaN

NaN

NaN

Low

LOC100132062

0

NaN

NaN

NaN

NaN

NaN

Low

LOC100133331

50.25088

0.011386

0.254789

0.044688

0.964356

0.976735

OK

Our processing pipeline applies a number of transformations to these data to create a bioset and there are optional steps involved.

  1. Filter out low status data by status column 5 (this removes all nonsense NaN results)

  2. Extract columns 1-4 to create a new file and change column headers as follows:

    1. 1 -> Gene name

    2. Pvalue -> p-value

    3. Padj -> q-value

  3. Decide how to represent differential expression. This will determine how you will name the column and apply cutoffs later. Fold change data can be ingested as either:

    1. log2 of the fold change ratio 0-N, column = Log2 fold change

    2. fold change ratio 0-N, column = Fold change 0-N

    3. directional +/- fold change from 0. Upon upload, fold change will be converted to +/- fold change. Column = Fold change

  4. Optional: Rename the gene names to refseq identifiers

    1. Correlation Engine matches RNA-Seq biosets to the correct platform model based on species-specific refseq identifiers. This ensures that best statistics are used for correlation calculations. Skipping this step results in the bioset being treated as a custom platform.

    2. Files can be provided to human, mouse, and rat genomes respectively for re-mapping

    3. If users wish to upload RNA-seq data from other species supported in Correlation Engine, they will be ingested as custom platforms

At this point, you have the basic unfiltered bioset equivalent to the Internal ID files provided as supplementary files in Correlation Engine. This file is used to perform a FDR analysis to determine whether the bioset should be tagged as “Below threshold significance” and thereby excluded from categorizations when the study is published to the Enterprise library.

  1. Fold change cut-offs: how fold change is represented effects how the user will apply filters to remove features

    1. Log2 fold change ratio 0-N

      1. cut-offs <log2(1/1.2) – log2(1.2)> or

      2. -0.2630344 to 0.2630344

    2. Fold change ration 0-N

      1. 1/1.2 to 1.2 or

      2. 0.8333333 to 1.2

    3. +/- Fold change

      1. -1.2 to 1.2

  2. Q-value cutoff

    1. Remove any features with q-value > 0.05

    2. Note, this is different from our practice with array studies, where we remove above a p-value cutoff of 0.05

At this point the files should look like one of the two following tables:

Gene name
Log2 fold change
p-value
q-value

WASH7P

-0.02556

0.007739

0.012142

LOC729737

0.432888

0.002186

0.003668

LOC100133331

0.011386

1.50E-10

4.47E-10

LOC100288069

0.578921

1.94E-129

5.89E-128

LOC643837

-0.00985

1.27E-21

6.35E-21

SAMD11

0.612835

8.38E-06

1.81E-05

NOC2L

0.064104

4.47E-20

2.11E-19

KLHL17

-1.47607

1.82E-27

1.12E-26

Gene name
Fold change
p-value
q-value

653635

-1.56044

0.007739

0.012142

79854

1.575059

0.002186

0.003668

100130417

-2.36943

1.50E-10

4.47E-10

148398

-3.90303

1.94E-129

5.89E-128

26155

1.271878

1.27E-21

6.35E-21

84069

-1.3572

8.38E-06

1.81E-05

9636

-1.58597

4.47E-20

2.11E-19

375790

-1.47607

1.82E-27

1.12E-26

  1. Optional: Add in test expression and control expression

    1. In addition to the RNA-Express work flow, our pipeline provides follow on calculations of the average counts in the test group of samples and control group of samples based on the globally normalized counts and adjusted for gene lengths to calculate fragments per kb length of gene per total millions of reads (FPKM).

    2. The source file is found in the RNA-Express-AppResult/differential/global folder/genes.counts.csv

    3. Using a reference for gene lengths, a user would create a genes.fpkm.csv file and then calculate median fpkm values for each group per feature.

    4. These values are merely provided as a reference and are not used in correlations

  2. Add header information. We provide details around the processing and group identification in the header section of each bioset. Below is an example of this content. Some highlights of which are:

    1. The Bioset summary is the same as the study title

    2. Comparison: restates the comparison using full group names

    3. Data pre-processing: fixed text

    4. Analysis summary: modified according to species

    5. Test expression: Test group name with sample count

    6. Control expression: Control group name with sample count

    7. Internal ID - <studyID>: Typically the GEO series number. Based on this ID, biosets can be matched with their Internal ID files

Bioset summary = Prostate cancer cell lines treated with combinations of enzalutamide, dihydrotestosterone and JQ1

Comparison = Prostate_cancer_cell_line_MR49F_treated_24hr_with_1nM_dihydrotestosterone_(DHT)_and_charcoal_stripped_serum vs. Prostate_cancer_cell_line_MR49F_treated_24hr_with_0.1percent_each_dimethylasulfoxide_(DMSO)_and_ethanol_vehicles_and_charcoal_stripped_serum

Data pre-processing = FASTQ files were downloaded from Sequence Read Archive (SRA). No other preprocessing was performed.

Analysis summary = Reads are aligned to the human genome (UCSC iGenomes hg19, download date 5/23/2014) using STAR 2.3 and RefSeq annotations. Reads are assigned to a gene if the read (or both reads in a pair) uniquely and fully map to the exons of one gene. Differential gene expression results between the control and test sample groups are generated based on these counts using DESeq2. The base-mean read count, fold change, p-value and q-value (Benjamini-Hochberg adjusted) are derived from this analysis. The median FPKMs per group are calculated separately based on normalized read counts, number of aligned reads and the full gene length. The genes were filtered with a q-value cutoff 0.05. An additional fold change cutoff of +/-1.2 was applied to generate the final list of genes.

Test expression - Median FPKM expression in Prostate_cancer_cell_line_MR49F_treated_24hr_with_1nM_dihydrotestosterone_(DHT)_and_charcoal_stripped_serum (total replicates = 3)

Control expression - Median FPKM expression in Prostate_cancer_cell_line_MR49F_treated_24hr_with_0.1percent_each_dimethylasulfoxide_(DMSO)_and_ethanol_vehicles_and_charcoal_stripped_serum (total replicates = 3)

Internal ID - GSE69896_3

  1. Rename file to an appropriately descriptive title that keeps to 96 characters or less

  2. Import through UI, select ranking on absolution fold change decreasing