Cohorts Data in ICA Base

ICA Cohorts data can be viewed in an ICA Project Base instance as a shared database. A shared database in ICA Base operates as a database view. To use this feature, enable Base for your project prior to starting any ICA Cohorts ingestions. See Base for more information on enabling this feature in your ICA Project.

ICA Cohorts Base Tables

After ingesting data into your project, select Phenotypic and Molecular data are available to view in Base. See Cohorts Import for instruction on importing data sets into Cohorts.

  1. Post ingestion, data will be represented in Base.

  2. Select BASE from the ICA left-navigation and click Query.

  3. Under the New Query window, a list of tables is displayed. Expand the Shared Database for Project \<your project name\> .

  4. Cohorts tables will be displayed.

  5. To preview the table and fields click each view listed.

  6. Clicking any of these views then selecting PREVIEW on the right-hand side will show you a preview of the data in the tables.

Note: If your ingestion includes Somatic variants, there will be two molecular tables: ANNOTATED_SOMATIC_MUTATIONS and ANNOTATED_VARIANTS. All ingestions will include a PHENOTYPE table.

\

Note: The PHENOTYPE table includes a harmonized set that is collected across all data ingestions and is not representative of all data ingested for the Subject or Sample. Sample information is also displayed in this table, if applicable. Sample information drives the annotation process if molecular data is included in the ingestion. That data is stored in the PHENOTYPE table.

Phenotype Data

Field Name

Type

Description

SAMPLE_BARCODE

STRING

Sample Identifier

SUBJECTID

STRING

Identifer for Subject entity

STUDY

STRING

Study designation

AGE

NUMERIC

Age in years

SEX

STRING

Sex field to drive annotation

POPULATION

STRING

Population Designation for 1000 Genomes Project

SUPERPOPULATION

STRING

Superpopulation Designation from 1000 Genomes Project

RACE

STRING

Race according to NIH standard

CONDITION_ONTOLOGIES

VARIANT

Diagnosis Ontology Source

CONDITION_IDS

VARIANT

Diagnosis Concept Ids

CONDITIONS

VARIANT

Diagnosis Names

HARMONIZED_CONDITIONS

VARIANT

Diagnosis High-level concept to drive UI

LIBRARYTYPE

STRING

Seqencing technology

ANALYTE

STRING

Substance sequenced

TISSUE

STRING

Tissue source

TUMOR_OR_NORMAL

STRING

Tumor designation for somatic

GENOMEBUILD

STRING

Genome Build to drive annotations - hg38 only

SAMPLE_BARCODE_VCF

STRING

Sample ID from VCF

AFFECTED_STATUS

NUMERIC

Affected, Unaffected, or Unknown for Family Based Analysis

FAMILY_RELATIONSHIP

STRING

Relationship designation for Family Based Analysis

Annotated Variants

This table will be available for all projects with ingested molecular data

Field Name

Type

Description

SAMPLE_BARCODE

STRING

Original sample barcode used in VCF column

STUDY

STRING

Study designation

GENOMEBUILD

STRING

Only hg38 is supported

CHROMOSOME

STRING

Chromosome without 'chr' prefix

CHROMOSOMEID

NUMERIC

Chromosome ID: 1..22, 23=X, 24=Y, 25=Mt

DBSNP

STRING

dbSNP Identifiers

VARIANT_KEY

STRING

Variant ID in the form "1:12345678:12345678:C"

NIRVANA_VID

STRING

Broad Institute VID: "1-12345678-A-C"

VARIANT_TYPE

STRING

Description of Variant Type (e.g. SNV, Deletion, Insertion)

VARIANT_CALL

NUMERIC

1=germline, 2=somatic

DENOVO

BOOLEAN

true / false

GENOTYPE

STRING

"G|T"

READ_DEPTH

NUMERIC

Sequencing read depth

ALLELE_COUNT

NUMERIC

Counts of each alternate allele for each site across all samples

ALLELE_DEPTH

STRING

Unfiltered count of reads that support a given allele for an individual sample

FILTERS

STRING

Filter field from VCF. If all filters pass, field is PASS

ZYGOSITY

NUMERIC

0 = hom ref, 1 = het ref/alt, 2 = hom alt, 4 = hemi alt

GENEMODEL

NUMERIC

1=Ensembl, 2=RefSeq

GENE_HGNC

STRING

HUGO/HGNC gene symbol

GENE_ID

STRING

Ensembl gene ID ("ENSG00001234")

GID

NUMERIC

NCBI Entrez Gene ID (RefSeq) or numerical part of Ensembl ENSG ID

TRANSCRIPT_ID

STRING

Ensembl ENST or RefSeq NM_

CANONICAL

STRING

Transcript designated 'canonical' by source

CONSEQUENCE

STRING

missense, stop gained, intronic, etc.

HGVSC

STRING

The HGVS coding sequence name

HGVSP

STRING

The HGVS protein sequence name

Annotated Somatic Mutations

This table will only be available for data sets with ingested Somatic molecular data.

Field Name

Type

Description

SAMPLE_BARCODE

STRING

Original sample barcode, used in VCF column

SUBJECTID

STRING

Identifier for Subject entity

STUDY

STRING

Study designation

GENOMEBUILD

STRING

Only hg38 is supported

CHROMOSOME

STRING

Chromosome without 'chr' prefix

DBSNP

NUMERIC

dbSNP Identifiers

VARIANT_KEY

STRING

Variant ID in the form "1:12345678:12345678:C"

MUTATION_TYPE

NUMERIC

Rank of consequences by expected impact: 0 = Protein Truncating to 40 = Intergenic Variant

VARIANT_CALL

NUMERIC

1=germline, 2=somatic

GENOTYPE

STRING

"G|T"

REF_ALLELE

STRING

Reference allele

ALLELE1

STRING

First allele call in the tumor sample

ALLELE2

STRING

Second allele call in the tumor sample

GENEMODEL

NUMERIC

1=Ensembl, 2=RefSeq

GENE_HGNC

STRING

HUGO/HGNC gene symbol

GENE_ID

STRING

Ensembl gene ID ("ENSG00001234")

TRANSCRIPT_ID

STRING

Ensembl ENST or RefSeq NM_

CANONICAL

BOOLEAN

Transcript designated 'canonical' by source

CONSEQUENCE

STRING

missense, stop gained, intronic, etc.

HGVSP

STRING

HGVS nomenclature for AA change: p.Pro72Ala

Annotated Copy Number Variants

This table will only be available for data sets with ingested CNV molecular data.

Field Name

Type

Description

SAMPLE_BARCODE

STRING

Sample barcode used in the original VCF

GENOMEBUILD

STRING

Genome build, always 'hg38'

NIRVANA_VID

STRING

Variant ID of the form 'chr-pos-ref-alt'

CHRID

STRING

Chromosome without 'chr' prefix

CID

NUMERIC

Numerical representation of the chromosome, X=23, Y=24, Mt=25

GENE_ID

STRING

NCBI or Ensembl gene identifier

GID

NUMERIC

Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix

START_POS

NUMERIC

First affected position on the chromosome

STOP_POS

NUMERIC

Last affected position on the chromosome

VARIANT_TYPE

NUMERIC

1 = copy number gain, -1 = copy number loss

COPY_NUMBER

NUMERIC

Observed copy number

COPY_NUMBER_CHANGE

NUMERIC

Fold-chang of copy number, assuming 2 for diploid and 1 for haploid as the baseline

SEGMENT_VALUE

NUMERIC

Average FC for the identified chromosomal segment

PROBE_COUNT

NUMERIC

Probes confirming the CNV (arrays only)

REFERENCE

NUMERIC

Baseline taken from normal samples (1) or averaged disease tissue (2)

GENE_HGNC

STRING

HUGO/HGNC gene symbol

Annotated Structural Variants

This table will only be available for data sets with ingested SV molecular data. Note that ICA Cohorts stores copy number variants in a separate table.

Field Name

Type

Description

SAMPLE_BARCODE

STRING

Sample barcode used in the original VCF

GENOMEBUILD

STRING

Genome build, always 'hg38'

NIRVANA_VID

STRING

Variant ID of the form 'chr-pos-ref-alt'

CHRID

STRING

Chromosome without 'chr' prefix

CID

NUMERIC

Numerical representation of the chromosome, X=23, Y=24, Mt=25

BEGIN

NUMERIC

First affected position on the chromosome

END

NUMERIC

Last affected position on the chromosome

BAND

STRING

Chromosomal band

QUALIITY

NUMERIC

Quality from the original VCF

FILTERS

ARRAY

Filters from the original VCF

VARIANT_TYPE

STRING

Insertion, deletion, indel, tandem_duplication, translocation_breakend, inversion ("INV"), short tandem repeat ("STR2")

VARIANT_TYPE_ID

NUMERIC

21=insertion, 22=deletion, 23=indel, 24=tandem_duplication, 25=translocation_breakend, 26=inversion ("INV"), 27=short tandem repeat ("STR2")

CIPOS

ARRAY

Confidence interval around first position

CIEND

ARRAY

Confidence interval around last position

SVLENGTH

NUMERIC

Overall size of the structural variant

BONDCHR

STRING

For translocations, the other affected chromosome

BONDCID

NUMERIC

For translocations, the other affected chromosome as a numeric value, X=23, Y=24, Mt=25

BONDPOS

STRING

For translocations, positions on the other affected chromosome

BONDORDER

NUMERIC

3 or 5: Whether this fragment (the current variant/VID) "receives" the other chromosome's fragment on it's 3' end, or attaches to the 5' of the other chromosome fragment

GENOTYPE

STRING

Called genotype from the VCF

GENOTYPE_QUALITY

NUMERIC

Genotype call quality

READCOUNTSSPLIT

ARRAY

Read counts

READCOUNTSPAIRED

ARRAY

Read counts, paired end

REGULATORYREGIONID

STRING

Ensembl ID for the affected regulatory region

REGULATORYREGIONTYPE

STRING

Type of the regulatory region

CONSEQUENCE

ARRAY

Variant consequence according to SequenceOntology

TRANSCRIPTID

STRING

Ensembl of RefSeq transcript identifier

TRANSCRIPTBIOTYPE

STRING

Biotype of the transcript

INTRONS

STRING

Count of impacted introns out of the total number of introns, specified as "M/N"

GENEID

STRING

Ensembl or RefSeq gene identifier

GENEHGNC

STRING

HUGO/HGNC gene symbol

ISCANONICAL

BOOLEAN

Is the transcript ID the canonical one according to Ensembl?

PROTEINID

STRING

RefSeq or Ensembl protein ID

SOURCEID

NUMERICAL

Gene model: 1=Ensembl, 2=RefSeq

Raw RNAseq data tables for genes and transcripts

These tables will only be available for data sets with ingested RNAseq molecular data.

Table for gene quantification results:

Field Name

Type

Description

GENOMEBUILD

STRING

Genome build, always 'hg38'

STUDY_NAME

STRING

Study designation

SAMPLE_BARCODE

STRING

Sample barcode used in the original VCF

LABEL

STRING

Group label specified during import: Case or Control, Tumor or Normal, etc.

GENE_ID

STRING

Ensembl or RefSeq gene identifier

GID

NUMERIC

Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix

GENE_HGNC

STRING

HUGO/HGNC gene symbol

SOURCE

STRING

Gene model: 1=Ensembl, 2=RefSeq

TPM

NUMERICAL

Transcripts per million

LENGTH

NUMERICAL

The length of the gene in base pairs.

EFFECTIVE_LENGTH

NUMERICAL

The length as accessible to RNA-seq, accounting for insert-size and edge effects.

NUM_READS

NUMERICAL

The estimated number of reads from the gene. The values are not normalized.

The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.

Differential expression tables for genes and transcripts

These tables will only be available for data sets with ingested RNAseq molecular data.

Table for differential gene expression results:

Field Name

Type

Description

GENOMEBUILD

STRING

Genome build, always 'hg38'

STUDY_NAME

STRING

Study designation

SAMPLE_BARCODE

STRING

Sample barcode used in the original VCF

CASE_LABEL

STRING

Study designation

GENE_ID

STRING

Ensembl or RefSeq gene identifier

GID

NUMERIC

Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix

GENE_HGNC

STRING

HUGO/HGNC gene symbol

SOURCE

STRING

Gene model: 1=Ensembl, 2=RefSeq

BASEMEAN

NUMERICAL

FC

NUMERICAL

Fold-change

LFC

NUMERICAL

Log of the fold-change

LFCSE

NUMERICAL

Standard error for log fold-change

PVALUE

NUMERICAL

P-value

CONTROL_SAMPLECOUNT

NUMERICAL

Number of samples used as control

CONTROL_LABEL

NUMERICAL

Label used for controls

The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.

Last updated