Cohorts Data in ICA Base
ICA Cohorts data can be viewed in an ICA Project Base instance as a shared database. A shared database in ICA Base operates as a database view. To use this feature, enable Base for your project prior to starting any ICA Cohorts ingestions. See Base for more information on enabling this feature in your ICA Project.
ICA Cohorts Base Tables
After ingesting data into your project, select Phenotypic and Molecular data are available to view in Base. See Cohorts Import for instruction on importing data sets into Cohorts.
Post ingestion, data will be represented in Base.
Select
BASE
from the ICA left-navigation and clickQuery
.Under the New Query window, a list of tables is displayed. Expand the
Shared Database for Project \<your project name\>
.Cohorts tables will be displayed.
To preview the table and fields click each view listed.
Clicking any of these views then selecting
PREVIEW
on the right-hand side will show you a preview of the data in the tables.
Note: If your ingestion includes Somatic variants, there will be two molecular tables: ANNOTATED_SOMATIC_MUTATIONS and ANNOTATED_VARIANTS. All ingestions will include a PHENOTYPE table.
\
Note: The PHENOTYPE table includes a harmonized set that is collected across all data ingestions and is not representative of all data ingested for the Subject or Sample. Sample information is also displayed in this table, if applicable. Sample information drives the annotation process if molecular data is included in the ingestion. That data is stored in the PHENOTYPE table.
Phenotype Data
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Sample Identifier |
SUBJECTID | STRING | Identifer for Subject entity |
STUDY | STRING | Study designation |
AGE | NUMERIC | Age in years |
SEX | STRING | Sex field to drive annotation |
POPULATION | STRING | Population Designation for 1000 Genomes Project |
SUPERPOPULATION | STRING | Superpopulation Designation from 1000 Genomes Project |
RACE | STRING | Race according to NIH standard |
CONDITION_ONTOLOGIES | VARIANT | Diagnosis Ontology Source |
CONDITION_IDS | VARIANT | Diagnosis Concept Ids |
CONDITIONS | VARIANT | Diagnosis Names |
HARMONIZED_CONDITIONS | VARIANT | Diagnosis High-level concept to drive UI |
LIBRARYTYPE | STRING | Seqencing technology |
ANALYTE | STRING | Substance sequenced |
TISSUE | STRING | Tissue source |
TUMOR_OR_NORMAL | STRING | Tumor designation for somatic |
GENOMEBUILD | STRING | Genome Build to drive annotations - hg38 only |
SAMPLE_BARCODE_VCF | STRING | Sample ID from VCF |
AFFECTED_STATUS | NUMERIC | Affected, Unaffected, or Unknown for Family Based Analysis |
FAMILY_RELATIONSHIP | STRING | Relationship designation for Family Based Analysis |
Annotated Variants
This table will be available for all projects with ingested molecular data
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Original sample barcode used in VCF column |
STUDY | STRING | Study designation |
GENOMEBUILD | STRING | Only hg38 is supported |
CHROMOSOME | STRING | Chromosome without 'chr' prefix |
CHROMOSOMEID | NUMERIC | Chromosome ID: 1..22, 23=X, 24=Y, 25=Mt |
DBSNP | STRING | dbSNP Identifiers |
VARIANT_KEY | STRING | Variant ID in the form "1:12345678:12345678:C" |
NIRVANA_VID | STRING | Broad Institute VID: "1-12345678-A-C" |
VARIANT_TYPE | STRING | Description of Variant Type (e.g. SNV, Deletion, Insertion) |
VARIANT_CALL | NUMERIC | 1=germline, 2=somatic |
DENOVO | BOOLEAN | true / false |
GENOTYPE | STRING | "G|T" |
READ_DEPTH | NUMERIC | Sequencing read depth |
ALLELE_COUNT | NUMERIC | Counts of each alternate allele for each site across all samples |
ALLELE_DEPTH | STRING | Unfiltered count of reads that support a given allele for an individual sample |
FILTERS | STRING | Filter field from VCF. If all filters pass, field is PASS |
ZYGOSITY | NUMERIC | 0 = hom ref, 1 = het ref/alt, 2 = hom alt, 4 = hemi alt |
GENEMODEL | NUMERIC | 1=Ensembl, 2=RefSeq |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
GENE_ID | STRING | Ensembl gene ID ("ENSG00001234") |
GID | NUMERIC | NCBI Entrez Gene ID (RefSeq) or numerical part of Ensembl ENSG ID |
TRANSCRIPT_ID | STRING | Ensembl ENST or RefSeq NM_ |
CANONICAL | STRING | Transcript designated 'canonical' by source |
CONSEQUENCE | STRING | missense, stop gained, intronic, etc. |
HGVSC | STRING | The HGVS coding sequence name |
HGVSP | STRING | The HGVS protein sequence name |
Annotated Somatic Mutations
This table will only be available for data sets with ingested Somatic molecular data.
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Original sample barcode, used in VCF column |
SUBJECTID | STRING | Identifier for Subject entity |
STUDY | STRING | Study designation |
GENOMEBUILD | STRING | Only hg38 is supported |
CHROMOSOME | STRING | Chromosome without 'chr' prefix |
DBSNP | NUMERIC | dbSNP Identifiers |
VARIANT_KEY | STRING | Variant ID in the form "1:12345678:12345678:C" |
MUTATION_TYPE | NUMERIC | Rank of consequences by expected impact: 0 = Protein Truncating to 40 = Intergenic Variant |
VARIANT_CALL | NUMERIC | 1=germline, 2=somatic |
GENOTYPE | STRING | "G|T" |
REF_ALLELE | STRING | Reference allele |
ALLELE1 | STRING | First allele call in the tumor sample |
ALLELE2 | STRING | Second allele call in the tumor sample |
GENEMODEL | NUMERIC | 1=Ensembl, 2=RefSeq |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
GENE_ID | STRING | Ensembl gene ID ("ENSG00001234") |
TRANSCRIPT_ID | STRING | Ensembl ENST or RefSeq NM_ |
CANONICAL | BOOLEAN | Transcript designated 'canonical' by source |
CONSEQUENCE | STRING | missense, stop gained, intronic, etc. |
HGVSP | STRING | HGVS nomenclature for AA change: p.Pro72Ala |
Annotated Copy Number Variants
This table will only be available for data sets with ingested CNV molecular data.
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
NIRVANA_VID | STRING | Variant ID of the form 'chr-pos-ref-alt' |
CHRID | STRING | Chromosome without 'chr' prefix |
CID | NUMERIC | Numerical representation of the chromosome, X=23, Y=24, Mt=25 |
GENE_ID | STRING | NCBI or Ensembl gene identifier |
GID | NUMERIC | Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix |
START_POS | NUMERIC | First affected position on the chromosome |
STOP_POS | NUMERIC | Last affected position on the chromosome |
VARIANT_TYPE | NUMERIC | 1 = copy number gain, -1 = copy number loss |
COPY_NUMBER | NUMERIC | Observed copy number |
COPY_NUMBER_CHANGE | NUMERIC | Fold-chang of copy number, assuming 2 for diploid and 1 for haploid as the baseline |
SEGMENT_VALUE | NUMERIC | Average FC for the identified chromosomal segment |
PROBE_COUNT | NUMERIC | Probes confirming the CNV (arrays only) |
REFERENCE | NUMERIC | Baseline taken from normal samples (1) or averaged disease tissue (2) |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
Annotated Structural Variants
This table will only be available for data sets with ingested SV molecular data. Note that ICA Cohorts stores copy number variants in a separate table.
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
NIRVANA_VID | STRING | Variant ID of the form 'chr-pos-ref-alt' |
CHRID | STRING | Chromosome without 'chr' prefix |
CID | NUMERIC | Numerical representation of the chromosome, X=23, Y=24, Mt=25 |
BEGIN | NUMERIC | First affected position on the chromosome |
END | NUMERIC | Last affected position on the chromosome |
BAND | STRING | Chromosomal band |
QUALIITY | NUMERIC | Quality from the original VCF |
FILTERS | ARRAY | Filters from the original VCF |
VARIANT_TYPE | STRING | Insertion, deletion, indel, tandem_duplication, translocation_breakend, inversion ("INV"), short tandem repeat ("STR2") |
VARIANT_TYPE_ID | NUMERIC | 21=insertion, 22=deletion, 23=indel, 24=tandem_duplication, 25=translocation_breakend, 26=inversion ("INV"), 27=short tandem repeat ("STR2") |
CIPOS | ARRAY | Confidence interval around first position |
CIEND | ARRAY | Confidence interval around last position |
SVLENGTH | NUMERIC | Overall size of the structural variant |
BONDCHR | STRING | For translocations, the other affected chromosome |
BONDCID | NUMERIC | For translocations, the other affected chromosome as a numeric value, X=23, Y=24, Mt=25 |
BONDPOS | STRING | For translocations, positions on the other affected chromosome |
BONDORDER | NUMERIC | 3 or 5: Whether this fragment (the current variant/VID) "receives" the other chromosome's fragment on it's 3' end, or attaches to the 5' of the other chromosome fragment |
GENOTYPE | STRING | Called genotype from the VCF |
GENOTYPE_QUALITY | NUMERIC | Genotype call quality |
READCOUNTSSPLIT | ARRAY | Read counts |
READCOUNTSPAIRED | ARRAY | Read counts, paired end |
REGULATORYREGIONID | STRING | Ensembl ID for the affected regulatory region |
REGULATORYREGIONTYPE | STRING | Type of the regulatory region |
CONSEQUENCE | ARRAY | Variant consequence according to SequenceOntology |
TRANSCRIPTID | STRING | Ensembl of RefSeq transcript identifier |
TRANSCRIPTBIOTYPE | STRING | Biotype of the transcript |
INTRONS | STRING | Count of impacted introns out of the total number of introns, specified as "M/N" |
GENEID | STRING | Ensembl or RefSeq gene identifier |
GENEHGNC | STRING | HUGO/HGNC gene symbol |
ISCANONICAL | BOOLEAN | Is the transcript ID the canonical one according to Ensembl? |
PROTEINID | STRING | RefSeq or Ensembl protein ID |
SOURCEID | NUMERICAL | Gene model: 1=Ensembl, 2=RefSeq |
Raw RNAseq data tables for genes and transcripts
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for gene quantification results:
Field Name | Type | Description |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
STUDY_NAME | STRING | Study designation |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
LABEL | STRING | Group label specified during import: Case or Control, Tumor or Normal, etc. |
GENE_ID | STRING | Ensembl or RefSeq gene identifier |
GID | NUMERIC | Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
SOURCE | STRING | Gene model: 1=Ensembl, 2=RefSeq |
TPM | NUMERICAL | Transcripts per million |
LENGTH | NUMERICAL | The length of the gene in base pairs. |
EFFECTIVE_LENGTH | NUMERICAL | The length as accessible to RNA-seq, accounting for insert-size and edge effects. |
NUM_READS | NUMERICAL | The estimated number of reads from the gene. The values are not normalized. |
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
Differential expression tables for genes and transcripts
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for differential gene expression results:
Field Name | Type | Description |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
STUDY_NAME | STRING | Study designation |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
CASE_LABEL | STRING | Study designation |
GENE_ID | STRING | Ensembl or RefSeq gene identifier |
GID | NUMERIC | Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
SOURCE | STRING | Gene model: 1=Ensembl, 2=RefSeq |
BASEMEAN | NUMERICAL | |
FC | NUMERICAL | Fold-change |
LFC | NUMERICAL | Log of the fold-change |
LFCSE | NUMERICAL | Standard error for log fold-change |
PVALUE | NUMERICAL | P-value |
CONTROL_SAMPLECOUNT | NUMERICAL | Number of samples used as control |
CONTROL_LABEL | NUMERICAL | Label used for controls |
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
Last updated