Cohorts Data in ICA Base
ICA Cohorts data can be viewed in an ICA Project Base instance as a shared database. A shared database in ICA Base operates as a database view. To use this feature, enable Base for your project prior to starting any ICA Cohorts ingestions. See Base for more information on enabling this feature in your ICA Project.
ICA Cohorts Base Tables
After ingesting data into your project, select Phenotypic and Molecular data are available to view in Base. See Cohorts Import for instruction on importing data sets into Cohorts.
Post ingestion, data will be represented in Base.
Select
BASE
from the ICA left-navigation and clickQuery
.Under the New Query window, a list of tables is displayed. Expand the
Shared Database for Project \<your project name\>
.Cohorts tables will be displayed.
To preview the table and fields click each view listed.
Clicking any of these views then selecting
PREVIEW
on the right-hand side will show you a preview of the data in the tables.
Note: If your ingestion includes Somatic variants, there will be two molecular tables: ANNOTATED_SOMATIC_MUTATIONS and ANNOTATED_VARIANTS. All ingestions will include a PHENOTYPE table.
\
Note: The PHENOTYPE table includes a harmonized set that is collected across all data ingestions and is not representative of all data ingested for the Subject or Sample. Sample information is also displayed in this table, if applicable. Sample information drives the annotation process if molecular data is included in the ingestion. That data is stored in the PHENOTYPE table.
Phenotype Data
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Sample Identifier
SUBJECTID
STRING
Identifer for Subject entity
STUDY
STRING
Study designation
AGE
NUMERIC
Age in years
SEX
STRING
Sex field to drive annotation
POPULATION
STRING
Population Designation for 1000 Genomes Project
SUPERPOPULATION
STRING
Superpopulation Designation from 1000 Genomes Project
RACE
STRING
Race according to NIH standard
CONDITION_ONTOLOGIES
VARIANT
Diagnosis Ontology Source
CONDITION_IDS
VARIANT
Diagnosis Concept Ids
CONDITIONS
VARIANT
Diagnosis Names
HARMONIZED_CONDITIONS
VARIANT
Diagnosis High-level concept to drive UI
LIBRARYTYPE
STRING
Seqencing technology
ANALYTE
STRING
Substance sequenced
TISSUE
STRING
Tissue source
TUMOR_OR_NORMAL
STRING
Tumor designation for somatic
GENOMEBUILD
STRING
Genome Build to drive annotations - hg38 only
SAMPLE_BARCODE_VCF
STRING
Sample ID from VCF
AFFECTED_STATUS
NUMERIC
Affected, Unaffected, or Unknown for Family Based Analysis
FAMILY_RELATIONSHIP
STRING
Relationship designation for Family Based Analysis
Annotated Variants
This table will be available for all projects with ingested molecular data
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Original sample barcode used in VCF column
STUDY
STRING
Study designation
GENOMEBUILD
STRING
Only hg38 is supported
CHROMOSOME
STRING
Chromosome without 'chr' prefix
CHROMOSOMEID
NUMERIC
Chromosome ID: 1..22, 23=X, 24=Y, 25=Mt
DBSNP
STRING
dbSNP Identifiers
VARIANT_KEY
STRING
Variant ID in the form "1:12345678:12345678:C"
NIRVANA_VID
STRING
Broad Institute VID: "1-12345678-A-C"
VARIANT_TYPE
STRING
Description of Variant Type (e.g. SNV, Deletion, Insertion)
VARIANT_CALL
NUMERIC
1=germline, 2=somatic
DENOVO
BOOLEAN
true / false
GENOTYPE
STRING
"G|T"
READ_DEPTH
NUMERIC
Sequencing read depth
ALLELE_COUNT
NUMERIC
Counts of each alternate allele for each site across all samples
ALLELE_DEPTH
STRING
Unfiltered count of reads that support a given allele for an individual sample
FILTERS
STRING
Filter field from VCF. If all filters pass, field is PASS
ZYGOSITY
NUMERIC
0 = hom ref, 1 = het ref/alt, 2 = hom alt, 4 = hemi alt
GENEMODEL
NUMERIC
1=Ensembl, 2=RefSeq
GENE_HGNC
STRING
HUGO/HGNC gene symbol
GENE_ID
STRING
Ensembl gene ID ("ENSG00001234")
GID
NUMERIC
NCBI Entrez Gene ID (RefSeq) or numerical part of Ensembl ENSG ID
TRANSCRIPT_ID
STRING
Ensembl ENST or RefSeq NM_
CANONICAL
STRING
Transcript designated 'canonical' by source
CONSEQUENCE
STRING
missense, stop gained, intronic, etc.
HGVSC
STRING
The HGVS coding sequence name
HGVSP
STRING
The HGVS protein sequence name
Annotated Somatic Mutations
This table will only be available for data sets with ingested Somatic molecular data.
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Original sample barcode, used in VCF column
SUBJECTID
STRING
Identifier for Subject entity
STUDY
STRING
Study designation
GENOMEBUILD
STRING
Only hg38 is supported
CHROMOSOME
STRING
Chromosome without 'chr' prefix
DBSNP
NUMERIC
dbSNP Identifiers
VARIANT_KEY
STRING
Variant ID in the form "1:12345678:12345678:C"
MUTATION_TYPE
NUMERIC
Rank of consequences by expected impact: 0 = Protein Truncating to 40 = Intergenic Variant
VARIANT_CALL
NUMERIC
1=germline, 2=somatic
GENOTYPE
STRING
"G|T"
REF_ALLELE
STRING
Reference allele
ALLELE1
STRING
First allele call in the tumor sample
ALLELE2
STRING
Second allele call in the tumor sample
GENEMODEL
NUMERIC
1=Ensembl, 2=RefSeq
GENE_HGNC
STRING
HUGO/HGNC gene symbol
GENE_ID
STRING
Ensembl gene ID ("ENSG00001234")
TRANSCRIPT_ID
STRING
Ensembl ENST or RefSeq NM_
CANONICAL
BOOLEAN
Transcript designated 'canonical' by source
CONSEQUENCE
STRING
missense, stop gained, intronic, etc.
HGVSP
STRING
HGVS nomenclature for AA change: p.Pro72Ala
Annotated Copy Number Variants
This table will only be available for data sets with ingested CNV molecular data.
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
GENOMEBUILD
STRING
Genome build, always 'hg38'
NIRVANA_VID
STRING
Variant ID of the form 'chr-pos-ref-alt'
CHRID
STRING
Chromosome without 'chr' prefix
CID
NUMERIC
Numerical representation of the chromosome, X=23, Y=24, Mt=25
GENE_ID
STRING
NCBI or Ensembl gene identifier
GID
NUMERIC
Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix
START_POS
NUMERIC
First affected position on the chromosome
STOP_POS
NUMERIC
Last affected position on the chromosome
VARIANT_TYPE
NUMERIC
1 = copy number gain, -1 = copy number loss
COPY_NUMBER
NUMERIC
Observed copy number
COPY_NUMBER_CHANGE
NUMERIC
Fold-chang of copy number, assuming 2 for diploid and 1 for haploid as the baseline
SEGMENT_VALUE
NUMERIC
Average FC for the identified chromosomal segment
PROBE_COUNT
NUMERIC
Probes confirming the CNV (arrays only)
REFERENCE
NUMERIC
Baseline taken from normal samples (1) or averaged disease tissue (2)
GENE_HGNC
STRING
HUGO/HGNC gene symbol
Annotated Structural Variants
This table will only be available for data sets with ingested SV molecular data. Note that ICA Cohorts stores copy number variants in a separate table.
Field Name
Type
Description
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
GENOMEBUILD
STRING
Genome build, always 'hg38'
NIRVANA_VID
STRING
Variant ID of the form 'chr-pos-ref-alt'
CHRID
STRING
Chromosome without 'chr' prefix
CID
NUMERIC
Numerical representation of the chromosome, X=23, Y=24, Mt=25
BEGIN
NUMERIC
First affected position on the chromosome
END
NUMERIC
Last affected position on the chromosome
BAND
STRING
Chromosomal band
QUALIITY
NUMERIC
Quality from the original VCF
FILTERS
ARRAY
Filters from the original VCF
VARIANT_TYPE
STRING
Insertion, deletion, indel, tandem_duplication, translocation_breakend, inversion ("INV"), short tandem repeat ("STR2")
VARIANT_TYPE_ID
NUMERIC
21=insertion, 22=deletion, 23=indel, 24=tandem_duplication, 25=translocation_breakend, 26=inversion ("INV"), 27=short tandem repeat ("STR2")
CIPOS
ARRAY
Confidence interval around first position
CIEND
ARRAY
Confidence interval around last position
SVLENGTH
NUMERIC
Overall size of the structural variant
BONDCHR
STRING
For translocations, the other affected chromosome
BONDCID
NUMERIC
For translocations, the other affected chromosome as a numeric value, X=23, Y=24, Mt=25
BONDPOS
STRING
For translocations, positions on the other affected chromosome
BONDORDER
NUMERIC
3 or 5: Whether this fragment (the current variant/VID) "receives" the other chromosome's fragment on it's 3' end, or attaches to the 5' of the other chromosome fragment
GENOTYPE
STRING
Called genotype from the VCF
GENOTYPE_QUALITY
NUMERIC
Genotype call quality
READCOUNTSSPLIT
ARRAY
Read counts
READCOUNTSPAIRED
ARRAY
Read counts, paired end
REGULATORYREGIONID
STRING
Ensembl ID for the affected regulatory region
REGULATORYREGIONTYPE
STRING
Type of the regulatory region
CONSEQUENCE
ARRAY
Variant consequence according to SequenceOntology
TRANSCRIPTID
STRING
Ensembl of RefSeq transcript identifier
TRANSCRIPTBIOTYPE
STRING
Biotype of the transcript
INTRONS
STRING
Count of impacted introns out of the total number of introns, specified as "M/N"
GENEID
STRING
Ensembl or RefSeq gene identifier
GENEHGNC
STRING
HUGO/HGNC gene symbol
ISCANONICAL
BOOLEAN
Is the transcript ID the canonical one according to Ensembl?
PROTEINID
STRING
RefSeq or Ensembl protein ID
SOURCEID
NUMERICAL
Gene model: 1=Ensembl, 2=RefSeq
Raw RNAseq data tables for genes and transcripts
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for gene quantification results:
Field Name
Type
Description
GENOMEBUILD
STRING
Genome build, always 'hg38'
STUDY_NAME
STRING
Study designation
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
LABEL
STRING
Group label specified during import: Case or Control, Tumor or Normal, etc.
GENE_ID
STRING
Ensembl or RefSeq gene identifier
GID
NUMERIC
Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix
GENE_HGNC
STRING
HUGO/HGNC gene symbol
SOURCE
STRING
Gene model: 1=Ensembl, 2=RefSeq
TPM
NUMERICAL
Transcripts per million
LENGTH
NUMERICAL
The length of the gene in base pairs.
EFFECTIVE_LENGTH
NUMERICAL
The length as accessible to RNA-seq, accounting for insert-size and edge effects.
NUM_READS
NUMERICAL
The estimated number of reads from the gene. The values are not normalized.
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
Differential expression tables for genes and transcripts
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for differential gene expression results:
Field Name
Type
Description
GENOMEBUILD
STRING
Genome build, always 'hg38'
STUDY_NAME
STRING
Study designation
SAMPLE_BARCODE
STRING
Sample barcode used in the original VCF
CASE_LABEL
STRING
Study designation
GENE_ID
STRING
Ensembl or RefSeq gene identifier
GID
NUMERIC
Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix
GENE_HGNC
STRING
HUGO/HGNC gene symbol
SOURCE
STRING
Gene model: 1=Ensembl, 2=RefSeq
BASEMEAN
NUMERICAL
FC
NUMERICAL
Fold-change
LFC
NUMERICAL
Log of the fold-change
LFCSE
NUMERICAL
Standard error for log fold-change
PVALUE
NUMERICAL
P-value
CONTROL_SAMPLECOUNT
NUMERICAL
Number of samples used as control
CONTROL_LABEL
NUMERICAL
Label used for controls
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
Last updated