arrow-left

All pages
gitbookPowered by GitBook
1 of 11

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Cohorts

hashtag
Introduction to Cohorts

ICA Cohorts is a cohort analysis tool integrated with Illumina Connected Analytics (ICA). ICA Cohorts combines subject- and sample-level metadata, such as phenotypes, diseases, demographics, and biometrics, with molecular data stored in ICA to perform tertiary analyses on selected subsets of individuals.

hashtag
Overview Video

This video is an overview of Illumina Connnected Analytics. It walks through a Multi-Omics Cancer workflow that can be found here:

hashtag
Features At-a-glance

  • Intuitive UI for selecting subjects and samples to analyze and compare: deep phenotypical and clinical metadata, molecular features including germline, somatic, gene expression.

  • Comprehensive, harmonized data model exposed to ICA Base and ICA Bench users for custom analyses.

  • Run analyses in ICA Base and ICA Bench and upload final results back into Cohorts for visualization.

hashtag
Functionality

hashtag
Walk-throughs

hashtag
Public Data Sets

ICA Cohorts contains a variety of freely available data sets covering different disease areas and sequencing technologies. For a list of currently available data, .

Out-of-the-box statistical analyses including genetic burden tests, GWAS/PheWAS.
  • Rich public data sets covering key disease areas to enrich private data analysis.

  • Easy-to-use visualizations for gene prioritization and genetic variation inspection.

  • Oncology Walkthrougharrow-up-right
    Create a Cohortarrow-up-right
    Import New Samplesarrow-up-right
    Prepare Metadata Sheetsarrow-up-right
    Oncologyarrow-up-right
    Rare Genetic Disordersarrow-up-right
    see herearrow-up-right
    Precomputed GWAS and PheWASarrow-up-right
    Cohort Analysisarrow-up-right
    Compare Cohortsarrow-up-right
    Cohorts Data in ICA Basearrow-up-right

    Compare Cohorts

    You can compare up to four previously created individual cohorts, to view differences in variants and mutations, RNA expression, copy number variation, and distribution of clinical attributes. Once comparisons are created, they are saved in the Comparisons left-navigation tab of the Cohorts module.

    hashtag
    Create a comparison view

    1. Select Cohorts from the left-navigation panel.

    2. Select 2 to 4 cohorts already created. If you have not created any cohorts, See Create a Cohort documentation.

    3. Click Compare Cohorts in the right-navigation panel.

    4. Note you are now in the Comparisons left-navigation tab of the Cohorts module.

    5. In the Charts Section, if the COHORTS item is not displayed, click the gear icon in the top right and add Cohorts as the first attribute and click Save.

    6. The COHORTS item in the charts panel will provide a count of subjects in each cohort and act as a legend for color representation throughout comparison screens.

    7. For each clinical attribute category, a bar chart is displayed. Use the gear icon to select attributes to display in the charts panel.

    You can share a comparison with other team members in the same ICA Project. Please refer to the section on "Sharing a Cohort" on "Create a Cohort" for details on sharing, unsharing, deleting, and archiving, which are analogous for sharing comparisons.

    hashtag
    Attribute Comparison

    1. Select the Attributes tab

    2. Attribute categories are listed and can be expanded using the down-arrows next to the category names. The categories available are based on cohorts selected. Categories and attributes are part of the ICA Cohorts metadata template that map to each Subject.

    3. For example, use the drop-down arrow next to Vital status to view sub-categories and frequencies across selected cohorts.

    hashtag
    Variants Comparison

    1. Select the Genes tab

    2. Search for a gene of interest using its HUGO/HGNC gene symbol

    3. Variants and mutations will be displayed as one needle plot for each cohort that is part of the comparison (see in this online help for more details)

    hashtag
    Survival Summary

    1. Select the Survival Summary tab.

    2. Attribute categories are listed and can be expanded using the down-arrows next to the category names.

    3. Select the drop-down arrow for Therapeutic interventions.

    hashtag
    Survival Comparison

    1. Click Survival Comparison tab.

    2. A Kaplan-Meier Curve is rendered based on each cohort.

    3. P-Value Displayed at the top of Survival Comparison indicates whether there is statistically significant variance between survival probabilities over time of any pair of cohorts (CI=0.95).

    When comparing two cohorts, the P-Value is shown above the two survival curves. For three or four cohorts, P-Values are shown as a pair-wise heatmap, comparing each cohort to every other cohort.

    hashtag
    Marker Frequency Comparison

    1. Select the Marker Frequency tab.

    2. Select either Gene expression (default), Somatic mutation, or Copy number variation

    hashtag
    Correlation Comparison

    1. Select the Correlation tab.

    2. Similar to the single-cohort view (Cohort Analysis | Correlation), choose two clinical attributes and/or genes to compare.

    3. Depending on the available data types for the two selections (categorical and/or continuous), Cohorts will display a bubble plot, violin plot, or scatter plot.

    As additional filter options, you can view only those variants that are occur in every cohort; that are unique to one cohort; that have been observed in at least two cohorts; or any variant.

    In each subcategory there is a sum of the subject counts across select cohorts.

  • For each cohort, designated by a color, there is a Subject count and Median survival (years) column.

  • Type Malignancy in the Search Box and an auto-complete dropdown suggests three different attributes.

  • Select Synchronous malignancy and the results are automatically opened and highlighted in orange.

  • For gene expression (up- versus down-regulated) and for copy number variation (gain versus loss), Cohorts will display a list of all genes with bidirectional barcharts
  • For somatic mutations, the barcharts are unidirectional and indicate the percentage of samples with a mutation in each gene per cohort.

  • Bars are color-coded by cohort, see the accompanying legend.

  • Each row shows P-value(s) resulting from pairwise comparison of all cohorts. In the case of comparing two cohorts, the numerical P-value will be displayed in the table. In the case of comparing three or more cohorts, the pairwise P-values are shown as a triangular heatmap, with details available as a tooltip.

  • Cohort analysis -> Genesarrow-up-right

    Rare Genetic Disorders Walk-through

    hashtag
    Cohorts Walk-through: Rare Genetic Disorders

    This walk-through is meant to represent a typical workflow when building and studying a cohort of rare genetic disorder cases.

    hashtag
    Login and Create a new ICA Project

    Create a new Project to track your study:

    1. Login to the ICA

    2. Navigate to Projects

    3. Create a new project using the New Project button.

    hashtag
    Create and Review a Rare Disease Cohort

    1. Navigate to the ICA Cohorts module by clicking Cohorts in the left navigation panel.

    2. Click Create Cohort button.

    3. Enter a name for your cohort, like Rare Disease + 1kGP at top, left of pencil icon.

    hashtag
    Analyze Your Rare Disease Cohort Data

    1. A recent GWAS publication identified 10 risk genes for intellectual disability (ID) and autism. Our task is to evaluate them in ICA Cohorts: TTN, PKHD1, ANKRD11, ARID1B, ASXL3, SCN2A, FHL1, KMT2A, DDX3X, SYNGAP1.

    2. First let’s Hide charts for more visual space.

    3. Click the Genes tab where you need to query a gene to see and interact with results.

    Oncology Walk-through

    This walk-through is intended to represent a typical workflow when building and studying a cohort of oncology cases.

    hashtag
    Create a Cancer Cohort and View Subject Details

    1. Click Create Cohort button.

    2. Select the following studies to add to your cohort:

      1. TCGA – BRCA – Breast Invasive Carcinoma

      2. TCGA – Ovarian Serous Cystadenocarcinoma

    3. Add a Cohort Name = TCGA Breast and Ovarian_1472

    4. Click on Apply.

    5. Expand Show query details to see the study makeup of your cohort.

    6. Charts will be open by default. If not, click Show charts

    7. Use the gear icon in the top-right to change viewable chart settings.

      Tip: Disease Type, Histological Diagnosis, Technology, Overall Survival have interesting data about this cohorts

    8. The Subject tab with all Subjects list is displayed below Charts with a link to each Subject by ID and other high-level information, like Data Types measured and reported. By clicking a subject ID, you will be brought to the data collected at the Subject level.

    9. Search for subject TCGA-E2-A14Y and view the data about this Subject.

    10. Click the TCGA-E2-A14Y Subject ID link to view clinical data for this Subject that was imported via the metadata.tsv file on ingest.

      Note: the Subject is a 35 year old Female with vital status and other phenotypes that feed up into the Subject attribute selection criteria when creating or editing cohorts.

    11. Click X to close the Subject details.

    12. Click Hide charts to increase interactive landscape.

    hashtag
    Data Analysis, Multi-Omic Biomarker Discovery, and Interpretation

    1. Click the Marker Frequency tab, then click the Somatic Mutation tab.

    2. Review the gene list and mutation frequencies.

    3. Note that PIK3CA has a high rate of mutation in the Cohort (ranked 2nd with 33% mutation frequency in 326 of the 987 Subjects that have Somatic Mutation data in this cohort).

    Create a Cohort

    ICA Cohorts lets you create a research cohort of subjects and associated samples based on the following criteria:

    • Project:

      • Include subjects that are part of any ICA Project that you own or that is shared with you.

    Give your project a name and click Save.

  • Navigate to the ICA Cohorts module by clicking COHORTS in the left navigation panel then choose Cohorts.

  • From the Public Data Sets list select:

    1. DRAGEN-1kGP

    2. All Rare genetic disease cohorts

  • Notice that a cohort can also be created based on Technology, Disease Type and Tissue.

  • Under Selected Conditions in right panel, click on Apply

  • A new page opens with your cohort in a top-level tab.

  • Expand Query Details to see the study makeup of your cohort.

  • A set of 4 Charts will be open by default. If they are not, click Show Charts.

    1. Use the gear icon in the top-right of the Charts pane to change chart settings.

  • The bottom section is demarcated by 8 tabs (Subjects, Marker Frequency, Genes, GWAS, PheWAS, Correlation, Molecular Breakdown, CNV).

  • The Subjects tab displays a list of exportable Subject IDs and attributes.

    1. Clicking on a Subject ID link pops up a Subject details page.

  • Type SCN2A into the Gene search field and select it from autocomplete dropdown options.

  • The Gene Summary tab now lists information and links to public resources about SCN2A.

  • Click on the Variants tab to see an interactive Legend and analysis tracks.

    1. The Needle Plot displays gnomAD Allele Frequency for variants in your cohort.

      1. Note that some are in SCN2A conserved protein domains.

    2. In Legend, switch the Plot by option to Sample Count in your cohort.

    3. In Legend, uncheck all Variant Types except Stop gained. Now you should see 7 variants.

    4. Hover over pin heads to see pop-up information about particular variants.

  • The Primate AI track displays Scores for potential missense variants, based on polymorphisms observed in primate species. Points above the dashed line for the 75th percentile may be considered "likely pathogenic" as cross-species sequence is highly conserved; you often see high conservancy at the functional domains. Points below the 25th percentile may be considered "likely benign".

  • The Pathogenic variants track displays markers from ClinVar color-coded by variant type. Hover over to see pop-ups with more information.

  • The Exons track shows mRNA exon boundaries with click and zoom functionality at the ends.

  • Below the Needle Plot and analysis tracks is a list of "Variants observed in the selected cohort"

    1. Export Gene Variants table icon is above the legend on right side.

  • Now let's click on the Gene Expression tab to see a Bar chart of 50 normal tissues from GTEx in transcripts per million (TPM). SCN2A is highly expressed in certain Brain tissues, indicating specificity to where good markers for intellectual disability and autism could be expected.

  • As a final exercise in discovering good markers, click on the tab for Genetic Burden Test. The table here associates Phenotypes with Mutations Observed in each Study selected for our cohort, alongside Mutations Expected to derive p-values. Given all considerations above, SCN2A is good marker for intellectual disability (p < 1.465 x 10 -22) and autism (p < 5.290 x 10 -9).

  • Continue to check the other genes of interest in step 1.

  • Do Subjects with PIK3CA mutations have changes in PIK3CA RNA Expression?

  • Click the Gene Expression tab, search for PIK3CA

    1. PIK3CA RNA is down-regulated in 27% of the subjects relative to normal samples.

      1. Switch from normal to disease Reference where the Subject’s denominator is the median of all disease samples in your cohort.

      2. The count of matching vs. total subjects that have PIK3CA up-regulated RNA which may indicate a distinctive sub-phenotype.

  • Click directly on PIK3CA gene link in the Gene Expression table.

  • You are brought to the Gene tab under the Gene Summary sub-tab that lists information and links to public resources about PIK3CA.

  • Click the Variants tab and Show legend and filters if it does not open by default.

  • Below the interactive legend you see a set of analysis tracks: Needle Plot, Primate AI, Pathogenic variants, and Exons.

  • The Needle Plot allows toggling the plot by gnomAD frequency and Sample Count. Select Sample Count in the Plot by legend above the plot.

    1. There are 87 mutations distributed across the 1068 amino acid sequence, listed below the analysis tracks. These can be exported via the icon into a table.

  • We know that missense variants can severely disrupt translated protein activity. Deselect all Variant Types except for Missense from the Show Variant Type legend above the needle plot.

    1. Many mutations are in the functional domains of the protein as seen by the colored boxes and labels on the x-axis of the Needle Plot.

  • Hover over the variant with the highest sample count in the yellow PI3Ka protein domain.

    1. The pop-up shows variant details for the 64 Subjects observed with it: 63 in the Breast Cancer study and 1 in the Ovarian Cancer Study.

  • Use the Exon zoom bar from each end of the Amino Acid sequence to zoom in to the PI3Ka domain to better separate observations.

  • There are three different missense mutations at this locus changing the wildtype Glutamine at different frequencies to Lysine (64), Glycine (6), or Alanine (2).

  • The Pathogenic Variant Track shows 7 ClinVar entries for mutations stacked at this locus affecting amino acid 545. Pop up details with pathogenicity calls, phenotypes, submitter and a link to the ClinVar entry is seen by hovering over the purple triangles.

  • Note the Primate AI track and high Primate AI score.

    1. Primate AI track displays Scores for potential missense variants, based on polymorphisms observed in primate species. Points above the dashed line for the 75th percentile may be considered likely pathogenic as cross-species sequence is highly conserved; you often see high conservancy at the functional domains. Points below the 25th percentile may be considered "likely benign".

  • Click the Expression tab and notice that normal Breast and normal Ovarian tissue have relatively high PIK3CA RNA Expression in GTex RNAseq tissue data but ubiquitously expressed.

  • Sample:
    • Sample type such as FFPE.

    • Tissue type.

    • Sequencing technology: Whole genome DNA-sequencing, RNAseq, single-cell RNAseq, etc.

  • Subject:

    • Subject inclusion by Identifier:

      • Input a list of Subject Identifiers (up to 100 entries) when defining a cohort.

      • The Subject Identifier filter is combined using AND logic with any other applied filters.

      • Within the list of subject identifiers, OR logic is applied (i.e., a subject matches if it is in the provided list).

    • Demographics such as age, sex, ancestry.

    • Biometrics such as body height, body mass index.

    • Family and patient medical history.

  • Sample:

    • Sample type such as FFPE.

    • Tissue type.

    • Sequencing technology: Whole genome DNA-sequencing, RNAseq, single-cell RNAseq, etc.

  • Disease:

    • Phenotypes and diseases from standardized ontologies.

  • Drug:

    • Drugs from standardized ontologies along with specific typing, stop reasons, drug administration routes, and time points.

  • Molecular attributes:

    • Samples with a somatic mutation in one or multiple, specified genes.

    • Samples with a germline variant of a specific type in one or multiple, specified genes.

    • Samples over- or under-expressed in one or multiple, specified genes.

    • Samples with a copy number gain or loss involving one or multiple, specified genes.

  • hashtag
    Disease search

    ICA Cohorts currently uses six standard medical ontologies to 1) annotate each subject during ingestion and then to 2) search for subjects: HPO for phenotypes, MeSH, SNOMED-CT, ICD9-CM, ICD10-CM, and OMIM for diseases. By default, any 'type-ahead' search will find matches from all six; and you can limit the search to only the one(s) you prefer. When searching for subjects using names or codes from one of these ontologies, ICA Cohorts will automatically match your query against all the other ontologies, therefore returning subjects that have been ingested using a corresponding entry from another ontology.

    In the 'Disease' tab, you can search for subjects diagnosed with one or multiple diseases, as well as phenotypes, in two ways:

    • Start typing the English name of a disease/phenotype and pick from the suggested matches. Continue typing if your disease/phenotype of interest is not listed initially.

      • Use the mouse to select the term or navigate to the term in the dropdown using the arrow buttons.

      • If applicable, the concept hierarchy is shown, with ancestors and immediate children visible.

      • For diagnostic hierarchies, concept children count and descendant count for each disease name is displayed.

        • Descendant Count: Displays next to each disease name in the tree hierarchy (e.g., "Disease (10)").

        • Leaf Nodes: No children count shown for leaf nodes.

        • Missing Counts

      • Select a checkbox to include the diagnostic term along with all of its children and decedents.

      • Expand the categories and select or deselect specific disease concepts.

    • Paste one or multiple diagnostic codes separated by a pipe (‘|’).

    hashtag
    Drug Search

    In the 'Drug' tab, you can search for subjects who have a specific medication record:

    • Start typing the concept name for the drug and pick from suggested matches. Continue typing if the drug is not listed initially.

    • Paste one or multiple drug concept codes. ICA Cohorts currently use RXNorm as a standard ontology during ingestion. If multiple concepts are in your instance of ICA Cohorts, they will be listed under 'Concept Ontology.'

    • 'Drug Type' is a static list of qualifiers that denote the specific administration of the drug. For example, where the drug was dispensed.

    • 'Stop Reason' is a static list of attributes describing a reason why a drug was stopped if available in the data ingestion.

    • 'Drug Route' is a static list of attributes that describe the physical route of administration of the drug. For example, Intravenous Route (IV).

    hashtag
    Measurement Search

    In the ‘Measurements’ tab, you can search for vital signs and laboratory test data leveraging LOINC concept codes. ·

    • Start typing the English name of the LOINC term, for example, ‘Body height’. A dropdown will appear with matching terms. Use the mouse or down arrows to select the term.

    • Upon selecting a term, the term will be available for use in a query.

    • Terms can be added to your query criteria.

    • For each term, you can set a value `Greater than or equal`, `Equals`, `Less than or equal`, `In range`, or `Any value`.

    • `Any value` will find any record where there is an entry for the measurement independent of an available value.

    • Click `Apply` to add your criteria to the query.

    • Click `Update Now` to update the running count of the Cohort.Include/Exclude

    hashtag
    Include/Exclude

    • As attributes are added to the 'Selected Condition' on the right-navigation panel, you can choose to include or exclude the criteria selected.

      • Select a criterion from 'Subject', 'Disease', and/or 'Molecular' attributes by filling in the appropriate checkbox on the respective attribute selection pages.

      • When selected, the attribute will appear in the right-navigation panel.

      • You can use the 'Include' / 'Exclude' dropdown next to the selected attribute to decide if you want to include or exclude subjects and samples matching the attribute.

      • Note: the semantics of 'Include' work in such a way that a subject needs to match only one or multiple of the 'included' attributes in any given category to be included in the cohort. (Category refers to disease, sex, body height, etc.) For example, if you specify multiple diseases as inclusion criteria, subjects will only need to be diagnosed with one of them. Using 'Exclude', you can exclude any subject who matches one or multiple exclusion criteria; subjects do not have to match all exclusion criteria in the same category to be excluded from the cohort.

      • Note: This feature is not available on the 'Project' level selections as there is no overlap between subjects in datasets.

      • Note: Using exclusion criteria does not account for NULL values. For example, if the Super-population 'Europeans' is excluded, subjects will be in your cohort even if they do not contain this data point.

    Once you selected Create Cohort, the above data are organized in tabs such as Project, Subject, Disease, and Molecular. Each tab then contains the aforementioned sections, among others, to help you identify cases and/or controls for further analysis. Navigate through these tabs, or search for an attribute by name to directly jump to that tab and section, and select attributes and values that are relevant to describe your subjects and samples of interest. Assign a new name to the cohort you created, and click Apply to save the cohort.

    hashtag
    Duplicate a Cohort Definition

    • After creating a Cohort, select the Duplicate icon.

    • A copy of the Cohort definition will be created and tagged with "_copy".

    hashtag
    Delete a Cohort Definition

    • Deleting a Cohort Definition can be accomplished by clicking the Delete Cohort icon.

    • This action cannot be undone.

    hashtag
    Sharing a Cohort within an ICA Project

    After creating a Cohort, users can set a Cohort bookmark as Shared. By sharing a Cohort, the Cohort will be available to be applied across the project by other users with access to the Project. Cohorts created in a Project are only accessible at scope of the user. Other users in the project cannot see the cohort created unless they use this sharing functionality.

    hashtag
    Share Cohort Definition

    • Create a Cohort using the directions above.

    • To make the Cohort available to other users in your Project, click the Share icon.

    • The Share icon will be filled in black and the Shared Status will be turned from Private to Shared.

    • Other users with access to Cohorts in the Project can now apply the Cohort bookmark to their data in the project.

    hashtag
    Unshare a Cohort Definition

    • To unshare the Cohort, click the Share icon.

    • The icon will turn from black to white, and other users within the project will no longer have access to this cohort definition.

    hashtag
    Archive a Cohort Definition

    • A Shared Cohort can be Archived.

    • Select a Shared Cohort with a black Shared Cohort icon.

    • Click the Archive Cohort icon.

    • You will be asked to confirm this selection.

    • Upon archiving the Cohort definition, the Cohort will no longer be seen by other users in the Project.

    • The archived Cohort definition can be unarchived by clicking the Unarchive Cohort icon.

    • When the Cohort definition is unarchived, it will be visible to all users in the Project.

    hashtag
    Sharing a Cohort as Bundle

    You can link cohorts data sets to a bundle as follows:

    • Create or edit a bundle at Bundles from the main navigation.

    • Navigate to Bundles > your_bundle > Cohorts > Data Sets.

    • Select Link Data Set to Bundle.

    • Select the data set which you want to link and +Select.

    • After a brief time, the cohorts data set will be linked to your bundle and ICA_BASE_100 will be logged.

    If you can not find the cohorts data sets which you want to link, verify if

    • Your data set is part of a project (Projects > your_project > Cohorts > Data Sets)

    • This project is set to Data Sharing (Projects > your_project > Project Settings > Details)

    hashtag
    Stop sharing a Cohort as Bundle

    You can unlink cohorts data sets from bundles as follows:

    • Edit the desired bundle at Bundles from the main navigation.

    • Navigate to Bundles > your_bundle > Cohorts > Data Sets.

    • Select the cohorts data set which you want to unlink.

    • Select Unlink Data Set from Bundle.

    • After a brief time, the cohorts data set will be unlinked from your bundle and ICA_BASE_101 will be logged.

    Precomputed GWAS and PheWAS

    The GWAS and PheWAS tabs in ICA Cohorts allow you to visualize precomputed analysis results for phenotypes/diseases and genes, respectively. Note that these do not reflect the subjects that are part of the cohort that you created.

    ICA Cohorts currently hosts GWAS and PheWAS analysis results for approximately 150 quantitative phenotypes (such as "LDL direct" and "sitting height") and about 700 diseases.

    hashtag
    Visualize Results from Precomputed Genome-Wide Association Studies (GWAS)

    : Children count is hidden if unavailable.
  • Show Term Count: A new checkbox below "Age of Onset" that is always checked. Unchecking it hides the descendant count.

  • Navigate to the GWAS tab and start looking for phenotypes and diseases in the search box. Cohorts will suggest the best matches against any partial input ("cancer") you provide. After selecting a phenotype/disease, Cohorts will render a Manhattan plot, by default collapsed to gene level and organized by their respective position in each chromosome.

    Circles in the Manhattan plot indicate binary traits, potential associations between genes and diseases. Triangles indicate quantitative phenotypes with regression Beta different from zero, and point up or down to depict positive or negative correlation, respectively.

    Hovering over a circle/triangle will display the following information:

    • gene symbol

    • variant group (see below)

    • P-value, both raw and FDR-corrected

    • number of carriers of variants of the given type

    • number of carriers of variants of any type

    • regression Beta

    For gene-level results, Cohorts distinguishes five different classes of variants: protein truncating; deleterious; missense; missense with a high ILMN PrimateAI score (indicating likely damaging variants); and synonymous variants. You can limit results to either of these five classes, or select All to display all results together.

    • Deleterious variants (del): the union of all protein-truncating variants (PTVs, defined below), pathogenic missense variants with a PrimateAI score greater than a gene-specific threshold, and variants with a SpliceAI score greater than 0.2.

    • Protein-truncating variants (ptv): variant consequences matching any of stop_gained, stop_lost, frameshift_variant, splice_donor_variant, splice_acceptor_variant, start_lost, transcript_ablation, transcript_truncation, exon_loss_variant, gene_fusion, or bidirectional_gene_fusion.

    • missense_all: all missense variants regardless of their pathogenicity.

    • missense, PrimateAI optimized (missense_pAI_optimized): only pathogenic missense variants with primateAI score greater than a gene-specific threshold.

    • missenses and PTVs (missenses_and_ptvs_all): the union of all PTVs, SpliceAI > 0.2 variants and all missense variants regardless of their pathogenicity scores.

    • all synonymous variants (syn).

    To zoom in to a particular chromosome, click the chromosome name underneath the plot, or select the chromosome from the drop down box, which defaults to Whole genome.

    hashtag
    Visualize Results from Precomputed Phenome-Wide Association Studies (PheWAS)

    To browse PheWAS analysis results by gene, navigate to the PheWAS tab and enter a gene of interest into the search box. The resulting Manhattan plot will show phenotypes and diseases organized into a number of categories, such as "Diseases of the nervous system" and "Neoplasms". Click on the name of a category, shown underneath the plot, to display only those phenotypes/diseases, or select a category from the drop down, which defaults to All.

    Public Data Sets

    ICA Cohorts comes front-loaded with a variety of publicly accessible data sets, covering multiple disease areas and also including healthy individuals.

    Data set
    Samples
    Diseases/Phenotypes
    Reference

    1kGP-DRAGEN

    3202 WGS: 2504 original samples plus 698 relateds

    Presumed healthy

    DDD

    4293 (3664 affected), de novos only

    Prepare Metadata Sheets

    In ICA Cohorts, metadata describe any subjects and samples imported into the system in terms of attributes, including:

    • subject:

      • demographics such as age, sex, ancestry;

      • phenotypes and diseases;

      • biometrics such as body height, body mass index, etc.;

      • pathological classification, tumor stages, etc.;

      • family and patient medical history;

    • sample:

      • sample type such as FFPE,

      • tissue type,

      • sequencing technology: whole genome DNA-sequencing, RNAseq, single-cell RNAseq, among others.

    You can use these attributes while to define the cases and/or controls that you want to include.

    During , you will be asked to upload a metadata sheet as a tab-delimited (TSV) file. An example sheet is available for download on the Import files page in the ICA Cohorts UI.

    A metadata sheet will need to contain at least these four columns per row:

    • Subject ID - identifier referring to individuals; use the column header "SubjectID".

    • Sample ID - identifier for a sample. Sample IDs need to match the corresponding column header in VCF/GVCFs; each subject can have multiple samples, these need to be specified in individual rows for the same SubjectID; use the column header "SampleID".

    • Biological sex - can be "Female (XX)", "Female"; "Male (XY)", "Male"; "X (Turner's)"; "XXY (Klinefelter)"; "XYY"; "XXXY" or "Not provided". Use the column header "DM_Sex" (demographics).

    A description of all attributes and data types currently supported by ICA Cohorts can be found here:

    You can download an example of a metadata sheet, which contains some samples from The Cancer Genome Atlas () and their publicly available clincal attributes, here:

    A list of concepts and diagnoses that cover all public data subjects to easily navigate the new concept code browser for diagnosis can be found here:

    Sequencing technology - can be "Whole genome sequencing", "Whole exome sequencing", "Targeted sequencing panels", or "RNA-seq"; use the column header "TC" (technology).

    creating a cohortarrow-up-right
    importarrow-up-right
    ICA_Cohorts_Supported_Attributes.xlsxarrow-up-right
    TCGAarrow-up-right
    ICA_Cohorts_Example_Metadata.tsvarrow-up-right
    PublicData_AllConditionsSummarized.xlsxarrow-up-right

    Developmental disorders

    McRae et al., Nature 19:1194-1196arrow-up-right

    EPI4K

    356, de novos only

    Epilepsy

    Epi4K Consortium, Nature 501:217-221arrow-up-right

    ASD Cohorts

    6786 (4266 affected), de novos only

    Autism Spectrum disorder

    Iossifov et al. Neuron 74:285-299arrow-up-right; Iossifov et al. Nature 498:216-221arrow-up-right; O'Roak et al. Nature 485:246-250arrow-up-right; Sanders et al. Nature 485:237-241arrow-up-right; Sanders et al. Neuron 87:1215-1233arrow-up-right; De Rubeis et al. Nature 515:209-215arrow-up-right

    De Ligt et al.

    100, de novos only

    Intellectual disability

    De Ligt et al., N Engl J Med 367:1921-1929arrow-up-right

    Homsy et al.

    1213, de novos only

    Congenital heart disease (HP:0030680)

    Homsy et al., Science 350:1262-1266arrow-up-right

    Lelieveld et al.

    820, de novos only

    Intellectual disability

    Lelieveld et al., Nature Neuroscience19:1194-1196arrow-up-right

    Rauch et al.

    51, de novos only

    Intellectual disability

    Rauch et al., Lancet 380:1674-1682arrow-up-right

    Rare Genomes Project

    315 WES (112 pedigrees)

    Various

    https://raregenomes.org/

    TCGA

    ca. 4200 WES, ca. 4000 RNAseq

    12 tumor types

    https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

    GEO

    RNAseq

    Auto-immune disorders, incl. asthma, arthritis, SLE, MS, Crohn's disease, Psoriasis, Sjögren's Syndrome

    For GEO/GSE study identifiers, please refer to the in-product list of studies

    RNAseq

    Kidney diseases

    For GEO/GSE study identifiers, please refer to the in-product list of studies

    RNAseq

    Central nervous system diseases

    For GEO/GSE study identifiers, please refer to the in-product list of studies

    RNAseq

    Parkinson's disease

    For GEO/GSE study identifiers, please refer to the in-product list of studies

    DRAGEN reanalysis of the 1000 Genomes Datasetarrow-up-right

    Import New Samples

    hashtag
    Import New Samples

    ICA Cohorts can pull any molecular data available in an ICA Project, as well as additional sample- and subject-level metadata information such as demographics, biometrics, sequencing technology, phenotypes, and diseases.

    To import a new data set, select Import Jobs from the left navigation tab underneath Cohorts, and click the Import Files button. The Import Files button is also available under the Data Sets left navigation item.

    The Data Set menu item is used to view imported data sets and information. The Import Jobs menu item is used to check the status of data set imports.

    Confirm that the project shown is the ICA Project that contains the molecular data you would like to add to ICA Cohorts.

    1. Choose a data type among

      • Germline variants

      • Somatic mutations

    circle-info

    Search Spinner behavior in input jobs table

    • Search a term and press ** Enter.

    • The search spinner will appear while the results are loading.

    All VCF types, specifically from DRAGEN, can be ingested using the Germline variants selection. Cohorts will distinguish the variant types that it is ingesting. If Cohorts cannot determine the variant file type, it will default to ingest small variants.

    Alternatively to VCFs, you can select Nirvana JSON files for DNA variants: small variants, structural variants, and copy number variation.

    The maximum amount of files that can be part of a single manual ingestion batch is capped at 1000

    Alternatively, users can choose a single folder and ICA Cohorts will identify all ingestible files within that folder and its sub-folders. In this scenario, cohorts will select molecular data files matching the samples listed in the metadata sheet which is the next step in the import process.

    Users have the option to ingest either VCF files or Nirvana JSON files for any given batch, regardless of the chosen ingestion method.

    The sample identifiers used in the VCF columns need to match the sample identifiers used in subject/sample metadata files; accordingly, if you are starting from JSON files containing variant- and gene-level annotations provided by ILMN Nirvana, the samples listed in the header need to match the metadata files.

    hashtag
    Variant file formats

    ICA Cohorts supports VCF files formatted according to VCF v4.2 and v4.3 specifications. VCF files require at least one of the following header rows to identify the genome build:

    • ##reference=file://... --- needs to contain a reference to hg38/GRCh38 in the file path or name (numerical value is sufficient)

    • ##contig=<ID=chr1,length=248956422> --- for hg38/GRCh38

    • ##DRAGENCommandLine= ... --ht-reference

    ICA Cohorts accepts VCFs aligned to hg38/GRCh38 and hg19/GRCh37. If your data uses hg19/GRCh37 coordinates, Cohorts will convert these to hg38/GRCh38 during the ingestion process [see Reference 1]. Harmonizing data to one genome build facilitates searches across different private, shared, and public projects when building and analyzing a cohort. If your data contains a mixture of samples mapped to hg38 and hg19, please ingest these in separate batches, as each import job into Cohorts is limited to one genome build.

    Alternative to VCFs, ICA Cohorts accepts the JSON output of for hg38/GRCh38-aligned data for small germline variants and somatic mutations, copy number variations other structural variants.

    hashtag
    RNAseq file format

    ICA Cohorts can process gene- and transcript-level quantification files produced by the Illumina DRAGEN RNA pipeline. The file naming convention needs to match .quant.genes.sf for genes; and .quant.sf for transcript-level TPM (transcripts per million.)

    Please also see the online documentation for the for more information on output file formats.

    hashtag
    GWAS file format

    ICA Cohorts currently support upload of SNV-level GWAS results produced by and saved as CSV files.

    hashtag
    Metadata and File Types

    Note: If annotating large sets of samples with molecular data, expect the annotation process to take over 20 minutes per whole genome batch of samples. You will receive two e-mail notifications: once your ingestion starts and once completed successfully or failed.

    As an alternative to ICA Cohorts' metadata file format, you can provide files formatted according to the . Cohorts currently ingests data for these OMOP 5.4 tables, formatted as tab-delimited files:

    • PERSON (mandatory),

    • CONCEPT (mandatory if any of the following is provided),

    • CONDITION_OCCURRENCE (optional),

    Additional files such as measurement and observation will be supported in a subsequent release of Cohorts.

    Note that Cohorts requires that all such files do not deviate from the OMOP CDM 5.4 standard. Depending on your implementation, you may have to adjust file formatting to be OMOP CDM 5.4-compatible.

    hashtag
    References

    [1] VcfMapper: https://stratus-documentation-us-east-1-public.s3.amazonaws.com/downloads/cohorts/main_vcfmapper.py

    [2] crossMap: https://crossmap.sourceforge.net/

    [3] liftOver: https://genome.ucsc.edu/cgi-bin/hgLiftOver

    [4] Chain files:

    RNAseq
  • GWAS

  • Choose a new study name by selecting the radio button: Create new study and entering a Study Name.

  • To add new data to an existing Study, select the radio button: Select from list of studies and select an existing Study Name from the dropdown.

  • To add data to existing records or add new records, select Job Type, Append.

  • Append does not wipe out any data ingested previously and can be used to ingest the molecular data in an incremental manner.

  • To replace data, select Job Type, Replace. If you are ingesting data again, use the Replace job type.

  • Enter an optional Study description.

  • Select the metadata model (default: Cohorts; alternatively, select OMOP version 5.4 if your data is formatted that way.)

  • Select the genome build your molecular data is aligned to (default: GRCh38/hg38)

  • For RNAseq, specify whether you want to run differential expression (see below) or only upload raw TPM.

  • Click Next.

  • Navigate to VCFs located in the Project Data.

  • Select each single-sample VCF or multi-sample VCF to ingest. For GWAS, select CSV files produced by Regenie.

    • As an alernative to selecting individual files, you can also opt to select a folder instead. Toggle the radio button on Step 2 from "Select files" to "Select folder".

    • This option is currently only available for germline variant ingestion: any combination of small variants, structural variation, and/or copy number variants.

    • ICA Cohorts will scan the selected folder and all sub-folders for any VCF files or JSON files and try to match them against the Sample ID column in the metadata TSV file (Step 3).

    • Files not matching sample IDs will be ignored; allowed file extensions for VCF files after the sample ID are: *.vcf.gz, *.hard-filtered.vcf.gz, *.cnv.vcf.gz, and *.sv.vcf.gz .

    • Files not matching sample IDs will be ignored; allowed file extensions for JSON files after the sample ID are: .json,.json.gz, *.json.bgz, *.json.gzip.

  • Click Next.

  • Navigate to the metadata (phenotype) data tsv in the project Data.

  • Select the TSV file or files for ingestion.

  • Click Finish.

  • Once the results are displayed in the table, the spinner will disappear immediately

    DRUG_EXPOSURE (optional), and
  • PROCEDURE_OCCURRENCE (optional.)

  • Field

    Description

    Project name

    The ICA project for your cohort analysis (cannot be changed.)

    Study name

    Create or select a study. Each study represents a subset of data within the project.

    Description

    Short description of the data set (optional).

    Job type

    Append: Appends values to any existing values. If a field supports only a single value, the value is replaced.

    Replace: Overwrites existing values with the values in the uploaded file.

    Subject metadata files

    Subject metadata file(s) in tab-delimited format. For Append and Replace job types, the following fields are required and cannot be changed: - Sample identifier - Sample display name - Subject identifier - Subject display name - Sex

    Illumina Nirvanaarrow-up-right
    Illumina DRAGEN RNA Pipelinearrow-up-right
    Regeniearrow-up-right
    OMOP common data model 5.4arrow-up-right
    ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/arrow-up-right

    Cohort Analysis

    hashtag
    Cohort Analysis

    From the Cohorts menu in the left hand navigation, select a cohort created in Create Cohort to begin a cohort analysis.

    hashtag

    Query Details

    The query details can be accessed by clicking the triangle next to Show Query Details. The query details displays the selections used to create a cohort. The selections can be edited by clicking the pencil icon in the top right.

    hashtag
    Charts

    1. Charts will be open by default. If not, click Show Charts.

    2. Use the gear icon in the top-right to change viewable chart settings.

    3. There are four charts available to view summary counts of attributes within a cohort as histogram plots.

    4. Click Hide Charts to hide the histograms.

    hashtag
    Single Subject Timeline View:

    1. Display time-stamped events and observations for a single subject on a timeline.The timeline view is visible to only those subjects which have time-series data.

    2. Below attributes are displayed in timeline view: • Diagnosed and Self-Reported Diseases: • Start and end dates • Progression vs. remission • Medication and Other Treatments: • Prescribed and self-medicated • Start date, end date, and dosage at every time point

    3. The timeline utilizes age (at diagnosis, at event, at measurement) as the x-axis and attribute name as the y-axis. If the birthdate is not recorded for a subject, the user can now switch to Date to visualize data.

    4. In the default view, the timeline shows the first five disease data and the first five drug/medication data in the plot. Users can choose different attributes or change the order of existing attributes by clicking on the “select attribute” button.

    5. The x-axis shows the person’s age in years, with data points initially displayed between ages 0 to 100. Users can zoom in by selecting the desired range to visualize data points within the selected age range.

    6. Each event is represented by a dot in the corresponding track. Events in the same track can be connected by lines to indicate the start and end period of an event.

    circle-info

    Measurement Section: A summary of measurements (without values) is displayed under the section titled "Measurements and Laboratory Values Available." Users can click a link to access the Timeline View for detailed results.

    Drug Section: The "Drug Name" section lists drug names without repeating the header "Drug Name" for each entry.

    hashtag
    Subjects

    1. By Default, the Subjects tab is displayed.

    2. The Subjects tab with a list of all subjects matching your criteria is displayed below Charts with a link to each Subject by ID and other high-level information. By clicking a subject ID, you will be brought to the data collected at the Subject level.

    3. Search for a specific subject by typing the Subject ID into the Search Subjects text box.

    4. Get all details available on a subject by clicking the hyperlinked Subject ID in the Subject list.

    To Exclude specific subjects from subsequent analysis, such as marker frequencies or gene-level aggregated views, you can uncheck the box at the beginning of each row in the subject list. You will then be prompted to save any exclusion(s).

    You can Export the list of subjects either to your ICA Project's data folder or to your local disk as a TSV file for subsequent use. Any export will omit subjects that you excluded after you saved those changes. For more information, see at the bottom of this page.

    hashtag
    Remove a Subject

    1. Specific subjects can be removed from a Cohort.

    2. Select the Subjects tab.

    3. Subjects in the Cohort, by default are checked.

    4. To remove a specific subject from a Cohort, uncheck the checkbox next to subjects to remove from a Cohort.

    5. Check box selections are maintained while browsing through the pages of the subject list.

    6. Click Save Cohort to save the subjects you would like to exclude.

    7. The specific subjects will no longer be counted in all analysis visualizations.

    8. The specific excluded subjects will be saved for the Cohort.

    9. To add the subjects back to the Cohort, select the checkboxes to checked and click Save Cohort.

    hashtag
    Structural variant aggregation: Marker Frequency analysis

    For each individual cohort, display a table of all observed SVs that overlap with a given gene.

    hashtag
    Marker Frequency

    1. Click the Marker Frequency tab, then click the Gene Expression tab.

    2. Down-regulated genes are displayed in blue and Up-regulated genes are displayed in red.

    3. A frequency in the Cohort is displayed and the Matching number/Total is also displayed in the chart.

    4. Genes can be searched by using the Search Genes text box.

    hashtag
    Genes

    1. You are brought to the Gene tab under the Gene Summary sub-tab.

    2. Select a Gene by typing the gene name into the Search Genes text box.

    3. A Gene Summary will be displayed that lists information and links to public resources about the selected gene.

    4. A cytogenic map will be displayed based on the selected gene and a vertical orange bar represents gene location in the chromosome.

    5. Click the Variants tab and Show legend and filters if it does not open by default.

    6. Below the interactive legend, you see a set of analysis tracks: Needle Plot, Primate AI, Pathogenic variants, and Exons.

    7. The Needle Plot allows toggling the plot by gnomAD frequency and Sample Count. Select Sample Count in the Plot by legend above the plot. You can also filter the plot to only show variants above/below a certain cut-off for gnomAD frequency (in percent) or absolute sample count.

    8. The Needle Plot allows filtering by PrimateAI Score.

      • Set a lower (>=) or upper (<=) threshold for the PrimateAI Score to filter variants.

      • Enter the threshold value in the text box located below the gnomadFreq/SampleCount input box.

    9. The following filters are always shown and can be independently set: %gnomAD Frequency Sample Count PrimateAI Score . Changes made to these filters are immediately reflected in both the needle plot and the variant list below.

    10. Click on a variant's needle pin to view details about the variant from public resources and counts of variants in the selected cohort by disease category. If you want to view all subjects that carry the given variant, click on the sample count link, which will take you to the list of subjects (see above).

    11. Use the Exon zoom bar from each end of the Amino Acid sequence to zoom in on the gene domain to better separate observations.

    12. The Pathogenic Variant Track shows pop up details with pathogenicity calls, phenotypes, submitter and a link to the ClinVar entry is seen by hovering over the purple triangles.

    13. Below the needle plot is a full listing of variants displayed in the needle plot visualization

      • Display only variants shown in the plot above. toggle (enabled by default) syncs the table with the Needle Plot. When the toggle is on, the table will display only the variants shown in the Needle Plot, applying all active filters (e.g., variant type, somatic/germline, sample count). When the toggle is off, all reported variants are displayed in the table and table-based filters can be used.

      • Export to CSV: When the views are synchronized (toggle on), the filtered list of variants can be exported to a CSV file for further analysis.The Phenotypes tab

    14. The Gene Expression tab shows known gene expression data from tissue types in GTEx.

    15. The Genetic Burden Test will only be available for de novo variants only.

    hashtag
    Correlation

    For every correlation, subjects contained in each count can be viewed by selecting the count on the bubble or the count on the X-axis and Y-axis.

    hashtag
    Clinical vs. Clinical Attribute Comparison – Bubble Plot

    1. Click the Correlation Tab.

    2. In X-axis category, select Clinical.

    3. In X-axis Attribute, select a clinical attribute.

    4. In Y-axis category, select Clinical.

    5. In Y-Axis Attribute, select another clinical attribute.

    6. You will be shown a bubble plot comparing the first clinical attribute on the x-axis to second attributes on the y-axis.

    7. The size of the bubbles correspond to the number of subjects falling into those categories.

    hashtag
    Molecular vs. Molecular Attribute Comparison – Bubble Plot

    To see a breakdown of Somatic Mutations vs. RNA Expression levels perform the following steps:

    Note this comparison is for a Cancer case.

    1. Click the Correlation Tab.

    2. In X-axis category, select Somatic.

    3. In X-axis Attribute, select a gene.

    4. In Y-axis category, select RNA expression.

    5. In Y-Axis Attribute, type a gene and leave Reference Type, NORMAL.

    6. Click Continuous to see violin plots of compared variables.

    hashtag
    Clinical vs. Molecular Attribute Comparison – Bubble Plot

    Note this comparison is for a Cancer case.

    1. Click the Correlation Tab.

    2. In X-axis category, select Somatic.

    3. In X-axis Attribute, type a gene name.

    4. In Y-axis category, select Clinical.

    5. In Y-Axis Attribute, select a clinical attribute.

    hashtag
    Molecular Breakdown

    1. Click the Molecular Breakdown Tab.

    2. In Enter a clinical Attribute, and select a clinical attribute.

    3. In Enter a gene, select a gene by typing a gene name.

    4. You are shown a stacked bar-chart by the clinical attribute selected values on the Y-axis.

    5. For each attribute value the bar represents the % of Subjects with RNA Expression, Somatic Mutation, and Multiple Alterations.

    Note: for each of the aforementioned bubble plots, you can view the list of subjects by following the link under each subject count associated with an individual bubble or axis label. This will take you to the list of subjects view, see above.

    hashtag
    CNV

    If there is Copy Number Variant data in the cohort:

    1. Click the CNV tab.

    2. A graph will show CNV a Sample Percentage on the Y-axis and Chromosomes on the X-axis.

    3. Any value above Zero is a copy number gain, and any value below Zero is a copy number loss.

    4. Click Chromosome: to select a specific chromosome position.

    hashtag
    Subject Export for Analysis in ICA Bench

    ICA allows for integrated analysis in a computation workspace. You can export your cohorts definitions and, in combination with molecular data in your ICA Project Data, perform, for example, a GWAS analysis.

    1. Confirm the VCF data for your analysis is in ICA Project Data.

    2. From within your ICA Project, Start a Bench Workspace -- See Bencharrow-up-right for more details.

    3. Navigate back to ICA Cohorts.

    4. Create a Cohort of subjects of interest using .

    5. From the Subjects Tab click the Export subjects... from the top-right of the subject list. The file can be downloaded to the Browser or ICA Project Data.

    6. We suggest using export ...to Data Folder for immediate access to this data in Bench or other areas of ICA.

    7. Create another cohort if needed for your Research and complete the last 3 steps.

    8. Navigate to the Bench workspace created in the second step.

    9. After the workspace has started up, click Access.

    10. Find the /Project/ folder in the Workspace file navigation.

    11. This folder will contain your cohort files created along with any pipeline output data needed for your workspace analysis.

    If no threshold value is entered, no filter will be applied.

  • The filter affects both the plot and the table when the “Display only variants shown in the plot above” toggle is enabled.

  • Filter preferences persist across gene views for a seamless experience.

  • shows a stacked horizontal bar chart which displays molecular breakdown (disease type vs Gene) and subject count for the selected gene.

    Note on "Stop Lost" Consequence Variants:

    • The stop_lost consequence is mapped as Frameshift, Stop lost in the tooltip.

    • The l Stop gained|lost value includes both stop gain and stop loss variants.

    • When the Stop gained filter is applied, Stop lost variants will not appear in the plot or table if the "Display only variants shown in the plot above" toggle is enabled

    Create a Cohortarrow-up-right
    Illumina Connected Analytics: Cohorts Multi-Omic Cancer
    Multi-Omic Cancer Workflow

    Cohorts Data in ICA Base

    ICA Cohorts data can be viewed in an ICA Project Base instance as a shared database. A shared database in ICA Base operates as a database view. To use this feature, enable Base for your project prior to starting any ICA Cohorts ingestions. See Basearrow-up-right for more information on enabling this feature in your ICA Project.

    hashtag
    ICA Cohorts Base Tables

    After ingesting data into your project, select Phenotypic and Molecular data are available to view in Base. See Cohorts Importarrow-up-right for instruction on importing data sets into Cohorts.

    1. Post ingestion, data will be represented in Base.

    2. Select BASE from the ICA left-navigation and click Query.

    3. Under the New Query window, a list of tables is displayed. Expand the Shared Database for Project \<your project name\> .

    4. Cohorts tables will be displayed.

    5. To preview the table and fields click each view listed.

    6. Clicking any of these views then selecting PREVIEW on the right-hand side will show you a preview of the data in the tables.

    circle-info

    If your ingestion includes Somatic variants, there will be two molecular tables: ANNOTATED_SOMATIC_MUTATIONS and ANNOTATED_VARIANTS. All ingestions will include a PHENOTYPE table.

    circle-info

    The PHENOTYPE table includes a harmonized set that is collected across all data ingestions and is not representative of all data ingested for the Subject or Sample. Sample information is also displayed in this table, if applicable. Sample information drives the annotation process if molecular data is included in the ingestion. That data is stored in the PHENOTYPE table.

    hashtag
    Phenotype Data

    Field Name
    Type
    Description

    hashtag
    Sample Information

    Field Name
    Type
    Description

    hashtag
    Sample Attribute

    This table is an entity-attribute value table of supplied sample data matching Cohorts accepted attributes.

    Field Name
    Type
    Description

    hashtag
    Study Information

    Field Name
    Type
    Description

    hashtag
    Subject

    Field
    Type
    Description

    hashtag
    Subject Attribute

    This table is an entity-attribute value table of supplied subject data matching Cohorts accepted attributes.

    Field
    Type
    Description

    hashtag
    Disease

    Field
    Type
    Description

    hashtag
    Drug Exposure

    Field
    Type
    Description

    hashtag
    Measurement

    Field
    Type
    Description

    hashtag
    Procedure

    Field
    Type
    Description

    hashtag
    Annotated Variants

    This table will be available for all projects with ingested molecular data

    hashtag
    Annotated Somatic Mutations

    This table will only be available for data sets with ingested Somatic molecular data.

    hashtag
    Annotated Copy Number Variants

    This table will only be available for data sets with ingested CNV molecular data.

    hashtag
    Annotated Structural Variants

    This table will only be available for data sets with ingested SV molecular data. Note that ICA Cohorts stores copy number variants in a separate table.

    hashtag
    Raw RNAseq data tables for genes and transcripts

    These tables will only be available for data sets with ingested RNAseq molecular data.

    Table for gene quantification results:

    The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.

    hashtag
    Differential expression tables for genes and transcripts

    These tables will only be available for data sets with ingested RNAseq molecular data.

    Table for differential gene expression results:

    The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.

    SEX

    STRING

    Sex field to drive annotation

    POPULATION

    STRING

    Population Designation for 1000 Genomes Project

    SUPERPOPULATION

    STRING

    Superpopulation Designation from 1000 Genomes Project

    RACE

    STRING

    Race according to NIH standard

    CONDITION_ONTOLOGIES

    VARIANT

    Diagnosis Ontology Source

    CONDITION_IDS

    VARIANT

    Diagnosis Concept Ids

    CONDITIONS

    VARIANT

    Diagnosis Names

    HARMONIZED_CONDITIONS

    VARIANT

    Diagnosis High-level concept to drive UI

    LIBRARYTYPE

    STRING

    Seqencing technology

    ANALYTE

    STRING

    Substance sequenced

    TISSUE

    STRING

    Tissue source

    TUMOR_OR_NORMAL

    STRING

    Tumor designation for somatic

    GENOMEBUILD

    STRING

    Genome Build to drive annotations - hg38 only

    SAMPLE_BARCODE_VCF

    STRING

    Sample ID from VCF

    AFFECTED_STATUS

    NUMERIC

    Affected, Unaffected, or Unknown for Family Based Analysis

    FAMILY_RELATIONSHIP

    STRING

    Relationship designation for Family Based Analysis

    CREATEDATE

    DATE

    Date and time of record creation

    LASTUPDATEDATE

    DATE

    Date and time of last update of record

    STUDY

    STRING

    Study subject belongs to

    CREATEDATE

    DATE

    Date and time of record creation

    LASTUPDATEDATE

    DATE

    Date and time of record update

    NUMERIC

    Chromosome ID: 1..22, 23=X, 24=Y, 25=Mt

    DBSNP

    STRING

    dbSNP Identifiers

    VARIANT_KEY

    STRING

    Variant ID in the form "1:12345678:12345678:C"

    NIRVANA_VID

    STRING

    Broad Institute VID: "1-12345678-A-C"

    VARIANT_TYPE

    STRING

    Description of Variant Type (e.g. SNV, Deletion, Insertion)

    VARIANT_CALL

    NUMERIC

    1=germline, 2=somatic

    DENOVO

    BOOLEAN

    true / false

    GENOTYPE

    STRING

    "G|T"

    READ_DEPTH

    NUMERIC

    Sequencing read depth

    ALLELE_COUNT

    NUMERIC

    Counts of each alternate allele for each site across all samples

    ALLELE_DEPTH

    STRING

    Unfiltered count of reads that support a given allele for an individual sample

    FILTERS

    STRING

    Filter field from VCF. If all filters pass, field is PASS

    ZYGOSITY

    NUMERIC

    0 = hom ref, 1 = het ref/alt, 2 = hom alt, 4 = hemi alt

    GENEMODEL

    NUMERIC

    1=Ensembl, 2=RefSeq

    GENE_HGNC

    STRING

    HUGO/HGNC gene symbol

    GENE_ID

    STRING

    Ensembl gene ID ("ENSG00001234")

    GID

    NUMERIC

    NCBI Entrez Gene ID (RefSeq) or numerical part of Ensembl ENSG ID

    TRANSCRIPT_ID

    STRING

    Ensembl ENST or RefSeq NM_

    CANONICAL

    STRING

    Transcript designated 'canonical' by source

    CONSEQUENCE

    STRING

    missense, stop gained, intronic, etc.

    HGVSC

    STRING

    The HGVS coding sequence name

    HGVSP

    STRING

    The HGVS protein sequence name

    STRING

    Chromosome without 'chr' prefix

    DBSNP

    NUMERIC

    dbSNP Identifiers

    VARIANT_KEY

    STRING

    Variant ID in the form "1:12345678:12345678:C"

    MUTATION_TYPE

    NUMERIC

    Rank of consequences by expected impact: 0 = Protein Truncating to 40 = Intergenic Variant

    VARIANT_CALL

    NUMERIC

    1=germline, 2=somatic

    GENOTYPE

    STRING

    "G|T"

    REF_ALLELE

    STRING

    Reference allele

    ALLELE1

    STRING

    First allele call in the tumor sample

    ALLELE2

    STRING

    Second allele call in the tumor sample

    GENEMODEL

    NUMERIC

    1=Ensembl, 2=RefSeq

    GENE_HGNC

    STRING

    HUGO/HGNC gene symbol

    GENE_ID

    STRING

    Ensembl gene ID ("ENSG00001234")

    TRANSCRIPT_ID

    STRING

    Ensembl ENST or RefSeq NM_

    CANONICAL

    BOOLEAN

    Transcript designated 'canonical' by source

    CONSEQUENCE

    STRING

    missense, stop gained, intronic, etc.

    HGVSP

    STRING

    HGVS nomenclature for AA change: p.Pro72Ala

    NUMERIC

    Numerical representation of the chromosome, X=23, Y=24, Mt=25

    GENE_ID

    STRING

    NCBI or Ensembl gene identifier

    GID

    NUMERIC

    Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix

    START_POS

    NUMERIC

    First affected position on the chromosome

    STOP_POS

    NUMERIC

    Last affected position on the chromosome

    VARIANT_TYPE

    NUMERIC

    1 = copy number gain, -1 = copy number loss

    COPY_NUMBER

    NUMERIC

    Observed copy number

    COPY_NUMBER_CHANGE

    NUMERIC

    Fold-chang of copy number, assuming 2 for diploid and 1 for haploid as the baseline

    SEGMENT_VALUE

    NUMERIC

    Average FC for the identified chromosomal segment

    PROBE_COUNT

    NUMERIC

    Probes confirming the CNV (arrays only)

    REFERENCE

    NUMERIC

    Baseline taken from normal samples (1) or averaged disease tissue (2)

    GENE_HGNC

    STRING

    HUGO/HGNC gene symbol

    NUMERIC

    Numerical representation of the chromosome, X=23, Y=24, Mt=25

    BEGIN

    NUMERIC

    First affected position on the chromosome

    END

    NUMERIC

    Last affected position on the chromosome

    BAND

    STRING

    Chromosomal band

    QUALIITY

    NUMERIC

    Quality from the original VCF

    FILTERS

    ARRAY

    Filters from the original VCF

    VARIANT_TYPE

    STRING

    Insertion, deletion, indel, tandem_duplication, translocation_breakend, inversion ("INV"), short tandem repeat ("STR2")

    VARIANT_TYPE_ID

    NUMERIC

    21=insertion, 22=deletion, 23=indel, 24=tandem_duplication, 25=translocation_breakend, 26=inversion ("INV"), 27=short tandem repeat ("STR2")

    CIPOS

    ARRAY

    Confidence interval around first position

    CIEND

    ARRAY

    Confidence interval around last position

    SVLENGTH

    NUMERIC

    Overall size of the structural variant

    BONDCHR

    STRING

    For translocations, the other affected chromosome

    BONDCID

    NUMERIC

    For translocations, the other affected chromosome as a numeric value, X=23, Y=24, Mt=25

    BONDPOS

    STRING

    For translocations, positions on the other affected chromosome

    BONDORDER

    NUMERIC

    3 or 5: Whether this fragment (the current variant/VID) "receives" the other chromosome's fragment on it's 3' end, or attaches to the 5' of the other chromosome fragment

    GENOTYPE

    STRING

    Called genotype from the VCF

    GENOTYPE_QUALITY

    NUMERIC

    Genotype call quality

    READCOUNTSSPLIT

    ARRAY

    Read counts

    READCOUNTSPAIRED

    ARRAY

    Read counts, paired end

    REGULATORYREGIONID

    STRING

    Ensembl ID for the affected regulatory region

    REGULATORYREGIONTYPE

    STRING

    Type of the regulatory region

    CONSEQUENCE

    ARRAY

    Variant consequence according to SequenceOntology

    TRANSCRIPTID

    STRING

    Ensembl of RefSeq transcript identifier

    TRANSCRIPTBIOTYPE

    STRING

    Biotype of the transcript

    INTRONS

    STRING

    Count of impacted introns out of the total number of introns, specified as "M/N"

    GENEID

    STRING

    Ensembl or RefSeq gene identifier

    GENEHGNC

    STRING

    HUGO/HGNC gene symbol

    ISCANONICAL

    BOOLEAN

    Is the transcript ID the canonical one according to Ensembl?

    PROTEINID

    STRING

    RefSeq or Ensembl protein ID

    SOURCEID

    NUMERICAL

    Gene model: 1=Ensembl, 2=RefSeq

    STRING

    Ensembl or RefSeq gene identifier

    GID

    NUMERIC

    Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix

    GENE_HGNC

    STRING

    HUGO/HGNC gene symbol

    SOURCE

    STRING

    Gene model: 1=Ensembl, 2=RefSeq

    TPM

    NUMERICAL

    Transcripts per million

    LENGTH

    NUMERICAL

    The length of the gene in base pairs.

    EFFECTIVE_LENGTH

    NUMERICAL

    The length as accessible to RNA-seq, accounting for insert-size and edge effects.

    NUM_READS

    NUMERICAL

    The estimated number of reads from the gene. The values are not normalized.

    STRING

    Ensembl or RefSeq gene identifier

    GID

    NUMERIC

    Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix

    GENE_HGNC

    STRING

    HUGO/HGNC gene symbol

    SOURCE

    STRING

    Gene model: 1=Ensembl, 2=RefSeq

    BASEMEAN

    NUMERICAL

    FC

    NUMERICAL

    Fold-change

    LFC

    NUMERICAL

    Log of the fold-change

    LFCSE

    NUMERICAL

    Standard error for log fold-change

    PVALUE

    NUMERICAL

    P-value

    CONTROL_SAMPLECOUNT

    NUMERICAL

    Number of samples used as control

    CONTROL_LABEL

    NUMERICAL

    Label used for controls

    SAMPLE_BARCODE

    STRING

    Sample Identifier

    SUBJECTID

    STRING

    Identifer for Subject entity

    STUDY

    STRING

    Study designation

    AGE

    NUMERIC

    Age in years

    SAMPLE_BARCODE

    STRING

    Original sample barcode used in VCF column

    SUBJECTID

    STRING

    Original identifier for the subject record

    DATATYPE

    ARRAY

    The categorization of molecular data

    TECHNOLOGY

    ARRAY

    The sequencing method

    SAMPLE_ BARCODE

    STRING

    Original sample barcode used in VCF column

    SUBJECTID

    STRING

    Original identifier for the subject record

    ATTRIBUTE_NAME

    STRING

    Cohorts meta-data driven field name

    ATTRIBUTE_VALUE

    VARIANT

    List of values entered for the field

    NAME

    STRING

    Study name

    CREATEDATE

    DATE

    Date and time of study creation

    LASTUPDATEDATE

    DATE

    Data and time of record update

    SUBJECTID

    STRING

    Original identifier for the subject record

    AGE

    FLOAT

    Age entered on subject record if applicable

    SEX

    STRING

    -

    ETHNICITY

    STRING

    -

    SUBJECTID

    STRING

    Original identifier for the subject record

    ATTRIBUTE_NAME

    STRING

    Cohorts meta-data driven field name

    ATTRIBUTE_VALUE

    VARIANT

    List of values entered for the field

    SUBJECTID

    STRING

    Original identifier for the subject record

    TERM

    STRING

    Code for disease term

    OCCURRENCES

    STRING

    List of occurrence related data

    SUBJECTID

    STRING

    Original identifier for the subject record

    TERM

    STRING

    Code for drug term

    OCCURRENCES

    STRING

    List of occurrence related data of drug exposure

    SUBJECTID

    STRING

    Original identifier for the subject record

    TERM

    STRING

    Code for measurement term

    OCCURRENCES

    STRING

    List of occurrences and values related to lab or measurement data

    SUBJECTID

    STRING

    Original identifier for the subject record

    TERM

    STRING

    Code for procedure term

    OCCURRENCES

    STRING

    List of occurrences and values related procedure data

    Field Name

    Type

    Description

    SAMPLE_BARCODE

    STRING

    Original sample barcode used in VCF column

    STUDY

    STRING

    Study designation

    GENOMEBUILD

    STRING

    Only hg38 is supported

    CHROMOSOME

    STRING

    Chromosome without 'chr' prefix

    Field Name

    Type

    Description

    SAMPLE_BARCODE

    STRING

    Original sample barcode, used in VCF column

    SUBJECTID

    STRING

    Identifier for Subject entity

    STUDY

    STRING

    Study designation

    GENOMEBUILD

    STRING

    Only hg38 is supported

    Field Name

    Type

    Description

    SAMPLE_BARCODE

    STRING

    Sample barcode used in the original VCF

    GENOMEBUILD

    STRING

    Genome build, always 'hg38'

    NIRVANA_VID

    STRING

    Variant ID of the form 'chr-pos-ref-alt'

    CHRID

    STRING

    Chromosome without 'chr' prefix

    Field Name

    Type

    Description

    SAMPLE_BARCODE

    STRING

    Sample barcode used in the original VCF

    GENOMEBUILD

    STRING

    Genome build, always 'hg38'

    NIRVANA_VID

    STRING

    Variant ID of the form 'chr-pos-ref-alt'

    CHRID

    STRING

    Chromosome without 'chr' prefix

    Field Name

    Type

    Description

    GENOMEBUILD

    STRING

    Genome build, always 'hg38'

    STUDY_NAME

    STRING

    Study designation

    SAMPLE_BARCODE

    STRING

    Sample barcode used in the original VCF

    LABEL

    STRING

    Group label specified during import: Case or Control, Tumor or Normal, etc.

    Field Name

    Type

    Description

    GENOMEBUILD

    STRING

    Genome build, always 'hg38'

    STUDY_NAME

    STRING

    Study designation

    SAMPLE_BARCODE

    STRING

    Sample barcode used in the original VCF

    CASE_LABEL

    STRING

    Study designation

    CHROMOSOMEID

    CHROMOSOME

    CID

    CID

    GENE_ID

    GENE_ID