Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
From the Cohorts menu in the left hand navigation, select a cohort created in Create Cohort
to begin a cohort analysis.
The query details can be accessed by clicking the triangle next to Show Query Details
. The query details displays the selections used to create a cohort. The selections can be edited by clicking the pencil
icon in the top right.
Charts
will be open by default. If not, click Show Charts
.
Use the gear icon in the top-right to change viewable chart settings.
There are four charts available to view summary counts of attributes within a cohort as histogram plots.
Click Hide Charts
to hide the histograms.
Display time-stamped events and observations for a single subject on a timeline.The timeline view is visible to only those subjects which have time-series data.
Below attributes are displayed in timeline view: • Diagnosed and Self-Reported Diseases: • Start and end dates • Progression vs. remission • Medication and Other Treatments: • Prescribed and self-medicated • Start date, end date, and dosage at every time point
The timeline utilizes age (at diagnosis, at event, at measurement) as the x-axis and attribute name as the y-axis. If the birthdate is not recorded for a subject, the user can now switch to Date to visualize data.
In the default view, the timeline shows the first five disease data and the first five drug/medication data in the plot. Users can choose different attributes or change the order of existing attributes by clicking on the “select attribute” button.
The x-axis shows the person’s age in years, with data points initially displayed between ages 0 to 100. Users can zoom in by selecting the desired range to visualize data points within the selected age range.
Each event is represented by a dot in the corresponding track. Events in the same track can be connected by lines to indicate the start and end period of an event.
By Default, the Subjects
tab is displayed.
The Subjects
tab with a list of all subjects matching your criteria is displayed below Charts
with a link to each Subject by ID and other high-level information. By clicking a subject ID, you will be brought to the data collected at the Subject level.
Search for a specific subject by typing the Subject ID into the Search Subjects
text box.
Get all details available on a subject by clicking the hyperlinked Subject ID in the Subject list.
To Exclude specific subjects from subsequent analysis, such as marker frequencies or gene-level aggregated views, you can uncheck the box at the beginning of each row in the subject list. You will then be prompted to save any exclusion(s).
You can Export the list of subjects either to your ICA Project's data folder or to your local disk as a TSV file for subsequent use. Any export will omit subjects that you excluded after you saved those changes. For more information, see at the bottom of this page.
Specific subjects can be removed from a Cohort.
Select the Subjects
tab.
Subjects in the Cohort, by default are checked.
To remove a specific subject from a Cohort, uncheck the checkbox next to subjects to remove from a Cohort.
Check box selections are maintained while browsing through the pages of the subject list.
Click Save Cohort
to save the subjects you would like to exclude.
The specific subjects will no longer be counted in all analysis visualizations.
The specific excluded subjects will be saved for the Cohort.
To add the subjects back to the Cohort, select the checkboxes to checked and click Save Cohort
.
For each individual cohort, display a table of all observed SVs that overlap with a given gene.
Click the Marker Frequency
tab, then click the Gene Expression
tab.
Down-regulated genes are displayed in blue and Up-regulated genes are displayed in red.
A frequency in the Cohort is displayed and the Matching number/Total is also displayed in the chart.
Genes can be searched by using the Search Genes
text box.
You are brought to the Gene
tab under the Gene Summary
sub-tab.
Select a Gene by typing the gene name into the Search Genes
text box.
A Gene Summary
will be displayed that lists information and links to public resources about the selected gene.
A cytogenic map will be displayed based on the selected gene and a vertical orange bar represents gene location in the chromosome.
Click the Variants
tab and Show legend and filters
if it does not open by default.
Below the interactive legend, you see a set of analysis tracks: Needle Plot, Primate AI, Pathogenic variants, and Exons.
The Needle Plot allows toggling the plot by gnomAD frequency
and Sample Count
. Select Sample Count
in the Plot by
legend above the plot. You can also filter the plot to only show variants above/below a certain cut-off for gnomAD frequency (in percent) or absolute sample count.
Click on a variant's needle pin to view details about the variant from public resources and counts of variants in the selected cohort by disease category. If you want to view all subjects that carry the given variant, click on the sample count link, which will take you to the list of subjects (see above).
Use the Exon zoom bar from each end of the Amino Acid sequence to zoom in on the gene domain to better separate observations.
The Pathogenic Variant
Track shows pop up details with pathogenicity calls, phenotypes, submitter and a link to the ClinVar entry is seen by hovering over the purple triangles.
The Phenotypes tab
shows a stacked horizontal bar chart which displays molecular breakdown (disease type vs Gene) and subject count for the selected gene.
The Gene Expression
tab shows known gene expression data from tissue types in GTEx.
The Genetic Burden Test
will only be available for de novo
variants only.
For every correlation, subjects contained in each count can be viewed by selecting the count on the bubble or the count on the X-axis and Y-axis.
Click the Correlation
Tab.
In X-axis category
, select Clinical
.
In X-axis Attribute
, select a clinical attribute.
In Y-axis category
, select Clinical
.
In Y-Axis Attribute
, select another clinical attribute.
You will be shown a bubble plot comparing the first clinical attribute on the x-axis to second attributes on the y-axis.
The size of the bubbles correspond to the number of subjects falling into those categories.
To see a breakdown of Somatic Mutations vs. RNA Expression levels perform the following steps:
Note this comparison is for a Cancer case.
Click the Correlation
Tab.
In X-axis category
, select Somatic
.
In X-axis Attribute
, select a gene.
In Y-axis category
, select RNA expression
.
In Y-Axis Attribute
, type a gene and leave Reference Type
, NORMAL
.
Click Continuous
to see violin plots of compared variables.
Note this comparison is for a Cancer case.
Click the Correlation
Tab.
In X-axis category
, select Somatic
.
In X-axis Attribute
, type a gene name.
In Y-axis category
, select Clinical
.
In Y-Axis Attribute
, select a clinical attribute.
Click the Molecular Breakdown
Tab.
In Enter a clinical Attribute
, and select a clinical attribute.
In Enter a gene
, select a gene by typing a gene name.
You are shown a stacked bar-chart by the clinical attribute selected values on the Y-axis.
For each attribute value the bar represents the % of Subjects with RNA Expression
, Somatic Mutation
, and Multiple Alterations
.
Note: for each of the aforementioned bubble plots, you can view the list of subjects by following the link under each subject count associated with an individual bubble or axis label. This will take you to the list of subjects view, see above.
If there is Copy Number Variant data in the cohort:
Click the CNV
tab.
A graph will show CNV a Sample Percentage on the Y-axis and Chromosomes on the X-axis.
Any value above Zero is a copy number gain, and any value below Zero is a copy number loss.
Click Chromosome:
to select a specific chromosome position.
ICA allows for integrated analysis in a computation workspace. You can export your cohorts definitions and, in combination with molecular data in your ICA Project Data, perform, for example, a GWAS analysis.
Confirm the VCF data for your analysis is in ICA Project Data.
From within your ICA Project, Start a Bench Workspace -- See Bench for more details.
Navigate back to ICA Cohorts.
Create a Cohort of subjects of interest using Create a Cohort.
From the Subjects
Tab click the Export subjects...
from the top-right of the subject list. The file can be downloaded to the Browser or ICA Project Data.
We suggest using export ...to Data Folder
for immediate access to this data in Bench or other areas of ICA.
Create another cohort if needed for your Research and complete the last 3 steps.
Navigate to the Bench workspace created in the second step.
After the workspace has started up, click Access
.
Find the /Project/
folder in the Workspace file navigation.
This folder will contain your cohort files created along with any pipeline output data needed for your workspace analysis.
The GWAS
and PheWAS
tabs in ICA Cohorts allow you to visualize precomputed analysis results for phenotypes/diseases and genes, respectively. Note that these do not reflect the subjects that are part of the cohort that you created.
ICA Cohorts currently hosts GWAS and PheWAS analysis results for approximately 150 quantitative phenotypes (such as "LDL direct" and "sitting height") and about 700 diseases.
Navigate to the GWAS
tab and start looking for phenotypes and diseases in the search box. Cohorts will suggest the best matches against any partial input ("cancer") you provide. After selecting a phenotype/disease, Cohorts will render a Manhattan plot, by default collapsed to gene level and organized by their respective position in each chromosome.
Circles in the Manhattan plot indicate binary traits, potential associations between genes and diseases. Triangles indicate quantitative phenotypes with regression Beta different from zero, and point up or down to depict positive or negative correlation, respectively.
Hovering over a circle/triangle will display the following information:
gene symbol
variant group (see below)
P-value, both raw and FDR-corrected
number of carriers of variants of the given type
number of carriers of variants of any type
regression Beta
For gene-level results, Cohorts distinguishes five different classes of variants: protein truncating; deleterious; missense; missense with a high ILMN PrimateAI score (indicating likely damaging variants); and synonymous variants. You can limit results to either of these five classes, or select All
to display all results together.
Deleterious variants (del
): the union of all protein-truncating variants (PTVs, defined below), pathogenic missense variants with a PrimateAI score greater than a gene-specific threshold, and variants with a SpliceAI score greater than 0.2.
Protein-truncating variants (ptv
): variant consequences matching any of stop_gained
, stop_lost
, frameshift_variant
, splice_donor_variant
, splice_acceptor_variant
, start_lost
, transcript_ablation
, transcript_truncation
, exon_loss_variant
, gene_fusion
, or bidirectional_gene_fusion
.
missense_all
: all missense variants regardless of their pathogenicity.
missense, PrimateAI optimized (missense_pAI_optimized
): only pathogenic missense variants with primateAI score greater than a gene-specific threshold.
missenses and PTVs (missenses_and_ptvs_all
): the union of all PTVs, SpliceAI > 0.2 variants and all missense variants regardless of their pathogenicity scores.
all synonymous variants (syn
).
To zoom in to a particular chromosome, click the chromosome name underneath the plot, or select the chromosome from the drop down box, which defaults to Whole genome
.
To browse PheWAS analysis results by gene, navigate to the PheWAS
tab and enter a gene of interest into the search box. The resulting Manhattan plot will show phenotypes and diseases organized into a number of categories, such as "Diseases of the nervous system" and "Neoplasms". Click on the name of a category, shown underneath the plot, to display only those phenotypes/diseases, or select a category from the drop down, which defaults to All
.
A future release of ICA Cohorts will allow you to run your own customized GWAS analysis inside ICA Bench and then upload variant- or gene-level results for visualization in the ICA Cohorts graphical user interface.
ICA Cohorts lets you create a research cohort of subjects and associated samples based on the following criteria:
Project:
Include subjects that are part of any ICA Project that you own or that is shared with you.
Sample:
Sample type such as FFPE.
Tissue type.
Sequencing technology: Whole genome DNA-sequencing, RNAseq, single-cell RNAseq, etc.
Subject:
Demographics such as age, sex, ancestry.
Biometrics such as body height, body mass index.
Family and patient medical history.
Sample:
Sample type such as FFPE.
Tissue type.
Sequencing technology: Whole genome DNA-sequencing, RNAseq, single-cell RNAseq, etc.
Disease:
Phenotypes and diseases from standardized ontologies.
Drug:
Drugs from standardized ontologies along with specific typing, stop reasons, drug administration routes, and time points.
Molecular attributes:
Samples with a somatic mutation in one or multiple, specified genes.
Samples with a germline variant of a specific type in one or multiple, specified genes.
Samples over- or under-expressed in one or multiple, specified genes.
Samples with a copy number gain or loss involving one or multiple, specified genes.
In the 'Disease' tab, you can search for subjects diagnosed with one or multiple diseases, as well as phenotypes, in two ways:
Start typing the English name of a disease/phenotype and pick from the suggested matches. Continue typing if your disease/phenotype of interest is not listed initially.
Paste one or multiple diagnostic codes. ICA Cohorts currently uses six standard medical ontologies to 1) annotate each subject during ingestion and then to 2) search for subjects: HPO for phenotypes, MeSH, SNOMED-CT, ICD9-CM, ICD10-CM, and OMIM for diseases. By default, any 'type-ahead' search will find matches from all six; and you can limit the search to only the one(s) you prefer. When searching for subjects using names or codes from one of these ontologies, ICA Cohorts will automatically match your query against all of the other ontologies, therefore returning subjects that have been ingested using a corresponding entry from another ontology. Example: a subject was annotated with the disease "Barrett esophagus" (HPO) during ingestion; ICA Cohorts will include this subject when you create a cohort by searching for "Barrett's esophagus" (SNOMED-CT).
In the 'Drug' tab, you can search for subjects who have a specific medication record:
Start typing the concept name for the drug and pick from suggested matches. Continue typing if the drug is not listed initially.
Paste one or multiple drug concept codes. ICA Cohorts currently use RXNorm as a standard ontology during ingestion. If multiple concepts are in your instance of ICA Cohorts, they will be listed under 'Concept Ontology.'
'Drug Type' is a static list of qualifiers that denote the specific administration of the drug. For example, where the drug was dispensed.
'Stop Reason' is a static list of attributes describing a reason why a drug was stopped if available in the data ingestion.
'Drug Route' is a static list of attributes that describe the physical route of administration of the drug. For example, Intravenous Route (IV).
As attributes are added to the 'Selected Condition' on the right-navigation panel, you can choose to include or exclude the criteria selected.
Select a criterion from 'Subject', 'Disease', and/or 'Molecular' attributes by filling in the appropriate checkbox on the respective attribute selection pages.
When selected, the attribute will appear in the right-navigation panel.
You can use the 'Include' / 'Exclude' dropdown next to the selected attribute to decide if you want to include or exclude subjects and samples matching the attribute.
Note: the semantics of 'Include' work in such a way that a subject needs to match only one or multiple of the 'included' attributes in any given category to be included in the cohort. (Category refers to disease, sex, body height, etc.) For example, if you specify multiple diseases as inclusion criteria, subjects will only need to be diagnosed with one of them. Using 'Exclude', you can exclude any subject who matches one or multiple exclusion criteria; subjects do not have to match all exclusion criteria in the same category to be excluded from the cohort.
Note: This feature is not available on the 'Project' level selections as there is no overlap between subjects in datasets.
Note: Using exclusion criteria does not account for NULL values. For example, if the Super-population 'Europeans' is excluded, subjects will be in your cohort even if they do not contain this data point.
Once you selected Create Cohort
, the above data are organized in tabs such as Project, Subject, Disease, and Molecular. Each tab then contains the aforementioned sections, among others, to help you identify cases and/or controls for further analysis. Navigate through these tabs, or search for an attribute by name to directly jump to that tab and section, and select attributes and values that are relevant to describe your subjects and samples of interest. Assign a new name to the cohort you created, and click Apply
to save the cohort.
After creating a Cohort, select the Duplicate
icon.
A copy of the Cohort definition will be created and tagged with "_copy".
Deleting a Cohort Definition can be accomplished by clicking the Delete Cohort
icon.
This action cannot be undone.
After creating a Cohort, users can set a Cohort bookmark as Shared. By sharing a Cohort, the Cohort will be available to be applied across the project by other users with access to the Project. Cohorts created in a Project are only accessible at scope of the user. Other users in the project cannot see the cohort created unless they use this sharing functionality.
Create a Cohort using the directions above.
To make the Cohort available to other users in your Project, click the Share
icon.
The Share
icon will be filled in black and the Shared Status will be turned from Private
to Shared
.
Other users with access to Cohorts in the Project can now apply the Cohort bookmark to their data in the project.
To unshare the Cohort, click the Share
icon.
The icon will turn from black to white, and other users within the project will no longer have access to this cohort definition.
A Shared Cohort can be Archived.
Select a Shared Cohort with a black Shared Cohort
icon.
Click the Archive Cohort
icon.
You will be asked to confirm this selection.
Upon archiving the Cohort definition, the Cohort will no longer be seen by other users in the Project.
The archived Cohort definition can be unarchived by clicking the Unarchive Cohort
icon.
When the Cohort definition is unarchived, it will be visible to all users in the Project.
You can link cohorts data sets to a bundle as follows:
Create or edit a bundle at Bundles from the main navigation.
Navigate to Bundles > your_bundle > Cohorts > Data Sets.
Select Link Data Set to Bundle.
Select the data set which you want to link and +Select.
After a brief time, the cohorts data set will be linked to your bundle and ICA_BASE_100 will be logged.
If you can not find the cohorts data sets which you want to link, verify if
Your data set is part of a project (Projects > your_project > Cohorts > Data Sets)
This project is set to Data Sharing (Projects > your_project > Project Settings > Details)
You can unlink cohorts data sets from bundles as follows:
Edit the desired bundle at Bundles from the main navigation.
Navigate to Bundles > your_bundle > Cohorts > Data Sets.
Select the cohorts data set which you want to unlink.
Select Unlink Data Set from Bundle.
After a brief time, the cohorts data set will be unlinked from your bundle and ICA_BASE_101 will be logged.
ICA Cohorts can pull any molecular data available in an ICA Project, as well as additional sample- and subject-level metadata information such as demographics, biometrics, sequencing technology, phenotypes, and diseases.
To import a new data set, select Import Jobs
from the left navigation tab underneath Cohorts
, and click the Import Files
button. The Import Files
button is also available under the Data Sets
left navigation item.
The
Data Set
menu item is used to view imported data sets and information. TheImport Jobs
menu item is used to check the status of data set imports.
Confirm that the project shown is the ICA Project that contains the molecular data you would like to add to ICA Cohorts.
Choose a data type among
Germline variants
Somatic mutations
RNAseq
GWAS
Choose a new study name by selecting the radio button: Create new study
and entering a Study Name
.
To add new data to an existing Study, select the radio button: Select from list of studies
and select an existing Study Name
from the dropdown.
To add data to existing records or add new records, select Job Type
, Append
.
Append
does not wipe out any data ingested previously and can be used to ingest the molecular data in an incremental manner.
To replace data, select Job Type
, Replace
. If you are ingesting data again, use the Replace job type.
Enter an optional Study description
.
Select the metadata model (default: Cohorts; alternatively, select OMOP version 5.4 if your data is formatted that way.)
Select the genome build your molecular data is aligned to (default: GRCh38/hg38)
For RNAseq, specify whether you want to run differential expression (see below) or only upload raw TPM.
Click Next
.
Navigate to VCFs located in the Project Data.
Select each single-sample VCF or multi-sample VCF to ingest. For GWAS, select CSV files produced by Regenie.
As an alernative to selecting individual files, you can also opt to select a folder instead. Toggle the radio button on Step 2 from "Select files" to "Select folder".
This option is currently only available for germline variant ingestion: any combination of small variants, structural variation, and/or copy number variants.
ICA Cohorts will scan the selected folder and all sub-folders for any VCF files or JSON files and try to match them against the Sample ID column in the metadata TSV file (Step 3).
Files not matching sample IDs will be ignored; allowed file extensions for VCF files after the sample ID are: *.vcf.gz, *.hard-filtered.vcf.gz, *.cnv.vcf.gz, and *.sv.vcf.gz .
Files not matching sample IDs will be ignored; allowed file extensions for JSON files after the sample ID are: .json,.json.gz, *.json.bgz, *.json.gzip.
Click Next
.
Navigate to the metadata (phenotype) data tsv in the project Data.
Select the TSV file or files for ingestion.
Click Finish
.
All VCF types, specifically from DRAGEN, can be ingested using the Germline variants selection. Cohorts will distinguish the variant types that it is ingesting. If Cohorts cannot determine the variant file type, it will default to ingest small variants.
Alternatively to VCFs, you can select Nirvana JSON files for DNA variants: small variants, structural variants, and copy number variation.
The maximum amount of files that can be part of a single manual ingestion batch is capped at 1000
Alternatively, users can choose a single folder and ICA Cohorts will identify all ingestible files within that folder and its sub-folders. In this scenario, cohorts will select molecular data files matching the samples listed in the metadata sheet which is the next step in the import process.
Users have the option to ingest either VCF files or Nirvana JSON files for any given batch, regardless of the chosen ingestion method.
The sample identifiers used in the VCF columns need to match the sample identifiers used in subject/sample metadata files; accordingly, if you are starting from JSON files containing variant- and gene-level annotations provided by ILMN Nirvana, the
samples
listed in the header need to match the metadata files.
ICA Cohorts supports VCF files formatted according to VCF v4.2 and v4.3 specifications. VCF files require at least one of the following header rows to identify the genome build:
##reference=file://... --- needs to contain a reference to hg38/GRCh38 in the file path or name (numerical value is sufficient)
##contig=<ID=chr1,length=248956422> --- for hg38/GRCh38
##DRAGENCommandLine= ... --ht-reference
ICA Cohorts accepts VCFs aligned to hg38/GRCh38 and hg19/GRCh37. If your data uses hg19/GRCh37 coordinates, Cohorts will convert these to hg38/GRCh38 during the ingestion process [see Reference 1]. Harmonizing data to one genome build facilitates searches across different private, shared, and public projects when building and analyzing a cohort. If your data contains a mixture of samples mapped to hg38 and hg19, please ingest these in separate batches, as each import job into Cohorts is limited to one genome build.
Alternative to VCFs, ICA Cohorts accepts the JSON output of Illumina Nirvana for hg38/GRCh38-aligned data for small germline variants and somatic mutations, copy number variations other structural variants.
ICA Cohorts can process gene- and transcript-level quantification files produced by the Illumina DRAGEN RNA pipeline. The file naming convention needs to match .quant.genes.sf for genes; and .quant.sf for transcript-level TPM (transcripts per million.)
Please also see the online documentation for the Illumina DRAGEN RNA Pipeline for more information on output file formats.
ICA Cohorts currently support upload of SNV-level GWAS results produced by Regenie and saved as CSV files.
Note: If annotating large sets of samples with molecular data, expect the annotation process to take over 20 minutes per whole genome batch of samples. You will receive two e-mail notifications: once your ingestion starts and once completed successfully or failed.
As an alternative to ICA Cohorts' metadata file format, you can provide files formatted according to the OMOP common data model 5.4. Cohorts currently ingests data for these OMOP 5.4 tables, formatted as tab-delimited files:
PERSON (mandatory),
CONCEPT (mandatory if any of the following is provided),
CONDITION_OCCURRENCE (optional),
DRUG_EXPOSURE (optional), and
PROCEDURE_OCCURRENCE (optional.)
Additional files such as measurement and observation will be supported in a subsequent release of Cohorts.
Note that Cohorts requires that all such files do not deviate from the OMOP CDM 5.4 standard. Depending on your implementation, you may have to adjust file formatting to be OMOP CDM 5.4-compatible.
[1] VcfMapper: https://stratus-documentation-us-east-1-public.s3.amazonaws.com/downloads/cohorts/main_vcfmapper.py
[2] crossMap: https://crossmap.sourceforge.net/
[3] liftOver: https://genome.ucsc.edu/cgi-bin/hgLiftOver
[4] Chain files: ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/
You can compare up to four previously created individual cohorts, to view differences in variants and mutations, RNA expression, copy number variation, and distribution of clinical attributes. Once comparisons are created, they are saved in the Comparisons
left-navigation tab of the Cohorts module.
Select Cohorts
from the left-navigation panel.
Select 2 to 4 cohorts already created. If you have not created any cohorts, See Create a Cohort documentation.
Click Compare Cohorts
in the right-navigation panel.
Note you are now in the Comparisons
left-navigation tab of the Cohorts module.
In the Charts
Section, if the COHORTS
item is not displayed, click the gear icon in the top right and add Cohorts
as the first attribute and click Save
.
The COHORTS
item in the charts panel will provide a count of subjects in each cohort and act as a legend for color representation throughout comparison screens.
For each clinical attribute category, a bar chart is displayed. Use the gear icon to select attributes to display in the charts panel.
You can share a comparison with other team members in the same ICA Project. Please refer to the section on "Sharing a Cohort" on "Create a Cohort" for details on sharing, unsharing, deleting, and archiving, which are analogous for sharing comparisons.
Select the Attributes
tab
Attribute categories are listed and can be expanded using the down-arrows next to the category names. The categories available are based on cohorts selected. Categories and attributes are part of the ICA Cohorts metadata template that map to each Subject.
For example, use the drop-down arrow next to Vital status
to view sub-categories and frequencies across selected cohorts.
Select the Genes
tab
Search for a gene of interest using its HUGO/HGNC gene symbol
As additional filter options, you can view only those variants that are occur in every cohort; that are unique to one cohort; that have been observed in at least two cohorts; or any variant.
Select the Survival Summary
tab.
Attribute categories are listed and can be expanded using the down-arrows next to the category names.
Select the drop-down arrow for Therapeutic interventions
.
In each subcategory there is a sum of the subject counts across select cohorts.
For each cohort, designated by a color, there is a Subject count
and Median survival (years)
column.
Type Malignancy
in the Search Box and an auto-complete dropdown suggests three different attributes.
Select Synchronous malignancy
and the results are automatically opened and highlighted in orange.
Click Survival Comparison
tab.
A Kaplan-Meier Curve is rendered based on each cohort.
P-Value Displayed at the top of Survival Comparison indicates whether there is statistically significant variance between survival probabilities over time of any pair of cohorts (CI=0.95).
When comparing two cohorts, the P-Value is shown above the two survival curves. For three or four cohorts, P-Values are shown as a pair-wise heatmap, comparing each cohort to every other cohort.
Select the Marker Frequency
tab.
Select either Gene expression
(default), Somatic mutation
, or Copy number variation
For gene expression (up- versus down-regulated) and for copy number variation (gain versus loss), Cohorts will display a list of all genes with bidirectional barcharts
For somatic mutations, the barcharts are unidirectional and indicate the percentage of samples with a mutation in each gene per cohort.
Bars are color-coded by cohort, see the accompanying legend.
Each row shows P-value(s) resulting from pairwise comparison of all cohorts. In the case of comparing two cohorts, the numerical P-value will be displayed in the table. In the case of comparing three or more cohorts, the pairwise P-values are shown as a triangular heatmap, with details available as a tooltip.
Select the Correlation
tab.
Similar to the single-cohort view (Cohort Analysis | Correlation
), choose two clinical attributes and/or genes to compare.
Depending on the available data types for the two selections (categorical and/or continuous), Cohorts will display a bubble plot, violin plot, or scatter plot.
ICA Cohorts data can be viewed in an ICA Project Base instance as a shared database. A shared database in ICA Base operates as a database view. To use this feature, enable Base for your project prior to starting any ICA Cohorts ingestions. See for more information on enabling this feature in your ICA Project.
After ingesting data into your project, select Phenotypic and Molecular data are available to view in Base. See for instruction on importing data sets into Cohorts.
Post ingestion, data will be represented in Base.
Select BASE
from the ICA left-navigation and click Query
.
Under the New Query window, a list of tables is displayed. Expand the Shared Database for Project \<your project name\>
.
Cohorts tables will be displayed.
To preview the table and fields click each view listed.
Clicking any of these views then selecting PREVIEW
on the right-hand side will show you a preview of the data in the tables.
Note: If your ingestion includes Somatic variants, there will be two molecular tables: ANNOTATED_SOMATIC_MUTATIONS and ANNOTATED_VARIANTS. All ingestions will include a PHENOTYPE table.
\
Note: The PHENOTYPE table includes a harmonized set that is collected across all data ingestions and is not representative of all data ingested for the Subject or Sample. Sample information is also displayed in this table, if applicable. Sample information drives the annotation process if molecular data is included in the ingestion. That data is stored in the PHENOTYPE table.
This table will be available for all projects with ingested molecular data
This table will only be available for data sets with ingested Somatic molecular data.
This table will only be available for data sets with ingested CNV molecular data.
This table will only be available for data sets with ingested SV molecular data. Note that ICA Cohorts stores copy number variants in a separate table.
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for gene quantification results:
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
These tables will only be available for data sets with ingested RNAseq molecular data.
Table for differential gene expression results:
The corresponding transcript table uses TRANSCRIPT_ID instead of GENE_ID and GENE_HGNC.
In ICA Cohorts, metadata describe any subjects and samples imported into the system in terms of attributes, including:
subject:
demographics such as age, sex, ancestry;
phenotypes and diseases;
biometrics such as body height, body mass index, etc.;
pathological classification, tumor stages, etc.;
family and patient medical history;
sample:
sample type such as FFPE,
tissue type,
sequencing technology: whole genome DNA-sequencing, RNAseq, single-cell RNAseq, among others.
You can use these attributes while to define the cases and/or controls that you want to include.
During , you will be asked to upload a metadata sheet as a tab-delimited (TSV) file. An example sheet is available for download on the Import files page in the ICA Cohorts UI.
A metadata sheet will need to contain at least these four columns per row:
Subject ID - identifier referring to individuals; use the column header "SubjectID".
Sample ID - identifier for a sample. Sample IDs need to match the corresponding column header in VCF/GVCFs; each subject can have multiple samples, these need to be specified in individual rows for the same SubjectID; use the column header "SampleID".
Biological sex - can be "Female (XX)", "Female"; "Male (XY)", "Male"; "X (Turner's)"; "XXY (Klinefelter)"; "XYY"; "XXXY" or "Not provided". Use the column header "DM_Sex" (demographics).
Sequencing technology - can be "Whole genome sequencing", "Whole exome sequencing", "Targeted sequencing panels", or "RNA-seq"; use the column header "TC" (technology).
This walk-through is intended to represent a typical workflow when building and studying a cohort of oncology cases.
Click Create Cohort
button.
Select the following studies to add to your cohort:
TCGA – BRCA – Breast Invasive Carcinoma
TCGA – Ovarian Serous Cystadenocarcinoma
Add a Cohort Name
= TCGA Breast and Ovarian_1472
Click on Apply
.
Expand Show query details
to see the study makeup of your cohort.
Charts
will be open by default. If not, click Show charts
Use the gear icon in the top-right to change viewable chart settings.
Tip:
Disease Type
,Histological Diagnosis
,Technology
,Overall Survival
have interesting data about this cohorts
The Subject
tab with all Subjects list is displayed below Charts with a link to each Subject by ID and other high-level information, like Data Types measured and reported. By clicking a subject ID, you will be brought to the data collected at the Subject level.
Search for subject TCGA-E2-A14Y
and view the data about this Subject.
Click the TCGA-E2-A14Y
Subject ID link to view clinical data for this Subject that was imported via the metadata.tsv file on ingest.
Note: the Subject is a 35 year old Female with vital status and other phenotypes that feed up into the
Subject
attribute selection criteria when creating or editing cohorts.
Click X
to close the Subject details.
Click Hide charts
to increase interactive landscape.
Click the Marker Frequency
tab, then click the Somatic Mutation
tab.
Review the gene list and mutation frequencies.
Note that PIK3CA has a high rate of mutation in the Cohort (ranked 2nd with 33% mutation frequency in 326 of the 987 Subjects that have Somatic Mutation data in this cohort).
Do Subjects with PIK3CA mutations have changes in PIK3CA RNA Expression?
Click the Gene Expression
tab, search for PIK3CA
PIK3CA RNA is down-regulated in 27% of the subjects relative to normal samples.
Switch from normal
to disease
Reference where the Subject’s denominator is the median of all disease samples in your cohort.
The count of matching vs. total subjects that have PIK3CA up-regulated RNA which may indicate a distinctive sub-phenotype.
Click directly on PIK3CA
gene link in the Gene Expression
table.
You are brought to the Gene
tab under the Gene Summary
sub-tab that lists information and links to public resources about PIK3CA.
Click the Variants
tab and Show legend and filters
if it does not open by default.
Below the interactive legend you see a set of analysis tracks: Needle Plot, Primate AI, Pathogenic variants, and Exons.
The Needle Plot allows toggling the plot by gnomAD frequency
and Sample Count
. Select Sample Count
in the Plot by
legend above the plot.
There are 87 mutations distributed across the 1068 amino acid sequence, listed below the analysis tracks. These can be exported via the icon into a table.
We know that missense variants can severely disrupt translated protein activity. Deselect all Variant Types
except for Missense
from the Show Variant Type
legend above the needle plot.
Many mutations are in the functional domains of the protein as seen by the colored boxes and labels on the x-axis of the Needle Plot.
Hover over the variant with the highest sample count in the yellow PI3Ka
protein domain.
The pop-up shows variant details for the 64 Subjects observed with it: 63 in the Breast Cancer study and 1 in the Ovarian Cancer Study.
Use the Exon zoom bar from each end of the Amino Acid sequence to zoom in to the PI3Ka
domain to better separate observations.
There are three different missense mutations at this locus changing the wildtype Glutamine at different frequencies to Lysine (64), Glycine (6), or Alanine (2).
The Pathogenic Variant
Track shows 7 ClinVar entries for mutations stacked at this locus affecting amino acid 545. Pop up details with pathogenicity calls, phenotypes, submitter and a link to the ClinVar entry is seen by hovering over the purple triangles.
Note the Primate AI
track and high Primate AI score.
Primate AI
track displays Scores for potential missense variants, based on polymorphisms observed in primate species. Points above the dashed line for the 75th percentile may be considered likely pathogenic as cross-species sequence is highly conserved; you often see high conservancy at the functional domains. Points below the 25th percentile may be considered "likely benign".
Click the Expression
tab and notice that normal Breast and normal Ovarian tissue have relatively high PIK3CA RNA Expression in GTex RNAseq tissue data but ubiquitously expressed.
ICA Cohorts is a cohort analysis tool integrated with Illumina Connected Analytics (ICA). ICA Cohorts combines subject- and sample-level metadata, such as phenotypes, diseases, demographics, and biometrics, with molecular data stored in ICA to perform tertiary analyses on selected subsets of individuals.
This video is an overview of Illumina Connnected Analytics. It walks through a Multi-Omics Cancer workflow that can be found here:
Intuitive UI for selecting subjects and samples to analyze and compare: deep phenotypical and clinical metadata, molecular features including germline, somatic, gene expression.
Comprehensive, harmonized data model exposed to ICA Base and ICA Bench users for custom analyses.
Run analyses in ICA Base and ICA Bench and upload final results back into Cohorts for visualization.
Out-of-the-box statistical analyses including genetic burden tests, GWAS/PheWAS.
Rich public data sets covering key disease areas to enrich private data analysis.
Easy-to-use visualizations for gene prioritization and genetic variation inspection.
Variants and mutations will be displayed as one needle plot for each cohort that is part of the comparison (see in this online help for more details)
A description of all attributes and data types currently supported by ICA Cohorts can be found here:
You can download an example of a metadata sheet, which contains some samples from The Cancer Genome Atlas () and their publicly available clincal attributes, here:
A list of concepts and diagnoses that cover all public data subjects to easily navigate the new concept code browser for diagnosis can be found here:
ICA Cohorts contains a variety of freely available data sets covering different disease areas and sequencing technologies. For a list of currently available data, .
Field
Description
Project name
The ICA project for your cohort analysis (cannot be changed.)
Study name
Create or select a study. Each study represents a subset of data within the project.
Description
Short description of the data set (optional).
Job type
Append: Appends values to any existing values. If a field supports only a single value, the value is replaced.
Replace: Overwrites existing values with the values in the uploaded file.
Subject metadata files
Subject metadata file(s) in tab-delimited format. For Append and Replace job types, the following fields are required and cannot be changed: - Sample identifier - Sample display name - Subject identifier - Subject display name - Sex
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Original sample barcode used in VCF column |
STUDY | STRING | Study designation |
GENOMEBUILD | STRING | Only hg38 is supported |
CHROMOSOME | STRING | Chromosome without 'chr' prefix |
CHROMOSOMEID | NUMERIC | Chromosome ID: 1..22, 23=X, 24=Y, 25=Mt |
DBSNP | STRING | dbSNP Identifiers |
VARIANT_KEY | STRING | Variant ID in the form "1:12345678:12345678:C" |
NIRVANA_VID | STRING | Broad Institute VID: "1-12345678-A-C" |
VARIANT_TYPE | STRING | Description of Variant Type (e.g. SNV, Deletion, Insertion) |
VARIANT_CALL | NUMERIC | 1=germline, 2=somatic |
DENOVO | BOOLEAN | true / false |
GENOTYPE | STRING | "G|T" |
READ_DEPTH | NUMERIC | Sequencing read depth |
ALLELE_COUNT | NUMERIC | Counts of each alternate allele for each site across all samples |
ALLELE_DEPTH | STRING | Unfiltered count of reads that support a given allele for an individual sample |
FILTERS | STRING | Filter field from VCF. If all filters pass, field is PASS |
ZYGOSITY | NUMERIC | 0 = hom ref, 1 = het ref/alt, 2 = hom alt, 4 = hemi alt |
GENEMODEL | NUMERIC | 1=Ensembl, 2=RefSeq |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
GENE_ID | STRING | Ensembl gene ID ("ENSG00001234") |
GID | NUMERIC | NCBI Entrez Gene ID (RefSeq) or numerical part of Ensembl ENSG ID |
TRANSCRIPT_ID | STRING | Ensembl ENST or RefSeq NM_ |
CANONICAL | STRING | Transcript designated 'canonical' by source |
CONSEQUENCE | STRING | missense, stop gained, intronic, etc. |
HGVSC | STRING | The HGVS coding sequence name |
HGVSP | STRING | The HGVS protein sequence name |
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Original sample barcode, used in VCF column |
SUBJECTID | STRING | Identifier for Subject entity |
STUDY | STRING | Study designation |
GENOMEBUILD | STRING | Only hg38 is supported |
CHROMOSOME | STRING | Chromosome without 'chr' prefix |
DBSNP | NUMERIC | dbSNP Identifiers |
VARIANT_KEY | STRING | Variant ID in the form "1:12345678:12345678:C" |
MUTATION_TYPE | NUMERIC | Rank of consequences by expected impact: 0 = Protein Truncating to 40 = Intergenic Variant |
VARIANT_CALL | NUMERIC | 1=germline, 2=somatic |
GENOTYPE | STRING | "G|T" |
REF_ALLELE | STRING | Reference allele |
ALLELE1 | STRING | First allele call in the tumor sample |
ALLELE2 | STRING | Second allele call in the tumor sample |
GENEMODEL | NUMERIC | 1=Ensembl, 2=RefSeq |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
GENE_ID | STRING | Ensembl gene ID ("ENSG00001234") |
TRANSCRIPT_ID | STRING | Ensembl ENST or RefSeq NM_ |
CANONICAL | BOOLEAN | Transcript designated 'canonical' by source |
CONSEQUENCE | STRING | missense, stop gained, intronic, etc. |
HGVSP | STRING | HGVS nomenclature for AA change: p.Pro72Ala |
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
NIRVANA_VID | STRING | Variant ID of the form 'chr-pos-ref-alt' |
CHRID | STRING | Chromosome without 'chr' prefix |
CID | NUMERIC | Numerical representation of the chromosome, X=23, Y=24, Mt=25 |
GENE_ID | STRING | NCBI or Ensembl gene identifier |
GID | NUMERIC | Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix |
START_POS | NUMERIC | First affected position on the chromosome |
STOP_POS | NUMERIC | Last affected position on the chromosome |
VARIANT_TYPE | NUMERIC | 1 = copy number gain, -1 = copy number loss |
COPY_NUMBER | NUMERIC | Observed copy number |
COPY_NUMBER_CHANGE | NUMERIC | Fold-chang of copy number, assuming 2 for diploid and 1 for haploid as the baseline |
SEGMENT_VALUE | NUMERIC | Average FC for the identified chromosomal segment |
PROBE_COUNT | NUMERIC | Probes confirming the CNV (arrays only) |
REFERENCE | NUMERIC | Baseline taken from normal samples (1) or averaged disease tissue (2) |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
NIRVANA_VID | STRING | Variant ID of the form 'chr-pos-ref-alt' |
CHRID | STRING | Chromosome without 'chr' prefix |
CID | NUMERIC | Numerical representation of the chromosome, X=23, Y=24, Mt=25 |
BEGIN | NUMERIC | First affected position on the chromosome |
END | NUMERIC | Last affected position on the chromosome |
BAND | STRING | Chromosomal band |
QUALIITY | NUMERIC | Quality from the original VCF |
FILTERS | ARRAY | Filters from the original VCF |
VARIANT_TYPE | STRING | Insertion, deletion, indel, tandem_duplication, translocation_breakend, inversion ("INV"), short tandem repeat ("STR2") |
VARIANT_TYPE_ID | NUMERIC | 21=insertion, 22=deletion, 23=indel, 24=tandem_duplication, 25=translocation_breakend, 26=inversion ("INV"), 27=short tandem repeat ("STR2") |
CIPOS | ARRAY | Confidence interval around first position |
CIEND | ARRAY | Confidence interval around last position |
SVLENGTH | NUMERIC | Overall size of the structural variant |
BONDCHR | STRING | For translocations, the other affected chromosome |
BONDCID | NUMERIC | For translocations, the other affected chromosome as a numeric value, X=23, Y=24, Mt=25 |
BONDPOS | STRING | For translocations, positions on the other affected chromosome |
BONDORDER | NUMERIC | 3 or 5: Whether this fragment (the current variant/VID) "receives" the other chromosome's fragment on it's 3' end, or attaches to the 5' of the other chromosome fragment |
GENOTYPE | STRING | Called genotype from the VCF |
GENOTYPE_QUALITY | NUMERIC | Genotype call quality |
READCOUNTSSPLIT | ARRAY | Read counts |
READCOUNTSPAIRED | ARRAY | Read counts, paired end |
REGULATORYREGIONID | STRING | Ensembl ID for the affected regulatory region |
REGULATORYREGIONTYPE | STRING | Type of the regulatory region |
CONSEQUENCE | ARRAY | Variant consequence according to SequenceOntology |
TRANSCRIPTID | STRING | Ensembl of RefSeq transcript identifier |
TRANSCRIPTBIOTYPE | STRING | Biotype of the transcript |
INTRONS | STRING | Count of impacted introns out of the total number of introns, specified as "M/N" |
GENEID | STRING | Ensembl or RefSeq gene identifier |
GENEHGNC | STRING | HUGO/HGNC gene symbol |
ISCANONICAL | BOOLEAN | Is the transcript ID the canonical one according to Ensembl? |
PROTEINID | STRING | RefSeq or Ensembl protein ID |
SOURCEID | NUMERICAL | Gene model: 1=Ensembl, 2=RefSeq |
Field Name | Type | Description |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
STUDY_NAME | STRING | Study designation |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
LABEL | STRING | Group label specified during import: Case or Control, Tumor or Normal, etc. |
GENE_ID | STRING | Ensembl or RefSeq gene identifier |
GID | NUMERIC | Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
SOURCE | STRING | Gene model: 1=Ensembl, 2=RefSeq |
TPM | NUMERICAL | Transcripts per million |
LENGTH | NUMERICAL | The length of the gene in base pairs. |
EFFECTIVE_LENGTH | NUMERICAL | The length as accessible to RNA-seq, accounting for insert-size and edge effects. |
NUM_READS | NUMERICAL | The estimated number of reads from the gene. The values are not normalized. |
Field Name | Type | Description |
GENOMEBUILD | STRING | Genome build, always 'hg38' |
STUDY_NAME | STRING | Study designation |
SAMPLE_BARCODE | STRING | Sample barcode used in the original VCF |
CASE_LABEL | STRING | Study designation |
GENE_ID | STRING | Ensembl or RefSeq gene identifier |
GID | NUMERIC | Numerical part of the gene ID; for Ensembl, we remove the 'ENSG000..' prefix |
GENE_HGNC | STRING | HUGO/HGNC gene symbol |
SOURCE | STRING | Gene model: 1=Ensembl, 2=RefSeq |
BASEMEAN | NUMERICAL |
FC | NUMERICAL | Fold-change |
LFC | NUMERICAL | Log of the fold-change |
LFCSE | NUMERICAL | Standard error for log fold-change |
PVALUE | NUMERICAL | P-value |
CONTROL_SAMPLECOUNT | NUMERICAL | Number of samples used as control |
CONTROL_LABEL | NUMERICAL | Label used for controls |
Field Name | Type | Description |
SAMPLE_BARCODE | STRING | Sample Identifier |
SUBJECTID | STRING | Identifer for Subject entity |
STUDY | STRING | Study designation |
AGE | NUMERIC | Age in years |
SEX | STRING | Sex field to drive annotation |
POPULATION | STRING | Population Designation for 1000 Genomes Project |
SUPERPOPULATION | STRING | Superpopulation Designation from 1000 Genomes Project |
RACE | STRING | Race according to NIH standard |
CONDITION_ONTOLOGIES | VARIANT | Diagnosis Ontology Source |
CONDITION_IDS | VARIANT | Diagnosis Concept Ids |
CONDITIONS | VARIANT | Diagnosis Names |
HARMONIZED_CONDITIONS | VARIANT | Diagnosis High-level concept to drive UI |
LIBRARYTYPE | STRING | Seqencing technology |
ANALYTE | STRING | Substance sequenced |
TISSUE | STRING | Tissue source |
TUMOR_OR_NORMAL | STRING | Tumor designation for somatic |
GENOMEBUILD | STRING | Genome Build to drive annotations - hg38 only |
SAMPLE_BARCODE_VCF | STRING | Sample ID from VCF |
AFFECTED_STATUS | NUMERIC | Affected, Unaffected, or Unknown for Family Based Analysis |
FAMILY_RELATIONSHIP | STRING | Relationship designation for Family Based Analysis |
This walk-through is meant to represent a typical workflow when building and studying a cohort of rare genetic disorder cases.
Create a new Project to track your study:
Login to the ICA
Navigate to Projects
Create a new project using the New Project
button.
Give your project a name and click Save
.
Navigate to the ICA Cohorts module by clicking COHORTS
in the left navigation panel then choose Cohorts
.
Navigate to the ICA Cohorts module by clicking Cohorts
in the left navigation panel.
Click Create Cohort
button.
Enter a name for your cohort, like Rare Disease + 1kGP
at top, left of pencil icon.
From the Public Data Sets
list select:
DRAGEN-1kGP
All Rare genetic disease cohorts
Notice that a cohort can also be created based on Technology
, Disease Type
and Tissue
.
Under Selected Conditions
in right panel, click on Apply
A new page opens with your cohort in a top-level tab.
Expand Query Details
to see the study makeup of your cohort.
A set of 4 Charts
will be open by default. If they are not, click Show Charts
.
Use the gear icon in the top-right of the Charts pane to change chart settings.
The bottom section is demarcated by 8 tabs (Subjects, Marker Frequency, Genes, GWAS, PheWAS, Correlation, Molecular Breakdown, CNV).
The Subjects
tab displays a list of exportable Subject IDs and attributes.
Clicking on a Subject ID
link pops up a Subject details page.
A recent GWAS publication identified 10 risk genes for intellectual disability (ID) and autism. Our task is to evaluate them in ICA Cohorts: TTN, PKHD1, ANKRD11, ARID1B, ASXL3, SCN2A, FHL1, KMT2A, DDX3X, SYNGAP1.
First let’s Hide charts
for more visual space.
Click the Genes
tab where you need to query a gene to see and interact with results.
Type SCN2A
into the Gene search field and select it from autocomplete dropdown options.
The Gene Summary
tab now lists information and links to public resources about SCN2A.
Click on the Variants
tab to see an interactive Legend and analysis tracks.
The Needle Plot displays gnomAD Allele Frequency
for variants in your cohort.
Note that some are in SCN2A conserved protein domains.
In Legend, switch the Plot by
option to Sample Count
in your cohort.
In Legend, uncheck all Variant Types
except Stop gained
. Now you should see 7 variants.
Hover over pin heads to see pop-up information about particular variants.
The Primate AI
track displays Scores for potential missense variants, based on polymorphisms observed in primate species. Points above the dashed line for the 75th percentile may be considered "likely pathogenic" as cross-species sequence is highly conserved; you often see high conservancy at the functional domains. Points below the 25th percentile may be considered "likely benign".
The Pathogenic variants
track displays markers from ClinVar color-coded by variant type. Hover over to see pop-ups with more information.
The Exons
track shows mRNA exon boundaries with click and zoom functionality at the ends.
Below the Needle Plot and analysis tracks is a list of "Variants observed in the selected cohort"
Export Gene Variants
table icon is above the legend on right side.
Now let's click on the Gene Expression
tab to see a Bar chart of 50 normal tissues from GTEx in transcripts per million (TPM). SCN2A is highly expressed in certain Brain tissues, indicating specificity to where good markers for intellectual disability and autism could be expected.
As a final exercise in discovering good markers, click on the tab for Genetic Burden Test
. The table here associates Phenotypes
with Mutations Observed
in each Study selected for our cohort, alongside Mutations Expected
to derive p-values. Given all considerations above, SCN2A is good marker for intellectual disability (p < 1.465 x 10 -22) and autism (p < 5.290 x 10 -9).
Continue to check the other genes of interest in step 1.
ICA Cohorts comes front-loaded with a variety of publicly accessible data sets, covering multiple disease areas and also including healthy individuals.
Data set | Samples | Diseases/Phenotypes | Reference |
---|---|---|---|
1kGP-DRAGEN
3202 WGS: 2504 original samples plus 698 relateds
Presumed healthy
DDD
4293 (3664 affected), de novos only
Developmental disorders
EPI4K
356, de novos only
Epilepsy
ASD Cohorts
6786 (4266 affected), de novos only
Autism Spectrum disorder
De Ligt et al.
100, de novos only
Intellectual disability
Homsy et al.
1213, de novos only
Congenital heart disease (HP:0030680)
Lelieveld et al.
820, de novos only
Intellectual disability
Rauch et al.
51, de novos only
Intellectual disability
Rare Genomes Project
315 WES (112 pedigrees)
Various
https://raregenomes.org/
TCGA
ca. 4200 WES, ca. 4000 RNAseq
12 tumor types
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
GEO
RNAseq
Auto-immune disorders, incl. asthma, arthritis, SLE, MS, Crohn's disease, Psoriasis, Sjögren's Syndrome
For GEO/GSE study identifiers, please refer to the in-product list of studies
RNAseq
Kidney diseases
For GEO/GSE study identifiers, please refer to the in-product list of studies
RNAseq
Central nervous system diseases
For GEO/GSE study identifiers, please refer to the in-product list of studies
RNAseq
Parkinson's disease
For GEO/GSE study identifiers, please refer to the in-product list of studies