# Import New Samples

## Import New Samples

ICA Cohorts can pull any molecular data available in an ICA Project, as well as additional sample- and subject-level metadata information such as demographics, biometrics, sequencing technology, phenotypes, and diseases.

To import a new data set, select `Import Jobs` from the left navigation tab underneath `Cohorts`, and click the `Import Files` button. The `Import Files` button is also available under the `Data Sets` left navigation item.

{% hint style="info" %}
The `Data Set` menu item is used to view imported data sets and information. The `Import Jobs` menu item is used to check the status of data set imports.
{% endhint %}

Confirm that the project shown is the ICA Project that contains the molecular data you would like to add to ICA Cohorts.

1. Choose a data type among
   * Germline variants
   * Somatic mutations
   * RNAseq
   * GWAS
2. Choose a new study name by selecting the radio button: `Create new study` and entering a `Study Name`.
3. To add new data to an existing Study, select the radio button: `Select from list of studies` and select an existing `Study Name` from the dropdown.
4. To add data to existing records or add new records, select `Job Type`, `Append`.
5. `Append` does not wipe out any data ingested previously and can be used to ingest the molecular data in an incremental manner.
6. To replace data, select `Job Type`, `Replace`. If you are ingesting data again, use the Replace job type.
7. Enter an optional `Study description`.
8. Select the metadata model (default: Cohorts; alternatively, select OMOP version 5.4 if your data is formatted that way.)
9. Select the genome build your molecular data is aligned to (default: GRCh38/hg38)
10. For RNAseq, specify whether you want to run differential expression (see below) or only upload raw TPM.
11. Click `Next`.
12. Navigate to VCFs located in the Project Data.
13. Select each single-sample VCF or multi-sample VCF to ingest. For GWAS, select CSV files produced by Regenie.
    * As an alernative to selecting individual files, you can also opt to select a folder instead. Toggle the radio button on Step 2 from "Select files" to "Select folder".
    * This option is currently only available for germline variant ingestion: any combination of small variants, structural variation, and/or copy number variants.
    * ICA Cohorts will scan the selected folder and all sub-folders for any VCF files or JSON files and try to match them against the Sample ID column in the metadata TSV file (Step 3).
    * Files not matching sample IDs will be ignored; allowed file extensions for VCF files after the sample ID are: \*.vcf.gz, \*.hard-filtered.vcf.gz, \*.cnv.vcf.gz, and \*.sv.vcf.gz .
    * Files not matching sample IDs will be ignored; allowed file extensions for JSON files after the sample ID are: *.json,*.json.gz, \*.json.bgz, \*.json.gzip.
14. Click `Next`.
15. Navigate to the metadata (phenotype) data *tsv* in the project Data.
16. Select the TSV file or files for ingestion.
17. Click `Finish`.

{% hint style="info" %}
Search Spinner behavior in input jobs table

* Search a term and press \*\* Enter.
* The search spinner will appear while the results are loading.
* Once the results are displayed in the table, the spinner will disappear immediately
  {% endhint %}

{% hint style="info" %}
All VCF types, specifically from DRAGEN, can be ingested using the Germline variants selection. Cohorts will distinguish the variant types that it is ingesting. If Cohorts cannot determine the variant file type, it will default to ingest small variants.

Alternatively to VCFs, you can select Nirvana JSON files for DNA variants: small variants, structural variants, and copy number variation.

The maximum amount of files that can be part of a single manual ingestion batch is capped at 1000

Alternatively, you can choose a single folder and ICA Cohorts will identify all ingestible files within that folder and its sub-folders. In this scenario, cohorts will select molecular data files matching the samples listed in the metadata sheet which is the next step in the import process.

You have the option to ingest either VCF files or Nirvana JSON files for any given batch, regardless of the chosen ingestion method.

The sample identifiers used in the VCF columns need to match the sample identifiers used in subject/sample metadata files; accordingly, if you are starting from JSON files containing variant- and gene-level annotations provided by ILMN Nirvana, the `samples` listed in the header need to match the metadata files.
{% endhint %}

#### Variant file formats

ICA Cohorts supports VCF files formatted according to VCF v4.2 and v4.3 specifications. VCF files require at least one of the following header rows to identify the genome build:

* \##reference=file://... --- needs to contain a reference to hg38/GRCh38 in the file path or name (numerical value is sufficient)
* \##contig=\<ID=chr1,length=248956422> --- for hg38/GRCh38
* \##DRAGENCommandLine= ... --ht-reference

ICA Cohorts accepts VCFs aligned to hg38/GRCh38 and hg19/GRCh37. If your data uses hg19/GRCh37 coordinates, Cohorts will convert these to hg38/GRCh38 during the ingestion process \[see Reference 1]. Harmonizing data to one genome build facilitates searches across different private, shared, and public projects when building and analyzing a cohort. If your data contains a mixture of samples mapped to hg38 and hg19, please ingest these in separate batches, as each import job into Cohorts is limited to one genome build.

Alternative to VCFs, ICA Cohorts accepts the JSON output of [Illumina Nirvana](https://illumina.github.io/NirvanaDocumentation/) for hg38/GRCh38-aligned data for small germline variants and somatic mutations, copy number variations other structural variants.

#### RNAseq file format

ICA Cohorts can process gene- and transcript-level quantification files produced by the Illumina DRAGEN RNA pipeline. The file naming convention needs to match .quant.genes.sf for genes; and .quant.sf for transcript-level TPM (transcripts per million.)

Please also see the online documentation for the [Illumina DRAGEN RNA Pipeline](https://support-docs.illumina.com/SW/dragen_v42/Content/SW/DRAGEN/GeneExpressionQuantification.htm) for more information on output file formats.

#### GWAS file format

ICA Cohorts currently support upload of SNV-level GWAS results produced by [Regenie](https://rgcgithub.github.io/regenie/) and saved as CSV files.

### Metadata and File Types

| **Field**              | **Description**                                                                                                                                                                                                                                                                                     |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Project name           | The ICA project for your cohort analysis (cannot be changed.)                                                                                                                                                                                                                                       |
| Study name             | Create or select a study. Each study represents a subset of data within the project.                                                                                                                                                                                                                |
| Description            | Short description of the data set (optional).                                                                                                                                                                                                                                                       |
| Job type               | **Append**: Appends values to any existing values. If a field supports only a single value, the value is replaced.                                                                                                                                                                                  |
|                        | **Replace**: Overwrites existing values with the values in the uploaded file.                                                                                                                                                                                                                       |
| Subject metadata files | <p>Subject metadata file(s) in tab-delimited format.<br>For <strong>Append</strong> and <strong>Replace</strong> job types, the following fields are required and cannot be changed:<br>- Sample identifier<br>- Sample display name<br>- Subject identifier<br>- Subject display name<br>- Sex</p> |

{% hint style="info" %}
If annotating large sets of samples with molecular data, expect the annotation process to take over 20 minutes per whole genome batch of samples. You will receive two e-mail notifications: once your ingestion starts and once completed successfully or failed.
{% endhint %}

As an alternative to ICA Cohorts' metadata file format, you can provide files formatted according to the [OMOP common data model 5.4](http://ohdsi.github.io/CommonDataModel/cdm54.html). Cohorts currently ingests data for these OMOP 5.4 tables, formatted as tab-delimited files:

* PERSON (mandatory),
* CONCEPT (mandatory if any of the following is provided),
* CONDITION\_OCCURRENCE (optional),
* DRUG\_EXPOSURE (optional), and
* PROCEDURE\_OCCURRENCE (optional.)

Additional files such as measurement and observation will be supported in a subsequent release of Cohorts.

{% hint style="info" %}
Cohorts requires that all such files do not deviate from the OMOP CDM 5.4 standard. Depending on your implementation, you may have to adjust file formatting to be OMOP CDM 5.4-compatible.
{% endhint %}

## References

\[1] VcfMapper: <https://stratus-documentation-us-east-1-public.s3.amazonaws.com/downloads/cohorts/main_vcfmapper.py>

\[2] crossMap: <https://crossmap.sourceforge.net/>

\[3] liftOver: <https://genome.ucsc.edu/cgi-bin/hgLiftOver>

\[4] Chain files: <ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.connected.illumina.com/connected-analytics/project/p-cohorts/cohorts-import.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.