arrow-left

All pages
gitbookPowered by GitBook
1 of 6

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Base

hashtag
Introduction to Base

Base is a genomics data aggregation and knowledge management solution suite. It is a secure and scalable integrated genomics data analysis solution which provides Information management and knowledge mining. You can analyze, aggregate and query data for new insights that can inform and improve diagnostic assay development, clinical trials, patient testing and patient care. Clinically relevant data needs to be generated and extracted from routine clinical testing and clinical questions need to be asked across all data and information sources. As a large data store, Base provides a secure and compliant environment to accumulate data, allowing for efficient exploration of the aggregated data. This data consists of test results, patient data, metadata, reference data, consent and QC data.

hashtag
Use Cases

Base can be used by for different use cases:

  • Clinical and Academic Researchers:

    • Big data storage solution housing all aggregated sample test outcomes

    • Analyze information by way of a convenient query formalism

hashtag
Base Action Possibilities

  • Data Warehouse Creation: Build a relational database for your Project in which desired data sets can be selected and aggregated. Typical data sets include pipeline output metrics and other suitable data files generated by the ICA platform which can be complemented by additional public (or privately built) databases.

  • Report and Export: Once created, a data warehouse can be mined using standard database query instructions. All Base data is stored in a structured and easily accessible way. An interface allows for the selection of specific datasets and conditional reporting. All queries can be stored, shared, and re-used in the future. This type of standard functionality supports most expected basic mining operations, such as variant frequency aggregation. All result sets can be downloaded or exported in various standard data formats for integration in other reporting or analytical applications.

hashtag
Access

The Base module can be found at Projects > your_project > Base. In order to use Base, you need to meet the following requirements:

hashtag
Subscription

On the domain level, Base needs to be included in the subscription (full and premium subscriptions give access to Base).

hashtag
Enabling Base

Once a project is created, the project owner must navigate to Projects > your_project > Base and click the Enable button. From that moment on, every user who has the proper permissions has access to the Base module in that project.

hashtag
Enabling User Access

Access to the projects and Base is configured on the Projects > your_project > Project settings > Team page. Here you can add or edit a user or workgroup and give them .

hashtag
Activity

The status and history of Base activities and jobs are shown on the page.

Look for signals in combined phenotypic and genotypic data
  • Analyze QC patterns over large cohorts of patients

  • Securely share (sub)sets of data with other scientists

  • Generate reports and analyze trends in a straightforward and simple manner.

  • Bioinformaticians:

    • Access, consult, audit, and query all relevant data and QC information for tests run

    • All accumulated data and accessible pipelines can be used to investigate and improve bioinformatics for clinical analysis

    • Metadata is captured via automatic pipeline version tracking, including information on individual tools and/or reference files used during processing for each sample analyzed, information on the duration of the pipeline, the execution path of the different analytical steps, or in case of failure, exit codes can be warehoused.

  • Product Developers and Service Providers:

    • Better understand the efficiency of kits and tests

    • Analyze usage, understand QC data trends, improve products

    • Store and aggregate business intelligence data such as lab identification, consumption patterns and frequency, as well as allow renderings of test result outcome trends and much more.

  • Detect Signals and Patterns: extensive and detailed selection of subsets of patients or samples adhering to any imaginable set of conditions is possible. Users can, for example, group and list subjects based on a combination of (several) specific genetic variants in combination with patient characteristics such as therapeutic (outcome) information. The built-in integration with public datasets allows users to retrieve all relevant publications, or clinically significant information for a single individual or a group of samples with a specific variant. Virtually any possible combination of stored sample and patient information allow for detecting signals and patterns by a simple single query on the big data set.
  • Profile/Cluster patients: use and re-analyze patient cohort information based on specific sample or individual characteristics. For instance, they might want to run a next agile iteration of clinical trials with only patients that respond. Through integrated and structured consent information allowing for time-boxed use, combined with the capability to group subjects by the use of a simple query, patients can be stratified and combined to export all relevant individuals with their genotypic and phenotypic information to be used for further research.

  • Share your data: Data sharing is subject to strict ethical and regulatory requirements. Base provides built-in functionality to securely share (sub)sets of your aggregated data with third parties. All data access can be monitored and audited, in this way Base data can be shared with people in and outside of an organization in a compliant and controlled fashion.

  • Bench access
    Activityarrow-up-right

    Tables

    All tables created within Base are gathered on the Projects > your_project > Base > Tables page. New tables can be created and existing tables can be updated or deleted here.

    hashtag
    Create a new Table

    To create a new table, click Projects > your_project > Base > Tables > +Create. Tables can be created from scratch or from a template that was previously saved. Views on data from Illumina hardware and processes are selected with the option Import from catalogue.

    circle-info

    If you make a mistake in the order of columns when creating your table, then as long as you have not saved your table, you can switch to Edit definition to change the column order. The text editor can swap or move columns whereas the built-in editor can only delete columns or add columns to the end of the sequence. When editing in text mode, it is best practice to copy the content of the text editor to a notepad before you make changes because a corrupted syntax will result in the text being wiped or reverted when switching between text and non-text mode.

    triangle-exclamation

    Once a table is saved it is no longer possible to edit the schema, only new fields can be added. The workaround is switching to text mode, copying the schema of the table to which you want to make modifications and paste it into a new empty table where the necessary changes can be made before saving.

    triangle-exclamation

    Once created, do not try to modify your table column layout via the Query module as even though you can execute ALTER TABLE commands, the definitions and syntax of the table will go out of sync resulting in processing issues.

    circle-exclamation

    Be careful when naming tables when you want to use them in . Table names have to be unique per bundle, so no two tables with the same name can be part of the same bundle.

    hashtag
    Empty Table

    To create a table from scratch, complete the fields listed below and click the Save button. Once saved, a job will be created to create the table. To view table creation progress, navigate to the Activity page.

    hashtag
    Table information

    The table name is a required field and must be unique. The first character of the table must be a letter followed by letters, numbers or underscores. The description is optional.

    hashtag
    References

    Including or excluding references can be done by checking or un-checking the Include reference checkbox. These reference fields are not shown on the table creation page, but are added to the schema definition, which is visible after creating the table (Projects > your_project > Base > Tables > your_table > Schema definition). By including references, additional columns will be added to the which can contain references to the data on the platform:

    • data_reference: reference to the data element in the Illumina platform from which the record originates

    • data_name: original name of the data element in the Illumina platform from which the record originates

    • sample_reference: reference to the sample in the Illumina platform from which the record originates

    hashtag
    Schema

    In an empty table, you can create a schema by adding a field with the +Add button for each column of the table and defining it. At any time during the creation process, it is possible to switch to the edit definition mode and back. The definition mode shows the JSON code, whereas the original view shows the fields in a table.

    Each field requires:

    • a unique name (*1) with optional description.

    • a type

      • String – collection of characters

    circle-exclamation

    (*1) Do not use reserved Snowflake keywords such as left, right, sample, select, table,... (https://docs.snowflake.com/en/sql-reference/reserved-keywords) for your schema name as this will lead to SQL compilation errors.

    circle-info

    (*2) Float values will be exported differently depending on the output format. For example JSON will use scientific notation so verify that your consecutive processing methods support this.

    circle-info

    (*3) Defining the precision when creating tables with SQL is not supported as this will result in rounding issues.

    hashtag
    From template

    Users can create their own template by making a table which is turned into a template at Projects > your_project > Base > Tables > your_table > Manage (top right) > Save as template.

    If a template is created and available/active, it is possible to create a new table based on this template. The table information and references follow the rules of the empty table but in this case the schema will be pre-filled. It is possible to still edit the schema that is based on the template.

    hashtag
    Table information

    hashtag
    Table status

    The status of a table can be found at Projects > your_project > Base > Tables. The possible statuses are:

    • Available: Ready to be used, both with or without data

    • Pending: The system is still processing the table, there is probably a process running to fill the table with data

    • Deleted: The table is deleted functionally; it still exists and can be shown in the list again by clicking the Show deleted tables/views button

    Additional Considerations

    • Tables created from empty data or from a template are available faster.

    • When copying a table with data, it can remain in a Pending for longer periods of time.

    • Clicking on the page's refresh button will update the list.

    hashtag
    Table details

    For any available table, the following details are shown:

    • Table information: Name, description, status, number of records and data size.

    circle-info

    The data size of tables with the same layout and content may vary slightly, depending on when and how the data was written by Snowflake.

    • Definition: An overview of the table schema, also available in text. Fields can be added to the schema but not deleted. Tip for deleting fields: copy the schema as text and paste in a new empty table where the schema is still editable.

    • Preview: A preview of the table for the 50 first rows (when data is uploaded into the table). Select show details to see record details.

    • Source Data: the files that are currently uploaded into the table. You can see the Load Status of the files which can be Prepare Started

    hashtag
    Table actions

    From within the details of a table it is possible to perform the following actions from the Manage menu (top right) of the table:

    • Edit: Add fields to the table and change the table description.

    • Copy: Create a copy from this table in the same or a different project. In order to copy to another project, data sharing of the original project should be enabled in the details of this project. The user also has to have access to both original and target project.

    • Export as file: Export this table as a CSV or JSON file. The exported file can be found in a project where the user has the access to download it.

    hashtag
    Manually importing data to your Table

    To manually add data to your table, go to Projects > your_project > Base > Tables > your_table > Manage (top right) > Add Data

    hashtag
    Data selection

    The data selection screen will show options to select the structure as CSV (comma-separated), TSV (tab-separated) or JSON (JavaScript Object Notation) and the location of your source data. In the first step, you select the data format and the files containing the data.

    • Data format (required): Select the format of the data which you want to import.

    • Write preference: Define if data can be written to the table only when the table is empty, if the data should be appended to the table or if the table should be overwritten.

    • Delimiter: Which delimiter is used in the delimiter separated file. If the required delimiter is not comma, tab or pipe, select custom and define the custom delimiter.

    Most of the advanced options are legacy functions and should not be used. The only exceptions are

    • Encoding: Select if the encoding is UTF-8 (any Unicode character) or ISO-8859-1 (first 256 Unicode characters).

    • Ignore unknown values: This applies to CSV-formatted files. You can use this function to handle optional fields without separators, provided that the missing fields are located at the end of the row. Otherwise, the parser can not detect the missing separator and will shift fields to the left, resulting in errors.

      • If headers are used: The columns that have matching fields are loaded, those that have no matching fields are loaded with NULL and remaining fields are discarded.

    hashtag
    Data import progress

    To see the status of your data import, go to Projects > your_project > Activity > Base Jobs where you will see a job of type Prepare Data which will have succeeded or failed. If it has failed, you can see the error message and details by double-clicking the base job. You can then take corrective actions if the input mismatched with the table design and try to run the import again (with a new copy of the file as each input file can only be used once)

    If you need to cancel the import, you can do so while it is scheduled by navigating to the Base Jobs inventory and selecting the job followed by Abort.

    hashtag
    List of table data sources

    To see which data has been used to populate your table go to Projects > your_project > Base > Tables > your_table > Source Data. This will list all the source data files, including those that failed to be imported. You can not use these files anymore to import again to prevent double entries. The load status will remain empty while the data is being processed and be set to load succeeded or failed after loading completes.

    hashtag
    How to load array data in Base

    Base Table schema definitions do not include an array type, but arrays can be ingested using either the Repeated mode for arrays containing a single type (ie, String), or the Variant type.

    hashtag
    Parsing nested JSON data

    If you have a nested JSON structure, you can import it into individual fields of your table.

    For example, if your JSON nested structure looks like the above and you want to get it imported into a table with a, b and c having integers as values, you need to create a matching table. This can be done either or via the sql command CREATE OR REPLACE TABLE json_data ( a INTEGER, b INTEGER, c INTEGER);

    Format your JSON data to have single lines per structure.

    Finally, create a to import your data or perform a .

    The resulting table will look like this:

    #
    A
    B
    C

    sample_name: name of the sample in the Illumina platform from which the record originates

  • pipeline_reference: reference to the pipeline in the Illumina platform from which the record originates

  • pipeline_name: name of the pipeline in the Illumina platform from which the record originates

  • execution_reference: reference to the pipeline execution in the Illumina platform from which the record originates

  • account_reference: reference to the account in the Illumina platform from which the record originates

  • account_name: name of the account in the Illumina platform from which the record originates

  • Bytes – raw binary data
  • Integer – whole numbers

  • Float – fractional numbers (*2)

  • Numeric – any number (*3)

  • Boolean – only options are “true” or “false”

  • Timestamp - Stores number of (milli)seconds passed since the Unix epoch

  • Date - Stores date in the format YYYY-MM-DD

  • Time - Stores time in the format HH:MI:SS

  • Datetime - Stores date and time information in the format YYYY-MM-DD HH:MI:SS

  • Record – has a child field

  • Variant - can store a value of any other type, including OBJECT and ARRAY

  • a mode

    • Required - Mandatory field

    • Nullable - Field is allowed to have no value

    • Repeated - Multiple values are allowed in this field (will be recognized as array in Snowflake)

  • ,
    Prepare Succeeded
    or
    Prepare Failed
    and finally
    Load Succeeded
    or
    Load Failed
    .

    Save as template: Save the schema or an edited form of it as a template.

  • Add data: Load additional data into the table manually. This can be done by selecting data files previously uploaded to the project, or by dragging and dropping files directly into the popup window for adding data to the table. It’s also possible to load data into a table manually or automatically via a pre-configured job. This can be done on the Schedule page.

  • Delete: Delete the table.

  • Custom delimiter: If a custom delimiter is used in the source data, it must be defined here.

  • Header rows to skip: The number of consecutive header rows (at the top of the table) to skip.

  • References: Choose which references must be added to the table.

  • If no headers are used: The fields are loaded in order of occurrence and trailing missing fields are loaded with NULL, trailing additional fields are discarded.

  • 1

    1

    1

    2

    3

    3

    3

    bundles
    schema
    manually
    schedule
    manual import
    {
        "one": {
          "a": "1",
          "b": "1"
        },
        "three": {
          "a": "3",
          "b": "3",
          "c": "3"
        }
    }
    {"A":1,"B":1}
    {"A":3,"B":3,"C":3}

    Query

    Queries can be used for data mining. On the Projects > your_project > Base > Query page:

    • New queries can be created and executed

    • Already executed queries can be found in the query history

    Saved queries and query templates are listed under the saved queries tab.

    hashtag
    New Query

    hashtag
    Available tables

    All available tables are listed on the Run tab.

    circle-info

    Metadata tables are created by syncing with the Base module. This synchronization is configured on the Details page within the project.

    hashtag
    Input

    Queries are executed using SQL (for example Select * From table_name). When there is a syntax issue with the query, the error will be displayed on the query screen when trying to run it. The query can be immediately executed or saved for future use.

    hashtag
    Best practices and notes

    triangle-exclamation

    Do not use queries such as ALTER TABLE to modify your table structure as it will go out of sync with the table definition and will result in processing errors.

    • When you have duplicate column names in your query, put the columns explicitly in the select clause and use column aliases for columns with the same name.

    • Case sensitive column names (such as the VARIANTS table) must be surrounded by double quotes. For example, select * from MY_TABLE where "PROJECT_NAME" = 'MyProject'.

    • The syntax for ICA case-sensitive subfields is without quotes, for example select * from MY_TABLE where ica:Tenant = 'MyTenant' As these are case sensitive, the upper and lowercasing must be respected.

    • If you want to query data from a table shared from another tenant (indicated in green), select the table (Projects > your_project > Base > Tables > your_table) to see the unique name. In the example below, the query will be select * from demo_alpha_8298.public.TestFiles

    • For more information on queries, please also see the snowflake documentation: https://docs.snowflake.com/en/user-guide/

    hashtag
    Querying data within columns.

    Some tables contain columns with an array of values instead of a single value.

    hashtag
    Querying data within an array

    circle-info

    As of ICA version 2.27, there is a change in the use of capitals for ICA array fields. In previous versions, the data name within the array would start with a capital letter. As of 2.27, lowercase is used. For example ICA:Data_reference has become ICA:data_reference.

    You can use the GET_IGNORE_CASE option to adapt existing queries when you have both data in the old syntax and new data in the lowercase syntax. The syntax is GET_IGNORE_CASE(Table_Name.Column_Name,'Array_field')

    For example:

    select ICA:Data_reference as MY_DATA_REFERENCE from TestTable becomes:

    select GET_IGNORE_CASE(TESTTABLE.ICA,'Data_reference') as MY_DATA_REFERENCE from TestTable

    You can also modify the data to have consistent capital usage by executing the query update YOUR_TABLE_NAME set ica = object_delete(object_insert(ica, 'data_name', ica:Data_name), 'Data_name') and repeating this process for all field names (Data_name, Data_reference, Execution_reference, Pipeline_name, Pipeline_reference, Sample_name, Sample_reference, Tenant_name and Tenant_reference).

    Suppose you have a table called YOUR_TABLE_NAME consisting of three fields. The first is a name, the second is a code and the third field is an array of data called ArrayField:

    NameField
    CodeField
    ArrayField

    Name A

    Code A

    { “userEmail”: “[email protected]”, "bundleName": null, “boolean”: false }

    Name B

    Code B

    { “userEmail”: “[email protected]”, "bundleName": "thisbundle", “boolean”: true }

    Examples

    You can use the name field and code field to do queries by running

    Select * from YOUR_TABLE_NAME where NameField = "Name A".

    If you want to show specific data like the email and bundle name from the array, this becomes

    Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where NameField = "Name A".

    If you want to use data in the array as your selection criteria, the expression becomes

    Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where ArrayField:boolean = true.

    If your criteria is text in the array, use the ' to delimit the text. For example:

    Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where ArrayField:userEmail = '[email protected]'.

    You can also use the LIKE operator with the % wildcard if you do not know the exact content.

    Select ArrayField:userEmail as User_Email, ArrayField:bundleName as Bundle_Name from YOUR_TABLE_NAME where ArrayField:userEmail LIKE '%A@server%'

    hashtag
    Query results

    If the query is valid for execution, the result will be shown as a table underneath the input box. Only the first 200 chars of a string, record or variant field are included in the query results grid. The complete value is available through clicking the "show details" link.

    From within the result page of the query, it is possible to save the result in several ways:

    • Export to > New table saves the query result as a new table with contents.

    • Export to > New view saves the query results as a new view.

    • Export to > Project file: As a new table, as a view or as file to the project in CSV (Tab, Pipe or a custom delimeter is also allowed.) or JSON format. When exporting in JSON format, the result will be saved in a text file that contains a JSON object for each entry, similar to when exporting a tablearrow-up-right. The exported file can be located in the Data page under the folder named base_export_<user_supplied_name>_<auto generated unique id>.

    hashtag
    Run a new query

    1. Navigate to Projects > your_project > Base > Query.

    2. Enter the query to execute using SQL.

    3. Select Run.

    4. Optionally, select Save to add the query to your saved queries list.

    If the query takes more than 30 seconds without returning a result, a message will be displayed to inform you the query is still in progress and the status can be consulted on Projects > your_project > Activity > Base Jobs. Once this Query is successfully completed, the results can be found in Projects > your_project > Base > Query > Query History tab.

    hashtag
    Query history

    The query history lists all queries that were executed. Historical queries are shown with their date, executing user, returned rows and duration of the run.

    1. Navigate to Projects > your_project > Base > Query.

    2. Select the History tab.

    3. Select a query.

    4. Perform one of the following actions:

      • Use—Open the query for editing and running in the Run tab. You can then select Run to execute the query again.

      • Save —Save the query to the saved queries list.

      • View Results—Download the results from a query or export results to a new table, view, or file in the project. Results are available for 24 hours after the query is executed. To view results after 24 hours, you need to execute the query again.

    hashtag
    Saved Queries

    All queries saved within the project are listed under the Saved tab together with the query templates.

    The saved queries can be:

    • Use — Open the query for editing and running in the Run tab. You can then select Run to execute the query again.

    • Saved as template — The saved query becomes a query template.

    • Deleted — The query is removed from the list and cannot be opened again.

    The query templates can be:

    • Opened: This will open the query again in the “New query” tab.

    • Deleted: The query is removed from the list and cannot be opened again.

    It is possible to edit the saved queries and templates by double-clicking on each query or template. Specifically for Query Templates, the data classification can be edited to be:

    • Account: The query template will be available for everyone within the account

    • User: The query template will be available for the user who created it

    hashtag
    Run a saved Query

    If you have saved a query, you can run the query again by selecting it from the list of saved queries.

    1. Navigate to Projects > your_project > Base > Query.

    2. Select the Saved Queries tab.

    3. Select a query.

    4. Select Open Query to open the query in the New Query tab from where it can be edited if needed and run by selecting Run Query.

    hashtag
    Shared database for project

    Shared databases are displayed under the list of Tables as Shared Database for project <project name>.

    For ICA Cohorts Customers, shared databases are available in a project Base instance. For more information on specific Cohorts shared database tables that are viewable, See Cohorts Basearrow-up-right.

    Snowflake

    hashtag
    User

    Every Base user has 1 snowflake username: ICA_U_<id>

    hashtag
    User/Project-Bundle

    For each user/project-bundle combination a role is created: ICA_UR_<id>_<name project/bundle>__<id>

    This role receives the viewer or contributor role of the project/bundle, depending on their permissions in ICA.

    hashtag
    Roles

    Every project or bundle has a dedicated Snowflake database.

    For each database, 2 roles are created:

    • <project/bundle name>_<id>_VIEWER

    • <project/bundle name>_<id>_CONTRIBUTOR

    hashtag
    Project viewer role

    This role receives

    • REFERENCE and SELECT rights on the tables/views within the project's PUBLIC schema.

    • Grants on the viewer roles of the bundles linked to the project.

    hashtag
    Project contributor role

    This role receives the following rights on current an future objects in the project's/bundle database in the PUBLIC schema:

    • ownership

    • select, insert, update, delete, truncate and references on tables/views/materialized views

    • usage on sequences/functions/procedures/file formats

    • write, read and usage on stages

    • select on streams

    • monitor and operate on tasks

    It also receives grant on the viewer role of the project.

    hashtag
    Warehouses

    For each project (not bundle!) 2 warehouses are created, whose size can be changed ICA at projects > your_project > project settings > details.

    • <projectname>_<id>_QUERY

    • <projectname>_<id>_LOAD

    chevron-rightUsing Load instead of Query warehousehashtag

    When you generates an oauth token, ICA always uses the QUERY warehouse by default (see bold part below):

    snowsql -a iap.us-east-1 -u ICA_U_277853 --authenticator=oauth -r ICA_UR_274853_603465_264891 -d atestbase2_264891 -s PUBLIC -w ATESTBASE2_264891_QUERY --token=<token>

    If you wish to use the LOAD warehouse in a session, you have 2 options :

    1. Change the name in the connect string : snowsql -a iapdev.us-east-1 -u ICA_U_277853 --authenticator=oauth -r ICA_UR_277853_603465_264891 -d atestbase2_264891 -s PUBLIC -w ATESTBASE2_264891_LOAD ``--token=<token>

    2. Execute the following statement after logging in : “use warehouse ATESTBASE2_264891_LOAD”

    To determine which warehouse you are using, execute : select current_warehouse();

    hashtag
    Synchronizing Tables

    if you have created tablesarrow-up-right directly in Snowflake with the OAuth token, you can synchronize them to appear in ICA by means of the Projects > your_project > Base > Tables > Sync button.

    Schedule

    On the Schedule page at Projects > your_project > Base > Schedule, it’s possible to create a job for importing different types of data you have access to into an existing table.

    When creating or editing a schedule, Automatic import is performed when the Active box is checked. The job will run at 10 minute intervals. In addition, for both active and inactive schedules, a manual import is performed when selecting the schedule and clicking the »run button.

    hashtag
    Configure a schedule

    There are different types of schedules that can be set up:

    • Files

    • Metadata

    • Administrative data.

    hashtag
    Files

    This type will load the content of specific files from this project into a table. When adding or editing this schedule you can define the following parameters:

    • Name (required): The name of the scheduled job

    • Description: Extra information about the schedule

    • File name pattern (required): Define in this field a part or the full name of the file name or of the tag that the files you want to upload contain. For example, if you want to import files named sample1_reads.txt, sample2_reads.txt, … you can fill in _reads.txt in this field to have all files that contain _reads.txt imported to the table.

    hashtag
    Metadata

    This type will create two new tables: BB_PROJECT_PIPELINE_EXECUTIONS_DETAIL and ICA_PROJECT_SAMPLE_META_DATA. The job will load metadata (added to the samples) into ICA_PROJECT_SAMPLE_META_DATA. The process gathers the metadata from the samples via the data linked to the project and the metadata from the analyses in this project. Furthermore, the schedular will add provenance data to BB_PROJECT_PIPELINE_EXECUTIONS_DETAIL. This process gathers the execution details of all the analyses in the project: the pipeline name and status, the user reference, the input files (with identifiers), and the settings selected at runtime. This enables you to track the lineage of your data and to identify any potential sources of errors or biases. So, for example, the following query will count how many times each of the pipelines was executed and sort it accordingly:

    To obtained the similar table for the failed runs, you can execute the following SQL query:

    When adding or editing this schedule you can define the following parameters:

    • Name (required): the name of this scheduled job.

    • Description: Extra information about the schedule.

    • Include sensitive meta data fields: in the meta data fields configuration, fields can be set to sensitive. When checked, those fields will also be added.

    hashtag
    Administrative data

    This type will automatically create a table and load administrative data into this table. A usage overview of all executions is considered administrative data.

    When adding or editing this schedule the following parameters can be defined:

    • Name (required): The name of this scheduled job.

    • Description: Extra information about the schedule.

    • Include sensitive metadata fields: In the metadata fields configuration, fields can be set to sensitive. When checked, those fields will also be added.

    hashtag
    Delete schedule

    Schedules can be deleted. Once deleted, they will no longer run, and they will not be shown in the list of schedules.

    hashtag
    Run schedule

    When clicking the Run button, or Save & Run when editing, the schedule will start the job of importing the configured data in the correct tables. This way the schedule can be run manually. The result of the job can be seen in the tables. The load status is empty while the data is being processed and set to failed or succeeded once loading completes.

    Data Catalogue

    Data Catalogues provide views on data from Illumina hardware and processes (Instruments, Cloud software, Informatics software and Assays) so that this data can be distributed to different applications. This data consists of read-only tables to prevent updates by the applications accessing it. Access to data catalogues is included with professional and enterprise subscriptions.

    hashtag
    Available views

    Project-level views

    Generated by Pipelines: Only files generated by these selected pipelines are taken into account. When left clear, files from all pipelines are used.

  • Target Base Table (required): The table to which the information needs to be added. A drop-down list with all created tables is shown. This means the table needs to be created before the schedule can be created.

  • Write preference (required): Define data handling; whether it can overwrite the data

  • Data format (required): Select the data format of the files (CSV, TSV, JSON)

  • Delimiter (required): to indicate which delimiter is used in the delimiter separated file. If the delimiter is not present in list, it can be indicated as custom.

  • Active: The job will run automatically if checked

  • Custom delimiter: the custom delimiter that is used in the file. You can only enter a delimiter here if custom delimiter is selected.

  • Header rows to skip: The number of consecutive header rows (at the top of the table) to skip.

  • References: Choose which references must be added to the table

  • Advanced Options

    • Encoding (required): Select the encoding of the file.

    • Null Marker: Specifies a string that represents a null value in a CSV/TSV file.

    • Quote: The value (single character) that is used to quote data sections in a CSV/TSV file. When this character is encountered at the beginning and end of a field, it will be removed. For example, entering " as quote will remove the quotes from "bunny" and only store the word bunny itself.

    • Ignore unknown values: This applies to CSV-formatted files. You can use this function to handle optional fields without separators, provided that the missing fields are located at the end of the row. Otherwise, the parser can not detect the missing separator and will shift fields to the left, resulting in errors.

      • If headers are used: The columns that have matching fields are loaded, those that have no matching fields are loaded with NULL and remaining fields are discarded.

      • If no headers are used: The fields are loaded in order of occurrence and trailing missing fields are loaded with NULL, trailing additional fields are discarded.

  • Active: the job will run automatically if ticked.
  • Source (Tenant Administrators Only):

    • Project (default): All administrative data from this project will be added.

    • Account: All administrative data from every project in the account will be added. When a tenant admin creates the tenant-wide table with administrative data in a project and invites other users to this project, these users will see this table as well.

  • Active: The job will run automatically if checked.
  • Source (Tenant Administrators Only):

    • Project (default): All administrative data from this project will be added.

    • Account: All administrative data from every project in the account will be added. When a tenant admin creates the tenant-wide table with administrative data in a project and invites other users to this project, these users will see this table as well.

  • SELECT PIPELINE_NAME, COUNT(*) AS Appearances
    FROM BB_PROJECT_PIPELINE_EXECUTIONS_DETAIL
    GROUP BY PIPELINE_NAME
    ORDER BY Appearances DESC;
    SELECT PIPELINE_NAME, COUNT(*) AS Appearances
    FROM BB_PROJECT_PIPELINE_EXECUTIONS_DETAIL
    WHERE PIPELINE_STATUS = 'Failed'
    GROUP BY PIPELINE_NAME
    ORDER BY Appearances DESC;
    ICA_PIPELINE_ANALYSES_VIEW (Lists project-specific ICA pipeline analysis data)
  • ICA_DRAGEN_QC_METRIC_ANALYSES_VIEW (project-specific quality control metrics)

  • Tenant-level views

    • ICA_PIPELINE_ANALYSES_VIEW (Lists ICA pipeline analysis data)

    • CLARITY_SEQUENCINGRUN_VIEW_tenant (sequencing run data coming from the lab workflow software)

    • CLARITY_SAMPLE_VIEW_tenant (sample data coming from the lab workflow software)

    • CLARITY_LIBRARY_VIEW_tenant (library data coming from the lab workflow software)

    • CLARITY_EVENT_VIEW_tenant (event data coming from the lab workflow software)

    • ICA_DRAGEN_QC_METRIC_ANALYSES_VIEW (quality control metrics)

    hashtag
    Preconditions for view content

    • DRAGEN metrics will only have content when DRAGEN pipelines have been executed.

    • Analysis views will only have content when analyses have been executed.

    • Views containing Clarity data will only have content if you have a Clarity LIMS instance with minimum version 6.0 and the Product Analytics service installed and configured. Please see the Clarity LIMS documentationarrow-up-right for more information.

    • When you use your storage in a project, metrics can not be collected and thus the DRAGEN METRICS - related views can not be used.

    hashtag
    Who can add or remove Catalogue data (views) to a project?

    Members of a project, who have both base contributor and project contributor or administrator rights and who belong to the same tenant as the project can add views from a Catalogue. Members of a project with the same rights who do not belong to the same tenant can remove the catalogue views from a project. Therefore, if you are invited to collaborate on a project, but belong to a different tenant, you can remove catalogue views, but cannot add them again.

    hashtag
    Adding Catalogue data (views) to your project

    To add Catalogue data,

    1. Go to Projects > your_project > Base > Tables.

    2. Select Add table > Import from Catalogue.

    3. A list of available views will be displayed. (Note that views which are already part of your project are not listed)

    4. Select the table you want to add and choose +Select

    Catalogue data will have View as type, the same as tables which are linked from other projects.

    hashtag
    Removing Catalogue data (views) from your project

    To delete Catalogue data,

    1. go to Projects > your_project > Base > Tables.

    2. Select the table you want to delete and choose Delete.

    3. A warning will be presented to confirm your choice. Once deleted, you can add the Catalogue data again if needed.

    hashtag
    Catalogue table details (Catalogue Table Selection Screen)

    • View: The name of the Catalogue table.

    • Description: An explanation of which data is contained in the view.

    • Category: The identification of the source system which provided the data.

    • Tenant/project. Appended to the view name as _tenant or _project. Determines if the data is visible for all projects within the same tenant or only within the project. Only the tenant administrator can see the non-project views.

    hashtag
    Catalogue table details (Table Schema Definition)

    In the Projects > your_project > Base > Tables view, double-click the Catalogue table to see the details. For an overview of the available actions and details, see Tablesarrow-up-right.

    hashtag
    Querying views

    In this section, we provide examples of querying selected views from the Base UI, starting with ICA_PIPELINE_ANALYSES_VIEW (project view). This table includes the following columns: TENANT_UUID, TENANT_ID, TENANT_NAME, PROJECT_UUID, PROJECT_ID, PROJECT_NAME, USER_UUID, USER_NAME, and PIPELINE_ANALYSIS_DATA. While the first eight columns contain straightforward data types (each holding a single value), the PIPELINE_ANALYSIS_DATA column is of type VARIANT, which can store multiple values in a nested structure. In SQL queries, this column returns data as a JSON object. To filter specific entries within this complex data structure, a combination of JSON functions and conditional logic in SQL queries is essential.

    Since Snowflake offers robust JSON processing capabilities, the FLATTENarrow-up-right function can be utilized to expand JSON arrays within the PIPELINE_ANALYSIS_DATA column, allowing for the filtering of entries based on specific criteria. It's important to note that each entry in the JSON array becomes a separate row once flattened. Snowflake aligns fields outside of this FLATTEN operation accordingly, i.e. the record USER_ID in the SQL query below is "recycled".

    The following query extracts

    • USER_NAME directly from the ICA_PIPELINE_ANALYSES_VIEW_project table.

    • PIPELINE_ANALYSIS_DATA:reference and PIPELINE_ANALYSIS_DATA:price. These are direct accesses into the JSON object stored in the PIPELINE_ANALYSIS_DATA column. They extract specific values from the JSON object.

    • Entries from the array 'steps' in the JSON object. The query uses LATERAL FLATTEN(input => PIPELINE_ANALYSIS_DATA:steps) to expand the steps array within the PIPELINE_ANALYSIS_DATA JSON object into individual rows. For each of these rows, it selects various elements (like bpeResourceLifeCycle, bpeResourcePresetSize, etc.) from the JSON.

    Furthermore, the query filters the rows based on the status being 'FAILED' and the stepId not containing the word 'Workflow': it allows the user to find steps which failed.

    Now let's have a look at DRAGEN_METRICS_VIEW_project view. Each DRAGEN pipeline on ICA creates multiple metrics files, e.g. SAMPLE.mapping_metrics.csv, SAMPLE.wgs_coverage_metrics.csv, etc for DRAGEN WGS Germline pipeline. Each of these files is represented by a row in DRAGEN_METRICS_VIEW_project table with columns ANALYSIS_ID, ANALYSIS_UUID, PIPELINE_ID, PIPELINE_UUID, PIPELINE_NAME, TENANT_ID, TENANT_UUID, TENANT_NAME, PROJECT_ID, PROJECT_UUID, PROJECT_NAME, FOLDER, FILE_NAME, METADATA, and ANALYSIS_DATA. ANALYSIS_DATA column contains the content of the file FILE_NAME as an array of JSON objects. Similarly to the previous query we will use FLATTEN command. The following query extracts

    • Sample name from the file names.

    • Two metrics 'Aligned bases in genome' and 'Aligned bases' for each sample and the corresponding values.

    The query looks for files SAMPLE.wgs_coverage_metrics.csv only and sorts based on the sample name:

    Lastly, you can combine these views (or rather intermediate results derived from these views) using the WITH and JOIN commands. The SQL snippet below demonstrates how to join two intermediate results referred to as 'flattened_dragen_scrna' and 'pipeline_table'. The query:

    • Selects two metrics ('Invalid barcode read' and 'Passing cells') associated with single-cell RNA analysis from records where the FILE_NAME ends with 'scRNA.metrics.csv', and then stores these metrics in a temporary table named 'flattened_dragen_scrna'.

    • Retrieves metadata related to all scRNA analyses by filtering on the pipeline ID from the 'ICA_PIPELINE_ANALYSES_VIEW_project' view and stores this information in another temporary table named 'pipeline_table'.

    • Joins the two temporary tables using the JOIN operator, specifying the join condition with the ON operator.

    hashtag
    An example how to obtain the costs incurred by the individual steps of an analysis

    You can use ICA_PIPELINE_ANALYSES_VIEW to obtained the costs of individual steps of an analysis. Using the following SQL snippet you can retrieve the costs of individual steps for every analyses run in the past week.

    hashtag
    Limitations

    • Data Catalogue views cannot be shared as part of a Bundle.

    • Data size is not shown for views because views are a subset of data.

    • By removing Base from a project, the Data Catalogue will also be removed from that project.

    hashtag
    Best Practices

    As tenant-level Catalogue views can contain sensitive data, it is best to save this (filtered) data to a new table and share that table instead of sharing the entire view as part of a project. To do so, add your view to a separate project and run a query on the data at Projects > your_project > Base > Query > New Query. When the query completes, you can export the result as a new table. This ensures no new data will be added on consequent runs.

    SELECT
        USER_NAME as user_name,
        PIPELINE_ANALYSIS_DATA:reference as reference,
        PIPELINE_ANALYSIS_DATA:price as price,
        PIPELINE_ANALYSIS_DATA:totalDurationInSeconds as duration,
        f.value:bpeResourceLifeCycle::STRING as bpeResourceLifeCycle,
        f.value:bpeResourcePresetSize::STRING as bpeResourcePresetSize,
        f.value:bpeResourceType::STRING as bpeResourceType,
        f.value:completionTime::TIMESTAMP as completionTime,
        f.value:durationInSeconds::INT as durationInSeconds,
        f.value:price::FLOAT as price,
        f.value:pricePerSecond::FLOAT as pricePerSecond,
        f.value:startTime::TIMESTAMP as startTime,
        f.value:status::STRING as status,
        f.value:stepId::STRING as stepId
    FROM
        ICA_PIPELINE_ANALYSES_VIEW_project,
        LATERAL FLATTEN(input => PIPELINE_ANALYSIS_DATA:steps) f
    WHERE
        f.value:status::STRING = 'FAILED'
        AND f.value:stepId::STRING NOT LIKE '%Workflow%';
    SELECT DISTINCT
        SPLIT_PART(FILE_NAME, '.wgs_coverage_metrics.csv', 1) as sample_name,
        f.value:column_2::STRING as metric,
        f.value:column_3::FLOAT as value
    FROM
        DRAGEN_METRICS_VIEW_project,
        LATERAL FLATTEN(input => ANALYSIS_DATA) f
    WHERE
        FILE_NAME LIKE '%wgs_coverage_metrics.csv'
        AND (
            f.value:column_2::STRING = 'Aligned bases in genome'
            OR f.value:column_2::STRING = 'Aligned bases'
        )
    ORDER BY
        sample_name;
    WITH flattened_dragen_scrna AS (   
    SELECT DISTINCT
        SPLIT_PART(FILE_NAME, '.scRNA.metrics.csv', 1) as sample_name,
        ANALYSIS_UUID, 
        f.value:column_2::STRING as metric,
        f.value:column_3::FLOAT as value
    FROM
        DRAGEN_METRICS_VIEW_project,
        LATERAL FLATTEN(input => ANALYSIS_DATA) f
    WHERE
        FILE_NAME LIKE '%scRNA.metrics.csv'
        AND (
            f.value:column_2::STRING = 'Invalid barcode read'
            OR f.value:column_2::STRING = 'Passing cells'
        )
    ),
    pipeline_table AS (
    SELECT
        PIPELINE_ANALYSIS_DATA:reference::STRING as reference,
        PIPELINE_ANALYSIS_DATA:id::STRING as analysis_id,
        PIPELINE_ANALYSIS_DATA:status::STRING as status,
        PIPELINE_ANALYSIS_DATA:pipelineId::STRING as pipeline_id,
        PIPELINE_ANALYSIS_DATA:requestTime::TIMESTAMP as start_time
    FROM
        ICA_PIPELINE_ANALYSES_VIEW_project
    WHERE
        PIPELINE_ANALYSIS_DATA:pipelineId = 'c9c9a2cc-3a14-4d32-b39a-1570c39ebc30'
        )
    SELECT * FROM flattened_dragen_scrna JOIN pipeline_table 
    ON
         flattened_dragen_scrna.ANALYSIS_UUID = pipeline_table.analysis_id;
    SELECT
        USER_NAME as user_name,
        PROJECT_NAME as project,
        SUBSTRING(PIPELINE_ANALYSIS_DATA:reference, 1, 30) as reference,
        PIPELINE_ANALYSIS_DATA:status as status,
        ROUND(PIPELINE_ANALYSIS_DATA:computePrice,2) as price,
        PIPELINE_ANALYSIS_DATA:totalDurationInSeconds as duration,
        PIPELINE_ANALYSIS_DATA:startTime::TIMESTAMP as startAnalysis,
        f.value:bpeResourceLifeCycle::STRING as bpeResourceLifeCycle,
        f.value:bpeResourcePresetSize::STRING as bpeResourcePresetSize,
        f.value:bpeResourceType::STRING as bpeResourceType,
        f.value:durationInSeconds::INT as durationInSeconds,
        f.value:price::FLOAT as priceStep,
        f.value:status::STRING as status,
        f.value:stepId::STRING as stepId
    FROM
        ICA_PIPELINE_ANALYSES_VIEW_project,
        LATERAL FLATTEN(input => PIPELINE_ANALYSIS_DATA:steps) f
    WHERE
       PIPELINE_ANALYSIS_DATA:startTime > CURRENT_TIMESTAMP() - INTERVAL '1 WEEK'
    ORDER BY
       priceStep DESC;
    own AWS S3