1 of 4

CWL CLI Workflow

In this tutorial, we will demonstrate how to create and launch a pipeline using the CWL language using the ICA command line interface (CLI).

Installation

Please refer to for installing ICA CLI.

Tutorial project

In this project, we will create two simple tools and build a workflow that we can run on ICA using CLI. The first tool (tool-fqTOfa.cwl) will convert a FASTQ file to a FASTA file. The second tool(tool-countLines.cwl) will count the number of lines in an input FASTA file. The workflow (workflow.cwl) will combine the two tools to convert an input FASTQ file to a FASTA file and count the number of lines in the resulting FASTA file.

Following are the two CWL tools and workflow scripts we will use in the project. If you are new to CWL, please refer to the cwl for a better understanding of CWL codes. You will also need the cwltool installed to create these tools and workflows. You can find installation instructions on the CWL page.

tool-fqTOfa.cwl

tool-countLines.cwl

workflow.cwl

[!IMPORTANT] Please note that we don't specify the Docker image used in both tools. In such a case, the default behaviour is to use public.ecr.aws/docker/library/bash:5 image. This image contains basic functionality (sufficient to execute wc and awk commands).

In case you want to use a different public image, you can specify it using requirements tag in cwl file. Assuming you want to use *ubuntu:latest' you need to add

In case you want to use a Docker image from the ICA Docker repository, you would need the link to AWS ECR from ICA GUI. Double-click on the image name in the Docker repository and copy the URL to the clipboard. Add the URL to dockerPull key.

Authentication

Enter/Create a Project

You can create a project or use an existing project for creating a new pipeline. You can create a new project using the "icav2 projects create" command.

If you do not provide the "--region" flag, the value defaults to the existing region when there is only one region available. When there is more than one region available, a selection must be made from the available regions at the command prompt. The region input can be determined by calling the "icav2 regions list" command first.

You can select the project to work on by entering the project using the "icav2 projects enter" command. Thus, you won't need to specify the project as an argument.

You can also use the "icav2 projects list" command to determine the names and ids of the project you have access to.

Create a pipeline on ICA

"projectpipelines" is the root command to perform actions on pipelines in a project. "create" command creates a pipeline in the current project.

The parameter file specifies the input for the workflow with additional parameter settings for each step in the workflow. In this tutorial, input is a FASTQ file shown inside <dataInput> tag in the parameter file. There aren't any specific settings for the workflow steps resulting in a parameter file below with an empty <steps> tag. Create a parameter file (parameters.xml) with the following content using a text editor.

The following command creates a pipeline called "cli-tutorial" using the workflow "workflow.cwl", tools "tool-fqTOfa.cwl" and "tool-countLines.cwl" and parameter file "parameter.xml" with small storage size.

Once the pipeline is created, you can view it using the "list" command.

Running the pipeline

The "icav2 projectdata upload" command lets you upload data to ica.

The "list" command lets you view the uploaded file. Note the ID of the file you want to use with the pipeline.

The "icav2 projectpipelines start" command initiates the pipeline run. The following command runs the pipeline. Note the id for exploring the analysis later.

Note: If for some reason your "create" command fails and needs to rerun, you might get an error (ConstraintViolationException). If so, try your command with a different name.

You can check the status of the run using the "icav2 projectanalyses get" command.

The pipelines can be run using JSON input type as well. The following is an example of running pipelines using JSON input type. Note that JSON input works only with file-based CWL pipelines (built using code, not a graphical editor in ICA).

Notes

runtime.ram and runtime.cpu

runtime.ram and runtime.cpu values are by default evaluated using the compute environment running the host CWL runner. CommandLineTool Steps within a CWL Workflow run on different compute environments than the host CWL runner, so the valuations of the runtime.ram and runtime.cpu for within the CommandLineTool will not match the runtime environment the tool is running in. The valuation of runtime.ram and runtime.cpu can be overridden by specifying cpuMin and ramMin in the ResourceRequirements for the CommandLineTool.

CWL Graphical Pipeline

This tutorial aims to guide you through the process of creating CWL tools and pipelines from the very beginning. By following the steps and techniques presented here, you will gain the necessary knowledge and skills to develop your own pipelines or transition existing ones to ICA.

Build and push to ICA your own Docker image

The foundation for every tool in ICA is a Docker image (externally published or created by the user). Here we present how to create your own Docker image for the popular tool (FASTQC).

Copy the contents displayed below to a text editor and save it as a Dockerfile. Make sure you use an editor which does not add formatting to the file.

FROM centos:7
WORKDIR /usr/local

# DEPENDENCIES
RUN yum -y install java-1.8.0-openjdk wget unzip perl && \
    yum clean all && \
    rm -rf /var/cache/yum

# INSTALLATION fastqc
RUN wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip --no-check-certificate && \
    unzip fastqc_v0.11.9.zip && \
    chmod a+rx /usr/local/FastQC/fastqc && rm -rf fastqc_v0.11.9.zip

# Adding FastQC to the PATH
ENV PATH $PATH:/usr/local/FastQC

# DEFAULTS
ENV LANG=en_US.UTF-8
ENV LC_ALL=en_US.UTF-8
ENTRYPOINT []

## how to build the docker image
## docker build --file fastqc-0.11.9.Dockerfile --tag fastqc-0.11.9:0 .
## docker run --rm -i -t --entrypoint /bin/bash fastqc-0.11.9:0

Open a terminal window, place this file in a dedicated folder and navigate to this folder location. Then use the following command:

docker build --file fastqc-0.11.9.Dockerfile --tag fastqc-0.11.9:1 .

Check the image has been successfully built:

docker images

Check that the container is functional:

docker run --rm -i -t --entrypoint /bin/bash fastqc-0.11.9:1

Once inside the container check that the fastqc command is responsive and prints the expected help message. Remember to exit the container.

Save a tar of the previously built image locally:

docker save fastqc-0.11.9:1 -o fastqc-0.11.9:1.tar.gz

Upload your docker image .tar to an ICA project (browser upload, Connector, or CLI). Important: In Data tab, select the uploaded .tar file, then click “Manage --> Change Format”, select 'DOCKER' and Save.

Now step outside of the Project and go to Docker Repository, Select New and click on the Search Icon. You can filter on Project names and locations, select your docker file (use the checkbox on the left) and Press Select.

Create a CWL tool

While outside of any Project go to Tool Repository and Select New Tool. Fill the mandatory fields (Name and Version) and click on the Search Icon to look for a Docker image to link to the tool. You must double-click on the image row to confirm the selection. Tool creation in ICA adheres to the cwl standard.

There are 2 ways you can create a (cwl) tool on top of a docker image in ICA UI:

1: Navigate to the Tool cwl tab and use the Text Editor to create the tool definition in CWL syntax. 2: Use the other tabs to independently define inputs, outputs, arguments, settings, etc …

In this tutorial we will present the 1st option using the CWL file: paste the following content into the Tool CWL tab

#!/usr/bin/env cwl-runner

# (Re)generated by BlueBee Platform

$namespaces:
  ilmn-tes: http://platform.illumina.com/rdf/iap/
cwlVersion: cwl:v1.0
class: CommandLineTool
label: FastQC
doc: FastQC aims to provide a simple way to do some quality control checks on raw
  sequence data coming from high throughput sequencing pipelines.
inputs:
  Fastq1:
    type: File
    inputBinding:
      position: 1
  Fastq2:
    type:
    - File
    - 'null'
    inputBinding:
      position: 3
outputs:
  HTML:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - '*.html'
  Zip:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - '*.zip'
arguments:
- position: 4
  prefix: -o
  valueFrom: $(runtime.outdir)
- position: 1
  prefix: -t
  valueFrom: '2'
baseCommand:
- fastqc

Please, observe the following: since the user needs to specify the output folder for FASTQC application (-o prefix), we are using the $(runtime.outdir) runtime parameter to point to the designated output directory.

Create the pipeline

While inside a Project, navigate to Pipelines and click on cwl and then Graphical.

Fill the mandatory fields (Code = pipeline name and free text Description) and click on the Definition tab to open the Graphical Editor.

Expand the Tool Repository menu (lower right) and drag your FastQC tool into the Editor field (center).

Now drag one Input and one Output file icon (on top) into the Editor field as well. Both may be given a Name (editable fields on the right when icon is selected) and need a Format attribute. Set the Input Format to fastq and Output Format to html. Connect both Input and Output files to the matching nodes on the tool itself (mouse over the node, then hold-click and drag to connect).

Press Save, you just created your first FastQC pipeline on ICA!

Run a pipeline

First make sure you have at least one Fastq file uploaded and/or linked to your Project. You may use Fastq files available in the Bundle.

Navigate to Pipelines and select the pipeline you just created, then press Start New Run.

Fill the mandatory field (User Reference = pipeline execution name) and click on the Select button to open the File Selection dialog box. Select any of the Fastq files available to you (use the checkbox on the left and press Select on the lower right).

Press Start Run on the top right, the platform is now orchestrating the workflow execution.

View Results

Navigate to Runs and observe that the pipeline execution is now listed and will first appear to be in “Requested” Status. After a few minutes the Status should change to “In Progress” and then to “Succeeded”.

Once this Run is succeeded click on the row (a single click is enough) to enter Result view. You should see the FastQC HTML output file listed on the right. Click on the file to open Data Details view. Since it is an HTML file Format there is a View tab that allows visualizing the HTML within the browser.

CWL DRAGEN Pipeline

In this tutorial, we will demonstrate how to create and launch a DRAGEN pipeline using the CWL language.

In ICA, CWL pipelines are built using tools developed in CWL. For this tutorial, we will use the "DRAGEN Demo Tool" included with DRAGEN Demo Bundle 3.9.5.

Linking bundle to Project

1.) Start by selecting a project at the Projects inventory.

2.) In the details page, select Edit.

3.) In the edit mode of the details page, click the + button in the LINKED BUNDLES section.

4.) In the Add Bundle to Project window: Select the dragen demo tool bundle from the list. Once you have selected the bundle, the Link Bundles button becomes available. Select it to continue.

Tip: You can select multiple bundles using Ctrl + Left mouse button or Shift + Left mouse button.

5.) In the project details page, the selected bundle will appear under the LINKED BUNDLES section. If you need to remove a bundle, click on the - button. Click Save to save the project with linked bundles.

Create Pipeline

1.) From the project details page, select Pipelines > CWL

2.) You will be given options to create pipelines using a graphical interface or code. For this tutorial, we will select Graphical.

3.) Once you have selected the Graphical option, you will see a page with multiple tabs. The first tab is the Information page where you enter pipeline information. You can find the details for different fields in the tab in the GitBook. The following three fields are required for the INFORMATION page.

Code: Provide pipeline name here.
Description: Provide pipeline description here.
Storage size: Select the storage size from the drop-down menu.

4.) The Documentation tab provides options for configuring the HTML description for the tool. The description appears in the tool repository but is excluded from exported CWL definitions.

5.) The Definition tab is used to define the pipeline. When using graphical mode for the pipeline definition, the Definition tab provides options for configuring the pipeline using a visualization panel (A) and a list of component menus (B). You can find details on each section in the component menu here

6.) To build a pipeline, start by selecting Machine PROFILE from the component menu section on the right. All fields are required and are pre-filled with default values. Change them as needed.

The profile Name field will be updated based on the selected Resource. You can change it as needed.
Color assigns the selected color to the tool in the design view to easily identify the machine profile when more than one tool is used in the pipeline.
Tier lets you select Standard or Economy tier for AWS instances. Standard is on-demand ec2 instance and Economy is spot ec2 instance. You can find the difference between the two AWS instances here. You can find the price difference between the two Tiers here.
Resource lets you choose from various compute resources available. In this case, we are building a DRAGEN pipeline and we will need to select a resource with FPGA in it. Choose from FPGA resources (FPGA Medium/Large) based on your needs.

7.) Once you have selected the Machine Profile for the tool, find your tool from the Tool Repository at the bottom section of the component menu on the right. In this case, we are using the DRAGEN Demo Tool. Drag and drop the tool from the Tool Repository section to the visualization panel.

8.) The dropped tool will show the machine profile color, number of outputs and inputs, and warning to indicate missing parameters, mandatory values, and connections. Selecting the tool in the visualization panel activates the tool (Dragen Demo Tool) component menu. On the component menu section, you will find the details of the tool under Tool - DRAGEN Demo Tool. This section lists the inputs, outputs, additional parameters, and the machine profile required for the tool. In this case, the DRAGEN Demo Tool requires three inputs (FASTQ read 1, FASTQ read 2, and a Reference genome). The tool has two outputs (a VCF file and an output directory). The tool also has a mandatory parameter (Output File Prefix). Enter the value for the input parameter (Output File Prefix) in the text box.

9.) The top right corner of the visualization panel has icons to zoom in and out in the visualization panel followed by three icons: ref, in, and out. Based on the type of input/output needed, drag and drop the icons into the visualization area. In this case, we need three inputs (read 1, read 2, and Reference hash table.) and two outputs (VCF file and output directory). Start by dragging and dropping the first input (a). Connect the input to the tool by clicking on the blue dot at the bottom of the input icon and dragging it to the blue dot representing the first input on the tool (b). Select the input icon to activate the input component menu. The input section for the first input lets you enter the Name, Format, and other relevant information based on tool requirements. In this case, for the first input, enter the following information:

Name: FASTQ read 1
Format: FASTQ
Comments: any optional comments

10.) Repeat the step for other inputs. Note that the Reference hash table is treated as the input for the tool rather than Reference files. So, use the input icon instead of the reference icon.

11.) Repeat the process for two outputs by dragging and connecting them to the tool. Note that when connecting output to the tool, you will need to click on the blue dot at the bottom of the tool and drag it to the output.

12.) Select the tool and enter additional parameters. In this case, the tool requires Output File Prefix. Enter demo_ in the text box.

13.) Click on the Save button to save the pipeline. Once saved, you can run it from the Pipelines page under Flow from the left menus as any other pipeline.

CWL: Scatter-gather Method

In bioinformatics and computational biology, the vast and growing amount of data necessitates methods and tools that can process and analyze data in parallel. This demand gave birth to the scatter-gather approach, an essential pattern in creating pipelines that offers efficient data handling and parallel processing capabilities. In this tutorial, we will demonstrate how to create a CWL pipeline utilizing the scatter-gather approach. To this purpose, we will use two widely known tools: and . Given the functionalities of both fastp and multiqc, their combination in a scatter-gather pipeline is incredibly useful. Individual datasets can be scattered across resources for parallel preprocessing with fastp. Subsequently, the outputs from each of these parallel tasks can be gathered and fed into multiqc, generating a consolidated quality report. This workflow not only accelerates the preprocessing of large datasets but also offers an aggregated perspective on data quality, ensuring that subsequent analyses are built upon a robust foundation.

Creating the tools

First, we create the two tools: fastp and multiqc. For this, we need the corresponding Docker images and CWL tool definitions. Please, look up this of our help sites to learn more how to import a tool into ICA. In a nutshell, once the CWL tool definition is pasted into the editor, the other tabs for editing the tool will be populated. To complete the tool, the user needs to select the corresponding Docker image and to provide a tool version (could be any string).

For this demo, we will use the publicly available Docker images: quay.io/biocontainers/fastp:0.20.0--hdbcaa40_0 for fastp and docker.io/ewels/multiqc:v1.15 for multiqc. In this one can find how to import publicly available Docker images into ICA.

Furthermore, we will use the following CWL tool definitions:

and

Pipeline

Once the tools are created, we will create the pipeline itself using these two tools at Projects > your_project > Flow > Pipelines > CWL > Graphical:

On the Definition tab, go to the tool repository and drag and drop the two tools which you just created on the pipeline editor.
Connect the JSON output of fastp to multiqc input by hovering over the middle of the round, blue connector of the output until the icon changes to a hand and then drag the connection to the first input of multiqc. You can use the magnification symbols to make it easier to connect these tools.
Above the diagram, drag and drop two input FASTQ files and an output HTML file on to the pipeline editor and connect the blue markers to match the diagram below.

Relevant aspects of the pipeline:

Both inputs are multivalue (as can be seen on the screenshot)
Ensure that the step fastp has scattering configured: it scatters on both inputs using the scatter method 'dotproduct'. This means that as many instances of this step will be executed as there are pairs of FASTQ files. To indicate that this step is executed multiple times, the icons of both inputs have doubled borders.

Important remark

Both input arrays (Read1 and Read2) must be matched. Currently an automatic sorting of input arrays is not supported yet. One has to take care of matching the input arrays. There are two ways to achieve this (besides the manual specification in the GUI):

invoke this pipeline in CLI using Bash functionality to sort the arrays
add a tool to the pipeline which will intake array of all FASTQ files, spread them on R1 and R2 suffixes, and sort them.

Now this tool can added to the pipeline before fastp step.

CWL Graphical Pipeline

Build and push to ICA your own Docker image

The foundation for every tool in ICA is a Docker image (externally published or created by the user). Here we present how to create your own Docker image for the popular tool (FASTQC).

Copy the contents displayed below to a text editor and save it as a Dockerfile. Make sure you use an editor which does not add formatting to the file.

FROM centos:7
WORKDIR /usr/local

# DEPENDENCIES
RUN yum -y install java-1.8.0-openjdk wget unzip perl && \
    yum clean all && \
    rm -rf /var/cache/yum

# INSTALLATION fastqc
RUN wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip --no-check-certificate && \
    unzip fastqc_v0.11.9.zip && \
    chmod a+rx /usr/local/FastQC/fastqc && rm -rf fastqc_v0.11.9.zip

# Adding FastQC to the PATH
ENV PATH $PATH:/usr/local/FastQC

# DEFAULTS
ENV LANG=en_US.UTF-8
ENV LC_ALL=en_US.UTF-8
ENTRYPOINT []

## how to build the docker image
## docker build --file fastqc-0.11.9.Dockerfile --tag fastqc-0.11.9:0 .
## docker run --rm -i -t --entrypoint /bin/bash fastqc-0.11.9:0

Open a terminal window, place this file in a dedicated folder and navigate to this folder location. Then use the following command:

docker build --file fastqc-0.11.9.Dockerfile --tag fastqc-0.11.9:1 .

Check the image has been successfully built:

docker images

Check that the container is functional:

docker run --rm -i -t --entrypoint /bin/bash fastqc-0.11.9:1

Once inside the container check that the fastqc command is responsive and prints the expected help message. Remember to exit the container.

Save a tar of the previously built image locally:

docker save fastqc-0.11.9:1 -o fastqc-0.11.9:1.tar.gz

Create a CWL tool

There are 2 ways you can create a (cwl) tool on top of a docker image in ICA UI:

1: Navigate to the Tool cwl tab and use the Text Editor to create the tool definition in CWL syntax. 2: Use the other tabs to independently define inputs, outputs, arguments, settings, etc …

In this tutorial we will present the 1st option using the CWL file: paste the following content into the Tool CWL tab

#!/usr/bin/env cwl-runner

# (Re)generated by BlueBee Platform

$namespaces:
  ilmn-tes: http://platform.illumina.com/rdf/iap/
cwlVersion: cwl:v1.0
class: CommandLineTool
label: FastQC
doc: FastQC aims to provide a simple way to do some quality control checks on raw
  sequence data coming from high throughput sequencing pipelines.
inputs:
  Fastq1:
    type: File
    inputBinding:
      position: 1
  Fastq2:
    type:
    - File
    - 'null'
    inputBinding:
      position: 3
outputs:
  HTML:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - '*.html'
  Zip:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - '*.zip'
arguments:
- position: 4
  prefix: -o
  valueFrom: $(runtime.outdir)
- position: 1
  prefix: -t
  valueFrom: '2'
baseCommand:
- fastqc

Create the pipeline

While inside a Project, navigate to Pipelines and click on cwl and then Graphical.

Fill the mandatory fields (Code = pipeline name and free text Description) and click on the Definition tab to open the Graphical Editor.

Expand the Tool Repository menu (lower right) and drag your FastQC tool into the Editor field (center).

Press Save, you just created your first FastQC pipeline on ICA!

Run a pipeline

First make sure you have at least one Fastq file uploaded and/or linked to your Project. You may use Fastq files available in the Bundle.

Navigate to Pipelines and select the pipeline you just created, then press Start New Run.

Press Start Run on the top right, the platform is now orchestrating the workflow execution.

View Results

CWL DRAGEN Pipeline

In this tutorial, we will demonstrate how to create and launch a DRAGEN pipeline using the CWL language.

In ICA, CWL pipelines are built using tools developed in CWL. For this tutorial, we will use the "DRAGEN Demo Tool" included with DRAGEN Demo Bundle 3.9.5.

Linking bundle to Project

1.) Start by selecting a project at the Projects inventory.

2.) In the details page, select Edit.

3.) In the edit mode of the details page, click the + button in the LINKED BUNDLES section.

4.) In the Add Bundle to Project window: Select the dragen demo tool bundle from the list. Once you have selected the bundle, the Link Bundles button becomes available. Select it to continue.

Tip: You can select multiple bundles using Ctrl + Left mouse button or Shift + Left mouse button.

Create Pipeline

1.) From the project details page, select Pipelines > CWL

2.) You will be given options to create pipelines using a graphical interface or code. For this tutorial, we will select Graphical.

Code: Provide pipeline name here.
Description: Provide pipeline description here.
Storage size: Select the storage size from the drop-down menu.

4.) The Documentation tab provides options for configuring the HTML description for the tool. The description appears in the tool repository but is excluded from exported CWL definitions.

6.) To build a pipeline, start by selecting Machine PROFILE from the component menu section on the right. All fields are required and are pre-filled with default values. Change them as needed.

The profile Name field will be updated based on the selected Resource. You can change it as needed.
Color assigns the selected color to the tool in the design view to easily identify the machine profile when more than one tool is used in the pipeline.
Tier lets you select Standard or Economy tier for AWS instances. Standard is on-demand ec2 instance and Economy is spot ec2 instance. You can find the difference between the two AWS instances here. You can find the price difference between the two Tiers here.
Resource lets you choose from various compute resources available. In this case, we are building a DRAGEN pipeline and we will need to select a resource with FPGA in it. Choose from FPGA resources (FPGA Medium/Large) based on your needs.

Name: FASTQ read 1
Format: FASTQ
Comments: any optional comments

10.) Repeat the step for other inputs. Note that the Reference hash table is treated as the input for the tool rather than Reference files. So, use the input icon instead of the reference icon.

12.) Select the tool and enter additional parameters. In this case, the tool requires Output File Prefix. Enter demo_ in the text box.

13.) Click on the Save button to save the pipeline. Once saved, you can run it from the Pipelines page under Flow from the left menus as any other pipeline.

CWL: Scatter-gather Method

Creating the tools

Furthermore, we will use the following CWL tool definitions:

and

#!/usr/bin/env cwl-runner

cwlVersion: cwl:v1.0
class: CommandLineTool
label: MultiQC
doc: MultiQC is a tool to create a single report with interactive plots for multiple
  bioinformatics analyses across many samples.
inputs:
  files:
    type:
    - type: array
      items: File
    - 'null'
    doc: Files containing the result of quality analysis.
    inputBinding:
      position: 2
  directories:
    type:
    - type: array
      items: Directory
    - 'null'
    doc: Directories containing the result of quality analysis.
    inputBinding:
      position: 3
  report_name:
    type: string
    doc: Name of output report, without path but with full file name (e.g. report.html).
    default: multiqc_report.html
    inputBinding:
      position: 1
      prefix: -n
outputs:
  report:
    type: File
    outputBinding:
      glob:
      - '*.html'
baseCommand:
- multiqc

Pipeline

Once the tools are created, we will create the pipeline itself using these two tools at Projects > your_project > Flow > Pipelines > CWL > Graphical:

On the Definition tab, go to the tool repository and drag and drop the two tools which you just created on the pipeline editor.
Connect the JSON output of fastp to multiqc input by hovering over the middle of the round, blue connector of the output until the icon changes to a hand and then drag the connection to the first input of multiqc. You can use the magnification symbols to make it easier to connect these tools.
Above the diagram, drag and drop two input FASTQ files and an output HTML file on to the pipeline editor and connect the blue markers to match the diagram below.

Relevant aspects of the pipeline:

Both inputs are multivalue (as can be seen on the screenshot)
Ensure that the step fastp has scattering configured: it scatters on both inputs using the scatter method 'dotproduct'. This means that as many instances of this step will be executed as there are pairs of FASTQ files. To indicate that this step is executed multiple times, the icons of both inputs have doubled borders.

Important remark

invoke this pipeline in CLI using Bash functionality to sort the arrays
add a tool to the pipeline which will intake array of all FASTQ files, spread them on R1 and R2 suffixes, and sort them.

We will describe the second way in more detail. The tool will be based on public python Docker docker.io/python:3.10 and have the following definition. In this tool we are providing the Python script spread_script.py via Dirent .

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
  listing:
  - entry: "import argparse\nimport os\nimport json\n\n# Create argument parser\n\
      parser = argparse.ArgumentParser()\nparser.add_argument(\"-i\", \"--inputFiles\"\
      , type=str, required=True, help=\"Input files\")\n\n# Parse the arguments\n\
      args = parser.parse_args()\n\n# Split the inputFiles string into a list of file\
      \ paths\ninput_files = args.inputFiles.split(',')\n\n# Sort the input files\
      \ by the base filename\ninput_files = sorted(input_files, key=lambda x: os.path.basename(x))\n\
      \n\n# Separate the files into left and right arrays, preserving the order\n\
      left_files = [file for file in input_files if '_R1_' in os.path.basename(file)]\n\
      right_files = [file for file in input_files if '_R2_' in os.path.basename(file)]\n\
      \n# Print the left files for debugging\nprint(\"Left files:\", left_files)\n\
      \n# Print the left files for debugging\nprint(\"Right files:\", right_files)\n\
      \n# Ensure left and right files are matched\nassert len(left_files) == len(right_files),\
      \ \"Mismatch in number of left and right files\"\n\n    \n# Write the left files\
      \ to a JSON file\nwith open('left_files.json', 'w') as outfile:\n    left_files_objects\
      \ = [{\"class\": \"File\", \"path\": file} for file in left_files]\n    json.dump(left_files_objects,\
      \ outfile)\n\n# Write the right files to a JSON file\nwith open('right_files.json',\
      \ 'w') as outfile:\n    right_files_objects = [{\"class\": \"File\", \"path\"\
      : file} for file in right_files]\n    json.dump(right_files_objects, outfile)\n\
      \n"
    entryname: spread_script.py
    writable: false
label: spread_items
inputs:
  inputFiles:
    type:
      type: array
      items: File
    inputBinding:
      separate: false
      prefix: -i
      itemSeparator: ','
outputs:
  leftFiles:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - left_files.json
      loadContents: true
      outputEval: $(JSON.parse(self[0].contents))
  rightFiles:
    type:
      type: array
      items: File
    outputBinding:
      glob:
      - right_files.json
      loadContents: true
      outputEval: $(JSON.parse(self[0].contents))
baseCommand:
- python3
- spread_script.py

Now this tool can added to the pipeline before fastp step.

CWL CLI Workflow

In this tutorial, we will demonstrate how to create and launch a pipeline using the CWL language using the ICA command line interface (CLI).

Installation

Please refer to for installing ICA CLI.

Tutorial project

tool-fqTOfa.cwl

tool-countLines.cwl

#!/usr/bin/env cwltool

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [wc, -l]
inputs:
  inputFasta:
    type: File
    inputBinding:
        position: 1
stdout: lineCount.tsv
outputs:
  outputCount:
    type: File
    streamable: true
    outputBinding:
        glob: lineCount.tsv

workflow.cwl

cwlVersion: v1.0
class: Workflow
inputs:
  ipFQ: File

outputs:
  count_out:
    type: File
    outputSource: count/outputCount
  fqTOfaOut:
    type: File
    outputSource: convert/outputFasta
   
steps:
  convert:
    run: tool-fqTOfa.cwl
    in:
      inputFastq: ipFQ
    out: [outputFasta]

  count:
    run: tool-countLines.cwl
    in:
      inputFasta: convert/outputFasta
    out: [outputCount]

[!IMPORTANT] Please note that we don't specify the Docker image used in both tools. In such a case, the default behaviour is to use public.ecr.aws/docker/library/bash:5 image. This image contains basic functionality (sufficient to execute wc and awk commands).

In case you want to use a different public image, you can specify it using requirements tag in cwl file. Assuming you want to use *ubuntu:latest' you need to add

requirements:
  - class: DockerRequirement
    dockerPull: ubuntu:latest

requirements:
  - class: DockerRequirement
    dockerPull: 079623148045.dkr.ecr.eu-central-1.amazonaws.com/cp-prod/XXXXXXXXXX:latest

To add a custom or public docker image to the ICA repository, please refer to the .

Authentication

Before you can use ICA CLI, you will need to authenticate using the Illumina API key. Please follow to authenticate.

Enter/Create a Project

You can create a project or use an existing project for creating a new pipeline. You can create a new project using the "icav2 projects create" command.

% icav2 projects create basic-cli-tutorial --region c39b1feb-3e94-4440-805e-45e0c76462bf

You can select the project to work on by entering the project using the "icav2 projects enter" command. Thus, you won't need to specify the project as an argument.

% icav2 projects enter basic-cli-tutorial

You can also use the "icav2 projects list" command to determine the names and ids of the project you have access to.

% icav2 projects list

Create a pipeline on ICA

"projectpipelines" is the root command to perform actions on pipelines in a project. "create" command creates a pipeline in the current project.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pd:pipeline xmlns:pd="xsd://www.illumina.com/ica/cp/pipelinedefinition" code="" version="1.0">
    <pd:dataInputs>
        <pd:dataInput code="ipFQ" format="FASTQ" type="FILE" required="true" multiValue="false">
            <pd:label>ipFQ</pd:label>
            <pd:description></pd:description>
        </pd:dataInput>
    </pd:dataInputs>
    <pd:steps/>
</pd:pipeline>

% icav2 projectpipelines create cwl cli-tutorial --workflow workflow.cwl --tool tool-fqTOfa.cwl --tool tool-countLines.cwl --parameter parameters.xml --storage-size small --description "cli tutorial pipeline"

Once the pipeline is created, you can view it using the "list" command.

% icav2 projectpipelines list
ID                                   	CODE                      	DESCRIPTION                                      
6779fa3b-e2bc-42cb-8396-32acee8b6338	cli-tutorial             	cli tutorial pipeline

Running the pipeline

Upload data to the project using the "icav2 projectdata upload" command. Please refer to the for advanced data upload features. For this test, we will use a small FASTQ file test.fastq containing the following reads.

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
AAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
AGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
+SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
IIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
AAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
AGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
+SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
IIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
AAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
IIIIIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
@SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
AGCAGAAGTCGATGATAATACGCGTCGTTTTATCAT
+SRR001666.2 071112_SLXA-EAS1_s_7:5:1:801:338 length=36
IIIIIIIIIIIIIIIIIIIIIIGII>IIIII-I)8I

The "icav2 projectdata upload" command lets you upload data to ica.

% icav2 projectdata upload test.fastq /
oldFilename= test.fastq en newFilename= test.fastq
bucket= stratus-gds-use1  prefix= 0a488bb2-578b-404a-e09d-08d9e3343b2b/test.fastq
Using: 1 workers to upload 1 files
15:23:32: [0]  Uploading /Users/user1/Documents/icav2_validation/for_tutorial/working/test.fastq
15:23:33: [0]  Uploaded /Users/user1/Documents/icav2_validation/for_tutorial/working/test.fastq to /test.fastq in 794.511591ms
Finished uploading 1 files in 795.244677ms

The "list" command lets you view the uploaded file. Note the ID of the file you want to use with the pipeline.

% icav2 projectdata list                
PATH          NAME        TYPE  STATUS    ID                                    OWNER                                 
/test.fastq  test.fastq FILE  AVAILABLE fil.c23246bd7692499724fe08da020b1014  4b197387-e692-4a78-9304-c7f73ad75e44

The "icav2 projectpipelines start" command initiates the pipeline run. The following command runs the pipeline. Note the id for exploring the analysis later.

Note: If for some reason your "create" command fails and needs to rerun, you might get an error (ConstraintViolationException). If so, try your command with a different name.

% icav2 projectpipelines start cwl cli-tutorial --type-input STRUCTURED --input ipFQ:fil.c23246bd7692499724fe08da020b1014 --user-reference tut-test
analysisStorage.description           1.2 TB
analysisStorage.id                    6e1b6c8f-f913-48b2-9bd0-7fc13eda0fd0
analysisStorage.name                  Small
analysisStorage.ownerId               8ec463f6-1acb-341b-b321-043c39d8716a
analysisStorage.tenantId              f91bb1a0-c55f-4bce-8014-b2e60c0ec7d3
analysisStorage.tenantName            ica-cp-admin
analysisStorage.timeCreated           2021-11-05T10:28:20Z
analysisStorage.timeModified          2021-11-05T10:28:20Z
id                                    461d3924-52a8-45ef-ab62-8b2a29621021
ownerId                               7fa2b641-1db4-3f81-866a-8003aa9e0818
pipeline.analysisStorage.description  1.2 TB
pipeline.analysisStorage.id           6e1b6c8f-f913-48b2-9bd0-7fc13eda0fd0
pipeline.analysisStorage.name         Small
pipeline.analysisStorage.ownerId      8ec463f6-1acb-341b-b321-043c39d8716a
pipeline.analysisStorage.tenantId     f91bb1a0-c55f-4bce-8014-b2e60c0ec7d3
pipeline.analysisStorage.tenantName   ica-cp-admin
pipeline.analysisStorage.timeCreated  2021-11-05T10:28:20Z
pipeline.analysisStorage.timeModified 2021-11-05T10:28:20Z
pipeline.code                         cli-tutorial
pipeline.description                  Test, prepared parameters file from working GUI
pipeline.id                           6779fa3b-e2bc-42cb-8396-32acee8b6338
pipeline.language                     CWL
pipeline.ownerId                      7fa2b641-1db4-3f81-866a-8003aa9e0818
pipeline.tenantId                     d0696494-6a7b-4c81-804d-87bda2d47279
pipeline.tenantName                   icav2-entprod
pipeline.timeCreated                  2022-03-10T13:13:05Z
pipeline.timeModified                 2022-03-10T13:13:05Z
reference                             tut-test-cli-tutorial-eda7ee7a-8c65-4c0f-bed4-f6c2d21119e6
status                                REQUESTED
summary                               
tenantId                              d0696494-6a7b-4c81-804d-87bda2d47279
tenantName                            icav2-entprod
timeCreated                           2022-03-10T20:42:42Z
timeModified                          2022-03-10T20:42:43Z
userReference                         tut-test

You can check the status of the run using the "icav2 projectanalyses get" command.

%   icav2 projectanalyses get 461d3924-52a8-45ef-ab62-8b2a29621021
analysisStorage.description           1.2 TB
analysisStorage.id                    6e1b6c8f-f913-48b2-9bd0-7fc13eda0fd0
analysisStorage.name                  Small
analysisStorage.ownerId               8ec463f6-1acb-341b-b321-043c39d8716a
analysisStorage.tenantId              f91bb1a0-c55f-4bce-8014-b2e60c0ec7d3
analysisStorage.tenantName            ica-cp-admin
analysisStorage.timeCreated           2021-11-05T10:28:20Z
analysisStorage.timeModified          2021-11-05T10:28:20Z
endDate                               2022-03-10T21:00:33Z
id                                    461d3924-52a8-45ef-ab62-8b2a29621021
ownerId                               7fa2b641-1db4-3f81-866a-8003aa9e0818
pipeline.analysisStorage.description  1.2 TB
pipeline.analysisStorage.id           6e1b6c8f-f913-48b2-9bd0-7fc13eda0fd0
pipeline.analysisStorage.name         Small
pipeline.analysisStorage.ownerId      8ec463f6-1acb-341b-b321-043c39d8716a
pipeline.analysisStorage.tenantId     f91bb1a0-c55f-4bce-8014-b2e60c0ec7d3
pipeline.analysisStorage.tenantName   ica-cp-admin
pipeline.analysisStorage.timeCreated  2021-11-05T10:28:20Z
pipeline.analysisStorage.timeModified 2021-11-05T10:28:20Z
pipeline.code                         cli-tutorial
pipeline.description                  Test, prepared parameters file from working GUI
pipeline.id                           6779fa3b-e2bc-42cb-8396-32acee8b6338
pipeline.language                     CWL
pipeline.ownerId                      7fa2b641-1db4-3f81-866a-8003aa9e0818
pipeline.tenantId                     d0696494-6a7b-4c81-804d-87bda2d47279
pipeline.tenantName                   icav2-entprod
pipeline.timeCreated                  2022-03-10T13:13:05Z
pipeline.timeModified                 2022-03-10T13:13:05Z
reference                             tut-test-cli-tutorial-eda7ee7a-8c65-4c0f-bed4-f6c2d21119e6
startDate                             2022-03-10T20:42:42Z
status                                SUCCEEDED
summary                               
tenantId                              d0696494-6a7b-4c81-804d-87bda2d47279
tenantName                            icav2-entprod
timeCreated                           2022-03-10T20:42:42Z
timeModified                          2022-03-10T21:00:33Z
userReference                         tut-test

 % icav2 projectpipelines start cwl cli-tutorial --data-id fil.c23246bd7692499724fe08da020b1014 --input-json '{
  "ipFQ": {
    "class": "File",
    "path": "test.fastq"
  }
}' --type-input JSON --user-reference tut-test-json