Parse CSV

Available from: Clarity LIMS v2.0.5

-haltOnMissingSample option and support for header section values (e.g., containerName) are introduced in NGS v5.4.0.

Data might sometimes need to be parsed from an instrument result file (CSV, TSV, or other character-separated format) into Clarity LIMS, for the purposes of QC.

For example, suppose that a 96 well plate is run on a Caliper GX. The instrument produces a result file, which the user imports into Clarity LIMS. The per-sample data are parsed and stored for a range of capabilities, such as QC threshold checking, searching, and visibility in the Clarity LIMS interface.

The parseCSV script allows for the data for each well to be parsed into fields on either derived samples or result files (measurement records) that map directly to the derived samples being measured.

If the instrument result file contains data that applies to the batch of derived samples being measured, this data are stored in fields on the step.

The Script

The parseCSV script automates parsing a separated-value file, configurable but typically comma- or tab-separated, into the LIMS.

  1. Data lines in the file are matched to the corresponding sample in the LIMS using well placement information.

  2. A line that references well A1 of container Plate123 will have its parsed data mapped to the sample placed in well position A:1 of container Plate123 in the LIMS.

  3. Values from the file are mapped to fields (known as UDFs in the API) in Clarity LIMS based on the automation configuration for the script.

Workflow and Configuration

  • Configure the step to invoke the script manually via a button in Record Details screen.

  • Before pressing the button that invokes the script, upload a shared result file to be parsed.

  • Configure the automation command line to match the destination fields configured in Clarity LIMS.

  • Create a field for each column that will be brought into the LIMS. Field names must not contain the separator used for the automation parameter string, "::".

  • When using NGS v5.0 or later, fields can be configured for the step, input samples, output samples, or output result files. Versions before this release support only output result files.

  • Input result files are not supported.

Script Parameters

Parameter

Description

-u {user}

LIMS username (Required)

-p {password}

LIMS password (Required)

-i {URI}

LIMS process URI (Required)

-inputFile {result file}

Instrument result file to be parsed (Required)

-log {log file name}

Log file name (Required)

-containerName {container name}

Name of column header for container name

-wellPosition {well position}

Name of column header for well position

-sampleLocation {sample location}

Name of column header for <container_name>_<well>

-measurementUDFMap {measurement UDF map}

-partialMatchUDFMap {partial match UDF map}

-processUDFMap {process UDF map}

-headerRow {header row}

Numeric index of CSV header row, starting from one (default 1)

-separator {separator}

File separator; comma used by default if not otherwise specified (default comma)

-matchOutput {boolean}

-setOutput {boolean}

-relaxed {boolean}

-haltOnMissingSample {boolean}

Association Strategy

The association strategy describes how information in the file is mapped to samples in the LIMS.

When running this script, there are two association strategies you can implement. Which strategy you choose is determined by the contents of the file that will be parsed. Both strategies rely on sample placement information (well and container name) to perform the mapping to the LIMS.

  • Strategy 1: Provide the -containerName and -wellPosition parameters to the script. Use this strategy when the well and container information are found in separate columns of the file, eg "Plate123" in column "Plate Name" and "A1" in column "Well Label"

  • Strategy 2: Provide the -sampleLocation parameter to the script. Use this strategy when the placement information is all found in the same column, in the following format: <container_name>_< well >_<free text>, eg "Plate123_A1_control" in column "Sample ID"

Header Section Parsing (NGS v5.4 and later)

For the association strategy provided, if matching headers are not found in the file at the provided header index, the script will then search the lines of the file that appear prior to this index (the header section) for a match.

For example, when using association strategy 1 and providing -containerName and -wellPosition, if the file contains information for only a single container the container name may only appear one time in a header section. This may look something like this for a comma-separated file: "ContainerID, plate123". With -containerName provided as "ContainerID" the script will locate the adjacent value as the one to be used as the value of the container name for the entire file and interpret the well positions as being within this container.

Mapping Parameters

Mapping parameters (measurementUDFMap, partialMatchUDFMap, and processUDFMap) determine which information is mapped from the file to fields in the LIMS.

The structure in which to provide these parameters is as follows, where the <Header Name> is the name of the data column or header section row in the file:

<UDF Name>::<Header Name>

At least one mapping parameter must be provided to map data from the file to the LIMS. The details of how each of these parameters affects the behavior of the script is described in the Parameter Details section.

File Separator

While the most common file formats are *.csv (comma-separated) and *.tsv (tab-separated), the script may be configured to use any separator.

To use a comma or tab as the separator, provide these using the -separator parameter as "comma" or "tab" as they require additional handling by the script.

Boolean Parameters

The script supports several boolean parameters. Boolean parameter values must be provided in quotes, eg "true".

Script Usage

Example 1

This example uses matching Strategy 1 for a comma-separated file and maps two columns, "Region[100–1000] Conc. (ng/ul)" and "Region[100–1000] Size at Maximum [BP]", to output resultfile fields "Concentration" and "Size (bp)" in the LIMS, respectively:

bash -c "/opt/gls/clarity/bin/java
-jar \/opt/gls/clarity/extensions/ngs-common/v5/EPP/ngs-extensions.jar \
-i {processURI:v2:http} \
-u {username} \
-p {password} \
script:parseCSV \
-inputFile {compoundOutputFileLuid0} \
-log {compoundOutputFileLuid1} \
-headerRow '1' \
-separator 'comma' \
-containerName 'Plate Name' \
-wellPosition 'Well Label' \
-measurementUDFMap 'Concentration::Region[100–1000] Conc. (ng/ul)' \
-measurementUDFMap 'Size (bp)::Region[100–1000] Size at Maximum [BP]'"

Example 2

This example uses matching Strategy 2 for a tab-separated file, running in relaxed mode. It maps a column to an input sample field, using that input sample placement information, and maps a header section row to a protocol step field:

bash -c "/opt/gls/clarity/bin/java -jar \/opt/gls/clarity/extensions/ngs-common/v5/EPP/ngs-extensions.jar \
-i {processURI:v2:http} \
-u {username} \
-p {password} \
script:parseCSV \
-inputFile {compoundOutputFileLuid0} \
-log {compoundOutputFileLuid1} \
-headerRow '1' \
-separator 'tab' \
-sampleLocation 'Sample ID' \
-measurementUDFMap 'Concentration::Region[100–1000] Conc. (ng/ul)' \
-processUDFMap 'Run date::Run Date'
-matchOutput 'false'
-setOutput 'false'
-relaxed 'true'"

To view an out-of-the-box example included in the NGS package, review the configuration of the NanoDrop QC protocol steps included in the Initial DNA and RNA QC protocols.

Parameter Details

measurementUDFMap

This performs a 1:1 parsing of column information from the file to individual sample fields in the LIMS. The column names must match exactly. The exact destination (input/output sample or result file fields) is controlled through other script options.

partialMatchUDFMap

This allows customization of the column names that appear in the file by only matching on the first part of the column name, eg a partial match of "Sample" will match to a column customized to "Sample (internal ID)." Other than providing this flexibility, this parameter functions the same as measurementUDFMap.

If two columns are found that begin with the partial match provided, the script will log an error and stop execution.

processUDFMap

The process UDF option is provided to parse per-run information into protocol step fields in the LIMS. When provided, the script will search for a match in the header section and the data column headers of the file.

In the following example file:

  • The first two lines (beginning with OPERATOR and WORKFLOW) represent a header section with information for the batch of derived samples.

  • The third line (S_PLATE_ID) is the data section header (header row).

  • The lines make up the data section, which contains data for each derived sample.

How it Works

  • If there is a matching header in both the header section and column headers, the value from the header section will be used.

  • If no matching header is found and the script isn't running in relaxed mode, the script will log an error and stop execution.

  • When a match is found only among the column headers, validation is done to ensure all the values in that column are equal (because they will be mapped to a single destination field). If not all of the values are the same, a warning will be logged listing the distinct values and the field in the LIMS will not be updated.

matchOutput Mode

This parameter is provided as a boolean true/false value (default is false). It toggles whether information from the file is matched to the LIMS by comparing it to the placement of the protocol step inputs or protocol step outputs.

  • If set to False: The script uses the placement information of the inputs.

  • If set to True: The script uses the placement information of protocol step outputs.

setOutput Mode

This parameter is provided as a boolean true/false value (default is true). It toggles whether per-sample information is mapped to fields on the protocol step inputs or outputs.

  • If set to True: The script updates the protocol step outputs.

  • If set to False: The script updates field information on the protocol step inputs.

Input samples, output samples, and output result files are supported. The script expects either output samples or output result files, not both.

The script will log an error and stop execution if there is more than one kind of per-input output configured for the protocol step.

relaxed Mode

This parameter is provided as a boolean true/false value (default is false) to toggle relaxed mode.

  • If set to False: The script considers all provided header mappings to be mandatory headers and throws an exception if anything cannot be found in the file.

  • If set to True: In relaxed mode, the script will log a warning if a header cannot be found in the file, and will continue execution.

haltOnMissingSample mode (NGS v5.4 and later)

This parameter is provided as a boolean true/false value (default is true) to toggle halt on missing sample mode.

  • If set to False: The script will warn but continue execution when placement information for a line in the file cannot be determined. This mode can be used to handle, for example, ladder entries or footer sections, where the lines in the file will not contain valid sample information for the parser to use.

  • If set to True: The script will log an error and stop execution when a line in the file is encountered where it cannot determine the placement information for a sample. This mode allows strict matching of all contents.

Additional Information

Other scripts you may find useful are as follows.

Last updated