Downloading a File and PDF Image Extraction

Compatibility: API version 2 revision 21

A lab user can attach a PDF file containing multiple images to a result file placeholder, the script extracts the images which are automatically attached to corresponding samples as individual result files.

Images within a PDF may be in a number of formats, and will usually be .ppm or .jpeg. The example script includes additional code to convert .ppm images to .jpeg.

Prerequisites

  • You have the package 'poppler' (linux) installed.

  • You have defined a process with analytes (samples) as inputs, and outputs that generate the following:

    • A single shared result file output.

    • A result file output per input.

  • You have added samples to Clarity LIMS.

  • You have uploaded the Results PDF to Clarity LIMS during 'Step Setup'.

  • Optionally, if you wish to convert other file types to .jpeg, installation of ImageMagick (linux package).

Code Example

How it works:

  1. The lab scientist runs a process/protocol step and attaches the PDF in Clarity LIMS

  2. When run, the scrip uses the API and the 'requests' package available in python to locate and retrieve the PDF.

  3. The script generates a file for each image.

  4. Files are named with LUIDs and well location.

  5. The images are attached to the ResultFile placeholder. **The file names must begin with the {outputFileLuids} for automatic attachment.**

Additionally, this script converts the images to JPEG format for compatibility with other LIMS features.

Step 1. Create the script

Part 1 - Downloading a file using the API

The script will find and get the content of the PDF through 2 separate GET requests:

  1. Following the artifact URI using the {compoundOutputFile0} to identify the LUID of the PDF file.

  2. Using the ~/api/v2/files/{luid}/download endpoint to save the file to the temporary working directory.

def dl_pdf( artifactluid_ofpdf ): ### finds the file LUID from artifact LUID of the PDF
 artif_URI = BASE_URI + "artifacts/" + artifactluid_ofpdf
 artGET = requests.get(artif_URI, auth=(args[ "username" ],args[ "password" ]))
 root = ET.fromstring(artXML)
 for id in root.findall("{http://genologics.com/ri/file}file"):
     fileLUID = id.get("limsid")
 file_URI = BASE_URI + "files/" + fileLUID + "/download"
 fileGET = requests.get(file_URI, auth=(args[ "username" ],args[ "password" ])) 
 with open("frag.pdf", 'wb') as fd: 
     for chunk in fileGET.iter_content():
         fd.write(chunk)

The PDF is written to the temporary directory.

The script performs a batch retrieval of the artifact XML for all samples. Subsequently a python dictionary is created defining which LIMS id corresponds to a given well location.

Part 2 - Extracting images as individual results files

The script uses the pdfimages function to extract the images from the PDF. This function is from a linux package and can be called using the the os.system() function.

This example script extracts an image from each page, beginning with page 10. Files are named with LUIDs and well location. The file names must begin with the {outputFileLuids} for automatic attachment.

page = 10
for each in range(len(wells)):
    well_loci = wells[each]        
    if well_loci in well_map.keys():
        limsid = well_map[well_loci]
	filename = limsid + "_" + well_loci
        command = 'pdfimages ' + 'frag.pdf' +' -j -f ' + str(page) + ' -l ' + str(page) + ' ' + filename
	os.system(command) 

Additionally, the cookbook example script converts the image files to JPEG for compatibility with other features in Clarity LIMS. The script uses 'convert', a function from the linux package called 'ImageMagick'. Like the 'pdfimages' function, 'convert' can be called in a python script through the os.system() function.

Step 2. Configure the Process

The steps required to configure a process to run EPP are described in the Process Execution with EPP/Automation Support example, namely:

  1. Configure the inputs and outputs.

  2. On the External Programs tab, select the check box to associate the process with an external program.

Parameters

The process parameter string for the external program is as follows:

bash -c "/usr/bin/python /opt/gls/clarity/customextensions/pdfimages.py -a {compoundOutputFileLuid0} -u{username} -p {password} -f '{outputFileLuids}'"

The EPP command is configured to pass the following parameters:

Step 3. Run the Process

Record Details page in Clarity LIMS

The placeholder where the lab scientist can upload the PDF.

External Process ready to be run. 'Script generated' message marking individual result file placeholder.

Expected Output and Results

External program was run successfully. Individual result files named with artifact LUID and well location.

Attachments

pdfimages.py:

Last updated