Select the Optimal Batch Size
The Clarity LIMS API has batch retrieve endpoints for samples, artifacts, containers, and files. This article talks generically about links for any of those four entities.
When using the batch endpoints, you want to process upwards of hundreds of links. Intuitively, you may think that a single API call with all the links would be the fastest way to retrieve the data. However, analysis of the API performance shows that as the number of links increases beyond a threshold, the time per object increases.
To retrieve the data in the most efficient way, it is best to do multiple POSTs containing the optimal sized batch. A batch call takes longer than a GET to the endpoint of the sample to retrieve the data for a single sample (or other entity). However, after more than one or two samples are needed, the batch endpoint is more efficient.
Prerequisties
Before you follow the example, make sure that you are aware of what the optimal batch size is based on the following information:
The optimal size is dependent on your specific server and the amount of UDFs / custom fields or other data attached to the object being retrieved.
The optimal batch size may be different for artifacts, samples, files, and containers. For example, if the optimal size for samples is 500, 10 batches of 500 samples will retrieve the data faster then one batch of 5000.
You must also have a compatible version of API (v2 r21 or later).
Determining Optimal Batch Size
Attached below is a simple python script which will time how long batch retrieve take for an array of batch sizes. The efficiency is measured by the duration of the call divided by the number of links posted.
Hard-coded Parameters
The attached script has hard coded parameters to define the range and increments of batch sizes to test. Additionally, the number of replications for each size is adjustable. These parameters are found on line 110, and may not require any modification since they are already set to the following by default:
For example, the above parameters will test the following sizes: 100, 125, 150, 175, 200, 225, 250, 275.
Command-line Parameters
The parameters which will need to specific to your server are entered at the command line.
An example of the full syntax to invoke the script is as follows:
Expected Results
The script tracks how long each batch call takes to complete. The script outputs a .txt file with the raw numeric data and the batch size that returns the minimum value, and is the most efficient.
Viewing this data in a scatterplot format, you can see the range of optimal batch sizes for the artifacts/batch/retrieve endpoint is about 200 to 300 artifacts. This would be valid for artifacts only and each entity (eg, sample, file, or container) should be evaluated separately.
The shortest time per artifact is the most efficient batch size, as shown in the following example:
Proxy timeout
By default, LIMS configuration of send and receive timeout is 60 seconds. Very large batch calls will not complete if their duration is greater then the timeout configuration. This configuration is located at
Attachments
BatchOptimalSizeTest.py:
Last updated