Spark on ICA Bench

Running a Spark application in a Bench Spark Cluster

Running a pyspark application

The JupyterLab environment is by default configured with 3 additional kernels

When one of the above kernels is selected, the spark context is automatically initialised and can be accessed using the sc object.

PySpark - Local

The PySpark - Local runtime environment launches the spark driver locally on the workspace node and all spark executors are created locally on the same node. It does not require a spark cluster to run and can be used for running smaller spark applications which don’t exceed the capacity of a single node.

The spark configuration can be found at /data/.spark/local/conf/spark-defaults.conf.

Making changes to the configuration requires a restart of the Jupyter kernel.

PySpark - Remote

The PySpark – Remote runtime environment launches the spark driver locally on the workspace node and interacts with the Manager for scheduling tasks onto executors created across the Bench Cluster.

This configuration will not dynamically spin up executors, hence it will not trigger the cluster to auto scale when using a Dynamic Bench cluster.

The spark configuration can be found at /data/.spark/remote/conf/spark-defaults.conf.

Making changes to the configuration requires a restart of the Jupyter kernel.

PySpark – Remote - Dynamic

The PySpark – Remote - Dynamic runtime environment launches the spark driver locally on the workspace node and interacts with the Manager for scheduling tasks onto executors created across the Bench Cluster.

This configuration will increase/decrease the required executors which will result into a cluster that auto scales using a Dynamic Bench cluster

The spark configuration can be found at /data/.spark/remote/conf-dynamic/spark-defaults.conf.

Making changes to the configuration requires a restart of the Jupyter kernel.

Job resources

Every cluster member has a certain capacity depending on the selection of the Resource model for the member.

A spark application consists of 1 or more jobs. Each Job consists out of one or more stages. Each stage consists out of one or more tasks. Task are handled by executors and executors are run on a worker (cluster member).

The following setting define the amount of cpus needed per task

spark.task.cpus 1

The following settings define the size of a single executor which handles the execution of a task

spark.executor.cores 4 
spark.executor.memory 4g

The above example allows an executor to handle 4 tasks concurrently and share a total capacity of 4Gb of memory. Depending on the resource model chosen (e.g. standard-2xlarge) a single cluster member (worker node) is able to run multiple executors concurrently (e.g. 32 cores, 128 Gb for 8 concurrent executors on a single cluster member)

Spark User Interface

The Spark UI can be accessed via the Cluster. The Web Access URL is displayed in the Workspace details page

This Spark UI will register all applications submitted when using one of the Remote Jupyter kernels. It will provide an overview of the registered workers (Cluster members) and the applications running in the Spark cluster.

Spark Reference documentation

See the apache website

Last updated

Was this helpful?