Spark on ICA Bench
Running a Spark application in a Bench Spark Cluster
Running a pyspark application
The JupyterLab environment is by default configured with 3 additional kernels
PySpark – Local
PySpark – Remote
PySpark – Remote – Dynamic
When one of the above kernels is selected, the spark context is automatically initialised and can be accessed using the sc object.

PySpark - Local
The PySpark - Local runtime environment launches the spark driver locally on the workspace node and all spark executors are created locally on the same node. It does not require a spark cluster to run and can be used for running smaller spark applications which don’t exceed the capacity of a single node.
The spark configuration can be found at /data/.spark/local/conf/spark-defaults.conf
.
PySpark - Remote
The PySpark – Remote runtime environment launches the spark driver locally on the workspace node and interacts with the Manager for scheduling tasks onto executors created across the Bench Cluster.
This configuration will not dynamically spin up executors, hence it will not trigger the cluster to auto scale when using a Dynamic Bench cluster.
The spark configuration can be found at /data/.spark/remote/conf/spark-defaults.conf
.
PySpark – Remote - Dynamic
The PySpark – Remote - Dynamic runtime environment launches the spark driver locally on the workspace node and interacts with the Manager for scheduling tasks onto executors created across the Bench Cluster.
This configuration will increase/decrease the required executors which will result into a cluster that auto scales using a Dynamic Bench cluster
The spark configuration can be found at /data/.spark/remote/conf-dynamic/spark-defaults.conf
.
Job resources
Every cluster member has a certain capacity depending on the selection of the Resource model for the member.
A spark application consists of 1 or more jobs. Each Job consists out of one or more stages. Each stage consists out of one or more tasks. Task are handled by executors and executors are run on a worker (cluster member).
The following setting define the amount of cpus needed per task
spark.task.cpus 1
The following settings define the size of a single executor which handles the execution of a task
spark.executor.cores 4
spark.executor.memory 4g
The above example allows an executor to handle 4 tasks concurrently and share a total capacity of 4Gb of memory. Depending on the resource model chosen (e.g. standard-2xlarge) a single cluster member (worker node) is able to run multiple executors concurrently (e.g. 32 cores, 128 Gb for 8 concurrent executors on a single cluster member)
Spark User Interface
The Spark UI can be accessed via the Cluster. The Web Access URL is displayed in the Workspace details page
This Spark UI will register all applications submitted when using one of the Remote Jupyter kernels. It will provide an overview of the registered workers (Cluster members) and the applications running in the Spark cluster.

Spark Reference documentation
See the apache website
Last updated
Was this helpful?