How to quickly set up a local Spark development environment?

1. Introduction

Setting up Spark locally is not easy! Especially if you are simultaneously trying to learn Spark. If you

Don’t know how to start working with Spark locally

Don’t know what the recommended tools are to work with Spark (like which IDE or data storage table formats)

Try and try, and then give up, only to end up trying to use one of the cloud providers or give up altogether.

This post is for you! You can have a fully functioning local Spark development environment with all the bells and whistles in a few minutes.

In this post, you will learn how to set up Spark locally, how to develop with it, how to debug your code, how to code interactively with Jupyter, how to see Spark UI, and how you can easily set up a complete development environment with devcontainers and Docker.

By the end of this post, you will have learnt how to set up Spark locally and the principles behind using Docker and devcontainers to set up a local development Spark environment.

2. Setup

Follow instructions in this repo: local_spark_setup. The following sections will assume that you have followed these setup instructions.

3. Use VSCode devcontainers to set up Spark environment

Let’s go over how you can develop Spark applications with this setup. The key idea is to have all of the developmental tools in VSCode, making it easy for you to quickly iterate on your code.

3.1. Run code interactively with Jupyter Notebook

Open the notebook named sample_spark_iceberg.ipynb, press the run button on the first cell that creates the Spark session.

Choose Python 3.10.16 when prompted to choose a kernel.

You can run Pyspark code in the notebook, and you can also use %%sql as the first line of a cell to run Spark SQL.

Now you can create and work with tables; by default, the tables will be Apache Iceberg tables.

Note Read more about table formats here

3.2. Run & Debug your PySpark code

A Python script can be run using the Run button in the top-right corner. Shown below is how you can run the sample_pyspark_script.py script.

Debugging is a vital skill that enables you to step through each line of your code to determine what is happening. You can also examine the values of variables at each of your breakpoints.

Shown below is a GIF of how Pyspark debugging works in this setup.

3.3. Explore Spark performance with the Spark UI

Spark UI provides a lot of information crucial for troubleshooting hanging jobs and identifying performance bottlenecks.

In our setup, you can

Use the Spark History server at http://localhost:18080 to see the list of Spark applications that have completed running. Note that only completed (or stopped) Spark applications will show up here. The Python script you ran earlier starts and stops a Spark application. In contrast, if you have an open Jupyter notebook with a SparkSession variable, it will not show up here.

Use the Spark Application UI at http://localhost:4041 to see the currently running Spark application. Note that if there are no running Spark applications, you won’t be able to see this UI.

3.4. Examine Iceberg data with Data Wrangler (local only)

Note The following are only viewable when running locally.

By default, the tables are created as Iceberg tables with Parquet format and stored in Minio (a local open source S3 equivalent).

Go ahead and create the sample.employee table as shown in the sample_spark_iceberg.ipynb notebook.

Now open Minio at http://localhost:9001 with username and password as admin and password. The files can be browsed at warehouse/sample/employee/data as shown below:

A handy extension to explore parquet files is Data Wrangler.

Run the sample_pyspark_script.py script, which writes files to a local file system (which we set when creating the SparkSession as shown below) in Iceberg format.

spark = (
        SparkSession.builder.appName("PySpark Sample Application")
        .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
        .config("spark.sql.catalog.local.type", "hadoop")
        .config("spark.sql.catalog.local.warehouse", "/home/iceberg/warehouse")
        .getOrCreate()
    )

Open the data folder at warehouse/sample_db/employees/data. Open any file with the extension .parquet. You will be prompted to open this file with Data Wrangler, and now you will be able to see the contents of this file.

Connect to the local Python Interpreter run time.

You will be able to see the content and analytics of the data in the columns.

3.5. Devcontainers make it easy to set up a local Spark environment

We use Docker to set up containers to run:

Spark: This container named spark-iceberg runs Spark Master, Worker, History server, and Thrift server.
Minio: This container runs a minio server, which provides an S3-compatible interface, ie, it works as a local S3 equivalent.
Iceberg Rest interface: This is a rest server that allows anyone to interact with the Iceberg tables in Minio.

For more information on how Docker works, read this.

We use the following official images from DockerHub, instead of creating our own.

Spark-iceberg
Minio
iceberg-rest-fixture
minio-mc is a container that creates our minio folders.

When we start devcontainers, our devcontainer.json does the following.

Start the containers with Docker Compose.
Installs VSCode in the spark-iceberg container.
Installs Node.js in the spark-iceberg container.
Installs the listed extensions for our VSCode instance.
Sets the default Python version to use with our VSCode instance.

4. Conclusion

To recap, we saw:

The next time you are struggling with setting up Spark locally for development or for your job, use this repo to get started.