How to quickly set up a local Spark development environment?
- 1. Introduction
- 2. Setup
- 3. Use VSCode devcontainers to set up Spark environment
- 4. Conclusion
- 5. Read these
1. Introduction
Setting up Spark locally is not easy! Especially if you are simultaneously trying to learn Spark. If you
Don’t know how to start working with Spark locally
Don’t know what the recommended tools are to work with Spark (like which IDE or data storage table formats)
Try and try, and then give up, only to end up trying to use one of the cloud providers or give up altogether.
This post is for you! You can have a fully functioning local Spark development environment with all the bells and whistles in a few minutes.
In this post, you will learn how to set up Spark locally, how to develop with it, how to debug your code, how to code interactively with Jupyter, how to see Spark UI, and how you can easily set up a complete development environment with devcontainers and Docker.
By the end of this post, you will have learnt how to set up Spark locally and the principles behind using Docker and devcontainers to set up a local development Spark environment.
2. Setup
Follow instructions in this repo: local_spark_setup . The following sections will assume that you have followed these setup instructions.
3. Use VSCode devcontainers to set up Spark environment
Let’s go over how you can develop Spark applications with this setup. The key idea is to have all of the developmental tools in VSCode, making it easy for you to quickly iterate on your code.
3.1. Run code interactively with Jupyter Notebook
Open the notebook named sample_spark_iceberg.ipynb , press the run button on the first cell that creates the Spark session.
Choose Python 3.10.16
when prompted to choose a kernel.
You can run Pyspark code in the notebook, and you can also use %%sql
as the first line of a cell to run Spark SQL.
Now you can create and work with tables; by default, the tables will be Apache Iceberg tables.
Note Read more about table formats here
3.2. Run & Debug your PySpark code
A Python script can be run using the Run
button in the top-right corner. Shown below is how you can run the sample_pyspark_script.py
script.
Debugging is a vital skill that enables you to step through each line of your code to determine what is happening. You can also examine the values of variables at each of your breakpoints.
Shown below is a GIF of how Pyspark debugging works in this setup.
3.3. Explore Spark performance with the Spark UI
Spark UI provides a lot of information crucial for troubleshooting hanging jobs and identifying performance bottlenecks.
In our setup, you can
- Use the Spark History server at http://localhost:18080 to see the list of Spark applications that have completed running. Note that only completed (or stopped) Spark applications will show up here. The Python script you ran earlier starts and stops a Spark application. In contrast, if you have an open Jupyter notebook with a SparkSession variable, it will not show up here.
- Use the Spark Application UI at http://localhost:4041 to see the currently running Spark application. Note that if there are no running Spark applications, you won’t be able to see this UI.
3.4. Examine Iceberg data with Data Wrangler (local only)
Note The following are only viewable when running locally.
By default, the tables are created as Iceberg tables with Parquet format and stored in Minio (a local open source S3 equivalent).
Go ahead and create the sample.employee
table as shown in the sample_spark_iceberg.ipynb
notebook.
Now open Minio at http://localhost:9001
with username and password as admin
and password
. The files can be browsed at warehouse/sample/employee/data
as shown below:
A handy extension to explore parquet
files is Data Wrangler
.
Run the sample_pyspark_script.py script, which writes files to a local file system (which we set when creating the SparkSession as shown below) in Iceberg format.
spark = (
SparkSession.builder.appName("PySpark Sample Application")
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.local.type", "hadoop")
.config("spark.sql.catalog.local.warehouse", "/home/iceberg/warehouse")
.getOrCreate()
)
Open the data folder at warehouse/sample_db/employees/data
. Open any file with the extension .parquet
. You will be prompted to open this file with Data Wrangler
, and now you will be able to see the contents of this file.
Connect to the local Python Interpreter
run time.
You will be able to see the content and analytics of the data in the columns.
3.5. Devcontainers make it easy to set up a local Spark environment
We use Docker to set up containers to run:
Spark
: This container namedspark-iceberg
runs Spark Master, Worker, History server, and Thrift server.Minio
: This container runs aminio
server, which provides an S3-compatible interface, ie, it works as a local S3 equivalent.Iceberg Rest interface
: This is a rest server that allows anyone to interact with the Iceberg tables in Minio.
For more information on how Docker works, read this .
We use the following official images from DockerHub, instead of creating our own.
- Spark-iceberg
- Minio
- iceberg-rest-fixture
- minio-mc is a container that creates our minio folders.
When we start devcontainers, our devcontainer.json does the following.
- Start the containers with Docker Compose.
- Installs VSCode in the
spark-iceberg
container. - Installs Node.js in the
spark-iceberg
container. - Installs the listed extensions for our VSCode instance.
- Sets the default Python version to use with our VSCode instance.
4. Conclusion
To recap, we saw:
- How to set up a local Spark development environment with devcontainers
- Running & debugging Spark interactively and as a script
- Using Spark UI to explore how Spark is processing our data
- Viewing Iceberg data in its location and exploring Parquet data with Data Wrangler
- Overview of how devcontainer sets up our local dev environment
The next time you are struggling with setting up Spark locally for development or for your job, use this repo to get started.
5. Read these
- Docker for data engineers
- Testing Spark code with Pytest
- Triggering a Spark job with AWS Lambda
- Data pipeline testing for CI
If you found this article helpful, share it with a friend or colleague using one of the socials below!