How to quickly set up a local Spark development environment?

1. Introduction

Setting up Spark locally is not easy! Especially if you are simultaneously trying to learn Spark. If you

Don’t know how to start working with Spark locally

Don’t know what the recommended tools are to work with Spark (like which IDE or data storage table formats)

Try and try, and then give up, only to end up trying to use one of the cloud providers or give up altogether.

This post is for you! You can have a fully functioning local Spark development environment with all the bells and whistles in a few minutes.

In this post, you will learn how to set up Spark locally, how to develop with it, how to debug your code, how to code interactively with Jupyter, how to see Spark UI, and how you can easily set up a complete development environment with devcontainers and Docker.

By the end of this post, you will have learnt how to set up Spark locally and the principles behind using Docker and devcontainers to set up a local development Spark environment.

2. Setup

Follow instructions in this repo: local_spark_setup . The following sections will assume that you have followed these setup instructions.

3. Use VSCode devcontainers to set up Spark environment

Let’s go over how you can develop Spark applications with this setup. The key idea is to have all of the developmental tools in VSCode, making it easy for you to quickly iterate on your code.

3.1. Run code interactively with Jupyter Notebook

Open the notebook named sample_spark_iceberg.ipynb , press the run button on the first cell that creates the Spark session.

Choose Python 3.10.16 when prompted to choose a kernel.

Notebook

You can run Pyspark code in the notebook, and you can also use %%sql as the first line of a cell to run Spark SQL.

Now you can create and work with tables; by default, the tables will be Apache Iceberg tables.

Note Read more about table formats here

3.2. Run & Debug your PySpark code

A Python script can be run using the Run button in the top-right corner. Shown below is how you can run the sample_pyspark_script.py script.

Run Python

Debugging is a vital skill that enables you to step through each line of your code to determine what is happening. You can also examine the values of variables at each of your breakpoints.

Shown below is a GIF of how Pyspark debugging works in this setup.

Debug Python

3.3. Explore Spark performance with the Spark UI

Spark UI provides a lot of information crucial for troubleshooting hanging jobs and identifying performance bottlenecks.

In our setup, you can

  1. Use the Spark History server at http://localhost:18080 to see the list of Spark applications that have completed running. Note that only completed (or stopped) Spark applications will show up here. The Python script you ran earlier starts and stops a Spark application. In contrast, if you have an open Jupyter notebook with a SparkSession variable, it will not show up here.

History server

  1. Use the Spark Application UI at http://localhost:4041 to see the currently running Spark application. Note that if there are no running Spark applications, you won’t be able to see this UI.

Spark UI

3.4. Examine Iceberg data with Data Wrangler (local only)

Note The following are only viewable when running locally.

By default, the tables are created as Iceberg tables with Parquet format and stored in Minio (a local open source S3 equivalent).

Go ahead and create the sample.employee table as shown in the sample_spark_iceberg.ipynb notebook.

Now open Minio at http://localhost:9001 with username and password as admin and password. The files can be browsed at warehouse/sample/employee/data as shown below:

Minio

A handy extension to explore parquet files is Data Wrangler .

Run the sample_pyspark_script.py script, which writes files to a local file system (which we set when creating the SparkSession as shown below) in Iceberg format.

spark = (
        SparkSession.builder.appName("PySpark Sample Application")
        .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog")
        .config("spark.sql.catalog.local.type", "hadoop")
        .config("spark.sql.catalog.local.warehouse", "/home/iceberg/warehouse")
        .getOrCreate()
    )

Open the data folder at warehouse/sample_db/employees/data. Open any file with the extension .parquet. You will be prompted to open this file with Data Wrangler, and now you will be able to see the contents of this file.

DataWrangler

Connect to the local Python Interpreter run time.

DataWrangler

You will be able to see the content and analytics of the data in the columns.

DataWrangler Explore

3.5. Devcontainers make it easy to set up a local Spark environment

Architecture

We use Docker to set up containers to run:

  1. Spark: This container named spark-iceberg runs Spark Master, Worker, History server, and Thrift server.
  2. Minio: This container runs a minio server, which provides an S3-compatible interface, ie, it works as a local S3 equivalent.
  3. Iceberg Rest interface: This is a rest server that allows anyone to interact with the Iceberg tables in Minio.

For more information on how Docker works, read this .

We use the following official images from DockerHub, instead of creating our own.

  1. Spark-iceberg
  2. Minio
  3. iceberg-rest-fixture
  4. minio-mc is a container that creates our minio folders.

When we start devcontainers, our devcontainer.json does the following.

  1. Start the containers with Docker Compose.
  2. Installs VSCode in the spark-iceberg container.
  3. Installs Node.js in the spark-iceberg container.
  4. Installs the listed extensions for our VSCode instance.
  5. Sets the default Python version to use with our VSCode instance.

4. Conclusion

To recap, we saw:

  1. How to set up a local Spark development environment with devcontainers
  2. Running & debugging Spark interactively and as a script
  3. Using Spark UI to explore how Spark is processing our data
  4. Viewing Iceberg data in its location and exploring Parquet data with Data Wrangler
  5. Overview of how devcontainer sets up our local dev environment

The next time you are struggling with setting up Spark locally for development or for your job, use this repo to get started.

5. Read these

  1. Docker for data engineers
  2. Testing Spark code with Pytest
  3. Triggering a Spark job with AWS Lambda
  4. Data pipeline testing for CI

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

Sign up for my free Data Engineering 101 Course. You will get

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

    M ↓   Markdown

    Land your dream Data Engineering job with my free book!

    Build data engineering proficiency with my free book!

    Are you looking to enter the field of data engineering? And are you

    > Overwhelmed by all the concepts/jargon/frameworks of data engineering?

    > Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

    Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

    Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

    Sign up for my free Data Engineering 101 Course. You will get

    ✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

    ✅ Executable code to practice and exercises to test yourself.

    ✅ Weekly email for 4 weeks with the exercise solutions.

    Join now and get started on your data engineering journey!

      Testimonials:

      I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
      I have learned a lot from the course which is much more practical.
      This course helped me build a project and actually land a data engineering job! Thank you.

      When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.