How to Use Spark SQL Merge Into - Step-by-Step Tutorial

Understand when and why to use Merge Into and its tradeoffs, with real-world code examples and visual diagrams.

Understand when and why to use Merge Into and its tradeoffs, with real-world code examples and visual diagrams.
Author

Joseph Machado

Published

August 13, 2025

Keywords

Spark SQL, Merge Into, Warehouse, Data Warehouse

Introduction

Queries involving modifying existing data in a warehouse are some of the complex, error-prone code you will build. If you have

Struggled with debugging issues caused by a series of data updates

Had to deal with partially updated data

Had to deal with complex systems that update data in multiple places

This post is for you. Imagine all your complex table update logic in a single place. Debugging and maintenance will be a breeze.

This is what MERGE INTO enables you to do. By the end of this post, you will know why and how to use MERGE INTO effectively.

We will walk through a complex example that, without MERGE INTO, would require multiple queries and take even longer to debug.

Merge Into enables you to Update/Delete/Insert rows into a table

MERGE INTO enables you to modify data in a table (aka target table) based on values in another table (aka source table).

Target table is the table to be modified. Source table is the input to the modification logic represented by MERGE INTO.

MERGE INTO involves two key parts

  1. A join criterion to identify target rows that have corresponding source rows. The rows are matched with an on criterion.
  2. A series of criteria that define how the target rows are to be modified (update/delete) or to insert source data. A row is evaluated against the criterion in the specified order. When met, the process is repeated for the next row.

There are three ways to define matching criteria.

Clause Description INSERT UPDATE DELETE
WHEN MATCHED Used to update or delete matched target rows
WHEN NOT MATCHED Used to insert unmatched source rows into target
WHEN NOT MATCHED BY SOURCE Used to update or delete unmatched target rows

The MERGE INTO will either pass or fail, and you won’t be left with partially updated data if something fails.

Working Example

To execute the code in this post, you need the following

  1. Setup Spark locally
  2. Start a Jupyter notebook
  3. Create tables using the code below
spark.sql("create schema if not exists merge_into")

spark.sql("use merge_into")

spark.sql("DROP TABLE IF EXISTS source_data")
spark.sql("""
CREATE TABLE source_data (
    id INT,
    value INT
)
""")
spark.sql("""
INSERT INTO source_data VALUES
    (1, 10),
    (2, 25),
    (3, 5),
    (10, 100),
    (20, 200),
    (30, 5)
""")

spark.sql("DROP TABLE IF EXISTS target_table")
spark.sql("""
CREATE TABLE target_table (
    id INT,
    value INT,
    is_active BOOLEAN
)
""")
spark.sql("""
INSERT INTO target_table VALUES
    (1, 10, true),
    (2, 20, true),
    (3, 30, true),
    (4, 40, true),
    (5, 50, true),
    (6, 5, true)
""")

Let’s assume we need to update a table based on the following logic:

  1. Target rows should be updated with their value = max(current value, incoming value)
  2. New rows from incoming data should be inserted into the target
  3. Target rows that are not present in incoming data should be set to inactive (is_active=False)
  4. Any target rows with a value less than a specified threshold (15 in our example) should be removed.

Traditionally, this would require multiple update queries, but it is much simpler with MERGE INTO, as shown below:

MERGE INTO target_table
USING source_data
ON target_table.id = source_data.id
    
WHEN MATCHED 
  AND source_data.value >= 15 
  AND source_data.value > target_table.value 
  THEN 
  UPDATE SET target_table.value = source_data.value

WHEN MATCHED 
  AND target_table.value < 15  
  THEN DELETE
    
WHEN NOT MATCHED 
  AND source_data.value >= 15 
  THEN INSERT (id, value, is_active) 
VALUES (source_data.id, source_data.value, true)

WHEN NOT MATCHED BY SOURCE 
  AND target_table.value < 15 
  THEN DELETE

WHEN NOT MATCHED BY SOURCE  
  THEN UPDATE SET is_active = false;
1
Update matched target row with source value
2
Delete matched target rows with values less than the threshold
3
Insert unmatched source rows into the target
4
Delete unmatched target rows with values less than the threshold
5
Deactivate unmatched target rows

We can see how the rows are updated visually as follows:

MERGE INTO (click to enhance)

MERGE INTO (click to enhance)

Tradeoffs

MERGE INTO has the following caveats:

  1. Performance: It’s an expensive operation as the system essentially does a full outer join and any full scane. Do not use MERGE INTO with fact tables.
  2. Hard to debug: The more the matching logic, the harder it is to test and debug. Keep MERGE INTO logic easy to read or use sparingly.
  3. Schema evolution: If you are not careful and have schema evolution switched on, your table may evolve beyond its standard schema.

Conclusion

To recap, we saw how

  1. Merge Into enables you to update/delete/insert rows into a table
  2. Practice MERGE INTO with an example
  3. Tradeoffs to consider

The next time you are faced with a complex update logic that requires multiple queries, use this post to rewrite it with MERGE INTO.

Let me know in the comments below how you use MERGE INTO.

Back to top

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

Sign up for my free Data Engineering 101 Course. You will get

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.