How to Use Spark SQL Merge Into - Step-by-Step Tutorial

Introduction

Queries involving modifying existing data in a warehouse are some of the complex, error-prone code you will build. If you have

Struggled with debugging issues caused by a series of data updates

Had to deal with partially updated data

Had to deal with complex systems that update data in multiple places

This post is for you. Imagine all your complex table update logic in a single place. Debugging and maintenance will be a breeze.

This is what MERGE INTO enables you to do. By the end of this post, you will know why and how to use MERGE INTO effectively.

We will walk through a complex example that, without MERGE INTO, would require multiple queries and take even longer to debug.

Merge Into enables you to Update/Delete/Insert rows into a table

MERGE INTO enables you to modify data in a table (aka target table) based on values in another table (aka source table).

Target table is the table to be modified. Source table is the input to the modification logic represented by MERGE INTO.

MERGE INTO involves two key parts

A join criterion to identify target rows that have corresponding source rows. The rows are matched with an on criterion.
A series of criteria that define how the target rows are to be modified (update/delete) or to insert source data. A row is evaluated against the criterion in the specified order. When met, the process is repeated for the next row.

There are three ways to define matching criteria.

Clause	Description	INSERT	UPDATE	DELETE
`WHEN MATCHED`	Used to update or delete matched target rows	❌	✅	✅
`WHEN NOT MATCHED`	Used to insert unmatched source rows into target	✅	❌	❌
`WHEN NOT MATCHED BY SOURCE`	Used to update or delete unmatched target rows	❌	✅	✅

The MERGE INTO will either pass or fail, and you won’t be left with partially updated data if something fails.

Working Example

Expand for Setup

To execute the code in this post, you need the following

Setup Spark locally
Start a Jupyter notebook
Create tables using the code below

spark.sql("create schema if not exists merge_into")

spark.sql("use merge_into")

spark.sql("DROP TABLE IF EXISTS source_data")
spark.sql("""
CREATE TABLE source_data (
    id INT,
    value INT
)
""")
spark.sql("""
INSERT INTO source_data VALUES
    (1, 10),
    (2, 25),
    (3, 5),
    (10, 100),
    (20, 200),
    (30, 5)
""")

spark.sql("DROP TABLE IF EXISTS target_table")
spark.sql("""
CREATE TABLE target_table (
    id INT,
    value INT,
    is_active BOOLEAN
)
""")
spark.sql("""
INSERT INTO target_table VALUES
    (1, 10, true),
    (2, 20, true),
    (3, 30, true),
    (4, 40, true),
    (5, 50, true),
    (6, 5, true)
""")

Let’s assume we need to update a table based on the following logic:

Target rows should be updated with their value = max(current value, incoming value)
New rows from incoming data should be inserted into the target
Target rows that are not present in incoming data should be set to inactive (is_active=False)
Any target rows with a value less than a specified threshold (15 in our example) should be removed.

Traditionally, this would require multiple update queries, but it is much simpler with MERGE INTO, as shown below:

MERGE INTO target_table
USING source_data
ON target_table.id = source_data.id
    
WHEN MATCHED 
  AND source_data.value >= 15 
  AND source_data.value > target_table.value 
  THEN 
  UPDATE SET target_table.value = source_data.value

WHEN MATCHED 
  AND target_table.value < 15  
  THEN DELETE
    
WHEN NOT MATCHED 
  AND source_data.value >= 15 
  THEN INSERT (id, value, is_active) 
VALUES (source_data.id, source_data.value, true)

WHEN NOT MATCHED BY SOURCE 
  AND target_table.value < 15 
  THEN DELETE

WHEN NOT MATCHED BY SOURCE  
  THEN UPDATE SET is_active = false;

1: Update matched target row with source value
2: Delete matched target rows with values less than the threshold
3: Insert unmatched source rows into the target
4: Delete unmatched target rows with values less than the threshold
5: Deactivate unmatched target rows

We can see how the rows are updated visually as follows:

Tradeoffs

MERGE INTO has the following caveats:

Performance: It’s an expensive operation as the system essentially does a full outer join and any full scane. Do not use MERGE INTO with fact tables.
Hard to debug: The more the matching logic, the harder it is to test and debug. Keep MERGE INTO logic easy to read or use sparingly.
Schema evolution: If you are not careful and have schema evolution switched on, your table may evolve beyond its standard schema.

Conclusion

To recap, we saw how

Merge Into enables you to update/delete/insert rows into a table
Practice MERGE INTO with an example
Tradeoffs to consider

The next time you are faced with a complex update logic that requires multiple queries, use this post to rewrite it with MERGE INTO.

Let me know in the comments below how you use MERGE INTO.

Introduction

Merge Into enables you to Update/Delete/Insert rows into a table

Working Example

Tradeoffs

Conclusion

Recommended Reading