Introduction
Queries involving modifying existing data in a warehouse are some of the complex, error-prone code you will build. If you have
Struggled with debugging issues caused by a series of data updates
Had to deal with partially updated data
Had to deal with complex systems that update data in multiple places
This post is for you. Imagine all your complex table update logic in a single place. Debugging and maintenance will be a breeze.
This is what MERGE INTO enables you to do. By the end of this post, you will know why and how to use MERGE INTO effectively.
We will walk through a complex example that, without MERGE INTO, would require multiple queries and take even longer to debug.
Merge Into enables you to Update/Delete/Insert rows into a table
MERGE INTO enables you to modify data in a table (aka target table) based on values in another table (aka source table).
Target table is the table to be modified. Source table is the input to the modification logic represented by MERGE INTO.
MERGE INTO involves two key parts
- A join criterion to identify target rows that have corresponding source rows. The rows are matched with an
oncriterion. - A series of criteria that define how the target rows are to be modified (update/delete) or to insert source data. A row is evaluated against the criterion in the specified order. When met, the process is repeated for the next row.
There are three ways to define matching criteria.
| Clause | Description | INSERT | UPDATE | DELETE |
|---|---|---|---|---|
WHEN MATCHED |
Used to update or delete matched target rows | ❌ | ✅ | ✅ |
WHEN NOT MATCHED |
Used to insert unmatched source rows into target | ✅ | ❌ | ❌ |
WHEN NOT MATCHED BY SOURCE |
Used to update or delete unmatched target rows | ❌ | ✅ | ✅ |
The MERGE INTO will either pass or fail, and you won’t be left with partially updated data if something fails.
Working Example
To execute the code in this post, you need the following
- Setup Spark locally
- Start a Jupyter notebook
- Create tables using the code below
spark.sql("create schema if not exists merge_into")
spark.sql("use merge_into")
spark.sql("DROP TABLE IF EXISTS source_data")
spark.sql("""
CREATE TABLE source_data (
id INT,
value INT
)
""")
spark.sql("""
INSERT INTO source_data VALUES
(1, 10),
(2, 25),
(3, 5),
(10, 100),
(20, 200),
(30, 5)
""")
spark.sql("DROP TABLE IF EXISTS target_table")
spark.sql("""
CREATE TABLE target_table (
id INT,
value INT,
is_active BOOLEAN
)
""")
spark.sql("""
INSERT INTO target_table VALUES
(1, 10, true),
(2, 20, true),
(3, 30, true),
(4, 40, true),
(5, 50, true),
(6, 5, true)
""")Let’s assume we need to update a table based on the following logic:
- Target rows should be updated with their value = max(current value, incoming value)
- New rows from incoming data should be inserted into the target
- Target rows that are not present in incoming data should be set to inactive (
is_active=False) - Any target rows with a value less than a specified threshold (15 in our example) should be removed.
Traditionally, this would require multiple update queries, but it is much simpler with MERGE INTO, as shown below:
MERGE INTO target_table
USING source_data
ON target_table.id = source_data.id
WHEN MATCHED
AND source_data.value >= 15
AND source_data.value > target_table.value
THEN
UPDATE SET target_table.value = source_data.value
WHEN MATCHED
AND target_table.value < 15
THEN DELETE
WHEN NOT MATCHED
AND source_data.value >= 15
THEN INSERT (id, value, is_active)
VALUES (source_data.id, source_data.value, true)
WHEN NOT MATCHED BY SOURCE
AND target_table.value < 15
THEN DELETE
WHEN NOT MATCHED BY SOURCE
THEN UPDATE SET is_active = false;- 1
- Update matched target row with source value
- 2
- Delete matched target rows with values less than the threshold
- 3
- Insert unmatched source rows into the target
- 4
- Delete unmatched target rows with values less than the threshold
- 5
- Deactivate unmatched target rows
We can see how the rows are updated visually as follows:
Tradeoffs
MERGE INTO has the following caveats:
Performance: It’s an expensive operation as the system essentially does a full outer join and any full scane. Do not useMERGE INTOwith fact tables.Hard to debug: The more the matching logic, the harder it is to test and debug. KeepMERGE INTOlogic easy to read or use sparingly.Schema evolution: If you are not careful and have schema evolution switched on, your table may evolve beyond its standard schema.
Conclusion
To recap, we saw how
- Merge Into enables you to update/delete/insert rows into a table
- Practice MERGE INTO with an example
- Tradeoffs to consider
The next time you are faced with a complex update logic that requires multiple queries, use this post to rewrite it with MERGE INTO.
Let me know in the comments below how you use MERGE INTO.
