CTEs(Common Table Expression) or Temporary Tables for Spark SQL
- 1. Introduction
- 2. CTE for short clean code & temp tables for re-usability
- 3. Conclusion
- 4. Recommended reading
1. Introduction
As a data engineer, CTEs(Common Table Expression) are one of the best techniques you can use to improve query readability. Most DEs wish they had started using CTEs sooner in their careers.
However, overuse of CTEs can lead to
complex queries trying to fit everything into a single, massive nested statements
Hard to debug code, as you canβt quickly inspect the results of one CTE at a time
Inability to reuse code, as CTEs in one query cannot be used in another query
Imagine having the benefits of CTE (readability) while being able to make your code reusable & easy to debug. That is what temp tables enable you to do.
In this post, we will explore how temporary tables can enhance CTEs by making your code more reusable and easier to debug.
Prerequisites:
TL;DR: CTEs are great for single use; Temp tables are great for debuggability & re-usability
2. CTE for short clean code & temp tables for re-usability
Assume both the CTE and temp table in the code below are written with clean SQL; it is impossible to compare approaches with bad code.
2.1. CTEs make medium-complex SQL easy to understand
CTE-based queries are designed to be read as a single query, containing self-contained logic, making it easy to understand the steps involved in a CTE query and the resulting output. Your context window is a single query.
However, suppose you are not mindful of query size and let it grow beyond, say, 4 CTEs. In that case, you encounter readability and complexity problems.
Pros:
- A single query encapsulates the entire logic.
- Easy to read
Cons:
- Can quickly become complex to understand if not approached with mindfulness. Especially as the number of CTEs per query grows over 4.
- Debugging can be difficult since you must identify where an error occurred by running each CTE individually.
- Non reusable logic. A CTE in one query cannot be reused in another query.
The above flowchart demonstrates problems with CTEs:
- Hard to
debug
as you will have to manually copy-run the CTE query and inspect results. - The same code is rewritten in multiple queries, leading to
hard-to-maintain code
.
2.2. Temp table enables you to reuse logic multiple times in a session
Note In this context, Spark dataframes are equivalent to temp tables
Temp tables exist within a session and can be reused multiple times across queries in that session.
Pros:
- Temp tables can be reused
- Simple to debug, as you can examine the outputs of the temp table one at a time
Cons:
- Having too many temporary tables is usually an indication of complexity. See if the temp tables can be materialized into tables or views.
The above flowchart and query demonstrate how a non-DRY CTE-based query can be turned into reusable code with temporary tables.
Letβs look at some benefits:
- The temp tables can be
re-used
in multiple queries without having to dogpile all query logic into a single CTE-based query. Debuggable
: Easy to validate each of the 3 key transformations
Notice that similar temporary tables are being created in various queries. It may be an indication that those temporary tables can be converted into permanent tables.
2.3. Performance depends on the execution engine
Spark is lazy-evaluated, so it considers how a temporary table is used and typically only computes it once, then reuses the result. Spark is also intelligent enough to understand how to reuse Common Table Expressions (CTEs) across a session.
Note: There can be significant performance differences between CTE and temp table-based approaches for other engines; use EXPLAIN
to understand these differences and utilize indexes as needed.
3. Conclusion
To recap, we saw how
- CTEs are great for representing single-use logic
- Temp tables are great for debugging and reusable logic
CTEs are great for single use; temporary tables are great for debuggability and reusability.
The next time you recreate the same CTEs at multiple places, ask yourself if it should be a temporary table, materialized as a table, or a view.
4. Recommended reading
If you found this article helpful, share it with a friend or colleague using one of the socials below!