CTEs(Common Table Expression) or Temporary Tables for Spark SQL
- 1. Introduction
- 2. CTE for short clean code & temp tables for re-usability
- 3. Conclusion
- 4. Recommended reading
1. Introduction
As a data engineer, CTEs(Common Table Expression) are one of the best techniques you can use to improve query readability. Most DEs wish they had started using CTEs sooner in their careers.
However, overuse of CTEs can lead to
complex queries trying to fit everything into a single, massive nested statements
Hard to debug code, as you can’t quickly inspect the results of one CTE at a time
Inability to reuse code, as CTEs in one query cannot be used in another query
Imagine having the benefits of CTE (readability) while being able to make your code reusable & easy to debug. That is what temp tables enable you to do.
In this post, we will explore how temporary tables can enhance CTEs by making your code more reusable and easier to debug.
Prerequisites:
TL;DR: CTEs are great for single use; Temp tables are great for debuggability & re-usability
2. CTE for short clean code & temp tables for re-usability
Assume both the CTE and temp table in the code below are written with clean SQL; it is impossible to compare approaches with bad code.
2.1. CTEs make medium-complex SQL easy to understand
CTE-based queries are designed to be read as a single query, containing self-contained logic, making it easy to understand the steps involved in a CTE query and the resulting output. Your context window is a single query.
However, suppose you are not mindful of query size and let it grow beyond, say, 4 CTEs. In that case, you encounter readability and complexity problems.
Pros:
- A single query encapsulates the entire logic.
- Easy to read
Cons:
- Can quickly become complex to understand if not approached with mindfulness. Especially as the number of CTEs per query grows over 4.
- Debugging can be difficult since you must identify where an error occurred by running each CTE individually.
- Non reusable logic. A CTE in one query cannot be reused in another query.