CTEs(Common Table Expression) or Temporary Tables for Spark SQL
- 1. Introduction
- 2. CTE for short clean code & temp tables for re-usability
- 3. Conclusion
- 4. Recommended reading
1. Introduction
As a data engineer, CTEs(Common Table Expression) are one of the best techniques you can use to improve query readability. Most DEs wish they had started using CTEs sooner in their careers.
However, overuse of CTEs can lead to
complex queries trying to fit everything into a single, massive nested statements
Hard to debug code, as you can’t quickly inspect the results of one CTE at a time
Inability to reuse code, as CTEs in one query cannot be used in another query
Imagine having the benefits of CTE (readability) while being able to make your code reusable & easy to debug. That is what temp tables enable you to do.
In this post, we will explore how temporary tables can enhance CTEs by making your code more reusable and easier to debug.
Prerequisites:
TL;DR: CTEs are great for single use; Temp tables are great for debuggability & re-usability
2. CTE for short clean code & temp tables for re-usability
Assume both the CTE and temp table in the code below are written with clean SQL; it is impossible to compare approaches with bad code.
2.1. CTEs make medium-complex SQL easy to understand
CTE-based queries are designed to be read as a single query, containing self-contained logic, making it easy to understand the steps involved in a CTE query and the resulting output. Your context window is a single query.
However, suppose you are not mindful of query size and let it grow beyond, say, 4 CTEs. In that case, you encounter readability and complexity problems.
Pros:
- A single query encapsulates the entire logic.
- Easy to read
Cons:
- Can quickly become complex to understand if not approached with mindfulness. Especially as the number of CTEs per query grows over 4.
- Debugging can be difficult since you must identify where an error occurred by running each CTE individually.
- Non reusable logic. A CTE in one query cannot be reused in another query.
table")] TX2[("users
table")] TX3[("products
table")] %% First Query CTEs subgraph Q1 ["๐ต QUERY 1: Dashboard Analytics"] CTE1_Q1["user_activity
(CTE1)"] CTE2_Q1["user_segments
(CTE2)
๐ Segments + Rankings"] CTE3_Q1["dashboard_metrics
(CTE3)"] FINAL_Q1["Final SELECT
Dashboard Results"] CTE1_Q1 --> CTE2_Q1 CTE2_Q1 --> CTE3_Q1 CTE3_Q1 --> FINAL_Q1 end %% Second Query CTEs - RECREATING THE SAME LOGIC! subgraph Q2 ["๐ก QUERY 2: Report Analytics"] CTE1_Q2["user_activity
(CTE1)
๐ DUPLICATED"] CTE2_Q2["user_segments
(CTE2)
๐ Segments + Rankings
๐ DUPLICATED"] CTE3_Q2["report_metrics
(CTE3)"] FINAL_Q2["Final SELECT
Report Results"] CTE1_Q2 --> CTE2_Q2 CTE2_Q2 --> CTE3_Q2 CTE3_Q2 --> FINAL_Q2 end %% Third Query CTEs - AGAIN! subgraph Q3 ["๐ข QUERY 3: Executive Summary"] CTE1_Q3["user_activity
(CTE1)
๐ DUPLICATED"] CTE2_Q3["user_segments
(CTE2)
๐ Segments + Rankings
๐ DUPLICATED"] CTE3_Q3["executive_summary
(CTE3)"] FINAL_Q3["Final SELECT
Executive Results"] CTE1_Q3 --> CTE2_Q3 CTE2_Q3 --> CTE3_Q3 CTE3_Q3 --> FINAL_Q3 end %% Base tables feed into all queries TX1 --> CTE1_Q1 TX2 --> CTE1_Q1 TX3 --> CTE1_Q1 TX1 --> CTE1_Q2 TX2 --> CTE1_Q2 TX3 --> CTE1_Q2 TX1 --> CTE1_Q3 TX2 --> CTE1_Q3 TX3 --> CTE1_Q3 %% Problem callout PROBLEM["โ PROBLEM:
Recreating same CTEs
in multiple queries!
๐ Code Duplication
๐ Maintenance Nightmare
"] %% Light mode styling classDef cteStyle fill:#f8f9ff,stroke:#2563eb,stroke-width:2px,color:#1e40af classDef duplicateStyle fill:#fff7ed,stroke:#ea580c,stroke-width:3px,color:#c2410c classDef tableStyle fill:#fef3f2,stroke:#dc2626,stroke-width:2px,color:#991b1b classDef finalStyle fill:#f0fdf4,stroke:#16a34a,stroke-width:3px,color:#15803d classDef problemStyle fill:#fef2f2,stroke:#dc2626,stroke-width:4px,color:#b91c1c class CTE1_Q1,CTE3_Q1,CTE1_Q2,CTE3_Q2,CTE1_Q3,CTE3_Q3 cteStyle class CTE2_Q1,CTE2_Q2,CTE2_Q3 duplicateStyle class TX1,TX2,TX3 tableStyle class FINAL_Q1,FINAL_Q2,FINAL_Q3 finalStyle class PROBLEM problemStyle
The above flowchart demonstrates problems with CTEs:
- Hard to
debug
as you will have to manually copy-run the CTE query and inspect results. - The same code is rewritten in multiple queries, leading to
hard-to-maintain code
.
2.2. Temp table enables you to reuse logic multiple times in a session
Note In this context, Spark dataframes are equivalent to temp tables
Temp tables exist within a session and can be reused multiple times across queries in that session.
Pros:
- Temp tables can be reused
- Simple to debug, as you can examine the outputs of the temp table one at a time
Cons:
- Having too many temporary tables is usually an indication of complexity. See if the temp tables can be materialized into tables or views.
table")] TX2[("users
table")] TX3[("products
table")] %% Shared temp tables created once subgraph SHARED ["๐ง SHARED TEMP TABLES (Created Once)"] TEMP1["temp_user_activity
๐ Base User Data"] TEMP2["temp_user_segments
๐ฏ Segments + Rankings
โ REUSABLE"] end %% Query 1 - Simple and clean subgraph Q1 ["๐ต QUERY 1: Dashboard Analytics"] CTE1_Q1["dashboard_metrics
(CTE)"] FINAL_Q1["Final SELECT
Dashboard Results"] CTE1_Q1 --> FINAL_Q1 end %% Query 2 - Also simple subgraph Q2 ["๐ก QUERY 2: Report Analytics"] CTE1_Q2["report_metrics
(CTE)"] FINAL_Q2["Final SELECT
Report Results"] CTE1_Q2 --> FINAL_Q2 end %% Query 3 - Clean and focused subgraph Q3 ["๐ข QUERY 3: Executive Summary"] CTE1_Q3["executive_summary
(CTE)"] FINAL_Q3["Final SELECT
Executive Results"] CTE1_Q3 --> FINAL_Q3 end %% Base tables feed into shared temps (once!) TX1 --> TEMP1 TX2 --> TEMP1 TX3 --> TEMP1 TEMP1 --> TEMP2 %% Shared temps feed into all queries TEMP2 --> CTE1_Q1 TEMP2 --> CTE1_Q2 TEMP2 --> CTE1_Q3 %% Solution callout SOLUTION["โ SOLUTION:
Shared temp tables!
๐ฏ Single Source of Truth
โก Compute Once, Use Many
๐ ๏ธ Easy Maintenance
"] %% Benefits arrows TEMP2 -.->|"REUSE"| CTE1_Q1 TEMP2 -.->|"REUSE"| CTE1_Q2 TEMP2 -.->|"REUSE"| CTE1_Q3 %% Light mode styling classDef tempStyle fill:#f0fdf4,stroke:#16a34a,stroke-width:3px,color:#15803d classDef cteStyle fill:#f8f9ff,stroke:#2563eb,stroke-width:2px,color:#1e40af classDef tableStyle fill:#fef3f2,stroke:#dc2626,stroke-width:2px,color:#991b1b classDef finalStyle fill:#f0f9ff,stroke:#7c3aed,stroke-width:3px,color:#5b21b6 classDef solutionStyle fill:#f0fdf4,stroke:#16a34a,stroke-width:4px,color:#15803d classDef reuseStyle stroke:#16a34a,stroke-width:2px,stroke-dasharray: 5 5 class TEMP1,TEMP2 tempStyle class CTE1_Q1,CTE1_Q2,CTE1_Q3 cteStyle class TX1,TX2,TX3 tableStyle class FINAL_Q1,FINAL_Q2,FINAL_Q3 finalStyle class SOLUTION solutionStyle
The above flowchart and query demonstrate how a non-DRY CTE-based query can be turned into reusable code with temporary tables.
Let’s look at some benefits:
- The temp tables can be
re-used
in multiple queries without having to dogpile all query logic into a single CTE-based query. Debuggable
: Easy to validate each of the 3 key transformations
Notice that similar temporary tables are being created in various queries. It may be an indication that those temporary tables can be converted into permanent tables.
2.3. Performance depends on the execution engine
Spark is lazy-evaluated, so it considers how a temporary table is used and typically only computes it once, then reuses the result. Spark is also intelligent enough to understand how to reuse Common Table Expressions (CTEs) across a session.
Note: There can be significant performance differences between CTE and temp table-based approaches for other engines; use EXPLAIN
to understand these differences and utilize indexes as needed.
3. Conclusion
To recap, we saw how
- CTEs are great for representing single-use logic
- Temp tables are great for debugging and reusable logic
CTEs are great for single use; temporary tables are great for debuggability and reusability.
The next time you recreate the same CTEs at multiple places, ask yourself if it should be a temporary table, materialized as a table, or a view.
4. Recommended reading
If you found this article helpful, share it with a friend or colleague using one of the socials below!