CTEs(Common Table Expression) or Temporary Tables for Spark SQL

1. Introduction

As a data engineer, CTEs(Common Table Expression) are one of the best techniques you can use to improve query readability. Most DEs wish they had started using CTEs sooner in their careers.

However, overuse of CTEs can lead to

complex queries trying to fit everything into a single, massive nested statements

Hard to debug code, as you can’t quickly inspect the results of one CTE at a time

Inability to reuse code, as CTEs in one query cannot be used in another query

Imagine having the benefits of CTE (readability) while being able to make your code reusable & easy to debug. That is what temp tables enable you to do.

In this post, we will explore how temporary tables can enhance CTEs by making your code more reusable and easier to debug.

Prerequisites:

TL;DR: CTEs are great for single use; Temp tables are great for debuggability & re-usability

2. CTE for short clean code & temp tables for re-usability

Assume both the CTE and temp table in the code below are written with clean SQL; it is impossible to compare approaches with bad code.

2.1. CTEs make medium-complex SQL easy to understand

CTE-based queries are designed to be read as a single query, containing self-contained logic, making it easy to understand the steps involved in a CTE query and the resulting output. Your context window is a single query.

However, suppose you are not mindful of query size and let it grow beyond, say, 4 CTEs. In that case, you encounter readability and complexity problems.

Pros:

A single query encapsulates the entire logic.
Easy to read

Cons:

Can quickly become complex to understand if not approached with mindfulness. Especially as the number of CTEs per query grows over 4.
Debugging can be difficult since you must identify where an error occurred by running each CTE individually.
Non reusable logic. A CTE in one query cannot be reused in another query.

graph TD
    %% Base data sources
    TX1[("transactions<br/>table")] 
    TX2[("users<br/>table")]
    TX3[("products<br/>table")]
    
    %% First Query CTEs
    subgraph Q1 ["🔵 QUERY 1: Dashboard Analytics"]
        CTE1_Q1["user_activity<br/>(CTE1)"]
        CTE2_Q1["user_segments<br/>(CTE2)<br/>📊 Segments + Rankings"]
        CTE3_Q1["dashboard_metrics<br/>(CTE3)"]
        FINAL_Q1["Final SELECT<br/>Dashboard Results"]
        
        CTE1_Q1 --> CTE2_Q1
        CTE2_Q1 --> CTE3_Q1
        CTE3_Q1 --> FINAL_Q1
    end
    
    %% Second Query CTEs - RECREATING THE SAME LOGIC!
    subgraph Q2 ["🟡 QUERY 2: Report Analytics"]
        CTE1_Q2["user_activity<br/>(CTE1)<br/>🔄 DUPLICATED"]
        CTE2_Q2["user_segments<br/>(CTE2)<br/>📊 Segments + Rankings<br/>🔄 DUPLICATED"]
        CTE3_Q2["report_metrics<br/>(CTE3)"]
        FINAL_Q2["Final SELECT<br/>Report Results"]
        
        CTE1_Q2 --> CTE2_Q2
        CTE2_Q2 --> CTE3_Q2
        CTE3_Q2 --> FINAL_Q2
    end
    
    %% Third Query CTEs - AGAIN!
    subgraph Q3 ["🟢 QUERY 3: Executive Summary"]
        CTE1_Q3["user_activity<br/>(CTE1)<br/>🔄 DUPLICATED"]
        CTE2_Q3["user_segments<br/>(CTE2)<br/>📊 Segments + Rankings<br/>🔄 DUPLICATED"]
        CTE3_Q3["executive_summary<br/>(CTE3)"]
        FINAL_Q3["Final SELECT<br/>Executive Results"]
        
        CTE1_Q3 --> CTE2_Q3
        CTE2_Q3 --> CTE3_Q3
        CTE3_Q3 --> FINAL_Q3
    end
    
    %% Base tables feed into all queries
    TX1 --> CTE1_Q1
    TX2 --> CTE1_Q1
    TX3 --> CTE1_Q1
    
    TX1 --> CTE1_Q2
    TX2 --> CTE1_Q2
    TX3 --> CTE1_Q2
    
    TX1 --> CTE1_Q3
    TX2 --> CTE1_Q3
    TX3 --> CTE1_Q3
    
    %% Problem callout
    PROBLEM["❌ PROBLEM:<br/>Recreating same CTEs<br/>in multiple queries!<br/><br/>💔 Code Duplication<br/>🐛 Maintenance Nightmare<br/>"]
    
    %% Light mode styling
    classDef cteStyle fill:#f8f9ff,stroke:#2563eb,stroke-width:2px,color:#1e40af
    classDef duplicateStyle fill:#fff7ed,stroke:#ea580c,stroke-width:3px,color:#c2410c
    classDef tableStyle fill:#fef3f2,stroke:#dc2626,stroke-width:2px,color:#991b1b
    classDef finalStyle fill:#f0fdf4,stroke:#16a34a,stroke-width:3px,color:#15803d
    classDef problemStyle fill:#fef2f2,stroke:#dc2626,stroke-width:4px,color:#b91c1c
    
    class CTE1_Q1,CTE3_Q1,CTE1_Q2,CTE3_Q2,CTE1_Q3,CTE3_Q3 cteStyle
    class CTE2_Q1,CTE2_Q2,CTE2_Q3 duplicateStyle
    class TX1,TX2,TX3 tableStyle
    class FINAL_Q1,FINAL_Q2,FINAL_Q3 finalStyle
    class PROBLEM problemStyle

The above flowchart demonstrates problems with CTEs:

Hard to debug as you will have to manually copy-run the CTE query and inspect results.
The same code is rewritten in multiple queries, leading to hard-to-maintain code.

2.2. Temp table enables you to reuse logic multiple times in a session

Note

In this context, Spark dataframes are equivalent to temp tables

Temp tables exist within a session and can be reused multiple times across queries in that session.

Pros:

Temp tables can be reused at different places in the same session
Simple to debug, as you can examine the outputs of the temp table one at a time

Cons:

Having too many temporary tables is usually an indication of complexity. See if the temp tables can be materialized into tables or views.

graph TD
    %% Base data sources
    TX1[("transactions<br/>table")] 
    TX2[("users<br/>table")]
    TX3[("products<br/>table")]
    
    %% Shared temp tables created once
    subgraph SHARED ["🔧 SHARED TEMP TABLES (Created Once)"]
        TEMP1["temp_user_activity<br/>📊 Base User Data"]
        TEMP2["temp_user_segments<br/>🎯 Segments + Rankings<br/>✅ REUSABLE"]
    end
    
    %% Query 1 - Simple and clean
    subgraph Q1 ["🔵 QUERY 1: Dashboard Analytics"]
        CTE1_Q1["dashboard_metrics<br/>(CTE)"]
        FINAL_Q1["Final SELECT<br/>Dashboard Results"]
        
        CTE1_Q1 --> FINAL_Q1
    end
    
    %% Query 2 - Also simple
    subgraph Q2 ["🟡 QUERY 2: Report Analytics"]
        CTE1_Q2["report_metrics<br/>(CTE)"]
        FINAL_Q2["Final SELECT<br/>Report Results"]
        
        CTE1_Q2 --> FINAL_Q2
    end
    
    %% Query 3 - Clean and focused
    subgraph Q3 ["🟢 QUERY 3: Executive Summary"]
        CTE1_Q3["executive_summary<br/>(CTE)"]
        FINAL_Q3["Final SELECT<br/>Executive Results"]
        
        CTE1_Q3 --> FINAL_Q3
    end
    
    %% Base tables feed into shared temps (once!)
    TX1 --> TEMP1
    TX2 --> TEMP1
    TX3 --> TEMP1
    
    TEMP1 --> TEMP2
    
    %% Shared temps feed into all queries
    TEMP2 --> CTE1_Q1
    TEMP2 --> CTE1_Q2
    TEMP2 --> CTE1_Q3
    
    %% Solution callout
    SOLUTION["✅ SOLUTION:<br/>Shared temp tables!<br/><br/>🎯 Single Source of Truth<br/>⚡ Compute Once, Use Many<br/>🛠️ Easy Maintenance<br/>"]
    
    %% Benefits arrows
    TEMP2 -.->|"REUSE"| CTE1_Q1
    TEMP2 -.->|"REUSE"| CTE1_Q2
    TEMP2 -.->|"REUSE"| CTE1_Q3
    
    %% Light mode styling
    classDef tempStyle fill:#f0fdf4,stroke:#16a34a,stroke-width:3px,color:#15803d
    classDef cteStyle fill:#f8f9ff,stroke:#2563eb,stroke-width:2px,color:#1e40af
    classDef tableStyle fill:#fef3f2,stroke:#dc2626,stroke-width:2px,color:#991b1b
    classDef finalStyle fill:#f0f9ff,stroke:#7c3aed,stroke-width:3px,color:#5b21b6
    classDef solutionStyle fill:#f0fdf4,stroke:#16a34a,stroke-width:4px,color:#15803d
    classDef reuseStyle stroke:#16a34a,stroke-width:2px,stroke-dasharray: 5 5
    
    class TEMP1,TEMP2 tempStyle
    class CTE1_Q1,CTE1_Q2,CTE1_Q3 cteStyle
    class TX1,TX2,TX3 tableStyle
    class FINAL_Q1,FINAL_Q2,FINAL_Q3 finalStyle
    class SOLUTION solutionStyle

The above flowchart and query demonstrate how a non-DRY CTE-based query can be turned into reusable code with temporary tables.

Let’s look at some benefits:

The temp tables can be re-used in multiple queries without having to dogpile all query logic into a single CTE-based query.
Debuggable: Easy to validate each of the 3 key transformations

Notice that similar temporary tables are being created in various queries. It may be an indication that those temporary tables can be converted into permanent tables.

2.3. Performance depends on the execution engine

Spark is lazy-evaluated, so it considers how a temporary table is used and typically only computes it once, then reuses the result. Spark is also intelligent enough to understand how to reuse Common Table Expressions (CTEs) across a session.

Note: There can be significant performance differences between CTE and temp table-based approaches for other engines; use EXPLAIN to understand these differences and utilize indexes as needed.

3. Conclusion

To recap, we saw how

CTEs are great for single use; temporary tables are great for debuggability and reusability.

The next time you recreate the same CTEs at multiple places, ask yourself if it should be a temporary table, materialized as a table, or a view.

1. Introduction

2. CTE for short clean code & temp tables for re-usability

2.1. CTEs make medium-complex SQL easy to understand

2.2. Temp table enables you to reuse logic multiple times in a session

2.3. Performance depends on the execution engine

3. Conclusion

4. Recommended reading