CTEs(Common Table Expression) or Temporary Tables for Spark SQL

1. Introduction

As a data engineer, CTEs(Common Table Expression) are one of the best techniques you can use to improve query readability. Most DEs wish they had started using CTEs sooner in their careers.

However, overuse of CTEs can lead to

complex queries trying to fit everything into a single, massive nested statements

Hard to debug code, as you can’t quickly inspect the results of one CTE at a time

Inability to reuse code, as CTEs in one query cannot be used in another query

Imagine having the benefits of CTE (readability) while being able to make your code reusable & easy to debug. That is what temp tables enable you to do.

In this post, we will explore how temporary tables can enhance CTEs by making your code more reusable and easier to debug.

Prerequisites:

  1. Common Table Expression (CTE)
  2. Temporary table
  3. SQL Session

TL;DR: CTEs are great for single use; Temp tables are great for debuggability & re-usability

2. CTE for short clean code & temp tables for re-usability

Assume both the CTE and temp table in the code below are written with clean SQL; it is impossible to compare approaches with bad code.

2.1. CTEs make medium-complex SQL easy to understand

CTE-based queries are designed to be read as a single query, containing self-contained logic, making it easy to understand the steps involved in a CTE query and the resulting output. Your context window is a single query.

However, suppose you are not mindful of query size and let it grow beyond, say, 4 CTEs. In that case, you encounter readability and complexity problems.

Pros:

  1. A single query encapsulates the entire logic.
  2. Easy to read

Cons:

  1. Can quickly become complex to understand if not approached with mindfulness. Especially as the number of CTEs per query grows over 4.
  2. Debugging can be difficult since you must identify where an error occurred by running each CTE individually.
  3. Non reusable logic. A CTE in one query cannot be reused in another query.

🟒 QUERY 3: Executive Summary

🟑 QUERY 2: Report Analytics

πŸ”΅ QUERY 1: Dashboard Analytics

transactions
table

users
table

products
table

user_activity
(CTE1)

user_segments
(CTE2)
πŸ“Š Segments + Rankings

dashboard_metrics
(CTE3)

Final SELECT
Dashboard Results

user_activity
(CTE1)
πŸ”„ DUPLICATED

user_segments
(CTE2)
πŸ“Š Segments + Rankings
πŸ”„ DUPLICATED

report_metrics
(CTE3)

Final SELECT
Report Results

user_activity
(CTE1)
πŸ”„ DUPLICATED

user_segments
(CTE2)
πŸ“Š Segments + Rankings
πŸ”„ DUPLICATED

executive_summary
(CTE3)

Final SELECT
Executive Results

❌ PROBLEM:
Recreating same CTEs
in multiple queries!

πŸ’” Code Duplication
πŸ› Maintenance Nightmare

The above flowchart demonstrates problems with CTEs:

  1. Hard to debug as you will have to manually copy-run the CTE query and inspect results.
  2. The same code is rewritten in multiple queries, leading to hard-to-maintain code.

2.2. Temp table enables you to reuse logic multiple times in a session

Note In this context, Spark dataframes are equivalent to temp tables

Temp tables exist within a session and can be reused multiple times across queries in that session.

Pros:

  1. Temp tables can be reused
  2. Simple to debug, as you can examine the outputs of the temp table one at a time

Cons:

  1. Having too many temporary tables is usually an indication of complexity. See if the temp tables can be materialized into tables or views.

🟒 QUERY 3: Executive Summary

🟑 QUERY 2: Report Analytics

πŸ”΅ QUERY 1: Dashboard Analytics

πŸ”§ SHARED TEMP TABLES (Created Once)

REUSE

REUSE

REUSE

transactions
table

users
table

products
table

temp_user_activity
πŸ“Š Base User Data

temp_user_segments
🎯 Segments + Rankings
βœ… REUSABLE

dashboard_metrics
(CTE)

Final SELECT
Dashboard Results

report_metrics
(CTE)

Final SELECT
Report Results

executive_summary
(CTE)

Final SELECT
Executive Results

βœ… SOLUTION:
Shared temp tables!

🎯 Single Source of Truth
⚑ Compute Once, Use Many
πŸ› οΈ Easy Maintenance

The above flowchart and query demonstrate how a non-DRY CTE-based query can be turned into reusable code with temporary tables.

Let’s look at some benefits:

  1. The temp tables can be re-used in multiple queries without having to dogpile all query logic into a single CTE-based query.
  2. Debuggable: Easy to validate each of the 3 key transformations

Notice that similar temporary tables are being created in various queries. It may be an indication that those temporary tables can be converted into permanent tables.

2.3. Performance depends on the execution engine

Spark is lazy-evaluated, so it considers how a temporary table is used and typically only computes it once, then reuses the result. Spark is also intelligent enough to understand how to reuse Common Table Expressions (CTEs) across a session.

Note: There can be significant performance differences between CTE and temp table-based approaches for other engines; use EXPLAIN to understand these differences and utilize indexes as needed.

3. Conclusion

To recap, we saw how

  1. CTEs are great for representing single-use logic
  2. Temp tables are great for debugging and reusable logic

CTEs are great for single use; temporary tables are great for debuggability and reusability.

The next time you recreate the same CTEs at multiple places, ask yourself if it should be a temporary table, materialized as a table, or a view.

  1. Common Table Expression (CTE)
  2. Temporary table
  3. SQL Session
  4. SQL techniques

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job!

Overwhelmed by all the concepts you need to learn to become a data engineer? Have difficulty finding good data projects for your portfolio? Are online tutorials littered with sponsored tools and not foundational concepts?

Learning data engineer can be a long and rough road, but it doesn't have to be!

Pick up any new tool/framework with a clear understanding of data engineering fundamentals. Demonstrate your expertise by building well-documented real-world projects on GitHub.

Sign up for my free DE-101 course that will take you from basics to building data projects in 4 weeks!

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    ​

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

    M ↓   Markdown