How to Get Really Good at Advanced SQL for Data Engineering

A step-by-step guide to help you write advanced SQL. Learn by practicing real-life use cases. On your browser, no setup needed.

A step-by-step guide to help you write advanced SQL. Learn by practicing real-life use cases. On your browser, no setup needed.
Author

Joseph Machado

Published

August 9, 2025

Keywords

AdventureWorks, CTE, Window Function, Workshop, Analytical Data Processing, Step-by-step SQL, Practice Complex SQL

Introduction

How do you get good at complex data manipulation in SQL? Are you

An analytics engineer trying to break into data engineering?

A data engineer trying to get better at complex data manipulation?

Realizing that your SQL proficiency is mid-level and not advanced?

Stuck at a role where you don’t get to practice advanced SQL?

Is this you? Want to advance your SQL skills? Read on.

Imagine being able to make informed business decisions. And write easy-to-understand SQL. That is what SQL proficiency is.

In this post, you will learn how to go from business requirements to creating clean SQL code. You will do this by following a step-by-step repeatable process.

You will learn by practicing your new skills on real-life use cases.

By the end of this post, you will have a step-by-step process. Use this process to transform business requirements into easy-to-understand & clean SQL queries.

Prerequisites

To follow along, you will need to know the following:

  1. SQL Basics
  2. CTE
  3. Window Functions

Input Data

Assume you are working at a bicycle manufacturing company called Adventure Works.

ERD for the data you need is shown below.

OrderDateKey = DateKey; Many-to-One

ProductKey = ProductKey; Many-to-One

CustomerKey = CustomerKey; Many-to-One

FactInternetSales

int

OrderDateKey

FK

int

ProductKey

FK

int

CustomerKey

FK

string

SalesOrderNumber

decimal

SalesAmount

DimDate

int

DateKey

PK

date

FullDateAlternateKey

int

CalendarYear

int

MonthNumberOfYear

DimProduct

int

ProductKey

PK

string

EnglishProductName

DimCustomer

int

CustomerKey

PK

string

FirstName

string

LastName

string

EmailAddress

Advanced SQL = Solving complex requirements with easy-to-understand code

The expectation from an advanced SQL practitioner is not just the ability to answer complex questions. But the ability to answer complex questions with easy-to-understand SQL.

You are working on an initiative to increase revenue with product bundling. The metrics below are your business requirements. Reference.

Define outputs & their input sources

Output Grain Each row in the output should represent every unique combination of two products (product_name_a, product_name_b).

Column Name Description Source Data
product_name_a (Grain Column 1) Name of the Product
  • DimProduct
product_name_b (Grain Column 2) Name of the Product that A is compared to
  • DimProduct
support number of orders with both A & B / total number of orders
  • total number of orders from FactInternetSales
  • number of orders with both A & B from self-joined FactInternetSales grouped by product_a, product_b
confidence number of orders with both A & B / number of orders with A
  • number of orders with A from FactInternetSales grouped by product
  • number of orders with both A & B from self-joined FactInternetSales grouped by product_a, product_b
lift number of orders with both A & B / ((number of orders with A / total number of orders) * (number of orders with B / total number of orders))
  • total number of orders from FactInternetSales
  • number of orders with both A & B from self-joined FactInternetSales grouped by product_a, product_b
  • number of orders with A (or B) from FactInternetSales grouped by product

Step-by-step process to get outputs from inputs

From the Source Data in the previous section, list out the different grains needed from the inputs.

  1. Total number of orders from FactInternetSales
  2. FactInternetSales grouped by product
  3. Self-joined FactInternetSales grouped by product_a, product_b
  4. DimProduct

Define these 3 grains of FactInternetSales as individual CTEs. One CTE per grain.

In the final SELECT, join the 3 CTEs and enrich them with DimProduct.

Write the code

Hover over the annotation number to the right to see the CTE corresponding to each step.

WITH 
total_orders AS (
  SELECT COUNT(DISTINCT SalesOrderNumber) AS total_order_count
  FROM FactInternetSales
),
product_counts AS (
  SELECT 
    ProductKey,
    COUNT(DISTINCT SalesOrderNumber) AS orders_with_product
  FROM FactInternetSales
  GROUP BY ProductKey
),
product_pair_counts AS (
  SELECT 
    a.ProductKey AS ProductKey_A,
    b.ProductKey AS ProductKey_B,
    COUNT(DISTINCT a.SalesOrderNumber) AS orders_with_both
  FROM FactInternetSales a
  JOIN FactInternetSales b 
    ON a.SalesOrderNumber = b.SalesOrderNumber
    AND a.ProductKey < b.ProductKey
  GROUP BY a.ProductKey, b.ProductKey
)
SELECT
  pa.EnglishProductName AS product_a,
  pb.EnglishProductName AS product_b,
  t.total_order_count,
  -- Support: P(A AND B) 
  (pc.orders_with_both * 1.0) / t.total_order_count AS support,
  -- Confidence: P(B|A) = P(A AND B) / P(A) 
  (pc.orders_with_both * 1.0) / pca.orders_with_product AS confidence_a_to_b,
  -- Lift: P(A AND B) / (P(A) * P(B)) 
  ((pc.orders_with_both * 1.0) / t.total_order_count) / 
    (((pca.orders_with_product * 1.0) / t.total_order_count) * 
     ((pcb.orders_with_product * 1.0) / t.total_order_count)) AS lift
FROM product_pair_counts pc
JOIN product_counts pca ON pc.ProductKey_A = pca.ProductKey
JOIN product_counts pcb ON pc.ProductKey_B = pcb.ProductKey
JOIN DimProduct pa ON pc.ProductKey_A = pa.ProductKey
JOIN DimProduct pb ON pc.ProductKey_B = pb.ProductKey
CROSS JOIN total_orders t
ORDER BY confidence_a_to_b DESC, lift DESC
LIMIT 5;
1
Total number of orders from FactInternetSales
2
FactInternetSales grouped by product
3
Self-joined FactInternetSales grouped by product_a, product_b
4
3 grains of FactInternetSales enriched with DimProduct
5
We multiply by 1.0 to output a decimal. Default will convert to an integer, leaving percentages as 0.

Try it now

Try running the SQL above (with LIMIT 5).

initSqlJs = ƒ(moduleConfig)
SQL = Object {locateFile: ƒ(file), onAbort: ƒ(errorThatCausedAbort), postRun: Array(0), onRuntimeInitialized: ƒ(), _sqlite3_free: ƒ(a), _sqlite3_value_text: ƒ(a), _sqlite3_prepare_v2: ƒ(a, b, c, d, e), _sqlite3_step: ƒ(a), _sqlite3_reset: ƒ(a), _sqlite3_exec: ƒ(a, b, c, d, e), _sqlite3_finalize: ƒ(a), _sqlite3_column_name: ƒ(a, b), _sqlite3_column_text: ƒ(a, b), _sqlite3_column_type: ƒ(a, b), _sqlite3_errmsg: ƒ(a), _sqlite3_clear_bindings: ƒ(a), _sqlite3_value_blob: ƒ(a), _sqlite3_value_bytes: ƒ(a), _sqlite3_value_double: ƒ(a), _sqlite3_value_int: ƒ(a), …}
db = e {filename: "dbfile_2053382069", db: 5318224, fb: Object, Sa: Object}
executeQueryOnDb = ƒ(db, query)
getErrorHint = ƒ(errorMessage)
renderExecuteButton = ƒ(…)
sqlQuery = "select 'Write your SQL query here'"
executeQuery = 0
No results.

You should see the following.

product_a product_b total_order_count support confidence_a_to_b lift
Touring Tire Tube Touring Tire 27,659 0.029 0.543 16.063
Water Bottle - 30 oz. Mountain Bottle Cage 27,659 0.061 0.399 5.445
Water Bottle - 30 oz. Road Bottle Cage 27,659 0.055 0.358 5.79
Road-250 Red, 58 HL Road Tire 27,659 0.002 0.351 11.307
Road-250 Black, 48 Road Bottle Cage 27,659 0.002 0.338 5.454

Practice the approach you learnt with real-life business questions.

Exercise 1: Customer Lifetime Value analysis

You are part of a project to analyze customer lifetime value. Compute the following metrics.

Define output columns & their input sources

Output Grain Each row in the output represents a customer.

Column Name Description Source Data
customer_name Name of the customer FirstName LastName DimCustomer
email_address EmailAddress of customer DimCustomer
total_revenue sum(SalesAmount) FactInternetSales Grouped by CustomerKey
number_of_orders count(SalesOrderNumber) FactInternetSales Grouped by CustomerKey
recency_days Days since last purchase (current_date - max(OrderDate)) FactInternetSales Grouped by CustomerKey
first_purchase_date Earliest purchase date for customer FactInternetSales Grouped by CustomerKey

Try it now

Note:

  1. For date calculations, enrich FactInternetSales with DimDate and use FullDateAlternateKey
  2. For computing date differences, use julianday function.
sqlQuery_2 = "-- Your Customer Lifetime Value Query Here"
executeQuery_2 = 0
No results.

Order results by total_revenue in descending order and limit to 5.

You should see the following as the output (recency_days will differ as it’s computed based on the day you run this query).

customer_name email_address total_revenue number_of_orders recency_days first_purchase_date
Nichole Nara nichole16@adventure-works.com 13,295.38 5 4340 2011-05-10
Kaitlyn Henderson kaitlyn72@adventure-works.com 13,294.27 5 4374 2011-01-30
Margaret He margaret25@adventure-works.com 13,269.27 5 4341 2011-06-03
Randall Dominguez randall14@adventure-works.com 13,265.99 5 4381 2011-02-21
Adriana Gonzalez adriana19@adventure-works.com 13,242.75 5 4374 2011-05-24

Use a CTE to enrich FactInternetSales with DimDate (join key f.OrderDateKey = d.DateKey). And use d.FullDateAlternateKey to represent the date

WITH 
customer_metrics AS (
    SELECT 
        f.CustomerKey,
        SUM(f.SalesAmount) as total_revenue,
        COUNT(DISTINCT f.SalesOrderNumber) as number_of_orders,
        MIN(d.FullDateAlternateKey) as first_purchase_date,
        MAX(d.FullDateAlternateKey) as last_purchase_date
    FROM FactInternetSales f
    JOIN DimDate d ON f.OrderDateKey = d.DateKey
    GROUP BY f.CustomerKey
)
SELECT
    c.FirstName || ' ' || c.LastName as customer_name,
    c.EmailAddress as email_address,
    ROUND(cm.total_revenue, 2) as total_revenue,
    cm.number_of_orders,
    CAST(julianday('now') - julianday(cm.last_purchase_date) AS INTEGER) as recency_days,
    cm.first_purchase_date
FROM customer_metrics cm
JOIN DimCustomer c ON cm.CustomerKey = c.CustomerKey
ORDER BY cm.total_revenue DESC 
LIMIT 5;
1
FactInternetSales Grouped by CustomerKey
2
Enrich Customer metrics with Name & EmailAddress

Exercise 2: Year-over-year analysis

You are creating a dataset to enable business people to track sales over time. You have been provided the following metrics to be computed.

Define output columns & their input sources

Output Grain One row per year-month combination.

Filter Criteria: Only consider sales on or after 2013.

Column Name Description Source Data
year (Grain Column 1) Year of sale
  • FactInternetSales grouped to Year, Month
month (Grain Column 2) Month of sale
  • FactInternetSales grouped to Year, Month
current_period_sales_amount Total salesamount for this year-month
  • Sum(SalesAmount) from FactInternetSales grouped to Year, Month
prior_period_sales_amount Total salesamount for the same month, but the previous year
  • Sum(SalesAmount) from FactInternetSales grouped to Year, Month
yoy_percentage_change (current_period_sales_amount - prior_period_sales_amount) / prior_period_sales_amount
  • current_period_sales_amount
  • prior_period_sales_amount
current_ytd_sales_amount Running total sales amount for this year, until this month
  • Running Sum(SalesAmount) until this year month from FactInternetSales grouped to Year, Month
prior_ytd_sales_amount Running total sales amount for the prior year, until this month
  • Running Sum(SalesAmount) until this year month from FactInternetSales grouped to Year, Month
running_yoy_percentage_change (current_ytd_sales_amount - prior_ytd_sales_amount) / prior_ytd_sales_amount
  • current_ytd_sales_amount
  • prior_ytd_sales_amount

Try it now

sqlQuery_3 = "-- Your YOY Analysis Query Here"
executeQuery_3 = 0
No results.

Order results by current_year, current_month in ascending order, and limit to 5.

You should see the following as the output.

current_year month_number current_period_value prior_period_value yoy_percentage_change current_ytd prior_ytd running_total_yoy_change
2013 1 857,689.91 495,364.13 73.14 857,689.91 495,364.13 73.14
2013 2 771,348.74 506,994.19 52.14 1,629,038.65 1,002,358.31 62.52
2013 3 1,049,907.39 373,483.01 181.11 2,678,946.04 1,375,841.32 94.71
2013 4 1,046,022.77 400,335.61 161.29 3,724,968.81 1,776,176.93 109.72
2013 5 1,284,59.93 358,877.89 257.95 5,009,561.74 2,135,054.82 134.63

Define the following steps as individual CTES.

  1. Aggregate sales by month and year
  2. Calculate running totals (year-to-date)
  3. Get current year and prior year data side by side
  4. Calculate YoY metrics

To display the current and prior years side by side, use lead/lag, or a self-join (year = year - 1)

During division operations, ensure that the denominator is greater than 0 to prevent a division by zero error.

WITH 
monthly_sales AS (
    SELECT 
        d.CalendarYear,
        d.MonthNumberOfYear,
        SUM(f.SalesAmount) as monthly_sales
    FROM FactInternetSales f
    JOIN DimDate d ON f.OrderDateKey = d.DateKey
    GROUP BY d.CalendarYear, d.MonthNumberOfYear
),
sales_with_running_total AS (
    SELECT 
        CalendarYear,
        MonthNumberOfYear,
        monthly_sales,
        SUM(monthly_sales) OVER (
            PARTITION BY CalendarYear 
            ORDER BY MonthNumberOfYear
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) as running_total
    FROM monthly_sales
),
yoy_comparison AS (
    SELECT 
        current.CalendarYear as current_year,
        current.MonthNumberOfYear as month_number,
        
        -- Current Period Value
        current.monthly_sales as current_period_value,
        
        -- Prior Period Value
        prior.monthly_sales as prior_period_value,
        
        -- Current Running Total
        current.running_total as current_ytd,
        
        -- Prior Running Total
        prior.running_total as prior_ytd
        
    FROM sales_with_running_total current
    LEFT JOIN sales_with_running_total prior
        ON current.MonthNumberOfYear = prior.MonthNumberOfYear
        AND current.CalendarYear = prior.CalendarYear + 1
    WHERE current.CalendarYear >= 2013
)
SELECT
    current_year,
    month_number,
    
    -- Current Period Value
    ROUND(current_period_value, 2) as current_period_value,
    
    -- Prior Period Value
    ROUND(prior_period_value, 2) as prior_period_value,
    
    -- YoY Percentage Change
    CASE 
        WHEN prior_period_value > 0 THEN
            ROUND(((current_period_value - prior_period_value) / prior_period_value) * 100, 2)
        ELSE NULL
    END as yoy_percentage_change,
    
    -- Current Year-to-Date
    ROUND(current_ytd, 2) as current_ytd,
    
    -- Prior Year-to-Date
    ROUND(prior_ytd, 2) as prior_ytd,
    
    -- Running Total YoY % Change
    CASE 
        WHEN prior_ytd > 0 THEN
            ROUND(((current_ytd - prior_ytd) / prior_ytd) * 100, 2)
        ELSE NULL
    END as running_total_yoy_change
    
FROM yoy_comparison
ORDER BY current_year, month_number 
LIMIT 5;
1
Aggregate sales by month and year
2
Calculate running totals (year-to-date)
3
Get current year and prior year data side by side
4
Calculate YoY metrics

Conclusion

To recap, we saw

  1. How to clearly define the outputs and their relationship to inputs
  2. How to create a step-by-step process to get the inputs in the right grain
  3. How to use CTEs to represent the step-by-step processing of inputs

The next time you are stuck with a gnarly query or complex requirement at work or an interview, follow the process you just learnt.

The mapping from inputs to outputs will enable you to create elegant SQL that can represent arbitrarily complex logic.

Make sure to read these

  1. SQL 101
  2. How to clean up a 1000-line SQL query
  3. Should you use SQL or Python for data transformations?
  4. How to improve your SQL skills as a data engineer
Back to top

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

Sign up for my free Data Engineering 101 Course. You will get

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

    Land your dream Data Engineering job with my free book!

    Build data engineering proficiency with my free book!

    Are you looking to enter the field of data engineering? And are you

    > Overwhelmed by all the concepts/jargon/frameworks of data engineering?

    > Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

    Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

    Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

    Sign up for my free Data Engineering 101 Course. You will get

    ✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

    ✅ Executable code to practice and exercises to test yourself.

    ✅ Weekly email for 4 weeks with the exercise solutions.

    Join now and get started on your data engineering journey!

      Testimonials:

      I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
      I have learned a lot from the course which is much more practical.
      This course helped me build a project and actually land a data engineering job! Thank you.

      When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.