How to Get Really Good at Advanced SQL for Data Engineering

Introduction

How do you get good at complex data manipulation in SQL? Are you

An analytics engineer trying to break into data engineering?

A data engineer trying to get better at complex data manipulation?

Realizing that your SQL proficiency is mid-level and not advanced?

Stuck at a role where you don’t get to practice advanced SQL?

Is this you? Want to advance your SQL skills? Read on.

Imagine being able to make informed business decisions. And write easy-to-understand SQL. That is what SQL proficiency is.

In this post, you will learn how to go from business requirements to creating clean SQL code. You will do this by following a step-by-step repeatable process.

You will learn by practicing your new skills on real-life use cases.

By the end of this post, you will have a step-by-step process. Use this process to transform business requirements into easy-to-understand & clean SQL queries.

Prerequisites

To follow along, you will need to know the following:

Input Data

Assume you are working at a bicycle manufacturing company called Adventure Works.

ERD for the data you need is shown below.

erDiagram
    FactInternetSales }|--|| DimDate : "OrderDateKey = DateKey; Many-to-One"
    FactInternetSales }|--|| DimProduct : "ProductKey = ProductKey; Many-to-One"
    FactInternetSales }|--|| DimCustomer : "CustomerKey = CustomerKey; Many-to-One"
    
    FactInternetSales {
        int OrderDateKey FK
        int ProductKey FK
        int CustomerKey FK
        string SalesOrderNumber
        decimal SalesAmount
    }
    
    DimDate {
        int DateKey PK
        date FullDateAlternateKey
        int CalendarYear
        int MonthNumberOfYear
    }
    
    DimProduct {
        int ProductKey PK
        string EnglishProductName
    }
    
    DimCustomer {
        int CustomerKey PK
        string FirstName
        string LastName
        string EmailAddress
    }

Advanced SQL = Solving complex requirements with easy-to-understand code

The expectation from an advanced SQL practitioner is not just the ability to answer complex questions. But the ability to answer complex questions with easy-to-understand SQL.

You are working on an initiative to increase revenue with product bundling. The metrics below are your business requirements. Reference.

Define outputs & their input sources

Output Grain Each row in the output should represent every unique combination of two products (product_name_a, product_name_b).

Column Name	Description	Source Data
product_name_a (Grain Column 1)	Name of the Product	DimProduct
product_name_b (Grain Column 2)	Name of the Product that A is compared to	DimProduct
support	number of orders with both A & B / total number of orders	total number of orders from FactInternetSales number of orders with both A & B from self-joined FactInternetSales grouped by product_a, product_b
confidence	number of orders with both A & B / number of orders with A	number of orders with A from FactInternetSales grouped by product number of orders with both A & B from self-joined FactInternetSales grouped by product_a, product_b
lift	number of orders with both A & B / ((number of orders with A / total number of orders) * (number of orders with B / total number of orders))	total number of orders from FactInternetSales number of orders with both A & B from self-joined FactInternetSales grouped by product_a, product_b number of orders with A (or B) from FactInternetSales grouped by product

Step-by-step process to get outputs from inputs

From the Source Data in the previous section, list out the different grains needed from the inputs.

Total number of orders from FactInternetSales
FactInternetSales grouped by product
Self-joined FactInternetSales grouped by product_a, product_b
DimProduct

Define these 3 grains of FactInternetSales as individual CTEs. One CTE per grain.

In the final SELECT, join the 3 CTEs and enrich them with DimProduct.

Write the code

Hover over the annotation number to the right to see the CTE corresponding to each step.

WITH 
total_orders AS (
  SELECT COUNT(DISTINCT SalesOrderNumber) AS total_order_count
  FROM FactInternetSales
),
product_counts AS (
  SELECT 
    ProductKey,
    COUNT(DISTINCT SalesOrderNumber) AS orders_with_product
  FROM FactInternetSales
  GROUP BY ProductKey
),
product_pair_counts AS (
  SELECT 
    a.ProductKey AS ProductKey_A,
    b.ProductKey AS ProductKey_B,
    COUNT(DISTINCT a.SalesOrderNumber) AS orders_with_both
  FROM FactInternetSales a
  JOIN FactInternetSales b 
    ON a.SalesOrderNumber = b.SalesOrderNumber
    AND a.ProductKey < b.ProductKey
  GROUP BY a.ProductKey, b.ProductKey
)
SELECT
  pa.EnglishProductName AS product_a,
  pb.EnglishProductName AS product_b,
  t.total_order_count,
  -- Support: P(A AND B) 
  (pc.orders_with_both * 1.0) / t.total_order_count AS support,
  -- Confidence: P(B|A) = P(A AND B) / P(A) 
  (pc.orders_with_both * 1.0) / pca.orders_with_product AS confidence_a_to_b,
  -- Lift: P(A AND B) / (P(A) * P(B)) 
  ((pc.orders_with_both * 1.0) / t.total_order_count) / 
    (((pca.orders_with_product * 1.0) / t.total_order_count) * 
     ((pcb.orders_with_product * 1.0) / t.total_order_count)) AS lift
FROM product_pair_counts pc
JOIN product_counts pca ON pc.ProductKey_A = pca.ProductKey
JOIN product_counts pcb ON pc.ProductKey_B = pcb.ProductKey
JOIN DimProduct pa ON pc.ProductKey_A = pa.ProductKey
JOIN DimProduct pb ON pc.ProductKey_B = pb.ProductKey
CROSS JOIN total_orders t
ORDER BY confidence_a_to_b DESC, lift DESC
LIMIT 5;

1: Total number of orders from FactInternetSales
2: FactInternetSales grouped by product
3: Self-joined FactInternetSales grouped by product_a, product_b
4: 3 grains of FactInternetSales enriched with DimProduct
5: We multiply by 1.0 to output a decimal. Default will convert to an integer, leaving percentages as 0.

Try it now

Try running the SQL above (with LIMIT 5).

initSqlJs = require('sql.js');
SQL = await initSqlJs({
  locateFile: file => `https://sql.js.org/dist/${file}`
});

db = {
  const response = await fetch("adventureworks.db");
  const buffer = await response.arrayBuffer();
  const uint8Array = new Uint8Array(buffer);
  return new SQL.Database(uint8Array);
}

function executeQueryOnDb(db, query) {
  try {
    const result = db.exec(query);
    if (result.length === 0) return [];
    const columns = result[0].columns;
    const values = result[0].values;
    return values.map(row => {
      const obj = {};
      columns.forEach((col, i) => { obj[col] = row[i]; });
      return obj;
    });
  } catch(e) {
    return [{
      "❌ Error Type": e.name || "SQLError",
      "📝 Error Message": e.message,
      "🔍 Query Attempted": query.substring(0, 200) + (query.length > 200 ? "..." : ""),
      "💡 Hint": getErrorHint(e.message)
    }];
  }
}
function getErrorHint(errorMessage) {
  const msg = errorMessage.toLowerCase();
  
  if (msg.includes("no such table")) {
    return "Table doesn't exist. Use: SELECT name FROM sqlite_schema WHERE type='table' to see all tables";
  }
  if (msg.includes("no such column")) {
    return "Column doesn't exist. Use: PRAGMA table_info(tablename) to see columns";
  }
  if (msg.includes("syntax error")) {
    return "Check your SQL syntax. Common issues: missing commas, unmatched quotes, or typos in keywords";
  }
  if (msg.includes("ambiguous column")) {
    return "Multiple tables have this column. Use table aliases: SELECT t1.column FROM table1 t1";
  }
  if (msg.includes("near")) {
    const match = errorMessage.match(/near "(.+?)"/);
    if (match) {
      return `Problem near: "${match[1]}". Check syntax around this keyword/symbol`;
    }
  }
  
  return "Check your SQL syntax and table/column names";
}
function renderExecuteButton(label = "Execute", queryInput) {
  return Inputs.button(label, {
    reduce: () => queryInput
  });
}

viewof sqlQuery = {
  const textarea = Inputs.textarea({
    label: "SQLite Query",
    value: "select 'Write your SQL query here'",
    rows: 5,
    width: "100%"
  });
  
  const input = textarea.querySelector('textarea');
  
  // Auto-resize function
  const autoResize = () => {
    input.style.height = 'auto';
    input.style.height = input.scrollHeight + 'px';
  };
  
  // Resize on input
  input.addEventListener('input', autoResize);
  
  // Initial resize
  setTimeout(autoResize, 0);
  
  return textarea;
}

viewof executeQuery = renderExecuteButton("Execute", sqlQuery)

Inputs.table(executeQueryOnDb(db, executeQuery))

You should see the following.

product_a	product_b	total_order_count	support	confidence_a_to_b	lift
Touring Tire Tube	Touring Tire	27,659	0.029	0.543	16.063
Water Bottle - 30 oz.	Mountain Bottle Cage	27,659	0.061	0.399	5.445
Water Bottle - 30 oz.	Road Bottle Cage	27,659	0.055	0.358	5.79
Road-250 Red, 58	HL Road Tire	27,659	0.002	0.351	11.307
Road-250 Black, 48	Road Bottle Cage	27,659	0.002	0.338	5.454

Practice the approach you learnt with real-life business questions.

Exercise 1: Customer Lifetime Value analysis

You are part of a project to analyze customer lifetime value. Compute the following metrics.

Define output columns & their input sources

Output Grain Each row in the output represents a customer.

Column Name	Description	Source Data
customer_name	Name of the customer `FirstName LastName`	DimCustomer
email_address	EmailAddress of customer	DimCustomer
total_revenue	sum(SalesAmount)	FactInternetSales Grouped by CustomerKey
number_of_orders	count(SalesOrderNumber)	FactInternetSales Grouped by CustomerKey
recency_days	Days since last purchase (current_date - max(OrderDate))	FactInternetSales Grouped by CustomerKey
first_purchase_date	Earliest purchase date for customer	FactInternetSales Grouped by CustomerKey

Try it now

Note:

For date calculations, enrich FactInternetSales with DimDate and use FullDateAlternateKey
For computing date differences, use julianday function.

viewof sqlQuery_2 = Inputs.textarea({
  label: "SQLite Query",
  value: "-- Your Customer Lifetime Value Query Here",
  rows: 5,
  width: "100%"
})

viewof executeQuery_2 = renderExecuteButton("Execute", sqlQuery_2)

Inputs.table(executeQueryOnDb(db, executeQuery_2))

Order results by total_revenue in descending order and limit to 5.

You should see the following as the output (recency_days will differ as it’s computed based on the day you run this query).

customer_name	email_address	total_revenue	number_of_orders	recency_days	first_purchase_date
Nichole Nara	nichole16@adventure-works.com	13,295.38	5	4340	2011-05-10
Kaitlyn Henderson	kaitlyn72@adventure-works.com	13,294.27	5	4374	2011-01-30
Margaret He	margaret25@adventure-works.com	13,269.27	5	4341	2011-06-03
Randall Dominguez	randall14@adventure-works.com	13,265.99	5	4381	2011-02-21
Adriana Gonzalez	adriana19@adventure-works.com	13,242.75	5	4374	2011-05-24

Hint

Use a CTE to enrich FactInternetSales with DimDate (join key f.OrderDateKey = d.DateKey). And use d.FullDateAlternateKey to represent the date

Answer

WITH 
customer_metrics AS (
    SELECT 
        f.CustomerKey,
        SUM(f.SalesAmount) as total_revenue,
        COUNT(DISTINCT f.SalesOrderNumber) as number_of_orders,
        MIN(d.FullDateAlternateKey) as first_purchase_date,
        MAX(d.FullDateAlternateKey) as last_purchase_date
    FROM FactInternetSales f
    JOIN DimDate d ON f.OrderDateKey = d.DateKey
    GROUP BY f.CustomerKey
)
SELECT
    c.FirstName || ' ' || c.LastName as customer_name,
    c.EmailAddress as email_address,
    ROUND(cm.total_revenue, 2) as total_revenue,
    cm.number_of_orders,
    CAST(julianday('now') - julianday(cm.last_purchase_date) AS INTEGER) as recency_days,
    cm.first_purchase_date
FROM customer_metrics cm
JOIN DimCustomer c ON cm.CustomerKey = c.CustomerKey
ORDER BY cm.total_revenue DESC 
LIMIT 5;

1: FactInternetSales Grouped by CustomerKey
2: Enrich Customer metrics with Name & EmailAddress

Exercise 2: Year-over-year analysis

You are creating a dataset to enable business people to track sales over time. You have been provided the following metrics to be computed.

Define output columns & their input sources

Output Grain One row per year-month combination.

Filter Criteria: Only consider sales on or after 2013.

Column Name	Description	Source Data
year (Grain Column 1)	Year of sale	FactInternetSales grouped to Year, Month
month (Grain Column 2)	Month of sale	FactInternetSales grouped to Year, Month
current_period_sales_amount	Total salesamount for this year-month	Sum(SalesAmount) from FactInternetSales grouped to Year, Month
prior_period_sales_amount	Total salesamount for the same month, but the previous year	Sum(SalesAmount) from FactInternetSales grouped to Year, Month
yoy_percentage_change	(current_period_sales_amount - prior_period_sales_amount) / prior_period_sales_amount	current_period_sales_amount prior_period_sales_amount
current_ytd_sales_amount	Running total sales amount for this year, until this month	Running Sum(SalesAmount) until this year month from FactInternetSales grouped to Year, Month
prior_ytd_sales_amount	Running total sales amount for the prior year, until this month	Running Sum(SalesAmount) until this year month from FactInternetSales grouped to Year, Month
running_yoy_percentage_change	(current_ytd_sales_amount - prior_ytd_sales_amount) / prior_ytd_sales_amount	current_ytd_sales_amount prior_ytd_sales_amount

Try it now

viewof sqlQuery_3 = Inputs.textarea({
  label: "SQLite Query",
  value: "-- Your YOY Analysis Query Here",
  rows: 5,
  width: "100%"
})

viewof executeQuery_3 = renderExecuteButton("Execute", sqlQuery_3)

Inputs.table(executeQueryOnDb(db, executeQuery_3))

Order results by current_year, current_month in ascending order, and limit to 5.

You should see the following as the output.

current_year	month_number	current_period_value	prior_period_value	yoy_percentage_change	current_ytd	prior_ytd	running_total_yoy_change
2013	1	857,689.91	495,364.13	73.14	857,689.91	495,364.13	73.14
2013	2	771,348.74	506,994.19	52.14	1,629,038.65	1,002,358.31	62.52
2013	3	1,049,907.39	373,483.01	181.11	2,678,946.04	1,375,841.32	94.71
2013	4	1,046,022.77	400,335.61	161.29	3,724,968.81	1,776,176.93	109.72
2013	5	1,284,59.93	358,877.89	257.95	5,009,561.74	2,135,054.82	134.63

Hint 1

Define the following steps as individual CTES.

Aggregate sales by month and year
Calculate running totals (year-to-date)
Get current year and prior year data side by side
Calculate YoY metrics

Hint 2

To display the current and prior years side by side, use lead/lag, or a self-join (year = year - 1)

Hint 3

During division operations, ensure that the denominator is greater than 0 to prevent a division by zero error.

Answer

WITH 
monthly_sales AS (
    SELECT 
        d.CalendarYear,
        d.MonthNumberOfYear,
        SUM(f.SalesAmount) as monthly_sales
    FROM FactInternetSales f
    JOIN DimDate d ON f.OrderDateKey = d.DateKey
    GROUP BY d.CalendarYear, d.MonthNumberOfYear
),
sales_with_running_total AS (
    SELECT 
        CalendarYear,
        MonthNumberOfYear,
        monthly_sales,
        SUM(monthly_sales) OVER (
            PARTITION BY CalendarYear 
            ORDER BY MonthNumberOfYear
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) as running_total
    FROM monthly_sales
),
yoy_comparison AS (
    SELECT 
        current.CalendarYear as current_year,
        current.MonthNumberOfYear as month_number,
        
        -- Current Period Value
        current.monthly_sales as current_period_value,
        
        -- Prior Period Value
        prior.monthly_sales as prior_period_value,
        
        -- Current Running Total
        current.running_total as current_ytd,
        
        -- Prior Running Total
        prior.running_total as prior_ytd
        
    FROM sales_with_running_total current
    LEFT JOIN sales_with_running_total prior
        ON current.MonthNumberOfYear = prior.MonthNumberOfYear
        AND current.CalendarYear = prior.CalendarYear + 1
    WHERE current.CalendarYear >= 2013
)
SELECT
    current_year,
    month_number,
    
    -- Current Period Value
    ROUND(current_period_value, 2) as current_period_value,
    
    -- Prior Period Value
    ROUND(prior_period_value, 2) as prior_period_value,
    
    -- YoY Percentage Change
    CASE 
        WHEN prior_period_value > 0 THEN
            ROUND(((current_period_value - prior_period_value) / prior_period_value) * 100, 2)
        ELSE NULL
    END as yoy_percentage_change,
    
    -- Current Year-to-Date
    ROUND(current_ytd, 2) as current_ytd,
    
    -- Prior Year-to-Date
    ROUND(prior_ytd, 2) as prior_ytd,
    
    -- Running Total YoY % Change
    CASE 
        WHEN prior_ytd > 0 THEN
            ROUND(((current_ytd - prior_ytd) / prior_ytd) * 100, 2)
        ELSE NULL
    END as running_total_yoy_change
    
FROM yoy_comparison
ORDER BY current_year, month_number 
LIMIT 5;

1: Aggregate sales by month and year
2: Calculate running totals (year-to-date)
3: Get current year and prior year data side by side
4: Calculate YoY metrics

Conclusion

To recap, we saw

How to clearly define the outputs and their relationship to inputs
How to create a step-by-step process to get the inputs in the right grain
How to use CTEs to represent the step-by-step processing of inputs

The next time you are stuck with a gnarly query or complex requirement at work or an interview, follow the process you just learnt.

The mapping from inputs to outputs will enable you to create elegant SQL that can represent arbitrarily complex logic.

Introduction

Advanced SQL = Solving complex requirements with easy-to-understand code

Define outputs & their input sources

Step-by-step process to get outputs from inputs

Write the code

Try it now

Exercise 1: Customer Lifetime Value analysis

Define output columns & their input sources

Try it now

Exercise 2: Year-over-year analysis

Define output columns & their input sources

Try it now

Conclusion

Make sure to read these