Introduction
How do you get good at complex data manipulation in SQL? Are you
An analytics engineer trying to break into data engineering?
A data engineer trying to get better at complex data manipulation?
Realizing that your SQL proficiency is mid-level and not advanced?
Stuck at a role where you don’t get to practice advanced SQL?
Is this you? Want to advance your SQL skills? Read on.
Imagine being able to make informed business decisions. And write easy-to-understand SQL. That is what SQL proficiency is.
In this post, you will learn how to go from business requirements to creating clean SQL code. You will do this by following a step-by-step repeatable process.
You will learn by practicing your new skills on real-life use cases.
By the end of this post, you will have a step-by-step process. Use this process to transform business requirements into easy-to-understand & clean SQL queries.
Prerequisites
To follow along, you will need to know the following:
Input Data
Assume you are working at a bicycle manufacturing company called Adventure Works.
ERD for the data you need is shown below.
Advanced SQL = Solving complex requirements with easy-to-understand code
The expectation from an advanced SQL practitioner is not just the ability to answer complex questions. But the ability to answer complex questions with easy-to-understand SQL.
You are working on an initiative to increase revenue with product bundling. The metrics below are your business requirements. Reference.
Define outputs & their input sources
Output Grain Each row in the output should represent every unique combination of two products (product_name_a, product_name_b).
Column Name | Description | Source Data |
---|---|---|
product_name_a (Grain Column 1) | Name of the Product |
|
product_name_b (Grain Column 2) | Name of the Product that A is compared to |
|
support | number of orders with both A & B / total number of orders |
|
confidence | number of orders with both A & B / number of orders with A |
|
lift | number of orders with both A & B / ((number of orders with A / total number of orders) * (number of orders with B / total number of orders)) |
|
Step-by-step process to get outputs from inputs
From the Source Data
in the previous section, list out the different grains needed from the inputs.
- Total number of orders from FactInternetSales
- FactInternetSales grouped by product
- Self-joined FactInternetSales grouped by product_a, product_b
- DimProduct
Define these 3 grains of FactInternetSales
as individual CTEs. One CTE per grain.
In the final SELECT
, join the 3 CTEs and enrich them with DimProduct.
Write the code
Hover
over the annotation number to the right to see the CTE corresponding to each step.
WITH
total_orders AS (
SELECT COUNT(DISTINCT SalesOrderNumber) AS total_order_count
FROM FactInternetSales
),
product_counts AS (
SELECT
ProductKey,
COUNT(DISTINCT SalesOrderNumber) AS orders_with_product
FROM FactInternetSales
GROUP BY ProductKey
),
product_pair_counts AS (
SELECT
a.ProductKey AS ProductKey_A,
b.ProductKey AS ProductKey_B,
COUNT(DISTINCT a.SalesOrderNumber) AS orders_with_both
FROM FactInternetSales a
JOIN FactInternetSales b
ON a.SalesOrderNumber = b.SalesOrderNumber
AND a.ProductKey < b.ProductKey
GROUP BY a.ProductKey, b.ProductKey
)
SELECT
pa.EnglishProductName AS product_a,
pb.EnglishProductName AS product_b,
t.total_order_count,
-- Support: P(A AND B)
(pc.orders_with_both * 1.0) / t.total_order_count AS support,
-- Confidence: P(B|A) = P(A AND B) / P(A)
(pc.orders_with_both * 1.0) / pca.orders_with_product AS confidence_a_to_b,
-- Lift: P(A AND B) / (P(A) * P(B))
((pc.orders_with_both * 1.0) / t.total_order_count) /
(((pca.orders_with_product * 1.0) / t.total_order_count) *
((pcb.orders_with_product * 1.0) / t.total_order_count)) AS lift
FROM product_pair_counts pc
JOIN product_counts pca ON pc.ProductKey_A = pca.ProductKey
JOIN product_counts pcb ON pc.ProductKey_B = pcb.ProductKey
JOIN DimProduct pa ON pc.ProductKey_A = pa.ProductKey
JOIN DimProduct pb ON pc.ProductKey_B = pb.ProductKey
CROSS JOIN total_orders t
ORDER BY confidence_a_to_b DESC, lift DESC
LIMIT 5;
- 1
- Total number of orders from FactInternetSales
- 2
- FactInternetSales grouped by product
- 3
- Self-joined FactInternetSales grouped by product_a, product_b
- 4
- 3 grains of FactInternetSales enriched with DimProduct
- 5
- We multiply by 1.0 to output a decimal. Default will convert to an integer, leaving percentages as 0.
Try it now
Try running the SQL above (with LIMIT 5).
You should see the following.
product_a | product_b | total_order_count | support | confidence_a_to_b | lift |
---|---|---|---|---|---|
Touring Tire Tube | Touring Tire | 27,659 | 0.029 | 0.543 | 16.063 |
Water Bottle - 30 oz. | Mountain Bottle Cage | 27,659 | 0.061 | 0.399 | 5.445 |
Water Bottle - 30 oz. | Road Bottle Cage | 27,659 | 0.055 | 0.358 | 5.79 |
Road-250 Red, 58 | HL Road Tire | 27,659 | 0.002 | 0.351 | 11.307 |
Road-250 Black, 48 | Road Bottle Cage | 27,659 | 0.002 | 0.338 | 5.454 |
Practice the approach you learnt with real-life business questions.
Exercise 1: Customer Lifetime Value analysis
You are part of a project to analyze customer lifetime value. Compute the following metrics.
Define output columns & their input sources
Output Grain Each row in the output represents a customer.
Column Name | Description | Source Data |
---|---|---|
customer_name | Name of the customer FirstName LastName |
DimCustomer |
email_address | EmailAddress of customer | DimCustomer |
total_revenue | sum(SalesAmount) | FactInternetSales Grouped by CustomerKey |
number_of_orders | count(SalesOrderNumber) | FactInternetSales Grouped by CustomerKey |
recency_days | Days since last purchase (current_date - max(OrderDate)) | FactInternetSales Grouped by CustomerKey |
first_purchase_date | Earliest purchase date for customer | FactInternetSales Grouped by CustomerKey |
Try it now
Note:
- For date calculations, enrich FactInternetSales with DimDate and use FullDateAlternateKey
- For computing date differences, use julianday function.
Order results by total_revenue in descending order and limit to 5
.
You should see the following as the output (recency_days will differ as it’s computed based on the day you run this query).
customer_name | email_address | total_revenue | number_of_orders | recency_days | first_purchase_date |
---|---|---|---|---|---|
Nichole Nara | nichole16@adventure-works.com | 13,295.38 | 5 | 4340 | 2011-05-10 |
Kaitlyn Henderson | kaitlyn72@adventure-works.com | 13,294.27 | 5 | 4374 | 2011-01-30 |
Margaret He | margaret25@adventure-works.com | 13,269.27 | 5 | 4341 | 2011-06-03 |
Randall Dominguez | randall14@adventure-works.com | 13,265.99 | 5 | 4381 | 2011-02-21 |
Adriana Gonzalez | adriana19@adventure-works.com | 13,242.75 | 5 | 4374 | 2011-05-24 |
Use a CTE to enrich FactInternetSales with DimDate (join key f.OrderDateKey = d.DateKey). And use d.FullDateAlternateKey to represent the date
WITH
customer_metrics AS (
SELECT
f.CustomerKey,
SUM(f.SalesAmount) as total_revenue,
COUNT(DISTINCT f.SalesOrderNumber) as number_of_orders,
MIN(d.FullDateAlternateKey) as first_purchase_date,
MAX(d.FullDateAlternateKey) as last_purchase_date
FROM FactInternetSales f
JOIN DimDate d ON f.OrderDateKey = d.DateKey
GROUP BY f.CustomerKey
)
SELECT
c.FirstName || ' ' || c.LastName as customer_name,
c.EmailAddress as email_address,
ROUND(cm.total_revenue, 2) as total_revenue,
cm.number_of_orders,
CAST(julianday('now') - julianday(cm.last_purchase_date) AS INTEGER) as recency_days,
cm.first_purchase_date
FROM customer_metrics cm
JOIN DimCustomer c ON cm.CustomerKey = c.CustomerKey
ORDER BY cm.total_revenue DESC
LIMIT 5;
- 1
- FactInternetSales Grouped by CustomerKey
- 2
- Enrich Customer metrics with Name & EmailAddress
Exercise 2: Year-over-year analysis
You are creating a dataset to enable business people to track sales over time. You have been provided the following metrics to be computed.
Define output columns & their input sources
Output Grain One row per year-month combination.
Filter Criteria: Only consider sales on or after 2013.
Column Name | Description | Source Data |
---|---|---|
year (Grain Column 1) | Year of sale |
|
month (Grain Column 2) | Month of sale |
|
current_period_sales_amount | Total salesamount for this year-month |
|
prior_period_sales_amount | Total salesamount for the same month, but the previous year |
|
yoy_percentage_change | (current_period_sales_amount - prior_period_sales_amount) / prior_period_sales_amount |
|
current_ytd_sales_amount | Running total sales amount for this year, until this month |
|
prior_ytd_sales_amount | Running total sales amount for the prior year, until this month |
|
running_yoy_percentage_change | (current_ytd_sales_amount - prior_ytd_sales_amount) / prior_ytd_sales_amount |
|
Try it now
Order results by current_year, current_month in ascending order, and limit to 5.
You should see the following as the output.
current_year | month_number | current_period_value | prior_period_value | yoy_percentage_change | current_ytd | prior_ytd | running_total_yoy_change | |
---|---|---|---|---|---|---|---|---|
2013 | 1 | 857,689.91 | 495,364.13 | 73.14 | 857,689.91 | 495,364.13 | 73.14 | |
2013 | 2 | 771,348.74 | 506,994.19 | 52.14 | 1,629,038.65 | 1,002,358.31 | 62.52 | |
2013 | 3 | 1,049,907.39 | 373,483.01 | 181.11 | 2,678,946.04 | 1,375,841.32 | 94.71 | |
2013 | 4 | 1,046,022.77 | 400,335.61 | 161.29 | 3,724,968.81 | 1,776,176.93 | 109.72 | |
2013 | 5 | 1,284,59.93 | 358,877.89 | 257.95 | 5,009,561.74 | 2,135,054.82 | 134.63 |
Define the following steps as individual CTES.
- Aggregate sales by month and year
- Calculate running totals (year-to-date)
- Get current year and prior year data side by side
- Calculate YoY metrics
To display the current and prior years side by side, use lead/lag, or a self-join (year = year - 1)
During division operations, ensure that the denominator is greater than 0 to prevent a division by zero error.
WITH
monthly_sales AS (
SELECT
d.CalendarYear,
d.MonthNumberOfYear,
SUM(f.SalesAmount) as monthly_sales
FROM FactInternetSales f
JOIN DimDate d ON f.OrderDateKey = d.DateKey
GROUP BY d.CalendarYear, d.MonthNumberOfYear
),
sales_with_running_total AS (
SELECT
CalendarYear,
MonthNumberOfYear,
monthly_sales,
SUM(monthly_sales) OVER (
PARTITION BY CalendarYear
ORDER BY MonthNumberOfYear
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as running_total
FROM monthly_sales
),
yoy_comparison AS (
SELECT
current.CalendarYear as current_year,
current.MonthNumberOfYear as month_number,
-- Current Period Value
current.monthly_sales as current_period_value,
-- Prior Period Value
prior.monthly_sales as prior_period_value,
-- Current Running Total
current.running_total as current_ytd,
-- Prior Running Total
prior.running_total as prior_ytd
FROM sales_with_running_total current
LEFT JOIN sales_with_running_total prior
ON current.MonthNumberOfYear = prior.MonthNumberOfYear
AND current.CalendarYear = prior.CalendarYear + 1
WHERE current.CalendarYear >= 2013
)
SELECT
current_year,
month_number,
-- Current Period Value
ROUND(current_period_value, 2) as current_period_value,
-- Prior Period Value
ROUND(prior_period_value, 2) as prior_period_value,
-- YoY Percentage Change
CASE
WHEN prior_period_value > 0 THEN
ROUND(((current_period_value - prior_period_value) / prior_period_value) * 100, 2)
ELSE NULL
END as yoy_percentage_change,
-- Current Year-to-Date
ROUND(current_ytd, 2) as current_ytd,
-- Prior Year-to-Date
ROUND(prior_ytd, 2) as prior_ytd,
-- Running Total YoY % Change
CASE
WHEN prior_ytd > 0 THEN
ROUND(((current_ytd - prior_ytd) / prior_ytd) * 100, 2)
ELSE NULL
END as running_total_yoy_change
FROM yoy_comparison
ORDER BY current_year, month_number
LIMIT 5;
- 1
- Aggregate sales by month and year
- 2
- Calculate running totals (year-to-date)
- 3
- Get current year and prior year data side by side
- 4
- Calculate YoY metrics
Conclusion
To recap, we saw
- How to clearly define the outputs and their relationship to inputs
- How to create a step-by-step process to get the inputs in the right grain
- How to use CTEs to represent the step-by-step processing of inputs
The next time you are stuck with a gnarly query or complex requirement at work or an interview, follow the process you just learnt.
The mapping from inputs to outputs will enable you to create elegant SQL that can represent arbitrarily complex logic.