Using Joins and Group Bys the right way for data warehousing
- 1. Introduction
- 2. Joins & Group bys are two of the most commonly used operations in data warehousing
- 3. Conclusion
- 4. Recommended reading
1. Introduction
Joins represent the denormalization of data to create dimensions and add dimension attributes to fact tables for reporting. Group bys enable the creation of metrics that determine how your data is used to make decisions.
If you have wondered
Why do your joins often end up in a dense block that’s hard to pick apart.
Why are people running nearly identical queries with only different join combinations.
How to avoid losing data when performing joins
How group bys have been used to mask underlying data modeling issues
Why a group by all is considered a red flag
Then this post is for you. In this post, we will cover how to use joins and group bys, as well as what to look out for when using them.
By the end of this post, you will know what to watch out for when you perform joins or group bys.
Prerequisites:
YouTube Live Workshop on July 21st; 1PM EST (10 AM PST) 2025
2. Joins & Group bys are two of the most commonly used operations in data warehousing
2.1. Joins are used to create denormalized dimension tables & to enrich fact tables with dimensions for reporting
2.1.1. When to use joins
Joins are used in data warehousing to create dimension tables from two or more individual tables from upstream application systems.
-- Sample join usage
select
c.customer_id,
c.zipcode,
c.city,
c.state_code,
s.state_name,
c.datetime_created,
c.datetime_updated,
c.dbt_valid_from::timestamp as valid_from,
case
when c.dbt_valid_to is NULL then '9999-12-31'::timestamp
else c.dbt_valid_to::timestamp
end as valid_to
from customers as c
inner join state as s on c.state_code = s.state_code
Joins are also used in reporting to enrich fact data with dimensional attributes (e.g., name, state, address, etc).
2.1.2. How to use joins
Join criteria refers to the columns used to join the tables. When joining tables, there is usually one table called the driver
table to which other tables are joined.
Depending on your use case, you may want to:
- Use a
left join
to get all data from thedriver
table, even when relevant data is missing from other tables. - Use an
inner join
to retrieve only data that is present in all the tables in the join. - Use an
anti-join
to retrieve data from thedriver
table that is not present in the table being joined to. - Use a
full outer join
to get data in either one of the multiple joining tables. This type of join is typically used for data validation and to determine differences in data between the tables.
In addition to types of joins, there are 2 main join criteria
- Direct equality criteria such as a = b or a!= b
- Range criteria such as a >= b or a between b1 and b2.
Joins with range criteria are usually used to find rows that fall within a specific time range. For example, when you want to determine the lifecycle of a customer on your website (Impression -> Click -> Purchase), you would self-join the table containing this data to check for past events from the same user.
Note: If you find yourself trying to use self-join to compare values across rows, check to see if a window function would better suit your purpose.
2.1.3. Things to watch out for when joining
- Joining table(s) with multiple grains will lead to duplicated or partial data. It’s best to ensure that your table(s) have a single grain before joining them.
- If the tables you are joining do not have complete data, your joins will produce partial or NULL data points. For e.g., if you are joining a customer’s personal details table with their payment information and the payment information for that customer has not yet arrived in the warehouse, you will receive NULL for this information.
- Joins on columns with NULLs. NULLs represent the absence of data, and as such, joins on NULL = NULL will not work. Ensure that you coalesce NULLs to a hardcoded value before joining (if that’s what you intend to do).
- Complex join criteria and join criteria that require transformation functions are typically indicative of a poor underlying data model. Try to solve this by making changes to the tables in the join.
2.2. Group bys are the cornerstone of reporting
2.2.1 When to use Group by
Group bys are used to calculate metrics from numerical data. The metrics are typically used to understand how the business is performing. GROUP BY allows us to perform aggregate calculations on data from a set of rows grouped together based on values of specified column(s).
2.2.2. How to use Group by
Group bys involve grouping by a set of columns (typically low cardinality) and aggregating numbers to provide an overview of how the set of values representing the columns are performing.
SELECT
orderpriority,
count(*) AS num_orders
FROM
orders
GROUP BY
orderpriority;
In the above query, we group the data by orderpriority, and the calculation count() will be applied to the rows having a specific orderpriority value. The output will consist of one row per unique value of orderpriority and the count() calculation.
Types of Group by
Standard agg:
min/max/sum/avg/countStatistical agg:
Functions like correlation, sampling, standard deviation, skew, etcCollection agg:
Functions to combine values into nested data types, e.g., array_agg, collect_set, etcApproximation agg:
Functions that are fast by sacrificing accuracy, e.g., approx_distinct, approx_most_frequentConvenience agg:
Functions that make common usages easier, e.g., count_if, bool_or, etc
The available alternatives for grouping are ROLLUP, CUBE, etc.
Note The types of aggregate depend on your database engine; check your documentation for available functions. E.g.,Trino aggregation function .
2.2.3. Things to watch out for when using Group by
- If you are grouping by multiple columns or using Group by to remove duplicates, this usually indicates a problem with your underlying data model
- For the columns on which you want to run aggregation functions, ensure that they are of the correct data type.
- Ensure that the columns that you are aggregating are additive. For example, you cannot aggregate percentages; instead, you must aggregate the numerator and denominator separately.
3. Conclusion
To recap, we saw:
- How to use joins to create denormalized dimension tables & enrich facts with dimensions.
- How group bys represent the core purpose of data warehousing.
Joins and Group bys are extremely powerful; while they are easy enough operations to grasp when combined with complex requirements and a bad data model, they are extremely easy to introduce unintended bugs.
Thinking about failure scenarios in depth and knowing how to use them effectively will make your code easier to maintain and bug-free.
4. Recommended reading
If you found this article helpful, share it with a friend or colleague using one of the socials below!