Using Joins and Group Bys the right way for data warehousing

1. Introduction

Joins represent the denormalization of data to create dimensions and add dimension attributes to fact tables for reporting. Group bys enable the creation of metrics that determine how your data is used to make decisions.

If you have wondered

Why do your joins often end up in a dense block that’s hard to pick apart.

Why are people running nearly identical queries with only different join combinations.

How to avoid losing data when performing joins

How group bys have been used to mask underlying data modeling issues

Why a group by all is considered a red flag

Then this post is for you. In this post, we will cover how to use joins and group bys, as well as what to look out for when using them.

By the end of this post, you will know what to watch out for when you perform joins or group bys.

Prerequisites:

  1. What is data warehousing
  2. SQL basics

YouTube Live Workshop on July 21st; 1PM EST (10 AM PST) 2025

2. Joins & Group bys are two of the most commonly used operations in data warehousing

2.1. Joins are used to create denormalized dimension tables & to enrich fact tables with dimensions for reporting

2.1.1. When to use joins

Joins are used in data warehousing to create dimension tables from two or more individual tables from upstream application systems.

-- Sample join usage
select
    c.customer_id,
    c.zipcode,
    c.city,
    c.state_code,
    s.state_name,
    c.datetime_created,
    c.datetime_updated,
    c.dbt_valid_from::timestamp as valid_from,
    case
        when c.dbt_valid_to is NULL then '9999-12-31'::timestamp
        else c.dbt_valid_to::timestamp
    end as valid_to
from customers as c
inner join state as s on c.state_code = s.state_code

Joins are also used in reporting to enrich fact data with dimensional attributes (e.g., name, state, address, etc).

2.1.2. How to use joins

Join criteria refers to the columns used to join the tables. When joining tables, there is usually one table called the driver table to which other tables are joined.

Depending on your use case, you may want to:

  1. Use a left join to get all data from the driver table, even when relevant data is missing from other tables.
  2. Use an inner join to retrieve only data that is present in all the tables in the join.
  3. Use an anti-join to retrieve data from the driver table that is not present in the table being joined to.
  4. Use a full outer join to get data in either one of the multiple joining tables. This type of join is typically used for data validation and to determine differences in data between the tables.

join type

In addition to types of joins, there are 2 main join criteria

  1. Direct equality criteria such as a = b or a!= b
  2. Range criteria such as a >= b or a between b1 and b2.

Joins with range criteria are usually used to find rows that fall within a specific time range. For example, when you want to determine the lifecycle of a customer on your website (Impression -> Click -> Purchase), you would self-join the table containing this data to check for past events from the same user.

Note: If you find yourself trying to use self-join to compare values across rows, check to see if a window function would better suit your purpose.

2.1.3. Things to watch out for when joining

  1. Joining table(s) with multiple grains will lead to duplicated or partial data. It’s best to ensure that your table(s) have a single grain before joining them.
  2. If the tables you are joining do not have complete data, your joins will produce partial or NULL data points. For e.g., if you are joining a customer’s personal details table with their payment information and the payment information for that customer has not yet arrived in the warehouse, you will receive NULL for this information.
  3. Joins on columns with NULLs. NULLs represent the absence of data, and as such, joins on NULL = NULL will not work. Ensure that you coalesce NULLs to a hardcoded value before joining (if that’s what you intend to do).
  4. Complex join criteria and join criteria that require transformation functions are typically indicative of a poor underlying data model. Try to solve this by making changes to the tables in the join.

2.2. Group bys are the cornerstone of reporting

2.2.1 When to use Group by

Group bys are used to calculate metrics from numerical data. The metrics are typically used to understand how the business is performing. GROUP BY allows us to perform aggregate calculations on data from a set of rows grouped together based on values of specified column(s).

2.2.2. How to use Group by

Group bys involve grouping by a set of columns (typically low cardinality) and aggregating numbers to provide an overview of how the set of values representing the columns are performing.

SELECT
  orderpriority,
  count(*) AS num_orders
FROM
  orders
GROUP BY
  orderpriority;

In the above query, we group the data by orderpriority, and the calculation count() will be applied to the rows having a specific orderpriority value. The output will consist of one row per unique value of orderpriority and the count() calculation.

Types of Group by

  1. Standard agg: min/max/sum/avg/count
  2. Statistical agg: Functions like correlation, sampling, standard deviation, skew, etc
  3. Collection agg: Functions to combine values into nested data types, e.g., array_agg, collect_set, etc
  4. Approximation agg: Functions that are fast by sacrificing accuracy, e.g., approx_distinct, approx_most_frequent
  5. Convenience agg: Functions that make common usages easier, e.g., count_if, bool_or, etc

The available alternatives for grouping are ROLLUP, CUBE, etc.

Note The types of aggregate depend on your database engine; check your documentation for available functions. E.g.,Trino aggregation function .

2.2.3. Things to watch out for when using Group by

  1. If you are grouping by multiple columns or using Group by to remove duplicates, this usually indicates a problem with your underlying data model
  2. For the columns on which you want to run aggregation functions, ensure that they are of the correct data type.
  3. Ensure that the columns that you are aggregating are additive. For example, you cannot aggregate percentages; instead, you must aggregate the numerator and denominator separately.

3. Conclusion

To recap, we saw:

  1. How to use joins to create denormalized dimension tables & enrich facts with dimensions.
  2. How group bys represent the core purpose of data warehousing.

Joins and Group bys are extremely powerful; while they are easy enough operations to grasp when combined with complex requirements and a bad data model, they are extremely easy to introduce unintended bugs.

Thinking about failure scenarios in depth and knowing how to use them effectively will make your code easier to maintain and bug-free.

  1. What is data warehousing
  2. SQL basics
  3. Improve SQL as a data engineer

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job!

Overwhelmed by all the concepts you need to learn to become a data engineer? Have difficulty finding good data projects for your portfolio? Are online tutorials littered with sponsored tools and not foundational concepts?

Learning data engineer can be a long and rough road, but it doesn't have to be!

Pick up any new tool/framework with a clear understanding of data engineering fundamentals. Demonstrate your expertise by building well-documented real-world projects on GitHub.

Sign up for my free DE-101 course that will take you from basics to building data projects in 4 weeks!

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

    M ↓   Markdown