How to Join a fact and a type 2 dimension (SCD2) table

Introduction

If you are using a data warehouse, you would have heard of fact and dimension tables. Simply put, fact tables are used to record a business event and dimension tables are used to record the attributes of business items(eg user, item tables in an e-commerce app). This is also referred to as a star schema or dimension modeling.

For example, in an e-commerce website, a fact table would contain information about orders, such as when the order was placed, the items in that order, who placed that order, etc. The dimension tables would be an item table (containing item id, item price, size, etc) and an user table (containing user id, user name, user address etc).

If you are wondering

how can I store my table’s history over time ?

how can I join a fact table and an SCD2 dimension table ?

Then this post is for you. For a detailed explanation of what a data warehouse is, checkout this article.

What is an SCD2 table and why use it?

The dimensional data in a data warehouse are usually derived from an application’s database. There are 7 common types of ways to model and store dimensional data in a data warehouse. In this post, we will look exclusively at Type 2: Add New Row.

SCD2 stands for slowly changing dimension type 2. In this type, we create a new row for each change to an existing record in the corresponding transaction table. Each row in the SCD2 dimension table will have row effective and row expiration datetime columns to denote the range within which that row represents the state of the data.

Application table

Assume in our e-commerce applications database that we have a user table to store user attributes.

user column definition
user_id unique identifier for a user (primary key)
first_name first name of the user
last_name last name of the user
address address of the user
zipcode zipcode of the user
created_datetime Date and time when the user was created
updated_datetime Date and time when this user data was last modified

Let’s say that a user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb has a zip code of 10027 until 2020-12-31, after which, the user changes address and the new zip code is 10012. In our application table, this would mean that the record with user_id=b0cc9fde-a29a-498e-824f-e52399991beb now has a zip code of 10012.

Transaction update

Dimension table

In the data warehouse, we need to ensure that we have access to historical data to run historical analyses. For example, if an end user want to check how many of our users lived in the zip code 10027 for the month of Dec 2020, we need to make sure that the user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb is counted towards 10027 and not 10012, because that is where that user was, as of December 2020.

To store this historically changing data in our data warehouse, we create a new row for each change. Let’s consider our user_dim(user dimension) table in our data warehouse. This corresponds to the user table in our application database.

user_dim column definition
user_id unique identifier for a user (primary key)
first_name first name of the user
last_name last name of the user
address address of the user
zipcode zipcode of the user
created_datetime Date and time when the user was created
updated_datetime Date and time when this user data was last modified
row_effective_datetime The date and time from which this row is the effective data for this user_id
row_expiration_datetime The date and time until which this row is the effective data for this user_id
current_row_indicator Indicator denoting if this row is the most current state of the data for this user_id

SCD2

We do not update the value of an existing row, but we add a new record to indicate the new state. We also set the old record’s row_expiration_datetime and set current_row_indicator to expired.

--row_expiration_datetime=9999-12-31 00:00:00
++row_expiration_datetime=2021-01-01 17:59:59

--current_row_indicator=current
++current_row_indicator=expired

For the new record, we set the updated_datetime as row_effective_datetime and set end of time as row_expiration_datetime and set current for current_row_indicator to reflect the current state.

Setup

We will use PostgreSQL on a docker container for our data warehouse and pgcli to connect to it.

  1. docker
  2. pgcli

From your terminal, run

docker run --name pg_local -p 5432:5432 -e POSTGRES_USER=sde -e POSTGRES_PASSWORD=password -e POSTGRES_DB=scd2 -d postgres:12.2

Now, login to the running postgres instance as shown below. The password is password

pgcli -h localhost -p 5432 -U sde scd2 # password is password

Let’s create a simple user_dim table and an items_purchased fact table in our data warehouse.

CREATE DATABASE warehouse;
USE warehouse;
DROP TABLE IF EXISTS user_dim;
CREATE TABLE user_dim (
    user_key BIGINT,
    user_id VARCHAR(40),
    first_name VARCHAR(10),
    last_name VARCHAR(10),
    address VARCHAR(100),
    zipcode VARCHAR(10),
    created_datetime TIMESTAMP,
    updated_datetime TIMESTAMP,
    row_effective_datetime TIMESTAMP,
    row_expiration_datetime TIMESTAMP,
    current_row_indicator VARCHAR(10)
);
INSERT INTO user_dim (
        user_key,
        user_id,
        first_name,
        last_name,
        address,
        zipcode,
        created_datetime,
        updated_datetime,
        row_effective_datetime,
        row_expiration_datetime,
        current_row_indicator
    )
VALUES (
        1000,
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'john',
        'doe',
        'world',
        10027,
        '2020-01-01 10:00:00',
        '2020-01-01 10:00:00',
        ' 2020-01-01 10:00:00',
        '2021-01-01 17:59:59',
        'expired'
    ),
    (
        1200,
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'john',
        'doe',
        'world',
        10012,
        '2020-01-01 10:00:00',
        '2021-01-01 18:00:00',
        '2021-01-01 18:00:00',
        '9999-12-31 00:00:00',
        'current'
    );
DROP TABLE IF EXISTS items_purchased;
CREATE TABLE items_purchased (
    item_purchased_id VARCHAR(40),
    order_id VARCHAR(40),
    user_id VARCHAR(40),
    item_id VARCHAR(40),
    item_cost decimal(10, 2),
    purchased_datetime TIMESTAMP -- and other fact information
);
INSERT INTO items_purchased (
        item_purchased_id,
        order_id,
        user_id,
        item_id,
        item_cost,
        purchased_datetime
    )
VALUES (
        'nljbac724bbskd',
        'order_id_1',
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'item_id_1',
        1500.00,
        '2020-12-28 12:30:00'
    ),
    (
        'ljbkcfvbj6758njh',
        'order_id_23',
        'b0cc9fde-a29a-498e-824f-e52399991beb',
        'item_id_45',
        20.00,
        '2021-01-28 09:30:00'
    ),
    (
        'sjbv09uy7njhbvvj',
        'order_id_100',
        '11111111-2222-3333-44444444444',
        'item_id_12',
        32.00,
        '2021-02-01 11:00:00'
    );

We now have an items_purchased fact table and a user_dim dimension table, with some sample data.

Joining fact and SCD2 tables

Let’s say our data analysts wants to answer questions like

  1. I want to see the distribution of our high spending users by month, year and zip code(at time of purchase)
  2. I want to see the high spending users first name, last name, zip code along with how long they lived at each zip code(duration_of_stay). If they are currently living at a zip code then the duration_of_stay should be Null.

High spending user is defined as any user who has spent at least 1000$ on our website.

We can write a query to answer question 1, as shown below

WITH high_spenders AS (
    SELECT user_id
    FROM items_purchased
    GROUP BY user_id
    HAVING sum(item_cost) > 1000
),
user_items AS (
    SELECT ip.item_purchased_id,
        ip.user_id,
        ip.item_cost,
        ud.zipcode,
        ip.purchased_datetime
    FROM items_purchased ip
        JOIN user_dim ud ON ip.user_id = ud.user_id
        AND ip.purchased_datetime BETWEEN ud.row_effective_datetime AND ud.row_expiration_datetime
)
SELECT EXTRACT(
        YEAR
        FROM ui.purchased_datetime
    ) yr,
    EXTRACT(
        MONTH
        FROM ui.purchased_datetime
    ) mnth,
    ui.zipcode,
    COUNT(DISTINCT ui.user_id) num_high_spenders
FROM user_items ui
    JOIN high_spenders hs ON ui.user_id = hs.user_id
GROUP BY yr,
    mnth,
    ui.zipcode
ORDER BY yr DESC,
    mnth DESC;

Let’s look at each of the individual CTEs, high_spenders and user_items.

high_spenders

high spenders

We can see that the user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb is a high spender, because they spent a total of 1520.00 during their entire time in our application.

user_items

user_items

We can see how the user with user_id=b0cc9fde-a29a-498e-824f-e52399991beb is associated with the zip code of 10027 for purchases made in the month of December 2020 and zip code of 10012 for purchases made in the month of January 2021.

The fact to SCD2 dimension table join happens in the user_items CTE.

Fact SCD2 Query

Fact SCD2 Join

We know that for any given row in the user_dim table, the row_effective_datetime and row_expiration_datetime define the ranges between which the row represents the state of data at that point in time.

This enables us to uniquely join a row in the fact table to the state of the user at the time the purchase was made.

The distribution of our high spending users by month, year, and zip code, will be as shown below.

yr mnth zipcode num_high_spenders
2021.0 1.0 10012 1
2020.0 12.0 10027 1

Answering question number 2 is left as an exercise for the reader. Comment your answers in the comment section below to check the implementation.

To stop and remove the docker containers, run

docker stop pg_local
docker rm pg_local

Educating end users

When using a slowly changing dimension, it is crucial to educate the end user on how to use the data. You can try to encapsulate its complexity by creating a view or a function on top of your slowly changing dimension table. However ultimately, any users of data should be aware of what SCD2 is, how to use, and how not to use them.

Conclusion

Hope this article gave you a good understanding of how to join fact tables with SCD2 dimension tables to get point in time information. Most end user questions arise from the inability to have point in time information which makes the results of their queries inaccurate and confusing.

The next time you are modeling a dimension table in your data warehouse, try SCD2 to store historical changes that will be required for running analytical queries later.

Further reading

  1. Common Table Expressions (CTEs)
  2. What is a data warehouse
  3. Mastering window functions
  4. DBT support for SCD2

References:

  1. The Data Warehouse Toolkit