6 Key Concepts, to Master Window Functions

Introduction

If work with data, window functions can significantly level up your SQL skills. If you have ever thought

window functions are confusing

How do I perform calculations that use information from other rows, in SQL

then the post is for you. We go over 6 key concepts to master the window function.

Prerequisites

If you want to code along install the following

  1. docker to run postgres
  2. pgcli to connect to our postgres instance

On your terminal start a Postgres docker container and log into it

docker run --name pg_local -p 5432:5432 \
-e POSTGRES_USER=start_data_engineer -e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=window -d postgres:12.2

pgcli -h localhost -p 5432 -U start_data_engineer window
# the password is password, enter when prompted

Create a fake clickstream table.

drop table if exists clickstream;
create table clickstream (
    eventId varchar(40),
    userId int,
    sessionId int,
    actionType varchar(8),
    datetimeCreated timestamp
);
INSERT INTO clickstream(eventId, userId, sessionId, actionType, datetimeCreated )
VALUES 
('6e598ae5-3fb1-476d-9787-175c34dcfeff',1 ,1000,'click','2020-11-25 12:40:00'),
('0c66cf8c-0c00-495b-9386-28bc103364da',1 ,1000,'login','2020-11-25 12:00:00'),
('58c021ad-fcc8-4284-a079-8df0d51601a5',1 ,1000,'click','2020-11-25 12:10:00'),
('85eef2be-1701-4f7c-a4f0-7fa7808eaad1',1 ,1001,'buy',  '2020-11-22 18:00:00'),
('08dd0940-177c-450a-8b3b-58d645b8993c',3 ,1010,'buy',  '2020-11-20 01:00:00'),
('db839363-960d-4319-860d-2c9b34558994',10,1120,'click','2020-11-01 13:10:03'),
('2c85e01d-1ed4-4ec6-a372-8ad85170a3c1',10,1121,'login','2020-11-03 18:00:00'),
('51eec51c-7d97-47fa-8cb3-057af05d69ac',8 ,6,   'click','2020-11-10 10:45:53'),
('5bbcbc71-da7a-4d75-98a9-2e9bfdb6f925',3 ,3002,'login','2020-11-14 10:00:00'),
('f3ee0c19-a8f9-4153-b34e-b631ba383fad',1 ,90,  'buy',  '2020-11-17 07:00:00'),
('f458653c-0dca-4a59-b423-dc2af92548b0',2 ,2000,'buy',  '2020-11-20 01:00:00'),
('fd03f14d-d580-4fad-a6f1-447b8f19b689',2 ,2000,'click','2020-11-20 00:00:00');

Note We use the term user session to indicate a time when a user logs in until they log out. The session id will be the same between the corresponding user login and logout event.

6 Key Concepts

A Partition refers to a set of rows that have the same values in one or more columns. These column(s) are specified using the PARTITION BY clause as we will see in the examples.

1. When to Use

Window functions are useful when you have to

  1. Rank rows based on a certain column(s) within each partition in the table.
  2. Label numerical values within each partition into buckets based on percentile.
  3. Identify the first(or second or last) event within a specific partition.
  4. Calculate rolling average/mean.

General uses of window functions are when

  1. A calculation is needed to be performed on a set of rows(defined by partition columns) and still keep the result at row level. If we use group by we would have to use aggregation functions on any columns that are not part of the group by clause.
  2. Need to perform calculations based on a rolling window.

The clickstream table contains login, click, buy, sell, return, logout events from our app. The query below orders the events per userId & sessionId, based on the event creation (datetimeCreated) time.

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    ROW_NUMBER() OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated DESC
    ) as eventOrder
from clickstream;

2. Partition By

Partition By clause defines the column(s) that groups related rows. This set of rows is referred to as a partition. In our case, we partition by userId, sessionId which groups rows, as shown below.

partition

3. Order By

Within a partition, we can optionally define the order of rows using the Order By clause. Here we order the rows within a partition in reverse chronological order based on event created time (datetimeCreated).

order by

4. Function

This is defined before the OVER (PARTITION BY ..). This function will be applied to the rows within a partition. The available window functions depend on your database, eg in PostgreSQL we have these window functions .

In our example, we use ROW_NUMBER to get the order in which the events were created per user session.

window function

5. Lead and Lag

These can be used to perform calculations based on data from other rows. Lead and Lag are used to access data from rows after or before the current row respectively. The rows can be ordered using the order by clause.

Lead and lag can be used to calculate the time difference between events for a given user session (partition). In the example below, we use lead and lag to get the time that the next and previous events occur during a user session.

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    LEAD(datetimeCreated, 1) OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated
    ) as nextEventTime,
    LAG(datetimeCreated, 1) OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated
    ) as prevEventTime
from clickstream;

Lead and Lag

6. Rolling Window

We can use window functions without a PARTITION BY clause to simulate a rolling window over all the rows Let’s say we want to find the number of buy events within the last 5 events across all users, exclusive of the current event, then we do the following.

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    SUM(
        CASE
            WHEN actionType = 'buy' THEN 1
            ELSE 0
        END
    ) OVER(
        ORDER BY datetimeCreated DESC ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING
    ) as num_purchases
from clickstream;

Preceding

You can see from the image that the window starts from the 5 PRECEDING rows and stops before the current row, which is the 1 PRECEDING row. This num_purchases will be calculated for each row as seen in the result set above.

Let’s write a query to check if one of the current, previous, or next events was a buy event.

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    MAX(
        CASE
            WHEN actionType = 'buy' THEN 1
            ELSE 0
        END
    ) OVER(
        ORDER BY datetimeCreated DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) as neighborBuy
from clickstream;

Preceding and Following

You can also use this custom window ranges within a specific partition.

Efficiency Considerations

Window functions can be expensive, use EXPLAIN to see the query plan. This will help when using window functions in low latency situations.

For example, if you want to only get the row with the latest event. It might be beneficial to use another technique, such as a group by shown below

EXPLAIN
select *
from (
        select userId,
            sessionId,
            datetimeCreated,
            ROW_NUMBER() OVER(
                PARTITION BY userId,
                sessionId
                ORDER BY datetimeCreated DESC
            ) as eventOrder
        from clickstream
    ) as t
where t.eventOrder = 1;

Window Explain

EXPLAIN
select userId,
    sessionId,
    max(datetimeCreated) as datetimeCreated
from clickstream
group by userId,
    sessionId;

\q -- exit pgcli

Subquery Explain

The cost shown in the red box in the above images represents the cost of the query(“cost units are arbitrary, but conventionally mean disk page fetches” - pg docs). The lower the cost the faster your query execution. Try EXPLAIN ANALYZE to execute the query and get the actual time taken as well.

# Stop Postgres docker container
docker stop pg_local
docker rm pg_local

Conclusion

The next time you come across a particularly complex query, which involves having to group but keep the same granularity, or calculate metrics based on values from other rows use window functions.

Hope this article gives you a good idea of when to use window functions, what they are, what they can do and the performance considerations to be aware of while using them.

As always, if you have any questions or comments, please feel free to leave them in the comments section below.

Further reading

  1. What is a data warehouse
  2. What are common table expressions

References

  1. explain
  2. markdown TOC

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Don't want to keep digging through the SQL window function docs ?

Drop in your email address below and I'll send you my cheat sheet so you have a handy reference sheet you can use any time you are using SQL window functions. I'll also let you know whenever I publish a new blog post.

    We only send useful and actionable content. Unsubscribe at any time.

    M ↓   Markdown
    B
    Bernardo
    1 point
    3 years ago

    Great post! I am coming from your course-newsletter and I get the feeling this is exactly the type of content I needed to grow my DE skills! :D

    Also, db-fiddle.com is a nice way to follow your exercises/explanations when you provide the insert queries. :)

    J
    Joseph Kevin Machado
    0 points
    3 years ago

    Thank you Bernardo! I'll try to provide insert queries in my other articles as well.

    A
    Alex Hessler
    1 point
    4 years ago

    Very good explanation !! Thank you a lot for this awesome article.

    You are one of the very few newsletter that I keep reading because of the good quality.

    Great job. Keep going :)

    J
    Joseph Machado
    0 points
    4 years ago

    Thank you Alex

    M
    Madhusudan K
    1 point
    3 years ago

    Beautifully and so neatly explained...

    J
    Joseph Kevin Machado
    0 points
    3 years ago

    Thank you, Madhusudan

    R
    Renee S. Liu
    0 points
    4 years ago

    Hey this is really helpful! The only thing I am having trouble understanding is the last example where the code has 'DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING' in it. Would you please explain further the intuition behind this query? Thank you ahead!

    J
    Joseph Machado
    0 points
    4 years ago

    Hey @reneesliu:disqus , Glad this is helpful.

    The query you question can be understood in 3 parts

    1. "Case when buy then 1". This will create a flag which will be either 0 or 1 depending on absence or presence of buy event.
    2. Max is a function applied over the range which is determined by 1 preceding and 1 following. This MAX is used to calculate presence on 1 among the 3 rows(the current row, 1 preceding and 1 following), you can think of this as an if condition checking for presence of atleast on 1 across the 3 rows.
    3. This 3 row window will be calculated for each row and the result will be stored in the neighborBuy column

    lmk if this helps.

    R
    Renee S. Liu
    0 points
    4 years ago

    Hi Joseph, really appreciate your explanation. So the business intuition behind this is really to see if there are buy events among the current 3 windows? Thank you!

    J
    Joseph Machado
    0 points
    4 years ago

    you are welcome.
    Yes the business intuition is to see if there is a buy events among the 3 rows in the window.
    Max is generally used as an aggregate substitute to get these "is present" true/false columns.

    R
    Renee S. Liu
    0 points
    4 years ago

    That's really informative. Thank you a lotl

    ?
    Anonymous
    0 points
    3 years ago

    Really useful, thanks for sharing the knowledge