6 Key Concepts, to Master Window Functions

Introduction

Window functions are one the most powerful features of SQL. In spite of this, window functions are not used frequently. If you have ever thought “window functions are confusing”, then post is for you. We go over 6 key concepts to master window function.

Optional Prerequisites

This is optional, if you want to follow along with code.

  1. docker to run postgres
  2. pgcli to connect to our postgres instance

From your terminal run

docker run --name pg_local -p 5432:5432 \
-e POSTGRES_USER=start_data_engineer -e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=window -d postgres:12.2

Now, login to the running postgres instance as shown below, the password is password

pgcli -h localhost -p 5432 -U start_data_engineer window

Create clickstream table and fill it with some fake data

drop table if exists clickstream;
create table clickstream (
    eventId varchar(40),
    userId int,
    sessionId int,
    actionType varchar(8),
    datetimeCreated timestamp
);
INSERT INTO clickstream(eventId, userId, sessionId, actionType, datetimeCreated )
VALUES 
('6e598ae5-3fb1-476d-9787-175c34dcfeff',1 ,1000,'click','2020-11-25 12:40:00'),
('0c66cf8c-0c00-495b-9386-28bc103364da',1 ,1000,'login','2020-11-25 12:00:00'),
('58c021ad-fcc8-4284-a079-8df0d51601a5',1 ,1000,'click','2020-11-25 12:10:00'),
('85eef2be-1701-4f7c-a4f0-7fa7808eaad1',1 ,1001,'buy',  '2020-11-22 18:00:00'),
('08dd0940-177c-450a-8b3b-58d645b8993c',3 ,1010,'buy',  '2020-11-20 01:00:00'),
('db839363-960d-4319-860d-2c9b34558994',10,1120,'click','2020-11-01 13:10:03'),
('2c85e01d-1ed4-4ec6-a372-8ad85170a3c1',10,1121,'login','2020-11-03 18:00:00'),
('51eec51c-7d97-47fa-8cb3-057af05d69ac',8 ,6,   'click','2020-11-10 10:45:53'),
('5bbcbc71-da7a-4d75-98a9-2e9bfdb6f925',3 ,3002,'login','2020-11-14 10:00:00'),
('f3ee0c19-a8f9-4153-b34e-b631ba383fad',1 ,90,  'buy',  '2020-11-17 07:00:00'),
('f458653c-0dca-4a59-b423-dc2af92548b0',2 ,2000,'buy',  '2020-11-20 01:00:00'),
('fd03f14d-d580-4fad-a6f1-447b8f19b689',2 ,2000,'click','2020-11-20 00:00:00');

6 Key Concepts:

1. When to Use

Some sample use cases of when window functions would be useful are, when you have to

Note: partition in this context means a combination of one or more columns of the table, which uniquely identifies a set of rows.

  1. Rank rows based on a certain column(s) within each partition in the table.
  2. Label numerical values within a partition into buckets based on percentile.
  3. Flag the row that represent the first(or last) login event from a clickstream table along with other rows.
  4. Rolling window calculation over all(or based on partition) rows of the table.

You can see a common trend among the sample use cases. Basically we use window function when a calculation is to be performed on a subset of rows defined by the partition columns and we have to keep all(or some) of the rows, which we would have not been able to keep if we had used a GROUP BY.

Let’s go over the next section using an example. Let’s assume that we are working with a clickstream table. This clickstream table contains login, click, buy, sell, return, logout events from our app.

Let’s use this query, which assigns a row number to the events based on reverse chronological order of its occurrence per user session.

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    ROW_NUMBER() OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated DESC
    ) as eventOrder
from clickstream;

2. Partition By

Partition by defines the subset of one or more columns over which the function will be executed. In our case we partition by userId, sessionId as shown in the image below.

partition

3. Order By

Within a partition we can optionally define the order of rows using order by. In the example we order the rows within a partition based on the time that event was created.

order by

4. Function

This defines the function that is to be executed within each partition. Depending on the type of function the order of the rows within the partition will matter. The available window functions depend on your database, for eg in PostgreSQL we have these window functions. In our use case we assign a row number to the rows in reverse chronological order per partition using ROW_NUMBER. You could also try RANK and DENSE_RANK functions.

window function

5. Lead and Lag

If you want to compare data from a row with data from other rows that come before or after the current row depending on the order by clause, use lead or lag. Let’s say you want to get

  1. the time of the next user action
  2. the time of the previous user action

We use LEAD function to get the next event’s time. We do this by sorting using “datetimeCreated” in ascending order and assigning the next rows “datetimeCreated” as the nextEventTime. We use LAG similarly to get the prevEventTime

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    LEAD(datetimeCreated, 1) OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated
    ) as nextEventTime,
    LAG(datetimeCreated, 1) OVER(
        PARTITION BY userId,
        sessionId
        ORDER BY datetimeCreated
    ) as prevEventTime
from clickstream;

Lead and Lag

6. Rolling Window

Note that within our partition, the function operates on the entire partition. For example the row_number was applied to the entire partition. What if we wanted to calculate a running window, eg) If we are analyzing user behavior and want to find the number of buy events within the last 5 events across all users, exclusive of the current event, then we do the following

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    SUM(
        CASE
            WHEN actionType = 'buy' THEN 1
            ELSE 0
        END
    ) OVER(
        ORDER BY datetimeCreated DESC ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING
    ) as num_purchases
from clickstream;

Preceding

You can see from the image that window starts from 5 preceding rows and stops before the current row, which is the 1 preceding row.

Let’s say you want to write a query that checked the previous 1 and next 1 event to see if either of them or the current event was a buy event, you can do the following

select eventId,
    userId,
    sessionId,
    actionType,
    datetimeCreated,
    MAX(
        CASE
            WHEN actionType = 'buy' THEN 1
            ELSE 0
        END
    ) OVER(
        ORDER BY datetimeCreated DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) as neighborBuy
from clickstream;

Preceding and Following

You can also use these custom window ranges within a specific partition.

Efficiency Considerations

Depending on your database the window functions can be expensive. Use EXPLAIN to see the query plan, if you plan on using window functions in a low latency requirement situation this will help optimize for performance.

For example if you want to only get the row with the latest event. It might be beneficial to use another technique, such as a group by shown below

EXPLAIN
select *
from (
        select userId,
            sessionId,
            datetimeCreated,
            ROW_NUMBER() OVER(
                PARTITION BY userId,
                sessionId
                ORDER BY datetimeCreated DESC
            ) as eventOrder
        from clickstream
    ) as t
where t.eventOrder = 1;

Window Explain

EXPLAIN
select userId,
    sessionId,
    max(datetimeCreated) as datetimeCreated
from clickstream
group by userId,
    sessionId;

\q -- exit pgcli

Subquery Explain

The cost shown in the red box in the above images represent the cost of the query(“cost units are arbitrary, but conventionally mean disk page fetches” - pg docs). The lower the cost the faster your query execution. Try EXPLAIN ANALYZE to execute the query and get the actual time taken as well.

# Stop postgres docker container
docker stop pg_local
docker rm pg_local

Conclusion

Window functions are very useful in data analytics and performing operations which cannot be done easily with standard group by, subquery and filters. The next time you come across a particularly complex query, which involves having to write multiple queries try to see if it could be done using a window function. Hope this article gives you a good idea of when to use window functions, what they are, what they can do and the performance considerations to be aware of while using them. Let me know if you have any questions or comments in the comment section below.

Reference

  1. explain
  2. markdown TOC