What are Common Table Expressions(CTEs) and when to use them?

Introduction

Common Table Expressions are used to define temporary tables that exists only for the query in which they were defined. If you

Wondered what CTEs are ?

Want to understand potential performance problems with CTEs

Then this post is for you. We go over what CTEs are and compare its performance to subquery, derived table, and temp table based approaches in the context of a data warehouse.

Setup

This post uses AWS Redshift to explore CTEs. You can follow along without having to set up your own Redshift instance as well.

Prerequisites

  1. pgcli
  2. AWS account
  3. AWS CLI installed and configured

In your terminal, do the following

# create an AWS Redshift instance
aws redshift create-cluster --node-type dc2.large --number-of-nodes 2 --master-username sdeuser --master-user-password Password1234 --cluster-identifier sdeSampleCluster

# get your AWS Redshift endpoints address
aws redshift describe-clusters --cluster-identifier sdesamplecluster | grep '\"Address'

# use pgcli to connect to your AWS Redshift instance
pgcli -h <your-redshift-address> -U sdeuser -p 5439 -d dev
# note dev is the default database created. When prompted for the password enter "Password1234".

Let’s assume you work for a data collection company that tracks user clickstream and geolocation data. In your SQL terminal (pgcli) create fake clickstream and geolocation tables as shown below.

-- create fake clickstream data
drop table if exists clickstream;
create table clickstream (
    eventId varchar(40),
    userId int,
    sessionId int,
    actionType varchar(8),
    datetimeCreated timestamp
);
insert into clickstream(eventId, userId, sessionId, actionType, datetimeCreated )
values 
('6e598ae5-3fb1-476d-9787-175c34dcfeff',1 ,1000,'click','2020-11-25 12:40:00'),
('0c66cf8c-0c00-495b-9386-28bc103364da',1 ,1000,'login','2020-11-25 12:00:00'),
('58c021ad-fcc8-4284-a079-8df0d51601a5',1 ,1000,'click','2020-11-25 12:10:00'),
('85eef2be-1701-4f7c-a4f0-7fa7808eaad1',1 ,1001,'buy',  '2020-11-22 18:00:00'),
('08dd0940-177c-450a-8b3b-58d645b8993c',3 ,1010,'buy',  '2020-11-20 01:00:00'),
('db839363-960d-4319-860d-2c9b34558994',10,1120,'click','2020-11-01 13:10:03'),
('2c85e01d-1ed4-4ec6-a372-8ad85170a3c1',10,1121,'login','2020-11-03 18:00:00'),
('51eec51c-7d97-47fa-8cb3-057af05d69ac',8 ,6,   'click','2020-11-10 10:45:53'),
('5bbcbc71-da7a-4d75-98a9-2e9bfdb6f925',3 ,3002,'login','2020-11-14 10:00:00'),
('f3ee0c19-a8f9-4153-b34e-b631ba383fad',1 ,90,  'buy',  '2020-11-17 07:00:00'),
('f458653c-0dca-4a59-b423-dc2af92548b0',2 ,2000,'buy',  '2020-11-20 01:00:00'),
('fd03f14d-d580-4fad-a6f1-447b8f19b689',2 ,2000,'click','2020-11-20 00:00:00');

-- create fake geolocation data
drop table if exists geolocation;
create table geolocation (
    userId int,
    zipcode varchar(10),
    datetimeCreated timestamp
);

insert into geolocation(userId, zipCode, datetimeCreated )
values 
(1 ,66206,'2020-11-25 12:40:00'),
(1 ,66209,'2020-11-25 12:00:00'),
(1 ,91355,'2020-11-25 12:10:00'),
(1 ,83646,'2020-11-22 18:00:00'),
(3 ,91354,'2020-11-20 01:00:00'),
(10,91355,'2020-11-01 13:10:03'),
(10,91355,'2020-11-03 18:00:00'),
(8 ,91355,'2020-11-10 10:45:53'),
(3 ,91355,'2020-11-14 10:00:00'),
(1 ,83646,'2020-11-17 07:00:00'),
(2 ,83646,'2020-11-20 01:00:00'),
(2 ,91355,'2020-11-20 00:00:00');

Common Table Expressions (CTEs)

Let’s assume that we want to get the userId and the number of purchases they made, for the users who have been in multiple locations (identified by zip code) and have purchased at least one product. You can write a query which uses a subquery (where userId in (...)) as shown below.

select userId,
    sum(
        case
            when actionType = 'buy' then 1
            else 0
        end
    ) as numPurchases
from clickstream
where userId in (
        select userId
        from geolocation
        group by userId
        having count(distinct zipCode) > 1
    )
group by userId
having numPurchases >= 1;

Alternatively, we can use CTEs to define temp tables that only exist for the duration of the query as shown below.

with buyingUsers as (
    select userId,
        sum(
            case
                when actionType = 'buy' then 1
                else 0
            end
        ) as numPurchases
    from clickstream
    group by userId
    having numPurchases >= 1
),
movingUsers as (
    select userId
    from geolocation
    group by userId
    having count(distinct zipCode) > 1
)
select bu.userId,
    bu.numPurchases
from buyingUsers bu
    join movingUsers mu on bu.userId = mu.userId;

Notice how we replaced the subquery with a CTE. This example is simple but in cases with multiple derived tables and sophisticated join logic, using CTEs may make your query easier to read.

Performance: CTEs vs Subquery, derived tables vs Temp tables

Now that we know what CTEs are, let’s compare their performance against other competing approaches. Before we look at the query plan, we should update the tables statistics to enable a more accurate query plan. We update table statistics using ANALYZE.

analyze clickstream;
analyze geolocation;

For the performance comparison, let’s consider a sophisticated ask.

Objective: For users who have been in multiple locations (identified by zip code) and have purchased at least one product, get user-session level metrics. The metrics required are counts of their clicks, logins and purchases.

The result should be

userId sessionId numclicks numlogins numpurchases
2 2000 1 0 1
1 90 0 0 1
1 1001 0 0 1
1 1000 2 1 0
3 3002 0 1 0
3 1010 0 0 1

CTE performance

Let’s look at the CTE way to achieve the objective.

with purchasingUsers as (
    select userId,
        sum(
            case
                when actionType = 'buy' then 1
                else 0
            end
        ) as numPurchases
    from clickstream
    group by userId
    having numPurchases >= 1
),
movingUsers as (
    select userId
    from geolocation
    group by userId
    having count(distinct zipCode) > 1
),
userSessionMetrics as (
    select userId,
        sessionId,
        sum(
            case
                when actionType = 'click' then 1
                else 0
            end
        ) as numclicks,
        sum(
            case
                when actionType = 'login' then 1
                else 0
            end
        ) as numlogins,
        sum(
            case
                when actionType = 'buy' then 1
                else 0
            end
        ) as numPurchases
    from clickstream
    group by userId,
        sessionId
)
select usm.*
from userSessionMetrics as usm
    join movingUsers as mu on usm.userId = mu.userId
    join purchasingUsers as pu on usm.userId = pu.userId;

We can see the query plan by running explain + the above query in your sql terminal.

CTE query plan

From the query plan we can see that the query planner decided to

Note that cost is relative. “Cost is a measure that compares the relative execution times of the steps within a plan. Cost does not provide any precise information about actual execution times or memory consumption, nor does it provide a meaningful comparison between execution plans. It does give you an indication of which operations in a query are consuming the most resources” - Query Plan docs

Subquery and derived table performance

Lets look at the “Subquery and derived(from (select ...) purchasingUsers) table” way to achieve the same objective.

select userSessionMetrics.userId,
    userSessionMetrics.sessionId,
    userSessionMetrics.numclicks,
    userSessionMetrics.numlogins,
    userSessionMetrics.numPurchases
from (
        select userId,
            sum(
                case
                    when actionType = 'buy' then 1
                    else 0
                end
            ) as numPurchases
        from clickstream
        group by userId
        having numPurchases >= 1
    ) purchasingUsers
    join (
        select userId,
            sessionId,
            sum(
                case
                    when actionType = 'click' then 1
                    else 0
                end
            ) as numclicks,
            sum(
                case
                    when actionType = 'login' then 1
                    else 0
                end
            ) as numlogins,
            sum(
                case
                    when actionType = 'buy' then 1
                    else 0
                end
            ) as numPurchases
        from clickstream
        group by userId,
            sessionId
    ) userSessionMetrics on purchasingUsers.userId = userSessionMetrics.userId
where purchasingUsers.userId in (
        select userId
        from geolocation
        group by userId
        having count(distinct zipCode) > 1
    );

We can see the query plan by running explain + the above query in your sql terminal.

sq dt query plan

From the query plan we can see that the query planner decided to

You can see that the query plan is very similar to the CTE approach. The only difference is that the position of purchasingUsers and userSessionMetrics in the query plan have been swapped.

Temp table performance

Lets look at the “temp table” way to achieve the same objective.

create temporary table purchasingUsers DISTKEY(userId) as (
    select userId,
        sum(
            case
                when actionType = 'buy' then 1
                else 0
            end
        ) as numPurchases
    from clickstream
    group by userId
    having numPurchases >= 1
);
create temporary table movingUsers DISTKEY(userId) as (
    select userId
    from geolocation gl
    group by userId
    having count(distinct zipCode) > 1
);
create temporary table userSessionMetrics DISTKEY(userId) as (
    select userId,
        sessionId,
        sum(
            case
                when actionType = 'click' then 1
                else 0
            end
        ) as numclicks,
        sum(
            case
                when actionType = 'login' then 1
                else 0
            end
        ) as numlogins,
        sum(
            case
                when actionType = 'buy' then 1
                else 0
            end
        ) as numPurchases
    from clickstream
    group by userId,
        sessionId
);
select usm.*
from userSessionMetrics as usm
    join movingUsers as mu on usm.userId = mu.userid
    join purchasingUsers as pu on usm.userId = pu.userId;

We can see the query plan for the select statement by running explain + the select query from above in your sql terminal.

tt qp query plan

You will see that the number of steps are lower compared to the CTE and “subquery + derived table” based approaches. This is because we spent time aggregating and filtering the data to get the required datasets beforehand and distributed them based on the join keys to enable better distributed join performance. This does not come free. There is a cost associated with creation and distribution of temp tables. The query plans for table creations are shown below.

tt 1 query plan tt 2 query plan tt 3 query plan

You can quit the pgcli terminal using the \q command.

Trade-offs

In most cases you get the same performance from CTEs or “subquery + derived table” based approaches. The usual deciding factor is readability which is a subjective measure.

The temp table approach is preferred when you need to optimize for read performance. This comes at the cost of creating and optionally distributing(indexing in RDBMS) temp tables and you would also need to have permissions to create temp tables. If you are going to be reusing the temp tables in multiple select queries and are ok with spending time to pre-create them, it might be worth it.

Generally, it is good practice to always check the query plans for competing approaches and choose the one that is most appropriate for your specific use case. CTEs performance are very DB dependent. For example, in older postgres versions, CTEs were not optimized by the query planner and were always fully materialized before being used. Be sure to check your specific DB documentation and the query planner.

Tear down

When you are done, do not forget to tear down your AWS Redshift instance.

In your terminal run the following command

aws redshift delete-cluster --cluster-identifier sdesamplecluster --skip-final-cluster-snapshot

AWS Redshift charges based on time so make sure to tear down your instance.

Conclusion

To recap we saw

  1. What Common Table Expressions(CTEs) are.
  2. How to use them.
  3. Performance trade-offs compared to subquery, derived table and temp table based approaches.

Hope this articles helps you understand CTEs in detail. The next time you come across a complex query with confusing subqueries or have to write your own query, try CTEs and see if you can improve readability.

References

  1. AWS Redshift Docs
  2. TOC generator
  3. AWS Redshift cost