What are Common Table Expressions(CTEs) and when to use them?
- Introduction
- Setup
- Common Table Expressions (CTEs)
- Performance: CTEs vs Subquery, derived tables vs Temp tables
- Tear down
- Conclusion
- References
Introduction
Common Table Expressions are used to define temporary tables that exists only for the query in which they were defined. If you
Wondered what CTEs are ?
Want to understand potential performance problems with CTEs
Then this post is for you. We go over what CTEs are and compare its performance to subquery, derived table, and temp table based approaches in the context of a data warehouse.
Setup
This post uses AWS Redshift to explore CTEs. You can follow along without having to set up your own Redshift instance as well.
Prerequisites
In your terminal, do the following
# create an AWS Redshift instance
aws redshift create-cluster --node-type dc2.large --number-of-nodes 2 --master-username sdeuser --master-user-password Password1234 --cluster-identifier sdeSampleCluster
# get your AWS Redshift endpoints address
aws redshift describe-clusters --cluster-identifier sdesamplecluster | grep '\"Address'
# use pgcli to connect to your AWS Redshift instance
pgcli -h <your-redshift-address> -U sdeuser -p 5439 -d dev
# note dev is the default database created. When prompted for the password enter "Password1234".
Let’s assume you work for a data collection company that tracks user clickstream and geolocation data. In your SQL terminal (pgcli) create fake clickstream and geolocation tables as shown below.
-- create fake clickstream data
drop table if exists clickstream;
create table clickstream (
eventId varchar(40),
userId int,
sessionId int,
actionType varchar(8),
datetimeCreated timestamp
);
insert into clickstream(eventId, userId, sessionId, actionType, datetimeCreated )
values
('6e598ae5-3fb1-476d-9787-175c34dcfeff',1 ,1000,'click','2020-11-25 12:40:00'),
('0c66cf8c-0c00-495b-9386-28bc103364da',1 ,1000,'login','2020-11-25 12:00:00'),
('58c021ad-fcc8-4284-a079-8df0d51601a5',1 ,1000,'click','2020-11-25 12:10:00'),
('85eef2be-1701-4f7c-a4f0-7fa7808eaad1',1 ,1001,'buy', '2020-11-22 18:00:00'),
('08dd0940-177c-450a-8b3b-58d645b8993c',3 ,1010,'buy', '2020-11-20 01:00:00'),
('db839363-960d-4319-860d-2c9b34558994',10,1120,'click','2020-11-01 13:10:03'),
('2c85e01d-1ed4-4ec6-a372-8ad85170a3c1',10,1121,'login','2020-11-03 18:00:00'),
('51eec51c-7d97-47fa-8cb3-057af05d69ac',8 ,6, 'click','2020-11-10 10:45:53'),
('5bbcbc71-da7a-4d75-98a9-2e9bfdb6f925',3 ,3002,'login','2020-11-14 10:00:00'),
('f3ee0c19-a8f9-4153-b34e-b631ba383fad',1 ,90, 'buy', '2020-11-17 07:00:00'),
('f458653c-0dca-4a59-b423-dc2af92548b0',2 ,2000,'buy', '2020-11-20 01:00:00'),
('fd03f14d-d580-4fad-a6f1-447b8f19b689',2 ,2000,'click','2020-11-20 00:00:00');
-- create fake geolocation data
drop table if exists geolocation;
create table geolocation (
userId int,
zipcode varchar(10),
datetimeCreated timestamp
);
insert into geolocation(userId, zipCode, datetimeCreated )
values
(1 ,66206,'2020-11-25 12:40:00'),
(1 ,66209,'2020-11-25 12:00:00'),
(1 ,91355,'2020-11-25 12:10:00'),
(1 ,83646,'2020-11-22 18:00:00'),
(3 ,91354,'2020-11-20 01:00:00'),
(10,91355,'2020-11-01 13:10:03'),
(10,91355,'2020-11-03 18:00:00'),
(8 ,91355,'2020-11-10 10:45:53'),
(3 ,91355,'2020-11-14 10:00:00'),
(1 ,83646,'2020-11-17 07:00:00'),
(2 ,83646,'2020-11-20 01:00:00'),
(2 ,91355,'2020-11-20 00:00:00');
Common Table Expressions (CTEs)
Let’s assume that we want to get the userId and the number of purchases they made, for the users who have been in multiple locations (identified by zip code) and have purchased at least one product. You can write a query which uses a subquery (where userId in (...)
) as shown below.
select userId,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
where userId in (
select userId
from geolocation
group by userId
having count(distinct zipCode) > 1
)
group by userId
having numPurchases >= 1;
Alternatively, we can use CTEs to define temp tables that only exist for the duration of the query as shown below.
with buyingUsers as (
select userId,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
group by userId
having numPurchases >= 1
),
movingUsers as (
select userId
from geolocation
group by userId
having count(distinct zipCode) > 1
)
select bu.userId,
bu.numPurchases
from buyingUsers bu
join movingUsers mu on bu.userId = mu.userId;
Notice how we replaced the subquery with a CTE. This example is simple but in cases with multiple derived tables and sophisticated join logic, using CTEs may make your query easier to read.
Performance: CTEs vs Subquery, derived tables vs Temp tables
Now that we know what CTEs are, let’s compare their performance against other competing approaches. Before we look at the query plan, we should update the tables statistics to enable a more accurate query plan. We update table statistics using ANALYZE.
analyze clickstream;
analyze geolocation;
For the performance comparison, let’s consider a sophisticated ask.
Objective: For users who have been in multiple locations (identified by zip code) and have purchased at least one product, get user-session level metrics. The metrics required are counts of their clicks, logins and purchases.
The result should be
userId | sessionId | numclicks | numlogins | numpurchases |
---|---|---|---|---|
2 | 2000 | 1 | 0 | 1 |
1 | 90 | 0 | 0 | 1 |
1 | 1001 | 0 | 0 | 1 |
1 | 1000 | 2 | 1 | 0 |
3 | 3002 | 0 | 1 | 0 |
3 | 1010 | 0 | 0 | 1 |
CTE performance
Let’s look at the CTE way to achieve the objective.
with purchasingUsers as (
select userId,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
group by userId
having numPurchases >= 1
),
movingUsers as (
select userId
from geolocation
group by userId
having count(distinct zipCode) > 1
),
userSessionMetrics as (
select userId,
sessionId,
sum(
case
when actionType = 'click' then 1
else 0
end
) as numclicks,
sum(
case
when actionType = 'login' then 1
else 0
end
) as numlogins,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
group by userId,
sessionId
)
select usm.*
from userSessionMetrics as usm
join movingUsers as mu on usm.userId = mu.userId
join purchasingUsers as pu on usm.userId = pu.userId;
We can see the query plan by running explain
+ the above query in your sql terminal.
From the query plan we can see that the query planner decided to
- Calculate
movingUsers
andpurchasingUsers
in parallel and join them. - While simultaneously aggregating
clickstream
data to generateuserSessionMetrics
. - Finally, join the datasets from the above 2 points.
Note that cost is relative. “Cost is a measure that compares the relative execution times of the steps within a plan. Cost does not provide any precise information about actual execution times or memory consumption, nor does it provide a meaningful comparison between execution plans. It does give you an indication of which operations in a query are consuming the most resources” - Query Plan docs
Subquery and derived table performance
Lets look at the “Subquery and derived(from (select ...) purchasingUsers
) table” way to achieve the same objective.
select userSessionMetrics.userId,
userSessionMetrics.sessionId,
userSessionMetrics.numclicks,
userSessionMetrics.numlogins,
userSessionMetrics.numPurchases
from (
select userId,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
group by userId
having numPurchases >= 1
) purchasingUsers
join (
select userId,
sessionId,
sum(
case
when actionType = 'click' then 1
else 0
end
) as numclicks,
sum(
case
when actionType = 'login' then 1
else 0
end
) as numlogins,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
group by userId,
sessionId
) userSessionMetrics on purchasingUsers.userId = userSessionMetrics.userId
where purchasingUsers.userId in (
select userId
from geolocation
group by userId
having count(distinct zipCode) > 1
);
We can see the query plan by running explain
+ the above query in your sql terminal.
From the query plan we can see that the query planner decided to
- Calculate
movingUsers
(theIN
subquery) anduserSessionMetrics
in parallel and join them. - While simultaneously filtering
clickstream
data to generatepurchasingUsers
. - Finally, join the datasets from the above 2 points.
You can see that the query plan is very similar to the CTE approach. The only difference is that the position of purchasingUsers
and userSessionMetrics
in the query plan have been swapped.
Temp table performance
Lets look at the “temp table” way to achieve the same objective.
create temporary table purchasingUsers DISTKEY(userId) as (
select userId,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
group by userId
having numPurchases >= 1
);
create temporary table movingUsers DISTKEY(userId) as (
select userId
from geolocation gl
group by userId
having count(distinct zipCode) > 1
);
create temporary table userSessionMetrics DISTKEY(userId) as (
select userId,
sessionId,
sum(
case
when actionType = 'click' then 1
else 0
end
) as numclicks,
sum(
case
when actionType = 'login' then 1
else 0
end
) as numlogins,
sum(
case
when actionType = 'buy' then 1
else 0
end
) as numPurchases
from clickstream
group by userId,
sessionId
);
select usm.*
from userSessionMetrics as usm
join movingUsers as mu on usm.userId = mu.userid
join purchasingUsers as pu on usm.userId = pu.userId;
We can see the query plan for the select statement by running explain
+ the select query from above in your sql terminal.
You will see that the number of steps are lower compared to the CTE and “subquery + derived table” based approaches. This is because we spent time aggregating and filtering the data to get the required datasets beforehand and distributed them based on the join keys to enable better distributed join performance. This does not come free. There is a cost associated with creation and distribution of temp tables. The query plans for table creations are shown below.
You can quit the pgcli
terminal using the \q
command.
Trade-offs
In most cases you get the same performance from CTEs or “subquery + derived table” based approaches. The usual deciding factor is readability which is a subjective measure.
The temp table approach is preferred when you need to optimize for read performance. This comes at the cost of creating and optionally distributing(indexing in RDBMS) temp tables and you would also need to have permissions to create temp tables. If you are going to be reusing the temp tables in multiple select queries and are ok with spending time to pre-create them, it might be worth it.
Generally, it is good practice to always check the query plans for competing approaches and choose the one that is most appropriate for your specific use case. CTEs performance are very DB dependent. For example, in older postgres versions, CTEs were not optimized by the query planner and were always fully materialized before being used. Be sure to check your specific DB documentation and the query planner.
Tear down
When you are done, do not forget to tear down your AWS Redshift instance.
In your terminal run the following command
aws redshift delete-cluster --cluster-identifier sdesamplecluster --skip-final-cluster-snapshot
AWS Redshift charges based on time so make sure to tear down your instance.
Conclusion
To recap we saw
- What Common Table Expressions(CTEs) are.
- How to use them.
- Performance trade-offs compared to subquery, derived table and temp table based approaches.
Hope this articles helps you understand CTEs in detail. The next time you come across a complex query with confusing subqueries or have to write your own query, try CTEs and see if you can improve readability.