Advanced SQL is knowing how to model the data & get there effectively

1. Introduction

Most data engineering job descriptions these days expect “knowledge of advanced SQL,” but ask any data engineer that question, and you will get a different answer every time.

Are you

Frustrated that “advanced SQL” ebooks or Udemy courses aren’t all that advanced!

Wondering where you can learn more about writing advanced SQL queries?

Frustrated by job descriptions requiring “Advanced SQL”, but no one knows what that means?

It can be especially demotivating when seeking a new data engineering job, as the goalposts for landing a data job are constantly shifting.

Imagine being able to build systems that enable your stakeholders to use data to make decisions efficiently. In this post, you will learn the key advanced SQL techniques, techniques to optimize data processing, and how to structure data in a way that enables ease of use.

Advanced SQL involves knowing how to model data (data modeling) and retrieve it effectively (SQL techniques and query optimization).

2. SQL techniques

The most common techniques that are considered advanced SQL techniques are

In addition to knowing that these functions exist, it’s also crucial to understand when to use them.

3. Query optimization

The key point with query optimization is to reduce the amount of data to be processed and/or increase the utilization of cluster resources.

You need to know how to store data for optimized reads and how to configure your job to use all the available cluster resources.

4. Data modeling & data flow

There are five key types of tables; in almost all cases, following them religiously will solve 99.99% of your data problems.

Dimension tables store data for a business entity (e.g., customer, product, partner, etc). These tables describe the ‘who’ and ‘what’ types of questions. For example, which stores had the highest revenue yesterday? In this question, stores will be the dimension.
Fact tables containing information about how dimensions interact with each other in real life. Example: An order fact is an interaction between a customer and a seller involving one or more products.
Bridge tables are used to represent many-to-many relationships between dimension tables. Example: Account-Customer bridge table where one account can have multiple customers and one customer can have multiple accounts.
One Big Table (OBT) tables are fact tables to which all of its dimension tables are left joined. Eliminates joins and simplifies analysis for business users, albeit at the expense of storage efficiency and increased maintenance complexity. Example: Sales OBT with transaction details plus customer name, product description, and store location all in one table.
Summary/Aggregate tables are pre-calculated tables storing rolled-up metrics at higher levels of granularity than base fact tables. Improves query performance for common reporting patterns by avoiding expensive aggregation calculations at runtime. Monthly sales summary by region instead of querying daily transaction details each time.

Most data teams follow the 3-hop architecture (but every team does it subtly differently): Read this for details on how to use the 3-hop architecture.

5. Conclusion

To recap, we saw:

Learn these for interviews and real-world use cases; it will help you leverage decades of existing research and solve most of the common issues plaguing data teams.

Please let me know in the comment section below if you have any questions or comments.

1. Introduction

2. SQL techniques

3. Query optimization

4. Data modeling & data flow

5. Conclusion

6. Further reading