Joins represent the denormalization of data to create dimensions and add dimension attributes to fact tables for reporting. Group bys enable the creation of metrics that determine how your data is used to make decisions.
If you have wondered
> Why do your joins often end up in a dense block that's hard to pick apart.
> Why are people running nearly identical queries with only different join combinations.
> How to avoid losing data when performing joins
> How group bys have been used to mask underlying data modeling issues
> Why a group by all is considered a red flag
Then this post is for you. In this post, we will cover how to use joins and group bys, as well as what to look out for when using them.
By the end of this post, you will know what to watch out for when you perform joins or group bys.
As a data engineer, CTEs(Common Table Expression) are one of the best techniques you can use to improve query readability. Most DEs wish they had started using CTEs sooner in their careers.
However, overuse of CTEs can lead to
> complex queries trying to fit everything into a single, massive nested statements
> Hard to debug code, as you can't quickly inspect the results of one CTE at a time
> Inability to reuse code, as CTEs in one query cannot be used in another query
Imagine having the **benefits of CTE (readability) while being able to make your code reusable & easy to debug**. That is what temp tables enable you to do.
In this post, we will explore how temporary tables can enhance CTEs by making your code more reusable and easier to debug.
Most data engineering job descriptions these days expect "knowledge of advanced SQL," but ask any data engineer that question, and you will get a different answer every time.
Are you
> Frustrated that "advanced SQL" ebooks or Udemy courses aren't really all that advanced!
> Wondering where you can learn more about writing advanced SQL queries?
> Frustrated by job descriptions requiring "Advanced SQL", but no one really knows what that means?
It can be especially demotivating when seeking a new data engineering job, as the goalposts for landing a data job are constantly shifting.
Imagine being able to build systems that enable your stakeholders to efficiently use data to make decisions. In this post you will learn the key advanced SQL techniques, techniques to optimize data processing and how to structure data in a way that enables ease of use.
Advanced SQL is knowing how to model the data (data modeling) & get there effectively (SQL techniques & Query Optimization).
Every data engineering interview includes a SQL round. If you are:
> Worried about job descriptions asking for advanced SQL, but you are not sure what advanced SQL means for an interview
> Having anxiety about being unable to get a job
> Frustrated with online SQL courses teaching the basic dialects, but not a step-by-step approach to problem solving
If so, this post is for you. Imagine being able to dissect any SQL problem and make the interviewer say, I need this person on my team. That is what this post helps you do.
This post will go over a step-by-step approach to solving any SQL interview question. You will have an algorithm that you can use to show the interviewer that they'd be missing out if they don't hire you.
Extracting data is one of the critical skills for data engineering. If you have wondered
> How to get started for the first time extracting data from an API
> What are some good resources to learn API data extraction?
> If there are any recommendations, guides, videos, etc., for dealing with APIs in Python
> Which Python library to use to extract data from an API
> I don't know what I don't know. Am I missing any libraries?
Then this post is for you. Imagine being able to mentally visualize how systems communicate via APIs. By the end of this post, you will have learned how to pull data via an API. You can quickly and efficiently create data pipelines to pull data from most APIs.
Slowly changing dimension 2 is a critical data modeling technique used in most warehouses. If you are
> Wondering if there is a simple way to create an SCD table
> Struggling to handle edge cases with SCD2 creation
> Having to write a lot of code to ensure that SCD2 pipeline failures don't corrupt your data
Then this post is for you.
By the end of this post, you will have learned how to effectively use MERGE INTO to build an SCD2 pipeline. Imagine being able to enable end users to see how the data looked at any point in time. What if you could replace multiple SQL queries with a single-small query that just works.
In this post, we will explain how MERGE INTO works and how to use it to build an SCD2 pipeline. You will also receive a code recipe that you can repurpose for your use case.
Over the past decade, every department has wanted to be data-driven, and data engineering teams are under more pressure than ever.
If you have been an engineer for over a few years, you would have seen your world change from a 'well-planned data model' to a 'dump everything in S3 and get some data for the end-user'.
Data engineers are under a lot of stress caused by :
> The Business is becoming too complex, and every department wants to become data-driven; thus, expectations from the data teams skyrocket.
> Not having enough time to pay down tech debt or spend time properly modeling the data required
> Businesses act like they do not have the time/money, or patience to spend time doing things the right way.
> Too many requirements with too many stakeholders
If so, this post is for you. Imagine building systems enabling you to deliver any new data stakeholders want in minutes. You will be known for delivering quickly and empowering the business to make more money.
This post will discuss an approach to quickly delivering new data to your end user. By the end of this post, you will have a technique to apply to your pipelines to make your life easier and boost your career.
If you have worked at a company that moves fast (or claims to), you've inevitably had to deal with your pipelines breaking because the upstream team decided to change the data schema!
If you are
> Frequently in meetings, fixing pipeline issues due to schema changes
> Stressed, unable to deliver quality work, always in a hurry to put out the next fire
> Working with teams who have to prioritize speed over everything
This post is for you. Constantly dealing with broken pipelines due to upstream data changes is detrimental to your career and leads to burnout.
What if you could focus on building great data projects? Imagine pipelines auto-correcting themselves! This post will enable you to do that.
We will discuss the strategies for handling upstream schema changes. These strategies will help you move from constant fire-fighting mode to a stable way of dealing with breaking upstream changes.
Whether you are setting up visual studio code for your colleagues or want to improve your workflow, tons of extensions are available. If you have wondered
> What are the best visual studio code extensions for data engineers?
> How do I share my visual studio code environment with my colleagues?
> How does Visual Studio code user/workspace/devcontainers/profiles work?
Then this post is for you!
Imagine being able to quickly set up Visual Studio Code on any laptop exactly how you want it. You won't notice that you are coding on a different machine!
In this post, we will go over Visual Studio Code's settings hierarchy, how to set up Visual Studio Code on any machine exactly to your liking with profiles, useful extensions for data engineering, and the caveats of unrestricted extensions.
By the end of this post, you will have set up Visual Studio code exactly how you like it and be able to share it with other data engineers. Let's get started.
As a data engineer, you would have spent hours trying to figure out the right place to make a change in your repository—I know I have.
> You think, "Why is it so difficult to make a simple change?".
> You push a simple change (with tests, by the way), and suddenly, production issues start popping up!
> Dealing with on-call issues when your repository is spaghetti code with multiple layers of abstracted logic is a special hell that makes data engineers age in dog years!
> Messy code leads to delayed feature delivery and slow debug cycles, which lowers work satisfaction and delays promotions!
**Bad code leads to a bad life**
If this resonates with you, know that you are not alone. Every day, thousands of data engineers deal with bad code and, with the best intentions, write messy code.
Most data engineers want to write good code, but the common SWE patterns don't translate easily to data processing patterns, and there aren't many practical examples that illustrate how to write clean data pipelines.
**Imagine a code base where every engineer knows where to look when something breaks, even if they have never worked on that part of the code** base before. Imagine knowing intuitively where a piece of logic would be and quickly figuring out the source of any issue.
That is what this article helps you do!
In this post, I explain how to combine functions and OOP patterns in Python to write pipelines that are easy to maintain/debug. By the end of this post, **you will have a clear picture of when and how to use functions and OOP effectively to make your (and your colleagues') life easy.**