How to Extract Data from APIs for Data Pipelines using Python

Extracting data is one of the critical skills for data engineering. If you have wondered > How to get started for the first time extracting data from an API > What are some good resources to learn API data extraction? > If there are any recommendations, guides, videos, etc., for dealing with APIs in Python > Which Python library to use to extract data from an API > I don't know what I don't know. Am I missing any libraries? Then this post is for you. Imagine being able to mentally visualize how systems communicate via APIs. By the end of this post, you will have learned how to pull data via an API. You can quickly and efficiently create data pipelines to pull data from most APIs.

How to create an SCD2 Table using MERGE INTO with Spark & Iceberg

Slowly changing dimension 2 is a critical data modeling technique used in most warehouses. If you are > Wondering if there is a simple way to create an SCD table > Struggling to handle edge cases with SCD2 creation > Having to write a lot of code to ensure that SCD2 pipeline failures don't corrupt your data Then this post is for you. By the end of this post, you will have learned how to effectively use MERGE INTO to build an SCD2 pipeline. Imagine being able to enable end users to see how the data looked at any point in time. What if you could replace multiple SQL queries with a single-small query that just works. In this post, we will explain how MERGE INTO works and how to use it to build an SCD2 pipeline. You will also receive a code recipe that you can repurpose for your use case.

How to quickly deliver data to business users? #1. Adv Data types & Schema evolution

Over the past decade, every department has wanted to be data-driven, and data engineering teams are under more pressure than ever. If you have been an engineer for over a few years, you would have seen your world change from a 'well-planned data model' to a 'dump everything in S3 and get some data for the end-user'. Data engineers are under a lot of stress caused by : > The Business is becoming too complex, and every department wants to become data-driven; thus, expectations from the data teams skyrocket. > Not having enough time to pay down tech debt or spend time properly modeling the data required > Businesses act like they do not have the time/money, or patience to spend time doing things the right way. > Too many requirements with too many stakeholders If so, this post is for you. Imagine building systems enabling you to deliver any new data stakeholders want in minutes. You will be known for delivering quickly and empowering the business to make more money. This post will discuss an approach to quickly delivering new data to your end user. By the end of this post, you will have a technique to apply to your pipelines to make your life easier and boost your career.

How to Manage Upstream Schema Changes in Data Driven Fast Moving Company

If you have worked at a company that moves fast (or claims to), you've inevitably had to deal with your pipelines breaking because the upstream team decided to change the data schema! If you are > Frequently in meetings, fixing pipeline issues due to schema changes > Stressed, unable to deliver quality work, always in a hurry to put out the next fire > Working with teams who have to prioritize speed over everything This post is for you. Constantly dealing with broken pipelines due to upstream data changes is detrimental to your career and leads to burnout. What if you could focus on building great data projects? Imagine pipelines auto-correcting themselves! This post will enable you to do that. We will discuss the strategies for handling upstream schema changes. These strategies will help you move from constant fire-fighting mode to a stable way of dealing with breaking upstream changes.

Visual Studio Code (VSCode) extensions for data engineers

Whether you are setting up visual studio code for your colleagues or want to improve your workflow, tons of extensions are available. If you have wondered > What are the best visual studio code extensions for data engineers? > How do I share my visual studio code environment with my colleagues? > How does Visual Studio code user/workspace/devcontainers/profiles work? Then this post is for you! Imagine being able to quickly set up Visual Studio Code on any laptop exactly how you want it. You won't notice that you are coding on a different machine! In this post, we will go over Visual Studio Code's settings hierarchy, how to set up Visual Studio Code on any machine exactly to your liking with profiles, useful extensions for data engineering, and the caveats of unrestricted extensions. By the end of this post, you will have set up Visual Studio code exactly how you like it and be able to share it with other data engineers. Let's get started.

Should Data Pipelines in Python be Function based or Object-Oriented (OOP)?

As a data engineer, you would have spent hours trying to figure out the right place to make a change in your repository—I know I have. > You think, "Why is it so difficult to make a simple change?". > You push a simple change (with tests, by the way), and suddenly, production issues start popping up! > Dealing with on-call issues when your repository is spaghetti code with multiple layers of abstracted logic is a special hell that makes data engineers age in dog years! > Messy code leads to delayed feature delivery and slow debug cycles, which lowers work satisfaction and delays promotions! **Bad code leads to a bad life** If this resonates with you, know that you are not alone. Every day, thousands of data engineers deal with bad code and, with the best intentions, write messy code. Most data engineers want to write good code, but the common SWE patterns don't translate easily to data processing patterns, and there aren't many practical examples that illustrate how to write clean data pipelines. **Imagine a code base where every engineer knows where to look when something breaks, even if they have never worked on that part of the code** base before. Imagine knowing intuitively where a piece of logic would be and quickly figuring out the source of any issue. That is what this article helps you do! In this post, I explain how to combine functions and OOP patterns in Python to write pipelines that are easy to maintain/debug. By the end of this post, **you will have a clear picture of when and how to use functions and OOP effectively to make your (and your colleagues') life easy.**

How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?

If you've been in the data space long enough, you would have come across really long SQL scripts that someone had written years ago. However, no one dares to touch them, as they may be powering some important part of the data pipeline, and everyone is scared of accidentally breaking them. If you feel > Rough SQL is a good place to start, but it cannot scale after a certain limit > That dogmatic KISS approach leads to unmaintainable systems > The simplest solution that takes the shortest time is not always the most optimal. > The need to build the 80% solution and then rebuild the entire thing again if you need the 100% solution later is not better than creating the 100% solution so you don't have to make it twice Then this post is for you! Imagine working with pipelines that are a joy to work with; any updates will be quick and straightforward. In this post, we will see how to convert 1000-ish lines of messy SQL into modular code that is easy to test and modify. By the end of this post, you will have a systematic approach to converting your messy SQL queries into modular, well-scoped, easily testable code.

How to ensure consistent metrics in your warehouse

If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist. The root cause of these metric deviations often stems from rapid data utilization without prioritizing long-term maintainability. Imagine this common scenario: a company hires its first data professional, who writes an ad-hoc SQL query to compute a metric. Over time, multiple teams build their own datasets using this query—each tweaking the metric definition slightly. As the number of downstream consumers grows, so does the volume of ad-hoc requests to the data team to investigate inconsistencies. Before long, the team spends most of its time firefighting data bugs and reconciling metric definitions instead of delivering new insights. This cycle erodes trust, stifles career growth, and lowers team morale. This post explores two options to reduce ad-hoc data issues and empower consumers to derive insights independently.

Data Engineering Interview Series #2: System Design

System design interviews are usually vague and depend on you (as the interviewee) to guide the interviewer. If you are thinking: How do I prepare for data engineering system design interviews? I struggle to think of questions you would ask in a system design interview for data engineering; I don't have enough interview experience to know what companies ask. Is data engineering "system design" more than choosing between technologies like Spark and Airflow? This post is for you! Imagine being able to solve any data systems design interviews systematically. You'll be able to showcase your abilities and demonstrate clear thinking to your interviewer. By the end of this post, you will have a list of questions ordered by concepts that you can use to approach any data systems design interview.

How to reference a seed from a different dbt project?

If your company has multiple dbt projects, you would have had to use code cross projects. Creating cross-project dependencies is not straightforward in a SQL templating system like dbt. If you are wondering: How to use seed data defined in one dbt project in another, How dbt packages work under the hood, Caveats to be aware of when using assets cross-projects, etc. This post is for you. In this post, we will go over how to use packaging in dbt to reuse assets and how packaging works under the hood. By the end of this post, you will know how to access seed data cross-projects.