How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?

Feb 3, 2025 · 5 min read

If you've been in the data space long enough, you would have come across really long SQL scripts that someone had written years ago. However, no one dares to touch them, as they may be powering some important part of the data pipeline, and everyone is scared of accidentally breaking them. If you feel > Rough SQL is a good place to start, but it cannot scale after a certain limit > That dogmatic KISS approach leads to unmaintainable systems > The simplest solution that takes the shortest time is not always the most optimal. > The need to build the 80% solution and then rebuild the entire thing again if you need the 100% solution later is not better than creating the 100% solution so you don't have to make it twice Then this post is for you! Imagine working with pipelines that are a joy to work with; any updates will be quick and straightforward. In this post, we will see how to convert 1000-ish lines of messy SQL into modular code that is easy to test and modify. By the end of this post, you will have a systematic approach to converting your messy SQL queries into modular, well-scoped, easily testable code.

How to ensure consistent metrics in your warehouse

Jan 28, 2025 · 3 min read

If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist. The root cause of these metric deviations often stems from rapid data utilization without prioritizing long-term maintainability. Imagine this common scenario: a company hires its first data professional, who writes an ad-hoc SQL query to compute a metric. Over time, multiple teams build their own datasets using this query—each tweaking the metric definition slightly. As the number of downstream consumers grows, so does the volume of ad-hoc requests to the data team to investigate inconsistencies. Before long, the team spends most of its time firefighting data bugs and reconciling metric definitions instead of delivering new insights. This cycle erodes trust, stifles career growth, and lowers team morale. This post explores two options to reduce ad-hoc data issues and empower consumers to derive insights independently.

Data Engineering Interview Preparation Series #2: System Design

Jan 20, 2025 · 13 min read

System design interviews are usually vague and depend on you (as the interviewee) to guide the interviewer. If you are thinking: How do I prepare for data engineering system design interviews? I struggle to think of questions you would ask in a system design interview for data engineering; I don't have enough interview experience to know what companies ask. Is data engineering "system design" more than choosing between technologies like Spark and Airflow? This post is for you! Imagine being able to solve any data systems design interviews systematically. You'll be able to showcase your abilities and demonstrate clear thinking to your interviewer. By the end of this post, you will have a list of questions ordered by concepts that you can use to approach any data systems design interview.

How to reference a seed from a different dbt project?

Dec 19, 2024 · 5 min read

If your company has multiple dbt projects, you would have had to use code cross projects. Creating cross-project dependencies is not straightforward in a SQL templating system like dbt. If you are wondering: How to use seed data defined in one dbt project in another, How dbt packages work under the hood, Caveats to be aware of when using assets cross-projects, etc. This post is for you. In this post, we will go over how to use packaging in dbt to reuse assets and how packaging works under the hood. By the end of this post, you will know how to access seed data cross-projects.

What do Snowflake, Databricks, Redshift, BigQuery actually do?

Nov 22, 2024 · 12 min read

Most data engineering job requirements involve one of the big data platforms: Databricks, Snowflake, Bigquery, Redshift, etc. You try to learn about these platforms, but all you can glean is that they are used to store and process data. You also try to understand why these platforms are necessary in the first place, but you do not have a clear answer to why they are necessary. With the plethora of marketing material out there, it is easy to get overwhelmed by exactly what these platforms are and why they may be very beneficial for your use case. The end result is confusion and no clear picture of why these platforms matter or why companies spend large amounts of money on them. Imagine knowing the key requirements of a data processing system and quickly choosing the right platform for your use case. By knowing what these platforms offer and when to use them, you can make recommendations to your team/leadership, making you a key contributor to any data team. In this post, we will look at these platforms (and the open-source alternative) from the perspective of the features they provide. By the end of this post, you will have a clear idea of what to expect from these platforms. You can also choose which platform to use for your specific use case.

25 SQL tips to level up your data engineering skills

Oct 17, 2024 · 19 min read

As a data engineer, you always want to uplevel yourself. SQL is the bread and butter of data engineering. Whether you are a seasoned pro or new to data engineering, there is always a way to improve your SQL skills. Do you ever think: > I wish I had known this SQL feature sooner > I wish 'learn SQL' online got into more interesting depths of the dialect than the basic shit it always is > I wish I had known this sooner; it's a much simpler way to use window functions for filtering; no more nested queries > I wish I didn't have to pull data into Python to do some loops This post is for you. Imagine being proficient in data processing patterns in SQL in addition to the standard functions. You will be able to write easy-to-maintain, clean, and scalable SQL. This post will review eight patterns to help you write easy-to-maintain SQL code and uplevel your SQL skills.

How to use nested data types effectively in SQL

Oct 14, 2024 · 11 min read

If you've worked in the data space, you've likely faced the frustration of dealing with massive tables—so many columns that it’s hard to remember their names or how they relate to each other. This complexity slows you down, increases the risk of errors, and makes your pipelines harder to maintain. You may be left wondering, Is there a simpler, more efficient way to handle this? Imagine a world where your data is organized intuitively, where relationships between entities are clear, and you don’t have to memorize dozens of column names to get your work done. In this world, representing complex relationships in data is straightforward, and your metrics are calculated with accuracy and ease. In this post, I’ll show you how to use complex data types in SQL to represent relationships more efficiently. You’ll learn how these data types can simplify your pipeline, improve developer experience, and minimize the risk of calculation errors. By the end, you’ll know the tradeoffs and practical applications of complex data types—and how to integrate them into your tables to make your work smoother and more effective.

How to decide on a data project for your portfolio

Sep 23, 2024 · 6 min read

Whether you are looking to improves your data skills or building portfolio projects to land a job, you would have faced the issue of deciding what and how to build data projects. If you are struggling to decide what tools/frameworks to use for your portfolio data projects or are not sure that what you are building is actually serving any purpose. Then this post is for you! Imagine being able make a potential referrer or hiring manager quickly understand that you have the expertise that they are looking for. By showing them exactly what they are looking for you improve the chances of landing an interview. By the end of this post you will have an algorithm that can help you decide what tools/frameworks to use for your data project to get the most out of the time you spend on it.

How to build a data project with step-by-step instructions

Sep 18, 2024 · 16 min read

There are a lot of data projects available on the web. While these projects are great, starting from scratch to build your data project can be challenging. If you are Wondering how to go from an idea to a production-ready data pipeline, Feeling overwhelmed by how all the parts of a data system fit together, or Unsure that the pipelines you build are up to industry-standard If so, this post is for you! In it, we will go over how to build a data project step-by-step from scratch. By the end of this post, you will be able to quickly create data projects for any use case and see how the different parts of data systems work together.

What are the Key Parts of Data Engineering?

Sep 5, 2024 · 5 min read

If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. The list of tools/frameworks to know can be overwhelming. If you are wondering > What are the parts of data engineering? > Which parts of data engineering are the most important? > Which popular tools should you focus your learning on? > How to build portfolio projects? Then this post is for you. This post will review the critical components of data engineering and how you can combine them. By the end of this post, you will know all the critical components necessary for building a data pipeline.

Land your dream Data Engineering job!