Building Cost Efficient Data Pipelines with Python & DuckDB

May 28, 2024 · 12 min read

Imagine working for a company that processes a few GBs of data every day but spends hours configuring/debugging large-scale data processing systems! Whoever set up the data infrastructure copied it from some blog/talk by big tech. Now, the responsibility of managing the data team's expenses has fallen on your shoulders. You're under pressure to scrutinize every system expense, no matter how small, in an effort to save some money for the organization. It can be frustrating when data vendors charge you a lot and will gladly charge you more if you are not careful with usage. Imagine if your data processing costs were dirt cheap! Imagine being able to replicate and debug issues quickly on your laptop! In this post, we will discuss how to use the latest advancements in data processing systems and cheap hardware to enable cheap data processing. We will use DuckDB and Python to demonstrate how to process data quickly while improving developer ergonomics.

Enable stakeholder data access with Text-to-SQL RAGs

May 21, 2024 · 13 min read

You want to democratize your company's data to a larger part of your organization. However, trying to teach SQL to nontechnical stakeholders has not gone well. Stakeholders will always choose the easiest way to get what they want: by writing bad queries or opening an ad-hoc request for a data engineer to handle. You hope stakeholders will recognize the power of SQL, but it can be disappointing and frustrating to know that most people do not care about learning SQL but only about getting what they need, fast! The result is that the data team is overloaded with ad-hoc data requests or has to deal with bad queries that can bring a warehouse to its knees. Imagine a scenario where stakeholders can independently analyze data in our warehouse without needing a new dashboard for each request. This would free up the data team for more focused, deep work while empowering stakeholders to become proficient in data analysis. It is asking a lot of stakeholders to take the time to learn SQL when they have multiple other priorities. One way to get stakeholders to get good at SQL is to enable them to see the SQL query they will need to run to get the data they need. In this post, we will build an RAG to convert stakeholder questions to SQL queries, which the stakeholder will run. Repeatedly seeing the SQL queries necessary to get the needed data, the stakeholders can modify them according to their needs and eventually write SQL themselves! By the end of this post, you will learn how to build a simple LLM-powered text-to-SQL query engine for your data.

How to reduce your Snowflake cost

May 9, 2024 · 14 min read

Have you worked on Snowflake SQL written without concern for maintainability or performance? Most data projects are built without consideration of warehouse costs! You may be a new data engineer brought in to optimize Snowflake usage or suddenly thrust into a cost-reduction project. While the Snowflake contracts are signed by management without your consultation, the cost reduction initiative will fall on you, the data engineer! On top of that, you will be held responsible for skyrocketing costs in addition to the feature delivery work that you are expected to deliver! It can be quite stressful and frustrating to deal with costs in a short amount of time! A high warehouse bill can cost jobs and maybe a key metric for your career aspirations. What if your Snowflake warehouse can be a profit center? Imagine Snowflake cost management on auto-pilot. In addition to money saved, you can improve your team's morale and use the savings as a key point for your promotion! In this post, we will discuss four strategies for reducing your Snowflake costs. By the end of the posts, you will have a plan that includes quick wins, long-term planning, and monitoring costs so that they don't go haywire.

How to test PySpark code with pytest

Apr 22, 2024 · 8 min read

Working on a large codebase without any tests can be nerve-wracking. One wrong line of code or an in-conspicuous library update can bring down your whole production pipeline! Data pipelines start simple, so engineers skip tests, but the complexity increases rapidly after a while, and the lack of tests can grind down your feature delivery speed. It can be especially tricky to start testing if you are working on a large legacy codebase with few to no tests. In long-running data pipelines, bad code can take hours to be identified and fixed, causing stakeholder frustration! No testing => Stress, Worry, Extra work from an apparent bug that made it to production! What if you could confidently push changes to production without worrying about slowing down your pipeline? Quickly delivering features and empowering your team to be fast can uplevel your career as a data engineer. Tests => Peace of mind, fast feature delivery, and happy stakeholders. While testing will not catch all potential issues, it can avoid a significant number of production issues. In this post, we go over the types of tests and how to test PySpark data pipelines with pytest. We will create tests (unit, integration, and end-to-end) for a simple data pipeline that demonstrates key concepts like fixtures and mocking. By the end of this post, you will be able to identify pieces of your data pipeline to add tests.

Docker Fundamentals for Data Engineers

Apr 22, 2024 · 7 min read

Docker can be overwhelming to start with. Most data projects use Docker to set up the data infrastructure locally (and often in production as well). Setting up data tools locally without Docker is (usually)a nightmare! The official Docker documentation, while extremely instructive, does not provide a simple guide covering the basics for setting up data infrastructure. With a good understanding of data components and their interactions combined with some networking knowledge, you can easily set up a local data infrastructure with Docker. Knowing the core fundamentals of Docker will not only help you set up data infrastructure quickly but also empower you to think about networking, volumes, ports, etc., which are critical parts of most cloud data infrastructure. I wrote a post that covers the fundamental concepts one will need to set up complex data infra locally. By the end of the post, you will be able to use Docker to run any data tool (that is open source) locally on your laptop. In the post, we set up a Spark cluster, Postgres database, and minio (OSS cloud storage system) that can communicate with each other using Docker.

Data Engineering Best Practices - #2. Metadata & Logging

Feb 22, 2024 · 9 min read

Imagine this scenario: You are on call when suddenly an obscure alert pops up. It just says that your pipeline failed but has no other information. The pipelines you inherited (or didn't build) seem like impenetrable black boxes. When they break, it's a mystery—why did it happen? Where did it go wrong? The feeling is palpable: frustration and anxiety mount as you scramble to resolve the issue swiftly. It's a common struggle, especially for new team members who have yet to unravel the system's intricacies or data engineers who have to deal with pipelines built without observability. The root cause often lies in systems built without consideration for debugging and quick issue identification. The consequence? Lengthy downtimes, overburdened on-call engineers, and a slowdown in feature delivery. However, the ramifications extend beyond the technical realm, as incorrect data or failure to quickly fix a high-priority pipeline can erode stakeholder trust. Bugs are inevitable, but imagine a system that detects issues and provides the necessary information for an engineer to fix them quickly! A well-setup system that captures pertinent pipeline metadata and logs and exposes them in an easy-to-access UI will significantly reduce engineering time spent on fixing bugs. In the following post, you will learn what metadata is (in the context of data pipelines), how to log and monitor it, and how to design actionable alerts that simplify resolving bugs, even for someone new to the team. You will also set up an end-to-end logging system with Spark, Prometheus, and Grafana.

Uplevel your dbt workflow with these tools and techniques

Dec 13, 2023 · 20 min read

Are you part of an under-resourced team where adding time-saving dbt (data build tool) features take a back seat to delivering new datasets? Do you want to incorporate time (& money) saving dbt processes but need more time? While focussing on delivery may help in the short term, the delivery speed will suffer without proper workflow! A good workflow will save time, prevent bad data, and ensure high development speed! Imagine the time (& mental pressure) savings if you didn't have to validate data manually each time you put up a PR? Your development speed will be high, you can be confident that your change does not bring down the pipeline, & you can concentrate on creating value for your end users! In this post, we will see how to add improvements to an existing dbt project with an example. By the end of the post, you will know the most common enhancements engineers make to their dbt projects, how to do them yourself quickly, and how further to optimize dbt workflow for your specific use case.

What is an Open Table Format? & Why to use one?

Nov 14, 2023 · 8 min read

Do you need clarification about what Open Table Formats (OTF) are? Is it more than just a pointer to some metadata files that helps you sift through the data quickly? What is the difference between table formats (Apache Iceberg, Apache Hudi, Delta Lake) & file formats (Parquet, ORC)? How do OTFs work? Then this post is for you. Understanding the underlying principle behind open table formats will enable you to deeply understand what happens behind the scenes and make the right decisions when designing your data systems. This post will review what open table formats are, their main benefits, and some examples with Apache Iceberg. By the end of this post, you will know what OTFs are, why you use them, and how they work.

6 Steps to Avoid Messy Data in Your Warehouse

Oct 25, 2023 · 10 min read

Whether you are a new Data Engineer or someone with a few years of experience, you inevitably would have encountered messy data systems that seemed impossible to fix. Working at such a company usually comes with multiple pointless meetings, no clear work expectations, frustration, career stagnation, and ultimately no satisfaction from work! The reasons can be Managerial: Such as politics, red tape, cluelessness of management, influential people dictating roadmap, etc or Technical: Such as no data strategy at a leadership level, multiple teams using Excel as a warehouse, data/metric duplication across systems (without clear bounded context), lack of data rigor by upstream teams, etc Imagine if the data systems were seamless and a joy to work with; what would that do for your sanity, happiness & career growth? While there is no data utopia or a mythical mature organization where the data systems are perfect, there will always be some issues with the data. We, as data engineers, have the ability & responsibility to clean up the mess, build a great data warehouse, and make data accessible for the company. In this post, we will go over six critical steps to having a data warehouse that gives stakeholders precisely what they want while avoiding messy data.

Data Engineering Best Practices - #1. Data flow & Code

Jul 20, 2023 · 10 min read

If you are trying to improve your data engineering skills or are the sole data person in your company, it can be hard to know how well your technical skills are developing. Questions like Am I building pipelines the right way? How do I measure up to DEs at bigger tech companies? How do I get feedback on my pipeline design? It can cause a lot of uncertainty in career development! Imagine if you know that your code is on par (or even better than) with pipelines at tech-forward companies and that you are using industry best practices. You will be confident with your career progression and can quickly ramp up on any code base. These industry-standard best practices and concepts required to build resilient data pipelines are what you will learn in this post! By the end of this post, you will know the underlying concepts behind best practices and when to use them. While there is no perfect code/design, following these concepts will help you build resilient and easy-to-maintain data pipelines.

Land your dream Data Engineering job!