What is an Open Table Format? & Why to use one?

Do you need clarification about what Open Table Formats (OTF) are? Is it more than just a pointer to some metadata files that helps you sift through the data quickly? What is the difference between table formats (Apache Iceberg, Apache Hudi, Delta Lake) & file formats (Parquet, ORC)? How do OTFs work? Then this post is for you. Understanding the underlying principle behind open table formats will enable you to deeply understand what happens behind the scenes and make the right decisions when designing your data systems. This post will review what open table formats are, their main benefits, and some examples with Apache Iceberg. By the end of this post, you will know what OTFs are, why you use them, and how they work.

6 Steps to Avoid Messy Data in Your Warehouse

Whether you are a new Data Engineer or someone with a few years of experience, you inevitably would have encountered messy data systems that seemed impossible to fix. Working at such a company usually comes with multiple pointless meetings, no clear work expectations, frustration, career stagnation, and ultimately no satisfaction from work! The reasons can be Managerial: Such as politics, red tape, cluelessness of management, influential people dictating roadmap, etc or Technical: Such as no data strategy at a leadership level, multiple teams using Excel as a warehouse, data/metric duplication across systems (without clear bounded context), lack of data rigor by upstream teams, etc Imagine if the data systems were seamless and a joy to work with; what would that do for your sanity, happiness & career growth? While there is no data utopia or a mythical mature organization where the data systems are perfect, there will always be some issues with the data. We, as data engineers, have the ability & responsibility to clean up the mess, build a great data warehouse, and make data accessible for the company. In this post, we will go over six critical steps to having a data warehouse that gives stakeholders precisely what they want while avoiding messy data.

Data Engineering Best Practices - #1. Data flow & Code

If you are trying to improve your data engineering skills or are the sole data person in your company, it can be hard to know how well your technical skills are developing. Questions like Am I building pipelines the right way? How do I measure up to DEs at bigger tech companies? How do I get feedback on my pipeline design? It can cause a lot of uncertainty in career development! Imagine if you know that your code is on par (or even better than) with pipelines at tech-forward companies and that you are using industry best practices. You will be confident with your career progression and can quickly ramp up on any code base. These industry-standard best practices and concepts required to build resilient data pipelines are what you will learn in this post! By the end of this post, you will know the underlying concepts behind best practices and when to use them. While there is no perfect code/design, following these concepts will help you build resilient and easy-to-maintain data pipelines.

What is a self-serve data platform & how to build one

Are you a data engineer who can't respond quickly to user requests since your self-serve tool is over-complex with a lot of tech debt? Has your team's over-reliance on so-called self-serve tools (vs. focusing on end-user) caused the company to waste a lot of money? Is your work satisfaction suffering due to slow-moving, technical debt-ridden systems meant to enable end-users to use data effectively? Are you tired of vendors trying to sell you their self-serve data platform while not elaborating on what it is and why it may be helpful? Imagine empowering end-users to analyze data and make impactful decisions with minimal dependence on data engineers. The end-user impact will skyrocket, and your work will enable your company to use data effectively. Then this post is for you! In this post, we go over what self-serve is, what problems it aims to solve, the core components of a self-serve platform, and an approach you can follow to build a solid self-serve platform.

How to become a valuable data engineer

Are you looking to better yourself as a data engineer? But, when you look at job postings or company tech stack, you are overwhelmed by the sheer amount of tools you have to learn! Do you feel like you are just winging it and need a solid plan? Choosing what to learn among 100s of tools/frameworks can lead to analysis paralysis. The result is feeling overwhelmed, confused, and developing imposter syndrome, which is not helpful! What if you can have a fun and impactful career? You can be a force multiplier for any team or business you are a part of. You can be confident in providing significant value to any business. Companies will roll out the red carpet to work with you! If you want to become a valuable data engineer, this post is for you. This post will review what makes a data engineer (or any engineer) valuable. We will also go over a step-by-step method that you can use to choose/work on projects that can provide significant business impact, thus improving your value as a data engineer significantly.

Data Engineering Project: Stream Edition

Stream processing differs from batch; one needs to be mindful of the system's memory, event order, and system recovery in case of failures. However, understanding the fundamental concepts of time attributes, cluster memory, time-bounded joins, and system monitoring will enable you to build resilient and efficient streaming pipelines. If you are looking for an end-to-end streaming tutorial or a project to understand the foundational skills required to build streaming pipelines, this post is for you. In this post, we will design & build a streaming pipeline that multiple marketing companies build in-house. We will create a real-time first-click attribution pipeline. By the end of this post, you will know the fundamental concepts to develop your streaming pipelines. We will use Apache Flink and Apache Kafka for stream processing and queuing. However, the ideas in this project apply to all stream processing systems.

Change Data Capture, with Debezium

Change data capture is a popular technique to copy data from DBs into warehouses. However, it can be tricky to understand at first. Without working with a CDC system, knowing what it does, why it's needed, or how it works can be challenging. However, understanding the what, why, and how of CDC can help you set up pipelines that are resilient and reliable. If you have wondered what CDC does, why it's needed, and how it works, this post is for you. By the end of this post, you will have a good idea of what a CDC system is, where it's used, the different types of CDC, and how a CDC system built on debezium and Kafka works.

Data Pipeline Design Patterns - #2. Coding patterns in Python

As data engineers, you might have heard the terms functional data pipeline, factory pattern, singleton pattern, etc. One can quickly look up the implementation, but it can be tricky to understand what they are precisely and when to (& when not to) use them. Blindly following a pattern can help in some cases, but not knowing the caveats of a design will lead to hard-to-maintain and brittle code! While writing clean and easy-to-read code takes years of experience, you can accelerate that by understanding the nuances and reasoning behind each pattern. Imagine being able to design an implementation that provides the best extensibility and maintainability! Your colleagues (& future self) will be extremely grateful, your feature delivery speed will increase, and your boss will highly value your opinion. In this post, we will go over the specific code design patterns used for data pipelines, when and why to use them, and when not to use them, and we will also go over a few python specific techniques to help you write better pipelines. By the end of this post, you will be able to identify patterns in your data pipelines and apply the appropriate code design patterns. You will also be able to take advantage of pythonic features to write bug-free, maintainable code that is a joy to work on!

Data Pipeline Design Patterns - #1. Data flow patterns

Data pipelines built (and added on to) without a solid foundation will suffer from poor efficiency, slow development speed, long times to triage production issues, and hard testability. What if your data pipelines are elegant and enable you to deliver features quickly? An easy-to-maintain and extendable data pipeline significantly increase developer morale, stakeholder trust, and the business bottom line! Using the correct design pattern will increase feature delivery speed and developer value (allowing devs to do more in less time), decrease toil during pipeline failures, and build trust with stakeholders. This post goes over the most commonly used data flow design patterns, what they do, when to use them, and, more importantly, when not to use them. By the end of this post, you will have an overview of the typical data flow patterns and be able to choose the right one for your use case.

How to gather requirements for your data project

Frustrated trying to pry data pipeline requirements out of end users? Is scope creep preventing you from delivering data projects on time? You assume that the end-users know (and communicate) exactly what they want, but that is rarely the case! Adhoc feature/change requests throw off your delivery timelines. You want to deliver on time and make an impact, but you are interrupted constantly! Requirements gathering is rough, but it doesn't have to be! We go over five steps you can follow, to work with the end-user to define requirements, validate data before wasting engineering hours, deliver iteratively, and handle new feature/change requests without context switching.