What are the types of data quality checks?

Data quality is such a broad topic. There are many ways to check the data quality of a dataset, but knowing what checks to run and when can be confusing and unclear. In this post, we will review the main types of data quality checks, where to use them, and what to do if a DQ check fails. By the end of this post, you will not only have a clear understanding of the different types of DQ checks and when to use them, but you'll also be equipped with the knowledge to prioritize which DQ checks to implement.

SQL or Python for Data Transformations?

Do you use SQL or Python for data processing? Every data engineer will have their preference. Some will swear by Python, stating that it's a Turing-complete language. At the same time, the SQL camp will restate its performance, ease of understanding, etc. Not using the right tool for the job can lead to hard-to-maintain code and sleepless nights! Using the right tool for the job can help you progress the career ladder, but every advice online seems to be 'Just use Python' or 'Just use SQL.' Understanding how the underlying execution engine and code interact and the tradeoffs you can choose from will equip you with the mental model to make a calculated, objective decision about which tool to use for your use case. By the end of this post, you will understand how the underlying execution engine impacts your pipeline performance. You will have a list of criteria to consider when using Python or SQL for a data processing task. With this checklist, you can use each tool to its benefit.

Why use Apache Airflow (or any orchestrator)?

Are you a data engineer(or new to data space) wondering why one may need to use Apache Airflow vs. just using cron? Does Apache Airflow feel like an over-optimized solution for a simple problem? Then this post is for you. Understanding the critical features necessary for a data pipelining system will ensure that your output is high quality! Imagine knowing exactly what a complex orchestration system brings to the table; you can make the right tradeoffs for your data architecture. This post will review three critical components of a data pipelining system: Scheduling, Orchestrating, and Observability. We will explain how Apache Airflow empowers data engineers with these vital components.

Data Engineering Projects

An in-significant data project portfolio can help set you apart from the run-of-a-mill candidate. Projects show that you are someone who can learn and adapt. Your portfolio informs a potential employer about your ability to continually learn, your knowledge of data pipeline best practices, and your genuine interest in the data field. Most importantly, it gives you the confidence to pick up new tools and build data pipelines from scratch. But setting up data infrastructure, with coding best practices, data quality checks, etc., is not trivial and can take a long time, especially if you have not done it before! This post gives you 10 data engineering projects you can adapt to your portfolio! The data projects in this post cover batch, stream, and event-driven pipelines and follow best practices, including 1. version control 2. Industry-standard code organization 3. Testing & data quality checks 4. Using in-demand tools & much more Bookmark this post, use these templates to build your projects, and share your portfolio with me in the comments for a review!

Build Data Engineering Projects, with Free Template

Setting up data infra is one of the most complex parts of starting a data engineering project. Overwhelmed trying to set up data infrastructure with code? Or using dev ops practices such as CI/CD for data pipelines? In that case, this post will help! This post will cover the critical concepts of setting up data infrastructure, development workflow, and sample data projects that follow this pattern. We will also use a data project template that runs Airflow, Postgres, DuckDB & Quarto to demonstrate how each concept works.

Python Essentials for Data Engineers

You know Python is essential for a data engineer. Does anyone know how much one should learn to become a data engineer? When you're in an interview with a hiring manager, how can you effectively demonstrate your Python proficiency? Imagine knowing exactly how to build resilient and stable data pipelines (using any language). Knowing the foundational ideas for data processing will ensure you can quickly adapt to the ever-changing tools landscape. In this post, we will review the concepts you need to know to use Python effectively for data engineering. Each concept has an associated workbook for practicing these concepts. You can try them out directly in your browser with GitHub Codespaces.

Building Cost Efficient Data Pipelines with Python & DuckDB

Imagine working for a company that processes a few GBs of data every day but spends hours configuring/debugging large-scale data processing systems! Whoever set up the data infrastructure copied it from some blog/talk by big tech. Now, the responsibility of managing the data team's expenses has fallen on your shoulders. You're under pressure to scrutinize every system expense, no matter how small, in an effort to save some money for the organization. It can be frustrating when data vendors charge you a lot and will gladly charge you more if you are not careful with usage. Imagine if your data processing costs were dirt cheap! Imagine being able to replicate and debug issues quickly on your laptop! In this post, we will discuss how to use the latest advancements in data processing systems and cheap hardware to enable cheap data processing. We will use DuckDB and Python to demonstrate how to process data quickly while improving developer ergonomics.

Enable stakeholder data access with Text-to-SQL RAGs

You want to democratize your company's data to a larger part of your organization. However, trying to teach SQL to nontechnical stakeholders has not gone well. Stakeholders will always choose the easiest way to get what they want: by writing bad queries or opening an ad-hoc request for a data engineer to handle. You hope stakeholders will recognize the power of SQL, but it can be disappointing and frustrating to know that most people do not care about learning SQL but only about getting what they need, fast! The result is that the data team is overloaded with ad-hoc data requests or has to deal with bad queries that can bring a warehouse to its knees. Imagine a scenario where stakeholders can independently analyze data in our warehouse without needing a new dashboard for each request. This would free up the data team for more focused, deep work while empowering stakeholders to become proficient in data analysis. It is asking a lot of stakeholders to take the time to learn SQL when they have multiple other priorities. One way to get stakeholders to get good at SQL is to enable them to see the SQL query they will need to run to get the data they need. In this post, we will build an RAG to convert stakeholder questions to SQL queries, which the stakeholder will run. Repeatedly seeing the SQL queries necessary to get the needed data, the stakeholders can modify them according to their needs and eventually write SQL themselves! By the end of this post, you will learn how to build a simple LLM-powered text-to-SQL query engine for your data.