How to build a data project with step-by-step instructions

There are a lot of data projects available on the web. While these projects are great, starting from scratch to build your data project can be challenging. If you are Wondering how to go from an idea to a production-ready data pipeline, Feeling overwhelmed by how all the parts of a data system fit together, or Unsure that the pipelines you build are up to industry-standard If so, this post is for you! In it, we will go over how to build a data project step-by-step from scratch. By the end of this post, you will be able to quickly create data projects for any use case and see how the different parts of data systems work together.

What are the Key Parts of Data Engineering?

If you are trying to break into (or land a new) data engineering job, you will inevitably encounter a slew of data engineering tools. The list of tools/frameworks to know can be overwhelming. If you are wondering > What are the parts of data engineering? > Which parts of data engineering are the most important? > Which popular tools should you focus your learning on? > How to build portfolio projects? Then this post is for you. This post will review the critical components of data engineering and how you can combine them. By the end of this post, you will know all the critical components necessary for building a data pipeline.

Data Engineering Interview Series #1: Data Structures and Algorithms

Preparing for data engineering interviews can be stressful. There are so many things to learn. In this 'Data Engineering Interview Series', you will learn how to crack each section of the data engineering interview. If you have felt > That you need to practice 100s of Leetcode questions to crack the data engineering interview > That you have no idea where/how to start preparing for the data structures and algorithms interview > That you are not good enough to crack the data structures and algorithms interview. Then this post is for you!

How to implement data quality checks with greatexpectations

Data quality checks are critical for any production pipeline. While there are many ways to implement data quality checks, the greatexpectations library is one of the popular ones. If you have wondered 1. How can you effectively use the greatexpectations library? 2. Why is the greatexpectations library so complex? 3. Why is the greatexpectations library so clunky and has many moving pieces? Then this post is for you. In this post, we will go over the key concepts you’ll need to get up and running with the greatexpectations library, along with examples of the types of tests you may run. By the end of this post, you will have a mental model of how the greatexpectations library works and be able to quickly set up and run your own data quality checks with greatexpectations.

What are the types of data quality checks?

Data quality is such a broad topic. There are many ways to check the data quality of a dataset, but knowing what checks to run and when can be confusing and unclear. In this post, we will review the main types of data quality checks, where to use them, and what to do if a DQ check fails. By the end of this post, you will not only have a clear understanding of the different types of DQ checks and when to use them, but you'll also be equipped with the knowledge to prioritize which DQ checks to implement.

SQL or Python for Data Transformations?

Do you use SQL or Python for data processing? Every data engineer will have their preference. Some will swear by Python, stating that it's a Turing-complete language. At the same time, the SQL camp will restate its performance, ease of understanding, etc. Not using the right tool for the job can lead to hard-to-maintain code and sleepless nights! Using the right tool for the job can help you progress the career ladder, but every advice online seems to be 'Just use Python' or 'Just use SQL.' Understanding how the underlying execution engine and code interact and the tradeoffs you can choose from will equip you with the mental model to make a calculated, objective decision about which tool to use for your use case. By the end of this post, you will understand how the underlying execution engine impacts your pipeline performance. You will have a list of criteria to consider when using Python or SQL for a data processing task. With this checklist, you can use each tool to its benefit.

Why use Apache Airflow (or any orchestrator)?

Are you a data engineer(or new to data space) wondering why one may need to use Apache Airflow vs. just using cron? Does Apache Airflow feel like an over-optimized solution for a simple problem? Then this post is for you. Understanding the critical features necessary for a data pipelining system will ensure that your output is high quality! Imagine knowing exactly what a complex orchestration system brings to the table; you can make the right tradeoffs for your data architecture. This post will review three critical components of a data pipelining system: Scheduling, Orchestrating, and Observability. We will explain how Apache Airflow empowers data engineers with these vital components.

Data Engineering Projects

An in-significant data project portfolio can help set you apart from the run-of-a-mill candidate. Projects show that you are someone who can learn and adapt. Your portfolio informs a potential employer about your ability to continually learn, your knowledge of data pipeline best practices, and your genuine interest in the data field. Most importantly, it gives you the confidence to pick up new tools and build data pipelines from scratch. But setting up data infrastructure, with coding best practices, data quality checks, etc., is not trivial and can take a long time, especially if you have not done it before! This post gives you 10 data engineering projects you can adapt to your portfolio! The data projects in this post cover batch, stream, and event-driven pipelines and follow best practices, including 1. version control 2. Industry-standard code organization 3. Testing & data quality checks 4. Using in-demand tools & much more Bookmark this post, use these templates to build your projects, and share your portfolio with me in the comments for a review!

Build Data Engineering Projects, with Free Template

Setting up data infra is one of the most complex parts of starting a data engineering project. Overwhelmed trying to set up data infrastructure with code? Or using dev ops practices such as CI/CD for data pipelines? In that case, this post will help! This post will cover the critical concepts of setting up data infrastructure, development workflow, and sample data projects that follow this pattern. We will also use a data project template that runs Airflow, Postgres, DuckDB & Quarto to demonstrate how each concept works.