Wondering how to backfill an hourly SQL query in Apache Airflow ? Then, this post is for you. In this post we go over how to manipulate the execution_date to run backfills with any time granularity. We use an hourly DAG to explain execution_date and how you can manipulate them using Airflow macros.
Change data capture(CDC) is a software design pattern in which we track every change(update, insert, delete) to the data in a database and replicate it to other database(s). In this post we will see how to do CDC by reading data from database logs and replicating it to other databases, using the popular open source Singer standard.
You have most likely heard of Common Table Expressions(CTEs), but may not be sure what they are and when to use them. What if you knew exactly what Common Table Expressions(CTEs) were and when to use them ? In this post, we go over what CTEs are, and its performance comparisons against subqueries, derived tables, and temp tables to help decide when to use them.
If you are preparing for a data engineering interview and are overwhelmed by all the tools and concepts you need to learn. Then this post is for you, in this post we go over the most common tools and concepts you need to know to ace any data engineering interview.
In this post we go over 6 key concepts to help you master window functions. Window functions are one the most powerful features of SQL, they are very useful in analytics and performing operations which cannot be done easily with standard group by, subquery and filters. In spite of this, window functions are not used frequently. If you have ever thought 'window functions are confusing', then this post is for you.
If you are looking for an easy to setup and simple way to automate, schedule and monitor a 'small' API data pull on the cloud, serverless functions are a good option. In this post we cover what a serverless function can and cannot do, what its pros and cons are and walk through a simple API data pull project. We will be using AWS Lambda and AWS S3 for this project.
There are many ways to submit an Apache Spark job to an AWS EMR cluster using Apache Airflow. In this post we go over the steps on how to create a temporary EMR cluster, submit jobs to it, wait for the jobs to complete and terminate the cluster, the Airflow-way.
Data engineering project for beginners, stream edition. In this post we design and build a simple data streaming pipeline using Apache Kafka, Apache Flink and PostgreSQL DB. We will also review the design and understand some common issues to avoid while building distributed stream processing systems.
This post goes over what the ETL and ELT data pipeline paradigms are. It tries to address the inconsistency in naming conventions and how to understand what they really mean. Finally ends with a comparison of the 2 paradigms and how to use these concepts to build efficient and scalable data pipelines.
This post goes over what exactly a staging area is in a data pipeline. It covers the reasons for having a staging area and goes over some common use cases where having a staging area can save on engineering effort and time.