How to unit test sql transforms in dbt

Using dbt you can test the output of your sql transformations. If you have wondered how to "unit test" your sql transformations in dbt, then this post is for you. In this post, we go over how to write unit tests for your sql transformations with mock inputs/outputs and test them locally. This helps keep the development cycle shorter and enables you to follow a TDD approach for your sql based data pipelines.

How to Backfill a SQL query using Apache Airflow

Wondering how to backfill an hourly SQL query in Apache Airflow ? Then, this post is for you. In this post we go over how to manipulate the execution_date to run backfills with any time granularity. We use an hourly DAG to explain execution_date and how you can manipulate them using Airflow macros.

How to do Change Data Capture (CDC), using Singer

Change data capture(CDC) is a software design pattern in which we track every change(update, insert, delete) to the data in a database and replicate it to other database(s). In this post we will see how to do CDC by reading data from database logs and replicating it to other databases, using the popular open source Singer standard.

What are Common Table Expressions(CTEs) and when to use them?

You have most likely heard of Common Table Expressions(CTEs), but may not be sure what they are and when to use them. What if you knew exactly what Common Table Expressions(CTEs) were and when to use them ? In this post, we go over what CTEs are, and its performance comparisons against subqueries, derived tables, and temp tables to help decide when to use them.

10 Skills to Ace Your Data Engineering Interview

If you are preparing for a data engineering interview and are overwhelmed by all the tools and concepts you need to learn. Then this post is for you, in this post we go over the most common tools and concepts you need to know to ace any data engineering interview.

6 Key Concepts, to Master Window Functions

In this post we go over 6 key concepts to help you master window functions. Window functions are one the most powerful features of SQL, they are very useful in analytics and performing operations which cannot be done easily with standard group by, subquery and filters. In spite of this, window functions are not used frequently. If you have ever thought 'window functions are confusing', then this post is for you.

How to Pull Data from an API, Using AWS Lambda

If you are looking for an easy to setup and simple way to automate, schedule and monitor a 'small' API data pull on the cloud, serverless functions are a good option. In this post we cover what a serverless function can and cannot do, what its pros and cons are and walk through a simple API data pull project. We will be using AWS Lambda and AWS S3 for this project.

How to submit Spark jobs to EMR cluster from Airflow

There are many ways to submit an Apache Spark job to an AWS EMR cluster using Apache Airflow. In this post we go over the steps on how to create a temporary EMR cluster, submit jobs to it, wait for the jobs to complete and terminate the cluster, the Airflow-way.

Data Engineering Project: Stream Edition

Data engineering project for beginners, stream edition. In this post we design and build a simple data streaming pipeline using Apache Kafka, Apache Flink and PostgreSQL DB. We will also review the design and understand some common issues to avoid while building distributed stream processing systems.

ETL & ELT, a comparison

This post goes over what the ETL and ELT data pipeline paradigms are. It tries to address the inconsistency in naming conventions and how to understand what they really mean. Finally ends with a comparison of the 2 paradigms and how to use these concepts to build efficient and scalable data pipelines.