How to Join a fact and a type 2 dimension (SCD2) table

Wondering how to store a dimension table's history over time and how to join these historical dimension tables with fact tables for analytical querying ? Then this post is for you. In this post, we will go over a popular dimension modeling technique called SCD2, which preserves historical changes. We will also see how to join a fact table with an SCD2 table to get accurate point in time information.

How to update millions of records in MySQL?

Whenever updating a few records in an OLTP table we just use the update command. But what if we have to update millions of records in an OLTP table? If you run a large update, your database will lock those records and other transactions may fail. In this post we look at how a large update can cause lock timeout error and how running batches of smaller updates can eliminate this issue.

How to unit test sql transforms in dbt

Using dbt you can test the output of your sql transformations. If you have wondered how to "unit test" your sql transformations in dbt, then this post is for you. In this post, we go over how to write unit tests for your sql transformations with mock inputs/outputs and test them locally. This helps keep the development cycle shorter and enables you to follow a TDD approach for your sql based data pipelines.

How to Backfill a SQL query using Apache Airflow

Wondering how to backfill an hourly SQL query in Apache Airflow ? Then, this post is for you. In this post we go over how to manipulate the execution_date to run backfills with any time granularity. We use an hourly DAG to explain execution_date and how you can manipulate them using Airflow macros.

How to do Change Data Capture (CDC), using Singer

Change data capture(CDC) is a software design pattern in which we track every change(update, insert, delete) to the data in a database and replicate it to other database(s). In this post we will see how to do CDC by reading data from database logs and replicating it to other databases, using the popular open source Singer standard.

What are Common Table Expressions(CTEs) and when to use them?

You have most likely heard of Common Table Expressions(CTEs), but may not be sure what they are and when to use them. What if you knew exactly what Common Table Expressions(CTEs) were and when to use them ? In this post, we go over what CTEs are, and its performance comparisons against subqueries, derived tables, and temp tables to help decide when to use them.

10 Skills to Ace Your Data Engineering Interview

If you are preparing for a data engineering interview and are overwhelmed by all the tools and concepts you need to learn. Then this post is for you, in this post we go over the most common tools and concepts you need to know to ace any data engineering interview.

6 Key Concepts, to Master Window Functions

In this post we go over 6 key concepts to help you master window functions. Window functions are one the most powerful features of SQL, they are very useful in analytics and performing operations which cannot be done easily with standard group by, subquery and filters. In spite of this, window functions are not used frequently. If you have ever thought 'window functions are confusing', then this post is for you.

How to Pull Data from an API, Using AWS Lambda

If you are looking for an easy to setup and simple way to automate, schedule and monitor a 'small' API data pull on the cloud, serverless functions are a good option. In this post we cover what a serverless function can and cannot do, what its pros and cons are and walk through a simple API data pull project. We will be using AWS Lambda and AWS S3 for this project.

How to submit Spark jobs to EMR cluster from Airflow

There are many ways to submit an Apache Spark job to an AWS EMR cluster using Apache Airflow. In this post we go over the steps on how to create a temporary EMR cluster, submit jobs to it, wait for the jobs to complete and terminate the cluster, the Airflow-way.