Writing memory efficient data pipelines in Python

Working with a dataset that is too large to fit in memory? Then this post is for you. In this post, we will write memory efficient data pipelines using python generators. We also cover the common generator patterns you will need for your data pipelines.

How to trigger a spark job from AWS Lambda

Wondering how to execute a spark job on an AWS EMR cluster, based on a file upload event on S3? Then this post if for you. In this post we go over how to trigger spark jobs on an AWS EMR cluster, using AWS Lambda. The lambda function will execute in response to an S3 upload event. We will go over this event driven pattern with code snippets and set up a fully functioning pipeline.

How to set up a dbt data-ops workflow, using dbt cloud and Snowflake

Setting up an ELT data-ops workflow with multiple environments for developers is often extremely time consuming. What if there was a way to speed up this process, so that you could concentrate on modeling your data and delivering value to your end users? The good news is that there is a way. You can leverage dbt cloud to setup an ELT data-ops workflow in a very short time. In this post, we cover how to setup a data-ops workflow for an ELT system. We will go over how to setup dbt, snowflake, CI and schedule jobs. This data-ops workflow can be easily modified and built upon as your data team's needs evolve.

Apache Superset Tutorial

Spending hundreds of thousands of dollars on vendor BI tools ? Looking for a clean open source alternative ? Then this post is for you. In this post we go over Apache Superset, which is one of the most popular open source visualization tools. We will go over its architecture and build charts and dashboards to visualize data. We will end with a list of pros and cons with using an open source visualization tool like Apache Superset.

How to Join a fact and a type 2 dimension (SCD2) table

Wondering how to store a dimension table's history over time and how to join these historical dimension tables with fact tables for analytical querying ? Then this post is for you. In this post, we will go over a popular dimension modeling technique called SCD2, which preserves historical changes. We will also see how to join a fact table with an SCD2 table to get accurate point in time information.

How to update millions of records in MySQL?

Whenever updating a few records in an OLTP table we just use the update command. But what if we have to update millions of records in an OLTP table? If you run a large update, your database will lock those records and other transactions may fail. In this post we look at how a large update can cause lock timeout error and how running batches of smaller updates can eliminate this issue.

How to unit test sql transforms in dbt

Using dbt you can test the output of your sql transformations. If you have wondered how to "unit test" your sql transformations in dbt, then this post is for you. In this post, we go over how to write unit tests for your sql transformations with mock inputs/outputs and test them locally. This helps keep the development cycle shorter and enables you to follow a TDD approach for your sql based data pipelines.

How to Backfill a SQL query using Apache Airflow

Wondering how to backfill an hourly SQL query in Apache Airflow ? Then, this post is for you. In this post we go over how to manipulate the execution_date to run backfills with any time granularity. We use an hourly DAG to explain execution_date and how you can manipulate them using Airflow macros.

How to do Change Data Capture (CDC), using Singer

Change data capture(CDC) is a software design pattern in which we track every change(update, insert, delete) to the data in a database and replicate it to other database(s). In this post we will see how to do CDC by reading data from database logs and replicating it to other databases, using the popular open source Singer standard.