What and Why Staging

This post goes over what exactly a staging area is in a data pipeline. It covers the reasons for having a staging area and goes over some common use cases where having a staging area can save on engineering effort and time.

What is a Data Warehouse

This post goes over what the term data warehousing means. This post provides a simple e-commerce relational data model and how it has to be changed to fit analytical queries. It also covers the reasoning behind wanting to use a data warehouse and how to choose an appropriate database for your project.

Ensuring Data Quality, With Great Expectations

Ensure your data meets basic and business specific data quality constraints. In this post we go over a data quality testing framework called great expectations, which provides powerful functionality to cover the most common test cases and the ability to group them together and run them.

Designing a "low-effort" ELT system, using stitch and dbt

With the advent of powerful data warehouses like snowflake, bigquery, redshift spectrum, etc that allow separation of storage and execution, it has become very economical to store data in the data warehouse and then transform them as required. This post goes over how to design such a ELT system using stitch and DBT. The main objective is to keep the code complexity and server management low, while automating as much as possible

3 Key techniques, to optimize your Apache Spark code

This post covers key techniques to optimize your Apache Spark code. You will know exactly what distributed data storage and distributed data processing systems are, how they operate and how to use them efficiently. Go beyond the basic syntax and learn 3 powerful strategies to drastically improve the performance of your Apache Spark project.

Change Data Capture Using Debezium Kafka and Pg

Change data capture tutorial using Debezium Kafka and Postgres. Change data capture is a software design pattern used to capture changes to data and take corresponding action based on that change. The change to data is usually one of read, update or delete. The corresponding action usually is supposed to occur in another system in response to the change that was made in the source system.