What and Why Staging

Introduction

If you starting with or have been working with data pipelines, you might have noticed a staging area being used in most data pipelines. I have seen questions regarding this staging area, like

Why is there a staging area ? Can’t we just load data into the destination tables ?

Is the data removed from the staging area completely once the transformation has been done ?

In this post we will go over what exactly a staging area is and why it is necessary for almost all data pipelines.

youtube version

What is staging

Staging is a concept in building data pipelines. It refers to storing the raw/unprocessed data in a certain location before processing it, as per our data model requirements and keeping it there even after processing. The location may be file storage systems such as AWS S3 or database tables usually named something like stg_*. In most companies it is a mix of these types or more of storage systems, which are used to store this raw/unprocessed data.

What is staging area

Why

This begs the question of Why, Why do we need to store the raw/unprocessed data if we already have the cleaned data. Isn’t it expensive to store data that we are not going to use ?. There are many reasons to preserve the raw/unprocessed data(as we will see below), but the main concept is the decoupling of historical data dependency between the source system and data pipeline. This means our data pipeline should be able to rerun the data cleaning process on the raw/unprocessed data without having to get the data again from the source systems, a use case of why this is beneficial is explained below

Let’s say you build a data pipeline that does some data cleaning based on some business calculation logic and this data is pulled every day from our application database which contains the application data for that day. 3 months later the business realizes they had made some error in the calculation logic and want us to recreate the data with the correct calculation. If you did not have a staging area, this would be impossible (one can argue that we can try to recreate the data from backups and logs, but that is a high effort engineering task) since your pull from the application database will only reflect the current data of the application, not the data in the application database that was present 3 months ago. But if you had stored the raw/unprocessed data in a staging area, all you will have to do is run a backfill (which is the process of reprocessing raw data based on a new logic).

Other such use cases are

  1. If the data vendor you get data from sends wrong data for a particular day and later resends the correct data for that day, you will need to reprocess the data for that day. This does not necessarily require a staging area, but having it designed this way, easily allows you to reprocess the correct data for that particular time frame(a single day in this example).
  2. If you are scraping websites to capture data, it is ideal to store that information since the website may change later or block scrapers, rate limit calls etc.
  3. If the business user notices an error in the cleaned data, we can easily track it back to the source data by looking up where the raw data is stored. This preservation of data lineage (aka how the cleaned data relates to the raw data) is crucial when debugging issues with your data pipeline.
  4. If the incoming data adds a new column after a certain date, we can modify our load script with a simple date based if-else logic to account for this. This is called schema evolution.

In the diagram below, the numbered boxes indicate

  1. data vendor reprocessing of data for a particular day in the past
  2. reprocessing website data after a realizing an error in the transformation logic
  3. reprocessing historical application data, after changing the transformation logic

The numbered blocks are rerun for the offending date(s) along with recreating denorm_data for those date(s).

Use cases

Conclusion

In most companies the data in staging area is stored for 3-5 years and then it is moved to a data archive service such as Amazon S3 Glacier. Hope this post gives you an understanding of what staging is, why we need a staging area and its benefits. Let me know if you have any questions in the comment section below.