What is a staging area?
- 1. Introduction
- 2. What is a staging area
- 3. The advantages of having a staging area
- 5. Conclusion
- 6. Further reading
1. Introduction
Working with data pipelines, you might have noticed a staging
area in most data pipelines. If you work in the data space and have questions like
Why is there a staging area? Can’t we just load data into the destination tables?
Why do we need to store raw/unprocessed data if we already have the cleaned data?
Isn’t it expensive to store data that we are not going to use?
Is the data removed from the staging area completely, once the transformation has been done?
Then this post is for you. In this post, we will go over what exactly a staging area
is and why it is crucial for data pipelines.
2. What is a staging area
A staging area is a data pipeline design concept. It refers to an area where the raw/unprocessed data lives, before being transformed for downstream use. Staging areas can be database tables, files in a cloud storage system, etc.
3. The advantages of having a staging area
In most data pipelines, data in the source system is constantly changing. Storing the data from the source in a staging area with every run of the data pipeline provides us with a historical snapshot of the source data.
Let’s assume we have a data pipeline pulling data every day from an application’s database. An application’s database represents the data at the current state. For example, Let’s assume we have a user
table with a zipcode
column. If the user changes their zipcode
the application will overwrite the existing zipcode
value. This is a standard OLTP database design
.
Let’s assume we discover an error in a transformation step and have to reprocess the last three months’ data using the correct logic. We do not have the point-in-time data for the past three months, since the application database will only contain the current state. But, if we had stored the extracted data in a staging area, we can run a backfill with the correct logic on the data in the staging area.
Thus we can see that the staging area stores historical snapshots of the source data. The staging area removes our dependence on the source for historical data.
Here are some more example scenarios where staging areas can be helpful.
- Staging data from scraping websites provides us with a historical snapshot of the website. The staging area is crucial since the website may block scrapers, rate limit calls, etc.
- Data discrepancies can be traced back to raw source data. This
data lineage
is crucial when debugging issues with your data pipeline. - If the incoming data adds a new column after a specific date, we can modify our load script with a simple date-based if-else logic to account for this. Having date-based logic allows us to handle schema evolution.
- Backfill destination data with new transformation logic applied on staging data.
Some data pipelines have a raw/base/landing zone
area, where the data extracted from the source system is stored, and a staging area
, where the raw data is transformed to have consistent column names and types.
The business logic is applied to the staging data and is tested before being inserted into the final tables.
5. Conclusion
Hope this article gives you a good understanding of what a staging area is and why it’s crucial in a data pipeline. The next time you are building a data pipeline, make sure to have a staging area and your future self will thank you.
If you have any questions or comments, please leave them in the comment section below.
6. Further reading
If you found this article helpful, share it with a friend or colleague using one of the socials below!
The business logic is applied to the staging data and is tested before being inserted into the final tables.
The next time you are building a data pipeline, make sure to have a staging area and your future self will thank you.
Hello, great post as usual! Just wanted to know, is the staging area a data lake ?
Thank you. Not exactly, a staging area is a data pipeline design concept and can be a file location, table, etc. Data lake is a location (usually on cloud storage) where one can store any type of data. A data lake is usually offered as an alternative to a central data warehouse (sometimes in addition to it).
Your data lake may contain a staging area for your data pipelines. This article gives a good overview of a data lake: https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Hope this helps :)
Great post!
https://www.youtube.com/watch?v=sLhInuwdwcc&ab_channel=SeattleDataGuySeattleDataGuy
Joseph, a question for you. In this youtube video from 'Seattle Data Guy', he recommends data be moved from OLTP system -> raw DB -> stg -> DWH.
It seems his 'raw' database is what you call staging, and his 'staging' database is where he runs some business logic and does error checking.
Once the stg tables have passed error checks, then it is moved to PRD.
Have you seen this type of pattern before? Is this essentially what you're describing as well?
Hi Justin, Yes, you are right and I have seen this pattern before. I have seen it called raw, base, etc. In the post, I talk only about non transformed staging, but you can also have multiple "intermediate staging" tables that are quality tested post-transformation and before inserting into the final production table.
Hope this helps :)
Thank you for post!
Thank you!
Hope this article gives you a good understanding of what a staging area is and why it’s crucial in a data pipeline.
nice post!
While all this may feel overwhelming, it doesn't have to be.
Thats absolutely right!
But, if we had stored the extracted data in a staging area, we can run a backfill with the correct logic on the data in the staging area.
If the incoming data adds a new column after a specific date, we can modify our load script with a simple date-based if-else logic to account for this. Having date-based logic allows us to handle schema evolution.
Hi Joseph, great work! I'm trying to grasp why the process is called "ETL" when we move data from source to raw table if we are not doing any transformations. From my understanding, we are just extracting the data from the source and loading into the raw table then doing transformation in the target system.
Hi, Thank you. The term "ETL" is often used to denote moving data from source to destination with or without transformations. But you are right, if we want to be more specific we would want to say "ELT". IMO these terms are overloaded, depending on the architecture we might have ETLT... The point for this article is the storage of raw data as is. Hope this helps.
Why do we need to store raw/unprocessed data if we already have the cleaned data?
Hey Joseph, great article! Was wondering what a good rule of thumb is regarding the number of historical snapshots which should be stored in the staging area? In other words, how far back in time should data be staged for? Is it sufficient to only stage data used in the most recent data pipeline run , then overwrite and maintain a single most recent snapshot in future pipeline runs? I'm thinking storing multiple historical snapshots would be overkill in cases when there is a large amount of data. Thoughts on this?
Hey, Thank you. The time range for which data is stored in the staging area depends on the company. Typically its anywhere from 3 - 6 years. After this time the data is moved into an archive system such as AWS Glacier, etc.
You can store the single most recent snapshot but this will limit your ability to reprocess the raw data with new logic. Given that data storage cost is cheap companies tend to keep raw data around than delete it.
Also the staging area can be any cloud storage, not necessarily in your database. This way the cost is even cheaper and data is readily available if there is some issue down the line.
Hope this helps. LMK if you have more questions.