What is a Data Warehouse

A common question that I see asked among people trying to get into data engineering is.

What is Data Warehousing ?

Should I use Redshift or Bigquery or Synapse for my data warehouse ?

When should I start using a data warehouse ?

In this post we will assume that we work for an e-commerce startup and look at its evolving data needs and how a data warehouse fits into the context of this business. By the end of this post you should be able to have a clear understanding of what data warehousing means and when to use it.

What is a Data Warehouse

Let’s understand what and when to use a data warehouse by assuming that we work for an e-commerce startup. You will see how a simple website can evolve over time to require more complex data requirements and how data warehousing fits into this.

youtube version

Step 1: Setting up the e-commerce website

For our e-commerce website let’s assume we have customers who can login and place orders on items. The items are created by merchants through the merchant portal. The basic design looks something like shown below

basic ecommerce website

Let the customer_details, order_details, merchant_details, item_details table look like shown below

customer_details column definition
customer_id unique identifier for a customer (primary key)
customer_first_name first name of the customer
customer_last_name last name of the customer
customer_create_datetime Date and time when the customer was created
customer_update_datetime Date and time when this customer data was last modified
customer_delete_datetime Date and time when a customer was removed from the e-commerce website
merchant_details column definition
merchant_id unique identifier for the merchant (primary key)
merchant_name merchant name
merchant_location registered location of this merchant
merchant_create_datetime Date and time when this merchant was created
merchant_update_datetime Date and time when this merchant’s attribute was last updated
merchant_delete_datetime Date and time when a merchant was removed from the e-commerce website
item_details column definition
item_id unique identifier for the item (primary key)
item_name item name
item_type type of item (e.g can be grocery, applicance, etc)
item_price registered price of this merchant
item_create_datetime Date and time when this item was created
item_update_datetime Date and time when this item’s attribute was last updated
item_delete_datetime Date and time when a item was removed from the e-commerce website
order_details column definition
order_id unique identifier for the order (primary key)
customer_id customer identifier (foreign key to customer_details)
item_id item identifier (foreign key to item_details)
merchant_id merchant identifier (foreign key to merchant_details)
is_promotion Boolean flag denoting if this order has any promotion applied
order_status Current status of the order (one of placed, in-transit, delivered)
order_create_datetime Date and time when a order was created
order_update_datetime Date and time when a order status change was made

Step 2: Request for analytical query capabilities

As the e-commerce business grows, business analysts (or similar) users would want to access data to create models, access performance, optimize strategy, etc (this is generally referred to as analytics). Some basic questions the analytics user may want answered are

  1. How many orders were placed on average per customer in a given time period ? This is usually done to figure out holiday shopping trends
  2. List of top 10 merchants, who have sold the most items ? This may be to incentive high performing merchants, etc
  3. What is the average time for an order to move from placing the order to being delivered during the month of July 2020 ?

There will be more such questions that need data. As you can see, question 1 can be answered by running a query on the customer_database but for question 2 you would need to join tables across merchant_database and customer_database. For question 3 you would need a historical order status data (aka point in time information). In such scenarios you would want to create a data pipeline to move data from multiple databases into a single database(aka data warehouse) that can be used by analysts to answer questions. In general there are 3 main reasons to move data from the source databases (merchant_database and customer_database) to a single database(aka data warehouse), they are

  1. To prevent overloading the application database. In our e-commerce website we need to ensure that there is no heavy load on the database(which most analytic queries are) and thus keep the latency low for the website.
  2. To enable join across tables spanning multiple databases, have a single database for all analytic needs and serve a high number of read requests from analytic users.
  3. To store point in time information that can be analyzed later.

The data flow would look something like the image below

basic Data Warehouse

dim_customer column definition
customer_id unique identifier for a customer
customer_first_name first name of the customer
customer_last_name last name of the customer
start_datetime Date and time from when this customer detail was valid from
end_datetime Date and time to when this customer detail is valid until

Note: The above dimension model is based on SCD2 to preserve point in time information. Also note that in the above dimension table we do not use customer_id as a primary key. The reason is because when using SCD2, we create a new row when an attribute changes. This way we can get the data that was valid for that period of time. For example in the dim_customer SCD2 table we will have data as shown below

customer_id customer_first_name customer_last_name start_datetime end_datetime
1 John Doe 2020-08-10 12:00:00 2020-08-20 12:00:00
1 John Joe 2020-08-20 12:00:00 NULL

As you can see from the above data, if customer 1 decides to change their name we can keep track of when that change was made and what the customers name was before the change.

Other dimension tables will be similar to the one shown above

fact_orders column definition
order_event_id unique identifier for this particular event
order_id order identifier
customer_id customer identifier
item_id item identifier
merchant_id merchant identifier
is_promotion Boolean flag denoting if this order has any promotion applied at the time this event occured
order_status point in time information denoting order status
order_event_datetime date and time when this event occured

You can see that we have fact (fact_) and dimension (dim_) tables. This is called star schema(because it is often visualized as a central fact tables having connections to multiple dimension table resembling a star like structure, aka dimension modeling). The idea is to have point in time occurrence of an event, in our case we will have an event when an order was placed and one when it starts in transit and finally one for when it is delivered. In the fact table you treat each individual event as a separate row. This enables the analytic user to ask questions with a time component, as shown in one off the questions above What is the average time for an order to move from placing the order to being delivered during the month of July 2020 . This cannot be answered from the order_details table in the customer_database since that table will only contain the most current state of that order. These point in time events are stored in tables called fact tables. They have key reference to other details such as customer details, which are stored in dimension tables. Dimension modeling reference.

In some cases we may want to load data from other data sources directly into our data warehouse as well, e.g. if we use a cookies to record clickstream data from our website, they should probably be loaded into a clickstream events fact table in our warehouse.

So we have seen how a business data needs evolve to require data suitable for analytics. The concepts, process and techniques involved in creating tables to run analytical queries and writing data pipelines to move data into the data warehouse is called data warehousing.

If interested, you can check out this article, which shows how to use managed open source tools and cloud services to setup a simple data warehouse.

OLTP vs OLAP database

So now that we have seen what a data warehouse is and why we may want to setup a data warehouse, We can understand what OLTP and OLAP databases are.

OLTP database

  1. Stands for Online Transaction Processing.
  2. These are databases specifically designed for fast CREATE, READ, UPDATE and DELETE (aka CRUD) of small number of rows.
  3. Some popular examples are PostgreSQL, MySQL, etc.

In our example the merchant_database and customer_database should be an OLTP databases.

OLAP database

  1. Stands for Online Analytical Processing.
  2. These are databases specifically designed for running analytic queries, usually reading a specific set of columns, involving group by, multiple joins across fact and dimension tables on a large number of rows. The type of read only queries that analytic users need to run.
  3. Some popular examples are HIVE, Redshift, Snowflake, BigQuery, etc .

In our example our Data Warehouse should be an OLAP database.

A suitable analogy would be cars. Although you can use a race car to move across the country or use a moving van to try and win a race, you would not want to do that. The idea is to use the right tool for the job.

car analogy

Similarly you can use OLTP database to run large analytical queries or OLAP databases to run CRUD type transactions but you would not want to do that.

Conclusion

Hope this article gives you a good understanding of what a data warehouse is and how to use one. Let me know if you have any questions or comments in the comments section below.