4 Key Patterns to Load Data Into A Data Warehouse

Introduction

Loading data into a data warehouse is a key component of most data pipelines. If you are wondering

How to handle SQL loads

What are the patterns used to load data into a data warehouse?

Then this post is for you. In this post, we go over 4 key patterns to load data into a data warehouse. Data pipelines typically tend to use one or a combination of the patterns shown below. Recognizing these patterns can help you design better data pipelines and understand existing data pipelines.

Patterns

1. Batch Data Pipelines

These are data pipelines that run with a scheduled interval of at least 5 minutes or more.

1.1 Process => Data Warehouse

p to dw

This generally involves a single process that extracts data from some source and loads it into a data warehouse. The number of network round trips to the data warehouse should be kept low by batching together multiple (or all, if process memory allows) inserts into a single insert-database call (ie micro/mini batching).

For example, if you are using python and Postgres (as a data warehouse) you might use execute_values to insert multiple rows with one network call.

Pros:

  1. Simple to set up, run, monitor, and debug.
  2. Large machines can handle a significant amount of data.

Cons:

  1. Does not scale beyond a single machine. This will become an issue if your data size is very large.
  2. Inserts into the data warehouse will be a bottleneck. This can be mitigated by running multiple processes in parallel. However, this can introduce incorrect ordering of inserts, if not careful.

1.2 Process => Cloud Storage => Data Warehouse

p to cs to dw

This generally involves a distributed process (sometimes a single process) that writes data in parallel to a cloud storage system. This is then followed by a process to COPY the data in parallel from the cloud storage system into a data warehouse.

Some data warehouses support external tables, which enable reading data directly from S3. This will remove the need to run the COPY command.

Pros:

  1. Can load very large amounts of data.
  2. Writing to a cloud storage system and inserting it into a data warehouse are parallelized, making this approach fast.

Cons:

  1. Managing a distributed systems cluster can be expensive.
  2. With external tables, new partitions can require running an ALTER TABLE ADD PARTITION command.

2. Near Real-Time Data pipelines

These are data pipelines that are constantly running. Typically the time between data generation and its availability in the data warehouse will be less than a minute. Eg) A data pipeline to ingest clicks, impressions from a website into a data warehouse.

2.1 Data Stream => Consumer => Data Warehouse

ds to c to dw

This generally involves a data stream (being fed by another process) and a consumer of this data stream. The consumer process consumes records, performs data enrichment (optional), collects a batch of records in memory, and inserts it into a data warehouse.

Note: Some projects like ksqldb and clickhouse lets you directly query data in a Kafka topic. These are however not substitutes for a data warehouse.

Pros:

  1. Wide range of ready use connectors.
  2. Most data warehouses have established connectors (e.g. Snowflake Kafka Connector).
  3. The consumer can be parallelized easily with consumer groups.

Cons:

  1. Pay close attention to at least once insertion semantics, as required by your system.
  2. With multiple consumers inserting data into a data warehouse, the order of insertion is not guaranteed to be the same as it enters the data stream. You will need to handle this carefully if needed. Ref: Message ordering in Kafka.

2.2 Cloud Storage => process => Data Warehouse

cs to p to dw

This generally involves data (usually a small batch of rows) landing in a cloud storage system (being fed by another process) and a monitoring system that detects this and triggers a process to insert that data into a data warehouse.

For example, you might set up an s3 trigger to start a lambda process to insert the data into a data warehouse, when data lands in S3.

Pros:

  1. Most cloud service providers have support to monitor their cloud storage system and trigger a process (usually called serverless).
  2. Some data warehouses also support this patter (e.g. Snowpipe).
  3. The raw data stored in the cloud storage system provides additional data redundancy.

Cons:

  1. Pay close attention to limits on the number of processes that can be triggered, during high traffic.
  2. With multiple processes inserting data into a data warehouse, the order of insertion is not guaranteed to be the same as it lands in the cloud storage system. You will need to handle this carefully if needed.

Conclusion

Hope this article gives you a good idea of the general patterns that are used to load data into a data warehouse. In most cases, one or a combination of the above patterns is used.

The next time you are building a pipeline to load data into a data warehouse, try one of these loading patterns. You will be surprised by how most tools, cloud providers, and orchestration frameworks support and promote these patterns.

If you have used another pattern or have any questions or comments please leave them in the comments section below.

Further Reading

  1. Make your data pipelines idempotent
  2. Micro/Mini batching
  3. What is Staging
  4. ELT v ETL