Data Pipeline Design Patterns - #1. Data flow patterns

1. Introduction

Data pipelines can become flakey over time if the data pipeline design foundations are not solid. If you are

Wondering what design patterns that people follow for a typical data pipeline

Looking for learning materials for data pipeline design patterns

Curious about the design patterns other people use besides the ones you already use

Confused about the never-ending praise in favor of idempotency and not its actual benefits

Then this post is for you. This post will cover the typical data flow design patterns. We will learn about the pros and cons of each design pattern, when to use them, and, more importantly, when not to use them.

By the end of this post, you will have an overview of the typical data flow patterns.

Note that this is a 2-part series; in the next post, we will review the common coding patterns.

Choosing your data pipeline design:

graph TD
    A{Need historical data in output} -->|No| C[Non-replayable source ok]
    A -->|Yes| B[Replayable source]
    B --> D{Size of data to be pulled}
    C --> D
    D -->|large| E[Time ranged]
    D -->|Small| F[Full Snapshot]
    D -->|Only past n period| G[Lookback]
    D -->|Streaming data| G1[Streaming]
    E --> H{Transformational complexity}
    F --> H
    G --> H
    G1 --> H
    H -->|Standard| I[Multi hop arch]
    H -->|Transformation logic depends on the time of run or input values| J[Conditional flow arch]
    H -->|Multiple teams| K[Disconnected pipelines]
    I --> L{Is sink append only}
    J --> L
    K --> L
    L --> |Yes| M[Non-overwritable sink ok]
    L --> |No| N[Overwritable sink]
    B --> |Idempotent pipeline possible| N

The above decision chart shows a simple way to choose your data pipeline design, be sure to read about them in detail to understand the nuances.

2. Source & Sink

Before designing our pipeline, we must understand the inputs and outputs available to us.

Source: Systems which provide input(s) to your data pipeline.
Sink: Systems where your data pipeline stores the processed data.

2.1. Source Replayability

If the source system can answer the question What did the data look like n periods(min/hour/day/months/years) ago?(n can be any reasonable arbitrary number), then it is replayable, else it is not. Replayability is critical for backfills.

Replayable sources: Event stream, web server logs, a dump of database WAL showing all the create/update/delete statements (CDC), etc

Nonreplayable sources: Application tables that are constantly modified, APIs, etc., that only provide the current state of the data.

One can create a replayable source from a non-replayable source by dumping the incoming data into a raw/loading/base area. The raw/loading/base area now acts as a replayable source.

However, if our dumping process only happens periodically (say every day), then the degree of replayability is reduced since the answer to the questions what did the data look like five hours ago? and what did the data look like one day ago? will be the same, this is fine for most data pipelines, but is something to keep in mind.

2.2. Source Ordering

Ordering refers to the order in which data is pulled from a source system. Some data pipelines require the incoming events to be in a particular order. E.g., attributing checkout to a specific click, the log-out event must occur after the log-in event, etc.

Use techniques such as exponential backoff, watermarking, or handling late events to deal with out of order events.

2.3. Sink Overwritability

Overwritability is the ability of a system to update specific rows of existing data. For example, if the sink is a database and the data has a unique key, it can be overwritable based on that key. If your sink is a cloud storage and you are storing the data namespaced by a unique run id, they are overwritable.

Overwritability is crucial to prevent duplicates or partial records when a pipeline fails.

Overwritable sinks: Database tables with unique keys, cloud storage system namespaced by unique run ids (e.g., s3://your-bucket/yyyy-mm-dd/data_yyyy_mm_dd.csv for a pipeline run daily), etc.

Non-overwritable sinks: Kafka queue (without log compaction), “create only” access sinks, database tables without a unique key, etc.

3. Data pipeline patterns

In this section, we will go over extraction, behavior, & structural patterns. One can combine these patterns based on your use case. For example, you might have a data pipeline that is self-healing (behavior), pulls a full snapshot (extraction), and uses multi-hop (structural) architecture.

3.1. Extraction patterns

In this section, we will go over the common patterns used to pull data from a source.

3.1.1. Time ranged

With a time ranged pull, the data pipeline only pulls the data corresponding to a specific time frame.

E.g., A daily pipeline (run at 12:01 AM every day) will pull data for the entire yesterday.

When you do time ranged pull, you must update your sink reasonably. For example, if you are building a SCD2 table and there is an error with your pipeline you will need to pause further pulls before rerunning the current one to prevent corruption of the SCD2 table.

Pros

Fast data pulls, as your pipeline only pulls the necessary data.
The data pulls for multiple pipeline runs (in case of backfills) can be parallelized.

Cons

Incremental loads into sinks can be challenging to build (UPSERTs & MERGE INTOs).
If the source data is not replayable, rerunning a time ranged pull will produce results different from the first time, causing downstream confusion.

3.1.2. Full Snapshot

In a full snapshot pull, the entire data is pulled from the source.

If you need historical data, add a run_id column (or a new folder in the cloud storage system) that uniquely identifies each pipeline run. You can overwrite the data with each run if you do not need historical data.

Are you pulling data from a production table? Make sure you are pulling from a replica database!

Pros

Simple to build and maintain.
Easy to track when a data issue started, especially if you store history.
Ability to see historical changes.
Well suited for dimensional data.

Cons

Slow due to pulling all the data from a table.
Schema changes may break the pipeline since the older snapshots will not have a specific column. Table formats such as Apache Iceberg, Delta lake, etc., can help handle schema changes gracefully.
Storage costs, since you are storing a full copy of the entire data set per run, your data storage costs will be high.
Not suited for fact data (clickstream, logs, orders placed, etc.), as storing a complete snapshot will cause data size to explode.

3.1.3. Lookback

The lookback pull is used when the end-user wants an aggregate metric for the past n periods (hour/day/months/years)—for example, calculating MAU (monthly active users) at the end of every month, tracking KPI for the past 30 days.

This pattern is beneficial when your source data is continuously updated or has late arriving events, and the end-user only cares about metric(s) based on the current state of the data. Generally, dashboards/reports that track the number of users over time end-of-quarter financial calculations are powered by a pull of this type.

Pros

Easy to build and maintain.
Great fit when you have a large source but only need the most current past n periods of data.
Usually run on fact tables, making debugging data issues easy.

Cons

If your source has a lot of late-arriving events, the metrics may change significantly between runs causing end-user confusion & panic.

3.1.4. Streaming

In this pattern, each record flows through your data pipeline. As a record flows through your pipeline, it can be enriched, filtered, etc., as needed. E.g., a near real-time pipeline to detect credit card fraud from transactions. This pattern is beneficial for any event that needs to be processed as soon as possible.

Pros

Low latency.
Essential for some instances like CC fraud detection, stock market data monitoring, etc.

Cons

It must be built with an eye towards the replayability of the pipeline, or else an error in the code can cause irreparable downstream impact.
Back pressure, exponential backoff, speed of commits into sink (rate limit), checkpointing system state, no downtime deploys, etc. need to be well thought out.

3.2. Behavioral

In this section, we will explore data pipeline patterns based on how the data pipeline behaves in case of failures and reruns.

3.2.1. Idempotent

Idempotency in data pipelines means that one can run a data pipeline multiple times with the same input, and the output will not change. Running a data pipeline numerous times with the same inputs will not create any duplicates or schema changes.

E.g., the Hourly data pipeline reads data from logs, aggregates it, and dumps(overwrites) it into cloud storage.

Run 1: run_id: 2022-12-01 => Transformations => output file path s3://some-bkt/your-data/2022-12-01 overwritten with new data.
Run 2: run_id: 2022-12-02 => Transformations => output file path s3://some-bkt/your-data/2022-12-02 overwritten with new data.
…
Run 30: run_id: 2022-12-30 => Transformations => output file path s3://some-bkt/your-data/2022-12-30 overwritten with new data.
New request to rerun 2022-12-02 due to upstream issues.
Re-run: run_id: 2022-12-02 => Transformations => output file path s3://some-bkt/your-data/2022-12-02 overwritten with new data.

In the above run log, re-rerunning the data pipeline multiple times with the same inputs will not produce duplicates or incorrect values.

For idempotency to work, we need replayable sources and an overwritable sink.

Code example: Idempotent data pipeline

Pros

Easy to maintain, do reruns and backfills.
Easy to track data lineage, i.e., determine a row’s data source.
Debug issues easily by allowing users to replay pipelines for different inputs.

Cons

Longer dev time.
It can be hard to maintain idempotency with changing requirements. E.g., In the above example, if we need to enrich the data from an OLTP transactional table, we break idempotency since transactional tables are non-replayable.

3.2.2. Self-healing

While an idempotent data pipeline may sound like the gold standard, it can be challenging to implement and maintain with changing business requirements and non-replayable sources. In some cases, it might not be possible to enforce idempotency, E.g., reconciling systems that require aggregating past n days’ data to create some metric(s).

A more straightforward design pattern is self-healing pipelines.The idea behind a self-healing pipeline is that the next pipeline run will “catch up” all the unprocessed data when an error occurs during a run.

In the case of time ranged pipelines, this can be as simple as automatically running the pipeline for the failed run(s) before starting the current runs. For full snapshot pipelines that don’t need historical data and lookback pipelines, you can skip the failed run.

Pros

Simpler to build & maintain.
Since it self-heals, it reduces alert fatigue.
Ideal for pipelines where upstream data sources or data infra break intermittently.
The catch-up pipeline can be designed to run the failed runs as idempotent runs, thus providing the best of both.

Cons

Since its self-healing, code bugs may not be caught for a few runs, as most assume the pipeline would self-heal.
If you have a failed run, you will need logic to ensure that the catch-up run does not create duplicates. You will also need to ensure that partial records are not made available to end users.
Inconsistent pipeline run time due to variability in input sizes.
If you do not have a replayable source, chances of losing historical changes. Check your business requirements; most cases do not require historical change data.
You will need to handle metadata to know when to rerun the data pipeline. Most orchestrators provide this metadata.
Data might only sometimes be fresh.

3.3. Structural

This section will cover how tasks/transformations are structured in a data pipeline.

3.3.1. Multi-hop pipelines

As data flows through the pipeline, keeping them separated at different levels of “cleanliness” is ideal. Multiple layers of transformation help catch issues sooner by implementing data quality checks after each layer. Debugging cases are made simpler by following a consistent pattern of applying transformation between layers (e.g., naming issue => check layer 1, aggregate looking incorrect => check layer 3, etc.).

In standard data pipelines, there is a concept of staging tables. Most tools have their preferred approaches; here are two popular ones.

Pros

Most tools can rerun only the failed transformations and their dependencies.
Ability to rerun with new logic at any step.
Having intermediate tables can help pinpoint where an issue occurred with the code. E.g., if there is an issue with a middle table, you will know that the transformation step before it caused it, helping devs triage quicker.

Cons

Using custom patterns has a steep learning curve as one tends to run into various issues. Use well-established patterns.
Storage costs since we are storing copies of the dataset.
Processing costs, especially if your data is large.

3.3.2. Conditional/ Dynamic pipelines

As your data pipeline grows, so does its complexity. The requirements may need complex flows that do different tasks based on when a pipeline is run, choose one task among multiple tasks based on input, etc.

A data pipeline typically does not start with a complex architecture but grows and evolves. While it can help deliver on time, one must keep an eye on exploding complexity. Otherwise, the pipeline can become flakey and hard to develop, test, or debug when facing production issues.

Pros

Easy to deliver on complex data flow requirements
One pipeline (& repo) to control data flow

Cons

Hard to debug, slow to develop
Hard to test since you will need to simulate all the different input scenarios.

3.3.3. Disconnected data pipelines

Disconnected data pipelines depend on sinks of other data pipelines. If they don’t have a way of recognizing cross-pipeline dependencies (e.g., Task sensor), then debugging, defining data lineage, defining SLAs can be highly convoluted.

Multiple disconnected data pipelines will lead to the inability (or make it very hard) to improve data latency (time taken for data to go from source to sink), debug, and triage production issues. Multiple disconnected data pipelines also increase communication overhead among various teams.

Pros

Quick to build.
Multiple teams can develop their pipelines independently.

Cons

Hard to debug, track data lineage, & triage issues.
Defining & Monitoring SLAs are difficult.

4. Conclusion

This article gives you a good idea of the typical design patterns for organizing data flows through your pipeline. To recap, we saw.

Types of Sources & Sinks
Data flow design patterns

A consistent pattern across your data pipelines helps devs communicate easily and understand code better. However, adding design patterns is overhead and will only make sense if you have more than a few simple pipelines.

When you have to build a new data pipeline requirement (or refactor an old one), use this flow chart to choose the right design(it’s usually the simplest one) for your data pipeline.

If you have any questions, comments, or other patterns you’d like to talk about, please leave them in the comment section below.

Data Pipeline Design Patterns - #1. Data flow patterns

1. Introduction

2. Source & Sink

2.1. Source Replayability

2.2. Source Ordering

2.3. Sink Overwritability

3. Data pipeline patterns

3.1. Extraction patterns

3.1.1. Time ranged

3.1.2. Full Snapshot

3.1.3. Lookback

3.1.4. Streaming

3.2. Behavioral

3.2.1. Idempotent

3.2.2. Self-healing

3.3. Structural

3.3.1. Multi-hop pipelines

3.3.2. Conditional/ Dynamic pipelines

3.3.3. Disconnected data pipelines

4. Conclusion

5. Further reading

6. References