How to Ingest Data: 2 Essential Patterns

Learn the 2 data ingestion patterns every data engineer needs — streaming for high-velocity events and batch for everything else. With real trade-offs and tools.

Learn the 2 data ingestion patterns every data engineer needs — streaming for high-velocity events and batch for everything else. With real trade-offs and tools.
BREAK INTO
BEST PRACTICES
LEARN ARCHITECTURE
LATEST TRENDS
TECHNICAL UPSKILL
Author

Joseph Machado

Published

May 23, 2026

Keywords

data ingestion, data ingestion patterns, stream data ingestion

Most Companies Use These Ingestion Patterns

Companies pull data from multiple source systems. While they use a variety of tools, the ingestion patterns remain the same.

If you are wondering how companies ingest data into their warehouse, this post is for you.

By the end of this post, you will be able to design data ingestion systems for your use case.

You will learn the patterns companies use and their trade-offs.

Ingestion Pattern Is Chosen Based on Data Velocity

If your data is generated at a high velocity (think >1000 events per minute). The data is pushed into a cloud store via an event log.

Otherwise, data is pulled into your cloud store via batch pipelines.

Note: The cloud store (e.g. S3) can be replace with an OLAP system (e.g. Snowflake) table.

Fact Data Is Streamed to Cloud Storage

Fact data generated at high velocity is sent from the client systems to the company servers.

Company servers asynchronously push these high-velocity events into event logs. An event log like Kafka is used here to reliably handle the volume of data.

A Kafka Connect cluster reads data from a Kafka topic and writes it to a cloud storage location. The Kafka Connect cluster can be replaced by any consumer.

Stream Ingest

Stream Ingest

Note that we need to make sure the consumer writes large enough files to the cloud storage.

E.g., the Confluent S3 connector has a flush size that specifies how many events to hold in memory before writing them to cloud storage as a file.

This ensures we don’t write a separate file for each event, which is extremely inefficient.

Batch Flush

Batch Flush

Alternative tools: AWS Kinesis, AWS Kinesis Connect, Confluent, Redpanda, Spark Stream, SnowPipe, etc

Be mindful of event order & uniqueness; most event logs don’t guarantee them by default.

Change data capture events are also sent as events to a Kafka Topic.

Dimension & 3rd-Party Fact Data Are Ingested Periodically

Data from source systems like OLTP DB, SFTP/FTP servers, cloud storage, API are extracted as a batch process.

Batch processes are triggered by a scheduler (cron, Airflow, etc).

For small to medium data sizes, native Python is used; for larger data, distributed systems are used to extract data and dump them into a cloud storage.

Depending on the size and updateability of the source, you can perform a full or an incremental data pull.

Extract Entire Source Data and Load Into Cloud Store

If the data size is small/medium (e.g., under 50GB per extract), the simplest option is a full extract.

In this approach, the entire source data is extracted and loaded into a cloud store.

To preserve historical data, ingestion systems create one output folder (partition) per extract. The folder path is typically the pipeline’s run date.

Batch

Batch

Cheap storage enables the storage of historical snapshots. But old data will need to be cleaned up regularly.

E.g., S3 Lifecycle policies can automatically delete old data.

No backfill logic is necessary since the entire source is pulled every time.

Read the latest data as shown below.

SELECT * 
FROM s3_data_location 
WHERE run_ts = ( 
  SELECT MAX(run_ts) 
  FROM s3_data_location
 );

Large Data Is Extracted in Chunks

If the data is large or expensive (e.g., due to API costs), pull a single chunk of data per run.

You will need a scheduling system to tell your data pipeline which data chunk to pull. See this data interval technique for how to do this.

The data is stored in a cloud store with output files partitioned or named by run time.

Batch Inc

Batch Inc

This pipeline is typically NOT IDEMPOTENT. Due to the risk of re-running the pipeline and overwriting existing data.

We trade off idempotency for safety + potential data duplication. Since this data store will serve as the source for downstream backfills, we need it to be complete.

We can remove possible duplicates and create snapshot/SCD2 versions as shown below.

-- Source: customer table  
-- PK: customer_id
-- Update at column: updated_at (timestamp)

-- snapshot table
WITH deduped AS ( -- dedupes data
    SELECT *
    FROM read_parquet('s3://warehouse/dimension_dataname/*')
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY customer_id, updated_at
        ORDER BY updated_at DESC
    ) = 1
),
SELECT * -- picks the latest updated_at => most recent snapshot
FROM deduped
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY updated_at DESC
) = 1

-- scd2 
WITH deduped AS ( -- dedupes data
    SELECT *
    FROM read_parquet('s3://warehouse/dimension_dataname/*')
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY customer_id, updated_at
        ORDER BY updated_at DESC
    ) = 1
),
scd2 AS (
    SELECT
        *,
        updated_at AS valid_from,
        LEAD(updated_at) OVER (
            PARTITION BY customer_id
            ORDER BY updated_at
        ) AS valid_to -- selects the next updated_at ts for this customer_id
    FROM deduped
)
SELECT
    *,
    valid_to IS NULL AS is_current -- flag to indicate the most recent snapshot
FROM scd2

The data store acts as the bronze layer upon which silver and gold tables are built. See multi-hop architecture for more details.

Conclusion

To recap, we saw

  1. Ingestion Pattern Is Chosen Based on Data Velocity
  2. Fact Data Is Streamed to Cloud Storage
  3. Dimension & 3rd-Party Fact Data Are Ingested Periodically

These patterns are used in both system design interviews and in real-world pipelines.

What patterns do you use to ingest data? If you had the choice which one would you choose? Let me know in the comments below.

Read These

  1. Multi-Hop Architecture
  2. Change Data Capture
  3. Airflow 3.0 Tutorial
Back to top