Most Companies Use These Ingestion Patterns
Companies pull data from multiple source systems. While they use a variety of tools, the ingestion patterns remain the same.
If you are wondering how companies ingest data into their warehouse, this post is for you.
By the end of this post, you will be able to design data ingestion systems for your use case.
You will learn the patterns companies use and their trade-offs.
Ingestion Pattern Is Chosen Based on Data Velocity
If your data is generated at a high velocity (think >1000 events per minute). The data is pushed into a cloud store via an event log.
Otherwise, data is pulled into your cloud store via batch pipelines.
Note: The cloud store (e.g. S3) can be replace with an OLAP system (e.g. Snowflake) table.
Fact Data Is Streamed to Cloud Storage
Fact data generated at high velocity is sent from the client systems to the company servers.
Company servers asynchronously push these high-velocity events into event logs. An event log like Kafka is used here to reliably handle the volume of data.
A Kafka Connect cluster reads data from a Kafka topic and writes it to a cloud storage location. The Kafka Connect cluster can be replaced by any consumer.
Note that we need to make sure the consumer writes large enough files to the cloud storage.
E.g., the Confluent S3 connector has a flush size that specifies how many events to hold in memory before writing them to cloud storage as a file.
This ensures we don’t write a separate file for each event, which is extremely inefficient.
Alternative tools: AWS Kinesis, AWS Kinesis Connect, Confluent, Redpanda, Spark Stream, SnowPipe, etc
Be mindful of event order & uniqueness; most event logs don’t guarantee them by default.
Change data capture events are also sent as events to a Kafka Topic.
Dimension & 3rd-Party Fact Data Are Ingested Periodically
Data from source systems like OLTP DB, SFTP/FTP servers, cloud storage, API are extracted as a batch process.
Batch processes are triggered by a scheduler (cron, Airflow, etc).
For small to medium data sizes, native Python is used; for larger data, distributed systems are used to extract data and dump them into a cloud storage.
Depending on the size and updateability of the source, you can perform a full or an incremental data pull.
Extract Entire Source Data and Load Into Cloud Store
If the data size is small/medium (e.g., under 50GB per extract), the simplest option is a full extract.
In this approach, the entire source data is extracted and loaded into a cloud store.
To preserve historical data, ingestion systems create one output folder (partition) per extract. The folder path is typically the pipeline’s run date.
Cheap storage enables the storage of historical snapshots. But old data will need to be cleaned up regularly.
E.g., S3 Lifecycle policies can automatically delete old data.
No backfill logic is necessary since the entire source is pulled every time.
Read the latest data as shown below.
Large Data Is Extracted in Chunks
If the data is large or expensive (e.g., due to API costs), pull a single chunk of data per run.
You will need a scheduling system to tell your data pipeline which data chunk to pull. See this data interval technique for how to do this.
The data is stored in a cloud store with output files partitioned or named by run time.
This pipeline is typically NOT IDEMPOTENT. Due to the risk of re-running the pipeline and overwriting existing data.
We trade off idempotency for safety + potential data duplication. Since this data store will serve as the source for downstream backfills, we need it to be complete.
We can remove possible duplicates and create snapshot/SCD2 versions as shown below.
-- Source: customer table
-- PK: customer_id
-- Update at column: updated_at (timestamp)
-- snapshot table
WITH deduped AS ( -- dedupes data
SELECT *
FROM read_parquet('s3://warehouse/dimension_dataname/*')
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id, updated_at
ORDER BY updated_at DESC
) = 1
),
SELECT * -- picks the latest updated_at => most recent snapshot
FROM deduped
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC
) = 1
-- scd2
WITH deduped AS ( -- dedupes data
SELECT *
FROM read_parquet('s3://warehouse/dimension_dataname/*')
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id, updated_at
ORDER BY updated_at DESC
) = 1
),
scd2 AS (
SELECT
*,
updated_at AS valid_from,
LEAD(updated_at) OVER (
PARTITION BY customer_id
ORDER BY updated_at
) AS valid_to -- selects the next updated_at ts for this customer_id
FROM deduped
)
SELECT
*,
valid_to IS NULL AS is_current -- flag to indicate the most recent snapshot
FROM scd2The data store acts as the bronze layer upon which silver and gold tables are built. See multi-hop architecture for more details.
Conclusion
To recap, we saw
- Ingestion Pattern Is Chosen Based on Data Velocity
- Fact Data Is Streamed to Cloud Storage
- Dimension & 3rd-Party Fact Data Are Ingested Periodically
These patterns are used in both system design interviews and in real-world pipelines.
What patterns do you use to ingest data? If you had the choice which one would you choose? Let me know in the comments below.



