ETL & ELT, a comparison

Introduction

When designing and building data pipelines to load data into data warehouses you might have heard of the common ETL and ELT paradigms. This post goes over what they mean, their differences and which paradigm you might want to choose. Before we start lets define the abbreviations

  1. E: Extract, refers to the process of getting data from the source.
  2. T: Transform, refers to the process of transforming the raw data from the source (eg: joins with other tables, group by, column mapping, denormalizing, lookups on external database, machine learning modeling, etc).
  3. L: Load, refers to the process of loading the data into a table to be used.

Youtube version

If you prefer to follow along on video

ETL

ETL refers to extract-transform-load. The usual steps involved in ETL are

  1. extracting data from a data source
  2. storing it in a staging area
  3. doing some custom transformation (commonly a python/scala/spark script or spark/flink streaming service for stream processing)
  4. loading into a table ready to be used by data users.

Batch ETL

Stream ETL

If you are wondering why we have a staging area click here.

ELT

ELT is very similar but the data is loaded into a table before being transformed to a final table which is used by users.

  1. Extract and Load data from data source into a data warehouse table (commonly called stg_* table)
  2. Run a native SQL script to transform the data into the desired shape that is ready to be used by data users.

ELT

As you can see it has fewer components compared to the ETL approach. Note that with ELT we have the same paradigm for batch and streaming, this is because with most data warehouses you can stream data directly into the staging table. ref: BigQuery, Snowflake.

ETL & ELT, a comparison

With data warehouse introducing powerful features like external tables, separation of storage and query engine, the lines between ETL and ELT have gotten blurred. In recent years ELT has been used to denote a process where raw data is loaded into a table (internal or external) and then some transformations are done on it to change it to a final format, which is the data used by the users. The below table shows a comparison of some key points.

ELT ETL
number of components/tools ✔️
easier to maintain ✔️
works with standard encoding formats (parquet, orc, ) ✔️ ✔️
works with non standard file types(eg zipped files, vendor specific formats) ✔️
run machine learning algorithms ✔️
accessing external services (eg hitting a separate micro service for transformation) ✔️
dealing with non tabular structure (eg graph processing) ✔️

With the number of positives for ETL from the above table we might conclude that this is the right paradigm. But it is not always the case, ELT has been gaining a lot of popularity, with powerful data warehouses offering a lot of flexibility, reduced cost(although this depends on how you use it), less engineering effort and with tools like DBT ELT has become a favorite among companies. Having said that there are lots of use cases for ETL, such as machine learning and accessing data that is not present in the data warehouse at transformation time.

If you are building a data pipeline it is recommended to start with the simplest solution which is an ELT pipeline, while making sure your data pipelines and code base is evolvable and then add ETL capabilities as needed. You might even end up doing 2 transformation steps such as running a ML model in one transformation step, loading the data into a data warehouse and doing more transformations within the data warehouse, thus becoming an ETLTL pipeline. As you can see this depends on your use case and project. The key idea is to start simple (think ELT) and add complexity (think extrenal transformation systems) as needed.

Note that ELT is not recommended for applications OLTP databases. If you are working on loading data into an application’s OLTP database it is almost always recommended to use ETL approach, so as to not overload your application’s OLTP database and increase latency in your application.

Conclusion

Hope this article gives you a good understanding of what ETL and ELT is and how to use these concepts as you design and build out your data pipeline. If you have any comments or suggestions please leave them in the comment section below.