dbt (Data Build Tool) Tutorial – Start Data Engineering

Introduction

Are you trying to learn dbt and understand what it’s all about? Are you:

Even more confused after reviewing the complex documentation?

Frustrated at having to set up so many things before getting to what dbt is?

This post is for you.

dbt enables data engineering best practices with just SQL.

By the end of this post, you will know.

How dbt works
How to use dbt
Build and run an end-to-end dbt project

Code along project

Follow along by running the project on GitHub Codespaces (Recommended) or set up locally.

In your terminal, run the following command.

# copy paste this in your terminal
uv run python extract_load_pipeline.py
uv run dbt clean 
uv run dbt deps
uv run dbt seed
uv run dbt run --select models/bronze
uv run dbt snapshot
uv run dbt run --select models/silver models/gold
uv run dbt test
uv run dbt docs generate 
uv run dbt docs serve

1: Simulate an EL process to load data into warehouse
2: Clean up old code (if any)
3: Load seed data from ./data folder
4: Run models in models/bronze folder
5: Create SCD2 table
6: Run models in models/silver and models/gold folder
7: Run tests
8: Create and serve documentation

The docs site will open; click on the lineage icon on the bottom right to see the end-to-end lineage.

Note

We use uv run in front of all our commands.

uv is a fast python package manager, and uv run runs our commands inside a virtual env defined by pyproject.toml

We will go over each of these commands below.

Video walkthrough

dbt enables anyone with SQL knowledge to build data pipelines

dbt does this by

Creating datasets with just SQL SELECT statements.
Using code to understand dataset lineage.
Defining data quality testing & documentation as part of the code.
Enabling reusable code with SQL functions (macros) & dbt packages.
Having helper functions to create SCD2 and incremental pipelines.

Cheap storage has made loading source data into a warehouse and then transforming it (ELT) popular.

dbt (data build tool) brings best practices to the T in ELT.

dbt is select statements (models) connected with the ref function

Select statements are the pipeline’s building blocks

dbt is built on files with SQL select statements, called models. Models can be referenced by other models.

The file name is the model name.

Let’s look at an example:

./models/silver/fct_orders.sql

with orders as (
    select *
    from {{ ref('bronze_orders') }}
),

order_status_code as (
    select *
    from {{ ref('order_status_code') }}
)

select o.*
, sc.order_status_code
from orders o 
left join order_status_code sc 
on o.order_status = sc.order_status

1: Using a model called bronze_orders
2: Using a model called order_status_code
3: The model file should end with a SELECT statement.

When we run dbt, these models will be created as tables, views, or materialized views in the database (see here for details).

When we run our pipelines, dbt will compile this into a SQL CREATE statement.

Let’s look at the compiled version of the above model.

./target/run/sde_dbt_tutorial/models/silver/fct_orders.sql

create  table
      "dbt"."main"."fct_orders__dbt_tmp"
    as (
      with orders as (
    select *
    from "dbt"."main"."bronze_orders"
),

order_status_code as (
    select *
    from "dbt"."main"."order_status_code"
)

select o.* 
, sc.order_status_code
from orders o 
left join order_status_code sc 
on o.order_status = sc.order_status
    );

1: Create table added
2: ref replaced with actual DB table/view

dbt uses the `ref` function to understand model creation order

From our data lineage, we see that bronze_orders and order_status_code models need to be created before the fct_orders model.

dbt uses the ref function to determine this order.

./models/silver/fct_orders.sql

with orders as (
    select *
    from {{ ref('bronze_orders') }}
),

order_status_code as (
    select *
    from {{ ref('order_status_code') }}
)

…

1: Using a model called bronze_orders
2: Using a model called order_status_code

Caution

The ref function only works on models created by dbt.

For tables/views created outside of dbt use the source function.

./models/bronze/bronze_state.sql

with source as (
    select *
    from {{ source('raw', 'raw_state') }}
),

renamed as (
    select
        state_id::INT as state_id,
        state_code::VARCHAR(2) as state_code,
        state_name::VARCHAR(30) as state_name
    from source
)

select *
from renamed

1: Use the table raw.raw_state.

The table raw.raw_state was created by our EL process. We need to define these source models as shown below.

./models/bronze/raw.yml

version: 2
sources:
  - name: raw
    description: Data from the application database, imported via an EL process.
    tables:
      - name: raw_customer
      - name: raw_orders
      - name: raw_state
      - name: raw_clickstream

1: Tables imported via EL process

Video walkthrough

Yaml files define how dbt operates

dbt has multiple configurations that we can use (e.g., should models be views or tables?). We can define these in three main places:

dbt_project.yml : This file is mandatory. In this file, we define the following:
- Project name & version
- Folders for specific dbt commands
- Folder & file level configurations
Individual yaml files per folder. While we can define everything in dbt_project.yml, split configs into individual files per folder for manageability. E.g., models/bronze/bronze.yml
As Jinja configs per model. E.g. Incremental logic in models/silver/fct_clickstream.sql and SCD2 logic in scd2/dim_customers.sql.

We use profiles.yml to define the connection to our warehouse. We can define multiple connections and set a default one with a target.

We can quickly switch targets with the --target flag in dbt commands.

./profiles.yml

---
sde_dbt_tutorial:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ./dbt.duckdb
    prod:
      type: duckdb
      path: ./dbt-prod.duckdb

1: Project id (must match one in dbt_project.yml)
2: duckdb connection is just a local file

Video walkthrough

# dbt Configs (single choice) When you run `dbt snapshot`, which folder does dbt look in for the SCD2 tables? > Hint: read the [dbt_project.yml](https://github.com/josephmachado/simple_dbt_project/blob/master/dbt_project.yml) and find the line for `snapshot-paths`. 1. [ ] models 1. [ ] macros 1. [ ] data 1. [x] scd2 1. [ ] target

Using multi-hop architecture to model data for analytics

dbt project structure is complex, so we use a standard multi-hop architecture.

Bronze is source data as-is with data type conversions
Silver is data modeled as facts and dimensions
Gold is summary tables for stakeholders

We create one folder per layer of our architecture.

We define bronze as views, since the data is already in the warehouse and we only need to perform data-type cleanup.

models:
  sde_dbt_tutorial:
    bronze:
      +materialized: view

1: A + applies config to all models in this folder

Let’s run the bronze models.

uv run dbt run --select models/bronze

The --select enables us to choose the models to run. We run all the models in the bronze folder.

Silver layer models data as facts and dimensions. We create: fct_orders, fct_clickstream, and dim_customer.

Warning

We model our dim_customer as SCD2. dbt requires that we run a dbt snapshot command to create SCD2 tables.

Video walkthrough

Read more about incremental tables here.

dbt packages are open-source data transformation code

In the gold layer, we create a model called order_status_pivot for the marketing team.

./models/gold/marketing/order_status_pivot.sql

SELECT
    EXTRACT(
        YEAR
        FROM
        order_approved_at
    ) AS order_year, --noqa: CV03
{{
    dbt_utils.pivot('order_status',
    dbt_utils.get_column_values(ref('orders_obt'), 'order_status')) }}
FROM
    {{ ref('orders_obt') }}
GROUP BY
    1

1: pivot from dbt-utils package
2: get_column_values from dbt-utils package

We define the packages we need in the packages.yml file and when we run dbt deps dbt downloads them.

Run tests on materialized models

We can run tests on materialized models. There are 3 types of tests:

dbt tests: dbt comes with unique, not_null, accepted_values, & relationships tests. Reference docs.
dbt-expectations: Great expectations equivalent for dbt. Has a lot of complex tests. Reference docs.
Custom tests: We can create custom SQL-based tests. E.g. reconcilation tests

We define tests in our yaml configs, as shown below.

./models/silver/silver.yml

---
version: 2
models:
  - name: dim_customer
    columns:
      - name: customer_id
        data_tests: [not_null]

1: Model to test
2: Column on which the test is run
3: Type of test to run on customer_id

Caution

dbt tests do not follow the WAP pattern because data must be present to run them.

Read more about WAP pattern for data quality checks here ->

See entire data lineage with dbt docs

Create dbt docs and serve them as shown below:

uv run dbt docs generate 
uv run dbt docs serve

Exposures show which of our models are used by stakeholders, ./models/exposure.yml.

Video walkthrough

Conclusion

To recap, we saw

Learning and understanding dbt can also significantly improve your odds of landing a DE job.

dbt can help you make your ELT pipelines stable and development fun. If you have any questions or comments, please leave them in the comment section below.

Introduction

Code along project

Video walkthrough

dbt enables anyone with SQL knowledge to build data pipelines

dbt is select statements (models) connected with the ref function

Select statements are the pipeline’s building blocks

dbt uses the ref function to understand model creation order

Video walkthrough

Yaml files define how dbt operates

Video walkthrough

Using multi-hop architecture to model data for analytics

Video walkthrough

dbt packages are open-source data transformation code

Run tests on materialized models

See entire data lineage with dbt docs

Video walkthrough

Conclusion

Recommended reading

dbt uses the `ref` function to understand model creation order