dbt(Data Build Tool) Tutorial

The reasons behind why someone would consider dbt is in this article. The best way to learn any tool is to build a simple project with it and improve it as you get new requirements.

Project details

For this project we will use an open data set which has

  1. Orders data
  2. Customer data
  3. A simple map from state abbreviation to state name

Youtube version

Data Dictionary

Data

Objective

The objective of the project is to create a simple order-customer denormalized table called customer_orders as shown below.

Denormalized customer_orders

Denormalized customer_orders

1. Prereq Installation

For this project we will use a postgres database. The requirements are

  1. docker (preferred) to run postgres
  2. pgcli to connect to our postgres instance
  3. dbt

Make sure you have docker running. You can start a postgres instance on a docker container using the following command

docker run --name pg_local -p 5432:5432 \
-e POSTGRES_USER=start_data_engineer -e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=tutorial -d postgres:12.2

The above command sets up a user account (user: POSTGRES_USER, password: POSTGRES_PASSWORD) and creates a database called tutorial. You can check for running docker containers using.

docker ps

Note: We are setting our password as password which is fine when trying a toy example, do not use this password in real life.

2. Connecting to the database

We use the pgcli tool to connect to our docker postgres instance.

pgcli -h localhost -p 5432 -U start_data_engineer tutorial

The command will ask for your password, which we have set to password.

Use \dn to check the list of available schemas and \q to exit the cli. We will use this tool later to see the tables created by dbt.

3. dbt quick start

dbt performs the T of the ETL process in your data warehouse, and as such it expects the raw data to be present in the data warehouse(an exception would be small mapping/lookup data that can be present as raw data). dbt works as follows

  1. A select query in a sql file to get a data model (a data model maps to a table or view in your database depending on your config)

  2. A data model can be built from an existing data models or from raw data. It is done using the ref function (which we will use in later steps)

  3. When you run dbt, all the select queries will be filled in with its dependent data model references and the scripts will be run on your database

dbt takes care of figuring out dependency between tables, creating tables and inserting records. We operate at the model level.

Go to the directory where you want to create your dbt project. In the terminal type in

dbt init simple_dbt_project
cd simple_dbt_project

This will set up a simple_dbt_project directory, which is the starting point for your dbt project.

4. Folder structure

Within your dbt project, you will have the following folders

default project structure

default project structure

  1. analysis Directory to store scripts that perform analytical queries (eg report), we can use these files to generate the sql scripts to get the report using dbt compile.

  2. data Directory to store any raw data we might want to load in to our models

  3. dbt_project.yml yaml file to specify configurations for the project

  4. macros Directory to store reusable macros functions here

  5. models Directory to store our model files

  6. snapshots Directory to store snapshots of your model

  7. tests Directory to store test files

5. Download seed data

Download the seed data from data

6. What is dbt_project.yml

In the base directory of your dbt project you will see a dbt_project.yml file. This file is used to specify the project level details. We set the following variables here

  1. name: The name for your dbt project.

  2. version: The version of your dbt project (this is similar to software project versioning).

  3. profile: Denotes the name of the profile to use from ~/.dbt/profiles.yml to connect to your database (we will see how to set this next).

  4. source-paths: Denotes the directory where we store out scripts to create our models.

  5. data-paths: Denotes the directory where dbt will look for seed data.

  6. target-path: Denotes the directory where dbt will store the compiled sql scripts to create our model and the files for the UI.

  7. models: In this section we can define how we want our models, within the models directory to behave, we can set our models to be tables or views using materialized option. Note that here we can have different options for different folders within the models directory.

name: "tutorial"
version: "1.0.0"

profile: "tutorial"

source-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
    - "target"
    - "`dbt`_modules"

models:
    tutorial:
        materialized: table
        # Applies to all files under models/staging/
        staging:
            materialized: view

Delete the example directory within the model directory and create staging and mart folders. Move the data you downloaded into the data folder of your dbt project. Your folder structure should look like

our project structure

our project structure

7. creating ~/.dbt/profiles.yml

This file is used to store different profiles that can be referenced in individual dbt projects(in their dbt_project.yml file)to connect to the corresponding database. You will be provided with a default profiles.yml file with the following content

# For more information on how to configure this file, please see:
# https://docs.getdbt.com/docs/profile

default: # profile
  outputs:
    dev:
      type: redshift
      threads: 1
      host: 127.0.0.1
      port: 5439
      user: alice
      pass: pa55word
      dbname: warehouse
      schema: `dbt`_alice
    prod:
      type: redshift
      threads: 1
      host: 127.0.0.1
      port: 5439
      user: alice
      pass: pa55word
      dbname: warehouse
      schema: analytics
  target: dev

In the above yaml file, the profile is set to default and we have multiple environments dev and prod and have set target to dev which means when you run dbt in your local machine it will be run in the database specified under the dev sections. In the profiles.yml file you can have multiple profiles with multiple environments and depending on where it is run the target can be set. This is a common approach when deploying to different environments.

Our ~/.dbt/profiles.yml should look as follows

tutorial:
  outputs:
    dev:
      type: postgres
      threads: 1
      host: localhost
      port: 5432
      user: start_data_engineer
      pass: password
      dbname: tutorial
      schema: `dbt`_tutorial
  target: dev

We have modified out profile to tutorial which we had in our dbt_project.yml file and set our dev environment config to our postgres instance details. We also specify a schema to dbt_tutorial which will be created when we run dbt.

8. Staging

In ETL design you generally have a staging area where you store the raw unprocessed data. dbt is designed to handle transformations and as such expects the staging data to be present in the data warehouse. But since this is a toy example we can load all our raw data from the file system. Lets see how to load in raw data into some staging tables.

Create a file models/staging/stg_customer.sql with the following content

with source as (
  select * from {{ ref('customer') }}
),

stage_customer as (
  select
    customer_order_id,
    customer_unique_id,
    customer_zip_code_prefix,
    customer_city,
    customer_st
  from source
)
select
  *
from stage_customer

If you are familiar with Common table expression, this query will seem straight forward. Basically we are referencing the raw data using its file name (ref(‘customer’)) into a CTE called source and exposing that data using a select query. Since we had set the materialization strategy to view for staging this query will be used to create a view in the database when you run dbt. Similarly we can create staging scripts for orders and state files within the models/staging directory as shown below.

-- stage_orders.sql
with source as (
  select
    *
  from {{ ref('orders') }}
),
stage_orders as (
  select
    order_id,
    customer_order_id,
    order_status,
    order_purchase_timestamp,
    order_approved_at,
    order_delivered_carrier_date,
    order_delivered_customer_date,
    order_estimated_delivery_date
  from source
)
select
  *
from stage_orders
-- stg_state.sql 
with source as (
  select
    *
  from {{ ref('state') }}
),
stage_state as (
  select
    st,
    state_name
  from source
)
select
  *
from stage_state

Note here we use ref('file-name') which references the file, this function is a key component behind dbt’s workflow. dbt figures out the data flow DAG using this.

A powerful feature of dbt is to enable easy data test. This can be done by specifying the test criteria within a schema.yml file. Lets create a schema.yml file within the models/staging directory and named schema.yml. Its content should be as shown below

version: 2

models:
  - name: stg_customer
    columns:
      - name: customer_unique_id
        tests:
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_order_id
        tests:
          - unique
          - not_null

  - name: stg_state
    columns:
      - name: st
        tests:
          - accepted_values:
              values:
                [
                  "AC",
                  "AL",
                  "AP",
                  "AM",
                  "BA",
                  "CE",
                  "DF",
                  "ES",
                  "GO",
                  "MA",
                  "MT",
                  "MS",
                  "MG",
                  "PA",
                  "PB",
                  "PR",
                  "PE",
                  "PI",
                  "RJ",
                  "RN",
                  "RS",
                  "RO",
                  "RR",
                  "SC",
                  "SP",
                  "SE",
                  "TO",
                ]

Here you can see how we define column level test cases for columns in your data models. We have defined

  1. unique and not null constraints for order_id and customer_order_id
  2. not null constraint for customer_unique_id
  3. accepted_values constraint for st column

9. Transform

Now that you have the scripts for your staging model, its time to create the final denormalized model. Within the mart folder create a customer_orders.sql file with content as shown below

with customer as (
  select
    *
  from {{ ref('stg_customer') }}
),
orders as (
  select
    *
  from {{ ref('stg_orders') }}
),
state_map as (
  select
    *
  from {{ ref('stg_state') }}
),
final as (
  select
    customer.customer_unique_id,
    orders.order_id,
    case
      orders.order_status
      when 'delivered' then 1
      else 0
    end
      as order_status,
      state_map.state_name
      from orders
      inner join customer on orders.customer_order_id = customer.customer_order_id
      inner join state_map on customer.customer_st = state_map.st
    )
  select
    *
  from final

In this script we load in the stage data models as CTE and do a join with some case condition to get our final model.

It is good practice to have a schema.yml file to test the data quality. Create a file schema.sql in the /models/mart s directory with the following content

version: 2

models:
  - name: customer_orders
    columns:
      - name: customer_unique_id
        description: Unique identifier for a customer
        tests:
          - not_null
      - name: order_status
        description: 1 if the order is delivered 0 otherwise
        tests:
          - accepted_values:
              values: [1, 0]
      - name: state_name
        description: The full state name of the customer

We have and additional description option which has the column description to be displayed in dbt UI. It is that simple to perform modular transformations in dbt, no need to create and manage dependencies.

10. Running dbt

Open your terminal and go to the simple_dbt_project directory

  1. Check for any errors in your file using dbt debug, make sure everything shows OK before proceeding
  2. Run dbt seed to load data in from data folder into the database as tables
  3. Run the dbt data transformation pipeline using dbt run
  4. Run the tests you have specified in the schema.yml files using dbt test, it should say Completed successfully
  5. Now we can check for the tables created in our postgres database
    1. Log into our postgres instance using pgcli -h localhost -p 5432 -U start_data_engineer tutorial and password password
    2. Use \dn to see the list of schemas in your database, you should see dbt_tutorial
    3. Use a simple select query to spot check your data select * from dbt_tutorial.customer_orders limit 2;
    4. Exit pgcli using \q
  6. Now that we have seen our data loaded in, we can generate docs and deploy it to a front end using dbt docs generate and then dbt docs serve
  7. If you go to http://localhost:8080/#!/overview on your browser you will be able to see the UI. You will be able to see the project folder structure, column name and types in the Project tab on the left pane. Select the Database option on the left pane and you will be able to explore the tables and their descriptions and get a sample sql script.
  8. If you click on customer_orders and click on the view lineage graph you should be able to see the dependency graph as shown below

Customer order lineage graph

lineage

Stop Docker containers

You can stop your postgres docker container using the following commands

docker stop $(docker ps -aq)
docker rm $(docker ps -aq)

Note that this stops all running docker containers.

code

The project code is available at github

Conclusion

In this tutorial we have seen how to quickly get up and running with dbt. Hope this gives you some ideas for your next project. There are more advanced topics that we did not cover like macros, analysis, incremental loads, scheduling, etc. which we have skipped, as this post was meant to give you a basic introduction to dbt.

Given the trend of the data engineering industry as a whole moving towards powerful data warehouses (AWS Redshift, Google Bigquery, etc). I think tools like dbt will become more popular, by

  1. helping non engineering users easily create and run transformation logic
  2. helping bridge the communication gap between engineers and analysts

dbt is already being used in companies like gitlab, betterment, etc and they recently raised funding for Series A (ref) showing growth in industry trend towards building easy to use ETL tools. Let me know if you have any questions or comments in the comments section below.