dbt(Data Build Tool) Tutorial
The reasons behind why someone would consider dbt
is in this article. The best way to learn any tool is to build a simple project with it and improve it as you get new requirements.
Project details
For this project we will use an open data set which has
- Orders data
- Customer data
- A simple map from state abbreviation to state name
Data Dictionary
Objective
The objective of the project is to create a simple order-customer denormalized table called customer_orders as shown below.
Denormalized customer_orders
1. Prereq Installation
For this project we will use a postgres
database. The requirements are
Make sure you have docker running. You can start a postgres
instance on a docker container using the following command
docker run --name pg_local -p 5432:5432 \
-e POSTGRES_USER=start_data_engineer -e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=tutorial -d postgres:12.2
The above command sets up a user account (user: POSTGRES_USER
, password: POSTGRES_PASSWORD
) and creates a database called tutorial
. You can check for running docker containers using.
docker ps
Note: We are setting our password as password which is fine when trying a toy example, do not use this password in real life.
2. Connecting to the database
We use the pgcli
tool to connect to our docker postgres
instance.
pgcli -h localhost -p 5432 -U start_data_engineer tutorial
The command will ask for your password, which we have set to password
.
Use \dn
to check the list of available schemas and \q to exit the cli. We will use this tool later to see the tables created by dbt
.
3. dbt quick start
dbt
performs the T of the ETL process in your data warehouse, and as such it expects the raw data to be present in the data warehouse(an exception would be small mapping/lookup data that can be present as raw data). dbt
works as follows
-
A select query in a sql file to get a data model (a data model maps to a table or view in your database depending on your config)
-
A data model can be built from an existing data models or from raw data. It is done using the ref function (which we will use in later steps)
-
When you run
dbt
, all the select queries will be filled in with its dependent data model references and the scripts will be run on your database
dbt
takes care of figuring out dependency between tables, creating tables and inserting records. We operate at the model level.
Go to the directory where you want to create your dbt
project. In the terminal type in
dbt init simple_dbt_project
cd simple_dbt_project
This will set up a simple_dbt
_project directory, which is the starting point for your dbt
project.
4. Folder structure
Within your dbt
project, you will have the following folders
default project structure
-
analysis
Directory to store scripts that perform analytical queries (eg report), we can use these files to generate the sql scripts to get the report usingdbt
compile. -
data
Directory to store any raw data we might want to load in to our models -
dbt_project.yml
yaml file to specify configurations for the project -
macros
Directory to store reusable macros functions here -
models
Directory to store our model files -
snapshots
Directory to store snapshots of your model -
tests
Directory to store test files
5. Download seed data
Download the seed data from data
6. What is dbt_project.yml
In the base directory of your dbt
project you will see a dbt_project.yml
file. This file is used to specify the project level details. We set the following variables here
-
name
: The name for yourdbt
project. -
version
: The version of yourdbt
project (this is similar to software project versioning). -
profile
: Denotes the name of the profile to use from~/.dbt/profiles.yml
to connect to your database (we will see how to set this next). -
source-paths
: Denotes the directory where we store out scripts to create our models. -
data-paths
: Denotes the directory wheredbt
will look for seed data. -
target-path
: Denotes the directory wheredbt
will store the compiled sql scripts to create our model and the files for the UI. -
models
: In this section we can define how we want our models, within the models directory to behave, we can set our models to be tables or views using materialized option. Note that here we can have different options for different folders within the models directory.
name: "tutorial"
version: "1.0.0"
profile: "tutorial"
source-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "`dbt`_modules"
models:
tutorial:
materialized: table
# Applies to all files under models/staging/
staging:
materialized: view
Delete the example directory within the model directory and create staging and mart folders. Move the data you downloaded into the data folder of your dbt
project. Your folder structure should look like
our project structure
7. creating ~/.dbt/profiles.yml
This file is used to store different profiles that can be referenced in individual dbt
projects(in their dbt_project.yml
file)to connect to the corresponding database. You will be provided with a default profiles.yml
file with the following content
# For more information on how to configure this file, please see:
# https://docs.getdbt.com/docs/profile
default: # profile
outputs:
dev:
type: redshift
threads: 1
host: 127.0.0.1
port: 5439
user: alice
pass: pa55word
dbname: warehouse
schema: `dbt`_alice
prod:
type: redshift
threads: 1
host: 127.0.0.1
port: 5439
user: alice
pass: pa55word
dbname: warehouse
schema: analytics
target: dev
In the above yaml file, the profile is set to default and we have multiple environments dev and prod and have set target to dev which means when you run dbt
in your local machine it will be run in the database specified under the dev sections. In the profiles.yml
file you can have multiple profiles with multiple environments and depending on where it is run the target can be set. This is a common approach when deploying to different environments.
Our ~/.dbt/profiles.yml
should look as follows
tutorial:
outputs:
dev:
type: postgres
threads: 1
host: localhost
port: 5432
user: start_data_engineer
pass: password
dbname: tutorial
schema: `dbt`_tutorial
target: dev
We have modified out profile to tutorial which we had in our dbt_project.yml
file and set our dev environment config to our postgres
instance details. We also specify a schema to dbt_tutorial
which will be created when we run dbt
.
8. Staging
In ETL design you generally have a staging area where you store the raw unprocessed data. dbt
is designed to handle transformations and as such expects the staging data to be present in the data warehouse. But since this is a toy example we can load all our raw data from the file system. Lets see how to load in raw data into some staging tables.
Create a file models/staging/stg_customer.sql
with the following content
with source as (
select * from {{ ref('customer') }}
),
stage_customer as (
select
customer_order_id,
customer_unique_id,
customer_zip_code_prefix,
customer_city,
customer_st
from source
)
select
*
from stage_customer
If you are familiar with Common table expression, this query will seem straight forward. Basically we are referencing the raw data using its file name (ref(‘customer’)) into a CTE called source and exposing that data using a select query. Since we had set the materialization strategy to view for staging this query will be used to create a view in the database when you run dbt
. Similarly we can create staging scripts for orders and state files within the models/staging
directory as shown below.
-- stage_orders.sql
with source as (
select
*
from {{ ref('orders') }}
),
stage_orders as (
select
order_id,
customer_order_id,
order_status,
order_purchase_timestamp,
order_approved_at,
order_delivered_carrier_date,
order_delivered_customer_date,
order_estimated_delivery_date
from source
)
select
*
from stage_orders
-- stg_state.sql
with source as (
select
*
from {{ ref('state') }}
),
stage_state as (
select
st,
state_name
from source
)
select
*
from stage_state
Note here we use ref('file-name')
which references the file, this function is a key component behind dbt
’s workflow. dbt
figures out the data flow DAG
using this.
A powerful feature of dbt
is to enable easy data test. This can be done by specifying the test criteria within a schema.yml
file. Lets create a schema.yml
file within the models/staging directory and named schema.yml
. Its content should be as shown below
version: 2
models:
- name: stg_customer
columns:
- name: customer_unique_id
tests:
- not_null
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_order_id
tests:
- unique
- not_null
- name: stg_state
columns:
- name: st
tests:
- accepted_values:
values:
[
"AC",
"AL",
"AP",
"AM",
"BA",
"CE",
"DF",
"ES",
"GO",
"MA",
"MT",
"MS",
"MG",
"PA",
"PB",
"PR",
"PE",
"PI",
"RJ",
"RN",
"RS",
"RO",
"RR",
"SC",
"SP",
"SE",
"TO",
]
Here you can see how we define column level test cases for columns in your data models. We have defined
- unique and not null constraints for
order_id
andcustomer_order_id
- not null constraint for
customer_unique_id
- accepted_values constraint for
st
column
9. Transform
Now that you have the scripts for your staging model, its time to create the final denormalized model. Within the mart folder create a customer_orders.sql
file with content as shown below
with customer as (
select
*
from {{ ref('stg_customer') }}
),
orders as (
select
*
from {{ ref('stg_orders') }}
),
state_map as (
select
*
from {{ ref('stg_state') }}
),
final as (
select
customer.customer_unique_id,
orders.order_id,
case
orders.order_status
when 'delivered' then 1
else 0
end
as order_status,
state_map.state_name
from orders
inner join customer on orders.customer_order_id = customer.customer_order_id
inner join state_map on customer.customer_st = state_map.st
)
select
*
from final
In this script we load in the stage data models as CTE and do a join with some case condition to get our final model.
It is good practice to have a schema.yml
file to test the data quality. Create a file schema.sql
in the /models/mart
s directory with the following content
version: 2
models:
- name: customer_orders
columns:
- name: customer_unique_id
description: Unique identifier for a customer
tests:
- not_null
- name: order_status
description: 1 if the order is delivered 0 otherwise
tests:
- accepted_values:
values: [1, 0]
- name: state_name
description: The full state name of the customer
We have and additional description option which has the column description to be displayed in dbt
UI.
It is that simple to perform modular transformations in dbt
, no need to create and manage dependencies.
10. Running dbt
Open your terminal and go to the simple_dbt_project
directory
- Check for any errors in your file using
dbt debug
, make sure everything shows OK before proceeding - Run
dbt seed
to load data in from data folder into the database as tables - Run the
dbt
data transformation pipeline usingdbt run
- Run the tests you have specified in the
schema.yml
files usingdbt test
, it should say Completed successfully - Now we can check for the tables created in our
postgres
database- Log into our
postgres
instance usingpgcli -h localhost -p 5432 -U start_data_engineer tutorial
and password password - Use
\dn
to see the list of schemas in your database, you should seedbt_tutorial
- Use a simple select query to spot check your data
select * from dbt_tutorial.customer_orders limit 2;
- Exit
pgcli
using\q
- Log into our
- Now that we have seen our data loaded in, we can generate docs and deploy it to a front end using
dbt
docs generate and thendbt
docs serve - If you go to
http://localhost:8080/#!/overview
on your browser you will be able to see the UI. You will be able to see the project folder structure, column name and types in the Project tab on the left pane. Select the Database option on the left pane and you will be able to explore the tables and their descriptions and get a sample sql script. - If you click on customer_orders and click on the view lineage graph you should be able to see the dependency graph as shown below
Customer order lineage graph
Stop Docker containers
You can stop your postgres
docker container using the following commands
docker stop $(docker ps -aq)
docker rm $(docker ps -aq)
Note that this stops all running docker containers.
code
The project code is available at github
Conclusion
In this tutorial we have seen how to quickly get up and running with dbt
. Hope this gives you some ideas for your next project. There are more advanced topics that we did not cover like macros, analysis, incremental loads, scheduling, etc. which we have skipped, as this post was meant to give you a basic introduction to dbt
.
Given the trend of the data engineering industry as a whole moving towards powerful data warehouses (AWS Redshift, Google Bigquery, etc). I think tools like dbt
will become more popular, by
- helping non engineering users easily create and run transformation logic
- helping bridge the communication gap between engineers and analysts
dbt
is already being used in companies like gitlab, betterment, etc and they recently raised funding for Series A (ref) showing growth in industry trend towards building easy to use ETL tools. Let me know if you have any questions or comments in the comments section below.