Uplevel your dbt workflow with these tools and techniques

Dec 13, 2023 · 20 min read

1. Introduction
2. Setup
3. Ways to uplevel your dbt workflow
4. Conclusion
5. Further Reading
6. References

1. Introduction

Data Build Tool (dbt) is a powerful system that helps data engineers and analysts ship code faster, enabling good SWE best practices. dbt has had widespread adoption in the past few years. If you are wondering

How do you take your dbt development experience to the next level?

What dbt hacks do other DEs use that you wish you knew about?

How do you spend less time reviewing trivial stuff in PRs?

How do you enable your under-resourced data team to keep up with new features to improve business impact?

Is testing data only after building data the right approach?

Then this post is for you. In this post, we will go over some approaches you can quickly set up in your dbt project to improve development speed, confidently deploy while ensuring that your changes will not break datasets, enhance code quality, reduce feedback loop time, and ensure data quality.

By the end of this post, you will know about six concepts to speed up your dbt project’s feature delivery speed. You can also adopt these concepts to your dbt projects reasonably quickly.

2. Setup

Prerequisites

Clone the git repo.

git clone https://github.com/josephmachado/simple_dbt_project_dev.git
cd simple_dbt_project_dev

# Set dbt env variables to tell dbt where the profiles.yml and dbt_project.yml files are
export DBT_PROFILES_DIR=$(just profile-dir)
export DBT_PROJECT_DIR=$(just project-dir)

To fully grasp these concepts, you will need to know what dbt is and how to use it.

A common task with most systems is to run specific commands multiple times. In such cases, it’s beneficial to use custom scripts or a command runner like just that enables you to create shortcuts for the most commonly used commands.

Check out our justfile with shortcuts for the most commonly run commands. In the code examples below, we will show the command to run and include the corresponding shortcuts (with the just command runner) for your convenience.

3. Ways to uplevel your dbt workflow

This section will review six concepts to level up your dbt workflow. Each section will describe the problem it aims to solve, how to solve it with a tool/technique/argument, an example, and the tradeoffs you make when using the approach specified.

3.1. Reproducible environment

3.1.1. A virtual environment with Poetry

dbt is a Python project, so managing a Python environment will enable a reproducible codebase. Python suffers from dependency hell where different Python libraries, may depend on various versions of the same underlying library. We will use Poetry to handle dependency.

Poetry will use the pyproject.toml and poetry.lock files to identify the python libraries(along with the correct versions that work for all of our python project dependencies) needed. Let’s see how we can use Poetry to create and activate a virtual environment.

rm -rf .venv && poetry config virtualenvs.in-project true && poetry install --no-root 
# This can be run as 
# just create-venv
source .venv/bin/activate

You will now be in the virtual environment with all the Python libraries installed. To add/remove Python dependencies from the virtual environment, you will need to use Poetry.

3.1.2. Use Docker to run your warehouse locally

While a virtual environment provides good isolation for Python libraries, Docker allows you to run databases easily. We can use Docker to run a Postgres instance locally; this will serve as a data warehouse for local development and testing.

We will use Docker to run a Postgres instance locally for development; run the following command in your terminal.

docker run -d \
  --name postgres \
  -e POSTGRES_USER=dbt \
  -e POSTGRES_PASSWORD=password1234 \
  -e POSTGRES_DB=dbt \
  -v $(pwd)/raw_data:/input_data \
  -v $(pwd)/warehouse_setup:/docker-entrypoint-initdb.d \
  -p 5432:5432 \
  postgres:16

# You can run this with just start-db

The above command shows how to set the database name and credentials (with the POSTGRES_* environment variables). We mount our warehouse_setup volume; this folder contains the setup SQL scripts to run to create the base tables and users and grant them appropriate access.

Note However, if you are using a closed source system like Snowflake, BigQuery, or Redshift, you will need to have a dev environment setup and be able to connect to it. While using Postgres as a local dev alternative to closed source systems may work, there will be a lot of consistency issues with data types, functions available, and query plans.

3.2. Reduce feedback loop time when developing locally

One of the most significant issues when developing data pipelines is the time it takes to run a pipeline and then check/validate its outputs. Feedback loop refers to the repetitive process of changing and testing the output. Ideally, the feedback loop when developing should be as low as possible for an efficient development flow.

We will review some steps you can follow to ensure your feedback loop is as quick as possible.

3.2.1. Run only required dbt objects with selectors

One of the most significant issues with running the dbt run command is that it runs all the models. In most cases, your dbt project may have multiple unrelated pipelines; it is unnecessary to run all the models while testing a few models.

dbt offers the ability for the user to run only chosen models. One can choose which models to run with selectors (--select flag), which is available with all the dbt (run, test, etc) commands.

dbt offers four powerful ways to select which models to run.

Methods : Provides ways of selecting models to run (based on if they are seeds, tags, path, test name/type, file path, etc)
Graph operators : Used to select models to run based on a given models up/down stream dependencies.
Set operators : Used to combine (AND / OR) multiple selection criteria.
Exclude : Used to exclude models to run.

With these selectors, you can choose precisely which models to run. If you are testing locally, it is beneficial to only run the necessary models instead of using dbt run, which will run all the models in your dbt project.

An example of running only one model(customer_orders) is:

dbt run --select "customer_orders"
# List dbt objects that are tested and belong to source data
dbt ls --select "resource_type:test,source:*"

Another way to run only select models or tests is by using tags. While most of the selection methods specified here use folder paths or dbt concepts (source, package, etc.), the tag method enables the selection of arbitrary models/tests provided the appropriate tags are added to them.

For some of our tests (which we will go over in a later section), we tag them with the value of elementary. Tagging tests/models enables us to tell dbt only to run those models/tests.

For example, one of our tests has this tag:

  tests:
    - elementary.volume_anomalies:
        training_period:
          period: day
          count: 3000
        timestamp_column: order_purchase_timestamp
        time_bucket:
          period: day
          count: 1
        tags: [elementary]
        config:
          severity: error

As you see, this specific test elementary.volume_anomalies has added a tag called elementary. Using the following command, we can tell dbt to list only this test (and other tests with the elementary tag).

# Make sure that you are in the sde_dbt_tutorial directory
dbt ls --select "elementary"

While these selectors are extremely powerful, they require careful examination before running. It is critical to ensure that all your required models are being run based on the selectors used. Fortunately, dbt has a handy ls command that lets one check the list of models/tests to run based on your chosen selectors.

dbt ls --select "elementary,resource_type:test"

3.2.2. Use prod datasets to build dev models with defer

You may need to use production (or another environment’s) data to build a model. The need to use a different environment to build a model in your development environment can be because the upstream models do not exist in your environment, or you need access to data from a different environment, etc.

In such cases, you can use defer to tell dbt to use models from another environment to build your chosen model(s).

# make sure your virtual env is activated
just restart #. to clear out existing models in the dev and prod
just prod-run # to run all the dbt models in the prod environment

# We are copying over the manifest JSON file from the prod run to let dbt know which upstream models to use
rm -f ./sde_dbt_tutorial/prod-run-artifacts/manifest.json
cp ./sde_dbt_tutorial/target/manifest.json ./sde_dbt_tutorial/prod-run-artifacts/

# Run defer using pre-build prod tables and create a new model in dev
dbt run --select "customer_orders" --defer --state prod-run-artifacts
# You will see a log showing one model created

# Open warehouse with 
just warehouse

On your development warehouse(log in with just warehouse), you can check to see that only the customer_orders model exists.

select * from your_name_warehouse.dim_customers; -- table does not exist error
select * from your_name_warehouse.customer_orders limit 10; -- you will see results
set search_path to 'your_name_warehouse';
\d -- you will only see customer_orders tables in your your_name_warehouse schema
\q -- exit pgcli

The manifest.json file contains the dbt lineage graph details, test details, macro details etc doc .

With defer and state:modified selector, we can ensure that only the modified table built and all other data is from an alternate environment (prod/UAT, etc) or an older run of the current environment. Only creating the modified model is called slim ci.

Cons:

Managing manifest.json, if your use is complex with multiple schemas, etc., you will need to ensure that the dbt manifest files are copied from the appropriate location.
Using manifest has some caveats(such as issues with macros rendering the expected model name, variables, etc): https://docs.getdbt.com/reference/node-selection/state-comparison-caveats

3.2.3. Parallelize model building by increasing thread count

dbt builds models in parallel, and you can increase the number of models it can run at a given time by increasing the thread count. While increasing the thread count does increase the number of models built in parallel, it also impacts the database performance.

If you have a dedicated warehouse for development (like we do in our case), you can increase it; however, in production, you will need to ensure that you are not overtaxing the database and thus impacting other users/system data warehouse performance.

We can set thread count in profiles.yml .

3.3. Reduce the amount of code to write using dbt packages

One should ensure the reuse of existing code; this helps develop faster and reduces the chances of bugs.

3.3.1. dbt utils for commonly recreated SQL functionality

Building and maintaining a data warehouse is a pretty common task, and there are well-established patterns and codes for how to do this with dbt. When building a new feature with no native support by dbt, please check for it among the multiple dbt packages available.

A handy dbt package is dbt-utils; it provides standard functionality often needed for data processing but is missing natively in warehouses and dbt-macros. For example, if you want to pivot a table in a warehouse that does not have a native pivot function, you can use the pivot macro. We use the pivot macro in our order_status_pivot model.

SELECT
    EXTRACT(
        YEAR
        FROM
        order_approved_at
    ) AS order_year, --noqa: CV03
{{
    dbt_utils.pivot('order_status',
    dbt_utils.get_column_values(ref('customer_orders'), 'order_status')) }}
FROM
    {{ ref('customer_orders') }}
GROUP BY
    1

The above query gets compiled to the target folder (sde_dbt_tutorial/target/compiled/sde_dbt_tutorial/models/marts/marketing/order_status_pivot.sql). Check out dbt-utils documentation for a list of available features.

3.3.2. dbt expectations for extensive testing in dbt

Another good example is the dbt-expectations library, which provides a wide range of tests you can use in your dbt project (inspired by Python’s great expectations package). For example, we use dbt-expectations to compare column values.

  - dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
          column_A: delivered
          column_B: invoiced
          or_equal: false
          row_condition: order_year > 2018

The con, however, is that you will not be able to control the implementation, which may be inefficient for your warehouse.

Check out dbt-expectations documentation for a list of available features.

3.4. Validate data before pushing changes to production

NOTE: data diff has been archived, do not use!

One of the most common ways of testing that your code changes are valid in a dbt project is by comparing the data generated using your code against the data already in the production database. Comparing development with production data ensures that the granularity remains the same and that the changes made do not have unintended consequences such as corrupting values in a column (e.g., lower casing string), changing units (dollars and cents etc.), etc.

data-diff is a Python library that compares datasets and shows their differences. For example, when changing a dataset, you can use data-diff to compare data in your development environment with data in production to ensure that your changes are not inadvertently affecting other columns.

just restart && just dev-run && just prod-run

rm -f ./sde_dbt_tutorial/prod-run-artifacts/manifest.json
cp ./sde_dbt_tutorial/target/manifest.json ./sde_dbt_tutorial/prod-run-artifacts/

# go to and ./sde_dbt_tutorial/models/marts/marketing/customer_orders.sql 
# and add a new col 'som col' as some_new_col,

cd sde_dbt_tutorial
dbt run --select "customer_orders" && data-diff --dbt --state ./prod-run-artifacts/manifest.json --select customer_orders -k order_id,customer_id
cd ..

You will see the diff as shown below.


dbt.warehouse.customer_orders <> dbt.your_name_warehouse.customer_orders 
Column(s) added: {'some_new_col'}
No row differences

Note that we can also use data-diff to compare datasets between different environments. In our example, both the datasets are in the same database (different schema); depending on the project, your data may be in the same database or another database. data-diff has options to compare data within and outside the same database ref docs .

Cons: While the open-source library offers a good comparison of datasets, the more valuable features are only available if you sign up for their cloud services. dbt audit helper is another option

3.5. Observe data quality with elementary

With data teams creating more models and becoming responsible for them, it’s essential that the data team understands the data quality, what/where/how failures occur, and data quality trends over time. To address the need to monitor data quality over time, a new term called data observability has emerged.

Multiple tools (paid, semi-free, free) handle observability for the data produced. One of the tools that focuses primarily on dbt is elementary. We will use the open-source elementary dbt package to add schema checks and anomaly detection and the elementary python package to create a data quality dashboard.

3.5.1. Schema checks

One of the main reasons for pipeline failures is due to unexpected schema changes. Elementary provides the ability to check schema changes of upstream tables(schema checks) and detect if your output model (consumed by a BI tool) has a schema change(exposure), which may lead to BI tool breaking.

Let’s add schema checks to our sources to detect unexpected schema changes(schema checks ). We can add these tests to the source table definitions as shown below.

tables:
  - name: customers
    config:
      tags: [raw_layer]
    columns:
      - name: customer_id
        tests: [not_null]
    tests:
      - elementary.schema_changes:
          tags: [elementary]
          config:
            severity: error

Let’s change our source schema and see how our tests fail; log into the warehouse with the just warehouse command. Add a new dummy column called textcol as shown below.

alter table raw_layer.customers add column textcol varchar;
\q

In your terminal, run the tests for the source tables, and you will see the schema check test fail. The schema check is helpful when another team/process controls data loading into the warehouse.

dbt test --select "source:*"

Elementary provides a schema check mechanism called ' exposure ' to ensure that the output data that our pipeline produces does not change types unexpectedly and break downstream consumers. Exposures enable us to catch unexpected data type changes in our data sets used by some BI tools/other consumers.

Our exposure for order_status_pivot model is as follows (exposure.yml )

---
version: 2
exposures:
  - name: order_status_pivot
    label: Order Status Pivot
    type: dashboard
    maturity: high
    url: https://your.bi.tool/dashboards/2
    description: >
      Shows year distribution of order by status
    depends_on: [ref('order_status_pivot')]
    owner:
      name: John Reese
      email: john.reese@themachine.com
    meta:
      referenced_columns:
        - column_name: order_year
          data_type: numeric

Let’s make a type change in the order_status_pivot model to check how exposures alert us on a potential issue. Make the following changes to your order_status_pivot.sql file.

- EXTRACT(
-     YEAR
-     FROM
-     order_approved_at
- ) AS order_year,
+ CAST(EXTRACT(
+         YEAR
+         FROM
+         order_approved_at
+     ) AS VARCHAR) AS order_year,

Now run the model and then test it as shown below.

dbt run --select "order_status_pivot"
dbt test --select "order_status_pivot"

You will see the error Failure in test elementary_exposure_schema_validity_order_status_pivot_ (models/marts/marketing/marketing.yml).

3.5.2. Anomaly detection

A particularly tricky issue to detect is value changes over time. Elementary provides a suite of anomaly detection tests to ensure

Metrics along key dimensions remain within the allowed threshold
Column values remain within the allowed threshold
Volume and freshness of data are within the allowed threshold

By default, elementary considers any values with a standard score of 3 or above an anomaly(detection algorithm ). Elementary allows you to modify arguments to control the anomaly detection algorithm. There are different types of anomalies that elementary can detect ; the general idea is to monitor a number over multiple dbt runs to see if the value generated in a run is outside the expected range. This number can be:

row count
freshness of the datasets(based on some timestamp column)
metrics from the group by dimension(s)
column statistics such as null rate, length, min, max

Let’s see an example of monitoring row count over multiple dbt runs. Let’s add an anomaly check based on row count for the customer_orders model as shown below.

    tests:
      - elementary.volume_anomalies:
          training_period:
            period: day
            count: 3000
          timestamp_column: order_purchase_timestamp
          time_bucket:
            period: day
            count: 1
          tags: [elementary]
          config:
            severity: error

In the above example, elementary will run volume anomaly detection with the given parameters (parameter docs ).

The red points indicate anomalies in the volume count of the customer_orders model.

3.5.3. Build data observability reports

Elementary also allows generating a report with all the test results. Elementary will gather all the tests that were run as part of your dbt test (available at ./sde_dbt_tutorial/target/compiled/sde_dbt_tutorial/tests/*) and generate a static HTML in the ./edr_target/elementary_report.html location.

Here, one can see all the tests run and anomaly detection results.

3.6. Streamline common tasks

In this section, we will go over some tips to speed up dbt development and testing.

3.6.1. [VS Code] Use the dbt power user extension

dbt uses jinja2, ref, etc., to create SQL queries(compile to SQL), which run on your warehouse. The common problems associated with this jinja2 -> SQL compilation approach are

There is no autocomplete of SQL commands
Not knowing the exact sql query to be executed
Having to run dbt docs to create and view lineage
Not knowing if the referenced database object(table or view exists)
Needing to run parents/children of a model quickly

While it is possible to check the above by compiling and inspecting the target folder and using dbt selectors, the dbt power user vscode extension provides this via an easy to use vscode extenstion. The dbt power user extension provides autocomplete, column lineage, complied view, sql validation, checking for presence of referenced database objects, easy to edit documentation & go to definition code exploration.

Note: some features(e.g. column lineage) require sigining up for their services.

3.6.2. Set up auto grants to streamline end-user access

Granting permission to the models created by dbt is a common requirement. Historically done with a post hook, where after the models run, you’d set up a GRANT stakeholder SELECT ACCESS TO some_model; or similar. But dbt now has a feature dedicated to providing grants to models. You can specify grants for all the models in a path via dbt_project.ym l, & models via individual config.

We have added a grant config in our dbt_project.yml to ensure that the user stakeholder(in actual project this will be a user group) has select access to all the models in marketing path.

models:
  sde_dbt_tutorial:
    # Applies to all files under models/marts/core/
    marts:
      core:
        materialized: table
      marketing:
        materialized: view
        +grants:
          select: ['stakeholder']

Note that this only grants access to models; you must ensure that the user/role has usage access for your schemas. We have added schema usage permissions via our setup script here .

3.6.3. Reduce time to production by avoiding PR nits

A typical comment among PR reviews is to format the code better or argue over formatting style. It is beneficial to have a standard style guide to ensure the code has a consistent feel. Having style standards automated ensures that it is easy to follow, and any dev new to the team is empowered to focus on the core feature delivery and not spend time fixing formatting issues.

There are two main concepts to understand

Linter tells us what’s wrong with our code regarding style and code correctness.
Formatter formats the code to conform to the standard style.

3.6.3.1. Lint and format sql with sqlfluff

sqlfluff is an SQL linter and a formatter. While there are multiple sql linters and formatters, we chose sqlfluff since it has good support for dbt (macros, jinja2, etc). We use the .sqlfluff to provide dbt specific settings (like where the dbt project file is, how to format macros, etc) and the .sqlfluffignore to ignore the folders to format.

We can lint and format sql files with sqlfluff via terminal as shown below

sqlfluff lint sde_dbt_tutorial/models --dialect postgres # just lint-sql
sqlfluff fix sde_dbt_tutorial/models --dialect postgres --show-lint-violations # just format-sql

3.6.3.2. Lint and format yaml with yamllint

We use yamllint and yamlfix to lint and format our yaml files. Note that dbt cloud uses prettier (installed with npm, yarn, etc.) to format; however, to keep our libraries Python focussed, we have opted for yamllint & yamlfix. We can lint and format yaml files as shown below.

yamllint sde_dbt_tutorial/models sde_dbt_tutorial/snapshots sde_dbt_tutorial/dbt_project.yml sde_dbt_tutorial/packages.yml sde_dbt_tutorial/profiles.yml # just lint-yml

yamlfix sde_dbt_tutorial/models # just format-yml

3.6.4. Parse manifest.json to inspect the dbt graph

When we run dbt, it creates the required lineage graph (i.e., order of debt models to run), compiles dbt code to raw SQL, associates tags and tests with the corresponding model, etc., and stores all this information in the target folder, (./sde_dbt_tutorial/target). Within the target folder, the manifest.json file contains the references to which models/tests to run and how they are associated.

Parsing this manifest file is an excellent way to inspect our models/tests via code. Let’s look at an example; our marketing.yml file shows a test based on a table that does not exist.

  - name: non_existent_table
    columns:
      - name: non_existent
        tests: [unique]

dbt will compile these tests but not run them since the table non_existent_table does not exist. As shown below, we can identify such cases with a simple Python script.

The above Python script reads the manifest.json file and checks if any tests do not depend on any node if it raises a warning nudging the engineer to remove unused code. The code also checks if any model does not use the ref function (i.e., hardcoded table names).

3.6.5. Autorun linting & checks locally before opening a PR to save on CI costs

Usually, your CI pipeline will run checks and tests to ensure the PR is up to standard. You can reduce the time taken at CI runs by preventing issues by adding the checks as a pre-commit git hook. The pre-commit git hook ensures that checks and tests run before a developer puts up a PR, saving potential CI time (if there are issues with the code).

As shown below, you can add a pre-commit hook to your .git/hooks folder.

echo -e '
#!/bin/sh
just ci
' > .git/hooks/pre-commit
chmod ug+x .git/hooks/*

Now, the just ci command will run each time you try to add a commit(just ci will run before the code commit).

4. Conclusion

This article gives you a good idea of how you can improve your dbt development experience. To recap, we went over.

Reproducible environments
Reducing feedback loop when developing locally
Reducing the amount of code to write by leveraging dbt packages
Validating data and Data Observability
Streamlining common tasks

If you are working on a dbt project where the development speed(dev start to feature in prod time) is slow, identify where it could be better & fix them using the concepts specified above.

If you have any questions or comments, please leave them in the comment section below. If you have any issues with running the code, please open a GitHub issue here .

5. Further Reading

6. References

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job!

Overwhelmed by all the concepts you need to learn to become a data engineer? Have difficulty finding good data projects for your portfolio? Are online tutorials littered with sponsored tools and not foundational concepts?

Learning data engineer can be a long and rough road, but it doesn't have to be!

Pick up any new tool/framework with a clear understanding of data engineering fundamentals. Demonstrate your expertise by building well-documented real-world projects on GitHub.

Join now and get started on your data engineering journey!

Testimonials:

I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!

I have learned a lot from the course which is much more practical.

This course helped me build a project and actually land a data engineering job! Thank you.

When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

Built with Kit

Comment anonymously

M ↓ Markdown

Upvotes Newest Oldest

May 2024 Calendar

0 points

15 months ago

By the end of this post, you will know about six concepts to speed up your dbt project’s feature delivery speed. You can also adopt these concepts to your dbt projects reasonably quickly.

Anonymous

0 points

12 months ago

really cool stuff to read and learn. thank you for the effort and sharing :)

February Calendar 2024

0 points

18 months ago

Anonymous

0 points

18 months ago

Great article

Commento