Six Data Modeling Techniques For Building Production-Ready Tables Fast

Introduction

Everyone agrees that data modeling is essential. In reality, tables are built haphazardly, without any modeling standards.

If you have felt that

The unending drudgery of having to build pipelines insanely fast for quick wins and spend hours or days fixing them

No one really cares as much about getting things right the first time, because it’ll be rebuilt anyway.

The insanely-hyped and hyperfunded data platforms are pushing for bad practices in a case of data land grab

Then this post is for you.

Imagine this, a data engineer with

Work-life balance
Predictable career growth
Being perceived as a leader

That is what the contents of this post will enable you to have.

In this post, we will go over six techniques to help you build well-designed warehouses insanely fast and demonstrate your value.

Prerequisites:

Delivering Value and Building Trust in a Quick-Win Industry

We can separate the six techniques into two sections:

End user-facing documents: These documents ensure the data team and end-users are on the same page.
Data pipeline design techniques: These techniques ensure that the data team builds resilient pipelines.

End User Facing Documents

The techniques in this section are for documents intended for the end-user of your data. Following these will enable you to

Demonstrate your value
Build resilient warehouse foundations
Reduce chances of miscommunication between teams

Use the Bus Matrix to Demonstrate Your Value

Stakeholders speak in business jargon. Telling them that you are working on a fact table called orders would not be as impactful as showing how it can help them.

This is where a bus matrix can help.

A bus matrix is a table with each row indicating a business process and each column a business dimension.

Each column value indicates whether that dimension is associated with the row’s business process.

For example, consider the bus matrix for a Bike Part Seller Warehouse.

Business Process	Customer	Supplier	Part	Order Date	Ship Date	Nation	Region
Order Purchases	✓			✓		✓	✓
Order Returns	✓	✓	✓	✓	✓	✓	✓
LineItem Discounts	✓	✓	✓	✓	✓	✓	✓

We can expand this to include the names of fact table(s) per business process. This will also serve as a form of short ERD. With a bus matrix, we get:

Stakeholders to care about the outcome of your work by tying it to their outcomes
A way to demonstrate progress and your value to the company

Set End-User Expectations with Data Contracts

Tables are your team’s end-user-facing products. A false assumption can lead to misuse of that table. This is where data contracts help.

A data contract is a document that outlines what end-users can expect from a dataset. The data contract should be easy for end users to discover and view.

Here are five data contract attributes that will get you a long way:

Table grain: Indicates what each row in the table means.
SLA: The time between when an event occurs and when it’s ready for use by the end-user.
Table Schema: Column names and their data types.
Owner: The name of the team to be contacted for any questions. This typically also includes expected response times.
Data quality checks: List of data quality checks that have been run on this table. This should be understandable by end-users.

Note

A data contract is a document. Enforcing or checking a data contract depends on the tools available to you.

Data Pipeline Design Techniques

Data teams are under immense pressure to deliver fast. Protect your time and money with the following techniques.

Insert-only Fact and Snapshot Dimension Tables Work For Most Use Cases

Most business questions can be answered with well-defined Kimball fact and dimension tables.

Follow these techniques for easy-to-maintain tables:

Insert Only Fact Tables: Do not update the fact table once data is inserted (unless backfilling).
Snapshot Dimension Tables: Do a complete recreation of dimension tables for each pipeline run (ref). Keep historical data for a few years. SCD2 should be limited.
Only one grain per table: Multiple grains per table will cause issues with joins or group by.
Flag conditioned columns should be combined if you have a column’s usage conditioned on another flag column, combine them into individual columns (e.g., FLAG_COL_VALUE_1_COLUMN_NAME, FLAG_COL_VALUE_2_COLUMN_NAME, etc.).
Use naming conventions if your company does not have one, use an existing one like the Kimball naming convention.
Do not create surrogate key. With advances in technology, you can directly join fact and dimension tables.
Create a view interface for end users, so if you change the underlying schema, column name, etc, it will not impact your end user. And with snapshot tables, the view can also be used to select the most recent snapshot.

Caution

Break the above recommendations if you have a clear use case.

Data Quality Checks to Prevent High-Impact Issues

Using incorrect data can lead to irreversible damage to the business and loss of trust in the data. We need to ensure that the data end-users’ use is checked.

Here are 4 data quality checks that provide the highest ROI for time invested:

Table constraints: Check for table constraints such as uniqueness check, allowed value check, and not null check.
Referential integrity checks: to ensure that column values in one table have corresponding values in another table. For e.g., a dimension key in the fact table should be present in the corresponding dimension table, else any joins on those tables will drop the fact row.
Reconciliation checks: to ensure that the row counts and key numerical aggregates (e.g., revenue) of the table are the same (or within a threshold) to the source table.
Metric variance check: to ensure that the current run’s table metric does not deviate too much from its historical value.

Note

Use the WAP pattern to run the DQ checks.
Here are more types of data quality checks.

Debug Data Issues using Data Lineage

As a data engineer, you will inevitably face questions about the source of data and have to determine where a bug in the logic is. For this purpose, you must have a way to determine the lineage of a table.

Some tools offer lineage out of the box (SQL Mesh, dbt, etc).

If you do not have lineage, debugging data issues will definitely take a lot of work and confusion.

Lineage is critical in audit-intensive fields such as healthcare and finance.

Centralize Metric Definitions

As the company grows, so does the number of metrics and the teams that track them.

We need to ensure that the metric definitions are available in one place, making identification and debugging easier.

The two main options for this are

Semantic layer to define metric definition and have systems re-query them as needed.
Data mart tables with the metric available to query

Ensure that metrics have clear ownership.

Warning

The more metrics are scattered across teams, the harder it will be to debug issues.

Conclusion

To recap, we saw how to

Bookmark this post. The next time you are in a time crunch to build a table, follow the techniques in this post. You will be off building fast, stable tables while demonstrating your value to the company.

Introduction

Delivering Value and Building Trust in a Quick-Win Industry

End User Facing Documents

Use the Bus Matrix to Demonstrate Your Value

Set End-User Expectations with Data Contracts

Data Pipeline Design Techniques

Insert-only Fact and Snapshot Dimension Tables Work For Most Use Cases

Data Quality Checks to Prevent High-Impact Issues

Debug Data Issues using Data Lineage

Centralize Metric Definitions

Conclusion

Read These