Introduction
Everyone agrees that data modeling is essential. In reality, tables are built haphazardly, without any modeling standards.
If you have felt that
The unending drudgery of having to build pipelines insanely fast for quick wins and spend hours or days fixing them
No one really cares as much about getting things right the first time, because it’ll be rebuilt anyway.
The insanely-hyped and hyperfunded data platforms are pushing for bad practices in a case of data land grab
Then this post is for you.
Imagine this, a data engineer with
- Work-life balance
- Predictable career growth
- Being perceived as a leader
That is what the contents of this post will enable you to have.
In this post, we will go over six techniques to help you build well-designed warehouses insanely fast and demonstrate your value.
Prerequisites:
Delivering Value and Building Trust in a Quick-Win Industry
We can separate the six techniques into two sections:
- End user-facing documents: These documents ensure the data team and end-users are on the same page.
- Data pipeline design techniques: These techniques ensure that the data team builds resilient pipelines.
End User Facing Documents
The techniques in this section are for documents intended for the end-user of your data. Following these will enable you to
- Demonstrate your value
- Build resilient warehouse foundations
- Reduce chances of miscommunication between teams
Use the Bus Matrix to Demonstrate Your Value
Stakeholders speak in business jargon. Telling them that you are working on a fact table called orders would not be as impactful as showing how it can help them.
This is where a bus matrix can help.
A bus matrix is a table with each row indicating a business process and each column a business dimension.
Each column value indicates whether that dimension is associated with the row’s business process.
For example, consider the bus matrix for a Bike Part Seller Warehouse.
| Business Process | Customer | Supplier | Part | Order Date | Ship Date | Nation | Region |
|---|---|---|---|---|---|---|---|
| Order Purchases | ✓ | ✓ | ✓ | ✓ | |||
| Order Returns | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| LineItem Discounts | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
We can expand this to include the names of fact table(s) per business process. This will also serve as a form of short ERD. With a bus matrix, we get:
- Stakeholders to care about the outcome of your work by tying it to their outcomes
- A way to demonstrate progress and your value to the company
Set End-User Expectations with Data Contracts
Tables are your team’s end-user-facing products. A false assumption can lead to misuse of that table. This is where data contracts help.
A data contract is a document that outlines what end-users can expect from a dataset. The data contract should be easy for end users to discover and view.
Here are five data contract attributes that will get you a long way:
Table grain: Indicates what each row in the table means.SLA: The time between when an event occurs and when it’s ready for use by the end-user.Table Schema: Column names and their data types.Owner: The name of the team to be contacted for any questions. This typically also includes expected response times.Data quality checks: List of data quality checks that have been run on this table. This should be understandable by end-users.
A data contract is a document. Enforcing or checking a data contract depends on the tools available to you.
Data Pipeline Design Techniques
Data teams are under immense pressure to deliver fast. Protect your time and money with the following techniques.
Insert-only Fact and Snapshot Dimension Tables Work For Most Use Cases
Most business questions can be answered with well-defined Kimball fact and dimension tables.
Follow these techniques for easy-to-maintain tables:
Insert Only Fact Tables: Do not update the fact table once data is inserted (unless backfilling).Snapshot Dimension Tables: Do a complete recreation of dimension tables for each pipeline run (ref). Keep historical data for a few years. SCD2 should be limited.Only one grain per table: Multiple grains per table will cause issues with joins or group by.Flag conditioned columns should be combinedif you have a column’s usage conditioned on another flag column, combine them into individual columns (e.g., FLAG_COL_VALUE_1_COLUMN_NAME, FLAG_COL_VALUE_2_COLUMN_NAME, etc.).Use naming conventionsif your company does not have one, use an existing one like the Kimball naming convention.Do not create surrogate key. With advances in technology, you can directly join fact and dimension tables.Create a view interfacefor end users, so if you change the underlying schema, column name, etc, it will not impact your end user. And with snapshot tables, the view can also be used to select the most recent snapshot.
Break the above recommendations if you have a clear use case.
Data Quality Checks to Prevent High-Impact Issues
Using incorrect data can lead to irreversible damage to the business and loss of trust in the data. We need to ensure that the data end-users’ use is checked.
Here are 4 data quality checks that provide the highest ROI for time invested:
Table constraints: Check for table constraints such as uniqueness check, allowed value check, and not null check.Referential integrity checks: to ensure that column values in one table have corresponding values in another table. For e.g., a dimension key in the fact table should be present in the corresponding dimension table, else any joins on those tables will drop the fact row.Reconciliation checks: to ensure that the row counts and key numerical aggregates (e.g., revenue) of the table are the same (or within a threshold) to the source table.Metric variance check: to ensure that the current run’s table metric does not deviate too much from its historical value.
- Use the WAP pattern to run the DQ checks.
- Here are more types of data quality checks.
Debug Data Issues using Data Lineage
As a data engineer, you will inevitably face questions about the source of data and have to determine where a bug in the logic is. For this purpose, you must have a way to determine the lineage of a table.
Some tools offer lineage out of the box (SQL Mesh, dbt, etc).
If you do not have lineage, debugging data issues will definitely take a lot of work and confusion.
Lineage is critical in audit-intensive fields such as healthcare and finance.
Centralize Metric Definitions
As the company grows, so does the number of metrics and the teams that track them.
We need to ensure that the metric definitions are available in one place, making identification and debugging easier.
The two main options for this are
Semantic layerto define metric definition and have systems re-query them as needed.Data marttables with the metric available to query
Ensure that metrics have clear ownership.
The more metrics are scattered across teams, the harder it will be to debug issues.
Conclusion
To recap, we saw how to
- Use a bus matrix to demonstrate your value
- Set end-user expectations with data Contracts
- Create Insert-only fact and Snapshot dimension tables for maintainability
- Use high ROI DQ checks
Bookmark this post. The next time you are in a time crunch to build a table, follow the techniques in this post. You will be off building fast, stable tables while demonstrating your value to the company.