How to ensure consistent metrics in your warehouse

1. Introduction

If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist.

The root cause of these metric deviations often stems from rapid data utilization without prioritizing long-term maintainability. Imagine this common scenario: a company hires its first data professional, who writes an ad-hoc SQL query to compute a metric. Over time, multiple teams build their own datasets using this query—each tweaking the metric definition slightly.

As the number of downstream consumers grows, so does the volume of ad-hoc requests to the data team to investigate inconsistencies. Before long, the team spends most of its time firefighting data bugs and reconciling metric definitions instead of delivering new insights. This cycle erodes trust, stifles career growth, and lowers team morale.

Multi-team table model causing discrepancies

This post explores two options to reduce ad-hoc data issues and empower consumers to derive insights independently.


2. Centralize Metric Definitions in Code

The foundation of consistency lies in defining critical metrics in a single repository. While code organization varies (e.g., medallion architecture, dbt), metric consolidation typically follows one of two approaches.

Option A: Semantic Layer for On-the-Fly Queries

A semantic layer allows you to define dimensions (columns in the GROUP BY clause) and metrics (computed columns) in a static format, often YAML files. Consumers (APIs, BI tools) query this layer, which dynamically generates SQL against raw tables.

Semantic layer architecture

Pros:

Cons:


Option B: Pre-Aggregated Tables for Consumers

This approach involves creating pre-computed tables with fixed metrics and dimensions tailored to specific use cases.

Pre-aggregated tables architecture

Pros:

Cons:


3. Conclusion & Recap

Recap:

Recommendation: If possible, adopt a semantic layer tool like Cube.js early to balance flexibility and scalability. Reserve pre-aggregated tables for niche use cases with strict cost constraints.

Please let me know in the comment section below if you have any questions or comments.


4. Required Reading

  1. What are facts and dimensions?
  2. Multi-hop architecture

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

Sign up for my free Data Engineering 101 Course. You will get

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

    Testimonials:

    I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!
    I have learned a lot from the course which is much more practical.
    This course helped me build a project and actually land a data engineering job! Thank you.

    When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

    M ↓   Markdown
    E
    excitab leassign
    0 points
    52 days ago

    @geometry dash Excellent piece of writing! I must express my gratitude for the methodical approach that was used in selecting a data project. When it comes to your project, it is of the utmost importance to match it with your career goals and the talents that you wish to exhibit.

    C
    Carlson Roy
    0 points
    51 days ago

    @basketball stars Great post! I’ve definitely seen metric inconsistencies create chaos across teams. The semantic layer approach sounds like the way to go for long-term scale and clarity. Cube.js is on my radar now—thanks for the tip. Curious how you handle versioning of metric definitions across product changes. Looking forward to diving into the DE-101 course too!

    B
    Bukari Manches
    0 points
    12 days ago

    @ragdoll archers I can understand this concept but it is too vague. Thanks for explaining it in detail and I can orient my work.