How to ensure consistent metrics in your warehouse

Jan 28, 2025 · 3 min read

1. Introduction
2. Centralize Metric Definitions in Code
- Option A: Semantic Layer for On-the-Fly Queries
- Option B: Pre-Aggregated Tables for Consumers
3. Conclusion & Recap
4. Required Reading

1. Introduction

If you’ve worked on a data team, you’ve likely encountered situations where multiple teams define metrics in slightly different ways, leaving you to untangle why discrepancies exist.

The root cause of these metric deviations often stems from rapid data utilization without prioritizing long-term maintainability. Imagine this common scenario: a company hires its first data professional, who writes an ad-hoc SQL query to compute a metric. Over time, multiple teams build their own datasets using this query—each tweaking the metric definition slightly.

As the number of downstream consumers grows, so does the volume of ad-hoc requests to the data team to investigate inconsistencies. Before long, the team spends most of its time firefighting data bugs and reconciling metric definitions instead of delivering new insights. This cycle erodes trust, stifles career growth, and lowers team morale.

Multi-team table model causing discrepancies

This post explores two options to reduce ad-hoc data issues and empower consumers to derive insights independently.

2. Centralize Metric Definitions in Code

The foundation of consistency lies in defining critical metrics in a single repository. While code organization varies (e.g., medallion architecture, dbt), metric consolidation typically follows one of two approaches.

Option A: Semantic Layer for On-the-Fly Queries

A semantic layer allows you to define dimensions (columns in the GROUP BY clause) and metrics (computed columns) in a static format, often YAML files. Consumers (APIs, BI tools) query this layer, which dynamically generates SQL against raw tables.

Semantic layer architecture

Pros:

Flexibility: No need to pre-build tables for every use case; aggregations adapt to query requirements.
Transparency: Centralized definitions in human-readable formats (YAML) for technical and non-technical teams.
Freshness: Real-time data access without dependency on pre-computed tables.

Cons:

Cost: Repeated query execution can increase compute expenses (mitigated by caching in some tools).
Tooling: Limited mature open-source options (Cube.js is a notable exception).

Option B: Pre-Aggregated Tables for Consumers

This approach involves creating pre-computed tables with fixed metrics and dimensions tailored to specific use cases.

Pre-aggregated tables architecture

Pros:

Simplicity: Easy to implement for small-scale use cases.
Cost Efficiency: Reduced compute costs compared to dynamic querying.
Control: Metrics are explicitly defined in code.

Cons:

Scalability: Managing custom tables becomes unwieldy as consumers multiply.
Maintenance: Metric consistency across tables is challenging (often leading teams to reinvent a semantic layer).

3. Conclusion & Recap

Recap:

Semantic Layers offer dynamic querying and centralized definitions but require investment in tooling and optimization.
Pre-Aggregated Tables are simple but unsustainable at scale, often forcing teams to build internal semantic layers.

Recommendation: If possible, adopt a semantic layer tool like Cube.js early to balance flexibility and scalability. Reserve pre-aggregated tables for niche use cases with strict cost constraints.

Please let me know in the comment section below if you have any questions or comments.

4. Required Reading

If you found this article helpful, share it with a friend or colleague using one of the socials below!

Land your dream Data Engineering job with my free book!

Build data engineering proficiency with my free book!

Are you looking to enter the field of data engineering? And are you

> Overwhelmed by all the concepts/jargon/frameworks of data engineering?

> Feeling lost because there is no clear roadmap for someone to quickly get up to speed with the essentials of data engineering?

Learning to be a data engineer can be a long and rough road, but it doesn't have to be!

Imagine knowing the fundamentals of data engineering that are crucial to any data team. You will be able to quickly pick up any new tool or framework.

✅ Instant access to my Data Engineering 101 e-book, which covers SQL, Python, Docker, dbt, Airflow & Spark.

✅ Executable code to practice and exercises to test yourself.

✅ Weekly email for 4 weeks with the exercise solutions.

Join now and get started on your data engineering journey!

Testimonials:

I really appreciate you putting these detailed posts together for your readers, you explain things in such a detailed, simple manner that's well organized and easy to follow. I appreciate it so so much!

I have learned a lot from the course which is much more practical.

This course helped me build a project and actually land a data engineering job! Thank you.

When you subscribe, you'll also get emails about data engineering concepts, development practices, career advice, and projects every 2 weeks (or so) to help you level up your data engineering skills. We respect your email privacy.

Comment anonymously

M ↓ Markdown

Upvotes Newest Oldest

excitab leassign

0 points

52 days ago

@geometry dash Excellent piece of writing! I must express my gratitude for the methodical approach that was used in selecting a data project. When it comes to your project, it is of the utmost importance to match it with your career goals and the talents that you wish to exhibit.

Carlson Roy

0 points

51 days ago

@basketball stars Great post! I’ve definitely seen metric inconsistencies create chaos across teams. The semantic layer approach sounds like the way to go for long-term scale and clarity. Cube.js is on my radar now—thanks for the tip. Curious how you handle versioning of metric definitions across product changes. Looking forward to diving into the DE-101 course too!

Bukari Manches

0 points

12 days ago

@ragdoll archers I can understand this concept but it is too vague. Thanks for explaining it in detail and I can orient my work.

Commento