OpenMetadata: The Complete Guide to Modern Data Cataloging
Your data estate has outgrown tribal knowledge — tables nobody owns, pipelines nobody understands, dashboards nobody trusts. OpenMetadata fixes that with a unified platform for discovery, lineage, quality, and governance. This is the complete guide from zero to production.
I. What metadata really is
The textbook definition, data about data, is technically correct but practically useless.
Better definition for Metadata would be:
Context that helps systems and humans understand, trust, discover, govern, operate, and automate around data assets.
Notice something important here ? Metadata is NOT merely descriptive. It is operational intelligence .
Example Suppose you have a table called orders with rows
| order_id | customer_id | amount |
|---|---|---|
| 101 | 55 | 4200 |
This is the DATA
Let’s see what meta data means then,
| Information | Type |
|---|---|
| table name | metadata |
| column names | metadata |
| datatype | metadata |
| owner | metadata |
| refresh frequency | metadata |
| downstream dashboard | metadata |
| lineage | metadata |
| quality score | metadata |
| query count | metadata |
Key Insight
Everything about the data becomes metadata. Modern systems care as much about metadata as about data itself.II. Evolution of data systems
During the Early era it was a Simple setup:
Application -> Single Database
No metadata platform needed.
Engineers knew everything mentally. Tribal Knowledge sufficed.
Then companies added:
OLTP DB -> ETL -> Warehouse -> Reports
This was Still manageable. Maybe spreadsheets documented things. Some small wiki pages here and there.
Begining of Modern data stack explosion
Kafka, Spark, Airflow, dbt, Snowflake, BigQuery, Databricks, S3, ML pipelines, Tableau, Power BI, Microservices, Real-time streams, Feature stores
Complexity arrives
Now complexity explodes. Human cognition breaksNobody knows:
- what depends on what,
- which dataset is trusted,
- where data originated,
- who owns pipelines,
- which dashboard is correct.
Key Insight
This is the birth condition for metadata platforms.
Metadata platforms are fundamentally complexity management systems. This is their real purpose.
III. Metadata categories
There are primarily 7 categories of metadata:
Technical Metadata
Structure: tables, schemas, columns, datatypes, partitions, indexes, file formats, and storage locations.Operational Metadata
Runtime behavior: query frequency, freshness, latency, failures, SLA violations, and storage growth.Business Metadata
Human meaning: definitions, owners, glossary terms, domains, SLAs, and steward assignments.Governance Metadata
Control layer: PII tags, GDPR labels, retention policies, access classifications, and sensitivity levels.Lineage Metadata
Dependencies: where data came from, how it changed, and which consumers depend on it.Usage Metadata
Interaction: who queries what, which dashboards matter, and which assets are dormant.Semantic Metadata
Meaning in the AI era: concepts, relationships, embeddings, similarity, and ontology mappings.
1. Technical Metadata (Describes structure)
Examples are
table names, schemas, datatypes, partitions, indexes, file formats, storage locations
| |
This is the oldest metadata type. Traditional catalogs mostly stopped here.
2. Operational Metadata (Describes runtime behavior)
This includes query frequency, pipeline failures, freshness, latency, runtime, SLA violations, storage growth
| |
This transforms metadata into: living operational telemetry .
3. Business Metadata (This is where human meaning enters)
Examples are business definitions, owners, glossary terms, domains, SLAs, steward assignments
| |
Why business metadata matters
Without business metadata, companies fight endlessly over definitions.Finance says:
Revenue = gross sales
Analytics says:
Revenue = post-discount value
Product team says:
Revenue = subscription MRR
Now every dashboard disagrees. Metadata systems attempt to solve this.
4. Governance Metadata
Security/compliance layer.
Examples are PII tags, GDPR labels, retention policies, access classifications, sensitivity levels
| |
This becomes critical in enterprises.
5. Lineage Metadata (Very Valuable)
It describes dependencies.
Example:
raw_orders -> clean_orders -> daily_sales_dashboard
Lineage answers
Now we can answer:
- What breaks if schema changes?
- Where did this metric originate?
- Why is this dashboard wrong?
6. Usage Metadata
Describes human/system interaction.
Examples: who queried tables, dashboard popularity, dormant assets, active consumers.
Usage metadata helps teams
- discover important datasets
- identify dead pipelines
- optimize systems
7. Semantic Metadata
Increasingly important in AI era.
It describes:
- meaning
- conceptual relationships
- embeddings
- similarity
- ontology mappings
Example: customer ≈ client ≈ account_holder
This is becoming huge with LLM systems.
IV. Metadata graph thinking
If you think metadata is a table of information, that would not be the complete picture. Modern metadata systems are graph systems .
Why Graphs instead of tables. It is because, relationships matter more than isolated assets.
Nodes and Edges
Nodes
Nodes are entities like
- tables
- topics
- dashboards
- users
- pipelines
- ML models
Edges
Edges or links are Relationships
- owns
- depends_on
- upstream_of
- consumes
- produces
Why graphs?
Real-world metadata questions are graph traversals.
Example: “If I deprecate this Kafka topic, what breaks?”
This is a dependency traversal problem.
Another Example: “Which dashboards are affected if column X changes?”
Column X -> Table -> dbt model -> Dashboard
Metadata Graph enables impact analysis.
Key Note
Metadata systems are partially: knowledge graph systems.V. Active metadata
Traditional catalogs were passive. Meaning they were
manually updatedstaledocumentation-orienteddisconnected from runtime systems
Example:
- wiki pages
- spreadsheets
- manually written docs
They became outdated immediately.
Modern systems = active metadataModern Systems emit:
- events
- lineage updates
- schema changes
- query stats
- freshness metrics
- ownership updates
Using this now Metadata becomes real-time operational context .
Examples
Airflow pipeline fails.
OpenMetadata updates these automatically
- pipeline status
- freshness warning
- downstream impact
- activity feed
Schema changes: customer_name → full_name
Metadata system can:
- detect change
- compute downstream impact
- trigger alerts
- update lineage
- notify owners
This is not “documentation”. This is operational automation.
Modern systems are too dynamic. Manual governance fails completely at scale.
VI. Lineage fundamentals
Basic definition:
Lineage describes how data moves and transforms through systems over time.
But again: that definition is too shallow.
A better definition:
Better definition
Lineage is adependency graph that models the flow, transformation, and propagation of data across systems.Imagine:
orders table -> daily_sales aggregation -> executive dashboard
This already forms lineage.
Meaning:
- dashboard depends on aggregation
- aggregation depends on orders
Without lineage
Modern systems become operationally dangerous.
Because nobody knows:
- what depends on what,
- what breaks downstream,
- where metrics originated,
- whether data is trustworthy.
Lineage solves: dependency visibility.
Lineage existed conceptually for decades.
But it became critical
because of distributed data systems.
Earlier the lineage was small enough for humans to understand manually.
Database -> Nightly ETL -> Warehouse
Modern lineage shape
Modern Systems are :
Kafka -> Spark -> Data Lake -> dbt -> Warehouse -> Dashboards -> ML Models
Thousands of transformations -> Thousands of dependencies -> Human reasoning collapses.
Types of Lineage
Lineage exists at multiple granularities:
Dataset/Table-Level Lineage
raw_orders->clean_orders->daily_salesGood for impact analysis, system overview, and operational debugging.
Column-Level Lineage
Tracks individual field propagation.
raw_orders.amount->SUM(amount)->daily_sales.total_revenuePipeline Lineage
Describes orchestration dependencies.
Airflow DAG A->dbt Job B->ML Pipeline CEnd-to-End Lineage
Full ecosystem graph.
Kafka Topic->Spark Job->Iceberg Table->dbt Model->Snowflake Table->Dashboard
Column-Level Lineage Example
Column-level lineage tracks individual field propagation.
raw_orders.amount -> SUM(amount) -> daily_sales.total_revenue
This is very hard Technically. Because SQL transformations must be parsed precisely.
Consider the sql query below
| |
Now lineage engine must infer that orders.price, orders.quantity produces revenue
Not trivial at all. This requires, SQL parsing, AST analysis, expression mapping.
Lineage direction
Upstream lineage: Data sources feeding current asset.
Downstream lineage: Consumers affected by current asset.
Lineage Derivation
How does a metadata system KNOW lineage? There are many methods to derive Lineage.
Static SQL Parsing
The system parses SQL statements and infers dependencies from reads, writes, joins, aliases, CTEs, nested queries, unions, and functions.Query Log Analysis
Warehouses expose query histories, so metadata systems can infer lineage from the queries that actually executed.Orchestration Metadata
Airflow and dbt expose DAGs, jobs, and task dependencies that can be converted into lineage edges.OpenLineage Events
Systems emit lineage events directly, so the platform does not have to guess lineage only from logs or SQL.
i. Static SQL Parsing
System parses SQL statements.
| |
Based on this query the parser infers:
orders → sales_summary
Real SQL becomes insanely complicated.
Example:
| |
Now parser must understand:
aliases, joins, CTEs, nested queries, unions, functions
This is a hard compiler-style problem. Lineage systems often contain mini SQL compilers .
ii. Query Log Analysis
Warehouses expose query histories. Example:
- Snowflake query logs
- BigQuery audit logs
Metadata systems infer lineage from executed queries.
Tradeoff
Advantage reflects actual runtime usage.
Disadvantage incomplete context sometimes.
iii. Orchestration Metadata
Airflow/dbt provide dependency information.
Example:
task_a → task_b
Metadata system converts these into lineage.
- task dependencies
- DAG structure
- job execution
iv OpenLineage Events
This is the most modern approach. Systems emit lineage events directly.
Instead of guessing lineage: systems explicitly publish it.
Column Lineage Is Hard Table lineage is relatively manageable. But column lineage is MUCH harder.
Example
| |
Lineage engine must infer:
customers.first_name, customers.last_name -> full_name
Now add:
- CASE statements
- nested subqueries
- window functions
- UDFs
- macros
- dynamic SQL
Reality check
Complexity explodes. Many lineage tools advertisecolumn-level lineage, but accuracy varies dramatically.Another classification of lineage
Static lineage
Derived from code, SQL, configs, DAG definitions
Pros: predictable easier
Cons: may not reflect runtime reality
Runtime lineage
Derived from actual executions, query logs, emitted events
Pros: reflects reality
Cons: operationally harder
VII. Why Lineage Is Valuable
Impact Analysis
“What breaks if this changes?” Core lineage use case.
Root Cause Analysis
Dashboard wrong? Then lets Trace backward: Dashboard -> dbt model -> warehouse table -> pipeline Find failure source.
Trust & Certification
If upstream asset is unreliable: downstream trust decreases.
This becomes: trust propagation.
Governance
Lineage helps answer:
where PII flows, which systems consume sensitive data, retention propagation.
Very important for compliance.
Cost Optimization
Unused downstream assets can be identified. Dead pipelines discovered. Expensive transformations analyzed.
Important Limitations
Lineage is NOT magically perfect.
Real-world problems:
dynamic SQL
hidden transformations
external scripts
missing instrumentation
incomplete logs
manual uploads
spreadsheets
So lineage graphs are often:
probabilistic/incomplete
Feature 1
Feature 2
Content here...
Feature 3
Content here...
Saved locally to your browser.