OpenMetadata: The Complete Guide to Modern Data Cataloging

I. What metadata really is

The textbook definition, data about data, is technically correct but practically useless.

Better definition for Metadata would be:

Context that helps systems and humans understand, trust, discover, govern, operate, and automate around data assets.

Notice something important here ? Metadata is NOT merely descriptive. It is operational intelligence .

Example Suppose you have a table called orders with rows

order_id	customer_id	amount
101	55	4200

This is the DATA

Let’s see what meta data means then,

Information	Type
table name	metadata
column names	metadata
datatype	metadata
owner	metadata
refresh frequency	metadata
downstream dashboard	metadata
lineage	metadata
quality score	metadata
query count	metadata

Key Insight

Everything about the data becomes metadata. Modern systems care as much about metadata as about data itself.

II. Evolution of data systems

During the Early era it was a Simple setup:

Application -> Single Database

No metadata platform needed.

Engineers knew everything mentally. Tribal Knowledge sufficed.

Then companies added:

OLTP DB -> ETL -> Warehouse -> Reports

This was Still manageable. Maybe spreadsheets documented things. Some small wiki pages here and there.

Begining of Modern data stack explosion

Kafka, Spark, Airflow, dbt, Snowflake, BigQuery, Databricks, S3, ML pipelines, Tableau, Power BI, Microservices, Real-time streams, Feature stores

Complexity arrives

Now complexity explodes. Human cognition breaks

Nobody knows:

what depends on what,
which dataset is trusted,
where data originated,
who owns pipelines,
which dashboard is correct.

Key Insight

This is the birth condition for metadata platforms.

Metadata platforms are fundamentally complexity management systems. This is their real purpose.

III. Metadata categories

There are primarily 7 categories of metadata:

Technical Metadata
Structure: tables, schemas, columns, datatypes, partitions, indexes, file formats, and storage locations.
Operational Metadata
Runtime behavior: query frequency, freshness, latency, failures, SLA violations, and storage growth.
Business Metadata
Human meaning: definitions, owners, glossary terms, domains, SLAs, and steward assignments.
Governance Metadata
Control layer: PII tags, GDPR labels, retention policies, access classifications, and sensitivity levels.
Lineage Metadata
Dependencies: where data came from, how it changed, and which consumers depend on it.
Usage Metadata
Interaction: who queries what, which dashboards matter, and which assets are dormant.
Semantic Metadata
Meaning in the AI era: concepts, relationships, embeddings, similarity, and ontology mappings.

1. Technical Metadata (Describes structure)

Examples are

table names, schemas, datatypes, partitions, indexes, file formats, storage locations

1
2
3
4
5
6
7
8
9
{
  "table": "orders",
  "columns": [
    {
      "name": "amount",
      "type": "decimal(10,2)"
    }
  ]
}

This is the oldest metadata type. Traditional catalogs mostly stopped here.

2. Operational Metadata (Describes runtime behavior)

This includes query frequency, pipeline failures, freshness, latency, runtime, SLA violations, storage growth

1
2
3
4
5
{
  "queries_per_day": 18344,
  "last_updated": "2026-05-18",
  "freshness_minutes": 12
}

This transforms metadata into: living operational telemetry .

3. Business Metadata (This is where human meaning enters)

Examples are business definitions, owners, glossary terms, domains, SLAs, steward assignments

1
2
3
4
{
  "definition":
  "Net revenue after refunds and discounts"
}

Why business metadata matters

Without business metadata, companies fight endlessly over definitions.

Finance says: Revenue = gross sales

Analytics says: Revenue = post-discount value

Product team says: Revenue = subscription MRR

Now every dashboard disagrees. Metadata systems attempt to solve this.

4. Governance Metadata

Security/compliance layer.

Examples are PII tags, GDPR labels, retention policies, access classifications, sensitivity levels

1
2
3
4
{
  "classification": "PII",
  "retention_days": 365
}

This becomes critical in enterprises.

5. Lineage Metadata (Very Valuable)

It describes dependencies.

Example:

raw_orders -> clean_orders -> daily_sales_dashboard

Lineage answers

Now we can answer:

What breaks if schema changes?
Where did this metric originate?
Why is this dashboard wrong?

6. Usage Metadata

Describes human/system interaction.

Examples: who queried tables, dashboard popularity, dormant assets, active consumers.

Usage metadata helps teams

discover important datasets
identify dead pipelines
optimize systems

7. Semantic Metadata

Increasingly important in AI era.

It describes:

meaning
conceptual relationships
embeddings
similarity
ontology mappings

Example: customer ≈ client ≈ account_holder

This is becoming huge with LLM systems.

IV. Metadata graph thinking

If you think metadata is a table of information, that would not be the complete picture. Modern metadata systems are graph systems .

Why Graphs instead of tables. It is because, relationships matter more than isolated assets.

Nodes and Edges

Nodes

Nodes are entities like

tables
topics
dashboards
users
pipelines
ML models

Edges

Edges or links are Relationships

owns
depends_on
upstream_of
consumes
produces

Why graphs?

Real-world metadata questions are graph traversals.

Example: “If I deprecate this Kafka topic, what breaks?”

This is a dependency traversal problem.

Another Example: “Which dashboards are affected if column X changes?”

Column X -> Table -> dbt model -> Dashboard

Metadata Graph enables impact analysis.

Key Note

Metadata systems are partially: knowledge graph systems.

V. Active metadata

Traditional catalogs were passive. Meaning they were

manually updated
stale
documentation-oriented
disconnected from runtime systems

Example:

wiki pages
spreadsheets
manually written docs

They became outdated immediately.

Modern systems = active metadata

Modern Systems emit:

events
lineage updates
schema changes
query stats
freshness metrics
ownership updates

Using this now Metadata becomes real-time operational context .

Examples

Airflow pipeline fails.

OpenMetadata updates these automatically

pipeline status
freshness warning
downstream impact
activity feed

Schema changes: customer_name → full_name

Metadata system can:

detect change
compute downstream impact
trigger alerts
update lineage
notify owners

This is not “documentation”. This is operational automation.

Modern systems are too dynamic. Manual governance fails completely at scale.

VI. Lineage fundamentals

Basic definition:

Lineage describes how data moves and transforms through systems over time.

But again: that definition is too shallow.

A better definition:

Better definition

Lineage is a dependency graph that models the flow, transformation, and propagation of data across systems.

Imagine:

orders table -> daily_sales aggregation -> executive dashboard

This already forms lineage.

Meaning:

dashboard depends on aggregation
aggregation depends on orders

Without lineage

Modern systems become operationally dangerous.

Because nobody knows:

what depends on what,
what breaks downstream,
where metrics originated,
whether data is trustworthy.

Lineage solves: dependency visibility.

Lineage existed conceptually for decades.

But it became critical because of distributed data systems.

Earlier the lineage was small enough for humans to understand manually.

Database -> Nightly ETL -> Warehouse

Modern lineage shape

Modern Systems are :

Kafka -> Spark -> Data Lake -> dbt -> Warehouse -> Dashboards -> ML Models

Thousands of transformations -> Thousands of dependencies -> Human reasoning collapses.

Types of Lineage

Lineage exists at multiple granularities:

Dataset/Table-Level Lineage
raw_orders -> clean_orders -> daily_sales
Good for impact analysis, system overview, and operational debugging.
Column-Level Lineage
Tracks individual field propagation.
raw_orders.amount -> SUM(amount) -> daily_sales.total_revenue
Pipeline Lineage
Describes orchestration dependencies.
Airflow DAG A -> dbt Job B -> ML Pipeline C
End-to-End Lineage
Full ecosystem graph.
Kafka Topic -> Spark Job -> Iceberg Table -> dbt Model -> Snowflake Table -> Dashboard

Column-Level Lineage Example

Column-level lineage tracks individual field propagation.

raw_orders.amount -> SUM(amount) -> daily_sales.total_revenue

This is very hard Technically. Because SQL transformations must be parsed precisely.

Consider the sql query below

1
2
3
SELECT
  SUM(price * quantity) AS revenue
FROM orders

Now lineage engine must infer that orders.price, orders.quantity produces revenue

Not trivial at all. This requires, SQL parsing, AST analysis, expression mapping.

Lineage direction

Upstream lineage: Data sources feeding current asset.

Downstream lineage: Consumers affected by current asset.

Lineage Derivation

How does a metadata system KNOW lineage? There are many methods to derive Lineage.

Static SQL Parsing
The system parses SQL statements and infers dependencies from reads, writes, joins, aliases, CTEs, nested queries, unions, and functions.
Query Log Analysis
Warehouses expose query histories, so metadata systems can infer lineage from the queries that actually executed.
Orchestration Metadata
Airflow and dbt expose DAGs, jobs, and task dependencies that can be converted into lineage edges.
OpenLineage Events
Systems emit lineage events directly, so the platform does not have to guess lineage only from logs or SQL.

i. Static SQL Parsing

System parses SQL statements.

1
2
3
INSERT INTO sales_summary
SELECT *
FROM orders

Based on this query the parser infers: orders → sales_summary

Real SQL becomes insanely complicated.

Example:

1
2
3
4
5
6
7
WITH x AS (
   SELECT ...
),
y AS (
   SELECT ...
)
SELECT ...

Now parser must understand:

aliases, joins, CTEs, nested queries, unions, functions

This is a hard compiler-style problem. Lineage systems often contain mini SQL compilers .

ii. Query Log Analysis

Warehouses expose query histories. Example:

Snowflake query logs
BigQuery audit logs

Metadata systems infer lineage from executed queries.

Tradeoff

Advantage reflects actual runtime usage.

Disadvantage incomplete context sometimes.

iii. Orchestration Metadata

Airflow/dbt provide dependency information.

Example: task_a → task_b

Metadata system converts these into lineage.

task dependencies
DAG structure
job execution

iv OpenLineage Events

This is the most modern approach. Systems emit lineage events directly.

Instead of guessing lineage: systems explicitly publish it.

Column Lineage Is Hard Table lineage is relatively manageable. But column lineage is MUCH harder.

Example

1
2
3
SELECT
   first_name || ' ' || last_name AS full_name
FROM customers

Lineage engine must infer:

customers.first_name, customers.last_name -> full_name

Now add:

CASE statements
nested subqueries
window functions
UDFs
macros
dynamic SQL

Reality check

Complexity explodes. Many lineage tools advertise column-level lineage, but accuracy varies dramatically.

Another classification of lineage

Static lineage

Derived from code, SQL, configs, DAG definitions

Pros: predictable easier

Cons: may not reflect runtime reality

Runtime lineage

Derived from actual executions, query logs, emitted events

Pros: reflects reality

Cons: operationally harder

VII. Why Lineage Is Valuable

Impact Analysis

“What breaks if this changes?” Core lineage use case.

Root Cause Analysis

Dashboard wrong? Then lets Trace backward: Dashboard -> dbt model -> warehouse table -> pipeline Find failure source.

Trust & Certification

If upstream asset is unreliable: downstream trust decreases.

This becomes: trust propagation.

Governance

Lineage helps answer:

where PII flows, which systems consume sensitive data, retention propagation.

Very important for compliance.

Cost Optimization

Unused downstream assets can be identified. Dead pipelines discovered. Expensive transformations analyzed.

Important Limitations

Lineage is NOT magically perfect.

Real-world problems:

dynamic SQL hidden transformations external scripts missing instrumentation incomplete logs manual uploads spreadsheets

So lineage graphs are often:

probabilistic/incomplete

Feature 1

Feature 2

Content here...

Feature 3

Content here...

Saved locally to your browser.

I. What metadata really is

Key Insight

II. Evolution of data systems

Complexity arrives

Key Insight

III. Metadata categories

Technical Metadata

Operational Metadata

Business Metadata

Governance Metadata

Lineage Metadata

Usage Metadata

Semantic Metadata

1. Technical Metadata (Describes structure)

2. Operational Metadata (Describes runtime behavior)

3. Business Metadata (This is where human meaning enters)

Why business metadata matters

4. Governance Metadata

5. Lineage Metadata (Very Valuable)

Lineage answers

6. Usage Metadata

Usage metadata helps teams

7. Semantic Metadata

IV. Metadata graph thinking

Nodes

Edges

Why graphs?

Key Note

V. Active metadata

Airflow pipeline fails.

Schema changes: customer_name → full_name

VI. Lineage fundamentals

Better definition

Without lineage

Modern lineage shape

Types of Lineage

Dataset/Table-Level Lineage

Column-Level Lineage

Pipeline Lineage

End-to-End Lineage

Column-Level Lineage Example

Lineage direction

Lineage Derivation

Static SQL Parsing

Query Log Analysis

Orchestration Metadata

OpenLineage Events

i. Static SQL Parsing

ii. Query Log Analysis

Tradeoff

iii. Orchestration Metadata

iv OpenLineage Events

Reality check

Static lineage

Runtime lineage

VII. Why Lineage Is Valuable

Impact Analysis

Root Cause Analysis

Trust & Certification

Governance

Cost Optimization

Important Limitations

Feature 1

Feature 2

Feature 3