I. What metadata really is

The textbook definition, data about data, is technically correct but practically useless.

Better definition for Metadata would be:

Context that helps systems and humans understand, trust, discover, govern, operate, and automate around data assets.

Notice something important here ? Metadata is NOT merely descriptive. It is operational intelligence .

Example Suppose you have a table called orders with rows

order_idcustomer_idamount
101554200

This is the DATA

Let’s see what meta data means then,

InformationType
table namemetadata
column namesmetadata
datatypemetadata
ownermetadata
refresh frequencymetadata
downstream dashboardmetadata
lineagemetadata
quality scoremetadata
query countmetadata

Key Insight

Everything about the data becomes metadata. Modern systems care as much about metadata as about data itself.

II. Evolution of data systems

During the Early era it was a Simple setup:

Application -> Single Database

No metadata platform needed.

Engineers knew everything mentally. Tribal Knowledge sufficed.

Then companies added:

OLTP DB -> ETL -> Warehouse -> Reports

This was Still manageable. Maybe spreadsheets documented things. Some small wiki pages here and there.

Begining of Modern data stack explosion

Kafka, Spark, Airflow, dbt, Snowflake, BigQuery, Databricks, S3, ML pipelines, Tableau, Power BI, Microservices, Real-time streams, Feature stores

Complexity arrives

Now complexity explodes. Human cognition breaks

Nobody knows:

  • what depends on what,
  • which dataset is trusted,
  • where data originated,
  • who owns pipelines,
  • which dashboard is correct.

Key Insight

This is the birth condition for metadata platforms.

Metadata platforms are fundamentally complexity management systems. This is their real purpose.

III. Metadata categories

There are primarily 7 categories of metadata:

  • Technical Metadata

    Structure: tables, schemas, columns, datatypes, partitions, indexes, file formats, and storage locations.
  • Operational Metadata

    Runtime behavior: query frequency, freshness, latency, failures, SLA violations, and storage growth.
  • Business Metadata

    Human meaning: definitions, owners, glossary terms, domains, SLAs, and steward assignments.
  • Governance Metadata

    Control layer: PII tags, GDPR labels, retention policies, access classifications, and sensitivity levels.
  • Lineage Metadata

    Dependencies: where data came from, how it changed, and which consumers depend on it.
  • Usage Metadata

    Interaction: who queries what, which dashboards matter, and which assets are dormant.
  • Semantic Metadata

    Meaning in the AI era: concepts, relationships, embeddings, similarity, and ontology mappings.

1. Technical Metadata (Describes structure)

Examples are

table names, schemas, datatypes, partitions, indexes, file formats, storage locations

1
2
3
4
5
6
7
8
9
{
  "table": "orders",
  "columns": [
    {
      "name": "amount",
      "type": "decimal(10,2)"
    }
  ]
}

This is the oldest metadata type. Traditional catalogs mostly stopped here.

2. Operational Metadata (Describes runtime behavior)

This includes query frequency, pipeline failures, freshness, latency, runtime, SLA violations, storage growth

1
2
3
4
5
{
  "queries_per_day": 18344,
  "last_updated": "2026-05-18",
  "freshness_minutes": 12
}

This transforms metadata into: living operational telemetry .

3. Business Metadata (This is where human meaning enters)

Examples are business definitions, owners, glossary terms, domains, SLAs, steward assignments

1
2
3
4
{
  "definition":
  "Net revenue after refunds and discounts"
}

Why business metadata matters

Without business metadata, companies fight endlessly over definitions.

Finance says: Revenue = gross sales

Analytics says: Revenue = post-discount value

Product team says: Revenue = subscription MRR

Now every dashboard disagrees. Metadata systems attempt to solve this.

4. Governance Metadata

Security/compliance layer.

Examples are PII tags, GDPR labels, retention policies, access classifications, sensitivity levels

1
2
3
4
{
  "classification": "PII",
  "retention_days": 365
}

This becomes critical in enterprises.

5. Lineage Metadata (Very Valuable)

It describes dependencies.

Example:

raw_orders -> clean_orders -> daily_sales_dashboard

Lineage answers

Now we can answer:

  • What breaks if schema changes?
  • Where did this metric originate?
  • Why is this dashboard wrong?

6. Usage Metadata

Describes human/system interaction.

Examples: who queried tables, dashboard popularity, dormant assets, active consumers.

Usage metadata helps teams

  • discover important datasets
  • identify dead pipelines
  • optimize systems

7. Semantic Metadata

Increasingly important in AI era.

It describes:

  • meaning
  • conceptual relationships
  • embeddings
  • similarity
  • ontology mappings

Example: customerclientaccount_holder

This is becoming huge with LLM systems.

IV. Metadata graph thinking

If you think metadata is a table of information, that would not be the complete picture. Modern metadata systems are graph systems .

Why Graphs instead of tables. It is because, relationships matter more than isolated assets.

Nodes and Edges

Nodes

Nodes are entities like

  • tables
  • topics
  • dashboards
  • users
  • pipelines
  • ML models

Edges

Edges or links are Relationships

  • owns
  • depends_on
  • upstream_of
  • consumes
  • produces

Why graphs?

Real-world metadata questions are graph traversals.

Example: “If I deprecate this Kafka topic, what breaks?”

This is a dependency traversal problem.

Another Example: “Which dashboards are affected if column X changes?”

Column X -> Table -> dbt model -> Dashboard

Metadata Graph enables impact analysis.

Key Note

Metadata systems are partially: knowledge graph systems.

V. Active metadata

Traditional catalogs were passive. Meaning they were

  • manually updated
  • stale
  • documentation-oriented
  • disconnected from runtime systems

Example:

  • wiki pages
  • spreadsheets
  • manually written docs

They became outdated immediately.

Modern systems = active metadata

Modern Systems emit:

  • events
  • lineage updates
  • schema changes
  • query stats
  • freshness metrics
  • ownership updates

Using this now Metadata becomes real-time operational context .

Examples

Airflow pipeline fails.

OpenMetadata updates these automatically

  • pipeline status
  • freshness warning
  • downstream impact
  • activity feed

Schema changes: customer_name → full_name

Metadata system can:

  • detect change
  • compute downstream impact
  • trigger alerts
  • update lineage
  • notify owners

This is not “documentation”. This is operational automation.

Modern systems are too dynamic. Manual governance fails completely at scale.

VI. Lineage fundamentals

Basic definition:

Lineage describes how data moves and transforms through systems over time.

But again: that definition is too shallow.

A better definition:

Better definition

Lineage is a dependency graph that models the flow, transformation, and propagation of data across systems.

Imagine:

orders table -> daily_sales aggregation -> executive dashboard

This already forms lineage.

Meaning:

  • dashboard depends on aggregation
  • aggregation depends on orders

Without lineage

Modern systems become operationally dangerous.

Because nobody knows:

  • what depends on what,
  • what breaks downstream,
  • where metrics originated,
  • whether data is trustworthy.

Lineage solves: dependency visibility.

Lineage existed conceptually for decades.

But it became critical because of distributed data systems.

Earlier the lineage was small enough for humans to understand manually.

Database -> Nightly ETL -> Warehouse

Modern lineage shape

Modern Systems are :

Kafka -> Spark -> Data Lake -> dbt -> Warehouse -> Dashboards -> ML Models

Thousands of transformations -> Thousands of dependencies -> Human reasoning collapses.

Types of Lineage

Lineage exists at multiple granularities:

  • Dataset/Table-Level Lineage

    raw_orders -> clean_orders -> daily_sales

    Good for impact analysis, system overview, and operational debugging.

  • Column-Level Lineage

    Tracks individual field propagation.

    raw_orders.amount -> SUM(amount) -> daily_sales.total_revenue

  • Pipeline Lineage

    Describes orchestration dependencies.

    Airflow DAG A -> dbt Job B -> ML Pipeline C

  • End-to-End Lineage

    Full ecosystem graph.

    Kafka Topic -> Spark Job -> Iceberg Table -> dbt Model -> Snowflake Table -> Dashboard

Column-Level Lineage Example

Column-level lineage tracks individual field propagation.

raw_orders.amount -> SUM(amount) -> daily_sales.total_revenue

This is very hard Technically. Because SQL transformations must be parsed precisely.

Consider the sql query below

1
2
3
SELECT
  SUM(price * quantity) AS revenue
FROM orders

Now lineage engine must infer that orders.price, orders.quantity produces revenue

Not trivial at all. This requires, SQL parsing, AST analysis, expression mapping.

Lineage direction

Upstream lineage: Data sources feeding current asset.

Downstream lineage: Consumers affected by current asset.

Lineage Derivation

How does a metadata system KNOW lineage? There are many methods to derive Lineage.

  • Static SQL Parsing

    The system parses SQL statements and infers dependencies from reads, writes, joins, aliases, CTEs, nested queries, unions, and functions.
  • Query Log Analysis

    Warehouses expose query histories, so metadata systems can infer lineage from the queries that actually executed.
  • Orchestration Metadata

    Airflow and dbt expose DAGs, jobs, and task dependencies that can be converted into lineage edges.
  • OpenLineage Events

    Systems emit lineage events directly, so the platform does not have to guess lineage only from logs or SQL.

i. Static SQL Parsing

System parses SQL statements.

1
2
3
INSERT INTO sales_summary
SELECT *
FROM orders

Based on this query the parser infers: orderssales_summary

Real SQL becomes insanely complicated.

Example:

1
2
3
4
5
6
7
WITH x AS (
   SELECT ...
),
y AS (
   SELECT ...
)
SELECT ...

Now parser must understand:

aliases, joins, CTEs, nested queries, unions, functions

This is a hard compiler-style problem. Lineage systems often contain mini SQL compilers .

ii. Query Log Analysis

Warehouses expose query histories. Example:

  • Snowflake query logs
  • BigQuery audit logs

Metadata systems infer lineage from executed queries.

Tradeoff

Advantage reflects actual runtime usage.

Disadvantage incomplete context sometimes.

iii. Orchestration Metadata

Airflow/dbt provide dependency information.

Example: task_atask_b

Metadata system converts these into lineage.

  • task dependencies
  • DAG structure
  • job execution

iv OpenLineage Events

This is the most modern approach. Systems emit lineage events directly.

Instead of guessing lineage: systems explicitly publish it.

Column Lineage Is Hard Table lineage is relatively manageable. But column lineage is MUCH harder.

Example

1
2
3
SELECT
   first_name || ' ' || last_name AS full_name
FROM customers

Lineage engine must infer:

customers.first_name, customers.last_name -> full_name

Now add:

  • CASE statements
  • nested subqueries
  • window functions
  • UDFs
  • macros
  • dynamic SQL

Reality check

Complexity explodes. Many lineage tools advertise column-level lineage, but accuracy varies dramatically.

Another classification of lineage

Static lineage

Derived from code, SQL, configs, DAG definitions

Pros: predictable easier

Cons: may not reflect runtime reality

Runtime lineage

Derived from actual executions, query logs, emitted events

Pros: reflects reality

Cons: operationally harder

VII. Why Lineage Is Valuable

Impact Analysis

“What breaks if this changes?” Core lineage use case.

Root Cause Analysis

Dashboard wrong? Then lets Trace backward: Dashboard -> dbt model -> warehouse table -> pipeline Find failure source.

Trust & Certification

If upstream asset is unreliable: downstream trust decreases.

This becomes: trust propagation.

Governance

Lineage helps answer:

where PII flows, which systems consume sensitive data, retention propagation.

Very important for compliance.

Cost Optimization

Unused downstream assets can be identified. Dead pipelines discovered. Expensive transformations analyzed.

Important Limitations

Lineage is NOT magically perfect.

Real-world problems:

dynamic SQL hidden transformations external scripts missing instrumentation incomplete logs manual uploads spreadsheets

So lineage graphs are often:

probabilistic/incomplete

Feature 1

Feature 2

Content here...

Feature 3

Content here...

Saved locally to your browser.