Column-Level Lineage with OpenLineage and Marquez in 2026

Table-level lineage tells you that table A feeds table B. That’s the floor — useful for high-level impact analysis but inadequate for the operational questions data teams actually face. Column-level lineage tells you that column X in table A feeds column Y in table B through a specific transformation. That’s the level of detail that audits, GDPR data subject requests, and surgical impact analysis actually need.

OpenLineage — the CNCF-incubated lineage standard — plus Marquez (the reference metadata store) and the broader ecosystem have matured substantially through 2024-2026. This post walks through what’s actually working and the deployment pattern that produces value.

Why column-level lineage matters#

Several specific use cases require column-level granularity.

GDPR / DPDPA / privacy data subject requests. When a user requests deletion or export of their personal data, you need to know everywhere a specific personal data column has propagated. Table-level lineage tells you “this data is somewhere in 47 downstream tables.” Column-level tells you “the email column flows to these specific columns in these specific downstream tables.”

Audit and compliance. When auditors ask how a specific KPI was calculated, table-level lineage doesn’t answer. Column-level lineage traces the specific calculation path through each transformation.

Impact analysis for schema changes. When you want to change a column type or rename a column, table-level lineage says “47 downstream tables might be affected.” Column-level says “12 downstream columns in 8 tables use this column; the rest don’t reference it.”

Root-cause analysis for data quality issues. When a downstream metric is wrong, column-level lineage traces backward to identify which upstream column changed.

ML feature lineage. Particularly important — knowing which raw columns feed which features is essential for ML model governance.

The OpenLineage standard#

OpenLineage is the CNCF-incubated open standard for capturing lineage events. The model is event-based: each job emits events when it starts, runs, and completes, with structured information about inputs, outputs, and the transformations between them.

The standard defines event types (START, RUNNING, COMPLETE, ABORT, FAIL), job metadata, run metadata, dataset facets including schemas and statistics, job facets including the SQL and code references, and column-level lineage as a specific facet.

The standard is implemented in integrations: Airflow, dbt, Spark, Flink, Snowflake, BigQuery, plus increasing others. Each integration emits events during job execution; the events flow to a metadata store.

Marquez and the alternatives#

Marquez is the OpenLineage reference metadata store. It receives events, stores the lineage graph, and provides UI and API for querying. Solid open-source option for teams that want to self-host.

DataHub (LinkedIn-originated, now under the Foundation for Open Data Standards) is the broader metadata platform that includes lineage. Larger feature set than Marquez but more operational overhead.

OpenMetadata is another open-source metadata platform with strong lineage support.

Commercial platforms — Atlan, Collibra, Alation, Manta, Castor, plus the various — increasingly support OpenLineage as input. The commercial platforms typically provide broader metadata management beyond lineage.

For most teams starting with column-level lineage, Marquez or DataHub plus OpenLineage events from the major data tools (dbt, Airflow) is the right starting architecture.

The deployment pattern#

The pattern we’ve converged on for client engagements:

Phase 1: Enable OpenLineage events from dbt. dbt has native OpenLineage integration. The events capture model-to-model lineage at column level when the SQL is clean enough for column-level parsing.

Phase 2: Enable OpenLineage events from the orchestrator. Airflow, Dagster, Prefect all have OpenLineage integration. Events from the orchestrator capture cross-tool lineage.

Phase 3: Enable OpenLineage events from warehouse compute. Snowflake, BigQuery, Databricks all support OpenLineage. Events from warehouse compute catch ad-hoc transformations that don’t go through dbt.

Phase 4: Build the query layer. Marquez or DataHub provides the API; build the specific queries your team needs (column lineage for a specific PII column, impact analysis for a proposed schema change, etc.).

Phase 5: Integrate with workflows. Make lineage queryable from the tools the team uses — IDE plugins, Slack bots, GitHub PR comments, the warehouse UI.

What’s hard#

Column-level lineage has specific challenges.

Dynamic SQL. When transformations use dynamic SQL generation, column-level parsing fails. The events are partial.

Complex transformations. When a column is derived through complex aggregation or window functions, the lineage is technically correct but operationally less useful.

Cross-engine flows. When data moves between systems (warehouse to Spark, Spark to ML platform), the lineage events from each system need to stitch together. The stitching is fragile.

Performance. Capturing detailed events on every job run adds overhead. The events have to be efficient enough not to slow down production workloads.

Where pdpspectra fits#

Our data engineering practice builds lineage infrastructure into client engagements. Column-level lineage is increasingly table-stakes for compliance-sensitive workloads.

Lineage is operational discipline, not theater. Talk to our team about your data platform.

Why column-level lineage matters#

The OpenLineage standard#

Marquez and the alternatives#

The deployment pattern#

What’s hard#

Where pdpspectra fits#

Related posts.

Sovereign AI and Data Residency: An Architecture Decision, Not a Checkbox

Carbon Accounting Platforms: Data Architecture

Cross-Border Data Transfer: SCCs, DPF, and Real Production Patterns