Capture Data Lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

Jun 25, 2025

—

in Amazon Web Services, Data Analytics, Data Engineering

Introducing SageMaker’s Enhanced Data Lineage Capabilities

The next generation of Amazon SageMaker serves as a comprehensive hub for your data, analytics, and AI needs. It integrates AWS artificial intelligence, machine learning, and analytics capabilities into a unified experience with seamless data access. Through Amazon SageMaker Unified Studio, you can access your data and utilize powerful tools for data processing, SQL analytics, model development, training, inference, and generative AI development—all from a single interface.

This unified experience is enhanced by Amazon Q and Amazon SageMaker Catalog (powered by Amazon DataZone), delivering embedded generative AI and governance capabilities at every step of your workflow.

Understanding Data Lineage in SageMaker

Data lineage, now integrated into SageMaker Catalog, allows domain administrators and data producers to centralize lineage metadata for their data assets. This feature tracks data flow over time, providing clear visibility into data origins, transformations, and business usage. By offering transparency around data sources, data lineage helps users trust that the data is appropriate for their use cases.

Since data lineage is captured at table, column, and job levels, data producers can conduct impact analysis and address data issues efficiently when needed.

OpenLineage Integration for Expanded Capabilities

SageMaker’s data lineage capture begins after configuring connections and data sources, generating lineage events during data transformations in AWS Glue or Amazon Redshift. Importantly, this capability is fully compatible with OpenLineage, allowing you to extend data lineage capture to additional processing tools.

Solution Architecture

Many third-party and open-source tools used for orchestrating data pipelines—like dbt, Apache Airflow, and Apache Spark—actively support the OpenLineage standard. This provides interoperability across environments by simply including and configuring the right library to stream lineage events to target HTTP endpoints.

The architecture for integrating these tools with SageMaker or Amazon DataZone works through a proxy pattern:

Amazon API Gateway exposes an HTTP endpoint and path
Amazon SQS queue buffers events as they arrive
AWS Lambda function retrieves events from the queue, performs transformations if needed, and posts them to SageMaker or DataZone

This architecture is supported by IAM for secure interactions and CloudWatch for logging and observability.

Practical Implementation Examples

The post provides detailed implementation guides for:

Setting up OpenLineage package for Spark in AWS Glue 4.0
Configuring OpenLineage package for dbt
Implementing OpenLineage package for Apache Airflow (MWAA)

Each implementation allows you to stream lineage events to SageMaker or Amazon DataZone, providing comprehensive visibility into your data pipelines. The examples demonstrate how SageMaker and DataZone can map all datasets and jobs to accurately reflect your data pipeline structure.

Production Considerations

When implementing this pattern in production environments, consider:

Implementing proper authentication and authorization for the API endpoint
Customizing Lambda function logic for specific transformations
Using FIFO queues if event order is critical
Leveraging the direct OpenLineage transport (available from version 1.33.0) when proxy pattern controls aren’t needed

Benefits of SageMaker’s OpenLineage Compatibility

SageMaker’s compatibility with OpenLineage simplifies governance of data assets and increases trust. This capability is part of SageMaker’s comprehensive governance strategy that includes data quality, business metadata, data discovery, and access automation—enabling faster value derivation and helping establish a data-driven culture.

Visit here for more information about capturing data lineage with Amazon SageMaker

Airflow Amazon Sagemaker AWS Data Lineage DataZone DBT ETL MWAA Spark