Author: Data Domain Blogger
-

Capture Data Lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker
Amazon SageMaker now offers enhanced data lineage capabilities compatible with OpenLineage, allowing users to track data flow from tools like dbt, Apache Airflow, and Apache Spark. This integration creates transparency, builds trust, and centralizes governance of data assets in a single place.
-

Amazon PackScan: Revolutionizing Real-Time Sort Center Analytics with AWS Services
Discover how Amazon transformed its logistics operations with PackScan, an AWS-powered platform that reduced data latency from 1 hour to under 1 minute. This real-time analytics solution processes 500,000 scan events per second across 80 sort centers, resulting in 25% increased throughput and 12% reduction in labor hours.
-

Using Amazon Neptune for Real-time Anomaly Detection in Gaming Transactions
Discover how Zupee leveraged Amazon Neptune’s graph database to detect real-time anomalies in gaming wallet transactions. Learn how they overcame relational database limitations to build an integrity system that processes over 1 million daily transactions, identifies suspicious patterns, and ensures incentives reach legitimate users.
-

How Flutter UKI Optimized Data Pipelines with Amazon MWAA
Discover how Flutter UKI transformed their data pipelines by migrating from EC2-based Airflow to Amazon MWAA, managing 5,500 DAGs and 60,000 daily runs with improved stability and reduced operational overhead.
-

Scaling Apache Iceberg Tables with AWS Lake Formation Hybrid Access Mode
Apache Iceberg tables combined with AWS Lake Formation’s hybrid access mode provide a powerful solution for enterprises managing large datasets. This approach allows organizations to use Lake Formation for read access while maintaining IAM policy-based permissions for write operations, offering fine-grained access control without disrupting existing workflows.
-

MIT’s SASA Method: Training LLMs to Self-Detoxify Their Language Output
MIT researchers have developed SASA, a method allowing Large Language Models to detoxify their own outputs without retraining. This system creates internal boundaries between toxic/non-toxic subspaces, helping LLMs generate appropriate content while maintaining natural language fluency—similar to how humans develop internal filters for appropriate speech.
-

Streamlining Cross-Account Orchestration with Amazon MWAA
Learn how to orchestrate data workflows across multiple AWS accounts and regions using Amazon Managed Workflows for Apache Airflow (MWAA). This article covers implementing secure cross-account access, creating custom Airflow operators, and following best practices for distributed data processing and machine learning pipelines.
-

Instagram’s Journey to Managing 1000+ ML Models
Instagram has successfully scaled its recommendation system to manage over 1000 ML models. This article explores how they built a robust infrastructure through a model registry, streamlined launching process, and innovative stability metrics to maintain high-quality personalized experiences for billions of users.

