The Challenge of Large-Scale Data Understanding
Managing and understanding vast data ecosystems presents significant challenges for organizations committed to protecting user privacy. Meta’s diverse systems make comprehending data structure, meaning, and context particularly complex at scale.
To address these challenges, Meta has made substantial investments in advanced data understanding technologies as part of their Privacy Aware Infrastructure (PAI). This includes adopting a “shift-left” approach, integrating data schematization and annotations early in product development, and creating a universal privacy taxonomy—a standardized framework providing a common semantic vocabulary for data privacy management.
A Decade-Long Journey
Meta began their data understanding journey a decade ago, with millions of assets in scope ranging from structured to unstructured data, processed by millions of flows across many Meta App offerings. Today, Meta has cataloged millions of data assets and classifies them daily, supporting numerous privacy initiatives across product groups.
Their approach ensures privacy considerations are embedded at every stage of product development. For Meta, privacy isn’t just a compliance requirement—it’s a driver of product innovation.
The Privacy Aware Infrastructure (PAI)
The Privacy Aware Infrastructure integrates efficient and reliable privacy tools into Meta’s systems to address needs such as purpose limitation while unlocking opportunities for product innovation by ensuring transparency in data flows.
Data understanding is an early step in PAI, involving capturing the structure and meaning of data assets such as tables, logs, and AI models. Initially, Meta employed heuristics and classifiers to automatically detect semantic types from user-generated content, but conducting these processes outside developer workflows presented challenges in accuracy and timeliness.
Five-Step Approach to Data Understanding
Meta developed a comprehensive five-step approach to data understanding:
- Schematizing: Creating DataSchema, a standard format to capture structure and relationships of all data assets independent of system implementation
- Predicting metadata at scale: Using a universal privacy taxonomy and classification systems to identify and classify data elements
- Annotating: Attaching metadata to individual fields in data assets, combining machine predictions with developer input
- Inventorying assets and systems: Using OneCatalog to discover, register, and enumerate all data assets
- Maintaining data understanding: Employing robust processes to maintain high coverage and quality of schemas and annotations
Technical Implementation
Meta developed DataSchema, a standard format using Thrift Interface Description Language, compatible with Meta systems and languages. It describes over 100 million schemas across more than 100 data systems, covering granular data units from database tables to AI models.
The classification system leverages machine learning models and heuristics to predict data types by sampling data, extracting features, and inferring annotation values. Key components include a scheduling component, scanning service, and classification service.
Portable annotation APIs seamlessly integrate into developer workflows, ensuring consistent representation of data across all systems at Meta, accurate understanding of data, and efficient evidencing of compliance with regulatory requirements.
Learnings and Future Direction
Building data understanding at Meta’s scale required novel infrastructure and contributions from thousands of engineers. Key learnings include:
- The importance of canonical catalogs for systems, assets, and taxonomy labels
- The value of an incremental and flexible approach to onboarding diverse systems
- The necessity of collaboration between infrastructure teams and subject matter experts
- The critical role of community engagement with tight feedback loops
Looking ahead, data understanding will continue to evolve, impacting various aspects of operations and product offerings, including improved AI and machine learning, streamlined developer workflows, operational efficiency, and product innovation.
By harnessing canonical metadata, Meta can deepen their shared understanding of data, unlocking unprecedented opportunities for innovation not just at Meta, but across the industry.
Visit Meta Engineering Blog for more information on how Meta understands data at scale
Leave a Reply