The Scale of Instagram’s ML Ecosystem
Instagram’s recommendation systems extend far beyond what users see in Feed, Stories, and Reels. From surfacing relevant comments to determining which notifications are important or suggesting whom to tag in posts – ML models drive countless personalized experiences across the platform.
The ranking funnel consists of multiple layers: sourcing (retrieval), early-stage ranking (ESR), and late-stage ranking (LSR). As content moves through this funnel, fewer candidates are processed while operations become increasingly complex and resource-intensive.
This architecture, combined with constant experimentation, creates significant infrastructure challenges. Instagram’s ML engineers need flexibility to adjust model weights and parameters, resulting in an ever-growing number of models serving user traffic in production.
Infrastructure Challenges Identified
As Instagram’s ML ecosystem expanded, several critical risks emerged:
- Discovery issues: Even within Instagram’s team, tracking model growth became unmanageable, with product ML teams maintaining separate sources of truth.
- Release bottlenecks: The lack of a consistent, efficient model launch process slowed ML velocity and hampered product innovation.
- Health monitoring gaps: Without a standardized definition of model prediction quality, degraded ranking quality often went undetected.
Building Solutions at Scale
To address these challenges, Instagram implemented three core solutions:
1. Model Registry – A centralized ledger for tracking production model importance and business function. This registry became the foundation for automation, observability, and model health monitoring.
2. Model Launch Tooling – A streamlined process for launching new models, including estimation, approval, preparation, scale-up, and finalization. This automation reduced launch time from days to hours.
3. Model Stability Framework – A pioneering metric measuring the accuracy of model predictions. This framework established Service Level Objectives (SLOs) for all models in the registry, enabling comprehensive ML health monitoring.
The Power of a Model Registry
Prior to the registry, model investigations were time-consuming and error-prone. On-call engineers had to gather contextual information about models from multiple owners, slowing response times during critical incidents.
The model registry standardized information collection about model importance and business function, ensuring operational resources prioritized the most critical models. Built on Meta’s Configerator system, this schematized ledger provides a flexible foundation for automation and tooling.
Models are categorized by type (e.g., “ig_stories_tray_mtml”) and criticality level (from TIER0 to TIER4), providing clear context about their purpose and importance. This structured approach enabled comprehensive monitoring coverage and automated alerting.
Revolutionizing the Model Launch Process
The traditional model launch process was inefficient and time-consuming. It involved cloning services, sending shadow traffic, running multiple overload tests, and manually shifting traffic – a process that could take days.
To improve this, Instagram implemented:
- Virtual resource pools for each team, eliminating competition for resources
- Offline performance evaluation using pre-recorded traffic
- An automated launching platform that handles scaling and traffic shifts
These improvements dramatically reduced launch-related incidents, increased launch frequency from a few to over 10 per week, and saved engineers more than two days per launch.
Ensuring Model Stability
As Instagram’s model count grew, consistently measuring model health became crucial. Unlike traditional backend systems, ranking models must produce accurate recommendations to maintain user engagement.
The team developed a model stability metric that evaluates:
- Model calibration – the ratio of predicted click-through-rate to empirical CTR
- Model normalized entropy – measuring how well the predictor separates action from inaction
When either metric breaches established thresholds, the model is considered unstable. This approach enables real-time detection of prediction instability and helps identify trends across different ranking funnels.
Key Lessons Learned
Instagram’s journey to managing 1000+ ML models yielded valuable insights:
- Infrastructure understanding is essential for building the right tools
- Enabling teams to move quickly benefits the entire organization
- Reliability must encompass both system performance and recommendation quality
By implementing robust infrastructure and processes, Instagram has not only improved operational efficiency but also empowered its teams to drive continuous innovation and growth.
Visit Meta Engineering for more details on Instagram’s journey to scaling 1000+ ML models
Leave a Reply