AWS Glue Data Catalog Now Automates Table Statistics for Enhanced Query Performance

Introducing Automated Statistics Generation

AWS Glue Data Catalog has introduced an automated system for generating statistics on new tables, enhancing the cost-based optimizer (CBO) functionality in Amazon Redshift Spectrum and Amazon Athena. This advancement promises improved query performance and potential cost reductions.

Understanding the Impact on Query Performance

Large dataset queries often involve complex operations across multiple datasets. The CBO utilizes table statistics to optimize these queries effectively. By having access to metrics like distinct value counts in table columns, the system can determine optimal join strategies and ordering.

Key Features and Benefits

  • Automatic statistics generation for new and updated tables
  • Support for multiple file formats including Parquet, ORC, JSON, ION, CSV, XML, and Apache Iceberg
  • Weekly automated statistics collection across all databases
  • 20% sampling of records for balanced performance
  • Flexible configuration options at both catalog and table levels

Administrative Control and Customization

Data lake administrators can now enable catalog-level statistics collection through the Lake Formation console. This provides a foundation for platform-wide optimization while allowing individual data owners to customize settings for specific tables based on their unique requirements.

Table-Level Configuration Options

  • Adjustable collection frequency (hourly, daily, weekly, or monthly)
  • Customizable sampling percentage
  • Column-specific statistics collection
  • Integration with IAM roles and security configurations

Implementation Benefits

This automation eliminates the need for manual monitoring and configuration of statistics collection, reducing operational overhead while maintaining optimal query performance. The system automatically adapts to new tables and data changes, ensuring consistent optimization across your data platform.

 

Click here to learn more about AWS Glue Data Catalog Automation for Table Statistics Collection