Amazon S3 Expands Capabilities with Managed Apache Iceberg Tables for Faster Data Lake Analytics and Automatic Metadata Generation to Simplify Data Discovery and Understanding
At AWS re:Invent, Amazon Web Services, Inc. (AWS), an Amazon.com, Inc. company (NASDAQ: AMZN), today announced new Amazon Simple Storage Service (Amazon S3) features that make S3 the first cloud object store with fully-managed support for Apache Iceberg for faster analytics and the easiest way to store and manage tabular data at any scale. These new features also include the ability to automatically generate queryable metadata, simplifying data discovery and understanding to help customers unlock the value of their data in S3.
- Amazon S3 Tables is the first cloud object store with built-in Apache Iceberg table support and introduces a new bucket type to optimize storage and querying of tabular data as Iceberg tables, delivering up to 3x faster query performance, up to 10x higher transactions per second (TPS), and automated table maintenance and automation for analytics workloads.
- Amazon S3 Metadata streamlines data discovery in near real-time by automatically capturing queryable object metadata, as well as custom metadata using object tags, storing it in S3 Tables for accelerating analytics across data lakes.
“As the leading object store in the world with more than 400 trillion objects, S3 is used by millions of customers, and we continue to innovate to remove the complexity of working with data at an unprecedented scale,” said Andy Warfield, vice president, Storage, and distinguished engineer, AWS. “We have seen the rapid rise of tabular data and, increasingly, customers want to query across tables, improve query performance, and understand and organize troves of data so they can easily find exactly what they need. S3 Tables and S3 Metadata remove the overhead of organizing and operating table and metadata stores on top of objects, so customers can shift their focus back to building with their data.”
S3 Tables and S3 Metadata are Apache Iceberg table-compatible so customers can easily query their data using AWS analytics services and open source tools, including Amazon Athena, Amazon QuickSight, and Apache Spark.
Amazon S3 Tables—the easiest and fastest way to perform analytics on Apache Iceberg tables in S3
Many customers today organize the data they use for analytics as tabular data, most often stored in Apache Parquet, a file format optimized for data queries. Parquet has become one of the fastest growing data types in S3, and customers increasingly want to be able to query these growing tabular data sets—often turning to open table formats (OTF), an open source standard for storing data in tables—because it helps organize, update, and track changes to large amounts of data. Iceberg has become the most popular OTFs to manage Parquet files, with customers using Iceberg to query across billions of files containing petabytes or even exabytes of data. However, Iceberg can be challenging for customers to manage as they scale, often requiring dedicated teams to build and maintain systems to handle table maintenance and data compaction, as well as manage access control. These external systems are costly and complex, and they require skilled teams to maintain, using up valuable resources.
Amazon S3 Tables are purpose-built for managing Apache Iceberg tables for data lakes. S3 Tables are specifically optimized for analytics workloads, delivering up to 3x faster query performance and 10x higher TPS compared to general purpose S3 buckets. S3 Tables automatically manage table maintenance tasks such as compaction for better query performance and snapshot management to continuously optimize query performance and storage costs, even as customers’ data lakes scale and evolve. Customers can use S3 Tables by creating a table bucket that optimizes the storage and querying of tabular data in fully-managed Iceberg tables. With S3 Tables, customers benefit from Iceberg capabilities like row-level transactions, queryable snapshots via time travel functionality, schema evolution, and more. In addition, S3 Tables provide table-level access controls, allowing customers to define permissions.
Genesys, a global leader in AI-powered experience orchestration, plans to leverage Amazon S3 for its data lake. By utilizing S3 Tables’ managed Iceberg support, Genesys expects to offer a materialized view layer for its diverse data analysis needs. S3 Tables’ built-in support for Iceberg tables will simplify complex data workflows by automating key maintenance tasks such as table compaction, snapshot management, and unreferenced file cleanup. Genesys is looking forward to improved performance and broad support from Iceberg-compliant analytics tools that can read and write Iceberg tables directly from S3. S3 Tables will be foundational to Genesys’ future data strategy, enabling the company to deliver faster, more flexible, and reliable data insights to support its AI-driven customer and employee experience solutions.
Amazon S3 Metadata—the easiest and fastest way to discover and understand data in S3
As more customers use S3 as their central data repository, the volume and variety of data have grown exponentially, with metadata becoming increasingly important as a way to understand and organize large amounts of data so customers can find the exact objects they need. To address this problem, many customers resort to building and maintaining complex metadata capture and storage systems to enrich their understanding of data. But these metadata systems are expensive, time-consuming, and resource-intensive, often requiring data engineers to manually track and update metadata as it flows through their processing pipelines, as well as data analysts to manually inspect massive object stores to find the specific data they need for analytics and AI/ML data processing workflows.
Amazon S3 Metadata automatically generates queryable object metadata in near real-time to help accelerate data discovery and improve data understanding, eliminating the need for customers to build and maintain their own complex metadata systems. S3 Metadata lets customers query, find, and use data for business analytics, real-time inference applications, and more. S3 Metadata automatically generates object metadata, which includes system-defined details like size and source of the object, and makes it queryable via new S3 Tables. S3 Metadata updates object metadata in S3 Tables as objects are added or removed, giving customers an up-to-date view of their data. Customers can add their own custom metadata using object tags to annotate objects with information specific to their business, such as product SKUs, transaction IDs, or content ratings, or with customer details. Customers can easily query metadata using a simple SQL query, enabling them to quickly find and prepare data for use in business analytics and real-time inference applications, as well as fine-tune foundation models, perform retrieval augmented generation (RAG), integrate data warehouse and analytics workflows, perform targeted storage optimization tasks, and more.
Organizations of all sizes are set to benefit from the data discovery and understanding that S3 Metadata will bring. Roche, a leading biotech company, plans to leverage S3 Metadata to accelerate their future generative AI initiatives. As they develop advanced large language model (LLM) applications like sophisticated internal chatbots, they anticipate managing exponentially larger volumes of unstructured data for enhanced RAG. S3 Metadata will simplify the creation of a scalable metadata system, automatically surfacing and updating metadata as new data is ingested. Roche envisions using custom Lambda functions to extract complex, business-specific metadata, integrating it seamlessly with S3 Metadata in a comprehensive Glue catalog. This will enable more efficient organization and rapid identification of relevant datasets for cutting-edge AI applications, allowing Roche to focus on groundbreaking innovations in personalized healthcare.
Cambridge Mobile Telematics (CMT) is the world’s largest telematics service provider. The company gathers sensor data from devices and enriches it with contextual data to create a unified view of vehicle and driver behavior that auto insurers, automakers, commercial mobility companies, and the public sector use to power risk assessment, safety, claims, and driver improvement programs. CMT stores and analyzes multiple petabytes of data from millions of IoT devices worldwide. As CMT scales, locating specific data for developing new insights and models becomes increasingly challenging. S3 Metadata, including system and custom metadata, allows CMT to query petabytes of metadata, making finding relevant data simple and cost-effective.
S3 Tables (generally available) and S3 Metadata (preview) are available today. S3 Tables’ integration with AWS Glue Data Catalog is in preview, allowing customers to query and visualize data—including S3 Metadata tables—using AWS Analytics services such as Amazon Athena, Redshift, EMR, and QuickSight.
To learn more, visit:
- S3 Tables and S3 Metadata AWS News Blog posts for details on today’s announcements.
- S3 Tables and S3 Metadata product detail pages to learn more about their capabilities.
- S3 Tables and S3 Metadata videos for explanations on how they work.
About Amazon Web Services
Since 2006, Amazon Web Services has been the world’s most comprehensive and broadly adopted cloud. AWS has been continually expanding its services to support virtually any workload, and it now has more than 240 fully featured services for compute, storage, databases, networking, analytics, machine learning and artificial intelligence (AI), Internet of Things (IoT), mobile, security, hybrid, media, and application development, deployment, and management from 108 Availability Zones within 34 geographic regions, with announced plans for 18 more Availability Zones and six more AWS Regions in Mexico, New Zealand, the Kingdom of Saudi Arabia, Taiwan, Thailand, and the AWS European Sovereign Cloud. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—trust AWS to power their infrastructure, become more agile, and lower costs. To learn more about AWS, visit aws.amazon.com.
About Amazon
Amazon is guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. Amazon strives to be Earth’s Most Customer-Centric Company, Earth’s Best Employer, and Earth’s Safest Place to Work. Customer reviews, 1-Click shopping, personalized recommendations, Prime, Fulfillment by Amazon, AWS, Kindle Direct Publishing, Kindle, Career Choice, Fire tablets, Fire TV, Amazon Echo, Alexa, Just Walk Out technology, Amazon Studios, and The Climate Pledge are some of the things pioneered by Amazon. For more information, visit amazon.com/about and follow @AmazonNews.