How AWS S3 is built
How AWS S3 is built Amazon S3 operates at an immense scale, handling hundreds of millions of transactions per second and storing exabytes of data, a feat achieved through meticulous engineering and a design philosophy centered on building for failure. The system has evolved significantly, incorporating Rust for performance-critical code and leveraging formal methods to ensure correctness, particularly in areas like consistency and cross-region replication. Key to its reliability are strategies to mitigate correlated failures and a principle that scale should be an advantage, leading to continuous improvements in performance and user experience.
- S3 handles hundreds of millions of transactions per second globally and stores over 500 trillion objects, amounting to hundreds of exabytes of data.
- The system achieved strong consistency without compromising availability or increasing costs through innovations like a replicated journal and a new cache coherency protocol.
- Performance-critical code paths in S3 have been largely rewritten in Rust to maximize performance and minimize latency.
- S3’s 11 nines of durability are continuously measured by auditor microservices, with automated repair systems addressing detected issues.
- Formal methods, including automated reasoning, are extensively used in production to verify code correctness, especially for the index subsystem and cross-region replication.
- Correlated failures, where multiple components fail simultaneously due to shared fault domains, are a primary concern and are mitigated by replicating data across multiple availability zones and using quorum-based algorithms.
- S3 comprises around 200 microservices, many dedicated to durability tasks like health checks and repairs, emphasizing simplified, focused services.
- S3 Vectors, a new data structure for searching high-dimensional vector spaces, achieves sub-100ms query times by precomputing vector neighborhoods.
- Crash consistency is a core design philosophy, with engineers reasoning about system states under failure conditions.
- The ‘Scale Is to Your Advantage’ principle ensures that increased scale improves system attributes like reliability.
- Read the full article
https://newsletter.pragmaticengineer.com/p/how-aws-s3-is-built
Write a comment