AMAZON S3: ARCHITECTING FOR RESILIENCY IN THE FACE OF FAILURES Jason McHugh

Similar documents

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Amazon EC2 Product Details Page 1 of 5

Designing Apps for Amazon Web Services

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

EEDC. Scalability Study of web apps in AWS. Execution Environments for Distributed Computing

Preparing Your IT for the Holidays. A quick start guide to take your e-commerce to the Cloud

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Design for Failure High Availability Architectures using AWS

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

How AWS Pricing Works May 2015

Cloud Computing For Bioinformatics

Amazon Elastic Beanstalk

High availability on the Catalyst Cloud

DLT Solutions and Amazon Web Services

Smartronix Inc. Cloud Assured Services Commercial Price List

Building Fault-Tolerant Applications on AWS October 2011

Cloud Computing Trends

Service Organization Controls 3 Report

Scalable Application. Mikalai Alimenkou

Cloud Computing Disaster Recovery (DR)

Migration Scenario: Migrating Batch Processes to the AWS Cloud

How AWS Pricing Works

Expand Your Infrastructure with the Elastic Cloud. Mark Ryland Chief Solutions Architect Jenn Steele Product Marketing Manager

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

Meeting Management Solution. Technology and Security Overview N. Dale Mabry Hwy Suite 115 Tampa, FL Ext 702

Designing a Cloud Storage System

Enabling Cloud Architecture for Globally Distributed Applications

High Availability for Citrix XenServer

SteelFusion with AWS Hybrid Cloud Storage

Multi-Datacenter Replication

Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel

The Total Cost of (Non) Ownership of a NoSQL Database Cloud Service

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

CONNECTRIA MANAGED AMAZON WEB SERVICES (AWS)

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Resilient Voice Architecture

Cloud Computing project Report

Service Organization Controls 3 Report

Web Application Hosting in the AWS Cloud Best Practices

Scalable Architecture on Amazon AWS Cloud

Storage and Disaster Recovery

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

CloudLink - The On-Ramp to the Cloud Security, Management and Performance Optimization for Multi-Tenant Private and Public Clouds

Cloud Computing with Amazon Web Services and the DevOps Methodology.

Software- as- a- Service (SaaS) on AWS Business and Architecture Overview

Zadara Storage Cloud A

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

Cloud and the future of Unemployment Sean Rhody, CTO Capgemini Government Solutions

High Performance MySQL Choices in Amazon Web Services: Beyond RDS. Andrew Shieh, SmugMug Operations smugmug.

High Availability and Clustering

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Cloud Computing. Chapter 1 Introducing Cloud Computing

BASICS OF SCALING: LOAD BALANCERS

Famly ApS: Overview of Security Processes

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

RemoteApp Publishing on AWS

Scaling in the Cloud with AWS. By: Eli White (CTO & mojolive) eliw.com - mojolive.com

Storage Solutions in the AWS Cloud. Miles Ward Enterprise Solutions Architect

Thing Big: How to Scale Your Own Internet of Things.

VXLAN: Scaling Data Center Capacity. White Paper

HIGH-SPEED BRIDGE TO CLOUD STORAGE

Reference Model for Cloud Applications CONSIDERATIONS FOR SW VENDORS BUILDING A SAAS SOLUTION

NetflixOSS A Cloud Native Architecture

Active Directory Domain Services on the AWS Cloud: Quick Start Reference Deployment Mike Pfeiffer

Amazon Web Services Yu Xiao

Cloud Computing: Meet the Players. Performance Analysis of Cloud Providers

Deploying for Success on the Cloud: EBS on Amazon VPC. Phani Kottapalli Pavan Vallabhaneni AST Corporation August 17, 2012

STRATEGIC WHITE PAPER. The next step in server virtualization: How containers are changing the cloud and application landscape

Deploy Remote Desktop Gateway on the AWS Cloud

Alfresco Enterprise on AWS: Reference Architecture

HRG Assessment: Stratus everrun Enterprise

Building an AWS-Compatible Hybrid Cloud with OpenStack

Cloud Computing. Chapter 1 Introducing Cloud Computing

Deploying for Success on the Cloud: EBS on Amazon VPC Session ID#11312

Success in the Cloud. Mark williams

FAQ: BroadLink Multi-homing Load Balancers

FortiBalancer: Global Server Load Balancing WHITE PAPER

Storage Options in the AWS Cloud: Use Cases

DNS use guidelines in AWS Jan 5, 2015 Version 1.0

SiteCelerate white paper

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Network Virtualization Platform (NVP) Incident Reports

Transcription:

AMAZON S3: ARCHITECTING FOR RESILIENCY IN THE FACE OF FAILURES Jason McHugh

CAN YOUR S ERVICE S URVIVE?

CAN YOUR S ERVICE S URVIVE?

CAN YOUR SERVICE SURVIVE? Datacenter loss of connectivity Flood Tornado Complete destruction of a datacenter containing thousands of machines

KEY TAKEAWAYS Dealing with large scale failures takes a qualitatively different approach Set of design principles here will help AWS, like any mature software organization, has learned a lot of lessons about being resilient in the face of failures

OUTLINE AWS Amazon Simple Storage Service (S3) Scoping the failure scenarios Why failures happen Failure detection and propagation Architectural decisions to mitigate the impact of failures Examples of failures

ONE SLIDE INTRODUCTION TO AWS Amazon Elastic Compute Cloud (EC2) Amazon Elastic block storage service (EBS) Amazon Virtual Private Cloud (VPC) Amazon Simple storage service (S3) Amazon Simple queue service (SQS) Amazon SimpleDB Amazon Cloudfront CDN Amazon Elastic Map-Reduce (EMR) Amazon Relational Database Service (RDS)

AMAZON S3 Simple storage service Launched: March 14, 2006 at 1:59am Simple key/value storage system Core tenets: simple, durable, available, easily addressable, eventually consistent Large scale import/export available Financial guarantee of availability Amazon S3 has to be above99.9% available

AMAZON S3 MOMENTUM 52 Billion Q3 2009: 82 billion Peak RPS: 100,000+ 18 Billion 200 Million 5 Billion Total Number of Objects Stored in Amazon S3

FAILURES There are some things that pretty much everyone knows Expect drives to fail Expect network connection to fail (independent of the redundancy in networking) Expect a single machine to go out Workers Central Coordinator Workers Datacenter #1Datacenter #1Datacenter #3 Datacenter #2

FAILURE SCENARIOS Corruption of stored and transmitted data Losing one machine in fleet Losing an entire datacenter Losing an entire datacenter and one machine in another datacenter

WHY FAILURES HAPPEN Human error Acts of nature Entropy Beyond scale

FAILURE CAUSE: HUMAN ERROR Network configuration Pulled cords Forgetting to expose load balancers to external traffic DNS black holes Software bug Failure to use caution while pushing a rack of servers

FAILURE CAUSE: ACTS OF NATURE Flooding Standard kind Non-standard kind: Flooding from the roof down Heat waves New failure mode: dude that drives the diesel truck Lightning It happens Can be disruptive

FAILURE CAUSE: ENTROPY Drive failures During an average day many drives will fail in Amazon S3 Rack switch makes half the hosts in rack unreachable Which half? Depends on the requesting IP. Chillers fail forcing the shutdown of some hosts Which hosts? Essentially random from the service owner s perspective.

FAILURE CAUSE: BEYOND SCALE Some dimensions of scale are easy to manage Amount of free space in system Precise measurements of when you could run out No ambiguity Acquisition of components by multiple suppliers Some dimensions of scale are more difficult Request rate Ultimate manifestation: DDOS attack

RECOGNIZING WHEN FAILURE HAPPENS Timely failure detection Propagation of failure must handle or avoid Scaling bottlenecks of their own Centralized failure of failure detection units Asymmetric routes X #1 is healthy #1 is healthy Request #1 is healthy to #1 Service 1 Service 2 Service 3

GOSSIP APPROACH FOR FAILURE DETECTION Gossip, or epidemic protocols, are useful tools when probabilistic consistency can be used Basic idea Applications, components, or failure units, heartbeat their existence Machines wake up every time quantum to perform a round of gossip Every round machines contact another machine randomly, exchange all gossip state Robustness of propagation is both a positive and negative

S3 S GOSSIP APPROACH THE REALITY No, it really isn t this simple at scale Can t exchange all gossip state Different types of data change at different rates Rate of change might require specialized compression techniques Network overlay must be taken into consideration Doesn t handle the bootstrap case Doesn t address the issue of application lifecycle This alone is not simple Not all state transitions in lifecycle should be performed automatically. For some human intervention may be required.

DESIGN PRINCIPLES Prior just sets the stage 7 design principles

DESIGN PRINCIPLES TOLERATE FAILURES Service relationships Calls/Depends on Service 1 Service 2 Upstream from #2 Downstream from #1 Decoupling functionality into multiple services has standard set of advantages Scale the two independently Rate of change (verification, deployment, etc) Ownership encapsulation and exposure of proper primitives

DESIGN PRINCIPLES TOLERATE FAILURES Protect yourself from upstream service dependencies when they haze you Protect yourself from downstream service dependencies when they fail

DESIGN PRINCIPLES CODE FOR LARGE FAILURES Some systems you suppress entirely Example: replication of entities (data) When a drive fails replication components work quickly When a datacenter fails then replication components do minimal work without operator confirmation To Datacenter #3 Storage Storage Datacenter #1 Datacenter #2

DESIGN PRINCIPLES CODE FOR LARGE FAILURES Some systems must choose different behaviors based on the unit of failure Storage Datacenter #1 Object Storage Datacenter #2 Storage Datacenter #3 Storage Datacenter #4

DESIGN PRINCIPLE DATA & MESSAGE CORRUPTION At scale it is a certainty Application must do end-to-end checksums Can t trust TCP checksums Can t trust drive checksum mechanisms End-to-end includes the customer

DESIGN PRINCIPLE CODE FOR ELASTICITY The dimensions of elasticity Need infinite elasticity for cloud storage Quick elasticity for recovery from large-scale failures Introducing new capacity to a fleet Ideally you can introduce more resources in the system and capabilities increase All load balancing systems (hardware and software) Must become aware of new resources Must not haze How not to do it

DESIGN PRINCIPLE MONITOR, EXTRAPOLATE, AND REACT Modeling Alarming Reacting Feedback loops Keeping ahead of failures

DESIGN PRINCIPLE CODE FOR FREQUENT SINGLE MACHINE FAILURES Most common failure manifestation a single box Also sometimes exhibited as a larger-scale uncorrelated failure For persistent data consider use Quorum Specialization of redundancy If you are maintaining n copies of data Write to w copies and ensure all n are eventually consistent Read from r copies of data and reconcile

DESIGN PRINCIPLE CODE FOR FREQUENT SINGLE MACHINE FAILURES For persistent data use Quorum Advantage: does not require all operations to succeed on all copies Hides underlying failures Hides poor latency from users Disadvantages Increases aggregate load on system for some operations More complex algorithms Anti-entropy is difficult at scale

DESIGN PRINCIPLE CODE FOR FREQUENT SINGLE MACHINE FAILURES For persistent data use Quorum Optimal quorum set size System strives to maintain the optimal size even in the face of failures All operations have a set size If available copies are less than the operation set size then the operation is not available Example operations: read and write Operation set sizes can vary depending on the execution of the operations (driven by user s access patterns)

DESIGN PRINCIPLE GAME DAYS Network eng and data center technicians turn off a data center Don t tell service owners Accept the risk, it is going to happen anyway Build up to it to start Randomly, once a quarter minimum Standard post-mortems and analysis Simple idea test your failure handling however it may be difficult to introduce

REAL FAILURE EXAMPLES Large outage last year Traced down to a single network interface card Once found the problem was easily reproduced Corruption leaked past TCP checksuming on the single communication channel that did not have application level checksuming

REAL FAILURE EXAMPLES Network access to Datacenter is lost Happens not infrequently Several noteworthy events in the last year Due to transit providers, networking upgrades, etc. None noticed by customers Easily direct customers away from a datacenter It helps that we run game-days and irregular maintenance by failing entire datacenters

REAL FAILURE EXAMPLES Network route asymmetry Learning about machine health via gossip Route taken to learn about health might not be the same taken by communication between two machines Results in split brain I think that machine is unhealthy Everyone else says it is fine, keep trying

REAL FAILURE EXAMPLES Rack switch makes all or some of hosts unreachable Must handle losing hundreds of disks simultaneously Independent of fixing the rack switch and the timeline some action needs to be taken Intersection of a hundreds of sets of objects (say each set Intersection of a hundreds of sets of objects (say each set is 10 million objects) efficiently taking into account state of the world for other failed components

DESIGN PRINCIPLES RECAP Expect and tolerate failures Code for large scale failures Expect and handle data and message corruption Code for elasticity Monitor, extrapolate and react Code for frequent single machine failures Game days

WHAT I HAVEN T DISCUSSED Unit of failures Coalescing reporting of failures intelligently How to handle a failure Recording and trending of failure types Tracking and resolving failures In general all issues related to maintaining a good ratio of support burden to fleet size

CONCLUSION Just scratching the surface Set of design principles which can help your system be resilient in the face of failures Amazon S3 has maintained aggregate availability far in excess of our stated SLA for the last year Amazon AWS is hiring: http://aws.amazon.com/jobs

QUESTIONS?