Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel www.reliablesoftware.com development@reliablesoftware.



Similar documents
Do Relational Databases Belong in the Cloud? Michael Stiefel

Designing Apps for Amazon Web Services

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Cloud Computing with Microsoft Azure

Cloud Computing Is In Your Future

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

Multi-Datacenter Replication

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Guideline for stresstest Page 1 of 6. Stress test

Scalable Architecture on Amazon AWS Cloud

SCALABILITY AND AVAILABILITY

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Web Application Hosting Cloud Architecture

Cloud Computing Disaster Recovery (DR)

FortiBalancer: Global Server Load Balancing WHITE PAPER

LinuxWorld Conference & Expo Server Farms and XML Web Services

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

Service Level Agreement for Windows Azure operated by 21Vianet

Web Application Hosting in the AWS Cloud Best Practices

How AWS Pricing Works

HRG Assessment: Stratus everrun Enterprise

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching

How AWS Pricing Works May 2015

Techniques for implementing & running robust and reliable DB-centric Grid Applications

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

High Availability Solutions for the MariaDB and MySQL Database

Using Multipathing Technology to Achieve a High Availability Solution

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Cloud computing - Architecting in the cloud

Mark Bennett. Search and the Virtual Machine

Overview of Luna High Availability and Load Balancing

Application Performance Testing Basics

Public Cloud Partition Balancing and the Game Theory

Global Server Load Balancing

SAN Conceptual and Design Basics

Introduction to AWS Economics

Managing your Domino Clusters

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

High-Availability in the Cloud Architectural Best Practices

High Availability Database Solutions. for PostgreSQL & Postgres Plus

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at

Distribution One Server Requirements

Performance and Scale in Cloud Computing

INDIA September 2011 virtual techdays

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

Cloud Architecture Patterns

msuite5 & mdesign Installation Prerequisites

High Availability and Clustering

New Features in SANsymphony -V10 Storage Virtualization Software

Amazon EC2 Product Details Page 1 of 5

Coyote Point Systems White Paper

VMware vcloud Automation Center 6.0

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

HA / DR Jargon Buster High Availability / Disaster Recovery

Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Web Server (Step 2) Creates HTML page dynamically from record set

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

Business Continuity in Today s Cloud Economy. Balancing regulation and security using Hybrid Cloud while saving your company money.

Load Balancing and Maintaining the Qos on Cloud Partitioning For the Public Cloud

Chapter 2 TOPOLOGY SELECTION. SYS-ED/ Computer Education Techniques, Inc.

Design for Failure High Availability Architectures using AWS

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

ELIXIR LOAD BALANCER 2

5 Easy Steps to Implementing Application Load Balancing for Non-Stop Availability and Higher Performance

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

AMAZON S3: ARCHITECTING FOR RESILIENCY IN THE FACE OF FAILURES Jason McHugh

Cisco Application Networking for IBM WebSphere

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

High Availability Essentials

Scaling Analysis Services in the Cloud

Creating Web Farms with Linux (Linux High Availability and Scalability)

Fax Server Cluster Configuration

Cloud Based Application Architectures using Smart Computing

How To Use Arcgis For Free On A Gdb (For A Gis Server) For A Small Business

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Testing Cloud Application System Resiliency by Wrecking the System

Rackspace Cloud Databases and Container-based Virtualization

Building Scalable Applications Using Microsoft Technologies

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Alfresco Enterprise on AWS: Reference Architecture

Web Application Hosting in the AWS Cloud Best Practices

Low-cost Open Data As-a-Service in the Cloud

HBA Virtualization Technologies for Windows OS Environments

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Resource Utilization of Middleware Components in Embedded Systems

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

The Benefits of Virtualizing

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Delivering Quality in Software Performance and Scalability Testing

Global Server Load Balancing

FAQ: BroadLink Multi-homing Load Balancers

Building a Highly Available and Scalable Web Farm

Transcription:

Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel www.reliablesoftware.com development@reliablesoftware.com

Outsource Infrastructure?

Traditional Web Application Web Site Virtual Machine / Directly on Hardware 100 MB Relational Database Inbound Transactions Output Transactions File System

Hosting Provider Costs Provider $ / Monthly Cost Host Gator 9.95 Go Daddy 10 ORCS Web 69 Amazon 83+ BYOS Windows Azure 97 Note: traditional hosting, no custom colocation, virtualized data centers.

Cloud is Not Cheaper for Hosting

Perhaps, Higher Availability?

SLA is Not Radically Different Provider Compute SLA (%) Go Daddy 99.9 ORCS Web 99.9 Host Gator 99.9 Amazon 99.95 Azure 99.95 Difference is seven minutes a day; 1.75 days a year.

Higher Rate Since You Pay for Flexibility

Hosting is Not Cloud Computing

Why Utility Computing? Scalability: do not have to pay for peak scenarios. Availability: can approach 100% if you want to pay.

Architecturally, they are the same problem

You must design to accommodate missing computing resources.

Designing for Failure is Cloud Computing

What s wrong with this Code Fragment? ClientProxy client = new ClientProxy(); Response response = client.do (request);

Never assume that any interface between two components always succeeds.

So You Put in a Catch Handler try { ClientProxy client = new ClientProxy(); int result = client.do (a, b, c); } catch (Exception ex) { }

What if a timeout, how many retries? the result is a complete failure? the underlying hardware crashed? you need to save the user s data? you are in the middle of a transaction?

What Do You Put in the Catch Handler? try { ClientProxy client = new ClientProxy(); int result = client.do (a, b, c); } catch (Exception ex) {???? }

You can t program yourself out of a failure.

Failure is a first-class design citizen.

Principle #1 The critical issue is how to respond to failure. The underlying infrastructure cannot guarantee availability.

Consequences of Failure

Multiple tiers and dependencies If your order queue fails, no orders If your customer service fails, no membership information

The more dependencies, the more consequences of a poorly handle failure

Dependencies include your code, third parties, the Internet/Web, anything you do not control

Unhandled failures propagate (like cracks) through your application.

Principle #2 Failures Cascade an unhandled failure in one part of the system becomes a failure of your application.

Transient Failure Resource Failure Two Types of Failure

Typical Response to a Transient Failure Retry How Often? How Long Before You Give Up?

Delays Cascade Just Like Failures Delays occur while you are waiting or retrying Delays hog resources like threads, TCP/IP ports, database connections, memory. Since delays are usually the result of resource bottlenecks, waiting or retrying for long periods adds to the bottleneck.

Transient failures become resource failures

Transient Failures Retry for a short time, then give up (like a circuit breaker) if unsuccessful. Never block on I/O, timeout and assume failure.

Principle #3 There is no such thing as a transient failure. Fail fast and treat it as a resource failure.

Make Components Failure Resistant Must Provide Failure Isolation

Make Components Failure Resistant Design For Beyond Largest Expected Load Understand latency of adding a new resource User load, virtual memory, CPU size, bandwidth, database Handle all Errors Failure affects more people than on the desktop.

Provide Failure Isolation Catch all exceptions Log all errors Return Succeed / Fail to External Services Have Failure Strategy For Dependent Services

Define your own SLA

Stress test components and system

A chain is a strong as its weakest link

Principle #4 Use a Margin of Safety when designing the resources used.

What is the cost of availability?

Any component or instance can fail eliminate single points of failure.

Search for Dependencies Hardware / Virtual Machines Third Party Libraries Internet/Web Interfaces to your own components TCP/IP ports DNS Servers Message Queues Database Drivers Credit Card Processors, Geocoding services, etc.

Examine Queries Only three types of result sets: Zero, One, Many (can become large overnight) Search Providers limit results returned Remember those 5 way joins your ORM uses Objects on a DCOM or RMI call

Principle #5 Eliminate single points of failure. Accept the fact that you must build a distributed application.

You need redundancy...

but you have to manage state.

Solutions such as database mirroring may have unacceptable latencies, such as over geography.

Reduce the parts of your application that handle state to a minimum.

Loss of a stateful component usually means loss of user data.

State Handling Components Does the UI layer need session state? Business Logic, Domain Layer should be stateless Use queues where they make sense to hold data Design services for minimal dependencies Pay with a customer number Keep state with the message Don t forget infrastructure logs, configuration files State is in specialized stores

Build atomic services. Atomic means unified, not small. Decouple the services.

Stateless components allow for scalability and redundancy.

What about the data tier?

Can you relax consistency constraints? What is acceptable data loss?

What is the cost of an apology?

How important is the relational model?

Design for Eventual Consistency

Consider CQRS

Monitor your components. Understand why they fail.

Reroute traffic to existing instances or another data center or geographic area?

Add more instances?

Caching or throttling can help your application run under failure.

Poorer performance may be acceptable.

Automate Automate.Automate

Principle #6 Degrade gracefully and predictably. Know what you can live without.

Cloud Outages Happen

Some Are Normal Some Are Black Swans

Humans Reason About Probabilities Poorly

Principle #7 Assume the Rare Will Occur - It Will Occur

Case Study: Amazon Four Day Outage

Facts April 21, 2011 One Day of Stabilization, Three Days of Recovery Problems: EC2, EBS, Relational Database Service Affected: Quora, Hootsite, Foursquare, Reddit Unaffected: Netflix, Twillo

Why were Netflix and Twillo Unaffected? They Designed For Failure

Netflix Explicitly Architected For Failure

Although more errors, higher latency, no increase in customer service calls or inability to find or start movies.

Key Architectural Decisions Stateless Services Data stored across isolation zones Could switch to hot standby Had Excess Capacity (N + 1) Handle large spikes or transient failures Used relational databases only where needed. Could partition data Degraded Gracefully

Degraded Gracefully Fail Fast, Aggressive Timeouts Can degrade to lower quality service no personalized movie list, still can get list of available movies Non Critical Features can be removed.

Chaos Monkey

Some Problems Had to manually reroute traffic; use more automation in the future for failover and recovery Round robin load balancer can overload decreased number of instances. May have to change auto scaling algorithm and internal load balancing. Expand to Geographic Regions

Summary Hosting in a cloud computing environment is valid. Cloud Computing means designing for failure.