Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together



Similar documents
Building Fault-Tolerant Applications on AWS October 2011

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

Amazon EC2 Product Details Page 1 of 5

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

How AWS Pricing Works May 2015

Cloud Computing For Bioinformatics

How AWS Pricing Works

Amazon Web Services Yu Xiao

Web Application Hosting in the AWS Cloud Best Practices

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

Amazon Elastic Beanstalk

TECHNOLOGY WHITE PAPER Jun 2012

TECHNOLOGY WHITE PAPER Jan 2016

Using ArcGIS for Server in the Amazon Cloud

Scalable Architecture on Amazon AWS Cloud

Alfresco Enterprise on AWS: Reference Architecture

Deploying for Success on the Cloud: EBS on Amazon VPC. Phani Kottapalli Pavan Vallabhaneni AST Corporation August 17, 2012

Scaling in the Cloud with AWS. By: Eli White (CTO & mojolive) eliw.com - mojolive.com

ur skills.com

Preparing Your IT for the Holidays. A quick start guide to take your e-commerce to the Cloud

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Amazon Cloud Storage Options

Design for Failure High Availability Architectures using AWS

Postgres Plus Cloud Database!

EEDC. Scalability Study of web apps in AWS. Execution Environments for Distributed Computing

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Deploying for Success on the Cloud: EBS on Amazon VPC Session ID#11312

Web Application Hosting in the AWS Cloud Best Practices

Cloud computing - Architecting in the cloud

Storage and Disaster Recovery

Using ArcGIS for Server in the Amazon Cloud

An Introduction to Cloud Computing Concepts

DLT Solutions and Amazon Web Services

Amazon AWS in.net. Presented by: Scott Reed

Amazon Web Services Student Tutorial

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Cloud Models and Platforms

Service Organization Controls 3 Report

Cloud Computing project Report

High-Availability in the Cloud Architectural Best Practices

Storage Options in the AWS Cloud: Use Cases

Simple Storage Service (S3)

Apache Hadoop. Alexandru Costan

Getting Started with AWS. Web Application Hosting for Linux

Architecting for the Cloud: Best Practices January 2010 Last updated - May 2010

Reliable Data Tier Architecture for Job Portal using AWS

Multi-Datacenter Replication

ArcGIS for Server: In the Cloud

Jinesh Varia Technology Evangelist Architectural Design Patterns in Cloud Computing

AWS Account Setup and Services Overview

AWS Storage: Minimizing Costs While Retaining Functionality

Case Study. Cloud Adoption, Fault Tolerant AWS Support & Magento ecommerce Implementation. Case Study

Designing Apps for Amazon Web Services

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

ColdFusion 10 in the Amazon AWS Cloud. Sven Ramuschkat tecracer GmbH

Expand Your Infrastructure with the Elastic Cloud. Mark Ryland Chief Solutions Architect Jenn Steele Product Marketing Manager

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February

Running Oracle Applications on AWS

Software- as- a- Service (SaaS) on AWS Business and Architecture Overview

319 MANAGED HOSTING TECHNICAL DETAILS

AMAZON S3: ARCHITECTING FOR RESILIENCY IN THE FACE OF FAILURES Jason McHugh

Every Silver Lining Has a Vault in the Cloud

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Cloud Computing with Amazon Web Services and the DevOps Methodology.

ArcGIS 10.3 Server on Amazon Web Services

Hypertable Architecture Overview

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

C Examcollection.Premium.Exam.34q

An Esri White Paper January 2011 Estimating the Cost of a GIS in the Amazon Cloud

RackWare Solutions Disaster Recovery

Service Organization Controls 3 Report

Big data blue print for cloud architecture

Storage Solutions in the AWS Cloud. Miles Ward Enterprise Solutions Architect

Learning Management Redefined. Acadox Infrastructure & Architecture

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Cloud Computing Disaster Recovery (DR)

Zadara Storage Cloud A

Amazon Compute - EC2 and Related Services

Estimating the Cost of a GIS in the Amazon Cloud. An Esri White Paper August 2012

Cloud Computing In Reality: Experience sharing in cloud solution developments and evaluations

Thing Big: How to Scale Your Own Internet of Things.

Getting Started with AWS. Computing Basics for Linux

Druva Phoenix: Enterprise-Class. Data Security & Privacy in the Cloud

Understanding AWS Storage Options

Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel

Intro to AWS: Storage Services

Backup and Recovery of SAP Systems on Windows / SQL Server

Transcription:

Fault-Tolerant Computer System Design ECE 695/CS 590 Putting it All Together Saurabh Bagchi ECE/CS Purdue University ECE 695/CS 590 1 Outline Looking at some practical systems that integrate multiple techniques that we have learned Amazon Web Services (AWS) Hadoop Distributed Processing ECE 695/CS 590 2 1

Amazon Web Service (AWS): Case Study A set of services built in for reliability and security ECE 695/CS 590 3 AWS Ecosystem ECE 695/CS 590 4 2

AWS Reliability Examples Amazon S3 is highly durable and distributed data store With a simple web services interface, you can store and retrieve large amounts of data as objects in containers From anywhere on the web using standard HTTP commands Copies of objects can be distributed and cached at 14 edge locations around the world by creating a distribution Distribution handled using Amazon CloudFront7 service a web service for content delivery (static or streaming content) Amazon Simple Queue Service (Amazon SQS) is a reliable, highly scalable, hosted distributed queue for storing messages as they travel between computers and application components ECE 695/CS 590 5 Amazon Web Service: Case Study Amazon Machine Images (AMIs): Commonly used machine instances from which the user can choose to use as an execution platform; Spare instances can be kept running Amazon Elastic Block Store (Amazon EBS): Block-level storage volumes for AMIs Durability of EBS is higher than a typical hard drive due to storing data redundantly; Annual failure rate for an EBS volume is 0.1 to 0.5% compared to 4% for a regular hard drive. EBS provides a snapshot feature a backup of the system taken at a specific instance of time. Snapshots are stored in the Amazon S3 to ensure high durability. ECE 695/CS 590 6 3

Amazon Web Service: Case Study Autoscaling and Elastic Load Balancing: Allows EC2 capacity to go up or down as needed by load Example: When # running server instances is below a threshold, launch new server instances Example: Monitor resource utilization of server instances using CloudWatch service; if utilization too high, terminate server instances Example: Distribute incoming traffic across EC2 instances, for load balancing or to route around failed instances Regions and Availability zones: Distribute the application geographically in distant data centers. Each geographic location is called a Region. Within each Region, there are Availability Zones. Availability Zones are distinct locations that are insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. ECE 695/CS 590 7 AWS Regions US East (Northern Virginia) Region EC2 Availability Zones: 5 US West (Northern California) Region EC2 Availability Zones: 3 US West (Oregon) Region EC2 Availability Zones: 3 AWS GovCloud (US) Region EC2 Availability Zones: 2 ECE 695/CS 590 8 4

Designing for Failure in AWS Questions that should be asked wrt hardware failure: What happens if a node in your system fails? How do you recognize that failure? How do I replace that node? What kind of scenarios do I have to plan for? What are my single points of failure? If a load balancer is sitting in front of an array of application servers, what if that load balancer fails? If there are master and slaves in your architecture, what if the master node fails? How does the failover occur and how is a new slave instantiated and brought into sync with the master? Questions that should be asked wrt software failure: What happens to my application if the dependent services changes its interface? What if downstream service times out or returns an exception? What if the cache keys grow beyond memory limit of an instance? ECE 695/CS 590 9 Designing for Failures in AWS Best practices that help dealing with failures: 1. Have a coherent backup and restore strategy for your data and automate it 2. Build process threads that resume on reboot 3. Allow the state of the system to re-sync by reloading messages from queues 4. Keep pre-configured and pre-optimized virtual images to support (2) and (3) on launch/boot 5. Avoid in-memory sessions or stateful user context, move that to data stores. Review AWS specific tactics for implementing this best practice from Amazon Web Services - Architecting for The Cloud: Best Practices, Jinesh Varia, January 2011, page 12. ECE 695/CS 590 10 5

AWS Specific Fault-Tolerance Techniques 1. Failover gracefully using Elastic IPs: Elastic IP is a static IP that is dynamically re-mappable. You can quickly remap and failover to another set of servers so that your traffic is routed to the new servers. It works great when you want to upgrade from old to new versions or in case of hardware failures 2. Utilize multiple Availability Zones: Availability Zones are conceptually like logical datacenters. By deploying your architecture to multiple availability zones, you can ensure highly availability. Automatically replicate database updates across multiple Availability Zones. 3. Maintain an Amazon Machine Image so that you can restore and clone environments very easily in a different Availability Zone; Maintain multiple Database slaves across Availability Zones and setup hot replication. ECE 695/CS 590 11 AWS Specific Fault-Tolerance Techniques 4. Utilize Amazon CloudWatch (or various real-time open source monitoring tools) to get more visibility and take appropriate actions in case of hardware failure or performance degradation. Setup an Auto scaling group to maintain a fixed fleet size so that it replaces unhealthy Amazon EC2 instances by new ones. 5. Utilize Amazon EBS and set up cron jobs so that incremental snapshots are automatically uploaded to Amazon S3 and data is persisted independent of your instances. 6. Utilize Amazon RDS and set the retention period for backups, so that it can perform automated backups. ECE 695/CS 590 12 6

Hadoop: Case Study Runs on a collection of COTS shared-nothing servers ECE 695/CS 590 13 Hadoop: Case Study Hadoop job has two types of tasks: mappers and reducers. Mappers read the job input data from a distributed file system (HDFS) and produce key-value pairs. These map outputs are stored locally on compute nodes Each reducer processes a particular key range. For this, it copies map outputs from the mappers which produced values with that key (oftentimes all mappers). A reducer writes job output data to HDFS. A Task- Tracker (TT) is a Hadoop process running on compute nodes which is responsible for starting and managing tasks locally. A TT has a number of mapper and reducer slots which determine task concurrency. A TT communicates regularly with a Job Tracker (JT), a centralized Hadoop component that decides when and where to start tasks. JT also runs a speculative execution algorithm which attempts to improve job running time by duplicating under-performing tasks. ECE 695/CS 590 14 7

Hadoop: Case Study Failure cases it worries about: non-responsiveness of a task Can be due to overload of the task or network congestion Waits for non-responsive tasks (on the order of 10 minutes) and then re-executes the work of these tasks TT sends heartbeat to JT every 3 s. JT declares a TT dead if no heartbeat for 600 s. Then tasks are restarted on a different node A reducer is considered faulty if it failed too many times to copy map outputs. This decision is made at the TT. ECE 695/CS 590 15 8