NetflixOSS A Cloud Native Architecture



Similar documents
Lessons Learned from the Movies

Design For Availability. October 2013 Stevan Vlaovic

Netflix and Open Source. April 2013 Adrian

Designing Apps for Amazon Web Services

High-Availability in the Cloud Architectural Best Practices

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Distributed Systems. Tutorial 12 Cassandra

Practical Cassandra. Vitalii

Multi-Datacenter Replication

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Introduction to Cassandra

Architecting For Failure Why Cloud Architecture is Different! Michael Stiefel

Netflix: Building Up and Scaling Out on Open Source

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Amazon EC2 Product Details Page 1 of 5

Exchange Data Protection: To the DAG and Beyond. Whitepaper by Brien Posey

GoGrid Implement.com Configuring a SQL Server 2012 AlwaysOn Cluster

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

LARGE-SCALE DATA STORAGE APPLICATIONS

AMAZON S3: ARCHITECTING FOR RESILIENCY IN THE FACE OF FAILURES Jason McHugh

Migrating to Microservices. Adrian QCon London 6 th March 2014

Tushar Joshi Turtle Networks Ltd

Design for Failure High Availability Architectures using AWS

Distributed Storage Systems part 2. Marko Vukolić Distributed Systems and Cloud Computing

Apache Hadoop. Alexandru Costan

Transactions and ACID in MongoDB

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Introduction to Apache Cassandra

Apache HBase. Crazy dances on the elephant back

MySQL: Cloud vs Bare Metal, Performance and Reliability

GlobalSCAPE DMZ Gateway, v1. User Guide

Migration and Disaster Recovery Underground in the NEC / Iron Mountain National Data Center with the RackWare Management Module

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Real-time Data Replication

High Performance MySQL Choices in Amazon Web Services: Beyond RDS. Andrew Shieh, SmugMug Operations smugmug.

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Conventionally, software testing has aimed at verifying functionality but the testing paradigm has changed for software services.

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

NetflixOSS A Cloud Native Architecture

Velocity and Volume (or Speed Wins)

SCALABILITY AND AVAILABILITY

Companies are moving more and more IT services and

Architecting Distributed Databases for Failure A Case Study with Druid

Availability Digest. MySQL Clusters Go Active/Active. December 2006

ArcGIS 10.3 Server on Amazon Web Services

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

The Hadoop Distributed File System

CONNECTRIA MANAGED AMAZON WEB SERVICES (AWS)

High Availability Solutions for the MariaDB and MySQL Database

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Migrating a running service to AWS

YouTube Vitess. Cloud-Native MySQL. Oracle OpenWorld Conference October 26, Anthony Yeh, Software Engineer, YouTube.

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

DISASTER RECOVERY WITH AWS

A programming model in Cloud: MapReduce

Hadoop and Map-Reduce. Swati Gore

Database Resilience at ISPs. High-Availability. White Paper

Cloud Computing with Microsoft Azure

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

HDFS Users Guide. Table of contents

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson,Nelson Araujo, Dennis Gannon, Wei Lu, and

Assignment # 1 (Cloud Computing Security)

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Module 14: Scalability and High Availability

Testing Cloud Application System Resiliency by Wrecking the System

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Building Fault-Tolerant Applications on AWS October 2011

This talk is mostly about Data Center Replication, but along the way we'll have to talk about why you'd want transactionality arnd the Low-Level API.

Drupal in the Cloud. by Azhan Founder/Director S & A Solutions

Distributed Scheduling with Apache Mesos in the Cloud. PhillyETE - April, 2015 Diptanu Gon

Distributed storage for structured data

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

Be Very Afraid. Christophe Pettus PostgreSQL Experts Logical Decoding & Backup Conference Europe 2014

Guideline for stresstest Page 1 of 6. Stress test

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

Cloud Computing Is In Your Future

Distributed File Systems

TECHNOLOGY WHITE PAPER Jan 2016

CSE-E5430 Scalable Cloud Computing Lecture 11

Best practices for operational excellence (SharePoint Server 2010)

When talking about hosting

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

In Memory Accelerator for MongoDB

Release Notes LS Retail Data Director August 2011

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Simba Apache Cassandra ODBC Driver

המרכז ללימודי חוץ המכללה האקדמית ספיר. ד.נ חוף אשקלון טל' פקס בשיתוף עם מכללת הנגב ע"ש ספיר

Amazon Elastic Beanstalk

Connectivity. Alliance Access 7.0. Database Recovery. Information Paper

Transcription:

NetflixOSS A Cloud Native Architecture LASER Session 5 Availability September 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft

Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Active-Active multi-region deployment AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn t make sense. Getting there

Application Resilience Run what you wrote Rapid detection Rapid Response

Chaos Monkey http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html Computers (Datacenter or AWS) randomly die Fact of life, but too infrequent to test resiliency Test to make sure systems are resilient Kill individual instances without customer impact Latency Monkey (coming soon) Inject extra latency and error return codes

Edda Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html AWS Instances, ASGs, etc. Eureka Services metadata AppDynamics Request flow Edda Monkeys

Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicipaddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b ] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securitygroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securitygroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securitygroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { "ipranges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" }

Apache Scalable and Stable in large deployments No additional license cost for large scale! Optimized for OLTP vs. Hbase optimized for DSS Available during Partition (AP from CAP) Hinted handoff repairs most transient issues Read-repair and periodic repair keep it clean Quorum and Client Generated Timestamp Read after write consistency with 2 of 3 copies Latest version includes Paxos for stronger transactions

Astyanax Client for Java Available at http://github.com/netflix Features Abstraction of connection pool from RPC protocol Fluent Style API Operation retry with backoff Token aware Batch manager Many useful recipes New: Entity Mapper based on JPA annotations

Astyanax Query Example Paginate through all columns in a row ColumnList<String> columns; int pageize = 10; try { RowQuery<String, String> query = keyspace.preparequery(cf_standard1).getkey("a").setispaginating().withcolumnrange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getresult()).isempty()) { for (Column<String> c : columns) { } } } catch (ConnectionException e) { }

C* Astyanax Recipes Distributed row lock (without needing zookeeper) Multi-region row lock Uniqueness constraint Multi-row uniqueness constraint Chunked and multi-threaded large file storage Reverse index search All rows query Durable message queue Contributed: High cardinality reverse index

Astyanax Futures Maintain backwards compatibility Wrapper for C* 1.2 Netty driver More CQL support NetflixOSS Cloud Prize Ideas DynamoDB Backend? More recipes?

Astyanax - Write Data Flows Single Region, Multiple Availability Zone, Token Aware Disks Zone A 1. Client Writes to local coordinator 2. Coodinator writes to other zones 3. Nodes return ack 4. Data written to internal commit log disks (no more than 10 seconds later) 4 Disks Zone C Disks Zone B 3 1 Token Aware Clients Disks Zone A 4 3 2 3 4 Disks Zone B 2 Disks Zone C If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compactions occur asynchronously

Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum 1. Client writes to local replicas 2. Local write acks returned to Client which continues when 2 of 3 local nodes are committed 3. Local coordinator writes to remote coordinator. 4. When data arrives, remote coordinator node acks and copies to other remote zones 5. Remote nodes ack to local coordinator 6. Data flushed to internal commit log disks (no more than 10 seconds later) 6 Disks Zone C Disks Zone B 2 If a node or region goes offline, hinted handoff completes the write when the node comes back up. Nightly global compare and repair jobs ensure everything stays consistent. Disks Zone A 1 US Clients Disks Zone A 2 6 Disks Zone B 2 6 Disks Zone C 100+ms latency 3 5 Disks Zone C Disks Zone B Disks Zone A 4 6 Disks 4 6 Zone B 4 EU Clients 6 Disks Zone A 5 Disks Zone C

Platform Outage Taxonomy Classify and name the different types of things that can go wrong

YOLO

Zone Failure Modes Power Outage Instances lost, ephemeral state lost Clean break and recovery, fail fast, no route to host Network Outage Instances isolated, state inconsistent More complex symptoms, recovery issues, transients Dependent Service Outage Cascading failures, misbehaving instances, human errors Confusing symptoms, recovery issues, byzantine effects

Zone Power Failure June 29, 2012 AWS US-East - The Big Storm http://aws.amazon.com/message/67457/ http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html Highlights One of 10+ US-East datacenters failed generator startup UPS depleted -> 10min power outage for 7% of instances Result Netflix lost power to most of a zone, evacuated the zone Small/brief user impact due to errors and retries

Zone Failure Modes Zone Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Zone Power Outage Zone Dependent Service Outage

Regional Failure Modes Network Failure Takes Region Offline DNS configuration errors Bugs and configuration errors in routers Network capacity overload Control Plane Overload Affecting Entire Region Consequence of other outages Lose control of remaining zones infrastructure Cascading service failure, hard to diagnose

Regional Control Plane Overload April 2011 The big EBS Outage http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html Human error during network upgrade triggered cascading failure Zone level failure, with brief regional control plane overload Netflix Infrastructure Impact Instances in one zone hung and could not launch replacements Overload prevented other zones from launching instances Some MySQL slaves offline for a few days Netflix Customer Visible Impact Higher latencies for a short time Higher error rates for a short time Outage was at a low traffic level time, so no capacity issues

Dependent Services Failure June 29, 2012 AWS US-East - The Big Storm Power failure recovery overloaded EBS storage service Backlog of instance startups using EBS root volumes ELB (Load Balancer) Impacted ELB instances couldn t scale because EBS was backlogged ELB control plane also became backlogged Mitigation Plans Mentioned Multiple control plane request queues to isolate backlog Rapid DNS based traffic shifting between zones

Regional Failure Modes Regional Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Control Plane Overload

Application Routing Failure June 29, 2012 AWS US-East - The Big Storm Eureka service directory failed to mark down dead instances due to a configuration error US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Zone Power Outage Applications not using Zone-aware routing kept trying to talk to dead instances and timing out Effect: higher latency and errors Mitigation: Fixed config, and made zone aware routing the default

Partial Regional ELB Outage Dec 24 th 2012 US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C ELB (Load Balancer) Impacted ELB control plane database state accidentally corrupted Hours to detect, hours to restore from backups Mitigation Plans Mentioned Tighter process for access to control plane Better zone isolation

Global Failure Modes Software Bugs Externally triggered (e.g. leap year/leap second) Memory leaks and other delayed action failures Global configuration errors Usually human error Both infrastructure and application level Cascading capacity overload Customers migrating away from a failure Lack of cross region service isolation

Global Software Bug Outages AWS S3 Global Outage in 2008 Gossip protocol propagated errors worldwide No data loss, but service offline for up to 9hrs Extra error detection fixes, no big issues since Microsoft Azure Leap Day Outage in 2012 Bug failed to generate certificates ending 2/29/13 Failure to launch new instances for up to 13hrs One line code fix. Netflix Configuration Error in 2012 Global property updated to broken value Streaming stopped worldwide for ~1hr until we changed back Fix planned to keep history of properties for quick rollback

Global Failure Modes Cascading Capacity Overload US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Capacity Demand Migrates Software Bugs and Global Configuration Errors Oops

Slideshare.net/Netflix Details Meetup S1E3 July Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html Lightning Talks March S1E2 http://www.slideshare.net/ruslanmeshenberg/netflixoss-meetup-lightning-talks-androadmap Lightning Talks Feb S1E1 http://www.slideshare.net/ruslanmeshenberg/netflixoss-open-house-lightning-talks Asgard In Depth Feb S1E1 http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house Security Architecture http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/ Cost Aware Cloud Architectures with Jinesh Varia of AWS http://www.slideshare.net/amazonwebservices/building-costaware-architectures-jineshvaria-aws-and-adrian-cockroft-netflix

Takeaways Cloud Native Manages Scale and Complexity at Speed NetflixOSS makes it easier for everyone to become Cloud Native http://netflix.github.com http://techblog.netflix.com http://slideshare.net/netflix http://www.linkedin.com/in/adriancockcroft @adrianco #netflixcloud @NetflixOSS