NetflixOSS A Cloud Native Architecture LASER Session 5 Availability September 2013 Adrian Cockcroft @adrianco @NetflixOSS http://www.linkedin.com/in/adriancockcroft
Failure Modes and Effects Failure Mode Probability Current Mitigation Plan Application Failure High Automatic degraded response AWS Region Failure Low Active-Active multi-region deployment AWS Zone Failure Medium Continue to run on 2 out of 3 zones Datacenter Failure Medium Migrate more functions to cloud Data store failure Low Restore from S3 backups S3 failure Low Restore from remote archive Until we got really good at mitigating high and medium probability failures, the ROI for mitigating regional failures didn t make sense. Getting there
Application Resilience Run what you wrote Rapid detection Rapid Response
Chaos Monkey http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html Computers (Datacenter or AWS) randomly die Fact of life, but too infrequent to test resiliency Test to make sure systems are resilient Kill individual instances without customer impact Latency Monkey (coming soon) Inject extra latency and error return codes
Edda Configuration History http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html AWS Instances, ASGs, etc. Eureka Services metadata AppDynamics Request flow Edda Monkeys
Edda Query Examples Find any instances that have ever had a specific public IP address $ curl "http://edda/api/v2/view/instances;publicipaddress=1.2.3.4;_since=0" ["i-0123456789","i-012345678a","i-012345678b ] Show the most recent change to a security group $ curl "http://edda/api/v2/aws/securitygroups/sg-0123456789;_diff;_all;_limit=2" --- /api/v2/aws.securitygroups/sg-0123456789;_pp;_at=1351040779810 +++ /api/v2/aws.securitygroups/sg-0123456789;_pp;_at=1351044093504 @@ -1,33 +1,33 @@ { "ipranges" : [ "10.10.1.1/32", "10.10.1.2/32", + "10.10.1.3/32", - "10.10.1.4/32" }
Apache Scalable and Stable in large deployments No additional license cost for large scale! Optimized for OLTP vs. Hbase optimized for DSS Available during Partition (AP from CAP) Hinted handoff repairs most transient issues Read-repair and periodic repair keep it clean Quorum and Client Generated Timestamp Read after write consistency with 2 of 3 copies Latest version includes Paxos for stronger transactions
Astyanax Client for Java Available at http://github.com/netflix Features Abstraction of connection pool from RPC protocol Fluent Style API Operation retry with backoff Token aware Batch manager Many useful recipes New: Entity Mapper based on JPA annotations
Astyanax Query Example Paginate through all columns in a row ColumnList<String> columns; int pageize = 10; try { RowQuery<String, String> query = keyspace.preparequery(cf_standard1).getkey("a").setispaginating().withcolumnrange(new RangeBuilder().setMaxSize(pageize).build()); while (!(columns = query.execute().getresult()).isempty()) { for (Column<String> c : columns) { } } } catch (ConnectionException e) { }
C* Astyanax Recipes Distributed row lock (without needing zookeeper) Multi-region row lock Uniqueness constraint Multi-row uniqueness constraint Chunked and multi-threaded large file storage Reverse index search All rows query Durable message queue Contributed: High cardinality reverse index
Astyanax Futures Maintain backwards compatibility Wrapper for C* 1.2 Netty driver More CQL support NetflixOSS Cloud Prize Ideas DynamoDB Backend? More recipes?
Astyanax - Write Data Flows Single Region, Multiple Availability Zone, Token Aware Disks Zone A 1. Client Writes to local coordinator 2. Coodinator writes to other zones 3. Nodes return ack 4. Data written to internal commit log disks (no more than 10 seconds later) 4 Disks Zone C Disks Zone B 3 1 Token Aware Clients Disks Zone A 4 3 2 3 4 Disks Zone B 2 Disks Zone C If a node goes offline, hinted handoff completes the write when the node comes back up. Requests can choose to wait for one node, a quorum, or all nodes to ack the write SSTable disk writes and compactions occur asynchronously
Data Flows for Multi-Region Writes Token Aware, Consistency Level = Local Quorum 1. Client writes to local replicas 2. Local write acks returned to Client which continues when 2 of 3 local nodes are committed 3. Local coordinator writes to remote coordinator. 4. When data arrives, remote coordinator node acks and copies to other remote zones 5. Remote nodes ack to local coordinator 6. Data flushed to internal commit log disks (no more than 10 seconds later) 6 Disks Zone C Disks Zone B 2 If a node or region goes offline, hinted handoff completes the write when the node comes back up. Nightly global compare and repair jobs ensure everything stays consistent. Disks Zone A 1 US Clients Disks Zone A 2 6 Disks Zone B 2 6 Disks Zone C 100+ms latency 3 5 Disks Zone C Disks Zone B Disks Zone A 4 6 Disks 4 6 Zone B 4 EU Clients 6 Disks Zone A 5 Disks Zone C
Platform Outage Taxonomy Classify and name the different types of things that can go wrong
YOLO
Zone Failure Modes Power Outage Instances lost, ephemeral state lost Clean break and recovery, fail fast, no route to host Network Outage Instances isolated, state inconsistent More complex symptoms, recovery issues, transients Dependent Service Outage Cascading failures, misbehaving instances, human errors Confusing symptoms, recovery issues, byzantine effects
Zone Power Failure June 29, 2012 AWS US-East - The Big Storm http://aws.amazon.com/message/67457/ http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html Highlights One of 10+ US-East datacenters failed generator startup UPS depleted -> 10min power outage for 7% of instances Result Netflix lost power to most of a zone, evacuated the zone Small/brief user impact due to errors and retries
Zone Failure Modes Zone Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Zone Power Outage Zone Dependent Service Outage
Regional Failure Modes Network Failure Takes Region Offline DNS configuration errors Bugs and configuration errors in routers Network capacity overload Control Plane Overload Affecting Entire Region Consequence of other outages Lose control of remaining zones infrastructure Cascading service failure, hard to diagnose
Regional Control Plane Overload April 2011 The big EBS Outage http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html Human error during network upgrade triggered cascading failure Zone level failure, with brief regional control plane overload Netflix Infrastructure Impact Instances in one zone hung and could not launch replacements Overload prevented other zones from launching instances Some MySQL slaves offline for a few days Netflix Customer Visible Impact Higher latencies for a short time Higher error rates for a short time Outage was at a low traffic level time, so no capacity issues
Dependent Services Failure June 29, 2012 AWS US-East - The Big Storm Power failure recovery overloaded EBS storage service Backlog of instance startups using EBS root volumes ELB (Load Balancer) Impacted ELB instances couldn t scale because EBS was backlogged ELB control plane also became backlogged Mitigation Plans Mentioned Multiple control plane request queues to isolate backlog Rapid DNS based traffic shifting between zones
Regional Failure Modes Regional Network Outage US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Control Plane Overload
Application Routing Failure June 29, 2012 AWS US-East - The Big Storm Eureka service directory failed to mark down dead instances due to a configuration error US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Zone Power Outage Applications not using Zone-aware routing kept trying to talk to dead instances and timing out Effect: higher latency and errors Mitigation: Fixed config, and made zone aware routing the default
Partial Regional ELB Outage Dec 24 th 2012 US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C ELB (Load Balancer) Impacted ELB control plane database state accidentally corrupted Hours to detect, hours to restore from backups Mitigation Plans Mentioned Tighter process for access to control plane Better zone isolation
Global Failure Modes Software Bugs Externally triggered (e.g. leap year/leap second) Memory leaks and other delayed action failures Global configuration errors Usually human error Both infrastructure and application level Cascading capacity overload Customers migrating away from a failure Lack of cross region service isolation
Global Software Bug Outages AWS S3 Global Outage in 2008 Gossip protocol propagated errors worldwide No data loss, but service offline for up to 9hrs Extra error detection fixes, no big issues since Microsoft Azure Leap Day Outage in 2012 Bug failed to generate certificates ending 2/29/13 Failure to launch new instances for up to 13hrs One line code fix. Netflix Configuration Error in 2012 Global property updated to broken value Streaming stopped worldwide for ~1hr until we changed back Fix planned to keep history of properties for quick rollback
Global Failure Modes Cascading Capacity Overload US-East Load Balancers EU-West Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Capacity Demand Migrates Software Bugs and Global Configuration Errors Oops
Slideshare.net/Netflix Details Meetup S1E3 July Featuring Contributors Eucalyptus, IBM, Paypal, Riot Games http://techblog.netflix.com/2013/07/netflixoss-meetup-series-1-episode-3.html Lightning Talks March S1E2 http://www.slideshare.net/ruslanmeshenberg/netflixoss-meetup-lightning-talks-androadmap Lightning Talks Feb S1E1 http://www.slideshare.net/ruslanmeshenberg/netflixoss-open-house-lightning-talks Asgard In Depth Feb S1E1 http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house Security Architecture http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned/ Cost Aware Cloud Architectures with Jinesh Varia of AWS http://www.slideshare.net/amazonwebservices/building-costaware-architectures-jineshvaria-aws-and-adrian-cockroft-netflix
Takeaways Cloud Native Manages Scale and Complexity at Speed NetflixOSS makes it easier for everyone to become Cloud Native http://netflix.github.com http://techblog.netflix.com http://slideshare.net/netflix http://www.linkedin.com/in/adriancockcroft @adrianco #netflixcloud @NetflixOSS