Design For Availability October 2013 Stevan Vlaovic svlaovic@netflix.com http://www.linkedin.com/in/stevanvlaovic
Stevan Vlaovic Director, Membership Infrastructure, Netflix Performance Architect, Display Advertising, Yahoo! CTO & Founder, FastScale Technology Microprocessor R&D, Sun Microsystems and Intel Corporation PhD, Computer Science & Engineering, University of Michigan
Where time to market wins big Making a land-grab Disrupting competitors (OODA) Anything delivered as web services
Land grab opportunity Competitive move Engage customers Measure customers Observe Customer Pain Point Deliver Implement Act Colonel Boyd, USAF Get inside your adversaries' OODA loop to disorient them Orient Analysis Model alternatives Commit resources Decide Get buy-in Plan response
How Soon? Product features in days instead of months Deployment in minutes instead of weeks Incident response in seconds instead of hours
Assumptions Scale Hardware will fail Slowly Changing Large Scale Slowly Changing Small Scale Everything works Telco s Enterprise IT Web-Scale Startups Everything is Broken Rapid Change Large Scale Rapid Change Small Scale Software will fail Speed
Cloud Native A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components
Inspiration
Cloud Native How does Netflix work?
Netflix Member Web Site Home Page Personalization Driven Everyone gets different content
How Netflix Used to Work Consumer Electronics AWS Cloud Services Monolithic Web App Oracle MySQL CDN Edge Locations Datacenter Customer Device (PC, PS3, TV ) Monolithic Streaming App Oracle MySQL Limelight/Level 3 Akamai CDNs Content Management Content Encoding
How Netflix Streaming Works Today Consumer Electronics AWS Cloud Services Web Site or Discovery API User Data Personalization CDN Edge Locations Datacenter Customer Device (PC, PS3, TV ) Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding
Things We Don t Use AWS For SaaS Applications Pagerduty, Appdynamics Content Delivery Service DNS Service
Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo
Incidents Impact and Mitigation Public Relations Media Impact High Customer Service Calls Affects AB Test Results PR X Incidents CS XX Incidents Metrics impact Feature disable XXX Incidents Y incidents mitigated by Active Active, game day practicing YY incidents mitigated by better tools and practices YYY incidents mitigated by better data tagging No Impact fast retry or automated failover XXXX Incidents
Highly Available NoSQL Storage A highly scalable, available and durable deployment pattern based on Apache Cassandra
Single Function Micro-Service Pattern One keyspace, replaces a single table or materialized view Many Different Single-Function REST Clients Single function Cassandra Cluster Managed by Priam Between 6 and 144 nodes Stateless Data Access REST Service Astyanax Cassandra Client Over 50 Cassandra clusters Over 1000 nodes Over 30TB backup Over 1M writes/s/cluster Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones Optional Datacenter Update Flow Appdynamics Service Flow Visualization
Antifragile Architecture AWS Route53 UltraDNS DynECT DNS Denominator Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Global deployment in minutes, robust, agile, denormalized, NoSQL
Open source all the things! Everything is on github.com Default to Apache 2.0 License
Asgard http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Ephemeral Instances Largest services are autoscaled Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down
Application Resilience Run what you wrote Rapid detection Rapid Response
Antifragile API Patterns Functional Reactive with Circuit Breakers and Bulkheads https://speakerdeck.com/benjchristensen/rxjava-goto-aarhus-2013
Cloud Security Fine grain security rather than perimeter Leveraging AWS Scale to resist DDOS attacks Automated attack surface monitoring and testing http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds It s done when it runs Asgard Functionally complete Demonstrated March Released June in V3.3 IBM Example application Acme Air Based on NetflixOSS running on AWS Ported to IBM Softlayer with Rightscale Vendor and end user interest Openstack Heat getting there Paypal C3 Console based on Asgard
Takeaway Cloud provides development and deployment agility. NetflixOSS provides highly available Cloud Native patterns. http://netflix.github.com http://techblog.netflix.com http://slideshare.net/netflix http://www.linkedin.com/in/stevanvlaovic @NetflixOSS
Appendix
Cassandra at Scale Benchmarking to Retire Risk
Benchmarking Global Cassandra Write intensive test of cross region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes Test Load 1 Million reads After 500ms CL.ONE with no Data loss Validation Load 1 Million writes CL.ONE (wait for one replica to ack) Test Load US-West-2 Region - Oregon US-East-1 Region - Virginia Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Inter-Zone Traffic Inter-Region Traffic Up to 9Gbits/s, 83ms 18TB backups from S3