Design For Availability. October 2013 Stevan Vlaovic svlaovic@netflix.com http://www.linkedin.com/in/stevanvlaovic



Similar documents
Lessons Learned from the Movies

Velocity and Volume (or Speed Wins)

Netflix: Building Up and Scaling Out on Open Source

NetflixOSS A Cloud Native Architecture

Netflix and Open Source. April 2013 Adrian

Migrating to Microservices. Adrian QCon London 6 th March 2014

NetflixOSS A Cloud Native Architecture

Scalable Architecture on Amazon AWS Cloud

High-Availability in the Cloud Architectural Best Practices

Amazon Elastic Beanstalk

Learning Management Redefined. Acadox Infrastructure & Architecture

High availability on the Catalyst Cloud

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Netflix s Journey to the Cloud: Lessons Learned from Netflix s Migration to the Public Cloud

Reference Model for Cloud Applications CONSIDERATIONS FOR SW VENDORS BUILDING A SAAS SOLUTION

Considerations for Adopting PaaS (Platform as a Service)

IAN MASSINGHAM. Technical Evangelist Amazon Web Services

INDIA September 2011 virtual techdays

Avoiding Pain Running MySQL in the Cloud

OnApp Cloud. The complete platform for cloud service providers. 114 Cores. 286 Cores / 400 Cores

OTM in the Cloud. Ryan Haney

Web Application Hosting in the AWS Cloud Best Practices

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Amazon Web Services Yu Xiao

Cloud Computing: Making the right choices

Scale Cloud Across the Enterprise

Architecting Self-Managing Distributed Systems

The Importance of High Customer Experience

Web Application Hosting in the AWS Cloud Best Practices

Technical Overview Simple, Scalable, Object Storage Software

Cloud-Based dwaf A Real World Deployment Case Study. OWASP 5. April The OWASP Foundation

Towards Smart and Intelligent SDN Controller

Postgres Plus Cloud Database!

Alfresco Enterprise on AWS: Reference Architecture

From Internet Data Centers to Data Centers in the Cloud

Software AG and the AWS cloud. Past, Present and Best Practices. Jonathan Madamba Director, Solution Cloud John Fitzgerald Director, Product Marketing

DLT Solutions and Amazon Web Services

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Achieve Economic Synergies by Managing Your Human Capital In The Cloud

Designing Apps for Amazon Web Services

Deploying Database clusters in the Cloud

What it is and why you might use it

Introduction to Amazon Web Services! Leo Senior Solutions Architect

OVERVIEW. The complete IaaS platform for service providers

Using ArcGIS for Server in the Amazon Cloud

Introduction to Apache Cassandra

Web Application Hosting Cloud Solution Architecture.

Overview. The OnApp Cloud Platform. Dashboard APPLIANCES. Used Total Used Total. Virtual Servers. Blueprint Servers. Load Balancers.

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Evolving Datacenter and Cloud Connectivity Services

Introduction to Cloud : Cloud and Cloud Storage. Lecture 2. Dr. Dalit Naor IBM Haifa Research Storage Systems. Dalit Naor, IBM Haifa Research

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Scalable Web Application

HP Converged Cloud Cloud Platform Overview. Shane Pearson Vice President, Portfolio & Product Management

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Distributed Scheduling with Apache Mesos in the Cloud. PhillyETE - April, 2015 Diptanu Gon

Zadara Storage Cloud A

Webinar: Modern Data Protection For Next-Gen Apps and Databases

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Assignment # 1 (Cloud Computing Security)

Introduction to Cloud Computing

2013 ONS Tutorial 2: SDN Market Opportunities

From the Monolith to Microservices: Evolving Your Architecture to Scale. Randy linkedin.com/in/randyshoup

Intel Service Assurance Administrator. Product Overview

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

NCTA Cloud Operations

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

Microsoft Private Cloud Fast Track

Akamai Security Products

[Hadoop, Storm and Couchbase: Faster Big Data]

Amazon EC2 Product Details Page 1 of 5

Developing Cloud Applications using IBM Bluemix. Brian DePradine (Development lead Liberty buildpack)

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

ACHIEVING 100% UPTIME WITH A CLOUD-BASED CONTACT CENTER

Designing a Cloud Storage System

Kamailio World Kamailio and OpenStack Together to build a truly scalable solution

PART I: The Pros and Cons of Public Cloud Computing

TECHNOLOGY WHITE PAPER Jun 2012

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

Architecting Your Company. Ann Winblad Co-Founder and Managing Director

Bimodal IT. PaaS and Containers, what are they all about? By Rhys Sharp Chief Technology Officer. August 2015

Amazon Relational Database Service (RDS)

Big Data Trends and HDFS Evolution

Addressing Storage Management Challenges using Open Source SDS Controller

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Architecting Robust Applications for Amazon EC2

TECHNOLOGY WHITE PAPER Jan 2016

2014 Foley & Lardner LLP Attorney Advertising Prior results do not guarantee a similar outcome Models used are not clients but may be representative

Deploying for Success on the Cloud: EBS on Amazon VPC. Phani Kottapalli Pavan Vallabhaneni AST Corporation August 17, 2012

Azure Media Service Cloud Video Delivery KILROY HUGHES MICROSOFT AZURE MEDIA

Real Time Big Data Processing

Nimble Storage + OpenStack 打 造 最 佳 企 業 專 屬 雲 端 平 台. Nimble Storage Brian Chen, Solution Architect Jay Wang, Principal Software Engineer

Cloud IaaS Migration Roadmap

How To Use Big Data For Telco (For A Telco)

DIR Contract Number DIR-TSO-2621 Appendix C Pricing Index

BeBanjo Infrastructure and Security Overview

I D C A N A L Y S T C O N N E C T I O N

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Transcription:

Design For Availability October 2013 Stevan Vlaovic svlaovic@netflix.com http://www.linkedin.com/in/stevanvlaovic

Stevan Vlaovic Director, Membership Infrastructure, Netflix Performance Architect, Display Advertising, Yahoo! CTO & Founder, FastScale Technology Microprocessor R&D, Sun Microsystems and Intel Corporation PhD, Computer Science & Engineering, University of Michigan

Where time to market wins big Making a land-grab Disrupting competitors (OODA) Anything delivered as web services

Land grab opportunity Competitive move Engage customers Measure customers Observe Customer Pain Point Deliver Implement Act Colonel Boyd, USAF Get inside your adversaries' OODA loop to disorient them Orient Analysis Model alternatives Commit resources Decide Get buy-in Plan response

How Soon? Product features in days instead of months Deployment in minutes instead of weeks Incident response in seconds instead of hours

Assumptions Scale Hardware will fail Slowly Changing Large Scale Slowly Changing Small Scale Everything works Telco s Enterprise IT Web-Scale Startups Everything is Broken Rapid Change Large Scale Rapid Change Small Scale Software will fail Speed

Cloud Native A new engineering challenge Construct a highly agile and highly available service from ephemeral and assumed broken components

Inspiration

Cloud Native How does Netflix work?

Netflix Member Web Site Home Page Personalization Driven Everyone gets different content

How Netflix Used to Work Consumer Electronics AWS Cloud Services Monolithic Web App Oracle MySQL CDN Edge Locations Datacenter Customer Device (PC, PS3, TV ) Monolithic Streaming App Oracle MySQL Limelight/Level 3 Akamai CDNs Content Management Content Encoding

How Netflix Streaming Works Today Consumer Electronics AWS Cloud Services Web Site or Discovery API User Data Personalization CDN Edge Locations Datacenter Customer Device (PC, PS3, TV ) Streaming API DRM QoS Logging OpenConnect CDN Boxes CDN Management and Steering Content Encoding

Things We Don t Use AWS For SaaS Applications Pagerduty, Appdynamics Content Delivery Service DNS Service

Nov 2012 Streaming Bandwidth March 2013 Mean Bandwidth +39% 6mo

Incidents Impact and Mitigation Public Relations Media Impact High Customer Service Calls Affects AB Test Results PR X Incidents CS XX Incidents Metrics impact Feature disable XXX Incidents Y incidents mitigated by Active Active, game day practicing YY incidents mitigated by better tools and practices YYY incidents mitigated by better data tagging No Impact fast retry or automated failover XXXX Incidents

Highly Available NoSQL Storage A highly scalable, available and durable deployment pattern based on Apache Cassandra

Single Function Micro-Service Pattern One keyspace, replaces a single table or materialized view Many Different Single-Function REST Clients Single function Cassandra Cluster Managed by Priam Between 6 and 144 nodes Stateless Data Access REST Service Astyanax Cassandra Client Over 50 Cassandra clusters Over 1000 nodes Over 30TB backup Over 1M writes/s/cluster Each icon represents a horizontally scaled service of three to hundreds of instances deployed over three availability zones Optional Datacenter Update Flow Appdynamics Service Flow Visualization

Antifragile Architecture AWS Route53 UltraDNS DynECT DNS Denominator Regional Load Balancers Regional Load Balancers Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Global deployment in minutes, robust, agile, denormalized, NoSQL

Open source all the things! Everything is on github.com Default to Apache 2.0 License

Asgard http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html

Ephemeral Instances Largest services are autoscaled Average lifetime of an instance is 36 hours P u s h Autoscale Up Autoscale Down

Application Resilience Run what you wrote Rapid detection Rapid Response

Antifragile API Patterns Functional Reactive with Circuit Breakers and Bulkheads https://speakerdeck.com/benjchristensen/rxjava-goto-aarhus-2013

Cloud Security Fine grain security rather than perimeter Leveraging AWS Scale to resist DDOS attacks Automated attack surface monitoring and testing http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned

Vendor Driven Portability Interest in using NetflixOSS for Enterprise Private Clouds It s done when it runs Asgard Functionally complete Demonstrated March Released June in V3.3 IBM Example application Acme Air Based on NetflixOSS running on AWS Ported to IBM Softlayer with Rightscale Vendor and end user interest Openstack Heat getting there Paypal C3 Console based on Asgard

Takeaway Cloud provides development and deployment agility. NetflixOSS provides highly available Cloud Native patterns. http://netflix.github.com http://techblog.netflix.com http://slideshare.net/netflix http://www.linkedin.com/in/stevanvlaovic @NetflixOSS

Appendix

Cassandra at Scale Benchmarking to Retire Risk

Benchmarking Global Cassandra Write intensive test of cross region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes Test Load 1 Million reads After 500ms CL.ONE with no Data loss Validation Load 1 Million writes CL.ONE (wait for one replica to ack) Test Load US-West-2 Region - Oregon US-East-1 Region - Virginia Zone A Zone B Zone C Zone A Zone B Zone C Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Cassandra Replicas Inter-Zone Traffic Inter-Region Traffic Up to 9Gbits/s, 83ms 18TB backups from S3