Lessons from Giant-Scale Services and Load Balancing Switches. George Porter CSE 124 January 27, 2015

Similar documents

SMTP, Porcupine. Amin Vahdat CSE 123b May 4, 2006

Lessons from Giant-Scale Services

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM System with VMware View

Chapter 10: Scalability

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

DEPLOYMENT GUIDE Version 1.1. Deploying F5 with IBM WebSphere 7

Load Balancing for Microsoft Office Communication Server 2007 Release 2

Nick McClure University of Kentucky

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Deploying the BIG-IP LTM v10 with Microsoft Lync Server 2010 and 2013

DEPLOYMENT GUIDE Version 1.2. Deploying the BIG-IP System v10 with Microsoft IIS 7.0 and 7.5

Windows Server Failover Clustering April 2010

Deploying the BIG-IP System v10 with VMware Virtual Desktop Infrastructure (VDI)

Technical Overview Simple, Scalable, Object Storage Software

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM with Apache Tomcat and Apache HTTP Server

DEPLOYMENT GUIDE. Deploying F5 for High Availability and Scalability of Microsoft Dynamics 4.0

Chapter 2 TOPOLOGY SELECTION. SYS-ED/ Computer Education Techniques, Inc.

DEPLOYMENT GUIDE CONFIGURING THE BIG-IP LTM SYSTEM WITH FIREPASS CONTROLLERS FOR LOAD BALANCING AND SSL OFFLOAD

Application Delivery and Load Balancing for VMware View Desktop Infrastructure

Configuring Windows Server Clusters

TELE 301 Network Management. Lecture 17: File Transfer & Web Caching

Deployment Topologies

Guideline for setting up a functional VPN

DEPLOYMENT GUIDE DEPLOYING F5 WITH VMWARE VIRTUAL DESKTOP INFRASTRUCTURE (VDI)

DEPLOYMENT GUIDE Version 1.0. Deploying F5 with the Oracle Fusion Middleware SOA Suite 11gR1

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM with the Zimbra Open Source and Collaboration Suite

UIP1868P User Interface Guide

Load Balancing and Sessions. C. Kopparapu, Load Balancing Servers, Firewalls and Caches. Wiley, 2002.

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Load Balancing Web Applications

Networking Topology For Your System

Installation of the On Site Server (OSS)

Getting More Performance and Efficiency in the Application Delivery Network

Chapter 1 - Web Server Management and Cluster Topology

DEPLOYMENT GUIDE Version 1.1. DNS Traffic Management using the BIG-IP Local Traffic Manager

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

DEPLOYMENT GUIDE Version 1.2. Deploying the BIG-IP System v9.x with Microsoft IIS 7.0 and 7.5

Solution Brief. Load Balancing to Provide Scalable, Reliable, Secure Access Solutions

Siemens PLM Connection. Mark Ludwig

How To Configure SSL VPN in Cyberoam

Deploying the BIG-IP System v10 with SAP NetWeaver and Enterprise SOA: ERP Central Component (ECC)

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Distributed System: Definition

Network Management Slide Set 3

Deploying F5 with Microsoft Active Directory Federation Services

DEPLOYMENT GUIDE Version 1.0. Deploying the BIG-IP LTM with Microsoft Windows Server 2008 R2 Remote Desktop Services

DEPLOYMENT GUIDE Version 1.2. Deploying the BIG-IP system v10 with Microsoft Exchange Outlook Web Access 2007

The Evolution of Application Acceleration:

Brocade One Data Center Cloud-Optimized Networks

Oracle Collaboration Suite

Ecomm Enterprise High Availability Solution. Ecomm Enterprise High Availability Solution (EEHAS) Page 1 of 7

5 Key Reasons to Migrate from Cisco ACE to F5 BIG-IP

How To - Configure Virtual Host using FQDN How To Configure Virtual Host using FQDN

Configuring the BIG-IP system for FirePass controllers

Broadband Phone Gateway BPG510 Technical Users Guide

WW HMI SCADA-08 Remote Desktop Services Best Practices

Technical papers Virtual private networks

DEPLOYMENT GUIDE DEPLOYING THE BIG-IP SYSTEM WITH MICROSOFT INTERNET INFORMATION SERVICES (IIS) 7.0

Internet Content Distribution

Load Balancing & High Availability

INTERNET SECURITY: THE ROLE OF FIREWALL SYSTEM

Data Sheet. V-Net Link 700 C Series Link Load Balancer. V-NetLink:Link Load Balancing Solution from VIAEDGE

The Application Delivery Controller Understanding Next-Generation Load Balancing Appliances

GlobalSCAPE DMZ Gateway, v1. User Guide

DEPLOYMENT GUIDE Version 1.1. Deploying the BIG-IP LTM v10 with Citrix Presentation Server 4.5

Apache Tomcat Clustering

Building Reliable, Scalable AR System Solutions. High-Availability. White Paper

Business Case for Data Center Network Consolidation

DEPLOYMENT GUIDE Version 1.1. Deploying F5 with Oracle Fusion Middleware Identity Management 11gR1

DEPLOYMENT GUIDE DEPLOYING THE BIG-IP LTM SYSTEM WITH MICROSOFT WINDOWS SERVER 2008 TERMINAL SERVICES

FortiOS Handbook - Load Balancing VERSION 5.2.2

LinuxWorld Conference & Expo Server Farms and XML Web Services

McAfee Agent Handler

APV9650. Application Delivery Controller

DEPLOYMENT GUIDE. Deploying the BIG-IP LTM v9.x with Microsoft Windows Server 2008 Terminal Services

SCALABILITY AND AVAILABILITY

SILVER PEAK ACCELERATION WITH EMC VSPEX PRIVATE CLOUD WITH RECOVERPOINT FOR VMWARE VSPHERE

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations

Score your ACE in Business and IT Efficiency

Deploying the BIG-IP System v10 with Oracle Application Server 10g R2

CS514: Intermediate Course in Computer Systems

CS 188/219. Scalable Internet Services Andrew Mutz October 8, 2015

AdventNet MSP Solutions

Networking and High Availability

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Distributed Data Management

Cisco Active Network Abstraction Gateway High Availability Solution

Cloud Computing Disaster Recovery (DR)

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

Staying Alive Understanding Array Clustering Technology

SiteCelerate white paper

Transcription:

Lessons from Giant-Scale Services and Load Balancing Switches George Porter CSE 124 January 27, 2015

Announcements Project 1 submission website: http://vm134.sysnet.ucsd.edu/ (This link will be put on the course webpage) Upload in.tar.gz format Instructions will be put on Piazza today

Configuration channel: AC Question 1: Are you a human being? A = Yes! B = No iclicker

Giant-Scale Services Challenges for network services: High availability Critical in today s environment: $1000/sec of lost revenue during downtime Evolution Growth This paper does not address Service monitoring, configuration, QoS, security, logging, and log analysis Wide-area replicated services Write intensive services

Benefits of Network Services Access anywhere, anytime Availability via multiple devices Groupware support Calendaring, teleconferencing, messaging, etc. Lower overall cost Multiplex infrastructure over active users Dedicated resources are typically 98% idle Central administrative burden Simplified service updates Update the service in one place, or 100 million?

Network Service Components

Clusters as Building Block No alternative to clusters for building network services that can scale to global use Key question: what is the lowest-level building block of a cluster? Commodity intel processor or higher-end SMP? Cluster benefits: Incremental scalability Adding one machine typically linearly improves performance Independent components Cost and performance

Load Management Started with round-robin DNS in 1995 Map hostname to multiple IP addresses, hand out particular mapping in a round robin fashion to clients iclicker: A: Does not hide failure or inactive servers B: Can not scale to millions of users C: Exposes structure of underlying service D: A&C E: B&C

Load Management Started with round-robin DNS in 1995 Map hostname to multiple IP addresses, hand out particular mapping in a round robin fashion to clients Does not hide failure or inactive servers Exposes structure of underlying service Today, L4 and L7 switches can inspect TCP session state or HTTP session state (cookie sticky, etc.) Perform mapping of requests to back end servers based on dynamically changing membership information Load balancing still an important topic today (more about that later)

Service Replication

Service Partitioning

Case Study: Search Map keywords to set of documents containing all words Optionally rank the document set in decreasing relevance E.g., PageRank from Google Need a web crawler to build inverted index Data structure that maps keywords to list of all documents that contains that word Multi-word search Perform join operation across individual inverted indices Where to store individual inverted indices? Too much storage to place all on each machine (esp if you also need to have portions of the document avail as well)

Case Study: Search Vertical partitioning Split inverted index across multiple nodes (each node contains as much of index as possible for a particular keyword) Replicate inverted indices across multiple nodes OK if certain portion of document database not reflected in a particular query result (even expected) Horizontal partitioning Each node contains portion of inverted index for all keywords (or large fraction) Have to visit every node in system to perform full join

Replication versus Partitioning Replication Any replica can serve any request iclicker: A: Failure reduces system capacity but not data availability or B: Failure reduces data availability but not system capacity Partitioning Nodes are no longer identical so certain requests need to be sent to individual nodes No need for coherence traffic Failure reduces data availability and may reduce capacity

Availability Metrics Mean time between failures (MTBF) Mean time to repair (MTTR) Availability = (MTBF MTTR)/MTBF Example: MTBF = 10 minutes MTTR = 1 minute A = (10 1) / 10 = 90% availability Can improve availability by increasing MTBF or by reducing MTTR Ideally, systems never fail but much easier to test reduction in MTTR than improvement in MTBF

Harvest and Yield yield = queries completed/queries offered In some sense more interesting than availability because it focuses on client perceptions rather than server perceptions If a service fails when no one was accessing it harvest = data available/complete data How much of the database is reflected in each query? Should faults affect yield, harvest or both?

DQ Principle Data per query * queries per second à constant At high levels of utilization, can increase queries per second by reducing the amount of input for each response Adding nodes or software optimizations changes the constant

Graceful Degradation Peak to average ratio of load for giant-scale systems varies from 1.6:1 to 6:1 Single-event bursts can mean 1 to 3 orders of magnitude increase in load Power failures and natural disasters are not independent, severely reducing capacity Under heavy load can limit capacity (queries/sec) to maintain harvest or sacrifice harvest to improve capacity

Graceful Degradation Cost-based admission control Search engine denies expensive query (in terms of D) Rejecting one expensive query may allow multiple cheaper ones to complete Priority-based admission control Stock trade requests given different priority relative to, e.g., stock quotes Reduced data freshness Reduce required data movement under load by allowing certain data to become out of date (again stock quotes or perhaps book inventory)

Online Evolution and Growth Internet services undergo rapid development with the frequent release of new products and features Rapid release means that software released in unstable state with known bugs Goal: acceptable MTBF, low MTTR, no cascading failures Beneficial to have staging area such that both new and old system can coexist on a node simultaneously Otherwise, will have to transfer new software after taking down old software à increased MTTR Also makes it easier to switch back to old version in case of trouble

Online Evolution and Growth Fast reboot Simultaneously reboot all machines to new version Simple, but guaranteed downtime Rolling upgrade Upgrade one node at a time in wave moving across cluster Old and new versions must be compatible because they will coexist (hard in practice) Big flip Update one half at a time Remove one half of system from view of load balancing switch Wait for existing connections to complete Upgrade this half with new software Atomically flip load balancing switch to upgraded software

Load Balancing

Load balancing We re going to talk about two strategies: 1. Load balancing switches 2. Network-wide load balancing (specifically within data centers)

Load balancing switches WAN Router Firewall Load Balancing Switch Integrated? Ethernet LAN Switches

Load balancing switches HTTP (front end) Load Balancing Switch Application Server Ethernet LAN Switches Storage Tier (Fileserver/Database)

Load balancing switches HTTP (front end) Load Balancing Switch Virtual Cluster 1 (site1.com) Load balancing switch maps external virtual IP addresses (vip) to (often dynamic) physical IP addresses (dip) in internal network Virtual Cluster 2 (site2.com)

F5 Big-IP

Big IP: Cookie Encryption rule when HTTP_REQUEST { # # Decrypt the cookie on its way to the server. # HTTP::cookie decrypt "cookie_name" "password-key" } rule when HTTP_RESPONSE { # # Encrypt the cookie on its way to the client. # HTTP::cookie encrypt "cookie_name" "password-key" }

Big IP: Sanitize HTTP Headers From All Apps rule when HTTP_RESPONSE { # # Remove all but the given headers. # HTTP::header sanitize "ETag" "Content-Type" "Connection" }

Big IP: Remove SSN from Responses rule ssn_rule { when HTTP_RESPONSE_DATA { set payload [HTTP::payload [HTTP::payload length]] set ssnx "xxx-xxx-xxxx" # Find the SSN numbers regsub -all {\d{3}-\d{2}-\d{4}} $payload $ssnx new_response } # Replace the content if there was a match