Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju



Similar documents
Datacenter Software and Client Software 廖 世 偉

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Important Design Factors for WSCs. Programming Models for WSCs

Last time. Data Center as a Computer. Today. Data Center Construction (and management)

Cloud Computing Data Centers

The Datacenter as a Computer. An Introduction to the Design of Warehouse-Scale Machines

Design of Warehouse Scale Computers (WSC)

Distributed Systems LEEC (2005/06 2º Sem.)

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Lecture 24: WSC, Datacenters. Topics: network-on-chip wrap-up, warehouse-scale computing and datacenters (Sections )

System Models for Distributed and Cloud Computing

Google

SCALABILITY AND AVAILABILITY

Building a Highly Available and Scalable Web Farm

Introduction to Cloud Computing

Best Practices for Managing Virtualized Environments

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Price/performance Modern Memory Hierarchy

Internet Content Distribution

Cloud Computing at Google. Architecture

Windows Server Performance Monitoring

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Mobile Performance Testing

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

How Solace Message Routers Reduce the Cost of IT Infrastructure

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Apache Hadoop Cluster Configuration Guide

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

WanVelocity. WAN Optimization & Acceleration

Next-generation data center infrastructure

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Guideline for stresstest Page 1 of 6. Stress test

Google File System. Web and scalability

Building a Linux Cluster

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Integrated Application and Data Protection. NEC ExpressCluster White Paper

Chapter 18: Database System Architectures. Centralized Systems

Principles and characteristics of distributed systems and environments

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Cluster, Grid, Cloud Concepts

Solid State Storage in Massive Data Environments Erik Eyberg

Disk Storage Shortfall

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

COMLINK Cloud Technical Specification Guide DEDICATED SERVER

How to Choose your Red Hat Enterprise Linux Filesystem

Data Management in the Cloud

Hadoop implementation of MapReduce computational model. Ján Vaňo

Distributed Systems. Lec 2: Example use cases: Cloud computing, Big data, Web services

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Terminal Server Software and Hardware Requirements. Terminal Server. Software and Hardware Requirements. Datacolor Match Pigment Datacolor Tools

MarkLogic Server Scalability, Availability, and Failover Guide

9/26/2011. What is Virtualization? What are the different types of virtualization.

Contents. 1. Introduction

Enabling High performance Big Data platform with RDMA

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

Cloud Computing with Microsoft Azure

Virtual machine interface. Operating system. Physical machine interface

1 Storage Devices Summary

Large scale processing using Hadoop. Ján Vaňo

Maximizing SQL Server Virtualization Performance

NV-DIMM: Fastest Tier in Your Storage Strategy

The Ultimate in Scale-Out Storage for HPC and Big Data

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

How To Understand The Concept Of A Distributed System

Parallel Computing. Benson Muite. benson.

Distribution One Server Requirements

Big Fast Data Hadoop acceleration with Flash. June 2013

Using NetBooting on the Mac OS X Server for delivery of mass client deployment

Intro to Virtualization

Increasing Flash Throughput for Big Data Applications (Data Management Track)

The Benefits of Purpose Built Super Efficient Video Servers

Enabling Technologies for Distributed Computing

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Desktop Virtualization and Storage Infrastructure Optimization

Tableau Server Scalability Explained

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

NetScaler VPX FAQ. Table of Contents

What s new in Hyper-V 2012 R2

COM 444 Cloud Computing

Chapter 1: Introduction. What is an Operating System?

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Load DynamiX Storage Performance Validation: Fundamental to your Change Management Process

Cloud computing an insight

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

Cloud Computing Is In Your Future

Mellanox Academy Online Training (E-learning)

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

PRODUCTS & TECHNOLOGY

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

How To Handle Big Data With A Data Scientist

Scalability and BMC Remedy Action Request System TECHNICAL WHITE PAPER

Transcription:

Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju

Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note: Term Warehouse-scale computing originates from Google Examples typically of Google s services Trend towards WSC is more general This chapter based on book Barroso, Hölzle: The Datacenter as a Computer (see course website) 2

What is Warehouse-Scale Computing (WSC)? Essentially: Modern Internet services Massive scale of Software infrastructure Data repositories Hardware platform Program is a service Consists of tens of interacting programs Different teams, organizations, etc. 3

WSC vs. Data Centers Both look very similar to the outside Lots of computers in one building Key difference: Data centers host services for multiple providers Little commonality between hardware and software Third-party software solutions WSC run by a single organization Homogeneous hardware and software and management In-house middleware 4

Cost Efficiency Cost efficiency extremely important Growth driven by 3 main factors: Popularity increases load Size of problem increases (e.g., indexing of Web) Highly competitive market Need bigger and bigger systems Cost efficiency! 5

Future of Distributed Computing? WSC is not just a collection of servers New and rapidly evolving workloads Too big to simulate New design techniques Fault behavior Energy efficiency New programming paradigms Design spectrum: One computer Multiple computers Data center WSC = Multiple data centers operating together Modern CDN: Server = WSC data center 6

Architectural Overview Storage Networking Storage hierarchy Latency, bandwidth, capacity Power usage Handling failures 7

General architecture Servers, e.g., 1-U servers Racks Interconnected racks 8

Storage Tradeoff: NAS vs. local disks as distributed filesystem? NAS: Easier to deploy, puts responsibility on vendor Collection of disks: Must implement own filesystem abstraction (e.g., GFS) Lower hardware costs (desktop vs. enterprise disks) Reliability issues and replication? More network traffic due to writes 9

Network 48-port 1 Gbps Ethernet switches are cheap Good bandwidth within one rack Problem: Cluster-level bandwidth? Bigger and faster switches prohibitively expensive? Hierarchical network organization: Good bandwidth within rack Less bandwidth within cluster Programmer must keep this in mind! (transparency?) 10

Storage Hierarchy Server: N processors, X cores/cpu, local cache, DRAM, disks Fast, but limited capacity Rack: Individual servers, combined view A bit slower, but more capacity Cluster: View over all racks Slower, but more capacity Tradeoff: Bandwidth, latency, capacity 11

Power Usage No single culprit on server level CPU 33% DRAM 30% Disk 10% Network 5% Other 22% Further optimization targets on cluster/wsc level Cooling of data center 12

Handling Failures At this scale, things will break often Application must handle them More details later 13

Workloads and Software Infrastructure Different levels of abstraction Platform-level software Firmware, kernel, individual OS Cluster-level infrastructure software Distributed software for managing resources and services OS for a datacenter Distributed FS, RPC, MapReduce, Application-level software Actual application, e.g., Gmail, Google Maps 14

Datacenter vs. Desktop Differences in developing software Datacenter: Parallelism (both data and requests) Workload changes Homogeneous platform Hiding failures 15

Basic Techniques Technique Reliability Availability Replication Yes Yes Partitioning Yes Yes Load balancing Timers Integrity checks App.-specific Compression Yes Yes Eventual consistency Yes Yes Yes Yes 16

Cluster-Level Infrastructure Software Resource management Mapping of tasks to resources Hardware abstraction and basic services Distributed storage, message passing, Deployment and maintenance Software distribution, configuration, Programming frameworks Hide some of the above from programmer Examples: MapReduce, BigTable, Dynamo 17

MapReduce Google s framework for processing large data sets on clusters Name from map and reduce (functional programming) Not really much in common with real map and reduce One master, multiple (levels) of slaves Map: Master partitions input, distributed to slaves Slaves may do the same Reduce: Slave sends its result to its master Eventually root-master will get result 18

Application-Level Software What is the application? First was search, then many other have appeared Datacenter must support general-purpose computing Too expensive to tailor datacenters for applications Changing workloads Faster to adapt software Two application examples: Search Similar scientific articles (see book for description) 19

Search Inverted index Set of documents matching a keyword Size of index similar to original data Consider query new york restaurant Must search each of three terms Find documents matching every term Sorting (PageRank + other criteria) Result Latency must be low (user waiting) Throughput must be high (many users) Read-only index Easily parallelizable 20

Monitoring Infrastructure Service-level dashboards Real-time monitoring of few key indicators (latency, t-put) Can extend to some more indicators Performance debugging tools Dashboards only show problem, but no answer to why No need for real-time (compare CPU profilers) Blackbox monitoring vs. instrumentation approach Platform-level monitoring Everything above is needed, but not sufficient Need a higher-level view (see book for details) 21

Buy vs. Build? Buy: Build: Typical solution Google s (and others ) approach Original reason: No third-party solutions available More software development and maintenance work Improved flexibility In-house software can take shortcuts - Not implement every feature 22

Failures Traditional software not good with failures Result: Make hardware more reliable WSC is different because of scale 30 year MTBF = 10,000 hours MTBF WSC with 10,000 servers = 1 failure per day Software must handle failures Application or middleware Middleware makes applications simpler 23

Positive Side Effect Failures are a fact of life Can buy cheaper hardware Upgrades are simpler Upgrade, kill, reboot Same for hardware upgrades Failure is an option Can allow servers to fail, makes life simpler 24

Caveats Cannot ignore reliability completely Hardware must be able to detect errors and failures No need to recover, but can include Not detecting hardware errors is risky See book for example Every piece of software would need to handle everything 25

Categorizing Faults Corrupted Data lost or corrupted Can data be regenerated or not? Unreachable Service unreachable by users User network reliability? Degraded Service available, but degraded What can be still done? Masked Fault occurs, but is masked 26

Sources of Faults Hardware not the common culprit (~10%) Software and configurations are bigger problems Exact numbers depend on study Hardware problem = single computer Software/configuration problem = many computers simultaneously 27

Causes of Crashes Anecdotal evidence points to software Hardware: Memory or disk DRAM errors happen, but can be helped with ECC Some errors still persist Real crash rate higher than studies predict Again points to software Predicting problems in WSC not useful Need to handle failures anyway Could be useful in other systems 28

Repairs When something breaks, it must be repaired Two important characteristics of WSC No need to repair immediately Optimize time of repair technician Collect lot of health data from large number of servers Use machine learning to optimize actions 29

Summary: Key Challenges Rapidly changing workloads Building balanced systems from imbalanced components Energy use Amdahl s Law 30

Chapter Summary Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs 31