Last time Data Center Construction (and management) Johan Tordsson Department of Computing Science 1. Common (Web) application architectures N-tier applications Load Balancers Application Servers Databases 2. Cloud application guidelines Scalability Fault-tolerance Some best practices A few Amazon examples Today Data centers how to build (and operate) Servers Network Storage Power Cooling Energy-efficiency and a building to keep everything in Conceptual overview only Details about these only relevant for those who actually build/operate data centers Data Center as a Computer Majority of cloud computing infrastructure consists of reliable services delivered through data centers Traditional co-location data centers Multiple servers and communications gear collocated due to common environmental & security needs Hosts a large number of relatively small or mediumsized applications, each running on a dedicated hardware infrastructure Data centers for cloud computing platforms Belongs to a single organization Uses a relatively homogeneous hardware and system software platform Common system management layer Runs a smaller number of very large applications Cloud computing workloads must be designed to gracefully tolerate large numbers of component faults with little or no impact on service level performance and availability 1
Warehouse Scale Computers (WSC) Not just a collection of servers 100s to 1000s coordinated servers Typically runs on a virtualized platform Fault behavior & energy considerations have significant impact Needs to be considered as a single unit Must be highly manageable Deployment of software updates Monitoring & system management Affordability Currently power public clouds such as Google, Amazon, Yahoo, Microsoft, etc Soon to be affordable by Enterprises A rack of servers can easily have > 600 cores What s different about WSC s? As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). Google Warehouse Style Computer Data Center The New Data Center Industry Data centers replaces servers Container Computer for high efficiency and environmental conservation (Packaging, PUE, ) Bundled software for integrated service, high scalability, and availability Large Enterprise will bypass traditional server channels (IBM, HP, Dell, ) Purchase of entire data center directly from manufacturers Significant cost reductions Horizontal scalability High Availability Google already buys directly purchase from Taiwan Google 4 th largest server manufacturer, does not sell Facebook s opencompute.org project Open specifications for data center design 2
Container Computers Data Center Architecture Treat the entire data center as a computer - Air flow analysis - Cooling architecture (thermal management) - Power/energy management - Focus on ease of system and network management - What cannot be managed/monitored does not get deployed Modular and Scalable - Card to Rack - Rack to Container - Container to Warehouse Explore low power, commodity CPU as a building block 9 Data center server hardware Standard servers Standard networks Standard storage But at a very large scale Comparison: Parallel computer Custom high-performance hardware (?) Fast interconnection networks Design Motivation Multicore CPUs in mid-range servers typically carry a price/performance benefit 2-5 times cheaper than top-of-the-line systems Many services are memory-bound Faster CPUs do not scale well for large services Applications are larger-than-server anyway Slower CPUs are more power efficient; CPU power decreases by O(k^2) when as CPU frequency decreases by k 3
Cost comparison example Server and network overview High-latency, low-price network Gigabit ethernet Hierarchy of commodity switches Storage Increased space with distance Decreased latency and bandwidth Data center management tools Physical Deployment Tool Cloud Application Management Tool Virtual Provisioning Physical Compute Servers Distributed Main/Secondary Storage Network Network/System Management Security Power Management Virtual Machine Management Intra-Virtual- Load Balancing 4
Management (cont.) Virtualization Platform (virtualize everything) CPUs Storage (Filesystems) Network Resource Management Provisioning of virtual clusters Physical machine load balancing Network traffic load balancing Power Management Security Hypervisor protection Isolation between clusters System Management High Availability Physical component failure should not interrupt availability of virtual resources Cloud Applications management Unless a resource can be remotely managed, it should not be part of the data center Virtualization Platform Leverage existing hypervisors Allocation of virtual machine instances Monitor VM Performance Virtual storage provisioning Intra-Virtual load balancing Scalable data center network Isolation between virtual clusters Virtual machine migration Mail Virtual Compute Nodes Service Nodes Physical Node Bkup Virtual Physical l Node HC Virtual Physical Node Data Nodes AppXYZ Virtual Storage Server ge Serve r Storage ge Server Serve r System Service daemons Cloud OS agents Virtual Machine Management Objective Power Management Physical Machine Load Balancing Monitor runtime VM statistics Heuristic calculation to predict workloads Determine power down/up of machines Multi-dimensional bin packing (knapsack) CPU, network, disk VM migration algorithm Physical machine load balancing Migration of VM s to other physical machine Power To run servers To run data center Cooling, power distribution, etc. 5
Power (cont.) Uninterruptable Power Supply (UPS) Detects power failure Batteries (for short-term outage + switch) (Diesel) Generator (long-term outage) Power Distribution Unit (PDU) Fancy socket w. power distribution and/or control Power usage breakdown Power (cont.) Data centers major power users Common claim: ~4% of world electricity use Example: Facebook in Luleå (120MW) ~1BSEK (1 000 000 000 SEK) / year (list prices) Exponential growth of data center capacity & cheaper server hardware Power costs (will) dominate. Exponential power use? Cost breakdown (examples): Cooling Keep heat-generating servers cool Computer Room Air Conditioners (CRAC) - Like room air conditioner, for server rooms Very complex to model and design - Airflow 3D and non-linear Cooling (cont.) Cooling by water Water cooling very close to servers Cooling by sea water Inf. availability of cool water Example: Google Finland Cooling by location Cold climate reduces need for cooling 6
Energy-efficiency Not all power is used by servers Power Usage Efficiency (PUE) Power used / power used computing: Energy-efficiency (cont.) Non-linear server power usage Performanc/power ration changes with load High server utilization beneficial But not common by default Typical: 2.0 State-of-the-art: ~1.2 Quite a few variants of the definition Many to make data centers look good Others look at power source Carbon vs. solar vs.. Energy efficiency (cont.) Energy efficiency (cont.) 5k Google servers (6 months) Consolidate workloads Power servers off Or slow servers down Dynamic Voltage Frequency Scaling(DVFS) Very hard to assess impact for bursty (rapidly changing) workloads Oscillations and un-wanted correlations More next time Consolidation requires software support Must be able to start/stop instances and autoscale Stateless services preferable 7
Costs for a Data Center How much performance is required? How many/fast servers, disks, networks etc.? Size of data center: Watt How much power is needed? PUE? How much cooling? Price of electricity? What additional physical equipment is needed? Redundancy of power and cooling Where to place it, given the above? Costs vs. location of users Very attractive to host data centers Cloud computing = cost cuts? Amazon EC2 examples Small VM, 3 years full use (est. server lifetime) Per h: $0.08*(24*365*3) -> ~$2100 (!) Reserved: $300 + $0.013*(24*365*3) -> $640 Rough estimate of costs for Amazon (according to data center as a computer ) Assume server cost 25% of total cost (TCO) Standard $2k (list price) 1U server today: 32 cores + memory, disk etc. Total cost $8k Estimate: Can deliver 64 Small VMs Revenue: $2100*64 -> ~17 times server cost! Amazon does not pay list prices 90% discount rumoured With 24/7 use, hourly prices are very high Cloud cost life cycle 1. Develop service Run in-house for testing and very early use 2. Move to cloud-hosting To handle large scale-up of user base 3. Build own data center to cut hosting costs Once size of service is roughly known Unless major price cuts by IaaS providers, this will happen for more and more SaaS providers as server and data center costs drop Conclusions Data centers at warehouse scale More than just a group of servers Holistic management perspective needed Standard solutions superior Off-the-shelf servers, networks, disks, etc. Redundancy, scalability, etc. in software layer Balanced design cust costs and increases efficiency Example: Zynga 8
Suggested reading Data center as a computer Barroso & Hölzle (Google) Read (somewhat) carefully: Chapter 1, Chapter 3-5 Focus on principles, ignore numbers (examples are a few years old...) Skim: Chapter 2 Overlaps texts from last lecture + Data management lecture Next time. Thursday: Data center #2: Autonomic management Data centers are large Cloud services are complex How to make these Configure themselves? Optimize themselves? Heal themselves? Delay project demos??? From Thursday (31 May) to Monday (June 4)? 3 hours, 13-16: 1h review + evaluation (me) 2h presentation + demo (you)? 9