A Holistic Model of the -Efficiency of Hypervisors in an HPC Environment Mateusz Guzek,Sebastien Varrette, Valentin Plugaru, Johnatan E. Pecero and Pascal Bouvry SnT & CSC, University of Luxembourg, Luxembourg 1 / 29
Summary 1 Introduction, Context & Motivations 2 Modeling 3 Experimental Setup & Experiments Performed 4 Results 5 Conclusion 2 / 29
Introduction, Context & Motivations Summary 1 Introduction, Context & Motivations 2 Modeling 3 Experimental Setup & Experiments Performed 4 Results 5 Conclusion 3 / 29
Introduction, Context & Motivations HPC at the Heart of our Daily Life Today... Research, Industry, Local Collectivities... Tomorrow: applied research, digital health, nano/bio techno N 4 / 29
Introduction, Context & Motivations Cloud Computing in an HPC context Horizontal scalability: perfect for replication/ HA (High Availability) best suited for runs with minimal communication and I/O nearly useless for true parallel/distributed HPC runs Cloud Data storage Data locality enforced for performance Data outsourcing vs. legal obligation to keep data local Accessibility, security challenges Cost effectiveness chaos+gaia usage: 11,154,125 CPUhour (1273 years) since 2007 15,06M$ on EC2 cc2.8xlarge vs. 4 Me cumul. HW investment Virtualization layer impact on performance? most probably decreased performance huge overhead induced on I/O + no support of IB [Q E F]DR 5 / 29
Introduction, Context & Motivations Objectives of this study Better than assumptions/a-priori: concrete models and experiments Evaluate impact of the underlying hypervisor at the heart of any cloud middleware so far analysis of the most widespread virtualization frameworks propose a lightweight, high-level model of a virtualized machine. Evaluate a real HPC platform (or anything as close as possible) concrete deployment on top of the Grid5000 platform select benchmarking tools to reflect an HPC usage Abstract from the specifics of a single processor architecture Evaluate Intel vs AMD Thanks Georges ;) 6 / 29
Introduction, Context & Motivations Virtualization Frameworks Enables finer grain resource provisioning Provides new functionalities (e.g. migration, suspension) The study includes the most commonly used hypervisors Hypervisor: Xen 4.0 KVM 0.12 ESXi 5.1 Host architecture x86, x86-64, ARM x86, x86-64 x86-64 VT-x/AMD-v Yes Yes Yes Max Guest CPU 128 64 32 Max. Host memory 1TB - 2TB Max. Guest memory 1TB - 1TB 3D-acceleration Yes (HVM Guests) No Yes License GPL GPL/LGPL Proprietary deployment on the same Debian instance on Grid5000 7 / 29
Introduction, Context & Motivations HPC benchmarks Selected to represent various use cases of HPC systems: HPCC : new reference benchmark suit for HPC includes HPL 7 tests to stress CPU/disk/RAM/network usage Bonnie++ : a file system benchmarking suite IOZone : cross-platform benchmark of file operations read, write, re-read, re-write, read backwards/strided, mmap... 8 / 29
Modeling Summary 1 Introduction, Context & Motivations 2 Modeling 3 Experimental Setup & Experiments Performed 4 Results 5 Conclusion 9 / 29
Modeling Resource model The model divides machine into distinct types. Further defined by resource supplies: their capacity and architecture. Node PowerEdge R310 Processor Memory Storage Network Intel Xeon X3430, 2.4 GHz, 8M Cache, Turbo 4GB Memory (2x2GB), 1333MHz Single Ranked UDIMM 500GB 7.2K RPM SATA 3.5in No Raid Broadcom 5709 Dual Port 1GbE NIC w/toe iscsi, PCIe-4 C o r e C o r e C o r e C o r e 4 GB 500 GB 1 Gbps 1 Gbps 1 2 3 4 10 / 29
Modeling Resource allocation model The resource allocation is three-tier: Task, VM, Machine. T1 D: 1 A: 1 T2 D: 2 A: 1 T3 D: 1 A: 1 T4 D: 2 A: 1 T5 D: 1 A: 1 T6 D: 1 A: 1 T7 D: 1 A: 2 T8 D: 2 A: 2 T9 D: 2 A: 2 VM1 Type 1 D: 5 P: 4 A: 1 VM2 Type 2 D: 6 P: 5 A: 1 VM3 Type 3 D: 6 P:4 A: 1 VM4 Type 3 D: 6 P:4 A: 1 VM5 Type 2 D: 6 P: 5 A: 1 Node 1 P: 20 U: 17 A: 1 Node 2 P: 32 U: 12 A:2 Each level is described by the previously presented resource model, either as a resource provision or as a resource demand. Validation if such model can estimate the avg. power of a node. 11 / 29
Experimental Setup & Experiments Performed Summary 1 Introduction, Context & Motivations 2 Modeling 3 Experimental Setup & Experiments Performed 4 Results 5 Conclusion 12 / 29
Experimental Setup & Experiments Performed Setup Name Site Cluster #cpus/n #RAM Processor R peak Intel Lyon taurus 2 32GB Intel Xeon E5-2630@2.3GHz 6C 110,4 GFlops AMD Reims stremi 2 48GB AMD Opteron 6164 HE@1.7GHz 12C 163.2 GFlops Deploy of recent platforms 2011, Intel or AMD Experimental setup not straightforward see the paper for details. 13 / 29
Experimental Setup & Experiments Performed Runs & Monitoring config: baseline KVM Xen VMWare ESXi Observation No. stremi-3 5 5 5 0 10916 stremi-6 5 5 5 0 10907 stremi-30 5 5 5 5 13706 stremi-31 5 5 5 5 14026 taurus-7 5 5 5 5 6516 taurus-8 4 5 5 0 4769 taurus-9 5 5 5 0 5085 taurus-10 5 5 5 5 6545 Avg power values based on: OmegaWatt taurus average every 1s Raritan stremi instant. every 3s dstat used to capture utilization (every 1s): CPU user, system, idle, wio (%) Memory used, buffered, cached, free (B) Disk read, write (B) Network received, send (B). 14 / 29
Results Summary 1 Introduction, Context & Motivations 2 Modeling 3 Experimental Setup & Experiments Performed 4 Results 5 Conclusion 15 / 29
Results 1000 Performance HPCC results Baseline Intel Baseline AMD Xen Intel Xen AMD KVM Intel KVM AMD VMWare ESXi Intel VMWare ESXi AMD 100 Raw Result 100 Raw Result 10 10 HPL (TFlops) 1 DGEMM (GFlops) 0.1 STREAM (MB/s) 100 Copy Scale Add Trial Raw Result Raw Result 10 1 0.01 RandomAccess (GUPs) 0.1 16 / 29
Results Performance IOzone results 6000 rewrite 4500 3000 1500 0 random_write IOzone 64MB file, 1MB record test stremi baseline taurus baseline stremi KVM taurus KVM stremi Xen taurus Xen stremi ESXi taurus ESXi read random_read reread Intel-based platform outperforms the AMD-based one ESXi outperforms baseline in some cases on Intel platform (caching strategy?) 17 / 29
Results 250 200 consumption profiles I Baseline Intel (Lyon) Baseline AMD (Reims) Power usage evolution Baseline Intel (Lyon) Power usage evolution Baseline AMD (Reims) 350 HPCC HPCC Bonnie++ 1233931J Bonnie++ 604686J IOZone 4427s IOZone 2805s 300 1242245J 245835J 7436s 1475s 134824J 51407J 250 1128s 458s Power usage [W] 150 100 Power usage [W] 200 150 100 50 50 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 0 2000 4000 6000 8000 10000 12000 Xen Intel (Lyon) Xen AMD (Reims) Power usage evolution Xen Intel (Lyon) Power usage evolution Xen AMD (Reims) 350 300 638835J 2693s HPCC Bonnie++ IOZone 350 300 882056J 3486s HPCC Bonnie++ IOZone 250 210324J 1700s 62161J 532s 250 543745J 3302s 230757J 1428s Power usage [W] 200 150 Power usage [W] 200 150 100 100 50 50 0 0 1000 2000 3000 4000 5000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 18 / 29
Results 300 250 200 consumption profiles II KVM Intel (Lyon) KVM AMD (Reims) Power usage evolution KVM Intel (Lyon) Power usage evolution KVM AMD (Reims) 350 HPCC HPCC 815474J Bonnie++ Bonnie++ IOZone 698807J IOZone 3561s 300 2686s 175341J 57950J 948062J 368147J 1471s 538s 250 5983s 2321s Power usage [W] 150 Power usage [W] 200 150 100 100 50 50 0 0 1000 2000 3000 4000 5000 0 0 2000 4000 6000 8000 10000 ESXi Intel (Lyon) ESXi AMD (Reims) Power usage evolution VMWare ESXi Intel (Lyon) Power usage evolution VMWare ESXi AMD (Reims) 300 250 642461J 2693s HPCC Bonnie++ IOZone 154896J 48143J 350 300 554346J 2063s HPCC Bonnie++ IOZone 200 1162s 388s 250 1027326J 199686J Power usage [W] 150 Power usage [W] 200 150 6650s 1252s 100 100 50 50 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 0 2000 4000 6000 8000 10000 19 / 29
Results Models and their accuracy The proposed models are based on multiple linear regression principle and include different subsets of available inputs. Model R 2 Residuals Error St.er. Min 1Q Median 3Q Max W % Basic 0.959 10.4-116 -3.54 0.8 5.07 117 6.67 3.8% Refined 0.959 10.4-116 -3.54 0.806 5.07 117 6.67 3.8% No Phases 0.941 12.4-127 -4.13 1.04 4.88 125 8.18 4.4% CPU Hom. 0.814 22-147 -13.1 3.9 13.6 160 16.7 9.6% CPU Het. 0.922 14.2-129 -5.12-0.0472 4.89 129 9.73 5.0% Only phases 0.922 14.3-114 -3.94 2.06 5.51 72.8 8.75 4.9% No group 0.856 19.4-142 -8.45 2.96 11.3 129 14.1 8.0% Clusterwise 0.928 13.7-122 -7.22 0.876 7.02 131 9.97 5.3% Group only 0.924 14.1-113 -4.29 2.52 5.88 69.9 8.77 4.9% 20 / 29
Results Numerical values for sample model Categorical predictors: node stremi-3 stremi-30 stremi-31 stremi-6 taurus-10 taurus-7 taurus-8 taurus-9 value 0-1.8-14 -4.7-45 -40-46 -44 phase Bonnie DGEMM FFT HPL IOZONE PTRANS RandAcc STREAM idle value 0 5.7 6.5 16 0.012-6.1-11 3.1 6.1 hypervisor ESXi KVM Xen baseline value 0-3.6-5.4-19 Numerical Predictors: metric Intercept cpu user cpu system cpu idle cpu wio mem used value 316-0.78-1.3-1.7-1.8 1.1E-9 metric mem buffers mem cached mem free disk write bytes rec. bytes send value -3.1E-08 8.0E-10 8.3E-10-1.8E-10 4.6E-05-1.1E-04 21 / 29
Results Predictions I Baseline Intel (Lyon) Baseline AMD (Reims) Power [W] 100 150 200 Observed Refined Power [W] 160 180 200 220 240 260 280 300 Observed Refined 0 1000 2000 3000 4000 0 2000 4000 6000 8000 10000 12000 22 / 29
Results Predictions II Xen Intel (Lyon) Xen AMD (Reims) Power [W] 100 150 200 250 Observed Refined Power [W] 160 180 200 220 240 260 Observed Refined 0 1000 2000 3000 4000 5000 0 2000 4000 6000 8000 23 / 29
Results Predictions III KVM Intel (Lyon) KVM AMD (Reims) Power [W] 100 150 200 250 Observed Refined Power [W] 160 180 200 220 240 260 280 Observed Refined 0 1000 2000 3000 4000 5000 0 2000 4000 6000 8000 10000 24 / 29
Results Predictions IV ESXi Intel (Lyon) ESXi AMD (Reims) Power [W] 120 140 160 180 200 220 240 260 Observed Refined Power [W] 150 200 250 300 Observed Refined 0 1000 2000 3000 4000 0 2000 4000 6000 8000 10000 25 / 29
Conclusion Summary 1 Introduction, Context & Motivations 2 Modeling 3 Experimental Setup & Experiments Performed 4 Results 5 Conclusion 26 / 29
Conclusion Conclusion CC in an HPC context requires a better understanding of the performance of virtualization middlewares. In this talk: evaluation of the 3 widespread virtualization frameworks Xen, KVM and VMware ESXi vs. baseline practical and lightweight power model of virtualized nodes holistic model is the most accurate deployment/experiments on a real HPC environment Grid 5000, Intel vs. AMD evaluation Middleware affects both performance and energy results observed overhead (20-30%) is acceptable. Hardware heterogeneity is noticeable Better results on Intel than on AMD 27 / 29
Conclusion Future work 1 Performance modelling and comparing it with the power model. 2 Analysis of effects of multiple VMs on a single node. 3 Including temperature as an environmental factor. 4 Analysis of overhead of cloud management system OpenNebula, OpenStack, Eucalyptus, Nimbus etc. Snooze ;) 5 Experimenting with network-intensive workloads 6 Derivation of model from component, rather than full platform level 28 / 29
Conclusion Thank you for your attention... 1 Introduction, Context & Motivations 2 Modeling 3 Experimental Setup & Experiments Performed 4 Results 5 Conclusion Contacts: {firstname.name}@uni.lu 29 / 29