Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan SC2011@Seattle, Nov.15 2011
Outline What is HPC Cloud? Performance tuning method for HPC Cloud PCI passthrough NUMA affinity VMM noise reduction Performance evaluation 2
HPC Cloud HPC Cloud utilizes cloud resources in High Performance Computing (HPC) applications Virtualized Clusters Users require resources according to needs Provider allocates users a dedicated virtual cluster on demand Physical Cluster 3
HPC Cloud (cont d) Pros: User side: easy to deployment Provider side: high resource utilization Cons: Performance degradation? The method of performance tuning on a virtualized environment is not established. 4
Toward a practical HPC Cloud VM1 VMM Guest OS NIC Physical driver To reduce the overhead of interrupt virtualization To disable unnecessary services on the host OS (i.e., ksmd). Set NUMA affinity Reduce VMM noise (not completed) True HPC Cloud The performance is closing to that of bare metals. VM (QEMU process) Guest OS Threads Use PCI passthrough VCPU threads Linux kernel Current HPC Cloud Its performance is not good and unstable. KVM Physical CPU CPU socket 5
PCI passthrough IO emulation PCI passthrough SR-IOV VM1 Guest OS Guest driver VM2 VM1 Guest OS Physical driver VM2 VM1 Guest OS Physical driver VM2 VMM vswitch VMM VMM Physical driver NIC NIC NIC Switch (VEB) IO emulation PCI passthrough SR-IOV VM sharing Performance 6
Virtual CPU scheduling Bare Xen Metal KVM VM (Xen DomU) VM (QEMU process) VM (Dom0) Threads Guest OS A guest OS can not run numactl Guest OS Threads Virtual Machine V0 V1 VCPU V2 V3 VCPU threads V0 V1 V2 V3 Xen Hypervisor Linux kernel Domain scheduler KVM Process scheduler Virtual Machine Monitor (VMM) Physical CPU P0 P1 P2 P3 Physical CPU P0 P1 P2 P3 Hardware CPU socket 7
Linux Bare Metal NUMA affinity KVM VM (QEMU process) Threads Guest OS numactl Process scheduler Threads VCPU threads numactl V0 V1 V2 V3 bind threads to vsocket CPU socket Physical CPU P0 P1 P2 P3 memory memory Linux kernel KVM taskset Process scheduler Physical CPU P0 P1 P2 P3 CPU socket pin vcpu to CPU (Vn = Pn) 8
Evaluation Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster) Compute node Dell PowerEdge M610 CPU Intel quad-core Xeon E5540/2.53GHz x2 Chipset Intel 5520 Memory 48 GB DDR3 InfiniBand Mellanox ConnectX (MT26428) Blade switch InfiniBand Mellanox M3601Q (QDR 16 ports) Host machine environment OS Debian 6.0.1 Linux kernel 2.6.32-5-amd64 KVM 0.12.50 Compiler gcc/gfortran 4.4.5 MPI Open MPI 1.4.2 VM environment VCPU 8 Memory 45 GB 9
MPI Point-to-Point communication performance 10000 (higher is better) 1000 Bandwidth [MB/sec] 100 10 PCI passthrough improves MPI communication throughput close to that of bare metal machines. Bare Metal KVM 1 1 10 100 1k 10k 100k 1M 10M 100M 1G Message size [byte] Bare Metal: non-virtualized cluster 10
NUMA affinity Execution time on a single node: NPB multi-zone (Computational Fluid Dynamics) and Bloss (Non-linear eignsolver) SP-MZ [sec] BT-MZ [sec] Bloss [min] Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00) KVM KVM (w/ bind) 96.14 (1.02) 104.57 (1.11) 141.69 (1.03) 22.12 (1.05) 139.32 (1.01) 21.28 (1.01) NUMA affinity is an important performance factor not only on bare metal machines but also on virtual machines. 11
NPB BT-MZ: Parallel efficiency 300 (higher is better) 100 Performance [Gop/s total] 250 200 150 100 50 Degradation of PE: KVM: 2%, EC2: 14% Bare Metal KVM Amazon EC2 Bare Metal (PE) KVM (PE) Amazon EC2 (PE) 80 60 40 20 Parallel efficiency [%] 0 1 2 4 8 16 Number of nodes 0 12
Bloss: Parallel efficiency Bloss: non-linear internal eigensolver 120 100 Hierarchical parallel program by MPI and OpenMP Overhead of communication and virtualization Parallel Efficiency [%] 80 60 40 Degradation of PE: KVM: 8%, EC2: 22% 20 0 1 2 4 8 16 Number of nodes Bare Metal KVM Amazon EC2 Ideal 13
Summary HPC Cloud is promising! The performance of coarse-grained parallel applications is comparable to bare metal machines We plan to operate a private cloud service AIST Cloud for HPC users Open issues VMM noise reduction VMM-bypass device-aware VM scheduling Live migration with VMM-bypass devices 14
Live migration with SR-IOV 1/4 GesutOS (virtio) bond0 eth1 (igbvf) tap0 Host OS tap0 Host OS br0 br0 (igb) (igb) SR-IOV NIC SR-IOV NIC 15
Live migration with SR-IOV 2/4 GesutOS (virtio) bond0 eth1 (igbvf) (qemu) device_del vf0 tap0 Host OS tap0 Host OS br0 br0 (igb) (igb) SR-IOV NIC SR-IOV NIC 16
Live migration with SR-IOV 3/4 GesutOS (qemu) migrate -d tcp:x.x.x.x:y GesutOS bond0 (virtio) $ qemu -incoming tcp:0:y... tap0 Host OS tap0 Host OS br0 br0 (igb) (igb) SR-IOV NIC SR-IOV NIC 17
Live migration with SR-IOV 4/4 (qemu) device_add pci-assign, host=05:10.0,id=vf0 GesutOS (virtio) bond0 eth1 (igbvf) tap0 Host OS tap0 Host OS br0 br0 (igb) (igb) SR-IOV NIC SR-IOV NIC 18
Bloss: Parallel efficiency Bloss: non-linear internal eigensolver 120 Hierarchical parallel program by MPI and OpenMP 100 Parallel Efficiency [%] 80 60 40 Bare Metal 20 KVM KVM (w/ bind) Amazon EC2 Ideal 0 1 2 4 8 16 Number of nodes Binding threads and physical CPUs can be sensitive to VMM noise and degrade the performance. 19
LINPACK Efficiency 100 TOP500 June 2011 InfiniBand: 79% Efficiency (%) 80 60 40 20 0 GPGPU machines InfiniBand Gigabit Ethernet 10 Gigabit Ethernet Gigabit Ethernet: 54% 10 Gigabit Ethernet: 74% TOP500 rank #451 Amazon EC2 cluster compute instances Virtualization causes the performance degradation! Efficiency Maximum LINPACK performance Rmax Theoretical peak performance Rpeak