GPU Accelerated Signal Processing in OpenStack John Paul Walters Computer Scien5st, USC Informa5on Sciences Ins5tute jwalters@isi.edu
Outline Motivation OpenStack Background Heterogeneous OpenStack GPU Performance across hypervisors Current Status Future work 2
Motivation Scientific workloads demand increasing performance with greater power efficiency Architectures have been driven towards specialization, heterogeneity Infrastructure-as-a-Service (IaaS) clouds can democratize access to the latest, most powerful accelerators Then why are most of today s clouds homogeneous? Of the major providers, only Amazon offers virtual machine access to GPUs in the public cloud 3
Cloud Computing and GPUs GPU passthrough has historically been hard Specific to particular GPUs, hypervisors, host OS Legacy VGA BIOS support, etc. Today we can access GPUs through most of the major hypervisors KVM, VMWare ESXi, Xen, LXC Combine this with a heterogeneous cloud 4
OpenStack Background OpenStack founded by Rackspace and NASA In use by Rackspace, HP, and others for their public clouds Open source with hundreds of participating companies In use for both public and private clouds Current stable release: OpenStack Havana OpenStack Icehouse to be released in April 5 120 100 80 60 40 20 0 Google Trends Searches for Common Open Source IaaS Projects openstack cloudstack opennebula eucalyptus cloud
OpenStack Architecture Image source: http://docs.openstack.org/training-guides/content/module001-ch004-openstackarchitecture.html 6
Supporting GPUs We re pursuing multiple approaches for GPU support in OpenStack LXC support for container-based VMs Xen support for fully virtualized guests KVM support for fully virtualized guests, SR-IOV Also compare against VMWare ESXi Our OpenStack work currently supports GPU-enabled LXC containers Xen prototype implementation as well Given widespread support for GPUs across hypervisors, does hypervisor choice impact performance? 7 7
GPU Performance Across Hypervisors 2 CPU Architectures, 2 GPU Architectures, 4 Hypervisors Sandy Bridge + Kepler, Westmere + Fermi KVM, Xen, LXC, VMWare ESXi Standardize on a common CentOS 6.4 base system for comparison Same 2.6.32-358.23.2 kernel across all guests and LXC 8
Hardware Setup Sandy Bridge + Kepler Westmere + Fermi CPU (cores) 2xE5-2670 (16) 2xX5660 (12) Clock Speed 2.6 GHz 2.6 GHz RAM 48 GB 192 GB NUMA Nodes 2 2 GPU 1xK20m 2xC2075 9
Hypervisor Configuration Hypervisor Linux Kernel Linux Distro KVM 3.12 Arch 2013.10.01 Xen 4.3.0-7 3.12 (dom0) Arch 2013.10.01 VMWare ESXi 5.5.0 N/A N/A LXC 2.6.32-358.23.2 CentOS 6.4 10
Benchmarks 3 Benchmarks SHOC OpenCL: signal processing GPU-LIBSVM: big data, machine learning HOOMD: Molecular dynamics, GPUDirect Virtual machines: CentOS 6.4 with 2.6.32-358.23.2 kernel, 20 GB RAM, and 1 CPU socket Control for NUMA effects 11
K20 Results - SHOC RelaGve Performance 1.01 1 0.99 0.98 0.97 0.96 SHOC Performance for Common Signal Processing Kernels KVM Xen LXC VMWare 12
K20 Results SHOC Outliers RelaGve Performance 1.06 1.04 1.02 1 0.98 0.96 0.94 SHOC OpenCL Level 1, Level 2 Outliers KVM Xen LXC VMWare 13
C2075 Results - SHOC RelaGve Performance 1.02 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 SHOC Performance for Common Signal Processing Kernels KVM Xen LXC VMWare 14
C2075 Results SHOC Outliers RelaGve Performance 1.1 1 0.9 0.8 0.7 0.6 SHOC OpenCL Level 1, Level 2 Outliers KVM Xen LXC VMWare 15
SHOC Observations Overall both Fermi and Kepler systems perform nearnative This is especially true for KVM and LXC Xen on the C2075 system shows some overhead Likely because Xen couldn t activate large page tables Some unexpected performance improvement for Kepler Spmv 16
K20 Results GPU-LIBSVM 1.02 GPU- LIBSVM RelaGve Performance 1 RelaGve Performance 0.98 0.96 0.94 0.92 0.9 0.88 1800 3600 4800 6000 # of training instances KVM Xen LXC VMWare 17
C2075 Results GPU-LIBSVM 1.4 GPU- LIBSVM RelaGve Performance 1.2 RelaGve Performance 1 0.8 0.6 0.4 0.2 0 1800 3600 4800 6000 # of training instances KVM Xen LXC VMWare 18
GPU-LIBSVM Observations Unexpected performance improvement for KVM on both systems Most pronounced on Westmere/Fermi platform This is due to the use of transparent hugepages (THP) Back the entire guest memory with hugepages Improves TLB performance Disabling hugepages on Westmere/Fermi platform reduces performance to 80-87% of the base system 19
Multi-GPU with GPUDirect Many real applications extend beyond a single node s capabilities Test multi-node performance with Infiniband SR-IOV and GPUDirect 2 Sandy Bridge nodes equipped with K20 GPUs ConnectX-3 IB with SR-IOV enabled Ported Mellanox OFED 2.1-1 to 3.13 kernel KVM hypervisor Test with HOOMD, a commonly used GPUDirect-enabled particle dynamics simulator 20
GPUDirect Advantage Image source: http://old.mellanox.com/content/pages.php?pg=products_dyn&product_family=116 21
HOOMD Performance HOOMD Lennard- Jones Liquid MD RelaGve Performance RelaGve Performance 1.02 1 0.98 0.96 0.94 0.92 16k 32k 64k 128k 256k 512k Number of ParGcles 22
Current Status Source code is available now https://github.com/usc-isi/nova Includes support for heterogeneity GPU-enabled LXC instances Bare-metal provisioning Architecture-aware scheduler Prototype Xen with GPU passthrough implementation 23
Future Work Primary focus: multi-node Greater range of applications, larger systems Integrate GPU passthrough support for KVM This might come free with the existing OpenStack PCI passthrough work NUMA support This work assumes perfect NUMA mapping OpenStack should be NUMA-aware 24