Virtual Machine Scalability on Multi-Core Processors Based Servers for Cloud Computing Workloads



Similar documents
Performance Analysis of Large Receive Offload in a Xen Virtualized System

Enabling Technologies for Distributed Computing

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Performance Comparison of VMware and Xen Hypervisor on Guest OS

Enabling Technologies for Distributed and Cloud Computing

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

GUEST OPERATING SYSTEM BASED PERFORMANCE COMPARISON OF VMWARE AND XEN HYPERVISOR

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

VMWARE WHITE PAPER 1

VON/K: A Fast Virtual Overlay Network Embedded in KVM Hypervisor for High Performance Computing

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Performance Isolation of a Misbehaving Virtual Machine with Xen, VMware and Solaris Containers

Full and Para Virtualization

Delivering Quality in Software Performance and Scalability Testing

EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications

International Journal of Computer & Organization Trends Volume20 Number1 May 2015

Application Performance Isolation in Virtualization

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Rackspace Cloud Databases and Container-based Virtualization

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

Binary search tree with SIMD bandwidth optimization using SSE

Basics in Energy Information (& Communication) Systems Virtualization / Virtual Machines

Affinity Aware VM Colocation Mechanism for Cloud

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Comparing Multi-Core Processors for Server Virtualization

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

PARALLELS CLOUD SERVER

Boost Database Performance with the Cisco UCS Storage Accelerator

Optimizing Network Virtualization in Xen

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

D1.2 Network Load Balancing

Resource usage monitoring for KVM based virtual machines

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Scaling in a Hypervisor Environment

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Optimizing Network Virtualization in Xen

Discovering hypervisor overheads using micro and macrobenchmarks

Unifying Information Security

Virtualization for Cloud Computing

Informatica Ultra Messaging SMX Shared-Memory Transport

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Performance Profiling in a Virtualized Environment

Xen Live Migration. Networks and Distributed Systems Seminar, 24 April Matúš Harvan Xen Live Migration 1

White Paper. Recording Server Virtualization

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Cloud Operating Systems for Servers

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Microsoft Exchange Server 2007 and Hyper-V high availability configuration on HP ProLiant BL680c G5 server blades

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

Models For Modeling and Measuring the Performance of a Xen Virtual Server

New!! - Higher performance for Windows and UNIX environments

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Dynamic resource management for energy saving in the cloud computing environment

An Experimental Study of Load Balancing of OpenNebula Open-Source Cloud Computing Platform

SMB Direct for SQL Server and Private Cloud

Virtual Switching Without a Hypervisor for a More Secure Cloud

SAN Conceptual and Design Basics

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Microsoft Exchange Server 2003 Deployment Considerations

Virtual Scalability: Charting the Performance of Linux in a Virtual World

benchmarking Amazon EC2 for high-performance scientific computing

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

nanohub.org An Overview of Virtualization Techniques

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Chapter 14 Virtual Machines

Data Centers and Cloud Computing

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

A Performance Analysis of the iscsi Protocol

IOS110. Virtualization 5/27/2014 1

Performance and scalability of a large OLTP workload

LS DYNA Performance Benchmarks and Profiling. January 2009

COS 318: Operating Systems. Virtual Machine Monitors

8Gb Fibre Channel Adapter of Choice in Microsoft Hyper-V Environments

I/O Virtualization Bottlenecks in Cloud Computing Today

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

A Performance Analysis of Secure HTTP Protocol

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies. Virtualization of Clusters and Data Centers

WHITE PAPER Optimizing Virtual Platform Disk Performance

Multi-core Programming System Overview

Benchmarking Hadoop & HBase on Violin

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Transcription:

Virtual Machine Scalability on Multi-Core Processors Based Servers for Cloud Computing Workloads M. Hasan Jamal, Abdul Qadeer, and Waqar Mahmood Al-Khawarizmi Institute of Computer Science University of Engineering and Technology Lahore, Pakistan {hasan.jamal, qadeer, waqar}@kics.edu.pk Abdul Waheed and Jianxun Jason Ding Cisco Systems, Inc. 17 W Tasman Dr., San Jose CA 95134, USA {abwaheed, jiding}@cisco.com Abstract In this paper, we analyze virtual machine (VM) scalability on multi-core systems for compute-, memory-, and network I/O-intensive workloads. The VM scalability evaluation under these three workloads will help cloud users to understand the performance impact of underlying system and network architectures. We demonstrate that VMs on the state-of-the-art multi-core processor based systems scale as well as multiple threads on native SMP kernel for CPU and memory intensive workloads. Intra-VM communication of network I/O intensive TCP message workload has a lower overhead compared to multiple threads when VMs are pinned to specific cores. However, VM scalability is severely limited for such workloads for across-vm communication on a single host due to virtual bridges. For across local and wide area network communication, the network bandwidth is the limiting factor. Unlike previous studies that use workload mixes, we apply a single workload type at a time to clearly attribute VM scalability bottlenecks to system and network architectures or virtualization itself. Keywords- Cloud Computing; Virtualization; Performance Evaluation; Multi-core Processors, Server Scalability I. INTRODUCTION Massive increase in parallelism due to many-cores and the ubiquity of high speed Internet connectivity are the defining forces behind the recent surge of a distributed computing model termed as Cloud Computing. Cloud Computing environment manages a pool of computing and data processing resources, that vary vastly in terms of models, sizes, and configurations, and are provisioned to end users, either in a raw form (e.g. selling machine cycles, storage space, etc.) or as a service. This pool of resources and services is typically distributed and globally accessible through the Internet. Typical building blocks inside a cloud are multi-core processor based systems. These multi-core based systems connect with each other through local and wide-area networks. Many interesting distributed computing systems of the past (e.g. Condor [8] and Prospero [12]) were built along these lines, which schedule jobs on available pool of hosts to efficiently utilize the available computing power. Interconnects between such systems were traditionally local area networks. s closely tied to underlying platforms (processor, interconnect, and operating system) was one of the hurdles to seamlessly deploy such systems on geographically dispersed locations. Thus, Cloud Computing differs from traditional distributed computing system in terms of its ubiquity. Second difference is the use of virtualization. Virtualization is a mechanism to have multiple operating systems concurrently share the resources of a machine (e.g. running a Linux distribution as a Windows process). In this case, virtualization created a virtual hardware machine and gave this illusion to the Linux system that it is running on a dedicated physical hardware. One can use virtualization to create any kind of virtual resource and then present this virtual resource as if it were real. Virtualization is being used heavily in Cloud Computing. Virtualization results in efficient utilization of the available hardware resources. Server consolidation, runtime guest migration, and security against malicious code, are a few of the most compelling reasons for the use of virtualization. Xen [3] is a popular open-source based choice for virtualization. In this paper, we evaluate the performance overhead and scalability of virtual machines (VMs) on state-of-the-art multi-core processors based systems. While using multiple VMs to execute different applications ensures isolation among these applications, it has its overheads. Due to increasing use of multi-core processors as building blocks of a Cloud Computing environment, it is important to understand the overhead of virtualization. A typical Cloud Computing workload consists of four types of interactions among distributed compute nodes: (1) intra-processor; (2) inter-processor; (3) across a Local Area Network (LAN); and (4) across a Wide Area Network (WAN). In this paper, we focus our attention to the virtualization overhead and scalability with these four types of interactions. Fig. 1 presents an overview of a dual Intel quad-core processor based system. Without virtualization, this system can work as an SMP through a single operating system image. This is the traditional approach of workload scheduling, which may not be very efficient. In a Cloud Computing environment with virtualization support, one or more VMs can simultaneously run on the system to provide isolated execution environment. While each VM operates under the illusion of dedicated access to physical resources

PROCESSOR SOCKET CORE CORE 1 CORE 2 CORE 3 System 1 Main Memory (8 GB) L2 L2 Front side bus 1,333Mhz, 64 bits Memory Controller I/O Controller L2 L2 CORE 4 CORE 5 CORE 6 CORE 7 PROCESSOR SOCKET 1 Ethernet NICs System X System 2 System 4 System Y WAN System Z LAN System 3 Figure 1. Overview of dual processor, quad-core Intel Xeon E545 processor architecture and our testbed. allocated to it, these resources are shared at processor and memory architecture level. In this paper, our goal is to quantify the overhead of this level of sharing on VM scalability using four types of interactions mentioned above. Additionally we want to find the cost of virtualization in terms of performance penalty for intra-processor, interprocessor, and across LAN and WAN interactions. We use an Intel dual processor, quad-core processor based system where intra- and inter-processor communication is through a shared bus while LAN is a Gigabit switch. We emulate WAN using DummyNet [13]. We measure CPU, memory, and network I/O performance using micro-benchmarks running across multiple VMs as well as non-virtualized SMP kernel based baseline system. The baseline system employs multiple threads to fully exercise the system and to compare the scalability characteristics with multiple VM cases. We provide background of cloud computing environments in Section II. We discuss the related research efforts in Section III. Section IV presents our evaluation approach and findings of a measurement-based study are reported in Section V. We conclude with a discussion of contribution and future directions of this work in Section VI. II. CLOUD COMPUTING ENVIRONMENTS Fig. 2 presents a generic Cloud Computing infrastructure based on multiple virtualized nodes, running applications or services in isolated VMs, distributed across a wide-area network. In this case, multiple hardware platforms may be connected through a LAN in a data center and connected to other data centers through a WAN. VM '1' Guest OS 1 VM '2' Guest OS 2 Virtual Machine Monitor (VMM) Hardware VM '3' Guest OS 3 Wide area network VM '1' Guest OS 1 VM '2' Guest OS 2 Virtual Machine Monitor (VMM) Hardware VM '3' Guest OS 3 Figure 2. A cloud computing infrastructure using virtualized system building blocks.

Many system calls and hardware accesses require the interaction of a virtual machine monitor (VMM). VMs can typically run on any of the available physical resources regardless of their location, types of hardware resources, and VMMs. While it is straight-forward to characterize the performance of a single operating system image based SMP system, it is a challenge to characterize the performance of a cloud. Virtualization adds an additional layer of functionality between an application and physical platform (hardware and operating system). While such architectures can utilize the resources more efficiently, obtaining comparable performance may become more difficult. In this study, we use CPU, memory, and network benchmarks to isolate the overhead and scalability of VMs. III. RELATED WORK With rapid adoption of virtualization, running multiple virtual machines on a single physical hardware can lead to many performance issues. Many pervious studies have documented virtualization. Some studies characterize and analyze the performance impact of various types of workloads on VMs. Apparao et. al. [1] analyze the performance characteristics of server consolidation workloads on multicore based system in a virtualized environment and identify that certain cache and core interference can be reduced by using core affinity. They also study the effect of different scheduling schemes on the applications. Menon et. al. [1] highlight the need of the performance measurement for virtual machines and develops a new profiler for Xen, called Xenoprof. They demonstrate their profiler on a network application and show that Xenoprof can be a good tool to find the bottlenecks in the Xen kernel. Wood et. al. [14] develop a mathematical model to estimate resource overhead for a virtual machine. They claim that their model can be used to plan the required resources. Additionally they claim that their model is general enough to be used by any virtual machine and demonstrate it on Xen. Jerger et. al. [6] study the impact of caches on many-core server consolidation. They find that traditional ways of handling caches in unicore systems are not suitable for virtual machines as many performance and fairness issues arise. They describe a strategy to isolate the workloads on the virtual machines such that, the interference between applications is minimal. Cherkasova et. al. [4] specifically study CPU overhead for I/O Processing in the Xen Virtual Machine Monitor. They use I/O intensive workloads for their experiments. Apparao et. al. [2] study the effects of I/O virtualization and try to pinpoint the components responsible for the performance degradation. As cloud computing relies on virtualization, our work characterizes multi-core based systems with respect to virtualization for three types of workload: compute, memory and network I/O. In contrast to above studies, we do not mix multiple types of these workloads to clearly attribute VM scalability bottlenecks to system or network architectures or virtualization itself. While realistic applications represent a mix of compute, memory, and network I/O intensive workloads, they are not suitable to isolate VM scalability bottlenecks. Our objective is to isolate these bottlenecks as well as attribute them to either architectural features or to virtualization related overheads. In addition, our evaluation considers a hierarchy of VMs on one or more multi-core processors based hosts. IV. EVALUATION APPROACH In this section, we first present the micro-benchmarks that will be used for measuring VM overhead and scalability. We then outline the use cases under which we measure performance to adequately exercise the intra-processor, interprocessor, LAN and WAN based interactions among VMs. A. Micro-Benchmarks For our measurements based studies, we use three multithreaded micro-benchmarks: CPU, memory and network micro-benchmarks, developed using the MPAC Library [11]. MPAC is an extensible specifications-driven benchmarking infrastructure and is based on an implementation of a fork-and-join paradigm to concurrently execute symmetric benchmarking threads. A brief description of these benchmarks is given below. 1) CPU Micro-Benchmark: The CPU micro-benchmark can exercise the floating point, integer and logical units of the processor cores according to user specified workload, in parallel, using multiple threads. 2) Memory Micro-Benchmark: The memory microbenchmark is inspired from STREAM benchmark [9] and performs memory-to-memory data transfer operation using number of threads, data size, data type, and number of repetitions specified by the user. 3) Network Micro-Benchmark: The network benchmark is inspired from Netperf benchmark [5] and is implemented using its specification. This benchmark can run multiple client and server thread pairs for passing TCP messages. It can measure the end-to-end network I/O throughput.. B. Workload Characteristics We use MPAC based CPU, memory and network benchmark to generate representative workloads. For processor performance we use log, summation, and left shift operations to exercise the floating point, integer, and logical units, respectively. For memory measurements, we use randomly generated floating point data to perform memory to memory copy for various data sizes to exercise hierarchical memory subsystems. We run all of our benchmark cases for a duration of at least 12 seconds to eliminate the impact of transient behavior. For network I/O measurements, we use TCP messages based workload with a buffer size of 87,38 bytes and a randomly generated message of size 16,384 bytes. The test duration for each network I/O benchmark run is 5 minutes. We define three high level of VM-VM interaction for our performance measurements: (1) single host based VM interactions; (2) multiple hosts based VM interactions across a LAN; and (3) multiple hosts based VM interactions across (an emulated) WAN. For memory and CPU performance,

single host based VM interactions case is sufficient. For network benchmark, all three cases will be used. These cases are compared with baseline performance measurements. Our performance measurement methodology is described in the following subsections. 1) Performance Baseline: For establishing a performance baseline, we use FedoraCore 8 Linux kernel without virtualization to run MPAC based memory, CPU and network benchmarks with one to eight threads to measure the scalability with respect to the number of threads. For baseline network I/O measurement, there are three scenarios: (1) client and server threads running on the same host; (2) client and server threads running on different hosts connected through LAN; and (3) client and server threads running on different hosts connected through (emulated) WAN. In the first scenario, the sender and receiver ends of the network benchmark are run on the same physical host as two processes. Each of the sender and receiver process runs with one to eight threads to exercise the scalability of network I/O on a single host using loopback interface. In the second and third scenarios, the sender and receiver ends of the network benchmark are run on different physical hosts to measure the scalability across LAN and WAN. 2) Single Host Based VM Interactions: For single host based VM interactions case, we use FedoraCore 8 Linux kernel with virtualization, which can launch up to eight VMs (guests), equal to the number of available CPU cores. We run a single instance of single threaded MPAC based benchmarks on one to eight VMs concurrently to evaluate the scalability across VMs. We synchronize the launch of benchmark processes across multiple VMs through cron. The total throughput is the sum of all throughputs on individual VMs. These benchmarks run for multiple minutes to ensure that non-overlapped execution at start and end does not skew the overall throughput measurement. For a single host based VM network I/O measurements, there are two scenarios: (1) client and server ends running on the same VM; and (2) client and server ends running on different VMs. In the first scenario, sender and receiver ends of the network benchmark are run on the same VM on one to eight VMs in parallel to observe the scalability of network I/O performance of non-interacting VMs on a single host. In the second scenario, sender and receiver ends are run in neighboring VMs from one until eight sender receiver pair. To begin with, the sender end is run on odd numbered VMs and the receiver end is run on even numbered VMs incrementally until all odd VMs are running a sender end. Then sender ends are run on even VMs and receivers ends on odd VMs, until all VMs are running a sender and a receiver end. This scenario is used to observe the scalability of network I/O performance of communicating VMs on a single host. 3) Multiple Hosts Based VM Interactions across LAN and WAN: For multiple host based VM interaction case, we use FedoraCore 8 Linux kernel with virtualization using up to eight VMs (guests), which is equal to the number of cores of each of the multiple physical hosts. We run a single instance of single threaded MPAC based benchmarks on one to eight VMs of different host, in parallel, to measure the network I/O performance scalability across two physical hosts. The total throughput is the sum of all throughputs on individual VMs. In this case all sender ends are running on VMs of the first physical host and all receiver ends are running on the second physical host. For multiple hosts based VM interactions case, there are two scenarios; (1) multiple hosts connected through a LAN; and (2) multiple hosts connected through WAN. We use DummyNet to emulate the WAN through an OC-3 link that can be used to connect data centers. C. System Under Test (SUT) We choose an Intel dual quad-core processor based system as our SUT. The specification of SUT is given in Table 1. TABLE I. Attribute Processor type Number of processors 2 Number of cores/processor 4 Clock speed (MHz) 2, cache size (KB) L2 cache size (MB) Memory (RAM) size (MB) 8, CPU-Memory Bus Speed SPECIFICATIONS OF THE SUT. Intel Xeon E545 Value 32 D/32 I 2x6 x 2 (each core pair shares 6MB L2) 1,333MHz FSB Frequency NIC type Broadcom NetXtreme II 578 NIC speed (Mbps) 1, Number of NICs 2 Type of Hard Disks Baseline Kernel Xen Kernel Xen Guest Kernel Xen Guest Type SCSI (SAS) Linux 2.6.23.1-42.fc8 SMP (FC8) Linux 2.6.21-195.fc8xen SMP (FC8) Linux 2.6.23.1-42.fc8 SMP Para-virtualized D. Micro-Benchmark Validation In order to validate MPAC memory benchmark measurements, we compare the result of its single thread based execution with STREAM benchmark s default case result on our SUT. Table 2 compares memory-to-memory copy throughputs of the STREAM benchmark as well as MPAC memory benchmark. The similarity validates our results. The STREAM benchmark measures slightly higher

throughput because it statically allocates the arrays while our methodology initializes them dynamically from heap memory. In order to validate MPAC network benchmark, we compare the loopback data transfer throughput with Netperf benchmark results on our SUT. We use a single thread based execution to measure the end-to-end network data transfer throughput of sending messages to mimic the Netperf benchmark approach. Table 2 shows end-to-end network data transfer throughputs of the Netperf benchmark and MPAC network benchmark. MPAC incurs overhead due to its multi-threaded software architecture, which results in a slightly lower throughput compared to Netperf. TABLE II. THROUGHPUT IN MBPS OF MEMORY-TO-MEMORY COPY OF 16 MB FLOATING POINT DATA AND END-TO-END NETWORK DATA TRANSFER THROUGH LOOPBACK ON THE SUT. Memory Benchmarks (Mbps) Network Benchmarks (Mbps) STREAM MPAC Netperf MPAC 27,95 26,434 7,986 7,553 V. MEASUREMENT BASED EVALUATION In this section, we use three micro-benchmarks to characterize the performance of Intel multi-core processors based system. We use a baseline case with SMP system running a non-virtualized image of Linux and compare its CPU, memory, and network I/O scalability with Xen based virtualized kernel image. While the baseline cases exercise non-virtualized SMP kernel using multiple threads, virtualized cases exercise the system through independent and concurrently executing processes in multiple VMs. A. CPU Throughput Fig. 3 compares the scalability of baseline and VM related measurements of: (a) floating point; (b) integer; and (c) logical operations of CPU micro-benchmark. We observe a linear scalability trend for non-virtualized baseline and virtualized use cases. Thus virtualization provides isolation without compromising the linear CPU throughput scalability. This is expected as we are utilizing each processor core independent of the others. Another noticeable characteristic is that (Xen) virtualization overhead is insignificant to small compared to the baseline case as the number of VMs increases. This is promising for CPU intensive Cloud Computing workloads hosted on state-of-the-art multi-core processors based systems. B. Memory Throughput We measure memory throughput for different data sizes (ranging from 16 KB to 16 MB). These results are presented through graphs in Fig. 4. We use four array sizes, 16 KB, 512 KB, 6 MB, and 16 MB to distinguish the impact of private cache, shared L2 cache, and shared memory bus. While these cases are not Operations (MOPS) Operations (MOPS) Operations (MOPS) 12 1 8 6 4 2 18 16 14 12 1 8 6 4 2 14 12 1 8 6 4 2 1 2 3 4 5 6 7 8 (a) Floating Point Operation (b) Integer Operation 1 2 3 4 5 6 7 8 Number of Threads (Guest) 1 2 3 4 5 6 7 8 (c) Logical Operation Figure 3. CPU throughput in MOPS across number of threads and guests, for floating point, integer and logical operations for SUT. mutually exclusive, 16 KB array copy mostly access private cache. Similarly, 16 MB array size implies that while accesses from the main memory over shared bus play a dominant role in measuring throughput, and L2 caches enhance spatial locality. Keeping this distinction in mind, we observe three memory throughput characteristics of the system under this workload: For 16 KB array sizes, both baseline and VM based cases show highest memory throughput and linear scalability. This is because the underlying memory-to-memory copy operation of the benchmarks is mostly confined to local private caches. Array sizes of 512 KB and 6 MB start showing the impact of shared L2 caches. Due to this sharing, the throughputs in both baseline and VM cases are lower than the 16 KB case and do not scale linearly. Shared L2 cache

Throughput (Gbps) 12 1 8 6 4 2 1 2 3 4 5 6 7 8 (a) 16 KB Throughput (Gbps) 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 (b) 512 KB Throughput (Gbps) 12 1 8 6 4 2 Throughput (Gbps) 5 4 3 2 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (c) 6 MB (d) 16 MB Figure 4. Memory throughput in Gbps across number of threads and guests VMs for floating point data for different sizes on SUT. conflicts with multiple threads or VMs to access the main memory. In both cases, throughput reaches a saturation determined by combined L2 cache and main memory bandwidth. With further increase in the number of threads or VMs, memory throughput starts declining due to L2 cache conflicts until it reaches close to the memory bus throughput. With 16 MB array sizes, throughput is lowest compared to the throughput with smaller array sizes as and L2 caches do not play dominant role. Throughput reaches bus saturation level with up to four cores simultaneously stressing it. Memory micro-benchmark measurements clearly indicate that the overhead of virtualization is comparable with the overhead of multi-threading. In both cases, shared L2 caches and memory bus equally impacts the memory throughput. Virtualization itself does not add anything to this architecture level overhead. In addition, 6 MB size of shared L2 cache works effectively to hide the main memory access latencies for up to four cores. C. Network I/O Throughput Network benchmark exercises the underlying system with five use cases: (1) baseline case where multiple pairs of threads on non-virtualized SMP system work as clients and servers to exchange TCP messages; (2) client and server on a single VM case where each VM is hosting a pair of client and server end to run case (1); (3) client server on different VM running on the same host case where each VM is hosting a client end, which sends TCP messages to server end running in a different VM on the same host; (4) client and server ends on different hosts connected through a GigE LAN running on SMP systems (baseline) or inside VMs; and (5) client and server end on different hosts connected through a WAN running on SMP systems (baseline) or inside VMs. The main purpose behind using these five cases is to determine TCP/IP stack performance within the host as well as across a LAN and WAN under virtualization. We realize the WAN case using DymmyNet based emulation. Fig, 5 compares the network I/O throughput of TCP based messages under the above five use cases. In Fig. 5 (a), the non-virtualized baseline case shows almost linear scalability across TCP client and server thread pairs. With more threads, scheduling overhead due to threadexclusive TCP message dispatching for each client-server pair prevents hitting the bus throughput limit (~4 Gbps). Client and server on single VM case (#2) shows linear throughput increase with the number of VMs until it reaches bus based memory throughput limit of about 4 Gbps with six client-server pairs in six VMs. In this case, each VM is isolated from the other and is pinned to a single core. Thus, all communication is fully contained within a VM. Hence the communication overhead is slightly lower than those in the non-virtualized baseline case. This resulted in higher scalability for virtualized case. In client server on different VM case (#3) a Xen based virtual bridge is used. In this case, TCP messages based network I/O fully saturates the Xen bridge. This bridge acts as a serialization point for all the flows from different VMs. Aggregate maximum throughput is almost 1 times smaller than the memory-memory throughput of about 4 Gbps that

Throughput (Gbps) Throughput (Mbps) Throughput (Mbps) 45 4 35 3 25 2 15 1 5 1 99 98 97 96 95 94 93 92 91 9 Client Server pair on a single VM, single physical Host Client Server on different VMs, single physical Host Client Server on single non-virtualized OS 1 2 3 4 5 6 7 8 Number of Client-Server Pairs (a) Client Server on different Hosts, with non-virtualized OS (LAN) Client Server on different VMs on different physical Hosts (LAN) 1 2 3 4 5 6 7 8 Number of Client-Server Pairs (b) Client Server on different Hosts, with non-virtualized OS (WAN) Client Server on different VMs on different physical Hosts (WAN) 14 12 1 8 6 4 2 1 2 3 4 5 6 7 8 Number of Client-Server Pairs (c) Figure 5. (a) Single Host based VM interactions case (b) Multiple Host based interactions case across GigE LAN. (c) Multiple Host based interactions case across WAN connected through OC-3 emulated link. we observe in Fig. 4(d). This is essentially the price that a virtualized application incurs for ensuring isolation among multiple VMs. Throughput plots for across LAN communication scenario (Fig. 5(b)) are not surprising. In this case, the limiting factor is the 1Gbps physical network throughput. For across WAN communication among data centers scenario, we use an emulated OC-3 link (155 Mbps link and 6ms delay). Throughput plots in Fig. 5 (c) shows the physical network throughput to be the limiting factor while the delay in the network prevents the throughput to hit the link speed of 155 Mbps. Regardless of the increasing number of client-server pairs, the throughputs of LAN and WAN scenarios saturate to the available network bandwidths for virtualized and non-virtualized cases. VI. CONCLUSION AND FUTURE WORK In this paper, we characterized the performance of multicore processors based system for cloud computing workloads. We observe that state-of-the-art computer architectures can allow multiple VMs to scale as long as cache, memory, bus, and network bandwidth limits are not reached. Thus, CPU and memory intensive virtualized workloads should scale up to the memory architecture imposed limits. Similarly, network I/O intensive workloads scale up to the available LAN or WAN based effective throughput. Virtualization becomes a bottleneck when multiple VMs communicate. Communication among VMs on same physical host is bound by the throughput of virtual bridge. Furthermore, communication within a VM has low overhead as compared to non-virtualized case because the VM is pinned to a single core and avoids thread scheduling overheads. Using micro-benchmarks to generate one of compute-, memory-, and network I/O-intensive workloads at a time allows us to attribute the scalability bottlenecks to one of three possible areas: (1) cache and memory architecture; (2) network architecture; and (3) virtualization overheads. Our evaluation clearly indicates that virtualization overheads have significant impact on scalability under VM-VM interactions based workloads. There are multiple proposed solutions to overcome the bottleneck that occurs due to virtual bridges in across-vm communication. These solutions include XenSocket [15] and XenWay [7]. We are evaluating these techniques in terms of their impact on serialization of virtual bridge. In addition, we plan to study the cost of VM migration and parallelized execution of coarse-grained tasks on different VMs. ACKNOWLEDGMENT We would like to thank National ICT R&D Fund, Ministry of Information Technology, Pakistan, for funding this project. REFERENCES [1] P. Apparao, R. Iyer, X. Zhang, D. Newell, and T. Adelmeyer, "Characterization and Analysis of a Server Consolidation Benchmark," in the Proceedings of 4th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, Seattle, WA, USA, 28, pp. 21 3. [2] P. Apparao, S. Makineni, and D. Newell, "Characterization of Network Processing Overheads in Xen," in the Proceedings of 2nd International Workshop on Virtualization Technology in Distributed Computing, 27.

[3] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the art of Virtualization, in the Proceedings of the 19th ACM Symposium on Operating Systems Principles, ACM Press, New York, NY, USA, 23, pages 164 177. [4] L. Cherkasova and R. Gardner,"Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor," in the Proceedings of the USENIX Annual Technical Conference, April 25. [5] Hewlett-Packard Company, "Netperf: A Network Performance Benchmark," February 1995. Available on-line from: http://www.netperf.org/netperf/training/netperf.html [6] N. Jerger, D. Vantrease, and M. Lipasti "An Evaluation of Server Consolidation Workloads for Multi-Core Designs," in the Proceedings of 1th IEEE International Symposium on Workload Characterization, 27, pp. 47-56. [7] K. Kim, C. Kim, S.-I. Jung, H. Shin, and J.-S. Kim, Inter-domain Socket Communication Supporting High Performance and Full Binary Compatibility on Xen, In the Proceedings of 4th Virtual Execution Environment 28, ACM, Seattle, Washington,USA, Mar. 28, pp. 11-2. [8] M. J. Litzkow, M. Livny, and M. W. Mutka, "Condor: A hunter of idle Workstations," in the Proceedings of 8th International conference on Distributed Computing Systems, June 1988, pp. 14-111. [9] J.D. McCalpin, Memory bandwidth and machine balance in current high performance computers, IEEE Technical Committee on Computer Architecture newsletter, December 1995. [1] A. Menon, J. R. Santos, Y. Turner, G. Janakiraman and W. Zwaenepoel, "Diagnosing Performance Overheads in the XEN Virtual Machine Environment," in the Proceedings of 1st ACM/USENIX Conference on Virtual Execution Environments (VEE'5), June 25, pp 13-23. [11] MPAC Benchmarks, Available on-line: http://www.kics.edu.pk/hpcnl/download.php [12] B.C. Neuman and S. Rao, "The Prospero Resource Manager: A Scalable Framework for Processor Allocation in Distributed Systems," Concurrency: Practice and Experience 6(4), June 1994, pp. 339-355. [13] L. Rizzo, "Dummynet: a simple approach to the evaluation of network protocols", ACM SIGCOMM Computer Communication Review, Volume 27, Issue 1, Jan. 1997, pp. 31-41 [14] T. Wood, L. Cherkasova, K. Ozonat, and P. Shenoy, "Profiling and Modelling Resource usage of Virtualized s," in the Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, Leuven, Belgium, 28, pp. 366-387. [15] X. Zhang, S. McIntosh, P. Rohatgi, J.L. Griffin XenSocket: A High- Throughput Interdomain Transport For Virtual Machines, Middleware 27, pp. 184-23.