Micro-architectural Characterization of Desktop Cloud Workloads

Transcription

1 Micro-architectural Characterization of Desktop Cloud Workloads Tao Jiang,RuiHou, Lixin Zhang, Ke Zhang, Licheng Chen, Mingyu Chen, Ninghui Sun, State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China Graduate University of Chinese Academy of Sciences, Beijing, China { jiangtao, hourui, zhanglixin, zhangke, chenlicheng, cmy, snh Abstract Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, there is little work done to understand the behavior of desktop cloud. On one hand, desktop cloud workloads are different from conventional data center workloads in that they are rich with interactive operations. Desktop cloud workloads are different from traditional nonvirtualized desktop workloads in that they have an extra layer of software stack hypervisor. On the other hand, desktop cloud servers are mostly built with conventional commodity processors. While such processors are well optimized for traditional desktops and high performance computing workloads, their effectiveness for desktop cloud workloads remains to be studied. As an attempt to shed some lights on the effectiveness of conventional general-purpose processors on desktop cloud workloads, we have studied the behavior of desktop cloud workloads and compared it with that of SPEC CPU2006, TPC-C, PARSEC, and CloudSuite. We evaluate a Xen-based virtualization platform. The performance results reveal that desktop cloud workloads have significantly different characteristics with SPEC CPU2006, TPC-C and PARSEC, but they perform similarly with data center scale-out benchmarks from CloudSuite. In particular, desktop cloud workloads have high instruction cache miss rate (12.7% on average), high percentage of kernel instructions (23%, on average), and low IPC (0.36 on average). And they have much higher TLB miss rates and lower utilization of off-chip memory bandwidth than traditional benchmarks. Our experimental numbers indicate that the effectiveness of existing commodity processors is quite low for desktop cloud workloads. In this paper, we provide some preliminary discussions on some potential architectural and micro-architectural enhancements. We hope that the performance numbers presented in this paper will give some insights to the designers of desktop cloud systems. I. INTRODUCTION Desktop cloud, a.k.a. virtual desktop infrastructure, replaces the traditional desktops and workstations (e.g. [24, 29]) with virtual machines (s) running on centralized data center servers. It is being accepted by more and more entities as the office computer platform and becoming one of the fastest growing segments in the cloud computing market. With desktop cloud, users can access the same desktop with their own applications and data without being tied to a single client device. Desktop cloud allows IT administrators to maintain a centralized environment and quickly respond to the changing needs of the users and business. It allocates resources to users on an as-needed basis and allows users to share resources, reducing the number of hardware units to purchase by an entity, and furthermore reducing the total cost of ownership. The servers providing desktop cloud service are typically built with off-the-shelf commodity processors, most notably general-purpose X86 server processors. The commercial general-purpose processors adopt many optimization techniques, such as out-of-order super-scalar and multi-level cache hierarchy with sophisticated coherence protocols [3, 4], that have greatly improved single-thread performance. As data center applications get more diversified, they have also adopted many additional features, such as hardware assisted virtualization [7], efficient voltage and frequency scaling [19, 20], many core architectures [18], and wider I/O and memory bandwidth [17], to meet the needs of the increasing variety of data center applications. Desktop cloud workloads differ with traditional nonvirtualized applications and data center scale-out applications as they run on virtualized environments and have abundant interactive operations. One key question is whether the server processors designed mostly for traditional applications can support desktop cloud services efficiently. As an attempt to answer the question, we have evaluated typical desktop cloud workloads in a real deployment. Our evaluation captures the behavior of the entire software stack (application, OS, and hypervisor) via source code instrumentation, hardware performance counters, and a custom-made memory trace collection device. To put the performance numbers of desktop cloud workloads in perspective, we have evaluated some traditional benchmarks, including SPEC CPU2006 [23] as single-threaded ILP (Instruction Level Parallelism) benchmarks, TPC-C [26] as a transaction TLP (Thread Level Parallelism) benchmark, PARSEC [15] as parallel TLP benchmarks, and CloudSuite [28] as data center scale-out benchmarks. We find that desktop cloud workloads exhibit significantly different characteristics with SPEC CPU2006, TPC-C, and PARSEC. While they are also different with CloudSuite benchmarks, the difference with CloudSuite is relatively smaller as they share some characteristics. The major insights derived from our evaluation are as follows. Cache hierarchy has uneven performance. Desktop cloud workloads in our experiments exhibit high I-Cache (instruction cache) miss rates, high L2 cache miss rates, but very low LLC (Last-Level Cache, L3 in our ex /12/$ IEEE 1

2 periments) miss rates. These numbers indicate that even though LLC is very effective, I-Cache is not sufficiently large and L2 is extremely ineffective. Both instruction TLB and data TLB have poor performance. Desktop cloud workloads running on the virtualized environment suffer from much higher numbers of instruction and data TLB misses than those running on the non-virtualized environment. These numbers imply that TLBs are inadequate in addressing the extra needs brought upon by virtualization. Off-chip memory bandwidth consumption is quite low. To our surprise, desktop cloud workloads do not have a high number of memory accesses. While this phenomenon might change after addressing the inefficiency in other parts of the system, it does show the potential on managing the memory bandwidth for improved power efficiency. Desktop cloud workloads typically do not require full cache coherence. We have observed that cache-tocache communication pattern of desktop cloud workloads follows a star topology with one central core and point-topoint communication between the central core and each of the rest cores. There is no or little communication among the non-central cores. To the best of our knowledge, no mainline processors use a star topology among its on-chip caches. Their on-chip caches are normally connected through a bus, a ring, or a crossbar that optimizes cache-to-cache communication between any two cores. The rest of the paper is organized as follows. Section II discusses related work. Section III describes the experimental methodology and tools. Section IV shows the experimental results. Section V briefly brings up future work and concludes this paper. II. RELATED WORK Many researchers have studied the performance and resource utilization of data center computer systems with both traditional high-performance benchmarks and cloud computing benchmarks [1, 6, 11, 12]. Ebbers et al. [34] evaluated a virtual Linux desktop service created using eyeos, an open source web desktop, running on an IBM z-series server. They focused on the scalability of the system. Kochut et al. [2] analyzed CPU and memory usage on desktop running on native OS with the intent of optimizing desktop cloud management. Tickoo et al. [14] have demonstrated that one single parallel performance model does not fit for the virtualization environment of data center servers. They have proposed an alternative performance model. They have performed a detailed case study with the vconsolidate benchmark [13] and investigated core and cache contention effects of vconsolidate benchmark. SPECvirt sc2010 [36] isspec s firstbenchmarkaddressing performance evaluation of datacenter servers used in virtualized server consolidation. It includes typical server-side workloads such as SPECweb2005 and SPECmail2009. mark [37] from ware is a virtualization platform benchmark. It includes a variety of platform-level workloads such as dynamic relocation, cloning and deploying of s. These two benchmarks are not used in our study because our experiments focus on client-side workloads. Some researchers have used scale-out benchmarks to study today s dominant system architectures. Ferdman et al. [5] used hardware performance counters to study CloudSuite benchmarks. They discovered that existing systems were inefficient for running these benchmarks. To the best of our knowledge, this paper is the first to attempt to characterize desktop cloud in a real deployment. Our evaluation captures the behavior of the entire software stack via source code instrumentation, hardware performance counters and a custom-made hardware-based memory trace collection tool. We also compare the corresponding results with traditional ILP and TLP benchmarks and data center scale-out benchmarks. III. METHODOLOGY A. Hardware platform Our experiments are performed on a 10-node Supermicro X86 server. Each node contains two Intel Xeon 2.40GHz E5620 processors, 32GB DDR3, and two Gigabit Ethernet ports. Table I lists the major configuration parameters of each node. As in many real systems, large page support is not enabled on our platform. The number of nodes used by each benchmark varies according to the requirement of the workloads and will be disclosed accordingly in the following subsections. We also attempted to perform our study on an ATOM-based platform and an ARM-based platform, but we were unable to get Xen in full-virtualization mode running on these two platforms. Our attempt to get more support from vendors was also unsuccessful. CPU TABLE I: Hardware configuration. Intel Xeon 2.40GHz #Sockets 2 # Cores per Socket 4 # Threads per Core 1 (Hyper-Threading Disabled) L1I L1D L2 LLC (L3) Memory BIOS Configuration 32 KB, 4-way 32 KB, 8-way 256 KB, 8-way 12MB, 16-way 32 GB DDR3 800MHz Hyper-Threading Disabled Turbo-Boost Disabled Hardware-Assisted-Virtualization Enabled Hardware Prefetchers Enabled, unless indicated B. Workloads We evaluate a live desktop cloud and compare it with Cloud- Suite, SPEC CPU2006, TPC-C, and PARSEC benchmarks. 2

3 1) Desktop cloud workloads: There are a number of commercial desktop cloud offerings in market, including Desktone [30], Citrix s Xen Desktop [31], ware s View [33], and IBM Smart Business Desktop Cloud [24]. The platform we use is based on the open-source Xen [21]. It is used as the office computing platform by the students and staff members in our department. s Hypervisor Hardware orag Xen Server Storages Xen Server Management Server Net Work Laptop Fig. 1: Structure of the desktop cloud being tested. The structure of our desktop cloud environment is described in Figure 1. The hardware platform includes a number of servers, several storage servers, and one management server. All of them are connected through a 32-port Gigabit Switch. For the guest OS on Xen, we use CentOS 5.5 with Linux kernel for Domain0 (privileged, responsible for handling I/O operations), CentOS 5.5 with Linux kernel for para-virtualization DomainU (non-privileged virtual machine), and Windows XP for full-virtualization DomainU. Unless noted otherwise, we allocate 4GB memory for each Domain0 and 2GB memory for each DomainU. Each host node with 32GB memory thus can support 14 DomainUs. This empirical configuration ensures that all s get sufficient amount of memory while allowing as many s as possible to share one physical processor. It is a result of a stress test and the corresponding satisfaction measurement of real users. Even though our stress test shows that each node can support up to 42 DomainUs before it comes to crawl, having 14 DomainUs per node is the maximum number of s supported without users complaining constant slowdowns and irresponsiveness. In our platform, each machine has eight cores. Core 0 is dedicated to Domain0 and the rest of cores are used for DomainUs with each core running two DomainUs. Each DomainU is allocated with one virtual CPU (VCPU), 30GB disk space, and one RTL Mb virtual Ethernet card. RDP (Remote Desktop Protocol) is used as the communication protocol between clients and servers for Windows XP desktops, and XDMCP (X Display Manager Control Protocol) is used as the communication protocol for Linux gnome desktops. In order to guarantee the validity of the performance numbers, the evaluation is done without the awareness of desktop users. Table II contains a brief introduction of typical operations performed by desktop users. The most common daily operations recorded in our environment are editing files PC Thin Client Thin Client TABLE II: Common operations of desktop cloud users. Name Watching Video Surfing Web Web Downloading Office Work Browsing PDF Compressing & Decompressing Copying File Anti-virus Action Watching videos online Viewing webpage and processing s Downloading files directly or through P2P Editing Word, Excel, and PowerPoint files Viewing PDF files Decompressing files and compressing files Copying files to another disk partition Running anti-virus software (mostly MS Word and PowerPoint files), viewing (mostly PDF and MS Word)files, surfing the Web, watching videos,copying files, running anti-virus software, downloading files, running clients, and compressing and decompressing files. These workloads are similar to the View Planner benchmark [35] used to evaluate ware VDI deployments. Fig. 2: IPC trends. Despite the various optimizations proposed to accelerate migrations, we have found that migrations are rarely observed in our environment. In addition, we have observed that different physical nodes behave similarly when having the same number of users. Without losing generality, we have randomly selected one node to perform detailed evaluation. We have recorded the IPC of the selected node over a full day, as shown in Figure 2. It can be seen from the figure that the IPC becomes stable (from point A to point B) within six minutes after people start to work at 9:00 AM, and remains stable until 11:30 AM (point C) when people start to head out for lunch. During our experiments, we randomly select three time samples in the stable state and report the arithmetic mean numbers in the paper. During the stable state, the utilization rate of the core running Domain0 is 24.7% and the utilization 3

4 Fig. 3: IPC. Fig. 4: Pipeline stall breakdown. rate of the core running DomainUs is 21.3%. To enable the measurement of cache-to-cache transfers, we set up a special test environment with only one Domain0 and two DomainUs. We modified the BIOS to activate only two cores in each of the two processors within a node. Xen Domain0 uses one processor; while two DomainUs are pinned to two cores in another processor. 2) Data center benchmarks: CloudSuite consists of six benchmarks representing common data center applications of today. They run with the native input. a) Data Serving: A 10GB Yahoo! Cloud Serving Benchmark (YCSB) dataset is used as the input to Cassandra [27], which is configured with a 16GB Java heap and 800MB young generation. The server load is generated by a YCSB client that sends 10,000 operation requests per second following a Zipfian distribution with a 50:50 write to read ratio. b) Media Streaming: 20 Java processes are set up. Each Java process starts 50 client threads to issue requests simultaneously. Meanwhile, to minimize the impact of network, our setup is limited to low bit-rate video streams with GetMediumLow being 30 and GetShortLow being 70. c) Web Serving: One node is used as the web server, and another is used as the database server. 200 concurrent users send requests to the web server with 50 seconds ramp-up time, 1000 seconds steady state time and 30 seconds ramp-down time. d) Data Analytics: A 30GB Wikipedia data set is used. The machine learning library (Mahout) is provided by Apache and is built on top of Hadoop MapReduce framework. The default machine learning algorithm, Bayes, is used. e) Web Search: One node is used as the index processing server with 3GB index and 8GB data. Another one is used as the front-end server, which accepts requests and sends them to the index processing server. f) Software Testing: Cloud9 [22] is deployed on five nodes, including one load balancer and four workers that execute software-testing task automatically. The iteration cycle is set to be 1000 seconds. 3) SPEC CPU2006, TPC-C, and PARSEC benchmarks: All SPEC CPU2006 benchmarks are run with the reference input set and each run is pinned to a specified processor core via the thread affinity commands. TPC-C uses MySQL containing 40 warehouses. The load is generated by 32 clients running on one node. The ramp up time is 60 seconds, which is sufficient for the load to reach a steady state. All PARSEC benchmarks are run with the native input. C. Measurement tools For desktop cloud workloads, we use Xenoprof [10], a system-wide profiler for Xen-based virtualization systems, to profile the Xen hypervisor and the guest OS. We use oprofile [25] to collect processor hardware performance counters. Xeon E5620 processor allows only four performance counters to be collected at one time. We use time-multiplexing to collect the 20+ events we need. One drawback with the hardware performance counters is that they cannot identify the owning virtual machine of a memory reference and they cannot measure the interval between memory references, both of which are important to fully understand the performance of the memory system. To better understand the memory behavior, we use the enhanced Hyper Memory Trace Tracker (HMTT) [8], a platform independent memory trace monitoring system that sits between the memory controller and the DIMM on the board and records all offchip memory references. Due to the limited number of HMTT devices available to us, we install only one dual-ranked DDR3-800 DIMM for each processor when collecting memory access traces. During our experiments, HMTT each time collects memory references for a duration of 180 seconds after the machine reaches the stable state. We instrument the Xen source code with a number of software counters to record the number of events that we want to trace. The overhead is quite low and can be ignored. IV. EVALUATION AND ANALYSIS This section presents the performance numbers that we have collected. Figure 3 shows the IPC of the benchmarks 4

5 Fig. 5: L1 and L2 instruction misses per kilo-instruction. we have tested. IPC is calculated based on the number of architecture instructions, not the number of micro-ops. In this figure, dom0 in X-axis is the IPC of the processor core hosting Xen Domain0. DomU is the arithmetic mean of the IPC of all cores hosting Xen DomainUs. In order to simplify the figure, CloudSuite mean is the arithmetic mean of CloudSuite benchmarks, and SPEC INT mean/spec FP mean represents the arithmetic mean of all SPECCPU INT2006/FP2006 benchmarks. It can be observed from the figure that desktop cloud workloads show lower IPC than traditional benchmarks. For the conventional CPU-intensive applications, the IPC can reach 2 or above. On the other hand, the IPC of desktop cloud workloads is only 0.36 on average. Considering that Xeon E5620 has a theoretic peak IPC of 4.0, an IPC of 0.36 is a strong indication that the processor is ineffective. As the first attempt to understand the cause behind the low IPC, we break down the pipeline stall cycles in Figure 4. This figure shows that register allocation stalls, reservation station full stalls, reorder buffer full stalls and branch misprediction stalls of desktop cloud workloads are actually lower than those of other benchmarks. They are therefore not the cause of low IPC of desktop cloud workloads. The rest of this section will show that caches and TLBs are the real culprits. A. I-Cache The efficiency of instruction fetch directly affects the pipeline utilization. The L1 and L2 instruction reference misses per kilo-instruction (MPKI) are shown in Figure 5, in which domux represents the DomainU hosted by corex. Because domuxs show similar results for most of runs in our experiments, only their arithmetic means are listed in the rest of this section. Figure 5 shows that the I-Cache miss rates of Domain0 and DomainU are much higher than that of SPEC CPU2006 benchmarks. Specifically, the I-Cache miss rate of DomainU (23% on average) is 34 times higher than that of SPEC INT2006, and 100 times higher than that of SPEC FP2006. The I-Cache miss rate of DomainU is also higher than CloudSuite benchmarks. It is anticipated that such high I-Cache miss rates will lead to frequent pipeline stalls, causing significant performance degradation. The L2 instruction reference miss rate of desktop cloud workloads is also 27 times higher than that of traditional ILP benchmarks. Such an high L2 miss rate will lead to high I-Cache miss penalty, causing further performance degradation. TABLE III: Number of context switches. Vmexits per second Domain switches I/O instruction APIC access Others per second Fig. 6: L1 data cache miss rate. While we are unable to measure the content of I-Cache in real machine to fully understand the true cause behind the high number of I-Cache misses, it is our belief that one of the main reasons is the frequent context switch in Xen. Xen hypervisor uses split device driver model. The real device driver is located in Domain0. The device driver of DomainU 5

6 (a) ITLB miss rate (b) DTLB miss rate Fig. 7: ITLB and DTLB miss rate. Fig. 8: L2 cache miss rate. Fig. 9: L2 cache misses per kilo-instruction. Fig. 10: The number of cache-to-cache data transfers. Fig. 11: LLC misses per kilo-instruction. needs to interact with Domain0 to complete I/O requests. For full virtualization, every device access (normally through memory-mapped registers) needs hypervisor to intercept and emulate. A typical I/O operation needs a number of traps to operate a device, leading to a long execution path. In a virtualized environment, the desktop workloads request 6

7 a lot of network and disk operations, which inevitably cause lots of context switches (between hypervisor, Domain0 and DomainU). In addition, when multiple DomainUs running on the same physical core, switching between different DomainUs will further lead to more context switches. We instrument Xen to collect the number of context switches. Table III shows that there are vmexits (switching from DomainU to hypervisor), including 7156 switches caused by I/O instructions and 6607 switches caused by accessing Advanced Programmable Interrupt Controller (APIC), and 1002 inter-domainu switches per second, when running two s on one core. While dedicating one physical core to only one could reduce the number of context switches, it contradicts the essence of virtualization resource sharing. It is anticipated that running more s on one core would cause more context switches. B. L1 D-Cache Figure 6 displays L1 data cache miss rates. It shows that the average L1 data cache miss rate of Xen (10.2%) is a little bit higher than other benchmarks, while the miss rate of Domain0 is similar with other benchmarks. Compared with traditional applications, none of them shows much performance degradation, which indicates that existing L1 data cache works well with desktop cloud workloads. C. TLB The TLB (translation look-aside buffer) is a common hardware structure to accelerate virtual-to-physical address translation. A TLB miss triggers a process called page walk to load the associated translation into the TLB. To better support virtualization, Intel Nehalem architecture introduces a Virtual Processor ID (VPID) in its TLBs. Having VPID avoids TLB entries to be flushed when switching from one to another. Figure 7 presents the ITLB and DTLB miss rates. It shows that desktop cloud workloads have several times higher ITLB and DTLB miss rates than other benchmarks. Babka et al. [16] have shown that the average TLB miss penalty is larger than the average L2 cache miss penalty, even with hardware page walks. A TLB miss rate of 4.5% may seem small, but it could noticeably degrade the overall performance. One of the main reasons for high TLB miss rates is that many live contexts compete for the same set of TLB entries, causing many conflict misses. D. L2 cache Figure 8 displays the L2 miss rates. It shows that the average L2 miss rate of Domain0 is lower than that of other benchmarks. On the contrary, the average L2 miss rate of DomainU (68%) is significantly higher than that of other benchmarks. Figure 9 presents the L2 MPKI. It shows that L2 MPKI of DomainU is much greater than that of traditional ILP and TLP benchmarks. However, L2 MPKI of Domain0 is much lower than that of other benchmarks. The results indicate that while the L2 cache works well with Domain0, it is not adequate for DomainU. E. Cache-to-cache data transfers The percentage of L2 misses that hit another on-chip L2 cache (i.e., cache-to-cache transfers) is used as the metric for measuring core-to-core communication. Figure 10 shows that there are very few cache-to-cache transfers among different DomainUs. Only about 0.02% of L2 cache misses of a DomainU get data from a sibling L2 cache belonged to another DomainU running on the same chip. On the contrary, the coreto-core communication between Domain0 and DomainUs is high and similar with what PARSEC and TPC-C have. F. Last Level Cache The last level cache is the last defense to mitigate the speed gap between the processor and the off-chip memory. Most of today s server processors devote a large chip area to the LLC. Figure 11 shows that LLC misses occur rarely for all benchmarks. These results indicate that the existing (large) LLC works wonderfully for all benchmarks, even though the L1 instruction cache and the L2 cache do not work as well. Fig. 12: Impact of hardware prefetchers. G. Prefetcher In order to reduce the number of cache misses, many processors use hardware prefetchers to load cache lines before they are requested. Such mechanisms have been proven performing well with traditional benchmarks [9]. Figure 12 presents the TLB miss rates, cache miss rates and IPC of DomainU with both the hardware prefetcher on and off. The results in the figure show that enabling prefetcher noticeably increases the miss rate of each cache level and degrades performance. It is a strong indication that the hardware prefetcher fails in its role to reduce the number of cache misses. We speculate that frequent context switching disrupts the access patterns, rendering the prefetcher completely ineffective. H. Off-chip memory references We use HMTT to collect off-chip memory reference traces of desktop cloud workloads and several memory intensive benchmarks from SPEC CPU2006. Figure 13 shows the memory bandwidth consumed by each tested benchmark. In this figure, 4vm on xen means four s (DomainU) running 7

8 Fig. 13: Memory bandwidth consumption. Desktop cloud (left) and SPEC CPU2006 (right). Fig. 15: Memory reference interval of 4s on Xen. I. User/System breakdown on Xen in parallel, while 1vm on xen means only one (DomainU) running on Xen. It shows that even with four s, the Xen-based platform uses only 714MB/s, which is less than one-eighth of the peak bandwidth. Most of SEPC CPU2006 benchmarks use much more memory bandwidth, with the exception of bzip2, h264ref and povray. The average consumed memory bandwidth of the SPEC CPU2006 is three times higher than that of Xen with four s. Fig. 16: User/System breakdown of different benchmarks. Fig. 14: Memory bandwidth of 4s on Xen. To better understand the dynamic behavior of memory references, Figure 14 plots the consumed memory bandwidth over the time. Each point in this figure represents one millisecond. The results show that the consumed memory bandwidth does not fluctuate frequently over the time, with only few spikes. The average is below 1GB/s. Only a few points reach 3GB/s, and the highest is only 4GB/s. These numbers indicate that s do not access the memory frequently and do not consume lots of memory bandwidth. We select 10 microseconds traces and compute the time interval between every two consecutive memory access. Figure 15 shows that memory references are sparse with desktop cloud workloads. Figure 16 shows that the percentage of kernel and application instructions. Desktop cloud workloads have much higher percentage of kernel (Xen and guest OS ) instructions (ranging from 23% to 33%) than traditional ILP benchmarks. In particular, SPEC CPU2006 rarely executes kernel instructions (less than 1%). On the contrary, CloudSuite benchmarks execute mostly in the kernel mode (58% on average). These results demonstrate that the efficiency of kernel code would play a major role in the performance of desktop cloud workloads and CloudSuite benchmarks. J. Desktop workloads on native OS To illustrate the impact of virtualization on desktop workloads, we run a few representative operations of the desktop workloads on a native OS environment. The results are shown in table IV. Compared with running on the virtualized environment, desktop workloads running on the native OS environment have much higher IPC, much lower number of L1I/L2/LLC misses, and much lower TLB miss rates. In particular, virtualization increases the number of L1 instruction 8

9 misses by five times and the TLB miss rate by over 12 times. These numbers are a strong indication that virtualization could have dramatic performance impacts on desktop workloads. TABLE IV: Performance with and without virtualization. Native OS DomU IPC L1I MPKI L2 MPKI LLC MPKI ITLB Miss 0.09% 1.30% DTLB Miss 0.35% 4.36% K. Discussion Our experiments clearly demonstrate that the prevailing processor architecture optimized for traditional desktop and high performance computing applications does not bode well for desktop cloud workloads. Further optimization are needed for running desktop cloud workloads with improved efficiency. For instance, Domain0 and DomainUs have quite different characteristics. Domain0 is in charge of handling all I/O operations while DomainU runs guest OS and user applications. Such functional split between Domain0 and DomainU is common in virtualization environments. Hyper-V [32] also adopts that. This calls for both functional and microarchitectural heterogeneity among on-chip cores. In addition, desktop cloud workloads in our experiments exhibit high I- Cache and L2 cache miss rates, but very low LLC miss rates. These results indicate that opportunities exist for cache hierarchy optimizations. The core-to-core communication pattern of desktop cloud workloads follows the star-like topology with Domain0 as the central core and point-to-point communication between the Domain0 core and each of the DomainU cores, as there is no or little communication between the DomainU cores. Efficient support of this pattern requires a new onchip topology or some special extensions over the existing topologies like bus, ring, cross-bar, and mesh. V. CONCLUSION AND FUTURE WORK As desktop cloud is becoming more and more popular, it is important to understand if the existing systems can handle desktop cloud workloads in an efficient manner. In this paper, we use a large set of performance numbers gathered from a real desktop cloud to show that modern processors are ineffective in many ways for desktop cloud workloads. During our study, we have used instrumented codes, performance counters, and a special device to analyze desktop cloud workloads, SPEC CPU2006, TPC-C, PARSEC and Cloud- Suite benchmarks. We have collected performance numbers about I-Cache, L1 D-Cache, TLB, L2 cache, LLC, cacheto-cache transfers, off-chip memory references, and hardware prefetchers. Our experimental results demonstrate that the processor architecture optimized for traditional benchmarks is not efficient for desktop cloud workloads. Compared to traditional benchmarks, desktop cloud workloads have much higher I-Cache miss rates, much higher L2 miss rates, several times higher ITLB/DTLB miss rates. In addition, desktop cloud workloads have great LLC performance, very few cacheto-cache transfers, and high percentages of kernel instructions. The drawback of measuring a real system is that it is difficult to obtain detailed information to uncover the root cause for some of the problems. We are building a software simulator platform to assist in performing in-depth analysis. We are also in the middle of using the empirical results to guide the design of an efficient desktop cloud system. ACKNOWLEDGMENT We would like to thank many group members for their help and feedbacks in experimental setup and paper writing. We are grateful to anonymous reviewers for their encouraging comments. This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences, Grant No. XDA REFERENCES [1] N. E. Jerger, D. Vantrease, and M. Lipasti, An Evaluation of Server Consolidation Workloads for Multi-Core Designs, in Proceedings of the 2007 IEEE 10th International Symposium on Workload Characterization (IISWC 07), Washington, DC, USA, 2007, pp [2] A. Kochut, K. Beaty, H. Shaikh, and D. G. Shea, Desktop workload study with implications for desktop cloud resource optimization, in Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW 10), 2010 IEEE International Symposium on, 2010, pp [3] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quantitative Approach, 4th ed. Morgan Kaufmann, [4] J. Sharkey and D. Ponomarev, Balancing ILP and TLP in SMT Architectures through Out-of-Order Instruction Dispatch, in Proceedings of the 2006 International Conference on Parallel Processing (ICPP 06), Washington, DC, USA, 2006, pp [5] Michael Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware, in 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 12), [6] H. Xi, J. Zhan, Z. Jia, X. Hong, L. Wang, L. Zhang, N. Sun, and G. Lu, Characterization of real workloads of web search engines, in 2011 IEEE International Symposium on Workload Characterization (IISWC 11), 2011, pp [7] K. Adams and O. Agesen, A comparison of software and hardware techniques for x86 virtualization, in Proceedings of the 12th international conference on Architectural Support for Programming Languages and Operating Sys- 9

10 tems (ASPLOS 06), New York, NY, USA, 2006, pp [8]Y.Bao,M.Chen,Y.Ruan,L.Liu,J.Fan,Q.Yuan,B. Song, and J. Xu, HMTT: a platform independent fullsystem memory trace monitoring system, SIGMETRICS Perform. Eval. Rev., vol. 36, no. 1, pp , [9] S. Palacharla and R. E. Kessler, Evaluating stream buffers as a secondary cache replacement, in Proceedings of the 21st annual International Symposium on Computer Architecture (ISCA 94), Los Alamitos, CA, USA, 1994, pp [10] A. Menon, J. R. Santos, Y. Turner, G. (John) Janakiraman, and W. Zwaenepoel, Diagnosing performance overheads in the xen virtual machine environment, in Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments (VEE 05), New York, NY, USA, 2005, pp [11] Rodrigo N. Calheiros, Rajiv Ranjan, Anton Beloglazov, Cesar A. F. De Rose, and Rajkumar Buyya, CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms, Software: Practice and Experience (SPE), Volume 41, Number 1, pp , ISSN: , Wiley Press, New York, USA, January, [12] Rodrigo Calheiros, Rajiv Ranjan and Rajkumar Buyya, Virtual Machine Provisioning Based on Analytical Performance and QoS in Cloud Computing Environments, Proceedings of the 40th International Conference on Parallel Processing (ICPP 11), Taipei, Taiwan, September 13-16, [13] J. Casazza, Redefining Server Performance Characterization for Virtualization Benchmarking, Intel Technology Journal, vol. 10, no. 03, Aug [14] O. Tickoo, R. Iyer, R. Illikkal, and D. Newell, Modeling virtual machine performance: challenges and approaches, SIGMETRICS Perform. Eval. Rev., vol. 37, no. 3, pp , [15] C. Bienia, S. Kumar, J. P. Singh, and K. Li, The PAR- SEC benchmark suite: characterization and architectural implications, in Proceedings of the 17th international conference on Parallel Architectures and Compilation Techniques (PACT 08), New York, NY, USA, 2008, pp [16] V. Babka and P. Tuma, Investigating Cache Parameters of x86 Family Processors, in Proceedings of the 2009 SPEC Benchmark Workshop on Computer Performance Evaluation and Benchmarking, Berlin, Heidelberg, 2009, pp [17] L. A. Barroso and U. Hlzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture, Jan [18] P. Gepner and M. F. Kowalik, Multi-Core Processors: New Way to Achieve High System Performance, in Proceedings of the International Symposium on Parallel Computing in Electrical Engineering (PARELEC 06), Washington, DC, USA, 2006, pp [19] Ofri Wechsler. Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance. Technology@Intel Magazine, March [20] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu, No power struggles: coordinated multi-level power management for the data center, in Proceedings of the 13th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 08), New York, NY, USA, 2008, pp [21] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the art of virtualization, in Proceedings of the nineteenth ACM Symposium on Operating Systems Principles (SOSP 03), New York, NY, USA, 2003, pp [22] L. Ciortea, C. Zamfir, S. Bucur, V. Chipounov, and G. Candea, Cloud9: a software testing service, SIGOPS Oper. Syst. Rev., vol. 43, no. 4, pp. 5-10, [23] SPEC CPU2006 Benchmark Suite. [24] IBM Smart Business Desktop Cloud. [25] OProfile Tools. [26] TPC-C. [27] The Apache Cassandra Project. [28] CloudSuite. [29] Desktop Cloud Solutions of HuaWei. [30] Desktone Desktop Cloud. cloud/. [31] Citrix XenDesktop. [32] Hyper-V. [33] ware View (ware VDI). [34] IBM Redbooks: Performance Test of Virtual Linux Desktop Cloud Services on System Z. [35] ware View Planner. planner. [36] SPECvirt sc2010: sc2010/. [37] mark: 10