Cloud Gaming: Understanding the Support from Advanced Virtualization and Hardware

Transcription

1 Cloud Gaming: Understanding the Support from Advanced Virtualization and Hardware Ryan Shea Simon Fraser University Burnaby, Canada Di Fu Simon Fraser University Burnaby, Canada Jiangchuan Liu Simon Fraser University Burnaby, Canada Abstract Fueled by elastic resource provisioning, reduced costs and unparalleled scalability, cloud gaming brings immense benefits by expanding the user base to the vast number of resource-constrained thin clients. Existing cloud gaming platforms have mainly focused on private, non-virtualized environments with proprietary hardware. Modern public cloud platforms heavily rely on virtualization for efficient resource sharing, whose potentials have yet to be explored. Migrating gaming to a public cloud is non-trivial however, particularly considering the overhead for virtualization and that the GPUs for game rendering has long been an obstacle in virtualization. This paper takes a first step towards bridging the online gaming system and the public cloud platforms. We present a systematic study on the diverse aspects of migrating gaming services to a virtualized cloud environment. To this end, we closely examine the technology evolution for GPU virtualization and pass-through, and measure the performance of both earlier and the advanced solutions available in the market. We then present the design and implementation of a fully virtualized cloud gaming platform with the latest hardware support. We address the issues from each and every module of the platform, including the choice of hardware or software video encoding, the configuration and the detailed power consumption of thin client, and the impact of streaming protocols. We demonstrate that, with the latest hardware and virtualization support, gaming over virtualized cloud can be made possible with careful optimization and integration of the different modules. We also highlight the critical challenges towards full- fledge deployment of gaming services over public virtualized cloud. Index Items: Video gaming, cloud computing, Graphics Processing Unit (GPU), virtualization I. INTRODUCTION Fueled by elastic resource provisioning, reduced costs and unparalleled scalability, cloud computing is drastically changing the operation and business models of the IT industry. A wide range of conventional applications, from file sharing and document synchronization to media streaming, have experienced a great leap forward in terms of system efficiency and usability through leveraging cloud resources with computation offloading. Recently, advances in cloud technology have expanded to facilitate offloading such more complex tasks as high definition 3D rendering, which turns the idea of Cloud Gaming into a reality. Cloud gaming, in its simplest form, renders an interactive gaming application remotely in the cloud and streams the scenes as a video sequence back to the player over the Internet. A cloud gaming player interacts with the application through a thin client, which is responsible for displaying the video from the cloud rendering server as well as collecting the player s commands and sending the interactions back to the cloud. This new paradigm of gaming services brings immense benefits by expanding the user base to the vast number of lesspowerful devices that support thin clients only, particularly smartphones and tablets [1]. There have been significant studies to explore the potentials of cloud for gaming and to address challenges therein [2] [3] [4]. We have seen the development of such open-source cloud gaming systems as GamingAnywhere for Android OS [5]. We have also seen successful industrial deployment, e.g., OnLive [6] and Gaikai [7], both with multi-million user bases. The 38 millon dollar purchase of Gaikai by Sony [8], an industrial giant in digital entertainment and consumer electronics, and the forthcoming integration of Gaikai and Sony s Play Station 4, show that cloud gaming is beginning to move into the mainstream. These existing cloud gaming platforms tended to focus on private, non-virtualized environments with proprietary hardware, where each user is often mapped in a one-to-one fashion to a physical machine in the cloud. Modern public cloud platforms heavily rely on virtualization, which allows multiple virtual machines to share the underlying physical resources, making truly-scalable play-as-you-go service possible. In other words, despite the simplicity and ease of deployment, existing cloud gaming platforms have yet to unleash the full potentials of modern cloud towards the expansion to ultra-large scale with flexible services. Migrating gaming to a public cloud, e.g., Amazon EC2, is non-trivial however. The system modules should be carefully planned for effective virtual resource sharing with minimum overhead. Moreover, as the complexity of 3D rendering increases, modern game engines not only rely on the general purpose CPU for computation, but also on dedicated graphical processing units (GPUs). While GPU cards have been virtualized to some degree in modern virtualization systems, their performance has historically been poor given the ultra-high memory transfer demand and the unique data flows [9]. Recent advances in terms of both hardware and software design have not only increased the usability and performance of GPUs, but created a new class of GPUs specifically for virtualized

2 environments. A representative is NVIDIA s recently released GRID Class GPUs, which allow several virtualized system the ability to each utilize a dedicated GPU, by placing several logical GPUs on the same physical GPU board 1. These new hardware advances greatly assist in the deployment of online gaming systems in a public cloud environment. This paper takes a first step towards bridging the online gaming system and the public cloud platforms. We present a systematic study on the diverse aspects of migrating gaming services to a virtualized cloud environment. To this end, we first closely examine the technology evolution for GPU virtualization and pass-through, and measure the performance of both earlier and the advanced solutions available in the market. We then present the design and implementation of a fully virtualized cloud gaming platform with the latest hardware support for both remote servers and local clients. We address the issues from each and every module of the platform, including the choice of hardware or software video encoding, the configuration and the detailed power consumption of thin client, and the impact of streaming protocols. We demonstrate that, with the latest hardware and virtualization support, gaming over virtualized cloud can be made possible with careful optimization and integration of the different modules. We also highlight the critical challenges towards full-fledge deployment of gaming services over public virtualized cloud. The remainder of this paper is organized as follows: In Section II, we present the background and related work on cloud gaming and machine virtualization. Section III examines the state-of-the-art of GPU virtualization and pass-through. We then present the design and implementation of a fully virtualization cloud gaming platform in Section IV, and systematically examine its key modules, including video coding, thin-client, and streaming protocols in Sections V, VI, and VII, respectively. Section VIII further discusses the practical issues and concludes the paper. II. BACKGROUND AND RELATED WORK A. Gaming over Cloud Using remote servers, and more recently, cloud computing, for gaming has attracted significant interest. Winter et al. [1] designed a framework to adopt the thin-client streaming and interactive games. This hybrid system rendered some scenes locally at the thin client and thus greatly reduces the bandwidth requirements on the clients. To better understand the performance of thin-client gaming applications, Chang et al. [11] proposed a measurement approach to learn the implementations of thin-client games even when they are close-sourced. A follow-up study from Lee et al. [12] further investigated the latency issues of playing thin-client games with different network configurations. To understand the user-perceived Quality-of-Experience (QoE) in cloud gaming systems, Jarschel et al. [13] conducted a subjective study, in which the selected participants were 1 asked to play slow, medium, and fast videogames under different latency and packet-loss conditions. Hemmati et al. [14], on the other hand, examined the player s experience of cloud gaming with a selective object encoding methods. A recent study from Claypool et al. [3] provided a detailed study of OnLive [15], a commercially available cloud gaming system, and closely analyzed its bitrates, packet sizes and inter-packet times for both upstream and downstream game traffic. To further clarify its streaming quality, Shea et al. [2] measured the real-world performance of OnLive under different types of network and bandwidth conditions, revealing critical challenges towards the widespread deployment of cloud gaming. Wu et al. [16] further conducted a series of passive measurements on a large-scale cloud gaming platform and identified the performance issues of queueing delay as well as response delay among users. Cai et al. [1] suggested that the existing cloud service providers could also provide gaming as a service in their business models. As well as investigating the commercial cloud gaming systems, Lee et al. [4] proposed a system design that delivers real-time gaming interactivity as fast as traditional local clientside execution, despite the network latencies. Huang et al. [5] provided an open-source cloud gaming system, GamingAnywhere, which has been deployed on the Android OS with extensive experiments performed. These studies have mainly focused on private, non-virtualized cloud environments with proprietary hardware. The potentials of advanced hardware for both thin clients and public cloud with resource virtualization have yet to be understood. B. Machine Virtualization Computer virtualization can be roughly classified into two major classes: paravirtualization and hardware virtualization. Paravirtualization(PVM) is one of the first adopted versions of virtualization and is still widely deployed today. It requires no special hardware to realize virtualization, instead relying on special kernels and drivers. The kernel sends privileged system calls and hardware access directly to a hypervisor, which in turn decides what to do with the request. The use of special kernels and drivers means a loss of flexibility in terms of choosing the operating systems. In particular, PVM must use an OS that can be modified to work with the hypervisor. Typical PVM solutions include Xen [17] and User Mode Linux. Amazon, the current industry leader in cloud computing, heavily uses Xen to power its EC2 platform [18]. Hardware Virtual Machine (HVM) is the lowest level of virtualization, requiring special hardware capabilities to trap privileged calls from guest domains. It allows a machine to be fully virtualized without the need for any special operating systems or drivers on the guest system. Most modern CPUs are built with HVM capabilities, often called virtualization extensions. They detect if a guest VM tries to make a privileged call to a system resource. The hardware intercepts this call and sends it to a hypervisor which decides how to handle the call. It has been noticed however that HVMs can also have the highest virtualization overhead and as such may not

3 (a) Shared Device (b) Pass-through Device Fig. 1: Shared vs Pass-through Device Simplified Architecture always be the best choice for a particular situation[19]. Yet paravirtualization I/O drivers can alleviate such overhead; one example of a paravirtualization driver package is the open source VirtIO [2]. Representative HVM solutions include VMware Server, KVM, and Virtual-Box. Using PVM and HVM, CPU and many of the computer resources have been well virtualized, laying a solid foundation for today s cloud computing. There have also been initial attempts to virtualize GPU, e.g., in VMWare [9]. Yet, full virtualization of GPU remains difficult, because the GPU possesses a fundamentally different architecture from CPU. A GPU consists of hundreds or even thousands of cores, allowing a vast amount of threads run simultaneously to solve computationally intensive tasks. Due to the intrinsically parallel nature, a GPU demands much higher memory bandwidth than a CPU, and thus existing hardware designs around GPU are mostly throughput-driven. For instance, the GPU s internal bandwidth is optimized by using a dedicated memory, the newest generation of which is GDDR5. Externally, GPU is interfaced with the motherboard by a PCI Express (PCI-E) expansion slot, which has a much higher bus throughput than other older PCI buses. Unfortunately, the data transfer between the main memory and GPU is still a bottleneck, and virtualization can dramatically degrades the memory transfer performance. Furthermore, using SIMD (Single Instruction, Multipule Data), a GPU is able to unleash its computing power on data-parallel computations. This however makes a GPU harder to be shared among multiple VMs when each of them has its own distinct task to perform. CPU, on the other hand, excels at task-parallel computations with its multiple cores, making virtualization easier with concurrent access to different VMs and their disparate tasks. III. GPU VIRTUALIZATION AND PASS-THROUGH: WHERE ARE WE? Recent hardware advances have enabled virtualization systems to perform a one-to-one mapping between a device and a virtual machine guest, allowing hardware devices that do not virtualize well to still be used by a VM, including GPU. Both Intel and AMD have created hardware extensions for such device pass-through, namely VT-D by Intel and AMD-Vi by AMD. They work by making the processor s input/output memory management unit (IOMMU) configurable, allowing the systems hypervisor to reconfigure the interrupts and direct memory access (DMA) channels of a physical device, so as to map them directly into one of the guests [21]. As illustrated in Figure 1a, data flows through DMA channels from the physical device into the memory space of the VM host. The hypervisor then forwards the data to a virtual device belonging to the guest VM. The virtual devices interacts with the driver residing in the VM to deliver the data to the guest virtual memory space. Notifications are sent via interupts and follow a similar path. Figure 1b shows how a 1-1 device pass-through to a VM is achieved. As can be seen, the DMA channel can now flow data directly from the physical device to the VMs memory space. Also, interrupts can be directly mapped into the VM through the use of remapping hardware, which the hypervisor configures for the guest VM. The advanced pass-through grants a single VM a one-toone hardware mapping between itself and the GPU. These advances have allowed the cloud platforms to offer virtual machine instances with GPU capabilities. For example, Amazon EC2 has added a new instance class known as the GPU Instances, which have dedicated NVIDIA GPUs for graphics and general purpose GPU computing. The performance implications and overhead of pass-through have been examined for various pass-through devices, e.g, network interface controllers [22][23][24]. There have also been recent studies on enabling multiple VMs to access CUDA-enabled GPUs [25][26], as well as performance analysis of CUDA applications using a GPU pass-through device in Xen [27]. There has also been studies on GPU resource scheduling in cloud [28][29]. Despite these pioneering ef-

4 forts, the suitability of modern implementations of GPU passthrough devices for cloud gaming has seldom been explored. To understand the performance improvement and whether the technology is ready for gaming over public cloud, we have conducted a series of experiments over both an earlier GPU pass-through platform and the latest platform. With direct access to the hardware, the local virtualized platforms facilitate the measurement of virtualization overhead on game applications, as well as the energy consumption that can hardly be obtained from a public cloud platform. 1) Earlier GPU pass-through Platform (211): Our first test system is a server with an AMD Phenom II 145t 6-core processor running at 2.7 Ghz. The motherboard s chipset is based on AMD s 99X, and we enabled AMD-V and AMD- Vi in the BIOS as they are required for Hardware Virtual Machines (HVM) support and device pass-through. The server is equipped with 16 GB of 1333 MHz DDR-3 SDRAM and the physical network interface is a 1 Mb/s Broadcom Ethernet adapter attached to the PCI-E bus. The GPU is an AMD-based Sapphire HD 583 Xtreme with 1 GB of GDDR5 memory. The Xen 4. hypervisor is installed on our test system and the host and VM guests used Debian as their operating system. We configured Xen to use the HVM mode, since the GPU pass-through requires hardware virtualization extensions. The VM is given access to 6 VCPUs and 848 MB of RAM. 2) Advanced GPU pass-through Platform (214): Our second system is a state-of-the-art server with an Intel Haswell Xeon E quad core (8 threads) processor. The motherboard utilizes Intel s C226 chipset, which is one of Intel s latest server chipsets, supporting device pass-through using VT-D. The server has 16 GB of 16 MHz ECC DDR-3 memory installed. Networking is provided through a Intel i217lm 1 Mb/s ethernet card. We have also installed a Sapphire R9-28x GPUs with 3 GB of GDDR5 memory, a representative for advanced GPUs in the market. The Xen 4.1 hypervisor is installed and the VM guests again use Debian as their operating system. On this basis, we configure Xen to use the HVM mode, and the VM is given access to 8 VCPUs and 848 MB of RAM. 3) Baseline and Benchmarks: As the baseline for comparison, for both systems we run each test on a bare-metal setup with no virtualization, i.e., the system has direct access to the hardware. The same drivers, packages and kernel were used as in the previous setup. This configuration enabled us to calculate the amount of performance degradation that a virtualized system can experience. To determine the overall system s power consumption, referred to as wall power, we have wired a digital multi-meter (Mastech MAS-345) into the AC input power line of our system. The meter has an accuracy rating of ± 1.5%. We read the data from our meter using a data logger PCLink installed on workstation, and collect samples every second throughout our experiments. Both systems are powered by 75 Watt 8 Plus gold power supplies with active power factor correction. To compare the pass-through performance, we have selected two game engines, both of which have cross-platform imple- Frames Per Second (FPS) Frames Per Second (FPS) Bare-Metal(Old) 39.8 Xen 4. Doom 3 (FPS) Bare-Metal(Adv.) Fig. 2: Doom 3 Performance 84.5 Bare-Metal(Old) 51 Xen 4. Bare-Metal(Adv.) Xen 4.1 Sanctuary (FPS) Xen 4.1 Fig. 3: Unigine Sanctuary Performance mentation, and can run natively on our Debian Linux machine. The first is Doom 3, which is a popular game released in 25, and utilizes OpenGL to provide high quality graphics. The second is the Unigine s Sanctuary benchmark, 2 which is an advanced benchmarking tool that runs on both Windows and Linux. The Unigine engine uses the latest OpenGL hardware to provide rich graphics that are critical to many state-of-the-art games. For each of the following experiments, we run each benchmark three times and graph the average. For Doom3 and Sanctuary, we give the results in frames per second (fps). We also provide the amount of energy consumed by each system, which we measure through our meter attached to the system s AC power input. A. Frame Rate and Energy Consumption For Doom 3, we use the ultra high graphics settings with 8x Anti-Aliasing (AA) and 8x Anisotropic Filtering (AF), with a display resolution of 192 x 18 pixels (18p). To perform the actual benchmark, we utilize Doom 3 s built in time demo, which loads a predefined sequence of game frames as fast as the system will render them. In Figure 2, we show the results of frame rates, running on our bare metal systems as well as on the virtualized systems Xen and KVM. We start the discussion with our older 211 server. This bare metal system performs at fps, while our virtualized system dramatically falls over 65% to 39.2 fps. Our newer 214 system processing the frames at over 274 fps 2

5 Bare-metal Virtual Machine Energy Overhead(%) Doom 3 (Joules) Old J J 217% Unigine Sanctuary (Watts) W W 3.2% Doom 3 (Joules) Adv J J 5.2% Unigine Sanctuary (Watts) W 31.7 W 4.4% TABLE I: Power Consumption: Games Bare-metal vs Virtual Machine when running directly on the hardware and falling less then 3% when run inside a virtual machine. Table I gives the energy consumption experienced by our machines in this experiment, measured in joules since each system has a different running time in the test. The older system consumes over twice the amount of energy to perform the Doom 3 time demo when the system is virtualized. The advanced platform fares much better with virtualization, only adding 5.4% to energy consumption. This first virtualization experiment makes it clear that the device pass-through technology has come a long way in terms of performance and overhead. The advanced platform performs within 3% of the optimal bare-metal result consuming only 5% more energy. The energy efficiency of this advanced platform is quite impressive, considering that a virtualized system must maintain it s host s software hypervisor as well as many hardware devices such as the IOMMU, shadow TLB and the CPUs virtualization extensions. To confirm the performance implications with newer and more advanced OpenGL implementations, we next run the Unigine Sanctuary benchmark at 192 x 18, with all highdetail settings enabled. To further stress the GPUs, we enable 8x Anti-Aliasing (AA) and 16x Anisotropic Filtering (AF). We run the built in benchmark mode 3 times and measure the average frame rates and the energy consumption of the system. In this experiment, the running time is consistent between each run and platform; so we express the results in watts (Joules per Second). The results are given in Figure 3. Once again we see that our older virtualized system shows significant signs of performance degradation when compared to its bare-metal optimal. The old system drops from 84.4 fps to 51 fps when virtualized, i.e., nearly 4%. The advanced system has near identical performance when the game engine is running in a virtualized environment or directly on the hardware. Table I further shows that our advanced system uses a measurable more power but still impressively remains at a increase of only 4.4% more power when virtualized. The older system appears to have power consumption only increased by 3.2%; on the surface this looks like a positive trait, but indeed comes with the cost of nearly 4% lower video frame rate. B. GPU Memory Bandwidth It is known that memory transfer can be a bottleneck in GPU, which can be even more severe from the VMs memory and the GPU [3]. This would affect the game s performance as well, because both Doom 3 and the Unigine engine must constantly move data from the main memory to the GPU for processing. To under the impact, we ran a simple memory Data Bandwidth in MB/s Bare-Metal(Old) Xen 4. Host to Device Device to Host Bare-Metal(Adv.) Xen 4.1 Fig. 4: Memory Bandwidth by System bandwidth experiment written in OpenCL by NVIDIA. 3 We tested three different copy operations, from host s main memory (DDR3) to the GPU device s global memory (GDDR5), from the device s global memory to the host s main memory and finally from the devices global memory to another location on the devices global memory. We give the results for hostto-device and device-to-host experiments in Figure 4. As can be seen, for both systems, the bare-metal outperforms the VMs in terms of memory bandwidth across the PCI-E bus to the GPU. The older system loses over 4% of its memory transfer to and from the host memory when compared to the bare-metal system. The newer more advanced platform degrades even more, losing over 6% of its memory transfer speed when virtualized. Interestingly, all virtualized systems sustain a nearly identical performance to their baremetal counterpart for device-to-device copy: the older platform at 66,4 MB/s while the more advanced platform at 194,8 MB/s. This indicates that once data is transferred from the virtual machine to the GPU, the commands that operate exclusively on the GPU are much less susceptible to overhead created by the virtualization system. Our results indicate that, even though the gaming performance issues have largely disappeared in the more modern advanced platform, the issue of memory transfer from main memory across the PCI-E bus remains severely degraded. As we will show in the next section, this memory bottle neck does not significantly affect the graphics performance of bleeding edge applications, such as Unigine Heaven or Future Marks Fire Strike Benchmark at the 18p resolution. Although, it may become an issue as the gaming industry transitions from 18p to the newer high memory requirements of 4k (496 x 3 Code is available in the NVIDIA OCL SDK at: com/opencl

6 Score Watts Graphics Physics Bare-metal 1-VM 2-VM 3-VM Fig. 5: Advanced: 3DMark Score Watts 7.18 Bare-metal 1-VM 2-VM 3-VM Fig. 6: Advanced: 3DMark Power Usage 216) UHD resolution. C. Advanced Device Pass-through Experiments The previous experiments have shown that running gaming applications inside a virtual machine is now feasible with little overhead. It however was limited to virtual machines having access to the entire system. In a real-world deployment over a public cloud, we are more interested in sharing one physical machine and its resources between multiple users/clients. To this end, we upgraded our advanced platform to three R9 28x GPUs, so as to test the performance and energy implication of multiple gaming applications running at one time on our platform. Further, to test the performance of other widely deployed 3D rendering languages, such as Microsoft s DirectX, we created a windows server 28 virtual machine and corresponding bare-metal installation on our platform. 1) 3DMark: Windows and DirectX 11 Performance: Our first advanced experiment utilizes the Futuremark s widely used gaming benchmark 3DMark 4. Specifically, we use the Fire Strike module, which is a 18p benchmark utilizing the latest DirectX features (DX11) that stresses not only the 3D rendering capabilities of the system but also the physics calculations done on the CPU. We run the entire Fire Strike benchmark 3 times and provide the average graphics and physics scores. The Futuremark s score is not as clear to understand on its own as our previous tests using frame rates. Fortunately, a large public searchable dataset is available 5, which can be used to determine how a particular system performs when compared to thousands of others. Since we are interested in the performance difference between a game engine running directly on the hardware and virtual machine, it works well for our purpose. Once again, we collect the total energy consumption experienced by the system running the benchmark. We first run the benchmark on our baremetal windows server to determine the baseline. We then perform experiments using virtual machine assigning each VM a dedicated R9 28x GPU and 4 GB of RAM. We test the performance of different numbers of VMs ranging from one to three concurrently running the 3DMark benchmark. The performance results for the 3DMark experiment are given in Figure 5. We first look at the graphics score, which our bare-metal system achieves 7938 and our single VM experiment achieves They are within.1% of each other after 3 runs, implying that they are statistically identical. The 2-VM case drops to an average of 797, which is still only a small drop of about.3%. Finally, with 3 VMs performing GPU-intensive operations, the average score drops to 7876, which is still less than a 1% drop in performance. That said, in terms of graphic intensive gaming applications, very little performance is lost when adding more concurrent VMs. Also presented in Figure 5 is the 3DMark s CPU intensive physics gaming benchmark. Immediately we can see a large difference between the bare-metal and the single virtual machine performance, even though they have access to the same amount of CPU resources. Our virtual system suffers from 3% performance degradation than its optimal baremetal performance. We have found that this is because the virtualized system does not properly utilizing hyper-threading provided by the Intel CPU. To determine this, we disable the hyper-threading in our system s BIOS and re-run Fire-Strike on the bare-metal platform, which results in a physics score drop from 119 to 7256, making the performance of the baremetal optimal and the virtual machine within 1% of each other. It is likely that some gaming related optimizations could be applied to the Xen hypervisor to allow it to fully utilize hyperthreading for gaming applications. Next we run two VMs concurrently, and we find that the physics performance drops to to just over 55% of the baremetal base-line, and running 3 VMs drops to under 5% of the bare-metal base-line. Although the single VM experiment is surprising, the performance loss in the 2- and 3-VM test are much more expected. Unlike the GPU intensive graphics modules in this benchmark, the CPU intensive physics modules are all processed by a single physical component, the CPU. This device sharing makes it not a surprise that multiple VMs competing for the same CPU would effect each others performance. The 3DMark s physics benchmark has been specifically programmed to use all the available CPUs. For many gaming applications, however, the GPU is usually the bottleneck, and thus a shared CPU may not impact the

7 # of Instances Bare-metal (fps) VM (fps) TABLE II: Advanced: Heaven Multiple Instances Average fps Watts Bare-Metal Virtual Machine about 6.5% more. This is impressive considering that in the virtual machine deployment, we actually have one operating system for the virtual machine host and one for each guest VM. Overall, given such great benefit of virtualization as performance isolation and resource sharing, the overhead of around 5 6.5% for advanced virtualization is quite justifiable IV. A FULLY VIRTUALIZED CLOUD GAMING PLATFORM: DESIGN AND IMPLEMENTATION Number of Game Instances Fig. 7: Advanced: Heaven Power Usage game performance. The power consumption for this test is given in Figure 6, which shows that our bare-metal system uses approximately 35% less energy than the same application running inside the VM. Further, the power consumption increases greatly as we increase the number of concurrently running VMs. This is because each additional VM saturates another dedicated physical GPU. Unfortunately, 3DMark only allows our baremetal platform to run one instance of the Fire-Strike benchmark at a time, thus we could not directly compare multiple instances of the same gaming application running directly on the hardware to multiple gaming instances running inside different VMs. To determine the difference in performance and energy consumption between multiple instances running directly on the hardware and running on a virtualized platform, we have devised another experiment based on the Unigine Heaven benchmark. 2) Unigine Heaven - Multiple Instance Performance: To compare the two platforms, our final experiment leverages the Unigine s latest benchmark Heaven 6, which supports the latest graphics rendering techniques such as DirectX 11 and OpenGL 4.. Once again, we use a 192 x 18 resolution, with all high detail settings enabled and also utilize 8x Anti-Aliasing (AA) and 16x Anisotropic Filtering (AF). The results in Table II suggest that the performance remains consistent as we add more instances of the application to either the bare-metal or virtualized system. For the baremetal system, the average performance is approximately 3 regardless of the number of running game instances. The virtualized performance is approximately 5% lower at around 29 fps. Figure 7 further gives a comparison of the energy consumption. Although being higher in the virtual machine deployment, the energy consumption at its worst is only Our measurement above has demonstrated strong potentials of today s GPU for high-quality cloud gaming. To establish the best practices of system level trade-offs of cloud gaming over real-world public and virtualized cloud platforms, we have implemented an open-source high-performance cloud gaming system 7. Our implementation is a fully functional and modular platform, which allows users and researchers to customize many subsystem. It not only supports software encoding using the highly optimized x264 encoder for H.264 video [31] but also the state-of-the-art NVIDIA GRID and its hardware H.264 encoder [32]. The choice of streaming protocol can also be customized, e.g., RTSP over UDP or TCP, as well as HTTP streaming, using the the widely deployed open-source streaming library Live555 [33] as the streaming engine. We now offer a high-level description of our system, starting with the implementation of the user interactions module, which targets both mobile (Android) and PC users. It captures the user commands using VNC remote desktop hooks, which have inherent cross-platform support. The commands are sent back to the server over TCP, where the server interprets the commands and sends them to the gaming window. We emphasize here that VNC is only used for sending commands to the server, for its low overhead and latency; yet it is not involved in transmission/reception of the video in any way, as its performance is not enough for such data-intensive jobs. The next module is the game logic, which in essence is the gaming application that the user is playing. It intercepts the user s keystrokes to the game and computes the game world changes. The rendering is performed by the GPU that is assigned to the VM, e.g., an NVIDIA GRID GK14. The video encoder, which in our platform is selectable, consisting of either a software or hardware H.264 encoder. The software encoder is provided by the high-performance x264 encoding library; if the VM is assigned a GRID GPU, it can utilize the built-in H.264 encoder provided by the platform. In either case, the encoders need additional support to be adapted for real-time streaming, namely, a discreet framer module, which allows the Live555 streaming library to request live frames Software and More Info: rws1/cloud-gaming

8 Software Encoding Time (ms) Profile:base; Thread:8 Profile:main; Thread:8 Profile:high; Thread:8 Threshold = 33.33ms Frame Number (Resolution 72p) Fig. 8: Encoder: Software Multi-threaded 72p Hardware Encoding Time (ms) Profile:base; VBR Profile:main; VBR Profile:high; VBR Threshold =33.33ms Frame Number (Resolution 72p) Fig. 9: Encoder: Hardware 72p from the encoders at a desired video stream frame rate. The encoded video stream is then encapsulated and transported in either UDP, TCP, or HTTP. Finally, when the video is received by a thin client (mobile or PC), we use the cross-platform media library FFmpeg [34] with real-time optimizations to decode the video and display it on the client device. Our design and implementation are platform-independent, although a GRID GPU is required for hardware encoding. We have deployed and experimented the system on Amazon EC2 GPU Instances (G2), a new type of cloud instances backed by the Intel Xeon E5-267 (Sandy Bridge) processors and the NVIDIA GRID-K52 board that contains a GK14 GPU with 1536 CUDA cores and 4GB of video memory 8. The GRID s on-board hardware video encoder supports up to 8 live HD video streams (72p at 3 fps) or up to 4 live FHD video streams (18p at 3 fps), as well as low-latency frame capture for either the entire screen or selected rendering objects, enabling a G2 instance to offer such high-quality interactive streaming as game streaming, 3D application streaming, or other server-side graphics tasks. We emphasize here that the use of virtualization opens a new space for the design and deployment of cloud gaming. It can now be readily deployed or migrated to a public cloud (e.g., Amazon EC2) with low cost and practically unlimited resources. There is no need to maintain a dedicated and specialized private cloud. Besides, although the GRID GPU is passed in a physical device, other parts of the system can easily be tuned by the cloud provider. For instance, if future 3D games require more main memory, this can easily be satisfied as the amount of memory is a tuneable parameter for a cloud instance. Furthermore, the CPU resources are also configurable. If future games require more computing resources, the cloud provider can increase the VM s number of virtual CPUs, or their share of the physical resources. V. HARDWARE VERSUS SOFTWARE VIDEO ENCODING One of the fundamental changes that the NVIDIA s GRID technology brings to cloud gaming is the onboard H.264 video 8 encoder. This encoder can take a raw frame directly from the GPU s frame-buffer and encode it using H.264 video compression standard. Before this, video encoding applications have heavily relied on software-based encoders, e.g., the x264 encoder, which is also used in GamingAnywhere [5]. Although OnLive has used a hardware encoder, it is not built into the GPU, but instead attached to the physical output of the video card. Our system supports both hardware and software encoding, which allows us to directly compare the performance of hardware and software H.264 encoders in terms of both encoding latency and CPU overhead. To our knowledge, this is the first such head-to-head comparison in a cloud computing context. A. Encoder Benchmark We configure our cloud gaming platform to capture video from the GPU and encode it either using the hardware or software H.264 encoder. For all the tests, we again used the Unigine Heaven benchmark playing in a loop as the rendering source, and used the CPU s hardware counters to accurately determine the running time of encoding a frame. We also collected the CPU usage during the test, for both the encoding task alone and the complete usage of the encoder as well as of the streaming application. For our software x264 encoder, we performed measurements for a multi-threaded implementation. It is worth mentioning that a single-threaded implementation is much less likely to interfere with the 3D application s CPU time; yet more threads can significantly decrease the encoding time as the encoding task is inherently parallel. For both the software encoder and the hardware encoder, we recorded videos at 3 fps at a variable bit rate (VBR) of 8 Mb/s, and under three H.264 encoding profiles, namely base, main and high. The distinction is important, because different platforms may implement only certain profiles, e.g., the base profile only for a mobile device, as it is the simplest and most energy efficient. We recorded the running time for each call to the encoder and plotted them against each other. Further, we set a threshold value of ms, which is the maximum single frame encoding time to allow 3

9 Encoder Type Encoder Only (CPU %) Encoder + Streaming (CPU %) Software(72p) 15% 23% Hardware(72p) 1% 1% Software(18p) 25% 37% Hardware(18p) 1% 14% TABLE III: CPU Overhead: Hardware vs Software H.264 Software Encoding Time (ms) Profile:base; Thread:8 Profile:main; Thread:8 Profile:high; Thread:8 Threshold = 33.33ms Frame Number (Resolution 18p) Fig. 1: Encoder: Software Multi-threaded 18p Hardware Encoding Time (ms) Profile:base; VBR Profile:main; VBR Profile:high; VBR Threshold = 33.33ms Frame Number (Resolution 18p) Fig. 11: Encoder: Hardware 18p frames per second video. Any time higher than ms indicates that the current encoder cannot produce video at 3 fps. The parameters of the x264 software encoder were initialized with the zerolatency and ultra-fast flags for the fastest possible encoding speed. For the GRID s built-in hardware H.264 encoder, we apply NVIDIA s default encoding parameters, which provide a good balance between encoding speed and image quality. 1) 72p Result: We start our discussion with 72p H.264 software encoding, where the multi-threaded experimental results can be found in Figure 8. Regardless of the profile chosen, the encoder manages an average frame encoding time between ms. The longest encoding time experienced by any frame in this test is approximately 2 ms, making even the worst case acceptable for smooth 3 fps encoding. Next, we tested the GRID s built-in hardware H.264 encoder and the results are given in Figure 9. As can be seen, the results are very consistent regardless of the encoding profile, with an average encoding time less than 1 ms. Overall, the hardware encoder on average encodes frames 3% faster than the software encoder does, with a lower deviation. Table III further lists the corresponding CPU utilization, which provides both the CPU time taken by just the encoder as well as the complete systems (encoder + streaming overhead). As can be seen, the software encoder takes considerable more CPU resources than the hardware encoder. In fact, 15% of the total CPU is consumed by the multi-threaded encoder, which implies that over one entire core of our 8-core system is being used by the video encoding subsystem. It is also important to note that, even though both the hardware and multi-threaded software encoders on our cloud platform are capable of encoding 72p HD video at 3 fps, for cloud gaming, the lower the latency at this stage, the lower the playback delay will be at the end user. This is because, after the streaming software requests a new frame, it has to wait for whatever amount of time it takes to encode before sending to the end user. 2) 18p Result: We now look at the more complex situation of encoding Full HD (FHD) 18p video, using the same settings as the previous experiment, except for the resolution. Once again, we start our discussion with the multi-threaded software encoder, whose results can be seen in Figure 1. For the high profile encoding, the average frame encode time sits over 29 ms, and the main and base profile both attain an average of 3 ms per frame. The multi-threaded software encoder s performance at 18p manages to stay under ms per frame on average for all the profiles. However, all profiles have encoding time spikes that are over 5 ms. This means the software encoder will not always meet the encoding deadline to provide fluid 3 fps video. The results for the GRID s hardware H.264 encoder can be found in Figure 11. Regardless of encoding profile level, the average encoding time is less than 11 ms per frame. During our experiments, the worst case encoding time for the hardware encoder is only slightly over 25 ms. It is clear from these results that GRID has no issue with encoding full high definition frames at rates above 3 fps. A closer look into the results is shown in Table III, which reveals the CPU processing overhead of both the softwarebased x264 encoder and the hardware encoder while processing 18p video frames. For the multi-threaded x264 encoder, it consumes 25% of the available CPU resources for the encoding phase and a combined CPU utilization of 37%. This implies that, on the 8-core system, the software encoder consumes the equivalent resources of 2 cores to provide the

10 encoding, and nearly 3 cores to provide both the streaming and encoding systems. Despite this high CPU usage, the softwarebased encoder often misses the ms encoding deadline for 3 fps video. On the other hand, the GRID s hardware H.264 encoder easily handles the frames with very low CPU utilization and never exceeds the threshold. It is clear that, for our new cloud gaming platform, the GRID s hardware H.264 encoder greatly reduces encoding latency as well as the potential for performance interference between the encoder and the actual game application. This experiment suggests that the GRID s hardware encoder has significant advantages over a software implementation. Being able to offload the computationally expensive job of encoding frames to a dedicated piece of hardware not only shows a great reduction in CPU consumption but also completes encoding a frame with much lower latency. This reduction in latency will result in improved interaction latency for the end user, as the streaming subsystem is not stalled while waiting for the next frame. Further, the x264 software encoder s performance, even when using 8 threads, does not allow smooth capture of 18p video, whereas the GRID s hardware encoder handles these high definition frames with low latency. As the resolution increases (e.g., towards 4K UHD video), the suitability of the software video encoder will quickly diminish. VI. THIN CLIENT: CONFIGURATION AND PERFORMANCE One of key motivations of migrating games to the cloud is to enable resource-constrained (in terms of computation, memory, battery, and etc.) thin clients to play advanced games. Hence, we now shift our focus to the configuration and performance of thin client in our system. Our test system was a 7-inch EVGA Tegra NOTE 7 tablet, powered by NVIDIA s new mobile chip-set Tegra 4, functionally a SoC (System on Chip) that features a quadcore ARM Cortex-A15 CPU, a 72-core NVIDIA GeForce GPU, and particularly a dedicated video decoding block that provides full support for hardware decoding. The tablet has a 128 x 8 (72p) native resolution screen; yet the GPU can support up to 4K UHD external display through its video output. The installed Android Jelly Bean enables users to explore and utilize the device with a wide range of gaming apps. All these make it an ideal cloud gaming client for both use and test. NVIDIA has supplied such profiling tools as the System Profiler and the PerfHUD ES for Tegra, and Android also has a built-in battery usage monitor and numerous thirdparty applications for checking the battery life. Unfortunately, these tools either do not support the overall power measurement or is far lacking of precision and reliability. To determine the overall system s power consumption in executing client tasks, we instead resorted to a direct and low-level means: measuring the DC (direct current) intensity of the tablet s circuit. Given the fact that the tablet s working DC voltage stays mostly stable (= 3.7V) for a sufficiently long period when the battery s remaining capacity is over 5%, we can Fig. 12: Amperage Measurement compute the instantaneous power as the product of current (in amperage) and voltage. We used a digital clamp meter (Mastech MS2115B; as shown in Figure 12) to measure the DC amperage (precision of ± 2.5%). Since the original wire is far too short and too close to the device for the meter to clamp, we detached the back cover of the tablet, cut off the main wire, and soldered an extension wire back on for the measurement. A. Gaming Applications and Benchmark Setup We configured our Tegra tablet to make use of its built-in hardware decoder and to switch between hardware decoding and software decoding. For comparing their respective energy consumptions, we chose a high definition video clip as our sample video, which is original in 18p resolution. To have a consistent comparison across high and low resolutions, we used the VLC media player to convert this source video to multiple versions, including 48p H AAC (MP4) at 3.Mb/s bitrate, 72p H AAC (MP4) at 6.Mb/s bitrate, and 18p H AAC (MP4) at 1.Mb/s bitrate. We also used an advanced 3D benchmark for mainstream tablets, namely, 3DMark Ice Storm Benchmark, which includes 72p game scenes, two graphics tests, and a physics test to stress GPU and CPU, respectively. We installed it both on the Tegra tablet for the power measurement of local rendering as well as on the cloud gaming server for remote rendering. B. Client-side Power Consumption For each run of the experiments, we fully recharged the tablet s battery, so as to ensure a stable voltage supplied. After the recharge is completed, we restarted the tablet to minimize the interference from running other applications, and unplugged the power supply to ensure that the power source purely being the battery. For the Ice Storm experiment, when we did the local rendering tests, we started only the 3DMark application, ran the benchmark and began to record readings of the clamp meter using the data logger PCLink, a software installed on a workstation, with a sampling rate of two times per second. Likewise, when conducted the remote rendering tests, we selected the network video stream and

11 8 7 6 HW SW Local Cloud, HW Cloud, SW Watts Watts p 72p 18p 2 Fig. 13: Avg. Power Consumption for Locally Played Video 8 HW SW Seconds Fig. 15: Ice Storm Power Consumption 6 Watts p 72p 18p Fig. 14: Avg. Power Consumption for Streamed Video started measurement recording. By turning on and off the hardware acceleration option before each runs, we obtained the measurements with hardware decoder or with software decoder. For all experiments, the screen brightness, sound volumn, and other settings remained unchanged. For the experiment of high definition video clips, we followed the same set of steps to run each version of the sample clip, except that it did not involve any local rendering tests. It is worth mentioning that the DC clamp meter is highly sensitive; hence, to reduce the measuring error, we reset the clamp meter before each measurement. 1) Video Clip Results: In Figure 13, we depict the resulting power consumption in Wattage (watts) of locally playing the video clips, using software decoder and hardware decoder respectively on different resolutions. The dotted line represents the device s overall power consumption when using the software decoder, which started from 2.81 watts at 48p to 4.19 watts at 72p, then mounted to 6.21 watts at 18p. The hardware decoding had a lower start of 1.87 watts at 48p, 2.19 watts at 72p, and ended with a slightly higher 2.35 watts at 18p. Clearly, the hardware decoder outperforms its software counterpart, by consuming approximately half amount of power. This can be explained by that the software decoding invoked much more intensive CPU workloads than the hardware decoding. As shown in Table IV, obtained from the Android console, the hardware decoder incurred slightly more system calls but much less user calls than the software decoder; more importantly, the total CPU overhead remained nearly constant, regardless of how high the resolution was. On the contrary, for the software decoder, the total CPU overhead surged from 66% at 48p to 89% at 18p. In an even more practical perspective, given the battery capacity of our tablet is 15.1Wh, when a user successively play the 48p videos using the software decoder, the tablet can only work for roughly 5.4 hours (15.1 Wh divided by 2.81 watts). With the hardware decoder, it can be prolonged to roughly 8 hours. Likewise, for the 72p video, it is 3.6 hours against 7 hours, and 2.4 hours against 6.4 hours (more than double) for the 18p display. This margin significantly grows when the video s resolution increases; for a 4K ultrahigh resolution display, even with a short play duration, the hardware decoding will make qualitative difference against the software decoding with regard to devices time of endurance. Similarly, in Figure 14, we exhibit the results on average power consumption, including the streaming portion. Again the hardware decoder betters the software decoder, saving remarkable amount of energy and thus significantly extending the battery life. The same CPU overhead pattern can also be observed in this experiment. 2) Ice Storm Results: In Figure 15, we depict the results of the advanced 3D benchmark for mainstream tablets, namely, the 3DMark Ice Storm. The horizontal axis represents the timeline of the benchmark in seconds, and the vertical denotes the overall power consumption in watts. On the bottom sits the hardware decoding (average 2.36 watts), which stayed stable throughout the entire benchmarking. The local rendering is represented as the solid line sitting roughly in the middle (average 3.75 watts), which started high at around 5 watts, dropped under 4 and remained there until the later part of the benchmark. The other line sitting roughly on the top (average 5.11 watts) gives the result with software decoding. As we can see, when applying the hardware decoding, the device had an overall power consumption between 2 and 3 watts, while local rendering stayed competitive with the hardware decoding in the middle of benchmarking, where only normal game scenes were rendered and no benchmark loading or other ultra CPU- intensive works were happening. The software decoding however doubled this amount to over 5 watts, which

12 Decoder Type User (CPU %) System (CPU %) Total (CPU %) Software(48p) 51% 15% 66% Hardware(48p) 35% 25% 6% Software(72p) 6% 13% 73% Hardware(72p) 35% 26% 61% Software(18p) 8% 9% 89% Hardware(18p) 36% 23% 59% TABLE IV: CPU Overhead: Hardware vs Software H.264 was even worse than the local rendering in terms of supporting usual game scenes. C. Hardware Decoder: Key for Cloud Gaming Client A deeper investigation shows that local rendering increases the power consumption at the very beginning and particularly in the later part of the benchmark. We believe that this is due to a great amount of CPU and GPU calls were made to load and initialize the benchmark, stressing the device in its graphics and physics test. This conjecture is not only based on our track of CPU/GPU usage, but also on the fact that the sharp increase of power consumption, on the timeline, compactly corresponds to the benchmark loading and stress testing, where extraordinarily intensive workloads were involved. In other words, as opposed to traditional local-rendered gaming, cloud gaming saves not only a large amount of power consumed in rendering game scenes, but in game loading and other CPU/GPU intensive operations as well. The savings however comes from hardware decoding only. The software decoder on the client side consumes even more energy than local rendering, by 5.11 watts against 3.75, theoretically cutting down 27% of the battery life, and making cloud gaming less appealing given the notoriously shortage of power supply in mobile devices. The hardware decoder, on the other hand, holds a much less average power consumption at 2.36 watts, impressively extends the battery life by 59% longer than local rendering and 117% longer than software decoder, which makes battery shortage a much less concern for users. Moreover, as exhibited in the video clip experiment, when video resolution increased from 48p to 72p and to 18p, the software decoder expended significantly more computing overhead and power consumption, while the hardware decoder managed to contain the computational and energy cost in a relative low amount. Given the mainstream tablets now stand at 72p resolution and are still moving forward, the gap between software and hardware decoders will only increase. For richer and more detailed game world at 4K UHD resolution and beyond, a hardware decoder is definitely needed, though it remains to be universally supported by tablets and smartphones 9. VII. IMPACT OF STREAMING PROTOCOLS Different from the traditional streaming applications, cloud gaming introduces intensive user operations to control the 9 We have seen hardware decoding support for H.264 HD videos in certain high-end tablets that are readily available in the current market, including HP s TouchPad, EVGA Tegra Note and Samsung s Galaxy Note. TCP Interaction Latency (ms) UDP Interaction Latency (ms) Basic Loss Rate =.1% 1% 5% 1% Manually Added Packet Loss Rate Fig. 16: Packet Loss: TCP Basic Loss Rate =.1% 1% 5% 1% Manually Added Packet Loss Rate Fig. 17: Packet Loss: UDP rendering of the gaming screen. Due to such a new feature, the applicability of different transport protocols remains largely unclear in its real-world deployment. We next closely examine the performance of TCP and UDP in our cloud gaming system with a variety of packet loss rates. To capture user s interaction latency, we configured and used the high-performance screen capture software, MSI Afterburner, which provides lowlevel access to the GPU with little CPU and disk overheads. We bounded the key combination CTRL-N to begin the video record of MSI afterburner, which invoked a window on our cloud gaming server. We then measured the interaction latency by counting the number of frames from the start of the screen recording until when the action was performed on our local cloud gaming video stream. In our experiment, such tools as TC and the network

13 # of Freezing Events Over 6sec Basic Loss Rate =.1% 1% 5% 1% Manually Added Packet Loss Rate Fig. 18: Packet Loss: TCP Anomaly(Freezes) # of Distortions Over 6sec Basic Loss Rate = Most frames distorted.1% 1% 5% 1% Manually Added Packet Loss Rate Fig. 19: Packet Loss: UDP Anomaly(Distortion) emulator NETEM 1 were applied to control the packet loss in the network, ranging from.1% up to 1%. We also run each experiment five times to avoid possible bias. Figure 16 shows the result of TCP measurement. We can see that the interaction latency stayed below 14 ms when the packet loss rate was less than 1%. However, at a 5% loss rate, the interaction latency increased by nearly 43% to 2 ms. At a 1% loss rate, this increased to 368 ms, which is an astounding 16% increase. The UDP interaction latency shown in Figure 17, on the other hand, had an average between ms at any loss rate. Despite the clear advantage of UDP in this experiment, we notice many visual distortions for UDP as well as video freezes for TCP. To better understand their impact, we recorded live video streams and analyzed their display quality on the user clients. For each protocol and loss rate, we recorded 6 seconds of the incoming live streams from the Unigine Heaven benchmark, and then checked the possible anomalies (freeze with TCP, and distortion with UDP). Figure 18 shows the results of TCP measurement. Obviously, there was no anomaly when the loss rate was at %. Yet, when the loss rate increased to.1%, the user experienced 1 netem a noticeable freezing event once per minute. The number of freezing event increased steadily, until a loss rate of 1%, in which case the game was unplayable with freezing every 4 seconds. Figure 19 shows the visual distortions in UDP transmission. Again, no no noticeable distortion at low packet loss rate of of less o.1%; yet, as the loss rate approached 1%, the user experienced noticeable distortions approximately every 3 seconds. At 5%, the number of distortions happened approximately every 1 seconds. To make the matter worse, when the loss rate reaches 1%, a massive increase in distortions occurred, which affected nearly every frame. Clearly, the chance of losing part or all of an H.264 I-Frame in this case was extremely high, given that an I-frame is much larger than most frame packets and is thus split into many smaller packets for transmission over UDP. The results suggests that, for a stable and reliable network, both TCP and UDP work well for cloud gaming. When the loss rate escalates, particularly when it is over 5%, which is not uncommon in today s Internet, UDP is a better choice for the highly interactive and delay sensitive cloud gaming. VIII. FUTURE WORK AND CONCLUSION Cloud gaming has attracted significant interest from both academia and industry, and its potentials and challenges have yet to be fully understood, particulary with the latest hardware and virtualization technology advances. This paper presented an in-depth study in this direction. We closely examined the performance of modern virtualization systems equipped with virtualized GPU and pass-through techniques. Our results showed virtualization for GPU has greatly improved and is ready for gaming over public cloud. A modern platform can host three concurrent gaming sessions running inside different VMs at 95% of the optimal performance of a baremetal counterpart. Although there was degradation of memory transfer between a virtualized system s main memory and its assigned GPU, the game performance at the full HD resolution of 18p was only marginally impacted. Nevertheless, as the industry transitions to the new 4K UHD, it remains to be seen whether the impact to gaming. We have begun working on a 4K-enabled cloud gaming server and currently undertaking research to discover if any virtualization specific issues exist at these higher resolutions. Also, it is likely that practical changes can be made at the hypervisor level to enhance cloud gaming performance. Such enhancements may include gameaware virtual CPU scheduling as well as memory management techniques to increase memory transfer speed from the VM to the GPU. Further, we have designed and implemented a practical cloud gaming system, whose server side was deployed with virtualization and advanced GRID GPU in a public cloud platform. Our realworld experimental results revealed the clear advantages of the GRID s hardware encoder over its software counterpart. We showed that only hardware encoder can achieve acceptable gaming performance with 18p at 3 fps. We also explored practical issues facing thin-client design. We measured and quantified the power consumption of different

14 ways to supply game scenes, and found that a software decoder can consume even greater power than local rendering. Only with a hardware decoder will a thin client benefit from remote rendering, particularly for high-definitions. The impact of TCP and UDP for streaming have also been evaluated under various of network conditions. Cloud gaming is rapidly evolving, and progressing hardware and virtualization support acts as a crucial component of this evolution. A wide range of new hardware and systems will emerge in the coming years. Among the many open issues that are to be addressed, we are particularly interested in direct measurements on advanced GPU controllers in our local platform to quantify not only the energy overhead but also determine if virtualization overhead caused any performance degradation. We are also interested in if different encoding settings have any effect on interaction delay and energy overhead at our thin-client end point. Discovering novel transmission and intelligent masking techniques that physically or perceptually reduce the interaction delay is another direction to explore. REFERENCES [1] W. Cai, M. Chen, and V. Leung, Toward gaming as a service, IEEE Internet Computing, vol. 18, no. 3, pp , 214. [2] R. Shea and J. Liu, Cloud gaming: Architecture and performance, IEEE Network, vol. 27, no. 4, pp , 213. [3] M. Claypool and D. Finkel, On the performance of onlive thin client games, Springer Multimedia Systems Journal, Special Issue on Network Systems Support for Games, vol. PP, no. 9, pp. 1 14, 214. [4] K. Lee, D. Chu, E. Cuervo, A. Wolman, and J. Flinn, Demo: Delorean: using speculation to enable low-latency continuous interaction for mobile cloud gaming, in Proceedings of ACM MobiSys, 214. [5] C. Huang, C. Hsu, D. Chen, and K. Chen, Quantifying user satisfaction in mobile cloud games, in Proceedings of ACM MoViD, 214. [6] Onlive, [7] Gaikai, [8] Engadget, Sony buys Gaikai cloud gaming service for 38 million, [9] M. Dowty and J. Sugerman, Gpu virtualization on vmware s hosted i/o architecture, ACM SIGOPS Operating Systems Review, vol. 43, no. 3, pp , 29. [1] D. D. Winter, P. Simoens, L. Deboosere, F. D. Turck, and J. Moreau, A hybrid thin-client protocol for multimedia streaming and interactive gaming applications, in Proceedings of the NOSSDAV Workshop, 26. [11] Y. C. Chang, P. H. Tseng, K. T. Chen, and C. L. Lei, Understanding the performance of thin-client gaming, in Proceedings of IEEE CQR, 211. [12] Y. Lee, K. Chen, H. Su, and C. Lei, Are all games equally cloudgaming-friendly? an electromyographic approach, in Proceedings of IEEE NetGames, 212. [13] M. Jarschel, D. Schlosser, S. Scheuring, and T. Hossfeld, Gaming in the clouds: Qoe and the users perspective. mathematical and computer modelling, ACM Transactions on Computer Systems, vol. 57, no. 11, pp , 213. [14] M. Hemmati, A. Javadtalab, A. A. N. Shirehjini, S. Shirmohammadi, and T. Atici, Game as video: Bit rate reduction through adaptive object encoding, in Proceedings of ACM NOSSDAV, 213. [15] Onlive. [16] D. Wu, Z. Xue, and J. He, icloudaccess: Costeffective streaming of video games from the cloud with low latency, IEEE Transactions on Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1 12, 214. [17] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, Xen and the art of virtualization, in Proceedings of the nineteenth ACM symposium on Operating systems principles. ACM, 23, pp [18] Amazon Elastic Compute Cloud, [19] K. Ye, X. Jiang, S. Chen, D. Huang, and B. Wang, Analyzing and modeling the performance in xen-based virtual cluster environment, in 21 12th IEEE International Conference on High Performance Computing and Communications, 21, pp [2] R. Russell, virtio: towards a de-facto standard for virtual i/o devices, SIGOPS Oper. Syst. Rev., vol. 42, pp , July 28. [21] N. Amit, M. Ben-Yehuda, and B.-A. Yassour, Iommu: Strategies for mitigating the iotlb bottleneck, in Computer Architecture. Springer, 212, pp [22] J. Liu, Evaluating standard-based self-virtualizing devices: A performance study on 1 gbe nics with sr-iov support, in Parallel & Distributed Processing (IPDPS), 21 IEEE International Symposium on. IEEE, 21, pp [23] A. Gordon, N. Amit, N. Har El, M. Ben-Yehuda, A. Landau, A. Schuster, and D. Tsafrir, Eli: bare-metal performance for i/o virtualization, ACM SIGARCH Computer Architecture News, vol. 4, no. 1, pp , 212. [24] Y. Dong, X. Yang, J. Li, G. Liao, K. Tian, and H. Guan, High performance network virtualization with sr-iov, Journal of Parallel and Distributed Computing, vol. 72, no. 11, pp , 212. [25] L. Shi, H. Chen, J. Sun, and K. Li, vcuda: Gpu-accelerated highperformance computing in virtual machines, Computers, IEEE Transactions on, vol. 61, no. 6, pp , 212. [26] C. Reaño, A. J. Peña, F. Silla, J. Duato, R. Mayo, and E. S. Quintana- Orti, Cu2rcu: Towards the complete rcuda remote gpu virtualization and sharing solution, in High Performance Computing (HiPC), th International Conference on. IEEE, 212, pp [27] C.-T. Yang, H.-Y. Wang, and Y.-T. Liu, Using pci pass-through for gpu virtualization with cuda, in Network and Parallel Computing. Springer, 212, pp [28] M. Yu, C. Zhang, Z. Qi, J. Yao, Y. Wang, and H. Guan, Vgris: Virtualized gpu resource isolation and scheduling in cloud gaming, in Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 213, pp [29] C. Zhang, Z. Qi, J. Yao, M. Yu, and H. Guan, vgasa: Adaptive scheduling algorithm of virtualized gpu resource in cloud gaming, 213. [3] R. Shea and J. Liu, On gpu pass-through performance for cloud gaming: Experiments and analysis, in Network and Systems Support for Games (NetGames), th Annual Workshop on. IEEE, 213, pp [31] x264, [32] Geforce grid, [33] Internet streaming media library, [34] FFmpeg,