1 CloudTops: Latency aware placement of Virtual Desktops in Distributed Cloud Infrastructures Keywords: Abstract: Cloud, Virtual Machine, Response Time, Latency, VM Placement, Desktop Latency sensitive interactive applications such as virtual desktops for enterprise workers are slated to be important driving applications for next generation cloud infrastructures. Determining where to geographically place desktop VMs in a globally distributed cloud so as to optimize user-percieved performance is an important and challenging problem. Historically, the performance of thin-client-based systems has been predominantly characterized in terms of the front-end network link between the thin client and the desktop. In this paper, we show that for typical enterprise applications, back-end network connectivity to the filesystems and applications that support the desktop can be equally important, and that the optimal balance between the front-end and back-end links depends on the precise workload. To help make dynamic decisions about desktop VM placement, we propose a per-user model that can be used to automatically construct user profiles, and to predict the optimal location for a user s desktop based on their past and current usage patterns. Using experimental evaluation of several typical Enterprise applications, we show that our methodology can accurately predict which of many distributed data centers to use for a particular user s workload even if details of the precise applications being used are not known. 1 INTRODUCTION The Infrastructure as a Service (IaaS) model of cloud computing has led to revolutionary changes in the ease with which users can access compute cycles located anywhere in the world. While this revolution has thus far been focussed on compute intensive server and batch workloads, a new class of latency sensitive interactive applications such as gaming and virtual desktops are emerging as major future drivers of cloud adoption. Commercially available Virtual Desktop Infrastructure (VDI) solutions advocate hosting enterprise worker desktops as virtual machines on shared infrastructure and accessing them through dedicated thin client terminals as a way to consolidate management, reduce the high costs of periodic hardware refreshes, and to improve on-premise power consumption. While most of today s VDI solutions advocate running desktop VMs in private corporate data centers, the day is not far away when VDI services hosted by public cloud providers will provide corporate and personal users anytime, anywhere access to their desktops through a variety of endpoints such as thin clients, ultrabooks, or even tablets. However, unlike server and batch processing workloads that can be tucked away in remote data centers, cloud hosted virtual desktops (or as we call them: Cloudtops), are extremely latency and bandwidth sensitive. There exists a rich history of experiments in the literature, e.g. (Lai and Nieh, 2006; Tolia et al., 2006), that demonstrate that user satisfaction drops significantly as the network latency between the thin client front-end the and the remote desktop backend increases. This is because this front-end link transports what is usually the most data and interaction sensitive portion of the user s experience: the display and input subsystems. Consequently, placing backends as close as possible to the thin client+ so as to optimize the quality of the front-end network link has been a primary focus of both academic (Malet and Pietzuch, 2010; Calyam et al., 2011) and commercial efforts (e.g., OnLive.com). In this paper, we show that such a singular focus on the front-end is not sufficient, and that to get a more accurate picture of user experience, a more holistic view is required that examines not only the front-end network link between the thin client and the desktop, but also the back-end network links between the desktop and any remote services (e.g., file servers) that are accessed by desktop applications. This holistic view is especially important for VDI s primary demographic - enterprise workers working in a workgroup setting in which they use their desktops to work with data and applications shared with colleagues who may be geographically dispersed over wide distances. Such scenarios are increasing common: they are commonplace in large enterprises with workers in many different locations, in the increas-
2 ing number of companies that promote telecommuting and flexible working arrangements, and in corporate units with a large population of mobile employees, e.g., sales divisions. In such settings, we argue that the availability of a geographically distributed cloud infrastructure allows an addition lever to control the user experience that was not available before - while common backend servers and services can continue to reside in the enterprise s data centers, the virtual desktops themselves can be placed in data centers in different geographical locations, with varying front-end and back-end network link qualities so as to optimize overall user experience. However, choosing optimal desktop VM placements is not easy. Desktop users often work a wide range of applications many of which may not be known in advance, and each of which may have a different front-end/back-end ratio that is desirable. Moreover, because different users may use their environments differently, the best location for VM placement can be different even for members of the same workgroup. Therefore, what is needed is a model that can be automatically constructed for each individual user based on system-level measurement data, and which can be used to predict the optimal location for that users desktop VM from a selection of cloud data centers with different front-end/back-end latencies and bandwidths. In this paper, we propose, for the first time, precisely such a model, and make the following contributions: a) First, we experimentally measure the user perceived performance of several typical enterprise workloads, and demonstrate that in addition to the quality of the front-end link, the quality of the backend link has significant impact on performance, and this impact can vary significantly and qualitatively for different applications, b) Next, we propose a model and an algorithm to predict the end-to-end performance of a user s desktop VM running an unknown application workload. This model can be automatically constructed using OS level data measured from a user s desktop VM without any knowledge of the applications involved, and can thus be used to build a historical profile for each user, and finally, c) we use the model to predict which one of several geographically distributed cloud sites with differing front-end latency (to the thin client) and back-end latency (to the backend fileservers and applications) is the optimal choice for a given user, and show experimentally that the model produces accurate predictions in all the workloads considered. 2 RESPONSE TIME MEASUREMENT In this section, we describe the measurement environment we use to quantify the impact of VM placement on the user-observable application response times. We evaluate a number of workloads and show that no single placement is optimal for all workloads. The goal of the measurements is to identify the characteristics of the application workload that dictate userobservable response time and these factors are then used to build the VM placement estimation model that we describe in Section Measurement Measurement Environment Our measurement environment emulates a corporate use scenario, where users may be located around the world (either permanently or temporarily due to business travel), while the corporate data is stored centrally at the corporate file servers for security reasons (at the corporate head quarters, for example). While running applications that access the corporate data remotely is possible, the latency for interactive applications (e.g., typing) can become unsatisfactory. Therefore, we evaluate scenarios where a VM placed somewhere in the cloud runs the applications, the VM is accessed by the user using a thin-client software (VNC) and the applications access the corporate data using appropriate file access protocols (e.g., NFS). Figure 1 shows the configuration of the measurement environment, which consists of three machines: thin-client (machine a), VM (machine b) and file server (machine c). Microsoft Windows 7 is installed as a guest OS on the Xen hypervisor on the VM machine (b). Machines (a) and (b) are set up as a thinclient/server system using VNC software. The user executes the application workloads on the VM from VNC Client on machine (a) via the VNC Server on machine (b). On the file server (c), files accessed by the VM are placed on the Samba server and accessed using the SMB2 protocol. The VM s OS disk image is also located on the file server machine (c) and is accessed via the NFS protocol. The network emulator software is configured on the VM machine to control both the network latency and bandwidth between the thin-client (a) and the VM machine (b) as well as the VM machine (b) and the file server (c). We use one physical 100Mbps Ethernet network interface card (NIC) running netem, a network emulation software installed in the Linux kernel and controlled by the command line tc (traffic con-
3 Figure 1: Measurement Environment. trol command), which is part of iptable2 package. We configure four pseudo NICs using the one physical NIC, one each for incoming and outgoing traffic to the thin-client machine (a) and to the file server machine (c). Both the latency and the available bandwidth can be set separately for all these four connections. Two packet monitors are used in the measurement configuration. One on the thin-client machine (a) is used to capture all packet traffic on the front-end network and measures the response time for the packets on the thin client side. The other in the VM machine (b) is used to record all packets sent in both directions between the VM machine (b) and the file server (b). The Wireshark tool is utilized as the packet monitor. Note that since packet monitoring is done below the VNC client, the response time measured does not include the time from when the user executes an action (e.g., press key) and when the VNC client actually sends the corresponding event from the thin-client machine to the VM machine. Similarly, it does not include the time from when the screen update message is received at the thin-client machine and the image is drawn to the screen for the user to see Application Workload Scenarios The application workloads (see Table 1) are executed by a user manually at the thin-client machine (a) using the normal Windows 7 graphical user interface and applications that are presented to the user as a virtual desktop by the VNC client software. The application software actually executes on the VM machine (b) and accesses files located on the file server machine (c). Our goal is to measure some typical business use scenarios. The Typing workload involves typing 150 characters in to a MS Word file located on the Samba file server. The response time is measured for each key stroke separately and thus the user think time between key strokes will not affect the measurement. The File Download workload copies a 18 MB Power- Point (PPT) file from the Samba file server to the VM. The File Open workload opens a 20 MB PowerPoint file located on the file server (Samba server). The File Search workload involves searching for all files with letter a in their file name on the Samba file server. In our experiment, there are 3109 files that match this criteria out of a total of 7000 files. Finally, the Play Movie workload downloads a 16 MB Windows Media Video (wmv) from the Samba file server and plays it to completion on the Windows Media Player Measurement Procedure Each workload scenario is repeated and measured five times following the procedure described below and the average response time is computed and reported. 1. Reboot VM and hypervisor on machine (b) to ensure identical initial system status. 2. Set the chosen emulated network configuration using netem at machine (b). 3. Start packet capture using Wireshark at machines (a) and (b). 4. Execute the workload manually via the thin-client application. 5. Stop packet capture and network emulation. 6. Analyze response time from captured packets.
4 Table 1: Application Workload Scenarios. Case Workload Overview 1 Typing Typing 150 characters on a MS Word file on the file server. 2 File Download Open Samba directory, copy a PPT file from the directory and paste it on VM. 3 File Open Open a PPT file on Samba directory. 4 File Search Search files which include a in the file name. 5 Play Movie Play movie stored on the Samba directory We use two measurement approaches depending on the workload. For quick interactive workloads like typing, we measure the response time to individual events (key press, mouse click) from the time the event message is sent from the thin-client to the VM to the time the first response message is received at the thin-client from the VM. For workloads involving long duration actions (e.g., file open or search), we measure the whole duration of the action, that is, we measure the time difference between the user event (e.g., user clicking the OK button) and the last packet received from the VM at the thin-client. The response time is analyzed from the packets captured with Wireshark at the thin-client side. The user event such as a key press and the first received packet are identified as the VNC packet corresponding to the event and the first packet from the VM that is captured by Wireshark after the user event packet was sent. The last received packet is the first packet sent from the VM to the thin-client such that the payload size is less than the maximum packet size. The explicit assumption being that the VNC protocol will send only full packets (maximum allowed size) while it still has data to send and thus only the last packet would be not full size. The back-end bandwidth is fixed at 100Mbps for all of the measurements assuming that the back-end bandwidth is always sufficiently high between the cloud sites and the corporate data centers where the file servers are located. The front-end bandwidth is varied among 256Kbps, 1Mbps, and 100Mbps to emulate different connection speeds between the user and the cloud sites. For all of the measurements, the resolution of VNC thin-client is set to 1024 x Measurement Results In this section, we present the measurement results and explain their implications for VM placement Case 1: Typing The response time is measured as the time difference between the user event (key press) and the first packet received from the VM machine. This means that we get 150 response time measurements for each execution of the workload. Figure 2 shows the results of executing the typing workload on the VM. The different emulated cloud locations are shown using different colors that are labeled as x ms - y ms which denotes that the one-way front end latency (between the thin-client machine (a) and the VM machine (b)) is x ms and the back-end latency (between the VM machine (b) and the file server machine (c)) is y ms. Here we consider the scenario where the user is 50 ms network latency away from the corporate data center (with the file server) and we have cloud locations within 10 ms, 20 ms, 30 ms, and 40 ms available to choose from between the user and the file server. The x-axis denotes the response time and the y-axis denotes how often that response time is observed in the set of 150 measurements for the experiment. Figure 3 shows the average response time for each user event (type character) in the different emulated cloud locations under different network emulation conditions. The separate lines denote the different bandwidth allocations for the front and back end links. Figure 2: Typing Response Time. The results show that for this workload, the closer the VM is to the thin-client, the better the response time obtained. For example, in the case where the front-end latency is 0 ms (and the back-end latency is 50ms, respectively), meaning the VM is co-located with the thin-client, the mean response time is around 0.03s. Note that some response times differ significantly from the mean for the experiment. Some of the
5 exceptionally short response times are due to the fact that the next user event occurs before the previous response is received resulting in mismatching of events and responses. However, the correct matchings dominate as shown by the tall spikes in the response time histogram. The figure shows that the response time is dictated solely by the front-end latency. Specifically, the difference between the graphs for the different experiments matches the increase in round trip latency at the front end. For example, the mean for the experiment is around 0.06 s, while the mean for the experiment is around 0.08 s and the front end network roundtrip latency difference between these experiments is 0.02 s (20 ms), The result is not unexpected given that the backend link is not used during the workload execution (except for potential autosave). The results of this workload are similar to most interactive workloads where the user interacts with the application and the application state is maintained in the memory of the VM. For example, developing a presentation, entering numbers in a spreadsheet, drawing, reading a document, or viewing a presentation. Figure 4: File Download Response Time Case 3: File Open The response time is measured as the time difference between the enter key press event to open the file after the file has been selected, and the last packet received from the VM machine (file open and presented on the screen). Figure 4 shows the mean response times out of 5 experiments for each point in the graph. As can be expected by the nature of the workload scenarios, lower backend latency results in a lower total response time (= from click to file open on the screen). Thus, the closer to the file server the VM is located, the better the response time. Note that reducing the available front end bandwidth has a significant impact on the response time especially when the front-end latency is large. This is because the opened file is presented on the screen when the open is completed. The behavior of this workload scenario would be typical for any operation involving file copying, moving, or saving large files to/from the file server. Figure 3: Average Typing Response Time Case 2: File Download The response time is measured as the time difference between the mouse click user event to select the paste item in the menu displayed on the screen and the last received packet. Screen updates are sent by the OS running on the VM for the update of the progressive bar, which shows the status of the file download. Figure 3 shows the mean response times out of 5 experiments for each point in the graph. As expected, the back-end latency dictates the end-to-end completion time for this workload. The unexpected spike at can be likely contributed to measurement noise. Logically, the response time improvement should be contiguous while the VM is moved closer to the file server. Figure 5: File Open Response Time Case 4: File Search The response time is measured as the time difference between the mouse click event to start the search and the last received packet. The files found to match the search criteria are shown on the screen and thus result
6 in screen update messages from the VM machine to the thin-client machine. Figure 6 show the response time of file search case. Unsurprisingly the end-to-end completion time decreases when the VM is placed closer to the file server. Note that the front-end bandwidth does not make any significant difference in the response time either because the workload does not utilize the frontend link heavily or the screen updates are performed asynchronously. Figure 7: Movie download and play time we only presented one use case with this characteristic, majority of typical user-computer interactions (e.g., typing, data entry, drawing, reading) match this pattern. Figure 6: Desktop Search Response Time Case 5: Play Movie The response time is measured as the time difference between the enter key press event to start playing the movie file after the file has been selected, and the last received packet from the VM machine (file downloaded and played on the screen). Figure 7 shows the mean response times out of 5 experiments for each point in the graph. All experiments use 100Mbps links for the front-end and back-end communication. We observed that the total number of bytes transmitted from the VM to the thin-client was up to times larger than the number of bytes transmitted from the file server to the VM. This difference was due to the fact that the video file is stored in more compact form that the screen updates that need to be transmitted to the thin client. However, we also observed that the VNC player is adaptive to the frontend latency and when the latency is large, it sends fewer screen updates. The overall result is that lowering the back-end latency (i.e., placing VM close to the file server) results in a lower total completion time. If there is a lot of communication between the VM and the back-end and much less communication between the VM and the thin-client (cases 2,3, and 4), placing the VM closer to the backend improves completion time significantly. As expected, file operations (open, save,download) match this pattern. However, we also found that simply looking at the number of bytes transmitted on the front-end and back-end links is not sufficient to determine the optimal location (case 5) and sometime the bandwidth on the front-end link makes a difference (case 3) while it sometimes does not (case 4). As mentioned in Case 5, VNC exhibits significant adaptive behavior with regard to the front-end link latency and bandwidth. Figure 8 illustrates this in the use case 2 (File download). The figure shows the number of bytes transmitted from the VM to the thin-client under different network settings. Note that when the VM is close to the thin-client and there is a lot of available bandwidth, VNC transmits orders of magnitude more data than when the latency is high or the bandwidth is limited. 2.3 Discussion Overall, our measurement results match with intuition: If there is only communication between the frontend and the VM (case 1), the closer the VM is to the thin-client the better. Note that even though Figure 8: VNC adapts to available front-end latency and bandwidth
7 When combined, these measurements imply that there are a number of workload characteristics that have implications on the user-observable end-to-end performance of the system under different VM placement conditions. These characteristics can be estimated by using a number of I/O metrics, in particular, the total amount of data transmitted on the front-end and back-end links as well as the number of requests issued from the thin-client to the VM and from the VM to the back-end. In real usage, the user will perform operations that will fall into both patterns (e.g., open file, edit, save) and the placement has to be optimized in a holistic manner. In the next section, we will propose a VM placement estimation model that considers these I/O metrics and gives a ranking of possible VM placements locations. 3 VM PLACEMENT ESTIMATION MODEL In this section, we explain a scenario using our proposed VM placement estimation model and describe the model. 3.1 Scenario We assume the user (running the thin-client software) travel extensively around the world while needing access to the VM to run his applications and the backend server to access corporate files and other data sources. We assume a number of cloud locations are available for the system to place the VM so that the performance as observed by the user is optimized. The challenge for the system is to choose the best location out of the available locations when the user logs on with his thin-client software. Our VM Placement Estimation (VPE) model provides the ranking of the available locations using the user s workload history. The workload history is collected and updated every time the user is connected from the thin-client software to the VM. Specifically, we assume the typical use scenario consists of the following steps: 1. The user travels to a new location. 2. The user launches the thin-client terminal. 3. The system determines the user s location, available cloud locations for his VM, determines the best location using the VPE model, and moves the VM to the chosen optimal location. 4. The user executes his applications and workloads on the VM. Furthermore, if the user s workload pattern changes from the historical one enough that a new location becomes more optimal than the current one, the VM may be migrated live while the user is using it. 3.2 VPE Model The VPE model considers four metrics user metrics and four location metrics for each possible cloud location. The user metrics are continuously collected for the user and thus characterize his normal usage behavior. Specifically, they are E f : Number of requests from the thin-client to the VM in a time unit. B f : Number of bytes transmitted from the VM to the thin-client in a time unit. E b : Number of requests from the VM to the file server in a time unit. B b : Number of bytes transmitted from the file server to the VM in a time unit. The location metrics can be measured for each cloud location. They are L f : Roundtrip latency between the thin-client and the cloud location. BW f : Bandwidth between the thin-client and the cloud location. L b : Roundtrip latency between the cloud location and the back-end server. BW b : Bandwidth between the cloud location and the back-end server. The latter two metrics for available cloud locations can be measured ahead of time. The first two may need to be measured when the user starts the thinclient software. Our ranking model reflects the typical communication pattern when a thin-client accesses the VM that accesses the file server. This access consists of two phases of request-response computation: the first phase is between the thin-client and the VM, and second is between the VM and the file server. Therefore, the response time observed by the user is a combination of the response times at the front-end and back-end links and thus, the total score S T i(i) for a cloud location i is a combination of the score for the front end link between the user and cloud location i, S f (i) and the score for the back-end link between the cloud location i and the file server, S b (i). Our simple model uses simple summation of these scores and the smaller the score, the better. S T (i)=s f (i)+s b (i)
8 The front-end score S f (i) is calculated using the observation that the thin-client typically sends user events (e.g., key press or mouse click) sequentially and synchronously and therefore transmitting E f events takes order of E f L f time units. The screen updates (e.g., resulting from the events) are typically larger and may be limited by network bandwidth (B f /BW f ). Thus, we calculate S f (i) using the following formula: S f (i)=e f L f + B f /BW f Similarly, we calculate the back-end score using formula: S b (i)=e b L b + B b /BW b Note that the goal of the ranking formula is not to estimate the actual response time but to generate a score that considers the factors that affect the end-toend response time given the user and location metrics. There are multiple factors that are not considered by the model including processing time at the VM and file server, adaptive behavior exhibited by VNC, and any asynchrony in communication over the front-end link and back-end link. 3.3 Evaluation We evaluated the VPE scoring formula by using the 5 case studies reported above. For each of the executions, we recorded the total numbers of requests from the thin-client to the VM (E f ), total number of requests from the VM to the file server (E b ), total number of bytes from the VM to the thin-client (B f ) and total number of bytes from the file server to the VM (B b ) for the duration of the experiment. Table 2: Ranking accuracy: Case 1 (typing), 100M-100M Location RT(s) VPE RT rank VPE rank Table 2 presents the accuracy of the VPE ranking for use case 1 (typing) with front-end link bandwidth set at 100 Mbps. The first column represents the cloud location. For example, 0-50 denotes a location where the VM is co-located with the thin-client (0 ms latency between them) but the file server is 50 ms away from the VM. The second column ( RT(ms) ) gives the measured end-to-end response time in ms, the third column ( VPE ) gives the raw VPE score for the location, the fourth column gives the rank based on the measured response time, and the final column gives the rank based on the VPE score. As we can see, the VPE based ranking matches the measurement based ranking for all locations. The other bandwidth settings behaved similarly with the VPE rank matching the measurement rank perfectly. Table 3 presents the results for use case 2 (download). Note that the VPE score ranks correctly the last location (with 50 front-end latency and 0 back-end latency) as the best location out of the given 6, matching the ranking based on the actual measurement. Similarly, the VPE rank matches the measurements in all the locations except the first two, where the VPE score considers location 1-40 better than location 0-50, while the RT measurements rank these locations in the opposite order. We suspect that the measurement for the first two locations may have experienced some noise and it is likely that the VPE based ranking is actually more correct than the measurement based ranking for these two locations. Table 3: Ranking accuracy: Case 2 (download) Location RT(s) VPE RT rank VPE rank Table 4 presents the results for use case 3 (open). The results for all the network bandwidth configurations are similar in terms of ranking accuracy but for brevity, we only present the results for the case where front-end link is limited to 256K. In these experiments, the VPE ranking perfectly matches the ranking based on actual measurements with all the network bandwidth configurations. Table 4: Ranking accuracy: Case 3 (open), 256K-100M Location RT(s) VPE RT rank VPE rank