On the Prevention of Cache-Based Side-Channel Attacks in a Cloud Environment

Transcription

1 On the Prevention of Cache-Based Side-Channel Attacks in a Cloud Environment by Michael Godfrey A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s University Kingston, Ontario, Canada September 2013 Copyright c Michael Godfrey, 2013

2 Abstract As Cloud services become more commonplace, recent works have uncovered vulnerabilities unique to such systems. Specifically, the paradigm promotes a risk of information leakage across virtual machine isolation via side-channels. Unlike conventional computing, the infrastructure supporting a Cloud environment allows mutually distrusting clients simultaneous access to the underlying hardware, a seldom met requirement for a side-channel attack. This thesis investigates the current state of side-channel vulnerabilities involving the CPU cache, and identifies the shortcomings of traditional defences in a Cloud environment. It explores why solutions to non- Cloud cache-based side-channels cease to work in Cloud environments, and describes new mitigation techniques applicable for Cloud security. Specifically, it separates canonical cache-based side-channel attacks into two categories, Sequential and Parallel attacks, based on their implementation and devises a unique mitigation technique for each. Applying these solutions to a canonical Cloud environment, this thesis demonstrates the validity of these Cloud-specific, cache-based side-channel mitigation techniques. Furthermore, it shows that they can be implemented, together, as a server-side approach to improve security without inconveniencing the client. Finally, it conducts a comparison of our solutions to the current state-of-the-art. i

3 Acknowledgments I would like to thank my supervisor, Professor Mohammad Zulkernine for supporting me and letting me stumble unrestrained through the course of this thesis. Not many supervisors would be comfortable letting their students work in foreign territory with the restriction of it should have something do to with security or reliability. I would like to thank my family. Who supported me in my decision to do a MSc, and would still have done so had I not. They always showed interest, even if some of them still think Cloud Computing is something that happens vis a vis the sky. More technically, I would also like to thank Yunjing Xu [35] for his discussion and help in the creation of a canonical cache-based side-channel attack for the purpose of our evaluation. Without his guidance I might still be figuring out how to get the attack working, rather than how to prevent it. ii

4 Table of Contents Abstract i Acknowledgments ii Table of Contents iii List of Tables vii List of Figures viii Chapter 1: Introduction Motivation Contributions Organization of Thesis Chapter 2: Background The Cloud Model Virtual Machines The Xen Hypervisor CPU Cache Side-Channel Attacks iii

5 2.3.1 Cache Channel Attacks Summary Chapter 3: Related Work Side-Channels Side-Channels in the Cloud Cache Colouring Summary Chapter 4: Side-Channel Attacks Sequential Side-Channel Parallel Side-Channel Summary Chapter 5: Selective Cache Flushing Overview Expected Issues and Mitigations Reduced Cache Usability Cache Flushing Overhead Cache Flushing Technique Decision Algorithm Flushing Function Experimental Evaluation Objectives Environment Side-Channel Prevention iv

6 5.4.4 Performance Experiments Results Result Analysis Side-Channel Prevention Performance Analysis Summary Chapter 6: Cache Partitioning Parallel vs Sequential Side-Channels Overview Expected Issues and Mitigations Cache Partitioning Technique Cache Colouring Experimental Evaluation Objectives Environment Side-Channel Prevention Performance Experiments Results Result Analysis Side-Channel Prevention Performance Analysis Summary Chapter 7: Conclusions and Future Work v

7 Bibliography vi

8 List of Tables 5.1 Sun Context Switch Timings (Assumes 200 Switches per Second) IBM Context Switch Timings (Assumes 200 Switches per Second).. 56 vii

9 List of Figures 2.1 Architecture of the Xen Hypervisor The Cache Hierarchy The Prime+Trigger+Probe Technique The Prime+Trigger+Probe Technique (Parallel Adaptation) Our Solution s Effect on the PTP Technique (Sequential) The Apache benchmark on the Sun machine with varying number of VMs The Apache benchmark on the IBM machine with varying number of VMs The 7zip benchmark on the IBM machine with varying number of VMs The 7zip benchmark on the Sun machine with varying number of VMs Latency workloads executed on Xen Hypervisor with varying Timeslice values Computationally-heavy workloads executed on Xen Hypervisor with varying Timeslice values Latency workloads executed on Xen Hypervisor with varying Ratelimit values viii

10 5.9 Computationally-heavy workloads executed on Xen Hypervisor with Ratelimit values Latency workloads executed on Xen Hypervisor with varying number of VMs Computationally-heavy workloads executed on Xen Hypervisor with varying number of VMs The Effect of Partitioning the Cache on the PTP Technique (Parallel) Mapping of a Memory Address in the Cache Apache Benchmark Cachebench Benchmark Cache Timing Benchmark (Cached) Cache Timing Benchmark (Flushed) Cache Timing Benchmark (Flushed - Cached) VM Boot Times based on Number of Partitions ix

11 Chapter 1 Introduction 1.1 Motivation As a new design paradigm, Cloud computing brings with it a unique set of features and vulnerabilities. Specifically, the Cloud introduces the concept of mutually distrusting co-resident clients as a valid execution state. This is a relatively new concept in computing. Few systems in the past have had to account for malicious action among their own users, on the same hardware, while still providing each user with seemingly full machine access. As one might imagine, providing co-residence for clients has brought to light a new set of vulnerabilities in the paradigm. Specifically, an attacker s use of hardware side-channels to gain information about data and functionality that they should not, by design, have access to. Such attacks exploit co-resident systems by inferring software functionality from observed hardware phenomena. This allows an attack to be performed in any context where the attacker and victim have access to the same hardware, so long as proper safeguards are not in place. 1

12 CHAPTER 1. INTRODUCTION 2 While side-channel attacks have existed in the past [21], the novel co-residency feature of Cloud computing makes them particularly effective in this context. As the attacker is no longer required to gain unlawful, or otherwise restricted access to the victim s hardware, this essentially bypasses the first, and most effective, defence against such attacks. Since a side-channel requires the exploitation of a specific piece of hardware, each solution must also be adapted specifically for that hardware channel. This allows us to classify side-channel attacks and defences based on the hardware medium they exploit. The CPU cache is one of the most frequently used pieces of shared hardware, and often deals with sensitive data. This makes it one of the most common targets for use in a side-channel attack as it can more easily be used to extract useful data at a high rate. An attack made over this channel is referred to as a cache-based side-channel attack. Cloud computing is becoming more popular, but the number one concern of technologists heading to the Cloud is security [11]. CPU cache-based side-channel attacks are currently believed to be the most dangerous, among side-channel attacks. In recent years, there have been multiple publications about Cloud-specific vulnerabilities and exploits, specifically the use of side-channels to bypass the virtualization technology used in Cloud systems [26, 36, 33, 6]. Among other exploits, they have recently been used to extract private keys in a Cloud environment [37] and yield highbandwidth information leakage [35]. Because they are so specific to the medium, the solutions to a side-channel are specific to the hardware medium being exploited. As such, the most significant and robust threat comes from CPU-cache based attacks. Because of this, the scope of this thesis is limited to side-channel attacks which exploit

13 CHAPTER 1. INTRODUCTION 3 the CPU cache. In response to these attacks, there have been publications attempting to mitigate such situations. While beneficial, these solutions require the client using them to modify their own software to work with their technology [13] [27], or the underlying hardware [22]. From studying the Cloud, we believe that this restriction is both hindering to the user, and unnecessary. We argue that the Cloud s relationship with its users and hardware, referred to in this thesis as the Cloud model, prevents such solutions from being implemented. Therefore, a new, server-side defence against such attacks in the Cloud is necessary. 1.2 Contributions Our goal in this thesis is to provide a defence capable of preventing cache-based side-channels in the Cloud while not interfering with the Cloud model. Using the code base of an open source hypervisor, Xen [15], we demonstrate how to inhibit cache-based side-channels from occurring within a Cloud server. In addition, we validate the system against canonical attacks while demonstrating that they can be prevented without client-side or hardware modifications. We then compare our system to an established Cloud provider, demonstrating that such methods can be practically implemented in a modern Cloud system. Special attention is paid to the issues and configurations expected to be heavily impacted by the defences implementations. The major contributions of this thesis are listed as follows: 1: A study of cache-based side-channels in a Cloud environment. This includes a categorization of existing attacks into the Sequential and Parallel types.

14 CHAPTER 1. INTRODUCTION 4 2: Two server-side defences against cache-based side-channels. One focuses on sequential side-channels [9], which includes a technique to prevent the sidechannel s occurrence as well as an algorithm designed to implement the technique. The algorithm applies the solution in a minimalistic fashion to help minimize resulting overhead. The second focuses on parallel side-channels, and uses a cache colouring technique to prevent their occurrence and to improve cache efficiency in certain situations. 3: An implementation and validation of the above defences. We implement the above defences and demonstrate their ability to prevent cache-based side-channel attacks in an experimental Cloud environment. The defences are evaluated by running them against attacks designed to be representative of previously published attacks of the cache-based side-channel genre. 1.3 Organization of Thesis The side-channels studied typically comes in two varieties that each require a unique solution. As a result, this thesis details and approaches the problems of Sequential Cache-Based Side-Channels and Parallel Cache-Based Side-Channels in two separate chapters. The chapters that share information about both varieties of channels tend to first describe details that concern one variety, then the other. The chapters explaining our solutions to each variety of channel tend to be dedicated to only that variety. The remainder of this thesis is organized as follows: Chapter 2 describes aspects of the Cloud, Cloud technologies, side-channels, and related software techniques; how they relate to side-channel attacks and the prevention thereof. Chapter 3 summarizes

15 CHAPTER 1. INTRODUCTION 5 related work regarding side-channels, their use in the Cloud, and their prevention. It also describes programming techniques, such as cache colouring that we make use of in our solutions. Chapter 4 details the attacks attempting to be prevented, following how they are traditionally implemented and what system attributes they exploit. Chapter 5 focuses on our solution to sequential side-channels. It details how our solution solves the problem, how it is implemented, and our experimental results as to its validity and performance. Chapter 6 describes our solution to parallel sidechannels. It also describes how our solution solves the problem, its implementation, and its experimental validity and performance. Finally, Chapter 7 concludes the thesis.

16 Chapter 2 Background This chapter details the form and function of the modern Cloud system and its susceptibility to side-channel attacks. It describes the Cloud, its architecture, and a common open-sourced implementation that will be used to implement our solution. It also explains side-channel attacks, specifically those involving the CPU cache, and their specific threat to Cloud environments. 2.1 The Cloud Model The idea of Cloud Computing revolves around the pooling of multiple, large-scale, computing resources into a single, abstract entity, commonly referred to as the Cloud. This construction allows multiple clients concurrent access to these resources. Conceptually similar to the Mainframe paradigm of decades past [10], Cloud computing differs in its extensive use of networking technologies. It typically focuses on using many machines, composed of lower-grade canonical hardware, rather than fewer, more 6

17 CHAPTER 2. BACKGROUND 7 powerful, machines. Due to its underlying discreteness, complex software technologies need to be used in the Cloud to abstract these individual machines into a single dynamically manageable resource. The main objective of Cloud computing is to allow the clients to outsource their hardware needs. Cloud providers take advantage of the versatile nature of the Cloud to provide clients with computational resources when needed, and allow them to be used by other clients when free. The paradigm has often been compared to that of a utility grid, and has sometimes been referred to as Utility Computing. As a paradigm, Cloud computing has developed a specific relationship with its users and underlying hardware that we will refer to in this thesis as the Cloud Model. The model highlights two key points that have become commonplace in Cloud systems. First: users will often run canonical software on the Cloud and, as such, may not have the permission, or the knowledge to modify the software they intend to run. Second: A Cloud system is typically built from canonical hardware so that it can be easily expanded and maintained. It is our belief that any modifications made to the Cloud must preserve these two qualities so as to maintain the practicality of the Cloud. We therefore define a solution as adhering to the Cloud model if: 1: It does not require any software modifications by the client or on the client-end of the interface 2: It does not require any modifications to the underlying hardware If a solution matches both points then we can say it complies with the Cloud model and can be applied to a current Cloud system without interfering with the established functionality.

18 CHAPTER 2. BACKGROUND Virtual Machines Virtual Machine (VM) technology is the backbone of the modern Cloud implementation. A virtual machine system functions by emulating a physical machine within a software program. Essentially, virtual machine software acts as middleware between the programs being executed and the underlying hardware. It acts as one or more execution environments within which the programs may run. The real benefit of virtual machine software is that the middleware is (ideally) completely transparent. Each executing program has the illusion of running on a single isolated system. Even if there are actually twenty such systems being emulated in tandem, each one acts within the constraints of a single conceptual machine. It is through such technology that Cloud systems are viable, as they can virtually distribute the resources of a single machine between an arbitrary number of clients. Each client has the impression of being given sole access to a smaller, isolated machine. A virtual machine system is typically composed of two parts, the domains and the hypervisor. The domains are the machine emulating instances. Each domain will typically run its own version of an operating system and emulate a distinct physical machine. Any process executing within a domain has no access to, or knowledge of, anything outside of that domain. From its perspective, the domain and its emulated OS compose the entire system. The hypervisor of a virtual machine system (also known as the virtual machine monitor) is where the real complexity of the system lies. It acts as a super-operating system, typically installed between the domains and the underlying hardware (referred to as a bare-metal install). It serves as an intermediary to manage each domain s access to the hardware. Most of the operations of a typical operating system can be

19 CHAPTER 2. BACKGROUND 9 found in a hypervisor, including CPU scheduling and memory management. However, these operations function on a larger scale than a regular operating system as the processes being managed are themselves full blown operating systems rather than smaller-scale processes The Xen Hypervisor Xen is a specific, open-source implementation of a bare-metal hypervisor. At the time of writing, Xen is known to be quite developed and is being used as the backbone for many established Cloud enterprises, including Amazon s EC2 [2]. The combined nature of its canonical use and its open source access make it a prime candidate for experimentation relating to hypervisors in the Cloud. Xen functions as a combination of a bare-metal hypervisor (the Xen hypervisor), a single host domain (Dom0), and any number of guest domains (DomUs, or guests). Dom0 is a customized Linux OS capable of interacting with the hypervisor. It acts as the administration tool for the system. The DomUs represent the guest machines that clients will be using. The hypervisor controls basic operations, such as the CPU scheduling and memory management of the system. Complex functionality such as networking and IO are handled by the Dom0 after being redirected from the guests [15]. The guest operating systems are, of course, unaware of the context in which they are running. This type of software isolation has a bonus of acting as a security mechanism to separate one VM from another, working under much the same lines as sandboxing. Unfortunately, side-channels have been demonstrated to bypass this sort of isolation. A graphical interpretation of the Xen hypervisor and its chief components can be seen in Figure 2.1.

20 CHAPTER 2. BACKGROUND 10 Figure 2.1: Architecture of the Xen Hypervisor In this figure, we see a Xen system which includes the hypervisor, a Dom0, and two DomUs. This configuration represents a single physical server set-up with two guest VMs running. The arrows represent communication between the subsystems via system/hypercalls (a hypercall is the hypervisor s equivalent to an operating system s system call). When the two guests require resources, such as memory, or other machine access, they communicate with the hypervisor. When the administrator needs to communicate with the hypervisor, it is done via Dom0. If a guest needs to communicate with Dom0, it should be done through the hypervisor. Some systems allow exceptions to this organization and are referred to as para-virtualized systems. These systems allow certain direct communication on the guest s behalf for optimization purposes. By design, there should be no other channels of direct communication save for the standard networking capabilities of the guest machines. Ideally, there

21 CHAPTER 2. BACKGROUND 11 should be no way for the guest machines to communicate with one another directly without using networking protocols. 2.2 CPU Cache A CPU cache is a small section of memory built into the CPU. This memory acts as a medium through which any request for data must go. It serves to increase the speed of memory access for more commonly accessed data. Essentially the cache is a section of memory that can be accessed much more quickly, but is much smaller than main memory. Dedicating a small amount of memory to a cache can lead to massive speed increases, as CPUs often require frequent access to the same memory addresses. Keeping these frequently accessed data in the cache reduces the time needed to access these data, and therefore increases the speed of the program. A modern-day CPU contains multiple levels of cache dedicated to different purposes. The most common organization is depicted in Figure 2.2. In this figure, the data storage areas are organized from the top-down by decreasing speed of access and by increasing size. This puts the smallest, and fastest storage mechanisms on the top, and the slowest/biggest on the bottom. When data is required by the CPU (for the registers), it is first requested from the L1 cache. The L1 cache is unique in that it is often divided into separate instruction and data caches. The speed requirements for this cache, as it will be accessed most often, mean that it is usually virtually indexed. This means that the mapping of a memory address to the cache position is determined by its virtual address (as viewed by a process) as opposed to its physical address (as viewed by the operating system). This means that the cache is high speed, but mappings are not preserved between

22 CHAPTER 2. BACKGROUND 12 Figure 2.2: The Cache Hierarchy

23 CHAPTER 2. BACKGROUND 13 contexts (different processes will have different mappings). Because these mappings are not preserved across contexts, there is little possibility for information leakage across the L1 cache. If the requested data are located in the L1 cache, this is called a cache hit, and the data are returned to the CPU at a very high speed (typically within a handful of cycles). If they are not located in the L1 cache, this is referred to as a cache miss. If a cache miss is experienced, the data are next looked for in the next level of memory - for an L1 cache miss, this would be the L2 cache. Once found (at any level) the data are propagated back through each level of cache that experienced a miss to populate a portion of that cache. In this way, a successive attempt to access those data should find them located in a lower, and therefore faster, level of the cache hierarchy. Because there is limited space in each cache, the incorporation of new data may remove some existing data from the cache. This removal is referred to as a cache eviction. The L2 cache is typically much larger than an L1 cache and contains both instructions and data. Because the L2 cache is larger, and slower, that an L1 cache it is most often physically indexed. This means that the mapping of memory addresses to cache locations are done via physical memory and are preserved across contexts. Since the mapping is preserved, processes which have shared access to data may find those data located in the cache without first causing a cache miss. As a trade-off, this can cause information leakage between processes, as a process that finds data pre-located in the cache can infer that another process put it there. Often included in modern CPUs is a shared L3 cache. While the L2 and L1 caches have each been dedicated to a single CPU core, a shared L3 cache stores

24 CHAPTER 2. BACKGROUND 14 data from multiple cores simultaneously. The argued benefit to this scheme is that memory intensive processes can make use of a much larger cache when the resources are available, allowing a maximal usage of cache resources. This scheme is unique in that it allows multiple CPU cores access to a shared resource, and information leakage can happen much like in the L2 example. If a cache miss at the L3 ( last level) cache occurs, the requested data are sought in main memory. 2.3 Side-Channel Attacks A side-channel (or covert-channel) in a software program is a means of communication via a medium not intended for information transfer [22]. It typically involves correlating the higher level functionality of the software with the underlying hardware phenomena. With an established correlation, these phenomena can be measured and analysed to infer what is occurring within the software program at a given time. While the measured phenomena vary with the specific properties of the hardware in question, any phenomenon that can be reliably correlated to the software s function can be used as a side-channel. Examples may include the timing of specific hardware functions, or the acoustic properties of a hardware device. Typically, the higher-rate hardware functions are more interesting to explore as side-channels because they can communicate information more quickly, and therefore can yield more details about the state of the program in execution. To this end, CPU cache-based side-channels typically receive the most attention, as they are one of the highest-rate measurable resources shared between processes [35] [33]. Traditionally, cache-based side-channels have been used to glean the functionality of closed source systems and functions. Examples include the breaching of cross-vm

25 CHAPTER 2. BACKGROUND 15 systems, and the breaking of the Advanced Encryption Standard (AES) and Data Encryption Standard (DES) encryption algorithms. An experiment in 2003 used a timing-based measurement system on the CPU cache-channel to determine cache hits and misses in a DES encryption algorithm. By correlating the timing measurements with the execution of the algorithm, the authors were able to break the DES cypher in more than 90% of their attempts [31] Cache Channel Attacks In order for the CPU cache to be used in a side-channel attack, the cache must in some way be shared by the attacker and their target. A Cloud environment makes this condition particularly easy to achieve as both the attacker and target can get access to the same physical machine. Typically, a CPU cache can be shared in two ways, either the cache is exclusive to one CPU core, in which case two processes must access the cache sequentially; or cache is shared between CPU cores, in which case two processes can access the cache concurrently. We refer to the first type of attack as a sequential, cache channel, and to the second as a concurrent, or parallel, cache channel. Research has been done into attacks for both classes of channel [33] with the former typically seen as more portable, as only some systems will allow for parallel access to a cache. However, recent trends in hardware technology are seeing more and more CPUs outfitted with larger, shared, victim caches. These caches are designed to be shared among multiple cores and can be accessed in tandem. While the techniques for attacking these two types of cache are quite similar, the hardware differences require dramatically different solutions. In order to address

26 CHAPTER 2. BACKGROUND 16 both types of channels we have devised two solutions. For the sequential channel we apply a technique called Selective Cache Flushing, which can be found in Chapter 5. For parallel channels we apply a technique called Cache Partitioning, based on cache colouring, which can be found in Chapter 6. These two solutions can be implemented in conjunction to insulate a hypervisor against both types of side-channel. 2.4 Summary This chapter details the relevant background information dealing with side-channels and Cloud technologies. We describe how the Cloud paradigm has affected traditional side-channel security and which attacks have become prominent or dangerous in the Cloud. We also describe what aspects of Cloud technologies have encouraged these changes and what may be done to solve such a problem. More specifically, we describe the current state of the Cloud and its relationship with its users. We emphasize the importance of maintaining this relationship when attempting to mitigate side-channel attacks in the Cloud. From this we impose constraints on our own solutions to make sure that this relationship is maintained.

27 Chapter 3 Related Work This chapter details the related work on side-channels and techniques to mitigate them. It includes a brief summary of cache-based side channels and their typical mitigation techniques. It then compares these techniques to the Cloud-based variations of the attacks and solutions. In addition, it summarizes work on other techniques that we incorporate into our design, including cache colouring. 3.1 Side-Channels Previous work have demonstrated that cache-based side-channels can be exploited in the Cloud to glean information that service providers do not want users to have access to. Such cases include determining whether two machines are co-resident [26], and more dangerously, the extraction of cryptographic private keys from unwary hosts [37]. The latter particularly demonstrates the severity of a side-channel attack in the Cloud and the potential for similar side-channel attacks when migrated to a Cloud environment [31] [21] [28]. 17

28 CHAPTER 3. RELATED WORK 18 Since the attention given to side-channels in non-cloud environments in the 70 s [14], there have been many solutions proposed [21], [22]. Some of these include: altering the functionality of the hardware channel, disabling the hardware channel, or modifying the victim code to break the correlation between the program s execution and the hardware phenomena. While these solutions often succeed, all three are counter-productive to the Cloud model, which uses canonical hardware and whose clientele are users from a various strata of technical skill levels. Implementing any of these defences would require the customization of either all hardware intended for use in the Cloud, or else all software intended to be run in a Cloud environment. Both of these solutions conflict with the Cloud model, as they would either restrict the hardware requirements or the client skill level needed to use the Cloud. Cache flushing has been considered, but disregarded as a solution to traditional cache-based side-channels because it is expected to generate large amounts of overhead [21]. Recent work at the Massachusetts Institute of Technology has focused on modifying Cloud hardware to prevent side-channels [7]. Their work attempts to obfuscate side-channel attempts by adding additional work to certain processes. Unfortunately, their hardware often decreases efficiency by factors up to an order of magnitude. That these levels of overhead can be considered an acceptable trade-off for security emphasizes the need for high-efficiency solutions. Of course, the fact that they modify the hardware violates the Cloud model as it would require every machine in the Cloud to be replaced with a modified system.

29 CHAPTER 3. RELATED WORK Side-Channels in the Cloud Kim et al. have developed a solution for cache-based side-channels in Cloud systems [13]. In their solution, they prevent cache-based side-channels by giving each VM exclusive access to a sectioned portion of the cache they call a stealth page. Using a stealth page, the sharing of cache information is prevented by having each VM restore its context in that page before its time slice execution. In order to have software applications access these hidden pages, their solution requires the user to make client-side modifications to the software being executed in the guest VM. Due to its canonical nature, however, we believe that most clients will either not have the access, or not have the technical skill, to modify the software they intend to run in the Cloud. This restriction demonstrates the need for a solution transparent to the Client. For our solution, we implemented a purely server-side defence for cache-based side-channels in the Cloud. To make it fully compatible with the Cloud model, we impress the constraints that it both prevent cache-based side-channels between coresident VMs, and that it requires no modifications of the underlying hardware nor of the software being run on a VM. From this perspective, the solution should be both secure and invisible to the Client, as well as to the Cloud provider. Only the Cloud developer would be aware that such a solution is in place. 3.3 Cache Colouring For our parallel cache channel solution, we use a technique known as Cache Colouring. Cache colouring has been used in the past primarily as an optimization scheme to

30 CHAPTER 3. RELATED WORK 20 maximize the cache hit rate. Previous work by Tam et al. explored the use of cache colouring in environments where multiple processes shared a common cache [30]. In these experiments, they used a cache colouring technique to increase the cache hit rate by minimizing the number of cache data evictions occurring across processes. In our experiments, we use a similar technique to partition the cache on a virtual machine level granularity. While some similar results are observed in the reduction of cache misses in certain cases, our priority was the security of the system. As a result, the technique was implemented with some variations to guarantee the security features, as opposed to the performance features, of the system. Two recent papers by Kim et al.[13] and Shi et al.[27] attempt to use a form of cache partitioning to prevent cache-based side-channels. Both focus on reserving a small portion of the cache on a per-vm or per-core basis using cache-colouring that can dynamically be accessed by VMs when they need to execute a side-channel proof process. The upside to these attempts is that they attempt to incur very minor overhead by maximizing each VM s access to the cache and by securing portions of the VM dynamically. The downside is that both of these solutions require the client programs to understand that they are running in a modified environment to take advantage of the security features. While minor, these alterations violate the Cloud Model. In addition, both solutions are focused on specific types of cache-channel attacks, such as attacks that focus on the S-Box of a cryptographic algorithm and so assume only a small portion of memory needs to be protected at once. While understandable as these have, to date, been the most dangerous cache-based attacks mounted on conventional systems, such an assumption means that these defences cannot prevent

31 CHAPTER 3. RELATED WORK 21 attacks that use a larger portion of the cache. For instance, work by Ristenpart et al. implemented a side-channel to detect co-residence in Cloud environments [26]. As their particular implementation attempts to maximize usage of the cache, reserving small portions of the cache would not prevent such a technique. Other work has been done attempting to partition the cache physically [23]. Of course, this also violates the Cloud Model and an ideal solution would provide the same security using purely software means. 3.4 Summary This chapter describes work related to side-channels, cache-based side-channel attacks in the Cloud, defences against them, and cache colouring. Included among them are details of the side-channel attacks that have successfully been launched in the Cloud and works that have attempted to prevent such attacks. All of the work that has attempted to solve side-channels in the Cloud has done so by either altering the software being run by the client, or the underlying hardware. We argue that while successful, these solutions are not practical for general Cloud users and therefore, a new solution that conforms to the Cloud model is necessary. In addition, we describe work that has attempted to use Cache colouring for security or efficiency purposes. We compare our own work with theirs and show that while they use a similar technique to manipulate the cache, they design their solution quite differently by isolating only small portions of the cache.

32 Chapter 4 Side-Channel Attacks In order to discuss possible mitigations for a cache-based side-channel attack, it is necessary to define the specifics of the attack in question. While previous work have detailed attacks in local environments [21] [28], more recent work has specified a form of attack applicable to a Cloud system. This chapter describes the cache-channel attack that we are attempting to mitigate, including both the sequential and parallel variants. 4.1 Sequential Side-Channel The first documented form of a cache-based side-channel attack explored in the Cloud was by Ristenpart et al. [26] who demonstrated the use of side-channels to verify virtual machine co-residence in Amazon s EC2. As part of their work, they explored the use of a previously identified cache-based side-channel technique, which they refer to as the Prime+Probe technique, in a Cloud environment. The result of such work is the Prime+Trigger+Probe (PTP) technique, illustrated in Figure 4.1, a variation 22

33 CHAPTER 4. SIDE-CHANNEL ATTACKS 23 Figure 4.1: The Prime+Trigger+Probe Technique

34 CHAPTER 4. SIDE-CHANNEL ATTACKS 24 designed to work in Cloud environments. The purpose of their experiment, using the PTP technique, was to see if a cache-based side-channel could be established between two guest domains in the Cloud. The channel was established such that the first VM (referred to as the probing instance) could receive a message that the second VM (the target instance) encodes in its usage of the cache. The basic version of the PTP technique shown in Figure 4.1 is an example of a sequential side-channel. In the PTP technique, the probing instance first separates the cache lines into two categories: the Touched category and the Untouched category. Once the channel has been established, cache lines in the touched category will be modified by the target instance. Lines in the untouched category will remain intact. Using the timestamp counter in the CPU, the probing instance can measure the number of CPU cycles required to access each cache line. The resulting differences in access times for hits and misses in the cache is observable by the probing instance and will serve as the communication means within the channel. Once the categories have been established, the probing instance primes the cache by filling as many cache lines as it can (ideally, all of them). It then establishes an access time baseline by reading from each line in both categories. Having just been primed, each cache line should yield a cache hit, regardless of category, thus keeping the baseline access times low. This process is highlighted in the Prime step of Figure 4.1. Having done this, the probing instance now has a series of values which represent how long it took to access each cache line with a primed cache. Once the cache has been primed, and the baseline established, the probing instance must Trigger (context switch to) the target instance by busy looping, or otherwise giving up its time slice. When the target instance begins its time slice, it heavily

35 CHAPTER 4. SIDE-CHANNEL ATTACKS 25 accesses the cache lines in one of the pre-defined categories of the cache (the Touched category), but not the other (the Untouched category). The effect of this switch on the data can be seen in the Trigger step of Figure 4.1. In this figure, the target instance is accessing every second cache line in an access pattern we have pre-defined. The resulting set of accessed lines we define as the Touched category of the cache. By contrast, cache lines in the untouched category were not accessed by the target instance. When the CPU core context switches back to the probing instance, the instance probes the cache by re-measuring the access times for each line. This is illustrated in the Probe step of Figure 4.1. If there is a significant increase in the access times for cache lines in the touched category compared to the untouched category, then the probing instance can assume that the target instance was trying to communicate. The PTP technique was refined in later work by Wu et al. [33], where they establish a high-speed bit-stream by communicating a 1 or a 0 based on whether the difference between category timings is positive or negative (assuming the difference is above a certain threshold). Using this technique, they were able to establish reliable side-channel bit-streams of over 190kbs. This attack technique has so far been the most robust and reliable cache-based side-channel attack demonstrated in a Cloud environment. At time of writing, all other sequential cache-based side-channel attacks have been based on this technique, making it a good example of a canonical attack. Since all cache-based side-channel attacks in the Cloud rely on this basic technique, a successful inhibition of its principles would mitigate all currently viable attacks, including some of the more dangerous variations [37]. For these reasons, this particular side-channel attack has been implemented and is used as an example attack in our

36 CHAPTER 4. SIDE-CHANNEL ATTACKS 26 system s evaluation. 4.2 Parallel Side-Channel While the previous section describes a sequential side-channel, the technique can easily be adapted on a shared-cache system to become a parallel side-channel. In this case, the probing and target instances are each running on a separate core but have parallel access to some shared cache. While the cache access is essentially the same, the parallel technique omits the need to trigger between VMs because both the Trigger and the Probe steps are happening concurrently. Much like the sequential technique, the process begins with the probing instance priming the cache. Once the cache is primed, rather than context switching to the target instance the probing instance starts the Probe step. Likewise, once the priming is done, the target instance can start the Trigger stage. In this way, the technique functions the same way as the sequential version, except that the Trigger and Probe steps are occurring at the same time. As both VMs cannot access the same cache lines at the same time, they instead each work on a section of the cache. Therefore, while one set of cache lines is being read, another can be modified by the other VM. This process is depicted in Figure 4.2. with the set of cache lines being read by VM1 highlighted in green and the set being modified by VM2 in red. As might be expected, the parallel technique is less reliable as an attack medium, as there tends to be more noise in the system. To date, only a sequential side-channel attack has been demonstrated to be able to do serious damage in the Cloud [37]. However, while more difficult to use, parallel channels still hold the potential to be used in such an attack, and they can still be used to gain otherwise inaccessible

37 CHAPTER 4. SIDE-CHANNEL ATTACKS 27 Figure 4.2: The Prime+Trigger+Probe Technique (Parallel Adaptation)

38 CHAPTER 4. SIDE-CHANNEL ATTACKS 28 information about a VM. 4.3 Summary This chapter describes the current state of cache-based side-channels in the Cloud. It separates the channels into two categories, Sequential and Parallel, based on the type of cache access each requires. For both types of channels it describes a basic communication scenario in which two VMs attempt to use the channel as a means of illicit communication. The attacks described in this section are intentionally rudimentary and rely on only the most basic properties of a proper attack. This serves for both clarity and generality s sake as the attacks are both simple to comprehend and act as a very general attack that would work in a wide variety of scenarios. The simplicity and basic structure of the attacks shown argue that if these attacks can be prevented from occurring within a system then that system should be proof against the more complex varieties. This argument stems from the idea that the rudimentary attacks are less sensitive to noise and error, easier to implement, and rely on the same hardware vulnerabilities as the more complicated attacks.

39 Chapter 5 Selective Cache Flushing This chapter details our approach to solving sequential cache-based side-channels in the Cloud. It describes what attributes of a sequential cache-based side-channel can be inhibited to prevent its use. It goes on to illustrate how these exploits can be used in a Cloud environment without violating the requirements set out in the Cloud model in Section 2.1 and provides the foundation of our solution. From here we discuss known or expected issues that would arrive from taking the suggested approach. Included in this discussion are countermeasures that can be taken, and circumstances that can mitigate these issues. The chapter concludes by describing how our solution is realized and evaluated. This includes what concrete requirements the solution demands, as well as how it was incorporated into an established Cloud system. This system is then compared with the existing state-of-the art in terms of security against side-channels and general performance. 29

40 CHAPTER 5. SELECTIVE CACHE FLUSHING Overview The key to the PTP technique, and, in essence, to all sequential cache-based sidechannels, is to grant both the probing and the target instances some level of overlapping access to a shared cache resource. Since access to said resource cannot be denied to either party without compromising the fairness of the system, our solution focuses on disabling the overlapping aspect of the cache. Much like the principle of VM isolation, our solution attempts to isolate the data in the cache based on the VM that is using it. This can be done by designing the system such that both parties view the cache as flushed (having no cache line that yields a cache hit without first incurring a miss for the current context), upon gaining access to the cache. Establishing this flushed state would prevent the possibility of a cache-based side-channel by disallowing the probing instance to ever establish a base-line (by priming). More specifically, whenever the probing instance attempts to prime the cache, its base would be destroyed, as the cache would flush every time the probing instance regains context. Our solution s effect on the PTP technique is depicted in Figure 5.1. Our solution inhibits the PTP technique by flushing the cache between the Prime and Trigger steps, thereby preventing the probing instance from ever seeing a pattern in the cache hit data. Additionally, if the system supports cache warming (the saving of the cache s state on a context switch) the flush can be applied within this subsystem. This would allow the cache to retain useful data for the VM, while still preventing each VM from seeing if another has used the cache. A cache warming technique was not applied in our system as it would not make the system any more secure, it is instead left for future work.

41 CHAPTER 5. SELECTIVE CACHE FLUSHING 31 Figure 5.1: Our Solution s Effect on the PTP Technique (Sequential) Our solution can be implemented directly into the hypervisor, thereby conforming to the Cloud model defined in Section Expected Issues and Mitigations Our proposed solution involves a selective cache flushing technique that is implemented directly into the hypervisor. Such an approach would prevent side-channel attacks by flushing the cache during a context switch between VMs for which a sidechannel could occur. The chief drawback of a cache flushing approach lies in the levels of overhead that such a technique may generate. The overhead from our solution manifests, principally, in two forms: One is that, unless special precautions are taken, such as

42 CHAPTER 5. SELECTIVE CACHE FLUSHING 32 cache warming, the cache will return to a guest in a blank state, slowing down the system by minimizing cache hits. Second, reverting the cache to this blank slate for every VM would add overhead to the system. These sources of overhead are summarized in the following subsections as the Reduced Cache Usability and the Cache Flushing Overhead Reduced Cache Usability The reduced usability, or effectiveness, of the cache can be mitigated by observing that, in a canonical cache-based side-channel attack, the attacker would be priming the cache with every time slice allotted to it. Doing this would place the cache in a functionally flushed state for any VM allocated time immediately after the probing instance. By priming the cache, the attacker essentially puts every other VM into a state similar to that of the proposed solution (flushed). In this way, the secure system incurs no more overhead than a vulnerable system currently under attack. Additionally, the overhead generated within a VM by starting with a flushed cache should be negligible in most cases. Unlike with process level context switching, under most circumstances, each guest s time slice would likely use a great deal of the cache (depending on memory requirements for the workload). This would leave each successive guest with few warm cache lines when it runs next. The exception would be if a guest ran such small memory requirements that its cache lines were not overwritten by the time it regained access to the CPU. Under these circumstances, having a warm cache would not likely grant any of the guests much use, since they would only be using a few lines. Either option implies that starting each VM with a blank cache should not incur a large amount of overhead compared to process-level

43 CHAPTER 5. SELECTIVE CACHE FLUSHING 33 flushing. To summarize, our system s cache efficiency should be no less efficient than a system under attack via side-channel. In addition, the reduced effectiveness of the cache due to attack should be minimal because we are dealing with virtual machine level granularity. Compared to a single process, a VM should use a longer time slice and a larger portion of the cache, making it less likely for a VM to return to a warm cache state after having gaining a new time slice Cache Flushing Overhead The second issue is a more difficult one to overcome, as it relies on implementing a solution with a minimal amount of overhead. The overhead generated from flushing the cache is dependent on two main factors: The ratio of flushes to context switches and the frequency of context switching in the system. These factors compound one another to generate the main source of overhead for the system. The more context switches the more flushes required, the more flushes required the more overhead generated. When cache-based side-channels were being investigated in a non-virtualized environment, flushing the cache was deemed too expensive for general use [21]. This was because of the high rate at which cache flushes would need to occur and the overhead that would come with each flush. However, a Cloud system does not require side-channel security on the process-level, but on VM-level granularity, as this is how they are allocated resources. By comparison, a hypervisor will be running fewer VMs simultaneously than a regular OS would be running processes, and the rate of switching between them would

44 CHAPTER 5. SELECTIVE CACHE FLUSHING 34 be much lower. This means that the context switch rate between VMs in a Cloud system should be much lower than that between processes in a regular OS. Since the cache-flushing code can be inserted into the existing context-switching functionality the overhead generated by flushing the cache should be directly proportional to the context switch rate. With this in mind the reduced granularity of Cloud systems makes this technique more viable in a Cloud environment. In addition to the reduced rate, some mitigating factors as to the frequency of flushing can be postulated. The frequency of flushing depends on the exact circumstances during which the sharing of the cache could occur. For instance, flushing the cache would not be necessary if a guest VM switches to a VM that does not intend to access the cache (such as the idle domain). By tightening restrictions on when a context switch between virtual machines is considered vulnerable to a side-channel attack, we can reduce the frequency of context switches that require a cache flush. In total, the issue of cache flushing overhead can be addressed in two ways simultaneously. As mentioned earlier, the rate of context switching and the ratio of cache flushes to context switches compound to create the greater part of the solutions overhead. These two factors can be addressed individually. The frequency of cache flushing is naturally reduced by focusing on VM-level granularity, rather than process level granularity; and the ratio of cache flushes to context switches can be reduced by tightening restrictions on when a context switch is considered vulnerable to a side-channel. By addressing each factor, the net overhead decreases to a point where it is manageable in a standard Cloud system. Addressing the aforementioned issues, the solution should prevent sequentially scheduled VMs from gaining any information about the previous VM s cache usage,

45 CHAPTER 5. SELECTIVE CACHE FLUSHING 35 thereby blocking the prospective side-channel. Additionally, it should be implemented such that it conforms to the Cloud model. This implies that the solution cannot require any changes to the client-side code base nor have any non-canonical hardware requirements. This adherence to the Cloud model suggests that the solution must be contained entirely within the hypervisor. Restricting the implementation to the hypervisor in this way distinguishes our solution as the only purely server-side defence for sidechannel attacks in a Cloud environment. 5.3 Cache Flushing Technique Our solution to sequential side-channels involves the incorporation of our cache flushing technique into a canonical Cloud system. The technique includes two new functions within the hypervisor: One is a server run process to flush the cache (revert the cache to a blank slate), and the other is a tainting algorithm in the scheduler for deciding when flushing the cache is necessary. The sequential side-channel solution was implemented using the code base for the open source Xen Hypervisor, Version Decision Algorithm Flushing a high level cache on a modern machine can be a time consuming process. As mentioned in the above section, it is better to flush the cache only when necessary. Because the PTP technique requires that two VMs have consecutive access to a CPU core, flushing the cache is only required when this situation is able to occur. For

46 CHAPTER 5. SELECTIVE CACHE FLUSHING 36 instance, the Xen maps VMs to domains so that it can compare guest VMs, the host VM, and additional states (such as an idle CPU) using the same structures. If a CPU goes from executing Domain1 (VM1) to the Idle domain (Xen s interpretation of an idle CPU), then back to Domain1 there is no need to flush the cache, as no side-channel could have been implemented over such a context. A tainting algorithm, Algorithm 1, was implemented to record which VMs own data currently in the cache and to determine when it is necessary to flush those data. The Xen scheduler operates on scheduling units called VCPUs. Each VCPU represents a virtual CPU and is associated with a domain. A domain may have multiple VCPUs associated with it, but will always have at least one. According to our algorithm, each VCPU is given a new data field taint which indicates the origin of the data that currently reside in the CPU cache. Upon initialization, each VCPUs taint value is assigned the identifier for the domain it represents, with the exception of the Idle VCPU, which does not have an associated domain and is assigned an idle value. The outcome of Algorithm 1 is that a cache flush occurs only when a context switch changes from one domain to another where the second domain has the ability to establish a side-channel with the first. Switching to the Idle domain, or to the same domain, will not invoke a cache flush. Avoiding a flush during these situations can be critical, as they will arise often in an elastic Cloud environment Flushing Function The cache flushing functionality was implemented in two versions. The first version is a hardware dependent (referred to as the x86-secure for its use of the instruction set)

47 CHAPTER 5. SELECTIVE CACHE FLUSHING 37 Algorithm 1 Determines when flushes are necessary Function contextswitch(domx, DomY) { if DomX.taint == idle then return; end if if DomX.taint == DomY.id then return; end if if DomY.id == idle then DomY.taint := DomX.taint; return; end if flushcache(); return; } EndFunction implementation. The second version is an independent (referred to as the portablesecure) implementation. If the hardware has built-in functionality for flushing the cache, such as with the x86 instruction set, this task can be accomplished more efficiently. Both versions are compared with the existing version of the Xen 4.2 hypervisor (referred to as the insecure hypervisor) in Section In this case, we use the wbinvd function built into the x86 instruction set to flush the cache. This instruction invalidates every cache line by toggling the validity bit associated with that line after writing any relevant information back to memory. The hardware independent, or portable-secure, cache flushing function is implemented by allocating a chunk of memory equal to, or larger than, the size of the hardware s L2 cache (or the largest CPU cache available on the machine). Next, the chunk is divided into cache line sized blocks. When the flushing function is invoked, it iterates over these blocks, altering the data stored in each. The effect is that the

48 CHAPTER 5. SELECTIVE CACHE FLUSHING 38 cache line associated with each of the blocks is modified and will need to be written back to memory on the next access of that line. This will result in a cache miss as the cache has been overwritten. The main difference between these two techniques is that the hardware-specific flush will typically invalidate the cache lines, toggling their validity flags and indicating to the CPU that they contain useless data while leaving the data themselves unchanged. By contrast the hardware independent solution will overwrite each cache line (fill them with useless data) but leave the validity flags untouched. Either technique will cause following attempts to access the cache lines to miss, but the implementation is very different. 5.4 Experimental Evaluation This section describes the process of evaluating our solution. It describes the criteria by which we evaluate our solution s effectiveness and the environment in which we conduct our experiments. We also describe the metrics by which we compare our solution to the existing state-of-the-art. These metrics are used to determine under what conditions our solution can be practically implemented into a commercial Cloud environment Objectives The evaluation of our secure hypervisors focuses on obtaining the answers to two research questions: Does this hypervisor prevent sequential cache-based side-channels, and What is the performance difference between this and the insecure hypervisor.

49 CHAPTER 5. SELECTIVE CACHE FLUSHING 39 To answer these questions, we have simulated a single-server Cloud environment. In this environment, we subject each hypervisor to a conventional sequential cache-based side-channel attack, using the PTP technique, designed for use in Cloud environments. In addition, we subject each hypervisor to a series of workloads under different configurations, designed to test different behaviours of the system. The resulting completion times are used to determine a likely increase in overhead Environment The evaluation was conducted using the x86-secure, portable-secure, and insecure hypervisors with a Dom0 running Ubuntu and each guest running a sparse installation of Ubuntu Each hypervisor was installed on a Sun Ultra 40 model machine, with 16GB of RAM and two Opteron Ghz processors. Each processor contains two cores, each of which uses its own 1MB CPU L2 cache. Since this solution is specific to side-channels that do not have concurrent cache access, this machine was selected because it does not have a shared L3 cache. This machine is referred to in subsequent sections as the Sun Machine. For the Apache and 7zip benchmarks, the hypervisors were also run on an IBM X3200 M3 model machine, with 32GB of RAM and a single Xeon X3430 processor. This processor contains four cores, each of which uses its own 256MB CPU L2 cache and a shared 8MB victim L3 cache. In subsequent sections this machine is referred to as the IBM machine. From our experience with Cloud systems and examples in the related work [13], we believe these machines are comparable to servers that might be used in a Cloud

50 CHAPTER 5. SELECTIVE CACHE FLUSHING 40 environment. Unless otherwise stated, all environmental parameters and configurations were left on their default values. It should be noted that in the following section, the term context switch refers to the hypervisor-level process of switching attention between virtual machines, as opposed to any pre-existing definitions Side-Channel Prevention The hypervisors were evaluated using a side-channel attack based on the experiments done by [35] which uses the PTP technique. The side-channel Receiver and Sender programs, performing the functions of the probing instance and the target instance, respectively, were each installed onto separate guest instances. Both programs were executed simultaneously by co-resident guests on the test-bed machine. The attack was designed to send an identifiable string of 160 bits from the target instance to the probing instance. The attack was run ten times on each hypervisor to verify the consistency of the results. In order to verify each hypervisor s ability to block the side-channel under any conditions, the side-channel was given ideal conditions to work in. Specifically, the probing instance, the target instance, and Dom0 were the only virtual machines running on the hypervisor and the first two were pinned to a single CPU core separate from the core on which Dom0 was pinned. This configuration represents the best possible conditions for a cache-based side-channel attack; any variations would make it more difficult for the attack to succeed. Our experiments assume that if an attack can be prevented under these conditions, then the same prevention would work for environments more hostile to the attack s success. The viability of the defence

51 CHAPTER 5. SELECTIVE CACHE FLUSHING 41 mechanisms implemented should not be affected by these configurations Performance Experiments The overhead generated by each of the secure hypervisors is expected to be directly proportional to the amount of vulnerable context switches. Because of the variable nature of Cloud workloads, it was necessary to evaluate the hypervisors under a variety of configurations. The hypervisor configurations selected for experimentation address three categories of variation: variations on the type of workload; variations on the Xen Credit scheduler [29] configurations; and variations of the workload s magnitude. A variation in the type of workload undertaken by the guests in the system can significantly affect the execution of the system. Because of this, the workloads used in the experiments are designed to emulate different types of applications one might find in the Cloud. To better accomplish this goal we evaluate our hypervisors with a combination of established benchmarks and customized benchmarks designed to generate high levels of overhead in the system. System configurations consist of modifications to the internal variables which control the scheduling of guests access to the CPUs. These values are used in Cloud environments to customize performance to the expected workloads the Cloud will process. Different amounts of work imposed concurrently on the system can also significantly affect performance. Cloud systems are designed to manage a dynamic number of clients and their workloads. Variation in the workload magnitude will simulate this attribute. The performances of the secure hypervisors were compared given two standard

52 CHAPTER 5. SELECTIVE CACHE FLUSHING 42 benchmarks and two customized workloads with three different configurations. The results of these comparisons are listed at the end of the section. List of Experiments To evaluate the performance of our two systems, the following experiments were conducted: A) Measurement of Cache Flushing Overhead B) Apache Benchmark with Varying Number of VMs C) 7zip Benchmark with Varying Number of VMs D) Latency & Compute Workloads with Varying Timeslice Value E) Latency & Compute Workloads with Varying Ratelimit Value F) Latency & Compute Workloads with Varying Number of VMs The two systems are evaluated in two types of environments. One uses standard benchmarks that represent workloads one might see in the Cloud; these include the Apache and 7zip benchmarks. The other uses customized workloads to attempt to generate a particularly difficult load for the systems, to determine what type of scenarios they will fail in. In addition, the customized workloads are used to test the configuration parameters of the systems to determine the solution s impact on different types of environments. These workloads are referred to as the Latency workload and the Compute workload. The two standard benchmarks used in the following experiments were conducted on two machines with different hardware. This was done so as to compare the effect

53 CHAPTER 5. SELECTIVE CACHE FLUSHING 43 of different cache sizes and types on our solutions. This solution is designed to be hardware independent and should therefore work on both systems, but we would expect the overhead to change based on the different cache designs. The two machines used are specified in the above section (Section 5.4.2). As mentioned, these machines have significantly different cache sizes and layouts, but both are still machines we might expect to see in a Cloud environment. It was decided that the two systems should be compared in two different custom environments: One with a higher expected rate of context switching, and one with a lower rate. Each configuration was tested for a computationally-heavy workload (the Compute workload) and a latency-sensitive workload (the Latency workload). The reasoning behind this decision being that a computationally-heavy workload would operate with less interrupts and cause less context switching. A latency-sensitive workload, by contrast, would do the opposite. Both would simulate workloads typically found in Cloud environments, such as a major computational undertaking (solving a large problem) or a primarily on-call service (a web/gaming server). The Compute workload consists of a program that calculates the first 1000 iterations of the Fibonacci sequence. Its implementation consists of purely arithmetic functions with no system calls in place. By contrast the Latency workload forks two processes that alternate their execution 500,000 times by use of semaphores. This implementation implies that both processes wake and block 500,000 times throughout one execution. The Latency workload was designed to simulate a high-latency scenario with many system calls, as processes wake and release their use of the CPU.

54 CHAPTER 5. SELECTIVE CACHE FLUSHING 44 A) Measurement of Cache Flushing Overhead The first round of experimentation was to simply measure the amount of CPU cycles required for the context switch code to execute in the each of the three hypervisors. This was done because almost all of the overhead generated by our solution manifests in this functionality, and it serves to establish a good baseline for how much overhead we can expect due to context switching. The Flush Timing experiments were performed by encompassing the contextswitching functionality of the hypervisor in code that records the time-stamp counter in the CPU. We refer to the difference between the readings at the beginning of the function and at the end as the amount of time (in CPU cycles) that was required to perform the context switch, as this is the only section of code that should be affected by our modifications. To run the experiment, we run two guest VMs on each hypervisor, pin them to the same CPU, and proceed to run CPU intensive code on them. The resulting times for the context switches are recorded in the hypervisor and averaged to generate the values shown in Tables 5.1 and 5.2. It should be noted that these readings were taken using the selective cache flushing algorithm (Algorithm 1) and as a result the only timings considered are those that would be affected by our solution i.e., context switches between VMs that could potentially establish a channel. Depending on the workload, these timings could be frequent, or infrequent, but should always take more time than the default context switching code.

55 CHAPTER 5. SELECTIVE CACHE FLUSHING 45 B) Apache Benchmark with Varying Number of VMs The Apache benchmark program (AB) is a standard http webserver benchmarking tool [8]. Apache was chosen because it is open source, frequently available, commonly used as a benchmarking service, and represents the type of webservice one would expect to see running in a Cloud environment. In our experiments we run the benchmark simultaneously on 1, 2, 4, 8, and 16 co-resident VMs for each system. Each of the VMs have been distributed evenly amongst the CPUs. The benchmark yields an average number of requests per second that the system can handle. C) 7Zip Benchmark with Varying Number of VMs The 7Zip benchmark program is a benchmark designed to test the speed with which a system can compress data using the 7Zip program [24]. Like the Apache benchmark, this benchmark was chosen because it is open source, has a reputation for being robust, it represents a function that we would expect to be commonly found in the Cloud, and we expect its performance to be impacted by our solutions. At the same time, 7zip performs a fairly different function than the Apache benchmark and serves to add more variety to the test bed. The 7zip benchmark yields results in the form of MIPS (Millions of Instructions executed Per Second). D) Latency & Compute Workloads with Varying Timeslice Value The Timeslice parameter is a value, in milliseconds, which represents the default amount of consecutive time allocated to a VCPU for execution on a CPU core. A recent addition in the Xen hypervisor source code, the Timeslice value was devised to enable the optimization of Cloud systems for higher, or lower, latency workloads.

56 CHAPTER 5. SELECTIVE CACHE FLUSHING 46 Presumably, higher-latency workloads should be more efficient under a smaller Timeslice value, and lower-latency workloads would be more efficient under higher values. Adjusting the Timeslice value allows the user to customize the expected number of context switches to the expected latency of the system. The default value, 30ms, has been speculated [29] as rather large and proper experimentation is expected to suggest a more appropriate value for future versions. Lower values will allocate each VM a smaller time slice in the schedule and will likely make the system more latencysensitive, but increase overhead due to increased context switching rates. E) Latency & Compute Workloads with Varying Ratelimit Value The Ratelimit parameter is a newly introduced value, in microseconds, to Xen 4.2. It represents a minimum amount of time that a VCPU must run before it is preemptable by another VCPU. In Xen 4.2, the default Ratelimit value is 1000 microseconds. This parameter was introduced to Xen after it was observed in previous work [16] that certain high-latency workloads could over-preempt the system. This over-preemption causes the hypervisor equivalent to CPU thrashing. In this situation, a small number of processes repeatedly gain access to the CPU for small fractions of time, without lowering their overall priority in the system. This leads to heavy increases in overhead due to context switching. The Ratelimit value was employed as a prevention mechanism for such situations, and acts as a minimum value, in microseconds, for which a VCPU can be guaranteed to run without being pre-empted.

57 CHAPTER 5. SELECTIVE CACHE FLUSHING 47 F) Latency & Compute Workloads with Varying VM Number The Number of Virtual Machines is the number of guest domains running concurrently on the system, each executing the same load as all others. From the hardware and VM sizes available in canonical Clouds (such as Amazon s EC2), we estimate the average number of VMs running on a moderately loaded machine to be somewhere between 6 and 16. For the sake of our experiments, we have selected a VM number of 10, since it allows us to run VMs of canonical size (modelled after EC2) while still maximizing the number of machines within this range. Other researchers [13] have also considered the range of 6-16 to be an acceptable number of VMs with which to model Cloud server activity. Despite having this value established as a default, Cloud systems are designed to be elastic, dynamically increasing and reducing the number of VMs running on a machine as demand needs. To that extent, we have added values as low as 1 and as high as 50 to our test-bed to use as extreme cases. Unless otherwise specified, all experiments were conducted under the default configuration of 10 guests, with a Timeslice value of 30ms and a Ratelimit value of 1000 microseconds Results The attack performed on the insecure hypervisor was able to successfully communicate the entire 160 bit message between instances every time (10/10). The attack and experiments were repeated in order to prevent external influences from affecting the results. With such delicate measurements, external variables such as the scheduler s ordering, page allocation, or even the temperature of the CPU could affect timing values. This result demonstrates the vulnerability of the unmodified system. By

58 CHAPTER 5. SELECTIVE CACHE FLUSHING 48 contrast, both of the secure hypervisors yielded 0 bits of successful communication over all twenty attempts. This result suggests that the side-channel could not be successfully established with the countermeasures in place. Figures present the results of each performance benchmark described in the beginning of Section Each experiment observed a mean execution time over three instances of the benchmark as performed by the Phoronix Test Suite [17]. The results from each VM running in tandem were averaged for the final result shown. If an execution returned a standard deviation higher than 3% the results for that machine were discarded. This was due to the fact that the benchmarking tool will execute an additional run of the benchmark if the standard deviation was above 3%. While ordinarily this would be beneficial, the additional run would execute after all the other VMs had terminated and would therefore run without interference. This would significantly skew the results for this VM, as it would run unhindered in an experiment designed to measure how the VMs affect one another. Figures present the results of each performance experiment with the customized workloads described at the end of in Section Each experiment observed a mean execution time reported by the VMs involved over twenty workload executions under identical conditions. The means observed for each VM were then averaged to represent the values shown. The error bars represent standard deviations between the means of each VM. 5.5 Result Analysis This section discusses the results from the previous section to analyse our solution in detail. Additionally, the results are used to compare how our implementations

59 CHAPTER 5. SELECTIVE CACHE FLUSHING 49 Figure 5.2: The Apache benchmark on the Sun machine with varying number of VMs Figure 5.3: The Apache benchmark on the IBM machine with varying number of VMs

60 CHAPTER 5. SELECTIVE CACHE FLUSHING 50 Figure 5.4: The 7zip benchmark on the IBM machine with varying number of VMs Figure 5.5: The 7zip benchmark on the Sun machine with varying number of VMs

61 CHAPTER 5. SELECTIVE CACHE FLUSHING 51 Figure 5.6: Latency workloads executed on Xen Hypervisor with varying Timeslice values Figure 5.7: Computationally-heavy workloads executed on Xen Hypervisor with varying Timeslice values

62 CHAPTER 5. SELECTIVE CACHE FLUSHING 52 Figure 5.8: Latency workloads executed on Xen Hypervisor with varying Ratelimit values Figure 5.9: Computationally-heavy workloads executed on Xen Hypervisor with Ratelimit values

63 CHAPTER 5. SELECTIVE CACHE FLUSHING 53 Figure 5.10: Latency workloads executed on Xen Hypervisor with varying number of VMs Figure 5.11: Computationally-heavy workloads executed on Xen Hypervisor with varying number of VMs

64 CHAPTER 5. SELECTIVE CACHE FLUSHING 54 function differently from the insecure model and speculate why this may be the case. Our analysis focuses on the fulfilment of two features: the successful inhibition of a side-channel attack, and how emerging performance trends compare between hypervisors. The results are analysed to extrapolate the usefulness of our solution and its application in modern Cloud environments Side-Channel Prevention As mentioned in Section 5.4.5, there was no successful side-channel established between two given VMs using either secure hypervisor. This leads us to conclude, that for the attack we have tested against, our modified systems are capable of mitigating sequential cache-based side-channel attacks. As we were using a representative attack, it is important to determine to exactly what extent we can extrapolate this result. Our experiment includes the most common type of attack (sequential cache-based side-channel) which uses the PTP technique, and is currently the only type known to have done serious damage in the Cloud [37]. This particular solution was not designed to prevent an attack exploiting the parallel access of a higher level cache by multiple cores. While the solution may very well serve to inhibit such an attack, any mitigation would be due to cache flushing as interference from third party VMs otherwise not involved in the attack. In such a case, we speculate that our solution would act as some sort of noise amplifier. Likely, it would inhibit the cache channel to a degree proportional to the frequency of context switches on the targeted CPU core. While more secure than the canonical hypervisor, we believe such a defence would be unreliable at best. Our experiments are restricted to this one representative attack and a single test

65 CHAPTER 5. SELECTIVE CACHE FLUSHING 55 system. However, the theory holds that our solution should prevent any communication between VMs across a sequential cache-channel. As this attack represents the most basic form of communication, we believe that this experiment stands as a proof of concept that our solution can prevent sequential cache-based side-channels in a Cloud environment without interfering with the Cloud model Performance Analysis This section analyses the results of the performance experiments given in Section The experiments are the same as were described in Section A) Measurement of Cache Flushing Overhead Prior to subjecting the systems to any benchmarking workloads, experiments were run to determine the average number of CPU cycles required for the modified context switch code. The average number of CPU cycles (c) for the hypervisors run on the Sun and IBM machines are reported in Tables 5.1 and 5.2. These tables show the average number of cycles required to run through the context switch() code in the hypervisor for each configuration. In addition, they use the average number of context switches per second (gained empirically) and extrapolate how many cycles per second would be dedicated to context switching using this rate. The final column in each table shows an estimation of how much of the CPU s time is dedicated to context switching using these parameters. The main purpose of this experiment was to compare the cache flushing techniques on machines with significantly different cache hardware. To this extent, we compare each hypervisor s performance on both machines.

66 CHAPTER 5. SELECTIVE CACHE FLUSHING 56 Table 5.1: Sun Context Switch Timings (Assumes 200 Switches per Second) Hypervisor Cycles per Switch Switch Cycles per Second % of CPU time Unmodified 3, , Portable-Secure 1,500, ,000, x86-secure 400,000 80,000, Table 5.2: IBM Context Switch Timings (Assumes 200 Switches per Second) Hypervisor Cycles per Switch Switch Cycles per Second % of CPU time Unmodified 3, , Portable-Secure 100,000 20,000, x86-secure 1,300, ,000, For the Unmodified hypervisor, both machines show a similar overhead per context switch, marked at 3000c (CPU cycles). Despite the clock speed difference on these machines, (2.6Ghz vs 2.4Ghz) these numbers show a very low amount of overhead in the system (less than 0.05 % of clock cycles). For the Portable-Secure hypervisor, the Sun shows 1,500,000c per switch and the IBM shows 100,000c. This difference can be attributed to the size of the cache needing to be flushed. Since the portable-secure solution only focuses on flushing the L2 cache, the size difference between the IBM s L2 and the Sun s L2 can be felt. The Sun machine has a 1MB L2 cache per core, whereas the IBM machine has only a 256KB L2 per core. Since the IBM machine would need to iterate through one quarter of the memory size, we can expect a significant reduction in the overhead it produces compared to the Sun for this solution. Therefore, we can expect the IBM machine to out-perform the Sun for the portable-secure solution, despite its large L3 cache, which would not be affected by this cache flushing technique. For the x86-secure hypervisor, the Sun shows 400,000c per switch and the IBM

67 CHAPTER 5. SELECTIVE CACHE FLUSHING 57 is marked at 1,300,000c. This large difference can also be attributed to the size of the caches needing to be flushed. Unlike the portable-secure solution, the hardware instructions used to flush the cache for the x86-secure solution flush all levels of the cache, this includes the IBM s 8MB L3 victim cache. Because the Sun s last level cache is the L2, the wbinvd() (write-back invalidate) instruction requires less communication and has fewer caches, of smaller size, to flush. The overhead difference between these two systems should demonstrate the effect of flushing different layers of cache. Expected context switch rates were measured for the insecure system as 200 and 130 switches per second (per core) for the Latency and Compute workloads, respectively, performed under the default configurations. Performed on cores with a clock rate of 2.6Ghz (Sun), this would imply an expected overhead of 11.5% and 7.5% of CPU cycles for Latency and Compute workloads on the portable-secure system, respectively. The same calculations suggest expected overhead of 1.3% and 0.85% for the x86-secure system, and comparatively negligible overhead on the insecure system (approximately 1/500th to that of the modified system). The following experiments compare the observed results to this expected overhead. B) Apache Benchmark with Varying Number of VMs The results of running the Apache benchmark on the Sun and IBM machines can be seen in Figures 5.2 and 5.3, respectfully. As can be seen in these figures, there is relatively little overhead shown in this benchmark, with no secure hypervisor yielding less than 90% efficiency of the original for any configuration. In the Sun s performance in Figure 5.2 we can see that, with few exceptions, both

68 CHAPTER 5. SELECTIVE CACHE FLUSHING 58 the x86-secure and the portable-secure hypervisors perform slightly worse, but quite comparably to the unmodified hypervisor. Despite the larger disparity of overhead anticipated from the Cache Flush Overhead experiments, they seem to perform quite similarly. The IBM machine, by comparison, shows more overhead for the x86-secure hypervisor than for the portable-secure. The reasons for this, as explained in Section 5.5.2, is that the portable secure technique needs to flush less cache on the IBM machine than the Sun, whereas the x86 secure techniques need to flush the entire L3 cache. Overall, the performance for all three hypervisors were relatively close, with comparatively little overhead demonstrated. As a result, we run the Latency and Compute experiments in order to determine workloads that may have a more serious negative impact on the hypervisors. C) 7Zip Benchmark with Varying Number of VMs Curiously, the 7zip benchmark also yields fairly little overhead for our new systems. The overhead difference between the x86-secure and the portable-secure hypervisors on the IBM machine is more obvious for 7zip, compared to Apache, but follows the same general trend. The IBM machine typically shows more overhead in the x86 secure flushing scheme and the rest is relatively even. The Sun, by comparison, shows little overhead generated. One curious thing to note is that the x86 secure flushing scheme typically yields as much, if not more overhead, for the Sun as for the IBM machine. Considering the Flush Timing experiments above, this should not be the case, as the portable secure flushing scheme generates far more overhead on the Sun. We speculate that this may

69 CHAPTER 5. SELECTIVE CACHE FLUSHING 59 be due to some sort of additional bottleneck generated by the wbinvd() command [12]. Since it uses bus cycles to flush the cache this may encounter a separate bottleneck unrelated to the cache which adds overhead. Overall, the results from the Apache and 7zip benchmarks imply that these workloads could be performed in a Cloud environment running our secure hypervisors while experiencing comparatively little overhead (less than 10%). Latency & Compute Workloads with Default Values A comparison of the three hypervisors was run under the aforementioned default values (Timeslice = 30ms, Ratelimit = 200 microseconds, Number of VMs = 10). The results for the Latency and Compute custom benchmark completion times, run on the Sun machine, show an 11.01% and 11.78% increase for the portable-secure system, and a 1.61% and 0.33% increase for the x86-modified system, respectively. These values can be viewed in Figures These results are similar to those predicted; with the exception that the percentage increase for the Compute workload is not significantly lower than that of the Latency workload on the portable-secure system. As most of the overhead generated by the modified system is directly proportional to the amount of context switching, the Latency workload was expected to generate more overhead. For default values, the portable-secure system seems to have increased the rate of context switching for the Compute workload. This may be due to the context switch time taking up a larger chunk of each VM s time slice than it otherwise would. Such a situation would facilitate the need for more context switches over the course of the process. VMs running the Latency application would be more likely to give up their

70 CHAPTER 5. SELECTIVE CACHE FLUSHING 60 time slices early (block or wait), therefore giving this overhead less of a chance to cause additional context switching. This would explain why such an increase was not observed in the Latency workload. A related trend was not observed in the x86-secure hypervisor, likely due to its comparatively small increase in overhead. D) Latency & Compute workloads with Varying Timeslice Value The results of modifying the Timeslice value can be seen in Figures 5.6 and 5.7. As might be expected, modifying the Timeslice value appears to have the largest impact on the completion time of each workload. Most notably, there is a dramatic increase in the completion time on the portable-secure system for both workloads ( % and %) when Timeslice is set to its minimum value, 1ms. The 1ms Timeslice value also generates a surprisingly high (more than 27 times any other Timeslice value) standard deviation. This suggests that some sort of variable condition is arising during the application execution; generating a large amount of overhead. Very likely, this is the result of a CPU thrashing situation, with the overhead for each context switch being magnified 500 fold by the increased context switch overhead of the portable-secure hypervisor. This overhead is reflected in the x86-secure hypervisor as well, but to a much lesser extent (11.06% and 14.62%). This is likely due to the same cause, but magnified significantly less due to the reduced context switch time by comparison to the time spent in application execution. Due to the completely unreasonable level of overhead generated in this configuration, we conclude that the portable-secure hypervisor is not a practical implementation for a Cloud system requiring this high a level of latency sensitivity; at least not on a machine with such a large L2 cache. However, less than 15% overhead could

71 CHAPTER 5. SELECTIVE CACHE FLUSHING 61 be considered acceptable for the x86-secure system. This would imply that any implementation of our solution for such latency-sensitive configurations would need to be implemented using hardware specific instructions, or on a system with a smaller cache. Despite the above conclusion, very few Cloud systems would require such a high level of latency optimization (which is the absolute minimum value Timeslice can be set to). Such a latency-sensitive system would also lose much of the benefit of a cache due to its frequent context switching. Considering this, the system would probably be better defended by the complete removal or disabling of the cache, rather than its frequent flushing. While not encountering the same problems as with a value of 1, the overhead generated by having Timeslice set to 2 is still substantial. For the portable-secure system, generated overhead sits near 50% for both workloads under these configurations. Setting Timeslice to 2 seems to be the turning point at which the additional overhead generated by the portable-secure hypervisor ceases to be considered an acceptable loss. For values greater than 2, the overhead tends to remain below 15%, which we deem an acceptable overhead for the additional security provided. The x86-secure system, however, manages to keep the overhead at less than 6% for Timeslice values equal to or more than 2. We deem this overhead level acceptable for the increased levels of security. E) Latency & Compute workloads with Varying Ratelimit Value The results of modifying the Ratelimit value can be seen in Figures 5.8 and 5.9. The immediate trend to note is that the overhead increase is almost negligible (less than

72 CHAPTER 5. SELECTIVE CACHE FLUSHING 62 2%) for all experiments with the x86-secure system. The portable-secure system, however, shows a generally increasing trend in the amount of overhead as the Ratelimit decreases, with the highest values appearing with Ratelimit set to 100. The largest difference occurs for the Latency workload when Ratelimit is set to 2000, demonstrating a comparable decrease in overhead. As this trend was not reflected in the Compute workload, it may indicate a tendency for the Latency workload to cause pre-emption among VMs that have expended less than 2ms of their time slice. Since it was designed for specific instances in mind, we conclude from these minor changes that the Ratelimit variable should not have a large impact on the secure hypervisors under normal circumstances. F) Latency & Compute workloads with Varying VM Number The results of modifying the number of guests executing concurrently within the system can be seen in Figures 5.10 and The most surprising result from experimenting with this variable was the observation that, on the portable-secure system, there was less of an overhead increase with 50 guests than with 10. From our understanding of the Credit Scheduler (the scheduling algorithm used by Xen), this is due to the algorithm s tendency to reduce the number of context switches between VCPUs as there are more VCPUs waiting for execution time. Scheduling this way leads to each VCPU being given less frequent, but longer lasting sequential runs on the CPU core, and therefore reduces the amount of overhead due to context switching. One might expect this to show more variation in the execution time from guest to guest when there are a larger number of guests running. We believe that such a situation is being avoided because each VM begins its application execution by

73 CHAPTER 5. SELECTIVE CACHE FLUSHING 63 waking from sleep, thereby pre-empting any currently running VCPU. We speculate that each guest must start its load execution via pre-emption, before settling into a comparatively sequential execution, keeping the overhead due to context switching low. The x86-secure system did not show particularly significant overhead over the course of this experiment. For smaller numbers of guests, the variation in the VM numbers showed no significant overhead increase from the insecure to secure versions. This is likely due to the fact that no context switches are necessary when there are a smaller number of domains running than CPU cores available. 5.6 Summary This chapter details the approach and implementation of our solution to sequential cache-based side-channel attacks. The solution is composed of two cache flushing functions and an algorithm to determine when to selectively flush the cache. Together, by implementing a cache flushing function on virtual machine level granularity and implementing a decision algorithm we were able to reduce the overhead for cache flushing to a reasonable level. We compare our solution with the current state of the art and determine that it is able to prevent side-channels that the insecure hypervisor can not. As a trade-off, each of these solutions generates additional overhead depending on the context in which they were running. Both hypervisors generate fairly little overhead (less than 10%) when run with default configurations, regardless of hardware. Despite this, the affects of the hardware on our solutions is apparent, showing variations within overhead generated based on

74 CHAPTER 5. SELECTIVE CACHE FLUSHING 64 the solution applied and the hardware executing it. This confirms that the ideal solution would need to take the underlying hardware into account when selecting a hypervisor, or potentially altering the executing code based on the hardware configurations. Neither of these things would be particularly difficult to do and could greatly benefit a performance sensitive environment. While modifying scheduler configurations both implementations generate more overhead as the Timeslice value is decreased. This is likely due to the increased rate of context switching a low Timeslice value would imply. While the portable-secure hypervisor generates increasingly large amounts of overhead as this value is decreased, the x86-secure implementation remains within the bounds of reason (less than 15%). We believe this demonstrates the viability of such an approach without compromising the efficiency of the system. Surprisingly, neither implementation generates a particularly large amount of overhead as the number of VMs increased. This is likely due to the finer points of the scheduling algorithm implemented by Xen and could be further exploited for efficiency s sake.

75 Chapter 6 Cache Partitioning This chapter details our approach to mitigating parallel cache-based side-channel attacks. It explains the differences between sequential and parallel attacks, and what needs to be considered to prevent a parallel attack. Included in this chapter are the unique attributes of the parallel attack and the techniques we use to mitigate them. We also include a discussion on the expected issues and mitigations of implementing a solution as specified. Included in this discussion are the expected issues when implementing such a solution and suggested countermeasures or circumstances in which the solution would be viable. This includes observations from related work having dealt with similar techniques. We then describe how our solution was realized and conclude by comparing its performance with that of existing systems. 6.1 Parallel vs Sequential Side-Channels Due to the increased levels of noise and unpredictability, parallel side-channels are more difficult to execute than traditional sequential side channels. However, as current 65

76 CHAPTER 6. CACHE PARTITIONING 66 hardware trends more towards having CPU cores share a large last-level cache the possibility of establishing a parallel side-channel becomes more likely. In addition, this trend towards large, shared, victim caches increases the number of cache lines that would need to be flushed using our sequential solution. This would increase the overhead as more time would be needed to flush more lines. The large size of these shared caches can end up being an asset, however, as it allows VMs to use large amounts of cache memory while still using only a fraction of the cache. This allows for the shared cache to be partitioned into more slices while still allocating a large amount of memory to each slice. While selectively flushing the cache is an effective way to prevent sequential sidechannels, it is not a reliable method of preventing parallel side-channels. Flushing the cache to prevent a side-channel relies on the cache being flushed in between the sender and the receiver attempting to communicate. For a sequential channel, there is a clear ordering to the sending and receiving of messages, i.e., the sender writes a message and, once it is done, the receiver reads the message. Because of the ordering, we have a clear window in which to flush the cache when the context switch between the sender and the receiver occurs. For a parallel channel, no such obvious window exists. Both the sender and receiver are attempting to communicate over a shared cache by accessing it at the same time. Implementing a flushing function in this scenario would require a flush every time either VM modified the cache. Not only would this make the cache useless, but it would add massive overhead to the system. Rather than flush certain cache lines when they are a threat, we have opted to solve the problem of parallel cache-based side-channels by preventing co-resident VMs from being able to evict one another s

77 CHAPTER 6. CACHE PARTITIONING 67 data from the cache. In order to do this, and retain a level of fairness in terms of cache access throughout the system, each VM would need to be assigned a partition of the shared cache that does not overlap with any other. By assigning such partitions, and ensuring that no VM can access cache lines outside of its partition, each VM would be able to use a fragment of the shared cache without interfering with any other. To maintain the system s adherence to the Cloud Model, such partitioning would need to be enforced entirely through software means. Previous work has been done on attempts to partition or otherwise affect placement in the cache using software means [13] [27] [30]. The most common technique of which is known as Cache colouring, the details of which are explained in section Overview Much like with the sequential PTP technique, the parallel PTP technique relies on two VMs getting sequential access to the same cache lines. While both VMs will have parallel access to the cache, mutual exclusion is still held on the cache lines themselves, and therefore the same lines must be accessed sequentially. The difference in these two techniques is that the parallel approach assumes no context switches. Since the lack of context switches means that there is no clear phase in-between each VM accessing the same cache line the previous solution for sequential cache-channels cannot be implemented in this context. Our solution to the parallel variant takes a more preventative approach, rather than reactive approach. Since we cannot tell when a cache channel might be forming, we simply have to make sure that the two

78 CHAPTER 6. CACHE PARTITIONING 68 Figure 6.1: The Effect of Partitioning the Cache on the PTP Technique (Parallel) VMs cannot access the same cache lines. For this solution, we partition the shared cache into a number of smaller sections, called slices or partitions, and allow each VM access to a subset of these partitions (usually one). In this format, each cache line belongs to one, and only one, partition. Therefore, if two VMs are given different partitions of the cache they are unable to evict, or otherwise touch, one another s cache lines (or any cache lines outside of their partitions for that matter). The effect of this partitioning on the PTP Technique can be seen in Figure 6.1. As we can see in Figure 6.1, there are two cache partitions enforced on the cache data. The one in the red dotted line (Partition 1) is reserved for VM1 (the Probing

79 CHAPTER 6. CACHE PARTITIONING 69 instance), and the one in the green dotted line (Partition 2) is reserved for VM2 (the Target instance). These partitions map to the first three and last four cache lines out of the seven shown. When the probing instance (VM1) tries to prime the cache, it is only able to prime the cache lines in its partition, i.e., the first three lines, and has to leave the other four empty. When the triggering instance (VM2) then attempts to modify every other cache line, it can only do so in its own partition and therefore ends up evicting none of VM1 s data. When the probing instance once again reads from the cache lines that it is able, it can see no difference from when it left, and therefore no communication was able to occur. This solution intends to prevent a parallel cache-channel from ever occurring by preventing the VMs from ever sharing the same cache lines. This is markedly different from our sequential solution which prevents sequential channels by actively stopping channels when they have a possibility of forming. 6.3 Expected Issues and Mitigations The biggest expected issue in partitioning the cache is the reduced efficiency of the cache s usage. If the solution restricts each VM s usage of the cache to too small a section it could negatively impact performance. Since the side-channel could be prevented by removing the cache entirely, the goal of this solution is to prevent such channels while still retaining useful efficiency from the cache. Effectively, this makes the task to prevent side-channels from occurring, while incurring as little overhead as possible.

80 CHAPTER 6. CACHE PARTITIONING 70 Previous work on cache partitioning, however, has shown that intelligent partitioning of the cache can be used to reduce the eviction rate of cache data and actually improve performance. Ultimately, the trade-off in this situation is that by partitioning the cache, the size of the usable portion of cache for each VM is reduced, but it will also have less useful data evicted by other VMs. As a cache portion gets smaller there are fewer VMs that can compete for its usage, making it a more efficient portion of cache to work with. If a VM is given a cache of half the size, but does not have to worry about data being evicted from it by foreign VMs then it may end up yielding a greater cache hit/miss ratio. The ideal ratio of cache efficiency to size is difficult to determine. Therefore, one of the goals of the experiments in this section is to demonstrate which partitioning schemes may yield more efficient cache usage, while still ensuring side-channel security. To this end we have repeated our experiments with multiple partitioning configurations to determine how they might affect different workloads. 6.4 Cache Partitioning Technique The implementation to our parallel side-channel solution involves the incorporation of our Cache Partitioning technique into a canonical Cloud system. The number of partitions can be configured statically when building the hypervisor. In this section, we will typically refer to the unmodified hypervisor as Default (1) or as Partitioned (1), as the unmodified system considers the cache to be one large entity. The modified hypervisor is referred to as Partitioned (x) where x is the number of partitions into which the cache has been split. Our solution was implemented using the code base for the open source Xen Hypervisor, Version 4.2. The outcome is a new hypervisor which

81 CHAPTER 6. CACHE PARTITIONING 71 selectively assigns VMs memory based on the partition to which they are assigned. We have opted to partition the cache statically, which is to say, the partitions cannot be adapted at runtime. This means that the number of partitions cannot be scaled up or down to correspond with the number of executing VMs. As shown in our Section 6.5.5, the efficiency of the solution relies on correctly balancing the number of VMs to the number of partitions, and a significant mismatch can lead to significant overhead. The reason for the static allocation comes from the fact that we are partitioning the entire memory pool to guarantee the complete security of the system. Other works such as those done by Shi et al. [27] have attempted to partition the cache dynamically. To maintain efficiency they are forced to partition only a small portion of the cache, as the hypervisor would not be able to reallocate memory on a large scale at any practical speed. In addition, their solution requires the client s software to be aware of their solution so that it can take advantage of the partitions it creates. Our solution, by contrast, partitions the entire memory pool at boot time and as a result the number of partitions are fixed. It does this by assigning all memory pages to partitions based on the least significant bits of their page frame numbers. At boot time, VMs that belong to one partition are only allowed to allocate memory that belongs to the same partition. Ultimately, this may lead to wasted resources, as a single VM running in a 4-way partitioned system can only allocate one quarter of the total memory. Even-balancing of loads, however, can make sure that memory resources are maximized. Having access to multiple machines, each with different partition configurations, can guarantee that the loads can be appropriately distributed so as not to encounter a scenario wherein such resources are wasted.

82 CHAPTER 6. CACHE PARTITIONING Cache Colouring Cache colouring is a process by which memory pages are mapped to cache lines via groupings called colours. In reality, the term colours just refers to specific aspects of the memory address (usually the left-most bits corresponding to a page frame number). The mapping of memory addresses to cache locations, referred to as cache lines) is implemented in the hardware, but is typically universal over systems. The location of a particular memory address in the cache can therefore be known in advance and an operating system, or hypervisor, can select memory addresses to correspond to particular sets of cache lines. Usually cache colouring is done for cache hit optimization purposes. For instance, much like with virtual page colouring, it makes no sense for two instructions that are consecutive in memory to evict one another in the cache. Cache colouring solves this problem by mapping consecutive memory addresses to separate sections of the cache. Only sufficiently distant memory addresses will be able to evict one another. Despite its primary use as an optimization agent, the specific mapping of memory addresses to cache lines can be exploited for other purposes. In this case, we intend it to bolster the security of the system by mimicking VM isolation across the cache. Modern caches have a strict mapping from physical memory to cache lines. Using this we can predict which cache lines any given page will have access to. Because of this mapping we can therefore assign a VM to a cache partition by restricting it s assignable pages to those which correspond with that partition. The manipulation of the mapping of memory pages to cache lines is known as cache colouring. The cache line/partition a memory address maps to can be determined by looking

83 CHAPTER 6. CACHE PARTITIONING 73 Figure 6.2: Mapping of a Memory Address in the Cache at its least significant bits. The standard cache map shows a memory address separated into three sections. Starting from the least significant bits these sections are the Cache Block (or Offset), the Index (or Associative Set #), and the Tag. As is visible in Figure 6.2 a large section of the Associative Set # (the Index) is pre-defined by the address page number and cannot be changed using page-level granularity. However, in our given example there remains several bits in the address that overlap with the Associative Set # and do not overlap with the page definition. These bits are what we refer to as the page colour, as only pages that share these five bits in common will be able to compete for the same cache lines. For instance, the memory addresses 0x and 0x can evict one another as they share the same colour (the same bits for bit positions 15-20), whereas addresses 0x and 0x would not. Using this pattern, the hypervisor can assign pages to guest VMs based on colour, therefore restricting a VM s cache access to a subset that follows that bit combination. In the above example, the hypervisor has up to 32 colours to work with, which means it can effectively partition the cache into a maximum of 32 equal parts of 64KB. Alternatively, it could choose to restrict the use of a subset of those 5 bits. For instance, if it were to use 4 bits (out of 5) you would have 16 partitions; 3 would yield 8, etc...

84 CHAPTER 6. CACHE PARTITIONING 74 It should be noted that while our example shows 5 bits, this number is specific to the cache and may vary with the cache s size, associativity, and cache line size. Typically, a shared cache will be far larger than the 2MB example we have shown here and therefore our example with 5 bits worth of colours (32 colours) is conservative. Our implementation uses this form of cache colouring to enforce that each VM be restricted to a unique partition, or group of partitions, of the shared cache. For instance, if the system is set to have 16 partitions, then each of, up to, 16 running VMs will be assigned memory from the set that maps to their cache partition. If there are additional VMs, they will need to be assigned within a used partition, or the number of partitions will need to be increased. Our studies of Cloud technologies have suggested that the typical Cloud machine may run between 6 and 16 VMs simultaneously. Therefore, to maximize security, we suggest partitioning the cache either 8 or 16 times depending on the expected load. At the moment, the number of partitions are declared statically so as to reduce the overhead of having to reassign partitions. Further work may focus on enabling dynamic partitioning to maximize memory/cache usage at a trade-off to reassignment overhead. By default, the hypervisor s memory is not bound to a specific partition as we are not aware of any side-channel attack that targets the hypervisor. However, this could be easily implemented using the same technique. 6.5 Experimental Evaluation This section describes the process of evaluating our solution. It describes the criteria by which we evaluate our solution s effectiveness and the environment in which we

85 CHAPTER 6. CACHE PARTITIONING 75 conduct our experiments. We also describe the metrics by which we compare our solution to the existing state-of-the-art. These metrics are used to determine under what conditions our solution can be practically implemented into a commercial Cloud environment Objectives The evaluation of our secure hypervisors focuses on obtaining the answers to two research questions: Does this hypervisor prevent parallel cache-based side-channels, and What is the performance difference between this and the insecure hypervisor. To answer these questions, we have simulated a single-server Cloud environment. In this environment, we subject each hypervisor to a conventional parallel cache-based side-channel attack, using the PTP technique, designed for use in Cloud environments. In addition, we subject each hypervisor to a series of workloads under different configurations, designed to test different behaviours of the system. The resulting completion times are used to determine a likely increase in overhead Environment The evaluation was conducted using the default Xen 4.2 hypervisor and our modified version where we have divided the cache into 2, 4, 8, or 16 partitions. Once again the experiments were run with dom0 running Ubuntu and each guest running a sparse installation of Ubuntu Each hypervisor was installed on an IBM System X3200 M3 model machine, with 32GB of RAM and a single Xeon X3430 processor. This processor contains four cores, each of which uses its own 256MB CPU L2 cache and a shared 8MB victim L3 cache. From our experience with Cloud

86 CHAPTER 6. CACHE PARTITIONING 76 systems and examples in the related work [13], we believe this machine is comparable to a server that might be used in a Cloud environment. Unless otherwise stated, all environmental parameters and configurations were left on their default values Side-Channel Prevention The hypervisors were evaluated using the same side-channel attack used against the Selective Cache Flushing solution in Section 5 but adapted for a parallel environment. The side-channel Receiver and Sender programs, performing the functions of the probing instance and the target instance, respectively, were each installed onto separate guest instances. Both programs were executed simultaneously by co-resident guests on the test-bed machine, and pinned to separate CPU cores such that the L2 cache could not be used as a side-channel. The attack was designed to send an identifiable string of 160 bits from the target instance to the probing instance. The attack was run ten times on each hypervisor to verify the consistency of the results. In order to verify each hypervisor s ability to block the side-channel under any conditions, the side-channel was given ideal conditions to work in. Specifically, the probing instance, the target instance, and Dom0 were the only virtual machines running on the hypervisor and all three were pinned to separate CPU cores. This configuration represents the best possible conditions for a parallel cache-based sidechannel attack; any variations would make it more difficult for the attack to succeed. Our experiments assume that if an attack can be prevented under these conditions, then the same prevention would work for environments more hostile to the attack s success. The viability of the defence mechanisms implemented should not be affected by these configurations.

87 CHAPTER 6. CACHE PARTITIONING Performance Experiments Statically partitioning the cache effectively reduces the size of the cache that each VM has access too. However, it also should prevent any VM from having its cache data evicted by another VM. Due to these conflicting factors, the amount of overhead should vary based on how well the workload lines up with the partitions. Ideally, if the workload is such that each VM is on exactly one partition and needs the cache at the same time then the reduced cache size should not have a negative impact on performance. On the contrary, given these ideal situations we should expect the partitioned hypervisor to provide an increase in performance since the VMs would be prevented from interfering with one another s cache usage. Previous work by Tam et al. [30] demonstrated that by partitioning the cache in a multi-processing system you can expect performance improvements of up to 17% depending on the workload. It should also be noted that the overhead due to the implementation itself will all be executed during the booting process of the VMs (as they are assigned memory) as opposed to any point during their workload execution. List of Experiments To evaluate the performance of our partitioned system, the following experiments are conducted: A) Apache Benchmark with Varying Number of VMs and Partitions B) Cachebench Benchmark with Varying Number of VMs and Partitions C) Memory Access Rates with Varying Number of Partitions

88 CHAPTER 6. CACHE PARTITIONING 78 D) VM Boot Times vs. Number of Partitions To test the partitioned solution, we tested with two standard workloads: The Apache benchmark and the Cachebench benchmark from the open sourced Phoronix Test Suite. Apache was chosen because it is the type of software you would typically find in the Cloud and we wished to see how our solution would run typical software. In addition, it was used to benchmark our sequential solution and we wanted a point of comparison between the sequential and the parallel. Cachebench was chosen because it was expected to show more detailed cache usage than the 7zip benchmark, while still remaining a robust and open sourced benchmark. In addition, we also run two customized programs designed to measure the average number of cycles necessary to access a set of memory addresses. The first program, Timings (Cached), primes the cache and then measures the average amount of time needed in cycles to access the primed cache lines. The second program, Timings (Flushed) does the same, but flushes the cache after priming. Each of these four tests was conducted using the default hypervisor and the partitioned hypervisor with the amount of partitions set to (2, 4, 8, 16). The results are discussed in Section Certain experiments from Section 5 were note repeated for the partitioned system. Specifically, the custom workloads, Latency and Compute, and their related scheduler configurations were not expected to affect the performance. These experiments focus on the tie between the frequency of context switching and overhead generated, and there is no such tie expected for the partitioned system. Likewise, there is no use to measuring the cache flushing overhead, as it should not change between the partitioned and non-partitioned systems.

89 CHAPTER 6. CACHE PARTITIONING 79 While there is no expected cache flushing overhead, the implementation of the partitioned system suggests overhead may be generated by the memory allocation subsystem at each VM s boot time. To this extent we have added a series of experiments comparing the VM boot times to the number of partitions in the system. A) Apache Benchmark with Varying Number of VMs and Partitions The Apache benchmark used in these experiments is the same used in the Flushing tests in the Section This benchmark was chosen for both experiments because we believe it represents a believable Cloud workload and has been developed as a robust benchmark. Using this same benchmark in both sections also gives us a base on which to compare the overhead generated by both solutions. Based on the design of the benchmark, it is expected that both solutions should have an impact on the performance. B) Cachebench Benchmark with Varying Number of VMs and Partitions Cachebench is a program to empirically determine some parameters about an architecture s memory subsystem [19]. As the expected performance impact of our modified system depends on cache functionality this benchmark seemed like a good choice. The results that follow come from executing Cachebench s Read/Modify/Write functionality. C) Memory Access Rates with Varying Number of Partitions The Timings benchmarks are programs of our own design, specifically tuned to observe cache hit frequency based within a section of memory. In the Cached version,

90 CHAPTER 6. CACHE PARTITIONING 80 the program assigns a portion of memory larger than the L2, but no larger than the L3 cache (in this case 2MB). The program then iterates over this memory, storing it in the cache. Next, the program does timed accesses of each of these memory locations and takes the average. The idea is to bias the results in terms of cache hits, determining how quickly the memory can be run through if almost all the cache accesses are hits. In the Flushed version of this program, everything functions the same except that the cache lines are flushed in between the first and second iterations over the memory (in between the cache priming and the timings). This is to bias the results in favour of cache misses so that we have a comparable baseline. By comparing the flushed and non-flushed values for the same hypervisor configurations, we should be able to determine how much use of the cache each hypervisor is making. For the Timing benchmark, the Timing program was run on a single VM which was given uncontested access to a single CPU. For different stages of the experiment we then started additional VMs running a cache contention program on different CPUs. The cache contention programs ran a process that simply attempted to modify as many cache lines as possible as fast as possible, in a similar fashion to the program used to send and receive side-channel messages. D) VM Boot Times vs. Number of Partitions The major computational part of the partitioning algorithm comes from finding the correct pages to assign to a VM. While this is mostly due to the specific implementation of the algorithm, it is expected that the modifications to the memory allocation

91 CHAPTER 6. CACHE PARTITIONING 81 system will significantly impact the boot times for VMs. This is because the partitioning system is built into an existing binary-buddy memory allocation system and does not function efficiently. Because overhauling this sort of system could be costly, we decided to gauge just how much boot time overhead a simple implementation would produce, so that the decision to implement the system more efficiently could be more informed. This section simply records the boot time of several machines as the number of partitions is adjusted in the system Results The attack performed on the insecure hypervisor was able to successfully communicate the entire 20 bit message between instances every time (10/10). This result demonstrates the vulnerability of the unmodified system. By contrast, the partitioned hypervisor yielded 0 bits of successful communication over all twenty attempts. This result suggests that the side-channel could not be successfully established with these countermeasures in place. Figures present the results of each performance experiment given in Section Each experiment observed a mean execution time reported by the VMs involved over twenty workload executions under identical conditions. The means observed for each VM were then averaged to represent the values shown. The error bars represent standard deviations between the means of each VM.

92 CHAPTER 6. CACHE PARTITIONING 82 Figure 6.3: Apache Benchmark Figure 6.4: Cachebench Benchmark

93 CHAPTER 6. CACHE PARTITIONING 83 Figure 6.5: Cache Timing Benchmark (Cached) Figure 6.6: Cache Timing Benchmark (Flushed)

94 CHAPTER 6. CACHE PARTITIONING 84 Figure 6.7: Cache Timing Benchmark (Flushed - Cached) Figure 6.8: VM Boot Times based on Number of Partitions

95 CHAPTER 6. CACHE PARTITIONING Result Analysis This section discusses the results from the previous section to analyse our solution in detail. Additionally, the results are used to compare how our implementations function differently from the insecure model and speculate why this may be the case. Our analysis focuses on the fulfilment of two features: the successful inhibition of a side-channel attack, and how emerging performance trends compare between hypervisors. The results are analysed to extrapolate the usefulness of our solution and its application in modern Cloud environments Side-Channel Prevention As mentioned in Section 6.5.5, there was no successful side-channel established between two given VMs using the secure hypervisor. This leads us to conclude, that for the attack we have tested against, our modified systems are capable of mitigating sequential cache-based side-channel attacks. As we were using a representative attack, it is important to determine exactly to what extent we can extrapolate this result. Our experiment uses a modified version of the PTP technique designed to work in a parallel environment. This particular solution was not designed to prevent sequential side-channel attacks and is expected to have no effect on such attacks should they occur. Our empirical analysis only tests the solution on one representative attack, and on a single hardware set. However, the theory states that, if properly implemented, this solution should prevent any shared cache communication between VMs regardless of attack or system. In this case, we believe that our experiment stands as a proof of concept that such a solution can work to prevent parallel side-channels in a Cloud environment without interfering with the

96 CHAPTER 6. CACHE PARTITIONING 86 Cloud model as we have defined it Performance Analysis This section analyses the results of the performance experiments given in Section The experiments are the same as were described in Section A) Apache Benchmark with Varying Number of VMs and Partitions The first thing to note about the Apache benchmark is that the runs for 1, 2, and 4 VMs all involve 4 or less instances of the benchmark being run on 4 cores. Once you increase the number of VMs to 8, or 16, you start having to assign multiple benchmarking instances to the same core at the same time. Due to this fact, we would expect to see a massive drop in effectiveness from all instances as we increase to 8 or 16 active VMs. This drop can be seen in Figure 6.3 as the requests per second drop by almost a factor of 2 as the amount of VMs increases from 4 to 8 and again from 8 to 16. Consider Figure 6.3. In this figure, we can see that the amount of overhead generated (the reduction in requests per second) as the amount of partitions increases does not increase significantly when the cache is split into 1-4 partitions. However, we do see the overhead increasing for 8, and especially 16, partitions. Considering that the L3 cache on this machine is of 8MB, when split into 4 partitions each VM will have 2MB at its disposal. It seems that this might be a pivotal amount of cache memory for the benchmark because when given 1MB (partition (8)) or 512KB (partition (16)) partitions the program slows down. Interestingly, as the number of VMs increase, the percentage of overhead generated

97 CHAPTER 6. CACHE PARTITIONING 87 for the larger number of partitions decreases. This may be because, as there are more benchmarking instances vying for the same cache memory they begin evicting one another s data. Effectively, this means they are not able to make use of the full 2MB+ cache that would be available to them. This implies that as the amount of VMs increases, there will be less overhead generated by increasing the number of partitions. B) Cachebench Benchmark with Varying Number of VMs and Partitions Despite expectations to the contrary, the Cachebench benchmark shows no significant difference between the partitioned and unpartitioned hypervisors as the number of VMs increase. To explore why, we examined the Cachebench source code. From the code, it seems that Cachebench obtains its readings by measuring the response time for small sections of the cache at any one time. Because these sections are so small, they should always have enough cache to work with no matter how much the cache has been partitioned. This benchmark should serve as an example that a program with a low enough memory footprint should not be negatively affected by the cache partitioning because it can fit entirely in the smaller cache partitions that are assigned to each VM. For instance, if a program can only use up to 4KB of the cache at any one time there should be no negative impact in performance if there are 16 partitions, as each partition still allows a VM up to 512KB of cache memory.

98 CHAPTER 6. CACHE PARTITIONING 88 C) Memory Access Rates with Varying Number of Partitions Figure 6.5 shows access latency for each number of partitions as the number of VMs increases. The first thing to note is that, despite attempts to reduce all non-cpucache interference, there are still some bottlenecks in the system that can affect the overhead as the number of partitions change. Factors that can affect these timings can include page contiguity, access to translation look-aside buffer, resource scheduling, and many other shared resources detailed in the software or hardware architecture. For this reason, we have run all of the Timings experiments with a modified program that flushes the expected cache line before each line is accessed for a guaranteed miss. The results of this experiment can be seen in Figure 6.6 and should be used as a baseline to compare with the cached values. For comparison sake we have included Figure 6.7 which directly compares the cached results to the flushed results. Interestingly, when partitioning the cache 2 or 4 ways the partitioned hypervisor out-performs the default hypervisor while there are a large amount of interfering VMs. This is reminiscent of work done by Tam et al. [30] where they demonstrated cache colouring as a performance optimization technique. When the Timings program is running without interference, the non-partitioned hypervisor can let it have access to the entirety of the cache. This would, as expected, cause it to access data faster because it can generate more cache hits. When run under the partitioned hypervisor the program is only given access to a fraction of the cache. This should increase its overhead accordingly. When there are VMs running interfering programs on different CPUs, however, they interfere with the Timings program (unless prevented) by evicting data from the shared cache. In the non-partitioned hypervisor this can be seen as the average

99 CHAPTER 6. CACHE PARTITIONING 89 access time goes up threefold during the transition from 0 to 1 interfering machine. For the partitioned hypervisor this overhead cannot occur until there are more VMs than partitions. While there can be more VMs than partitions in this system, it should be noted that a side-channel could occur between any two VMs that share a partition, so for maximum security this type of situation should be avoided. Settings with a larger number of partitions tend to generate more overhead even at their ideal loads. This is likely due to the fact that, as the memory gets excessively partitioned there are less and less contiguous memory pages available causing the memory assigned to a VM becomes fragmented. From Figure 6.7, we can expect that the best balance for high loads is somewhere around 4 partitions. However, this solution will only prevent side channels between VMs that are located in different partitions. This performance/security trade-off is something that the provider would need to consider when configuring their system. For each set of partitions, we can see that the ideal distribution (least overhead) comes from having an equal amount of partitions and VMs. This configuration prevents overhead from cross-vm cache evictions while still maximizing memory contiguity. D) Boot Times vs. Number of Partitions As mentioned earlier, the major overhead contributed by our solution is in the selective assigning of pages to VMs. This is done at the VM s boot time, or when memory needs to be re-allocated. Because our solution aimed to keep intact as much of the initial memory allocation system as possible (for benchmarking purposes) little effort was put into optimizing the system.

100 CHAPTER 6. CACHE PARTITIONING 90 As it stands, the pages in Xen are still maintained in a single, long list which needs to be searched through exhaustively whenever a new page needs to be assigned to a partitioned VM. This means that as the restrictions on pages get stricter (as there are more partitions) the time taken to search this list becomes immense. This outcome can be observed in Figure 6.8. This overhead should not be a practical concern however, as a more efficiencyspecific implementation could remove it entirely. Previous work in cache colouring [30] has shown that by dividing the list into multiple sequences based on partition we can all but remove any overhead generated by searching. For instance, if our solution boasted 16 lists for 16 partitions, then rather than have to search a single list for every page it would just select the top page from the appropriate list. As the focus of our solution was side-channel prevention, rather than optimization, we did not put in these sort of optimizations. However, as they have been proven to work in similar situations, we believe that by implementing them we could remove this overhead from the system. 6.7 Summary This chapter details our approach and implementation of a solution to parallel cachebased side-channels. We describe the challenges of preventing parallel side-channels and compare our new approach with that used for sequential channels. Our solution prevents communication along a shared cache by partitioning the cache into multiple segments using a technique called cache colouring. We then compare this solution to the current state of the art. In our comparison, we find that the partitioned hypervisor is more secure against

101 CHAPTER 6. CACHE PARTITIONING 91 side-channels regardless of the number of partitions we assign. In terms of performance, our experiments isolate two main factors as the number of partitions is increased: Performance degradation due to reduced cache size, and performance increase due to reduced cache line eviction rate. Measuring how these factors affect performance we observe that, depending on the workload, the partitioned hypervisor can perform more efficiently than its insecure predecessor. These factors need to be considered when configuring the system and, when properly used, can help optimize the system s performance.

102 Chapter 7 Conclusions and Future Work Two unique security attributes of the Cloud motivated this research. First, the Cloud s architecture is particularly susceptible to cache-based side-channel attacks. Second, such attacks in the Cloud cannot be solved by conventional means without interfering with the Cloud model. To address these problems, we have developed, implemented, and evaluated two new techniques designed to prevent cache-based side-channels. One is for dealing with sequential side-channels, and the other for parallel side-channels. These techniques are implemented entirely within the server (hypervisor) of a Cloud system, so as not to interfere with the Cloud s methods of operation. Our solutions are unique in that they both address cache-based side-channels in the Cloud and do not interfere with the Cloud model (require no changes to the client-side code, nor to the underlying hardware). As shown in our evaluation, our secure hypervisors can effectively prevent sequential cache-based side-channels while generating less than 15% overhead, even with the 92

103 CHAPTER 7. CONCLUSIONS AND FUTURE WORK 93 systems configured to our worst case scenario. We show the trade-offs between hardware specific solutions and those more abstracted and which should be considered based on the hardware being used. We also demonstrate the efficiency of our system on a large L2 cache; as the workload becomes more latency-sensitive the portable system becomes less efficient - generating high levels of overhead for high-latency workloads. As they are designed to deal with hardware directly, Xen and other hypervisors already provide hardware specific installations that can easily be adapted to include efficient side-channel solutions. In terms of parallel side-channels, our secure hypervisor can effectively prevent parallel cache-based side-channels while generating overhead that is highly dependent on the number of partitions needed. If fewer partitions are needed, the solution can run as fast, or possibly faster, than the insecure hypervisor. If more partitions are needed, then the solution will run with overhead up to 20-30%, though highly depending on workload. The system s sensitivity to the partition configuration suggests that automatic re-configuration could be a boon to efficiency. While we have not implemented our system in such a way, further attempts down this path could lead to self load-balancing systems, which would suggest better performance. We believe that our implementations serve as proof of concept that cache-based side-channels can be prevented exclusively through server modifications. Specifically, they can be prevented without any changes required in the client s code, or the hardware. This confirms our solution to the Cloud model requirements listed in Section 2.1.

104 CHAPTER 7. CONCLUSIONS AND FUTURE WORK 94 As a proof of concept design, there is still room for improvement. Any optimization of the cache flushing functionality may impact the overhead generated by the system. Improvements to the algorithm or to the flushing functionality can reduce while still providing security against cache-based side-channel attacks in a Cloud environment. Additionally, improvements to the cache partitioning functionality should be able to nullify the boot time overhead for that hypervisor and further improve the performance benefits of the system. Like with cache-based side-channels, the Cloud provides an environment uniquely vulnerable to many other side-channels. There arises a great difficulty in addressing them because, as medium specific attacks, they often require a unique solution. As such, each possible channel will need further work to develop a solution customized to its particular vulnerabilities.

105 Bibliography [1] Emmanuel Ackaouy. The xen credit cpu scheduler. Technical report, Xen Source, [2] Amazon. Amazon web services, [3] AWS. Aws instance types, [4] Sean Kenneth Barker and Prashant Shenoy. Empirical evaluation of latencysensitive application performance in the cloud. In Proceedings of the first annual ACM SIGMM conference on Multimedia systems, MMSys 10, pages 35 46, New York, NY, USA, ACM. [5] Miles Cheatham. Core i7-930 specifications, [6] Shuo Chen, Rui Wang, XiaoFeng Wang, and Kehuan Zhang. Side-channel leaks in web applications: A reality today, a challenge tomorrow. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, SP 10, pages , Washington, DC, USA, IEEE Computer Society. [7] Srinivas Fletcher, Christopher. Devadas. Ascend: An architecture for performing secure computation on encrypted data. In In Proc. of the International Symposium on Computer Architecture,

106 BIBLIOGRAPHY 96 [8] Apache Software Foundation. Apache http server benchmarking tool, [9] Michael Godfrey and Mohammad Zulkernine. A server-side solution to cachebased side-channels in the cloud. In Proceedings of IEEE Cloud 2013, pages , [10] IBM. Ibm mainframes, [11] IDC. Enterprise it in the cloud computing era, [12] Intel. Intel 64 and ia-32 architectures software developers manual, [13] Taesoo Kim, Marcus Peinado, and Gloria Mainar-Ruiz. Stealthmem: systemlevel protection against cache-based side channel attacks in the cloud. In Proceedings of the 21st USENIX conference on Security symposium, Security 12, pages 11 11, Berkeley, CA, USA, USENIX Association. [14] Butler W. Lampson. A note on the confinement problem. Commun. ACM, 16(10): , October [15] LinuxFoundation. Xen project, [16] Hui Lv, Xudong Zheng, Zhiteng Huang, and Jiangang Duan. Tackling the challenges of server consolidation on multi-core systems. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC 10), IISWC 10, pages 1 10, Washington, DC, USA, IEEE Computer Society. [17] Phoronix Media. Phoronix test suite, [18] National Institute of Standards and Technology. Nist cloud definition, 2011.

107 BIBLIOGRAPHY 97 [19] Phillip J. Mucci of the Innovative Computing Laboratory. Cachebench, [20] Diego Ongaro, Alan L. Cox, and Scott Rixner. Scheduling i/o in virtual machine monitors. In Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, VEE 08, pages 1 10, New York, NY, USA, ACM. [21] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and countermeasures: the case of aes. In Proceedings of the 2006 The Cryptographers Track at the RSA conference on Topics in Cryptology, CT-RSA 06, pages 1 20, Berlin, Heidelberg, Springer-Verlag. [22] D Page. Defending against cache-based side-channel attacks. Information Security Technical Report, 8(1):30 44, [23] D. Page. Partitioned cache architecture as a side-channel defence mechanism, [email protected] received 22 Aug [24] Igor Pavlov. 7-zip, [25] Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages , Washington, DC, USA, IEEE Computer Society. [26] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and Stefan Savage. Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds.

108 BIBLIOGRAPHY 98 In Proceedings of the 16th ACM conference on Computer and communications security, CCS 09, pages , New York, NY, USA, ACM. [27] Jicheng Shi, Xiang Song, Haibo Chen, and Binyu Zang. Limiting cache-based side-channel in multi-tenant cloud using dynamic page coloring. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops, DSNW 11, pages , Washington, DC, USA, IEEE Computer Society. [28] Dawn Xiaodong Song, David Wagner, and Xuqing Tian. Timing analysis of keystrokes and timing attacks on ssh. In Proceedings of the 10th conference on USENIX Security Symposium - Volume 10, SSYM 01, pages 25 25, Berkeley, CA, USA, USENIX Association. [29] Citrix Systems. Xen credit scheduler, [30] David Tam, Reza Azimi, Livio Soares, and Michael Stumm. Managing shared l2 caches on multicore systems in software. In In Proc. of the Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA, [31] Yukiyasu Tsunoo, Teruo Saito, Tomoyasu Suzaki, and Maki Shigeri. Cryptanalysis of des implemented on computers with cache. In Proc. of CHES 2003, Springer LNCS, pages Springer-Verlag, [32] William von Hagen. Professional Xen Virtualization. Wrox Press Ltd., Birmingham, UK, UK, 2008.

109 BIBLIOGRAPHY 99 [33] Zhenyu Wu, Zhang Xu, and Haining Wang. Whispers in the hyper-space: highspeed covert channel attacks in the cloud. In Proceedings of the 21st USENIX conference on Security symposium, Security 12, pages 9 9, Berkeley, CA, USA, USENIX Association. [34] Sisu Xi, Justin Wilson, Chenyang Lu, and Christopher Gill. Rt-xen: towards realtime hypervisor scheduling in xen. In Proceedings of the ninth ACM international conference on Embedded software, EMSOFT 11, pages 39 48, New York, NY, USA, ACM. [35] Yunjing Xu, Michael Bailey, Farnam Jahanian, Kaustubh Joshi, Matti Hiltunen, and Richard Schlichting. An exploration of l2 cache covert channels in virtualized environments. In Proceedings of the 3rd ACM workshop on Cloud computing security workshop, CCSW 11, pages 29 40, New York, NY, USA, ACM. [36] Xiao Zhang, Sandhya Dwarkadas, and Kai Shen. Towards practical page coloringbased multicore cache management. In Proceedings of the 4th ACM European conference on Computer systems, EuroSys 09, pages , New York, NY, USA, ACM. [37] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. Cross-vm side channels and their use to extract private keys. In Proceedings of the 2012 ACM conference on Computer and communications security, CCS 12, pages , New York, NY, USA, ACM.