Nesting Virtual Machines in Virtualization Test Frameworks

Transcription

1 Nesting Virtual Machines in Virtualization Test Frameworks Dissertation submitted on May 2010 to the Department of Mathematics and Computer Science of the Faculty of Sciences, University of Antwerp, in partial fulfillment of the requirements for the degree of Master of Science. Supervisor: prof. Dr. Jan Broeckhove Co-supervisor: Dr. Kurt Vanmechelen Mentors: Sam Verboven & Ruben Van den Bossche Olivier Berghmans Research Group Computational Modelling and Programming

2 Contents List of Figures List of Tables Nederlandstalige samenvatting Preface Abstract iv vi vii viii x 1 Introduction Goals Outline Virtualization Applications Taxonomy Process virtual machines System virtual machines x86 architecture Formal requirements The x86 protection level architecture The x86 architecture problem Evolution of virtualization for the x86 architecture Dynamic binary translation System calls I/O virtualization Memory management Paravirtualization i

3 3.2.1 System calls I/O virtualization Memory management First generation hardware support Second generation hardware support Current and future hardware support Virtualization software VirtualBox VMware Xen KVM Comparison between virtualization software Nested virtualization Dynamic binary translation Paravirtualization Hardware supported virtualization Nested virtualization in Practice Software solutions Dynamic binary translation Paravirtualization Overview software solutions First generation hardware support Dynamic binary translation Paravirtualization Hardware supported virtualization Overview first generation hardware support Second generation hardware support Dynamic binary translation Paravirtualization Hardware supported virtualization Overview second generation hardware support Nested hardware support KVM Xen Performance results Processor performance Memory performance I/O performance Network I/O Disk I/O Conclusion ii

4 7 Conclusions Nested virtualization and performance results Future work Appendices 72 Appendix A Virtualization software 73 A.1 VirtualBox Appendix B Details of the nested virtualization in practice 76 B.1 Dynamic binary translation B.1.1 VirtualBox B.1.2 VMware Workstation B.2 Paravirtualization B.3 First generation hardware support B.3.1 Dynamic binary translation B.3.2 Paravirtualization B.4 Second generation hardware support B.4.1 Dynamic binary translation B.4.2 Paravirtualization B.5 KVM s nested SVM support Appendix C Details of the performance tests 86 C.1 sysbench C.2 iperf C.3 iozone iii

5 List of Figures 2.1 Implementation layers in a computer system Taxonomy of virtual machines The x86 protection levels Memory management in x86 virtualization using shadow tables Execution flow using virtualization based on Intel VT-x Latency reductions by CPU implementation [30] Layers in a nested virtualization setup with hosted hypervisors Memory architecture in a nested situation Layers for nested paravirtualization in dynamic binary translation Layers for nested Xen paravirtualization Layers for nested dynamic binary translation in paravirtualization Layers for nested dynamic binary translation in a hypervisor based on hardware support Layers for nested paravirtualization in a hypervisor based on hardware support Nested virtualization architecture based on hardware support Execution flow in nested virtualization based on hardware support CPU performance for native with four cores and L1 guest with one core CPU performance for native, L1 and L2 guest with four cores CPU performance for L1 and L2 guests with one core Memory performance for L1 and L2 guests Threads performance for native, L1 guests and L2 guests with sysbench benchmark Network performance for native, L1 guests and L2 guests iv

6 6.7 File I/O performance for native, L1 guests and L2 guests with sysbench benchmark File I/O performance for native, L1 guests and L2 guests with iozone benchmark v

7 List of Tables 3.1 Comparison between a selection of the most popular hypervisors Index table containing directions in which subsections information can be found about a certain nested setup The nesting setups with dynamic binary translation as the L1 hypervisor technique The nesting setups with paravirtualization as the L1 hypervisor technique Overview of the nesting setups with a software solution as the L1 hypervisor technique The nesting setups with first generation hardware support as the L1 hypervisor technique and DBT as the L2 hypervisor technique The nesting setups with first generation hardware support as the L1 hypervisor technique and PV as the L2 hypervisor technique The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique Overview of the nesting setups with first generation hardware support as the L1 hypervisor technique The nesting setups with second generation hardware support as the L1 hypervisor technique and DBT as the L2 hypervisor technique The nesting setups with second generation hardware support as the L1 hypervisor technique and PV as the L2 hypervisor technique The nesting setups with first generation hardware support as the L1 and L2 hypervisor technique Overview of the nesting setups with second generation hardware support as the L1 hypervisor technique Overview of all nesting setups vi

8 Nederlandstalige samenvatting Virtualisatie is uitgegroeid tot een wijdverspreide technologie die gebruikt wordt om computing resources te abstraheren, te combineren of op te delen. Verzoeken voor deze resources zijn op deze manier minimaal afhankelijk van de onderliggende fysieke laag. De x86 architectuur is niet speciaal ontworpen voor virtualisatie en bevat een aantal niet-virtualiseerbare instructies. Verschillende software-oplossingen en hardware-ondersteuning hebben hier voor een oplossing gezorgd. Het groeiend aantal toepassingen zorgt ervoor dat gebruikers steeds meer wensen virtualisatie te hanteren. Onder andere de noodzaak voor volledige fysieke opstellingen voor onderzoeksdoeleinden kan vermeden worden door het gebruik van virtualisatie. Om componenten, die mogelijk zelf virtualisatie gebruiken, te kunnen virtualiseren, moet het mogelijk zijn om virtuele machines in elkaar te nesten. Er was slechts weinig informatie over geneste virtualisatie beschikbaar en dit proefschrift gaat dieper in op wat mogelijk is met de huidige technieken. We testen het nesten van hypervisors gebaseerd op de verschillende virtualisatie technieken. De technieken die gebruikt werden zijn dynamic binary translation, paravirtualization en hardware-ondersteuning. Voor de hardware-ondersteuning werd een onderscheid gemaakt tussen eerste generatie en tweede generatie hardware-ondersteuning. Succesvolle geneste opstellingen maken gebruik van software-oplossingen voor de tweede hypervisor en hardware-ondersteuning voor de eerste hypervisor. Slechts één werkende geneste oplossing gebruikt voor beide een software-oplossing. Benchmarks werden uitgevoerd om te kijken of de prestaties van werkende geneste opstellingen performant zijn. De prestaties van de processor, het geheugen en I/O werden getest en vergeleken met de verschillende niveaus van virtualisatie. We ontdekten dat geneste virtualisatie werkt voor bepaalde opstellingen, vooral met een software-oplossing bovenop een hypervisor met hardware-ondersteuning. Opstellingen met hardware-ondersteuning voor de bovenste hypervisor zijn nog niet mogelijk. Geneste hardware-ondersteuning zal binnenkort beschikbaar worden, maar voorlopig is de enige optie het gebruik van een software-oplossing voor de bovenste hypervisor. Uit de resultaten van de benchmarks bleek dat de prestaties van geneste opstellingen veelbelovend zijn. vii

9 Preface In this section I will give some insight on the creation of this thesis. It was submitted in partial fulfillment of the requirements for a Master s degree of Computer Science. I have always been fascinated by virtualization and during the presentation of open thesis subjects I stumbled upon the subject of nested virtualization. Right from the start I found the subject very interesting so I made an appointment for more information and I eventually got it! I had already used some virtualization software but I did not know much about the underlying techniques. During the first semester I followed a course on virtualization, which helped me to learn the fundamentals. It took time to become familiar with the installation and use of the different virtualization packages. At first, it took a long time to test one nested setup and it seemed that all I was doing was installing operating systems in virtual machines. Predefined images can save a lot of work but I had to find this out the hard way! But even with these predefined images, a nested setup can take a long time to test and re-test since there are so many possible configurations. After the first series of tests, I was quite disappointed about the obtained results. Due to some setbacks in December and January, I also fell behind on schedule leading to a hard second semester. It was hard combining this thesis with other courses and with extracurricular responsibilities during this second semester. I am pleased that I got back on track and finished the thesis on time! This would not have been possible without the help from the people around me. I want to thank my girlfriend Anneleen Wislez for supporting me, not only during this year but during the last few years. She also helped me with creating the figures for this thesis and reading the text. viii

10 Further, I would like to show appreciation to my mentors Sam Verboven and Ruben Van den Bossche for always pointing me in the right direction and for the help during this thesis. Additionally, I also want to thank my supervisor Prof. Dr. Jan Broeckhove and co-supervisor Dr. Kurt Vanmechelen giving me the opportunity to make this thesis. A special thank you goes out to all my fellow students and especially to Kristof Overdulve for the interesting conversations and the laughter during the past years. And last but not least I want to thank my parents and sister for supporting me throughout my education; my dad for offering support by buying his new computer earlier and borrowing it so I could do a second series of tests on a new processor and my mom for the excellent care and interest in what I was doing. ix

11 Abstract Virtualization has become a widespread technology that is used to abstract, combine or divide computing resources to allow resource requests to be described and fulfilled with minimal dependence on the underlying physical delivery. The x86 architecture was not designed with virtualization in mind and contains certain non-virtualizable instructions. This has resulted in the emergence of several software solutions and has led to the introduction of hardware support. The expanding range of applications ensures that users increasingly want to use virtualization. Among other things, the need for entire physical setups for research purposes can be avoided by using virtualization. For components that already use virtualization, executing a virtual machine inside a virtual machine is necessary, this is called nested virtualization. There has been little related work on nested virtualization and this thesis elaborates on what is possible with current techniques. We tested the nesting of hypervisors based on the different virtualization techniques. The techniques that were used are dynamic binary translation, paravirtualization and hardware support. For hardware support, a distinction was made between first generation and second generation hardware support. Successful nested setups use a software solution for the inner hypervisor and hardware support for the bottom layer hypervisor. Only one working nested setup uses software solutions for both hypervisors. Performance benchmarks were conducted to find out if the performance of working nested setups is reasonable. The performance of the processor, the memory and I/O was tested and compared with the different levels of virtualization. We found that nested virtualization on the x86 architecture works for certain setups, especially with a software solution on top of a hardware supported hypervisor. Setups with hardware support for the inner hypervisor are not yet possible. The nested hardware support will be coming soon but until then, the only option is the use of a software solution for the inner hypervisor. Results of the performance benchmarks showed that performance of the nested setups is promising. x

12 CHAPTER 1 Introduction Within the research surrounding grid and cluster computing there are many developments at different levels that make use of virtualization. Virtualization can be used for all, or a selection of the components in grid or cluster middleware. Grids or clusters are also using virtualization to run separate applications in a sandbox environment. Both developments bring advantages concerning security, fault tolerance, legacy support, isolation, resource control, consolidation, etc. Complete test setups are not available or desirable for many development and research purposes. If certain performance limitations do not pose a problem, virtualization of all components in a system can avoid the need for physical grid or cluster setups. This thesis focusses on the latter, the consolidation of several physical cluster machines by virtualizing them on a single physical machine. The virtualization of cluster machines that use virtualization themselves leads to a combination of the above mentioned levels. 1.1 Goals The goal of this thesis is to find out whether different levels of virtualization are possible with current virtualization techniques. The research question is whether nested virtualization works on the x86 architecture. In cases where nested virtualization works we want to find out what the performance degradation is when compared to a single level of virtualization or to a native solution. For cases where nested virtualization does not work we search for the reasons of the failure and what needs to be changed in order for it to work. The experiments are conducted with some of the most popular virtualization software to find an answer to the posed question.

13 1.2. OUTLINE Outline The outline of this thesis is as follows. Chapter 2 contains an introduction to virtualization, a brief history of virtualization is given followed by a few definitions and a taxonomy of virtualization in general. The chapter ends with the formal requirements needed for virtualization on a computer architecture and how the x86 architecture compares to these requirements. Chapter 3 describes the evolution of virtualization for the x86 architecture. Virtualization software first used software techniques, at a later stage processor vendors provided hardware support for virtualization. The last section of the chapter provides an overview of a selection of the most popular virtualization software. Chapter 4 provides a theoretical view for the requirements of nested virtualization on the x86 architecture. For each technique described in chapter 3, a detailed explanation of the theoretical requirements gives more insight in whether nested virtualization can work for the given technique. Chapter 5 investigates the actual nesting of virtual machines using some of the most popular virtualization software solutions. The different virtualization techniques are combined to get an overview of which nested setup works best. Chapter 6 presents performance results of the working nested setups in chapter 5. System benchmarks are executed on each setup and the results are compared. Chapter 7 summarizes the results in this thesis and gives directions for future work.

14 CHAPTER 2 Virtualization In recent years virtualization has become a widespread technology that is used to abstract, combine or divide computing resources to allow resource requests to be described and fulfilled with minimal dependence on the underlying physical delivery. The first tracks of virtualization can be traced back to the 1960 s [1, 2] in research projects that provided concurrent, interactive access to mainframes. Each virtual machine (VM) gave the user the illusion of working directly on a physical machine. By partitioning the system into virtual machines, multiple users could concurrently use the system each within their own operating system. The projects provided an elegant way to enable time- and resource-sharing on expensive mainframes. Users could execute, develop, and test applications within their own virtual machine without interfering with other users. In that time, virtualization was used to reduce the cost of acquiring new hardware and to improve the productivity by letting more users work simultaneously. In the late 1970 s and early 1980 s virtualization became unpopular because of the introduction of cheaper hardware and multiprocessing operating systems. The popular x86 architecture lacked the power to run multiple operating systems at the same time. But since this hardware was so cheap, a dedicated machine was used for each separate application. The use of these dedicated machines led to a decrease in the use of virtualization. The ideas of virtualization became popular again in the late 1990 s with the emergence of a wide variety of operating systems and hardware configurations. Virtualization was used for executing a series of applications, targeted for different hardware or operating systems, on a given machine. Instead of buying dedicated machines and operating systems for each application, the use of virtualization on one machine offers the ability to create virtual machines that are able to run these applications. Virtualization concepts can be used in many areas of computer science. Large variations in the abstraction level and underlying architecture lead to many defini-

15 2.1. APPLICATIONS 4 tions of virtualization. In A survey on virtualization technologies, S. Nanda and T. Chiueh define virtualization by the following relaxed definition [1]: Definition 2.1 Virtualization is a technology that combines or divides computing resources to present one or many operating environments using methodologies like hardware and software partitioning or aggregation, partial or complete machine simulation, emulation, time-sharing, and many others. The definition mentions the aggregation of resources but in this context the focus lies on the partitioning of resources. Throughout the rest of this thesis, virtualization provides infrastructure used to abstract lower-level, physical resources and to create multiple independent and isolated virtual machines. 2.1 Applications The expanding range of computer applications and their varied requirements for hardware and operating systems increases the need for users to start using virtualization. Most people will have already used virtualization without realizing it because there are many applications where virtualization can be used in some form. This section elaborates on some practical applications where virtualization can be used. S. Nanda and T. Chiueh enumerate some of these applications in A survey on virtualization technologies but the list is not complete and one can easily think of other applications [1]. A first practical application that benefits from using virtualization is server consolidation [3]. It allows system administrators to consolidate workloads of multiple under-utilized machines to a few powerful machines. This saves hardware, management, administration of the infrastructure, space, cooling and power. A second application that also involves consolidation is application consolidation. A legacy application might require faster and newer hardware but might also require a legacy operating system. The need for such legacy applications could be served well by virtualizing the newer hardware. Virtual machines can be used for providing secure, isolated environments to run foreign or less-trusted applications. This form of sandboxing can help build secure computing platforms. Besides sandboxing, virtualization can also be used for debugging purposes. It can help debug complicated software such as operating systems or device drivers by letting the user execute them on an emulated PC with full software controls. Moreover, virtualization can help produce arbitrary test scenarios that are hard to produce in reality and thus eases the testing of software. Virtualization provides the ability to capture the entire state of a running virtual machine, which creates new management possibilities. Saving the state of a virtual machine, also called a snapshot, offers the user the capability to roll back to the saved state when, for example, a crash occurs in the virtual machine. The saved state can also be used to package an application together with its required operating system, this is often called an appliance. This eases the installation of that application on

16 2.2. TAXONOMY 5 a new server, lowering the entry barrier for its use. Another advantage of snapshots is that the user can copy the saved state to other physical servers and use the new instance of the virtual machine without having to install it from scratch. This is useful for migrating virtual machines from one physical server to other physical servers when needed. Another practical application is the use of virtualization within distributed network computing systems [4]. Such a system must deal with the complexity of decoupling local administration policies and configuration characteristics of distributed resources from the quality of service expected from end users. Virtualization can simplify or eliminate this complex decoupling because it offers functionality like consolidation of physical resources, security and isolation, flexibility and ease of management. It is not difficult to see that the practical applications given in this section are just a few examples of the many possible uses for virtualization. The number of possible advantages that virtualization can provide continues to rise, making it more and more popular. 2.2 Taxonomy Virtual machines can be divided into two main categories, namely process virtual machines and system virtual machines. In order to describe the differences, this section starts with an overview of the different implementation layers in a computer system, followed by the characteristics of process virtual machines. Finally, the characteristics of system virtual machines are explained. Most information in this section is deduced from the book Virtual machines: Versatile platforms for systems and processes by J. E. Smith and R. Nair [5]. Figure 2.1: Implementation layers in a computer system.

17 2.2. TAXONOMY 6 The complexity in computer systems is tackled by the division into levels of abstraction separated by well-defined interfaces. Implementation details at lower levels are ignored or simplified by introducing levels of abstraction. In both hardware and software in a computer system, the levels of abstraction correspond to implementation layers. A typical architecture of a computer system consist of several implementation layers. Figure 2.1 shows the key implementation layers in a typical computer system. At the base of the computer system we have the hardware layer consisting of all the different components of a modern computer. Just above the hardware layer, we find the operating system layer which exploits the hardware resources to provide a set of services to system users [6]. The libraries layer allows application calls to invoke various services available on the system, including those provided by the operating system. At the top, the application layer consists of the applications running on the computer system. Figure 2.1 also shows the three interfaces between the implementation layers the instruction set architecture (ISA), the application binary interface (ABI), and the application programming interface (API) which are especially important for virtual machine construction [7]. The division between hardware and software is marked by the instruction set architecture. The ISA consists of two interfaces, the user ISA and the system ISA. The user ISA includes the aspects visible to the libraries and application layers. The system ISA is a superset of the user ISA which also includes those aspects visible to supervisor software, such as the operating system. The application binary interface provides a program or library access to the hardware resources and services available in the system. This interface consists of the user ISA and a system call interface which allows application programs to interact with the shared hardware resources indirectly. The ABI allows the operating system to perform operations on behalf of a user program. The application programming interface allows a program to invoke various services available on the system and is usually defined with respect to a high-level language (HHL). An API enables application written to the API to be ported easily to other systems that support the same API. The interface consists of the user ISA and of HHL library calls. Using the three interfaces, virtual machines can be divided into two main categories: process virtual machines and system virtual machines. A process VM runs a single program, supporting only an individual process. It provides a user application with a virtual ABI or API environment. The process virtual machine is created when the corresponding process is created and terminates when the process terminates. The system virtual machines provide a complete system environment in which many processes can coexist. System VMs do this by virtualizing the ISA layer Process virtual machines Process virtual machines virtualize the ABI or API and can run only a single user program. Each virtual machine thus supports a single process, possibly consisting of multiple threads. The most common process VM is an operating system. It supports multiple user processes to run simultaneously by time-sharing the limited hardware resources. The operating system provides a replicated process VM for

18 2.2. TAXONOMY 7 each executing program so that each program thinks that it has its own machine. Program binaries that are compiled for a different instruction set are also supported by process VMs. There are two approaches for emulating the instruction set. Interpretation is a simple but slow approach; an interpreter fetches, decodes and emulates each individual instruction. A more efficient approach is dynamic binary translation, which is explained in section 3.1. The emulation between different instruction sets provides cross-platform compatibility only on case-by-case basis and requires considerable programming effort. Designing a process-level VM together with an HLL application development environment is an easier way to achieve full cross-platform portability. The HHL virtual machine does not correspond to any real platform, but is designed for ease of portability. The Java programming language is a widely used example of a HHL VM System virtual machines System virtual machines provide a complete system environment by virtualizing the ISA layer. They allow a physical hardware system to be shared among multiple, isolated guest operating system environments simultaneously. The layer that provides the hardware virtualization is called the virtual machine monitor (VMM) or hypervisor. It manages the hardware resources so that multiple guest operating system environments and their user programs can execute simultaneously. Subdivision is centered on the supported ISAs of the guest operating systems, whether virtualization or emulation is used. Virtualization can be further subdivided based on the location where the hypervisor is executed: native or hosted. The following two paragraphs clarify the subdivision according to the supported ISAs. Emulation: Guest operating systems with a different ISA from the host ISA can be supported through emulation. The hypervisor must emulate both the application and operating system code by translating each instruction to the ISA of the physical machine. The translation is applied to each instruction so that the hypervisor can easily manage all hardware resources. Using emulation for guest operating systems with the same ISA as the host ISA, performance will be severely lower than using virtualization. Virtualization: When the ISA of the guest operating system is the same as the host ISA, virtualization can be used to improve performance. It treats nonprivileged instructions and privileged instructions differently. A privileged instruction is an instruction that traps when executed in user mode instead of in kernel mode and will be discussed in more detail in section 2.3. Non-privileged instructions are executed directly on the hardware without intervention of the hypervisor. Privileged instructions are caught by the hypervisor and translated in order to guarantee correct results. When guest operating systems primarily execute non-privileged instructions, the performance is comparable to near native speed. Thus, when the ISA of the guest and the host are the same, the best performing technique is virtualization. It improves performance in terms of execution speed by running non-privileged instructions directly on the hardware. If the ISA of the guest and the host are different, emulation is the only way to execute the guest operating

19 2.2. TAXONOMY 8 system. The subdivision of virtualization based on the location of the hypervisor is clarified in the next two paragraphs. Native, bare-metal hypervisor: A native, bare-metal hypervisor, also referred to as a Type 1 hypervisor, is the first layer of software installed on a clean system. The hypervisor runs in the most privileged mode, while all the guests run in a less privileged mode. It runs directly on the hardware and executes the intercepted instructions directly on the hardware. According to J. E. Smith and R. Nair, a bare-metal hypervisor is more efficient than a hosted hypervisor in many respects since it has direct access to hardware resources, enabling greater scalability, robustness and performance [5]. There are some variations of this architecture where a privileged guest operating system handles the intercepted instructions. The disadvantage of a native, bare-metal hypervisor is that a user must clear the existing operating systems in order to install the hypervisor. Hosted hypervisor: An alternative to a native, bare-metal hypervisor is the hosted or Type 2 hypervisor. It runs on top of a standard operating system and supports the broadest range of hardware configurations [3]. The installation of the hypervisor is similar to the installation of an application within the host operating system. The hypervisor relies on the host OS for device support and physical resource management. Privileged instructions cannot be executed directly on the hardware but are modified by the hypervisor and passed down to the host OS. The implementation specifics of Type 1 and Type 2 hypervisors can be separated into several categories: dynamic binary translation, paravirtualization and hardware assisted virtualization. These approaches are discussed in more detail in chapter 3, which elaborates on virtualization within system virtual machines. An overview of the taxonomy of virtual machines is shown in figure 2.2. Figure 2.2: Taxonomy of virtual machines.

20 2.3. X86 ARCHITECTURE x86 architecture The taxonomy given in the previous section provides an overview of different virtual machines and different implementation approaches. This section gives detailed information about the requirements associated with virtualization and the problems that occur when virtualization technologies are implemented on the x86 architecture Formal requirements In order to provide insight into the problems and solutions for virtualization on top of the x86 architecture, the formal requirements for a virtualizable architecture are given first. These requirements describe what is needed in order to use virtualization on a computer architecture. In Formal requirements for virtualizable third generation architectures, G. J. Popek and R. P. Goldberg defined a set of formal requirements for a virtualizable computer architecture [8]. They divided the ISA instruction into several groups. The first group contains the privileged instructions: Definition 2.2 Privileged instructions are all the ISA instruction that only work in kernel mode and trap when executed in user mode instead of in kernel mode. Another important group of instructions that will have a big influence on the virtualizability of a particular machine are the sensitive instructions. Before defining sensitive instructions, the notions of behaviour sensitive and control sensitive are explained. Definition 2.3 An instruction is behaviour sensitive if the effect of its execution depends on the state of the hardware, i.e. upon its location in real memory, or on the mode. Definition 2.4 An instruction is control sensitive if it changes the state of the hardware upon execution, i.e. it attempts to change the amount of resources available or affects the processor mode without going through the memory trap sequence. With these notions, instructions can be separated into sensitive instructions and innocuous instructions. Definition 2.5 Sensitive instructions is the group of instructions that are either control sensitive or behaviour sensitive. Definition 2.6 Innocuous instructions is the group of instruction that are not sensitive instructions. According to Popek and Goldberg, there are three properties of interest when any arbitrary program is executed while the control program (the virtual machine monitor) is resident: efficiency, resource control, and equivalence. The efficiency property: All innocuous instructions are executed by the hardware directly, with no intervention at all on the part of the control program.

21 2.3. X86 ARCHITECTURE 10 The hypervisor should not intervene for instructions that do no harm. These instructions do not change the state of the hardware and should be executed by the hardware directly in order to preserve performance. The more instructions are executed directly, the better the performance of the virtualization will be. This property highlights the contrast between emulation - where every single instruction is analyzed - and virtualization. The resource control property: It must be impossible for that arbitrary program to affect the system resources, i.e. memory, available to it; the allocator of the control program is to be invoked upon any attempt. The hypervisor is in full control of the hardware resources. A virtual machine should not be able to access the hardware resources directly. It should go through the hypervisor to ensure correct results and isolation from other virtual machines. The equivalence property: Any program K executing with a control program resident, with two possible exceptions, performs in a manner indistinguishable from the case when the control program did not exist and K had whatever freedom of access to privileged instructions that the programmer had intended. A program running on top of a hypervisor should perform the identical behaviour as in the case where the program would run on the hardware directly. As mentioned, there are two exceptions: timing and resource availability problems. The hypervisor will occasionally intervene and instruction sequences may take longer to execute. This can lead to incorrect results in the assumptions about the length of the program. The second exception, the resource availability problem, might occur when the hypervisor does not satisfy a particular request for space. The program may then be unable to function in the same way as if the space were made available. The problem could easily occur, since the virtual machine monitor itself and other possible virtual machines take space as well. A virtual machine environment can be seen as a smaller version of the actual hardware: logically the same, but with lesser quantity of certain resources. Given the categories of instructions and the properties, they define the hypervisor and a virtualizable architecture as: Definition 2.7 We say that a virtual machine monitor, or hypervisor, is any control program that satisfies the three properties of efficiency, resource control and equivalence. Then functionally, the environment which any program sees when running with a virtual machine present is called a virtual machine. It is composed of the original real machine and the virtual machine monitor. Definition 2.8 For any conventional third generation computer, a virtual machine monitor may be constructed, i.e. it is a virtualizable architecture, if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.

22 2.3. X86 ARCHITECTURE The x86 protection level architecture The x86 architecture recognizes four privilege levels, numbered from 0 to 3 [9]. Figure 2.3 shows how the privilege levels can be interpreted as rings of protection. The center ring, ring 0, is reserved for the most privileged code and is used for the kernel of an operating system. When the processor is running in kernel mode, the code is executing in ring 0. Rings 1 and 2 are less privileged and are used for operating system services. These two are rarely used but some techniques in virtualization will run the guests inside ring 1. The most outer ring is used for applications and has the least privileges. The code of applications running in users mode will execute in ring 3. Figure 2.3: The x86 protection levels. These rings are used to prevent a program operating in a lower ring from accessing more privileged system routines. A call gate is used to allow an outer ring to access an inner ring s resource in a predefined manner The x86 architecture problem A computer architecture can support virtualization if it meets the formal requirements described in subsection 2.3. The x86 architecture, however, does not meet the requirements posed above. The x86 instruction set architecture contains sensitive instructions that are non-privileged, called non-virtualizable instructions. In other words, these instruction will not trap when executed in user mode and they depend on or change the hardware state. This is not desirable because the hypervisor cannot simulate the effect of the instruction. The current hardware state could belong to another virtual machine, producing an incorrect result for the current virtual machine. The non-virtualizable instructions make virtualization on the x86 architecture more difficult. Virtualization techniques will need to deal with these instructions. Applications will only run at near native speed when they contain a minimum

23 2.3. X86 ARCHITECTURE 12 amount of non-virtualizable instructions. Approaches that overcome the limitations of the x86 architecture are discussed in the next chapter.

24 CHAPTER 3 Evolution of virtualization for the x86 architecture Developers of virtualization software did not wait until processor vendors solved the x86 architecture problem. They introduced software solutions like binary translation and, when virtualization became more popular, paravirtualization. Processor vendors then introduced hardware support to solve the design problem of the x86 architecture and at a later stage to improve the performance. The next generation hardware support was introduced to improve performance concerning the memory management. This chapter gives an overview of the evolution towards hardware supported virtualization on x86 architectures. Dynamic binary translation, a software solution that tries to circumvent the design problem of the x86 architecture, is explained in the first section. The second section explains paravirtualization, a software solution which tries to improve the binary translation concept. It has some advantages and disadvantages over dynamic binary translation. The third section gives details on the first generation hardware support and its advantages and disadvantages over software solutions. In many cases the software solutions outperform the hardware support. The next generation hardware support tries to further close the performance gap by eliminating major sources of virtualization overhead. The second generation hardware support focusses on memory management and is discussed in the fourth section. The last section gives an overview of VirtualBox, KVM and Xen, which are virtualization products and VMware, a company providing multiple virtualization products. 3.1 Dynamic binary translation In full virtualization, the guest OS is not aware that it is running inside a virtual machine and requires no modifications [10]. Dynamic binary translation is a technique that implements full virtualization. It requires no hardware assisted or

25 3.1. DYNAMIC BINARY TRANSLATION 14 operating system assisted support while other techniques, like paravirtualization, need modifications to either the hardware or the operating system. Dynamic binary translation is a technique which works by translating code from one instruction set to another. The word dynamic indicates that the translation is done on the fly and is interleaved with execution of the generated code [11]. The word binary indicates that the input is binary code and not source code. To improve performance, the translation is mostly done on blocks of code instead of single instructions [12]. A block of code is defined by a sequence of instructions that end with a jump or branch instruction. A translation cache is used to avoid retranslating code blocks multiple times. In x86 virtualization, dynamic binary translation is not used to translate between different instruction set architectures. Instead, the translation is done from x86 instructions to x86 instructions. This makes the translation a lot lighter than previous binary translation technologies [13]. Since it is a translation between the same ISA, a copy of the original instructions often suffices. In other words, generally no translation is needed and the code can be executed as is. In particular, whenever the guest OS is executing code in user mode, no translation will be carried out and the instructions are executed directly, which is comparable in performance to execution of the code natively. Code that the guest OS wants to execute in kernel mode will be translated on the fly and is saved in the translation cache. Even when the guest OS is running kernel code, most times no translation is needed and the code is copied as is. Only in some cases will the hypervisor need to translate instructions of the kernel code to guarantee the integrity of the guest. The kernel of the guest is executed in ring 1 instead of ring 0 when using software virtualization. As explained in section 2.3, the x86 instruction set architecture contains sensitive instructions that are non-privileged. If the kernel of the guest operating system wants to execute privileged instructions or one of these non-virtualizable instructions, the dynamic binary translation will translate the instructions into a safe equivalent. The safe equivalent will not harm other guests or the hypervisor. For example, if access to the physical hardware is needed, the performed translation assures that the code will use the virtual hardware instead. In these cases, the translation ensures that the safe code is also less costly than the code with privileged instructions. The code with privileged instructions would trap when running in ring 1 and the hypervisor should handle these traps. The dynamic binary translation thus avoids the traps by replacing the privileged instruction so that there are less interrupts and the safe code will be less costly. The translation of code into safer equivalents is less costly than letting the privileged instructions trap, but the translation itself should also be taken into account. Luckily, the translation overhead is rather low and will decrease over time since translated pieces of code are cached in order to avoid retranslation in case of loops in the code. Yet, dynamic binary translation has a few cases it cannot fully solve: system calls, I/O, memory management and complex code. The latter is the set of code that, for example, does self-modification or has indirect control flows. This code is complex to execute, even on an operating system that runs natively. The other cases are now described in more detail in the next subsections.

26 3.1. DYNAMIC BINARY TRANSLATION System calls A system call is a mechanism used by processes to access the services provided by the operating system. This involves a transition to the kernel where the required function is then performed [6, 14]. The kernel of an operating system is also a process, but it differs from other processes in that it has privileged access to processor instructions. The kernel will not execute directly but only when it receives an interrupt from the processor or a system call from another process also running in the operating system. There are many different techniques for implementing system calls. One way is to use a software interrupt and trap, but for x86 a faster technique was chosen [13, 15]. Intel and AMD have come up with the instructions SYSCALL/SYSENTER and SYSRET/SYSEXIT for a process to do a system call. These instructions transfer control to the kernel without the overhead of an interrupt. In software virtualization the kernel of the guest will run inside ring 1 instead of ring 0. This implies that the hypervisor should intercept a SYSENTER (or SYSCALL), translate the code and hand over control to the kernel of the guest. This kernel then executes the translated code and execute a SYSEXIT (or SYS- RET) to return control back to the process that requested the service of the kernel. Because the kernel of the guest is running inside ring 1, it does not have the privilege to perform the SYSEXIT. This will cause an interrupt at the processor and the hypervisor has to emulate the effect of this instruction. System calls will cause a significant amount of overhead when using software virtualization. In a virtual machine, a system call costs about 10 times the cycles needed for a system call on a native machine. In A comparison of software and hardware techniques for x86 virtualization, the authors measured that a system call on a 3.8 GHz Pentium 4 takes 242 cycles [11]. On the same machine, a system call in a virtual machine, virtualized with dynamic binary translation and the kernel running in ring 1, takes 2308 cycles. In an environment where virtualization is used there will most likely be more than one virtual machine on a physical machine. In this case, the overhead of the system calls can become a significant part of the virtualization overhead. As we will see later, hardware support for virtualization offers a solution for this I/O virtualization When creating a virtual machine, not only the processor needs to be virtualized but also all the essential hardware like memory and storage. Each I/O device type has its own characteristics and needs to be controlled in its own special way [5]. There are often a large number of devices for an I/O device type and this number continues to rise. The strategy consists of constructing a virtual I/O device and then virtualizing the I/O activity that is directed at the device. Every access to this virtual hardware must be translated to the real hardware. The hypervisor must intercept all I/O operations issued by the guest operating system and it must emulate these instructions using software that understands the semantics of the specific I/O port accessed [16]. The I/O devices are emulated because of the ease of migration and multiplexing advantages [17]. Migration is easy because the virtual

27 3.1. DYNAMIC BINARY TRANSLATION 16 device exists in memory and can easily be transferred. The hypervisor can present a virtual device to each guest while performing the multiplexing. Emulation has the disadvantage of poor performance. The hypervisor must perform a significant amount of work to present the illusion of a virtual device. The great number of physical devices make the emulation of the I/O devices in the hypervisor complex. The hypervisor needs drivers for every physical device in order to be usable on different physical systems. A hosted hypervisor has the advantage that it can reuse the device drivers provided by the host operating system. Another problem is that the virtual I/O device is often a device model which does not match the full power of the underlying physical devices [18]. This means that optimizations implemented by specific devices can be lost in the process of emulation Memory management In an operating system, every application has the illusion that it is working with a piece of contiguous memory. Whereas in reality, the memory used by applications can be dispersed across the physical memory. The application is working with virtual addresses that are translated to physical addresses. The operating system manages a set of tables to do the translation of the virtual memory to the physical addresses. The x86 architecture provides support for paging in the hardware. Paging is the process that translates virtual addresses of a process to a system physical address. The hardware that translates the virtual addresses to physical addresses is called the memory management unit or MMU. The page table walker performs address translation using the page tables and uses a hardware page table pointer, the CR3 register, to start the page walk [19]. It will traverse several page table entries which point to the next level of the walk. The memory hierarchy will be traversed many times when the page walker performs address translation. To keep this overhead within limits, a translation look-aside buffer (TLB) is used. The most recent translation will be saved in this buffer. The processor will first check the TLB to see whether the translation is located in the cache. When the translation is found in the buffer this translation is used, otherwise a page walk is performed and this result is saved in the TLB. The operating system and the processor must cooperate in order to assure that the TLB stays consistent. Inside a virtual machine the guest operating system manages its own page tables. The task of the hypervisor is to virtualize the memory but also virtualize the virtual memory so that the guest operating system can use virtual memory [20]. This introduces an extra level of translation which maps physical addresses of the guest to real physical addresses of the system. The hypervisor must manage the address translation on the processor using software techniques. It derives a shadow version of the page table from the guest page table, which holds the translations of the virtual guest addresses to the real physical addresses. This shadow page table will be used by the processor when the guest is active and the hypervisor manages this shadow table to keep it synchronized with the guest page table. The guest does not have access to these shadow page tables and can only see his guest page tables which runs on an emulated MMU. It has the illusion that it can translate the virtual addresses to real physical ones. In the background, the hypervisor will deal with the

28 3.2. PARAVIRTUALIZATION 17 real translation using the shadow page tables. Figure 3.1: Memory management in x86 virtualization using shadow tables. Figure 3.1 shows the translations needed for translating a virtual guest address into a real physical address. Without the shadow page tables, the virtual guest memory (orange area) will be translated into physical guest memory (blue area) and the latter is translated into real physical memory (white area). The shadow page tables avoid the double translation by immediately translating the virtual guest memory (orange) into real physical memory (white) as shown by the red arrow. In software, several techniques can be used to keep the shadow page tables and guest page tables consistent. These techniques use the page fault exception mechanism of the processor. It throws an exception when a page fault occurred and allows the hypervisor to update the current shadow page table. This introduces extra page faults due to the shadow paging. The shadow page tables introduce an overhead because of the extra page faults and the extra work in keeping the shadow tables up to date. The shadow page tables also consume additional memory. Maintaining shadow page tables for SMP guests also introduces a certain overhead. Each processor in the guest can use the same guest page table instance. The hypervisor could maintain shadow page tables instances that can be used at each processor, which results in memory overhead. Another possibility is to share the shadow page table between the virtual processors leading to synchronization overheads. 3.2 Paravirtualization Paravirtualization is in many ways comparable to dynamic binary translation. It is also a software technique designed to enable virtualization on the x86 architecture. As explained in Denali: Lightweight Virtual Machines for Distributed and Networked Applications, and used in Denali [21], paravirtualization exposes a virtual architecture to the guest that is slightly different than the physical architecture.

29 3.2. PARAVIRTUALIZATION 18 Dynamic binary translation translates critical code into safe code on the fly. Paravirtualization does the same thing but requires changes in the source code of the operating system in advance. The operating systems built for the x86 architecture are by default not compatible with the paravirtualized architecture. This is a major disadvantage for existing operating systems because extra effort is needed in order to run these operating systems inside a paravirtualized guest. In the case of Denali, which provides light weight virtual machines, it allowed them to co-design the virtual architecture with the operating system. The advantages of a successful paravirtualization is a simpler hypervisor implementation and an improvement in the performance degradation compared to the physical system. Better performance is achieved because many unnecessary traps by the hypervisor are eliminated. The hypervisor provides hypercall interfaces for critical kernel operations such as memory management, interrupt handling and time keeping [10]. The guest operating system is adapted so that it is aware of the virtualization. The kernel is modified to replace non-virtualizable instructions with hypercalls that communicate directly with the hypervisor. The binary translation overhead is completely eliminated since the modifications are done in the operating system at design time. The implementation of the hypervisor is much simpler because it does not contain the binary translator System calls The overhead of system calls can be improved a bit. The dynamic binary translation technique intercepts each SYSENTER/SYSCALL instruction and translates the instruction to hand over the control to the kernel of the guest operating system. Afterwards, the guest operating system s kernel executes a SYSEXIT/SYSRET instruction to return to the application. This instruction is again intercepted and translated by the dynamic binary translation. The paravirtualization technique allows guest operating systems to install a handler for system calls, permitting direct calls from an application into its guest OS and avoiding indirection through the hypervisor on every call [22]. This handler is validated before installation and is accessed directly by the processor without indirection via ring I/O virtualization Paravirtualization software mostly uses a different approach for I/O virtualization compared to the emulation used with dynamic binary translation. The guest operating system utilizes a paravirtualized driver that operates on a simplified abstract device model exported by the hypervisor [23]. The real device driver can reside in the hypervisor, but often resides in a separate device driver domain which has privileged access to the device hardware. The latter one is attractive since the hypervisor does not need to provide the device drivers but the drivers of a legacy operating system can be used. Separating the address space of the device drivers from guest and hypervisor code also prevents buggy device drivers from causing system crashes. The paravirtualized drivers remove the need to emulate devices. They free up processor time and resources which would otherwise be needed to emulate hardware. Since there is no emulation of the device hardware, the overhead is significantly re-

30 3.3. FIRST GENERATION HARDWARE SUPPORT 19 duced. In Xen, well-known for its use of paravirtualization, the real device drivers reside in a privileged guest known as domain 0. A description of Xen can be found in subsection However, Xen is not the only hypervisor that uses paravirtualization for I/O. VMware has a paravirtualized I/O device driver, vmxnet, that shares data structures with the hypervisor [10]. A Performance Comparison of Hypervisors states that by using the paravirtualized vmxnet network driver they can now run network I/O intensive datacenter applications with very acceptable network performance [24] Memory management Paravirtual interfaces can be used by both the hypervisor and guest to reduce hypervisor complexity and overhead in virtualizing x86 paging [19]. When using a paravirtualized memory management unit, the guest operating system page tables are registered directly with the MMU [22]. To reduce the overhead and complexity associated with the use of shadow page tables, the guest operating system has readonly access to the page tables. A page table update is passed to Xen via a hypercall and validated before being applied. Guest operating systems can locally queue page table updates and apply the entire batch with a single hypercall. This minimizes the number of hypercalls needed for the memory management. 3.3 First generation hardware support In the meantime, processor vendors noticed that virtualization was becoming increasingly popular and they created a solution that solves the virtualization problem on the x86 architecture by introducing hardware assisted support. Hardware support for processor virtualization enables simple, robust and reliable hypervisor software [25]. It eliminates the need for the hypervisor to listen, trap and execute certain instructions for the guest OS [26]. Both Intel and AMD provide these hardware extensions in the form of Intel VT-x and AMD SVM respectively [11, 27, 28]. The first generation hardware support introduces a data structure for virtualization, together with specific instructions and a new execution flow. In AMD SVM, the data structure is called the virtual machine control block (VMCB). The VMCB combines control state with the guest s processor state. Each guest has its own VMCB with its own control state and processor state. The VMCB contains a list of which instructions or events in the guest to intercept, various control bits and the guest s processor state. The various control bits specify the execution environment of the guest or indicate special actions to be taken before running guest code. The VMCB is accessed by reading and writing to its physical address. The execution environment of the guest is referred to as guest mode. The execution environment of the hypervisor is called host mode. The new VMRUN instruction transfers control from host to guest mode. The instruction saves the current processor state and loads the corresponding guest state from the VMCB. The processor now runs the guest code until an intercept event occurs. This results in a #VMEXIT at which point

31 3.3. FIRST GENERATION HARDWARE SUPPORT 20 the processor writes the current guest state back to the VMCB and resumes host execution at the instruction following the VMRUN. The processor is then executing the hypervisor again. The hypervisor can retrieve information from the VMCB to handle the exit. When the effect of the exiting operation is emulated, the hypervisor can execute VMRUN again to return to guest mode. Although Intel has implemented their own version of hardware support, it has many similarities with the implementation of AMD although the terminology is somewhat different. Intel uses a virtual machine control structure (VMCS) instead of a VMCB. A VMCS can be manipulated by the new instructions VMCLEAR, VMPTRLD, VMREAD and VMWRITE which clears, loads, reads from, and writes to a VMCS respectively. The hypervisor runs in VMX root operation and the guest in VMX non-root operation instead of host and guest mode. Software enters the VMX operation by executing the VMXON instruction. From then on, the hypervisor can use a VMEntry to transfer control to one of its guest. There are two instructions available for triggering a VMEntry: VMLAUNCH and VMRESUME. As with AMD SVM, the hypervisor regains control using VMExits. Eventually, the hypervisor can leave the VMX operation with the instruction VMXOFF. Figure 3.2: Execution flow using virtualization based on Intel VT-x. The execution flow of a guest, virtualized by hardware support, can be seen in figure 3.2. The VMXON instruction starts and the VMXOFF stops the VMX operation. The guest is started using a VMEntry which loads the VMCS of the guest into the hardware. The hypervisor regains control using a VMExit when a guest tries to execute a privileged instruction. After intervention of the hypervisor, a VMEntry transfers control back to the guest. In the end, the guest can shut down and control is handed back to the hypervisor with a VMExit. The basic idea behind the first generation hardware support is to fix the problem that the x86 architecture cannot be virtualized. The VMExit forces a transition from guest to hypervisor, which is based on the trap all exceptions and privileged instructions philosophy. Nevertheless, each transition between the hypervisor and a

32 3.3. FIRST GENERATION HARDWARE SUPPORT 21 virtual machine requires a fixed amount of processor cycles. When the hypervisor has to handle a complex operation, the overhead is relatively low. However, for a simple operation the overhead of switching from guest to hypervisor and back is relatively high. Creating processes, context switches, small page table updates are all simple operations that will have a large overhead. In these cases, software solutions like binary translation and paravirtualization perform better than hardware supported virtualization. The overhead can be improved by reducing the number of processor cycles required for a transition between guest and hypervisor. The exact number of extra processor cycles depends on the processor architecture. For Intel, the format and layout of the VMCS in memory is not architecturally defined, allowing implementationspecific optimizations to improve performance in VMX non-root operation and to reduce the latency of a VMEntry and VMExit [29]. Intel and AMD are improving these latencies in their next processors, as you can see for Intel in figure 3.3. Figure 3.3: Latency reductions by CPU implementation [30]. System calls are an example of complex operations having a low transition overhead. System calls do not automatically transfer control from the guest to the hypervisor in hardware supported virtualization. A hypervisor intervention is only needed when the system call contains critical instructions. The overhead when a system call requires intervention is relatively low since a system call is rather complex and already requires a lot of processor cycles. First generation hardware support does not include support for I/O virtualization and memory management unit virtualization. Hypervisors that use the first generation hardware extensions will need to use a software technique for virtualizing the I/O devices and the MMU. For the MMU, this can be done using shadow tables or paravirtualization of the MMU.

33 3.4. SECOND GENERATION HARDWARE SUPPORT Second generation hardware support First generation hardware support has made the x86 architecture virtualizable, but only in some cases an improvement in performance can be measured [11]. Maintaining the shadow tables can be an intensive task, as was pointed out in subsection The next step of the processor vendors was to provide hardware MMU support. This second generation hardware support adds memory management support so the hypervisor does not have to maintain the integrity of the shadow page table mappings [17]. The shadow page tables remove the need to translate the virtual memory of the process to the guest OS physical memory and then translate the latter into the real physical memory, as can be seen in figure 3.1. It provides the ability to immediately translate the virtual memory of the guest process into real physical memory. On the other hand, the hypervisor must do the bookkeeping to keep the shadow page table up to date when an update occurs to the guest OS page table. In existing software solutions like binary translation, this bookkeeping introduces overhead which was even worse for first generation hardware support. The hypervisor must maintain the shadow page tables and every time a guest tries to translate a memory address, the hypervisor must intervene. In software solutions this intervention is an extra page fault, but in the first generation hardware support this will result in a VMExit and VMEntry roundtrip. As shown in figure 3.3, the latencies of such a roundtrip are improving but the second generation hardware support removes the need for the roundtrip. Intel and AMD introduced their own hardware MMU support. Like the first generation hardware support, this results in two different implementation but with similar characteristics. Intel proposed the extended page tables (EPT) and AMD proposed their nested page tables (NPT). In Intel s EPT, the page tables translate from virtual memory to guest physical addresses while a separate set of page tables, the extended page tables, translate from guest physical addresses to the real physical addresses [29]. The guest can modify its page tables without hypervisor intervention. The new extended page tables remove the VMExits associated with page table virtualization. AMD s nested paging also use additional page tables, the nested page tables (npt), to translate guest physical addresses to real physical addresses [19]. The guest page tables (gpt) map the virtual memory addresses to guest physical addresses. The gpt are set up by the guest and the npt by the hypervisor. When nested paging is enabled and a guest attempts to reference memory using a virtual address, the page walker performs a two dimensional walk using the gpt and npt to translate the guest virtual address to the real physical address. Like Intel s EPT, nested paging removes the overheads associated with software shadow paging. Another feature introduced by both Intel and AMD in the second generation hardware support is tagged TLBs. Intel uses Virtual-Processor Identifiers (VPIDs) that allow a hypervisor to assign a different identifier to each virtual processor. The zero VPID is reserved for the hypervisor itself. The processor then uses the VPIDs

34 3.5. CURRENT AND FUTURE HARDWARE SUPPORT 23 to tag translations in the TLB. AMD calls these identifiers the Address Space IDs (ASIDs). During a TLB lookup, the VPID or ASID value of the active guest is matched against the ID tag in the TLB entry. In this way, TLB entries belonging to different guests and to the hypervisor can coexist without causing incorrect address translations. The tagged TLBs eliminate the need for TLB flushes on every VMEntry and VMExit, furthermore it eliminates the impact of those flushes on performance. The tagged TLBs are an improvement compared to the other virtualization techniques. These techniques need to flush the TLB every time a guest switches to the hypervisor or back. The drawback of the extended page tables or nested paging is that a TLB miss has a larger performance hit for guests because it introduces an additional level of address translation. This is rectified by making the TLBs much larger than before. Previous techniques like shadow page tables immediately translate the virtual guest address to the real physical address eliminating the additional level of address translation. The second generation hardware support is completely focussed on the improvement of the memory management. It eliminates the need for the hypervisor to maintain the shadow tables and eliminates the TLB flushes. The EPT and NPT help to improve performance for memory intensive workloads. 3.5 Current and future hardware support Intel and AMD are still working on support for virtualization. They are improving the latencies of the VMEntry and VMExit instructions, but are also working on new hardware techniques for supporting virtualization on the x86 architecture. The first generation hardware support for virtualization was based primarily on the processor and the second generation focusses on the memory management unit. The final component required next to CPU and memory virtualization is device and I/O virtualization [10]. Recent techniques are Intel VT-d and AMD IOMMU. There are three general techniques for I/O virtualization. The first technique is emulation and is described in subsection The second technique, explained in subsection 3.2.2, is paravirtualization. The last technique is direct I/O. The device is not virtualized but assigned directly to a guest virtual machine. The guest s device drivers are used for the dedicated device. In order to improve the performance for I/O virtualization, Intel and AMD are looking at allowing virtual machines to talk to the device hardware directly. With Intel VT-d and AMD IOMMU, hardware support is introduced to support assigning I/O devices to virtual machines. In such cases, the ability to multiplex the I/O device is lost. Depending on the I/O device, this does not need to be an issue. For example, network card interfaces can easily be added to the hardware in order to provide a NIC for each virtual machine.

35 3.6. VIRTUALIZATION SOFTWARE Virtualization software There are many different virtualization implementations. This section gives an overview of some well-known virtualization software. Each implementation can be placed in the categories explained throughout the previous sections VirtualBox VirtualBox is a hosted hypervisor that performs full virtualization. It started as proprietary software but currently comes under a Personal Use and Evaluation License (PUEL). The software is free of charge for personal and educational use. Virtual- Box was initially created by Innotek and was released as an Open Source Edition on January The company was later purchased by Sun Microsystems, which in turn was recently purchased by Oracle Corporation. VirtualBox software runs on Windows, Linux, Mac OS X and Solaris hosts. In depth information can be found on the wiki of their site [31], more specifically in the technical documentation [32]. Appendix A.1 presents an overview of VirtualBox, which is largely based on the technical documentation. A short summary is given in the following paragraph. VirtualBox started as a pure software solution for virtualization. The hypervisor used dynamic binary translation to fix the problem of virtualization in the x86 architecture. With the arrival of hardware support for virtualization, VirtualBox now also supports Intel VT-x and AMD SVM. The host operating system runs each VirtualBox virtual machine as an application, i.e. just another process in the host operating system. A ring 0 driver needs to be loaded in the host OS for VirtualBox to work. It only performs a few tasks: allocating physical memory for the virtual machine, saving and restoring CPU registers and descriptor tables, switching from host ring 3 to guest context and enabling or disabling hardware support. The guest operating system is manipulated to execute its ring 0 code in ring 1. This could result in poor performance since there is a possibility of generating a large amount of additional instruction faults. To address these performance issues, VirtualBox has come up with a Patch Manager (PATM) and Code Scanning and Analysis Manager (CSAM). The PATM will scan code recursively and replace problematic instructions with a jump to hypervisor memory where a more suitable implementation is placed. Every time a fault occurs, the CSAM will analyze the fault s cause and determine if it is possible to patch the offending code to prevent it from causing more expensive faults VMware VMware [33] provides several virtualization products. The company was founded in 1998 and they released their first product, VMware Workstation, in May In 2001, they also entered the server market with VMware GSX Server and VMware ESX Server. Currently, VMware provides a variety of products for datacenter and desktop solutions together with management products. VMware software runs on Windows and Linux, and since the introduction of VMware Fusion it also runs on Mac OS X. Like VirtualBox, VMware started with a software only solution

36 3.6. VIRTUALIZATION SOFTWARE 25 for their hypervisors. In contrast with VirtualBox, VMware does not release the source code of their products. VMware now supports both full virtualization with binary translation and hardware assisted virtualization, and has a paravirtualized I/O device driver, vmxnet, that shares data structures with the hypervisor [10]. VMware Server is a free product based on the VMware virtualization technology. It is a hosted hypervisor that can be installed in Windows or Linux hosts. A webbased user interface provides a simple way to manage virtual machines. Another free datacenter product is VMware ESXi. It provides the same functionality but uses a native, bare-metal architecture for its hypervisor. VMware ESXi needs a dedicated server but has better performance. VMware makes these products available at no cost in order to help companies of all sizes experience the benefits of virtualization. The desktop product is VMware Player. It is free for personal non-commercial use and allows users to create and run virtual machines on a Windows or Linux host. It is a hosted hypervisor since this is common practice for desktop products. If users need developer-centric features, they can upgrade to VMware Workstation Xen Xen [34] is an open source example of virtualization software that uses paravirtualization. It is a native, bare-metal hypervisor for the x86 architecture and was initially created by the University of Cambridge Computer Laboratory in 2003 [22]. Xen is designed to allow multiple commodity operating systems to share conventional hardware. In 2007, Citrix Systems acquired the source of Xen and intended to freely license to all vendors and projects that implement the Xen hypervisor. Since 2010, the Xen community maintains and develops Xen. The Xen hypervisor is licensed under the GNU General Public License. After installation of the Xen hypervisor, the user can boot into Xen. When the hypervisor is started, it automatically boots a guest, domain 0, that has special management privileges and direct access to the physical hardware [35]. I/O devices are not emulated but Xen exposes a set of clean and simple device abstractions. There are two possibilities to run device drivers. In the first one, domain 0 is responsible for running the device drivers for the hardware. It will run a BackendDriver which queues requests from other domains and relays them to the real hardware driver. Each domain communicates with domain 0 through the FrontendDriver to access the devices. To the applications and the kernel, this driver looks like a normal device. The other possibility is that a driver domain has been given the responsibility for a particular piece of hardware. It runs the hardware driver and the backend driver for that device class. When the hardware driver fails, only this domain is affected and all other domains will survive. Apart from running paravirtualized guests, Xen supports Intel VT-x and AMD SVM since version and respectively. This allows users to run unmodified guest operating system in Xen KVM KVM [36], short for Kernel-based Virtual Machine, is a virtualization product that uses hardware support exclusively. Instead of creating major portions of an operating

37 3.6. VIRTUALIZATION SOFTWARE 26 system kernel, as other hypervisors have done, the KVM developers turned the standard Linux kernel into a hypervisor. By developing KVM as a loadable module, the virtualized environment can benefit from all the ongoing work on the Linux kernel itself and reduce redundancy [37]. KVM uses a driver ( /dev/kvm ) that communicates with the kernel and acts as an interface for userspace virtual machines. The initial version of KVM was released in November 2006 and it was first included in the Linux kernel on February The recommended way of installing KVM is through the packaging system of a Linux distribution. The latest version of the KVM kernel modules and supporting userspace can be found on their website. You can find the kernel modules in the kvm-kmod-kernel version releases and the userspace components are found in qemukvm-version. The latter is the stable branch of KVM based on QEMU [38] with the KVM extras on top. QEMU is a machine emulator and can run an unmodified target operating system and all its applications in a virtual machine. The kvmversion releases are development releases but they are outdated. Every virtual machine is a Linux process, scheduled by the standard Linux scheduler [39]. A normal Linux process has two modes of execution: kernel and user mode. KVM adds a third mode of execution, guest mode. Processes that are run from within the virtual machine run in guest mode. Hardware virtualization is used to virtualize the processor, memory management is handled by the host kernel and I/O is handled in user space through QEMU. In this text, KVM is considered as a hosted hypervisor but there are some discussions 1 that KVM is more a native, bare-metal hypervisor. One side argues that KVM turns Linux into a native, bare-metal hypervisor because Linux becomes the hypervisor and is running directly on top of the hardware. The other side argues that KVM runs on top of Linux and should be considered as hosted hypervisor. Regardless of what type of hypervisor KVM actually is, this text will consider KVM to be a hosted hypervisor Comparison between virtualization software A high-level comparison is given in table 3.1. All virtualization products in the table, except Xen, are installed within a host operating system. Xen is installed directly on the hardware. Most products provide two techniques for virtualization on x86 architectures. Hardware support for virtualization on x86 architectures is supported by all virtualization software in the table. 1 KVM-BareMetal-Hypervisor.aspx

38 3.6. VIRTUALIZATION SOFTWARE 27 VirtualBox VMware Workstation XEN KVM Hypervisor type Hosted Hosted Dynamic binary translation Paravirtualization Native, bare-metal Hosted Hardware support Table 3.1: Comparison between a selection of the most popular hypervisors.

39 CHAPTER 4 Nested virtualization The focus of this thesis lies with nested virtualization on x86 architectures. Nested virtualization is executing a virtual machine inside a virtual machine. In case of multiple nesting levels, one can also talk about recursive virtual machines. In 1973 and 1975 initial research was published about properties of recursive virtual machine architectures [40, 41]. These works refer to virtualization that was used in mainframes so that users could work simultaneously on a single mainframe. Multiple use cases come in mind for using nested virtualization. A possible use case for nested x86 virtualization is the development of test setups for research purposes. Research in cluster 1 and grid 2 computing requires extensive test setups, which might not be available. The latest developments in the research of grid and cluster computing make use of virtualization at different levels. Virtualization can be used for all, or certain, components of a grid or cluster. It can also be used to run applications within the grid or cluster in a sandbox environment. If certain performance limitations are not an issue, virtualizing all components of such a system can eliminate the need to acquire the entire test setup. Because these virtualized components, e.g. Eucalyptus 3 or OpenNebula 4, might use virtualization for running applications in a sandbox environment, two levels of virtualization are used. Nesting the physical machines of a cluster or grid as virtual machines on one physical machine can offer security, fault tolerance, legacy support, isolation, resource control, consolidation, etc. 1 A cluster is a group of interconnected computers working together as a single, integrated computer resource [42, 43]. 2 There is no strict definition of a grid. In [44], Bote-Lorenzo et al. listed a number of attempts to create a definition. Ian Foster created a three point checklist that combine the common properties of a grid. [45]

40 29 A second possible use case is the creation of a test framework for hypervisors. As virtualization allows testing and debugging an operating system by deploying the OS in a virtual machine, nested virtualization allows testing and debugging a hypervisor inside a virtual machine. It eliminates the need for a separate physical machine where a developer can test and debug a hypervisor. Another possible use case is the use of virtual machines inside a server rented from the cloud 5. Such a server is virtualized on its own so that the cloud vendor can make optimal use of its resources. For example, Amazon EC2 6 offers virtual private servers which are virtual machines using the Xen hypervisor. Hence, if a user wants to use virtualization software inside this server, nested x86 virtualization is needed in order to make that setup work. As explained in chapter 2, virtualization on the x86 architecture is not straightforward. This has resulted in the emergence of several techniques that are given in chapter 3. These different techniques produce many different combinations to nest virtual machines. A nested setup can consist of the same technique for both hypervisors, but it can also consist of a different technique for either the first level hypervisor or the nested hypervisor. Hence, if we divide the techniques in three major groups: dynamic binary translation, paravirtualization and hardware support, there are nine possible combinations for nesting a virtual machine inside another virtual machine. In the following sections, the theoretical possibilities and requirements for each of these combinations are given. The results of nested virtualization on x86 architectures are given in chapter 5. Figure 4.1: Layers in a nested virtualization setup with hosted hypervisors. To prevent confusion about which hypervisor or guest is meant, some terms are introduced. In a nested virtualization setup, there are two levels of virtualization, see 5 Two widely accepted definitions of the term cloud can be found in [46] and [47]. 6

41 4.1. DYNAMIC BINARY TRANSLATION 30 figure 4.1. The first level, referred to as L1, is the layer of virtualization that is used in a non-nested setup. Thus, this level is the virtualization layer that is closest to the hardware. The terms L1 or bottom layer indicate the first level of virtualization, e.g. the L1 hypervisor is the hypervisor that is used in the first level of virtualization. The second level, referred to as L2, is the new layer of virtualization, introduced by the nested virtualization. Hence, the terms L2, nested or inner indicate the second level of virtualization, e.g. the L2 hypervisor is the hypervisor that will be installed inside the L1 guest. 4.1 Dynamic binary translation This section focusses on L1 hypervisors that use dynamic binary translation for nested virtualization on x86 architectures. This can be in the host operating system or directly on the hardware. The hypervisor can be VirtualBox (see subsection 3.6.1), a VMware product (see subsection 3.6.2) or any other hypervisor using dynamic binary translation. The nested hypervisor can be any hypervisor, resulting in three major combinations. Each combination uses a nested hypervisor that allows virtualization through a different technique. The nested hypervisor will be installed in a guest virtualized by the L1 hypervisor. The first combination is again a hypervisor using dynamic binary translation. In the second combination a hypervisor using paravirtualization is installed in the guest. The last combination is a nested hypervisor that uses hardware support. It should be theoretically possible to nest virtual machines using dynamic binary translation as L1. When using dynamic binary translation, no modifications are needed to the hardware or to the operating system, as pointed out in section 3.1. Code running in ring 0 will actually run in ring 1, but the guest is not aware of this. Dynamic binary translation: The first combination nests a L2 hypervisor inside a guest virtualized by a L1 hypervisor where both hypervisors are based on dynamic binary translation. The L2 hypervisor will be running in guest ring 0. Since the hypervisor will not be aware that its code is actually running in ring 1, it should be possible to run a hypervisor in this guest. The nested hypervisor will have to take care of the memory management in the L2 guest. It will have to maintain the shadow page tables for its guests, see subsection The hypervisor uses these shadow page tables to translate the L2 virtual memory addresses to, what it thinks to be, real memory equivalents. But actually these translated addresses are in the virtual memory range of the L1 guest and can be converted to real memory addresses by the shadow page tables maintained by the L1 hypervisor. The memory architecture in a nested setup is illustrated in figure 4.2. For a L1 guest, there are two levels of address translation as shown in figure 3.1. A nested guest has three levels of address translation resulting in the need for shadow tables in the L2 hypervisor. Paravirtualization: The second combination uses paravirtualization as technique for the L2 hypervisor. This situation is the same as the situation with dynamic

42 4.1. DYNAMIC BINARY TRANSLATION 31 Figure 4.2: Memory architecture in a nested situation. binary translation for the L2 hypervisor. The hypervisor using paravirtualization will be running in guest ring 0 and is not aware that it is actually running in ring 1. This should make it possible to nest a L2 hypervisor based on paravirtualization within a guest virtualized by a L1 hypervisor using dynamic binary translation. Hardware supported virtualization: The virtualized processor that is available to the L1 guest is based on the x86 architecture in order to allow current operating system to work in the virtualized environment. However, are the extensions (see section 3.3 and 3.4) for virtualization on x86 architectures also included? In order to use a L2 hypervisor based on hardware support within the L1 guest, the L1 hypervisor should virtualize or emulate the virtualization extensions of the processor. A virtualization product that is based on hardware supported virtualization needs these extra extensions. If the extensions are not available, the hypervisor cannot be installed or activated. If the L1 hypervisor provides these extensions, chances are that it requires a physical processor with the same extensions. It might be possible for hypervisors based on dynamic binary translation to provide the extensions without having a processor that supports the hardware virtualization. However, all current processors have these extensions. Therefore it is very unlikely that developers will incorporate functionality that provides the hardware support to the guest without a processor with hardware support for x86 virtualization. Memory management in the L2 guest based on hardware support is not possible because the second generation hardware support only provides two levels of address translation. The L1 hypervisor should provide the EPT or NPT functionality to the guest together with the first generation hardware support, but it will have to use a software technique for the implementation of the MMU.

43 4.2. PARAVIRTUALIZATION Paravirtualization The situation for nested virtualization is quite different when using paravirtualization as the bottom layer hypervisor. The most popular example of a hypervisor based on paravirtualization is Xen (see subsection 3.6.3). There are again three combinations. A nested hypervisor can be the same as the bottom layer hypervisor, based on paravirtualization. The second combination is the case where a dynamic binary translation based hypervisor is used as the nested hypervisor. In the last combination a hypervisor based on hardware support is nested in the paravirtualized guest. The main difference is that the L1 guest is aware of the virtualization. Dynamic binary translation and paravirtualization: The paravirtualized guest is aware of the virtualization and should use the hypercalls provided by the hypervisor. The guest s operating system should be modified to use these hypercalls, thus all code in the guest that runs in kernel mode needs these modifications in order to work in the paravirtualized guest. This has major consequences for a nested virtualization setup. A nested hypervisor can only work in a paravirtualized environment if it is modified to work with these hypercalls. A native, bare-metal hypervisor should be adapted so that all ring 0 code is changed. For a hosted hypervisor this indicates that the module, that is loaded into the kernel of the host operating system, is modified to work in the paravirtualized environment. Hence, companies that develop virtualization products need to actively make their hypervisors compatible for running inside a paravirtualized guest. Memory management of the L2 guests is done by the nested hypervisor. The pages tables of the L1 guests are directly registered with the MMU, so the nested hypervisor can use the hypercalls to register its page tables with the MMU. A nested hypervisor based on paravirtualization might allow a L2 guest to register its page tables directly with the MMU, while a nested hypervisor based on dynamic binary translation will maintain shadow tables. Hardware supported virtualization: Hardware support for x86 virtualization is also for paravirtualization an exceptional case. The L1 hypervisor should provide the extensions for the hardware support to the guests, probably by means of hypercalls. Modified hypervisors based on hardware support can then use the hardware extensions. Second generation hardware support can also only be used if it is provided by the L1 hypervisor, together with first generation hardware support. In conclusion, nested virtualization with paravirtualization as a bottom layer needs modifications to the nested hypervisor, whereas nested virtualization with dynamic binary translation as bottom layer did not need these changes. On the other hand, the guests know that they are virtualized which might influence the performance of the L2 guests in a positive way. The nested virtualization will not work unless support is actively introduced. There is a low likelihood that virtualization software developers are willing to incorporate these modifications in their hypervisors since the cost of the implementation does not exceed the benefits.

44 4.3. HARDWARE SUPPORTED VIRTUALIZATION Hardware supported virtualization The last setup is to use a hypervisor based on hardware support for x86 virtualization as the bottom layer. For this configuration a processor is required that has the extensions for hardware support. KVM (see subsection 3.6.4) is a popular example of such a hypervisor but the latest versions of VMware, VirtualBox and Xen can also use hardware support. As with the previous configurations, there are three combinations. The combination using a hypervisor that is based on the same technique as the L1 hypervisor. A combination where a hypervisor based on dynamic binary translation is used and the last combination where a paravirtualization based hypervisor is the nested hypervisor. Dynamic binary translation and paravirtualization: These combinations are similar to the combinations where a hypervisor based on dynamic binary translation is used as bottom layer. A guest or its operating system does not need modifications, hence it should in theory be possible to nest virtual machines in a setup where the bottom layer hypervisor is based on hardware support. The nested hypervisor thinks its code is running in ring 0, but actually it is running in the guest mode of the processor, which is a result of a VMRUN or VMEntry instruction. The memory management depends on whether the processor supports the second generation hardware support. If the processor does not support this, the L1 hypervisor uses a software technique for virtualizing the MMU. In this case, memory management will be the same as with dynamic binary translation where both L1 and L2 hypervisor maintain shadow tables for virtualizing the MMU. Whereas, if the processor does support the hardware MMU, then the L1 hypervisor does not need to maintain these shadow tables which can improve the performance. Hardware supported virtualization: As for the other configurations, hardware support for nested hypervisors is a special case. The virtualized processor that is provided to the L1 guest is based on the x86 processor but needs to contain the hardware extensions for virtualization if the nested hypervisor uses hardware support. If the L1 hypervisor does not provide these hardware extensions to its guests, only the combination with a nested hypervisor that uses dynamic binary translation or paravirtualization can work. KVM and Xen are doing research and work to provide hardware extensions for virtualization on the x86 architecture to the guests. More details are given in section 5.4. The hardware support for EPT or NPT (see section 3.4) in the guest, which can also be referred to as nested EPT or nested NPT, deserves special attention according to Avi Kivity [48]. Avi Kivity is a lead developer and maintainer of KVM and posted some interesting information about nested virtualization on his blog. Nested EPT or nested NPT can be critical for obtaining reasonable performance. The guest hypervisor needs to trap and service context switches and writes to guest tables. A trap in the guest hypervisor is multiplied by quite a large factor into KVM traps. Since the hardware only supports two levels of address translation, nested EPT or NPT should be implemented in software.

45 CHAPTER 5 Nested virtualization in Practice The previous chapter gave some insight in the theoretical requirements of nested x86 virtualization. The division into three categories resulted in nine combinations. This chapter presents how nested x86 virtualization behaves in practice. Each of the nine combinations is tested and performance tests are executed on working combinations. The results of these tests are discussed in the following chapter. The combinations that fail to run are analyzed in order to find the reason for the failure. A selection of the currently popular virtualization products are tested. These products are VirtualBox, VMware Workstation, Xen and KVM as discussed in section 3.6. Table 3.1 shows a summary of these hypervisors and the supported virtualization techniques. There are seven different hypervisors if we consider that the products with multiple techniques consist of different hypervisors. For each hypervisor we can nest the other seven hypervisors. Thus, nesting these hypervisors result in 49 different setups, which will be described in the following sections. Details of the tests are given in appendix B. It lists the used configuration for each setup together with version information of the hypervisors and the result of the setup. The subsection in which each nested setup can be found is summarized in table 5.1. The columns of the table represent the L1 hypervisors and the rows represent the L2 hypervisors, i.e. the hypervisor represented by the row is nested inside the hypervisor represented by the column. For example, information about the nested setup where VirtualBox based on dynamic binary translation is nested inside Xen using paravirtualization, can be found in subsection The table cells for setups with a L1 hypervisor based on hardware support are split in two cells, the upper cell represents the nested setup tested on a processor with first generation hardware support. The bottom cell represents the setup tested on a processor with second generation hardware support.

46 35 VirtualBox VMware XEN KVM Subsections Gen. HV DBT HV DBT HV PV HV HV DBT VirtualBox HV st gen nd gen st gen nd gen. DBT VMware HV st gen nd gen st gen nd gen. PV Xen HV st gen nd gen st gen nd gen st gen. KVM HV nd gen. Table 5.1: Index table containing directions in which subsections information can be found about a certain nested setup.

47 5.1. SOFTWARE SOLUTIONS Software solutions Dynamic binary translation In this subsection, we will give the results of actually nesting the virtual machines inside a L1 hypervisor based on dynamic binary translation, as discussed in section 4.1. The nested hypervisors should not need modifications. Only the nested hypervisors based on hardware support for virtualization need a virtual processor that contains the hardware extensions. The L1 hypervisors are VirtualBox and VMware Workstation using dynamic binary translation for virtualizing guests. Since we test two L1 hypervisors, this subsection describes 14 setups. These setups are described in the following paragraphs categorized by their technique for the L2 hypervisor. The first paragraphs elaborate on the setups that use dynamic binary translation on top of dynamic binary translation. The next paragraph presents the setups that use paravirtualization as the L2 hypervisor, followed by a paragraph that presents the setups that use hardware support as the L2 hypervisor. The last paragraph concludes this subsection with an overview. Dynamic binary translation: Each setup that used dynamic binary translation as both the L1 and L2 hypervisor resulted in failure. The setups either hung or crashed when starting the inner guest. In the two setups where VMware Workstation was nested inside VMware Workstation and VirtualBox was nested inside VirtualBox, the L2 guest became unresponsive when started. After a few hours, the nested guests were still trying to start, so these setups could be marked as failures. In both setups the L1 and L2 hypervisors were the same, the developers know what instructions and functionality is used by the nested hypervisor and may have foreseen this situation. However, the double layer of dynamic binary translation seems to be inoperative or too slow for a working nested setup with the same hypervisor for both L1 and L2 hypervisors. The other two setups, where VMware Workstation is nested in VirtualBox and VirtualBox is nested in VMware Workstation, resulted in a crash. In the former setup the L1 VirtualBox guest crashed which indicates that the L2 guest tried to use functionality that is not fully supported by VirtualBox. This can be functionality that was left out in order to improve performance or a simple bug. In the other setup, with VMware Workstation as the L1 hypervisor and VirtualBox as the L2, the VirtualBox guest crashed but the VMware Workstation guest stayed operational. The L2 guest noticed that some conditions are not met and crashes with an assertion failure. In both setups, it seems that the L2 guest does not see a fully virtualized environment and one of the guests, in particularly VirtualBox, reports a crash. More information about the reported crash is given in section B.1. A possible reason that in both cases VirtualBox reports the crash is that VirtualBox is open source and can allow more information to be viewed by its users. Paravirtualization: Of the two setups that use paravirtualization on top of dynamic binary translation, one worked and the other crashed. Figure 5.1 shows the layers of these setups, where the L1 guest and the L2 hypervisor are represented by the same layer. The setup with VMware Workstation as the L1 hypervisor allowed

48 5.1. SOFTWARE SOLUTIONS 37 Figure 5.1: Layers for nested paravirtualization in dynamic binary translation. a Xen guest to boot successfully. In the other setup, using VirtualBox, the L1 guest crashed and reported a message similar to the setup with VMware Workstation inside VirtualBox (see section B.1). The result, one setup that works and one that does not, gives some insight in the implementation of VMware Workstation and VirtualBox. The latter contains one or more bugs which make the L1 guest crash when a nested hypervisor starts a guest. The functionality could be left out deliberately because such a situation might not be very common. Leaving out these exceptional situations allows developers to focus on more important functionality for allowing virtualization. On the other hand, VMware Workstation does provide the functionality and could be considered more mature for nested virtualization using dynamic binary translation as the L1 hypervisor. Hardware supported virtualization: VirtualBox and VMware Workstation do not provide the x86 virtualization processor extensions to their guests. This means that there is no hardware support available in the guests, neither for the processor, nor the memory management. Since four of the hypervisors are based on hardware support, there are eight setups that contain such a hypervisor. The lack of hardware support causes the failure of these eight setups. Implementing the hardware support in the L1 hypervisor using software, without underlying support from the processor, could result in bad performance. However, if performance is not an issue, such a setup could be useful to simulate a processor with hardware support on an incompatible processor. Only one out of 14 setups worked with dynamic binary translation as the L1 hypervisor. The successful combination is the Xen hypervisor using paravirtualization within the VMware Workstation hypervisor. Other setups hung or crashed and VirtualBox reported the most crashes. VirtualBox seems to contain some bugs that VMware Workstation does not have, resulting in crashes in the guest being virtualized by VirtualBox. Hardware support for virtualization is not present in the L1 guest using VMware Workstation or VirtualBox, which eliminates the eight setups with a nested hypervisor that needs the hardware extensions. Table 5.2 gives a summary of the setups described in this subsection. The columns represent the

49 5.1. SOFTWARE SOLUTIONS 38 Virtual- Box DBT VMware DBT VirtualBox DBT HV VMware Xen DBT HV PV HV KVM HV Table 5.2: The nesting setups with dynamic binary translation as the L1 hypervisor technique. DBT stands for dynamic binary translation, PV for paravirtualization and HV for hardware virtualization. L1 hypervisors and the rows represent the L2 hypervisors Paravirtualization Previous subsection described the setups that use dynamic binary translation as the L1 hypervisor. The following paragraphs elaborate on the use of a L1 hypervisor based on paravirtualization. In section 4.2, we concluded that nested virtualization with paravirtualization as a bottom layer needs modifications to the nested hypervisor. The L1 hypervisor used for the tests is Xen. In all the nested setups, the L2 hypervisor should be modified to use the paravirtual interfaces offered by Xen instead of executing ring 0 code. We discuss the problems for each hypervisor technique in the following paragraphs, together with what the setup would look like if the nested virtualization works. The last paragraph summarizes the setups described in this subsection. Paravirtualization: The paravirtualized guest does not allow the start of a Xen hypervisor within the guest. The kernel loaded in the paravirtualized guest is a kernel adapted for paravirtualization. The Xen hypervisor is not adapted to use the provided interface and hence the paravirtualized guest removes the other kernels from the bootloader. The complete setup, see figure 5.2, consists of Xen as the L1 hypervisor which automatically starts domain 0. This domain 0 is a L1 privileged guest. Another domain would run the nested hypervisor, which in turn would run its automatically started domain 0 and a nested virtual machine. Dynamic binary translation: The hypervisor of VMware Workstation and VirtualBox based on dynamic binary translation could not be loaded in the paravirtualized guest. The reason is that the ring 0 code is not adapted for the paravirtualization. In practice this expresses itself as the inability to compile the driver or module that needs to be loaded. It should be compiled against the kernel headers

50 5.1. SOFTWARE SOLUTIONS 39 Figure 5.2: Layers for nested Xen paravirtualization. but fails to compile since it does not recognize the version of the adapted kernel and its headers. The setup for dynamic binary translation as technique for the nested hypervisor (see figure 5.3) differs from the previous setup (figure 5.2) in that the L2 hypervisor is on top of a guest operating system. Xen is a native, bare-metal hypervisor which runs directly on the hardware, i.e. in this case the virtual hardware. VMware Workstation and VirtualBox are hosted hypervisors and do need an operating system between the hypervisor and the virtual hardware. Figure 5.3: Layers for nested dynamic binary translation in paravirtualization. Hardware supported virtualization: The other four setups, where a nested hypervisor based on hardware support is used, have the same problem. None of the hypervisors are modified to run in a paravirtualized environment. In addition, the virtualization extensions are not provided in the paravirtualized guest. Even if the hypervisors were adapted for the paravirtualization, they would still need these extensions. These setups look like figure 5.2 or figure 5.3, depending on whether the nested hypervisor is hosted or native, bare-metal. None of the seven setups with paravirtualization as bottom layer worked. The results of the setups are shown in table 5.3. The column with the header Xen represents the L1 hypervisor. The main problem is the adaptation of the hypervisors.

51 5.2. FIRST GENERATION HARDWARE SUPPORT 40 XEN PV VirtualBox DBT HV VMware Xen DBT HV PV HV KVM HV Table 5.3: The nesting setups with paravirtualization as the L1 hypervisor technique. Unless these hypervisors are modified, paravirtualization is not a good choice as L1 hypervisor technique. It will always depend on the adaptation of the hypervisor and one could only use that hypervisor. When using paravirtualization, the best one could do is hope that developers adapt their hypervisors or modify the hypervisor oneself Overview software solutions Previous subsections explain the results of nested virtualization with software solutions for the bottom layer hypervisor. This subsection gives an overview of all the possible setups described in the previous subsections. All these setups are gathered in table 5.4. The columns of the table represent the setups belonging to the same L1 hypervisor. The rows in the table indicate a different nested hypervisor, i.e. the hypervisor represented by the row is nested inside the hypervisor represented by the column. Nested x86 virtualization using a L1 hypervisor based on a software solution is not successful. Out of the 21 setups that were tested, only one setup allows to successfully boot a L2 guest: nesting Xen inside VMware Workstation. Note that 12 setups are unsuccessful simply because hardware support for x86 virtualization is not available in the L1 guest. 5.2 First generation hardware support The setups with a bottom layer hypervisor based on hardware support are described in this section. The theoretical possibilities and requirements needed for these setups are discussed in section 4.3. The conclusion was that it should be possible to nest virtual machines without modifying the guest operating systems, given that the physical processor provides the hardware extensions for x86 virtualization. In

52 5.2. FIRST GENERATION HARDWARE SUPPORT 41 Virtual- Box VMware XEN DBT DBT PV Subsection VirtualBox DBT HV VMware Xen DBT HV PV HV KVM HV Table 5.4: Overview of the nesting setups with a software solution as the L1 hypervisor technique. chapter 3, the hardware support for x86 virtualization was divided into the first generation and second generation hardware support. The second generation hardware support adds a hardware supported memory management unit so that the hypervisor does not need to maintain shadow tables. The original research was done on a processor 1 that did not have second generation hardware support. Detailed information about the hypervisor versions is listed in section B.3. To make a comparison between first generation and second generation hardware support for x86 virtualization, the setups were also tested on a newer processor 2 that does provide the hardware supported MMU. The results of the tests on the newer processor are given in section 5.3. The tested L1 hypervisors using the hardware extensions for virtualization are VirtualBox, VMware Workstation, Xen and KVM. We nested the seven hypervisors (see table 3.1) within these four hypervisors, resulting in 28 setups. In the first subsection the nested hypervisor is based on dynamic binary translation. The second subsection described the setups with Xen paravirtualization as the L2 hypervisor. The last subsection handles the setups with a nested hypervisor based on hardware support for x86 virtualization. 1 Setups with a L1 hypervisor based on first generation hardware support for x86 virtualization were tested on an Intel R Core TM 2 Quad Q9550 processor. 2 Setups with a L1 hypervisor based on second generation hardware support for x86 virtualization were tested on an Intel R Core TM i7-860 processor.

53 5.2. FIRST GENERATION HARDWARE SUPPORT Dynamic binary translation Using dynamic binary translation as the nested hypervisor technique, there are eight setups. Three of these setups are able to successfully boot and run a nested virtual machine. The layout of these setups can be seen in figure 5.4 where the L1 hypervisor is based on hardware support and the L2 hypervisor is based on dynamic binary translation. When Xen is used as the L1 hypervisor, the host OS layer can be left out and a domain 0 is started next to VM1, which still uses hardware support for its virtualization. Figure 5.4: Layers for nested dynamic binary translation in a hypervisor based on hardware support. VirtualBox: When VirtualBox based on hardware support is used as the bottom layer hypervisor, none of the setups worked. Nesting VirtualBox inside Virtual- Box resulted in the L2 guest becoming unresponsive. The same result happened when VirtualBox was nested in VirtualBox but used dynamic binary translation for both levels. When trying to nest a VMware Workstation guest inside VirtualBox, the configuration of that setup is very unstable so that each minor change resulted in a setup that refuses to start the L2 guest. There was one working configuration which we listed in section B.3. VMware Workstation: If the L1 hypervisor in figure 5.4 is VMware Workstation, the setups were successful in nesting virtual machines. Both VirtualBox and VMware Workstation as nested hypervisors based on dynamic binary translation were able to start the L2 guest which booted and ran correctly. Xen: VMware Workstation 3 checks whether there is an underlying hypervisor running. It noticed that Xen was running and refused to start a nested guest. This prevents a L2 VMware guest from starting within a Xen guest. In the other setup, where VirtualBox is used as inner hypervisor, the L2 again became unresponsive 3 In version VMware Workstation build and newer

54 5.2. FIRST GENERATION HARDWARE SUPPORT 43 after starting. There is no crash, error message or warning which might indicate that the L2 guest booted at a very slow pace. KVM: The third and last working setup for nesting a hypervisor based on dynamic binary translation within one based on hardware support is nesting VMware Workstation inside KVM. In newer versions of VMware Workstation 4, a check for an underlying hypervisor noticed that KVM was running and refused to boot a nested guest. The setup with VirtualBox as the nested hypervisor crashed while booting. The L2 guest showed an error indicating a kernel panic because it could not synchronize. The guest became unresponsive after displaying the message. Virtual- Box VMware XEN KVM HV HV HV HV VirtualBox DBT VMware DBT Table 5.5: The nesting setups with first generation hardware support as the L1 hypervisor technique and DBT as the L2 hypervisor technique. Table 5.5 gives a summary of the eight setups discussed in this subsection. VMware Workstation is the best option since it allows nesting other hypervisors based on dynamic binary translation, but it will also most likely work when used as nested hypervisor based on dynamic binary translation. In comparison to nesting inside a software solution, VirtualBox is able to nest within VMware Workstation when using hardware support for the L1 hypervisor. VirtualBox is still not able to nest within KVM, Xen and within itself, while VMware Workstation is able to nest within KVM and itself. It is regretful that VMware Workstation checks for an underlying hypervisor, other than VMware itself, to prevent the use of VMware Workstation within other hypervisors Paravirtualization In this subsection, we discuss the setups that nest a paravirtualized guest inside a guest virtualized using hardware support. Figure 5.5 shows the layers in these setups. The main differences with the setups in the previous subsection are that the L1 guest and the L2 hypervisor are represented by the same layer and that Xen automatically starts domain 0. There are just four setups tested in this subsection since only Xen is nested within the four hypervisors based on hardware support. All four setups could successfully nest a paravirtualized guest inside the L1 guest. However, the setup where Xen is nested inside VirtualBox was not very stable. Sometimes during the start-up of the privileged domain several segmentation faults occurred. Domain 0 was able to boot and run successfully but the creation of another paravirtualized guest was sometimes 4 In version VMware Workstation build and newer

55 5.2. FIRST GENERATION HARDWARE SUPPORT 44 Figure 5.5: Layers for nested paravirtualization in a hypervisor based on hardware support. impossible. Xen reported that the guest is created, however, it did not show up in the list of virtual machines indicating that the guest crashed immediately. Virtual- Box VMware XEN KVM HV HV HV HV Xen PV Table 5.6: The nesting setups with first generation hardware support as the L1 hypervisor technique and PV as the L2 hypervisor technique. An overview of the four setups is shown in table 5.6. It is clear that using paravirtualization as technique for the nested hypervisor can be recommended. The only setup that does not completely work is the one with VirtualBox. Since the other three setups work and since previous conclusions were also not in favor of VirtualBox, VirtualBox is probably the reason for the instability Hardware supported virtualization The remaining setups, which attempt to nest a hypervisor based on hardware supported virtualization, are discussed in this subsection. Nesting the four hypervisors based on hardware support within each other results in 16 setups. The layout of the setups is equal to figure 5.4 and figure 5.5, depending on which hypervisor is used. None of the hypervisors provide the x86 virtualization processor extensions to their guests indicating that none of the setups will work. Developers of both KVM and Xen are working on support for nested hardware support. Detailed information can be found in section 5.4. KVM has already released initial patches for nested hardware support on AMD processors and is working on patches for the nested support on Intel processors. Xen is also researching the ability