Business Continuity as a Service ICT FP7-609828 RCL: Software Prototype D3.2.1 June 2014
Document Information Scheduled delivery 30.06.2014 Actual delivery 30.06.2014 Version 1.0 Responsible Partner IBM Dissemination Level PU Public Revision History Date Editor Status Version Changes 18.05.2014 Ronen Kat Draft 0.1 Outline 25.05.2014 Yossi Draft 0.2 Added section 3 Kuperman 05.06.2014 Ronen Kat Draft 0.3 Merge IBM and RedHat updates 15.06.2014 Ronen Kat Draft 0.4 Added introduction and finishing 22.06.2014 Ronen Kat Draft 0.5 Input from internal reviewers 23.06.2014 Dave Gilbert Draft 0.6 Address comments from reviewers on Red Hat sections 30.06.2014 Ronen Kat Final 1.0 Finishing Contributors Ronen Kat IBM, Yossi Kuperman - IBM Dave Gilbert Red Hat, Andrea Arcangeli Red Hat Internal Reviewers Vasileios Anagnostopoulos Luis Tomás Copyright This report is by IBM and other members of the ORBIT Consortium 2013-2016. Its duplication is allowed only in the integral form for anyone's personal use and for the purposes of research or education. Acknowledgements The research leading to these results has received funding from the EC Seventh Framework Programme FP7/2007-2013 under grant agreement n 609828. www.orbitproject.eu 2/15
Glossary of Acronyms Acronym D DoW EC PM PO WP MTU Definition Deliverable Description of Work European Commission Project Manager Project Officer Work Package Maximum Transition Unit www.orbitproject.eu 3/15
Table of Contents 1. Executive Summary...6 2. Introduction...7 3. I/O Consolidation...8 4. Memory Consolidation and Externalization...11 5. Cloud Management...13 6. References...14 www.orbitproject.eu 4/15
List of Figures Figure 1: Split I/O deployment...8 Figure 2: Memory externalization nested test-bed...11 List of Tables Table 1: I/O consolidation: components internal names...9 Table 2: I/O consolidation: status of components...9 Table 3: I/O consolidation: status of API...10 Table 4: Memory consolidation and externalization: status of components...12 www.orbitproject.eu 5/15
1. Executive Summary This document summarizes the prototype development work done as part of WP3. For this project interval, the first nine months of the project, we report for Task 3.1 (T3.1) and Task 3.2 (T3.2) as Task 3.3 (T3.3) has not yet started. Work for developing the I/O consolidation layer (T3.1), and memory consolidation and externalization layer (T3.2) is well under way and the development completeness coverage is well beyond the required 25% of this deliverable per the quality plan in D1.2.1 [1]. The next deliverable of the prototype D3.2.2 is scheduled for June 2015 - month 21 of the project. www.orbitproject.eu 6/15
2. Introduction This first software prototype delivery realizes the design and specification outlined in deliverable D3.1.1 [2]. The work and status of developing the I/O consolidation layer is described in Chapter 3, the work and status of memory consolidation and externalization layer is described in Chapter 4, and the work and status of the cloud management is described in Chapter 5. 2.1. Progress toward feature and API completeness Development is well under way, and progress of the feature completeness is close to (or even above) 50% of the planned features. The progress for the API development is about 25% for the I/O consolidation layer and above 50% percent for the memory consolidation and externalization layer. Work on cloud management has not started yet, and is scheduled to start on October 2014 per the project plan. www.orbitproject.eu 7/15
3. I/O Consolidation Our objective is to externalize I/O resources and consolidate all of them in a single dedicated appliance. We did so by detaching the back-end logic responsible for handling I/O from the hypervisor software stack and moved this logic from all the compute servers used to host the VMs to a remote server dedicated for I/O virtualization. The prototype is comprised from the components described in deliverable D3.1.1, the internal names of the components are listed in Table 1, and the status is listed in Table 2. 3.1. Status of prototype The prototype implementation is capable of creating block and network virtual devices. The exposed virtual devices are partially functional and not yet ready to be deployed. It is possible to communicate over a virtual network device with the load generator and read/write a block of data from a remote block device that resides at the I/O hypervisor memory. We deployed Split I/O prototype on 3 machines in our lab (depicted at Figure 1 from right to left): the load generator, the I/O hypervisor (back-end) and the host that runs the VM (frontend). Each machine is an IBM System x3550 M4, equipped with two 8-cores sockets of Intel Xeon E2660 CPU running at 2.2 GHz. Each machine is further equipped with 56GB of memory and an Intel x520 dual port 10Gbps SRIOV NIC. Figure 1: Split I/O deployment The prototype is implemented for Linux 3.9 kernel as a set of new kernel modules. Each module corresponds to a component described at deliverable 3.1.1 (see Table 1). Modules 3.1.1, 3.1.21, 3.1.22, and 3.1.23 are installed on the I/O hypervisor machine, and modules modules 3.1.1, 3.1.11, 3.1.12 and 3.1.13 are installed on the VM. Note that module 3.1.1 is installed both on the VM and I/O hypervisor. Note that component 3.1.1 (Ethernet transport) is used by both the front-end and the backend. Its main purpose is to facilities with data transportation from both ends. It does so efficiently by using layer 2 (MAC layer) and thus avoids higher layers such as TCP/IP which incur high overhead. As block I/O requests have arbitrary sizes, and the size of the network packet that can be sent over the wire size is bounded by a Maximum Transition Unit (MTU), component 3.1.1 should fragment the requests before sending them to the other end for processing. The fragment size is determined by the MTU. To instantiate a virtual device, we created a special user-space utility that operates the (generic back-end) kernel module (module 3.1.21). Invoking the utility on the I/O hypervisor with the following details: device type (block or net), backing device (e.g. local block device) www.orbitproject.eu 8/15
and a remote VM s address, will pass the request to the I/O hypervisor kernel (via ioctl), which in turn instructs the VM to expose a virtual device. Bellow is a mapping between our internal modules names and the components described at D3.1.1. Component Module Name Installed On 3.1.1 - Split I/O Ethernet Transport vrio_eth.ko VM + I/O hypervisor 3.1.11 - Split I/O Generic Front-End vrio_generic.ko VM 3.1.12 - Split I/O Block Front-End vrio_gblk.ko VM 3.1.13 - Split I/O Net Front-End vrio_gnet.ko VM 3.1.21 - Split I/O Generic Back-End vrio_generic.ko I/O hypervisor 3.1.22 - Split I/O Block Back-End vrio_hblk.ko I/O hypervisor 3.1.23 - Split I/O Net Back-End vrio_hnet.ko I/O hypervisor 3.1.31 - Split I/O management module vrio.py I/O hypervisor Table 1: I/O consolidation: components internal names Notes: 1. vrio (virtual Remote I/O) is the internal development name for Split I/O. 2. Module vrio_generic.ko is used both for the generic front-end and the generic back-end. 3.2. External interactions Management for the Split I/O hypervisor is provided through the split I/O hypervisor python module. Status of development is described in the Table 2. The management library will be used by the cloud management layer as part of T3.3. 3.3. Completion status of components In the next table (Table 2) we show the development status of split I/O modules included in this prototype. Component Status and progress 3.1.1 - Split I/O Ethernet Transport 70% completed 3.1.11 - Split I/O Generic Front-End 70% completed 3.1.12 - Split I/O Block Front-End 40% completed 3.1.13 - Split I/O Net Front-End 50% completed 3.1.21 - Split I/O Generic Back-End 70% completed 3.1.22 - Split I/O Block Back-End 40% completed 3.1.23 - Split I/O Net Back-End 50% completed 3.1.31 - Split I/O management module Not started Table 2: I/O consolidation: status of components In the next table (Table 3) we show development status of the split I/O API functions. www.orbitproject.eu 9/15
API RESET_GUEST_DEVICES() CREATE_BLOCK_DEVICE() REMOVE_BLOCK_DEVICE() CREATE_NETWORK_DEVICE() REMOVE_NETWORK_DEVICE() Table 3: I/O consolidation: status of API Status and progress Not started 80% completed Not started 80% completed Not started www.orbitproject.eu 10/15
4. Memory Consolidation and Externalization Our objective is to provide a mechanism that allows the hypervisor to retrieve memory pages from a remote system for use in a VM; this is being used as part of a post-copy migration implementation where a migrated VM starts running on the destination host prior to all of the memory being copied over. 4.1. Status of prototype The prototype is capable of performing small post-copy migrations, with page requests making the full round trip in limited situations, but is incomplete and not yet stable. The prototype consists of modifications to both QEMU and the Linux kernel (v3.13 currently). The components are as described in deliverable 3.1.1. The current deployment is within a testing environment consisting of a nested hypervisor allowing all components to be easily tested on one machine as shown in Figure 2. The two QEMU instances are the source (left) and destination which has a partial copy of it's memory. The Linux kernel running in the L1 guest includes the 'Linux kernel mm subsystem enhancements' (component 3.2.1). These allow the destination QEMU instance to mark an area of memory as 'userfault' i.e. external, which casues QEMU to be notified when the guest accesses the page. At a later point in time, the QEMU uses another kernel modification to atomically (and efficiently) move a page into place to satisfy the previous request. The QEMU running in the L1 guest contains the modifications from components 3.2.2 and 3.2.3. 3.2.2 'Remote memory front end' routes page requests/data between the kernel on the destination machine and the network towards the source machine. Requests from the kernel are recorded in a page ownership datastructure, and sent as requests along a 'return path' to the source VM. 3.2.3 'Remote memory handler' satisfies page requests from the destination machine and routes control messages to the Remote memory front end; these include the messages that initiate the transition to postcopy. www.orbitproject.eu 11/15
Host system (running Linux with unmodified KVM) Host (unmodified) QEMU instance: L1 guest (running modified Linux kernel) L1 modified QEMU instance (1): L1 modified QEMU instance (2): L2 guest (unmodified) L2 guest (unmodified) Figure 2: Memory externalization nested test-bed. The existing QEMU migration protocol has been modified to include commands to initiate the postcopy mode, and to provide a bidirectional transport allowing the destination to request pages. The mode is enabled using an extra migration-capability flag. 4.2. External interactions None yet. 4.3. Completion status of components The status reflects the status of a prototype that is starting to work; the basics are there but need filling out, making more robust and tidying up before submission to the upstream projects. Component Status and progress 3.2.1 Linux kernel mm subsystem enhancements 60% completed 3.2.2 Remote memory front end 50% completed 3.2.3 Remote memory handler 50% completed Table 4: Memory consolidation and externalization: status of components www.orbitproject.eu 12/15
4.4. Additional notes The next step is to finish the functionality and stabilise the current version so that arbitrary guests can be transferred. www.orbitproject.eu 13/15
5. Cloud Management The cloud management is part of Task 3.3 which is scheduled to start at M13 (October 2014). Therefore, no components of the cloud management are included in this deliverable. www.orbitproject.eu 14/15
6. References [1]. ORBIT document - D1.2.1 - Quality Plan, February 12, 2014 [2]. ORBIT document - D3.1.1 - RCL Design And Open Specification www.orbitproject.eu 15/15