Virtualizing FPGAs for Cloud Computing Applications. Stuart A. Byma

Similar documents

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Software-Defined Infrastructure and the SAVI Testbed

Networking Virtualization Using FPGAs

Open Flow Controller and Switch Datasheet

基於 SDN 與可程式化硬體架構之雲端網路系統交換器

Software-Defined Networks Powered by VellOS

White Paper. Innovate Telecom Services with NFV and SDN

Virtualization, SDN and NFV

CoIP (Cloud over IP): The Future of Hybrid Networking

Lecture 02b Cloud Computing II

Windows Server 2008 R2 Hyper-V Live Migration

Global Headquarters: 5 Speen Street Framingham, MA USA P F

Pluribus Netvisor Solution Brief

Windows Server 2008 R2 Hyper-V Live Migration

BUILDING A NEXT-GENERATION DATA CENTER

Open, Elastic Provisioning of Hardware Acceleration in NFV Environments

Network Virtualization for Large-Scale Data Centers

State of the Art Cloud Infrastructure

How To Make A Vpc More Secure With A Cloud Network Overlay (Network) On A Vlan) On An Openstack Vlan On A Server On A Network On A 2D (Vlan) (Vpn) On Your Vlan

Data Center and Cloud Computing Market Landscape and Challenges

Cloud Networking Disruption with Software Defined Network Virtualization. Ali Khayam

Software Define Storage (SDs) and its application to an Openstack Software Defined Infrastructure (SDi) implementation

INTRODUCTION TO CLOUD COMPUTING CEN483 PARALLEL AND DISTRIBUTED SYSTEMS

White Paper. Requirements of Network Virtualization

A Coordinated. Enterprise Networks Software Defined. and Application Fluent Programmable Networks

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

7a. System-on-chip design and prototyping platforms

Infrastructure Matters: POWER8 vs. Xeon x86

COMPUTING. Centellis Virtualization Platform An open hardware and software platform for implementing virtualized applications

Xeon+FPGA Platform for the Data Center

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

Getting More Performance and Efficiency in the Application Delivery Network

IaaS Cloud Architectures: Virtualized Data Centers to Federated Cloud Infrastructures

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Chapter 1: Introduction

Unified Computing Systems

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

Enabling Technologies for Distributed and Cloud Computing

Optimizing Data Center Networks for Cloud Computing

Introduction to OpenStack

PLUMgrid Open Networking Suite Service Insertion Architecture

Using Network Virtualization to Scale Data Centers

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

OVERLAYING VIRTUALIZED LAYER 2 NETWORKS OVER LAYER 3 NETWORKS

Zentera Cloud Federation Network for Hybrid Computing

Network Services in the SDN Data Center

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Global Headquarters: 5 Speen Street Framingham, MA USA P F

SDN. WHITE PAPER Intel Ethernet Switch FM6000 Series - Software Defined Networking. Recep Ozdag Intel Corporation

Introduction to Cloud Design Four Design Principals For IaaS

SDN/Virtualization and Cloud Computing

White Paper. Recording Server Virtualization

Cloud Networking: A Novel Network Approach for Cloud Computing Models CQ1 2009

CloudLink - The On-Ramp to the Cloud Security, Management and Performance Optimization for Multi-Tenant Private and Public Clouds

ConnectX -3 Pro: Solving the NVGRE Performance Challenge

Enabling Technologies for Distributed Computing

Netvisor Software Defined Fabric Architecture

Introduction to Cloud Computing

IO Visor: Programmable and Flexible Data Plane for Datacenter s I/O

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

A Platform Built for Server Virtualization: Cisco Unified Computing System

SDN v praxi overlay sítí pro OpenStack Daniel Prchal daniel.prchal@hpe.com

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

Analysis of Network Segmentation Techniques in Cloud Data Centers

Cloud Essentials for Architects using OpenStack

A Look at the New Converged Data Center

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms

Enhancing Hypervisor and Cloud Solutions Using Embedded Linux Iisko Lappalainen MontaVista

Oracle SDN Performance Acceleration with Software-Defined Networking

White Paper. SDN 101: An Introduction to Software Defined Networking. citrix.com

Simplifying Data Data Center Center Network Management Leveraging SDN SDN

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Software Defined Network (SDN)

BROCADE NETWORKING: EXPLORING SOFTWARE-DEFINED NETWORK. Gustavo Barros Systems Engineer Brocade Brasil

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Planning the Migration of Enterprise Applications to the Cloud

WHITE PAPER. Software Defined Storage Hydrates the Cloud

An Introduction to Cloud Computing Concepts

Different NFV/SDN Solutions for Telecoms and Enterprise Cloud

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

SOFTWARE-DEFINED NETWORKING AND OPENFLOW

Networking for Caribbean Development

FPGAs for Trusted Cloud Computing

Successfully Deploying Globalized Applications Requires Application Delivery Controllers

Open vswitch and the Intelligent Edge

Definition of a White Box. Benefits of White Boxes

What is a System on a Chip?

Infrastructure as a Service (IaaS)

Cisco and Red Hat: Application Centric Infrastructure Integration with OpenStack

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

2) Xen Hypervisor 3) UEC

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

RED HAT INFRASTRUCTURE AS A SERVICE OVERVIEW AND ROADMAP. Andrew Cathrow Red Hat, Inc. Wednesday, June 12, 2013

Transcription:

Virtualizing FPGAs for Cloud Computing Applications by Stuart A. Byma A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Stuart A. Byma

Abstract Virtualizing FPGAs for Cloud Computing Applications Stuart A. Byma Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014 Cloud computing has become a multi-billion dollar industry, and represents a computing paradigm where all resources are virtualized, flexible and scalable. Field Programmable Gate Arrays (FPGAs) have the potential to accelerate many cloud-based applications, but as of yet are not available as cloud resources because they are so different from the conventional microprocessors that virtual machines (VMs) are based on. This thesis presents a first attempt at virtualizing and integrating FPGAs into cloud computing systems, making them available as generic cloud resources to end users. A novel architecture enabling this integration is presented and explored, and several custom hardware applications are evaluated on a prototype system. These applications show that Virtualized FPGA Resources can significantly outperform VMs in certain classes of common cloud computing applications, showing the potential to increase user compute power while reducing datacenter power consumption and operating costs. ii

Dedication To Jennifer. iii

Acknowledgements First I must sincerely thank my advisors Professors Greg Steffan and Paul Chow. I owe my success in graduate school to them and their invaluable guidance and advice. I could not have asked for better mentorship throughout my Masters research, or in this chapter of my life. Thank you both. Also to my esteemed colleagues and office mates Xander Chin, Charles Lo, Ruedi Willenberg, Robert Heße, Fernando Martin Del Campo, Andrew Shorten, Jimmy Lin and others: You have made my time at the U of T a true pleasure thank you for all the good times, and for being an ever present sounding board for thoughts and ideas. A special thanks as well to members of the SAVI testbed: Professor Alberto Leon- Garcia, Hadi Bannazadeh, Thomas Lin and Hesam Rahimi. Your help and advice, technical and otherwise, has made my work presented here possible. Finally and most importantly, an everlasting thanks to my wife. Thank you for encouraging me to pursue my passions, and thank you for your unfaltering belief in me. None of this would have happened without you. iv

Contents 1 Introduction 1 1.1 Motivation................................... 1 1.2 Contributions................................. 2 1.3 Overview.................................... 3 2 Background 4 2.1 Field-Programmable Gate Arrays...................... 4 2.2 Cloud Computing............................... 5 2.3 The Smart Applications on Virtual Infrastructure Network........ 7 2.3.1 The Smart Edge Node........................ 8 2.3.2 Heterogeneous Resources in the SAVI Testbed........... 10 2.4 FPGA Virtualization............................. 11 2.4.1 Related Work............................. 11 3 Architecture for FPGA Virtualization 14 3.1 OpenStack Resource Management...................... 14 3.1.1 SAVI Testbed FPGA Resources................... 16 3.1.2 Requirements for FPGA Virtualization in OpenStack....... 18 3.2 Hardware Architecture............................ 18 3.2.1 Fully Virtualized Hardware..................... 19 3.2.2 FPGA Partial Reconfiguration.................... 20 v

3.2.3 Virtualization via PR......................... 22 3.2.4 Static Logic Design.......................... 22 3.3 Agent Design................................. 28 3.3.1 Booting................................ 29 3.3.2 Deleting................................ 30 3.4 Booting VFRs in OpenStack......................... 31 3.5 Compiling Custom Hardware........................ 31 4 SAVI Testbed Prototype 35 4.1 FPGA Hardware............................... 35 4.2 Agent Software................................ 37 4.2.1 statustable and Associated Objects................ 38 4.2.2 Initialization and Operation..................... 42 5 Platform Evaluation and Application Case Studies 44 5.1 Platform Evaluation............................. 44 5.2 Case Study: Load Balancer......................... 47 5.2.1 Load Balancer Designs........................ 48 5.2.2 Performance Analysis......................... 50 5.3 Case Study: Extending OpenFlow Capabilities............... 53 5.3.1 VXLAN................................ 54 5.3.2 Virtualized Hardware for New OpenFlow Capabilities....... 55 5.3.3 Performance Analysis......................... 57 6 Conclusion 62 6.1 Future Work.................................. 62 6.1.1 Architectural Enhancements..................... 63 6.1.2 Failures and Migration........................ 64 6.1.3 Further Heterogeneity........................ 65 vi

6.1.4 Applications.............................. 65 6.1.5 FPGA CAD as a Service....................... 66 6.1.6 Complementary Studies....................... 67 Bibliography 68 vii

Acronyms BRAM Block Random Access Memory. 27 CAD Computer Aided Design. 16, 24 CAM Content Addressable Memory. 17 19 DRAM Dynamic Random Access Memory. 19, 20, 26 FIFO First In First Out. 20 FPGA Field Programmable Gate Array. 1 4, 9 17, 19 22, 24 GPIO General-Purpose Input-Output. 18, 19, 27 LUT Look-up Table. 27 MAC Media Access Control. 17 20, 22 PR Partial Reconfiguration. 9, 15, 16, 19, 22, 24 PRM Partially Reconfigurable Module. 16, 17, 24, 30 PRR Partially Reconfigurable Region. 15 17, 22, 24, 25, 27 30 SAVI Smart Applications on Virtual Infrastructure. 5 9 TCP Transmission Control Protocol. 21 viii

UART Universal Asynchronous Receiver-Transmitter. 18 UUID Universal Unique Identifier. 21 23 VFR Virtualized FPGA Resource. 2, 3, 17 25 VM Virtual Machine. 2, 3, 17 ix

List of Tables 4.1 Resource Usage for System Static Hardware................ 37 5.1 Boot Times for VMs and VFRs....................... 45 5.2 Resource Usage for VFR Load Balancer.................. 49 5.3 Resource Usage for VFR VXLAN Port Firewall.............. 57 5.4 Throughput and Latency for VXLAN Port Firewall............ 59 x

List of Figures 2.1 Diagram of the SAVI testbed......................... 7 2.2 The SAVI testbed Smart Edge node..................... 9 2.3 The Driver-Agent abstraction used in the SAVI testbed OpenStack system. 10 3.1 A simplified view of resource management in OpenStack/SAVI Testbed.. 15 3.2 FPGA Partial Reconfiguration........................ 22 3.3 System view of the on-fpga portion of the virtualization hardware... 24 3.4 Virtualization hardware input arbiter block................. 26 3.5 The VFR wrapper design........................... 27 3.6 VFR boot sequence.............................. 32 3.7 Compile flow.................................. 34 4.1 A sequence diagram of the entire boot procedure in the SAVI testbed prototype system................................ 43 5.1 Experiment setup for load balancer tests................... 51 5.2 VFR load balancer latency at different throughput levels.......... 52 5.3 VM load balancer latency at different throughput levels.......... 52 5.4 Number of dropped packets for the VM load balancer........... 53 5.5 Packet diagram for the VXLAN protocol................... 55 5.6 VXLAN Port Firewall............................ 57 5.7 Experimental setups for VFR-based VXLAN firewall............ 59 xi

Chapter 1 Introduction Datacenter-based cloud computing has evolved into a multi-billion dollar industry, with continued growth forecast [1]. Cloud computing is based on virtualization technology, which abstracts physical resources into virtualized resources. This virtualization provides flexibility and system scalability (or elasticity) [2], and also allows many users to share available resources in a datacenter in a transparent way. Cloud computing can also greatly reduce Information Technology (IT) operating costs of companies and organizations [3], making it a very attractive option for IT needs. 1.1 Motivation Field-Programmable Gate Arrays (FPGAs) have the potential to accelerate many common cloud computing and datacenter-centric applications, such as encryption [4], compression [5], or low-level packet processing [6]. FPGAs have begun to make their way into datacenters, and their use in this context can be organized into three categories. The first sees FPGAs being used in the technology that enables the datacenter itself, such as switches and routers. FPGAs in this category are transparent, as neither the end user nor datacenter operator are necessarily aware of their existence. In the second category, FPGAs are used in special appliances essentially boxes that accelerate certain tasks 1

Chapter 1. Introduction 2 or processing. An example could be FPGA-based Memcached appliances [7]. The appliance may be available to end users, but the FPGAs inside are themselves not accessible, programmable resources they are still relatively transparent. The third category, which is the focus of this thesis, sees FPGAs becoming fully user-accessible, programmable resources. Users would be able to allocate FPGA resources just as a virtual machine using the same control infrastructure making FPGAs first-class citizens of the cloud. Consider a motivating example: A large organization runs its Information Technology (IT) services and website on an infrastructure-as-a-service cloud, using hundreds or even thousands of VMs to serve their site and services to millions of users. Their applications may require compute-intensive processing, or application-level packet processing that requires many VMs to do efficiently. If user-accessible FPGA resources are available in the cloud, the organization could design custom hardware to accelerate these tasks eliminating a number of VMs in exchange for a few FPGA resources, and potentially gaining a boost in throughput and a reduction in latency. At the same time, the user retains all the benefits of using a compute cloud, such as dynamic scalability, flexibility, and reliability. The cloud provider also benefits by freeing up VMs, which could potentially reduce power consumption and operating costs. The work presented in this thesis aims to explore methods of enabling these FPGA resources in commercial cloud computing systems. 1.2 Contributions A hardware/software architecture enabling the virtualization of FPGAs and management thereof using the OpenStack cloud system is presented in this thesis. The major contributions are outlined as follows: A hardware and software infrastructure that splits an FPGA into a number of reconfigurable regions, and allows these regions to be managed as individual resources in

Chapter 1. Introduction 3 an OpenStack cloud system. Introduction of the term Virtualized FPGA Resource (VFR). A functional implementation of such an architecture using the Smart Applications on Virtual Infrastructures testbed. An comparison of VFRs and VMs in terms of boot time performance. An evaluation and proof of concept of the prototype system by means of two applications: A hardware load balancer using a hypothetical UDP-based protocol A method of using virtualized hardware to extend capabilities in an OpenFlow software-defined network (SDN). 1.3 Overview The rest of this thesis is organized as follows: Chapter 2 will provide background and context reviewing prior work in virtualizing FPGAs or work similar to the techniques used in this thesis. Chapter 3 will introduce the hardware and software architecture enabling FPGA virtualization. Chapter 4 will introduce the SAVI [8] testbed prototype [9] and implementation details. Comparisons of VMs and VFRs, as well as evaluations of the proof of concept applications, are shown in Chapter 5. Chapter 6 provides some future vision and concludes the thesis.

Chapter 2 Background This chapter introduces concepts and definitions used throughout this thesis, and provides context. Work related to the techniques used in this thesis will also be examined. 2.1 Field-Programmable Gate Arrays This thesis focuses on the use of Field-Programmable Gate Arrays (FPGAs) in datacenters and cloud computing. An FPGA is a silicon chip whose functionality can be reprogrammed an arbitrary number of times to become nearly any digital circuit it is a type of reconfigurable hardware. Modern FPGAs are typically made up of a large array of programmable Look-up Tables (LUTs), each of which can implement a four, five or six variable logic function, depending on the device architecture. LUTs are usually coupled with flip-flops and organized into logic blocks, that can then be connected together through a dense, programmable routing fabric. For further reading on FPGA architectures, the reader is directed to [10]. A set of CAD tools can map arbitrary hardware designs described in Hardware Description Languages (HDLs) to the FPGA fabric. The most common HDLs include Verilog HDL [11] and VHDL [12], but others such as BlueSpec Verilog [13] are gaining popularity. Modern FPGAs also have embedded hard blocks to increase their capabilities these 4

Chapter 2. Background 5 include Digital Signal Processor (DSP) or multiplier blocks, block Random Access Memories (BRAMs), high speed serialize-deserialize (SERDES) transceivers, communication controllers (Ethernet [14], PCIe [15]), and even full microprocessors. 2.2 Cloud Computing It is useful to define what is meant by the term cloud computing, as many companies, individuals and other sources may use the term in different ways. This thesis will follow the definition of cloud computing given by NIST that describes several essential characteristics [16], summarized here: 1. On Demand Self Service: ability to provision cloud resources in the cloud at any time, on demand, without interaction with humans. 2. Broad Network Access: all resources are available and accessible over the network. 3. Resource Pooling: provider resources are organized into pools enabling multi-tenant service. 4. Rapid Elasticity: amount of resources can be dynamically increased or contracted. 5. Measured Service: ability to monitor, control and report resource usage. In addition there are also several different cloud service models. The NIST definition above covers all these models, and they are described briefly here: Software as a Service (SaaS): Allocatable resources are software programs, usually provided over the Internet via web browsers. Platform as a Service (Paas): Resources are operating systems, development tools and frameworks for creating software and services. Infrastructure as a Service (Iaas): Resources are virtualized datacenter components such as Virtual Machines (VMs), virtual storage, networking, bandwidth.

Chapter 2. Background 6 This thesis will focus primarily on IaaS type cloud computing, where allocatable resources are virtualized datacenter components. A good example of this type of cloud computing would be Amazon Web Services [17], or the SAVI testbed, which will be discussed shortly. The cloud computing paradigm has become immensely popular in the IT services and related industries because it frees organizations from the physical aspects of computing and IT infrastructures. There is no major capital investment required for physical servers and networking equipment, nor maintenance costs on said equipment. These burdens are shifted to the Cloud Provider, and the IT organization simply pays a set rate for the cloud-based resources that it uses. The fact that the organization pays only for what it uses, combined with the lack of capital investment represents a significant cost reduction to the organization. The cloud generally guarantees a Service Level Agreement (SLA), leaving the organization assured that its IT infrastructure will experience little to zero downtime due to hardware problems. Cloud computing also allows the end user to scale their systems up or down seamlessly, avoiding the need to over-provision computing capabilities or bandwidth usually needed to mitigate the effects of bursty traffic, again saving on operating costs. From a technical and cost perspective, cloud computing is generally extremely attractive to organizations with both large and small IT needs. Certain other factors may influence the attractiveness of cloud computing, usually legal issues arising from the geographic location of the cloud providers datacenter, or privacy concerns due to the fact that a user s data is effectively in the hands of a third party, however these points are outside the scope of this thesis.

Chapter 2. Background 7 2.3 The Smart Applications on Virtual Infrastructure Network SAVI [8] is a Canada-wide research network aimed at exploring next-generation application marketplaces that make use of fully virtualized infrastructure, as well as future Internet alternatives. A central vision of the SAVI network is the notion of a Smart Edge node a smaller-scale datacenter situated close to the network edge, providing specialized low-latency processing for future application platforms. SAVI joins a number of other networking research testbeds such as GENI [18], Emulab [19], PlanetLab [20] and Internet2 [21], many of which are also federated. Victoria Edge C & M C & M U of T Edge C & M U of T Core C & M McGill Edge C & M Application X Resources Application Y Resources SAVI Testbed Network Virtual Network Calgary Edge C & M CANARIE Virtual Network ORION CANARIE Carlton Edge C & M C & M C & M Waterloo Edge YorkU Edge Figure 2.1: Diagram of the SAVI testbed. The SAVI testbed is one of the SAVI network research themes. The goal of the SAVI testbed is to realize a future application platform that will provide a testing ground for other SAVI research themes. The testbed consists of several Core nodes and many Smart Edge nodes, deployed at various Universities and institutions across Canada. These Core and Edge nodes are interconnected by a fully virtualized Software Defined Network (SDN), and the whole system is orchestrated by a Control and Management (C & M) system. Users can allocate virtualized resources via the C & M system across all nodes in the testbed, as well as private virtual networks that provide complete isolation from other

Chapter 2. Background 8 users experiments and systems. The ORION [22] and CANARIE [23] networks connect all components over a large geographic area of Canada. Figure 2.1 shows a diagram of the testbed architecture. 2.3.1 The Smart Edge Node SAVI Smart Edge nodes are small-scale datacenters situated close to the edge of the network and are the primary connection point for application users. SAVI Smart Edge nodes are unique in that they make use of heterogeneous resources in addition to virtual machines to accelerate processing GPUs and reconfigurable hardware, as well as regular bare-metal servers. These resources put a large amount of processing power close to the edge of the network, and allow applications to do a majority of intensive processing before having to traverse the possibly high-latency network to the Core datacenter. Such intensive processing may include things like advanced signal processing for wirelessly connected devices, encryption and decryption, multimedia streaming acceleration, new types of switching and routing, and other packet-oriented processing. Figure 2.2 shows a diagram of a Smart Edge node. In the SAVI testbed, the Smart Edge is an OpenStack cloud system [24]. OpenStack management forms the Smart Edge C & M plane through a number of subsystems, all of which are reachable through RESTful [25] APIs. Nova [26] manages all compute resources through a Driver-Agent abstraction. Keystone [27] performs authentication and identity management. Glance [28] manages Virtual Machine images and other images. Quantum [29] performs network management functions. Due to naming trademarks, will become known as Neutron. Swift [30] object storage system.

Chapter 2. Background 9 Cinder [31] block storage service. The SAVI smart edge also has a custom Software-Defined Infrastructure (SDI) manager, called Janus. Janus offloads certain tasks from OpenStack, such as network control and resource scheduling, and also performs configuration management and orchestration of the testbed s OpenFlow-based Software-Defined Network (SDN). Network control is accomplished through an OpenFlow Controller implemented using Ryu [32]. Janus also virtualizes the network into slices using FlowVisor [33] (an OpenFlow-based network virtualization layer), and users can run their own User OpenFlow Controller to manage their own private network slice. Essentially, the SDI Manager brings together Cloud Computing and Software Defined Networking together under one management system. More information on SAVI testbed infrastructure management and Janus is provided in [34]. Application and Service Provider Keystone Glance-reg Smart Edge OpenStack Nova Driver-1 Driver-2 Driver-N Quantum Swift Glance-API Cinder Configuration Manager SDI Manager Janus OpenFlow Controller (ryu) Agent Agent Agent Agent Agent Agent FlowVisor User OpenFlow Controller Heterogeneous Resources Sliceable OpenFlow Network Figure 2.2: The SAVI testbed Smart Edge node. Of particular interest to this thesis in Figure 2.2 is the Nova component of OpenStack, which is the part that allocates resources. The standard Nova only supports processor

Chapter 2. Background 10 virtualization, where Virtual Machines (VMs) are booted on top of hypervisors that abstract away the physical hardware. The vision of the Smart Edge however, incorporates Heterogeneous Resources in addition to VMs. Thus Nova in the SAVI testbed is extended to enable it to manage these new resources. 2.3.2 Heterogeneous Resources in the SAVI Testbed For OpenStack to manage different types of resources, they must all appear homogeneous in nature. To accomplish this, the SAVI testbed OpenStack uses a Driver-Agent system. A driver for any resource implements required OpenStack management API methods, such as boot, reboot, start, stop and release. The driver then communicates these OpenStack management commands to an Agent, which carries them out directly on the resource, via a hypervisor or otherwise. In this fashion, OpenStack can manage all resources through the same interface. Figure 2.3 shows a diagram of the Driver-Agent system. Essentially, the Agent is performing resource-specific management, while the driver facilitates resource-agnostic management for OpenStack. The method of communication between the Agent and the Resource, and the Driver and the Agent, is entirely resource-dependent. OpenStack Nova Resource-Agnostic Management Common API Driver Communication Agent 1... Agent N Resource-Specific Management Resources Resources Figure 2.3: The Driver-Agent abstraction used in the SAVI testbed OpenStack system. If a user desires to allocate a resource, they need to be able to specify what resource type they want - the SAVI testbed extends the OpenStack notion of resource flavor to enable this. Usually, resource flavor refers to the number of virtual processors and amount of RAM to allocate to a VM. The SAVI testbed extends the definition of flavor to also

Chapter 2. Background 11 include resource type. The SAVI testbed currently has several of these additional resource types including GPUs, bare-metal servers, and reconfigurable hardware. Although reconfigurable hardware is included in SAVI (and its precursor VANI [35]) it is still relatively non-virtualized simply FPGA cards in bare-metal servers managed by OpenStack. To be made aware of their existence, OpenStack must have resource references placed in its database one for each allocatable resource. This is done using the nova-manage tool. The resource database entry includes the address of the Agent that provides the resource, a type name that can be associated with a flavor, and how many physical network interfaces the resource has. A flavor is created for each unique resource type. 2.4 FPGA Virtualization This thesis focuses on exploring methods of virtualizing FPGAs and managing the device or portions thereof using OpenStack in the SAVI testbed. A number of prior works have examined virtualization of FPGAs in different contexts, described in the following subsection. 2.4.1 Related Work Hardware virtualization, especially that pertaining to FPGAs, has been explored for some time. Initially realized through time multiplexing hardware [36], most hardware virtualization schemes now generally use run-time reconfiguration of the FPGA. This can be full reconfiguration of the entire device, but usually refers to dynamic Partial Reconfiguration (PR) of a portion of the FPGA. In terms of network applications, which is a theme in this thesis, there has been work examining the use of partial reconfiguration to virtualize forwarding data planes in routers [37], although this is a very specific case and does not involve user-designed custom hardware. Recent works involving virtualized FPGAs for custom user hardware virtually

Chapter 2. Background 12 increase the number of available FPGA partially reconfigurable resources [38] or virtualize non-pr coprocessors [39], to maintain parity with the number of microprocessors in a high performance computing environment. Others use the partial reconfiguration technique to make reconfigurable hardware sharable by multiple software processes. This generally involves some sort of virtualization on the level of the operating system in addition to the FPGA or gate-level virtualization done using PR. Some works investigate operating system and scheduler design specifically to manage reconfigurable hardware tasks [40, 41]. On a lower level, Huang et al. use a hot plugin technique to provide access to PR based accelerators via a unified Linux kernel module, allowing multiple processes to efficiently share different accelerators [42]. Pham et al. propose a microkernel hypervisor for new FPGA/CPU hybrid devices, which facilitates access to either a CGRA-like Intermediate Fabric or a regular PR region running user accelerators [43]. What all these schemes and others like them have in common is that they view the reconfigurable accelerators as rather short-lived entities executing hardware tasks, which supplement software tasks running on a conventional processor. Thus they focus heavily on reconfiguration times and concurrent access to the same PR region for multiple processes, as well as high bandwidth between the FPGA and CPU. The context of FPGA virtualization in this thesis is markedly different. This thesis does not assume the virtualized accelerators to be closely coupled with CPUs or software processes, rather, the accelerators are seen as being a major or supplemental component of massive, distributed, cloud-based infrastructures. Most virtualization techniques like the ones mentioned above cannot readily be applied to IaaS clouds and VMs because the end user does not have any sort of access to the underlying physical hardware. It may be possible for a cloud provider to make so called hardware tasks available to VMs and thus end users, but the users would likely be unable to define their own hardware because of the low-level access it would still require, which somewhat defeats the purpose. Additionally, the hardware task model may not suit all IaaS users this thesis also

Chapter 2. Background 13 envisions streaming, packet processing, and network centric applications as well all things that the user of a virtualized datacenter may need. Because of this, the work presented here focuses on providing virtualized, in-network hardware resources that are analogous in the resource sense to VMs.

Chapter 3 Architecture for FPGA Virtualization The general approach for virtualization is modelled after that for virtual machines. To do this, it is important to understand how OpenStack manages resources. This chapter will briefly examine how OpenStack operates and manages heterogeneous resources in the SAVI Testbed. Then, the architecture of the system enabling FPGA virtualization is presented. 3.1 OpenStack Resource Management Figure 3.1 shows a simplified diagram of resource management in OpenStack. The Open- Stack Controller runs on a Commodity Server inside the SAVI Testbed, and provides an API (specifically Nova) to allow a user to request resources. In general, VM resources in the cloud system are booted on top of hypervisors on physically separate machines from the Controller - OpenStack maintains a database of all resources in the system, both in use and free for allocation. When a resource request comes in, OpenStack finds available resources in the Resource Database, and finds which physical machine they are located on. The Controller 14

Chapter 3. Architecture for FPGA Virtualization 15 SAVI TB Heterogeneous Resources GPU Server Resource VM VM Rsrce Rsrce... VM Rsrce Physical Server Resource Agent Hypervisor Resource Server (Commodity Server) User Resource Database *-Drivers SAVI TB OpenStack Controller Commodity Server API (Nova) Figure 3.1: A simplified view of resource management in OpenStack/SAVI Testbed. communicates with the Resource Server via a separate process running beside the Hypervisor, called an Agent. The Agent is a piece of software that interprets commands from the main OpenStack Controller. As described briefly in Chapter 2, the Agent is part of a Driver-Agent abstraction a Driver integrates with the OpenStack compute controller (Nova), implementing resource control API functions. Through the Driver, OpenStack requests the resources from the Agent, which in turn instructs the Hypervisor to boot a VM with the operating system image and parameters specified by the User (sent to the Agent by the Controller). Networking information is also sent by the Controller, which is used by the Agent and Hypervisor to set up network access for the VM. A reference, usually in the form of an IP address, is returned to the User such that they can connect to their resource, and run whatever application or system they want on top of it. For more details on OpenStack, the reader is referred to [24].

Chapter 3. Architecture for FPGA Virtualization 16 Figure 3.1 also shows that the SAVI Testbed OpenStack Controller manages heterogeneous resources through similar mechanisms. A custom driver communicates with Agents managing bare-metal servers, with some being regular bare-metal servers, and others containing GPUs or reconfigurable hardware. The current state of reconfigurable hardware in the SAVI Testbed is relatively nonvirtualized bare-metal servers with PCIe FPGA cards and several BEECube systems. They are non-virtualized because the resource provided is a fully physical resource, not sharable between Users and thus not very scalable or flexible. The current reconfigurable resources in the SAVI Testbed are described briefly below. 3.1.1 SAVI Testbed FPGA Resources This subsection briefly describes the current, non-virtualized FPGA resources in the SAVI Testbed. BEE2 Boards The SAVI Testbed has a number of BEE2 systems [44]. The BEE2 is equipped with five Xilinx FPGAs, with one used to control the others. In the Testbed, an Agent runs on an embedded system on the control FPGA, and manages the other FPGAs as resources that can be allocated. Each FPGA resource has four 10G-capable CX4 interfaces that connect to the Testbed SDN, allowing the user to send and receive data from their hardware on the FPGA. Since the user simply gets the entire device as a resource, they are responsible for designing and compiling their hardware using vendor tools, ensuring that their hardware ports match the correct pin locations on the BEE2, and ensuring that the hardware will function correctly. Once they generate a bitstream file for programming the FPGA, it is uploaded to OpenStack as an image, and would be loaded on the FPGA by the Agent. Note that again the definition of a concept in OpenStack is being extended. Normally,

Chapter 3. Architecture for FPGA Virtualization 17 an image refers only to an Operating System (OS) image, however OpenStack allows any file type to be uploaded as an image. Therefore, for a BEE2 FPGA resource, the image will be a bitstream generated by the FPGA tools. For the BEE2 resource, the Agent will receive this image from the OpenStack Controller via the driver, and simply configures it onto an unused FPGA. OpenStack sees the FPGA as any other resource thanks to the Driver-Agent abstraction, and the user can now make use of custom hardware acceleration in the SAVI Testbed. PCIe-Based FPGA Cards To increase the range of different FPGA applications available to researchers, it is useful to have FPGAs closely coupled to processors so that the reconfigurable hardware can accelerate compute-intensive portions of software. The SAVI Testbed provides several PCI-Express-based FPGA boards connected to physical servers: The NetFPGA, the NetFPGA10G [45] and the DE5Net [46]. The boards have varying FPGA device sizes and on-board memory, but have in common four network interfaces that are connected to the Testbed SDN. The NetFPGA has four 1G Ethernet ports, while the NetFPGA10G and DE5Net have four 10G Ethernet ports. A researcher can now design custom hardware that can accelerate software tasks, provide line-rate packet processing, or a combination of both. In addition to these boards, the Testbed also contains MiniBEE [47] resources. The MiniBEE contains a conventional processor and an on-board FPGA connected through PCIe. It also has 10G network interfaces, a large amount of memory and an expansion port for additional FPGA peripherals. Since the PCIe boards are required to be mounted inside physical servers, the SAVI Testbed provides the server itself with the FPGA card attached as a resource. In the case of the MiniBEE, the entire system is also offered as a resource.

Chapter 3. Architecture for FPGA Virtualization 18 3.1.2 Requirements for FPGA Virtualization in OpenStack Using Virtual Machines as a model for full FPGA virtualization, it is clear there are several required components: An agent to provide the FPGA (or pieces thereof) as a resource, and a driver to integrate into OpenStack so that OpenStack can communicate with the Agent. The Agent is responsible for managing the actual resource provided to the user, in this case an FPGA or portion thereof, and therefore must be capable of receiving an FPGA programming file (hereafter referred to as a bitstream ) and configuring or reconfiguring the device. It must also track which FPGAs or FPGA portions have user hardware running in them, and which hardware belongs to which user. Additionally, if full virtualization is to be achieved, the physical FPGA device must be abstracted and sharable between different users. This will require a base hardware architecture to virtualize the device, somewhat similar to a hypervisor. The Agent must be aware of this virtualization layer to manage the resources as well. The following sections will describe the design of a system that meets the aforementioned requirements, using OpenStack in the SAVI Testbed as the cloud computing platform. The base hardware architecture virtualizing the device will be described, and then the Agent that provides the virtualized hardware to OpenStack. 3.2 Hardware Architecture Though the current FPGA resources in the SAVI Testbed are managed by OpenStack, they still leave much to be desired in terms of commercial systems and user-friendliness. The resources are still relatively non-virtualized a single physical device is allocated to one user, whereas in a fully virtualized system, one physical device should be sharable among different users simultaneously. There is more motivation for this when considering that one user may not make fully use of an entire FPGA, wasting some reconfigurable fabric that could go to another user. A full FPGA can be more difficult to design

Chapter 3. Architecture for FPGA Virtualization 19 and program, especially when integrating complex IP (such as memory controllers), and without physical access to the device. The architecture presented in this thesis seeks to resolve these problems through full hardware virtualization. 3.2.1 Fully Virtualized Hardware Virtualization in a cloud computing context has several characteristics that are usually presented in terms of Virtual Machines: Physical Abstraction The physical device itself is abstracted and the user is not aware of the underlying hardware. For example, a VM may be running via a hypervisor running on an Intel Xeon processor, however the user is only aware of how many Virtual CPUs they have. They are unaware of the real hardware. Sharing A single physical device provides one or more virtual instances to one or more users. Such devices can also be referred to as Multi-tenant devices. Illusion of Infinity The actual number and physical location of resources is also abstracted, and from the user s view there exists a seemingly infinite pool of resources. The objective of the work presented in this thesis is to enable fully virtualized FPGA hardware by designing an architecture and system that has the above characteristics (or characteristics that are analogous). The physical device should be abstracted a user should be able to specify a hardware design, in HDL for example, and rely on the system to run their design in the cloud. They should not have to worry about what specific device their hardware must run on, nor about compiling for different devices. They should also not be aware of other physical aspects of the system such as the physical location of resources or the number of available resources (i.e. the illusion of infinite resources in the cloud should be maintained.)

Chapter 3. Architecture for FPGA Virtualization 20 FPGA size and density has grown to the point where it is feasible for the device to be virtualized and shared between users, with enough fabric leftover for each user to run non-trivial hardware designs. This also has the benefit of improving the usage efficiency and cost-effectiveness of a device, since many useful hardware designs do not need as much logic as the entire device provides. Full hardware virtualization would allow the cloud provider to put the unused fabric to work for other users. Another issue fully virtualized hardware would solve is that of security. Giving a user full control over an entire FPGA directly connected to the network may be risky for a cloud provider. It would allow a nefarious user the ability to inject malicious data directly into the provider datacenter, at extremely high rates (10Gb/s or more). A hypervisor, in the case of a regular VM, acts as a buffer between the user s guest OS and the provider hardware, allowing the provider to set up security and police network traffic before it gets onto the internal network. The hardware virtualization layer is therefore designed to allow the provider to police the data going in and coming out of the user hardware. The general approach for virtualization of the hardware is based on Partial Reconfiguration (PR) of the FPGA. This technique of reconfiguring specific portions of an FPGA while the rest remains running can be used to effectively split the device into several regions that can be offered individually as resources to cloud users. To familiarize the reader, the basics of Partial Reconfiguration will be reviewed in the following section. 3.2.2 FPGA Partial Reconfiguration Partial Reconfiguration is a capability of some FPGAs where portions of the device can be reconfigured independently, without affecting other circuits running on the device. Physical portions of the device must be specified to be a Partially Reconfigurable Region (PRR), and specific hardware modules of the overall design must be mapped to one of these PRRs. Multiple modules can be compiled for one PRR, however only one can be configured at run time. These modules are called Partially Reconfigurable Modules

Chapter 3. Architecture for FPGA Virtualization 21 (PRMs). Generally, the logic surrounding the PRRs is fixed, and is referred to as the static logic. Major FPGA vendors support partial reconfiguration [48, 49, 50]. Figure 3.2 depicts a partially reconfigurable FPGA system. There is one PRR, and three PRMs (PRM A, B and C). Each PRM contains different hardware implementing different functionality, and each PRM can be dynamically configured into the PRR at run-time while the Static Logic remains running. PR introduces complexities into the hardware compilation process. The interface from the static logic to a PRR, called the PR Boundary, must be dealt with carefully by the CAD compile process. The static logic must be compiled once, since it does not change, along with one of the PRMs. After placement and routing, the physical wires crossing the PR boundary are set permanently since they connect to the static logic, shown in Figure 3.2 as Static Connection Points. Further PRMs compiled with the static logic must have their connections routed to the same physical locations, so that when they are partially reconfigured, their connections actually connect to the running static logic. This is also shown in Figure 3.2, where any PR Boundary crossing signal is routed to the same location in each PRM. This is usually accomplished by locking the placement of the logic cells whose wires cross the boundary (called anchor LUTs) after compiling the static logic. Obviously, every PRM for a given PRR must have the same logical top-level ports, whether or not they use them all. Other considerations must be made by designers using PR. During reconfiguration, outputs of a PRR may be in flux and have unknown values. The static logic should have a method of freezing these outputs or ignoring them while reconfiguration takes place. Timing constraints can also be harder to meet in PR systems, since the CAD tools are unable to perform any logic optimization across the PR boundary. Xilinx Inc. suggests registering signals both before and after the PR boundary to improve timing performance [48]. Lastly, the designer should ensure that a freshly reconfigured PRM is fully reset to a known state.

Chapter 3. Architecture for FPGA Virtualization 22 FPGA PRM B PRM A PRM C PRR Static Logic PR Boundary Static Connection Point Figure 3.2: FPGA Partial Reconfiguration. 3.2.3 Virtualization via PR In a cloud computing context, the static logic surrounding the Partially Reconfigurable Regions implements hypervisor functions providing a buffer under control of the cloud provider in between the network and the user-defined hardware in the PRRs. Just as in a VM hypervisor, this will allow the cloud provider to implement some measure of security, and possibly other required management functions. The static logic also has several other functions that will become apparent as the full design is described below. 3.2.4 Static Logic Design As mentioned previously, partial reconfiguration is used to split a single FPGA into several reconfigurable regions, each of which are managed as a single Virtualized FPGA Resource (VFR). In effect, this virtualizes the FPGA and makes it a multi-tenant device, although still requiring the external control of the Agent. A user can now allocate one of these VFRs and have their own custom designed hardware placed within it. The static logic surrounding the VFRs is still under control of the cloud provider, and must

Chapter 3. Architecture for FPGA Virtualization 23 accomplish several functions. The method of data transfer between a user and their VFR is over the network, and therefore the static logic must facilitate forwarding of packets to the correct VFR. To do this, the static logic system must track the network information (i.e. MAC addresses) of each VFR as provided by the OpenStack Controller. The static logic is also designed in a way that enables it, and thus the cloud provider, to police the interfaces to the VFRs, to maintain some basic network security, such as prevention of sniffing traffic and spoofing addresses. The static logic contains interface hardware for for 10G Ethernet ports, memory controllers, all chip level I/O, and a method of communicating with the Agent. The following paragraphs and subsections describe the design choices made for this thesis to accomplish these functions. Figure 3.3 shows a block diagram of the on-fpga portion of the system. A Soft Processor (that is, a microprocessor implemented inside the FPGA fabric) communicates with the Agent that runs on the Host machine. The Soft Processor is attached to a Bus that allows it to communicate with and control the different components of the system. A bus is a two-way communication system with two actors: Masters and Slaves. Masters can initiate read and write transactions by addressing the Slave they wish to communicate with, while Slaves can only respond to reads or writes. The Soft Processor is a Bus Master. The DRAM Controller is a Bus Slave that facilitates access to Off-Chip DRAM. The MAC Memory-mapped (Memmap) Registers are also Slaves that allow the Soft Processor to control the Input Arbiter and Output Queues (described in the following subsection). The VFRs are wrapped in Bus Masters, which allows them to access the DRAM Controller slave. Packet streams from 10G interfaces pass through the Input Arbiter and the VFR Bus Master wrappers and enter the VFRs. Output streams exit the VFR and connect to the Output Queues and subsequently the egress interfaces. The Agent can reprogram the entire chip or individual VFRs through an external Programmer that operates over JTAG. The major subcomponents shown in Figure 3.3 are described in the following subsections.

Chapter 3. Architecture for FPGA Virtualization 24 Programmer (JTAG) Off-Chip DRAM 128 MB Virtex 5 VTX240T DRAM Controller 32-bit Bus Soft Processor Agent (Host) MAC Memmap Registers 1 VFR Wrappers VFR Stream In 256b Wide From 10GE (160 MHz) Input Arbiter 2 N VFR VFR Output Queues Stream Out To 10GE Figure 3.3: System view of the on-fpga portion of the virtualization hardware. Data Transfer and I/O A Streaming Interface facilitates packet transfer into the system from 10G Ethernet ports. Streaming Interfaces are one-way communication channels. They consist at a minimum of four signals: 1. A variable bit width Data signal. 2. A single bit Valid signal. 3. A single bit Ready signal. 4. A Clock signal. The actual data of the packet is transferred in chunks through the Data field, where each chunk is the bit width of the Data field. These chunks are referred to as flits. At the positive edge of the Clock signal (that is, a logic low to logic high transition), one flit is transferred if and only if both the Valid and the Ready signals are logic high. The Valid signal is asserted by the sender along with the flit in the Data field, and the Ready

Chapter 3. Architecture for FPGA Virtualization 25 signal is asserted by the receiver to indicate it is ready to receive data. The Valid and Ready signals are also referred to as handshake signals. One flit can be transferred over the Data signal every clock cycle. The Input Arbiter block, shown in Figure 3.4, is responsible for directing incoming packets to the correct VFR. The Input Arbiter contains a Content Addressable Memory (CAM) that the arbiter uses to match a packet s Destination MAC address to a specific VFR. An incoming packet is stalled for one clock cycle while the CAM looks up the destination VFR. In the case that there is no matching VFR in the CAM, the packet is simply dropped. This has the benefit of preventing VFRs from inadvertently (or intentionally) sniffing Ethernet traffic that does not belong to them, but the drawback of being unable to receive broadcast packets. This could be addressed by designing a more complex switching fabric within the Input Arbiter. The CAM must be programmed with the VFR MAC addresses (provided by Open- Stack) before any packets can be received. MAC addresses for new VFRs are programmed into the CAM by a soft processor via several memory mapped registers. The software running on the soft processor receives the MAC address and corresponding VFR region code over a UART link from the Agent running on the host. The software running on the processor and the communication protocol with the Agent will described in more detail in Section 3.3. The output queues operate similarly, except this block simply tracks each VFR s MAC in a register, and prevents spoofing by forcing an outgoing packet s source MAC address to be the VFR s MAC address. The output queue MAC addresses are also updated via the same memory mapped registers as the input arbiter. Virtualized Accelerator Wrappers The VFRs are wrapped inside Bus Masters, as can be seen in Figure 3.3. These are labelled 1 to N. Figure 3.5 shows a closer view of this wrapper design, which must

Chapter 3. Architecture for FPGA Virtualization 26 MAC Memmap Registers Bus write MAC Addr VFR Region Code 48 Bit CAM VFR Region Code Out Stream In Input Arbiter To VFRs Figure 3.4: Virtualization hardware input arbiter block. accomplish two things facilitate safe partial reconfiguration of a VFR, and provide the VFR access to off-chip DRAM. Safe reconfiguration is of paramount importance, since the cloud provider should seek to guarantee that user hardware will be configured correctly. The system must ensure that no information or packet is being transferred across the PR boundary during reconfiguration. This can result in lost data, or a newly reconfigured VFR receiving data starting in the middle of a packet, which could cause errors in the user hardware. The VFR reconfiguration process is handled as follows: When a request comes to the Agent to configure a new VFR, the Agent sends a command to the soft processor to freeze the interfaces of the selected region. The soft processor will de-assert the PR INTERFACE ENABLE signal by writing to the register, which causes the wrapper hardware to set all streaming interface handshake signals low after any current transfer finishes. This is accomplished by using a register for each Stream Interface that records whether or not the stream is in the middle of a transaction. When both transaction registers are logic 0, the hardware sends an ACK to the processor through another memory-

Chapter 3. Architecture for FPGA Virtualization 27 Bus Memmap Register PR_RESET ACK Stream In PR_INTERFACE_ENABLE ACK VFR (User IP) Stream Out Memory Operation Queues Figure 3.5: The VFR wrapper design. mapped register that the processor is polling, and the processor asserts the PR RESET signal and notifies the Agent that it is now safe to reconfigure that region. The VFR is held in reset, and after the new user hardware is configured via an external JTAG connection and the MAC address programmed into the input arbiter and output queue, the Agent instructs the soft processor to release PR RESET and enable the interfaces again. This method also ensures that new user hardware is fully reset before interfaces are enables. The wrapper also facilitates access to low-latency off-chip DRAM for a VFR. The VFR wrapper has read and write ports so that user hardware can insert memory operations into a queue that acts as a bus master. The queue would be implemented as a FIFO so that all writes added to the queue are done in order, and read data is also returned in order. The queue, which is part of the wrapper hardware under provider control, partitions the memory address space so that each VFR in the physical FPGA only sees a subset of the total DRAM effectively giving each VFR a private off-chip memory, while making it impossible for any VFR to access another s memory. This is done by dividing

Chapter 3. Architecture for FPGA Virtualization 28 the address space and offsetting addresses by an appropriate amount before entry into the operation queue. For example, if the memory is 64MB then the memory address space might be 0x03FFFFFF <-> 0x00000000 or a total number of address locations of 0x04000000. If there are four VFRs then this space is divided into four, and each VFR would have log 2 (0x01000000) = 26 address bits, or a range of 0x00FFFFFF <-> 0x00000000 (16MB each). When a VFR makes a read or write operation, the 26 bits are zero-extended to 32 bits (the system bus width) and offset an appropriate amount by adding a multiple of 0x01000000. The offset is done outside the VFR and inside the provider-controlled static logic, making the user hardware entirely unaware that memory beyond its partition exists. The queue is reset and cleared upon VFR reconfiguration via the PR RESET signal. In the prototype system using the NetFPGA10G, the RLDRAM modules have a minimum read latency of three clock cycles, and a minimum write latency of four clock cycles, plus several cycles for the controller running on the FPGA. The module can burst read and write up to a length of eight. Maximum aggregate bandwidth is 38.4 Gb/s. This memory system is admittedly rudimentary compared to other possibilities such as a multi-port memory controller, but it was chosen to keep the static logic simple and low-area. 3.3 Agent Design This section describes the Agent the piece of software that communicates with Open- Stack, performing resource specific management. One Agent manages all the VFRs on one FPGA, although the design could easily be extended to manage VFRs on multiple FPGAs. This section will describe Agent requirements in general, but also focus on the Agent implemented in the SAVI Testbed prototype, which uses the NetFPGA10G card. Generally, the Agent must implement the resource-specific management commands

Chapter 3. Architecture for FPGA Virtualization 29 from OpenStack (issued via the Driver). At the very least, these must include boot (instantiate a new resource with the specified parameters) and delete (tear down a running resource). The Agent and embedded software use a simple command-acknowledge protocol to communicate: the Agent will send a command string, and the embedded software will respond with an acknowledge, and an additional acknowledge for each data parameter sent as part of the command. If no acknowledge is received, the command is aborted and re-attempted. In the prototype hardware, the embedded software is run on a Xilinx Microblaze soft processor. The Agent and OpenStack Driver communicate over the Testbed network using a text-based protocol over TCP. The following subsections describe how the Agent boots and deletes VFRs. 3.3.1 Booting When a boot command is issued in OpenStack, several pieces of information are received by the Agent: UUID a universal unique identifier for the resource. Network information an IP and MAC address for the resource. Image usually an OS image for VMs, but repurposed for VFRs. The Agent uses the UUID to track VFRs UUIDs in OpenStack are an absolute reference for any object or resource. The image is the most important piece of data received. It contains PR bitstreams corresponding to the user-designed hardware. How this is compiled and created will be explained in Section 3.5. The image contains one PR bitstream corresponding to each physical PRR on the FPGA, numbered so that the Agent can tell which one is for which region. This is explained in detail in Section 4.2.1. The Agent chooses the first available unconfigured PRR for the incoming VFR to be booted, and then begins the reconfiguration process note that this requires the Agent

Chapter 3. Architecture for FPGA Virtualization 30 to maintain the current state of the system, remembering what regions are currently configured and what users they belong to. A simple data structure could be used to store the state of each physical VFR and, if configured, the associated UUID. First, a disable interfaces command is sent to the embedded software. This command has a single parameter, the region code corresponding to the PRR about to reconfigured. The soft processor freezes the packet stream interfaces as described in Section 3.2.3, and places the PRR (VFR) into reset. The Agent then reconfigures the PRR using an external reconfiguration tool, in the case of the prototype system, Xilinx impact over JTAG. Then, the MAC address is programmed into to the static logic s input arbiter and output queues, which allow packets in and out of the newly configured VFR. The load MAC address command is sent to the embedded software followed by the six byte MAC address received from the OpenStack Controller. Once successful, the enable interfaces command is issued, and the embedded software releases the VFR from reset and enables the packet stream interfaces. This method ensures that no packet is in the middle of transfer during a reconfiguration, avoiding the situation where a piece of user hardware might receive a fragmented packet. It also ensures that user hardware is fully reset before running. 3.3.2 Deleting Deleting a VFR is similar to booting in terms of what the Agent must do, however the only piece of data sent is a UUID. The Agent finds the VFR with the corresponding UUID, and proceeds to reconfigure the PRR with a black box bitstream. This effectively removes any user hardware from the system. Again this requires that the Agent store the state of each VFR and associated UUIDs.

Chapter 3. Architecture for FPGA Virtualization 31 3.4 Booting VFRs in OpenStack Several modifications were needed in the OpenStack Controller (Nova) to get a complete working system. Major modifications were already made by the SAVI Testbed team to enable management of bare-metal servers specifically modifications that allowed the integration of custom Drivers and therefore custom Agents. Recall that in Chapter 2 the notion of resource flavor was discussed. Normally, flavor refers to the number of virtual processors and amount of memory for a VM, but it can be repurposed to refer to more of a resource type. This is important because the Nova scheduler uses the flavor submitted by the user to select which Driver to use to contact the Agent. Each resource in the database references a flavor, so many single resources can fall under one flavor. Figure 3.6 shows a diagram of the boot sequence for a VFR, which proceeds as follows: Upon receiving the Boot command from the User, the OpenStack Controller uses the specified flavor to choose a resource in the database. The database entry references the Agent associated with it through an IP and port number. Nova calls the Driver implementation of the boot function, and passes generated networking information, the Agent IP and port, UUID and any other required information. The Driver then communicates with the Agent (FPGA host), instructing it to boot a new resource, and in the case of VFRs, sending over networking information and the image containing partial bitstreams. The Agent selects the partial bitstream in the image matching a free region, and programs it along with the network information as in Section 3.3.1. Success is indicated to the Controller, which then passes a reference to the user. 3.5 Compiling Custom Hardware Through the modifications and systems described in this chapter, custom hardware acceleration is now available to cloud computing users. However, there is still the question

Chapter 3. Architecture for FPGA Virtualization 32 Boot (User) OpenStack Controller enable VFR interfaces Controller passes image, network info Agent (FPGA host).bit.bit.bit Agent selects PR bitstream matching free region program VFR MAC disable VFR interfaces, configure region Figure 3.6: VFR boot sequence. of how to compile hardware for the system. The VFRs have specific interfaces that any user hardware must match exactly not only logically, but also physically, due to the nature of PR discussed previously. The static logic remains constantly running while user hardware is partially reconfigured as resources are booted and deleted. Therefore, any new user hardware must be compiled with the currently running static logic that is part of the cloud provider s systems. A compile flow must be developed to allow end users to compile their custom hardware for use with the virtualization system. First, the user hardware top-level ports must match those of the static logic PR boundary logically a template HDL file is provided to end users inside which they can define their hardware. The template file contains a module definition whose top-level ports match the PR boundary in the static logic. It is assumed at this point that the provider has already synthesized, placed and routed the static logic. Any user hardware has to be compiled with the placed and routed static logic, so that the physical wires crossing the PR boundary are placed in the right locations in the PRM. How this is done in practice depends entirely on the CAD tool flow provided by the FPGA vendor. For the prototype system implemented for this thesis, Xilinx PlanAhead is used to perform PR compilation. The general hardware compile procedure is performed as follows for the prototype

Chapter 3. Architecture for FPGA Virtualization 33 system, but should generalize well to other vendors also. First, the user s HDL (written in the provided template) is synthesized to a netlist. This netlist is added to a new compile run in Xilinx PlanAhead the netlist is assigned to all physical VFRs present in the system (in the case of the prototype, all four of them). The run is also configured to use the already placed and routed static logic, and PlanAhead ensures that all boundary crossing wires are placed in the correct locations in the PRM. The compile run is initiated, and the tool maps, places and routes the user netlist in all VFRs (which are PRRs). One design is placed and routed in all VFRs because the Agent requires the flexibility to place the user s design in any VFR in the system PRMs can only be configured inside PRRs for which they are specifically compiled. Bitstream generation is then completed, creating one partial bitstream for each physical VFR. These partial bitstreams will be used by the Agent to configure the user hardware into a running system when a VFR is booted in OpenStack. Recall that user hardware is sent to the Agent as an image. All partial bitstreams generated by the compile are added to a compressed archive, and this archive is uploaded as an image to OpenStack by the user with the image management tool called glance. When the Agent selects which physical VFR partition to use for the users hardware, it simply selects the partial bitstream corresponding to that PRR, configures it, and disregards the others. Figure 3.7 shows the general compile flow, upload and boot procedure. For the prototype, such a compile system is realized using a script-based approach. The user places their netlist in a specific folder, and then executes a compile script that uses PlanAhead to perform the aforementioned compilation steps and bitstream packaging. The end result is the zip archive containing the partial bitstreams, ready for upload as an image. Steps inside the dashed lines of Figure 3.7 are part of the compile script.

Chapter 3. Architecture for FPGA Virtualization 34 user_ip.v static.ncd Static Logic from provider (placed and routed) XST (ISE 13.4) PR Compile Flow PR Compile with Xilinx PlanAhead 13.4 user_ip.ngc Generate Bitstream.bit.bit.bit User IP PR Bitstreams in zipped folder OpenStack Image Upload via glance Figure 3.7: Compile flow.

Chapter 4 SAVI Testbed Prototype A prototype of the system described in Chapter 3 has been created using the SAVI testbed at the University of Toronto. The goal of the prototype system is to validate the system architecture and show its feasibility, as well as to evaluate and attempt to quantify the benefit of reconfigurable hardware resources in an Infrastructure-as-a-Service cloud system. 4.1 FPGA Hardware The prototype is based on the SAVI testbed OpenStack cloud, with its ability to manage heterogeneous resources. The FPGA-based portion is implemented using the NetF- PGA10G [45], available in the SAVI testbed as a non-virtualized resource in the form of a baremetal server with NetFPGA10G connected to a PCIe slot. The NetFPGA10G comes equipped with a Xilinx Virtex 5 VTX240T, four 10G Ethernet interfaces, and 128 MB of off-chip reduced latency DRAM. Although of an older generation than currently available, the Virtex 5 [51] is still large enough to realize a sufficiently non-trivial system that demonstrates the required functionality. The static logic architecture uses as a base the NetFPGA10G open source infrastructure [52]. Certain components of the infrastructure were modified to realize the architecture described in Chapter 3: 35

Chapter 4. SAVI Testbed Prototype 36 1. Input Arbiter: The open source infrastructure provides an input arbiter that simply forwards packets from all four interfaces to a single output stream in a round-robin fashion. This was modified to several output streams, one corresponding to each physical VFR in the system, and a CAM was inserted in the pipeline to realize correct forwarding to said VFRs. The CAM is created using Xilinx IP [53]. Additional top-level ports were also added to the Input Arbiter that connect to the write ports of the CAM. These top-level ports then connect to several memory-mapped registers (implemented as GPIO peripherals) connected to the microprocessor system bus. This allows a Xilinx Microblaze soft processor to program the MAC address for any specific VFR, by writing data to the GPIO registers. A single bit in one of the registers forms a Write signal to the CAM, which the software toggles to write the MAC and VFR region code into the CAM. The VFR datapath is clocked at a higher frequency than the embedded processor system, which may cause the CAM to be written several times in succession before the processor can write a 0 to the write bit. This is generally not a problem however, since the same location will be written with valid data each time; it is only a minor inefficiency. 2. Output Queues: The NetFPGA10G infrastructure Output Queues were also modified to accomodate multiple incoming packet streams. Registers that contain the MAC addresses of each VFR were added, allowing the hardware to force source MAC address fields. Top-level ports were also added to program these registers via the same GPIO peripherals as the Input Arbiter. All packet streams in the design are AXI Stream [54] streaming interfaces, with a width of 256 bits running at 160 MHz, equating to just over 40 Gb/s peak throughput. The stream widths are kept overprovisioned at 256 bits for the sake of simplicity. The prototype system contains four physical VFRs, each a Partially Reconfigurable Region (PRR). Each VFR region contains 11376 LUTs and 15 36K BRAMs. The number of four is chosen because it is a non-trivial number, and because it allows maintenance of

Chapter 4. SAVI Testbed Prototype 37 a region size that can implement meaningful hardware. It should also be noted that increasing the number of VFRs in the system also increases the number of required streaming interfaces, causing a large increase in required routing resources. Normally the placement and routing algorithms may handle this acceptably, but the problem is vastly more complicated due to the fact that PRRs are physically fixed, and the CAD tools are unable to do any optimization across the PR boundary. These constraints make it significantly more difficult for the CAD tools to meet timing, especially as the number of PRRs increases. This also contributed to the limiting of the number of VFRs. Resource utilization for the static logic is shown in Table 4.1. The counts in Table 4.1 do not include the counts for the VFR regions. The design with four VFRs successfully meets timing with the AXI streams running at 160MHz and the soft processor system running at 100MHz. The VFRs are connected to the AXI streams, so they also run at 160 MHz and make this clock available to user hardware. User hardware must meet the 160 MHz timing constraint to work properly, as there are no other clocks provided. Table 4.1: Resource Usage for System Static Hardware Resource Usage (Used / Device Total) Flip-flop 29327 / 149760 (19%) LUT 28711 / 149760 (19%) 36K BRAM 105 / 324 (32%) 4.2 Agent Software The Agent for the prototype is implemented in Python, in keeping with the rest of the OpenStack project, and because Python provides high functionality with low coding effort. Although this can come at the cost of performance, the Agent is not required to be a high performance component. The Agent uses a collection of Python objects and standalone functions to accomplish

Chapter 4. SAVI Testbed Prototype 38 its required tasks, and is designed in a way to make it FPGA hardware agnostic the idea being that the same Agent software can be used for any FPGA device realizing the VFR system, with minimal modifications, provided the communication protocol to the soft processor is implemented correctly. The software contains two global objects: a statustable object, which holds information about the status of the hardware, and a serial object, which provides RS232 send and receive functionality for communication with the soft processor system. The serial object is implemented using the PySerial library. 4.2.1 statustable and Associated Objects The statustable global object stores the needed state of the entire FPGA hardware system. This includes: 1. The number of physical VFR regions in the managed FPGA hardware. 2. The FPGA system type. This refers to the FPGA device or part number, and the specific board it is mounted on. For example, the prototype system type is a NetFPGA10G. 3. A Python list of region objects. region objects correspond to one unique PRR (VFR) in the FPGA system. 4. A string containing information about the system type, useful for debugging. The statustable object also provides several methods (functions) that allow the Agent to manage the FPGA hardware. The first is (in pseudocode): statustable.program(bitpkg, macaddr, serial, uuid)

Chapter 4. SAVI Testbed Prototype 39 The first argument to the function (bitpkg) points to a compressed file containing the image a collection of bitstreams, one matching each PRR in the system, that are essentially hardware images supplied by the user, and sent to the Agent by the OpenStack controller. macaddr is a string containing the MAC address of the VFR about to be booted. serial is a reference to the serial object, and uuid is a string containing the OpenStack generated UUID of the resource. Algorithm 1 shows the basic operation of the function. Algorithm 1: The statustable.program() function Data: bitpkg, regionlist, uuid, MAC Result: A unconfigured VFR in regionlist is configured with the user hardware in bitpkg, and the MAC address is set to MAC for region in regionlist do end if region is not configured then region is free, find bitstream file matching this region end pkg = uncompress(bitpkg) for file in pkg do end if match(file, * %d % region.id) then region.configure(file, uuid) end region.setmac(mac) return Success a free region was not found, fail return Failure

Chapter 4. SAVI Testbed Prototype 40 The match function call in Algorithm 1 is a regular expression matching the region object identifier (an integer id [0..N 1], where N is the number of PRRs) to a portion of a particular bitstream filename. The filename format of the bitstreams in the package are [name] [region id].bit. For example, if the users hardware system was named myhardware and was compiled for a system containing four VFRs, the bitstreams generated by the script-based compile would be named myhardware 0.bit... myhardware 3.bit. This is how the Agent knows which bitstreams in the package correspond to which PRRs. The second method in the statustable object is the opposite of the first: statustable.release(serial, uuid) This function is used to release a currently running VFR with UUID uuid and configure a black box PRM to remove the user hardware from the system. It also removes that resources MAC address from the hardware (i.e. from the Input Arbiter CAM and the Output Queues registers.) The basic operation is shown in Algorithm 2. Algorithm 2: The statustable.release() function Data: regionlist, uuid Result: A configured VFR in regionlist is released for region in regionlist do end if region is configured AND region.uuid == uuid then region matches uuid to be released, remove MACs and configure black box end region.resetmac() bb = blackbox + region.id +.bit region.release(bb) return Success the configured region was not found, fail return Failure

Chapter 4. SAVI Testbed Prototype 41 As can be seen in the program and release functions, there are non-global region objects, lists of which are in the statustable object, where each region object in the list corresponds to a PRR on the FPGA. The region object contains data describing a single VFR: 1. An integer identifier for the particular region. 2. A string containing a configured VFRs OpenStack UUID. 3. A boolean variable indicating whether or not the region is configured. 4. A string containing the MAC address of the region. are: The region object also contains several methods for managing a single VFR. These 1. region.resetmac(serial) : Send a zero MAC address over the serial connection to the soft processor to program into VFR with region.id. 2. region.setmac(serial, newmac) : Send newmac over the serial connection for the soft processor to program into the VFR corresponding to region.id. 3. region.configure(serial, bitstream, uuid, systype) : Configure bitstream into VFR region. This involves sending a command to disable stream interfaces, receiving an acknowledge, using Xilinx impact to program the partial bitstream for the correct FPGA (dependent on systype), and re-enabling the stream interfaces and receiving a final acknowledge. Also sets region.isconfigured to True. 4. region.release(serial, bitstream, systype) : Disable stream interfaces as in the configure method, use Xilinx impact to configure bitstream (a blackbox) into the FPGA dependent on systype, and set region.isconfigured to False.

Chapter 4. SAVI Testbed Prototype 42 Standalone Functions Several standalone functions are also used to implement the required functionality of the Agent. These include functions to setup a TCP socket for communication with the OpenStack controller, reading and writing bytes, null terminated lines and files from said socket, as well as a function to set up and run a TCP server that the OpenStack controller can connect to. Also included is a main() function where execution begins when the Agent is started. This is described in the following section. 4.2.2 Initialization and Operation The Agent execution starts in a main() function, and first sets up a serial connection (preferred to PCI Express for simplicity). to the FPGA hardware, specifically the soft processor, and queries the embedded system for information on the system. The embedded system responds with a string containing the number of VFRs and the system type. With this information, the Agent creates a new statustable object whose constructor sets up the list of region objects and initializes them. If the query to the embedded system fails, the Agent will retry several times and exit with a failure message if there is no response. Upon initializing the statustable object, the Agent acts as a server, listening on a predefined TCP port for incoming connections from the OpenStack controller. The Driver, which integrates with the controller, gets the connection parameters from the resource flavor information passed to it after a boot command is issued by the user. The Driver sets up a TCP connection to the Agent when a resource needs to be booted or torn down, and the Agent will accept the connection and begin reading null-terminated lines from the Driver. These lines contain string-based commands followed by other string-based parameters and files. The Agent carries out programming the resource as described in other sections and a successful return is propagated back to the user. Any failure along the way is also propagated back to the user.

Chapter 4. SAVI Testbed Prototype 43 Figure 4.1 shows a sequence diagram of the entire boot procedure. A User would use OpenStack Nova to issue a Boot VFR command and the OpenStack Controller would map the resource Flavor to the correct Driver (the VFR Driver) and relay the Boot VFR command. The VFR Driver extracts the Agent IP address and port and sets up a TCP connection. Then, the VFR Driver sends a PROGRAM command, followed by arguments containing the Resource UUID, the Resource MAC address, and the Image (PR bitstreams). The Agent calls statustable.program(), which configures the PR bitstream from the image into an available VFR and programs the MAC address. Success is returned to the Driver, which closes the TCP connection, and returns Success to the OpenStack controller, which in turn returns Success and a reference to the User. Figure 4.1: A sequence diagram of the entire boot procedure in the SAVI testbed prototype system.

Chapter 5 Platform Evaluation and Application Case Studies The previous chapters have described the architecture for FPGA virtualization and the prototype system in the SAVI testbed used to realize it. In this chapter, a brief analysis of the prototype is presented. Comparisons are made in terms of time to boot resources (comparing VMs and VFRs), and the trade off between virtualized and non-virtualized hardware is discussed. Furthermore, two application case studies are presented hardware applications run in the testbed using the virtualization infrastructure. 5.1 Platform Evaluation It is desirable to see how efficiently OpenStack can handle these new resources, relative to existing Virtual Machines. At the very least, one would expect that a VFR should be able to be allocated in the same amount of time or less than a VM. An experiment was set up to determine how quickly a VFR can be booted compared to a VM. The experiment attempts to measure the amount of time from when the command is issued to OpenStack to the point at which the resource becomes usable. Usability will obviously differ for different resources, and the term is defined for VMs and VFRs as 44

Chapter 5. Platform Evaluation and Application Case Studies 45 Table 5.1: Boot Times for VMs and VFRs Resource Boot Time (seconds) Virtual Machine 71 VFR 2.6 follows: Virtual Machine: A VM is defined as usable at the point where an SSH connection can successfully be established. Virtualized FPGA Resource: A VFR is defined as usable at the point where the hardware can successfully process packets as its design intends. The VFR hardware used for the experiment is a simple Layer 2 packet reflector. An incoming packet is buffered, and has its source and destination MAC address fields swapped. The packet is sent back out, effectively returning to its original sender. This provides a low-latency method of determining whether a VFR is usable or not. The experiment is run as follows: From a node within the SAVI testbed, the boot command to allocate a VFR is entered. Time is measured while a small program sends out a packet destined to the MAC address of the new VFR (extracted from OpenStack) every tenth of a second. In between the packet sendings, the program listens for the reflected response, and if found, exits and stops the timer. This is done five times and an average is taken. For the VM, the boot command is entered and time is measured while a shell script continually attempts to make an SSH connection. The flavor of the VM is a small size, with 2048 MB of RAM and one virtual CPU. The timer stops once a successful connection is made. Again this is done five times and an average is taken. Results of these averages are shown in Table 5.1. As can be seen in the results, a VFR can be booted much faster than a VM. This is unsurprising however, considering that it takes on the order of milliseconds to partially reconfigure a PRR over JTAG. The time to reconfigure a PRR may increase with the

Chapter 5. Platform Evaluation and Application Case Studies 46 physical size of the region, however it would still be on the order of milliseconds. A VM must first be initialized by the hypervisor, and would then take generally around the same time to boot up as a regular non-virtualized machine. It can be concluded though, that VFR systems can be scaled up or down relatively quickly. Virtualization Trade-offs Virtualization usually comes with trade-offs in terms of performance Virtual Machines typically have worse performance than a baremetal machine due to the underlying abstractions, although much progress has been made in reducing this performance gap to the point where it is very small, within a few percentage points [55, 56]. VFRs also come with some trade-offs in terms of performance, as well as some that are not analogous to processor virtualization. First, the performance trade-offs are analyzed. Since VFRs are network connected accelerators, performance can be framed in terms of throughput and latency. The static logic in the virtualization system will cause additional latency, although the system has been designed to balance this trade-off in favour of the user s hardware. The baseline for the following comparisons is the open source NetFPGA10G hardware, unmodified. In the virtualization system, the input arbiter is modified to include a CAM this CAM is generated using Xilinx IP and adds one cycle of latency, but this only happens once per packet. This is because only the first flit of the packet is needed by the CAM to decide which VFR to route the packet to, after which one flit is transferred per cycle as normal. From the arbiter the streams connect to each physical VFR wrapper module. Recall that the wrapper modules contain logic under control of the cloud provider, and the Partially Reconfigurable Region (PRR) into which the user hardware gets configured. The wrapper module contains two FIFO packet buffers one to buffer packets going into the PRR, and another to buffer outgoing packets. Each of these buffers adds another cycle of latency. The output queues in the NetFPGA10G hardware were also modified to force outgoing packet source MAC addresses, but this is

Chapter 5. Platform Evaluation and Application Case Studies 47 done completely in combinational logic during the cycle in which the hardware decides to which output port the packet should be directed, therefore adding no additional latency. Thus, in terms of performance, the user sees only a one-cycle pipeline stall per packet, and two additional cycles of latency, which, at 160MHz, is a mere 12.5 nanoseconds. Buffering will not affect throughput, but the one cycle stall per packet will affect it. The larger the packet however, the less the throughput will be affected because there will be more flits per packet. Peak theoretical throughput of the datapath is 256 bits per cycle, or 256 / 6.25ns = 40.96 Gb/s, but with a minimum size packet of 64 bytes and a one cycle pipeline stall, this falls to 27.3 Gb/s. For maximum size Ethernet packets of 1500 bytes the maximum throughput increases to 40.11 Gb/s. There are however, additional penalties for virtualization in terms of available area for use in the FPGA fabric. For example, in the prototype system, each VFR has 11376 LUTs, which amounts to 7.6% of the entire device, or 9.4% when considering fabric already taken by the static logic. This was the maximum VFR area that could be achieved in a four-vfr system while manually placing and adjusting the PRRs. In a production system, a cloud provider would likely have several different resource flavours of VFR, with different partition sizes that could help to alleviate this problem. 5.2 Case Study: Load Balancer In this section the first application case study for the virtualization system is presented an application-layer load balancer. Load balancers are an essential part of large datacenter systems, allowing requests to be distributed among active servers, increasing performance and system stability. In a virtualized cloud environment however, users generally would not have access to hardware load balancers, especially not if their incoming data or requests were in a proprietary or non-http protocol. Their only option is to use a VM-based software load balancer. In addition, this type of end-user applica-

Chapter 5. Platform Evaluation and Application Case Studies 48 tion will become increasingly more important as datacenter networks become virtualized through software defined networking (SDN), and users have full control over private internal Ethernet LANs. This case study shows how VFRs can be used to implement such an arbitrary-protocol load balancer that can vastly outperform a software version in terms of throughput and latency predictability. 5.2.1 Load Balancer Designs The load balancer operates on a hypothetical protocol that runs on top of UDP. The protocol has an identifier field in the first 16 bits of the UDP payload, which the load balancers recognize. There are two identifiers one designating request or data packets, coming from a client, required to be distributed to servers, and the other designates update packets, sent from servers to the load balancers such that the servers can be added or removed from the distribution system. Software The software load balancer is written in Python using the low-level Berkeley Sockets library [57]. A list is used to track active servers to distribute request packets to, and incoming request packets are distributed in round-robin fashion. If an update packet is received from a server, that server s IP address is added or removed from the distributing list accordingly. Hardware The hardware load balancer is implemented in a similar manner. Since the VFR stream interfaces are 256 bits wide, the first two flits of the packet are required to detect the identifier field in the UDP payload. If an update packet is detected, the hardware stores the server s IP address and MAC address in a memory. For request packets, the hardware will replace the packet s destination IP and destination MAC fields with values read from

Chapter 5. Platform Evaluation and Application Case Studies 49 Table 5.2: Resource Usage for VFR Load Balancer Resource Usage (used / VFR Total) Flip-flop 7523 / 11376 (67%) LUT 3594 / 11376 (31.6%) 36K BRAM 11 / 15 (74%) the memory location corresponding to the current destination server in the round robin schedule. The packet s UDP checksum is then recalculated, and the rest of the packet is buffered and sent along. Because the interface to the VFR is a Xilinx AXI Stream, it was possible to compile this hardware using Xilinx s high level synthesis tool, Vivado HLS [58], and simply instantiate the generated HDL within the VFR template file. The design is described in approximately 150 lines of C code, and compiled for the virtualization system as in Figure 3.7. Using this method, the design iteration time is reduced from days or weeks to as little as a few hours. The resource utilization for the hardware load balancer is shown in Table 5.2. Many sources of frustration are also removed from the design cycle because the static logic is already placed, routed, tested and guaranteed working by the cloud provider. The user need not worry about complex memory interfaces, pin constraints and I/O, or special timing considerations for high speed transceivers. Because the hardware is just another generic cloud resource, it is possible to quickly scale up or down the number of active load balancers according to system-level loads. This can also be done much faster than the software version, since VFRs can be allocated much more quickly than VMs.

Chapter 5. Platform Evaluation and Application Case Studies 50 5.2.2 Performance Analysis Test System and Prototype The prototype system in the SAVI Testbed is used for all tests. Recall that the SAVI Testbed provides an extended cloud consisting of Core nodes (conventional cloud compute) and Edge nodes that contain additional resources such as the NetFPGA platforms used for the prototype. All of the experiments are limited in scope to a single edge node in the SAVI testbed. Comparison of VM and VFR Load Balancers An experiment is set up to compare latency and throughput of the different load balancers. The experiment uses a single load balancer, three receiving servers, and two clients. The servers and clients are all VMs spread amongst two physical machines. Care is taken to make sure that the two clients are on different physical machines so that they do not share a single physical interface and interfere with one another. One client is used to take measurements of latency, while the other is used to inject additional traffic at prescribed rates onto the network, bound for the load balancer. A third physical machine hosts the NetFPGA 10G platform, which in turn provides the VFRs to OpenStack as described in the prototype. The machines are connected via the SAVI Testbed OpenFlow network. The network physical layer is implemented via a gigabit switch, with four 10G ports connecting the NetFPGA and VFRs. The test setup is shown in Figure 5.1. The latency is measured as the round trip time from Client 1, through the VM or VFR Load Balancer, to a Server, and back to Client 1, averaged over 10000 midsized (760 byte) packets. The load balancer is on the forward path, but not the return. The latency is continually measured and the number of dropped packets counted as Client 2 is used to inject additional traffic into the load balancer. In this way it is possible to see what level of throughput the load balancer can handle. Figures 5.2 and 5.3

Chapter 5. Platform Evaluation and Application Case Studies 51 Server Server Client 1 Measure Hypervisor Physical Server 1 1G 10G VFR Load Balancer Server Client 2 Inject VM Load Balancer Switch 1G Physical Server 3 Hypervisor Physical Server 2 1G Figure 5.1: Experiment setup for load balancer tests. show the results for the VFR and VM load balancers as a function of the Injection Rate from Client 2 each latency measurement is an average of 10000 packets, so the standard deviation is included as well, which gives a sense of system load and performance predictability. Socket timeout at one second is counted as a dropped packet, and is not counted in the latency average. Transmission continues until 10000 packets are successfully received back. Dropped packets for the VM load balancer at each data point are shown in Figure 5.4. The VFR load balancer had no dropped packets. Even at 25 MB/s, the VM load balancer begins to drop a number of packets. At 45 MB/s, the VM load balancer drops a significant number of packets, and latency increases and varies widely. The VFR balancer however has predictable and fairly constant latency. Even at 100 MB/s the VFR balancer dropped no packets and maintained predictable performance. Latency in general was dominated by software in both send and receive. Unfortunately, the test system did not allow testing of the full capabilities of the VFR, since there were not enough physical servers available in the edge node to saturate the 10GE links to the VFR load balancer. Clearly though, the VFR provides a significant benefit to the cloud user several VMs can be replaced with a single VFR, simplifying and streamlining the user s systems, and potentially lowering operating costs. The cloud provider also benefits by having fewer VMs per user, which may reduce overall power

Chapter 5. Platform Evaluation and Application Case Studies 52 1200 VFR Latency VFR Std Latency (microsecond) 1000 800 600 400 200 5 10 15 20 25 30 35 40 45 50 Injection Rate (MB/s) Figure 5.2: VFR load balancer latency at different throughput levels. 1200 VM Latency VM Std Latency (microsecond) 1000 800 600 400 200 5 10 15 20 25 30 35 40 45 50 Injection Rate (MB/s) Figure 5.3: VM load balancer latency at different throughput levels.

Chapter 5. Platform Evaluation and Application Case Studies 53 300 250 241 Dropped Packets 200 150 100 75 107 50 0 1 1 1 9 4.5 6 7 8 910 15 20 25 35 45 Injection Rate (MB/s) Figure 5.4: Number of dropped packets for the VM load balancer. consumption and costs of the datacenter. 5.3 Case Study: Extending OpenFlow Capabilities Software-Defined Networking (SDN) is a growing paradigm that sees network management move from a decentralized approach to a centralized, software-managed system. Network management rules are centrally defined for an entire network, and these rules are then translated into matching and forwarding actions for the data plane. OpenFlow is a popular realization of SDN that has also significantly affected the development of SDN in general [59, 60]. OpenFlow operates by having a central controller specify rules that packets coming into the network are matched against. These matches and the specific packet fields they operate on form what are know as flows. A flow consists of a match field, a priority field, statistics counters, timeouts, and most importantly, actions. When an incoming packet is matched to a flow (with multiple matches the highest priority is taken), the action(s) specified in the flow are taken on the packet. This may simply be to forward the packet out of a particular switch port, or it may be to drop the packet, modify certain fields in the packet header, swap certain fields or insert VLAN (Virtual Local Area Network) [61]

Chapter 5. Platform Evaluation and Application Case Studies 54 tags or MPLS (Multi-Protocol Label Switching) [62] shims. In current OpenFlow enabled switches, not all of these actions will be supported in the hardware datapath certain actions, usually more complex ones, require the processing to be done in switch software, which is a much slower path. Many desirable actions are not even possible on current OpenFlow switches, and some desirable actions are not even specified in the OpenFlow specification [63]. In this case study, virtualized hardware is used to extend the capabilities of OpenFlow. By forwarding matched packets to a VFR, custom hardware can be used to implement arbitrary actions or matches on packets at 10 Gb/s line rates. In particular this case study will look at actions pertaining to the Virtual extensible Local Area Network protocol, or VXLAN, which allows bridging of LANs over IP. 5.3.1 VXLAN Virtually extensible Local Area Network (VXLAN) is a method of bridging two OSI Layer 2 networks by encapsulating the Layer 2 packets inside UDP packets. This virtually connects two Layer 2 networks. The packet structure for VXLAN is straightforward. It consists of a standard UDP packet, with a port number of 4789, and a payload consisting of a VXLAN header and the encapsulated Layer 2 packet. Figure 5.5 shows the format of a VXLAN packet. Currently, OpenFlow does not specify any actions that can act on the encapsulated packet, or in fact anything that lies within the payload of the encapsulating UDP packet. It would be very useful from a network management perspective to be able to manage virtually connected LANs directly in OpenFlow without having to invest in additional infrastructure. This amounts to the ability to match and/or edit fields within the encapsulated packet.

Chapter 5. Platform Evaluation and Application Case Studies 55 Reserved 8 Bits Reserved 24 Bits VNID 24 Bits Reserved 8 Bits VXLAN L2 Ethernet L3 Internet Protocol L4 Transport Protocol (UDP) Encapsulated Packet... Header L2 Ethernet L3 Internet Protocol L4 Transport Protocol... Figure 5.5: Packet diagram for the VXLAN protocol. 5.3.2 Virtualized Hardware for New OpenFlow Capabilities Hardware can be written and compiled for a VFR via the methods discussed in Chapters 3 and 4. For this application, the chosen enhancement to implement is the ability to perform matches on the fields of the encapsulated field and either forward or drop the packet. This amounts to a custom, OpenFlow-controlled in-network firewall for VXLAN tunneled networks. The end result will be that a user will be able to control forwarding in a VXLAN-bridged Layer 2 network using nothing but OpenFlow. Theoretically of course, the VFR architecture allows access to the entire packet, and therefore any field could potentially be parsed, modified or inserted. The VFR used for this case study is designed to be capable of matching transport layer (i.e. TCP, UDP) port numbers inside VXLAN encapsulated packets, which will allow it to block certain communication protocols. The hardware is implemented in Verilog, and consists of a packet datapath, a programmable drop port register, and a simple control circuit. Control packets are defined as having the Ethernet Type field equal to 0x88B3. If the hardware detects a control packet in the datapath, it will extract the first 16 bits of the payload, and place it in the drop port register. The control packet is then dropped. Any other packet will be parsed to detect destination UDP port 4789, signifying that

Chapter 5. Platform Evaluation and Application Case Studies 56 the packet is a VXLAN packet. If this port is detected, the hardware then tests if source or destination transport layer port of the encapsulated packet matches the one currently stored in the drop port register. If it matches, the packet is dropped. If not, the packet is simply forwarded. Using the control packets, the user can change what port numbers the hardware matches. The hardware module also contains a FIFO buffer to help the design meet timing. The design introduces no pipeline stalls and can run at full pipeline throughput. To avoid broadcast loops, any broadcast Ethernet packets entering the VFR are dropped. All that remains is the question of how the VXLAN packets get to the VFR in the first place. This is achieved by programming several regular flows into the network such that all switches that receive packets with a destination UDP port of 4789 will forward the packet to the VFR. To avoid a forwarding loop, additional flows are added such that if the VXLAN packet ingress port corresponds to the VFR hardware, it is sent on to its original destination. In the SAVI testbed, physical topologies can be queried by the user, and it is possible to determine exactly what physical or virtual switches lie on the path between VXLAN-tunnelled machines and the VFR. This allows the aforementioned flows to be set manually. Also recall that the SAVI testbed has a fully virtualized network that allows users to allocate a private slice of the network with their own OpenFlow controller. In such an environment, the flows necessary for routing VXLAN tunnels through the VFR can be set up programmatically via the controller. Figure 5.6 depicts the structure of the VXLAN port firewall. Incoming packets are buffered, and if a packet is a VXLAN packet (i.e. the destination UDP port is equal to 4789) and the encapsulated packet contains a transport layer port (that is, TCP or UDP) equal to the Drop Port register, the packet is dropped. If the packet is not a VXLAN packet, it is also dropped. Otherwise, it is simply forwarded out of the VFR, and regular flows move the packet to it s original destination, avoiding a loop.

Chapter 5. Platform Evaluation and Application Case Studies 57 PKT In N IS VXLAN? Y Drop Y IS TP_PORT X? N PKT Out VFR VM1 VM2 VXLAN Tunnel Figure 5.6: VXLAN Port Firewall 5.3.3 Performance Analysis The hardware for the VXLAN port firewall was written in Verilog and compiled using the procedure outlined in Section 3.5. This uses Xilinx ISE 13.4 to synthesize a netlist, and Xilinx PlanAhead to perform the PR compilation. Resource usage for the hardware is shown in Table 5.3. At only 15.6% of available LUTs in a VFR, the hardware is lean and leaves room for more features, such as matching multiple ports using a CAM, or matching and blocking based on other encapsulated packet fields. Table 5.3: Resource Usage for VFR VXLAN Port Firewall Resource Usage (used / VFR Total) Flip-flop 1778 / 11376 (15.6%) LUT 1446 / 11376 (12.7%) 36K BRAM 0 / 15 (0%) Experimental Setup The port firewall VFR is booted using OpenStack along with two VMs. OpenVSwitch (OVS), a virtual OpenFlow switch, is installed on both VMs and used to set up a VXLAN

Chapter 5. Platform Evaluation and Application Case Studies 58 tunnel between the two VMs. Important to note is that both VMs are booted on separate physical machines, and separated by a physical OpenFlow switch. The two VMs are connected through the physical OpenFlow switch to the VFR port firewall. Flows are installed in the virtual switches and the physical switch to redirect any VXLAN packet to the VFR hardware, so it can achieve its intended function. It is obviously more desirable to use a custom OpenFlow controller to accomplish this rather than manually installing the flows, but at the time of writing this thesis, private OpenFlow networks were not fully compatible with heterogeneous resources in the SAVI testbed. Therefore, the flows are manually installed. Results After the flows are installed, VXLAN encapsulated traffic between the two VMs operating on a given transport layer port is successfully blocked. The port can be changed at any time by sending a control packet with a different port number to the VFR. The iperf tool is used to measure throughput from one VM to another, and the ping tool is used to measure latency. First, a baseline measurement is taken, that is with no additional flows installed, to see what the throughput and latency are between the VMs without the VFR in the network path (Figure 5.7a). iperf is run five times and an average is taken, while ping is run until 20 pings are completed. These tests are run both with and without VXLAN tunnelling. Results are shown under No Tunnel and VXLAN Tunnel in Table 5.4 Without tunnelling the throughput is near line rate (1 Gb/s) as expected. When tunnelling using VXLAN, throughput takes a large performance hit. Normally this is not the case, but the Ubuntu operating systems used in the machines to boot the VMs have a dynamic CPU frequency scaling system to save power the lower clock frequencies significantly affect the performance of the software switch (OpenVSwitch), since the VXLAN encapsulation and decapsulation is done inside OVS. The core frequency was

Chapter 5. Platform Evaluation and Application Case Studies 59 VM1 VM2 VM1 OVS VM2 OVS VM1 OVS VM2 OVS OVS OVS BM VFR Switch VXLAN VXLAN Switch VXLAN Switch (a) (b) (c) Figure 5.7: Experimental setups for VFR-based VXLAN firewall. limited by Ubuntu to 1.2 GHz, which was still enough to saturate the 1 Gb/s link when not running a tunnel, but the processor was not able to maintain this bandwidth when running the VXLAN tunnel. This effect was verified by disabling the CPU frequency scaling and observing a rise in throughput over the VXLAN tunnel, however, the CPU scaling was left on for the experiments to maintain default settings across all VMs in the system. Table 5.4: Throughput and Latency for VXLAN Port Firewall Throughput Latency No Tunnel 941 Mb/s 0.465 ms VXLAN Tunnel 517.4 Mb/s 0.532 ms VXLAN through VFR 513.2 Mb/s 0.600 ms VXLAN through BM 480.4 Mb/s 0.801 ms Now that a baseline for comparison is established, the flows are installed to reroute VXLAN traffic to the running VFR and iperf is run again to determine the overhead of rerouting the VXLAN traffic through the VFR port firewall, shown in Figure 5.7c. An average of five runs gives a throughput of 513.2 Mb/s, slightly less than the 517.4 Mb/s achieved without the VFR in the network path. This is not a large difference and it can be concluded that this technique introduces little to no overhead in terms of throughput. The ping test is also run again, and results show that rerouting to the VFR introduces a

Chapter 5. Platform Evaluation and Application Case Studies 60 small increase in latency ( 12%), but this is expected when adding an additional hop. These results are also summarized in Table 5.4 under VXLAN through VFR. Lastly, it is also useful to see how the VFR compares in performance to a software version executing the same task. A software VXLAN port firewall is programmed in C, based on the packet capture library libpcap. Normally it would be best to run the software on a VM in the cloud as VMs are the usual processing resource available. However, many VM network interfaces, including those in the current SAVI testbed, do not have the ability to put an interface into promiscuous mode. In promiscuous mode, a network interface passes along all detected packets, even those not addressed to the receiving interface. To implement a pass-through firewall, a promiscuous interface is required. Therefore, for the sake of the experiment, a non-standard bare-metal server is used to run the software on which the network interface can be configured into promiscuous mode. The packet flows installed on the switch are slightly modified to enable the network path shown in Figure 5.7b, where VXLAN packets are sent to the bare-metal server (BM) for software processing rather than the VFR hardware. The iperf and ping tests are run again, and the results are also shown in Table 5.4 under VXLAN through BM for comparison. The software introduces a rather large increase in latency compared with both the VFR and the direct path from one VM to another. This is expected since each packet must travel up through several software layers before being processed. That being said, network stacks in modern servers are very efficient, and this is evident in the throughput of 480.4 Mb/s, only a 6.6% decrease from the VFR. The main reason for this good performance however, is that the application is relatively simple. For a single port check, the software is only required to move a pointer and make a comparison, which is less than tens of instructions, then either sending or dropping the packet. More complex operations, such as an IP address Longest Prefix Match or multiple exact matches, would

Chapter 5. Platform Evaluation and Application Case Studies 61 likely widen the performance gap since they can be fully pipelined in hardware. Such enhancements are left to future work. Summary This case study has shown that VFRs can be used in a straightforward manner to enhance the capabilities of an OpenFlow network, allowing custom, in-network matching and actions to be added to the system with little to no degradation in throughput. This comes at the cost of slightly increased latency. It has also been shown that even for a simple application, a VFR will outperform a software implementation. VFRs thus provide a promising path toward fully custom complex network configurations, with all data plane operations done entirely inside hardware on high-speed paths.

Chapter 6 Conclusion This thesis has presented a hardware and software architecture that represents a first attempt at integrating FPGA hardware acceleration as a first-class citizen into cloud computing infrastructures. This work shows that FPGA hardware acceleration can be virtualized and feasibly integrated with existing infrastructure-as-a-service cloud management software. It also shows that FPGA Partial Reconfiguration (PR) can be used to virtualize an FPGA device with very little performance overhead in terms of throughput and latency when considering network-connected accelerators. The architecture presented shows that this style of virtualization can be very accessible to the end user as well, trading off potential application scope for a template-based automated compilation system that effectively removes many common design complexities. End users also benefit from the fact that the very same management commands for traditional cloud resources are used to manage the new heterogeneous resources. 6.1 Future Work As stated, this work is merely a first attempt at the integration of FPGAs into cloud computing. There are many open avenues of exploration, both for the general concept and for the particular architecture presented in this thesis. This section on Future Work 62

Chapter 6. Conclusion 63 will discuss some of these possible avenues, focusing mainly on the work presented in this thesis. 6.1.1 Architectural Enhancements While the Agent is fairly generic and not performance critical, there are areas in which the FPGA static hardware virtualization layer could be improved. VFRs are limited in size, and many applications may need more than one to fully implement a desired function. The performance of such a system may be heavily dependent on where each constituent VFR is placed physically by the scheduling algorithm. A commercial system could consist of hundreds or thousands of FPGA systems spread across one or possibly several datacenters. VFRs that are part of the same system could take a severe latency penalty if placed very far away from each other. This leads to two related future work items: 1. VFR-aware resource scheduling: modify the scheduling algorithm to recognize that VFRs that are part of the same system (or belonging to the same user) should be placed as close as possible to each other physically. Ideally, they should be placed in the same physical device. In the current implementation, however, this is not as good as it could be transferring packets between VFRs still requires the packet to be switched external to the FPGA. 2. Static Logic Switching Fabric: Related to the second point of the first item described above, a switching fabric in the static logic would allow packets addressed from one VFR to another in the same device to be redirected without exiting the device. VFRs could be chained together with extremely low-latency paths, leading to higher performance for larger, multi-vfr systems. Currently, the virtualization architecture lacks a dedicated promiscuous interface function. A promiscuous interface is one that allows any and all packets detected on

Chapter 6. Conclusion 64 the wire into the device, regardless of whether they are addressed to the device. For example, all ports on a regular Ethernet switch would be promiscuous. What this means is that it is difficult for a user to use a VFR to do generic switching and processing, because the static logic only allows addressed packets into a VFR. A useful enhancement would be a mechanism that allows VFRs to act in a promiscuous mode, while still enabling the provider to enforce security. Another thing to consider, especially from the view of the cloud provider, is what assumptions can be made about the user-designed hardware. A commercial provider would likely make very few assumptions user-designed hardware may be bug-ridden, faulty, or have any number of other problems. Error detection and fault tolerance in the face of shoddy or unpredictable user hardware may very well become a top priority. For example, if one VFR begins to make transfers incorrectly on the stream interface, the entire static logic may be affected by this, at the very worst locking up and forcing an entire chip reconfiguration. A poorly implemented VFR may also transmit packets below the network minimum packet size (e.g. below 64 bytes for Ethernet networks), which could result in network problems for the provider. Because of these problems, much research in fault-tolerant embedded systems could likely be applied to the static logic to improve its commercial viability. Additionally, the memory system connecting the VFRs to off-chip DRAM is currently simple and inefficient in its partitioning scheme. Future work may address this problem by replacing the partitioning scheme with a full multi-port memory controller or similar technology to more efficiently share the off-chip DRAM. 6.1.2 Failures and Migration As briefly discussed in the previous subsection, VFRs may occasionally fail due to a variety of circumstances such as faulty user hardware, a provider static logic bug or even network faults. In the case of a Virtual Machine, the hypervisor always retains the ability

Chapter 6. Conclusion 65 to kill or reboot a faulty virtual machine, however, this feature does not translate well to VFRs since they are so tightly integrated with the static logic that performs hypervisorlike functions. Future work would devise methods of maintaining a more strict separation, again possibly drawing on fault-tolerant embedded systems research, to allow the static logic and/or Agent to deal with faults in ways that do not involve bringing down the entire system. Another key capability of Virtual Machines that future work would seek to replicate for VFRs is migration. Migration involves saving the state of a running VM, transferring it to another hypervisor, and then continuing execution in the new location, all done transparently to the user. This problem is more difficult for FPGAs because they are not designed with this feature in mind, and the vendor tools do not support moving a PRM between different PRRs. This would require accessing a running circuit s state and FPGA configuration bits, saving them, and reprogramming it on another FPGA device. 6.1.3 Further Heterogeneity In the future, a provider would likely have multiple flavors of VFRs, each with different sizes. Larger VFRs would provide more LUTs for more complex circuits, but would cost more for end users. This is analogous to how one can allocate VMs with different amounts of RAM and virtual CPUs, and different operating systems. New FPGA devices that combine ARM CPUs with traditional reconfigurable fabric have also recently entered the market [64, 65]. Future work could investigate how these devices could be virtualized as resources that provide both an ARM-based CPU or VM, with a slice of closely-coupled reconfigurable fabric. 6.1.4 Applications The application case studies in this thesis have only described a taste of what may be possible with this type of system. Multi-VFR applications have yet to be investigated

Chapter 6. Conclusion 66 and analyzed. The concept of extending available OpenFlow capabilities may also be an area of potential future work. There are other tunnelling protocols similar to VXLAN that could be added to OpenFlow (e.g. GRE tunnelling), and there are many other higher layer protocols that OpenFlow does not consider. Although these higher layer protocols are complicated because of the large amount of state information required, there is still potential for implementation via VFRs. The current VXLAN port firewall application is also an area of future work. The VFR resource usage was low, and this leaves room for additional functionality, such as matching and blocking based on multiple ports, encapsulated Layer 2 and Layer 3 headers, or another part of the packet payload. Furthermore, the experiments revealed a significant loss in performance due to OVS having to encapsulate and decapsulate packets in software. This function could also be moved into hardware via a VFR, and the encapsulation and decapsulation done at line rate, incurring no losses in performance. VFRs are also well suited to the many streaming protocols that exist. Real-time Transport Protocol packet streams could be redirected to VFRs for hardware accelerated processing or custom switching and routing. RTP is used for many applications like telephony, video streaming and Voice over IP (VoIP). Other applications may include Event Stream processing or financial data stream processing. 6.1.5 FPGA CAD as a Service Recall that the user must compile their hardware for VFRs in a way that uses pre-placed and routed static logic. In the implementation discussed in this thesis, this is done using an automated script-based compile system. Currently a user must have access to the entire project, but this is not feasible in a production environment. Ideally, the user would submit their Verilog module to a cloud-based compile service that would analyze and compile their hardware to run on the cloud provider s systems. CAD as a service as

Chapter 6. Conclusion 67 a general topic may also make an interesting topic for future work, to see how it could be useful for traditional hardware design and also for cloud-based hardware as described in this thesis. 6.1.6 Complementary Studies Overhead Area Trade-off Study In the future a study could be performed to determine how the overhead of the static logic (in terms of FPGA area) scales according to how many VFRs are implemented on a physical chip. Larger device sizes could also be examined. Ideally, an optimal point would be discovered where the lowest static logic overhead gives the highest percentage of chip area to VFRs. Deployment Costs Analysis In this future study, a full, commercial-scale deployment of a VFR system would be examined. The primary focus would be on costs and considerations from a cloud provider perspective. Approximate values for how many FPGAs and VFRs of different flavors could fit in one standard rack would be determined, as well as how much they would cost this will be dependent on the system architecture, whether or not there are failsafe or redundant devices, as well as list cost prices of FPGA chips and boards, which are subject to volume and availability. Approximate power consumption and other operating costs would be compared to Virtual Machine deployments. The result of the study would show how virtualized FPGAs can provide benefits to providers by giving more computation per unit cost than Virtual Machines alone. With regard to power consumption and performance per Watt, recent work involving large-scale datacenter systems using FPGAs has shown that for a small (10%) increase in power, computation performance can be doubled [66]. These are promising results that can hopefully be replicated or improved with virtualized reconfigurable hardware.

Bibliography [1] Gartner. Gartner Says Worldwide Public Cloud Services Market to Total $131 Billion. Gartner, Inc., 2013. [2] Michael Armbrust, Armando Fox, Rean Griffith, A Joseph, R Katz, A Konwinski, G Lee, D Patterson, A Rabkin, I Stoica, and Matei Zaharia. Above the Clouds: A Berkeley View of Cloud Computing. Dept. Electrical Eng. and Computer Sciences, University of California, Berkeley, Rep. UCB/EECS, 28, 2009. [3] Daniele Catteddu. Cloud Computing: Benefits, Risks and Recommendations for Information Security. Springer, 2010. [4] Máire McLoone and John V McCanny. High Performance Single-Chip FPGA Rijndael Algorithm Implementations. In Cryptographic Hardware and Embedded SystemsCHES 2001, pages 65 76. Springer, 2001. [5] S. Rigler, W. Bishop, and A Kennings. Fpga-based lossless data compression using huffman and lz77 algorithms. In Canadian Conference on Electrical and Computer Engineering, pages 1235 1238, April 2007. [6] F. Braun, J. Lockwood, and M. Waldvogel. Protocol Wrappers for Layered Network Packet Processing in Reconfigurable Hardware. IEEE Micro, 22(1):66 74, Jan 2002. [7] Sai Rahul Chalamalasetti, Kevin Lim, Mitch Wright, Alvin AuYoung, Parthasarathy Ranganathan, and Martin Margala. An FPGA Memcached Appliance. In Proceed- 68

Bibliography 69 ings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pages 245 254. ACM, 2013. [8] Joon-Myung Kang, H. Bannazadeh, H. Rahimi, T. Lin, M. Faraji, and A. Leon- Garcia. Software-Defined Infrastructure and the Future Central Office. In IEEE International Conference on Communications Workshops (ICC), pages 225 229, 2013. [9] Joon-Myung Kang, Hadi Bannazadeh, and Alberto Leon-Garcia. Savi Testbed: Control and Management of Converged Virtual ICT Resources. In IFIP/IEEE International Symposium on Integrated Network Management, pages 664 667. IEEE, 2013. [10] Ian Kuon, Russell Tessier, and Jonathan Rose. FPGA Architecture: Survey and Challenges. Foundations and Trends in Electronic Design Automation, 2(2):135 253, February 2008. [11] IEEE Standard for Verilog Hardware Description Language. IEEE Std 1364-2005, pages 1 560, 2006. [12] IEEE Standard VHDL Language Reference Manual. IEEE Std 1076-2008, pages c1 626, Jan 2009. [13] Rishiyur Nikhil. Bluespec System Verilog: Efficient, Correct RTL from High Level Specifications. In Proceedings of the Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, pages 69 70, June 2004. [14] Robert M. Metcalfe and David R. Boggs. Ethernet: Distributed Packet Switching for Local Computer Networks. Communications of the ACM, 19(7):395 404, July 1976. [15] Don Anderson, Tom Shanley, and Ravi Budruk. PCI Express System Architecture. Addison-Wesley Professional, 2004.

Bibliography 70 [16] Peter Mell and Timothy Grance. The NIST Definition of Cloud Computing (draft). NIST Special Publication, 800(145):7, 2011. [17] Amazon Web Services Inc. Amazon Web Services (AWS) Cloud Computing Services. http://aws.amazon.com, 2014. [18] GENI. Global Environment for Networking Innovations (GENI) Project. http: //geni.net/, 2014. [19] Emulab. Emulab Network Emulation Testbed. http://emulab.net/, 2014. [20] PlanetLab. Planetlab A Open Platform for Developing, Deploying, and Accessing Planetary-Scale Services. http://planet-lab.org/, 2014. [21] Internet2. http://www.internet2.edu/, 2014. [22] ORION. Ontario Research and Innovation Optical Network. http://www.orion. on.ca/, 2014. [23] CANARIE. Canada s Advanced Research and Innovation Network. http://www. canarie.ca/, 2014. [24] OpenStack. http://www.openstack.org/, 2013. [25] R.T. Fielding. REST: Architectural Styles and the Design of Network-Based Software Architectures. PhD thesis, University of California, Irvine, 2000. [26] OpenStack. Nova Developer Documentation. http://nova.openstack.org, 2014. [27] OpenStack. Keystone Developer Documentation. http://keystone.openstack. org, 2014. [28] OpenStack. Glance Developer Documentation. http://glance.openstack.org, 2014.

Bibliography 71 [29] OpenStack. Quantum Developer Documentation. http://quantum.openstack. org, 2014. [30] OpenStack. Swift Developer Documentation. http://swift.openstack.org, 2014. [31] OpenStack. Cinder Developer Documentation. http://cinder.openstack.org, 2014. [32] Ryu. Ryu SDN Framework. http://osrg.github.io/ryu/, 2014. [33] Rob Sherwood, Glen Gibb, Kok-Kiong Yap, Guido Appenzeller, Martin Casado, Nick McKeown, and Guru Parulkar. Flowvisor: A Network Virtualization Layer. OpenFlow Switch Consortium, Technical Reports, 2009. [34] Joon-Myung Kang, Lin T., Bannazadeh H., and A. Leon-Garcia. Software-Defined Infrastructure and the SAVI Testbed. In TRIDENTCOM 2014, 2014. [35] K. Redmond, H. Bannazadeh, P. Chow, and A. Leon-Garcia. Development of a Virtualized Application Networking Infrastructure Node. In IEEE GLOBECOM Workshops, pages 1 6, 2009. [36] Steven Trimberger, Dean Carberry, Anders Johnson, and Jennifer Wong. A Time- Multiplexed FPGA. In Proceedings of the 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines, pages 22 28. IEEE, 1997. [37] D. Unnikrishnan, R. Vadlamani, Yong Liao, J. Crenne, Lixin Gao, and R. Tessier. Reconfigurable Data Planes for Scalable Network Virtualization. IEEE Transactions on Computers, 62(12):2476 2488, 2013. [38] Esam El-Araby, Ivan Gonzalez, and Tarek El-Ghazawi. Virtualizing and Sharing Reconfigurable Resources in High-Performance Reconfigurable Computing Systems. In Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications, pages 1 8. IEEE, 2008.

Bibliography 72 [39] Ivan Gonzalez, Sergio Lopez-Buedo, Gustavo Sutter, Diego Sanchez-Roman, Francisco J. Gomez-Arribas, and Javier Aracil. Virtualization of Reconfigurable Coprocessors in HPRC Systems with Multicore Architecture. Journal of Systems Architecture, 58(67):247 256, 2012. [40] C. Steiger, H. Walder, and M. Platzner. Operating Systems for Reconfigurable Embedded Platforms: Online Scheduling of Real-Time Tasks. IEEE Transactions on Computers, 53(11):1393 1407, 2004. [41] K. Rupnow. Operating System Management of Reconfigurable Hardware Computing Systems. In International Conference on Field-Programmable Technology, pages 477 478, 2009. [42] Chun-Hsian Huang and Pao-Ann Hsiung. Virtualizable Hardware/Software Design Infrastructure for Dynamically Partially Reconfigurable Systems. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 6(2):11, 2013. [43] Khoa Dang Pham, A.K. Jain, Jin Cui, S.A. Fahmy, and D.L. Maskell. Microkernel Hypervisor for a Hybrid ARM-FPGA Platform. In 24th International Conference on Application-Specific Systems, Architectures and Processors, pages 219 226, 2013. [44] C. Chang, J. Wawrzynek, and R.W. Brodersen. BEE2: A High-End Reconfigurable Computing System. Design Test of Computers, IEEE, 22(2):114 125, 2005. [45] NetFPGA. NetFPGA 10G. http://netfpga.org/, 2014. [46] Terasic Technologies Inc. DE5Net. http://de5-net.terasic.com/, 2013. [47] BEECube Inc. minibee - Research in a Box. http://www.beecube.com/products/ minibee.asp, 2014. [48] Xilinx. Xilinx Partial Reconfiguration User Guide v12.3. http://www.xilinx.com/ support/documentation/sw_manuals/xilinx12_3/ug702.pdf, 2010.

Bibliography 73 [49] Altera. Partial Reconfiguration Megafunction. http://www.altera.com/ literature/ug/ug_partrecon.pdf, 2013. [50] Xilinx. Vivado Design Suite User Guide - Partial Reconfiguration. http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_ 2/ug909-vivado-partial-reconfiguration.pdf, 2014. [51] Xilinx. Virtex 5 User Guide. http://www.xilinx.com/support/documentation/ user_guides/ug190.pdf, 2012. [52] NetFPGA. NetFPGA 10G Open Source Hardware. https://github.com/netfpga/ NetFPGA-public, 2014. [53] Kyle Locke. Xilinx Parametrizable Content-Addressable Memory. http: //www.xilinx.com/support/documentation/application_notes/xapp1151_ Param_CAM.pdf, 2011. [54] Xilinx. UG761 Xilinx AXI Reference Guide v14.3. http://www.xilinx.com/ support/documentation/ip_documentation/axi_ref_guide/latest/ug761_ axi_reference_guide.pdf, 2012. [55] Jianhua Che, Yong Yu, Congcong Shi, and Weimin Lin. A Synthetical Performance Evaluation of OpenVZ, Xen and KVM. In Services Computing Conference (AP- SCC), 2010 IEEE Asia-Pacific, pages 587 594, Dec 2010. [56] The Xen Project. Baremetal vs. Xen. vs. KVM Redux. http://blog.xen.org/ index.php/2011/11/29/baremetal-vs-xen-vs-kvm-redux/, 2013. [57] Python. Socket - Low-level Networking Inteface - Python Documentation. http: //docs.python.org/2/library/socket.html, 2013. [58] Xilinx. Vivado High-Level Synthesis. http://www.xilinx.com/products/ design-tools/vivado/integration/esl-design/, 2014.

Bibliography 74 [59] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. Openflow: Enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review, 38(2):69 74, 2008. [60] Scott Shenker, M Casado, T Koponen, and N McKeown. The Future of Networking, and the Past of Protocols. Open Networking Summit, 2011. [61] P.J. Frantz and G.O. Thompson. VLAN Frame Format, September 1999. US Patent 5,959,990. [62] Bruce Davie and Yakov Rekhter. MPLS: Technology and Applications. Morgan Kaufmann Publishers Inc., 2000. [63] OpenFlow Specification. http://www.opennetworking.org/sdn-resources/ onf-specifications/openflow, 2013. [64] Xilinx. Xilinx All Programmable SoC. http://www.xilinx.com/products/ silicon-devices/soc/index.htm, 2014. [65] Altera Corporation. Altera SoC Overview. http://www.altera.com/devices/ processor/soc-fpga/overview/proc-soc-fpga.html, 2014. [66] Andrew Putnam et. al. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In The 41st International Symposium on Computer Architecture. IEEE, 2014.