Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture

Transcription

1 Intel Network Builders Reference Architecture Packet Processing Performance of Virtualized Platforms Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture Packet Processing Performance of Virtualized Intel Architecture Platforms Audience and Purpose The primary audiences for this reference architecture RA) are architects and engineers evaluating Intel architecture based solutions for Network Function Virtualization NFV). Virtualized software architectures and configuration options can have a dramatic impact on system performance including the throughput and latency. The primary focus of this analysis is a virtual Broadband Remote Access Server vbras) however it is applicable for various Network Services including protocol conversion applications and firewalls. Intel Xeon Processor E Product Family Intel Xeon Processor E V2 Product Family Details for writing and configuring host based and virtualized packet processing software applications are provided full code is not provided; rather, techniques being used). The platform tested uses Linux* KVM with Intel Xeon Processor Series E and Intel Xeon Processor Series E V2 codenamed Sandy Bridge and Ivy Bridge respectfully) and Intel Gigabit Ethernet Controller codenamed Niantic). Performance results are specific to the target platform; however the test methodology and optimization learnings are generally applicable to NFV solutions using Linux* KVM virtualization and Intel Gigabit Ethernet Controller.

2 Table of Contents 1 Audience and Purpose Executive Summary Summary Introduction Test Configuration Use Cases Hardware Compute Platform Intel Gigabit Ethernet Controller Software Components Test Methodology and Protocol Description Measuring Network Throughput Protocols Network Test Tools Port Forwarding Performance General Performance and Effect of Intel DPDK parameters Unidirectional: Influence of Socket and rx_free_thresh Bidirectional: Influence PCIe Rhroughput Effect of CPU Socket Affinity Rx_mbuf and Mempool Cache Mempool cache= Mempool cache= Hyper-threading Huge pages L3 Forwarding Performance Load Balancing Hardware Load Balancing Software Load Balancing Intel DPDK Rings Rx_mbuf and Mempool Cache Load Balancing Architecture Load Balancer Performance Results Impact of CPU Socket Affinity Both Interfaces on the Same Socket Interfaces on a Different Socket Load Balancing Conclusion

3 9 BRAS Prototype Performance Use Case Description BRAS Test Description Detailed Description Architecture Description Lookup Tables Table Dimensioning Performance Results Performance of Intel Xeon E Performance of Intel Xeon E V Bottlenecks Analysis Impact of Intel Data Direct I/O Technology Impact of Cache and Number of GREs Performance per Task Virtualized Performance Port Forwarding Performance Performance Results: Influence of IOTLB Huge Pages Performance Stability Virtualized BRAS Prototype Performance Performance Results: Intel Xeon Processor E Performance Results: Intel Xeon Processor E V Huge Pages Annexes Framing Format Ethernet Frame Ethernet II, IEEE 802.3) IPv4 Packet IPv6 Packet UDP Datagram TCP Segment MPLS Header VLAN 802.1Q) QinQ GRE Tunneling Find the Socket of a PCIe Interface Find Number of Lanes Run on a PCIe Interface Hash Table Statistics Qemu Command Line Parameters

4 11.6 BRAS Configuration CPE Side UDP Packets CPE Side ARP Packets Network Core Side Ixia/Spirent Screenshots CPE Side Ixia Screenshots CPE Side Packet Decoding Core Side Screenshots Core Side Packet Decoding Glossary References

5 Executive Summary This Reference Architecture RA) evaluates Intel Architecture IA)- based platforms for Network Function Virtualization NFV) workloads by better understanding the implications of software design decisions. This RA specifically describes testing of a prototype Broadband Remote Access Server BRAS) workload. Virtualized standard high volume servers are widely used in Cloud Data Centers. Cloud computing technologies such as virtualization can also be applied to telecom networks, making the network more flexible and agile in delivering services over mobile and fixed-line networks. Requirements, architectures, and performance criteria of virtualized network functions are being defined by industry bodies such as the ETSI Network Function Virtualization NFV) Industry Specifications Group ISG). Evaluating the performance of virtualized packet processing on Commercial Off-The- Shelf COTS) platforms and understanding system architecture and configuration options is fundamental to developing solutions that will meet industry requirements and transform telecom networks. This RA contains learnings for the Intel Xeon Processor Series E and Intel Xeon Processor Series E V2 series processors, using the Intel Data Plane Development Kit Intel DPDK) for packet processing and qemu as the open source virtualization technology. This RA targets PCIe pass-through as the virtual machine VM) to Network Interface Controller NIC) communication mechanism. The configurations tested do not include a virtual switch vswitch). vswitch performance is the subject of another RA) This RA presents lessons for both virtualized and non-virtualized systems. It gives the necessary ingredients to build applications to handle high packet throughput with complex workloads, starting from the building blocks for the simplest uses cases L2/port forwarding) to more complex and realistic ones BRAS), and includes use cases such as L3 forwarding and Load Balancing. For each use case or building block, this RA shows the performance impact of relevant configuration parameters: 1. System parameters a. Number of lanes of PCIe slots b. Number of huge pages c. Number of interfaces on one NIC, etc. 2. CPU configuration a. CPU sockets b. Hyper-Threading, etc. 3. Intel DPDK configuration parameters a. rx_free_threshold b. mempool c. cache size, etc. 4. Software architecture i.e. how to architect the workload and map it onto multiple tasks to run on multiple cores) Failing to properly implement these recommendations can result in significant performance hits up to 50 percent for each incorrect parameter/configuration). In order to simulate a more realistic workload, a prototype BRAS has been developed. The software architecture of this prototype BRAS, the lessons learned and the performance results are explained in this RA. The performance obtained by this software prototype BRAS is summarized and compared in Figure 1 for Intel Xeon Processor E and Intel Xeon Processor E V2, on bare metal, and in a virtualized environment. Figure 1: Performance comparison for Intel Xeon Processor E and Intel Xeon Processor E V2, on bare metal, and in a virtualized environment 5

6 Line rate is achieved on bare metal for any packet longer than 256 bytes. Intel Xeon Processor E V2 offers in a virtual environment the same performance as in bare metal thanks to its IOTLB implementation. The performance impact of parameters such as Intel Data Direct I/O Intel DDIO), huge pages and the number of CPEs Customer Premises equipment, i.e. customer devices accessing Service Providers equipment) is studied. Bottlenecks are analysed to explain the limiting factors for the different packet sizes. 3 Summary Virtualized standard high volume COTS) servers are widely used in Cloud Data Centers. Cloud computing technologies such as virtualization will also transform Telco networks to become flexible and agile for delivering virtualized network services over mobile and fixed-line networks. Requirements, architectures, and performance criteria of virtualized network solutions are being defined by industry bodies such as the ETSI NFV Industry Specifications Group community/nfv/367). Evaluating performance of virtualized packet processing on COTS platforms and understanding system configuration options is fundamental to developing solutions that will meet industry requirements and transform Telco networks. This Reference Architecture contains data for Intel Xeon Processor E and Intel Xeon Processor Series E V2. The data and learnings are intended to help evaluate an Intel Architecture based COTS platform with specific software configurations. Use Cases 1. Port forwarding: a network service forwards IP packet received on one port to another port, without changing any bytes from the packet. 2. L3 forwarding, with hash management: a network service forwards IP packets, updating MAC destination addresses, and selecting next hop output port ) based on destination IP address. 3. Load balancing: a network service receives packet from an external interface and forwards those packets to other threads in the system, running on other cores in an attempt to distribute the work over the available cores. 4. Protocol encapsulation, decapsulation, tunneling: a network service receives packets from an external interface or from another thread/core), modifies those packets, and sends them to another core or an external interface. 5. Same workloads running in a virtualized system. System under Test - Hardware The target platform is a RMS from Intel, based on the Romley platform and including Intel Xeon Processor E Series and Intel Xeon Processor E V2 Series and Intel Gigabit Ethernet Controller. System under Test - Software The guest and host OS used is CentOS 6.2 updated with kernel from kernel. org. The platform configuration includes Intel DPDK v1.3.0 and qemu-kvm-0.12 or qemu Intel DPDK is a software package which helps fully utilize the packet parsing and security features of IA to shorten timeto-market of high performance packet processing solutions. PCI Pass-through: 10GbE PCI devices are accessed through PCI pass-through no bridge, vswitch used). Figure 2: Port forwarding performance results 6

7 Port Forwarding Using Port Forwarding applications, this RA shows the influence of the following parameters on the performance. Intel DPDK configuration parameters like rx_free_threshold, mempool cache size) Number of lanes of PCI slots x4 or x8 for 4 lanes or 8 lanes slots) CPU socket Number of Huge pages # HP) Hyper-Threading Figure 2 shows an example of performance results. While such port forwarding applications are not realistic, limitations encountered when configuring them in a non-optimal way will still be present in more complex, realistic applications. Load Balancing Recent NICs like Intel Gigabit Ethernet Controller can classify incoming packets into different NIC receive queues and can serve as hardware load balancers. However, on the Intel Gigabit Ethernet Controller, not all protocols are supported by those classification offloads. For instance, RSS receive Side Scaling) does not support MPLS or QinQ protocols. In order to be able to scale and use efficiently all cores of a CPU when RSS is not possible for some of the target protocols, we need a software load balancer, receiving packets from one or multiple) external interface, and forwarding the packets, usually untouched, to multiple other threads. Different load balancer models are shown in this RA, depending on whether the load balancer is also responsible for transmitting the traffic to the external interfaces. Performance implications of the different architectures are studied. This RA also shows the impact of some parameters, like DPDK parameters mempool cache) or cross socket workloads on performance, allowing users to make the right design decisions regarding which cores to use when interfaces are used on both sockets. Figure 3 and 4 show performance results for a load balancer when the RX interface is on one CPU socket and the TX interface is on the other socket. It shows that in this configuration it is still possible to reach line rate by carefully selecting the socket of both the load balancer and the worker threads. Figure 3: Throughput and memory bandwidth 7

8 BRAS In order to simulate a more realistic application a prototype BRAS has been developed and learnings are presented in this RA. The proposed software architecture is shown in Figure 4. Figure 4: Software architecture for prototype BRAS 8

9 The performance obtained by this software prototype BRAS is shown in Figure 5. We can see here the performance comparison between Intel Xeon Processor E and Intel Xeon Processor E V2, as well as between bare metal and virtualized. This RA explains the differences between the different performances and shows the influence of parameters such as DDIO, huge pages and the number of CPEs. Bottlenecks are analysed to explain the limiting factors for the different packet sizes. Figure 5: Performance comparison between Intel Xeon Processor E and Intel Xeon Processor E V2 9

10 4 Introduction 4.1 Test Configuration Figure 6 depicts the components used. The Hardware is described in section 4.3 and the Software is described in section 4.4. The system is configured with four Intel Gigabit Ethernet Controllers. See BOM below for detailed hardware configuration. 4.2 Use Cases The focus is packet processing performance in a virtualized COTS environment. The following use cases are useful when evaluating or architecting a software based network services platform, based on Intel DPDK. Port Forwarding/L2 forwarding L3 forwarding Load balancing across cores VMA CentOS, DPDK) VMB CentOS, DPDK) BRAS converting packets from QinQ to GRE tunnels, handling routing and adding/removing MPLS tags. These use cases are intended to serve as building blocks to explore higher level network service use-models. Following this recipe should dramatically reduce the effort to instantiate a POC platform for exploring new network cloud use-models. This RA specifically targets Intel Xeon Processor E and Intel Xeon Processor E V2 platforms. Small 64B) to large 1518B) packet sizes were investigated in all tests no packet fragmentation). The two first use-cases can be used to model run-to-completion architecture, where each packet is fully handled by one core. In this model each cores handles the same functions, on different packets. Scalability of this model is obtained using hardware or external load balancers e.g. RSS). The Core Load Balancing and BRAS use case can be used to model pipe-lined applications, where a packet is handled successively by multiple cores. In this model, each core has a dedicated task. Each core handles all packets from an interface. 4.3 Hardware Compute Platform Two different systems are used and compared: a system based on Intel Xeon Processor E Series formerly codenamed Sandy Bridge) and a system based on Intel Xeon Processor E V2 Series formerly codenamed Ivy Bridge) Intel Gigabit Ethernet Controller The platform network interfaces are Intel Gigabit Ethernet Controllers. This is the latest Intel 10GbE network card that provides dual-port 10Gbps throughput with improved end-toend latency and lower CPU utilization using TCP Segmentation Offload TSO), Receiving Side Coalescing RSC), and flow affinity filters. Host CentOS, DPDK) System Under Test Intel Xeon Processor 2600 Base Platform Intel GbE Controller NIC 4 Intel GbE Controller NIC 2 Intel GbE Controller NIC 3 Intel GbE Controller NIC 1 Test Tool Traffic Generator Due to the enhanced onboard intelligence of the Intel Ethernet Controller network card, it can also provide advanced features such as: Offload IPsec for up to 1024 Security Associations SA) for each of Tx and Rx AH and ESP protocols for authentication and encryption AES-128-GMAC and AES-128-GCM crypto engines Transport mode encapsulation IPv4 and IPv6 versions no options or extension headers) Figure 6: Test configuration components 10

11 HARDWARE ITEM DESCRIPTION NOTES Platform Form factor Intel Server Board S2600CP Family RMS 4U Processors) 2x Intel Xeon CPU E KB Cache, with Hyper-threading enabled Cores 8 physical cores/cpu 16 Hyper-threaded cores per CPU for 32 total cores Memory 32 GB RAM 8x 4GB) Quad channel 1333 DDR3 NICs 4x Intel Gigabit Ethernet Controller BIOS SE5C600.86B Intel Virtualization Technology for Directed I/O Intel VT-d) enabled Hyper-Threading enabled Platform Form factor Intel Server R2308IP4LHPC S2600IP) Family RMS 2U Processors) 2x Intel Xeon CPU E V2 25MB Cache, with Hyper-threading enabled Cores 10 physical cores/cpu 20 Hyper-threaded cores per CPU for 40 total cores Memory 32 GB RAM 8x 4GB) Quad channel 1333 DDR3 NICs Niantic) BIOS 4x Intel Gigabit Ethernet Controller S2600IP_W2600CR_SFUP_BIOS _ BMC01170r4151_FRUSDR109_ ME zip Intel Virtualization Technology for Directed I/O Intel VT-d) enabled Turbo Boost not enabled Hyper-Threading enabled 4.4 Software Components SOFTWARE COMPONENTS COMPONENT FUNCTION VERSION/CONFIGURATION CentOS Host OS 6.2 is used as a reference host OS to compare with Wind River Linux. Kernel was updated to System Services - firewall disabled, irqbalance service disabled Qemu-kvm Virtualization technology Qemu version on Intel Xeon Processor E Qemu-kvm , el6.centos.3.x86_64 on Intel Xeon Processor E V2 CentOS Guest OS Used as the guest OS for virtualized configurations Intel DPDK IP stack acceleration Version BRAS prototype application Internal DPDK prototype application to characterize the BRAS performance 11

12 5 Test Methodology and Protocol Description 5.1 Measuring Network Throughput When calculating data throughput, packet overhead needs to be considered as this significantly impacts throughput based on the size of packets. For instance, when sending or receiving 64 bytes Ethernet frames, there are 20 additional bytes consisting of an inter-framing gap, start of frame delimiter and preamble. Therefore using 64 bytes frames, the maximum theoretical data throughput on a 10Gbps interface is Gbps. For 1518 bytes Ethernet frames, the maximum theoretical throughput is Gbps. Another consideration is to know which throughput to report: Ethernet throughput, IP Payload throughput, UDP Payload throughput, etc. For example, the Netperf netperf/) UDP stream test reports UDP payload throughput. A 64 byte Ethernet frame contains 18 bytes UDP payload, the maximum theoretical throughput with such frames is Gbps of UDP payload, or Gbps of Ethernet payload; for 1518 bytes Ethernet frames containing 1472 bytes UDP payload), it is Gbps of UDP payload. There is a difference between an Ethernet frame, an IP packet and an UDP datagram. In the seven-layer OSI model of computer networking, 'packet' strictly refers to a data unit at layer 3 Network Layer). The correct term for a data unit at layer 2 Data Link Layer) is a frame, and at layer 4 Transport Layer) is a segment or datagram. In this document, we will typically measure the performance in terms of packets per seconds pps), kilo packets per seconds kpps), or Mega packets per seconds Mpps). Network throughput usually represents the fastest rate at which the system under test can forward frames without any packet loss RFC 2544, RFC 1242). However, in this reference architecture, network throughput refers to the packet rate transmitted by the system under test, when receiving packets at line rate. One of the reasons of this measurement technique is to avoid difficulty to reproduce test results because some services steal CPU cycles from time to time resulting in some very low) packet loss. For instance, losing a few packets every 5 seconds because a service starts every 5 seconds would result in very low throughput measured by RFC2544. Making sure that this very low) packet loss does not happen i.e. make the system more real time) was not part of this RA. 5.2 Protocols The Port forwarding and L3 forwarding use cases use IPv4 packets and UDP datagrams. The BRAS use case uses in addition MPLS, GRE, and QinQ protocols. Please see annex 11.1 for details on the format of those protocols. 5.3 Network Test Tools All tests were performed using standalone traffic generators Ixia* and Spirent*). This provides full control on defining the Ethernet frame MPLS, QinQ, Tunneling, TCP, UDP, etc.). Be sure to pay attention to how the field length is defined as the CRC may not be into account depending on the particular tool being used. Intel DPDK based applications were used as SUT System under Test). The performance obtained in this document has been obtained using Intel DPDK based examples and an Intel DPDK based BRAS prototype application. 6 Port Forwarding Performance We will investigate how the PCIe socket on which the Intel Gigabit Ethernet Controller is installed, as well as some Intel DPDK parameters can influence the forwarding performance. In many cases, the performance is more limited by the number of packets than by the bandwidth: it requires the same number of cycles to handle 64 bytes packets and 1500 bytes packets. For this reason, the performance will be measured with 64 bytes packets and expressed in kpps kilo packets per seconds). 6.1 General Performance and Effect of Intel DPDK Parameters Unidirectional: Influence of Socket and rx_free_thresh In the following preformance results, unidirectional forwarding means that packets generated from TS1 Test system 1, i.e. a test generator) are received on one interface of the SUT and forwarded to the other interface of the SUT, as indicated in Figure 7. TS2 Test System 2, i.e. another port of the Test generator) receives and counts those packets. Theoretical maximum is 14,881 kpps, for 64 bytes frames. Figure 8 shows the performance obtained in the unidirectional case. The first bar rx_free_thresh=4, wrong socket) shows the performance obtained when the core on which the packet forwarding application is running, is on a different CPU socket that the interface itself. Please refer to Appendix 11.1 for detailed information on how to find which socket a PCIe interface is attached to. We see that in this case the forwarding application is unable to reach line rate. The second bar rx_free_thresh=4, good socket) shows the effect of the socket: by choosing the right socket, forwarding performance increases from around 88 percent of the line rate to 95 percent. The third bar is trickier. It shows the influence of an Intel DPDK parameter, called rx_free_thresh. When we change 12

13 this parameter from 4 to 16, we see the performance increasing to almost line rate 99,995 percent of the line rate). To explain the effect of this parameter, we need some additional background. The Intel Gigabit Ethernet Controller Datasheet contains fully detailed explanations. Here is a small summary. Communication of packets received by the hardware is done using a circular buffer of packet descriptors see Figure 9, from Intel Gigabit Ethernet Controller Datasheet). A receive descriptor is a data structure that contains the receive data buffer address and fields for hardware to store packet information. Upon receipt of a packet for this device, hardware stores Figure 7 the packet data into the indicated buffer and writes the length, status and errors to the receive descriptor. There can be up to 64K-8 descriptors in the circular buffer. Hardware maintains a shadow copy that includes those descriptors completed but not yet stored in memory. The Receive Descriptor Head registers RDH) indicates the in-progress descriptor. The Receive Descriptor Tail registers RDT) identifies the location beyond the last descriptor that the hardware can process. This is the location where software writes the first new descriptor. During run time, software processes descriptors and upon completion of descriptors, increments the Receive Descriptor Tail registers. The number of usable free) descriptors for the hardware is the distance between the Tail and Head registers. When the tail reaches the head, there are no free descriptors and further packets might be either dropped or block the receive FIFO. However, updating the RDT after each packet has been processed by the Software has a cost, as it increases PCIe operations. rx_free_thresh represents the maximum number of free descriptors that the Intel DPDK software will hold before sending them back to the hardware. Hence, by processing batches of packets before updating the RDT, we can reduce this PCIe cost of this operation. We see that increasing the rx_free_thresh improves the performance, as we can now reach line rate or, at least 99,995 percent of line rate). Figure 8 13

14 Figure 9: Circular buffer of packet descriptors 14

15 6.1.2 Bidirectional: Influence PCIe Throughput Bidirectional forwarding means that packets are received on both interfaces, and forwarded in both cases to the other interface, as indicated in Figure 10. Theoretical maximum is 29,761 kpps, for 64 bytes frames. Figure 10 Performance for bidirectional port forwarding using different configurations is shown in Figure 11. The first bar shows that a PCIe x4 slot 4 lanes) is not sufficient to support the throughput of two 10GbE: less than 50 percent of the theoretical line rate is reached when using 64 bytes packets find how to check that your PCIe slot support 8 lanes in Appendix 11.3). The second bar of the bidirectional case shows that we reach around 78 percent of line rate with 64 bytes packets) when using both interfaces of one Intel Gigabit Ethernet Controller card to perform bidirectional port forwarding. overheads into account, we see that the maximum bandwidth is around 80% of the line rate when using 64bytes packets. With the third bar, we see that using two Intel Gigabit Ethernet Controllers, with only one interface on each NIC, we can reach line rate even with the smallest packets Effect of CPU Socket Affinity In this chapter we will show the influence of the CPU socket to which a core and an Intel Gigabit Ethernet Controller interface belongs. We have seen see chapter 6.1.1) that the best performance is obtained by running the software on a core belonging to the same socket as the interface it is handling. However, this is not always possible: for instance, one core might have to forward traffic from an interface on socket 0 to an interface on socket 1. This chapter studies the performance which can be obtained in this use-case and compares it to the case where both interfaces are on the same socket. In the following test, two interfaces from two different NIC cards are used. The traffic received on one interface is forwarded back to the other interface bi-directional forwarding). Four different configurations are tested: Bad, Receive OK, Transmit OK, Good. Figure 12 to Figure 15 describe those four configurations. Figure 16 shows how choosing the right or the wrong cores might affect performance. The Intel Gigabit Ethernet Controller is PCIe Gen2 x8 device. So, it should be able to handle 40 GT/second. Taking into account the 8/10 encoding, it should be able to handle 32 Gbps in each direction. However, there is a lot of overhead, mainly for small packets e.g. the PCIe Transaction Layer Overhead). Finally, if we look at the bandwidth from PCIe to memory, when doing bi-directional forwarding of IP packets, we see that it involves not only the NIC writing the packet, but also the NIC writing back TX and RX descriptors, and issuing PCIe read requests to read the packet, to read RX and TX descriptors). Taking all those Figure 11: Bidirectional forwarding performance 15

16 In the first configuration good, see Figure 12), both interfaces are on the same socket. The core handling the packets receiving and transmitting) is on the same socket. Figure 12 Figure 13 In the second configuration RX OK, see Figure 13, a core on socket 0 received packets from an interface on socket 0 and transmit them on an interface on socket 1. In this third) configuration Transmit OK, see Figure 14), a core on socket 1 receives packets from an interface on socket 0 and transmits them on an interface on socket 1. Figure 14 Figure 15 In this last configuration Bad, see Figure 15), a core on socket 1 receives packets from and transmits packets to interfaces on socket 0. 16

17 Figure 16 We can see that, as expected, if both interfaces and the core are on the same socket, the best performance is obtained good ). The memory throughput in this case is null: packets are exchanged through cache and DRAM is not involved. We also observe that if both interfaces are on different sockets, we can still reach line rate by using a core on the same socket as the receiving interface e.g. a core on socket 0 received traffic from an interface on socket 0 and transmits it to an interface on socket 1). Compared to the case where both interfaces were on the same socket, we can see that the memory throughput increases packets need to go through memory to cross QPI). As soon as the core is on a different socket than the receiving interface, we see the performance decrease and memory throughput continues to increase. So, when possible, interfaces and cores should be used from the same CPU socket. When this is not possible e.g. interfaces from both sockets must be used), then the receiving cores should be on the same socket as the interface they handle. 6.2 Rx_mbuf and mempool Cache Mbufs message buffers) are buffers used by the Intel DPDK to carry network packets. The message buffers are stored in a mempool a memory pool being an allocator of a fixed-sized object; it uses a ring to store free objects). The message buffers must be allocated by the application, usually at startup time. The number of mbufs must be high enough to carry all packets handled by the system at a given time. This includes RX and TX descriptors, any packets stored in rings between cores, mempool caches, any packets buffered by the application. 17

18 6.2.1 mempool cache=0 The following graph Figure 17) show the effect of number of mbuf on performance, and more specifically on memory bandwidth. We see that, as soon as we use more than 10K mbuf per interface in this two interfaces configuration hence 20k mbuf), memory throughput starts to increase. We see that, even if we do not see direct impact of number of mbuf on performance not shown on this graph, we hit line rate when using up to more than 80k mbuf per interface), having many mbuf impacts memory throughput. We expect that, in real applications, more complex than this basic port forwarding, this will impact performance. When cache size is set to 0 this case), all configured mbufs are being used in a circular way. When using a big number of mbufs, the number of mbufs pointers is too important to be all stored in the cache and are being evicted mempool cache=64 When mempool_cache_size is set to 64, the basic forwarding application does not show any memory read or memory write throughput anymore. While configured with up to one million mbufs, the whole test uses much less a few hundreds per core), always the same; all the other mbufs are unused. As only a small number of mbufs are used, all mbufs pointer can stay in cache their content being overwritten when a new packet comes in). We conclude that using mempool cache in basic forwarding application, as well as probably in most run-to-completions use cases, increases performance. We will see in following chapters 8.3.1) how this parameter influences pipe lined models. Figure 17 18

19 6.3 Hyper-threading The following test shows the effect of Intel Hyper-threading Technology on port forwarding. By port forwarding, we mean that each packet is taken from one interface and forwarded to the other interface, without any packet modification. We can see that, in this very simple) workload at least, hyper-threading HT) dramatically improves the results almost 40 percent improvement). Both tests use one physical core core 1). In the first case HT off), only one thread logical core) is handling both interfaces. In the second case HT on), two threads logical cores) are running, each handling one interface. Figure 18 19

20 6.4 Huge Pages Intel DPDK requires the use of huge pages. So, it is not possible to prove, when running on bare metal, the performance impact of such huge pages. Still the question of the size and number of huge pages must be answered. Figure 19 shows the performance per interface) results of a basic port forwarding application, when using various number of huge pages and huge pages sizes. We can see that using a basic application, the size of the huge pages does not influence the performance. However, while not proven in this document, we believe that more complex application will benefit from larger huge pages, as for instance 2GB memory will require 1000 pages when using 2MB pages and only 2 pages when using 1GB pages. Hence the pressure on TLB for instance will be reduced when using 1GB pages. More interestingly, we see that using 28GB huge pages give a performance boost compared to using 8 huge pages. This might be surprising as a very basic port forwarding application is being used here, and such an application does not require much memory usage. In fact, both the 8x and 28x 1G huge pages results seem to be limited by PCIe bandwidth. The difference is that, when configuring 28x 1G huge pages, some huge pages are created above the 4GB physical memory limit and others below this limit. When pages below the 4GB limit are used, 32bit addressing is used in TLP Transaction Layer Packet) instead of 64bits, resulting in less overhead and better performance 32 bits can be used as 2^32=4GB). In the 8-huge pages case, all huge pages were created above the 4 GB limit. Figure 19 20

21 7 L3 Forwarding Performance As shown in the previous chapter, port forwarding was interesting in showing the basic forwarding performance of a platform. We have seen that one core can forward packet on line rate using the Intel DPDK, provided the appropriate configuration is used. We have also seen that bi-directional traffic can be forwarded using two cores at line rate, except for the smallest packet size 64 bytes) for which we reach PCIe bandwidth limits. This is interesting as the performance can only decrease as application gets more complex. However, this test is of course not realistic at all as the packets are not modified nor routed. The first step in getting a more realistic workload is to do L3 forwarding and see the effect of the size of the routing table. The following test shows the throughput obtained by a two ports on 1 NIC) based application doing L3 forwarding using a variable number of routes, using longest prefix match LPM) implementation of Intel DPDK version 1.3. All routes are have a netmask length of 20 i.e. they cover 4K addresses). One million of those rules cover the whole IPv4 address space 4x 10 9 addresses). Similar performance would be expected using 16x more /24 routes e.g. we expect a similar performance using 1M /20 routes and 16M /24 routes). We can see in Figure 20 that the application can forward frames at maximum speed 13 Mpps, limited by PCIe bandwidth) up to 256k /20 rules, i.e consecutive) IP addresses. As already mentioned, it is hence expected that 4 million /24 rules would hit the same performance. We also see that using 1 million /20 rules, the performance is degraded, as the routing table does not fit to cache anymore. Of course, in more complex applications, cache will be also used by other components and the same performance will not be obtained with so many rules. Still, Figure 20 shows that using for instance 8192 /20 rules we are far from reaching the cache limit. This result shows the small performance impact of small routing table like 8k /24 rules). On the other hand, it shows the big impact of huge routing tables. When such huge routing tables must be implemented, one must ensure that they are not duplicated in multiple cores. We will describe this further when discussing the BRAS prototype architecture chapter 9.1.2). Figure 20 21

22 8 Load Balancing 8.1 Hardware Load Balancing Recent Network Interface cards NICs) like Intel Gigabit Ethernet Controller can classify incoming packets into different NIC receive queues. Classification can be based on 5-tuple i.e. 1): Source IP Address, 2) Source IP Port, 3) Destination IP Address, 4) Destination IP Port, and 5) Data type UDP, TCP, etc.) The kernel assigns each RX queue to an interrupt through Message Signaling Interrupt MSI-X)). Then, each interrupt can be bound to a different core defined by the user: this is the SMP IRQ affinity. This guarantees that packets belonging to the same stream are handled by the same core. To benefit from cache affinity, the process handling a flow should be run on that same core. When using the Intel DPDK, which operates in poll mode, one does not have interrupts assigned by the kernel to RX queues, but each queue can be polled by a different core. Flow Director and Receive Side Scaling RSS) are two different features of Intel Gigabit Ethernet Controllers with advanced on-board network controller card features. These features can greatly improve the system level performance if configured appropriately based on traffic and application characteristics. See Intel%20Cloud%20Builders_Packet%20 Processing_2012_2.pdf for details on RSS, Flow director and performance related data. However, on Intel Gigabit Ethernet Controller, not all protocols are supported by RSS. For instance, RSS does not support MPLS or QinQ protocols. In order to be able to scale and use efficiently all cores of a CPU when RSS is not possible for some of the target protocols, a software load balancer is needed, as described in the next section. 8.2 Software Load Balancing A software load balancer is a simple task, receiving packets from one or multiple) external interface, and forwarding the packet, usually untouched, to multiple other tasks. Each task can be a thread in a multithreaded application, a different process, or even a process in a different VM. We will focus in this case on multithreaded applications, where the load balancer is embedded within the application itself. The requirements for the load balancer are: It should receive packets from one external interface and forward them to other threads at line rate. We will call those threads worker threads) WT or WTs). For this reason, load balancers should be as efficient as possible. The load on all worker threads should be as balanced as possible in case of 4 worker threads for instance, WT1 should not receive 70 percent of the traffic and the other WT 10 percent of the traffic. Traffic from one stream same source and destination IP, same ports) should always be handled by the same WT. This guarantees packet ordering is maintained for one stream if packet ordering was not maintained, it would require re-ordering packets afterwards, which is a costly operation). In the case of bi-directional operation, packets from both directions should be handled by the same worker thread. The reason is that WT often need access to flow-related data. Having the same WT handling packets from each direction prevents that flow-related data to be accessed from different threads, and improves cache affinity. The BRAS prototype application described further in this document impements a load balancer that follows those requirements. 8.3 Intel DPDK Rings Software load balancers implemented using Intel DPDK are based on Intel DPDK rings. Before we look at load balancing further, we must first consider Intel DPDK rings as they will influence the performance of load balancers. In an asynchronous pipe-line model, some logical cores load balancers) may be dedicated to the retrieval of received packets and other logical cores to the processing of previously received packets. Received packets are exchanged between logical cores through rings. 22

23 8.3.1 Rx_mbuf and mempool Cache We have seen chapter 6.2) that the mempool cache size had an effect on memory throughput and potentially on performance packet throughput). We will see in this chapter that the mempool cache size has an even more important influence on packet throughput when using pipe line model, e.g. a load balancer which forward the packets to another core worker thread, doing in our case only port forwarding). Figure 21 The test see Figure 21) is using 4 interfaces and 4 cores. Two cores here called LB ) receive traffic from the 4 interfaces, and transmit the traffic through DPDK rings to worker threads here called L2 FWD ). The two worker threads transmit the packets to the 4 external interfaces. The throughput of each core, as a function of the mempool cache size, is shown of Figure 22. The theoretical maximum in this case is around 30Mpps 2x line rate) as each core handles two interfaces. We see that, if mempool cache size is lower than 64, there is some performance degradation. If the cache size is null, the performance is up to three times worse. In terms of CPU usage, the cost of multiple cores accessing a memory pool's ring of free buffers may be high since it requires a lock. Maintaining a per-core cache avoids having too many access requests to the memory pool's ring and decreases this cost. Figure 22 23

24 8.4 Load Balancing Architecture Two different models can be proposed for load balancing. The first load balancer model see Figure 23) has each load balancer responsible for receiving packets from and transmitting packets to external interfaces. They isolate the packet I/O task from the application-specific workload. The worker cores are totally oblivious to the intricacies of the packet I/O activity and use the NIC-agnostic interface provided by software rings to exchange packets with the load balancing cores. In the second load balancer model see Figure 24), the load balancers only receive packets from the external interface. The worker threads are still responsible of sending the packets to the external interface. Figure 23 Figure 24 24

25 8.4.1 Load Balancer Performance Results Both Load balancers models do not reach the same performance. In fact, we can see that in the first case Figure 23) the load balancer itself is more loaded than in the second case: it must receive the packet from the worker threads and forward them to the external port. The worker threads themselves are slightly less loaded in the first case it is less CPU intensive to forward packet to an internal ring than to forward packets to an external port). Hence the performance related results might be surprising. Performance is measured using two interfaces, one per load balancer, bidirectional traffic and 64 bytes frames. Using one worker thread doing only forwarding see Figure 26), we observe see Figure 25 and Table 1) that the first model reaches better performance 2x kpps versus 2x8900 kpps). We can see that in both cases the WT is fully busy. As the second model is more CPU intensive for the WT, it reaches lower results. The load balancers are respectively loaded at 100% and 72%. Using two worker threads see Figure 27), we reach resp. 2x10100 and 2x14880 kpps. We see now that the second model is better than the first one. When comparing 1WT and 2 WT cases, we also see that the performance is not improved in the first model by adding a second WT, while the performance improves in the second model. Reason is of course that in the first model the load balancer being the bottleneck, adding a second WT does not help; on the contrary, it increases slightly the load on the load balancer. In the second model, the WT being the bottleneck, adding a second WT helps scaling and lets us reach line rate. Figure 25: Load balancing performance Figure 26 25

26 Figure 27 PERFORMANCE SUMMARY MODEL FIGURE 19 AND FIGURE 20 LEFT) MODEL FIGURE 19 AND FIGURE 20 RIGHT) Throughput kpps) LB Load WT Load Throughput kpps) LB Load WT Load 1 WT 2x % 100% 2x % 100% 2WT 2x % 33% 2x % 100% 4WT 2x % 15% 2x % 55% Table Impact of CPU Socket Affinity Both Interfaces on the Same Socket As described in section 6.1 it s better to have the core handing the RX and TX path to be on the same socket as both RX and TX interfaces. Figure 28 shows a case where a packet is received on eth1 interface on socket 0, handled by core 8 on socket 1 and sent through interface eth0 on socket 0. We see from this simple diagram that the QPI bus is traversed twice by the packet: once by the RX path eth1 => core 8) and once by the TX path core 8 => eth0) and explains the performance impact of such a configuration. We will see now how load balancing is affected by the choice of the socket. In this chapter 8.5.1), both interfaces are located on the same socket. While it is easy to guess how to correctly configure the cores, it is still interesting the measure the cost of a bad configuration. In next chapter 8.5.2), we will see how to best assign tasks to cores when receiving and transmitting interfaces are on two different sockets. In the first configuration, both the load balancer and worker thread are on the same socket as the interfaces Good, see Figure 29). Figure 28 Figure 29 26

27 In this second configuration RX OK, see Figure 30) the worker thread is running on the wrong socket. In this third configuration TX OK, see Figure 31), the load balancer thread is running on the wrong socket. Figure 30 Figure 31 Figure 32 In this last configuration Bad, see Figure 32), both the worker thread and the load balancer thread are running on the wrong socket. We see that having both load balancer thread and worker thread on the right socket, we reach line rate, and memory throughput is zero DDIO makes sure that packets are directly copied by the NIC to cache, and packets never go to main memory). When the load balancer is on the right socket and worker thread on the wrong socket, the performance decreases, but this is still a better configuration than when load balancer is on the wrong socket Bad and TX OK ) Interfaces on a Different Socket It is not always possible to configure a system in a way where all interfaces and all threads are on the same socket, or we might end up using only one of the sockets. So, if interface 0 is on socket 0 and interface 1 is on socket 1, what performance can we expect? And what is the best configuration. Here again, we analyze four different configurations. Figure 33 27

28 In the first configuration RX OK, see Figure 34), both load balancer and worker threads are on the same socket as the receiving interface. Packets will cross QPI when worker thread is sending the packets to the external interface. In this second configuration RX OK, TX OK, see figure 35) load balancer is on the same socket as the receiving interface and worker thread on the same interface as the transmitting interface. Packets are crossing the QPI bus when worker thread reads packets from the load balancers. Figure 34 Figure 35 In this third configuration TX OK, see Figure 36), both the load balancer and the worker thread are on the same socket as the transmitting interface. Packets are crossing the QPI bus when being received by the load balancer. In this last configuration Bad, see Figure 37), load balancer is on the socket of the transmitting interface and worker thread on the socket of the receiving interface. Packets are crossing the QPI bus three times: when load balancer receives packets from the interface, when worker thread receives packets from the load balancer and when the worker thread transmits the packets to the external interface. It s expected that this configuration will have worse results. Figure 36 Figure 37 28

29 We see that, even when interfaces are not on the same socket, it is possible using only two interfaces) to reach line rate by choosing the proper core configuration, i.e. having he load balancer core and the worker thread core on the same socket as the receiving interface. Figure Load Balancing Conclusion Load balancer architecture has a big impact on performance. It is important to understand where the bottlenecks are to be able to achieve best performance. Based on these test results from this chapter, we can conclude that we should always configure the core receiving packets i.e. the load balancer when there is one) on the same socket as the interface it receives packets from. Ideally, the worker thread should be on the same socket as the load balancer, even if the outgoing interface is on a different socket. This might not always be possible for instance in the case a worker thread receives packets from two load balancers on different sockets). In that case, the worker thread should be located on the same socket as the load balancer it receives most packets from. The first load balancer model in which the load balancer is also responsible of transmitting packets to the external interfaces) does not reach line rate and is not scalable adding more WT does not help). However, as we can see from the performance summary, it provides more free CPU cycles to the worker thread. Hence, in complex applications where all cores are being fully loaded, it might be a better solution. 29

30 9 BRAS Prototype Performance 9.1 Use Case Description We will use in this case a prototype BRAS application. The workload it handles is more complex than pure L3 forwarding. The BRAS prototype simulates the work required by a real BRAS when it handles most packets. The idea is that the performance of the whole system is determined by the performance of handling most packets. The exception path is not implemented. We see this BRAS use case as a protocol converter. The protocols being used are not so important for analyzing the performance, even though they have of course influence on the performance itself. We will use protocol conversions which involve changing the packet headers, adding or deleting bytes at the beginning of the packet, changing the packet length and changing the CRC. In addition, packets from the CPE towards Internet use a routing table to find the next hop, which gives destination IP address, destination MAC address and MPLS tag this table is pre-configured in a configuration file). Source MAC address of those packets are stored locally and associated with the QinQ tags and IP addresses, so they can be restored when a packet is sent toward the subscriber from the Internet. FIgure 39 FIgure 40 30

31 9.1.1 BRAS Test Description The BRAS under test can be depicted as shown on Figure 41. The BRAS supports per CPU socket) 2 interfaces towards the CPEs eth1 and eth2) and 2 interfaces towards Internet eth3 and eth4). The traffic between CPE and BRAS is based on QinQ. The traffic between BRAS and Internet is based on GRE. There is a one-to-one mapping between gre_id and QinQ tags. Each CPE interface eth1 and eth2) supports a predefined number of QinQ tags, each representing one CPE. QinQ tags are unique between eth1 and eth2. For this version of the document, the BRAS supports CPEs per CPU socket), i.e CPEs per interface. Traffic from each CPE interface e.g. eth1) can go to both Internet facing interfaces eth3 and eth4), based on the destination address and a preloaded routing table. Traffic from each Internet interface e.g. eth3) can also go to both CPE interfaces, but this time based on gre_id and destination IP address. FIgure 41 31

32 9.1.2 Detailed Description IP Packets between CPE and BRAS are encapsulated using QinQ. Packets between BRAS and Internet are encapsulated using GRE. The BRAS makes the protocol conversion between QinQ and GRE. From CPE to Internet BRAS received QinQ packets. It retrieves the QinQ tag, calculate the GRE id based on a pre-loaded QinQ to GRE table). It removes the QinQ tag and encapsulates the packet in a GRE header. The BRAS stores MAC addresses from CPE based on the incoming QinQ tags in ARP requests. Those MAC addresses will be used when sending packets towards CPE. The BRAS does also routing: it retrieves from the destination IP address the output port, the destination IP address of the GRE tunnel, the MAC destination address of the GRE tunnel, and its MPLS tag. This information is retrieved through LPM tables using 8192 routes. From Internet to CPE The BRAS receives GRE tunneled packets, together with MPLS tagging. It removes the MPLS tag, and decapsulates the GRE tunnel. It retrieves the gre_id, calculates the QinQ tags, and retrieves the MAC addresses and destination port based on the gre_id or QinQ tags. It encapsulates the packets in QinQ. Figure 42 32

33 Figure 45 shows the actions taken by the BRAS when receiving packets from CPE or from the Internet facing interface. Figure 43 33

34 9.2 Architecture Description To understand the bottlenecks, we need to first identify the different components of the architecture and some implementation details. Figure 44 shows the BRAS architecture. It uses four load balancers one per interface). Each load balancer sends the traffic to eight worker threads. The same eight worker threads handle the traffic from all interfaces. For traffic from CPE towards Internet, the worker threads send the packet to some routing threads. There are two routing threads; each receives traffic from four predetermined worker threads. This graph shows the principle. In practice, the number of worker threads or the number of load balancers handled by one core are parameters to be adjusted for optimal performance. See discussion about this in chapter Figure 44 34

35 9.2.1 Lookup Tables To support the workload, multiples lookup tables are supported: a table to convert between gre_id and QinQ, a table to store MAC address of CPE based on their IP address and QinQ and a routing table. They are briefly described below Gre_id QinQ table There is a one to one mapping between gre_id and QinQ SVLAN and CVLAN). The mapping between those id is configured at setup time and stored into two tables: a table converting from QinQ to gre_id and a table converting from gre_id to QinQ. In the load balancer In the case of bi-directional traffic, the load balancer must make sure that all packets from a stream, from both directions, are sent to the same worker threads. As the load balancing is done on the gre_id or the QinQ, a QinQ to gre_id or gre_id to QinQ conversion must be done in the load balancer. Hence this operation must be done quite fast, as there is by definition) only one load balancer per interface there is no way to scale it without using additional external systems or hardware techniques). For such reason, it has been chosen to use the LSB of the gre_id as an index for the load balancing, and a QinQ to gre_id table as simple as a 2^24 entries table, taking the QinQ as input and the gre_id as output. Such a table uses 32 bits * 2^24 = 64 Mbytes. It can even be reduced to 8 bits * 2^24 = 16Mbytes by storing only the worker thread index instead of the gre_id. The amount of memory used in cache is much smaller, as most of the entries in the table will never been used if only a subset of all possible QinQ is used. E.g. if a BRAS supports 32K CPEs per interface, only from 32KB to 2MB 32K * 64 bytes =one cache line in the worst case) would really hit the cache. This simplistic table has been preferred to a more complex hash based lookup table, as such a hookup table is more CPU intensive. This approach is possible because per IEEE 8021ad specification, QinQ is coded on 24 bits. So, there are only 2^24 possible QinQ, i.e. a table of a reasonable size 16MB) can be used. It explains why the LSB of the gre_id and not the QinQ) has been chosen as the load balancer index. In the worker threads Our model supports an important number of worker threads it can be 8 or even 10 worker threads. It is anticipated that this number might increase in the future. Each worker thread only stores a subset of the QinQ to gre_id mapping, as each worker thread handles a predefined set of CPEs. For the QinQ to gre_id table, a hash table was used in this case. Each thread has a small hash table containing only the entries of the CPE it handles. Note that it means we have both a QinQ to worker thread table used in the load balancer and a QinQ to gre_id hash table. A gre_id to QinQ table is not necessary as such. We need to be able to find the QinQ from the gre_id, but this information can be stored in the same table as the MAC addresses, i.e. in the MAC address table see below). Doing so saves one table lookup MAC Address Table The BRAS application uses MAC learning. As a principle, when an ARP packet is received from the CPE, an entry is created or updated) in a MAC address table holding the QinQ or gre_id) +IP CPE MAC address conversion. In practice, the idea is to save one lookup table the gre_ id to qinq), the MAC address table is based on the gre_id: When a packet is received from CPE, --Its QinQ tag and source IP address are retrieved. --The gre_id is calculated from the QinQ this has to be done anyway, as we need to know the gre_id for the GRE tunnel). --An entry is added in the hash table for the key made of the gre_id and the IP SRC using rte_add_hash_key DPDK function). The rte_hash_add_ key function returns an index the position of the key in the hash table) which can be used to store the MAC address, originating port and QinQ in a separate table. When a packet is received from the Internet, the destination IP address within the GRE tunnel) is retrieved, the hash index is retrieved through rte_hash_lookup based on the gre_id and the IP DST). Using the hash index, the MAC address, QinQ and port can be retrieved from the MAC table. In addition, entries time out if no ARP packet is sent from the CPE for some time Routing Table The routing table stores the next_hop based on the IP address. It s based on the LPM longest prefix match) implementation of the Intel DPDK. Another table stores information related to the next_hop IP address, MPLS tag, port index, etc.). For simplicity, an ARP mechanism is not implemented, and the MAC address of the next hop is stored in the table as well. We anticipate it should not have a significant impact on performance. We have seen chapter 7) that huge routing table had a big performance impact. So, for this prototype BRAS architecture, as the routing table can be big, we wanted to avoid duplicating this table in all worker threads, as it would have even more increased the performance impact. It can neither be a unique table accessed by all worker threads at the same time, as this table is 35

36 modified from time to time, so would have required locks. So, it has been decided to use a smaller number of threads to do the routing. One unique thread would have been the optimal, but might not be able to sustain the load, or would not be scalable. When a small routing table is used e.g. 8k /20 rules or 128k /24 rules), the routing tables can be duplicated in the worker threads without performance impact. So, after the packets are handled by the worker thread, and encapsulated in a GRE tunnel, they are passed to a routing thread. It is clear that in a real implementation, the routing should be done before the GRE tunneling, as some of the packets might not need to be tunneled but sent to another CPE. This might have some influence on the performance, but is left to further study. 9.3 Table Dimensioning There are a few key parameters when dimensioning a table: The number of possible hash entries, i.e. the size of the table, each entry holding a CPE. The number of CPEs influences the performance, as a small table can usually fit in L1 to L3) cache, while a bigger table will require access to main memory. Another parameter is the number of entries per bucket, as well as the function used to calculate the hash. When a hash is calculated, it translates a key in our case QinQ tags + IP address) to a bucket index. There is no guarantee that two different keys will result in two different bucket indexes. When using a higher number of entries per bucket, more keys resulting in the same hash can be stored in the table. But using a high number of entries per bucket will cause higher CPU utilization. Using a low number of bucket entries might cause a bucket to become full collisions). Hence, ideally, another table should also hold the exceptions MAC addresses which could not fit in the table). An interesting question is to wonder what should be the size of such a hash table to guarantee with e.g. a probability higher than percent) that each CPE hits an entry in the table. Let suppose we have a table with N entries buckets) and M CPEs. The number of entries per bucket is k. We can use statistics to estimate the probability that the table is sized properly no collisions). An estimate see Annex 11.4) is given by where =M/N) With about four million 2 20 ) CPE the probability of collision is given in Table 2. N=2 18 N=2 19 N=2 20 N=2 21 N=2 22 N=2 23 N=2 24 =M/N =4 =2 =1 =1/2 =1/4 =1/8 =16 k= k= k= k= E-05 1E-07 6E-10 k= E-05 2E-9 4E-14 8E-19 1E-23 Table 2 So, to guarantee, with 99.9 percent probability, that a CPE gets into a bucket, we would need to use either 2 20 buckets with 16 entries each 1.6x10 7 entries in total), or 2 22 buckets with 8 entries each 3x10 7 entries in total). The second option consumes much more memory 2x more), but requires less CPU time less entries per bucket => around 2x less CPU search time). To support 8192 CPE per thread, each thread would have to implement a table with N buckets and k entries per bucket, with N=16384 and k=8 see Table 3). Table 3 N=2 11 N=2 12 N=2 13 N=2 14 N=2 15 N=2 16 N=2 17 =M/N =4 =2 =1 =1/2 =1/4 =1/8 =16 k= k= k= E-04 k= E-05 2E-07 1E-09 4E-12 k= E-07 1E-11 3E-16 6E-21 1E-25 36

37 9.4 Performance Results Performance of Intel Xeon Processor E The following figure shows the performance obtained by a BRAS prototype application running on host non- virtualized) using Intel Xeon E processors. This BRAS uses 4x 10gbE interfaces 2x CPE and 2x Internet). It uses the software architecture described above, i.e. two threads handling load balancing, 10 worker threads, two Core Configuration Core Function Logical Core id Physical Core id Dedicated physical core? Load Balancer 1, 16 1,0 Yes Worker Thread 2,3,4,5,18,19,20,21,22,23 2,3,4,5,2,3,4,5,6,7 No Router Thread 6,7 6,7 No Master 0 0 Unused 1) 17 1 router threads and one master thread. It runs on socket 0, on which the 4x 10gbE reside. Socket 1 is not used in this configuration. 65k CPEs are handled by the system, and 8192 routes have been configured for this performance analysis. Notes: Each load balancer handles traffic from one interface from CPE side and one interface from Internet side. 1) This core could be used to increase performance, but would result in asymmetric performance. The load balancers do not share their physical core with any other threads except the master core, which does not consume much resources) to reach their best performance. The router threads and worker threads share their physical core with other threads Only physical cores from socket 0 are being used. Core 17 is unused; it could be used by the WT to improve performance, but this would result in non-symmetrical results. It is idle, and can be used for other tasks, like control plane note however that the impact on performance is not measured here it would use the same L1 and L2 cache as core 1, and same L3 cache as all other cores used in this prototype). The performance obtained is shown in Figure 45. We can observe that the BRAS prototype can handle line rate for large frames 1500 bytes) to frames of 256 bytes. For 128 bytes frames, the throughput is slightly less that the theoretical maximum. For 64 bytes packets, around 75% of the theoretical maximum can be reached. IMIX Internet mix) average frame size as well as the average frame size reported by some service providers on some of their networks real avg frame size ) is also displayed. Figure 45: Performance from BRAS host and VM 37

38 9.4.2 Performance of Intel Xeon Processor E V2 The following figure Figure 46) shows the performance of the same BRAS prototype using an Intel Xeon Processor E V2. We see that the performance increases for small packets, mainly thanks to the increased number of cores and increased L3 cache size. 9.5 Bottlenecks Analysis Impact of Intel Data Direct I/O Technology We will check in this chapter the effect of some features or parameters. An important feature available in the Intel Xeon processor E5 family is Intel DDIO. Intel DDIO technology makes the processor cache the primary destination and source of I/O data rather than main memory, helping to deliver increased bandwidth, lower latency, and reduced power consumption. Intel DDIO is enabled by default on all Intel Xeon processor E5 platforms. Intel DDIO has no hardware dependencies and is invisible to software, requiring no changes to drivers, operating systems, hypervisors, or applications more information on Intel DDIO: www-ssl.intel.com/content/www/us/en/ io/direct-data-i-o.html ). Figure 46: BRAS host performance comparison 38

39 Figure 47 shows the benefit of Intel DDIO technology when forwarding packets. Memory throughput on this graph shows the memory throughput when Intel DDIO is off. Memory Throughput when Intel DDIO is on is null and not displayed on this graph. We can see that without Intel DDIO the network throughout could not reach line rate if more than 4 interfaces were used Impact of Cache and Number of GREs The number of CPEs supported by the BRAS is also of course an important parameter. When this number increases, there is more pressure on the cache and memory throughput increases. Figure 48 shows the BRAS performance when varying the number of CPEs from 16K to 1M). We can see that the performance is stable up to 65K users. The memory throughput starts to increase when BRAS supports more than 32K CPEs. Above 65K users, as memory throughput continues to increase, the performance decreases. The reason is that, when the BRAS supports more users, data is less likely to be in cache when a new packet comes in. Figure 47 39

40 Figure Performance per Task We can measure the performance of each task when run on a logical core. The performance is impacted by different factors whether other tasks are being run on the logical core part of the same physical core hyper threaded core), whether other tasks are being run on the same socket, etc.). Table 4 summarizes this performance per tasks, using one logical core. Other tasks were usually running on other cores within the same socket. 40

41 score Function Dedicated physical core? Note Load Balancer Yes Hyper threaded core not used. 1 interface line rate) Load Balancer Yes Hyper threaded core not used. 2 interfaces 2x10.1 Load Balancer No Hyper threaded core used for another load balancing 10.4 Worker Thread No Hyper threaded core used by another WT. Using 65k CPE result is a function of CPE number) Router Thread Yes Hyper threaded core not used 11.5 Router Thread No Hyper threaded core used by WT 9.4 Master 0 Unused 2) 0 Load handled by each core/ thread in Mpps) 2x2 1) Table 4 1) Each worker threads handle traffic from both directions. This is a requirement for the worker thread, as traffic from both directions share the same table, so must be handled by the same core to avoid inter core communication. Based on those results, we can build a graph showing where the bottlenecks are in our BRAS results: see Figure 49. The bars are the measured performance on the host with 65K CPE, using 2 LB, 10 WT and 2 RT). Intel DPDK also uses one master core. What can we deduce from Figure 48? At 64 bytes, we hit RT limit; At 128 bytes, this is the PCI gen2 limit; With 256 bytes, it becomes the 10Gbps limit. If we were using 262K CPEs, the 64 bytes performance would drop to 7500 kpps. It would not impact 128 bytes performance. If you want to improve performance: --If you use 65K CPEs, using 10 cores Ivy Bridge 4 extra logical cores) you could make sure that the RT use real physical cores, instead of hyper threaded ones using 2 extra logical cores), and the same for the load balancers 2 extra logical cores). Performance would reach around 10.8kpps instead of 9.3 kpps with 65k CPEs. --If you use 262k CPE, you need to increase number of WT: by adding 4 WT, you should reach 9.3 kpps RT limit) Figure 49 If you use a VM, you are hit by the IOTLB issue up to almost 512 bytes see chapter ). To improve performance, you need to use Intel Xeon Processor E V2 Series. 41

42 10 Virtualized Performance 10.1 Port Forwarding Performance Performance Results: Influence of IOTLB The following test Figure 50) shows the performance obtained when doing port forwarding from a virtual system, compared to the same test from bare metal, both tests running on Intel Xeon Processor E We see a big performance degradation when we virtualize the workload. To better understand the reasons of this performance, we need to describe what the IOTLB is. DMA Direct Memory Access) allows I/O devices to access system memory independently of the CPU. But DMA is not suitable in virtualized environment, as the guests cannot know the host physical address of the IO buffers that are utilized by the IO devices and the device is unaware of the virtualization of the guest physical address space. For this reason, and others like protection, preventing a device from accessing a memory region it is not allowed to access), IOMMU has been introduced. IOMMU intercepts DMA transactions and determine whether the access is permitted and, to resolve the actual host physical address that will be accessed: the IOMMU translates the virtual I/O address to the host physical address. The IOMMU includes an IOTLB Input- Output Translation Lookaside Buffer) to speed-up address resolution. The IOTLB caches frequently used effective translations results of the page walk) same as DTLB for IA-32e paging. The IOTLB on the previous generation Intel Xeon Processor E does not natively support huge pages it emulates them using 4K pages). Figure 50 42

43 We can see on the following results Figure 51) that the current generation of processor Intel Xeon Processor E V2) does handle this use case without any performance degradation: bare metal and host performance are the same. Line rate is not reached because of the maximum PCIe throughput is reached with 64 bytes frames see 6.1.2) Huge Pages As the Intel DPDK requires the usage of huge pages, it was not possible to measure the impact of using huge pages when using Intel DPDK on bare metal. When using a virtualized system, we could measure the impact of using huge pages in the following way: Host OS is configured either with huge pages 28x 1G huge pages), or without huge pages Qemu uses the huge pages from the host if available Guest OS is configured with 2x 1G huge pages Figure 51 43

44 As we can see on Figure 52 the impact of using huge pages on the host is very important. Note that this test was run on an Intel Xeon E V2 processor. The same test on an Intel Xeon E would not have given such a result, as the performance of forwarding application run in virtualized environment on Intel Xeon E is limited by the IOTLB see previous chapter) Performance Stability When we want to measure performance, stability of this performance over time is important: how long do we need to wait to get stable performance? When running port forwarding on a virtualized Intel Xeon Processor E5-2690, our initial data measurements showed that the throughput was not stable before some time see Figure 53). We see here that the stable performance, around 10 Mpps in port forwarding on an Intel Xeon Processor E5-2690, is only reached after more than five minutes. The performance in this case is limited by the IOTLB see chapter for more details). Figure 52 The reason of this performance evolution over time is due to the way in which the mbufs are organized see chapter 6.2 for mbuf discussion). At startup, all mbufs from one mempool are ordered. Hence, for instance, two consecutive mbufs will usually be within the same 4K page. Because of packet drops, the mbuf order is not maintained: some mbufs are released by the application when it cannot transmit a packet; some other mbufs are released at a later time by the Intel DPDK after it transmitted a packet. So, after some time, the mbufs are completely shuffled. Two consecutive mbufs are not within same page anymore. Figure 53 44

45 There are two consequences of this explanation: If the application loops on the rte_eth_ tx_burst function until it succeeds, then the application itself will not drop packets anymore the packets will be dropped by the NIC as the rte_eth_rx_burst is called less often). So, the application will not release any mbufs anymore; the release of a mbuf is now always ordered by the Intel DPDK. The order of the mbufs should be maintained. The performance should be close to the maximum performance observed in Figure 53. If we shuffle the mbufs at startup by allocating and releasing them randomly for some time), the mbufs will never be ordered, and we should have a stable performance, close to the performance obtained in Figure 53 after more than 10 minutes. We see on Figure 54 that, as expected, shuffling the mbufs at startup red line) gives a stable performance. We also see that, if the application does not drop packets and lets the NIC drop them, green and orange lines), the performance is also much more stable. It reaches a much higher throughput if mbufs are ordered no shuffle, green line). Finally, if we compared the two shuffled tests where the application either drop packets red line) either do not drop them orange line), we still see a performance difference. This is due to the way the performance is measured: the external traffic generator sends traffic at line rate, and we measure how much traffic is forwarded by our system. Hence, the no-drop case gives better results because packets are dropped earlier: they are dropped by the NIC, hence decreasing pressure on the system. In the dropped case red line) the reception of the packets which will be dropped later has a cost: some CPU is spent, some amount of cache and IOTLB is used to handle the reception of packets which will never be transmitted. Figure 54 45

46 10.2 Virtualized BRAS Prototype Performance Performance Results: Intel Xeon Processor E We have described in chapter 9 our BRAS prototype and have presented the performance results obtained on bare metal. We will show in this chapter the performance which can be obtained when running the same application from a virtualized environment. Figure 55 compares bare metal and virtualized performance on an Intel Xeon Processor E It gives the performance per socket for a 4-interfaces BRAS prototype. We can see on the Intel Xeon Processor E that the BRAS prototype reaches line rate for all packet sizes from 1500 bytes down to around 512 bytes packets. For instance, a size of 700 bytes is sometimes given as a realistic estimate of the average packet size on real networks. The BRAS prototype would be able to handle the load on a network with such an average packet size. But with smaller packet size, we see that the virtualized performance is worse that the bare metal host) performance. In fact, we see that the virtualized performance is limited to around 10 Mpps. This is not really a surprise when we know the results presented in chapter and the influence of the IOTLB: when only doing port forwarding, we were already seeing a maximum throughput of around 10 Mpps Performance Results: Intel Xeon Processor E V2 Figure 56 compares the performance obtained on a host bare metal) and on a virtualized Intel Xeon Processor E V2, each time using only one CPU socket. We can see that the virtualized and the host performance are extremely similar: there is no visible performance penalty when running the BRAS prototype from a virtual system Huge Pages We have already seen in chapter the importance of using huge pages. However, this was using a basic port forwarding application. On Figure 57 we show the per-interface performance of a 4-interfaces BRAS application running on one CPU socket. We see that using a more complex application such as the BRAS the performance impact of using huge pages remain very important. Figure 55 46

47 Figure 56 Figure 57 47

48 11 Annexes 11.1 Framing Format Ethernet Frame Ethernet II, IEEE 802.3) An Ethernet frame is a data packet on an Ethernet link. The length depends on the presence of the IEEE 802.1Q optional tag indicating VLAN membership and IEEE 802.1p priority) in the Ethernet header. If the optional 802.1Q tag is absent: 7 bytes 1 byte 14 bytes 46 to bytes 12 bytes Preambule Start of frame delimiter Ethernet Header Payload FCS/CRC Interframe gap If it is present: 7 bytes 1 byte 18 bytes 42 to bytes 12 bytes Preambule Start of frame delimiter Ethernet Header Payload FCS/CRC Interframe gap Frame Size RFC 1242 defines the Data Link frame size as The number of octets in the frame from the first octet following the preamble to the end of the FCS, if present, or to the last octet of the data if there is no FCS. RFC 1242). This means that the frame size can vary from 64 to 1518 bytes when 802.1Q tag is absent) or 64 to 1522 bytes when this tag is present. Ethernet Header 6 bytes 6 byte 4 bytes 2 bytes MAC destination MAC source 802.1Q optional) EtherType The Ethertype is a two-byte field. It takes one of two meanings, depending of its value. If the value is less or equal than 1500 bytes, it indicates a length. If the value is greater or equal than 0x ), then it represents a Type, indicating which protocol is encapsulated in the payload of an Ethernet Frame). Typical values are 0x8000 for IPv4 and 0x0806 for an ARP frame. So, the Ethernet header length is when 802 1Q tag is absent) 14 bytes. Any frame which is received and which is less than 64 bytes 18 bytes Header/CRC and 46 bytes data) is illegal. 48

49 IPv4 Packet The IP packet is transmitted as the payload of an Ethernet frame. Its structure is with each line representing 32 bits): The Total length 16-bit field defines the entire packet fragment) size, including header and data, in bytes. The IP header length is when options are absent) 20 bytes. The total length of the IP packet varies from 20 to bytes. The MTU is the maximum size of IP packet that can be transmitted without fragmentation - including IP headers but excluding headers from lower levels in the protocol stack so excluding Ethernet headers). If the network layer wishes to send less than 46 bytes of data, the MAC protocol adds a sufficient number of zero bytes 0x00, is also known as null padding characters). 8 bits 6 bits 2 bits 16 bits Version Length DSCP ECN Total Length Identification Flags Fragment Offset TTL Protocol Header CheckSum Source IP Address Destination IP Address Options if header length is greater than 5 Data IPv6 Packet The total length of the IPv6 packet varies from 40 to bytes including header). 0 4 bits 8 bits 4 bits 8 bits 8 bits 0 Version Traffic Class FLow Label 4 Payload Length Next Hop Limit 8 Source IP Address 16 bytes) 24 Destination IP Address 16 bytes) 49

50 UDP datagram An UDP datagram is transmitted as the Data of an IP Packet. Length is the length in octets of this user datagram including this header and the data. This means that the minimum value of the length is eight. An UDP payload of 0 bytes would result of a 8 bytes UDP datagram 8 bytes UDP header), a 28 bytes IPv4 packet 20 bytes IP header) and a 64 bytes Ethernet frame 14 bytes Ethernet header, 4 bytes CRC plus 18 bytes padding). In IPv6, it would result in a 66 bytes frame bytes). An UDP Payload of 1 byte would result of a 9 bytes UDP datagram 8 bytes UDP header), a 29 bytes IPv4 packet 20 bytes IP header) and a 64 bytes Ethernet frame 14 bytes Ethernet header, 4 bytes CRC plus 17 bytes padding). An UDP payload of 18 bytes results in a 26 bytes UDP datagram, a 46 bytes IPv4 packet and a 64 bytes Ethernet frame 14 bytes Ethernet header, 4 bytes CRC). 16 bits 16 bits Source Port Destination Port Length CRC Data/ UDP Payload TCP Segment The header size varies from 5 words of 32 bits to 15 words of 32 bits, so from 20 to 60 bytes. Option varies from 0 bytes to 40 bytes. A TCP payload of 6 bytes results in a 26 bytes TCP segment when no options are present in the TCP header), a 46 bytes IP packet and a 64 bytes Ethernet frame 14 bytes Ethernet header, 4 bytes CRC). 8 bits 8 bits 8 bits 8 bits Source Port Sequence Number Acknowledgement Number Destination Port Offset Reserved Flags Windows size Checksum Options +padding) Data/ TCP Payload Urgent Pointer MPLS Header EtherType field of the Ethernet header is 0x8847 for MPLS unicast and 0x8848 for MPLS multicast. Note that in the MPLS header, there is no Network Level Protocol identifier no Ethertype). EtherType field of the Ethernet header is 0x8847 for MPLS unicast and 0x8848 for MPLS multicast. Note that in the MPLS header, there is no Network Level Protocol identifier no Ethertype). 20 bits 1 bit 3 bits 8 bits Label Traffic Class Bottom of stack TTL 50

51 VLAN 802.1Q) 802.1Q adds this 32-bits field between the MAC address and the Ethertype. 16 bits 3 bit 1 bits 12 bits Tag Control Information TPID Tag Protocol ID) = 0x8100 PCP DEI VLAN ID QinQ 802.1ad, also known as double tagging permits to use an outer VLAN called SVLAN, service VLAN) and an inner VLAN CVLAN, customer VLAN). 16 bits 3 bit 1 bits 12 bits TPID Tag Protocol ID) = 0x88a8 TPID Tag Protocol ID) = 0x8100 Tag Control Information PCP DEI SVLAN Tag Control Information PCP DEI CVLAN GRE Tunneling The GRE header is defined in RFC bits 3 bits 5 bits 3 bits 8 bits 8 bits CRKSs Recur Flags Ver Protocol Type CheckSum Optional, if C==1) Offset Optional, if R==1) Key Optional, if K==1) Sequence Number Optional, if S==1) Routing Optional, if R==1) 51

52 11.2 Find the Socket of a PCIe Interface First find all Intel NICs: lspci -nn grep Ether grep Intel grep :00.0 Ethernet controller [0200]: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] rev 01) 02:00.1 Ethernet controller [0200]: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] rev 01) 83:00.0 Ethernet controller [0200]: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] rev 01) 83:00.1 Ethernet controller [0200]: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] rev 01) 85:00.0 Ethernet controller [0200]: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] rev 01) 85:00.1 Ethernet controller [0200]: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] rev 01) Check to which node it is attached: cat /sys/bus/pci/devices/0000\:02\:00.0/numa _ node 0 cat /sys/bus/pci/devices/0000\:83\:00.0/numa _ node Find Number of Lanes Run on a PCIe Interface lspci -nn grep Ether grep Intel grep :00.0 Ethernet controller [0200]: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] rev 01) lspci s 02:00.0 vv grep LnkSta LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt Hash Table Statistics The problem is similar of having M balls, N boxes, each box having the possibility to hold k balls. The probability that the first box contains exactly k balls is given by the binomial probability model: If M and N are big and k small, we have a Poisson distribution: 52

53 Intel Network Builders Reference Architecture: Packet Processing Performance of Virtualized IA Platforms M).M-1).M-2) M-k+1) As demonstrated by P X=k) = k! M P X=k) = M k k! k k P X=k) = M k! P X=k) = If M is big and k small: If N is big and k small: M k! k M P X=k) = k! N P X=k) = k! N If P X=k) = But, P X=k) = k! k! N The Probability that the first box has less than k balls: PX<=k) = PX=0) + PX=1) PX=k) -N) e=lim N =lim N =lim N =lim N =lim N -N) So, the Probability that the first box has less than k balls: P X=k) = k i=o i! i If k is big, we could use Taylor to estimate such sum, and Stirling to estimate the factorial: Taylor: k i=o i! Stirling: k! 2 k i k k e ) + k+1 k+1)! For all the boxes, we have: P ~= PX k)) N => k +1)! k +1) 2 k ) ) k k e ) ) ) 53

54 11.5 Qemu Command Line Parameters Taskset 0xff00ff qemu-system-x86 _ 64 cpu host enable kvm m 3000 \ smp 16,socket=1,core16=16,threads=1 name VM1 -had /Images/VM1.img -net nic,model=e1000,macaddr=00:00:01:00:00:01 net tap,ifname=tap1,script=no,downscript=no,vhos t=on\ -device pci-assign,host=06:00.0\ -device pci-assign,host=06:00.1\ -device pci-assign,host=08:00.0\ -device pci-assign,host=08:00.1\ -mem-path /mnt/huge\ -mem-prealloc\ -vnc : BRAS Configuration The test generation tool must be properly configured to use qinq, gre_id, IP addresses as configured within the BRAS. The test generation tool must be configured with four interfaces: two at the CPE side and two at the core side CPE side UDP Packets The CPE side must send QinQ IPv4 packets, which are 64 bytes in total. While a UDP payload of 10 bytes UDP length=18 bytes including UDP header) would fulfill this requirement =64 bytes), we use a UDP payload of 0 bytes UDP length=8 bytes including header). The reason is that this packet will get encapsulated in MPLS and GRE => packet length will increase. In order to measure the performance with smallest packet size, we must use UDP payload=0. UDP Payload=0 => CPE packet size=64 bytes including 10 bytes of padding, network core packet size=78 bytes ). If UDP Payload was set to 10, CPE packet size would still be 64 bytes ) but core network packet size would be 88 bytes ). 54

55 7 bytes 1 byte 14 bytes 8 bytes 46 to bytes 12 bytes Preambule Start of frame delimiter Ethernet Header QinQ tags Payload FCS/CRC Interframe gap Payload IP 8 bits 6 bits 2 bit 16 bits Version Length DSCP ECN Total Length Identification Flags Fragment Offset TTL Protocol Header CheckSum Source IP Address Destination IP Address Options if header length is greater than 5 Data Data=UDP 16 bits 16 bits Source Port Destination Port Length CRC Data/ UDP Payload QinQ 16 bits 3 bit 1 bits 12 bits TPID Tag Protocol ID) = 0x88a8 Tag Control Information PCP DEI SVLAN TPID Tag Protocol ID) = 0x8100 Tag Control Information PCP DEI CVLAN 55

56 The SVLAN format is 00000xxxxxxx on interface 0 and 00001xxxxxxx, where x refers to a random bit so we have 128 possible SVLAN on each CPE port). The CVLAN format is xxxx00xx00xx for both ports i.e. 256 possible CVLAN). The combination of SVLAN and CVLAN results in 128*256 = different CPE on each CPE side. We use a complex pattern for the CVLAN to be in a worst case scenario. Using all 4096 possible CVLAN and 16 SVLAN) would result in better performance due to better cache affinity. Destination IP Address The performance of the LPM routing lookups might depend on the routes configuration: the depth of each route /24, /27, /16 ), whether all routes are equally used or some routes are more used than others resulting in better cache affinity), or even if some routes can share the same cache lines. Routing tables in the BRAS were populated with 8192 routes, all being /24. Specifically, the routes are x. xxxxxxxx.xxxx /24 where x is randomly a 0 or a 1 so from /24 to /24). The gap in the routes the last four 0) should give a rather worst case scenario, where most of those routes will hit different cache lines it is expected that routes like /24 and /24, /24, could share the same cache line). Traffic was generated equally distributed among those routes, so that no route was more used than others. IP destination addresses were generated randomly not sequentially) by load generator, within the range of the routes. Using 10.0.x.0/24 to x.0/ xxxx. xxxxxxxx /24) would result in better performance better cache affinity). Source IP Address We used as IP addresses for all CPEs CPE anyhow differs by the QinQ tags) MAC Source Address Ideally, use a different MAC address per CPE. We used 00:00:00:00:00:01 for all CPEs. They should of course use different MAC addresses but this does not influence the performance CPE MAC addresses are stored in a table) MAC Destination Address Ideally, use MAC address of the destination port. However, we used 00:00:00:00:00:02 as we configured the BRAS to work in promiscuous mode so it is easier to change interfaces for instance) CPE Side ARP Packets When the BRAS must send a packet from the core network towards the CPE, it must know the MAC address of the CPE. It knows this as it stores the CPE MAC address GRE_ID or QinQ/IP address when it receives ARP messages from the CPE. This also means that the BRAS works as a very basic) firewall as no packet will be sent to the CPE until some packets ARP) were sent from the CPE. So, to be able to have bi-directional traffic working, we must also generate ARP traffic from the CPE. Those packets should use the same VLAN tags as the other packets from CPE side. They should be 64-bytes ARP- Requests, sender Protocol address src IP) being Other fields are not important. We used Sender Hardware address 00:00:00:00:00:01, Target Hardware address 00:00:00:00:00:02 and Target Protocol address One ARP is sent every 500 QinQ packets. Those packets are used by the BRAS to update his tables, but not forwarded to core network Network Core Side Same kind of configuration must be done at the core side. The core side generates IP/UDP packets encapsulated in GRE tunnels and MPLS tunnels. 7 bytes 1 byte 14 bytes 8 bytes 46 to bytes 12 bytes Preambule Start of frame delimiter Ethernet Header MPLS Header Payload FCS/CRC Interframe gap MPLS Header 20 bits 1 bit 3 bits 8 bits Label Traffic Class Bottom of stack TTL The content of MPLS is not important. It s only important to make sure that BoS is set to 1 one level of MPLS tags). 56

57 Payload=IP 8 bits 6 bits 2 bit 16 bits Version Length DSCP ECN Total Length Identification Flags Fragment Offset TTL Protocol Header CheckSum Source IP Address Destination IP Address Options if header length is greater than 5 Data GRE Header 5 bits 3 bits 5 bits 3 bits 8 bits 8 bits CRKSs Recur Flags Ver Protocol Type CheckSum Optional, if C==1) Offset Optional, if R==1) Key Optional, if K==1) Sequence Number Optional, if S==1) Routing Optional, if R==1) Data Payload=IP 8 bits 6 bits 2 bit 16 bits Version Length DSCP ECN Total Length Identification Flags Fragment Offset TTL Protocol Header CheckSum Source IP Address Destination IP Address Options if header length is greater than 5 Data Data=UDP 16 bits 16 bits Source Port Destination Port Length CRC Data/ UDP Payload 57

58 The UDP length should be set to 8 i.e. no UDP Payload), as this results in the smallest packet size i.e =78 bytes). BRAS will add padding when converting the packet to QinQ so that packet is not smaller than 64 bytes. Source IP Address We used as src IP addresses for all users does not influence the performance). Destination IP Address We used as destination IP addresses for all CPE users must be the same addresses as configured as source IP address at the CPE side). MAC Source Address We used 00:00:00:00:00:01 should not influence the performance) MAC Destination Address Ideally, use MAC address of the destination port. However, we used 00:00:00:00:00:02 as we configured the BRAS to work in promiscuous mode so it is easier to change interfaces for instance). Gre_id The gre_id is a 32-bit value. As this number is only used through hashes, we used the values 0 to for simplicity) Ixia/Spirent Screenshots CPE Side Ixia Screenshots 58

59 59

60 60

61 61

62 62

63 CPE Side Packet Decoding 63

64 Core Side Screenshots 64

65 65

66 66

67 67

68 Core Side Packet Decoding 68

69 12 Glossary GLOSSARY ABBREVIATION ATR COTS CPE DPI FCS GRE GRO IOMMU kpps KVM LRO MSI MPLS Mpps NFV Pps QinQ RA RSC RSS SP SR-IOV TCO TSO vbras DEFINITION Application Targeted Routing Commercial Off-The-Shelf Customer Premises Equipment Deep Packet Inspection Frame Check Sequence Generic Routing Encapsulation Generic Receive Offload Input/Output Memory Management Unit Kilo packets per seconds Kernel-based Virtual Machine Large Receive Offload Message Signaling Interrupt Multiprotocol Label Switching Millions packets per seconds Network Function Virtualization Packets per seconds VLAN stacking 802.1ad) Reference Architecture Receive Side Coalescing Receive Side Scaling Service Provider Single root I/O Virtualization Total Cost of Ownership TCP Segmentation Offload Virtual Broadband Remote Access Server 69

70 13 References REFERENCES DOCUMENT NAME Internet Protocol version 4 SOURCE RFC 1242 Benchmarking Terminology for Network Interconnection Devices) RFC 1701 Generic Routing Encapsulation GRE) Internet Protocol version 6 Point to Point Protocol over Ethernet RFC 2544 Benchmarking Methodology for Network Interconnect Devices) Multiprotocol Label Switching RFC 6349 Framework for TCP Throughput Testing) Virtual Local Area Network IEEE 802.1Q QinQ allows multiple VLAN tags in an Ethernet frame Intel Gigabit Ethernet Controller Datasheet Intel DDIO Bandwidth Sharing Fairness Design Considerations for efficient network applications with Intel multi-core processor-based systems on Linux OpenFlow with Intel Wu, W., DeMar,P. & Crawford,M 2012). A Transport- Friendly NIC for Multicore / Multiprocessor Systems. Why does Flow Director Cause Placket Reordering? Linux man pages IEEE 802.1ad workshop /openflow_ pdf IEEE transactions on parallel and distributed systems, vol 23, no 4, April Man taskset Linux kernel Documentation IA packet processing Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms: High Performance Packet Processing on Cloud Platforms using Linux* with Intel Architecture Intel DPDK Documentation/IRQ-affinity.txt Packet%20Processing_2012_2.pdf

71 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to: com/products/processor_number Disclaimers Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See products/processor_number for details. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROP- ERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling , or by visiting Intel s Web site at Copyright 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon inside, and Intel Intelligent Power Node Manager are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Visit: for more information.