NFV Acceleration with the EZchip NPS-400 Network Processor

NFV Acceleration with the EZchip NPS-400 Network Processor White Paper Contents 1. NFV Overview... 1 2. Virtualized Network Functions ()... 3 3. NFV Challenges... 6 3.1 Performance Challenges... 6 3.2 Load Balancing Challenges... 6 3.3 Performance Monitoring Challenges... 6 3.4 Scale Challenges... 6 3.5 Reliability Challenges... 7 3.6 Security Challenges... 7 3.7 Traffic Management Challenges... 7 3.8 Virtualization Software Challenges... 7 3.9 NFV Acceleration... 8 4. NPS-400 NPU Features for NFV Acceleration... 8 4.1 Performance... 8 4.2 Load Balancing... 9 4.3 Performance Monitoring... 9 4.4 Scale... 9 4.5 Reliability... 9 4.6 Security... 9 4.7 Traffic Management... 9 5. Code Examples... 10 6. Smart NFV ToR... 11 7. Smart Edge Router Design... 12 8. Smart Acceleration Appliance Design... 13 9. White Box Designs... 13 10. Software Architecture... 14 11. Conclusion... 16 1. NFV Overview Network Function Virtualization (NFV) is a standard under development by ETSI that allows network functions such as load balancing, firewalls and switching to be implemented in software and run on industry standard servers, virtualizing network functions currently deployed on dedicated network hardware. NFV will dramatically reduce expenses for carrier networks, while providing carriers with a greater flexibility when deploying networks and adapting networks to new customers and applications. Acceleration capabilities allow NFV to address most price / performance points resulting in high market acceptance of NFV by both equipment manufacturers and carriers. The carriers driving the effort expect NFV to significantly reduce the operational and capital expenses of their networks by consolidation on a smaller number of hardware platforms. While performance of network functions is lower on NFV than dedicated hardware optimized for the function, for many networks the performance is acceptable and is expected to scale up as server performance increases and technology is enhanced to deploy many Virtualized Network Functions (s) in parallel. NFV based networks will also NPS-400 NFV Acceleration 1

be more flexible than dedicated network devices as additional s and servers can be deployed from existing carrier data center resources when additional network services are needed. For information on the ETSI NFV work, visit http://www.etsi.org/technologies-clusters/technologies/nfv. Figure 1 shows the consolidation of network functions onto COTS servers. Figure 1. NFV consolation onto COTS servers Source: Cisco Developers Conference 2013 The 16 different network devices shown on the left side of Figure 1 require 16 different hardware platforms. With the advent of NFV, each network device is virtualized and deployed as a virtualized network function () running on a standard server and network infrastructure. Concurrent with NFV development, the workload on large multi-tenant data centers is increasing. New applications are causing an explosion in the network bandwidth used, and much of the increase is in East- West traffic inside the data center. Intelligence is moving to the edge of the network with multi-tenant networks implemented as overlay networks. The combination of increased packet rate and increased processing per packet impacts server performance. In the face of these increases in demand, server performance is falling behind relative to workload, for instance performance for GRE encapsulation on a 10G link is reported at 24% of line rate for average packet sizes, dropping down to 5% of line rate for smaller packets. Data centers need innovation to deliver higher density, lower cost and lower power consumption. NFV provides the flexible platform for innovation, but fails to address the performance, power or cost issues. NFV acceleration increases performance while reducing power consumption and cost. NFV acceleration offloads the data plane processing requirements for the virtual switch (OVS), TCP, DPI, encryption and potentially application-specific acceleration primitives, allowing the server to focus on the control plane. Separating and accelerating the data plane processing improves overall performance allowing NFV to address larger networks, higher speed links and higher touch applications. NPS-400 NFV Acceleration 2

2. Virtualized Network Functions () Figure 2 shows how the physical Evolved Packet Core (EPC) devices in a carrier network could be replaced or augmented with a running on a server in a remote data center. Figure 2. in a remote data center replacing EPC devices in a carrier network Branch Customer Premise Equipment Branch Data Center Provider Edge Residential Provider Edge Network Functions EPC PE Network Functions Internet Data Center SAN IP Backbone TOR Spine Switches Servers TOR vepc Building out data center capacity throughout the network NPS-400 NFV Acceleration 3

Many network functions may be virtualized including the customer premise equipment, edge router functions and even enterprise data centers themselves. As more functions are virtualized, consolidating the virtual network functions into the same data center provides better communication between the s and improves performance. Figure 3 shows the virtualization of physical network functions for security, load balancing, Deep Packet Inspection (Intrusion Detection or Application Recognition) and monitoring, then virtualized functions are run as software s in the data center. Figure 3. NFV virtualization of networking functions Branch Customer Premise Equipment Branch Data Center Provider Edge Residential Provider Edge Network Functions PE security, load balance, DPI, monitoring Network Functions Internet Data Center SAN IP Backbone TOR Spine Switches Servers TOR vnfs vmobile Core Virtualizing the network functions NPS-400 NFV Acceleration 4

Depending on the data rates and the amount of processing required, any network function can be virtualized. Figure 4 shows the virtualization of other s, and even enterprise data centers in a carrier data center. Figure 4. Virtualization of other s, and even enterprise data centers in a carrier data center Branch Customer Premise Equipment Branch Data Center Provider Edge Residential Provider Edge PE Internet Data Center SAN IP Backbone TOR vepc Servers s v vdc TOR Spine Switches IaaS PaaS Each of the physical Network Functions can be virtualized and run as a software on a server in the lower left of Figure 4. For many s, the servers will be located close to the original NF to avoid backhauling traffic and increasing latency. Yet having the flexibility to add additional s at a remote server pool as shown in Figure 3 allows the carrier to be more responsive to changes in demand, new customers, etc. Removing multiple hardware platforms from carrier network deployments can greatly reduce operational costs of running a network by eliminating the specialized support required for the physical network function, spares, truck rolls to remote sites, etc. But the main advantage of NFV is providing more flexibility to the carriers, allowing fast service provisioning over the network without a change in the physical topology. NPS-400 NFV Acceleration 5

3. NFV Challenges Let us address some of the most challenging aspects in implementing NVF in the data center. 3.1 Performance Challenges The savings and flexibility of NFV come at a price: performance is lower and power consumption is higher on NFV networks. In most cases the control plane software for virtualized network functions can be ported to standard servers with little performance penalty. The performance challenges occur on the data plane. NFV performance can be addressed for many functions by deploying multiple s on multiple VMs and/or cores. However, when deployed on multiple VMs, the vswitch in the hypervisor is not able to effectively use the Single Root IO Virtualization (SR-IOV) capabilities in the server and NIC because the packet is copied into the vswtich buffer, resulting in lower performance on the server. If s are deployed on multiple cores, the power consumed can be dramatically higher than dedicated network functions. In some instances s running on multiple cores result in different network behavior such as out-of-order delivery of packets as a result of distributing the network functions. Standard servers when shared by multiple VMs provide non-deterministic performance with variation caused by shared processors, cache memory state and TLB misses. 3.2 Load Balancing Challenges NFV can use the Intel Data Plane Development Kit (DPDK) to improve determinism and performance, but use of DPDK leads to data plane problems for higher speed links. At 10G most functions require multiple cores to achieve line rate for full network function feature sets, at 40G or 100G many s and cores are required. Load balancing is then needed to distribute the workload over multiple s/cores, but the movement of blocks of data between cores can impact performance by reducing the effectiveness of the caches and increasing the rate of TLB misses. Some applications require that all the packets in a flow be processed by the same. In addition to ensure delivery of all packets in the flow the load balancer must be stateful. Stateful load balancing proves to be a significant task if IP fragments are handled, imposing additional latency and using more computational resources especially if the load balancer itself is a. And even with a stateful load balancer, single flow performance is limited to the performance of a single /core if packet order is preserved by the. High capacity single flow scenarios are a challenge for NFV, especially if packet order must be maintained. 3.3 Performance Monitoring Challenges Another NFV challenge is performance monitoring for proper orchestration and service management. Effective performance monitors in large NFV networks must support millions of flows in stateful flow tables, and then collect, maintain and in some cases analyze millions of counters. In some applications, accurate time stamping of events is required beyond the accuracy of servers or some NICs. 3.4 Scale Challenges Support for large numbers of subscribers may involve large tables that impact the performance of servers if they exceed cache sizes or impact cache fill and ejection algorithms. Many network devices are flow based, and as the number of subscribers increases the size of the flow tables can quickly grow to exceed cache capacity of standard servers. s incorporating OpenFlow switch functions will have different size tables and number of tables depending on the function and the location in the network. s with millions of flow table entries pose a performance problem for servers as the server caches are not large enough to handle large numbers of flows NPS-400 NFV Acceleration 6

efficiently. Additionally, individual OpenFlow table entries can be large, OpenFlow 1.3 supports 40 different match fields, IPv6 addresses, MPLS, PBB, etc. While no single table will have 40 match fields, an IPv6 table with 12 or more fields is very large and heavily impacts server performance when implemented in a server based vswitch. 3.5 Reliability Challenges NFV networks must deliver the same reliability as existing physical networks, determining faults, monitoring the health of network services, and connectivity management (OAM, BFD). Access Control Lists (ACLs) are used to protect network functions by comparing addressing and service (port) information in each packet to a set of rules, the ACL. Network protection can be security (granting access) or protection against threats (denying access), and depending upon the location of the network function the ACL can be lengthy. Border routers may need to check thousands of ACL rules to provide network stability, imposing a large workload on the server if the border router is virtualized. 3.6 Security Challenges For network virtualization of functions normally residing outside of the data center, the network traffic backhauled to the increases the load on the network and the transport delay adds latency to the service provided. Security can become an issue on s if sensitive data is backhauled to the data center or the shares a server with VMs running user applications. Backhaul security issues can be addressed by encrypting the backhaul traffic and use of encapsulating network protocols like VXLAN to isolate the traffic from other users. Backhauling traffic also results in additional bandwidth usage in the network. Providing secure connections and services for multiple tenants on the same network, server or core require many of the same isolation techniques such as Virtual Ethernet Port Aggregator (VEPA) developed for data centers by the IEEE DCB group, allowing vswitches to offload ACL checks to adjacent Ethernet switches. The network and server infrastructure must support multiple s from multiple vendors while also providing service level agreement (SLA) measurements to verify service delivery for billing purposes. 3.7 Traffic Management Challenges On the outbound side of many s the egress traffic must be shaped or otherwise managed. Some flows may require large buffers, others low latency. Transmit schedulers should be capable of enforcing resource use policy between users, groups of users, ports, channels on ports, etc. Resource use policies can include prioritization, fair use of bandwidth or buffer, time out of buffered packets, etc. Traffic management features can impose a heavy workload on the server, or the server may not be capable of the accuracy required for shaped traffic due to server workloads, context switching or queueing in a shared NIC. 3.8 Virtualization Software Challenges NFV Scale Out means adding multiple s typically running on multiple cores in parallel to increase the capacity of a. As mentioned above, IP fragmentation can have a severe impact on the performance of a implemented on multiple cores. Since the TCP and UDP port numbers are only in the first fragment of a fragmented packet, keeping the 5-tuple state involves maintaining load balancer information about each flow including both the IP 5-tuple and MAC SA/DA and type/length fields. For most network functions, a single will handle all frames in a flow and may reassemble fragmented packets as part of its operation. If fragments of an IPv4 packet are assigned to different cores, performance is hurt by the need to transfer the fragments between the cores to reassemble the packet. This transfer further lowers performance by moving blocks of data into caches, evicting potentially useful cache contents. Network behavior is changed if short cuts are taken that re-order packet delivery. Packet re-ordering can happen if IP frames are distributed across NPS-400 NFV Acceleration 7

multiple s without a sequence number or some other mechanism to preserve packet order. The servers receiving the data will experience increased workloads in the TCP stack or may drop out of order frames resulting in re-transmission and increased network load. Some DPDK network implementations change expected network behavior to attain higher performance. The order of packets may not be preserved when distributing computing over multiple cores, and in some cases accepted features may not be performed. For instance, the DPDK L3 sample code makes checking the IP header checksum, decrementing the TTL field and recalculating the IP checksum optional, all features that impact network stability in larger networks. 3.9 NFV Acceleration The NFV standard includes acceleration as an option to increase network performance. In the NFV Infrastructure Group Specification, the processor component can offload the packet processing to the NIC if the NIC has TCP Offload Engine (TOE), LSO, DPI, encryption or other acceleration capabilities. The Infrastructure Network specification includes OpenFlow capabilities so that virtualized functions like load balancers or firewalls can populate OpenFlow tables in an upstream switch to forward and/or process subsequent frames in a flow once the flow is accepted. This feature is called Fast Path Offload (FPO) in the infrastructure network. Since OpenFlow is a general mechanism, packet processing features such as Network Address Translation (NAT), tunnel encapsulation or decapsulation, etc. may be performed with packet modifications specified as OpenFlow Actions associated with flow table entries. As innovation occurs in NFV and SDN networks, it is hard to anticipate the areas where changes will occur. But accelerating the functions requiring heavy packet processing like Advanced Classification, Access Control List checking, Deep Packet Inspection (scanning packet contents for pattern matches) and having the ability to do service chaining without performance degradation are likely areas for improvement. Having a high-performance C-programmable network processor in the data plane of networking devices allows those devices to track changes to standards and respond quickly to innovation. 4. NPS-400 NPU Features for NFV Acceleration The NPS-400 is a network processor from EZchip Technologies that is well suited for NFV acceleration. Like all network processors from EZchip, the NPS-400 is designed for high bandwidth packet processing. The sections below describe how thenps-400 addresses many of the challenges in implementing NVF in the data center. 4.1 Performance The NPS-400 can process 400 Gbps of minimum sized Ethernet packets (600M packets per second) and handle oversubscribed IO up to 960 Gbps. The packet processing code on a NPS-400 is written in C and executed on a CTOP (C-programmable Task Optimized Processor) processor array using a run to completion programming model. The hardware threads of the NPS-400 allow multiple table lookups and multiple statistic counter operations per packet at the 600M packets per second processing rate. There are 256 CTOP processors (cores) on the NPS-400, and each CTOP has 16 hardware threads with fast thread scheduling in hardware. With packet order maintenance, counter operations and search operations (in both DDR memory and TCAM) implemented in hardware, the NPS-400 provides the perfect meld of processing power, flexibility and software-directed table lookup capabilities to deliver 400 Gbps of packet processing. NPS-400 NFV Acceleration 8

4.2 Load Balancing Pinning of flows to CTOP processors is not required on the NPS-400 because packet order is maintained in the CTOP scheduling hardware. As a consequence, the performance of a single flow is not restricted to the performance of a single core on NPS-400. The NPS-400 can be used as a load balancer for NFV applications, or as an accelerator of a load balancer. Packet classification is not limited to hardware analysis of the packet headers on the NPS-400, since C programs can further analyze the packet to build simple or stateful load balancers. Large table support allows sophisticated per flow information on millions of flows, and hardware scheduling is done without consuming processor cycles. 4.3 Performance Monitoring The lookup and statistics engines of the NPS-400 support millions of flows with hardware token buckets for SLA monitoring of each flow. Policers can be built of simple token buckets or coupled token buckets, and per flow profiles allow each flow to conform to different industry standard definitions. The NPS-400 also supports 1588 time synchronization allowing accurate time stamps on packets or logged data. A Real Time Clock is available to CTOP programs for accurate timestamping of internal events. 4.4 Scale The NPS-400 has an on-chip TCAM with algorithmic extension to DDR memory, on-chip SRAM and up to 48 GB of external DDR-3 or DDR-4 memory, allowing large lookup tables, large numbers of counters and huge packet buffers for the Traffic Manager (TM) queues. The TCAM support with algorithmic extension is useful for large OpenFlow tables supporting many vswitches and many subscribers. Large table size support allows per-flow tables to support millions of flows. The instruction set of the NPS-400 can easily parse all frame types at line rate. Standard control frames are identified and parsed in hardware for packet scheduling; the priorities assigned to each packet type are configurable. Additional registers allow customers to have the classification hardware identify other protocol types, MAC addresses, tags and label values and assign priorities as needed. 4.5 Reliability The NPS-400 supports fast failover to backup links in the Traffic Manager without the need to change forwarding databases or route tables. The NPS-400 also has support for OAM (Y.1731) to monitor connectivity to OAM peers and trigger fast failover to backup links. Internal and external memories are protected with error correction codes to increase reliability. 4.6 Security The NPS-400 also has high performance encryption support for AES-256 and 3DES standards. In parallel with encryption, the NPS-400 can calculate authentication MACs using SHA-1, SHA-2 or MD5 hash functions. To isolate user groups, the NPS-400 efficiently handles VxLAN or NVGRE encapsulation and decapsulation at 400 Gbps using proven NPU packet editing technology. Overlay network support provides security in multi-tenant environments. 4.7 Traffic Management The embedded TM of the NPS-400 has 1M hardware queues and a 5 level hierarchical scheduler. Each queue and each aggregated scheduler entity has shapers delivering accurate conformance to outbound Service Level Agreements. Sophisticated scheduling combinations of shaping, priority and Weighted Fair Queue (WFQ) scheduling allow precision control of outbound traffic. NPS-400 NFV Acceleration 9

5. Code Examples Built-in functions for the NPS-400 are provided in the EZchip Data Plane (EZdp) library making use of the NPU instructions added to the CTOP processor for high performance packet processing. For instance, the CTOP has hardware instructions for parsing basic L2 and L3 packet headers efficiently. In-line C function support in the compiler provides single instruction parsing of Ethernet frames for VLAN tag(s), MPLS label stacks, IPv4 headers and IPv6 headers. For example, to decode an IPv4 header into the result struct in a single instruction code: status = ezdp_decode_ipv4 (header_ptr, size, framesize, result); header_ptr is a pointer to the IPv4 header stored in the core s local memory, size is the size of the IPv4 header, i.e. the number of bytes to decode. The decode instructions are derived from the TOPparse instruction set of the EZchip NP-5 NPU, an architecture that has delivered efficient packet processors for over a decade. Lookup operations are performed on portions of the packet, typically a search key is composed from fields in the packet or the metadata associated with the packet. The NPS-400 supports direct tables, longest prefix match (LPM) and hash tables in internal memory or external DDR memory as well as TCAM accesses. To lookup an 8-byte key in a hash table, hashed_key = ezdp_hash_lookup_key 32(key.raw_data, 0, 4); result.raw_data = ezdp_lookup_hash_entry (table, false, key_len, result_len, entry_len, hashed_key, &key, &entry, &scratchpad); where key is a 32 byte search key struct, hashed_key is the 32 bit hashed value of the key, table is the hash table descriptor, false indicates this table is not a single cycle hash table, key_len is the length of the search key, result_len is the size of the buffer to receive the results, entry_len is the size of the buffer to receive the hash table entry (key plus results), &entry is the address of the buffer in local memory for the hash table entry and &scratchpad is a buffer used as a temporary working area. The function return value is a C structure which includes an indication whether a valid entry was found, as well as the first 4 bytes of the result data, if applicable. The function is coded in such as a way that this structure fits in a core register for efficient testing of the lookup results. An example of a lookup in the internal TCAM of the NPS-400 is: res = ezdp_lookup_int_tcam (side, profile, key_ptr, key_len, mask, &result); key_ptr is the search key of the lookup, the result of the lookup is returned in the result structure. Likewise built-in functions are provided for DMA operations, allowing high performance encapsulation and decapsulation operations. The NPS-400 architecture allows frame buffers to be populated with an offset, allowing standard L2 encapsulation of overlay network headers to occur without the need to allocate and link additional buffers. status = ezdp_copy_frame_data (buff_desc, offset-hdr_size, template_desc, 0, hdr_size); The example above uses the DMA engine to copy hdr_size number of bytes from a template to prepend the buffer associated with the buffer descriptor buff_desc. More complex overlay networks such as VXLAN can be terminated with TCP/IP and VXLAN protocol processor offload in the NPS-400. The NPS-400 encryption instructions allow encrypting, decrypting and hashing of frames with minimal impact on CTOP processor cycles. ezdp_encrypt (sec_handle, plain_text, cypher_text, length); NPS-400 NFV Acceleration 10

Based on the contents of the security handle sec_handle, the encrypt instruction will encrypt the plain_text buffer using AES, DES, 3DES or RC4 ciphers, storing the encrypted information in the cypher_text buffer. Message Authentication Code (MAC) instructions will calculate MACs using the SHA-1, SHA-2 or MD5 hash functions. ezdp_mac_calculation (sec_handle, buf_ptr, first_flag, last_fflag, length); The security handle sec_handle would identify the hash function to apply to the buffer in CMEM pointed to by buf_ptr, first_flag and last_flag identify if the buffer is the first and/or last segment of text, and length is the size of the data in the buffer pointed to by buf_ptr. The message digest (the hash) is updated in the security context memory associated with sec_handle. Once the packet processing is complete, packets are queued into the Traffic Manager (TM) queues by the ezdp_send_job_to_tm function. In the NPS-400 architecture, a job processes a packet, and a job descriptor contains information about the job and the packet being processed. ezdp_send_job_to_tm (&job, &job_desc, job_desc.tx_info.side, 0); The embedded TM is a 5-level hierarchical hardware scheduler with shaping for output port with Weighted Fair Queueing (WFQ) and priority scheduling configurable at each level. The TM supports 1M queues and aggregates multiple queues at the upper levels of the scheduling hierarchy to conform to Service Level Agreements (SLAs). With a large buffer, the TM can provide loss-less service or buffering for high speed links. Per-packet time-out allows recovery of buffers left in congested or flow controlled queues beyond acceptable timeout values. Application specific accelerations coded in C can be located on any NPS-400 in the network and integrated with the packet processing acceleration for the flow. For instance, a NFV Fast Path Offloaded Network Address Translation (NAT) function may translate TCP port numbers and IP addresses in a packet. Users could write their own acceleration functions to translate other fields in other headers. Other features of the NPS-400 include OAM, 1588 time stamping, Deep Packet Inspection (DPI) acceleration, encryption, decryption and hashing. One-step and two-step 1588 support is provided and can accurately timestamp the arrival or departure of packets for monitoring applications, SLA measurements or orchestration. The DPI can perform a coarse grain screen of tens of thousands of rules at 200 Gbps, the encryption/decryption accelerators can encrypt/decrypt at 200 Gbps total, and the hashes can be calculated on 200 Gbps in parallel with the encrypt/decrypt activity. 6. Smart NFV ToR For most NFV applications, performance is improved by moving the vswitch processing for multiple servers into a Top of Rack (ToR) switch based on an NPS-400. The NPS-400 provides the encapsulation and decapsulation functions to access the overlay networks carrying the virtual network traffic, and delivers standard Ethernet frames to the devices on each server. If the device has SR-IOV or tunneling capabilities, high performance is attained for the entire chassis. Consolidating the vswitches of a rack of servers onto the Top of Rack switch provides a significant reduction in the number of managed virtual switches while increasing the performance of the servers. In data plane intensive s such as routers or load balancers, the data plane processing can consume most of the processor cycles and become the bottleneck. In such cases, offloading the packet processing will provide higher performance and in some cases allow normal network behavior. NFV switches and routers can offload most or all of their data plane lookup operations, frame modifications, forwarding decisions and queueing, while the performs the exception forwarding functions and the control plane processing. Other NFV applications may offload the TCP/IP stack, load balancing decisions, Deep Packet Inspection or NPS-400 NFV Acceleration 11

cryptography processing to make more effective use of the processor to run the remaining NFV functions. The net result is improved NFV performance and greater capacity. Single Root IO Virtualization (SR-IOV) in or NIC devices on the servers allows efficient vswitch implementations in the ToR switch to DMA frames directly into the buffers. Use of a hypervisor based vswitch in NFV applications reduces the effectiveness of SR-IOV acceleration technology. Additionally, as the size of the network increases, the vswitch flow and addressing tables can grow large enough to impact the server memory used and the effectiveness of the data cache memories on the server(s). SR-IOV lowers the latency for processing by eliminating a copy of the frame, and improves the performance of the processor/vm running the by improving the cache hit ratio. Use of NFV acceleration on NPS-400 devices improves performance and security while lowering power consumption for NFV applications. For multiple tenants sharing the same physical network, the Virtual Ethernet Port Aggregator (VEPA) header will identify the virtual network and the ToR switch can make the forwarding decision according to the 802/1Qbg Edge Virtual Bridging standards. VEPA allows better security on the network for traffic between VMs on the same server, since the ToR device can apply ACL checks in hardware, then hairpin forward the packet back to the same server if the communication is allowed by the policy programmed in the ACL tests. Since the ACL tests are typically performed in the TCAM hardware on the NPS-400, millions of rules can be checked at line rate. A normal vswitch does not implement ACL checks due to the performance impact on the processor, or will only check a few rules. A single NPS-400 in a switch, router or appliance can process packets at the same rate as an entire rack of servers for most data plane applications. Since the NPS-400 based TOR switch uses about the same amount of power as one server, using NPS-400 based acceleration results in a dramatic reduction in power consumption for packet processing. 7. Smart Edge Router Design Internet traffic entering a data center usually is processed by an Edge Router with packet processing capabilities on the line cards of the router or on a services card. For instance, the edge router will provide the VXLAN overlay network headers for the flow, allowing the packet to transit a multi-tenant data center network securely. For NFV, the edge router can perform deep packet classification and map packets into flows and NFV service chains while dynamically load balancing the flows across many distributed s. Many of these devices are using OpenFlow tables and actions, and can provide NFV acceleration as part of the OpenFlow table processing (see discussion above on NFVI network acceleration). Using a NPS-400 for the packet processing in an edge router provides a flexible platform for NFV acceleration, and allows for innovation in new acceleration functions. The packet processing capabilities of the NPS-400 deliver full duplex, line rate processing for 2x100G line cards using a single NPS-400, or a 6x40G line card. Higher density over-subscribed line cards can also be built with the NPS-400. Edge routers perform significant packet processing and the NPS-400 allows for line rate Access Control Lists checking of millions of rules using the on-chip TCAM with algorithmic extension, as well as encapsulation/decapsulation for tunnels and overlay networks, and terminating secure connections. The NPS-400 also has support for OAM, 1588 and policing of millions of flows, allowing high service delivery and integration with existing network management protocols while protecting against attacks. The NPS-400 Traffic Manager improves service delivery for both ingress and egress traffic. On ingress, the hierarchical scheduling and shaping can be used to aggregate and shape flows transiting constrained links or devices, protecting the network and servers from overloading. On egress, the TM s large number of queues, shapers and aggregation entities deliver SLA conformance on individual flows and/or groups of users, avoiding packet drops due to policing functions in Metro networks. NPS-400 NFV Acceleration 12

8. Smart Acceleration Appliance Design A NFV acceleration appliance could be located adjacent to a core or spine switch and provide NFV acceleration services to many racks of servers or to multiple data centers. The appliance is a pure packet processor with NFV acceleration that implements the acceleration primitives from the NFV Infrastructure Group Specification allowing the appliance to accelerate NFV processing for multiple s from multiple vendors. For example, consider a NFV Load Balancer that can use FPO acceleration, except the network switches do not support FPO. An appliance could be attached to a spine switch as a one armed router, and have the accelerated traffic in a flow be processed in the appliance after the initial flow detection and load balance decision. Figure 5 shows acceleration deployed in either a ToR switch, an edge router, or as an appliance or service blade. Figure 5. NPS-400 deployment in a variety of smart equipment Smart NFV TOR Top Of Rack Switch NFV Enabled Edge Router NFV Accelerated Appliances/Blades Edge Router Rack of servers Services Blades and Appliances A pool of NFV acceleration resources can be deployed as multiple service blades in a blade chassis, or multiple appliance boxes. Note that a single NPS-400 will likely provide a pool of accelerators for many s, for instance a ToR switch built on a NPS-400 will be able to accelerate all of the s in a rack of servers. 9. White Box Designs As new network architectures based on SDN and NFV are emerging as the way forward for carrier and datacenter networks, a related trend for white box systems is gaining momentum. In contrast to proprietary networking equipment provided by the established vendors, white boxes are systems from non-branded manufacturers (ODMs) that build off-the-shelf hardware. The particular networking function of the white box is determined through the software downloaded to the white box and is typically provided by application software vendors. White boxes may be deployed as servers, top-of-rack switches, network appliances and more. As SDN and NFV fuel the migration to software-based virtual network functions, they also unleash broad opportunities for white boxes on which the virtualized (software) network functions can run or be accelerated. Underscoring the promise of SDN and NFV to reduce OPEX and CAPEX, white-box networking further drives down cost and enables an ecosystem of hardware and software vendors. The key to enabling the migration to white-box networking is the availability of powerful merchant chips that are used at the core of the white box. As the market offers choices for merchant silicon, clearly a chip such as the NPS-400, offering richer functionality at higher throughput enables a wider range of virtual functions that can be deployed on the white box and bring greater value to the network operator. Utilizing EZchip s NPS-400 at the core of white-box systems gives ODM vendors and software application vendors greater differentiation versus alternate systems that are typically based on non-programmable NPS-400 NFV Acceleration 13

switching silicon with limited functionality or CPUs with limited performance. The NPS-400 delivers a unique combination of NPU performance coupled with CPU functionality. High throughput of 400 Gigabitsper-second, C-programming, Linux operating system and 7-layer packet processing capabilities accelerate data-plane processing and unlock virtual functions to unprecedented functionality and performance levels. These functions include switching, routing, advanced classification, traffic management, ACLs (access control lists), stateful flow tables, load balancing, security (firewall, IPsec & SSL VPNs), DPI (deep packet inspection), network monitoring, application recognition, subscriber management, TCP termination, and network overlay termination. 10. Software Architecture Figure 6 shows the SW architecture of a NPS-400 based ToR switch implementing the vswitch for a rack of s. If the devices on the physical servers are capable of SR-IOV or tunneling transfers, the vswitch will communicate directly with the, avoiding having to buffer the packets in the Hypervisor. The SR- IOV transfers avoid cache pollution, i.e. populating the processor cache with the Hypervisor buffer contents when only the will use the packet data. Better cache performance improves the server performance, and offloading the vswitch processing makes more cycles available to the s residing on the processors. Having a single vswitch for the chassis reduces the number of switches in data centers, lowering the management load and making the networks more responsive. Figure 6. Software architecture of a smart NFV ToR vswitch DCB NPS-based Smart NFV ToR Switch vswitch 1 SR-IOV SR-IOV SR-IOV 2 1 Switch 2 Server Server Server Router Other s can be accelerated as well, note that Figure 6 shows a Switch (L2) and a Router (L3), both are virtual devices which are making use of the vswitch in the NFV infrastructure. Both of these s have data plane and control plane portions of the application, and the data plane functions can be offloaded to the NPS-400 as shown in Figure 7. NPS-400 NFV Acceleration 14

Figure 7. SW offload to the NPS-400 in a smart NFV ToR vswitch DCB NPS-based Smart NFV ToR Switch vswitch Switch DP Router DP 1 SR-IOV SR-IOV SR-IOV 2 1 Switch CP 2 Server Server Server Router CP The NFV acceleration primitives for the data plane are OpenFlow table entries and actions. The Switch DP could include the VEPA processing with ACL check. Most of the traffic on the data plane will be processed by the ToR NPS-400 and forwarded to the egress port based on the OpenFlow table entry without disturbing the control plane. If traffic is detected for the control plane, the packets are sent via the vswitch to the control plane, using the SR-IOV transfers to minimize the impact on the server caches. Again server performance is improved by offloading the data plane processing of the Switch and Router s, and by avoiding degradation of the cache performance. Other types of acceleration primitives are likely in the future, for instance a future OpenFlow set of actions may allow Application Recognition s to populate regular expression tests into an OpenFlow switch. If the green-colored s in Figure 7 are two instances of the Application Recognition, Figure 8 shows how the regular expression checking could be deployed on the same NPS-400 based ToR switch as hardware accelerators for the data plane. Figure 8. Addition of regular expression on the same ToR switch DCB NPS-based Smart NFV ToR Switch vswitch 1 Accel 2 Accel 1 Accel Switch DP 2 Accel Router DP 1 SR-IOV SR-IOV SR-IOV 2 1 Switch CP 2 Server Server Server Router CP An with application recognition can request the ToR physical switch to process all the packets on a link with a set of Deep Packet Inspection rules (regular expressions), and only forward to the Anti-Virus those packets which need additional processing. Use of an acceleration appliance improves system performance by off-loading the NFV server, and makes the service delivered to users more predictable by processing the bulk of the packets in the accelerator providing quicker response and reducing network workload. NPS-400 NFV Acceleration 15

11. Conclusion EZchip s NPS-400 accelerated s can deliver high network performance while providing a flexible platform for the most demanding data plane applications. The NPS-400 is a Network Processor (NPU) that offloads the data plane processing and improves the performance of standard servers with general purpose CPU based s by making more processor cycles available to the and improving the effectiveness of those cycles through better cache utilization. The gets more of the processor for the control plane and exception handling, the data plane packets see shorter latency and the physical network sees better utilization of links by avoiding the need to backhaul some traffic. Since the performance of many servers is accelerated by the addition of a NPS-400 to the network infrastructure, overall power consumption per packet is reduced. With full network service delivery in its background, the NPS-400 delivers high performance packet processing while preserving packet order, accurately polices and shapes traffic and can buffer large amounts of data to ensure loss-less service, keep statistics or mirror traffic. Contrasted to DPDK, the NFV acceleration on NPS-400 can be located in optimal places in the network, either in ToR switches, edge routers or acceleration appliances depending on the application and workload. Since traditional network behavior is preserved in the data plane, integrating NPS-400 accelerated s into existing networks will be easier and avoid the need to compromise on network integrity, performance or power consumption in large scale NFV deployments. Email: ezsupport@ezchip.com Web: www.ezchip.com EZchip Technologies Inc. 900 E Hamilton Ave, Suite 100, Campbell, CA 95008, USA Tel: (408) 879-7355, Fax: (408) 879-7357 EZchip Technologies Ltd. 1 Hatamar Street, PO Box 527, Yokneam 20692, Israel Tel: +972-4-959-6666, Fax: +972-4-959-4166 2014 EZchip Technologies. All rights reserved. Information is subject to change without notice. EZchip is a trademark of EZchip Technologies. Revised; April 29, 2014. NPS-400 NFV Acceleration 16