Prototyping a Programmable Computing/Networking Switch

Heterogenesous Computing Tutorial @ Gwangju, Korea Sep. 20, 2012 Prototyping a Programmable Computing/Networking Switch JongWon Kim, Namgon Kim, and Namgon Lucas Kim, Gwangju Institute of Science and Technology () jongwon@nm.gist.ac.kr Sep. 20, 2012 1

Toward Balanced Service Composition based on Programmable (and Virtualized) Resources

Multi-Screen Content Consumption Content servers Cloud Logically Centralized Controller Software (Logically centralized) VisualXCoordinator Visualization Sharing Content Sharing at home Screen capturing Switch Control Software Switch Control Software Network Switch Control Software Many flows Tiling VisualXSwitch (Com + Net) at airport lobby Transcoding Switch Control Software Switch Control Software Control Agent (having partial knowledge) Tiling at cafeteria screen-casting at street at office 3

Multi-party Visual Sharing OptIPortal Termination Device for the OptIPuter Global Backplane SAGE (Scalable Adaptive Graphics Environment) Visualcasting enables Integration of HD Streams into Tiled Displays (2008. Apr) 9/20/2012 KIISE 2011 4

Programmable Computing/Networking Switch: Why? Emergence of various networkconnected consumer devices The demands for accessing video contents from heterogeneous devices with different screens and different capabilities are increasing Real-time content adaptation is challenging Transcoding the content while maintaining good video quality Maintaining appropriately short input-to-output latency We introduce a programmable switching node supporting innetwork processing with balanced use of computing/networking resources 5

Networking Service and SDN Networking Service The collection of network-centric services that assist the networking (e.g., transport) of diverse flows for computing centric services SDN (Software-Defined Networking) Restructures networking paradigm by exposing network APIs so that any software can program the networking nodes as they want By employing the programmability of SDN, we attempt to fill the gap caused by existing network services flows Networking Service SDN Network Substrate 6

Features of NetOpen Networking Service Extended flow-based networking: An extension of flow-based networking to reinforce the balanced utilization of both networking and computing resources Primitive-based loosely coupled creation of services Service primitives as linkage points between NetOpen networking services and programmable network substrates Identify and then link the key features (from resource capabilities) Interactive service operation via intuitive UIs 7

A Road Map to Build SmartX nodes SmartX Node (NetOpen + MediaX) MediaX Cloud Node NetOpen Node v2.0 + networking OF + Click + GPU + Storage + 10G NIC NetOpen Node v1.3 Cloud-based MediaX Node v1.2 OpenVSwitch+Cloud OF + Click + GPU OF + Click OF + NetFPGA NetOpen Node v1.1 NetOpen Node v1.0 NetOpen Node v1.2 NetOpen Switch Nodes Wireless NetOpen Node v1.1 (wallbox) Mobile SmartX Node v0.9 (tablet, netbook, laptop) Mobile MediaX Node with GPU (tablet, netbook, laptop) Mobile SmartX Nodes OF + Click PC-based MediaX Node with GPU v1.1 PC-based MediaX Node v1.0 MediaX Nodes 8

NetOpen Switch Node (v1.2) A Programmable Networking Switch Node with Innetwork Processing Support Design issues Independent processing module for a service functionality For each flow, a customized data plane can be built by selectively combining processing modules This buildup can be controlled by a logically centralized controller flows Networking Service SDN Network Substrate CPU GPU Commodity Processors 9

In-network Processing leveraging Heterogeneous Computing In-network Processing Additional processing on the packets before forwarding Middle-boxes provide required processing functionalities, e.g., packet caching, transcoding, network coding Leverage Heterogeneous Computing (with CPU/GPU/ ) for in-network processing Powerful CPUs with multiple cores (e.g. Intel Xeon E7-2820, AMD FX-8150) GPUs & GPGPUs (General Purpose Graphics Processing Units) with hundred cores (NVIDIA GTX580: 512 CUDA cores) 10

How to handle In-network Processing In-network processing for a flow is explained by a task, defined as a set of actions (provided by processing modules) mapped to a flow For example, a task for a video flow can be composed of decode, resize, and forward actions with segmentation and packetization Segmentation: Checks the application-layer packet header of each packet and batches the payload into current segment (Assuming the sequential transport of video frames) Packetization: Converts a segment back into packets Packet segment size size S Segment Segment Data size size P size(header) Packet P Header D (Data) S S P Header D (Data) D D D Segment <Segmentation> <Packetization> 11

Early Design of NetOpen Switch Node for In-network Processing We need a special purpose task dispatcher between task dispatchers and GPUs, shader, to avoid performance degradation due to multiple task dispatchers accessing GPU at the same time 12

Prototype Implementation Hardware CPU: 2 Intel Xeon quad-core CPUs (1.6Ghz) with 12GB memory GPU: NVIDIA GTX590 GPU 2 stream processors (each has 512 cores) 3GB GDDR5 memory NIC: Six 1G NICs, two 2port 1G NICs, and two WiFi NICs Software (Click + OpenFlow) Click: A toolkit for building a software modular router by combining Click elements (i.e., a small module that has a specific functionality) Each service primitive is implemented as a Click element that is controlled based on OpenFlow 13

Supported Primitives for In-networking Processing DXT compression (P DXTC ) /decompression (P DXTD ) DXT (i.e., S3 texture compression) is a light-weight compression scheme that compresses an image frame based on 4x4-pixel blocks The compression operation of each pixel block is completely independent, thus we significantly reduce the compression time by implementing DXT compression using hundreds of GPU cores Rate shaping (P rshape ) Control the rate of traffic received on a network interface Traffic that is less than or equal to the specified rate is sent, whereas traffic that exceeds the rate is dropped or delayed Implemented using BandwidthShaper element of Click YUV2RGB conversion (P Y2R ) Recent camera models produce YUV-format video content due to the small size of YUV format (i.e., half of the RGB format) DXT compressor only takes RGB-format pixel blocks, we need a conversion element to support various formats of pixel blocks Resize (P resize ) It changes the width and height of a video frame 14

Task Dispatcher Threads Main Thread Flow Batcher Forwarding Element Per-flow Customized Data Plane Flow Table Actions Packet (Forwarding) Packet (In-network Processing) Packet Segmentation Handler Input Task Queue Output Task Queue Segmentation Handler Task Task Task Task Dispatcher Task Segment Actions Task Dispatcher Segment Output Action Processing Element (CPU) Processing Element (CPU) Processing Element (GPU) Processing Element (GPU) 15

A Processing Element for GPUbased DXT Compression Task Dispatcher Pass segment Processing Element Return DXT compressed segment 2) Launch Kernel Kernel (DXT Compression) 1) Copy segment to GPU device memory 3) Copy DXT compressed segment from GPU device memory GPU GPU GPU Cores GPU Cores GPU Cores GPU Cores Cores Cores GPU Cores (e.g., 2x 512 cores for GTX 590) 4x4 pixel block DXT block segment segment DXT compressed segment Host Memory DXT compressed segment Device Memory GPU (Graphics Processing Unit) 16

DXT: DirectX Texture Compression S/W based light-weight lossy compression scheme 4x4 block of pixels (512-bit or 384-bit) to a 64-bit or 128-bit Fixed ratio per 4x4-pixel bolck quantity FOURCC Description Comp. ratio Texture DXT1 (BC1) Opaque / 1-bit Alpha 8:1 or 6:1 Simple non-alpha DXT3 (BC2) Explicit alpha 4:1 Sharp alpha DXT5 (BC3) Interpolated alpha 4:1 Gradient alpha Most of VGAs support H/W accelerated DXT-decompression DXT1 (BC1) Example Color0 > Color1 Color0: code 0 Color1: code 1 (2*Color0+Color1)/3: code 2 (Color0+2*Color1)/3: code 3 Color0 <= Color1 Color0: code 0 Color1: code 1 (Color0+Color1)/2: code 2 9/20/2012 Black: code 3 Color 0 Color 1 xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx < 4x4 RGB pixels > < DXT1 block > 32bit (2x16bit) 32bit (16x2bit) 17

DXT Compression: GPU-based parallelization 9/20/2012 18

Color Converter UYVY (YUV422) to RGB Each CUDA thread converter one UVYV packed pixel U Y2 V Y1 A B G R A B G R Pixel 1 Pixel 2 Conversion formula R: 1.164(Y - 16) + 1.596(V - 128) G: 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128) B: 1.164(Y - 16) + 2.018(U - 128) A: 0 9/20/2012 KIISE 2011 19

DXT Compression Apply FastDXT algorithm [2] 9/20/2012 KIISE 2011 20

DXT Compression: Case #1 Separate Color conversion and DXT compression Different processing property in CUDA thread E.g.: 1920x1080 frame, 64 threads per block, Max. 3072 blocks per kernel launch Color conversion launches kernel 6 times; DXT compression launches kernel once GPU: NVIDIA GeForce GTS450 (1.61Ghz, 192cores, 1GB RAM w/ 128bit bus) Experiment with CUDA Compute Visual Profiler Input Source: 4K 1 frame Total: 19ms Processing: 11.65ms (us) 9/20/2012 21

DXT Compression: Case #2 Pipelining based on Slices (Pipelining) Concurrent CUDA kernel execution with data copy (using pinned memory) Host to Device Memory Copy Performance 9/20/2012 22

DXT Compression: Case #2 (Tuning) 13.472ms ( 4.936ms) with 128 Threads (us) 9/20/2012 23

NetOpen Switch: Experimental Service 1 (Shortest-path Connection) Verification Service 2 (In-network DXT compression ) Service 3 (Rate-shaped connection) HD Camera NOX Core Flow-based forwarding GPU-based DXT compression Rate shaping HD Sender HD Receiver H01 Background Traffic Sender eth0 eth5 eth1 N01 eth0 eth4 eth3 N02 eth5 eth6 eth0 H03 Background Traffic Receiver H02 eth4 eth0 H04 24

Verification Results (Functionality) Frames rates in H03 with a 800Mbps background flow injection Frames rates in H03 with in-network DXT compression (Background: 500Mbps) Frames rates in H03 with in-network DXT compression (Background: 800Mbps). Frames rates in H03 with in-network DXT compression (Background traffic in-network rate-shaped from 800Mbps to 500Mbps). 25

The NetOpen switch node Conclusion A programmable network substrate supporting the concept of NetOpen networking service Design and prototype a NetOpen switch node providing in-network processing Provides service functionalities as programmable processing modules that can be enabled independently For each flow, a customized data plane can be built by selectively combining service functionalities (under the logically centralized control) Future work Enhancing the performance of in-network processing Continue the next versions of NetOpen switch node 26

Gwangju Institute of Science & Technology Thank you! Send Inquiry to jongwon@gist.ac.kr http://nm.gist.ac.kr 27