Prototyping a Programmable Computing/Networking Switch

Similar documents
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

IP Video Rendering Basics

GPGPU Computing. Yong Cao

NVIDIA GeForce GTX 580 GPU Datasheet

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Introduction to GPGPU. Tiziano Diamanti

Computer Graphics Hardware An Overview

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Networking Virtualization Using FPGAs

ultra fast SOM using CUDA

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Introduction to GPU Architecture

Cloud Gaming & Application Delivery with NVIDIA GRID Technologies. Franck DIARD, Ph.D. GRID Architect, NVIDIA

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Open Source Network: Software-Defined Networking (SDN) and OpenFlow

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

The Future Of Animation Is Games

Interactive Level-Set Deformation On the GPU

An Efficient Application Virtualization Mechanism using Separated Software Execution System

Network Virtualization and Application Delivery Using Software Defined Networking

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU-based Decompression for Medical Imaging Applications

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPUs for Scientific Computing

================================================================== CONTENTS ==================================================================

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Transcend the Vision. Embedded Graphic Solutions that Lead to New Territory. Embedded Graphic Solutions.

Xeon+FPGA Platform for the Data Center

Low power GPUs a view from the industry. Edvard Sørgård

Qualified Apple Mac Workstations for Avid Media Composer v5.0.x

System requirements for Autodesk Building Design Suite 2017

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Parallel Computing Architecture and CUDA Programming Model

~ Greetings from WSU CAPPLab ~

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Real-time Visual Tracker by Stream Processing

Several tips on how to choose a suitable computer

GPU Architecture. Michael Doggett ATI

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Real-Time BC6H Compression on GPU. Krzysztof Narkowicz Lead Engine Programmer Flying Wild Hog

Lecture Notes in Computer Science: Media-Oriented Service Overlay Network Architecture over Future Internet Research for Sustainable Testbed

L20: GPU Architecture and Models

Next Generation GPU Architecture Code-named Fermi

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Boundless Security Systems, Inc.

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

MIDeA: A Multi-Parallel Intrusion Detection Architecture

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Getting Started with RemoteFX in Windows Embedded Compact 7

The Collaboratorium & Remote Visualization at SARA. Tijs de Kler SARA Visualization Group (tijs.dekler@sara.nl)

HPC with Multicore and GPUs

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

GeoImaging Accelerator Pansharp Test Results

DATA/SPEC SHEET VIRTUAL MATRIX DISPLAY CONTROLLER VERSION 8

Getting Started with the ZED 2 Introduction... 2

Brainlab Node TM Technical Specifications

2020 Design Update Release Notes November 10, 2015

Whitepaper. NVIDIA Miracast Wireless Display Architecture

Packet-based Network Traffic Monitoring and Analysis with GPUs

NVIDIA VIDEO ENCODER 5.0

Introduction to GPU hardware and to CUDA

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Using MATLAB to Measure the Diameter of an Object within an Image

High-performance vswitch of the user, by the user, for the user

Parallel Computing with MATLAB

Stream Processing on GPUs Using Distributed Multimedia Middleware

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Software Defined Networking and the design of OpenFlow switches

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

10 Gbit Hardware Packet Filtering Using Commodity Network Adapters. Luca Deri Joseph Gasparakis

ARM Processors for Computer-On-Modules. Christian Eder Marketing Manager congatec AG

Software Defined Networking What is it, how does it work, and what is it good for?

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Unifying the Programmability of Cloud and Carrier Infrastructure

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

ADVANTAGES OF AV OVER IP. EMCORE Corporation

HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Performance of Software Switching

Interactive Level-Set Segmentation on the GPU

This letter contains latest information about the above mentioned software version.

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez

Choosing a Computer for Running SLX, P3D, and P5

OpenFlow and Onix. OpenFlow: Enabling Innovation in Campus Networks. The Problem. We also want. How to run experiments in campus networks?

HP Workstations graphics card options

NVIDIA workstation 3D graphics card upgrade options deliver productivity improvements and superior image quality

Optimizing AAA Games for Mobile Platforms

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Feasibility Study of Searchable Image Encryption System of Streaming Service based on Cloud Computing Environment

Adoption of SDN: Progress Update

Transcription:

Heterogenesous Computing Tutorial @ Gwangju, Korea Sep. 20, 2012 Prototyping a Programmable Computing/Networking Switch JongWon Kim, Namgon Kim, and Namgon Lucas Kim, Gwangju Institute of Science and Technology () jongwon@nm.gist.ac.kr Sep. 20, 2012 1

Toward Balanced Service Composition based on Programmable (and Virtualized) Resources

Multi-Screen Content Consumption Content servers Cloud Logically Centralized Controller Software (Logically centralized) VisualXCoordinator Visualization Sharing Content Sharing at home Screen capturing Switch Control Software Switch Control Software Network Switch Control Software Many flows Tiling VisualXSwitch (Com + Net) at airport lobby Transcoding Switch Control Software Switch Control Software Control Agent (having partial knowledge) Tiling at cafeteria screen-casting at street at office 3

Multi-party Visual Sharing OptIPortal Termination Device for the OptIPuter Global Backplane SAGE (Scalable Adaptive Graphics Environment) Visualcasting enables Integration of HD Streams into Tiled Displays (2008. Apr) 9/20/2012 KIISE 2011 4

Programmable Computing/Networking Switch: Why? Emergence of various networkconnected consumer devices The demands for accessing video contents from heterogeneous devices with different screens and different capabilities are increasing Real-time content adaptation is challenging Transcoding the content while maintaining good video quality Maintaining appropriately short input-to-output latency We introduce a programmable switching node supporting innetwork processing with balanced use of computing/networking resources 5

Networking Service and SDN Networking Service The collection of network-centric services that assist the networking (e.g., transport) of diverse flows for computing centric services SDN (Software-Defined Networking) Restructures networking paradigm by exposing network APIs so that any software can program the networking nodes as they want By employing the programmability of SDN, we attempt to fill the gap caused by existing network services flows Networking Service SDN Network Substrate 6

Features of NetOpen Networking Service Extended flow-based networking: An extension of flow-based networking to reinforce the balanced utilization of both networking and computing resources Primitive-based loosely coupled creation of services Service primitives as linkage points between NetOpen networking services and programmable network substrates Identify and then link the key features (from resource capabilities) Interactive service operation via intuitive UIs 7

A Road Map to Build SmartX nodes SmartX Node (NetOpen + MediaX) MediaX Cloud Node NetOpen Node v2.0 + networking OF + Click + GPU + Storage + 10G NIC NetOpen Node v1.3 Cloud-based MediaX Node v1.2 OpenVSwitch+Cloud OF + Click + GPU OF + Click OF + NetFPGA NetOpen Node v1.1 NetOpen Node v1.0 NetOpen Node v1.2 NetOpen Switch Nodes Wireless NetOpen Node v1.1 (wallbox) Mobile SmartX Node v0.9 (tablet, netbook, laptop) Mobile MediaX Node with GPU (tablet, netbook, laptop) Mobile SmartX Nodes OF + Click PC-based MediaX Node with GPU v1.1 PC-based MediaX Node v1.0 MediaX Nodes 8

NetOpen Switch Node (v1.2) A Programmable Networking Switch Node with Innetwork Processing Support Design issues Independent processing module for a service functionality For each flow, a customized data plane can be built by selectively combining processing modules This buildup can be controlled by a logically centralized controller flows Networking Service SDN Network Substrate CPU GPU Commodity Processors 9

In-network Processing leveraging Heterogeneous Computing In-network Processing Additional processing on the packets before forwarding Middle-boxes provide required processing functionalities, e.g., packet caching, transcoding, network coding Leverage Heterogeneous Computing (with CPU/GPU/ ) for in-network processing Powerful CPUs with multiple cores (e.g. Intel Xeon E7-2820, AMD FX-8150) GPUs & GPGPUs (General Purpose Graphics Processing Units) with hundred cores (NVIDIA GTX580: 512 CUDA cores) 10

How to handle In-network Processing In-network processing for a flow is explained by a task, defined as a set of actions (provided by processing modules) mapped to a flow For example, a task for a video flow can be composed of decode, resize, and forward actions with segmentation and packetization Segmentation: Checks the application-layer packet header of each packet and batches the payload into current segment (Assuming the sequential transport of video frames) Packetization: Converts a segment back into packets Packet segment size size S Segment Segment Data size size P size(header) Packet P Header D (Data) S S P Header D (Data) D D D Segment <Segmentation> <Packetization> 11

Early Design of NetOpen Switch Node for In-network Processing We need a special purpose task dispatcher between task dispatchers and GPUs, shader, to avoid performance degradation due to multiple task dispatchers accessing GPU at the same time 12

Prototype Implementation Hardware CPU: 2 Intel Xeon quad-core CPUs (1.6Ghz) with 12GB memory GPU: NVIDIA GTX590 GPU 2 stream processors (each has 512 cores) 3GB GDDR5 memory NIC: Six 1G NICs, two 2port 1G NICs, and two WiFi NICs Software (Click + OpenFlow) Click: A toolkit for building a software modular router by combining Click elements (i.e., a small module that has a specific functionality) Each service primitive is implemented as a Click element that is controlled based on OpenFlow 13

Supported Primitives for In-networking Processing DXT compression (P DXTC ) /decompression (P DXTD ) DXT (i.e., S3 texture compression) is a light-weight compression scheme that compresses an image frame based on 4x4-pixel blocks The compression operation of each pixel block is completely independent, thus we significantly reduce the compression time by implementing DXT compression using hundreds of GPU cores Rate shaping (P rshape ) Control the rate of traffic received on a network interface Traffic that is less than or equal to the specified rate is sent, whereas traffic that exceeds the rate is dropped or delayed Implemented using BandwidthShaper element of Click YUV2RGB conversion (P Y2R ) Recent camera models produce YUV-format video content due to the small size of YUV format (i.e., half of the RGB format) DXT compressor only takes RGB-format pixel blocks, we need a conversion element to support various formats of pixel blocks Resize (P resize ) It changes the width and height of a video frame 14

Task Dispatcher Threads Main Thread Flow Batcher Forwarding Element Per-flow Customized Data Plane Flow Table Actions Packet (Forwarding) Packet (In-network Processing) Packet Segmentation Handler Input Task Queue Output Task Queue Segmentation Handler Task Task Task Task Dispatcher Task Segment Actions Task Dispatcher Segment Output Action Processing Element (CPU) Processing Element (CPU) Processing Element (GPU) Processing Element (GPU) 15

A Processing Element for GPUbased DXT Compression Task Dispatcher Pass segment Processing Element Return DXT compressed segment 2) Launch Kernel Kernel (DXT Compression) 1) Copy segment to GPU device memory 3) Copy DXT compressed segment from GPU device memory GPU GPU GPU Cores GPU Cores GPU Cores GPU Cores Cores Cores GPU Cores (e.g., 2x 512 cores for GTX 590) 4x4 pixel block DXT block segment segment DXT compressed segment Host Memory DXT compressed segment Device Memory GPU (Graphics Processing Unit) 16

DXT: DirectX Texture Compression S/W based light-weight lossy compression scheme 4x4 block of pixels (512-bit or 384-bit) to a 64-bit or 128-bit Fixed ratio per 4x4-pixel bolck quantity FOURCC Description Comp. ratio Texture DXT1 (BC1) Opaque / 1-bit Alpha 8:1 or 6:1 Simple non-alpha DXT3 (BC2) Explicit alpha 4:1 Sharp alpha DXT5 (BC3) Interpolated alpha 4:1 Gradient alpha Most of VGAs support H/W accelerated DXT-decompression DXT1 (BC1) Example Color0 > Color1 Color0: code 0 Color1: code 1 (2*Color0+Color1)/3: code 2 (Color0+2*Color1)/3: code 3 Color0 <= Color1 Color0: code 0 Color1: code 1 (Color0+Color1)/2: code 2 9/20/2012 Black: code 3 Color 0 Color 1 xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx < 4x4 RGB pixels > < DXT1 block > 32bit (2x16bit) 32bit (16x2bit) 17

DXT Compression: GPU-based parallelization 9/20/2012 18

Color Converter UYVY (YUV422) to RGB Each CUDA thread converter one UVYV packed pixel U Y2 V Y1 A B G R A B G R Pixel 1 Pixel 2 Conversion formula R: 1.164(Y - 16) + 1.596(V - 128) G: 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128) B: 1.164(Y - 16) + 2.018(U - 128) A: 0 9/20/2012 KIISE 2011 19

DXT Compression Apply FastDXT algorithm [2] 9/20/2012 KIISE 2011 20

DXT Compression: Case #1 Separate Color conversion and DXT compression Different processing property in CUDA thread E.g.: 1920x1080 frame, 64 threads per block, Max. 3072 blocks per kernel launch Color conversion launches kernel 6 times; DXT compression launches kernel once GPU: NVIDIA GeForce GTS450 (1.61Ghz, 192cores, 1GB RAM w/ 128bit bus) Experiment with CUDA Compute Visual Profiler Input Source: 4K 1 frame Total: 19ms Processing: 11.65ms (us) 9/20/2012 21

DXT Compression: Case #2 Pipelining based on Slices (Pipelining) Concurrent CUDA kernel execution with data copy (using pinned memory) Host to Device Memory Copy Performance 9/20/2012 22

DXT Compression: Case #2 (Tuning) 13.472ms ( 4.936ms) with 128 Threads (us) 9/20/2012 23

NetOpen Switch: Experimental Service 1 (Shortest-path Connection) Verification Service 2 (In-network DXT compression ) Service 3 (Rate-shaped connection) HD Camera NOX Core Flow-based forwarding GPU-based DXT compression Rate shaping HD Sender HD Receiver H01 Background Traffic Sender eth0 eth5 eth1 N01 eth0 eth4 eth3 N02 eth5 eth6 eth0 H03 Background Traffic Receiver H02 eth4 eth0 H04 24

Verification Results (Functionality) Frames rates in H03 with a 800Mbps background flow injection Frames rates in H03 with in-network DXT compression (Background: 500Mbps) Frames rates in H03 with in-network DXT compression (Background: 800Mbps). Frames rates in H03 with in-network DXT compression (Background traffic in-network rate-shaped from 800Mbps to 500Mbps). 25

The NetOpen switch node Conclusion A programmable network substrate supporting the concept of NetOpen networking service Design and prototype a NetOpen switch node providing in-network processing Provides service functionalities as programmable processing modules that can be enabled independently For each flow, a customized data plane can be built by selectively combining service functionalities (under the logically centralized control) Future work Enhancing the performance of in-network processing Continue the next versions of NetOpen switch node 26

Gwangju Institute of Science & Technology Thank you! Send Inquiry to jongwon@gist.ac.kr http://nm.gist.ac.kr 27