Supercomputing Clusters with RapidIO Interconnect Fabric

Similar documents

White Paper Solarflare High-Performance Computing (HPC) Applications

Why 25GE is the Best Choice for Data Centers

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

3G Converged-NICs A Platform for Server I/O to Converged Networks

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

InfiniBand in the Enterprise Data Center

PCI Express and Storage. Ron Emerick, Sun Microsystems

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

PCI Express Supersedes SAS and SATA in Storage

High Speed I/O Server Computing with InfiniBand

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

UCS M-Series Modular Servers

ECLIPSE Performance Benchmarks and Profiling. January 2009

100 Gigabit Ethernet is Here!

Building a Scalable Storage with InfiniBand

Cloud Data Center Acceleration 2015

Advancing Applications Performance With InfiniBand

Router Architectures

Mellanox Academy Online Training (E-learning)

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

PCI Express Overview. And, by the way, they need to do it in less time.

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

State of the Art Cloud Infrastructure

Data Center and Cloud Computing Market Landscape and Challenges

SMB Direct for SQL Server and Private Cloud

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

White Paper Abstract Disclaimer

The Future of Cloud Networking. Idris T. Vasi

PCI Express IO Virtualization Overview

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Pluribus Netvisor Solution Brief

Networking Virtualization Using FPGAs

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Using Network Virtualization to Scale Data Centers

PCI Express Impact on Storage Architectures and Future Data Centers

Parallel Programming Survey

Sun Constellation System: The Open Petascale Computing Architecture

Intel PCI and PCI Express*

VTrak SATA RAID Storage System

The Transition to PCI Express* for Client SSDs

Cisco SFS 7000P InfiniBand Server Switch

PCI Express: The Evolution to 8.0 GT/s. Navraj Nandra, Director of Marketing Mixed-Signal and Analog IP, Synopsys

Fibre Channel over Ethernet in the Data Center: An Introduction

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Storage Architectures. Ron Emerick, Oracle Corporation

HP Cloudline Overview

HPC Update: Engagement Model

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

LS DYNA Performance Benchmarks and Profiling. January 2009

Definition of a White Box. Benefits of White Boxes

Overview of Requirements and Applications for 40 Gigabit and 100 Gigabit Ethernet

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Scaling 10Gb/s Clustering at Wire-Speed

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Storage, Cloud, Web 2.0, Big Data Driving Growth

SGI High Performance Computing

SeaMicro SM Server

HP Moonshot: An Accelerator for Hyperscale Workloads

Xeon+FPGA Platform for the Data Center

Block based, file-based, combination. Component based, solution based

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Scaling Mobile Compute to the Data Center. John Goodacre

» Application Story «PCI Express Fabric breakthrough

How To Build A Cisco Ukcsob420 M3 Blade Server

Cisco UCS B-Series M2 Blade Servers

Oracle Big Data Appliance: Datacenter Network Integration

Lecture 1: the anatomy of a supercomputer

Chapter 1 Reading Organizer

Cisco UCS B440 M2 High-Performance Blade Server

Storage at a Distance; Using RoCE as a WAN Transport

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

SDN CENTRALIZED NETWORK COMMAND AND CONTROL

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

Simplifying Data Center Network Architecture: Collapsing the Tiers

Migrate from Cisco Catalyst 6500 Series Switches to Cisco Nexus 9000 Series Switches

Introduction to Cloud Design Four Design Principals For IaaS

Kalray MPPA Massively Parallel Processing Array

Transcription:

Supercomputing Clusters with RapidIO Interconnect Fabric Devashish Paul, Director Strategic Marketing, Systems Solutions devashish.paul@idt.com Ethernet Summit 2015 April 14-16, 2015 Santa Clara, CA Integrated Device Technology 1

Agenda HPC Analytics Interconnect Development Supercompute Clusters RapidIO Interconnect Technology compliments Ethernet RapidIO Attributes Heterogeneous Accelerators Open Compute HPC Summary and Wrap Up Heterogeneous Compute Accelerations Low Latency Reliable Scalable Fault-tolerant Energy Efficient

Supercomputing Needs Hetergeneous Accerators Other nodes Chip to Chip Board to Board across backplanes Chassis to Chassis Over Cable Top of Rack Heterogeneous Computing Rack Scale Fabric For any to any compute Rack Scale Interconnect Fabric Cable / connector Local Inter-Connect Fabric FPGA GPU DSP CPU Storage Backplane Backplane Interconnect Fabric Other nodes

HPC/Supercomputing Interconnect Check In Interconnect Requirements Low Latency Scalability Integrated HW Termination Power Efficient Fault Tolerant Deterministic Top Line Bandwidth RapidIO Infiniband Ethernet PCIe The Meaning of Switch silicon: ~100 nsec Memory to memory : < 1 usec Ecosystem supports any topology, 1000 s of nodes Available integrated into SoCs AND Implement guaranteed, in order delivery without software 3 layers terminated in hardware, Integrated into SoC s Ecosystem supports hot swap Ecosystem supports fault tolerance Guaranteed, in order delivery Deterministic flow control Ecosystem supports > 8 Gbps/lane

Clustering Fabric Needs Lowest Deterministic System Latency Scalability Peer to Peer / Any Topology Embedded Endpoints Energy Efficiency Cost per performance HW Reliability and Determinism Scalability Low Latency Hardware Terminated Ethernet PCIe PCIe Guaranteed Delivery Ethernet In s/w only RapidIO Interconnect combines the best attributes of PCIe and Ethernet in a multi-processor fabric

RapidIO on Ethernet Electricals Ethernet for external networking RapidIO for internal fabric on ethernet electricals Ethernet RapidIO Interconnect 6

Open Compute Project HPC and Supercomputing 7

20 Gbps per port / 6.25Gbps/lane in productions 40Gbps per port /10 Gps lane in development Embedded RapidIO NIC on processors, DSPs, FPGA and ASICs. Hardware termination at PHY layer: 3 layer protocol Lowest Latency Interconnect ~ 100 ns Inherently scales to large system with 1000 s of nodes Over 13 million RapidIO switches shipped > 2xEthernet (10GbE) Over 70 million 10-20 Gbps ports shipped 100% 4G interconnect market share 60% 3G, 100% China 3G market share

Peer to Peer & Independent Routing is easy: Target ID based Every endpoint has a separate memory system All layers terminated in hardware Memory System 1 3 1 1 2 2 Prev Packet S AckID Rsrv S Rsrv Prio 2 8 or 16 8 or 16 4 TT Target Address Source Address Transaction 4 4 8 32 or 48 or 64 Ftype Size Source TID Device Offset Address 8 to 256 Bytes Optional Data Payload 16 CRC Next Packet Physical Transport Logical

Why RapidIO for Low Latency Bandwidth and Latency Summary System Requirement RapidIO Switch per-port performance raw data rate 20 Gbps 40 Gbps Switch latency End to end packet termination Fault Recovery NIC Latency (Tsi721 PCIe2 to S-RIO) Messaging performance 100 ns ~1-2 us 2 us 300 ns Excellent

RapidIO.org ARM64 bit Scale Out Group Latency Scalability Hardware Termination Energy Efficiency 10s to 100s cores & Sockets ARM AMBA protocol mapping to RapidIO protocols - AMBA 4 AXI4/ACE mapping to RapidIO protocols - AMBA 5 CHI mapping to RapidIO protocols Migration path from AXI4/ACE to CHI and future ARM protocols Supports Heteregeneous computing Support Wireless/HPC/Data Center Applications Source www.rapidio.org, Linley Processor conf

NASA Space Interconnect Standard Next Generation Spacecraft Interconnect Standard RapidIO selected from Infiniband / Ethernet /FiberChannel / PCIe NGSIS members: BAE, Honeywell, Boeing, Lockheed- Martin, Sandia Cisco, Northrup-Grumman, Loral, LGS, Orbital Sciences, JPL, Raytheon, AFRL

RapidIO in Open Compute Interconnect Supports Opencompute.org HPC Initiative

Computing: Open Compute Project HPC and RapidIO Low latency analytics becoming more important not just in HPC and Supercomputing but in other computing applications mandate to create open latency sensitive, energy efficient board and silicon level solutions Low Latency Interconnect RapidIO submission Latency Scalability Hardware Termination Input Reqts RapidIO = 2x 10 GigE Port Shipment RapidIO 1, 2, 10xN OCP HPC Interconnect Development Laser Connect 2.0 Energy Efficiency LaserConnect 1.0 = RapidIO 10xN 2015 4x 25 Gbps Multi Vendor Collaboration 2015

RapidIO at CERN LHC and Data Center RapidIO Low latency interconnect fabric Heterogeneous computing Large scalable multi processor systems Desire to leverage multi core x86, ARM, GPU, FPGA, DSP with uniform fabric Desire programmable upgrades during operations before shut downs 15

Heterogeneous HPC with RapidIO based Accelerators

RapidIO Heterogenous switch + server 4x 20 Gbps RapidIO external ports 4x 10 GigE external Ports 4 processing mezzanine cards In chassis 320 Gbps of switching with 3 ports to each processing mezzanine Compute Nodes with x86 use PCIe to S-RIO NIC Compute Node with ARM/PPC/DSP/FPGA are native RapidIO connected with small switching option on card 20Gbps RapidIO links to backplane and front panel for cabling Co located Storage over SATA 10 GbE added for ease of migration 19 Inch PCIe S-RIO x86 CPU RapidIO Switch GPU DSP/ FPGA External Panel With S-RIO and 10GigE ARM CPU

X86 + GPU Analytics Server + Switch Energy Efficient Nvidia K1 based Mobile GPU cluster acceleration 300 Gb/s RapidIO Switching 19 Inch Server 8 12 nodes per 1U 12-18 Teraflops per 1U

38x 20 Gbps Low Latency ToR Switching Switching at board, chassis, rack and top of rack level with Scalability to 64K nodes, roadmap to 4 billion 760Gbps full-duplex bandwidth with 100ns 300ns typical latency 32x (QSFP+ front) RapidIO 20Gbps fullduplex ports downlink 6x (CXP front) RapidIO 20Gbps ports downlink 2x (RJ.5 front) management I 2 C 1x (RJ.5 front) 10/100/1000 BASE-T www.prodrive.nl

Analytics Acceleration with GPU Clusters Announced at SC14 New Orleans Can be used in Data Center or edge of wireless network for analytics acceleration Proto cluster based on Jetson K1 PCIe boards NVIDIA K1 + RapidIO cluster using Tsi721 PCIe to S-RIO Low Energy Mobile GPU + RapidIO for Analytics 16 Gbps per Node, 24 flops per bit I/O Flux

RapidIO with Mobile GPU Compute Node 4 x Tegra K1 GPU RapidIO network 140 Gbps embedded RapidIO switching 4x PCIe2 to RapidIO NIC silicon 384 Gigaflops per GPU >1.5 Tflops per AMC 12 Teraflops per 1U 0.5 Petaflops per rack GPU compute acceleration

Social Media Real Time Analytics FIFA World Cup 2014 Analyze User Impressions on World Cup 2014 Future Analytics acceleration using GPU

RapidIO 10xN Development and Futures 3 rd Generation Scalable embedded peer to peer Multi processor On board, board-to-board Chassis to Chassis Data rate of 40-160 Gbps per port 10.3125 Gbaud per serial lane with option of going to 12.5 Gbaud in future Long-reach support, Short Reach 20 cm 1 connector, 30 cm no connector Backward compatibility with RapidIO Gen2 switches (5 & 6.25 Gbps) and endpoints Lane widths of x1, x2, x4, x8, x16 Speed granularity from 1, 2, 4, 5, 10, 20, 40 Gbps RapidIO Switch S-RIO 10xN 40 Gbps S-RIO Development Product PCIe3 S-RIO 10xN 40 Gbps S-RIO 40 Gbps S-RIO ARM CPU x86 CPU ARM CPU Definition Product Released: 40 Gbps IP core Development: Switches Definition: Bridging, Embedded 40 Gbps NIC and NIC Future: ARM64 bit Cache Coherency IP

Summary Low Latency Data Center Acceleration with RapidIO + GPU 100% 4G market share, all 4G calls worldwide go through RapidIO switches 2x market size of 10 GbE (70 million ports S- RIO) 20 Gbps per port in production 40 Gbps per port silicon in development now Low 100 ns latency, scalable, energy efficient interconnect Supports dense GPU based compute up to 600 Gflops per rack, based on 386 Gflops per GPU single precision 12-18 Teraflops per 1 U server Ideal for analytics, deep learning and pattern recognition HPC Analytics Interconnect Development Heterogeneous Compute Accelerations Data Center Acceleration Low Latency Reliable Scalable Fault-tolerant Energy Efficient

Backup: RapidIO Protocol Background

RapidIO with x86 and/or and GPU RapidIO based x86 + GPU heterogenous computing X86 moves data GPU provides compute/analytics RapidIO NIC silicon 13x13, 2W Best-in-class end-to-end latency < 1.2 microseconds any to any Supports any kind of topology Ideal for analytics, deep learning and pattern recognitions Server RapidIO switch PCIe2 S-RIO2 Node RapidIO switch PCIe2 S-RIO2 PCIe2 S-RIO2 PCIe2 S-RIO2 PCIe2 S-RIO2 x86 CPU GPU GPU GPU GPU Lowest latency scalable acceleration

RapidIO Interoperable Eco-system Open Ecosystem No Vendor Lock-in Source - http://www.rapidio.org/files/rapidio_asia_summit_intro.pdf Low Latency Reliable Scalable Fault-tolerant Energy Efficient

Requirements: Low Latency RapidIO networks are built around two Basic Blocks Endpoints Switches ID Port 0-2 1-0 2-3 3-1 ID=0x00 HOST A ID Port 0-2 1-0 2-3 3-1 End points source and sink packets Port 2 ID=0x03 Switches pass packets between ports without interpreting them Simple routing through Look Up Tables Agent A ID=0x02 Port 3 Switch Port 0 Agent C Port 1 Agent B ROM Typical Switch Latency is 100-150ns ID=0x01

Messaging Messaging uses a push architecture Receiver is responsible for storing the message in its memory system Overall system latency per message transfer is reduced significantly RapidIO Endpoint RapidIO Endpoint Local Memory Local Memory Queue 0 Queue 1 Queue 2 Outbound Message Transmitter RapidIO Inbound Message Receiver Queue 0 Queue 1 Queue 2 Queue 3 Queue 3

RDMA: RapidIO Logical Layer DDR Source Process/Thread RapidIO Fabric Destination Process/Thread DDR up A up B RDMA = Remote Direct Memory Access One process writes directly into another process. The fabric operates as a bus extension. RDMA is the most efficient, lowest latency, interprocess communication (IPC) mechanism. Hardware: Messaging: RDMA connection management Read/Write/Atomic: RDMA transfers Doorbells: Events All terminated in hardware Software: The OS connects processes for RDMA The OS is bypassed for RDMA transfers Low Latency * Scalable * Integrated HW Termination * Power Efficient * Fault Tolerant

Low Latency Open Fabric for Heterogeneous Computing Other nodes Rack Scale Interconnect Fabric Cable / connector Local Inter-Connect Fabric FPGA GPU DSP CPU Storage Backplane Backplane Interconnect Fabric Other nodes