Interconnecting Future DoE leadership systems



Similar documents
Advancing Applications Performance With InfiniBand

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Power Saving Features in Mellanox Products

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

SMB Direct for SQL Server and Private Cloud

Mellanox Academy Online Training (E-learning)

State of the Art Cloud Infrastructure

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Enabling High performance Big Data platform with RDMA

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2015

Long-Haul System Family. Highest Levels of RDMA Scalability, Simplified Distance Networks Manageability, Maximum System Productivity

Mellanox Accelerated Storage Solutions

SX1012: High Performance Small Scale Top-of-Rack Switch

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Connecting the Clouds

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Efficiency and Scalability

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Enabling the Use of Data

Introduction to Cloud Design Four Design Principals For IaaS

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

ECLIPSE Performance Benchmarks and Profiling. January 2009

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Building a Scalable Storage with InfiniBand

Storage, Cloud, Web 2.0, Big Data Driving Growth

Building Enterprise-Class Storage Using 40GbE

LS DYNA Performance Benchmarks and Profiling. January 2009

Can High-Performance Interconnects Benefit Memcached and Hadoop?

High Speed I/O Server Computing with InfiniBand

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

ConnectX -3 Pro: Solving the NVGRE Performance Challenge

FLOW-3D Performance Benchmark and Profiling. September 2012

Michael Kagan.

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

Block based, file-based, combination. Component based, solution based

White Paper Solarflare High-Performance Computing (HPC) Applications

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Microsoft s Cloud Networks

Storage at a Distance; Using RoCE as a WAN Transport

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

3G Converged-NICs A Platform for Server I/O to Converged Networks

Open Ethernet. April

Deploying 10/40G InfiniBand Applications over the WAN

Microsoft SMB Running Over RDMA in Windows Server 8

Advanced Computer Networks. High Performance Networking I

RoCE vs. iwarp Competitive Analysis

Mellanox HPC-X Software Toolkit Release Notes

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Performance Evaluation of the RDMA over Ethernet (RoCE) Standard in Enterprise Data Centers Infrastructure. Abstract:

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Pedraforca: ARM + GPU prototype

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Hadoop on the Gordon Data Intensive Cluster

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Mellanox Reference Architecture for Red Hat Enterprise Linux OpenStack Platform 4.0

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

PCI Express and Storage. Ron Emerick, Sun Microsystems

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Mellanox OpenStack Solution Reference Architecture

Realizing the next step in storage/converged architectures

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Converging Data Center Applications onto a Single 10Gb/s Ethernet Network

RDMA over Ethernet - A Preliminary Study

Ethernet: THE Converged Network Ethernet Alliance Demonstration as SC 09

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Security in Mellanox Technologies InfiniBand Fabrics Technical Overview

The Future of Cloud Networking. Idris T. Vasi

Data Center and Cloud Computing Market Landscape and Challenges

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Supercomputing Clusters with RapidIO Interconnect Fabric

Storage Architectures. Ron Emerick, Oracle Corporation

Summit and Sierra Supercomputers:

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

InfiniBand in the Enterprise Data Center

Data Center Convergence. Ahmad Zamer, Brocade

Achieving Low-Latency Security

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Linux NIC and iscsi Performance over 40GbE

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

New Storage System Solutions

Transcription:

Interconnecting Future DoE leadership systems Rich Graham HPC Advisory Council, Stanford, 2015

HPC The Challenges 2

Proud to Accelerate Future DOE Leadership Systems ( CORAL ) Summit System Sierra System 5X 10X Higher Application Performance versus Current Systems Mellanox EDR 100Gb/s InfiniBand, IBM POWER CPUs, NVIDIA Tesla GPUs Mellanox EDR 100G Solutions Selected by the DOE for 2017 Leadership Systems Deliver Superior Performance and Scalability over Current / Future Competition 3

Exascale-Class Computer Platforms Communication Challenges Challenge Very large functional unit count ~10M Large on- node functional unit count ~500 Deeper memory hierarchies Smaller amounts of memory per functional unit May have functional unit heterogeneity Component failures part of normal operation Data movement is expensive Solution focus Scalable communication capabilities: point-topoint & collectives Scalable Network: Adaptive routing Scalable HCA architecture Cache aware network access Low latency, high b/w capabilities Support for data heterogeneity Resilient and redundant stack Optimize data movement Independent remote progress Independent hardware progress Power costs Power aware hardware 4

Standardize your Interconnect 5

Technology Roadmap Always One-Generation Ahead Mellanox 20Gbs 40Gbs 56Gbs 100Gbs 200Gbs Terascale Petascale Exascale 3 rd TOP500 2003 Virginia Tech (Apple) 1 st Roadrunner Mellanox Connected Mega Supercomputers 2000 2005 2010 2015 2020 6

The Future is Here Enter the World of Scalable Performance At the Speed of 100Gb/s! 7

The Future is Here Entering the Era of 100Gb/s 100Gb/s Adapter, 0.7us latency 150 million messages per second (10 / 25 / 40 / 50 / 56 / 100Gb/s) 36 EDR (100Gb/s) Ports, <90ns Latency Throughput of 7.2Tb/s Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics 8

Enter the World of Scalable Performance 100Gb/s Switch Switch-IB: Highest Performance Switch in the Market 7 th Generation InfiniBand Switch Analyze 36 EDR (100Gb/s) Ports, <90ns Latency Throughput of 7.2 Tb/s InfiniBand Router Adaptive Routing Store 9

Enter the World of Scalable Performance 100Gb/s Adapter ConnectX-4: Highest Performance Adapter in the Market InfiniBand: SDR / DDR / QDR / FDR / EDR Ethernet: 10 / 25 / 40 / 50 / 56 / 100GbE Connect. Accelerate. Outperform 100Gb/s, <0.7us latency 150 million messages per second OpenPOWER CAPI technology CORE-Direct technology GPUDirect RDMA Dynamically Connected Transport (DCT) Ethernet offloads (HDS, RSS, TSS, LRO, LSOv2) 10

Higher is Better Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Connect-IB Provides Highest Interconnect Throughput 14000 12000 10000 8000 6000 Unidirectional Bandwidth 12810 12485 6343 30000 25000 20000 15000 Bidirectional Bandwidth ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 24727 21025 11643 4000 3385 10000 6521 2000 5000 0 4 16 64 256 1024 4K 16K 64K 256K 1M 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Source: Prof. DK Panda Gain Your Performance Leadership With Connect-IB Adapters 11

Standard! Standardized wire protocol Mix and match of vertical and horizontal components Ecosystem build together Open-source, extensible interfaces Extend and optimize per applications needs 12

Extensible Open Source Framework Application Protocols/Accelerators TCP/IP Sockets RPC Packet processing Extended Verbs Storage MPI / SHMEM / UPC Vendor Extensions HW A HW B HW C HW D 13

Scalability 14

Latency (us) Latency (us) Collective Operations Reduce Collective 3000 Topology Aware 2500 2000 Hardware Multicast 1500 1000 500 0 0 500 1000 1500 2000 2500 processes (PPN=8) Without FCA With FCA Separate Virtual Fabric Offload Scalable algorithms 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Barrier Collective 0 500 1000 1500 2000 2500 Processes (PPN=8) Without FCA With FCA 15

Process p-1 Process 1 p*(cs+cr)/2 Process 0 The DC Model Dynamic Connectivity Each DC Initiator can be used to reach any remote DC Target No resources sharing between processes process controls how many (and can adapt to load) process controls usage model (e.g. SQ allocation policy) no inter-process dependencies cs cr cs cr Resource footprint Function of node and HCA capability Independent of system size cs Fast Communication Setup Time cr cs concurrency of the sender cr=concurrency of the responder 16 16

Dynamically Connected Transport Key objects DC Initiator: Initiates data transfer DC Target: Handles incoming data 17

Host Memory Consumption (MB) Dynamically Connected Transport Exascale Scalability 1,000,000,000 1,000,000 8 nodes 2K nodes 10K nodes 100K nodes 1,000 1 InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012 18

Reliable Connection Transport Mode 19

Dynamically Connected Transport Mode 20

More QoS Servers Topologies l l Administrator 1 2 3 m b h 1 HCA 1..h h r HCA 1..h IPC InfiniBand Subnet Storage Routers Net. Mgt. Filer Gateway Distance Storage FAT Tree SM FAT Tree SM FAT Tree SM 21

Scalability Under Load 22

Adaptive Routing Purpose Improved Network utilization: choose alternate routes on congestion Network resilience: Alternative routes on failure Supported Hardware SwitchX-2 Switch-IB: Adaptive routing notification added 23

Mellanox Adaptive Routing Hardware Support For every incoming packet the adaptive routing process has two main stages Route Change Decision (to adapt or not to adapt) New output port selection Mellanox hardware is NOT topology specific SDN concept separates the configuration plane from the data plane Every feature is software controlled Fat-Tree, Dragonfly and Dragonfly+ are fully supported New hardware features introduced to support Dragonfly and Dragonfly+ 24

Is the Packet Allowed to Adapt? AR Modes Static traffic is always bound to a specific port Time-Bound traffic is bound to the last port used if not more than Tb [sec] passed since that event Free traffic may select a new out port freely Packets are classified to be either Legacy, Restricted or Unrestricted Destinations are classified to be either Legacy, Restricted, Timely-Restricted or Unrestricted A matrix maps possible combinations of packet and destination based classification to AR modes 25

Mellanox Adaptive Routing Notification (ARN) The reaction time is critical to Adaptive Routing Traffic modes change fast A better AR decision requires some knowledge about network state Internel switch to switch communications Faster convergence after routing changes ARN Fast notification to decision point Fully configurable (topology agnostic) 3) ARN cause new output port selection 2) ARN forwarded, can t fix here 1) Sampling prefers the large flows Faster Routing Modifications, Resilient Network 26

Network Offload 27

Scalability of Collective Operations Ideal Algorithm Impact of System Noise 4 3 2 1 28

Scalability of Collective Operations - II Offloaded Algorithm Nonblocking Algorithm - Communication processing 29

CORE-Direct Scalable collective communication Asynchronous communication Manage communication by communication resources Avoid system noise Task list Target QP for task Operation Send Wait for completions Enable Calculate 30

Example Four Process Recursive Doubling 1 2 3 4 Step 1 1 2 3 4 Step 2 1 2 3 4 31

Four Process Barrier Example Using Managed Queues Rank 0 32

Nonblocking Alltoall (Overlap-Wait) Benchmark CoreDirect Offload allows Alltoall benchmark with almost 100% compute 33

Optimizing Non Contiguous Memory Transfers Support combining contiguous registered memory regions into a single memory region. H/W treats them as a single contiguous region (and handles the non-contiguous regions) For a given memory region, supports non-contiguous access to memory, using a regular structure representation base pointer, element length, stride, repeat count. Can combine these from multiple different memory keys Memory descriptors are created by posting WQE s to fill in the memory key Supports local and remote non-contiguous memory access Eliminates the need for some memory copies 34

Optimizing Non Contiguous Memory Transfers 35

On Demand Paging No memory pinning, no memory registration, no registration caches! Advantages Greatly simplified programming Unlimited MR sizes Physical memory optimized to hold current working set 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000 Address Space PFN1 PFN2 PFN3 PFN4 PFN5 PFN6 IO Virtual Address 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000 ODP promise: IO virtual address mapping == Process virtual address mapping 36

Connecting Compute Elements TODO: x86-> +GPU -> +ARM/POWER 37

1 1 GPUDirect RDMA Receive Transmit System Memory 1 CPU GPUDirect 1.0 CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU GPU Memory InfiniBand GPUDirect RDMA InfiniBand GPU Memory 38

Mellanox PeerDirect with NVIDIA GPUDirect RDMA HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand Unlocks performance between GPU and InfiniBand This provides a significant decrease in GPU-GPU communication latency Provides complete CPU offload from all GPU communications across the network Demonstrated up to 102% performance improvement with large number of particles 102% 39

Storage 40

K IOPs @ 4K IO Size Disc access Latency @ 4K IO [micsec] InfiniBand RDMA Storage Get the Best Performance! Transport protocol implemented in hardware Zero copy using RDMA 2500 10000 9000 @ only 131K IOPs 2000 8000 1500 1000 500 7000 6000 5000 4000 0 iscsi (TCP/IP) 1 x FC 8 Gb port 4 x FC 8 Gb port iser 1 x 40GbE/IB Port iser 2 x 40GbE/IB Port (+Acceleration) KIOPs 130 200 800 1100 2300 3000 2000 1000 0 iscsi/tcp @2300K IOPs iscsi/rdma 5-10% the latency under 20x the workload 41

Data Protection Offloaded Used to provide data block integrity check capabilities (CRC) for block storage (SCSI) Proposed by the T10 committee DIF extends the support to main memory 42

Power 43

Motivation For Power Aware Design Today networks work at maximum capacity with almost constant dissipation Low power silicon devices may not suffice to meet future requirements for low-energy networks The electricity consumption of datacenters is a significant contributor to the total cost of operation Cooling costs scale as 1.3X the total energy consumption Lower power consumption lowers OPEX Annual OPEX for 1KW is ~$1000 * According to Global e-sustainability Initiative (GeSI) if green network technologies (GNTs) are not adopted 44

Increasing Energy Efficiency with Mellanox Switches Voltage/Frequency scaling Dynamic array power-save ASIC level Width/Speed reduction Energy Efficient Ethernet Link level Aggregate Power Aware API FW / MLNX-OS / UFM Port/Module/System shutdown Switch Level Energy aware cooling control System Level Low Energy COnsumption NETworks 45

power relative to highest speed Prototype Results - Power Scaling @ Link Level Speed Reduction Power Save [SRPS] 100.00% 95.00% 90.00% 85.00% 80.00% 75.00% 70.00% 65.00% 60.00% FDR (56Gb-IB} QDR(40Gb-IB} SDR(10Gb-IB} speed Width Reduction Power Save [WRPS] 46

Total SX power [W] power relative to full connectivity Prototype Results - Power Scaling @ System Level Internal port shutdown (Director switch) - ~1%/port Load based power scaling 100 90 80 70 60 50 Total system power for SX1036 86 87.1 88.1 89.2 90.1 91 91.9 93 84.4 85.1 62.2 63.2 100.00% 95.00% 90.00% 85.00% 80.00% 75.00% 70.00% 65.00% 60.00% Power decrease percentage due to internal port closing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 number of closed ports Fan system analysis for improved power algorithm 40 30 20 10 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% %BW per port WRPS enable (auto mode) WRPS disable 47

Monitoring and Diagnostics 48

Unified Fabric Manager Automatic Discovery Central Device Mgmt Fabric Dashboard Congestion Analysis Health & Perf Monitoring Advanced Alerting Fabric Health Reports Service Oriented Provisioning 49

ExaScale! 50

Mellanox InfiniBand Connected Petascale Systems Connecting Half of the World s Petascale Systems Mellanox Connected Petascale System Examples 51

The Only Provider of End-to-End 40/56Gb/s Solutions Comprehensive End-to-End InfiniBand and Ethernet Portfolio ICs Adapter Cards Switches/Gateways Host/Fabric Software Metro / WAN Cables/Modules From Data Center to Metro and WAN X86, ARM and Power based Compute and Storage Platforms The Interconnect Provider For 10Gb/s and Beyond 52

Thank You