Interconnecting Future DoE leadership systems Rich Graham HPC Advisory Council, Stanford, 2015
HPC The Challenges 2
Proud to Accelerate Future DOE Leadership Systems ( CORAL ) Summit System Sierra System 5X 10X Higher Application Performance versus Current Systems Mellanox EDR 100Gb/s InfiniBand, IBM POWER CPUs, NVIDIA Tesla GPUs Mellanox EDR 100G Solutions Selected by the DOE for 2017 Leadership Systems Deliver Superior Performance and Scalability over Current / Future Competition 3
Exascale-Class Computer Platforms Communication Challenges Challenge Very large functional unit count ~10M Large on- node functional unit count ~500 Deeper memory hierarchies Smaller amounts of memory per functional unit May have functional unit heterogeneity Component failures part of normal operation Data movement is expensive Solution focus Scalable communication capabilities: point-topoint & collectives Scalable Network: Adaptive routing Scalable HCA architecture Cache aware network access Low latency, high b/w capabilities Support for data heterogeneity Resilient and redundant stack Optimize data movement Independent remote progress Independent hardware progress Power costs Power aware hardware 4
Standardize your Interconnect 5
Technology Roadmap Always One-Generation Ahead Mellanox 20Gbs 40Gbs 56Gbs 100Gbs 200Gbs Terascale Petascale Exascale 3 rd TOP500 2003 Virginia Tech (Apple) 1 st Roadrunner Mellanox Connected Mega Supercomputers 2000 2005 2010 2015 2020 6
The Future is Here Enter the World of Scalable Performance At the Speed of 100Gb/s! 7
The Future is Here Entering the Era of 100Gb/s 100Gb/s Adapter, 0.7us latency 150 million messages per second (10 / 25 / 40 / 50 / 56 / 100Gb/s) 36 EDR (100Gb/s) Ports, <90ns Latency Throughput of 7.2Tb/s Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics 8
Enter the World of Scalable Performance 100Gb/s Switch Switch-IB: Highest Performance Switch in the Market 7 th Generation InfiniBand Switch Analyze 36 EDR (100Gb/s) Ports, <90ns Latency Throughput of 7.2 Tb/s InfiniBand Router Adaptive Routing Store 9
Enter the World of Scalable Performance 100Gb/s Adapter ConnectX-4: Highest Performance Adapter in the Market InfiniBand: SDR / DDR / QDR / FDR / EDR Ethernet: 10 / 25 / 40 / 50 / 56 / 100GbE Connect. Accelerate. Outperform 100Gb/s, <0.7us latency 150 million messages per second OpenPOWER CAPI technology CORE-Direct technology GPUDirect RDMA Dynamically Connected Transport (DCT) Ethernet offloads (HDS, RSS, TSS, LRO, LSOv2) 10
Higher is Better Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) Connect-IB Provides Highest Interconnect Throughput 14000 12000 10000 8000 6000 Unidirectional Bandwidth 12810 12485 6343 30000 25000 20000 15000 Bidirectional Bandwidth ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR 24727 21025 11643 4000 3385 10000 6521 2000 5000 0 4 16 64 256 1024 4K 16K 64K 256K 1M 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes) Source: Prof. DK Panda Gain Your Performance Leadership With Connect-IB Adapters 11
Standard! Standardized wire protocol Mix and match of vertical and horizontal components Ecosystem build together Open-source, extensible interfaces Extend and optimize per applications needs 12
Extensible Open Source Framework Application Protocols/Accelerators TCP/IP Sockets RPC Packet processing Extended Verbs Storage MPI / SHMEM / UPC Vendor Extensions HW A HW B HW C HW D 13
Scalability 14
Latency (us) Latency (us) Collective Operations Reduce Collective 3000 Topology Aware 2500 2000 Hardware Multicast 1500 1000 500 0 0 500 1000 1500 2000 2500 processes (PPN=8) Without FCA With FCA Separate Virtual Fabric Offload Scalable algorithms 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Barrier Collective 0 500 1000 1500 2000 2500 Processes (PPN=8) Without FCA With FCA 15
Process p-1 Process 1 p*(cs+cr)/2 Process 0 The DC Model Dynamic Connectivity Each DC Initiator can be used to reach any remote DC Target No resources sharing between processes process controls how many (and can adapt to load) process controls usage model (e.g. SQ allocation policy) no inter-process dependencies cs cr cs cr Resource footprint Function of node and HCA capability Independent of system size cs Fast Communication Setup Time cr cs concurrency of the sender cr=concurrency of the responder 16 16
Dynamically Connected Transport Key objects DC Initiator: Initiates data transfer DC Target: Handles incoming data 17
Host Memory Consumption (MB) Dynamically Connected Transport Exascale Scalability 1,000,000,000 1,000,000 8 nodes 2K nodes 10K nodes 100K nodes 1,000 1 InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012 18
Reliable Connection Transport Mode 19
Dynamically Connected Transport Mode 20
More QoS Servers Topologies l l Administrator 1 2 3 m b h 1 HCA 1..h h r HCA 1..h IPC InfiniBand Subnet Storage Routers Net. Mgt. Filer Gateway Distance Storage FAT Tree SM FAT Tree SM FAT Tree SM 21
Scalability Under Load 22
Adaptive Routing Purpose Improved Network utilization: choose alternate routes on congestion Network resilience: Alternative routes on failure Supported Hardware SwitchX-2 Switch-IB: Adaptive routing notification added 23
Mellanox Adaptive Routing Hardware Support For every incoming packet the adaptive routing process has two main stages Route Change Decision (to adapt or not to adapt) New output port selection Mellanox hardware is NOT topology specific SDN concept separates the configuration plane from the data plane Every feature is software controlled Fat-Tree, Dragonfly and Dragonfly+ are fully supported New hardware features introduced to support Dragonfly and Dragonfly+ 24
Is the Packet Allowed to Adapt? AR Modes Static traffic is always bound to a specific port Time-Bound traffic is bound to the last port used if not more than Tb [sec] passed since that event Free traffic may select a new out port freely Packets are classified to be either Legacy, Restricted or Unrestricted Destinations are classified to be either Legacy, Restricted, Timely-Restricted or Unrestricted A matrix maps possible combinations of packet and destination based classification to AR modes 25
Mellanox Adaptive Routing Notification (ARN) The reaction time is critical to Adaptive Routing Traffic modes change fast A better AR decision requires some knowledge about network state Internel switch to switch communications Faster convergence after routing changes ARN Fast notification to decision point Fully configurable (topology agnostic) 3) ARN cause new output port selection 2) ARN forwarded, can t fix here 1) Sampling prefers the large flows Faster Routing Modifications, Resilient Network 26
Network Offload 27
Scalability of Collective Operations Ideal Algorithm Impact of System Noise 4 3 2 1 28
Scalability of Collective Operations - II Offloaded Algorithm Nonblocking Algorithm - Communication processing 29
CORE-Direct Scalable collective communication Asynchronous communication Manage communication by communication resources Avoid system noise Task list Target QP for task Operation Send Wait for completions Enable Calculate 30
Example Four Process Recursive Doubling 1 2 3 4 Step 1 1 2 3 4 Step 2 1 2 3 4 31
Four Process Barrier Example Using Managed Queues Rank 0 32
Nonblocking Alltoall (Overlap-Wait) Benchmark CoreDirect Offload allows Alltoall benchmark with almost 100% compute 33
Optimizing Non Contiguous Memory Transfers Support combining contiguous registered memory regions into a single memory region. H/W treats them as a single contiguous region (and handles the non-contiguous regions) For a given memory region, supports non-contiguous access to memory, using a regular structure representation base pointer, element length, stride, repeat count. Can combine these from multiple different memory keys Memory descriptors are created by posting WQE s to fill in the memory key Supports local and remote non-contiguous memory access Eliminates the need for some memory copies 34
Optimizing Non Contiguous Memory Transfers 35
On Demand Paging No memory pinning, no memory registration, no registration caches! Advantages Greatly simplified programming Unlimited MR sizes Physical memory optimized to hold current working set 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000 Address Space PFN1 PFN2 PFN3 PFN4 PFN5 PFN6 IO Virtual Address 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000 ODP promise: IO virtual address mapping == Process virtual address mapping 36
Connecting Compute Elements TODO: x86-> +GPU -> +ARM/POWER 37
1 1 GPUDirect RDMA Receive Transmit System Memory 1 CPU GPUDirect 1.0 CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU GPU Memory InfiniBand GPUDirect RDMA InfiniBand GPU Memory 38
Mellanox PeerDirect with NVIDIA GPUDirect RDMA HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand Unlocks performance between GPU and InfiniBand This provides a significant decrease in GPU-GPU communication latency Provides complete CPU offload from all GPU communications across the network Demonstrated up to 102% performance improvement with large number of particles 102% 39
Storage 40
K IOPs @ 4K IO Size Disc access Latency @ 4K IO [micsec] InfiniBand RDMA Storage Get the Best Performance! Transport protocol implemented in hardware Zero copy using RDMA 2500 10000 9000 @ only 131K IOPs 2000 8000 1500 1000 500 7000 6000 5000 4000 0 iscsi (TCP/IP) 1 x FC 8 Gb port 4 x FC 8 Gb port iser 1 x 40GbE/IB Port iser 2 x 40GbE/IB Port (+Acceleration) KIOPs 130 200 800 1100 2300 3000 2000 1000 0 iscsi/tcp @2300K IOPs iscsi/rdma 5-10% the latency under 20x the workload 41
Data Protection Offloaded Used to provide data block integrity check capabilities (CRC) for block storage (SCSI) Proposed by the T10 committee DIF extends the support to main memory 42
Power 43
Motivation For Power Aware Design Today networks work at maximum capacity with almost constant dissipation Low power silicon devices may not suffice to meet future requirements for low-energy networks The electricity consumption of datacenters is a significant contributor to the total cost of operation Cooling costs scale as 1.3X the total energy consumption Lower power consumption lowers OPEX Annual OPEX for 1KW is ~$1000 * According to Global e-sustainability Initiative (GeSI) if green network technologies (GNTs) are not adopted 44
Increasing Energy Efficiency with Mellanox Switches Voltage/Frequency scaling Dynamic array power-save ASIC level Width/Speed reduction Energy Efficient Ethernet Link level Aggregate Power Aware API FW / MLNX-OS / UFM Port/Module/System shutdown Switch Level Energy aware cooling control System Level Low Energy COnsumption NETworks 45
power relative to highest speed Prototype Results - Power Scaling @ Link Level Speed Reduction Power Save [SRPS] 100.00% 95.00% 90.00% 85.00% 80.00% 75.00% 70.00% 65.00% 60.00% FDR (56Gb-IB} QDR(40Gb-IB} SDR(10Gb-IB} speed Width Reduction Power Save [WRPS] 46
Total SX power [W] power relative to full connectivity Prototype Results - Power Scaling @ System Level Internal port shutdown (Director switch) - ~1%/port Load based power scaling 100 90 80 70 60 50 Total system power for SX1036 86 87.1 88.1 89.2 90.1 91 91.9 93 84.4 85.1 62.2 63.2 100.00% 95.00% 90.00% 85.00% 80.00% 75.00% 70.00% 65.00% 60.00% Power decrease percentage due to internal port closing 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 number of closed ports Fan system analysis for improved power algorithm 40 30 20 10 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% %BW per port WRPS enable (auto mode) WRPS disable 47
Monitoring and Diagnostics 48
Unified Fabric Manager Automatic Discovery Central Device Mgmt Fabric Dashboard Congestion Analysis Health & Perf Monitoring Advanced Alerting Fabric Health Reports Service Oriented Provisioning 49
ExaScale! 50
Mellanox InfiniBand Connected Petascale Systems Connecting Half of the World s Petascale Systems Mellanox Connected Petascale System Examples 51
The Only Provider of End-to-End 40/56Gb/s Solutions Comprehensive End-to-End InfiniBand and Ethernet Portfolio ICs Adapter Cards Switches/Gateways Host/Fabric Software Metro / WAN Cables/Modules From Data Center to Metro and WAN X86, ARM and Power based Compute and Storage Platforms The Interconnect Provider For 10Gb/s and Beyond 52
Thank You