Introduction to HPC. Lecture 17

Transcription

1 Introduction to HPC Lecture 17 Dept of Computer Science Clusters 1

2 Clusters Recall: Bus Connected SMPs (UMAs) Processor Processor Processor Processor Cache Cache Cache Cache Single Bus Memory I/O Caches are used to reduce latency and to lower bus traffic Must provide hardware for cache coherence and process synchronization Bus traffic and bandwidth limits scalability (<~ 36 processors) 2

3 Network Connected Multiprocessors Memory Memory Memory Cache Cache Cache Processor Processor Processor Interconnection Network (IN) Either a single address space (NUMA and ccnuma) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives Interconnection network supports interprocessor communication Adapted from Networks Facets people talk a lot about: direct (point-to-point) vs. indirect (multi-hop) topology (e.g., bus, ring, DAG) routing algorithms switching (aka multiplexing) wiring (e.g., choice of media, copper, coax, fiber) What really matters: latency bandwidth cost reliability 3

4 Interconnections (Networks) Examples: MPP and Clusters: 100s 10s of 1000s of nodes; 100 meters per link Local Area Networks: 100s 1000s of nodes; a few 1000 meters Wide Area Network: 1000s nodes; 5,000,000 meters Interconnection Network MPP = Massively Parallel Processor Networks 3 cultures for 3 classes of networks MPP and Clusters: latency and bandwidth LAN: workstations, cost WAN: telecommunications, revenue 4

5 Network Performance Measures Universal Performance Metrics Sender Sender Overhead Transmission time (size bandwidth) Receiver (processor busy) Time of Flight Transmission time (size bandwidth) Transport Delay Receiver Overhead (processor busy) Total Delay Total Delay = Sender Overhead + Time of Flight + Message Size BW + Receiver Overhead Includes header/trailer in BW calculation 5

6 Simplified Latency Model Total Delay = Latency + Message Size / BW Latency = Sender Overhead + Time of Flight + Receiver Overhead 1,000 Example: show what happens as vary Latency: 1, 25, 500 µsec 100 BW: 10,100, 1000 Mbit/sec 10 (factors of 10) 1 Message Size: 16 Bytes to 4 MB (factors of 4) 0 If overhead 500 µsec, how big a message > 10 Mb/s? 0 Effective Bandwidth (Mbit/sec) o1, bw1000 o1, o25, bw100 bw100 o1, bw10 o25, bw10 o500, bw10 o25, bw1000 o500, bw100 Message Size (bytes) o500, bw E+06 4E+06 Example Performance Measures Interconnect MPP LAN WAN Example CM-5 Ethernet ATM Bisection BW N x 5 MB/s MB/s N x 10 MB/s Int./Link BW 20 MB/s MB/s 10 MB/s Transport Latency 5 µsec 15 µsec 50 to 10,000 µs HW Overhead to/from 0.5/0.5 µs 6/6 µs 6/6 µs SW Overhead to/from 1.6/12.4 µs 200/241 µs 207/360 µs (TCP/IP on LAN/WAN) Software overhead dominates in LAN, WAN 6

7 Source: Mike Levine, PSC DEISA Symp, May 2005 HW Interface Issues Where to connect network to computer? Cache consistent to avoid flushes? (=> memory bus) Latency and bandwidth? (=> memory bus) Standard interface card? (=> I/O bus) MPP => memory bus; Clusters, LAN, WAN => I/O bus CPU $ L2 $ Network I/O Controller Network I/O Controller ideal: high bandwidth, low latency, standard interface Memory Bus I/O bus Memory Bus Adaptor 7

8 Interconnect First level: Chip/Board AMD on-chip Hyper Transport on board Hyper Transport Interconnect First level: Chip/Board Intel Quick Path Interconnect (QPI) (Board) Ring (MIC, Sandy Bridge) Mesh (Polaris, SCC) SCC Sandy Bridge /SCC_Sympossium_Feb212010_FINAL-A.pdf MIC reference/isc_2010_skaugen_keynote.pdf 8

9 Internode connection Where to Connect? Internal bus/network (MPP) CM-1, CM-2, CM-5 IBM Blue Gene, Power7 Cray SGI Ultra Violett I/O bus (Clusters) Typically PCI bus Interconnect examples for MPP (Proprietary interconnection technology) 9

10 IBM Blue Gene/P 3.4 GF/s (DP) 13.6 GF/s Memory BW/F = 1 B/F Comm BW/F = (6*3.4*2/8) = B/F IBM BG/P 3 Dimensional Torus Interconnects all compute nodes Communications backbone for computations Adaptive cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 μs latency between nearest neighbors, 5 μs to the farthest Note: 0.5(# of hops) MPI: 3 μs latency for one hop, 10 μs to the farthest.7/2.6 TB/s bisection bandwidth, 188TB/s total bandwidth (72k machine) Collective Network Interconnects all compute and I/O nodes (1152) One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link Latency of one way tree traversal 2 μs, MPI 5 μs ~62TB/s total binary tree bandwidth (72k machine) Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 μs, MPI 1.6 μs Other networks 10Gb Functional Ethernet I/O nodes only 1Gb Private Control Ethernet Provides JTAG access to hardware.accessible only from Service Node system 10

11 IBM BG/P Ping-Pong MPI Ping-Pong Latency: µs, avg 4.7 µs MPI Ping-Pong Bandwidth: 0.38 GB/s 147,456 cores Measured: ~650MF/s out of 13.6 GF (~5% of peak) Estimate based on memory BW: (13.6/24*2*0.85)= GF/s Estimate based on measured BW: (9.4/24*2*0.85)=0.665 GF/s (BW measured by Stream and reported as part of HPCC) Blue Gene Q 11

12 BG/Q 5-D Torus Network BG/Q Networks Networks 5 D torus in compute nodes, 2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at ~1.75 GB/s per link Both collective and barrier networks are embedded in this 5-D torus network. Virtual Cut Through (VCT) Floating point addition support in collective network Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P) 20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x2gb/s link bandwidth = TB/s 26.8PF: bisection is 2x16x16x16x4x2GB/s = TB/s BGL at LLNL is 0.7 TB/s I/O Network to/from Compute rack 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out) Every Q32 node card has up to I/O 8 links or 4 ports Every rack has up to 32x8 = 256 links or 128 ports I/O rack 8 I/O nodes/drawer, each node has 2 links from compute rack, and1 PCI-e port to the outside world 12/drawers/rack 96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s 12

13 BG/Q Network All-to-all: 97% of peak Bisection: > 93% of peak Nearest-neighbor: 98% of peak Collective: FP reductions at 94.6% of peak No performance problems identified in network logic Cray XE6 Gemini Network Day_2_Session_2_all.pdf 13

14 Cray XE6 Gemini Network MPI Ping-Pong Latency: 6 9 µs, avg 7.5 µs Note: Seastar no Gemini MPI Ping-Pong Bandwidth: 1.6 GB/s Note: Seastar no Gemini 224,256 cores Cray XE6 Gemini Network Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf 14

15 SGI Ultra Violet (UV) 1 rack, 16 nodes, 32 sockets Max 3 hops External Numalink-5 routers:16 ports UV Hub Two QPI interfaces 2x25GB/s Four Numalink 5 links 4x10 GB/s SGI UV 8 racks, 128 nodes, 256 sockets, fat-tree, ¼ shown. 16TB shared memory (4 racks with 16GB/DIMMs) MPI Ping-Pong Latency: µs, avg 1.6 µs MPI Ping-Pong Bandwidth: GB/s, avg 3 GB/s 64 cores 512 racks, 8192 nodes, sockets, 8x8 torus of 128 node fat-trees. Each torus link consists of 2 Numalink-5 bidirectional links. Maximum estimated latency for 1024 rack system: <2µs

16 IBM Power7 IBM Power7 Hub 61 mm x 96 mm Glass Ceramic LGA module 56 12X optical modules LGA attach onto substrate TB/s interconnect bandwidth 45 nm lithography, Cu, SOI13 levels metal 440M transistors 582 mm mm x 21.8 mm 3707 signal I/O 11,328 total I/O 16

17 IBM Power7 Integrated Switch Router Two tier, full graph network 3.0 GHz internal 56x56 crossbar switch 8 HFI, 7 LL, 24 LR, 16 D, and SRV ports Virtual channels for deadlock prevention Input/Output Buffering 2 KB maximum packet size 128B FLIT size Link Reliability CRC based link-level retry Lane steering for failed links IP Multicast Support Multicast route tables per ISR for replicating and forwarding multicast packets Global Counter Support ISR compensates for link latencies as counter information is propagated HW synchronization with Network Management setup and maintenance Routing Characteristics 3-hop L-D-L longest direct route 5-hop L-D-L-D-L longest indirect route Cut-through Wormhole routing Full hardware routing using distributed route tables across the ISRsSource route tables for packets injected by the HFI Port route tables for packets at each hop in the network Separate tables for inter-supernode and intrasupernode routes FLITs of a packet arrive in order, packets of a message can arrive out of order Routing Modes Hardware Single Direct Routing Hardware Multiple Direct Routing For less than full-up system where more than one direct path exists Hardware Indirect Routing for data striping and failover Round-Robin, Random Software controlled indirect routing through hardware route tables I/O bus technologies (clusters) 17

18 Peripheral Component Interconnect (PCI) PCI: V1.0 (1992): 32-bit, MHz PCI-X: V1.0 (1998): 64-bit, 66 MHz, 100 MHz, 133 MHz V2.0 (2003): 64-bit wide, 266 MHz, 533 MHz PCI Express (PCIe): V1.0 (2003): 256 MiB/s per lane (16 lanes = 4 GiB/s) V2.0 (2007): 512 MiB/s per lane (16 lanes = 8 GiB/s) V3.0 (2010): 1024 MiB/s per lane (16 lanes = 16 GiB/s) PCIe is defined for 1,2,4,8,16, and 32 lanes x4 x16 x1 x16 PCI (32-bit) Cluster Interconnect Technologies Ethernet: 1 GigE (1995), 10 GigE (2001), 40 GigE (2010), 100 GigE (2010) Infiniband: 2001: Single Data Rate (SDR), 2.5 Gbps/lane, x4 (2003)10 Gbps, 8/10 encoding (net data rate 2 Gbps, x4 8 Gbps) 2005: Double Data Rate (DDR), 5 Gbps/lane, x4 20 Gbps, 8/10 encoding (net data rate 4 Gbps, x4 16 Gbps) 2007: Quad Data Rate (QDR), 10 Gbps/lane, x4 40 Gbps, 8/10 encoding (net data rate 8 Gbps, x4 32 Gbps) 2011: Fourteen Data Rate (FDR), Gbps, x Gbps, 64/66 encoding (net data rate Gbps, x Gbps) 2013: Enhanced Data Rate (EDR), Gbps, x Gbps, 64/66 encoding, (net data rate 25 Gbps, x4 100 Gbps) Switch latency: SDR 200 ns, DDR 140 ns, QDR 100 ns. Mellanox current switch chip has 1.4 billion transistors and a throughput of 4 Gbps on 36 ports and a port-to-port latency of 165 ns. Myrinet: 0.64 Gbps (1994), 1.28 Gbps (1996), 2 Gbps (2000), 10 Gbps (2006) 18

19 Infiniband Roadmap SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rate Typical Infiniband Network HCA = Host Channel Adapter TCA = Target Channel Adapter 19

20 Interconnect Technology Properties Mellanox ConnectX IB 40Gb/s PCIe x8 InfiniBand Proprietary GigE 10GigE QLogic InfiniPath IB 20Gb/s PCIe x8 Myrinet 10G PCIe x8 Quadrics QSNetII Chelsio T210-CX PCIe x8 Application Latency (µs) < Peak Unidirectional Bandwidth (MB/s) for PCIe Gen1 Peak Unidirectional Bandwidth (MB/s) for PCIe Gen N/A N/A N/A N/A N/A Mellanox ConnectX InfiniBand IB 10Gb/s PCIe Gen1 IB 20Gb/s IB 20Gb/s PCIe Gen2 IB 40Gb/s PCIe Gen2 IPoIB Bandwidth 939MB/s 1410MB/s 1880MB/s 2950MB/s MPI Ping-Pong measurements MPI Ping-Pong latency µs MPI Ping-Pong bandwidth GB/s Mellanox QDR , avg 3.6 (4320 cores) , avg 1.8 (4320 cores) Infinipath QDR , avg 1.6 (192 cores) , avg 2.5 (192 cores) MPI Ping-Pong latency µs MPI Ping-Pong bandwidth GB/s BG/P Cray Seastar SGI UV , avg 4.7 (147,456 cores) 0.38 (147,456 cores) 6 9, avg 7.5 (224,256 cores) 1.6 (224,256 cores) , avg 1.6 (64 cores) , avg 3 (64 cores) 20

21 Cray XC30 Interconnection Network Cray XC30 Interconnection Network Chassis 1 Group Chassis Chassis

22 Cray XC30 Interconnection Network Aeris chip 40 nm technology 16.6 x 18.9 mm 217M gates 184 lanes of SerDes 30 optical lanes 90 electrical lanes 64 PCIe 3.0 lanes Cray XC30 Network Overview 22

23 Interconnection Networks References CSE 431, Computer Architecture, Fall 2005, Lecture 27. Network Connected Multi s, Mary Jane Irwin, Lecture 21: Networks & Interconnect Introduction, Dave A. Patterson, Jan Rabaey, CS 252, Spring 2000, Technology Trends in High Performance Computing, Mike Levine, DEISA Symposium, May 9 10, Single-chip Cloud Computer: An experimental many-core processor from Intel Labs, Jim Held, Petascale to Exascale - Extending Intel s HPC Commitment, Kirk Skaugen, Blue Gene: A Next Generation Supercomputer (BlueGene/P), Alan Gara, Blue Gene/P Architecture: Application Performance and Data Analytics, Vitali Morozov, HPC Challenge, Multi-Threaded Course, February 15 17, 2011, DAY 2: Introduction to Cray MPP Systems with Multi-core Processors Multi-threaded Programming, Tuning and Optimization on Multi-core MPP Platforms, 011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf Gemini Description, MPI, Jason Beech-Brandt, Technical Advances in the SGI UV Architecture, SGI UV Solving the World s Most Data Intensive Problems, 23

24 References cont d The IBM POWER7 HUB Module: A Terabyte Interconnect Switch for High-Performance Computer Systems, Hot Chips 22, August 2010, Baba Arimilli, Steve Baumgartner, Scott Clark, Dan Dreps, Dave Siljenberg, Andrew Mak, Intel Core i7 I/O Hub and I/O Controller Hub, PCI Express, InfiniBand and 10-Gigabit Ethernet for Dummies, A Tutorial at Supercomputing 09, DK Panda, Pavan Balaji, Matthew Koop, Infiniband Roadmap, Infiniband Performance, A Complexity Theory for VLSI, Clark David Thompson, Doctoral Thesis, ACM, Microprocessors, Exploring Chip Layers, Cray T3E, Complexity issues in VLSI, Frank Thomson Leighton, MIT Press, 1983 The Tree Machine: An Evaluation of Strategies For Reducing Program Loading Time, Li, Pey-yun Peggy and Johnsson, Lennart, Dado: A Tree-Structured Architecture for Artificial Intelligence Computation, S J Stolfo, and D P Miranker, Annual Review of Computer Science, Vol. 1: 1-18 (Volume publication date June 1986), DOI: /annurev.cs , Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence, Salvatore J. Stolfo, Daniel Miranker, David Elliot Shaw, Introduction to Algorithms, Charles E. Leiserson, September 15, 2004, References (cont d) UC Berkeley, CS 252, Spring 2000, Dave Patterson Interconnection Networks, Computer Architecture: A Quantitative Approach 4th Edition, Appendix E, Timothy Mark Pinkston, USC, Jose Duato, Universidad Politecnica de Valencia, Access and Alignment of Data in an Array Processor, D H Lawrie, IEEE Trans Computers, C-24, No. 12, pp , December 1975, ieeexplore.ieee.org%2fxpls%2fabs_all.jsp%3farnumber%3d SP2 System Architecture, T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, IBM J. Res Dev, v. 34, no. 2, pp , 1995, Inside the TC2000, BBN Advanced Computer Inc., Preliminary version, 1989 A Study of Non-Blocking Switching Networks, Charles Clos, Bell Systems Technical Journal, vol. 32, 1953, pp On Rearrangeable Three-Stage Connecting Networks, V. E "Vic" Benes, BSTJ, vol. XLI, Sep. 1962, No. 5, pp GF11: M Kumar, IBM J. Res Dev, v. 36, no. 6, pp , Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. Charles E. Leiserson, IEEE Trans. Computers 34(10): (1985 Cray XC30 series Network, B Alverson, E. Froese, L. Kaplan, D. Roweth, Cray High Speed Networking, Hot Interconnect, August 2012, Cost-Efficient Dragonfly Topology for Large-Scale Systems, J. Kim, W. Dally, S. Scott, D Abts, Vol. 29, no. 1, pp33 40, Jan Feb 2009, IEEE Micro, Technology-Drive, Highly-Scalable Dragonfly Topology, J. Kim, W. Dally, S. Scott, D Abts, pp , 35 th International Symposium on Computer Architecture (ISCA), 2008, 24

25 References (cont d) Microarchitecture of a High-Radix Router, J. Kim, W. J.Dally, B. Towles, A. K. Gupta, The BlackWidow High-Radix Clos Network, S. Scott, D. Abts, J. Kim, W. Dally, pp , 33rd International Symposium on Computer Architecture (ISCA), 2006, Flattened Butterfly Network: A Cost-Efficient Topology for High-Radix Networks, J. Kim, W. J. Dally, D. Abts, pp , 34th International Symposium on Computer Architecture(ISCA), 2007, Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, vol. 6, no. 2, pp , Jul Dec, 2007, IEEE Computer Architecture Letters, Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, pp , 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO), 2007, From Hypercubes to Dragonflies: A Short History of Interconnect, W. J. Dally, 25