Introduction to HPC. Lecture 17
|
|
- Lynne Rachel Boyd
- 8 years ago
- Views:
Transcription
1 Introduction to HPC Lecture 17 Dept of Computer Science Clusters 1
2 Clusters Recall: Bus Connected SMPs (UMAs) Processor Processor Processor Processor Cache Cache Cache Cache Single Bus Memory I/O Caches are used to reduce latency and to lower bus traffic Must provide hardware for cache coherence and process synchronization Bus traffic and bandwidth limits scalability (<~ 36 processors) 2
3 Network Connected Multiprocessors Memory Memory Memory Cache Cache Cache Processor Processor Processor Interconnection Network (IN) Either a single address space (NUMA and ccnuma) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives Interconnection network supports interprocessor communication Adapted from Networks Facets people talk a lot about: direct (point-to-point) vs. indirect (multi-hop) topology (e.g., bus, ring, DAG) routing algorithms switching (aka multiplexing) wiring (e.g., choice of media, copper, coax, fiber) What really matters: latency bandwidth cost reliability 3
4 Interconnections (Networks) Examples: MPP and Clusters: 100s 10s of 1000s of nodes; 100 meters per link Local Area Networks: 100s 1000s of nodes; a few 1000 meters Wide Area Network: 1000s nodes; 5,000,000 meters Interconnection Network MPP = Massively Parallel Processor Networks 3 cultures for 3 classes of networks MPP and Clusters: latency and bandwidth LAN: workstations, cost WAN: telecommunications, revenue 4
5 Network Performance Measures Universal Performance Metrics Sender Sender Overhead Transmission time (size bandwidth) Receiver (processor busy) Time of Flight Transmission time (size bandwidth) Transport Delay Receiver Overhead (processor busy) Total Delay Total Delay = Sender Overhead + Time of Flight + Message Size BW + Receiver Overhead Includes header/trailer in BW calculation 5
6 Simplified Latency Model Total Delay = Latency + Message Size / BW Latency = Sender Overhead + Time of Flight + Receiver Overhead 1,000 Example: show what happens as vary Latency: 1, 25, 500 µsec 100 BW: 10,100, 1000 Mbit/sec 10 (factors of 10) 1 Message Size: 16 Bytes to 4 MB (factors of 4) 0 If overhead 500 µsec, how big a message > 10 Mb/s? 0 Effective Bandwidth (Mbit/sec) o1, bw1000 o1, o25, bw100 bw100 o1, bw10 o25, bw10 o500, bw10 o25, bw1000 o500, bw100 Message Size (bytes) o500, bw E+06 4E+06 Example Performance Measures Interconnect MPP LAN WAN Example CM-5 Ethernet ATM Bisection BW N x 5 MB/s MB/s N x 10 MB/s Int./Link BW 20 MB/s MB/s 10 MB/s Transport Latency 5 µsec 15 µsec 50 to 10,000 µs HW Overhead to/from 0.5/0.5 µs 6/6 µs 6/6 µs SW Overhead to/from 1.6/12.4 µs 200/241 µs 207/360 µs (TCP/IP on LAN/WAN) Software overhead dominates in LAN, WAN 6
7 Source: Mike Levine, PSC DEISA Symp, May 2005 HW Interface Issues Where to connect network to computer? Cache consistent to avoid flushes? (=> memory bus) Latency and bandwidth? (=> memory bus) Standard interface card? (=> I/O bus) MPP => memory bus; Clusters, LAN, WAN => I/O bus CPU $ L2 $ Network I/O Controller Network I/O Controller ideal: high bandwidth, low latency, standard interface Memory Bus I/O bus Memory Bus Adaptor 7
8 Interconnect First level: Chip/Board AMD on-chip Hyper Transport on board Hyper Transport Interconnect First level: Chip/Board Intel Quick Path Interconnect (QPI) (Board) Ring (MIC, Sandy Bridge) Mesh (Polaris, SCC) SCC Sandy Bridge /SCC_Sympossium_Feb212010_FINAL-A.pdf MIC reference/isc_2010_skaugen_keynote.pdf 8
9 Internode connection Where to Connect? Internal bus/network (MPP) CM-1, CM-2, CM-5 IBM Blue Gene, Power7 Cray SGI Ultra Violett I/O bus (Clusters) Typically PCI bus Interconnect examples for MPP (Proprietary interconnection technology) 9
10 IBM Blue Gene/P 3.4 GF/s (DP) 13.6 GF/s Memory BW/F = 1 B/F Comm BW/F = (6*3.4*2/8) = B/F IBM BG/P 3 Dimensional Torus Interconnects all compute nodes Communications backbone for computations Adaptive cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 μs latency between nearest neighbors, 5 μs to the farthest Note: 0.5(# of hops) MPI: 3 μs latency for one hop, 10 μs to the farthest.7/2.6 TB/s bisection bandwidth, 188TB/s total bandwidth (72k machine) Collective Network Interconnects all compute and I/O nodes (1152) One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link Latency of one way tree traversal 2 μs, MPI 5 μs ~62TB/s total binary tree bandwidth (72k machine) Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 μs, MPI 1.6 μs Other networks 10Gb Functional Ethernet I/O nodes only 1Gb Private Control Ethernet Provides JTAG access to hardware.accessible only from Service Node system 10
11 IBM BG/P Ping-Pong MPI Ping-Pong Latency: µs, avg 4.7 µs MPI Ping-Pong Bandwidth: 0.38 GB/s 147,456 cores Measured: ~650MF/s out of 13.6 GF (~5% of peak) Estimate based on memory BW: (13.6/24*2*0.85)= GF/s Estimate based on measured BW: (9.4/24*2*0.85)=0.665 GF/s (BW measured by Stream and reported as part of HPCC) Blue Gene Q 11
12 BG/Q 5-D Torus Network BG/Q Networks Networks 5 D torus in compute nodes, 2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at ~1.75 GB/s per link Both collective and barrier networks are embedded in this 5-D torus network. Virtual Cut Through (VCT) Floating point addition support in collective network Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P) 20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x2gb/s link bandwidth = TB/s 26.8PF: bisection is 2x16x16x16x4x2GB/s = TB/s BGL at LLNL is 0.7 TB/s I/O Network to/from Compute rack 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out) Every Q32 node card has up to I/O 8 links or 4 ports Every rack has up to 32x8 = 256 links or 128 ports I/O rack 8 I/O nodes/drawer, each node has 2 links from compute rack, and1 PCI-e port to the outside world 12/drawers/rack 96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s 12
13 BG/Q Network All-to-all: 97% of peak Bisection: > 93% of peak Nearest-neighbor: 98% of peak Collective: FP reductions at 94.6% of peak No performance problems identified in network logic Cray XE6 Gemini Network Day_2_Session_2_all.pdf 13
14 Cray XE6 Gemini Network MPI Ping-Pong Latency: 6 9 µs, avg 7.5 µs Note: Seastar no Gemini MPI Ping-Pong Bandwidth: 1.6 GB/s Note: Seastar no Gemini 224,256 cores Cray XE6 Gemini Network Presentations/Courses_Ws_2011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf 14
15 SGI Ultra Violet (UV) 1 rack, 16 nodes, 32 sockets Max 3 hops External Numalink-5 routers:16 ports UV Hub Two QPI interfaces 2x25GB/s Four Numalink 5 links 4x10 GB/s SGI UV 8 racks, 128 nodes, 256 sockets, fat-tree, ¼ shown. 16TB shared memory (4 racks with 16GB/DIMMs) MPI Ping-Pong Latency: µs, avg 1.6 µs MPI Ping-Pong Bandwidth: GB/s, avg 3 GB/s 64 cores 512 racks, 8192 nodes, sockets, 8x8 torus of 128 node fat-trees. Each torus link consists of 2 Numalink-5 bidirectional links. Maximum estimated latency for 1024 rack system: <2µs
16 IBM Power7 IBM Power7 Hub 61 mm x 96 mm Glass Ceramic LGA module 56 12X optical modules LGA attach onto substrate TB/s interconnect bandwidth 45 nm lithography, Cu, SOI13 levels metal 440M transistors 582 mm mm x 21.8 mm 3707 signal I/O 11,328 total I/O 16
17 IBM Power7 Integrated Switch Router Two tier, full graph network 3.0 GHz internal 56x56 crossbar switch 8 HFI, 7 LL, 24 LR, 16 D, and SRV ports Virtual channels for deadlock prevention Input/Output Buffering 2 KB maximum packet size 128B FLIT size Link Reliability CRC based link-level retry Lane steering for failed links IP Multicast Support Multicast route tables per ISR for replicating and forwarding multicast packets Global Counter Support ISR compensates for link latencies as counter information is propagated HW synchronization with Network Management setup and maintenance Routing Characteristics 3-hop L-D-L longest direct route 5-hop L-D-L-D-L longest indirect route Cut-through Wormhole routing Full hardware routing using distributed route tables across the ISRsSource route tables for packets injected by the HFI Port route tables for packets at each hop in the network Separate tables for inter-supernode and intrasupernode routes FLITs of a packet arrive in order, packets of a message can arrive out of order Routing Modes Hardware Single Direct Routing Hardware Multiple Direct Routing For less than full-up system where more than one direct path exists Hardware Indirect Routing for data striping and failover Round-Robin, Random Software controlled indirect routing through hardware route tables I/O bus technologies (clusters) 17
18 Peripheral Component Interconnect (PCI) PCI: V1.0 (1992): 32-bit, MHz PCI-X: V1.0 (1998): 64-bit, 66 MHz, 100 MHz, 133 MHz V2.0 (2003): 64-bit wide, 266 MHz, 533 MHz PCI Express (PCIe): V1.0 (2003): 256 MiB/s per lane (16 lanes = 4 GiB/s) V2.0 (2007): 512 MiB/s per lane (16 lanes = 8 GiB/s) V3.0 (2010): 1024 MiB/s per lane (16 lanes = 16 GiB/s) PCIe is defined for 1,2,4,8,16, and 32 lanes x4 x16 x1 x16 PCI (32-bit) Cluster Interconnect Technologies Ethernet: 1 GigE (1995), 10 GigE (2001), 40 GigE (2010), 100 GigE (2010) Infiniband: 2001: Single Data Rate (SDR), 2.5 Gbps/lane, x4 (2003)10 Gbps, 8/10 encoding (net data rate 2 Gbps, x4 8 Gbps) 2005: Double Data Rate (DDR), 5 Gbps/lane, x4 20 Gbps, 8/10 encoding (net data rate 4 Gbps, x4 16 Gbps) 2007: Quad Data Rate (QDR), 10 Gbps/lane, x4 40 Gbps, 8/10 encoding (net data rate 8 Gbps, x4 32 Gbps) 2011: Fourteen Data Rate (FDR), Gbps, x Gbps, 64/66 encoding (net data rate Gbps, x Gbps) 2013: Enhanced Data Rate (EDR), Gbps, x Gbps, 64/66 encoding, (net data rate 25 Gbps, x4 100 Gbps) Switch latency: SDR 200 ns, DDR 140 ns, QDR 100 ns. Mellanox current switch chip has 1.4 billion transistors and a throughput of 4 Gbps on 36 ports and a port-to-port latency of 165 ns. Myrinet: 0.64 Gbps (1994), 1.28 Gbps (1996), 2 Gbps (2000), 10 Gbps (2006) 18
19 Infiniband Roadmap SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rate Typical Infiniband Network HCA = Host Channel Adapter TCA = Target Channel Adapter 19
20 Interconnect Technology Properties Mellanox ConnectX IB 40Gb/s PCIe x8 InfiniBand Proprietary GigE 10GigE QLogic InfiniPath IB 20Gb/s PCIe x8 Myrinet 10G PCIe x8 Quadrics QSNetII Chelsio T210-CX PCIe x8 Application Latency (µs) < Peak Unidirectional Bandwidth (MB/s) for PCIe Gen1 Peak Unidirectional Bandwidth (MB/s) for PCIe Gen N/A N/A N/A N/A N/A Mellanox ConnectX InfiniBand IB 10Gb/s PCIe Gen1 IB 20Gb/s IB 20Gb/s PCIe Gen2 IB 40Gb/s PCIe Gen2 IPoIB Bandwidth 939MB/s 1410MB/s 1880MB/s 2950MB/s MPI Ping-Pong measurements MPI Ping-Pong latency µs MPI Ping-Pong bandwidth GB/s Mellanox QDR , avg 3.6 (4320 cores) , avg 1.8 (4320 cores) Infinipath QDR , avg 1.6 (192 cores) , avg 2.5 (192 cores) MPI Ping-Pong latency µs MPI Ping-Pong bandwidth GB/s BG/P Cray Seastar SGI UV , avg 4.7 (147,456 cores) 0.38 (147,456 cores) 6 9, avg 7.5 (224,256 cores) 1.6 (224,256 cores) , avg 1.6 (64 cores) , avg 3 (64 cores) 20
21 Cray XC30 Interconnection Network Cray XC30 Interconnection Network Chassis 1 Group Chassis Chassis
22 Cray XC30 Interconnection Network Aeris chip 40 nm technology 16.6 x 18.9 mm 217M gates 184 lanes of SerDes 30 optical lanes 90 electrical lanes 64 PCIe 3.0 lanes Cray XC30 Network Overview 22
23 Interconnection Networks References CSE 431, Computer Architecture, Fall 2005, Lecture 27. Network Connected Multi s, Mary Jane Irwin, Lecture 21: Networks & Interconnect Introduction, Dave A. Patterson, Jan Rabaey, CS 252, Spring 2000, Technology Trends in High Performance Computing, Mike Levine, DEISA Symposium, May 9 10, Single-chip Cloud Computer: An experimental many-core processor from Intel Labs, Jim Held, Petascale to Exascale - Extending Intel s HPC Commitment, Kirk Skaugen, Blue Gene: A Next Generation Supercomputer (BlueGene/P), Alan Gara, Blue Gene/P Architecture: Application Performance and Data Analytics, Vitali Morozov, HPC Challenge, Multi-Threaded Course, February 15 17, 2011, DAY 2: Introduction to Cray MPP Systems with Multi-core Processors Multi-threaded Programming, Tuning and Optimization on Multi-core MPP Platforms, 011/Multi-Threaded_Course_Feb11/Day_2_Session_2_all.pdf Gemini Description, MPI, Jason Beech-Brandt, Technical Advances in the SGI UV Architecture, SGI UV Solving the World s Most Data Intensive Problems, 23
24 References cont d The IBM POWER7 HUB Module: A Terabyte Interconnect Switch for High-Performance Computer Systems, Hot Chips 22, August 2010, Baba Arimilli, Steve Baumgartner, Scott Clark, Dan Dreps, Dave Siljenberg, Andrew Mak, Intel Core i7 I/O Hub and I/O Controller Hub, PCI Express, InfiniBand and 10-Gigabit Ethernet for Dummies, A Tutorial at Supercomputing 09, DK Panda, Pavan Balaji, Matthew Koop, Infiniband Roadmap, Infiniband Performance, A Complexity Theory for VLSI, Clark David Thompson, Doctoral Thesis, ACM, Microprocessors, Exploring Chip Layers, Cray T3E, Complexity issues in VLSI, Frank Thomson Leighton, MIT Press, 1983 The Tree Machine: An Evaluation of Strategies For Reducing Program Loading Time, Li, Pey-yun Peggy and Johnsson, Lennart, Dado: A Tree-Structured Architecture for Artificial Intelligence Computation, S J Stolfo, and D P Miranker, Annual Review of Computer Science, Vol. 1: 1-18 (Volume publication date June 1986), DOI: /annurev.cs , Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence, Salvatore J. Stolfo, Daniel Miranker, David Elliot Shaw, Introduction to Algorithms, Charles E. Leiserson, September 15, 2004, References (cont d) UC Berkeley, CS 252, Spring 2000, Dave Patterson Interconnection Networks, Computer Architecture: A Quantitative Approach 4th Edition, Appendix E, Timothy Mark Pinkston, USC, Jose Duato, Universidad Politecnica de Valencia, Access and Alignment of Data in an Array Processor, D H Lawrie, IEEE Trans Computers, C-24, No. 12, pp , December 1975, ieeexplore.ieee.org%2fxpls%2fabs_all.jsp%3farnumber%3d SP2 System Architecture, T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir, IBM J. Res Dev, v. 34, no. 2, pp , 1995, Inside the TC2000, BBN Advanced Computer Inc., Preliminary version, 1989 A Study of Non-Blocking Switching Networks, Charles Clos, Bell Systems Technical Journal, vol. 32, 1953, pp On Rearrangeable Three-Stage Connecting Networks, V. E "Vic" Benes, BSTJ, vol. XLI, Sep. 1962, No. 5, pp GF11: M Kumar, IBM J. Res Dev, v. 36, no. 6, pp , Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. Charles E. Leiserson, IEEE Trans. Computers 34(10): (1985 Cray XC30 series Network, B Alverson, E. Froese, L. Kaplan, D. Roweth, Cray High Speed Networking, Hot Interconnect, August 2012, Cost-Efficient Dragonfly Topology for Large-Scale Systems, J. Kim, W. Dally, S. Scott, D Abts, Vol. 29, no. 1, pp33 40, Jan Feb 2009, IEEE Micro, Technology-Drive, Highly-Scalable Dragonfly Topology, J. Kim, W. Dally, S. Scott, D Abts, pp , 35 th International Symposium on Computer Architecture (ISCA), 2008, 24
25 References (cont d) Microarchitecture of a High-Radix Router, J. Kim, W. J.Dally, B. Towles, A. K. Gupta, The BlackWidow High-Radix Clos Network, S. Scott, D. Abts, J. Kim, W. Dally, pp , 33rd International Symposium on Computer Architecture (ISCA), 2006, Flattened Butterfly Network: A Cost-Efficient Topology for High-Radix Networks, J. Kim, W. J. Dally, D. Abts, pp , 34th International Symposium on Computer Architecture(ISCA), 2007, Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, vol. 6, no. 2, pp , Jul Dec, 2007, IEEE Computer Architecture Letters, Flattened Butterfly Topology for On-Chip Networks, J. Kim, J. Balfour, W. J. Dally, pp , 40th Annual IEEE/ACM International Symposium on Micro-architecture (MICRO), 2007, From Hypercubes to Dragonflies: A Short History of Interconnect, W. J. Dally, 25
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationIntroduction to Infiniband. Hussein N. Harake, Performance U! Winter School
Introduction to Infiniband Hussein N. Harake, Performance U! Winter School Agenda Definition of Infiniband Features Hardware Facts Layers OFED Stack OpenSM Tools and Utilities Topologies Infiniband Roadmap
More informationOn-Chip Interconnection Networks Low-Power Interconnect
On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks
More informationScaling 10Gb/s Clustering at Wire-Speed
Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
More informationSockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
More informationFrom Hypercubes to Dragonflies a short history of interconnect
From Hypercubes to Dragonflies a short history of interconnect William J. Dally Computer Science Department Stanford University IAA Workshop July 21, 2008 IAA: # Outline The low-radix era High-radix routers
More informationWhy the Network Matters
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationPCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
More informationInterconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
More informationCan High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationIntel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband
Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband A P P R O I N T E R N A T I O N A L I N C Steve Lyness Vice President, HPC Solutions Engineering slyness@appro.com Company Overview
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationLecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,
More informationPCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
More informationHyper Node Torus: A New Interconnection Network for High Speed Packet Processors
2011 International Symposium on Computer Networks and Distributed Systems (CNDS), February 23-24, 2011 Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors Atefeh Khosravi,
More informationStovepipes to Clouds. Rick Reid Principal Engineer SGI Federal. 2013 by SGI Federal. Published by The Aerospace Corporation with permission.
Stovepipes to Clouds Rick Reid Principal Engineer SGI Federal 2013 by SGI Federal. Published by The Aerospace Corporation with permission. Agenda Stovepipe Characteristics Why we Built Stovepipes Cluster
More informationPCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems
PCI Express Impact on Storage Architectures Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individual members may
More informationAdvanced Computer Networks. High Performance Networking I
Advanced Computer Networks 263 3501 00 High Performance Networking I Patrick Stuedi Spring Semester 2014 1 Oriana Riva, Department of Computer Science ETH Zürich Outline Last week: Wireless TCP Today:
More informationInterconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)
Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems
More informationInterconnecting Future DoE leadership systems
Interconnecting Future DoE leadership systems Rich Graham HPC Advisory Council, Stanford, 2015 HPC The Challenges 2 Proud to Accelerate Future DOE Leadership Systems ( CORAL ) Summit System Sierra System
More informationRDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
More informationPerformance Evaluation of InfiniBand with PCI Express
Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 jl@us.ibm.com Amith Mamidala, Abhinav Vishnu, and Dhabaleswar
More informationExploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
More informationCray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak
Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance
More informationAdvancing Applications Performance With InfiniBand
Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and
More informationEDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation
PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
More informationECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
More informationInterconnection Network Design
Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure
More informationPCI Express and Storage. Ron Emerick, Sun Microsystems
Ron Emerick, Sun Microsystems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material in presentations and literature
More informationLS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
More informationInterconnection Networks. B649 Parallel Computing Seung-Hee Bae Hyungro Lee
Interconnection Networks B649 Parallel Computing Seung-Hee Bae Hyungro Lee Outline Introduction Interconnecting Two Devices Connecting More than Two Devices Network Topology Network Routing, Arbitration,
More informationCray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET
CRAY XT3 DATASHEET Cray XT3 Supercomputer Scalable by Design The Cray XT3 system offers a new level of scalable computing where: a single powerful computing system handles the most complex problems every
More informationInfiniBand Clustering
White Paper InfiniBand Clustering Delivering Better Price/Performance than Ethernet 1.0 Introduction High performance computing clusters typically utilize Clos networks, more commonly known as Fat Tree
More informationInterconnect Analysis: 10GigE and InfiniBand in High Performance Computing
Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing WHITE PAPER Highlights: There is a large number of HPC applications that need the lowest possible latency for best performance
More informationHPC Update: Engagement Model
HPC Update: Engagement Model MIKE VILDIBILL Director, Strategic Engagements Sun Microsystems mikev@sun.com Our Strategy Building a Comprehensive HPC Portfolio that Delivers Differentiated Customer Value
More informationHigh Speed I/O Server Computing with InfiniBand
High Speed I/O Server Computing with InfiniBand José Luís Gonçalves Dep. Informática, Universidade do Minho 4710-057 Braga, Portugal zeluis@ipb.pt Abstract: High-speed server computing heavily relies on
More informationSystem Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
More informationOpenSoC Fabric: On-Chip Network Generator
OpenSoC Fabric: On-Chip Network Generator Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf MODSIM 2014 Presentation
More informationCurrent Trend of Supercomputer Architecture
Current Trend of Supercomputer Architecture Haibei Zhang Department of Computer Science and Engineering haibei.zhang@huskymail.uconn.edu Abstract As computer technology evolves at an amazingly fast pace,
More informationPCI Express Impact on Storage Architectures and Future Data Centers
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation Author: Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is
More informationStorage Architectures. Ron Emerick, Oracle Corporation
PCI Express PRESENTATION and Its TITLE Interfaces GOES HERE to Flash Storage Architectures Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the
More informationComparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014
Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet Anand Rangaswamy September 2014 Storage Developer Conference Mellanox Overview Ticker: MLNX Leading provider of high-throughput,
More informationPerformance of Software Switching
Performance of Software Switching Based on papers in IEEE HPSR 2011 and IFIP/ACM Performance 2011 Nuutti Varis, Jukka Manner Department of Communications and Networking (COMNET) Agenda Motivation Performance
More informationA Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
More informationAchieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
More informationPCI Express IO Virtualization Overview
Ron Emerick, Oracle Corporation Author: Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationThe University of Arizona Department of Electrical and Computer Engineering Term Paper (and Presentation) for ECE 569 Fall 2006 21 February 2006
The University of Arizona Department of Electrical and Computer Engineering Term Paper (and Presentation) for ECE 569 Fall 2006 21 February 2006 The term project for this semester is an independent study
More informationHadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
More informationTopological Properties
Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any
More informationLecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)
Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E) 1 Topologies Internet topologies are not very regular they grew incrementally Supercomputers
More informationAn Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
More informationA Scalable Ethernet Clos-Switch
A Scalable Ethernet Clos-Switch Norbert Eicker John von Neumann-Institute for Computing Research Centre Jülich Technisches Seminar Desy Zeuthen 9.5.2006 Outline Motivation Clos-Switches Ethernet Crossbar
More informationComputer Network. Interconnected collection of autonomous computers that are able to exchange information
Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.
More informationALPS Supercomputing System A Scalable Supercomputer with Flexible Services
ALPS Supercomputing System A Scalable Supercomputer with Flexible Services 1 Abstract Supercomputing is moving from the realm of abstract to mainstream with more and more applications and research being
More informationComparison of interconnection networks
Department of Information Technology CT30A7001 Concurrent and parallel computing Document for seminar presentation Comparison of interconnection networks Otso Lonka Student ID: 0279351 Otso.lonka@lut.fi
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationIntel Xeon Processor E5-2600
Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset
More informationSun Constellation System: The Open Petascale Computing Architecture
CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical
More informationUniversity of Castilla-La Mancha
University of Castilla-La Mancha A publication of the Department of Computer Science A Strategy to Compute the InfiniBand Arbitration Tables by Francisco J. Alfaro, JoséL.Sánchez, José Duato Technical
More informationThe following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:
Overview HP supports 56 Gbps Fourteen Data Rate (FDR) and 40Gbps 4X Quad Data Rate (QDR) InfiniBand (IB) products that include mezzanine Host Channel Adapters (HCA) for server blades, dual mode InfiniBand
More informationHardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy
Hardware Implementation of Improved Adaptive NoC Rer with Flit Flow History based Load Balancing Selection Strategy Parag Parandkar 1, Sumant Katiyal 2, Geetesh Kwatra 3 1,3 Research Scholar, School of
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationJuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering
JuRoPA Jülich Research on Petaflop Architecture One Year on Hugo R. Falter, COO Lee J Porter, Engineering HPC Advisoy Counsil, Workshop 2010, Lugano 1 Outline The work of ParTec on JuRoPA (HF) Overview
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationDriving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
More informationIntroduction History Design Blue Gene/Q Job Scheduler Filesystem Power usage Performance Summary Sequoia is a petascale Blue Gene/Q supercomputer Being constructed by IBM for the National Nuclear Security
More informationInterconnection Networks
CMPT765/408 08-1 Interconnection Networks Qianping Gu 1 Interconnection Networks The note is mainly based on Chapters 1, 2, and 4 of Interconnection Networks, An Engineering Approach by J. Duato, S. Yalamanchili,
More informationTechnology-Driven, Highly-Scalable Dragonfly Topology
Technology-Driven, Highly-Scalable Dragonfly Topology By William J. Dally et al ACAL Group Seminar Raj Parihar parihar@ece.rochester.edu Motivation Objective: In interconnect network design Minimize (latency,
More informationNew Storage System Solutions
New Storage System Solutions Craig Prescott Research Computing May 2, 2013 Outline } Existing storage systems } Requirements and Solutions } Lustre } /scratch/lfs } Questions? Existing Storage Systems
More informationPerformance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,
More informationWhite Paper Solarflare High-Performance Computing (HPC) Applications
Solarflare High-Performance Computing (HPC) Applications 10G Ethernet: Now Ready for Low-Latency HPC Applications Solarflare extends the benefits of its low-latency, high-bandwidth 10GbE server adapters
More informationApplication and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni (mahidhar@sdsc.edu) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data
More informationAgenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
More informationSMB Direct for SQL Server and Private Cloud
SMB Direct for SQL Server and Private Cloud Increased Performance, Higher Scalability and Extreme Resiliency June, 2014 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server
More informationPADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute
PADS GPFS Filesystem: Crash Root Cause Analysis Computation Institute Argonne National Laboratory Table of Contents Purpose 1 Terminology 2 Infrastructure 4 Timeline of Events 5 Background 5 Corruption
More informationPCI Express* Ethernet Networking
White Paper Intel PRO Network Adapters Network Performance Network Connectivity Express* Ethernet Networking Express*, a new third-generation input/output (I/O) standard, allows enhanced Ethernet network
More informationA Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
More informationHigh Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand
High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand Hari Subramoni *, Ping Lai *, Raj Kettimuthu **, Dhabaleswar. K. (DK) Panda * * Computer Science and Engineering Department
More informationPCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)
PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters from One Stop Systems (OSS) PCIe Over Cable PCIe provides greater performance 8 7 6 5 GBytes/s 4
More informationUsing PCI Express Technology in High-Performance Computing Clusters
Using Technology in High-Performance Computing Clusters Peripheral Component Interconnect (PCI) Express is a scalable, standards-based, high-bandwidth I/O interconnect technology. Dell HPC clusters use
More informationPCI Express Overview. And, by the way, they need to do it in less time.
PCI Express Overview Introduction This paper is intended to introduce design engineers, system architects and business managers to the PCI Express protocol and how this interconnect technology fits into
More informationCisco SFS 7000P InfiniBand Server Switch
Data Sheet Cisco SFS 7000P Infiniband Server Switch The Cisco SFS 7000P InfiniBand Server Switch sets the standard for cost-effective 10 Gbps (4X), low-latency InfiniBand switching for building high-performance
More informationIntel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
More informationNetworking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
More informationAMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
More informationAppendix E final-draft.fm Page 1 Friday, July 14, 2006 10:23 AM
Appendix E final-draft.fm Page 1 Friday, July 14, 2006 10:23 AM E.1 Introduction 3 E.2 Interconnecting Two Devices 6 E.3 Connecting More than Two Devices 20 E.4 Network Topology 30 E.5 Network Routing,
More informationBuilding Clusters for Gromacs and other HPC applications
Building Clusters for Gromacs and other HPC applications Erik Lindahl lindahl@cbr.su.se CBR Outline: Clusters Clusters vs. small networks of machines Why do YOU need a cluster? Computer hardware Network
More informationUltra Low Latency Data Center Switches and iwarp Network Interface Cards
WHITE PAPER Delivering HPC Applications with Juniper Networks and Chelsio Communications Ultra Low Latency Data Center Switches and iwarp Network Interface Cards Copyright 20, Juniper Networks, Inc. Table
More informationInitial Performance Evaluation of the Cray SeaStar
Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith D. Underwood Sandia National Laboratories PO Box 58 Albuquerque, NM 87185-111 E-mail: {rbbrigh,ktpedre,kdunder}@sandia.gov
More informationCost/Performance Evaluation of Gigabit Ethernet and Myrinet as Cluster Interconnect
Cost/Performance Evaluation of Gigabit Ethernet and Myrinet as Cluster Interconnect Helen Chen, Pete Wyckoff, and Katie Moor Sandia National Laboratories PO Box 969, MS911 711 East Avenue, Livermore, CA
More informationThe IBM Blue Gene/Q Interconnection Network and Message Unit
The IBM Blue Gene/Q Interconnection Network and Message Unit Dong Chen, Noel A. Eisley, Philip Heidelberger, Robert M. Senger, Yutaka Sugawara, Sameer Kumar, Valentina Salapura, David L. Satterfield IBM
More informationAchieving Low-Latency Security
Achieving Low-Latency Security In Today's Competitive, Regulatory and High-Speed Transaction Environment Darren Turnbull, VP Strategic Solutions - Fortinet Agenda 1 2 3 Firewall Architecture Typical Requirements
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationMeasuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces
Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces Douglas Doerfler and Ron Brightwell Center for Computation, Computers, Information and Math Sandia
More informationCurrent Status of FEFS for the K computer
Current Status of FEFS for the K computer Shinji Sumimoto Fujitsu Limited Apr.24 2012 LUG2012@Austin Outline RIKEN and Fujitsu are jointly developing the K computer * Development continues with system
More informationA Quantum Leap in Enterprise Computing
A Quantum Leap in Enterprise Computing Unprecedented Reliability and Scalability in a Multi-Processor Server Product Brief Intel Xeon Processor 7500 Series Whether you ve got data-demanding applications,
More informationECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009
ECLIPSE Best Practices Performance, Productivity, Efficiency March 29 ECLIPSE Performance, Productivity, Efficiency The following research was performed under the HPC Advisory Council activities HPC Advisory
More information