Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak



Similar documents
Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

The Gemini Network. Rev 1.1. Cray Inc.

Cray XC Series Network. Bob Alverson, Edwin Froese, Larry Kaplan and Duncan Roweth Cray Inc.

Router Architectures

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

On-Chip Interconnection Networks Low-Power Interconnect

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

PCI Express and Storage. Ron Emerick, Sun Microsystems

Advanced Computer Networks. High Performance Networking I

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Scalability and Classifications

Sun Constellation System: The Open Petascale Computing Architecture

PCI Express* Ethernet Networking

SeaMicro SM Server

Network Contention and Congestion Control: Lustre FineGrained Routing

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Certification: HP ATA Servers & Storage

Application Server Platform Architecture. IEI Application Server Platform for Communication Appliance

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Performance of Software Switching

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

CUTTING-EDGE SOLUTIONS FOR TODAY AND TOMORROW. Dell PowerEdge M-Series Blade Servers

From Hypercubes to Dragonflies a short history of interconnect

Building Clusters for Gromacs and other HPC applications

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Unified Computing Systems

Motherboard- based Servers versus ATCA- based Servers

Networking Virtualization Using FPGAs

Cray DVS: Data Virtualization Service

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

OpenSPARC T1 Processor

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Service Partition Specialized Linux nodes. Compute PE Login PE Network PE System PE I/O PE

Lustre Networking BY PETER J. BRAAM

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

Cisco MCS 7825-H3 Unified Communications Manager Appliance

HPC Update: Engagement Model


ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Parallel Programming Survey

Lecture 2 Parallel Programming Platforms

HP ProLiant SL270s Gen8 Server. Evaluation Report

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Mellanox Academy Online Training (E-learning)

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

RouteBricks: A Fast, Software- Based, Distributed IP Router

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Bivio 7000 Series Network Appliance Platforms

10GBASE T for Broad 10_Gigabit Adoption in the Data Center

Virtualised MikroTik

Why 25GE is the Best Choice for Data Centers

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

SMB Direct for SQL Server and Private Cloud

PCI Express Overview. And, by the way, they need to do it in less time.

InfiniBand Clustering

ECLIPSE Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

A Platform Built for Server Virtualization: Cisco Unified Computing System

Supercomputing Clusters with RapidIO Interconnect Fabric

50. DFN Betriebstagung

LinuxWorld Conference & Expo Server Farms and XML Web Services

Switch Fabric Implementation Using Shared Memory

Multi-Threading Performance on Commodity Multi-Core Processors

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

Software and the Concurrency Revolution

Upgrading Small Business Client and Server Infrastructure E-LEET Solutions. E-LEET Solutions is an information technology consulting firm

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

UCS M-Series Modular Servers

Cloud Computing through Virtualization and HPC technologies

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

I3: Maximizing Packet Capture Performance. Andrew Brown

SAN Conceptual and Design Basics

Topological Properties

FPGA-based MapReduce Framework for Machine Learning

Hadoop on the Gordon Data Intensive Cluster

HPC Software Requirements to Support an HPC Cluster Supercomputer

How To Build A Cisco Ukcsob420 M3 Blade Server

Kriterien für ein PetaFlop System

OpenSoC Fabric: On-Chip Network Generator

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

Achieving Performance Isolation with Lightweight Co-Kernels

What is a System on a Chip?

Transcription:

Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance 7. Software stack 8. Interesting to know 9. References Denys Sobchyshak / TUM 2

Introduction Torus interconnect is a network topology for connecting processing nodes in parallel computer system. Gemini is a network for Cray's supercomputer systems first used in 2010 Successor of Gemini is an Aries interconnect network first used around 2013 Denys Sobchyshak / TUM 3

Introduction Most known example of Gemini interconnect based supercomputer currently occupying 2 nd place in TOP500: Titan - Cray XK7, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x Denys Sobchyshak / TUM 4

Overview Develops the highly scalable Seastar design used to deliver the 225,000 core Jaguar system, improving network functionality, latency and issue rate Novel system-on-chip (SoC) design to construct direct 3D torus networks that can scale to in excess of 100,000 multi-core nodes Capable of tens of millions of MPI messages per second Each hybrid compute node is interfaced to the Gemini interconnect through HyperTransport 3.0 technology. This direct connect architecture bypasses the PCI bottlenecks inherent in commodity networks and provides a peak of over 20 GB/s of injection bandwidth per node The proven 3-D torus topology provides powerful bisection and global bandwidth characteristics as well as support for dynamic routing of messages Denys Sobchyshak / TUM 5

Architecture Cray XT 3D Torus Network Each SeaStar router provides one node of 3D Torus, Gemini provides two. Denys Sobchyshak / TUM 6

Gemini Blocks Each Gemini ASIC provides two network interface controllers (NICs), and a 48-port Yarc router. Each of the NICs has its own HyperTransport 3 host interface, enabling Gemini to connect two Opteron nodes to the network. This 2 node building block provides 10 torus links, 4 each in two of the dimension ( x and z ) and 2 in the third dimension ( y ), as shown in previous slide. Traffic between the two nodes connected to a single Gemini is routed internally. The router uses a tiled design, with 8 tiles dedicated to the NICs and 40 (10 groups of 4) dedicated to the network. Denys Sobchyshak / TUM 7

FMA & BTA Fast Memory Access (FMA) is a mechanism whereby user processes generate network transactions, puts, gets and atomic memory operations (AMO), by storing directly to the NIC. The FMA block translates stores by the processor into fully qualified network requests. FMA provides both low latency and high issue rate on small transfers. The Block Transfer Engine (BTE) supports asynchronous transfer between local and remote memory. Software writes block transfer descriptors to a queue and the Gemini hardware performs the transfers asynchronously. The BTE supports memory operations (put/get) where the user specifies a local address, a network address and a transfer size. To put it simply: FMA is used for small transfers and BTE for large. FMA transfers are lower latency. BTE transfers take longer to start, but once running can transfer large amounts of data (upto 4GB) without CPU involvement. Denys Sobchyshak / TUM 8

Fault tolerance Gemini is designed for large systems in which failures are to be expected and applications must run on in the presence of errors. Each torus link comprises 4 groups of 3 lanes. Packet CRCs are checked by each device with automatic link level retry on error. In the event of the failure of a link, the router will select an alternate path of adaptively routed traffic. Gemini uses ECC to protect major memories and data paths within the device. Error-correcting code memory (Error Checking & Correction, ECC memory) is a type of computer data storage that can detect and correct the most common kinds of internal data corruption. A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data. Blocks of data entering these systems get a short check value attached, based on the remainder of a polynomial division of their contents; on retrieval the calculation is repeated, and corrective action can be taken against presumed data corruption if the check values do not match. Note: ECC memory is a common thing even in less complex systems, for example a simple rack DIMM (Dual In-line Memory Module, think of it as a set of RAM slots) might work with ECC memory circuits. Denys Sobchyshak / TUM 9

Software stack Gemini supports both kernel communication, through a Linux device driver and direct user space communication, where the driver is used to establish communication domains and handle errors, but can be bypassed for data transfer. Parallel applications typically use a library such as Cray MPI or SHMEM in which the programmer makes explicit communication calls. Denys Sobchyshak / TUM 10

Interesting to know Performance metrics: on a quiet network, the end-point latency is 1.0μs or less for a remote put, 1.5μs or less for a small MPI message. The per hop latency is 105ns on a quiet network. Updating SeaStar to Gemini: The Gemini mezzanine card is pin compatible with the SeaStar network card used on Cray XT5 and XT6 systems allowing them to be upgraded to Gemini. The upgrade is straightforward, each blade is removed and its network card is replaced. There are no changes to the chassis, cabinets or cabling. Future of Cray Interconnects: Cray sold patterns (10% of whole pattern portfolio), technology itself (starting from Aries interconnect) and key people to Intel in 2012.The agreement excludes Intel from reselling standalone Aries chips to anyone but Cray till 2016. It s been clear that the important technology trends are starting to intersect, enabling more powerful and energy-efficient systems interconnects by more closely integrating this technology with the processor. (Cray CEO Peter Ungaro) What will be changing in 2017, and beyond, is that we won t continue to build system interconnect hardware. (Cray CEO Peter Ungaro) Denys Sobchyshak / TUM 11

References http://wiki.ci.uchicago.edu/pub/beagle/systemspecs/gemini_whitepaper.pdf http://www.theregister.co.uk/2010/05/25/cray_xe6_baker_gemini/?page=2 http://www.hpcwire.com/2012/04/25/intel_makes_a_deal_for_cray_s_interconnect_technology/ http://www.cray.com/products/computing/xk7/technology.aspx http://www.top500.org/system/177975#.u44g6nwswl9 http://en.wikipedia.org/wiki/cray_xe6 http://en.wikipedia.org/wiki/cyclic_redundancy_check http://en.wikipedia.org/wiki/ecc_memory http://livingthewiccanlife.blogspot.de/2012/05/aries-march-21-april-20-aries-is.html http://sjastro.com/gemini-horoscope/ http://en.wikipedia.org/wiki/cray_xe6 http://en.wikipedia.org/wiki/torus_interconnect http://en.wikipedia.org/wiki/hypertransport Denys Sobchyshak / TUM 12