A Scalable Ethernet Clos-Switch



Similar documents
Lecture 2 Parallel Programming Platforms

Supercomputing Status und Trends (Conference Report) Peter Wegner

- An Essential Building Block for Stable and Reliable Compute Clusters

Layer 3 Network + Dedicated Internet Connectivity

HPC Update: Engagement Model

What is VLAN Routing?

ADVANCED NETWORK CONFIGURATION GUIDE

Scaling 10Gb/s Clustering at Wire-Speed

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Building a Linux Cluster

Axon: A Flexible Substrate for Source- routed Ethernet. Jeffrey Shafer Brent Stephens Michael Foss Sco6 Rixner Alan L. Cox

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Advanced Computer Networks. Datacenter Network Fabric

Leased Line + Remote Dial-in connectivity

How To Switch In Sonicos Enhanced (Sonicwall) On A 2400Mmi 2400Mm2 (Solarwall Nametra) (Soulwall 2400Mm1) (Network) (

50. DFN Betriebstagung

Isilon IQ Network Configuration Guide

Building Clusters for Gromacs and other HPC applications

VXLAN: Scaling Data Center Capacity. White Paper

Management Software. Web Browser User s Guide AT-S106. For the AT-GS950/48 Gigabit Ethernet Smart Switch. Version Rev.

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

Sun Constellation System: The Open Petascale Computing Architecture

LAN Switching Computer Networking. Switched Network Advantages. Hubs (more) Hubs. Bridges/Switches, , PPP. Interconnecting LANs

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

VMWARE WHITE PAPER 1

InfiniBand Clustering

Cluster Implementation and Management; Scheduling

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum

Switching in an Enterprise Network

Improved LS-DYNA Performance on Sun Servers

Performance and Recommended Use of AB545A 4-Port Gigabit Ethernet Cards

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

TRILL Large Layer 2 Network Solution

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Brocade Solution for EMC VSPEX Server Virtualization

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Chapter 3. Enterprise Campus Network Design

Virtual PortChannels: Building Networks without Spanning Tree Protocol

InfiniBand Experiences at Forschungszentrum Karlsruhe. Forschungszentrum Karlsruhe

Data Center Network Topologies: FatTree

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Objectives. The Role of Redundancy in a Switched Network. Layer 2 Loops. Broadcast Storms. More problems with Layer 2 loops

Scalability and Classifications

BSC - Barcelona Supercomputer Center

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

TRILL for Data Center Networks

Deployment Guide. How to prepare your environment for an OnApp Cloud deployment.

Lustre Networking BY PETER J. BRAAM

Exhibit n.2: The layers of a hierarchical network

White Paper Solarflare High-Performance Computing (HPC) Applications

CQG/LAN Technical Specifications. January 3, 2011 Version

Using PCI Express Technology in High-Performance Computing Clusters

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

VMware ESX Server Q VLAN Solutions W H I T E P A P E R

IP SAN Best Practices

N5 NETWORKING BEST PRACTICES

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Cut I/O Power and Cost while Boosting Blade Server Performance

Cisco EtherSwitch Network Modules

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

ZEN LOAD BALANCER EE v3.02 DATASHEET The Load Balancing made easy

Howstuffworks "How LAN Switches Work" Click here to go back to the normal view!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Migrate from Cisco Catalyst 6500 Series Switches to Cisco Nexus 9000 Series Switches

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心

Router Architectures

Expert Reference Series of White Papers. Planning for the Redeployment of Technical Personnel in the Modern Data Center

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Next Gen Data Center. KwaiSeng Consulting Systems Engineer

Low Latency 10 GbE Switching for Data Center, Cluster and Storage Interconnect

HP Switches Controlling Network Traffic

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Non-blocking Switching in the Cloud Computing Era

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware

An Oracle White Paper September Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

Virtual Networking with z/vm Guest LAN and Virtual Switch

Application Note Gigabit Ethernet Port Modes

Zarząd (7 osób) F inanse (13 osób) M arketing (7 osób) S przedaż (16 osób) K adry (15 osób)

Virtualised MikroTik

SSVVP SIP School VVoIP Professional Certification

How To Set Up Foglight Nms For A Proof Of Concept

AT-S60 Version Management Software for the AT-8400 Series Switch. Software Release Notes

D1.2 Network Load Balancing

APPLICATION NOTE 210 PROVIDER BACKBONE BRIDGE WITH TRAFFIC ENGINEERING: A CARRIER ETHERNET TECHNOLOGY OVERVIEW

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Service Definition. Internet Service. Introduction. Product Overview. Service Specification

Dell PowerVault MD Series Storage Arrays: IP SAN Best Practices

Enabling Technologies for Distributed Computing

- Hubs vs. Switches vs. Routers -

Terabit Networking with JASMIN

Transcription:

A Scalable Ethernet Clos-Switch Norbert Eicker John von Neumann-Institute for Computing Research Centre Jülich Technisches Seminar Desy Zeuthen 9.5.2006

Outline Motivation Clos-Switches Ethernet Crossbar Switches and their problems Broadcasts Spanning Trees VLANs Multiple Spanning Trees Crossbar Routing Implementation within ALiCEnext Results Summary / Outlook on further work in Jülich Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 2

Ideal Network Low latency Runtime of short messages Most critical parameter for tightly coupled problems High bandwidth Determines runtime of long messages Flat topology Two arbitrary nodes can see each other Equal distance between two arbitrary nodes No contention No bottleneck within the network even for arbitrary patterns Full bi-sectional bandwidth True Crossbar Switch Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 3

Real Networks Fat Tree Communication between arbitrary nodes Varying distances Bottlenecks Torus / Mesh Only nearest neighbor communication Cut through routing possible Scalable for adequate applications Clos / ω-network Communication between arbitrary nodes Varying distances No contention at least in principle Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 4

Clos Switches Charles Clos in 1953 Telephony networks Redundant Scalable Full bi-sectional bandwidth (at least in principle) Standard architecture for commercial high-performance networks Myrinet, Quadrics, InfiniBand Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 5

Contention Clos-Switches not collision / contention free Typical systems use static source- or destination routing Easy to construct colliding patterns Possible ways out: More switches in middle layer doubling the number prohibits contention Traffic dependent routing hard to implement Random patterns might reduce probability Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 6

Ethernet Clos Switches Motivation: Find network for ALiCEnext Limited budget Mix of jobs Cascaded Ethernet (Fat Tree) Bandwidth bottlenecks Mesh network Only special applications Actually there for QCD Ethernet Clos Switch Let's see... Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 7

Ethernet Clos Switches Let's start naively Take some nodes (in our case 9) Take some switches (6: 2 layers of 3) Cable them according to Clos' idea See what happening First impression: Success All nodes see each other Switches are still accessible Bandwidth test with single pairs give expected numbers But Test with more pairs show unexpected bottlenecks! Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 8

Address Request Protocol (ARP) Nodes at first only know IP addresses Ethernet switches (and NICs) only handle MAC addresses Address Request Protocol (ARP) RFC 826 Ethernet address resolution Requester sends broadcast message: Who knows IP a.b.c.d? ARP request Node with correct IP replies via broadcast message: My IP / MAC is a.b.c.d / 12-34-56-78-90-ab ARP reply Store known addresses within ARP-table (cache with TTL) Maybe listen to ARP traffic between other nodes ARP uses Ethernet Broadcasts Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 9

Spanning Trees Ethernet broadcast w/o TTL Loops generate packet storms Broadcasts inevitable for IP over Ethernet (ARP) There should be packet storms Spanning trees (IEEE 802.1D) Creates robustness against unwanted loops :-) Eliminates all additional connectivity :-( Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 10

VLAN Virtual LAN (VLAN, IEEE 802.1Q, IEEE 802.3ac) Segment physical network into virtually disjunct parts Segments might overlap Yet another layer of indirection Wrap Ethernet frames into VLAN container In practice low (almost no) overhead Extremely useful to manage department networks Wide availability in medium sized commodity switches Idea: VLANs might be used in order to hide loops Create various VLANs, each forming a spanning tree Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 11

VLAN Switches support different types of VLAN links: Tagged: Packets delivered are explicitly tagged, i.e. wrapped into VLAN header Inbound traffic expected to contain Untagged VLAN header discarded when packet is delivered Inbound traffic is plain Ethernet might be tagged automatically Our concept: Don't touch the nodes configuration Make Crossbar as transparent as possible Node-ports are untagged and belong to all VLANs Inbound traffic mapped into VLAN (depending on port #) Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 12

Spanning VLANs Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 13

Multiple Spanning Trees Each single VLAN might contain loop Every VLAN needs its own spanning tree Multiple spanning trees (MST, IEEE 802.1s) On startup / change of topology Throw away old MST configuration Determine network root via switches MAC address root sends test packets Each switch forwards (updated) test packets on all ports First received packet determines route to root Switch off all alternative routes Do this for all configured VLANs Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 14

Multiple Spanning Trees We tested the implementation on SMC 8648T Works for 3x3 setup Test with bigger (24x6 nodes / 6+4 switches) setup shows Not robust enough for our (quite unusual) setup Kind of packets storms totally occupy fabric Switches totally lock up Network becomes useless (Impractical) tricks make it work: MST has to be switched off: Plug one cable after the other Be aware of loops! Less tolerant against cabling / configuration errors Robust enough in production (we almost never replug cables) Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 15

Routing Next test: Setup works for some patterns, but: Tests for most patterns show unexpected bottlenecks! We can understand this Sending node designates employed VLAN Problem: Nodes only visible in one VLAN Learning impossible for L2 switches Unknown destination Switches deliver in broadcast mode Contention by multiple broadcasts Explicit routing tables suppress needless traffic Up to 12k routing entries per switch 512 nodes x 24 VLANs Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 16

Routing Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 17

Routing Next test: Setup works for some patterns, but: Tests for most patterns show unexpected bottlenecks! We can understand this Sending node designates employed VLAN Problem: Nodes only visible in one VLAN Learning impossible for L2 switches Unknown destination Switches deliver in broadcast mode Contention by multiple broadcasts Explicit routing tables suppress needless traffic Up to 12k routing entries per switch 512 nodes x 24 VLANs Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 18

ALiCEnext Advanced Linux Cluster Engine - next generation Dual Opteron nodes Gigabit Ethernet network 512 node / 1024 CPUs 1 TByte memory ~150 TByte harddisk ~ 3.6 TFlops Peak performance Installation: June 2004 Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 19

ALiCEnext Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 20

ALiCEnext Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 21

ALiCEnext Hardware Nodes: 512 Tyan Dual Opteron Blades Processors: 1.8 Ghz Opteron 244 Cache: 1 MByte (2nd), 64k/64k (1st) Memory: 2 GByte/node Disks: 320 GByte/node 160 TByte 1 TByte Connectivity: Gigabit Ethernet 2D (4x4) Torus Mesh for QCD Ethernet Crossbar Power: 140 kw Price: 1.5 M Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 22

ALiCEnext nodes Tyan Dual Opteron blades Opteron 244 @ 1.8 GHz 1 MByte Level 2 Cache 2 GByte ECC RAM @ 3.2 GByte/sec 2 160 GByte IDE hard disk 1 64bit PCI-X slot @ 100 MHz 2 Gigabit Ethernet on board (Broadcom BCM 5700) 1 Quad Gigabit Ethernet card (Broadcom BCM 5700) Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 23

Switches SMC 8648T Supports: VLAN static routing tables on MAC level (up to 16 k) STA / MST Load/Store config via tftp Status requests via SNMP Non-blocking switching architecture Aggr. bandwidth: 96 Gbps ~ 2000 Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 24

Implementation Full setup needs 12k entries in routing tables! No manual configuration possible Configuration & monitoring via set of scripts: collect information from nodes / switches consistency tests generate configuration files deployment into switches status control reconfiguration Last but not least: Put 1024 cables into place! Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 25

Results Results from prototype implementation: 144 nodes 6 level 1 switches 4 level 2 switches Test w/ Pallas MPI Benchmark (sendrecv & pingpong) Building blocks: 2 nodes back-to-back: 214.3 MB/sec 18.6 µsec 2 nodes with switch: 210.2 MB/sec 21.5 µsec Almost no bandwidth loss ~3 µsec latency per switch stage Low Ethernet latency due to ParaStation middleware Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 26

Results Crossbar with 3 stages: Expect ~9 µsec total latency No bandwidth loss, even if all nodes communicate Actual test: 140 processes / 70 pairs Comm. partners connected to different level 1 switches All traffic through level 2 switches Results per pair (worst): 210.4 MB/sec 28.0 µsec Average throughput ~5% more Total throughput: > 15 GB/sec Bi-sectional bandwidth! Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 27

Summary Ethernet Clos switches are possible Doable with components of the shelf Full bi-sectional bandwidth Total bandwidth of 144 port prototype: > 15 GB/sec Observed switch latency of prototype: 9.4 µsec Expected bandwidth full system (528 port): > 50 GB/sec No change in latency expected Cost of ALiCEnext Clos network < 125 / port Patent pending (Uni Wuppertal, FZ Jülich, ParTec) Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 28

Achieved Features Configuration is completely transparent to end nodes Nodes see each other (Static) traffic shaping possible Broadcasts work (and are restricted to a single VLAN) ARP-requests work (for nodes) Request send (as broadcast) via the requester's VLAN Answer send (as broadcast) via the responder's VLAN Even multicasts work used for installation of nodes via systemimager / flamethrower Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 29

Future (Cluster) Plans in Jülich Project JuLi Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 30

Infrastructure Jülich Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 31

Infrastructure Jülich JUMP: IBM p690, 1312 Power4 @ 1.7GHz, 41x32 SMP, 8.9 TFlop/s, 5.2 TB Memory, Federation Switch: 5.5 µs Latency, 1.4 GB/s, AIX 5.2, JUBL: IBM BlueGene 16384 PowerPC440 @ 700MHz, 8x(2x16)x32, 45.8 TFlop/s, 4.1 TB Memory, 3D-Torus, µk/linux, SLES9 Cray XD1: 120 AMD Opteron 248 2.2 GHz, 12 FPGAs, 60x2 SMP, 528 Gflop/s, 240 GB Memory, RapidArray: fat-tree, 2.2 µs Latency, 2.3 GB/s, SLES9 NICole: 32 AMD Opteron 250 @ 2.4 GHz, 16x2 SMP, 153.6 GFlop/s, 128 GB Memory, GigE, SuSE 9.3 Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 32

Major Upgrade End 2007 Jülich's main idea: Two systems instead of one Capability system Only for few selected groups with very demanding needs Applications have to (be made) fit on the machine Very scalable system Most probably BlueGene like Capacity system Open for all users Less scalable applications Applications should run out of the box (more or less) More general purpose Optimal integration into existing environment Most probably Cluster like Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 33

Prototype JuLi Philosophy: Build Clusters from best components: CPU Power Platform (PowerPC 970) Interconnect InfiniPath ( Better InfiniBand ) Middleware ParaStation Actual Hardware 56 x IBM JS21 PowerPC Blades 2 x Dual-Core PowerPC 970 (2.5 GHz) 2 GB SASDDR2 SDRAM, 36 GByte SAS HD BladeCenter H Chassis PathScale Infinipath HS-DC Voltaire ISR 9096 InfiniBand Switch DS4100 Storage Server with 14 x 400GB Disks (= 5.6 Tbytes) Aspired Delivery July 2006 (right after ISC) Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 34

JuLi Plans Test of production mode Selected users / applications Stability tests Feasibility of maintenance Integration tests Interoperability with JuMP, JuBL How do Clusters fit into our storage environment? Test of alternative components Storage GPFS vs. TerraScale vs. Lustre ( vs. PVFS?) Batch system LoadLeveler vs. Torque ( vs. SUN GridEngine?) Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 35

Thank you Norbert Eicker John von Neumann-Institute for Computing Research Center Jülich 36