Network Contention and Congestion Control: Lustre FineGrained Routing

Similar documents
Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Scripts for InfiniBand Monitoring

Cisco s Massively Scalable Data Center

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Configuring LNet Routers for File Systems based on Intel Enterprise Edition for Lustre* Software

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

InfiniBand Clustering

Scaling 10Gb/s Clustering at Wire-Speed

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Lustre Networking BY PETER J. BRAAM

Lustre & Cluster. - monitoring the whole thing Erich Focht

Kriterien für ein PetaFlop System

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

New Storage System Solutions

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Lecture 2 Parallel Programming Platforms

Architecting Low Latency Cloud Networks

Architecting a High Performance Storage System

Designing HP SAN Networking Solutions

Vocia MS-1 Network Considerations for VoIP. Vocia MS-1 and Network Port Configuration. VoIP Network Switch. Control Network Switch

Cisco s Massively Scalable Data Center. Network Fabric for Warehouse Scale Computer

Data Center Network Structure using Hybrid Optoelectronic Routers

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Non-blocking Switching in the Cloud Computing Era

Troubleshooting and Maintaining Cisco IP Networks Volume 1

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Data Center Network Topologies: FatTree

Traffic Engineering Management Concepts

Network Virtualization and Data Center Networks Data Center Virtualization - Basics. Qin Yin Fall Semester 2013

Datagram-based network layer: forwarding; routing. Additional function of VCbased network layer: call setup.

Isilon IQ Network Configuration Guide

Data Management Best Practices

Next Steps Toward 10 Gigabit Ethernet Top-of-Rack Networking

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Optimizing Enterprise Network Bandwidth For Security Applications. Improving Performance Using Antaira s Management Features

VMWARE WHITE PAPER 1

Sun Constellation System: The Open Petascale Computing Architecture

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Block based, file-based, combination. Component based, solution based

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

DESIGN AND ANALYSIS OF TECHNIQUES FOR MAPPING VIRTUAL NETWORKS TO SOFTWARE- DEFINED NETWORK SUBSTRATES

ICND2 NetFlow. Question 1. What are the benefit of using Netflow? (Choose three) A. Network, Application & User Monitoring. B.

Brocade Solution for EMC VSPEX Server Virtualization

Network Virtualization for the Enterprise Data Center. Guido Appenzeller Open Networking Summit October 2011

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Simplifying the Data Center Network to Reduce Complexity and Improve Performance

O /27 [110/129] via , 00:00:05, Serial0/0/1

VMware Virtual SAN Network Design Guide TECHNICAL WHITE PAPER

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Determining the health of Lustre filesystems at scale

Load Balancing in Data Center Networks

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

SDN CENTRALIZED NETWORK COMMAND AND CONTROL

Exercise 4 MPLS router configuration

Data Center Switch Fabric Competitive Analysis

Windows Server 2008 R2 Hyper-V Live Migration

IBM BladeCenter H with Cisco VFrame Software A Comparison with HP Virtual Connect

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

TRILL Large Layer 2 Network Solution

VXLAN Bridging & Routing

Globus and the Centralized Research Data Infrastructure at CU Boulder

Cisco - Configure the 1721 Router for VLANs Using a Switch Module (WIC-4ESW)

High Speed I/O Server Computing with InfiniBand

Lessons learned from parallel file system operation

Final for ECE374 05/06/13 Solution!!

Load Balancing Mechanisms in Data Center Networks

Network Architecture & Topology

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Arista and Leviton Technology in the Data Center

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Terabit Networking with JASMIN

Security and Access Control Lists (ACLs)

Cisco Discovery 3: Introducing Routing and Switching in the Enterprise hours teaching time

JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering

Cisco SFS 7000D Series InfiniBand Server Switches

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Sample Configuration Using the ip nat outside source static

Network Virtualization for Large-Scale Data Centers

Network configuration for the IBM PureFlex System

Ethernet Fabrics: An Architecture for Cloud Networking

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Walmart s Data Center. Amadeus Data Center. Google s Data Center. Data Center Evolution 1.0. Data Center Evolution 2.0

Science DMZs Understanding their role in high-performance data transfers

Exam 1 Review Questions

Enhanced Flow control in Ethernet Networks. Xsigo Systems 1

Name: 1. CS372H: Spring 2009 Final Exam

A Shared File System on SAS Grid Manger in a Cloud Environment

The ECHO - Cisco Connection ECHO, and how it interacts with Cisco's CallManager

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

White Paper. Accelerating VMware vsphere Replication with Silver Peak

Advanced Computer Networks. Datacenter Network Fabric

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Switching Architectures for Cloud Network Designs

Transcription:

Network Contention and Congestion Control: Lustre FineGrained Routing Matt Ezell HPC Systems Administrator Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory International Workshop on the Lustre Ecosystem: Challenges and Opportunities March 3-4, 2015 ORNL is managed by UT-Battelle for the US Department of Energy

2

Fat Tree Network 3 Image from http://clusterdesign.org/fat-trees/fat_tree_varying_ports/

Fat Tree Network (Folded Clos Network) Client1 Server A Server B 4 Image from http://clusterdesign.org/fat-trees/60_nodes_labeled/

Fat Tree Network (Folded Clos Network) Network can be Non-blocking (1:1) Blocking (x:1) Congestion and hot spots can (significantly) reduce performance Even in an expensive full-bisection-bandwidth setup Worse in under-provisioned or irregular networks 5

InfiniBand Routing InfiniBand uses a Linear Forwarding Table A given destination always uses the same output port 6 Destination LID Output Port 7 2 59 3 66 3 101 5 121 11 200 11 232 20 299 27 341 32 351 35 404 35

Hot Spot Example 4 Port Switches Spine 1 Spine 2 3 5 4 3 5 4 Leaf 1 Leaf 2 Leaf 3 Leaf 4 1 2 3 4 5 6 7 8 What happens when node 1 => node 3 node 2 => node 5 7

Gemini 3D Torus 8 Image from http://clusterdesign.org/torus/

9 Gemini Dimension Ordered Routing

How fast are Gemini Links? It depends! Link speed depends on link type Protocol overhead is around 35% for large messages Link Type Link Data Rate Number Links Raw Bitrate Data Rate Y-Mezzanine 6.25 gbps 12 9.375 GB/s ~ 6 GB/s Z-Backplane 5.0 gbps 24 15 GB/s ~ 9.75 GB/s X, Z Cable 3.125 gbps 24 9.375 GB/s ~ 6 GB/s Y Cable 3.125 gbps 12 4.6875 GB/s ~ 3 GB/s 10

Geometric Bandwidth Reductions 1 12 1 24 1 48 1 8 Titan is a 25 x 16 x 24 torus 11

Dragonfly Network Group (384 Nodes) Group (384 Nodes) Group (384 Nodes) Group (384 Nodes) Group (384 Nodes) Group (384 Nodes) Group (384 Nodes) Group (384 Nodes) 12

Dragonfly Network In an Aries network, global bandwidth can be increased by purchasing more links Adaptive routing can ease congestion But it makes it difficult to predict the path that a packet will take 13

The Million Dollar Question? 14

15 LNET Routing

LNET Routing identifier@network 10.10.10.101@o2ib0 10.10.10.101@o2ib201 9409@gni0 9409@gni101 16

LNET Routing options lnet routes= remotenethops id@localnet lctl --net remotenet add_route id@localnet hops net o2ib225 hops 10 gw 18329@gni102 up net o2ib225 hops 1 gw 10871@gni102 up net o2ib234 hops 10 gw 5991@gni102 up net o2ib234 hops 10 gw 18247@gni102 up net o2ib234 hops 1 gw 10921@gni102 up net o2ib208 hops 10 gw 15218@gni102 up net o2ib208 hops 10 gw 2946@gni102 up net o2ib208 hops 1 gw 7788@gni102 up net o2ib217 hops 10 gw 15212@gni102 up net o2ib217 hops 10 gw 2908@gni102 up 17

LNET Routing Start remote net == local net? No Find Best Router Deliver to Router Yes Deliver Directly End 18

Why Route? Control bandwidth by varying the number of routers Control I/O paths by selecting which routers to use Control route computation by partitioning fabrics and if you are using multiple network types, you have to route traffic 19

Initial Implementation for Jaguar Focused on relieving contention within the IB network only Jaguar XT5 IB Core IB Core IB Leaf Switches Lustre Servers 20

Atlas Layout ABCD E 21 FGHI

Atlas Layout OSSs InfiniBand Switch A InfiniBand Switch B InfiniBand Switch C InfiniBand Switch D InfiniBand Switch E InfiniBand Switch F InfiniBand Switch G InfiniBand Switch H InfiniBand Switch I 22

23 OSS Connections DDN 1i1 DDN 1i2 DDN 1a1 DDN 1a2 DDN 1b1 DDN 1b2 OSS1i1 OSS1i2 OSS1i3 OSS1i4 OSS1i5 OSS1i6 OSS1i7 OSS1i8 OSS1a1 OSS1a2 OSS1a3 OSS1a4 OSS1a5 OSS1a6 OSS1a7 OSS1a8 OSS1b1 OSS1b2 OSS1b3 OSS1b4 OSS1b5 OSS1b6 OSS1b7 OSS1b8 InfiniBand Switch I InfiniBand Switch A InfiniBand Switch B InfiniBand Switch C

How many routers? Performance Goal: > 1TB/s Router Performance: 2.6GB/s = Required Router Count: > 385 Deployed Router Count: > 432 24

Fine-Grained Routing (FGR) Group OSS1a5 OSS1a6 OSS1a7 OSS1a8 OSS1b1 OSS1b2 OSS1b3 OSS1b4 InfiniBand Switch B RTR1b-1 RTR1b-2 RTR1b-3 RTR1b-4 RTR1b-5 RTR1b-6 RTR1b-7 RTR1b-8 RTR1b-9 RTR1b-10 RTR1b-11 RTR1b-12 8:12 Ratio 25

Router Module Node 0 Node 1 Gemini 0 Atlas1 InfiniBand Switch 1A InfiniBand Switch 2A Node 2 Node 3 Gemini 1 InfiniBand Switch 3A InfiniBand Switch 4A Atlas2 26

Router Layout 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 27

28 Router Placement

29 Fine-Grained Routing Configuration

Routing Group 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 30

FGR On an XC30 Our XC30 is only two groups We have not measured significant contention in the Aries network 9 routers with 4 interfaces each, to provide connectivity to all 36 TOR switches Routers provide higher percentage of peak bandwidth, compared to our XK7 31

FGR on an IB Cluster Compute nodes and routers connect to to a 648- port core switch 36 routers connect directly to the 36 TOR switches 32

FGR Best Practices Understand your topology Understand your application I/O workload Create large FGR groups Create FGR groups in both directions Verify all settings and physical connections 33

Questions? ezellma@ornl.gov 34