Scripts for InfiniBand Monitoring



Similar documents
Network Contention and Congestion Control: Lustre FineGrained Routing

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Network Monitoring. SAN Discovery and Topology Mapping. Device Discovery. Send documentation comments to

Lustre & Cluster. - monitoring the whole thing Erich Focht

Monitoring the Network

QLogic SRP Module on Linux for OpenFabrics and InfiniPath Version Table of Contents

RapidIO Network Management and Diagnostics

Mellanox Academy Online Training (E-learning)

Cisco SFS 7000D Series InfiniBand Server Switches

Maintaining Non-Stop Services with Multi Layer Monitoring

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

Application Centric Infrastructure Object-Oriented Data Model: Gain Advanced Network Control and Programmability

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

Direct Attached Storage

Hardware Monitoring with the new Nagios IPMI Plugin

Cisco s Massively Scalable Data Center. Network Fabric for Warehouse Scale Computer

Cisco EXAM Implementing Cisco IP Telephony and Video, Part 2 (CIPTV2) Buy Full Product.

INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool

Scaling Graphite Installations

Intel Simple Network Management Protocol (SNMP) Subagent v6.0

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Discovering Devices. The Cisco Prime Collaboration Manager discovery process involves three phases: Access-level discovery Cisco Prime CM:

Determining the health of Lustre filesystems at scale

Installation Guide Avi Networks Cloud Application Delivery Platform Integration with Cisco Application Policy Infrastructure

Cisco SFS 7000P InfiniBand Server Switch

Know your Cluster Bottlenecks and Maximize Performance

Discovering Devices. The Cisco Prime Collaboration Manager discovery process involves three phases: Access-level discovery Prime CM:

GRNET NOC network monitoring & visualization tools

Commotion Network Dashboard Application for Commotion Wireless Request for Development Proposals

Isilon IQ Network Configuration Guide

Cisco ACI Simulator Release Notes, Release 1.2(1i)

IP Multicasting. Applications with multiple receivers

Extending Networking to Fit the Cloud

S&C IntelliTeam CNMS Communication Network Management System Table of Contents Overview Topology

Scaling 10Gb/s Clustering at Wire-Speed

Integrating Autotask Service Desk Ticketing with the Cisco OnPlus Portal

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

TELCO challenge: Learning and managing the network behavior

Router Architectures

Strengths and Limitations of Nagios as a Network Monitoring Solution

The following InfiniBand adaptor products based on Mellanox technologies are available from HP

Avaya Converged Network Analyzer, Version 3.1 Release Notes

ITSM Service Monitoring Using Open Source Tools

LSM RELEASE NOTES LOCKING SYSTEM MANAGEMENT SOFTWARE

WÜRTHPHOENIX NetEye Version 3

Talend Open Studio for ESB. Release Notes 5.2.1

How to Painlessly Discover What You Don't Know - Before It Bites You

Using the Advanced GUI

Monitoring Drupal with Sensu. John VanDyk Iowa State University DrupalCorn Iowa City August 10, 2013

Moab and Fabriscale s Fabric Manager White Paper

Best Practices for Monitoring Exadata Enterprise Manager Cloud Control 12c

Deploying Microsoft Operations Manager with the BIG-IP system and icontrol

Discovering Devices CHAPTER

I. ADDITIONAL EVALUATION RESULTS. A. Environment

Using WebLOAD to Monitor Your Production Environment

Oracle Communications WebRTC Session Controller: Basic Admin. Student Guide

IPMI Event Viewer Command Line Interface ***This is the user guide for the CLI only***

Technical Solution & White Paper. AFDX/ARINC664 Switch Testing JS, AIM GmbH

Configuring VMware vsphere 5.1 with Oracle ZFS Storage Appliance and Oracle Fabric Interconnect

IaaS Configuration for Cloud Platforms

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

MONITORING PERFORMANCE IN WINDOWS 7

EventSentry Overview. Part I About This Guide 1. Part II Overview 2. Part III Installation & Deployment 4. Part IV Monitoring Architecture 13

Configuring High Availability for Embedded NGX Gateways in SmartCenter

Implementing Cisco Data Center Unified Fabric Course DCUFI v5.0; 5 Days, Instructor-led

Industrial Adoption of Automatically Extracted GUI Models for Testing

Collax Monitoring with Nagios

4/25/2016 C. M. Boyd, Practical Data Visualization with JavaScript Talk Handout

Processing millions of logs with Logstash

IBM DataPower SOA Appliances & MQ Interoperability

Understand SIP trunk and registration in DWG gateway Version: 1.0 Dinstar Technologies Co., Ltd. Date:

Client Overview. Engagement Situation. Key Requirements

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

NetApp E-Series Storage Systems

Basic & Advanced Administration for Citrix NetScaler 9.2

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Cloudera Manager Training: Hands-On Exercises

PROFINET IO Diagnostics 1

TCP Labs. WACREN Network Monitoring and Measurement Workshop Antoine Delvaux perfsonar developer

A Plan for the Continued Development of the DNS Statistics Collector

Power Saving Features in Mellanox Products

CSE331: Introduction to Networks and Security. Lecture 8 Fall 2006

DE1600 DSA E-Series iscsi Disk Arrays. en Quick Install Guide

NSi Mobile Installation Guide. Version 6.2

Data Analysis Load Balancer

Details. Some details on the core concepts:

Scalable Source Routing

PSM/SAK Event Log Error Codes

Zenoss Service Dynamics Impact and Event Management Installation and Administration

Open Network Install Environment (ONIE) LinuxCon North America 2015

LiveAction Application Note

Transaction Monitoring Version for AIX, Linux, and Windows. Reference IBM

Evaluation of Nagios for Real-time Cloud Virtual Machine Monitoring

System Monitoring Using NAGIOS, Cacti, and Prism.

HDFS Cluster Installation Automation for TupleWare

IBM BladeCenter H with Cisco VFrame Software A Comparison with HP Virtual Connect

Juniper Networks Management Pack Documentation

Analyzing large flow data sets using. visualization tools. modern open-source data search and. FloCon Max Putas

Efficient Video Distribution Networks with.multicast: IGMP Querier and PIM-DM

Cisco s Massively Scalable Data Center

Transcription:

Scripts for InfiniBand Monitoring Blake Caldwell April 30, 2015 ORNL is managed by UT-Battelle for the US Department of Energy

Monitoring tools IB Health Check D3.js visualizations Perfquery balancing Available on github: https://github.com/bacaldwell/ 2 Presentation_name

IB Health Check Host-based shell script Designed to be run from snmpd or from CLI Returns Nagios status code 0: OK 1: WARNING 2: CRITICAL 3: UNKNOWN 3 Presentation_name

IB Health Check HCA and local link health Local errors check (HCA port) Remote errors check (switch port) PCI width/speed of each HCA Identify failed hardware or firmware issues Appropriate slot placement Port in up/active state Link speed/width matches capability SM lid is set 4 Presentation_name

IB Health Check Output CLI output Two interfaces, no errors: mlx4_1-ib0:ok ;; mlx4_1-ib1:ok ;; Two interfaces, LinkDownedCounter above threshold triggers warning: mlx4_1-ib0:warning - Direct-attached ddn-d-1 HCA-1 port 2: [LinkDownedCounter == 48] ;; \ mlx4_1-ib1:warning - Direct-attached ddn-d-0 HCA-1 port 2: [LinkDownedCounter == 66] ;; Nagios view 5 Presentation_name

Configuration /usr/local/etc/monitor_ib_health.conf oss:mlx4_0:1:56 oss:mlx4_0:2:56 oss:mlx4_1:1:56 mds:mlx4_0:1:56 /usr/local/etc/ib_node_name_map.conf # Compute nodes 0x0002c903001c5e31 oss1 0x0002c90300dde441 oss2 # Core switch 0x0002c903008726a0 ibsw-core1 Spine 1 0x0002c903008726b0 ibsw-core1 Spine 2 0x0002c903008726c0 ibsw-core1 Spine 3 0x0002c903008476d0 ibsw-core1 Line 1 0x0002c90300847790 ibsw-core1 Line 2 0x0002c903008477f0 ibsw-core1 Line 3 6 Presentation_name

D3.js visualizations Explored pydot for visualizations IBUG 2013 Too much information, wanted ability to collapse by switch (or chassis) Discovered of D3.js from Florent s IBUG 2014 presentation JavaScript library written by Mike Bostock of New York Times Demo https://github.com/bacaldwell/ib_d3viz 7 Presentation_name

ib_d3viz Workflow Create ibnetdiscover output file: ibnetdiscover -p --node-name-map > fabric_ibnetdisc.out Create JSON file describing a graph: python ib_topology_graph.py -g fabric_graph.json fabric_ibnetdisc.out Creates JSON file describing a graph: links : [ { source : 0, target : 1320, ], nodes : [ { group : 0, name : node1601 HCA-1, 8 Presentation_name

Full topology with data overlay 9 Presentation_name

Clustered graph with collapsible nodes 10 Presentation_name

Balanced perfqueries 1. Ports that are the direct neighbor of a node will always get assigned to that node for querying. 2. All link endpoints that make up paths between the all nodes part of the current job are identified. The number of links identified at this step relates to the number of nodes in the job, but grows very quickly nearing the total number of links in the fabric. 3. These links are allocated to nodes that are found on in the switch s forwarding table (LFT) for that link s port on the switch. This is in the reverse direction from the perfqueries, so routes will not be completely balanced. However, the hop count will remain minimized in the reverse direction. 4. From step 3 some imbalances will arise, but are kept in check by a static threshold of a difference of 5 between the node assigned the least number of link endpoints to query and the most loaded node. If the node picked by step 3 is past the threshold, the allocation script will instead choose the node with the least number of LID/ports assigned to it. 5. The per-node list of link endpoints to query is written to a file to be read later by each node when starting the actual queries. 11 Presentation_name

Distribution of queries Spine_1 Line_1 Line_2 HCA_0 HCA_1 HCA_2 HCA_3 HCA_4 HCA_5 HCA_6 HCA_7 HCA_8 HCA_9 12 Presentation_name

Start with local port and direct neighbor Spine_1 Line_1 Line_2 HCA_0 HCA_1 HCA_2 HCA_3 HCA_4 HCA_5 HCA_6 HCA_7 HCA_8 HCA_9 13 Presentation_name

Identify what else needs to be queried Spine_1 Line_1 Line_2 HCA_0 HCA_1 HCA_2 HCA_3 HCA_4 HCA_5 HCA_6 HCA_7 HCA_8 HCA_9 14 Presentation_name

Use lft to prune far HCAs Spine_1 Line_1 Line_2 HCA_0 HCA_1 HCA_2 HCA_3 HCA_4 HCA_5 HCA_6 HCA_7 HCA_8 HCA_9 15 Presentation_name

Take per-node list and run perfquery Spine_1 Line_1 Line_2 HCA_0 HCA_1 HCA_2 HCA_3 HCA_4 HCA_5 HCA_6 HCA_7 HCA_8 HCA_9 16 Presentation_name

Summary InfiniBand monitoring tools on github: https://github.com/bacaldwell 1. scalable-monitoring/monitor_ib_health 2. Ib_d3viz 3. balanced-perfquery Please contribute! J 17 Presentation_name