Know your Cluster Bottlenecks and Maximize Performance



Similar documents
Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Performance Tuning Guidelines

Performance Tuning Guidelines

Mellanox Academy Online Training (E-learning)

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

Evaluation of 40 Gigabit Ethernet technology for data servers

Configuring VMware vsphere 5.1 with Oracle ZFS Storage Appliance and Oracle Fabric Interconnect

June, Supermicro ICR Recipe For 1U Twin Department Cluster. Version 1.4 6/25/2009

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

FLOW-3D Performance Benchmark and Profiling. September 2012

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

LS DYNA Performance Benchmarks and Profiling. January 2009

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

OpenSM Log Management

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

ECLIPSE Performance Benchmarks and Profiling. January 2009

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

PCI Express Impact on Storage Architectures and Future Data Centers

How System Settings Impact PCIe SSD Performance

PCI Express and Storage. Ron Emerick, Sun Microsystems

Windows Server Performance Monitoring

The safer, easier way to help you pass any IT exams. Industry Standard Architecture and Technology. Title : Version : Demo 1 / 5

Telecom - The technology behind

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

XID ERRORS. vr352 May XID Errors

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

1 Basic Configuration of Cisco 2600 Router. Basic Configuration Cisco 2600 Router

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Multi-core and Linux* Kernel

Power Saving Features in Mellanox Products

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

BIOS Update Release Notes

A Brief Tutorial on Power Management in Computer Systems. David Chalupsky, Emily Qi, & Ilango Ganga Intel Corporation March 13, 2007

RDMA over Ethernet - A Preliminary Study

Performance Testing. Configuration Parameters for Performance Testing

HP PCIe IO Accelerator For Proliant Rackmount Servers And BladeSystems

Storage Architectures. Ron Emerick, Oracle Corporation

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

PCI Express IO Virtualization Overview

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Proof Point: Example Clustered Microsoft SQL Configuration on the HP ProLiant DL980

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

Hadoop on the Gordon Data Intensive Cluster

Comparing Power Saving Techniques for Multi cores ARM Platforms

Running Native Lustre* Client inside Intel Xeon Phi coprocessor

Networking Virtualization Using FPGAs

Host Power Management in VMware vsphere 5

Scaling from Datacenter to Client

The following InfiniBand adaptor products based on Mellanox technologies are available from HP

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

Intel Core i3-2310m Processor (3M Cache, 2.10 GHz)

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Abstract: Motivation: Description of proposal:

NUMA Best Practices for Dell PowerEdge 12th Generation Servers

Enabling VMware Enhanced VMotion Compatibility on HP ProLiant servers

OFA Training Program. Writing Application Programs for RDMA using OFA Software. Author: Rupert Dance Date: 11/15/

Introduction to Cloud Design Four Design Principals For IaaS

Performance Evaluation of Linux Bridge

Installation Guide for Dolphin PCI-SCI Adapters

TOOLS AND TIPS FOR MANAGING A GPU CLUSTER. Adam DeConinck HPC Systems Engineer, NVIDIA

How To Monitor Infiniband Network Data From A Network On A Leaf Switch (Wired) On A Microsoft Powerbook (Wired Or Microsoft) On An Ipa (Wired/Wired) Or Ipa V2 (Wired V2)

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Configuring LNet Routers for File Systems based on Intel Enterprise Edition for Lustre* Software

Family 12h AMD Athlon II Processor Product Data Sheet

Certification: HP ATA Servers & Storage

Power efficiency and power management in HP ProLiant servers

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

PCI-SIG ENGINEERING CHANGE NOTICE

SMB Direct for SQL Server and Private Cloud

High Speed I/O Server Computing with InfiniBand

High Performance Oracle RAC Clusters A study of SSD SAN storage A Datapipe White Paper

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Advancing Applications Performance With InfiniBand

UCS M-Series Modular Servers

High-Density Network Flow Monitoring

Network Load Balancing

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

HyperThreading Support in VMware ESX Server 2.1

Installing Hadoop over Ceph, Using High Performance Networking

Successfully negotiating the PCI EXPRESS 2.0 Super Highway Towards Full Compliance

ATTO ThunderLink Thunderbolt to SAS/SATA Troubleshooting Guide

PEX 8748, PCI Express Gen 3 Switch, 48 Lanes, 12 Ports

Troubleshooting Guide

VMWARE WHITE PAPER 1

Trend Micro Incorporated reserves the right to make changes to this document and to the products described herein without notice.

What s new in Hyper-V 2012 R2

Transcription:

Know your Cluster Bottlenecks and Maximize Performance Hands-on training March 2013

Agenda Overview Performance Factors General System Configuration - PCI Express (PCIe) Capabilities - Memory Configuration - BIOS Settings - Tuning Power Management - Tuning for NUMA Architecture Validating the Network - Node Validation - Subnet Manager Validation - Validating an Error Free Network with ibdiagnet - Expected Link Speed/Width - Cable Validation 2

Overview Depending on the application of the user's system, it may be necessary to modify the default configuration of the network adapters and the system/chipset configuration. This slide deck describes common tuning parameters, settings & procedures that can improve performance of the network adapter. Different Server & NIC vendors may have different recommendations for the values to be set But the general tuning approach should be similar. For the hands-on demo we will utilize Mellanox ConnectX adapters thus we will implement the recommended settings issued by Mellanox. 3

Performance Factors Dividing the performance affecting factors into 2 areas: System General Configuration Single System Configuration Network Configuration PCI Express (PCIe) Capabilities Memory Configuration BIOS Settings Tuning power management Tuning for NUMA Architecture Network Configuration Error free Expected link speed/width Correct topology QoS 4

Single System Configuration 5

General System Configurations PCIe BW Limitations PCIe Generation 2.0 3.0 Speed 5GT/s 8GT/s Width x8 x8 or x16 PCIe BW Constraint x8 * 5Gbps * (8:10) = 32Gbps Taking into account: PCI FC and PCI headers we get max of: ~27-29Gbps x8 * 8Gbps * (128:130) = 63Gbps Taking into account: PCI FC and PCI headers we get max of: ~52-54Gbps QDR rate x4*10 Gbps * (8:10) Max BW = 32Gbps FDR14 rate x4*14.4375 Gbps * (64:66) Max BW = 56Gbps PCI limits BW @ 27Gbps PCI limits BW @ 27Gbps IB limits Max BW = 32Gbps PCI limits Max BW = 52-54Gbps 6

Network Configuration PCIe Demo Running a simple RDMA test on port#1 between 2 gen3 servers: Server: ib_write_bw -d mlx5_0 -i 1 --run_infinitely Client: ib_write_bw -d mlx5_0 -i 1 --run_infinitely mtlae01 Querying the device capabilities using: lspci s [slot_num] xxxvvv Find the PCIe capability register (using # lspci v) Locate the Link Capabilities Register (offset 0Ch) - Locate the Max Link Speed (bits 3:0) & Width (bits 9:4) Locate the Link Status Register (offset 12h) - Locate the Current Link Speed (bits 3:0) & Width (bits 9:4) Verify the device had gone up with the max supported speed/width. In real life scenario when encountering an issue with the PCIe, the user will have to check the card and the board itself. 7

General System Configurations Tuning for NUMA Architecture From Wikipedia Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). Tuning for Intel Microarchitecture Code name Sandy Bridge The Intel Sandy Bridge processor has an integrated PCI express controller. Thus every PCIe adapter is connected directly to a NUMA node. On a system with more than one NUMA node, performance will be better when using the local NUMA node to which the PCIe adapter is connected. In order to identify which NUMA node is the adapter's node the system BIOS should support the proper ACPI feature. To see whether your system supports PCIe adapter s NUMA node detection run: # cat /sys/devices/[pci root]/[pcie function]/numa_node Or # cat /sys/class/net/[interface]/device/numa_node Example for a supported system: # cat /sys/devices/pci0000\:00/0000\:00\:05.0/numa_node 0 Example for a supported system: # cat /sys/devices/pci0000\:00/0000\:00\:05.0/numa_node -1 8

General System Configurations Tuning for NUMA Architecture Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system. In order to identify which NUMA node is the adapter's node the system BIOS should support the proper ACPI feature. With a 2 socket system the PCIe adapter will be connected to socket 0 ( nodes 0,1). With a 4 socket system the PCIe adapter will be connected either to socket 0 ( nodes 0,1) or to socket 3 ( nodes 6,7). 9

General System Configurations Tuning for NUMA Architecture Recognizing NUMA Node Cores To recognize NUMA node cores, run the following command: # cat /sys/devices/system/node/node[x]/cpulist cpumap Example: # cat /sys/devices/system/node/node1/cpulist 1,3,5,7,9,11,13,15 # cat /sys/devices/system/node/node1/cpumap 0000aaaa OR # lscpu Running an Application on a Certain NUMA Node In order to run an application on a certain NUMA node, the process affinity should be set in either in the command line or an external tool. For example, if the adapter's NUMA node is 1 and NUMA 1 cores are 8-15 then a multi thread application should run with process affinity that uses 8-15 cores only. To run an application, run the following commands: # taskset -c 8-15 ib_write_bw a or: # taskset 0xff00 ib_write_bw a 10

General System Configurations Memory Configuration Common Knowledge For high performance it is recommended to use the highest memory speed with fewest DIMMs. Please look for your vendor's memory configuration instructions or memory configuration tool available Online for tuning the system s memory to the max performance 11

General System Configurations - Recommended BIOS Settings General Set BIOS power management to Maximum Performance NOTE: These performance optimizations may result in higher power consumption. Intel Processors - The following table displays the recommended BIOS settings in machines with Intel Nehalem-based processors. BIOS Option Values General Operating mode Performance Processor C-States Turbo Mode CPU Frequency Disabled Disabled Max performance Memory Memory speed Memory channel mode Socket Interleaving Memory Node Interleaving Patrol Scrubbing Demand Scrubbing Thermal Mode Max performance Independent NUMA OFF Disabled Enabled Performance 12

General System Configurations - Power Management Tuning C-States In order to save energy when the CPU is idle, the CPU can be commanded to enter a low-power mode. Each CPU has several power modes which are collectively called C-States. These states are numbered starting at C0 which is the normal CPU operating mode, i.e. the CPU is 100% turned on. The higher the C number is, the deeper is the CPU sleep mode, i.e. more circuits and signals are turned off - prolonging the time it ll take the CPU to go back to C0 mode. Some operating systems can override BIOS power management configuration and enable c-states by default, which results in a higher latency. To resolve the high latency issue, please follow the instructions below: 1. Edit the /boot/grub/grub.conf file or any other bootloader configuration file. 2. Add the following kernel parameters to the bootloader command: intel_idle.max_cstate=0 processor.max_cstate=1 3. Reboot the system Example: title RH6.2x64 root (hd0,0) kernel /vmlinuz-rh6.2x64-2.6.32-220.el6.x86_64 root=uuid=817c207b-c0e8-4ed9-9c33-c589c0bb566f console=tty0 console=ttys0,115200n8 rhgb intel_idle.max_cstate=0 processor.max_cstate=1 initrd /initramfs-rh6.2x64-2.6.32-220.el6.x86_64.img 13

General System Configurations Checking Core Frequency Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent. Check the maximum supported CPU frequency using: # cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq Check that core frequencies are consistent using: # cat /proc/cpuinfo grep cpu MHz Check that output frequencies are the same as the maximum supported If the CPU frequency is not at the maximum, check the BIOS settings according to table in Recommended BIOS Settings (on slide 7) to verify that power state is disabled. 14

Network Configuration 15

Network Configuration Node Validation Validating nodes condition: Responding (and checking kernel version) - e.g. pdsh w node[01-10] /bin/uname r dshbak -c Have the same OFED version - e.g. pdsh w node[01-10] /usr/bin/ofed_info head -n 1 dshbak -c HCAs are attached to driver and have expected FW version - e.g. pdsh w node[01-10] ibstat grep Firmware dshbak -c Validate port state (after SM) - e.g. pdsh -w node0[01-10] ibstat grep State dshbak c State: Active/Down 16

Network Configuration Validating an Error Free Network with ibdiagnet Fabric cleanup is used to filter out bad HW including bad cables, bad ports or bad connectivity. Fabric cleanup algorithm 1. Zero all counters (ibdiagnet pc) 2. Wait X time 3. Check for errors exceeding allowed threshold during this X time (ibdiagnet lw 4x ls 10 P all=1) 4. Fix problematic links (re-sit or swap cables, replace switch ports or HCAs etc.) 5. Go to 1 17

Network Configuration Validating an Error Free Network with ibdiagnet -p --port <port-num> -pc - Reset all the fabric links pmcounters -P --counter <<PM>=<value>> -r --routing - provides a report of the fabric qualities -u --fat_tree - Indicates that UpDown credit loop checking should be done against automatically determined roots. -o --output_path <out-dir> -h --help -V --version 18

Network Configuration Validating an Error Free Network with ibdiagnet Symbol and Receive Errors Thresholds Thresholds are link speed and target BER dependent. For a QDR link with expected BER of 10e-12 (IB spec) the threshold will be: 10e12/40x10e9 = 25 sec/err For 10-15 BER: 10e15/40x10e9 = 25000 sec/err = ~7 Hour/err Link down and link retrain Both this counters should be zero Rcv Remote Phys Errors Indicates an issue (Symbol or Receive errors) on some remote port. VL15Dropped In most cases should be ignored as it counts MADs that are received before the link is up. 19

Network Configuration Subnet Manager Optimization One of the common issues is not having the SM running. There are various scenarios where the SM may have dropped and no other SM has been configured on the subnet. Check the presence of the SM using sminfo cmd. Check there are no errors with the SM by looking for error messages in the SM log ((/var/log/opensm.log) Select the right routing algorithm, use opensm -R <couting engine> or routing_engine parameter in opensm config file For fat tree or up down routing GUID file may be needed. GUID file is a txt file with a list of GUID of the highest hierarchical switch ASIC s Validate that routing succeeded in opensm log (/var/log/opensm.log) Validate that there are no errors in the log file Use ibdmchk to validate routing ibdmchk -s./opensm-subnet.lst -f./opensm.fdbs -m./opensm.mcfdbs QOS setting - SL2VL mapping. 20

Network Configuration Link Speed/Width & Cable Validation Validate the link speed & width using ibportstate or any OpenIB diagnostic tools. Validate the fabric s cables using the Cable Diagnostic plugin in ibdiagnet. --get_cable_info : Indicates to query all QSFP cables for cable information. Cable information will be stored in "ibdiagnet2.cables". --cable_info_disconnected : Get cable info on disconnected ports. Validate the correct cables are being utilized on the respected links. 21

Network Configuration Congestion Example Disabling one of the links between the switches while running RDMA tests between 2 interfaces on both servers. SM configures the routing to be divided between the ports. Reducing the link width of one of the switches connections The SM won t re-route and we ll see the BW drops only on the relevant interfaces which are routed through the link with the reduced width while the other link works in good performance. 22

Thank You 23