Performance Analysis and Tuning in Windows HPC Server 2008. Xavier Pillons Program Manager Microsoft Corp. xpillons@microsoft.com

Similar documents
Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Can High-Performance Interconnects Benefit Memcached and Hadoop?

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Parallel Programming Survey

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

RDMA over Ethernet - A Preliminary Study

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

EView/400i Management Pack for Systems Center Operations Manager (SCOM)

Technical Overview of Windows HPC Server 2008

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

Using the Windows Cluster

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

evm Virtualization Platform for Windows

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Boosting Data Transfer with TCP Offload Engine Technology

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Parallel application development

LS DYNA Performance Benchmarks and Profiling. January 2009

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Running applications on the Cray XC30 4/12/2015

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Tool - 1: Health Center

REFERENCE. Microsoft in HPC. Tejas Karmarkar, Solution Sales Professional, Microsoft

ECLIPSE Performance Benchmarks and Profiling. January 2009

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Hadoop on the Gordon Data Intensive Cluster

Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics

High Performance Data-Transfers in Grid Environment using GridFTP over InfiniBand

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Mellanox Academy Online Training (E-learning)

Intel DPDK Boosts Server Appliance Performance White Paper

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

New Storage System Solutions

SMB Direct for SQL Server and Private Cloud

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

theguard! ApplicationManager System Windows Data Collector

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Chapter 14 Virtual Machines

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

- An Essential Building Block for Stable and Reliable Compute Clusters

MPI / ClusterTools Update and Plans

InfiniBand Software and Protocols Enable Seamless Off-the-shelf Applications Deployment

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

Certification: HP ATA Servers & Storage

WinOF Updates. Gilad Shainer Ishai Rabinovitz Stan Smith Sean Hefty.

Michael Kagan.

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Overview: X5 Generation Database Machines

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Know your Cluster Bottlenecks and Maximize Performance

WINDOWS PROCESSES AND SERVICES

Cloud Computing through Virtualization and HPC technologies

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Mark Bennett. Search and the Virtual Machine

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

SAN Conceptual and Design Basics

Building Clusters for Gromacs and other HPC applications

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

Using Multipathing Technology to Achieve a High Availability Solution

How To Understand And Understand The Power Of An Ipad Ios 2.5 (Ios 2) (I2) (Ipad 2) And Ipad 2.2 (Ipa) (Io) (Powergen) (Oper

FIGURE Selecting properties for the event log.

MLNX_VPI for Windows Installation Guide

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Also on the Performance tab, you will find a button labeled Resource Monitor. You can invoke Resource Monitor for additional analysis of the system.

Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Comparison of Novell, Polyserve, and Microsoft s Clustering Performance

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

DPDK Summit 2014 DPDK in a Virtual World

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

Practical Performance Understanding the Performance of Your Application

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Distribution One Server Requirements

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Windows Server Performance Monitoring

Transcription:

erformance Analysis and Tuning in Windows HC Server 2008 Xavier illons rogram Manager Microsoft Corp. xpillons@microsoft.com

Introduction How to monitor performance on Windows? What to look for? How to tune the system? How to trace MS-MI?

MEASURING ERFORMANCES

erformance Analysis Cluster wide Built in Diagnostics The Heatmap Local erfmon xperf

Built-in Network Diagnostics MI ing-ong (mpipingpong.exe) Launchable via HC Admin Console Diagnostics ro s: Easy, Data is auto-stored for historical comparison Con s: No choice of network, no intermediate results Launchable via command line Command Line Features Tournament mode, ring mode, serial mode Output progress to xml, stderr, stdout Histogram, per-node, and per-cluster data Test throughput / latency or both Remember: Usually you want only1 rank per node Additional diagnostics and extensibility in v3

Network diagnostics

Basic Network Troubleshooting Know Expected Bandwidths and Latencies Network Bandwidth Latency IB QDR (ConnectX CI-E 2.0) 2400MB/s 2µs IB DDR (ConnectX CI-E 2.0) 1500MB/s 2µs IB DDR (ConnectX CI-E 1.0) 1400MB/s 2.8µs IB DDR / ND 1400MB/s 5µs IB SDR /ND 950MB/s 6µs IB / IoIB 200-400MB/s 30µs Gige 105MB/s 40-70µs Make sure drivers and firmware are up to date Use the product diagnostics to confirm Or allas ingpong, etc.

Cluster Sanity Checks HC Toolpack can help too

The Heatmap

Basic Tools - erfmon Counter Tolerance Used For rocessor /%CU time 95% User mode bottleneck rocessor / %Kernel Time 10% Kernel issues rocessor / %DC time 5% RSS, Affinity rocessor / %Interrupt Time 5% Misbehaving drivers Network / Output Queue Length 1 Network bottleneck Disk / Average Queue Length 1 / platter Disk bottleneck Memory / ages er Sec 1 Hard Faults System/ Context Switches per sec 20,000 Locks, wasting processing System / system calls per sec 100,000 Excessive transitions

erfmon In Use

Windows erformance Toolkit Official performance analysis tools from Windows Used to optimize Windows itself Wide support range Cross platform: Vista, Server 2008/R2, Win7 Cross architecture: x86, x64, ia64 Very low overhead live capture on production systems Less than 2 % processor overhead for a sustained rate of 10,000 events/second on a 2GHz processor The only tool that lets you correlate most of the fundamental system activity All processes and threads, both user and kernel mode DCs and ISRs, thread scheduling, disk and file I/O, memory usage, graphics subsystem, etc. Available externally: part of Windows 7 SDK http://www.microsoft.com/whdc/system/sysperf/perftools.mspx

erformance Analysis

TUNING

Kernel By-ass NetworkDirect A new RDMA networking interface built for speed and stability Verbs-based design for close fit with native, high-perf networking interfaces Equal to Hardware-Optimized stacks for MI micro-benchmarks NetworkDirect drivers for key highperformance fabrics: Infiniband [available now!] 10 Gigabit Ethernet (iwar-enabled) [available now!] Myrinet [available soon] MS-MIv2 has 4 networking paths: Shared Memory between processors on a motherboard TC/I Stack ( normal Ethernet) Winsock Direct for sockets-based RDMA New NetworkDirect interface TC/Ethernet Networking Socket-Based App Windows Sockets (Winsock + WSD) TC I NDIS Networking Networking Mini-port Hardware Hardware Driver (ISV) App Networking Hardware Networking Hardware Hardware Driver MI App MS-MI Networking WinSock Hardware Hardware Direct NetworkDirect Networking Hardware rovider rovider Networking Networking Hardware Hardware User Mode Access Layer Networking Hardware Networking Hardware Networking Hardware HCS2008 Component OS Component RDMA Networking IHV Component User Mode Kernel Mode

MS-MI Fine tuning Lots of MI parameters (mpiexec help3) : MICH_ROGRESS_SIN_LIMIT 0 is adaptive, otherwise 1-64K SHM / SOCK / ND eager limit Switchover point for eager / rendezvous behaviour ND ZCOY threshold Sets the switchover point between bcopy and zcopy Affinity Buffer-reuse and registration cost affect this ( registration ~= 32K bcopy ) Definitely used for NUMA systems

Reducing OS Jitter Track Hard Fault with xperf Disable non used services (up to 42+) Delete Windows scheduled tasks Change G update interval (90mn by default)

Tuning Memory Access Effective memory use is rule #1 rocessor Affinity is key here Need to know the rocessor architecture Use STREAM to measure memory bandwidth

rocess lacement node groups, job templates, filters, affinity Application Aware A An ISV application (requires Nodes where the application is installed) A GigE InfiniBand 10 GigE Capacity Aware A A A A A A Multi-threaded application (requires machine with many Cores) A big model (requires Large memory machines) Blade Chassis 10 GigE 8-core servers 16-core servers A 32-core servers A A A A A InfiniBand InfiniBand GigE Numa Aware 4-way Structural Analysis MI Job C0 C0 M C1 C1 C2 C2 M C3 C3 M M M M 0 1 2 3 M M M M M Quad-core IO 32-core IO

MI rocess lacement Request resource with JOB: /numnodes:n /numsockets:n /numcores:n /exclusive Control lacement with MIEXEC: cores X n X affinity http://blogs.technet.com/windowshpc/archive/2008/09/16/mpiprocess-placement-with-windows-hpc-server-2008.aspx Examples job submit /numcores:4 mpiexec foo.exe Compute Node job submit /numnodes:2 mpiexec c 2 affinity foo.exe Compute Node

Force Affinity mpiexec -affinity start /wait /b /affinity <mask> app.exe Windows AI SetrocessAffinityMask SetThreadAffinityMask With task manager or procexp.exe

Core and Affinity mask for woodcrest rocessor 1 0x0F 0x01 0x02 0x04 0x08 Core 0 Core 1 0x03 L2 Cache Bus Interface Core 2 Core 3 0x0C L2 Cache Bus Interface System Bus 0x00 0x00 0x00 rocessor Affinity Mask L2 Cache Affinity Mask Core Affinity Mask Bus Interface 0x30 L2 Cache Bus Interface 0xC0 L2 Cache 0x10 0x20 0x40 0x80 Core 4 Core 5 Core 6 Core 7 0xF0 rocessor 2

Finer control of affinity to overcome hyperthreading on NH mpiexec setaff.cmd mpiapp.exe @REM setaff.cmd set affinity based on MI Rank @IF "%MI_SMD_KEY%" == "7" set AFFINITY=1 @IF "%MI_SMD_KEY%" == "1" set AFFINITY=2 @IF "%MI_SMD_KEY%" == "5" set AFFINITY=4 @IF "%MI_SMD_KEY%" == "3" set AFFINITY=8 @IF "%MI_SMD_KEY%" == "4" set AFFINITY=10 @IF "%MI_SMD_KEY%" == "2" set AFFINITY=20 @IF "%MI_SMD_KEY%" == "6" set AFFINITY=40 @IF "%MI_SMD_KEY%" == "0" set AFFINITY=80 start /wait /b /affinity %AFFINITY% %*

MS-MI TRACING

Devs can't tune what they can't see MS-MI Tracing: Single, time-correlated log of MI Events on All Nodes Dual purpose: erformance Analysis Application Trouble-Shooting Trace Data Display VAMIR (TU Dresden) Intel Trace Analyzer MICH Jumpshot (Argonne NL) Windows ETW tools Text

MS-MI Tracing Overview MS-MI includes Built-In Tracing Low Overhead Based on Event Tracing for Windows (ETW) No need to recompile your application Three Step rocess Trace: mpiexec trace [event category] MyApp.exe Sync: clocks across nodes (mpicsync.exe) Convert: to Viewing format Explained in excruciating detail in: Tracing MI Apps with Windows HC Server 2008 Traces can also be triggered via any ETW mechanism (Xperf, etc.)

Step 1 Tracing and filtering mpiexec -trace MyApp.exe mpiexec -trace (T2T,ICND) MyApp.exe T2T : oint to point communication ICND : Network Direct Interconnect Communication These event groups are defined in the file mpitrace.mof which resides in the %CC_HOME%\bin\ folder log files written on each node in %userprofile% mpi_trace_{jobid}.{taskid}.{taskinstanceid}.etl Trace filename can be overriden with tracefile argument

Step 2 Clock synchronisation Use mpiexec and mpicsync to correct trace file timestamps for each node used in a job mpiexec cores 1 mpicsync mpi_trace_42.1.0.etl mpicsync uses uniquely trace (.etl) file data to calculate CU clock corrections mpicsync must be run as an MI program mpiexec -cores 1 wdir %%USERROFILE%% mpicsync mpi_trace_%cc_jobid%.%cc_taskid%.%cc_taskinstanceid%.etl

Step 3 - Format the Binary.etl File For Viewing Format to TEXT, OTF, CLOG2 tracefmt, etl2otf and etl2clog Format the event log and apply clock corrections Leverage the power of your cluster by using mpiexec to translate all your.etl files simultaneously on the compute nodes used for your trace job mpiexec -cores 1 -wdir %%USERROFILE%% etl2otf mpi_trace_42.1.0.etl Finally collect trace files from all nodes in a single location

Helper script TraceMyMI.cmd rovided as part of the tracing whitepaper Execute all the require steps Start mpiexec for you

MS-MI Tracing and viewing

QUESTIONS?

Resources The windows performance toolkit is here http://www.microsoft.com/whdc/system/sysperf/perf tools.mspx Windows Internals series is very good Basic windows server tuning is here http://www.microsoft.com/whdc/system/sysperf/erf _tun_srv.mspx rocess Affinity in HC Server 2008 S1 http://blogs.technet.com/windowshpc/archive/2009/ 10/01/process-affinity-and-windows-hpc-server-2008- sp1.aspx