erformance Analysis and Tuning in Windows HC Server 2008 Xavier illons rogram Manager Microsoft Corp. xpillons@microsoft.com
Introduction How to monitor performance on Windows? What to look for? How to tune the system? How to trace MS-MI?
MEASURING ERFORMANCES
erformance Analysis Cluster wide Built in Diagnostics The Heatmap Local erfmon xperf
Built-in Network Diagnostics MI ing-ong (mpipingpong.exe) Launchable via HC Admin Console Diagnostics ro s: Easy, Data is auto-stored for historical comparison Con s: No choice of network, no intermediate results Launchable via command line Command Line Features Tournament mode, ring mode, serial mode Output progress to xml, stderr, stdout Histogram, per-node, and per-cluster data Test throughput / latency or both Remember: Usually you want only1 rank per node Additional diagnostics and extensibility in v3
Network diagnostics
Basic Network Troubleshooting Know Expected Bandwidths and Latencies Network Bandwidth Latency IB QDR (ConnectX CI-E 2.0) 2400MB/s 2µs IB DDR (ConnectX CI-E 2.0) 1500MB/s 2µs IB DDR (ConnectX CI-E 1.0) 1400MB/s 2.8µs IB DDR / ND 1400MB/s 5µs IB SDR /ND 950MB/s 6µs IB / IoIB 200-400MB/s 30µs Gige 105MB/s 40-70µs Make sure drivers and firmware are up to date Use the product diagnostics to confirm Or allas ingpong, etc.
Cluster Sanity Checks HC Toolpack can help too
The Heatmap
Basic Tools - erfmon Counter Tolerance Used For rocessor /%CU time 95% User mode bottleneck rocessor / %Kernel Time 10% Kernel issues rocessor / %DC time 5% RSS, Affinity rocessor / %Interrupt Time 5% Misbehaving drivers Network / Output Queue Length 1 Network bottleneck Disk / Average Queue Length 1 / platter Disk bottleneck Memory / ages er Sec 1 Hard Faults System/ Context Switches per sec 20,000 Locks, wasting processing System / system calls per sec 100,000 Excessive transitions
erfmon In Use
Windows erformance Toolkit Official performance analysis tools from Windows Used to optimize Windows itself Wide support range Cross platform: Vista, Server 2008/R2, Win7 Cross architecture: x86, x64, ia64 Very low overhead live capture on production systems Less than 2 % processor overhead for a sustained rate of 10,000 events/second on a 2GHz processor The only tool that lets you correlate most of the fundamental system activity All processes and threads, both user and kernel mode DCs and ISRs, thread scheduling, disk and file I/O, memory usage, graphics subsystem, etc. Available externally: part of Windows 7 SDK http://www.microsoft.com/whdc/system/sysperf/perftools.mspx
erformance Analysis
TUNING
Kernel By-ass NetworkDirect A new RDMA networking interface built for speed and stability Verbs-based design for close fit with native, high-perf networking interfaces Equal to Hardware-Optimized stacks for MI micro-benchmarks NetworkDirect drivers for key highperformance fabrics: Infiniband [available now!] 10 Gigabit Ethernet (iwar-enabled) [available now!] Myrinet [available soon] MS-MIv2 has 4 networking paths: Shared Memory between processors on a motherboard TC/I Stack ( normal Ethernet) Winsock Direct for sockets-based RDMA New NetworkDirect interface TC/Ethernet Networking Socket-Based App Windows Sockets (Winsock + WSD) TC I NDIS Networking Networking Mini-port Hardware Hardware Driver (ISV) App Networking Hardware Networking Hardware Hardware Driver MI App MS-MI Networking WinSock Hardware Hardware Direct NetworkDirect Networking Hardware rovider rovider Networking Networking Hardware Hardware User Mode Access Layer Networking Hardware Networking Hardware Networking Hardware HCS2008 Component OS Component RDMA Networking IHV Component User Mode Kernel Mode
MS-MI Fine tuning Lots of MI parameters (mpiexec help3) : MICH_ROGRESS_SIN_LIMIT 0 is adaptive, otherwise 1-64K SHM / SOCK / ND eager limit Switchover point for eager / rendezvous behaviour ND ZCOY threshold Sets the switchover point between bcopy and zcopy Affinity Buffer-reuse and registration cost affect this ( registration ~= 32K bcopy ) Definitely used for NUMA systems
Reducing OS Jitter Track Hard Fault with xperf Disable non used services (up to 42+) Delete Windows scheduled tasks Change G update interval (90mn by default)
Tuning Memory Access Effective memory use is rule #1 rocessor Affinity is key here Need to know the rocessor architecture Use STREAM to measure memory bandwidth
rocess lacement node groups, job templates, filters, affinity Application Aware A An ISV application (requires Nodes where the application is installed) A GigE InfiniBand 10 GigE Capacity Aware A A A A A A Multi-threaded application (requires machine with many Cores) A big model (requires Large memory machines) Blade Chassis 10 GigE 8-core servers 16-core servers A 32-core servers A A A A A InfiniBand InfiniBand GigE Numa Aware 4-way Structural Analysis MI Job C0 C0 M C1 C1 C2 C2 M C3 C3 M M M M 0 1 2 3 M M M M M Quad-core IO 32-core IO
MI rocess lacement Request resource with JOB: /numnodes:n /numsockets:n /numcores:n /exclusive Control lacement with MIEXEC: cores X n X affinity http://blogs.technet.com/windowshpc/archive/2008/09/16/mpiprocess-placement-with-windows-hpc-server-2008.aspx Examples job submit /numcores:4 mpiexec foo.exe Compute Node job submit /numnodes:2 mpiexec c 2 affinity foo.exe Compute Node
Force Affinity mpiexec -affinity start /wait /b /affinity <mask> app.exe Windows AI SetrocessAffinityMask SetThreadAffinityMask With task manager or procexp.exe
Core and Affinity mask for woodcrest rocessor 1 0x0F 0x01 0x02 0x04 0x08 Core 0 Core 1 0x03 L2 Cache Bus Interface Core 2 Core 3 0x0C L2 Cache Bus Interface System Bus 0x00 0x00 0x00 rocessor Affinity Mask L2 Cache Affinity Mask Core Affinity Mask Bus Interface 0x30 L2 Cache Bus Interface 0xC0 L2 Cache 0x10 0x20 0x40 0x80 Core 4 Core 5 Core 6 Core 7 0xF0 rocessor 2
Finer control of affinity to overcome hyperthreading on NH mpiexec setaff.cmd mpiapp.exe @REM setaff.cmd set affinity based on MI Rank @IF "%MI_SMD_KEY%" == "7" set AFFINITY=1 @IF "%MI_SMD_KEY%" == "1" set AFFINITY=2 @IF "%MI_SMD_KEY%" == "5" set AFFINITY=4 @IF "%MI_SMD_KEY%" == "3" set AFFINITY=8 @IF "%MI_SMD_KEY%" == "4" set AFFINITY=10 @IF "%MI_SMD_KEY%" == "2" set AFFINITY=20 @IF "%MI_SMD_KEY%" == "6" set AFFINITY=40 @IF "%MI_SMD_KEY%" == "0" set AFFINITY=80 start /wait /b /affinity %AFFINITY% %*
MS-MI TRACING
Devs can't tune what they can't see MS-MI Tracing: Single, time-correlated log of MI Events on All Nodes Dual purpose: erformance Analysis Application Trouble-Shooting Trace Data Display VAMIR (TU Dresden) Intel Trace Analyzer MICH Jumpshot (Argonne NL) Windows ETW tools Text
MS-MI Tracing Overview MS-MI includes Built-In Tracing Low Overhead Based on Event Tracing for Windows (ETW) No need to recompile your application Three Step rocess Trace: mpiexec trace [event category] MyApp.exe Sync: clocks across nodes (mpicsync.exe) Convert: to Viewing format Explained in excruciating detail in: Tracing MI Apps with Windows HC Server 2008 Traces can also be triggered via any ETW mechanism (Xperf, etc.)
Step 1 Tracing and filtering mpiexec -trace MyApp.exe mpiexec -trace (T2T,ICND) MyApp.exe T2T : oint to point communication ICND : Network Direct Interconnect Communication These event groups are defined in the file mpitrace.mof which resides in the %CC_HOME%\bin\ folder log files written on each node in %userprofile% mpi_trace_{jobid}.{taskid}.{taskinstanceid}.etl Trace filename can be overriden with tracefile argument
Step 2 Clock synchronisation Use mpiexec and mpicsync to correct trace file timestamps for each node used in a job mpiexec cores 1 mpicsync mpi_trace_42.1.0.etl mpicsync uses uniquely trace (.etl) file data to calculate CU clock corrections mpicsync must be run as an MI program mpiexec -cores 1 wdir %%USERROFILE%% mpicsync mpi_trace_%cc_jobid%.%cc_taskid%.%cc_taskinstanceid%.etl
Step 3 - Format the Binary.etl File For Viewing Format to TEXT, OTF, CLOG2 tracefmt, etl2otf and etl2clog Format the event log and apply clock corrections Leverage the power of your cluster by using mpiexec to translate all your.etl files simultaneously on the compute nodes used for your trace job mpiexec -cores 1 -wdir %%USERROFILE%% etl2otf mpi_trace_42.1.0.etl Finally collect trace files from all nodes in a single location
Helper script TraceMyMI.cmd rovided as part of the tracing whitepaper Execute all the require steps Start mpiexec for you
MS-MI Tracing and viewing
QUESTIONS?
Resources The windows performance toolkit is here http://www.microsoft.com/whdc/system/sysperf/perf tools.mspx Windows Internals series is very good Basic windows server tuning is here http://www.microsoft.com/whdc/system/sysperf/erf _tun_srv.mspx rocess Affinity in HC Server 2008 S1 http://blogs.technet.com/windowshpc/archive/2009/ 10/01/process-affinity-and-windows-hpc-server-2008- sp1.aspx