How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Similar documents

Distributed communication-aware load balancing with TreeMatch in Charm++

Multi-Threading Performance on Commodity Multi-Core Processors

MAQAO Performance Analysis and Optimization Tool

ELEC 377. Operating Systems. Week 1 Class 3

Performance Analysis and Optimization Tool

Performance Characteristics of Large SMP Machines

Performance Monitoring of Parallel Scientific Applications

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Performance Application Programming Interface

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

End-user Tools for Application Performance Analysis Using Hardware Counters

Optimization tools. 1) Improving Overall I/O

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Multi-core Programming System Overview

Multi-core and Linux* Kernel

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

OpenMP Programming on ScaleMP

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

System Requirements Table of contents

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

Delivering Quality in Software Performance and Scalability Testing

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Understanding the Benefits of IBM SPSS Statistics Server

Enabling Technologies for Distributed Computing

Binary search tree with SIMD bandwidth optimization using SSE

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Scalability evaluation of barrier algorithms for OpenMP

Enabling Technologies for Distributed and Cloud Computing

Very Large Enterprise Network, Deployment, Users

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

OpenMP and Performance

CSCI E 98: Managed Environments for the Execution of Programs

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Practical Performance Understanding the Performance of Your Application

The team that wrote this redbook Comments welcome Introduction p. 1 Three phases p. 1 Netfinity Performance Lab p. 2 IBM Center for Microsoft

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

Understand Performance Monitoring

SIDN Server Measurements

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Optimizing the Performance of Your Longview Application

Parallel Programming Survey

Oracle Database Scalability in VMware ESX VMware ESX 3.5

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Implementing Probes for J2EE Cluster Monitoring

Data Structure Oriented Monitoring for OpenMP Programs

MAGENTO HOSTING Progressive Server Performance Improvements

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

L20: GPU Architecture and Models

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

11.1 inspectit inspectit

x64 Servers: Do you want 64 or 32 bit apps with that server?

Chapter 14 Virtual Machines

Full and Para Virtualization

Tableau Server 7.0 scalability

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Analysis of VDI Storage Performance During Bootstorm

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Cloud Computing Simulation Using CloudSim

Scalability Factors of JMeter In Performance Testing Projects

Oracle Enterprise Manager 12c New Capabilities for the DBA. Charlie Garry, Director, Product Management Oracle Server Technologies

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Using Library Dependencies for Clustering

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Xeon+FPGA Platform for the Data Center

Informatica Ultra Messaging SMX Shared-Memory Transport

Very Large Enterprise Network Deployment, 25,000+ Users

Lab Validation Report

HPSA Agent Characterization

Microsoft SQL Server OLTP Best Practice

Overlapping Data Transfer With Application Execution on Clusters

Chapter 5 Linux Load Balancing Mechanisms

Resource Utilization of Middleware Components in Embedded Systems

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

vsphere Resource Management Guide

SQL Server 2008 Performance and Scale

vsphere Resource Management

Replication on Virtual Machines

Transcription:

A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015

1. Context/Motivations 2. Fast presentation of the tool 3. Demonstration 4. How does it works? 5. How is it made? 6. Features & Future Works TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015

MOTIVATIONS Memory hierarchy is growing deeper and larger. No performance without a fair usage of the system topology Batch schedulers, runtimes, applications themeselves... are getting topology aware. Processing Unit NUMA Shared Memory Shared Cache IO Network Machine NUMA Shared Memory Shared Cache Topology Aware Performance Monitoring August 24, 2015-3

MOTIVATIONS Memory hierarchy is growing deeper and larger. IO Network Hence, data management gives opportunities for performance improvements. NUMA Shared Memory Shared Cache Machine NUMA Shared Memory Shared Cache Processing Unit Topology Aware Performance Monitoring August 24, 2015-4

MOTIVATIONS Memory hierarchy is growing deeper and larger. IO Network Hence, data management gives opportunities for performance improvements. NUMA Shared Memory Shared Cache Machine NUMA Shared Memory Shared Cache Processing Unit Topology Aware Performance Monitoring August 24, 2015-4

MOTIVATIONS Memory hierarchy is growing deeper and larger. IO Network Hence, data management gives opportunities for performance improvements. NUMA Shared Memory Shared Cache Machine NUMA Shared Memory Shared Cache It is a multi-level and a multi-criteria problem. Processing Unit Topology Aware Performance Monitoring August 24, 2015-4

MOTIVATIONS Need to match use cases, and relevant performance metrics for each level. Need to match performance and topology. Requires topology modeling skills. Requires adaptable performance monitoring. Topology Aware Performance Monitoring August 24, 2015-5

Yet Another Tool to Monitor Applications Performance Focus on data presentation to link the results with topology informations. Relies on two cornerstones of topology modeling (hwloc) and performance counter abstraction (PAPI) to map the latter on the former Minimal configuration and software requirements. Can help finding and caracterizing localized bottlenecks. Topology Aware Performance Monitoring August 24, 2015-6

Hardware Locality (hwloc) Portable abstraction of hierarchical architectures for high-performance computing Performs topology discovery and extracts hardware component information. Provides tools for memory and process binding. Many operating systems supported... lstopo utility to display the topology: Developped at Inria Bordeaux. Topology Aware Performance Monitoring August 24, 2015-7

Hardware Locality (hwloc) Machine (31GB total) NUMANode P#0 (31GB) Package P#0 L3 (20MB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d L1d L1d L1d L1d L1d L1d L1d L1i L1i L1i L1i L1i L1i L1i L1i P#0 P#1 P#2 P#3 P#4 P#5 P#6 P#7 P#0 P#2 P#4 P#6 P#8 P#10 P#12 P#14 P#16 P#18 P#20 P#22 P#24 P#26 P#28 P#30 Topology Aware Performance Monitoring August 24, 2015-7

Performance Application Programming Interface (PAPI) Consistent interface and methodology for use of the performance counter hardware. Real time relation between software performance and processor events. Many operating systems supported too. Reliable and actively supported. Used in a wide range of performance analysis applications. An abstraction layer to plug some other performance library is under development. Topology Aware Performance Monitoring August 24, 2015-8

Dynamic Lstopo (example) Machine (16GB total) NUMANode P#0 (16GB) Package P#0 L3 9,8000000000e+01 (20MB) L2 2,5900000e+02 (256KB) L2 2,2400000e+02 (256KB) L2 2,7600000e+02 (256KB) L2 2,4000000e+02 (256KB) L2 2,4000000e+02 (256KB) L2 5,0700000e+02 (256KB) L2 2,1900000e+02 (256KB) L2 3,0100000e+02 (256KB) L1d 6,5800000e+02 L1d 6,8200000e+02 L1d 6,8700000e+02 L1d 6,8700000e+02 L1d 7,2600000e+02 L1d 1,0560000e+03 L1d 6,4200000e+02 L1d 6,8200000e+02 L1i 1,7140000e+03 L1i 1,7350000e+03 L1i 1,7220000e+03 L1i 1,7380000e+03 L1i 1,7670000e+03 L1i 2,3410000e+03 L1i 1,7440000e+03 L1i 1,7130000e+03 P#0 P#1 P#2 P#3 P#4 P#5 P#6 P#7 2,82176e+00 P#0 2,45916e+00 P#2 2,68772e+00 P#4 1,51689e+00 P#6 1,63417e+00 P#8 3,47249e+00 P#10 1,47808e+00 P#12 1,51514e+00 P#14 1,40868e+00 P#16 1,47441e+00 P#18 1,52142e+00 P#20 2,64271e+00 P#22 2,82472e+00 P#24 1,47165e+00 P#26 2,58947e+00 P#28 2,49344e+00 P#30 Sample of hardware performance counters mapped on a single socket of an Intel Xeon E5-2650 C. Topology Aware Performance Monitoring August 24, 2015-9

A Demonstration Worth Thousand Words L1 L2 L3 3 2 1 0 4... Accesses to a linked list of variable size. Topology Aware Performance Monitoring August 24, 2015-10

A Demonstration Worth Thousand Words L1 L2 L3 3 2 1 0 4... Accesses to a linked list of variable size. Topology Aware Performance Monitoring August 24, 2015-10

A Demonstration Worth Thousand Words L1 L2 L3 3 2 1 0 4... Accesses to a linked list of variable size. Topology Aware Performance Monitoring August 24, 2015-10

A Demonstration Worth Thousand Words L3_MISS{ OBJ = L3; CTR = PAPI_L3_TCM; LOGSCALE = 1; } L2_MISS{ OBJ = L2; CTR = PAPI_L2_TCM; LOGSCALE = 1; } L1_MISS{ OBJ = L1d; CTR = PAPI_L1_DCM; LOGSCALE = 1; } SINGLE_L3_MISS{ OBJ = ; CTR = PAPI_L3_TCM; LOGSCALE = 1; } Topology Aware Performance Monitoring August 24, 2015-11

Dynamic Lstopo (Usage) Machine (16GB total) Counters input: NUMANode P#0 (16GB) Package P#0 L3 2,5120000000e+03 (4096KB) L2 (256KB) L1d L1i L2 (256KB) L1d L1i SINGLE L3 MISS{ OBJ = L3 ; CTR = PAPI L2 TCM/PAPI L2 TCA ; LOGSCALE = 1 ; MAX=1000000; MIN=0; } P#0 P#1 P#0 P#1 P#2 P#3 Command line: lstopo perf-input counters input Topology Aware Performance Monitoring August 24, 2015-12

Dynamic Lstopo (Theory) 1. Spawn one pthread per hardware thread (#0,..., #3). L2 L1 + L1 + + Topology Aware Performance Monitoring August 24, 2015-13

Dynamic Lstopo (Theory) 1. Spawn one pthread per hardware thread (#0,..., #3). L2 2. For each timestamp, each thread collects a local set of performance counters. L1 + L1 + + Topology Aware Performance Monitoring August 24, 2015-13

Dynamic Lstopo (Theory) 1. Spawn one pthread per hardware thread (#0,..., #3). L2 2. For each timestamp, each thread collects a local set of performance counters. L1 + L1 3. Counters are accumulated in each upper level. + + Topology Aware Performance Monitoring August 24, 2015-13

Dynamic Lstopo (Theory) 1. Spawn one pthread per hardware thread (#0,..., #3). L2 % 2. For each timestamp, each thread collects a local set of performance counters. % L1 + L1 % 3. Counters are accumulated in each upper level. % + % % + % 4. For each level, a leaf computes an arithmetic expression of the performance counters in the set. Topology Aware Performance Monitoring August 24, 2015-13

Dynamic Lstopo Software Architecture in Brief lstopo utility Monitors output Application monitors library hwloc library PAPI library Machine static topology Machine dynamic performance counters Topology Aware Performance Monitoring August 24, 2015-14

Dynamic Lstopo Software Architecture in Brief lstopo utility Monitors output Application monitors library hwloc library PAPI library Machine static topology Machine dynamic performance counters Topology Aware Performance Monitoring August 24, 2015-14

API m o n i t o r s = l o a d M o n i t o r s f r o m c o n f i g (NULL, m y p e r f f i l e, m y o u t p u t f i l e, 0) ; M o n i t o r s w a t c h p i d ( monitors, g e t p i d ( ) ) ; M o n i t o r s s t a r t ( m o n i t o r s ) ; /... / M o n i t o r s u p d a t e c o u n t e r s ( m o n i t o r s ) ; d e l e t e M o n i t o r s ( m o n i t o r s ) ; Topology Aware Performance Monitoring August 24, 2015-15

Dynamic Lstopo (Output paje trace) % EventDef val 0 % Id int % Phase int % Time_us date % Value double % EndEventDef % EventDef container 1 % Id int % Level string % Sibling int % Name string % Logscale int % EndEventDef 1 3 0 SINGLE_ L3_ MISS 1 1 2 1 SINGLE_ L3_ MISS 1 1 1 2 SINGLE_ L3_ MISS 1 1 0 3 SINGLE_ L3_ MISS 1 0 2 0 962832762224 67,00 0 1 0 962832762224 58,00 0 3 0 962832762225 77,00 0 0 0 962832762236 64,00 0 3 0 962832860676 94514,00 0 0 0 962832860676 121746,00 0 2 0 962832860676 205170,00 0 1 0 962832860717 200931,00 Topology Aware Performance Monitoring August 24, 2015-16

Features Record and/or Display live machine performance counters and match them with topology. Several settings: counters accumulation, sampling rate, attach to a process... Replay any trace with a topology file (for external display).. Sample specific parts of an application with the monitor library. Support legacy lstopo options (restrict topology, change display format... ). Topology Aware Performance Monitoring August 24, 2015-17

Future works Match code and performance informations Accept user defined aggregation operator. Provide performance abstraction layer. Be able to delimit phases during execution. Find and give explicit hints on bottlenecks. Topology Aware Performance Monitoring August 24, 2015-18

Conclusion Data locality becomes a main criterion for high performance. We built a tool based on a topology model and a performance library to help taking up the challenge. It maps performance values to machine objects. It is a visual tool, fast and easy to use. It is lightweight and causes less than 1% C overhead. Let you build topology aware performance models. Dynamic lstopo is into the process of beeing merged with hwloc project. Topology Aware Performance Monitoring August 24, 2015-19

Now available from https://github.com/nicolasdenoyelle/dynamic lstopo Thank you Topology Aware Performance Monitoring August 24, 2015-20