Hardware performance monitoring. Zoltán Majó

Similar documents

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

A Study of Performance Monitoring Unit, perf and perf_events subsystem

Multi-Threading Performance on Commodity Multi-Core Processors

Performance monitoring with Intel Architecture

Parallel Algorithm Engineering

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Hardware-based performance monitoring with VTune Performance Analyzer under Linux

Parallel Programming Survey

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

CSE 6040 Computing for Data Analytics: Methods and Tools

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

OpenMP and Performance

Enabling Technologies for Distributed Computing

On the Importance of Thread Placement on Multicore Architectures

Perf Tool: Performance Analysis Tool for Linux

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

Operating System Impact on SMT Architecture

Performance Monitoring of the Software Frameworks for LHC Experiments

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Enabling Technologies for Distributed and Cloud Computing

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

MAQAO Performance Analysis and Optimization Tool

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Virtual Machines.

Testing Database Performance with HelperCore on Multi-Core Processors

Lecture 1: the anatomy of a supercomputer

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

COS 318: Operating Systems

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

The programming language C. sws1 1

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

CSC 2405: Computer Systems II

HyperThreading Support in VMware ESX Server 2.1

Basic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Performance Characteristics of Large SMP Machines

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Performance monitoring of the software frameworks for LHC experiments

Introduction to Virtual Machines

perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux

Performance Analysis and Optimization Tool

High-Density Network Flow Monitoring

Optimizing matrix multiplication Amitabha Banerjee

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Eloquence Training What s new in Eloquence B.08.00

Next Generation Operating Systems

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Perfmon2: A leap forward in Performance Monitoring

RDMA over Ethernet - A Preliminary Study

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

How To Write A Windows Operating System (Windows) (For Linux) (Windows 2) (Programming) (Operating System) (Permanent) (Powerbook) (Unix) (Amd64) (Win2) (X

System/Networking performance analytics with perf. Hannes Frederic Sowa

Multi-core and Linux* Kernel

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

Intel 64 and IA-32 Architectures Software Developer s Manual

Computer Engineering and Systems Group Electrical and Computer Engineering SCMFS: A File System for Storage Class Memory

An OS-oriented performance monitoring tool for multicore systems

Virtual Servers. Virtual machines. Virtualization. Design of IBM s VM. Virtual machine systems can give everyone the OS (and hardware) that they want.

Programming Guide. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide (Nehalem Core PMU)

An Implementation Of Multiprocessor Linux

Distributed communication-aware load balancing with TreeMatch in Charm++

RPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY

Sequential Performance Analysis with Callgrind and KCachegrind

AlphaTrust PRONTO - Hardware Requirements

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Full and Para Virtualization

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Scalability evaluation of barrier algorithms for OpenMP

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.

Week 1 out-of-class notes, discussions and sample problems

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Optimizing Virtual Machine Scheduling in NUMA Multicore Systems

Virtuoso and Database Scalability

CHAPTER 1 INTRODUCTION

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Course Development of Programming for General-Purpose Multicore Processors

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

Transcription:

Hardware performance monitoring Zoltán Majó 1

Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance Computing 2

Program performance Algorithmic complexity is decisive E.g., O(n) better than O(n 2 ) Constant factors matter as well E.g., n 2 operations better than 10000000000000*n operations E.g., 3*n 3 operations better than 500*n 3 operations Constant factors are in many cases hardware-dependent Today s example: dense matrix multiplication (MMM) Complexity: O(n 3 ) Hardware: cache-based architecture 3

Algorithm: MMM C j = A X B j i i for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; } 4

Hardware: cache-based architecture CPU Double type: 35 cycles access latency Cache line size: Cache 200 cycles access latency RAM 5

MMM: Putting it together CPU A[][] B[][] Cache hits?? Total accesses 6 6 Cache RAM C A B 6

MMM: Putting it together CPU A[][] B[][] Cache hits 3?? Total accesses 6 6 Cache RAM C A B 7

MMM: Putting it together CPU Cache hits Total accesses A[][] B[][] 3? 0 6 6 Cache RAM C A B 8

MMM: Cache performance Hit rate: Accesses to A[][]: 3/6 = 50% Accesses to B[][]: 0/6 = 0% All accesses: 25% Can we do better? 9

Cache-friendly MMM Cache-unfriendly MMM (ijk) for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } Cache-friendly MMM (ikj) for (i=0; i<n; i++) for (k=0; k<n; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*b[k][j]; } C = A X B k i i k 10

Cache-friendly MMM CPU Cache hits Total accesses C[][] 3 6 B[][] 3 6 Cache RAM C B 11

Cache-friendly MMM Cache-unfriendly MMM (ijk) A[][]: 3/6 = 50% hit rate B[][]: 0/6 = 0% hit rate All accesses: 25% hit rate Cache-friendly MMM (ikj) C[][]: 3/6 = 50% hit rate B[][]: 0/6 = 50% hit rate All accesses: 50% hit rate Better performance due to cache-friendliness? 12

Performance of MMM Execution time [s] 10000 1000 100 10 1 0.1 0.01 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 13

Performance of MMM Execution time [s] 10000 1000 100 20X 10 1 0.1 0.01 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 14

Program performance MMM: constant factors matter Understanding constant factors requires access to Algorithm Implementation Inputs Architecture... but often not all of these are available We can have only the binary file that we want to execute fast Do we know the architecture? 15

Cache-based architecture Processor package Processor package Core CPU Core Core Core L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache L2 Cache L2 Cache Cache L3 Cache L3 Cache RAM 16

Microarchitecture of a core Source of picture: http://wikipedia.org 17

Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Your project 18

Hardware performance counters Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation In the past: undocumented feature Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst 19

Intel PTU Monitored events Per-function counts Source: http://software.intel.com/en-us/articles/intel-performance-tuning-utility/ 20

Debugging tools Limited functionality No access to raw data Do not support all features of processors Example: Intel PTU supports only sampling Idea: write your own tool 21

Programming performance counters Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode) perf_events interface Standard Linux interface since Linux 2.6.31 UNIX philosophy: performance counters are files Simple API: Set up counters: perf_event_open() Read counters as files Example: measuring MMM cache misses 22

Example: monitoring cache misses int main() { int pid = fork(); if (pid == 0) { exit(exec(./mmm, NULL)); } else { int status; uint64_t value; int fd = perf_event_open(...); waitpid(pid, &status, NULL); read(fd, &value, sizeof(uint64_t); printf( Cache misses: % PRIu64 \n, value); } } 23

perf_event_open() Looks simple int sys_perf_event_open( ); struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags struct perf_event_attr { u32 type; u32 size; u64 config; union { u64 sample_period; u64 sample_freq; }; u64 sample_type; u64 read_format; u64 inherit; u64 pinned; u64 exclusive; u64 exclude_user; u64 exclude_kernel; u64 exclude_hv; u64 exclude_idle; u64 mmap; 24

libpfm Open-source helper library (1) event name (3) call perf_event_open() libpfm user program perf_events (2) set up perf_event_attr (4) read results 25

Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual 26

Software Developer s Manual 27

Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM Measure cache misses 28

Performance of MMM Execution time [s] 10000 1000 100 20X 10 1 0.1 0.01 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 29

Millions MMM cache misses # cache misses x 10 6 1000000 1000 1 0.001 0.000001 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 30

Millions MMM cache misses # cache misses x 10 6 1000000 30X 1000 1 0.001 0.000001 512 1024 2048 4096 8192 Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 31

Performance of MMM 30X more cache misses cause 20X performance degradation Hardware performance counters confirm assumption 32

Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Sampling Precise information Your project 33

Sampling So far: counting mode Set up counters Execute program Read counters 34

Single-phased program set up performance counters read performance counters 35

Program with multiple phases set up performance counters get sample 36

Program with multiple phases set up performance counters 37

Program with multiple phases set up performance counters 38

Sampling frequency Low sampling frequency Low overhead Can fail to record changes in program behavior High sampling frequency High overhead Accurately follows program behavior Adaptive sampling 39

Precise information Normal operation: only event counts E.g., # of cache misses, # of branch instructions retired, etc. Events with more information in each sample E.g., register contents, instruction latency Intel PEBS, AMD IBS Today: data address profiling in NUMA systems 40

Non-uniform memory architecture Processor 0 Core 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 MC IC IC MC DRAM DRAM 41

Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC DRAM DRAM Data All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 42

Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC Remote memory accesses bandwidth: 6.3 GB/s DRAM Data DRAM latency: 310 cycles Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 43

Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 44

Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 45

Automatic page placement Current OS support for NUMA: first-touch page placement Often high number of remote accesses Data address profiling For each thread......for each memory instruction......record data address used 46

Profile-based page placement Processor 0 Processor 1 Profile T0 T1 P0 P0 : accessed 1000 times by T0 P1 P1 : accessed 3000 times by T1 DRAM DRAM 47

Automatic page placement Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping 48

Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 49

Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 50

Hardware performance counters Useful for program optimizations Example seen: data locality optimization in NUMA systems Question: what about my ACD project? 51

Your project Which event to use? Which measurement mode to use? Is precise information needed? Answer: it depends on the optimization you choose Example 1: loop unrolling Measure retired branch instructions, back-end stalls, i-fetch misses Example 2: code positioning Measure i-fetch misses 52

References Processor manufacturer s manuals Intel Processor manufacturer optimization manuals Intel http://www.intel.com/content/www/us/en/processors/architecturessoftware-developer-manuals.html http://www.intel.com/content/www/us/en/architecture-andtechnology/64-ia-32-architectures-optimization-manual.html Talk to me zoltan.majo@inf.ethz.ch 53