Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Similar documents

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

Performance Application Programming Interface

A Study of Performance Monitoring Unit, perf and perf_events subsystem

Hardware performance monitoring. Zoltán Majó

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Full and Para Virtualization

Building an energy dashboard. Energy measurement and visualization in current HPC systems

On the Importance of Thread Placement on Multicore Architectures

Perfmon2: A leap forward in Performance Monitoring

Xeon+FPGA Platform for the Data Center

Intel DPDK Boosts Server Appliance Performance White Paper

Parallel Algorithm Engineering

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Multi-Threading Performance on Commodity Multi-Core Processors

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Intel Xeon Processor E Product Family Uncore Performance Monitoring Guide March 2012

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Innovativste XEON Prozessortechnik für Cisco UCS

OpenMP Programming on ScaleMP

Perf Tool: Performance Analysis Tool for Linux

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Example of Standard API

Basics of Virtualisation

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Virtualization. Dr. Yingwu Zhu

Performance of Software Switching

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

KVM: Kernel-based Virtualization Driver

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed and Cloud Computing

Know your Cluster Bottlenecks and Maximize Performance

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

VMware Server 2.0 Essentials. Virtualization Deployment and Management

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Virtual Machines.

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Hardware Performance Monitoring with PAPI

Putting it all together: Intel Nehalem.

How System Settings Impact PCIe SSD Performance

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

AMD Opteron Quad-Core

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Virtualization. Types of Interfaces

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Knut Omang Ifi/Oracle 19 Oct, 2015

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

ELEC 377. Operating Systems. Week 1 Class 3

KVM: A Hypervisor for All Seasons. Avi Kivity avi@qumranet.com

Chapter 3 Operating-System Structures

Introducing the IBM Software Development Kit for PowerLinux

OC By Arsene Fansi T. POLIMI

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

FRONT FLYLEAF PAGE. This page has been intentionally left blank

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

RCL: Software Prototype

How To Write A Windows Operating System (Windows) (For Linux) (Windows 2) (Programming) (Operating System) (Permanent) (Powerbook) (Unix) (Amd64) (Win2) (X

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

An OS-oriented performance monitoring tool for multicore systems

Intel Xeon Processor E5-2600

İSTANBUL AYDIN UNIVERSITY

Models For Modeling and Measuring the Performance of a Xen Virtual Server

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Performance Monitoring of the Software Frameworks for LHC Experiments

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Design Cycle for Microprocessors

RPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY

Introduction to Virtual Machines

Operating System Impact on SMT Architecture

Resource Scheduling Best Practice in Hybrid Clusters

Virtualization. Jukka K. Nurminen

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Data Structure Oriented Monitoring for OpenMP Programs

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

features at a glance

CPS221 Lecture: Operating System Structure; Virtual Machines

Basic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

Week 1 out-of-class notes, discussions and sample problems

Performance Characteristics of Large SMP Machines

Eloquence Training What s new in Eloquence B.08.00

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Research and Design of Universal and Open Software Development Platform for Digital Home

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Transcription:

Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10

Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging used by manufacturers Model Specific Register (MSR) Program and access PMUs 2014-12-10 NUMA Seminar - Karsten Tausche 2

Categories Total instruction count Branch instruction (total/conditional) Load and Store instructions Arithmetic instructions Cache events Uncore events 2014-12-10 NUMA Seminar - Karsten Tausche 3

Motivation Analyze CPU behavior on instruction level High Performance Computing CPU simulators Embedded Unmodified Code Does not effect cache contents Performance Counter vs. Instrumentation No performance impact with active PMUs 2014-12-10 NUMA Seminar - Karsten Tausche 4

Outline Accessing PMUs Performance Counters on Intel Xeon Definitions Core/Uncore events Accuracy Analysis Dependencies Reducing Inaccuracies Tools Intel Performance Counter Monitor Perf PAPI Intel Perf 2014-12-10 NUMA Seminar - Karsten Tausche 5

Counter Access Select counted events Either fixed function PMU Or multiple countable events per PMU Configure counters Start/stop Count in kernel/user-mode Read/write values Programming via MSR bitfields CPU model specific field definitions MSRs accessible only in kernel mode 2014-12-10 NUMA Seminar - Karsten Tausche 6

Counter Access PMU event counting per core/hyper thread Kernel mode driver and user space library/tool Counter per software thread Generalized PMU programming with event IDs Access in user mode, without root privileges [Abstraction from platform/operating system] 2014-12-10 NUMA Seminar - Karsten Tausche 7

Performance Counter on Intel Xeon instructions_retired Executed by speculative out-of-order pipeline Handled by Retirement Unit Results visible to user 2014-12-10 NUMA Seminar - Karsten Tausche 8

Performance Counter on Intel Xeon Uncore per socket resources Intel Xeon E5-2600 Family (Sandy Bridge EP) Box modular uncore unit 2014-12-10 NUMA Seminar - Karsten Tausche 9

Core Performance Counter on Intel Xeon Instructions counts total, branches, arithmetic, 2014-12-10 NUMA Seminar - Karsten Tausche 10

Core Performance Counter on Intel Xeon Intel Turbo Boost frequencies 2014-12-10 NUMA Seminar - Karsten Tausche 11

Core Performance Counter on Intel Xeon L1, L2 Cache hits and misses 2014-12-10 NUMA Seminar - Karsten Tausche 12

Uncore Performance Counter on Intel Xeon Last Level Cache (shared L3 cache) hits and misses 2014-12-10 NUMA Seminar - Karsten Tausche 13

Uncore Performance Counter on Intel Xeon Home Agent memory controller and cache coherency memory read/write, local/remote, conflicts, directory/snooping Pbox Physical connection between cores or sockets 2014-12-10 NUMA Seminar - Karsten Tausche 14

Uncore Performance Counter on Intel Xeon Integrated Memory Controller DRAM access read/write queues, ECC correctable errors, refreshes, thermal throttling 2014-12-10 NUMA Seminar - Karsten Tausche 15

Uncore Performance Counter on Intel Xeon QPI Ring Link Layer (socket interconnect) Filter event counts: physical address, Home Node ID, instruction 2014-12-10 NUMA Seminar - Karsten Tausche 16

Uncore Performance Counter on Intel Xeon QPI Ring Link Layer (socket interconnect) link speed, transfers, total link utilization 2014-12-10 NUMA Seminar - Karsten Tausche 17

Uncore Performance Counter on Intel Xeon Power Controller Unit Socket energy usage DRAM energy usage 2014-12-10 NUMA Seminar - Karsten Tausche 18

Uncore Performance Counter on Intel Xeon Power Controller Unit per core temperature time spent in power states 2014-12-10 NUMA Seminar - Karsten Tausche 19

Performance Counters per Box QPI: 4 per port + 3 PCU: 4 Ubox (system config): 2 CBox: 4 PCIe: 4 Home Agent: 4 imc: 4 per channel 2014-12-10 NUMA Seminar - Karsten Tausche 20

Performance Counters on Intel Xeon ubuntu-numa0101.fsoc: Linux 3.13, 2x Intel Xeon E5-2620 57 (+x) Performance Monitoring Units per socket 634 countable events Allowing comprehensive runtime analysis Mostly focused on a few context specific events 2014-12-10 NUMA Seminar - Karsten Tausche 21

Accuracy Not defined/guarantied by manufacturer At least not for Intel/AMD Speculative architecture Out-of-order pipeline, serving multiple computing units Branch prediction Hardware parallelization CPU behavior depending on timings, cache-contents, etc. 2014-12-10 NUMA Seminar - Karsten Tausche 22

Accuracy Analysis [Weaver2013] Non-determinism Identical run, different result Overcount Same (wrong) result for identical runs 2014-12-10 NUMA Seminar - Karsten Tausche 23

Accuracy Analysis [Weaver2013] One deterministic event without overcount on Sandy Bridge EP: BR_INST_RETIRED_CONDITIONAL (executed conditional branch instructions) PMU hardly usable for deterministic implementations Deterministic replay Deterministic threading libraries CPU simulators Don t use micro-operation-counter System specific, undocumented low-level instructions in CISC processors 2014-12-10 NUMA Seminar - Karsten Tausche 24

Inaccuracy Sources [Weaver2013] Hardware Interrupts: increment most events Nondeterministic overcount Wrong/unintuitive counter behavior Floating point instructions with wait for exception count twice? Count µops instead of retired instructions (e.g., load/store events) Accessing counters Requires system call (interrupt, context switch) 2014-12-10 NUMA Seminar - Karsten Tausche 25

Reduce Inaccuracies Carefully controlled test environment Kernel version, tool versions Running processes BIOS/Power saving settings Compiler/Runtime configuration E.g., Address space layout randomization 2014-12-10 NUMA Seminar - Karsten Tausche 26

Prefer low-level APIs Reduce Inaccuracies Prefer dynamic library over command line tool Compare tools/libraries [Read papers] [Check errors with well known Assembly] 2014-12-10 NUMA Seminar - Karsten Tausche 27

Error Rates Counted instructions in the micro benchmark use by [Weaver2013] Integer divides Nehalem: 11.2% Nehalem-EX: 1.1% Sandy Bridge EP: n/a Ivy Bridge: 2.8% Floating point instructions Nehalem: 0.0% Nehalem EX 0.0% Sandy Bridge EP: 0.003% Ivy Bridge: 0.08% 2014-12-10 NUMA Seminar - Karsten Tausche 28

Using Performance Counters Use Profilers first Using automated tests Partly implemented with Performance Counter Optimize as much as possible Low level analysis with Performance Counters Optimize problematic code sections Platform specific optimization Benefit from minimal overhead 2014-12-10 NUMA Seminar - Karsten Tausche 29

Tools and Libraries Intel Performance Counter Monitor Perfmon/libpfm Linux Perf PAPI 2014-12-10 NUMA Seminar - Karsten Tausche 30

Intel Performance Counter Monitor Tools and C++ programming interface Full support for Intel core/uncore events Supports newer Intel Xeon, Core i, Atom Uncore mainly available on server platforms 2014-12-10 NUMA Seminar - Karsten Tausche 31

Intel Performance Counter Monitor Provided by Intel as source code Driver; GUI/command line tools; C++ library Linux build upon MSR kernel module KDE: ksysguard plug-in Windows compile/modify sample driver Perfmon plug-in 2014-12-10 NUMA Seminar - Karsten Tausche 32

Intel Performance Counter Monitor Ksysguard (KDE System Monitor) 2014-12-10 NUMA Seminar - Karsten Tausche 33

Intel Performance Counter Monitor Windows Performance Monitor 2014-12-10 NUMA Seminar - Karsten Tausche 34

perf Performance Counters for Linux Part of Linux since v2.6.31 (2009) before: patch and compile your kernel Command-line tool perf (userspace) Debian/Ubuntu-Package linux-tools perf list Detects supported events But no support for finding relevant events perf stat -e [eventname] command Run command and count eventname 2014-12-10 NUMA Seminar - Karsten Tausche 35

Linux: perfmon2 / libpfm Originated from own kernel subsystem Kernel driver: perfmon2, userspace library: libpfm Superseded by Linux perf_events interface Part of kernel since v2.6.31, 2009 Libpfm3 IA64-subsystem in Linux by HP Libpfm4 complete rewrite by Google 2014-12-10 NUMA Seminar - Karsten Tausche 36

Linux: libpfm4 Using Linux perf_events interface libpfm4 Retrieve supported events per source Translating event IDs and names Program events Architecture support Intel x86: since Pentium P6, Core Duo/Solo, Atom, Nehalem AMD64 x86: K7, K8 and newer; uncore since Bulldozer Some ARM, SPARC, IBM Power, MIPS models 2014-12-10 NUMA Seminar - Karsten Tausche 37

Libpfm4 on ubuntu-numa0101.fsoc Source code: examples and tools showevtinfo (example tool) Lists supported and detected PMUs Xeon E5-2620 (Sandy Bridge EP): Lists 4196 available events, 634 supported Per event: PMU, index, parameters, description 2014-12-10 NUMA Seminar - Karsten Tausche 38

Libpfm4 on ubuntu-numa0101.fsoc showeventinfo IDX : 37748741 PMU name : ix86arch (Intel X86 architectural PMU) Name : BRANCH_INSTRUCTIONS_RETIRED Equiv : None Flags : None Desc : count branch instructions at retirement. Specifically, this event counts the retirement of the last micro-op of a branch instruction Code : 0xc4 Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean) Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean) Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean) Modif-03 : 0x03 : PMU : [i] : invert (boolean) Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer) Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean) 2014-12-10 NUMA Seminar - Karsten Tausche 39

PAPI Performance Application Programming Interface Platform independent PMU interface Provide standard definitions for performance metrics easy to use, well documented, freely available Windows support discontinued after XP Preset Events Supported across [nearly] all platforms High Level API Simplified access to Preset Events Low Level API Adds access to native events 2014-12-10 NUMA Seminar - Karsten Tausche 40

PAPI Performance Application Programming Interface High Level API: core events only Uncore still requiring Low Level API papi_avail Check available events papi_event_chooser List events that can be combined with a given list papi_native_avail Detailed information about native events 2014-12-10 NUMA Seminar - Karsten Tausche 41

Demo perf perf stat -e instructions:u -e cache-misses:u./cache-miss count total instructions and cache-misses, both in user mode perf stat -r 10 run command 10 times, print average and stddev libpfm4 examples in libpfm/libpfm-4.5.0/examples./showevtinfo less lists first PMUs supported by libpfm and detected on your system also lists all supported events with description and parameters 2014-12-10 NUMA Seminar - Karsten Tausche 42

Demo sources /NUMASem/demo_PerformanceCounter/ cache-miss / cache-miss2 use perf stat to check cache misses with different array iteration patterns libpfm_cache-miss Uses libpfm4 library calls to count cache events List of measured events is defined by eventnames string vector This is a simplified version of libpfm examples: /NUMASem/libpfm-4.5.0/perf_examples/self.c libpfm_qpi_remote Uses libpfm4 to count QPI events and libnuma to pin the task and allocated memory to different sockets. See line 75 to switch between allocating memory on local or remote socket. 2014-12-10 NUMA Seminar - Karsten Tausche 43

Sources Zaparanuks, D.; Jovic, M.; Hauswirth, M., Accuracy of performance counter measurements, 2009 Weaver, V.M.; Terpstra, D.; Moore, S. Non-determinism and overcount on modern hardware performance counter implementations, 2013 Intel 64 and IA-32 Architectures Software Developer s Manual Volume 3B: System Programming Guide, Part 2, September 2014 https://software.intel.com/en-us/articles/intel-performance-countermonitor-a-better-way-to-measure-cpu-utilization http://perfmon2.sourceforge.net/ http://icl.cs.utk.edu/projects/papi/wiki/main_page Linux man pages 2014-12-10 NUMA Seminar - Karsten Tausche 44