2

Similar documents

Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors By Dr David Levinthal PhD. Version 1.0

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

Putting it all together: Intel Nehalem.

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

POWER8 Performance Analysis

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Giving credit where credit is due

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Multithreading Lin Gao cs9244 report, 2006

Hardware performance monitoring. Zoltán Majó

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

IA-64 Application Developer s Architecture Guide

HyperThreading Support in VMware ESX Server 2.1

Generations of the computer. processors.

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Binary search tree with SIMD bandwidth optimization using SSE

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

PROBLEMS #20,R0,R1 #$3A,R2,R4

Using Synology SSD Technology to Enhance System Performance Synology Inc.

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Programming Guide. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide (Nehalem Core PMU)

(Refer Slide Time: 00:01:16 min)

Pitfalls of Object Oriented Programming

Parallel Processing and Software Performance. Lukáš Marek

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Chapter 1 Computer System Overview

Performance Monitoring of the Software Frameworks for LHC Experiments

Random Access Memory (RAM) Types of RAM. RAM Random Access Memory Jamie Tees SDRAM. Micro-DIMM SO-DIMM

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Using Synology SSD Technology to Enhance System Performance Synology Inc.

İSTANBUL AYDIN UNIVERSITY

The Intel VTune Performance Analyzer

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Intel Core i Processor (3M Cache, 3.30 GHz)

Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Intel Pentium 4 Processor on 90nm Technology

Memory management basics (1) Requirements (1) Objectives. Operating Systems Part of E1.9 - Principles of Computers and Software Engineering

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Operating Systems, 6 th ed. Test Bank Chapter 7

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

RevoScaleR Speed and Scalability

Benchmarking Hadoop & HBase on Violin

361 Computer Architecture Lecture 14: Cache Memory

Base Conversion written by Cathy Saxton

Distribution One Server Requirements

ClearPath MCP Software Series Compatibility Guide

x64 Servers: Do you want 64 or 32 bit apps with that server?

AMD Opteron Quad-Core

Multi-Threading Performance on Commodity Multi-Core Processors

HY345 Operating Systems

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

A Unified View of Virtual Machines

Per-core Power Estimation and Power Aware Scheduling Strategies for CMPs

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Configuring Memory on the HP Business Desktop dx5150

FPGA-based Multithreading for In-Memory Hash Joins

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT

Computer Organization and Components

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

The Bus (PCI and PCI-Express)

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Legal Notices and Important Information

Driving force. What future software needs. Potential research topics

Load Testing and Monitoring Web Applications in a Windows Environment

Enhancing SQL Server Performance

Low Power AMD Athlon 64 and AMD Opteron Processors

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

arxiv: v1 [cs.dc] 20 Apr 2015

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Computer Architecture TDTS10

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Breakthrough AES Performance with. Intel AES New Instructions

Boundless Security Systems, Inc.

Rethinking SIMD Vectorization for In-Memory Databases

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

A Powerful solution for next generation Pcs

Transcription:

1

2

3

4

5

For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring Unit (New Counters, New Meanings) Intel Wide Dynamic Execution (this includes the OOO capabilities of the processor) Adds complexity for performance analysis Intel Hyper-Threading Technology Adds complexity for performance analysis if you can it is easier to turn it off during performance analysis Intel Turbo Boost Technology Adds complexity for performance analysis if you can it is easier to turn it off during performance analysis The Features of the Integrated Memory Controller/Smart Memory Access/QPI/8 MB Intel Smart Cache Adds More things which can be/need to be measured on Nehalem 6

The 3 Fixed Counters = INST_RETIRED.ANY CPU_CLK_UNHALTED.THREAD CPU_CLK_UNHALTED.REF Note- each of these counters has a corresponding Programmable version (which will use a general purpose counter) INST_RETIRED.ANY_P CPU_CLK_UNHALTED.THREAD_P CPU_CLK_UNHALTED.REF_P Note: The New capabilities to measure uncore events is not supported by Vtune. And the new information gathered when measuring certain events (i.e. Load Latency in the PEBS record) is not supported by VTune analyzer. 7

Both Intel Hyper-Threading Technology and Intel Turbo Boost Technology can be enabled or disabled through BIOS on most platforms. Contact with the system vendor or manufacturer for the specifics of any platform before attempting this. Incorrectly modifying bios settings from those supplied by the manufacturer can result in rendering the system completely unusable and may void the warranty. Don t forget to re-enable these features once you are through with the software optimization process! 8

Note: While VTune analyzer can also be helpful in threading an application, this slideset is not aimed at the process of introducing threads. The process described here could be used either before or after threading. Intel Thread Profiler is another performance analysis tool included with Intel Vtune Performance Analyzer for Windows that can help determine the efficiency of a parallel algorithm. Remember for all upcoming slides that you should only focus on hotspots! Only try to determine efficiency, identify causes, and optimize in hotspots! 9

For Intel Core i7 processors, the CPU_CLK_UNHALTED.THREAD counter measures unhalted clockticks on a per thread basis. So for each tick of the CPU's clock, the counter will count 2 ticks if Hyper-Threading is enabled, 1 tick if Hyper-Threading is disabled. There is no per-core clocktick counter. There is also a CPU_CLK_UNHALTED.REF counter, which counts unhalted clockticks per thread, at the reference frequency for the CPU. In other words, the CPU_CLK_UNHALTED.REF counter should not increase or decrease as a result of frequency changes due to Turbo Mode or Speedstep Technology.

% Execution Stalled and Changes in CPI methods rely on VTune s sampling. The Code Examination method relies on using Intel s capability as a source/disassembly viewer.

The UOPS_EXECUTED.CORE_STALL_CYCLES counter measures when the EXECUTION stage of the pipeline is stalled. This counter counts per CORE, not per thread. This has implications for doing performance analysis with Intel Hyper-Threading enabled. For example, if Intel Hyper- Threading is enabled, you won t be able to tell if the stall cycles are evenly distributed across the threads or just occurring on one thread. For this reason it can be easier to interpret performance data taken while Intel Hyper-Threading is disabled. (And enabled again after optimizations are completed). Note: Optimized code (i.e: SSE instructions) may actually lower the CPI, and increase stall % but it will increase the performance. CPI and Stalls is just general guidance of efficiency the real measure of efficiency is work taking less time work is not something that can be measured by the CPU

Note that CPI is a ratio! Cycles per instruction. So if the code size changes for a binary, CPI will change. In general, if CPI reduces as a result of optimizations, that is good, and if it increases, that is bad. However there are exceptions! Some code can have a very low CPI but still be inefficient because more instructions are executed than are needed. This problem is discussed using the Code Examination method for determining efficiency. Another Note: CPI will be doubled if using Intel Hyper-threading. With Intel Hyper-Threading enabled, there are actually 2 different definitions of CPI. We call them "Per Thread CPI" and "Per Core CPI". The Per Thread CPI will be twice the Per Core CPI. Only convert between per Thread and per Core CPI when viewing aggregate CPIs (summed for all logical threads). Note: Optimized code (i.e: SSE instructions) may actually lower the CPI, and increase stall % but it will increase the performance. CPI and Stalls is just general guidance of efficiency the real measure of efficiency is work taking less time work is not something that can be measured by the CPU

This method involves looking at the disassembly to make sure the most efficient instruction streams are generated. This can be complex and can require an expert knowledge of the Intel instruction set and compiler technology. What we have done is describe how to find the 2 most easy-to-detect and easy-to-fix issues in the next 2 slides.

*Advanced Note: Intel SSE included movups but On previous IA-32 and Intel 64 processors movups was not the best way to load data from memory. Now on Intel Core i7 processors movups and movupd is as fast as movaps and movapd on aligned data so therefore when the Intel Compiler uses /QxSSE4.2 it removes the if condition to check for aligned data. Speeding up loops and reducing the # of instructions to implement. + addss is a s(calar) Intel SSE instruction which is better than x87 instructions- but is not as good as packed SSE instructions. 15

16

These are issues that result in stall cycles and high CPI. It is important to go through these in order, as we have placed the most likely problem areas in front.

Note that all applications will have some branch mispredicts. As much as 10% of branches being mispredicted could be considered normal, depending on the application. However anything above 10% should be investigated. With branch mispredicts, it is not the number of mispredicts that is the problem but the impact. The equations above show how to determine the impact of the mispredicts on the front-end (with instruction starvation) and the back-end (with wasted execution cycles given to wrongly speculated code).

The pipeline for the Intel Core i7 processor consists of many stages. Some stages are collectively referred to as the front-end these are the stages responsible for fetching, decoding, and issuing instructions and micro-operations. The back-end contains the stages responsible for dispatching, executing, and retiring instructions and micro-operations. These 2 parts of the pipeline are de-coupled and separated by a buffer (in fact buffering is used throughout the pipeline) so, when the front-end is stalled, it does NOT necessarily mean the back-end is stalled. Front-end stalls are a problem when they are causing instruction starvation, meaning no micro-operations are being issued to the back-end. If instruction starvation occurs for a long enough period of time, the back-end may stall.

This counter only measures 4K False Store Forwarding but if 4k False Store Forwarding is detected your code is probably being impacted by Set Associative Cache issues, 4K false Store Forwarding, and potentially even Memory bank issues all resulting in more cache misses and slower data access. 4k False Store Forwarding: In order to speed things up - when the CPU is processing Stores and Loads it uses a subset of the address to determine if the store forward optimization can occur if the bits are the same it starts to process the load. Later on it will fully resolve the addresses and determine that the load is not from the same address as the store, resulting in the processor needing to fetch the appropriate data. Set Associative Cache issue. Set Associative Caches are organized into ways. If 2 data elements are in memory at addresses which are 2 N apart such that they fall in the same set then they will take 2 entries in that set on a 4 way set associative cache - the 5 th access that tries to occupy the same slot will force one of the other entries out of the cache. Turning a large Multi KB/MB cache into a N entry cache. The Level 1, Level 2, and Level 3 Caches of different SKU s of processors have different Cache Sizes Line Sizes Associativity Skipping through memory at exactly 2 n boundaries (2K,4K, 16K 256K) will cause the cache to evict entries more quickly. The exact size depends on the processor and the cache. Memory Bank Issues Typically (and for best performance) computers are installed with multiple DIMMs for the RAM No single dimm can keep up with requests from the CPU. The system determines which data goes to which dimm via some bits in the address. Skipping through memory large 2N may cause all or most of the accesses to hit the same dimm. 21

FP_ASSIST.ALL: It would be nice to use this counter but that counter only works for x87 code (Some of the manuals misreport that this event works on SSE code) UOPS_DECODED.MS: Counts the number of Uops decoded by the Microcode Sequencer, MS. The MS delivers uops when the instruction is more than 4 uops long or a microcode assist is occurring There is no public definitive list of events/instructions that will cause this to fire if there is a high % of uop_dedoded.ms than try to deduce why the processor is issuing extra instructions or causing an exception in your code. 1 UOP does not necessarily equal to one cycle but the result is close enough to determine if it is affecting your overall time. 22

On target data locality to TLB size: this is accomplished via data blocking and trying to minimize random access patterns. Note: this is more likely to occur with server applications or applications with a large random dataset

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42