POWER8 Performance Analysis

Similar documents

OC By Arsene Fansi T. POLIMI

VLIW Processors. VLIW Processors

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Sequential Performance Analysis with Callgrind and KCachegrind

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Sequential Performance Analysis with Callgrind and KCachegrind

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Energy-Efficient, High-Performance Heterogeneous Core Design

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Multithreading Lin Gao cs9244 report, 2006

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

E) Modeling Insights: Patterns and Anti-patterns

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Operating System Impact on SMT Architecture

Intel IA-64 Architecture Software Developer s Manual

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Using Power to Improve C Programming Education

Putting it all together: Intel Nehalem.

Parallel Programming Survey

Five Families of ARM Processor IP

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Architecture of Hitachi SR-8000

Binary search tree with SIMD bandwidth optimization using SSE

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Intel DPDK Boosts Server Appliance Performance White Paper

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

Concept of Cache in web proxies

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

EE361: Digital Computer Organization Course Syllabus

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Pipelining Review and Its Limitations

TPCalc : a throughput calculator for computer architecture studies

Putting Checkpoints to Work in Thread Level Speculative Execution

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Zing Vision. Answering your toughest production Java performance questions

Computer Organization and Components

Multi-Threading Performance on Commodity Multi-Core Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

A New Methodology for Studying Realistic Processors in Computer Science Degrees

Testing Database Performance with HelperCore on Multi-Core Processors

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Chapter 2 Parallel Computer Architecture

Price/performance Modern Memory Hierarchy

Performance Application Programming Interface

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

End-user Tools for Application Performance Analysis Using Hardware Counters

Introduction to GPU Architecture

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Precise and Accurate Processor Simulation

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

HP Storage Essentials Storage Resource Management Software end-to-end SAN Performance monitoring and analysis

Guided Performance Analysis with the NVIDIA Visual Profiler

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Response Time Analysis

A Lab Course on Computer Architecture

GPUs for Scientific Computing

CHAPTER 1 INTRODUCTION

Introducing the IBM Software Development Kit for PowerLinux

Thread level parallelism

SPARC64 VIIIfx: CPU for the K computer

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

WAR: Write After Read

A Performance Counter Architecture for Computing Accurate CPI Components

Energy-aware Memory Management through Database Buffer Control

ARM Microprocessor and ARM-Based Microcontrollers

Linux Performance Optimizations for Big Data Environments

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Transcription:

POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1

POWER8 Overview Overview Introduction to Performance Monitoring Performance Monitoring Features in POWER8 What s new in POWER8? POWER8 Pipeline CPI Stack overview Stall Accounting Model Performance analysis CPI analysis Data source analysis Prefetch control & Prefetch effectiveness Application level performance analysis Marked event profiling & performance analysis. Microarchitecture bottleneck analysis Core bottleneck analysis using trace tool and scroll pipe. Join the conversation at #OpenPOWERSummit 2

POWER8 Processor Join the conversation at #OpenPOWERSummit 3

Improvements over POWER7 Join the conversation at #OpenPOWERSummit 4

Cache Improvements Join the conversation at #OpenPOWERSummit 5

Cache Bandwidths Join the conversation at #OpenPOWERSummit 6

Memory Organization Join the conversation at #OpenPOWERSummit 7

Performance Instrumentation in P8 Hardware Performance Monitoring is critical to enable performance evaluation of applications/programs on complex performance cores such as POWER8 POWER8 provides advanced instrumentation capabilities in two layers Core Instrumentation Nest level Instrumentation Core Level Performance Monitoring Nest Level Performance Monitoring Join the conversation at #OpenPOWERSummit 8

Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring of Core Pipeline efficiency frontend, branch prediction, execution units, schedulers, etc Behavior metrics stalls, execution rates, utilizations, thread prioritization & resource sharing Enables understanding and optimization of application performance at processor and compiler level. Join the conversation at #OpenPOWERSummit 9

Nest Level Instrumentation Instrumentation at L3 Cache, Interconnect Fabric Memory channels/controller Information provided at per-core and chip-level( as against thread-level for core-level counters) Significance & Usefulness: Bandwidth Analysis Key for analyzing the Cloud Virtualized environment performance. Can be used to effectively monitor the memory and chip level characteristics to employ effective provisioning of the cloud space. Join the conversation at #OpenPOWERSummit 10

What s new in POWER8? Enhanced CPI Stack Cycle Accounting Model Hotness Table Branch History Rolling Buffer Event-Based Branches Prefetch effectiveness events Additional Events to capture & analyze hardware level performance issues Join the conversation at #OpenPOWERSummit 11

POWER8 Microarchitecture Join the conversation at #OpenPOWERSummit 12

POWER8 Core Pipeline Front end stalls: cycles a thread s GCT was empty, i.e. pipeline was empty for that thread. Back end stalls: cycles thread had GCT entries but no completion occurred. Join the conversation at #OpenPOWERSummit 13

POWER8 Group Formation Group formation: Instructions are formed into groups for dispatch and completion tracking after Instruction Fetch. Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in each cycle Group formation driven by group formation rules Global Completion Table(GCT) Completion based performance bottleneck analysis Join the conversation at #OpenPOWERSummit 14

CPI Analysis Cycles-per-instruction(CPI) stack presents a picture of a typical instruction s lifespan from fetch to completion Provides information to narrow down to the bottleneck point(s) in the processor pipeline POWER8 features a Completion-based CPI Stack accounting model Time spent in the execution is split into : Group Completion cycles Stall cycles Join the conversation at #OpenPOWERSummit 15

POWER8 CPI Stack Cycles Completion Stalls Thread Blocked Completion Table Empty Stall due to Branch Stall due to BR or CR Stall due to CR Stall due to Fixed-Point Long Stall due to Fixed-Point Stall due to Fixed-Point (Other) Stall due to Vector Long Stall due to Vector Stall due to Vector (other) Stall due to Vector/Scalar Stall due to Scalar Long Stall due to Scalar Stall due to Scalar (other) Stall due to Vector/Scalar (other) Stall due to Dcache Miss Stall due to LSU Reject Stall due to Store Finish Stall due to LSU Stall due to Load Finish Stall due to Store Forward Stall due to Load/Store (other) Stall due to Next-to-Complete Flush Waiting to Complete Blocked due to LWSync Blocked due to HWSync Blocked due to ECC Delay Blocked due to Flush Blocked due to COQ Full Thread Blocked (other) Completion Table Empty due to Completion Table Empty due to IC L3 Miss IC Miss Completion Table Empty due to IC Miss (other) Completion Table Empty due to Branch Mispredict Completion Table Empty due to Branch Mispredict + IC Miss Dispatch Held due to Mapper Completion Table Empty Dispatch Held due to Store Queue Dispatch Held Dispatch Held due to Issue Queue Dispatch Held (other) Completion Table Empty (Other) Completion Cycles Join the conversation at #OpenPOWERSummit

CPI Stack LSU Stalls Join the conversation at #OpenPOWERSummit 17

An Example of CPI Stack 3.000 CPI Stack 2.500 2.000 1.500 PM_CMPLU_STALL PM_NTCG_ALL_FIN PM_CMPLU_STALL_THRD PM_GCT_NOSLOT_CYC 1.000 PM_GRP_CMPL 0.500 0.000 Prefetch OFF Prefetch ON Join the conversation at #OpenPOWERSummit 18

CPI Stack Detailed Stall Distribution 4.000 Completion Stall Components 3.500 PM_CMPLU_STALL_BRU_CRU PM_CMPLU_STALL_FXU 3.000 PM_CMPLU_STALL_VSU 2.500 PM_CMPLU_STALL_VECTOR PM_CMPLU_STALL_SCALAR 2.000 PM_CMPLU_STALL_NTCG_FLUSH 1.500 PM_CMPLU_STALL_LSU PM_CMPLU_STALL_DCACHE_MISS 1.000 PM_CMPLU_STALL_REJECT 0.500 PM_CMPLU_STALL_STORE PM_CMPLU_STALL_LOAD_FINISH 0.000 Prefetch OFF Prefetch ON PM_CMPLU_STALL_ST_FWD Join the conversation at #OpenPOWERSummit 19

Data Source Analysis Analysis of application data accesses across the Cache & Memory hierarchy is key to understanding the following Performance limiting factors & resource requirements of the application Scaling capabilities(in multi-threaded scenarios) Cache hierarchy latencies: Join the conversation at #OpenPOWERSummit 20

Prefetch Controls Prefetch effects: Positive Brings data closer to the core Reduces memory access stalls Possible negative effects: Extra Bandwidth consumption - choking other application memory accesses Cache pollution Increased power consumption POWER8 supports L1 and L3 levels Prefetches DSCR Register ( Power ISA v2.07 ) DPFD: Default Prefetch Depth SSE: Store Stream Enable SNSE: Stride-N Stream Enable LSD: Load Stream Disable URG: Depth Attainment Urgency Join the conversation at #OpenPOWERSummit 21

Studying Prefetch Effectiveness POWER8 provides performance events to study the prefetch effectiveness Counters indicate usage and non-usage of cache lines that are prefetched into the cache at the time of eviction from the cache Counters available: MEPF Metrics are used to evaluate the Prefetch effectiveness in POWER8 Join the conversation at #OpenPOWERSummit 22

Application Profiling tools Market Event Profiling: Pinpoint performance inhibiting behavior/bottlenecks to specific instruction in application code Why necessary? Non-marked events are best suited to study performance metrics In an OOO super-scalar multiple-issue processor, the profile data from non-marked events can only indicate code region responsible for performance bottlenecks Code region granularity can range from few to tens of instructions. Join the conversation at #OpenPOWERSummit 23

Example of Marked Event profiling Join the conversation at #OpenPOWERSummit 24

Marked Events a non-exhaustive list PM_MRK_LD_MISS_L1 PM_MRK_LD_MISS_L1_CYC PM_MRK_BR_MPRED_CMPL PM_MRK_BR_TAKEN_CMPL PM_MRK_DATA_FROM_MEM PM_MRK_LSU_REJECT PM_MRK_STCX_FAIL PM_MRK_GRP_IC_MISS PM_MRK_DTLB_MISS PM_MRK_ST_FWD PM_MRK_LSU_FLUSH PM_MRK_LSU_FLUSH_ULD PM_MRK_LSU_FLUSH_UST Join the conversation at #OpenPOWERSummit 25

Microarchitecture Analysis Deep-dive analysis to root-cause performance inhibitor at processor pipeline stages. Tools used: Itrace Cycle Accurate Simulator Trace application with valgrind Generate qtrace simppc Microarchitecture Stats Scrollpipe Analyze & Optimize Application code Join the conversation at #OpenPOWERSummit 26

Tools for Microarchitecture Analysis IBM SDK for Linux on Power IBM POWER8 Functional Simulator (systemsim) Valgrind framework provides application/program tracing capabilities (itrace) POWER8 Performance Simulator (sim_ppc) https://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html Join the conversation at #OpenPOWERSummit 27

Thank You! Join the conversation at #OpenPOWERSummit 28