POWER8 Performance Analysis

Size: px

Start display at page:

Download "POWER8 Performance Analysis"

Reynold Walton
10 years ago
Views:

1 POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1

Systems and Technology Labs satsadas@in.ibm.

2 POWER8 Overview Overview Introduction to Performance Monitoring Performance Monitoring Features in POWER8 What s new in POWER8? POWER8 Pipeline CPI Stack overview Stall Accounting Model Performance analysis CPI analysis Data source analysis Prefetch control & Prefetch effectiveness Application level performance analysis Marked event profiling & performance analysis. Microarchitecture bottleneck analysis Core bottleneck analysis using trace tool and scroll pipe. Join the conversation at #OpenPOWERSummit 2

control & Prefetch effectiveness Application level performance analysis Marked event profiling & performance analysis.

3 POWER8 Processor Join the conversation at #OpenPOWERSummit 3

4 Improvements over POWER7 Join the conversation at #OpenPOWERSummit 4

5 Cache Improvements Join the conversation at #OpenPOWERSummit 5

6 Cache Bandwidths Join the conversation at #OpenPOWERSummit 6

7 Memory Organization Join the conversation at #OpenPOWERSummit 7

Performance Instrumentation in P8 Hardware Performance Monitoring is critical to enable performance evaluation of applications/programs on complex performance cores such as POWER8 POWER8 provides

8 Performance Instrumentation in P8 Hardware Performance Monitoring is critical to enable performance evaluation of applications/programs on complex performance cores such as POWER8 POWER8 provides advanced instrumentation capabilities in two layers Core Instrumentation Nest level Instrumentation Core Level Performance Monitoring Nest Level Performance Monitoring Join the conversation at #OpenPOWERSummit 8

advanced instrumentation capabilities in two layers Core Instrumentation Nest level Instrumentation

9 Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring of Core Pipeline efficiency frontend, branch prediction, execution units, schedulers, etc Behavior metrics stalls, execution rates, utilizations, thread prioritization & resource sharing Enables understanding and optimization of application performance at processor and compiler level. Join the conversation at #OpenPOWERSummit 9

metrics stalls, execution rates, utilizations, thread prioritization & resource sharing Enables understanding

10 Nest Level Instrumentation Instrumentation at L3 Cache, Interconnect Fabric Memory channels/controller Information provided at per-core and chip-level( as against thread-level for core-level counters) Significance & Usefulness: Bandwidth Analysis Key for analyzing the Cloud Virtualized environment performance. Can be used to effectively monitor the memory and chip level characteristics to employ effective provisioning of the cloud space. Join the conversation at #OpenPOWERSummit 10

Bandwidth Analysis Key for analyzing the Cloud Virtualized environment performance.

11 What s new in POWER8? Enhanced CPI Stack Cycle Accounting Model Hotness Table Branch History Rolling Buffer Event-Based Branches Prefetch effectiveness events Additional Events to capture & analyze hardware level performance issues Join the conversation at #OpenPOWERSummit 11

History Rolling Buffer Event-Based Branches Prefetch effectiveness

12 POWER8 Microarchitecture Join the conversation at #OpenPOWERSummit 12

13 POWER8 Core Pipeline Front end stalls: cycles a thread s GCT was empty, i.e. pipeline was empty for that thread. Back end stalls: cycles thread had GCT entries but no completion occurred. Join the conversation at #OpenPOWERSummit 13

Back end stalls: cycles thread had GCT entries but no

14 POWER8 Group Formation Group formation: Instructions are formed into groups for dispatch and completion tracking after Instruction Fetch. Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in each cycle Group formation driven by group formation rules Global Completion Table(GCT) Completion based performance bottleneck analysis Join the conversation at #OpenPOWERSummit 14

Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in

15 CPI Analysis Cycles-per-instruction(CPI) stack presents a picture of a typical instruction s lifespan from fetch to completion Provides information to narrow down to the bottleneck point(s) in the processor pipeline POWER8 features a Completion-based CPI Stack accounting model Time spent in the execution is split into : Group Completion cycles Stall cycles Join the conversation at #OpenPOWERSummit 15

the processor pipeline POWER8 features a Completion-based CPI Stack accounting model Time spent in

16 POWER8 CPI Stack Cycles Completion Stalls Thread Blocked Completion Table Empty Stall due to Branch Stall due to BR or CR Stall due to CR Stall due to Fixed-Point Long Stall due to Fixed-Point Stall due to Fixed-Point (Other) Stall due to Vector Long Stall due to Vector Stall due to Vector (other) Stall due to Vector/Scalar Stall due to Scalar Long Stall due to Scalar Stall due to Scalar (other) Stall due to Vector/Scalar (other) Stall due to Dcache Miss Stall due to LSU Reject Stall due to Store Finish Stall due to LSU Stall due to Load Finish Stall due to Store Forward Stall due to Load/Store (other) Stall due to Next-to-Complete Flush Waiting to Complete Blocked due to LWSync Blocked due to HWSync Blocked due to ECC Delay Blocked due to Flush Blocked due to COQ Full Thread Blocked (other) Completion Table Empty due to Completion Table Empty due to IC L3 Miss IC Miss Completion Table Empty due to IC Miss (other) Completion Table Empty due to Branch Mispredict Completion Table Empty due to Branch Mispredict + IC Miss Dispatch Held due to Mapper Completion Table Empty Dispatch Held due to Store Queue Dispatch Held Dispatch Held due to Issue Queue Dispatch Held (other) Completion Table Empty (Other) Completion Cycles Join the conversation at #OpenPOWERSummit

Stall due to Vector/Scalar (other) Stall due to Dcache Miss Stall due to LSU Reject Stall due to Store Finish Stall due to LSU Stall due to Load Finish Stall due to Store Forward Stall due to

17 CPI Stack LSU Stalls Join the conversation at #OpenPOWERSummit 17

18 An Example of CPI Stack CPI Stack PM_CMPLU_STALL PM_NTCG_ALL_FIN PM_CMPLU_STALL_THRD PM_GCT_NOSLOT_CYC PM_GRP_CMPL Prefetch OFF Prefetch ON Join the conversation at #OpenPOWERSummit 18

PM_GCT_NOSLOT_CYC 1.000 PM_GRP_CMPL 0.500 0.

19 CPI Stack Detailed Stall Distribution Completion Stall Components PM_CMPLU_STALL_BRU_CRU PM_CMPLU_STALL_FXU PM_CMPLU_STALL_VSU PM_CMPLU_STALL_VECTOR PM_CMPLU_STALL_SCALAR PM_CMPLU_STALL_NTCG_FLUSH PM_CMPLU_STALL_LSU PM_CMPLU_STALL_DCACHE_MISS PM_CMPLU_STALL_REJECT PM_CMPLU_STALL_STORE PM_CMPLU_STALL_LOAD_FINISH Prefetch OFF Prefetch ON PM_CMPLU_STALL_ST_FWD Join the conversation at #OpenPOWERSummit 19

500 PM_CMPLU_STALL_VECTOR PM_CMPLU_STALL_SCALAR 2.000 PM_CMPLU_STALL_NTCG_FLUSH 1.

20 Data Source Analysis Analysis of application data accesses across the Cache & Memory hierarchy is key to understanding the following Performance limiting factors & resource requirements of the application Scaling capabilities(in multi-threaded scenarios) Cache hierarchy latencies: Join the conversation at #OpenPOWERSummit 20

factors & resource requirements of the application Scaling capabilities(in

Prefetch Controls Prefetch effects: Positive Brings data closer to the core Reduces memory access stalls Possible negative effects: Extra Bandwidth consumption - choking other application memory

21 Prefetch Controls Prefetch effects: Positive Brings data closer to the core Reduces memory access stalls Possible negative effects: Extra Bandwidth consumption - choking other application memory accesses Cache pollution Increased power consumption POWER8 supports L1 and L3 levels Prefetches DSCR Register ( Power ISA v2.07 ) DPFD: Default Prefetch Depth SSE: Store Stream Enable SNSE: Stride-N Stream Enable LSD: Load Stream Disable URG: Depth Attainment Urgency Join the conversation at #OpenPOWERSummit 21

22 Studying Prefetch Effectiveness POWER8 provides performance events to study the prefetch effectiveness Counters indicate usage and non-usage of cache lines that are prefetched into the cache at the time of eviction from the cache Counters available: MEPF Metrics are used to evaluate the Prefetch effectiveness in POWER8 Join the conversation at #OpenPOWERSummit 22

23 Application Profiling tools Market Event Profiling: Pinpoint performance inhibiting behavior/bottlenecks to specific instruction in application code Why necessary? Non-marked events are best suited to study performance metrics In an OOO super-scalar multiple-issue processor, the profile data from non-marked events can only indicate code region responsible for performance bottlenecks Code region granularity can range from few to tens of instructions. Join the conversation at #OpenPOWERSummit 23

24 Example of Marked Event profiling Join the conversation at #OpenPOWERSummit 24

25 Marked Events a non-exhaustive list PM_MRK_LD_MISS_L1 PM_MRK_LD_MISS_L1_CYC PM_MRK_BR_MPRED_CMPL PM_MRK_BR_TAKEN_CMPL PM_MRK_DATA_FROM_MEM PM_MRK_LSU_REJECT PM_MRK_STCX_FAIL PM_MRK_GRP_IC_MISS PM_MRK_DTLB_MISS PM_MRK_ST_FWD PM_MRK_LSU_FLUSH PM_MRK_LSU_FLUSH_ULD PM_MRK_LSU_FLUSH_UST Join the conversation at #OpenPOWERSummit 25

26 Microarchitecture Analysis Deep-dive analysis to root-cause performance inhibitor at processor pipeline stages. Tools used: Itrace Cycle Accurate Simulator Trace application with valgrind Generate qtrace simppc Microarchitecture Stats Scrollpipe Analyze & Optimize Application code Join the conversation at #OpenPOWERSummit 26

27 Tools for Microarchitecture Analysis IBM SDK for Linux on Power IBM POWER8 Functional Simulator (systemsim) Valgrind framework provides application/program tracing capabilities (itrace) POWER8 Performance Simulator (sim_ppc) Join the conversation at #OpenPOWERSummit 27

28 Thank You! Join the conversation at #OpenPOWERSummit 28

OC By Arsene Fansi T. POLIMI 2008 1

IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5