POWER8 Performance Analysis
|
|
|
- Reynold Walton
- 10 years ago
- Views:
Transcription
1 POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1
2 POWER8 Overview Overview Introduction to Performance Monitoring Performance Monitoring Features in POWER8 What s new in POWER8? POWER8 Pipeline CPI Stack overview Stall Accounting Model Performance analysis CPI analysis Data source analysis Prefetch control & Prefetch effectiveness Application level performance analysis Marked event profiling & performance analysis. Microarchitecture bottleneck analysis Core bottleneck analysis using trace tool and scroll pipe. Join the conversation at #OpenPOWERSummit 2
3 POWER8 Processor Join the conversation at #OpenPOWERSummit 3
4 Improvements over POWER7 Join the conversation at #OpenPOWERSummit 4
5 Cache Improvements Join the conversation at #OpenPOWERSummit 5
6 Cache Bandwidths Join the conversation at #OpenPOWERSummit 6
7 Memory Organization Join the conversation at #OpenPOWERSummit 7
8 Performance Instrumentation in P8 Hardware Performance Monitoring is critical to enable performance evaluation of applications/programs on complex performance cores such as POWER8 POWER8 provides advanced instrumentation capabilities in two layers Core Instrumentation Nest level Instrumentation Core Level Performance Monitoring Nest Level Performance Monitoring Join the conversation at #OpenPOWERSummit 8
9 Core Level Performance Monitoring Key to root cause performance bottlenecks at core or thread level Facilitates monitoring of Core Pipeline efficiency frontend, branch prediction, execution units, schedulers, etc Behavior metrics stalls, execution rates, utilizations, thread prioritization & resource sharing Enables understanding and optimization of application performance at processor and compiler level. Join the conversation at #OpenPOWERSummit 9
10 Nest Level Instrumentation Instrumentation at L3 Cache, Interconnect Fabric Memory channels/controller Information provided at per-core and chip-level( as against thread-level for core-level counters) Significance & Usefulness: Bandwidth Analysis Key for analyzing the Cloud Virtualized environment performance. Can be used to effectively monitor the memory and chip level characteristics to employ effective provisioning of the cloud space. Join the conversation at #OpenPOWERSummit 10
11 What s new in POWER8? Enhanced CPI Stack Cycle Accounting Model Hotness Table Branch History Rolling Buffer Event-Based Branches Prefetch effectiveness events Additional Events to capture & analyze hardware level performance issues Join the conversation at #OpenPOWERSummit 11
12 POWER8 Microarchitecture Join the conversation at #OpenPOWERSummit 12
13 POWER8 Core Pipeline Front end stalls: cycles a thread s GCT was empty, i.e. pipeline was empty for that thread. Back end stalls: cycles thread had GCT entries but no completion occurred. Join the conversation at #OpenPOWERSummit 13
14 POWER8 Group Formation Group formation: Instructions are formed into groups for dispatch and completion tracking after Instruction Fetch. Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in each cycle Group formation driven by group formation rules Global Completion Table(GCT) Completion based performance bottleneck analysis Join the conversation at #OpenPOWERSummit 14
15 CPI Analysis Cycles-per-instruction(CPI) stack presents a picture of a typical instruction s lifespan from fetch to completion Provides information to narrow down to the bottleneck point(s) in the processor pipeline POWER8 features a Completion-based CPI Stack accounting model Time spent in the execution is split into : Group Completion cycles Stall cycles Join the conversation at #OpenPOWERSummit 15
16 POWER8 CPI Stack Cycles Completion Stalls Thread Blocked Completion Table Empty Stall due to Branch Stall due to BR or CR Stall due to CR Stall due to Fixed-Point Long Stall due to Fixed-Point Stall due to Fixed-Point (Other) Stall due to Vector Long Stall due to Vector Stall due to Vector (other) Stall due to Vector/Scalar Stall due to Scalar Long Stall due to Scalar Stall due to Scalar (other) Stall due to Vector/Scalar (other) Stall due to Dcache Miss Stall due to LSU Reject Stall due to Store Finish Stall due to LSU Stall due to Load Finish Stall due to Store Forward Stall due to Load/Store (other) Stall due to Next-to-Complete Flush Waiting to Complete Blocked due to LWSync Blocked due to HWSync Blocked due to ECC Delay Blocked due to Flush Blocked due to COQ Full Thread Blocked (other) Completion Table Empty due to Completion Table Empty due to IC L3 Miss IC Miss Completion Table Empty due to IC Miss (other) Completion Table Empty due to Branch Mispredict Completion Table Empty due to Branch Mispredict + IC Miss Dispatch Held due to Mapper Completion Table Empty Dispatch Held due to Store Queue Dispatch Held Dispatch Held due to Issue Queue Dispatch Held (other) Completion Table Empty (Other) Completion Cycles Join the conversation at #OpenPOWERSummit
17 CPI Stack LSU Stalls Join the conversation at #OpenPOWERSummit 17
18 An Example of CPI Stack CPI Stack PM_CMPLU_STALL PM_NTCG_ALL_FIN PM_CMPLU_STALL_THRD PM_GCT_NOSLOT_CYC PM_GRP_CMPL Prefetch OFF Prefetch ON Join the conversation at #OpenPOWERSummit 18
19 CPI Stack Detailed Stall Distribution Completion Stall Components PM_CMPLU_STALL_BRU_CRU PM_CMPLU_STALL_FXU PM_CMPLU_STALL_VSU PM_CMPLU_STALL_VECTOR PM_CMPLU_STALL_SCALAR PM_CMPLU_STALL_NTCG_FLUSH PM_CMPLU_STALL_LSU PM_CMPLU_STALL_DCACHE_MISS PM_CMPLU_STALL_REJECT PM_CMPLU_STALL_STORE PM_CMPLU_STALL_LOAD_FINISH Prefetch OFF Prefetch ON PM_CMPLU_STALL_ST_FWD Join the conversation at #OpenPOWERSummit 19
20 Data Source Analysis Analysis of application data accesses across the Cache & Memory hierarchy is key to understanding the following Performance limiting factors & resource requirements of the application Scaling capabilities(in multi-threaded scenarios) Cache hierarchy latencies: Join the conversation at #OpenPOWERSummit 20
21 Prefetch Controls Prefetch effects: Positive Brings data closer to the core Reduces memory access stalls Possible negative effects: Extra Bandwidth consumption - choking other application memory accesses Cache pollution Increased power consumption POWER8 supports L1 and L3 levels Prefetches DSCR Register ( Power ISA v2.07 ) DPFD: Default Prefetch Depth SSE: Store Stream Enable SNSE: Stride-N Stream Enable LSD: Load Stream Disable URG: Depth Attainment Urgency Join the conversation at #OpenPOWERSummit 21
22 Studying Prefetch Effectiveness POWER8 provides performance events to study the prefetch effectiveness Counters indicate usage and non-usage of cache lines that are prefetched into the cache at the time of eviction from the cache Counters available: MEPF Metrics are used to evaluate the Prefetch effectiveness in POWER8 Join the conversation at #OpenPOWERSummit 22
23 Application Profiling tools Market Event Profiling: Pinpoint performance inhibiting behavior/bottlenecks to specific instruction in application code Why necessary? Non-marked events are best suited to study performance metrics In an OOO super-scalar multiple-issue processor, the profile data from non-marked events can only indicate code region responsible for performance bottlenecks Code region granularity can range from few to tens of instructions. Join the conversation at #OpenPOWERSummit 23
24 Example of Marked Event profiling Join the conversation at #OpenPOWERSummit 24
25 Marked Events a non-exhaustive list PM_MRK_LD_MISS_L1 PM_MRK_LD_MISS_L1_CYC PM_MRK_BR_MPRED_CMPL PM_MRK_BR_TAKEN_CMPL PM_MRK_DATA_FROM_MEM PM_MRK_LSU_REJECT PM_MRK_STCX_FAIL PM_MRK_GRP_IC_MISS PM_MRK_DTLB_MISS PM_MRK_ST_FWD PM_MRK_LSU_FLUSH PM_MRK_LSU_FLUSH_ULD PM_MRK_LSU_FLUSH_UST Join the conversation at #OpenPOWERSummit 25
26 Microarchitecture Analysis Deep-dive analysis to root-cause performance inhibitor at processor pipeline stages. Tools used: Itrace Cycle Accurate Simulator Trace application with valgrind Generate qtrace simppc Microarchitecture Stats Scrollpipe Analyze & Optimize Application code Join the conversation at #OpenPOWERSummit 26
27 Tools for Microarchitecture Analysis IBM SDK for Linux on Power IBM POWER8 Functional Simulator (systemsim) Valgrind framework provides application/program tracing capabilities (itrace) POWER8 Performance Simulator (sim_ppc) Join the conversation at #OpenPOWERSummit 27
28 Thank You! Join the conversation at #OpenPOWERSummit 28
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
2
1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
Sequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 4 th Parallel Tools Workshop, HLRS, Stuttgart, September 7/8, 2010 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
Sequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 2 nd Parallel Tools Workshop, HLRS, Stuttgart, July 7/8, 2008 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut
Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda
Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
E) Modeling Insights: Patterns and Anti-patterns
Murray Woodside, July 2002 Techniques for Deriving Performance Models from Software Designs Murray Woodside Second Part Outline ) Conceptual framework and scenarios ) Layered systems and models C) uilding
<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing
T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle
Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015
Capstone Overview Architecture for Big Data & Machine Learning Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015 Accelerators Memory Traffic Reduction Memory Intensive Arch. Context-based Prefetching Deep
Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
Intel IA-64 Architecture Software Developer s Manual
Intel IA-64 Architecture Software Developer s Manual Volume 4: Itanium Processor Programmer s Guide January 2000 Order Number: 245320-001 THIS DOCUMENT IS PROVIDED AS IS WITH NO WARRANTIES WHATSOEVER,
RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS
RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO
Using Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden [email protected] jonasskeppstedt.net jonasskeppstedt.net [email protected]
Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Five Families of ARM Processor IP
ARM1026EJ-S Synthesizable ARM10E Family Processor Core Eric Schorn CPU Product Manager ARM Austin Design Center Five Families of ARM Processor IP Performance ARM preserves SW & HW investment through code
Intel Data Direct I/O Technology (Intel DDIO): A Primer >
Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers
X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007
Application Note 195 ARM11 performance monitor unit Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Copyright 2007 ARM Limited. All rights reserved. Application Note
Intel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes
TRACE PERFORMANCE TESTING APPROACH Overview Approach Flow Attributes INTRODUCTION Software Testing Testing is not just finding out the defects. Testing is not just seeing the requirements are satisfied.
Concept of Cache in web proxies
Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor
Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor Taiwo O. Ojeyinka Department of Computer Science, Adekunle Ajasin University, Akungba-Akoko Ondo State, Nigeria. Olusola Olajide
EE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
TPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University [email protected] [email protected]
Putting Checkpoints to Work in Thread Level Speculative Execution
Putting Checkpoints to Work in Thread Level Speculative Execution Salman Khan E H U N I V E R S I T Y T O H F G R E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Zing Vision. Answering your toughest production Java performance questions
Zing Vision Answering your toughest production Java performance questions Outline What is Zing Vision? Where does Zing Vision fit in your Java environment? Key features How it works Using ZVRobot Q & A
Computer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
A New Methodology for Studying Realistic Processors in Computer Science Degrees
A New Methodology for Studying Realistic Processors in Computer Science Degrees C. Gómez Departamento de Sistemas Informáticos Universidad de Castilla-La Mancha Albacete, Spain [email protected] M.E.
Testing Database Performance with HelperCore on Multi-Core Processors
Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem
Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine
Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine This is a limited version of a hardware implementation to execute the JAVA programming language. 1 of 23 Structured Computer
Chapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
Price/performance Modern Memory Hierarchy
Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion
Performance Application Programming Interface
/************************************************************************************ ** Notes on Performance Application Programming Interface ** ** Intended audience: Those who would like to learn more
Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth
Application Software Development Tool Suite 2.2 for Atom processor In-Depth Contents Application Software Development Tool Suite 2.2 for Atom processor............................... 3 Features and Benefits...................................
End-user Tools for Application Performance Analysis Using Hardware Counters
1 End-user Tools for Application Performance Analysis Using Hardware Counters K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, T. Spencer Abstract One purpose of the end-user tools described in
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
Precise and Accurate Processor Simulation
Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical
High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
HP Storage Essentials Storage Resource Management Software end-to-end SAN Performance monitoring and analysis
HP Storage Essentials Storage Resource Management Software end-to-end SAN Performance monitoring and analysis Table of contents HP Storage Essentials SRM software SAN performance monitoring and analysis...
Guided Performance Analysis with the NVIDIA Visual Profiler
Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
Response Time Analysis
Response Time Analysis A Pragmatic Approach for Tuning and Optimizing SQL Server Performance By Dean Richards Confio Software 4772 Walnut Street, Suite 100 Boulder, CO 80301 866.CONFIO.1 www.confio.com
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Introducing the IBM Software Development Kit for PowerLinux
Introducing the IBM Software Development Kit for PowerLinux Wainer S. Moschetta IBM, PowerLinux SDK Team Leader [email protected] 1 2009 IBM Acknowledgments The information in this presentation was created
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
WAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
A Performance Counter Architecture for Computing Accurate CPI Components
A Performance Counter Architecture for Computing Accurate CPI Components Stijn Eyerman Lieven Eeckhout Tejas Karkhanis James E. Smith ELIS, Ghent University, Belgium ECE, University of Wisconsin Madison
Energy-aware Memory Management through Database Buffer Control
Energy-aware Memory Management through Database Buffer Control Chang S. Bae, Tayeb Jamel Northwestern Univ. Intel Corporation Presented by Chang S. Bae Goal and motivation Energy-aware memory management
ARM Microprocessor and ARM-Based Microcontrollers
ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 A Microcontroller-Based Embedded System Roadmap 1 Introduction ARM ARM Basics 2 ARM Extensions Thumb Jazelle NEON & DSP Enhancement
Linux Performance Optimizations for Big Data Environments
Linux Performance Optimizations for Big Data Environments Dominique A. Heger Ph.D. DHTechnologies (Performance, Capacity, Scalability) www.dhtusa.com Data Nubes (Big Data, Hadoop, ML) www.datanubes.com
Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:
Scheduling Scheduling Scheduling levels Long-term scheduling. Selects which jobs shall be allowed to enter the system. Only used in batch systems. Medium-term scheduling. Performs swapin-swapout operations
A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville [email protected]
A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville [email protected] Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure
