Hardware performance monitoring. Zoltán Majó
|
|
- Rafe Sutton
- 8 years ago
- Views:
Transcription
1 Hardware performance monitoring Zoltán Majó 1
2 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance Computing 2
3 Program performance Algorithmic complexity is decisive E.g., O(n) better than O(n 2 ) Constant factors matter as well E.g., n 2 operations better than *n operations E.g., 3*n 3 operations better than 500*n 3 operations Constant factors are in many cases hardware-dependent Today s example: dense matrix multiplication (MMM) Complexity: O(n 3 ) Hardware: cache-based architecture 3
4 Algorithm: MMM C j = A X B j i i for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] = sum; } 4
5 Hardware: cache-based architecture CPU Double type: 35 cycles access latency Cache line size: Cache 200 cycles access latency RAM 5
6 MMM: Putting it together CPU A[][] B[][] Cache hits?? Total accesses 6 6 Cache RAM C A B 6
7 MMM: Putting it together CPU A[][] B[][] Cache hits 3?? Total accesses 6 6 Cache RAM C A B 7
8 MMM: Putting it together CPU Cache hits Total accesses A[][] B[][] 3? Cache RAM C A B 8
9 MMM: Cache performance Hit rate: Accesses to A[][]: 3/6 = 50% Accesses to B[][]: 0/6 = 0% All accesses: 25% Can we do better? 9
10 Cache-friendly MMM Cache-unfriendly MMM (ijk) for (i=0; i<n; i++) for (j=0; j<n; j++) { sum = 0.0; for (k=0; k < N; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } Cache-friendly MMM (ikj) for (i=0; i<n; i++) for (k=0; k<n; k++) { r = A[i][k]; for (j=0; j < N; j++) C[i][j] += r*b[k][j]; } C = A X B k i i k 10
11 Cache-friendly MMM CPU Cache hits Total accesses C[][] 3 6 B[][] 3 6 Cache RAM C B 11
12 Cache-friendly MMM Cache-unfriendly MMM (ijk) A[][]: 3/6 = 50% hit rate B[][]: 0/6 = 0% hit rate All accesses: 25% hit rate Cache-friendly MMM (ikj) C[][]: 3/6 = 50% hit rate B[][]: 0/6 = 50% hit rate All accesses: 50% hit rate Better performance due to cache-friendliness? 12
13 Performance of MMM Execution time [s] Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 13
14 Performance of MMM Execution time [s] X Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 14
15 Program performance MMM: constant factors matter Understanding constant factors requires access to Algorithm Implementation Inputs Architecture... but often not all of these are available We can have only the binary file that we want to execute fast Do we know the architecture? 15
16 Cache-based architecture Processor package Processor package Core CPU Core Core Core L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache L2 Cache L2 Cache Cache L3 Cache L3 Cache RAM 16
17 Microarchitecture of a core Source of picture: 17
18 Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Your project 18
19 Hardware performance counters Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation In the past: undocumented feature Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst 19
20 Intel PTU Monitored events Per-function counts Source: 20
21 Debugging tools Limited functionality No access to raw data Do not support all features of processors Example: Intel PTU supports only sampling Idea: write your own tool 21
22 Programming performance counters Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode) perf_events interface Standard Linux interface since Linux UNIX philosophy: performance counters are files Simple API: Set up counters: perf_event_open() Read counters as files Example: measuring MMM cache misses 22
23 Example: monitoring cache misses int main() { int pid = fork(); if (pid == 0) { exit(exec(./mmm, NULL)); } else { int status; uint64_t value; int fd = perf_event_open(...); waitpid(pid, &status, NULL); read(fd, &value, sizeof(uint64_t); printf( Cache misses: % PRIu64 \n, value); } } 23
24 perf_event_open() Looks simple int sys_perf_event_open( ); struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags struct perf_event_attr { u32 type; u32 size; u64 config; union { u64 sample_period; u64 sample_freq; }; u64 sample_type; u64 read_format; u64 inherit; u64 pinned; u64 exclusive; u64 exclude_user; u64 exclude_kernel; u64 exclude_hv; u64 exclude_idle; u64 mmap; 24
25 libpfm Open-source helper library (1) event name (3) call perf_event_open() libpfm user program perf_events (2) set up perf_event_attr (4) read results 25
26 Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual 26
27 Software Developer s Manual 27
28 Example: measure MMM cache misses Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM Measure cache misses 28
29 Performance of MMM Execution time [s] X Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 29
30 Millions MMM cache misses # cache misses x Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 30
31 Millions MMM cache misses # cache misses x X Matrix size ijk (cache-unfriendly) ikj (cache-friendly) 31
32 Performance of MMM 30X more cache misses cause 20X performance degradation Hardware performance counters confirm assumption 32
33 Outline Performance: constant factors matter Hardware performance counters Simple example: measuring cache misses Advanced uses Sampling Precise information Your project 33
34 Sampling So far: counting mode Set up counters Execute program Read counters 34
35 Single-phased program set up performance counters read performance counters 35
36 Program with multiple phases set up performance counters get sample 36
37 Program with multiple phases set up performance counters 37
38 Program with multiple phases set up performance counters 38
39 Sampling frequency Low sampling frequency Low overhead Can fail to record changes in program behavior High sampling frequency High overhead Accurately follows program behavior Adaptive sampling 39
40 Precise information Normal operation: only event counts E.g., # of cache misses, # of branch instructions retired, etc. Events with more information in each sample E.g., register contents, instruction latency Intel PEBS, AMD IBS Today: data address profiling in NUMA systems 40
41 Non-uniform memory architecture Processor 0 Core 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 MC IC IC MC DRAM DRAM 41
42 Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC DRAM DRAM Data All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 42
43 Non-uniform memory architecture Processor 0 Core T 0 Core 1 Core 2 Core 3 Processor 1 Core 4 Core 5 Core 6 Core 7 Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles MC IC IC MC Remote memory accesses bandwidth: 6.3 GB/s DRAM Data DRAM latency: 310 cycles Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO 09], Molka [PACT 09]) 43
44 Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 44
45 Data locality in multithreaded programs Remote memory references / total memory references [%] 60% 50% 40% 30% 20% 10% 0% cg. B lu.c ft.b ep.c bt.b sp.b is.b mg.c NAS Parallel Benchmarks 45
46 Automatic page placement Current OS support for NUMA: first-touch page placement Often high number of remote accesses Data address profiling For each thread......for each memory instruction......record data address used 46
47 Profile-based page placement Processor 0 Processor 1 Profile T0 T1 P0 P0 : accessed 1000 times by T0 P1 P1 : accessed 3000 times by T1 DRAM DRAM 47
48 Automatic page placement Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping 48
49 Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 49
50 Profile-based page placement Performance improvement over first-touch [%] 25% 20% 15% 10% 5% 0% cg.b lu.c bt.b ft.b sp.b 50
51 Hardware performance counters Useful for program optimizations Example seen: data locality optimization in NUMA systems Question: what about my ACD project? 51
52 Your project Which event to use? Which measurement mode to use? Is precise information needed? Answer: it depends on the optimization you choose Example 1: loop unrolling Measure retired branch instructions, back-end stalls, i-fetch misses Example 2: code positioning Measure i-fetch misses 52
53 References Processor manufacturer s manuals Intel Processor manufacturer optimization manuals Intel Talk to me zoltan.majo@inf.ethz.ch 53
Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10
Performance Counter Non-Uniform Memory Access Seminar Karsten Tausche 2014-12-10 Performance Counter Hardware Unit for event measurements Performance Monitoring Unit (PMU) Originally for CPU-Debugging
More informationA Study of Performance Monitoring Unit, perf and perf_events subsystem
A Study of Performance Monitoring Unit, perf and perf_events subsystem Team Aman Singh Anup Buchke Mentor Dr. Yann-Hang Lee Summary Performance Monitoring Unit, or the PMU, is found in all high end processors
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationPerformance monitoring with Intel Architecture
Performance monitoring with Intel Architecture CSCE 351: Operating System Kernels Lecture 5.2 Why performance monitoring? Fine-tune software Book-keeping Locating bottlenecks Explore potential problems
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationSTUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION
STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION 1 DIPAK PATIL, 2 PRASHANT KHARAT, 3 ANIL KUMAR GUPTA 1,2 Depatment of Information Technology, Walchand College of
More information2
1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring
More informationMatching Memory Access Patterns and Data Placement for NUMA Systems
Matching Memory Access Patterns and ata Placement for NUMA Systems Zoltan Majo epartment of Computer Science ETH Zurich Thomas R. Gross epartment of Computer Science ETH Zurich ABSTRACT Many recent multicore
More informationBuilding an energy dashboard. Energy measurement and visualization in current HPC systems
Building an energy dashboard Energy measurement and visualization in current HPC systems Thomas Geenen 1/58 thomas.geenen@surfsara.nl SURFsara The Dutch national HPC center 2H 2014 > 1PFlop GPGPU accelerators
More informationPSE Molekulardynamik
OpenMP, bigger Applications 12.12.2014 Outline Schedule Presentations: Worksheet 4 OpenMP Multicore Architectures Membrane, Crystallization Preparation: Worksheet 5 2 Schedule 10.10.2014 Intro 1 WS 24.10.2014
More informationMemory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors
Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors Robert Schöne, Daniel Hackenberg, and Daniel Molka Center for Information Services and High Performance Computing
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationHardware-based performance monitoring with VTune Performance Analyzer under Linux
Hardware-based performance monitoring with VTune Performance Analyzer under Linux Hassan Shojania shojania@ieee.org Abstract All new modern processors have hardware support for monitoring processor performance.
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationCSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
More informationParallel Processing and Software Performance. Lukáš Marek
Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel
More informationScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs
ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ABSTRACT Xu Liu Department of Computer Science College of William and Mary Williamsburg, VA 23185 xl10@cs.wm.edu It is
More informationOpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
More informationEnabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
More informationOn the Importance of Thread Placement on Multicore Architectures
On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug Motivation: Many possibilities can lead to non-deterministic runtimes...
More informationPerf Tool: Performance Analysis Tool for Linux
/ Notes on Linux perf tool Intended audience: Those who would like to learn more about Linux perf performance analysis and profiling tool. Used: CPE 631 Advanced Computer Systems and Architectures CPE
More informationThe Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa Introduction Problem Recent studies into the effects of memory
More informationOperating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
More informationPerformance Monitoring of the Software Frameworks for LHC Experiments
Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero
More informationBasics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda
Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through
More informationWorkshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012
Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),
More informationEnabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
More informationACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist
ACANO SOLUTION VIRTUALIZED DEPLOYMENTS White Paper Simon Evans, Acano Chief Scientist Updated April 2015 CONTENTS Introduction... 3 Host Requirements... 5 Sizing a VM... 6 Call Bridge VM... 7 Acano Edge
More informationMAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
More informationCS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study
CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what
More informationVirtual Machines. www.viplavkambli.com
1 Virtual Machines A virtual machine (VM) is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationTesting Database Performance with HelperCore on Multi-Core Processors
Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem
More informationLecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationCOS 318: Operating Systems
COS 318: Operating Systems OS Structures and System Calls Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Outline Protection mechanisms
More informationAssessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum
More informationThe programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationPerformance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor
Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor Taiwo O. Ojeyinka Department of Computer Science, Adekunle Ajasin University, Akungba-Akoko Ondo State, Nigeria. Olusola Olajide
More informationCarlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu. Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu
Continuous Monitoring using MultiCores Carlos Villavieja, Nacho Navarro {cvillavi,nacho}@ac.upc.edu Arati Baliga, Liviu Iftode {aratib,liviu}@cs.rutgers.edu Motivation Intrusion detection Intruder gets
More informationCSC 2405: Computer Systems II
CSC 2405: Computer Systems II Spring 2013 (TR 8:30-9:45 in G86) Mirela Damian http://www.csc.villanova.edu/~mdamian/csc2405/ Introductions Mirela Damian Room 167A in the Mendel Science Building mirela.damian@villanova.edu
More informationHyperThreading Support in VMware ESX Server 2.1
HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect
More informationBasic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors
Basic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors Paul J. Drongowski AMD CodeAnalyst Performance Analyzer Development Team Advanced Micro Devices, Inc. Boston Design
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationIntel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth
Application Software Development Tool Suite 2.2 for Atom processor In-Depth Contents Application Software Development Tool Suite 2.2 for Atom processor............................... 3 Features and Benefits...................................
More informationCOMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
More informationPerformance monitoring of the software frameworks for LHC experiments
Performance monitoring of the software frameworks for LHC experiments William A. Romero R. wil-rome@uniandes.edu.co J.M. Dana Jose.Dana@cern.ch First EELA-2 Conference Bogotá, COL OUTLINE Introduction
More informationIntroduction to Virtual Machines
Introduction to Virtual Machines Introduction Abstraction and interfaces Virtualization Computer system architecture Process virtual machines System virtual machines 1 Abstraction Mechanism to manage complexity
More informationperfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux
perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux Stéphane Eranian HP Labs July 2006 Ottawa Linux Symposium 2006
More informationPerformance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
More informationHigh-Density Network Flow Monitoring
Petr Velan petr.velan@cesnet.cz High-Density Network Flow Monitoring IM2015 12 May 2015, Ottawa Motivation What is high-density flow monitoring? Monitor high traffic in as little rack units as possible
More informationOptimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu
Optimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available
More informationThe High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
More informationEloquence Training What s new in Eloquence B.08.00
Eloquence Training What s new in Eloquence B.08.00 2010 Marxmeier Software AG Rev:100727 Overview Released December 2008 Supported until November 2013 Supports 32-bit and 64-bit platforms HP-UX Itanium
More informationNext Generation Operating Systems
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015 The end of CPU scaling Future computing challenges Power efficiency Performance == parallelism Cisco Confidential 2 Paradox of the
More informationBuilding a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
More informationPerfmon2: A leap forward in Performance Monitoring
Perfmon2: A leap forward in Performance Monitoring Sverre Jarp, Ryszard Jurga, Andrzej Nowak CERN, Geneva, Switzerland Sverre.Jarp@cern.ch Abstract. This paper describes the software component, perfmon2,
More informationRDMA over Ethernet - A Preliminary Study
RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Outline Introduction Problem Statement
More informationRUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS
RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO
More informationHow To Write A Windows Operating System (Windows) (For Linux) (Windows 2) (Programming) (Operating System) (Permanent) (Powerbook) (Unix) (Amd64) (Win2) (X
(Advanced Topics in) Operating Systems Winter Term 2009 / 2010 Jun.-Prof. Dr.-Ing. André Brinkmann brinkman@upb.de Universität Paderborn PC 1 Overview Overview of chapter 3: Case Studies 3.1 Windows Architecture.....3
More informationSystem/Networking performance analytics with perf. Hannes Frederic Sowa <hannes@stressinduktion.org>
System/Networking performance analytics with perf Hannes Frederic Sowa Prerequisites Recent Linux Kernel CONFIG_PERF_* CONFIG_DEBUG_INFO Fedora: debuginfo-install kernel for
More informationMulti-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
More informationCS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson
CS 3530 Operating Systems L02 OS Intro Part 1 Dr. Ken Hoganson Chapter 1 Basic Concepts of Operating Systems Computer Systems A computer system consists of two basic types of components: Hardware components,
More informationIntel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 3B: System Programming Guide, Part 2 NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of eight volumes:
More informationComputer Engineering and Systems Group Electrical and Computer Engineering SCMFS: A File System for Storage Class Memory
SCMFS: A File System for Storage Class Memory Xiaojian Wu, Narasimha Reddy Texas A&M University What is SCM? Storage Class Memory Byte-addressable, like DRAM Non-volatile, persistent storage Example: Phase
More informationAn OS-oriented performance monitoring tool for multicore systems
An OS-oriented performance monitoring tool for multicore systems J.C. Sáez, J. Casas, A. Serrano, R. Rodríguez-Rodríguez, F. Castro, D. Chaver, M. Prieto-Matias Department of Computer Architecture Complutense
More informationVirtual Servers. Virtual machines. Virtualization. Design of IBM s VM. Virtual machine systems can give everyone the OS (and hardware) that they want.
Virtual machines Virtual machine systems can give everyone the OS (and hardware) that they want. IBM s VM provided an exact copy of the hardware to the user. Virtual Servers Virtual machines are very widespread.
More informationProgramming Guide. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide (Nehalem Core PMU)
Programming Guide Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide (Nehalem Core PMU) Table of Contents 1. About this document... 8 2. Nehalem-based PMU Architecture...
More informationAn Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
More informationDistributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
More informationRPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY
RPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY Syamsul Anuar Abd Nasir Fedora Ambassador Malaysia 1 ABOUT ME Technical Consultant for Warix Technologies - www.warix.my Warix is a Red Hat partner Offers
More informationSequential Performance Analysis with Callgrind and KCachegrind
Sequential Performance Analysis with Callgrind and KCachegrind 4 th Parallel Tools Workshop, HLRS, Stuttgart, September 7/8, 2010 Josef Weidendorfer Lehrstuhl für Rechnertechnik und Rechnerorganisation
More informationAlphaTrust PRONTO - Hardware Requirements
AlphaTrust PRONTO - Hardware Requirements 1 / 9 Table of contents Server System and Hardware Requirements... 3 System Requirements for PRONTO Enterprise Platform Software... 5 System Requirements for Web
More informationRemoving Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
More informationPerformance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab
Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event
More informationFull and Para Virtualization
Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels
More informationHow To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power
A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems TADaaM Team - Nicolas Denoyelle - Brice Goglin - Emmanuel Jeannot August 24, 2015 1. Context/Motivations
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationOperating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program
More informationScalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationHigh-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.
High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs. 0 Outline Motivation and background Issues on current
More informationWeek 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationOptimizing Virtual Machine Scheduling in NUMA Multicore Systems
Optimizing Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Kun Wang, Xiaobo Zhou, Cheng-Zhong Xu Dept. of Computer Science Dept. of Electrical and Computer Engineering University of Colorado,
More informationMassimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche. massimo.bernaschi@cnr.it
Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche massimo.bernaschi@cnr.it Performance There are two main measurements of performance. Execution time is what we ll
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationProgram Grid and HPC5+ workshop
Program Grid and HPC5+ workshop 24-30, Bahman 1391 Tuesday Wednesday 9.00-9.45 9.45-10.30 Break 11.00-11.45 11.45-12.30 Lunch 14.00-17.00 Workshop Rouhani Karimi MosalmanTabar Karimi G+MMT+K Opening IPM_Grid
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationUses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:
Virtual Machines Uses for Virtual Machines Virtual machine technology, often just called virtualization, makes one computer behave as several computers by sharing the resources of a single computer between
More informationPAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt
1 PAPI - PERFORMANCE API ANDRÉ PEREIRA ampereira@di.uminho.pt 2 Motivation Application and functions execution time is easy to measure time gprof valgrind (callgrind) It is enough to identify bottlenecks,
More informationCourse Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu
More informationOpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old
OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a
More informationCisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)
White Paper Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability) White Paper July, 2012 2012 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public
More information