Computer Architecture



Similar documents
Achieving QoS in Server Virtualization

How Much Power Oversubscription is Safe and Allowed in Data Centers?

An OS-oriented performance monitoring tool for multicore systems

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x Architecture

Compiler-Assisted Binary Parsing

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Types of Workloads. Raj Jain. Washington University in St. Louis

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Chapter 2. Why is some hardware better than others for different programs?

EEM 486: Computer Architecture. Lecture 4. Performance

Cloud Performance Benchmark Series

When Prefetching Works, When It Doesn t, and Why

CS 147: Computer Systems Performance Analysis

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Architectures and Platforms

A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters

Instruction Set Architecture (ISA)

secubt : Hacking the Hackers with User-Space Virtualization

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Cache Capacity and Memory Bandwidth Scaling Limits of Highly Threaded Processors

Practical Memory Checking with Dr. Memory

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Chapter 3 Operating-System Structures

HQEMU: A Multi-Threaded and Retargetable Dynamic Binary Translator on Multicores

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Benchmarking the Amazon Elastic Compute Cloud (EC2)

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

School of Computer Science

THE NAS KERNEL BENCHMARK PROGRAM

on an system with an infinite number of processors. Calculate the speedup of

64-Bit versus 32-Bit CPUs in Scientific Computing

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

Computer Architecture

Fine-Grained User-Space Security Through Virtualization. Mathias Payer and Thomas R. Gross ETH Zurich

On the Importance of Thread Placement on Multicore Architectures

FACT: a Framework for Adaptive Contention-aware Thread migrations

Linear-time Modeling of Program Working Set in Shared Cache

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Chapter 1 Computer System Overview

DELL VS. SUN SERVERS: R910 PERFORMANCE COMPARISON SPECint_rate_base2006

Memory Bandwidth Management for Efficient Performance Isolation in Multi-core Platforms

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Subject knowledge requirements for entry into computer science teacher training. Expert group s recommendations

Five Families of ARM Processor IP

MEng, BSc Computer Science with Artificial Intelligence

Operating Systems, 6 th ed. Test Bank Chapter 7

MEng, BSc Applied Computer Science

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Masters in Human Computer Interaction

Masters in Advanced Computer Science

Lattice QCD Performance. on Multi core Linux Servers

Figure 1: Graphical example of a mergesort 1.

Masters in Artificial Intelligence

İSTANBUL AYDIN UNIVERSITY

Cloud Computing. Adam Barker

Is there any alternative to Exadata X5? March 2015

CSEE W4824 Computer Architecture Fall 2012

2: Computer Performance

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Testing & Assuring Mobile End User Experience Before Production. Neotys

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

OKLAHOMA SUBJECT AREA TESTS (OSAT )

A Lab Course on Computer Architecture

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

A NOVEL RESOURCE EFFICIENT DMMS APPROACH

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Performance evaluation

Wiggins/Redstone: An On-line Program Specializer

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Precise and Accurate Processor Simulation

What is LOG Storm and what is it useful for?

Computing Performance Benchmarks among CPU, GPU, and FPGA

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

Benchmarking Large Scale Cloud Computing in Asia Pacific

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Understanding applications using the BSC performance tools

Transcription:

Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 6 Fundamentals in Performance Evaluation Computer Architecture Part 6 page 1 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Why performance evaluation? Comparison of computers Selection of a computer Changes in the configuration of an existing computer (tuning) Design of computers Verification or validation of design desicions Methods for performance evaluation: (1) analytical methods (2) measurements Computer Architecture Part 6 page 2 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Aspects for evaluation modularity orthogonality adequacy virtuality symmetry transparency Is the system composed of mostly independent parts, so called modules? Does every module offer an own set of functions to the system? Is one particular function not offered by different modules? Do performance and cost of a module meet its weight for the whole system? Are the physical limits of the hardware modules been repealed to the user? (Examples: virtual memory) It is possible to derive the function of unknown parts from the properties of some known parts of the architecture, e.g. parts of the ISA? Are nonrelevant parts of the architecture been hidden to the user? (Example: transparent coprocessor) Computer Architecture Part 6 page 3 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Analytical methods Performance measures: (hypothetical maximaum performance!!) MIPS (Millions of Instructions per Second) MFLOPS (Millions of Floating Point Operations per Sec.) Mix: (as well calculated, not measured) In a mix, the average execution time for each instruction is calculated and scaled by a characteristical weight. Core-Programs: Typical application programs, written for the evaluated computer No measurements, the overall execution time is calculated using the execution times of the single machine instructions Computer Architecture Part 6 page 4 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Performance measures runtime = # clock cycles * clock period MIPS (million instruction per second) MIPS = instruction count runtime 10 6 MIPS = instruction count = instruction count clock frequency # clock cycles clock period 10 6 # clock cycles 10 6 MIPS = clock frequency = clock frequency IPC CPI 10 6 10 6 CPI (cycles per instruction) # clock cycles CPI = instruction count MFLOPS (million floating point operations per second) # executed floating point instruction MFLOPS = runtime 10 6 IPC (instructions per cycle) ICP = 1 / CPI Computer Architecture Part 6 page 5 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Drawbacks of performance measures CPI, IPC, MIPS and MFLOPS are dependent on the instruction set. CPI, IPC, MIPS and MFLOPS are dependent on the program. CPI, IPC, MIPS and MFLOPS are dependent on the microarchitecture Conclusions: Greater MIPS or MFLOPS ratings do not implicitly mean more performance! It is of vital importance to chose well-suited test applications (benchmarks)! Computer Architecture Part 6 page 6 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Measurements Benchmarks Use of existing or synthetic programs to measure the performance These programs are translated and executed on the evaluated computer Therefore, not only the computer hardware, but as well the compiler influences the outcome of a benchmark Monitoring: Monitors are used to observe parts of the computer at run-time Therefore, interesting quantities inside the computer can be measured beside the overall outcome of a benchmark (e.g. cache utilization, network traffic, ) Monitoring can be done by hardware or software Computer Architecture Part 6 page 7 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Benchmark terminology benchmark A test program. benchmark suite A set of benchmarks. synthetic benchmark A test program only useful as benchmark. kernel benchmark A very small synthetic benchmark. Usually a time intensive part of a real program is chosen. Kernel benchmarks are well suited for design and simulation but normally unqualified to compare complete systems. benchmark application A complete program additionally used as benchmark. Opposite to synthetic benchmark. Computer Architecture Part 6 page 8 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

SPEC-Benchmarks SPEC Standard Performance Evaluation Corporation since 1989, consortium of different manufacturer, general purpose computer applications, mainly to measure speed and throughput Several benchmark suites, e.g. SPEC95, SPECweb96, SPEC JVM98 SPEC JBB2000 SPEC CINT 2006 SPEC CFP 2006 Computer Architecture Part 6 page 9 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

SPECmarks Goal: comparable values for different systems But: single values don't always reflect real relations, therefore only a first indication to select or judge a computer CPU performance plus cache, memory and compiler is measured, the operating system and IO is less relevant Integer test-programs (ANSI C) Floating-point test-programs (Fortran77) SPECmark : this characteristic is the geometric mean of the individual program characteristics contained in the suite Computer Architecture Part 6 page 10 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

SPEC-CINT2006: 12 Integer test programs (C, C++) name perlbench bzip2 description PERL interpreter bzip compressionsprogram gcc GNU-C-Compiler version 3.2 mcf gobmk hmmer Simplex algorithm for traffic planning AI implementation of the game Go Protein sequence analysis based on a hidden Markov model sjeng libquantum h264ref omnetpp astar xalancbmk Chess program Quantum computer simulator H.264 codec OMNET++ discrete event simulator Route planning XML translator Computer Architecture Part 6 page 11 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

SPEC-CFP2006: 17 Floating-point test programs (C, C++, FORTRAN) name description bwaves gamess milc zeusmp gromacs cactusadm Fluid dynamics algorithm Quantum chemistry algorithm Physics algorithm Fluid dynamics algorithm Newton's equations of motion Equation solver for Einstein's evolutionary equation leslie3d namd dealll soplex povray calculix GemsFDTD Fluid dynamics algorithm Biomolecular simulation Finite-Elements Simplex algorithm Image rendering Finite-Elements Maxwell equation solver tonto lbm wrf Shinx3 Quantum chemistry Lattice-Bolzmann-simulator Weather modeling Speach recognition Computer Architecture Part 6 page 12 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

More popular benchmark suites Basic Linear Algebra Subprograms (BLAS): For numerical applications Core of the LINPACK software package to solve lienar equation systems TOP 500 list of the fastest parallel computers Whetstone-Benchmark: Developed in the seventies, a single program with lot of floating-point calculations Dhrystone-Benchmark: Improvement of Whetstone, developed in the eighties Powerstone-Benchmark-Suite: To compare the energy consumption of microprocessors and microcontrollers Computer Architecture Part 6 page 13 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Powerstone benchmark suite name description auto bilv bilt compress crc des dhry engine fir_int Vehicle control Logical and shift operations Graphical application UNIX compression program CRC error detection Data encryption Dhrystone Engine control Integer FIR filter g3fax FAX group 3 g721 jpeg pocsag servo summin ucbqsort v42bits whet Audio compression JPEG 24-Bit compression Communication protocol for pagers Hard disc control Hand writing recognition Quick sort Modem operation Whetstone Computer Architecture Part 6 page 14 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Monitoring Monitors are components recording the states of a system during its normal operation. Contents of registers, flags, buffers and traffic in data paths are recorded. Monitors are used to observe and debug systems. Computer Architecture Part 6 page 15 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Monitoring Generally, monitors can be classified in: a) Hardware monitors A hardware monitor is a separate component which is physically connected to the locations of the target system where measurements take place. Hardware monitors typically consist of comparators and counters to create data, memories to store it and busses for data transport. Thus, hardware monitors use its own resources. Computer Architecture Part 6 page 16 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Monitoring b) Software monitors A software monitor is a program, implemented to collect measuring data through interfaces provided by the operation system, the programming languages or application program. A software monitor uses the resources of the observed system to collect, transport and store data. c) Hybrid monitors A hybrid monitor is a mixed hardware and software monitor. Often simple elements like counters and memories are implemented in hardware while more complex observation functions are implemented in software. Computer Architecture Part 6 page 17 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Monitoring constraints 1. Accessing information Ideally monitoring is integrated into the hardware and software components of a system during design. Software monitors are cheaper than hardware monitors but they may influence the systems run time behavior. 2. Reaction less monitoring Hardware and most hybrid monitors store the recorded data in their own memories. Software monitors have to use the memories of the observed system. Thus, hardware monitors are more reaction less than software monitors. Computer Architecture Part 6 page 18 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Monitoring constraints: 3. Amount of recorded data and its further processing Most purposes, especially debugging, require observations with high resolution. For the accurate analysis of program errors the causing machine instruction has to be identified. For other purposes, e.g. a global performance analysis, a coarser resolution is sufficient. Although it often seems necessary to record observable data on the level of machine instruction execution, this would generate traces much greater than the memory usage of the observed application. Thus, the cost to store this high amount of data and the general difficulties of processing the trace data prohibit a complete recording of traces at machine instruction level. Computer Architecture Part 6 page 19 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Instrumentation One way of software monitoring is to insert measuring commands into program code e.g. loop or time counters. This is called instrumentation. Instrumentation can be performed by the user, the compiler, the class library or the operation system. instrumented program computer measure system results measure results Computer Architecture Part 6 page 20 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Montitoring overview method direct instrumentation trace driven simulation system state accuracy tools hardware very high Hardware monitor hardware high instrumented program hard- and satisfactory simulation program software + hardware Trace simulation software sufficient simulation program Computer Architecture Part 6 page 21 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Typical load-dependent parameters throughput Defines the average number of jobs completed per time unit. A job may be: execution of an instruction or a program, saving a data block or sending a message. utilization Defines the throughput (average number of jobs completed) divided by the maximum possible throughput. response time Defines the average time needed to complete a job. utilization ratio Defines the time spent working on the jobs divided by whole operating time. Computer Architecture Part 6 page 22 of 22 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting