Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Similar documents
Chapter 2. Why is some hardware better than others for different programs?

EEM 486: Computer Architecture. Lecture 4. Performance

CSEE W4824 Computer Architecture Fall 2012

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Performance evaluation

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

CPU Performance. Lecture 8 CAP

Types of Workloads. Raj Jain. Washington University in St. Louis

Advanced Computer Architecture

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

CS 147: Computer Systems Performance Analysis

on an system with an infinite number of processors. Calculate the speedup of

Lecture 36: Chapter 6

Parallel Algorithm Engineering

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

How To Understand The Design Of A Microprocessor

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Week 1 out-of-class notes, discussions and sample problems

Multi-core Programming System Overview

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Thread level parallelism

farmerswife Contents Hourline Display Lists 1.1 Server Application 1.2 Client Application farmerswife.com

System Models for Distributed and Cloud Computing

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

CS2101a Foundations of Programming for High Performance Computing

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

CHAPTER 1 INTRODUCTION

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Multi-Threading Performance on Commodity Multi-Core Processors

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Muse Server Sizing. 18 June Document Version Muse

Parallel Programming Survey

Performance Analysis of Web based Applications on Single and Multi Core Servers

Virtuoso and Database Scalability

Delivering Quality in Software Performance and Scalability Testing

Parallel Processing and Software Performance. Lukáš Marek

FPGA-based Multithreading for In-Memory Hash Joins

Chapter 2 Parallel Architecture, Software And Performance

Benchmarking Large Scale Cloud Computing in Asia Pacific

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Parallelism and Cloud Computing

Full and Para Virtualization

Performance metrics for parallel systems

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Application Performance Analysis of the Cortex-A9 MPCore

Characterizing Java Virtual Machine for More Efficient Processor Power Management. Abstract

EMBEDDED X86 PROCESSOR PERFORMANCE RATING SYSTEM

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

EE361: Digital Computer Organization Course Syllabus

Pipelining Review and Its Limitations

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Understanding Hardware Transactional Memory

Understanding the Performance of an X User Environment

Driving force. What future software needs. Potential research topics

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

CS 153 Design of Operating Systems Spring 2015

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

IOS110. Virtualization 5/27/2014 1

Price/performance Modern Memory Hierarchy

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

On the Importance of Thread Placement on Multicore Architectures

Towards Energy Efficient Query Processing in Database Management System

Architectures for Big Data Analytics A database perspective

Resource Efficient Computing for Warehouse-scale Datacenters

Virtualizing Performance Asymmetric Multi-core Systems


Software Performance and Scalability

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Building an Inexpensive Parallel Computer

32 K. Stride K 8 K 32 K 128 K 512 K 2 M

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

Java Performance. Adrian Dozsa TM-JUG

An examination of the dual-core capability of the new HP xw4300 Workstation

Energy Efficient MapReduce

A Performance Monitor based on Virtual Global Time for Clusters of PCs

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

VDI Optimization Real World Learnings. Russ Fellows, Evaluator Group

Software and the Concurrency Revolution

Next Generation GPU Architecture Code-named Fermi

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Measuring Computer Systems: How to Measure Performance

Oracle Database Scalability in VMware ESX VMware ESX 3.5

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Binary search tree with SIMD bandwidth optimization using SSE

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Intel DPDK Boosts Server Appliance Performance White Paper

Architecture of Hitachi SR-8000

General Introduction

COS 318: Operating Systems

ST810 Advanced Computing

Transcription:

Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and future? Today What is computer performance? What programs do I care about? Performance equations Amdahl s Law UTCS CS352 Lecture 3 1 Software & Hardware: The Virtuous Cycle? Faster Single Processor Frequency Scaling Larger, More Capable Software Managed Languages More Cores Multi/Many Core? Scalable Software Scalable Apps + Scalable Runtime UTCS CS352 Lecture 3 2 1

Performance Hype AMD Performance Preview: Taking Phenom II to 4.2 GHz speed up by 10-25% in many cases Intel Core i7 8 processing threads They are the best desktop processor family on the planet. With 8 cores, each supporting 4 threads, the UltraSPARC T1 processor executes 32 simultaneous threads within a design consuming only 72 watts of power. improves throughput by up to 41x about 2x in two cases more than 10x in two small benchmarks speedups of 1.2x to 6.4x on a variety of benchmarks can reduce garbage collection time by 50% to 75% demonstrating high efficiency and scalability our prototype has usable performance speedups. are very significant (up to 54-fold) sometimes more than twice as fast our. is better or almost as good as. across the board UTCS CS352 Lecture 3 3 What Does this Graph Mean? Performance Trends on SPEC Int 2000 UTCS CS352 Lecture 3 4 2

Computer Performance Evaluation Metric = something we measure Goal: Evaluate how good/bad a design is Examples Clock rate of computer Power consumed by a program Execution time for a program Number of programs executed per second Cycles per program instruction How should we compare two computer systems? UTCS CS352 Lecture 3 5 Tradeoff: latency vs. throughput Pizza delivery Do you want your pizza hot? Or do you want your pizza to be inexpensive? Two different delivery strategies for pizza company! This course focuses primarily on latency (hot pizza) Latency = execution time for a single task Throughput = number of tasks per unit time UTCS CS352 Lecture 3 6 3

Two notions of performance Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 Concorde 3 hours 1350 mph 132 178,200 Which has plane higher performance? Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns... (Performance) throughput, bandwidth UTCS CS352 Slide courtesy of D. Patterson Lecture 3 7 Definitions Performance is in units of things-per-second bigger is better Response time of a system Y running program Z performance (Y) = 1 execution time (Z on Y) Throughput of system Y running many programs performance (Y) = number of programs unit time " System X is n times faster than Y" means n = performance(x) performance(y) UTCS CS352 Slide courtesy of D. Patterson Lecture 3 8 4

Definitions Performance is in units of things-per-second bigger is better Response time of a system Y running program Z performance (Y) = 1 execution time (Z on Y) Throughput of system Y running many programs performance (Y) = number of programs unit time " System X is n times faster than Y" means n = performance(x) performance(y) UTCS CS352 Slide courtesy of D. Patterson Lecture 3 9 Which Programs Should I Measure? UTCS CS352 Slide courtesy of D. Patterson Lecture 3 10 5

Which Programs Should I Measure? Pros Cons Actual Target Workload Full Application Benchmarks Small Kernel Benchmarks Microbenchmarks UTCS CS352 Slide courtesy of D. Patterson Lecture 3 11 Which Programs Should I Measure? Pros representative portable widely used improvements useful in reality easy to run, early in design cycle identify peak capability and potential bottlenecks Actual Target Workload Full Application Benchmarks Small Kernel Benchmarks Microbenchmarks Cons very specific non-portable difficult to run, or measure hard to identify cause less representative easy to fool peak may be a long way from application performance UTCS CS352 Slide courtesy of D. Patterson Lecture 3 12 6

Brief History of Benchmarking Early days (1960s) Single instruction execution time Average instruction time [Gibson 1970] Pure MIPS (1/AIT) Simple programs(early 70s) Synthetic benchmarks (Whetstone, etc.) Kernels (Livermore Loops) Relative Performance (late 70s) VAX 11/780 1-MIPS but was it? MFLOPs Real Applications (1989-now) SPEC CPU C/Fortran Scientific, Irregular 89, 92, 95, 00, 07,?? TPC C: Transaction Processing SPECWeb WinBench: Desktop Graphics C/C++ Quake III, Doom 3 MediaBench Java: SPECJVM98 Problem: Programming Language Parallel?, Java, C#, JavaScript?? DaCapo Java Benchmarks 06, 09 Parsec: Parallel C/C++, 2008 UTCS CS352 Lecture 3 13 How to Compromise a Comparison: C programs running on two architectures UTCS CS352 Lecture 3 14 7

The compiler reorganized the code! Change the memory system performance Matrix multiply cache blocking After Before UTCS CS352 Lecture 3 15 There are lies, damn lies, and statistics Desraeli UTCS CS352 Lecture 3 16 8

benchmarks There are lies, damn lies, and statistics Desraeli UTCS CS352 Lecture 3 17 Benchmarking Java Programs Let s consider the performance of the DaCapo Java Benchmarks What do we need to think about when comparing two computers running Java programs? http://dacapo.anu.edu.au/regression/perf /2006-10-MR2.html UTCS CS352 Lecture 3 18 9

Pay Attention to Benchmarks & System Benchmarks measure the whole system application compiler, VM, memory management operating system architecture implementation Popular benchmarks often reflect yesterday s programs what about the programs people are running today? need to design for tomorrow s problems Benchmark timings are sensitive alignment in cache location of data on disk values of data Danger of inbreeding or positive feedback if you make an operation fast (slow) it will be used more (less) often therefore you make it faster (slower) and so on, and so on the optimized NOP UTCS CS352 Lecture 3 19 Performance Summary so Far Key concepts Throughput and Latency Best benchmarks are real programs DaCapo, Spec, TPC, Doom3 Pitfalls Whole system measurement Workload may not match user s Compiler, VM, memory management Next Amdahl s Law UTCS CS352 Lecture 3 20 10

Improving Performance: Fundamentals Suppose we have a machine with two instructions Instruction A executes in 100 cycles Instruction B executes in 2 cycles We want better performance. Which instruction do we improve? UTCS CS352 Lecture 3 21 Speedup Make a change to an architecture Measure how much faster/slower it is UTCS CS352 Lecture 3 22 11

Speedup when we know details about the change Performance improvements depend on: how good is enhancement (factor S) how often is it used (fraction p) Speedup due to enhancement E: Speedup(E) = ExTime w/out E ExTime w/ E = Perf w/ E Perf w/out E UTCS CS352 Lecture 3 23 Amdahl s Law: Example FP instructions improved by 2x But.only 10% of instructions are FP ExTime new = ExTime old 0.9 + 0.1 = 0.95 ExTime 2 old Amdahl s Law: Speedup bounded by UTCS CS352 Lecture 3 24 12

How Does Amdahl s Law Apply to Multicore? Given N cores what is our ideal speedup? UTCS CS352 Lecture 3 25 How Does Amdahl s Law Apply to Multicore? Given N cores what is our ideal speedup? ExTime new = ExTime old /N Say 90% of the code is parallel and N = 16? UTCS CS352 Lecture 3 26 13

How Does Amdahl s Law Apply to Multicore? Given N cores what is our ideal speedup? ExTime new = ExTime old /N Say 90% of the code is parallel and N = 16? ExTime new = ExTime old 1 p ( ) + p N ExTime new = ExTime old 0.1+ 0.9 = 0.15625 ExTime 16 old Speedup total = 1 0.15625 = 6.2 UTCS CS352 Lecture 3 27 How Does Amdahl s Law Apply to Multicore? UTCS CS352 Lecture 3 28 14

Performance Summary so Far Amdahl s law: Pay attention to what are you speeding up. Next Time More on Performance Cycles per Instruction Means Start: Instruction Set Architectures (ISA) Read: P&H 2.1 2.5 Turn in your homework at the beginning of class UTCS CS352 Lecture 3 29 15