CSEE W4824 Computer Architecture Fall 2012



Similar documents
CPU Performance. Lecture 8 CAP

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Performance evaluation

EEM 486: Computer Architecture. Lecture 4. Performance

Chapter 2. Why is some hardware better than others for different programs?

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Week 1 out-of-class notes, discussions and sample problems

on an system with an infinite number of processors. Calculate the speedup of

How To Understand The Design Of A Microprocessor

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Resource Efficient Computing for Warehouse-scale Datacenters

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

CS 147: Computer Systems Performance Analysis

CISC, RISC, and DSP Microprocessors

A Lab Course on Computer Architecture

EE361: Digital Computer Organization Course Syllabus

Pipelining Review and Its Limitations

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

1. Memory technology & Hierarchy

Introduction to Microprocessors

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Low Power AMD Athlon 64 and AMD Opteron Processors

VLIW Processors. VLIW Processors

Increasing Flash Throughput for Big Data Applications (Data Management Track)

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Photonic Networks for Data Centres and High Performance Computing

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

361 Computer Architecture Lecture 14: Cache Memory

Types of Workloads. Raj Jain. Washington University in St. Louis

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Design Cycle for Microprocessors


EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Towards Energy Efficient Query Processing in Database Management System

International Journal of Electronics and Computer Science Engineering 1482

Processor Architectures

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Fault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary

Computer Science 146/246 Homework #3

picojava TM : A Hardware Implementation of the Java Virtual Machine

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Abstract. Cycle Domain Simulator for Phase-Locked Loops

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

FPGA-based Multithreading for In-Memory Hash Joins

Advanced Computer Architecture

Energy Efficient MapReduce

White Paper The Numascale Solution: Extreme BIG DATA Computing

Recommendations for Performance Benchmarking

Computer Architecture

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Introduction to Cloud Computing

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Five Families of ARM Processor IP

Control 2004, University of Bath, UK, September 2004

PowerPC Microprocessor Clock Modes

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

The Classical Architecture. Storage 1 / 36

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Parallel Scalable Algorithms- Performance Parameters

Instruction scheduling

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Introducción. Diseño de sistemas digitales.1

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CHAPTER 1 INTRODUCTION

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Enterprise Applications

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

Digital Systems Design! Lecture 1 - Introduction!!

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

System Models for Distributed and Cloud Computing

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

Why Latency Lags Bandwidth, and What it Means to Computing

路 論 Chapter 15 System-Level Physical Design

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Intel Labs at ISSCC Copyright Intel Corporation 2012

AMD Opteron Quad-Core

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

Program Optimization for Multi-core Architectures

Power-Aware High-Performance Scientific Computing

Multi-Threading Performance on Commodity Multi-Core Processors

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Performance metrics for parallel systems

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Transcription:

CSEE W4824 Computer Architecture Fall 2012 Lecture 2 Performance Metrics and Quantitative Principles of Computer Design Luca Carloni Department of Computer Science Columbia University in the City of New York http://www.cs.columbia.edu/~cs4824/ Announcements: CS Distinguished Lecture Wed, Oct. 12 th 11:00 am - Davis Auditorium What Should a Well-informed Person Know about Computers? Brian Kernighan (Princeton Univ.) His book with Dennis Ritchie, the creator of the C programming language is considered the bible of C At Bell Labs contributed to the development of Unix working with the Unix creators K. Thompson and D. Ritchie He is also a coauthor of the widely used AWK and AMPL programming languages, and of the EQN and PIC typesetting languages In collaboration with Shen Lin he devised well-known heuristics for two important NP-complete optimization problems: graph partitioning travelling salesman problem CSEE 4824 Fall 2012 - Lecture 2 Page 3 1

Computer Architects and Quantitative Approach Design ideas and trade-offs are tested by using tools in order to estimate the impact on performance, power and cost (an iterative process) analytical reasoning and fundamental design principles equations for basic metrics cost, performance, power simulations at various levels system level, ISA, micro-architecture, memory, RTL, gate, circuit level benchmark programs representing typical workloads CSEE 4824 Fall 2012 - Lecture 2 Page 5 How to Define Performance? Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (m.p.h.) Passenger Throughput (passenger x m.p.h) Boeing 777 370 4630 610 228,750 Boeing 747 470 4150 610 286,700 Concorde 132 4000 1350 178,200 Douglas DC-8-50 146 8720 544 79,424 CSEE 4824 Fall 2012 - Lecture 2 Page 6 2

Two Key Performance Metrics Time to run the task execution time, response time, elapsed time, latency Tasks per time unit execution rate, bandwidth, throughput Airplane DC to Paris Speed Passengers Throughput (passengers x mph) Boeing 747 6.5 hours 610mph 470 286,700 Concorde 3 hours 1350mph 132 178,200 CSEE 4824 Fall 2012 - Lecture 2 Page 7 Latency vs. Throughput Latency real time necessary to complete a task important when the focus is on a single task a computer user who is working with a single application a critical task of a real-time embedded system Throughput (aka Bandwidth) number of tasks completed per unit of time a metric independent from the exact number of executed tasks important when the focus is on running many tasks a manager of a large data-processing center is interested in the total amount of work done in a given time CSEE 4824 Fall 2012 - Lecture 2 Page 8 3

Latency lags Bandwidth Bandwidth has outpaced latency across the main computer technologies There is an old network saying: Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed you can t bribe God. [Anonymous] CSEE 4824 Fall 2012 - Lecture 2 Page 9 Latency and Throughput The Classic 5-Stage Pipeline Pipelining increases the instruction throughput number of instructions completed per unit of time but does not reduce (in fact, it usually slightly increases) the execution time of an individual instruction CSEE 4824 Fall 2012 - Lecture 2 Page 10 4

Performance Metrics Machine X is n times faster than machine Y executiontime(y) n = = executiontime(x) performance(x) performance(y) Performance and execution time are reciprocal improve performance increase performance improve execution time decrease execution time Example executiontime(y) = 4.8, executiontime(x) = 3.6 n = 1.33, i.e. X is 33% faster than Y CSEE 4824 Fall 2012 - Lecture 2 Page 11 Make the Common Case Fast the most important, pervasive, and simple principle of computer design in making a design trade-off favor the frequent case rather than infrequent case when determining how to allocate resources favor the frequent event rather than the rare event when optimizing the design of a module target the average functional behavior besides, the frequent case is often simpler 1. How to determine what the frequent case is? 2. How to determine the amount of the possible performance gain in making the frequent case faster? CSEE 4824 Fall 2012 - Lecture 2 Page 12 5

Simulation and Simulation Levels ISA (functional) simulator execute program & get ISA-level statistics frequency of instructions Memory simulator ISA simulator is run together with a model of the memory systems get cache hit/miss rates, study memory hierarchy options Full performance simulator a detailed performance model to a functional simulator model all interactions, stalls, (mis)-speculations generate accurate statistics CSEE 4824 Fall 2012 - Lecture 2 Page 13 Simulation Tradeoffs ISA simulator 10x slower than the real processor 10-100x faster than a detailed performance simulator Key points use the right level of simulation to answer a specific question e.g., ISA simulator to get instruction mix statistics use fast, idealized models for non-critical components e.g., assume a perfect main memory for applications that present an optimal cache hit ratio simulation is a powerful tool for architectural explorations, but analytical reasoning should always be applied before starting long simulations CSEE 4824 Fall 2012 - Lecture 2 Page 14 6

Benchmark Suites Sets of programs to simulate typical workloads Several types real software applications (GCC, Word, ) most accurate but typically longer to process portability problems (OS/compiler dependencies), GUI kernels(livermore Loops, Linpack, ) small, key pieces taken from real programs limited picture, but good to isolate the performance of individual features of a machine synthetic benchmarks (Whetstone, Dhrystone, ) try to match the average frequency of operations on operands of a real program may easily mislead compiler and hardware designers CSEE 4824 Fall 2012 - Lecture 2 Page 15 Amdahl s Law What is the overall speedup after improving a component x of a system? system x originalexecutiontime speedup = = newexectiontime newperformance originalperformance If component x is improved by Sx and component x affects a fraction Fx of the overall execution time then 1 speedup = original execution time of unimproved part Fx (1 Fx) + Sx new exec. time of improved part CSEE 4824 Fall 2012 - Lecture 2 Page 16 7

Amdahl s Law - Example speedup = If we optimize the module for the floatingpoint instructions by a factor of 2, but the system will normally run programs with only 20% of floating point instructions then the speedup is only 1 speedup = (1 0.2) + 1 (1 Fx) + 0.2 2 Fx Sx = CSEE 4824 Fall 2012 - Lecture 2 Page 17 1 0.9 = 1.111 Amdahl s Law - Example If Sx=100, what is the overall speedup as a function of Fx? S Speedup vs. Optimized Fraction 100 90 80 70 60 50 S 40 30 20 10 0 01 0.1 2 0.2 3 0.3 4 0.4 5 0.5 6 0.6 7 0.7 8 0.8 9 0.9 10 1.0 11 CSEE 4824 Fall 2012 - Lecture 2 Page 18 8

Amdahl s Law and the Law of Diminishing Returns the closer to 1 is Fx, the closer to Sx is the overall speedup i.e. [make common case fast] however, as Sx, speedup 1 / (1- Fx) i.e., once Fx/Sx is small with respect to (1-Fx) the price/performance ratio falls rapidly as Sx is increased the incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added CSEE 4824 Fall 2012 - Lecture 2 Page 19 Amdahl s Law - Reference Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", AFIPS 67 Amdahl s Law - special case of parallelization if F is the fraction of a calculation that can be parallelized and (1-F) is the fraction that is sequential (i.e. cannot benefit from parallelization) then Amdahl s Law gives the maximum speedup that can be achieved by using N processors as 1 speedup = F (1 F) + Example N if F is only 90%, the calculation can be sped up by only a maximum of a factor of 10, no matter how many processors are used key to parallel computing is to augment F but there is also Gustafson s Law CSEE 4824 Fall 2012 - Lecture 2 Page 20 9

Principle of Locality Temporal Locality a resource that is referenced at one point in time will be referenced again sometime in the near future Spatial Locality the likelihood of referencing a resource is higher if a resource near it was just referenced 90/10 Locality Rule of Thumb a program spends 90% of its execution time in only 10% of its code hence, it is possible to predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past this is a consequence of how we program and we store the data in the memory CSEE 4824 Fall 2012 - Lecture 2 Page 21 Principle of Locality - Example Cache Memory directly exploits temporal locality providing faster access to a smaller subset of the main memory which contains copy of data recently used but, all data in the cache are not necessarily data that are spatially close in the main memory still, when a cache miss occurs a fixed-size block of contiguous memory cells is retrieved from the main memory based on the principle of spatial locality CSEE 4824 Fall 2012 - Lecture 2 Page 22 10

CPU Time CPU Time user CPU Time spent in the user program system CPU Time spent in the OS performing tasks required by the program harder to measure and to compare across architectures CPU performance = user CPU time on an unloaded system CPU Time = (Clock Cycles for a Program) x (Clock Cycle Time) = = (Clock Cycles for a Program) / (Clock Frequency) most computers run with a single clock signal (strictly synchronous design) whose discrete time events are called cycles, periods, or ticks a P with a 1ns clock period runs at 1GHz of clock frequency CSEE 4824 Fall 2012 - Lecture 2 Page 23 CPU Time Three Main Factors CPU Time = (Clock Cycles for a Program) x CCT IC = instruction count number of instructions executed for a program CPI = clock cycles per instruction = CCfP/IC average number of clock cycles per instruction of a program its reciprocal is IPC = instruction per clock cycles CPU Time = IC x CPI x CCT CPU Time equally depends on these three factors a 10% improvement in any of these leads to a 10% improvement in CPU time CSEE 4824 Fall 2012 - Lecture 2 Page 24 11

CPU Time - Dependencies Program Compiler HW organization HW technology CPU Time = IC x CPI x CCT IC CPI CCT ISA organization some interdependencies, but many techniques improve a single factor CSEE 4824 Fall 2012 - Lecture 2 Page 25 Improving Performance by Exploiting Parallelism at the system level use multiple processors, multiple disks scalability is key to adaptively distribute workload in server apps at the single microprocessor level exploit instruction level parallelism (ILP) e.g., pipelining overlaps the execution of instruction to reduce the overall program CPU Time reduces CPI by overlapping instructions in time possible because many subsequent instructions are independent e.g. parallel computation reduces CPI by overlapping instructions in space duplicate hardware modules such as ALUs at the circuit level carry-lookahead adders speed-up sums from linear to logarithmic CSEE 4824 Fall 2012 - Lecture 2 Page 26 12

CPU Time broken down per instruction CPU Time = IC x CPI x CCT CPU Time = i ( ICi x CPIi) x CCT CPI = i ( ICi x CPIi) IC = i (IFi x CPIi) frequent instructions have larger contributions on CPI CPI should be measured to include pipeline/memory effects it is not sufficient to calculate it from the reference manual table NOTE: it is ok to compare two designs based only on CPI (or IPC) only if IC and CCT are the same! CSEE 4824 Fall 2012 - Lecture 2 Page 27 Example: Average Instruction Execution Time Assuming a simple un-pipelined processor with CCT = 2ns Operation IFi CPIi IFi x CPIi (% Time) ALU 0.5 4 2 46 Load 0.2 5 1 23 Store 0.1 5 0.5 12 Branch 0.2 4 0.8 19 CPI = i (IFi x CPIi ) = 4.3 Average instruction execution time = CPI x CCT = 8.6ns CSEE 4824 Fall 2012 - Lecture 2 Page 28 13

Example: Speedup From 5-stage Pipelining Assumption after pipelining the slowest stage forces an effective clock period equal to (CCT + clockoverhead) = (2 + 0.2)ns Question What is the speedup from pipelining? (Average Instruction Time )unpipelined 8.6 speedup = = = 3.9 (Average Instruction Time ) pipelined 2.2 CSEE 4824 Fall 2012 - Lecture 2 Page 29 Another Key Metric: Power Dissipation Energy measured in Joules Power rate of energy consumption [Watts = Joules/sec] instantaneous power P = V * I voltage drop across a component times the current flowing through it Example system A higher peak power lower total energy system B lower peak power higher total energy I V [Source: K. Asanovic MIT ] CSEE 4824 Fall 2012 - Lecture 2 Page 30 14

Power Consumption of CMOS Transistors Dynamic Power traditionally dominant component dissipated when transistor switches (i.e. data dependent) Static Power becoming more important with transistors scaling due to leakage current that flows even if there is no switching activity proportional to the number of transistors on the chip Challenges power is the key limitation to chip design distribute power on-chip remove heat prevent hot spots low power design (clock gating, DVFS) CSEE 4824 Fall 2012 - Lecture 2 Page 31 Example: Dynamic Power Consumption Assume a 0.25µm CMOS chip with a voltage supply Vdd=2.5V clock frequency F=500Mhz, and average load capacitance of CL=15fF/gate (assuming a fan-out of 4) What is the power consumption per gate? Approximately, Pavg =50µW For a design with 1 million gates, assuming that a transition occurs at every clock edge, this would result in an average power consumption of ~50W! In reality, not all gates on the chip switch at the full rate of 500Mhz. The actual activity is substantially lower and it is estimated by the switching capacitance CSEE 4824 Fall 2012 - Lecture 2 Page 32 15

Dynamic Voltage Frequency Scaling DVFS is a low-power design technique that is becoming pervasive in modern processors Example: If the voltage and frequency of a processing core are both reduced by 15% what would be the impact on dynamic power? Power Save = Pnew Pold = C x (V x 0.85) x (F x 0.85) 2 2 C x V x F 3 = 0.85 = 0.61 Pnew is 64% more power efficient than Pold CSEE 4824 Fall 2012 - Lecture 2 Page 33 Assigned Readings Computer Architecture A Quantitative Approach by John Hennessy Stanford University Dave Patterson UC Berkeley Fifth Edition - 2012 Morgan Kaufmann (Elsevier) Read Sections 1.8-1.12 CSEE 4824 Fall 2012 - Lecture 1 Page 34 16