CSCI-GA Multicore Processors: Architecture & Programming Lecture 9: Performance Evaluation

Similar documents
CPU Performance. Lecture 8 CAP

Chapter 2. Why is some hardware better than others for different programs?

EEM 486: Computer Architecture. Lecture 4. Performance

Performance evaluation

on an system with an infinite number of processors. Calculate the speedup of

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

CSEE W4824 Computer Architecture Fall 2012

EC 362 Problem Set #2

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

EECS 678: Introduction to Operating Systems

Block diagram of typical laptop/desktop

Pipelining Review and Its Limitations

Chapter 2: OS Overview

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Computer Organization. and Instruction Execution. August 22

Capacity Planning Process Estimating the load Initial configuration

What's in a computer?

Delivering Quality in Software Performance and Scalability Testing

CloudCmp:Comparing Cloud Providers. Raja Abhinay Moparthi

Multi-core Programming System Overview

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

OPERATING SYSTEMS SCHEDULING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

CPU Scheduling. CPU Scheduling

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

TPCalc : a throughput calculator for computer architecture studies

Chapter 1: Introduction. What is an Operating System?

Introduction to Operating Systems. Perspective of the Computer. System Software. Indiana University Chen Yu

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Operating System Components

Operating Systems. Lecture 03. February 11, 2013

EE361: Digital Computer Organization Course Syllabus

Main Points. Scheduling policy: what to do next, when there are multiple threads ready to run. Definitions. Uniprocessor policies

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Operating System: Scheduling

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

HY345 Operating Systems

Chapter 5 Instructor's Manual

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

MS SQL Performance (Tuning) Best Practices:

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

CISC, RISC, and DSP Microprocessors

Disk Storage Shortfall

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Rackspace Cloud Databases and Container-based Virtualization

Linux Process Scheduling Policy

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Useful metrics for Interpreting.NET performance. UKCMG Free Forum 2011 Tuesday 22nd May Session C3a. Introduction

Comparison of Windows IaaS Environments

Using Power to Improve C Programming Education

Overview and History of Operating Systems

Virtualization. Pradipta De

CPU performance monitoring using the Time-Stamp Counter register

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Emerging IT and Energy Star PC Specification Version 4.0: Opportunities and Risks. ITI/EPA Energy Star Workshop June 21, 2005 Donna Sadowy, AMD

Enabling Technologies for Distributed Computing

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

(Refer Slide Time: 00:01:16 min)

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

Real-Time Monitoring Framework for Parallel Processes

Cloud Computing. By: Jonathan Delanoy, Mark Delanoy, Katherine Espana, Anthony Esposito, Daniel Farina, Samuel Feher, and Sean Flahive

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Chapter 10: Virtual Memory. Lesson 08: Demand Paging and Page Swapping

Multi-Threading Performance on Commodity Multi-Core Processors

Grid Computing for Artificial Intelligence

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Objectives. Chapter 5: CPU Scheduling. CPU Scheduler. Non-preemptive and preemptive. Dispatcher. Alternating Sequence of CPU And I/O Bursts

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

Development of Type-2 Hypervisor for MIPS64 Based Systems

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Process Description and Control william stallings, maurizio pizzonia - sistemi operativi

LLamasoft K2 Enterprise 8.1 System Requirements

Computer Architecture Syllabus of Qualifying Examination

Process Scheduling. Process Scheduler. Chapter 7. Context Switch. Scheduler. Selection Strategies

Performance Comparison of RTOS

Introduction to Cloud Computing

Central Processing Unit (CPU)

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Eloquence Training What s new in Eloquence B.08.00

Driving force. What future software needs. Potential research topics

sndio OpenBSD audio & MIDI framework for music and desktop applications

W4118 Operating Systems. Instructor: Junfeng Yang

Enabling Technologies for Distributed and Cloud Computing

Operating Systems Lecture #6: Process Management

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Processor Architectures

CHAPTER 1 INTRODUCTION

Performance Optimization Guide

ICS Principles of Operating Systems

Eight Ways to Increase GPIB System Performance

Transcription:

CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 9: Performance Evaluation Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

What Are We trying to do? Measure, Report, and Summarize Performance Make intelligent choices Why is some hardware better than others for different programs? What factors of system performance are hardware related? (e.g., Do we need a new machine, or a new operating system?) Does performance measure depends on application type?

Defining Performance Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100 200 300 400 500 Passenger Capacity 0 2000 4000 6000 8000 10000 Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 500 1000 1500 Cruising Speed (mph) 0 100000 200000 300000 400000 Passengers x mph

Let s Start with Two Simple Metrics Response time (aka Execution Time) The time between the start and completion of a task Throughput Total amount of work done in a given time What is the relationship between execution time and throughput?

Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput Computer Performance: TIME, TIME, TIME How many jobs can the machine run at once? What is the average execution rate? How much work is getting done?

Try to solve this Do the following changes to the computer system increase throughput, decrease response time, or both? Replacing the processor with a faster version Adding additional processors to a system that uses multiple processors for separate tasks.

Execution Time Elapsed Time counts everything (disk and memory accesses, I/O, etc.) a useful number, but often not good for comparison purposes CPU time doesn't count I/O or time spent running other programs can be broken up into system time, and user time Our focus: user CPU time time spent executing the lines of code that are "in" our program

Execution Time (Elapsed Time) I/O Time CPU Time Disk and Memory time User CPU Time System CPU Time

Book's Definition of Performance For some program running on machine X, Performance X = 1 / Execution time X "X is n times faster than Y" Performance X / Performance Y = n Example: time taken to run a program 10s on A, 15s on B Execution Time B / Execution Time A = 15s / 10s = 1.5 So A is 1.5 times faster than B

Clock Cycles Instead of reporting execution time in seconds, we often use cycles seconds program cycles program seconds cycle Clock ticks indicate when to start activities (one abstraction): cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) 1 4 10 9 A 4 Ghz. clock has a cycle time time 10 12 250 picoseconds (ps)

How to Improve Performance seconds program cycles program seconds cycle So, to improve performance (everything else being equal) you can either (increase or decrease?) the # of required cycles for a program, the clock cycle time or, said another way, the clock rate.

seconds program cycles program seconds cycle ET IC * CPI CT ET = IC X CPI X CT ET = Execution Time CPI = Cycles Per Instruction IC = Instruction Count

An Interesting Question If two machines have the same ISA which of our quantities (e.g., clock rate, CPI, execution time, # of instructions, MIPS) will always be identical?

1st instruction 2nd instruction 3rd instruction 4th 5th 6th... How many cycles are required for a program? Could assume that number of cycles equals number of instructions time This assumption is incorrect, different instructions take different amounts of time on different machines. Why? hint: remember that these are machine instructions, not lines of C code

Different numbers of cycles for different instructions time Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Important point: changing the cycle time often changes the number of cycles required for various instructions

Example Our favorite program runs in 10 seconds on computer A, which has a 4 GHz. clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target? ET = IC * CPI * CT = total_cycles * CT A: 10 = total_cycles * (1/4GHz) B: 6 = 1.2 * total_cycles * (1/clock_rate) 10/6 = clock_rate/(1.2 * 4GHz) Clock_rate = 8GHz

Now that we understand cycles A given program will require some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (cycles per instruction) a floating point intensive application might have a higher CPI MIPS (millions of instructions per second) this would be higher for a program using simple instructions

Performance Performance is determined by execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second?

CPI Example Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 250 ps and a CPI of 2.0 Machine B has a clock cycle time of 500 ps and a CPI of 1.2 What machine is faster for this program, and by how much? [ 10-3 = milli, 10-6 = micro, 10-9 = nano, 10-12 = pico, 10-15 = femto ] ET = IC*CPI*CT ET A = IC * 2*250 = 500IC ET B = IC * 1.2 * 500 = 600IC

#Instructions Example A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? What is the CPI for each sequence?

MIPS Example Two different compilers are being tested for a 4 GHz. machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time?

For Multithreaded Programs Shall we use execution time or throughput? or both? IPC is not accurate here small timing variations may lead to different execution Order at which threads enter critical section may vary Different interrupt timing may lead to different scheduling decisions The total number of instructions executed may be different across different runs!

For Multithreaded Programs The total number of instructions executed may be different across different runs! This effect increases with the number of cores System-level code account for a significant fraction of the total execution time

Your Program Does Not Run in A Vacuum System software at least is there Multi-programming setting is very common in multicore settings Independent programs affect each other performance (why?)

Some Metrics About Multiprogramming Normalized progress of program i Time when running in isolation Time when running with other programs System throughput Higher-is-better metric

Some Metrics About Multiprogramming Normalized Turnaround time of program i Time when running with other programs Time when running in isolation Average normalized turnaround time Lower-is-better metric

Other Metrics

Other Metrics

Important What we saw for multiprogramming, can it be used in multithreading?

Conclusions Performance evaluation is very important to assess programming quality as well as the underlying architecture and how they interact. IPC (or CPI) is not a good measure for multithreaded application