on an system with an infinite number of processors. Calculate the speedup of



Similar documents
Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Performance evaluation

Week 1 out-of-class notes, discussions and sample problems

EEM 486: Computer Architecture. Lecture 4. Performance

CPU Performance. Lecture 8 CAP

Chapter 2. Why is some hardware better than others for different programs?

Chapter 5 Instructor's Manual

LSN 2 Computer Processors

Instruction Set Design

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

System Models for Distributed and Cloud Computing

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

EC 362 Problem Set #2

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

EE361: Digital Computer Organization Course Syllabus

Computer Architecture-I

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

A Lab Course on Computer Architecture

Computer Organization. and Instruction Execution. August 22

CSEE W4824 Computer Architecture Fall 2012

İSTANBUL AYDIN UNIVERSITY

Computer Architecture TDTS10

Let s put together a Manual Processor

CISC, RISC, and DSP Microprocessors

Instruction Set Architecture (ISA)

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Chapter 2 Logic Gates and Introduction to Computer Architecture

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Pipelining Review and Its Limitations

Enabling Technologies for Distributed and Cloud Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

How To Understand The Design Of A Microprocessor

Central Processing Unit (CPU)

Application Performance Analysis of the Cortex-A9 MPCore

An examination of the dual-core capability of the new HP xw4300 Workstation

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Recommended hardware system configurations for ANSYS users

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Choosing a Computer for Running SLX, P3D, and P5

Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Hardware/Software Co-Design of a Java Virtual Machine


CHAPTER 4 MARIE: An Introduction to a Simple Computer

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

The Motherboard Chapter #5

Introducción. Diseño de sistemas digitales.1

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

ARM Microprocessor and ARM-Based Microcontrollers

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

2: Computer Performance

Introduction to GPU Programming Languages

Driving force. What future software needs. Potential research topics

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Chapter 3: Operating-System Structures. Common System Components

x64 Servers: Do you want 64 or 32 bit apps with that server?

SOC architecture and design

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

The Bus (PCI and PCI-Express)

361 Computer Architecture Lecture 14: Cache Memory

Chapter 2 Parallel Computer Architecture

Lua as a business logic language in high load application. Ilya Martynov ilya@iponweb.net CTO at IPONWEB

(Refer Slide Time: 00:01:16 min)

Comparative Performance Review of SHA-3 Candidates

Record Storage and Primary File Organization

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Virtuoso and Database Scalability

CPU Organisation and Operation

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to Computer Architecture Concepts

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Instruction Set Architecture (ISA) Design. Classification Categories

Parallel Scalable Algorithms- Performance Parameters

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Performance metrics for parallel systems

64-Bit versus 32-Bit CPUs in Scientific Computing

CPU Benchmarks Over 600,000 CPUs Benchmarked

Quantitative Computer Architecture

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Parallel Algorithm Engineering

CSCI 4717 Computer Architecture. Function. Data Storage. Data Processing. Data movement to a peripheral. Data Movement

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Building an Inexpensive Parallel Computer

The Central Processing Unit:

1. Memory technology & Hierarchy

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Transcription:

1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time must enhancement 3 be used to achieve an overall speedup of 10? b) Assume for some benchmark, the fraction of use is 15% for each of enhancements 1 and 2 and 70% for enhancement 3. We want to maximize performance. If only one enhancement can be implemented, which should it be? If two enhancements can be implemented, which should be chosen? 2. Measuring processor s time After graduating, you are asked to become the lead computer designer at Hyper Computer, Inc. Your study of usage of high level language constructs suggests that procedure calls are one of the most expensive operations. You have invented a new architecture with an ISA that reduces the loads and stores normally associated with procedure calls and returns. The first thing you do is run some experiments with and without this optimization. Your experiments use the same state of the art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information: The clock cycle time of the optimized version is 5% lower than the unoptimized version Thirty percent of the instructions in the unoptimized version are loads or stores. The optimized version executes two thirds as many loads and stores as the unoptimized version. For all other instructions the dynamic execution counts are unchanged. Every instruction (including load and store) in the unoptimized version takes one clock cycle. Due to the optimization, the procedure call and return instructions take one extra cycle in the optimized version, and these instructions accounts for 5% of total instruction count in the optimized version. Which is faster? Justify your decision quantitatively. 3. Amdahl s Law A particular program P running on a single-processor system takes time T to complete. Let us assume that 40% of the program s code is associated with data management housekeeping (according to Amdahl) and, therefore, can only execute sequentially on a single processor. Let us further assume that the rest of the program (60%) is embarrassingly parallel in that it can easily be divided into smaller tasks executing concurrently across multiple processors (without any interdependencies or communications among the tasks). (a) Calculate T2, T4, T8, which are the times to execute program P on a two-, four-, eightprocessor system, respectively.

(b) Calculate on an system with an infinite number of processors. Calculate the speedup of the program on this system, where speedup is defined as. What does this correspond to? 4. Amdahl s Law II Amdahl compares and contrasts the performance of three different machines in his paper (Machines A, B, C). For this problem, consider Machines X, Y, Z configured as follows. _ Machine X A scalar processor with 1 arithmetic unit, running at frequency 2f. _ Machine Y An array processor with 4 arithmetic units, running at frequency f. _ Machine Z A VLIW processor with 4 arithmetic units, running at frequency f. Additionally, we define instruction-level parallelism as the degree in which an application s instructions are independent of each other and, therefore, can be executed concurrently. (a) If a large portion of an application s code has very low instruction-level parallelism, on which machine would it run the fastest, if any, and why? (b) Describe the characteristics of an application which would perform better on Machine Y than on Machine Z. (c) Describe the characteristics of an application which would perform better on Machine Z than on Machine Y. 5. Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual Pentium, but the resource requirements are not equal. The first application needs 80% of the resources, and the other only 20% of the resources. a. Given that 30% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? b. Given that 90% of the second application is parallelizable, how much speedup would this application observe if run in isolation? c. Given that 30% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it? d. Given that 90% of the second application is parallelizable, how much overall system speedup would you get? 6.In the load-store architecture of MIPS, operands of arithmetic and logical instruction must be from registers. For a typical integer program, the instruction distribution and CPI of 4 groups are given in the following table. a. Calculate the average CPI of the integer program.

b. Now, assume that a set of new memory-register type of arithmetic and logical instructions are added into the ISA. Each memory-register ALU instruction combines one Load and one original ALU instruction together. It takes 4 cycles to execution this new type of instruction. Assume 60% of the load instructions can be combined for the program; calculate the new CPI of the integer program. c. Assume the modification makes the overall cycle time increased by 5%. Is this modification really worthwhile? 7. Your company s internal studies show that a single-core system is sufficient for the demand on your processing power. You are exploring, however, whether you could save power by using two cores. a. Assume that your application is 80% parallelizable. By how much could you decrease the frequency and get the same performance? b. Assume that the voltage may be decreased linearly with the frequency. Using the equation in Section 1.5, how much dynamic power would the dual-core system require as compared to the singlecore system? c. Now assume that the voltage may not decrease below 30% of the original voltage. This voltage is referred to as the voltage floor, and any voltage lower than that will lose the state. Using the equation in Section 1.5, how much dynamic power would the dual core system require from part (a) compared to the singlecore system when taking into account the voltage floor? 8. You find yourself in a game show presented with 2 machines. You are supposed to pick the fastest one to win an awesome prize! You are given the following information about the two machines A and B (running different compilers) Machine A has a clock rate of 2 GHz with the following measurements. Machine B has a clock rate of 2.5 GHz with the following measurements. To make sure you don t parrot the answers given from the audience, the host asks you the following questions. a. What is the average CPI of machine A and B? b. On which machine is the program faster with respect to i. Execution time

ii. MIPS rating 9. 30% of a benchmark program s execution time is from multiply operations. Uber cool hardware speeds up these operations 12 times! Suppose the program took 20 seconds to execute without the enhanced hardware, what will be the overall speedup achieved? During its enhanced operation, what is the new execution time, and what is the percentage of time multiply operations take? 10.When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating-point unit, that takes space, and something might have to be moved farther away from the middle to accommodate it, adding an extra cycle in delay to reach that unit. The basic Amdahl s law equation does not take this trade-off into account. Let s assume for some benchmark program, 15% of the original execution time is taken up by floating point operations, 25% by data accesses and 30% by I/O operations. You have 3 teams of engineers who come up with cool hardware to enhance each of the operations! But unfortunately, they inadvertently end up affecting other operations as well. Your job is to choose the hardware with the highest overall speedup and reward that team with a bag of goodies! Team A comes up with an improvement on the floating point hardware. It speeds up floating point operations 12 times but slows down data accesses by 1.25 times and I/O operations by 1.1 times. Team B comes up with an improvement on the data access hardware. It speeds up data accesses by 2.5 times. It slows down the I/O operations by 1.5 times but speeds up floating point operations by 2 times! Team C comes up with an improvement for I/O operations which speeds it up by 6 times but slows down data accesses by 2.5 times and leaves the floating point operations unchanged. 11. Suppose that when Program A is run, the user CPU time is 3 seconds, the elapsed wallclock time is 4 seconds, and the system performance is 10 MFLOP/sec. Assume that there are no other processes taking any significant amount of time, and the computer is either doing calculations in the CPU, or doing I/O, but it can't do both at the same time. We now replace the processor with one that runs six times faster, but doesn't affect the I/O speed. What will the user CPU time, the wallclock time, and the MFLOP/sec performance be now? 12. You are on the design team for a new processor. The clock of the processor runs at 200 MHz. The following table gives instruction frequencies for Benchmark B, as well as how many cycles the instructions take, for the different classes of instructions. For this problem, we

assume that (unlike many of today's computers) the processor only executes one instruction at a time. Instruction Type Frequency Cycles Loads & Stores 30% 6 cycles Arithmetic Instructions 50% 4 cycles All Others 20% 3 cycles Calculate the CPI for Benchmark B. The CPU execution time on the benchmark is exactly 11 seconds. What is the ``native MIPS'' processor speed for the benchmark in millions of instructions per second? The hardware expert says that if you double the number of registers, the cycle time must be increased by 20%. What would the new clock speed be (in MHz)? The compiler expert says that if you double the number of registers, then the compiler will generate code that requires only half the number of Loads & Stores. What would the new CPI be on the benchmark? How many CPU seconds will the benchmark take if we double the number of registers (taking into account both changes described above)?