CSE 141 Midterm Exam

Similar documents
Computer Organization and Components

WAR: Write After Read

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Course on Advanced Computer Architectures

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

EC 362 Problem Set #2

Pipelining Review and Its Limitations

Introducción. Diseño de sistemas digitales.1

Instruction Set Design

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Performance evaluation

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

on an system with an infinite number of processors. Calculate the speedup of

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

EE361: Digital Computer Organization Course Syllabus

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

Week 1 out-of-class notes, discussions and sample problems

Instruction Set Architecture (ISA) Design. Classification Categories

CS352H: Computer Systems Architecture


Computer organization

VLIW Processors. VLIW Processors

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Instruction Set Architecture (ISA)

Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

PROBLEMS #20,R0,R1 #$3A,R2,R4

How To Understand The Design Of A Microprocessor

Reduced Instruction Set Computer (RISC)

Intel 8086 architecture

Lecture 27 C and Assembly

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Instruction Set Architecture

EEM 486: Computer Architecture. Lecture 4. Performance

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Review: MIPS Addressing Modes/Instruction Formats

CS:APP Chapter 4 Computer Architecture Instruction Set Architecture. CS:APP2e

Intel Pentium 4 Processor on 90nm Technology

Chapter 5 Instructor's Manual

CSE 141 Introduction to Computer Architecture Summer Session I, Lecture 1 Introduction. Pramod V. Argade June 27, 2005

Solutions. Solution The values of the signals are as follows:

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

LSN 2 Computer Processors

Lecture Outline. Stack machines The MIPS assembly language. Code Generation (I)

Central Processing Unit (CPU)

Design of Digital Circuits (SS16)

ARM Microprocessor and ARM-Based Microcontrollers

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

Computer Organization and Architecture

Instruction Set Architecture

A Lab Course on Computer Architecture

Machine-Level Programming II: Arithmetic & Control

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

İSTANBUL AYDIN UNIVERSITY

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

High-speed image processing algorithms using MMX hardware

CHAPTER 7: The CPU and Memory

Machine Programming II: Instruc8ons

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Five Families of ARM Processor IP

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Processor Architectures

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Language Processing Systems

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Assembly Language: Function Calls" Jennifer Rexford!

POWER8 Performance Analysis

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08

The Central Processing Unit:

The Microarchitecture of Superscalar Processors

OC By Arsene Fansi T. POLIMI

Computer Architecture TDTS10

Operating System Impact on SMT Architecture

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

CPU Performance Equation

An Introduction to the ARM 7 Architecture

CHAPTER 4 MARIE: An Introduction to a Simple Computer

CS521 CSE IITG 11/23/2012

Figure 1: Graphical example of a mergesort 1.

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

CISC, RISC, and DSP Microprocessors

Transcription:

CSE 141 Midterm Exam 2011 Winter Professor Steven Swanson 1. Please write your name at the top of each page 2. This is a close book, closed notes exam. No outside material may be used. 3. You may use a calculator 4. Show your work. You will get more partial credit that way. 5. If you have any questions, please raise your hand. 6. Good luck! Name: Student ID: Problem Score Out of 1 32 2 8 3 12 4 16 5 20 6 12 7 12 Total 112

1. Short Answer Questions: 1. Give the performance equation and define each of the terms. For each term, give two system components that can effect each term Exec time = inst count * cycles/inst * sec/cyc Inst count compiler, the program, the ISA Cpi micro architecture, the compiler, the program sec/cyc microarchitecture, process technology 2. Give the equation of Amdalh s Law and define each of the terms. Stot = 1/((x/S) + (1-x)) Stot = total speedup. X = fraction of program affected, S = speedup for that portion 3. Which of SRAMs and DRAMs provide the highest density (i.e., the most bits per square centimeter)? The fastest access time? SRAM is less dense, but faster 4. Give two aspects of a good ISA that contribute to uniformity and make it easy to design a processor to implement the ISA. Few instruction formats, orthogonality, fixed inst length, constant amount of work per inst. 5. Describe the difference between a stall and flush. Give an example of when each is useful? (3%) Stalls freeze part of the pipeline, delaying the execution of some instructions. They are useful for waiting for hazards to resolve Flushing removes an incorrect instruction from the pipeline. It is useful for canceling instructions fetched because of a misprediction. 6. Describe the 5 stages of the the standard MIPS pipeline. Fetch fetch the inst; Decode decipher the inst and read from the reg file; Execute perform the operation; Memory access memory; Write back write the result into the reg file. 7. Give a metric for measuring the quality (i.e., the bigger the value, the better) of a computer system that takes power consumption and performance into account and places equal emphasis on each. Give another metric that places much more emphasis on power consumption. MIPS/W MIPS/W^3 They key is that the W has a larger exponent.

The table below shows the instruction type breakdown and the CPI of each instruction type of a given application running on a 3GHz StoneBear 438 processor. Instruction Type Percentage of Instruction Count CPI Integer 55% 1 Load/Store 30% 3 Branch 15% 4 Assume the application has a total of 2*10 9 instructions. Please answer the following questions: 8. What is the total execution time of the given benchmark? (7%) Total execution time = CPI * # of Instructions * clock cycle period = (0.55*1 + 0.30*3 + 0.15*4) * 2e9 * 0.33e-9 = 1.367 9. A revision of StoneBear 438 processor raises the clock rate to 4GHz, but also increases the CPI of Load/Store instructions to 12. What s the speedup of this revision vs the original? (7%) Total execution time = CPI * # of Instructions * clock cycle period = (0.55*1 + 0.30*12 + 0.15*4) * 2e9 * 0.25e-9 = 2.375s SpeedUp = new time/old time = 1.367/2.375 = 0.576

2. Performance calculations 1. If 20% of the program is optimized to run at 4x and 40% of the program at 2x, what is the total speedup? (Ans) 100 / ( ( 20 / 2) + (40/2) + 40)) = 100 / 65 = 1.538 2. A software engineer decides to rewrite a portion of the program that accounts for 60% of execution time so that it can run on multiple processors in parallel. What is maximum speedup she can hope to achieve? (Ans) Maximum speedup 100 / 40 = 2.5 3. Processor A is a single-cycle processor and runs at a clock rate of 1Mhz. Processor B is a pipelined version of A with 12 stages. Assuming ideal pipelining, what is the clock rate of B? Give two reasons why B will probably not attain this clock rate. (Ans) Assuming idea pipeline, clock rate of B is 12Mhz 2 reasons why B might not attain 12 Ghz a. It might not be possible to divide it into 12 equal stages b. There are additional delays due to pipeline registers 3. Scaled add: 1. On the pipeline diagram on the next page add the components necessary to execute a scaled add : adds rd, rs, rt => R[rd] = R[rs] + (R[rt] << 2). 2. Rewrite (or show how you would modify) the following code to use the scaled add to accelerate execution sll $t0, $a1, 2 add $t1, $a0, $t0 lw $v0, 0($t1) sll $t2, $v0, 2 add $v0, $v0, $t2 becomes adds $t1, $a0, $a1 lw $v0, 0($t1) adds $v0, $v0, $v0 3. What fraction of execution would the above code need to account for in order to speedup total execution by 1.31? (assume that adding the instruction does not affect the cycle time). ( 5 instructions originally became 3 instructions) So, s = 5/3 S tot = 1.31 = 1 / ( x/s + (1-x) ) 1.31 = 1 / ( x/(5/3) + (1-x)) Solving for x gives x = 0.58

Figure 1 The pipelined processor <<2 The shifter can also go before the pipeline register. Figure 1 The pipelined processor

4. Consider the following MIPS assembly code: LOOP: lw $t1, 0($a0) // 1 lw $a0, 0($t1) // 2 addi $a1, $a1, -1 // 3 bne $a1, $zero, LOOP // 4 add $v0, $a0, $zero // 5 addi $sp, $sp, 8 // 6 Assume that we are using the pipeline processor shown in Figure 1 (The processor we used in class) and $a1 is initialized to 2. Please answer the following questions: 1. List the sequence of instructions that will be executed during the first two iterations of the loop. (You can just list the numbers after the // if you wish.) (3%) 1, 2, 3, 4, 1, 2, 3, 4, 5, 6 2. Use arrows to show all the data dependencies between the executed instructions. Circle the instructions that may cause control hazards. (10%) lw $t1, 0($a0) // 1 lw $a0, 0($t1) // 2 addi $a1, $a1, -1 // 3 bne $a1, $zero, LOOP // 4 lw $t1, 0($a0) // 1 lw $a0, 0($t1) // 2 addi $a1, $a1, -1 // 3 bne $a1, $zero, LOOP // 4 add $v0, $a0, $zero // 5 addi $sp, $sp, 8 // 6

Assume that the pipeline processor implements al possible full forwarding paths and a predict not-taken branch predictor. Draw the pipeline diagram for the instructions you listed above. Include instructions, if any, that are fetched incorrectly. Use the letters F, D, E, M, and W to represent pipe stages, S for stall and X for the noop that replaces an instruction when it s squashed. (12%) cycles-> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 lw $t1, 0($a0) F D E M W lw $a0, 0($t1) F D S E M W 3 F S D E M W 4 F D E M W 5 F D X X X 6 F X X X X 1 F D E M W 2 F D S E M W 3 F S D E M W 4 F D E M W 5 F D E M W 6 F D E M W

5. Consider the following C code: int foo(int a){ int b=1; while(a!= 0) { b = b * a; a = a-1; } return b; } The code is compiled as the assembly code shown in the following table. How many memory acces 1. ses and arithmetic operations will the foo function require? Assume a = 5. Please complete the following table: (10%) foo:.l3:.l2: Static Static Arithmetic Memory Operations Operations Dynamic Arithmetic Ops Dynamic Memory Ops pushl %ebp 1 1 1 1 movl %esp, %ebp 1 1 subl $16, %esp 1 1 movl $1, -4(%ebp) 2 1 2 1 jmp.l2 1 1 movl -4(%ebp), %eax 2 1 10 5 imull 8(%ebp), %eax 2 1 10 5 movl %eax, -4(%ebp) 2 1 10 5 subl $1, 8(%ebp) 2 1 10 5 cmpl $0, 8(%ebp) 2 1 12 6 jne.l3 1 6 movl -4(%ebp), %eax 2 1 2 1 leave 2 1 2 1 ret 1 1 1 1 Total: 22 10 69 31 2. Is the above code compiled with or without optimization turned on? Why or why not? (5%) Without, since all the variable are stored on the stack rather than in registers.

6. Assume that the following is the outcome of a particular branch when it was executed within a loop: T, T, T, NT, T, T, T, NT, T, T, T, NT 1. What is the percentage of misprediction for a predicted-taken predictor (Ans) Misprediction = 3 /12 = 25% i. 2-bit predictor with initial value 00 00 - Strongly not taken, 01 - weakly not taken, 10 - weak taken, 11 - strongly taken Actual Predictor State Predicted State Correct /Incorrect T 00 NT 0 T 01 NT 0 T 10 T 1 NT 11 T 0 T 10 T 1 T 11 T 1 T 11 T 1 NT 11 T 0 T 10 T 1 T 11 T 1 NT 11 T 0 Misprediction = 5 / 12 = 41.67% 2. Assume that in a typical execution, 80% of your instructions are branches that follow the above outcome pattern. Currently there is a predicted-not-taken predictor in your machine. In order to optimize the program, you can either change the predictor to a 2-bit predictor or you can change the code and reduce the number of branch instructions by half. Compute the speedup for each alternative. Which optimization is the best choice?

(Ans) CPI-predict-not-taken = not_branches*no_branch_cpi + branches* mis_predict_rate * mispredict_cpi + branches*(1-mis_predict_rate)*correct_predict_cpi CPI-predict-not-taken = 0.2*1 + 0.8 * 0.75 * (2 + 1) + 0.8*0.25*1 = 2.2 CPI-fewer-branches = = 0.6*1 + 0.4 * 0.75 * (2 + 1) + 0.4*0.25*1 = 1.6 CPI-two-bit = 0.2*1 + 0.8 * 0.416 * (2 + 1) + 0.8*(1-.416)*1 = 1.66 Speedup-fewer-branches = 2.2/1.6 = 1.375 Speedup-2-bit = 2.2/1.66 = 1.325 Second option is better