2013 Advanced Computer Architecrture Mid-term Exam

Similar documents
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:


on an system with an infinite number of processors. Calculate the speedup of

VLIW Processors. VLIW Processors

Pipelining Review and Its Limitations

WAR: Write After Read

Week 1 out-of-class notes, discussions and sample problems

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Computer Architecture TDTS10

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Chapter 5 Instructor's Manual

PROBLEMS. which was discussed in Section

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

PROBLEMS #20,R0,R1 #$3A,R2,R4

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

IA-64 Application Developer s Architecture Guide

Instruction Set Architecture (ISA)

The Microarchitecture of Superscalar Processors

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Module: Software Instruction Scheduling Part I

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

CS352H: Computer Systems Architecture

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

A Lab Course on Computer Architecture

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Optimizing matrix multiplication Amitabha Banerjee

Thread level parallelism

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

CPU Performance Equation

Processor Architectures

2

EC 362 Problem Set #2

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

361 Computer Architecture Lecture 14: Cache Memory

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

ARM Microprocessor and ARM-Based Microcontrollers

OC By Arsene Fansi T. POLIMI

Using Power to Improve C Programming Education

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

LSN 2 Computer Processors

Introduction to Microprocessors

Central Processing Unit

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Giving credit where credit is due

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Instruction Set Design

Storing Measurement Data

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Intel 8086 architecture

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

picojava TM : A Hardware Implementation of the Java Virtual Machine

Course on Advanced Computer Architectures

Performance evaluation

StrongARM** SA-110 Microprocessor Instruction Timing

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Board Notes on Virtual Memory

(Refer Slide Time: 00:01:16 min)

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December :59

Annotation to the assignments and the solution sheet. Note the following points

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

A3 Computer Architecture

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

CHAPTER 7: The CPU and Memory

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Boosting Beyond Static Scheduling in a Superscalar Processor

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Exception and Interrupt Handling in ARM

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

MICROPROCESSOR AND MICROCOMPUTER BASICS

Instruction Set Architecture (ISA) Design. Classification Categories

AC : A PROCESSOR DESIGN PROJECT FOR A FIRST COURSE IN COMPUTER ORGANIZATION

MACHINE ARCHITECTURE & LANGUAGE

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

EE361: Digital Computer Organization Course Syllabus

Central Processing Unit (CPU)

Software Pipelining - Modulo Scheduling

Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Let s put together a Manual Processor

CISC, RISC, and DSP Microprocessors

Programming Logic controllers

Transcription:

2013 Advanced Computer Architecrture Mid-term Exam 1. Amdahl s law When making changes to optimize part of a processor, it is often the case that speeding up one type of instruction comes at the cost of slowing down something else. For example, if we put in a complicated fast floating point unit, that takes space, and something else might have to be moved farther away from the middle to accommodate it, addding an extra cycle in delay to reach that unit. The basic Amdahl s law equation does not take into account this trade-off. a) If the new fast floating point unit speeds up floating point operations by, on average, 2, and floating point operations take 20% of the original program s execution time, what is the overal speedup ( ignoring the penalty to any other instructions)? b) Now assume that speeding up the floating point unit slowed down data cache accesses which consume 10% of the execution time. What is the overal speedup now? a. 1/(0.8 + 0.20/2) = 1.11 b. 1/(0.7 + 0.20/2 + 0.10 3/2) = 1.05 2. Cache performance optimization The transpose of a matrix interchanges its rows and columes; this is illustrated below: Here is a simple C loop to show the transpose: for ( i = 0; i< 3; i++) { for ( j = 0; j<3; j++) { output [j][i] = input [i][j]; } }

Assume that both the input and output matrices are stored in the row major order (row major order means that the row index changes fastest). Assume that you are executing a 256 256 double-precision transpose on a processor with a 16KB fully associative ( don t worry about cache conflicts) least recently used(lru) replacement L1 data cache with 64 byte blocks. Assume that the L1 cache misses require 16 cycles and always hit in the L2 cache. For the simple implementation given above, this execution order would be non- ideal for the input matrix; however, applying a loop interchange optimization would create a non-ideal order for the output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead. a) What should be the minimum size of the cache to take advantage of blocked execution? Each element is 8B. Since a 64B cache line has 8 elements, and each column access will result in fetching a new line for the non-ideal matrix, we need a minimum of 8x8 (64 elements) for each matrix. Hence, the minimum cache size is 128 8B = 1KB. (128= 64 + 64 for two matrix ) b) How do the relative number of misses in the blocked and unblocked versions compare in the minimum sized cache above? The blocked version only has to fetch each input and output element once. The unblocked version will have one cache miss for every 64B/8B = 8 row elements. Each column requires 64Bx256 of storage, or 16KB. Thus, column elements will be replaced in the cache before they can be used again. Hence the unblocked version will have 9 misses (1 row and 8 columns) for every 2 in the blocked version. c) Write code to perform a transpose with a block size parameter B which uses B B blocks. for ( i=0; i<256; i=i+b ) for(j=0; j<256; j=j+b) for ( m=0; m<b; m++) for ( n=0; n<b; n++) output[j+n][i+m] = input [i+m][j+n] ; Three C s of Cache Misses (Short Answer)

Mark whether the following modifications will cause each of the categories to increase, decrease, or whether the modification will have no effect. You can assume the baseline cache is set associative. Explain your reasoning to receive credit. This table continues on the next page. Compulsory Misses Conflict Misses Capacity Misses No effect Decrease No effect Double the associativity (capacity and line size constant) (halves # of sets) If the data wasn t ever in the cache, increasing associativity with the constraints won t change that. No effect Typically higher associativity reduces conflict misses because there are more places to put the same element. Decrease Capacity was given as a constant. Decrease Adding a victim cache Adding prefetching Victim cache only holds lines previously held by CPU. Decrease Hopefully prefetched data is there when needed. The victim cache holds the victim of a conflict, so it can be used again later. No effect doesn t affect placement - or - Possibly increase prefetch data could possibly pollute cache Slightly larger cache capacity. No effect, since victim cache doesn t count towards capacity total (only -0.5 off) No effect doesn t affect change - or - Possibly increase prefetch data could possibly pollute cache

Ben is designing a 7-stage in-order pipeline and is concerned about the performance implications of branches. The baseline processor he is considering is similar to the classic 5-stage RISC pipeline, except instruction cache access and data cache access each take two stages, as shown in Figure 2-1. Initially, the pipeline lacks any branch prediction mechanism. All branches in Ben's ISA are simple enough that they can execute without an ALU. His ISA has no branch delay slots. Figure 2-1 Ben s in-order pipeline ----------------------------------------------------------------------------------------------------- Problem 2.A What is the earliest stage that branches can be resolved in Ben s pipeline? How many instructions are squashed on a taken branch? Assuming 1 in 6 instructions is a branch and 3/5 branches are taken, and assuming a base CPI of 1, what is the CPI of Ben s pipeline? Simple branches can be resolved in decode, so two fetches are wasted on a taken branch. The CPI is 5/6 (non-branches) + 1/6*2/5 (untaken branches) + 3*1/6*3/5 (taken branches), or 6/5. Decode Branch Resolution Stage 2 # Instructions Squashed 6/5 CPI

Problem 3.A Describe how precise exceptions are maintained in out-of-order processors. Exceptions are detected when an instruction executes out-of-order and saved in the ROB. Since instructions commit in-order from the ROB, exceptions can still be taken in program order by not actually taking an exception until the corresponding instruction is at the head of the ROB, about to commit. Problem 3.B Consider an out-of-order processor with register renaming using a unified physical register file. A new physical register is allocated for each instruction's destination register in the decode stage, but since physical registers are a finite resource, they must be deallocated at some point in time. Carefully explain when it is safe to deallocate a physical register. The physical register can be freed when the next writer of the same architectural register commits. At that point, it is guaranteed that no instructions remaining in the pipeline need to read the old physical register. Compute the Clocks Per Instruction (CPI) of a machine which has an average CPI for ALU operations of 1.1, a CPI for branches/jumps of 3.0, and a hit rate of 60% in the cache. A hit in the cache takes 1 cycle pipelined and a cache miss takes 120 cycles. Assume 22% of instructions are loads, 12% are stores, 20% are branches/jumps and the balance are ALU operations. CPI = P ALU * CPI ALU + P BR/JMP * CPI BR/JMP + P LD/ST * (P HIT * CPI HIT + P MISS * CPI MISS ) = (1 0.22 0.12 0.2) * 1.1 + 0.2 * 3.0 + (0.22 + 0.12) * (0.6 * 1 + (1 0.6) * 120) = 17.63

For the following code snippet, identify all of the RAW, WAW, and WAR hazards. Provide a list for each hazard. Hint, remember that you have to check more than neighbor instructions. I0: LD F4,0(Rx) I1: MULTD F2,F0,F2 I2: DIVD F8,F4,F2 I3: LD F4,0(Ry) I4: ADDD F6,F0,F4 I5: SUBD F8,F8,F6 I6: SD F8,0(Ry) RAW WAW WAR I0, I2 I0. I3 I2, I3 I0, I4 I2, I5 I5, I6 I1, I2 I5, I6 I2, I5 I2, I6 I3, I4 I4, I5 1. For the following snippets of code, select the single architectural feature that will most improve the performance of the code. Explain your choice, including description of why the other features will not improve performance as much and your assumptions about the machine design. ------------- ADD.D F0, F1, F8 ADD.D F2, F3, F8 ADD.D F4, F5, F8 ADD.D F6, F7, F8

C : -------------------------- A. Out-of-Order Issue with Renaming 带换名的乱序发射 B. Branch Prediction 转移预测 C. Superscalar 超标量 A : There is no WAR, WAW or RAW hazards in this code, so OoO issue with renaming has no help to improve the performance B : There is no branch in this code, so BPU can not improve the performance C : In this code, for instructions can be fetched,executed and wrote back in parallel, so by using superscalar we can get the most improvement of performance. Consider the execution of the following loop, which searches an array, on an processor which is in-order single-issue, out of order execution and writing back with dynamic scheduling and speculation: Loop: LD R2, 0(r1) ; R2= array element DADDI R2, R2, #1 ; increment R2 SD R2, 0(R1) DAADI R1, R1, #-4 ;decrement pointer BNEZ R2, LOOP ; branch if the element!=0 Assume that there are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. And assume that only one instruction can write back per cycle. Complete the following table for the first three iterations of this loop for the machine.

Iteration instructions Issues at Executes Memroy Write Comment number access at CDB at 1 LD R2, 0(r1) 1 2 3 4 First issue 1 DADDI R2, R2, #1 2 5 6 1 SD R2, 0(R1) 3 7 8 1 DAADI R1, R1, #-4 4 8 9 1 BNEZ R2, LOOP 5 7 2 LD R2, 0(r1) 8 10 11 12 2 DADDI R2, R2, #1 9 13 14 2 SD R2, 0(R1) 10 15 16 2 DAADI R1, R1, #-4 11 16 17 2 BNEZ R2, LOOP 12 13 3 LD R2, 0(r1) 14 16 17 18 3 DADDI R2, R2, #1 15 19 20 3 SD R2, 0(R1) 16 21 22 3 DAADI R1, R1, #-4 17 22 23 3 BNEZ R2, LOOP 18 19 Notice that : 1: There are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. It means that the excecute stages of LD/SD, DADDI and BNEZ can be overlapped. 2. No bypassing. 3. No branch prediction. 4. Memory access stage of different LD/SD instructions can not be overlapped.

Assume that you have the following pipeline. It can issue two instructions per cycle and can commit one instruction per cycle. Draw the pipeline diagram of the following code sequence executing. MUL R6, R7, R8 ADD R9, R10, R11 ADD R11, R12, R13 ADD R13, R14, R15 ADD R19, R13, R10 LW R2, R3 ADD R12, R16, R19 LW R5, R2 ADD R15, R20, R21

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 F D I Y0 Y1 Y2 Y3 W C F D I X0 W C F D I X0 W C F D I I X0 W C F D I I I X0 W C F D I L0 L1 W C F D I I I I X0 W C F D I I I I I L0 L1 W C F D I I X0 W C