EC 362 Problem Set #2



Similar documents
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Chapter 5 Instructor's Manual

Central Processing Unit (CPU)

Instruction Set Architecture (ISA)

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Instruction Set Design

LSN 2 Computer Processors

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

İSTANBUL AYDIN UNIVERSITY

MICROPROCESSOR AND MICROCOMPUTER BASICS

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Chapter 2 Topics. 2.1 Classification of Computers & Instructions 2.2 Classes of Instruction Sets 2.3 Informal Description of Simple RISC Computer, SRC

Notes on Assembly Language

1 Classical Universal Computer 3

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

(Refer Slide Time: 00:01:16 min)

on an system with an infinite number of processors. Calculate the speedup of

Computer Organization and Architecture

1 Description of The Simpletron

Let s put together a Manual Processor

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Instruction Set Architecture

Intel 8086 architecture

Compilers I - Chapter 4: Generating Better Code

Central Processing Unit Simulation Version v2.5 (July 2005) Charles André University Nice-Sophia Antipolis

Chapter 7D The Java Virtual Machine

CPU Performance. Lecture 8 CAP

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

Computer organization

CS101 Lecture 26: Low Level Programming. John Magee 30 July 2013 Some material copyright Jones and Bartlett. Overview/Questions

In the Beginning The first ISA appears on the IBM System 360 In the good old days

Computer Organization. and Instruction Execution. August 22

CPU Organization and Assembly Language

An Introduction to the ARM 7 Architecture

MACHINE ARCHITECTURE & LANGUAGE

2) Write in detail the issues in the design of code generator.

OAMulator. Online One Address Machine emulator and OAMPL compiler.

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

CPU Organisation and Operation

Microprocessor & Assembly Language

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

We will use the accumulator machine architecture to demonstrate pass1 and pass2.

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Traditional IBM Mainframe Operating Principles

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Lecture Outline. Stack machines The MIPS assembly language. Code Generation (I)

PROBLEMS #20,R0,R1 #$3A,R2,R4

Instruction Set Architecture (ISA) Design. Classification Categories

Lecture 3 Addressing Modes, Instruction Samples, Machine Code, Instruction Execution Cycle

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Computer Architectures

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Comp 255Q - 1M: Computer Organization Lab #3 - Machine Language Programs for the PDP-8

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

X86-64 Architecture Guide

An Introduction to Assembly Programming with the ARM 32-bit Processor Family

The previous chapter provided a definition of the semantics of a programming

EE361: Digital Computer Organization Course Syllabus

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

TIMING DIAGRAM O 8085

Summary of the MARIE Assembly Language

Chapter 2. Why is some hardware better than others for different programs?

CSE 141 Introduction to Computer Architecture Summer Session I, Lecture 1 Introduction. Pramod V. Argade June 27, 2005

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

PROBLEMS (Cap. 4 - Istruzioni macchina)

MACHINE INSTRUCTIONS AND PROGRAMS

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Assembly Language Programming

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

CHAPTER 7: The CPU and Memory

Virtual Machines as an Aid in Teaching Computer Concepts

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Processor Architectures

What to do when I have a load/store instruction?

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Administrative Issues

8085 INSTRUCTION SET

Reduced Instruction Set Computer (RISC)

1 The Java Virtual Machine

Review: MIPS Addressing Modes/Instruction Formats

Microprocessor/Microcontroller. Introduction

PART B QUESTIONS AND ANSWERS UNIT I


A3 Computer Architecture

Chapter 9 Computer Design Basics!

1.4 Compound Inequalities

EEM 486: Computer Architecture. Lecture 4. Performance

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS201: Architecture and Assembly Language

Performance evaluation

Module 2 Stacks and Queues: Abstract Data Types

Where we are CS 4120 Introduction to Compilers Abstract Assembly Instruction selection mov e1 , e2 jmp e cmp e1 , e2 [jne je jgt ] l push e1 call e

A Lab Course on Computer Architecture

Transcription:

EC 362 Problem Set #2 1) Using Single Precision IEEE 754, what is FF28 0000? 2) Suppose the fraction enhanced of a processor is 40% and the speedup of the enhancement was tenfold. What is the overall speedup? What if the fraction enhanced increased to 90%? 3) MIPS has one of the simpler instruction sets in existence. However, it is possible to imagine even simpler instruction sets. Consider a hypothetical machine call SIC, for Single Instruction Computer. As its name implies, SIC has only one instruction: subtract and branch if negative, or sbn for short. The sbn instruction has three operands, each consisting of the address of a word in memory: sbn a, b, c # Mem[a] = Mem[a] - Mem[b]; if (Mem[a]<0) go to c The instruction will subtract the number in memory location b from the number in location memory location a and place the result back in memory location a, overwriting the previous value. If the result is greater than or equal to 0, the computer will take its next instruction from the memory location just after the current instruction. If the result is less than 0, the next instruction is taken from memory location c. SIC has no registers and no instructions other than sbn. Although it has only one instruction, SIC can imitate many of the operations of more complex instruction sets by using clever sequences of sbn instructions. For example, here is a program to copy a number from location a to location b: start: sbn temp, temp,.+1 # Sets temp to zero sbn temp, a,.+1 # Sets temp to -a sbn b, b,.+1 # Sets b to zero sbn b, temp,.+1 # Sets b to -temp, which is a In the program above, the notation.+1 means the address after this one, so that each instruction in this program goes on to the next in sequence whether or not the result is negative. We assume temp to be the address of a spare memory word that can be used for temporary results. Part a) Write a SIC program to add a and b, leaving the result in a and leaving b unmodified. Your SIC program should only require 3 instructions. Part b) Write a SIC program to multiply a by b, putting the result in c. Assume that memory location "one" contains the number 1. Assume that a and b are greater than 0 and that it s OK to modify a or b. (Hint: What does this program compute?) c = 0; while (b > 0) {b = b 1; c = c + a;} Your SIC program should only require 5 instructions. 4) This problem compares the memory efficiency of four different styles of instruction sets for two code sequences. The architecture styles are the following: Accumulator: All operations occur in a single register called the accumulator (acc). Memory-memory: All three operands of each instruction are in memory. Stack: All operations occur on top of the stack. Only push and pop access memory, and all other instructions remove their operands from the stack and replace them with the result. The implementation uses a stack for the top two entries; accesses that use other stack positions are memory references.

Load-store: All operations occur in registers, and register-to-register instructions have three operands per instruction. There are 16 general-purpose registers, and register specifiers are 4 bits long. This is the architecture style for MIPS (but slightly modified). Consider the following C code: a = b + c; # a, b, and c are variables in memory This code would be translated as follows using an accumulator style instruction set. Notice that all operations implicitly reference the acc register, and everything else is a memory reference: load B # acc Memory[B] add C # acc acc + Memory[C] store A # Memory[A] acc The same code would be translated as follows using a memory-memory instruction set: add A, B, C # Memory[A] Memory[B] + Memory[C] In this case, there are no registers, just memory locations. So, all references are to memory. The same code would be translated as follows using a stack instruction set: push C # top+=4; stack[top] Memory[C] push B # top+=4; stack[top] Memory[B] add # stack[top-4] stack[top] + stack[top-4]; top-=4 pop A # Memory[A] stack[top]; top-=4 Here all operations act on the elements on the top of a stack. Again, there are no registers, just a stack. Everything else is a memory reference. Finally, the load-store translation would be: load reg1, C load reg2, B add reg3, reg1, reg2 store reg3, A # reg1 Memory[C] # reg2 Memory[B] # Memory[A] reg3 These are each equivalent assembly language code segments for the different styles of instruction sets. For a given code sequence, we can calculate the instruction bytes fetched and the memory data bytes transferred using the following assumptions about all four instruction sets: The opcode is always 1 byte (8 bits) All memory addresses are 2 bytes (16 bits) All memory data transfers are 4 bytes (32 bits) All instructions are an integral number of bytes in length There are no optimizations to reduce memory traffic For example, a register load in the load-store ISA will require four instruction bytes (one for the opcode, one for the register destination, and two for a memory address) to be fetched from memory along with four data bytes. A memory-memory add instruction will require seven instruction bytes (one for the opcode and two for each of the three memory addresses) to be fetched from memory and will result in 12 data bytes being transferred (eight from memory to the processor and four from the processor back to memory). The following table displays a summary of this information for each of the architectural styles for the code appearing above: Style Instructions for a=b+c Code bytes Data bytes

Accumulator 3 3+3+3 4+4+4 Memory-memory 1 7 12 Stack 4 3+3+1+3 4+4+0+4 Load-store 4 4+4+3+4 4+4+0+4 For the following C code, write an equivalent assembly language program in each architectural style (assume all variables are initially in memory): a = b + c; b = a + c; d = a b; Assume that the stack instruction set has instructions such as push, pop, add, sub, negate (computes the unary "minus" of the element on the top of the stack), and duplicate (makes a copy of the top of the stack), and assume that the accumulator instruction set has instructions such as load, store, add, sub, and negate. For each code sequence, calculate the instruction bytes fetched and the memory data bytes transferred (read or written). Be efficient in writing your code, i.e. write the code as efficiently as you can so as to minimize the memory traffic. Which architecture is most efficient as measured by code size? Which architecture is most efficient as measured by total memory bandwidth required (code + data)? If the answers are not the same, why are they different? 5) Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 4 GHz, and M2 has a clock rate of 2 GHz. The average number of cycles for each instruction class on M1 and M2 is given in the following table. Class CPI on M1 CPI on M2 C1 usage C2 usage Third-Party usage A 4 2 30% 30% 50% B 6 4 50% 20% 30% C 8 3 20% 50% 20% The table also contains a summary of how three different compilers use the instruction set. C1 is a compiler produced by the makers of M1, C2 is a compiler produced by the makers of M2, and the other compiler is a third-party product. Assume that each compiler uses the same number of instructions for a given program but that the instruction mix is as described in the table. Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared with M2? Using C2 on both M2 and M1, how much faster can the makers of M2 claim that M2 is compared with M1? If you purchase M1, which compiler would you use? Which machine would you purchase if we assume that all other criteria are identical, including costs?

Solutions to Problem Set #2 1. 1 11111110 0101000 0000-1.0101*2^(254-127) = -1.3125*2^127 2. With 40%, Speedup = 1/(1-.4+.4/10) = 1.56 With 90%, Speedup = 1/(1-.9+.9/10) = 5.26 3a) sbn temp, temp,.+1 # temp = 0 sbn temp, b,.+1 # temp = -b sbn a, temp,.+1 # a = a temp = a + b 3b) sbn temp, temp,.+1 # temp = 0 sbn b, one,.+2 # if b < 1 jump +2. either way b=b-1 sbn temp, a,.-1 # temp = temp a, go back 1 instruction sbn c, c,.+1 # c = 0 sbn c, temp,.+1 # c = c temp = b*a 4) Total memory traffic = 50 bytes. Total memory traffic = 57 bytes. Acc Code Data Load b 3 4 Add c 3 4 Store a 3 4 Add c 3 4 Store b 3 4 Neg 1 0 Add a 3 4 Store d 3 4 Total 22 28 Mem/mem Code Data Add a,b,c 7 12 Add b,a,c 7 12 Sub d,a,b 7 12 Total 21 36 For stack operations, I ll assume that the sub operation computes A-B where B is on top of A on the stack. Stack Code Data Push b 3 4 Push c 3 4 Add 1 0 Dup 1 0 Pop a 3 4 Push c 3 4

Add 1 0 Dup 1 0 Pop b 3 4 Neg 1 0 Push a 3 4 Add 1 0 Pop d 3 4 Total 27 28 Total memory traffic = 55 bytes. Load/store Code Data Load regb,b 4 4 Load regc,c 4 4 Add rega,regb,regc 3 0 Store rega,a 4 4 Add regb,rega,regc 3 0 Store regb,b 4 4 Sub regd,rega,regb 3 0 Store regd,d 4 4 Total 29 20 Total memory traffic = 49 bytes. The load/store machine has the least memory traffic since everything can be held in registers, once fetched from memory. Of course, it has the largest code size since it cannot combine memory ops with arithmetic. On the other extreme is the mem/mem architecture which has the best code size, but terrible memory traffic since 3 operands are fetched for every statement! 5. Cpu = IC * CPI / Rate Rate M1 = 4 GHz = 4e9 Hz Rate M2 = 2 GHz = 2e9 Hz Class CPI on M1 CPI on M2 C1 usage C2 usage Third-Party usage A 4 2 30% 30% 50% B 6 4 50% 20% 30% C 8 3 20% 50% 20% We are told that the instruction counts (IC) are the same for each compiler on each machine. CPI C1,M1 = (4 *.3 + 6 *.5 + 8 *.2) = 5.8 CPI C1,M2 = (2 *.3 + 4 *.5 + 3 *.2) = 3.2 CPI C2,M1 = (4 *.3 + 6 *.2 + 8 *.5) = 6.4 CPI C2,M2 = (2 *.3 + 4 *.2 + 3 *.5) = 2.9 CPI C3,M1 = (4 *.5 + 6 *.3 + 8 *.2) = 5.4 CPI C3,M2 = (2 *.5 + 4 *.3 + 3 *.2) = 2.8 Perf C1,M1 / Perf C1,M2 = Cpu C1,M2 / Cpu C1,M1 = (IC * 3.2/2e9) / (IC * 5.8/4e9) = 16ns / 14.5ns= 1.10, so that M1 is 1.1 times faster than M2 using compiler C1. Perf C2,M2 / Perf C2,M1 = Cpu C2,M1 / Cpu C2,M2 = (IC * 6.4/4e9) / (IC * 2.9/2e9) = 16ns / 14.5ns = 1.10, so that M2 is 1.1 times faster than M1 using compiler C2.

Perf C3,M1 / Perf C3,M2 = Cpu C3,M2 / Cpu C3,M1 = (IC * 2.8/2e9) / (IC * 5.4/4e9) = 14ns / 13.5ns = 1.03, so that M1 is 1.03 times faster than M2 using C3. If you own M1, you should use C3. If you own M2, you should use C3 also. This is poor news for the native compiler writers of these two machines! You should always use C3 as your compiler, and if you buy a machine, buy M1.