612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap. 12 - Famiglie di processori) PROBLEMS

Similar documents
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Computer Organization and Architecture

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Instruction Set Architecture

PROBLEMS (Cap. 4 - Istruzioni macchina)

M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Computer Architectures

THUMB Instruction Set

Central Processing Unit (CPU)

MACHINE INSTRUCTIONS AND PROGRAMS

PROBLEMS #20,R0,R1 #$3A,R2,R4

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC Microprocessor & Microcontroller Year/Sem : II/IV

Intel 8086 architecture

LSN 2 Computer Processors

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Instruction Set Design

IA-64 Application Developer s Architecture Guide

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Property of ISA vs. Uarch?

Introduction to MIPS Assembly Programming

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08

Instruction Set Architecture

2) Write in detail the issues in the design of code generator.

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Faculty of Engineering Student Number:

How It All Works. Other M68000 Updates. Basic Control Signals. Basic Control Signals

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

CPU Organization and Assembly Language

Review: MIPS Addressing Modes/Instruction Formats

Instruction Set Architecture (ISA)

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

MACHINE ARCHITECTURE & LANGUAGE

Chapter 7D The Java Virtual Machine

VLIW Processors. VLIW Processors

Processor Architectures

PART B QUESTIONS AND ANSWERS UNIT I

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

EE361: Digital Computer Organization Course Syllabus

Outline. Lecture 3. Basics. Logical vs. physical memory physical memory. x86 byte ordering

Computer Organization and Components

An Overview of Stack Architecture and the PSC 1000 Microprocessor

High-speed image processing algorithms using MMX hardware

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

İSTANBUL AYDIN UNIVERSITY

Instruction Set Architecture (ISA) Design. Classification Categories

A Tiny Guide to Programming in 32-bit x86 Assembly Language

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

PROBLEMS. which was discussed in Section

The Java Virtual Machine and Mobile Devices. John Buford, Ph.D. Oct 2003 Presented to Gordon College CS 311

Basic Computer Organization

1 Classical Universal Computer 3

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

8051 hardware summary

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CS:APP Chapter 4 Computer Architecture Instruction Set Architecture. CS:APP2e

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

A3 Computer Architecture

Data Cables. Schmitt TTL LABORATORY ELECTRONICS II

CHAPTER 7: The CPU and Memory

Memory Systems. Static Random Access Memory (SRAM) Cell

An Introduction to Assembly Programming with the ARM 32-bit Processor Family

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

CHAPTER 6 TASK MANAGEMENT

1 The Java Virtual Machine

Chapter 2 Topics. 2.1 Classification of Computers & Instructions 2.2 Classes of Instruction Sets 2.3 Informal Description of Simple RISC Computer, SRC

ARM Microprocessor and ARM-Based Microcontrollers

CSE 141 Introduction to Computer Architecture Summer Session I, Lecture 1 Introduction. Pramod V. Argade June 27, 2005

Chapter 11 I/O Management and Disk Scheduling

Topics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

MICROPROCESSOR AND MICROCOMPUTER BASICS

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Chapter 2 Heterogeneous Multicore Architecture

Machine Programming II: Instruc8ons

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Exception and Interrupt Handling in ARM

CSC 2405: Computer Systems II

Stack machines The MIPS assembly language A simple source language Stack-machine implementation of the simple language Readings:

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang


Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Chapter 10: Virtual Memory. Lesson 03: Page tables and address translation process using page tables

Operating Systems. Virtual Memory

Computer Architecture TDTS10

Exceptions in MIPS. know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine

Machine Architecture and Number Systems. Major Computer Components. Schematic Diagram of a Computer. The CPU. The Bus. Main Memory.

64-Bit NASM Notes. Invoking 64-Bit NASM

ARM. Architecture and Assembly. Modest Goal: Turn on an LED

Optimal Stack Slot Assignment in GCC

Reduced Instruction Set Computer (RISC)

Computer Organization and Assembly Language

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

picojava TM : A Hardware Implementation of the Java Virtual Machine

OPERATING SYSTEMS MEMORY MANAGEMENT

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Transcription:

612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap. 12 - Famiglie di processori) PROBLEMS 11.1 How is conditional execution of ARM instructions (see Part I of Chapter 3) related to predicated execution of IA-64 instructions? Comment on similarities and differences. 11.2 The 16-bit Thumb instruction subset of ARM instructions is intended for compact program encoding. Estimate the number of Thumb instructions needed to program the evaluation of the arithmetic expression shown in Figure 11.9. Assume that the Thumb subset has an appropriate divide instruction (which it does not) for the purposes of this question. How does the estimated Thumb instruction count compare to the HP3000 program instruction count in Figure 11.9a? 11.3 Discuss the similarities and differences between the Motorola 680X0 family and the Intel 80X86 family of processors, up to the 68040 and the 80486 versions. 11.4 The 68030 microprocessor has a 256-byte instruction cache and a 256-byte data cache. Is this better than having a 512-byte instruction cache and no data cache? What are the advantages and disadvantages of these two alternatives? Answer the same two questions for a unified 512-byte instruction and data cache. 11.5 Intel IA-32 processors have special instructions for dedicated I/O operations, as described in Part III of Chapter 3. Motorola 680X0 processors use only memory-mapped I/O. What are the advantages and disadvantages of memory-mapped I/O compared to dedicated I/O? 11.6 Discuss the relative merits of addressing modes in the Motorola 680X0 and Intel 80X86 processors. In particular, discuss how the addressing modes in each processor facilitate program relocation, implementation of a stack, accessing an operand list, and manipulating character strings.

PROBLEMS 613 11.7 Section 11.3.1 explains that an IA-32 processor can view the memory as being organized in four different ways, depending on how segmentation and paging are used. Give some examples of situations in which each of the four possibilities is beneficial. 11.8 Write an ARM, 68000, or IA-32 program to evaluate the arithmetic expression in Figure 11.9. How does your program compare to the one in the figure with respect to the number of machine instructions required? Assume that the standard ARM instruction set has an appropriate divide instruction (which it does not) for the purposes of this question. 11.9 In Alpha processors, only 32-bit and 64-bit aligned loads and stores are directly handled in the datapath between the cache and the processor. Sketch the combinational logic network that would be required in a 32-bit-wide datapath to permit loading any one of the four bytes of a 32-bit quantity into the low-order byte position at the destination side of the path. 11.10 Compare the handling of the register stack in the IA-64 processors to that of the processor (memory) stack in Chapter 2. In particular, what are the counterparts of the Chapter 2 stack pointer, SP, and frame pointer, FP, in the IA-64 scheme? 11.11 Give a general description of the hardware needed to support execution of the Alloc X,Y instruction used for managing the IA-64 register stack. Assume that small registers and adders are available. How are they used? 11.12 The Alpha 21264 processor has a much different arrangement of caches than the 21164. Why is the arrangement in the 21264 better? That is, under what circumstances do programs execute more quickly on the 21264, based only on the effects of cacheing? Include more than an observation on hit rates in your answer. 11.13 Show how the expression [ w = a (b c) + (d e) + f g ] h i can be evaluated in an HP3000 computer. 11.14 In an HP3000 computer, Procedure i generates eight words of data, DI 1,...,DI 8, which are stored in the stack. After these words are placed in the stack, but before the completion of Procedure i, a new procedure, Procedure j, is called. It generates 10 words of data, DJ 1,...,DJ 10, which are also stored in the stack. Then another procedure, Procedure k, is called, which places three words of data in the stack. Show the contents of the top words of the stack at this time. 11.15 Show how the expression w = (a + b)(c + d) + (d e) can best be evaluated by the HP3000, ARM, Motorola 68000, and IA-32 computers. The values of variables w, a, b, c, d, and e are stored in memory locations. The following assumptions are made. The addresses do not reference successive locations. Direct memory addressing in the DB+ relative mode is used in the HP3000. Absolute/Direct

614 CHAPTER 11 PROCESSOR FAMILIES memory addressing is used in the 68000 and IA-32 computers, and Relative addressing is used in the ARM computer. All products are single length. 11.16 What is the largest number of stack locations occupied during execution of the program in Figure 11.9? 11.17 Repeat Problem 11.16 for the HP3000 programs in Problems 11.13 and 11.15.

SOLUTIONS - Chapter 11 Processor Families 11.1. The main ideas of conditional execution of ARM instructions (see Sections 3.1.2 and B.1) and conditional execution of IA-64 instructions, called predication (see Section 11.7.2), are very similar. The differences occur in the way that the conditions are set and stored in the processor, and in the way that they are referenced by the conditionally executed instructions. In ARM processors, the state is stored in four conventional condition code ags N, Z, C, and V (see Section 3.1.1). These ags are optionally set by the results of instruction execution. The particular condition, which may be a function of more than one ag, is named in the condition eld of each ARM instruction (see Figure B.1 and Table B.1). In the IA-64 architecture, there are no conventional condition code ags. Instead, the result (true or false) of executing a Compare<condition> instruction is stored in one of 64 one-bit predicate registers, as described in Section 11.7.2. Each instruction can name one of these bits in its 6-bit predicate eld; and the instruction is executed only if the bit is 1 (true). 1

11.2. Assume that Thumb arithmetic instructions have a 2-operand format, expressed in assembly language as OP Rdst,Rsrc as discussed in Section 11.1.1 Also assume that a signed integer Divide instruction (DIV) is available in the Thumb instruction set with the assembly language format DIV Rdst,Rsrc This instruction performs the operation [Rdst]/[Rsrc]. It stores the quotient in Rdst and stores the remainder in Rsrc. Under these assumptions, a possible Thumb program would be: LDR R0,G LDR R1,H R0,R1 Leaves g + h in R0. LDR R1,E LDR R2,F MUL R1,R2 Leaves e f in R1. DIV R1,R0 Leaves (e f)/(g + h) in R1. LDR R0,C LDR R2,D DIV R0,R2 Leaves c/d in R0. R0,R1 Leaves denominator in R0. LDR R1,A LDR R2,B R1,R2 Leaves a + b in R1. DIV R1,R0 Leaves result in R1. STR R1,W Stores result in w. This program requires 16 instructions as compared to 13 instruction words (some combined instructions) in the HP3000. 2

11.3. The following table shows some of the important areas for similarity/difference comparisons. MOTOROLA 680X0 INTEL 80X86 8 Data registers and 8 Address 8 General registers (including registers (including a a processor stack register) processor stack register) CISC instruction set with CISC instruction set with e xible addressing modes e xible addressing modes Large instruction set with multiple-register load/store instructions Memory-mapped I/O only Flat address space Big-endian addressing Large instruction set with multiple-register push/pop instructions Separate I/O space as well as memory-mapped I/O Segmented address space Little-endian addressing There is roughly comparable capability and performance between pairs from these two families; that is 68000 vs. 8086, 68020 vs. 80286, 68030 vs. 80386, and 68040 vs. 80486. The cache and pipelining aspects for the high end of each family are summarized in Sections 11.2.2 and 11.3.3. 11.4. An instruction cache is simpler to implement, because its entries do not have to be written back to the main memory. A data cache must have a provision for writing any changed entries back to the memory, before they are overwritten by new entries. From a performance standpoint, a single larger instruction cache would be advantageous only if the frequency of memory data accesses were very low. A uni ed cache has the potential performance advantage that the proportions of instructions and data vary automatically as a program is executed. However, if separate instruction and data caches are used, they can be accessed in parallel in a pipelined machine; and this is the major performance advantage. 11.5. Memory-mapped I/O requires no specialized support in terms of either instructions or bus signals. A separate I/O space allows simpler I/O interfaces and potentially faster operation. Processors such as those in the IA-32 family, that have a separate I/O space, can also use memory-mapped I/O. 3

11.6. MOTOROLA - The Autoincrement and Autodecrement modes facilitate stack implementation and accessing successive items in a list. Signi cant e xibility in accessing structured lists and arrays of addresses and data of different sizes is provided by the displacement, offset, and scale factor features, coupled with indirection. INTEL - Relocatability in the physical address space is facilitated by the way in which base, index and displacement features are used in generating virtual addresses. As in the Motorola processors, these multiple-component address features enable e xible access to address lists and data structures. In both families of processors, byte-addressability enables handling of character strings, and the Intel IA-32 String instructions (see Sections 3.21.3 and D.4.1) facilitate movement and processing of byte and doubleword data blocks. The Motorola MOVEM and MOVEP instructions perform similar operations. 11.7. Flat address space Simplest con guration from the standpoint of a single user program and its compilation. One or more variable-length segments Ef cient allocation of available memory space to variable-length user or operating system programs. Paged memory Facilitates automated memory management between the randomaccess main memory and a sector-organized disk secondary memory (see Chapters 5 and 10). Access privileges can be controlled on a page-by-page basis to ensure protection among users, and between users and the operating system when shared data are involved. Segmentation and paging Most e xible arrangement for managing multiple user and system address spaces, including protection mechanisms. The virtual address space can be signi cantly larger than the physical main memory space. 4

11.8. ARM program: Assume that a signed integer Divide instruction is available in the ARM instruction set, and that it has the same format as the Multiply (MUL) instruction (see Figure B.4). The assembly language expression for the Divide (DIV) instruction is DIV Rd,Rm,Rs and it performs the operation [Rm]/[Rs], loading the quotient into Rm and the remainder into Rd. LDR R0,C LDR R1,D DIV R2,R0,R1 Leaves c/d in R0. LDR R1,G LDR R2,H R1,R1,R2 Leaves g + h in R1. LDR R2,F DIV R3,R2,R1 Leaves f/(g + h) in R2. LDR R3,E MLA R1,R2,R3,R0 Leaves denominator in R1. LDR R0,A LDR R2,B R0,R0,R2 Leaves a + b in R0. DIV R2,R0,R1 Leaves result in R0. STR R0,W Stores result in w. This program requires 15 instructions as compared to 13 instruction words (some combined instructions) in the HP3000. 5

68000 program (assume 16-bit operands): MOVE G,D0 H,D0 Leaves g + h in D0. MOVE E,D1 MULS F,D1 Leaves e f in D1. DIVS D0,D1 Leaves (e f)/(g + h) in D1. MOVE C,D0 EXT.L D0 See Note below. DIVS D,D0 Leaves c/d in D0. D1,D0 Leaves denominator in D0. MOVE A,D1 B,D1 EXT.L D1 See Note below. DIVS D0,D1 Leaves result in D1. MOVE D1,W Stores result in w. Note: The EXT.L instruction sign-extends the 16-bit dividend in the destination register to 32 bits, a requirement of the Divide instruction. This program contains 14 instructions, as compared to 13 instruction words (some combined instructions) in the HP3000. IA-32 program: MOV EBX,G EBX,H Leaves g + h in EBX. MOV EAX,E IMUL EAX,F Leaves e f in EAX. CDQ See Note below. IDIV EBX MOV EBX,EAX Leaves (e f)/(g + h) in EBX. MOVE EAX,C CDQ See Note below. IDIV D Leaves c/d in EAX. EBX,EAX Leaves denominator in EBX. MOVE EAX,A EAX,B Leaves a + b in EAX. CDQ See Note below. IDIV EBX Leaves result in EAX. MOV W,EAX Stores result in w. Note: The CDQ instruction sign-extends EAX into EDX (see Section 3.23.1), a requirement of the Divide instruction. This program contains 16 instructions, as compared to 13 instruction words (some combined instructions) in the HP3000. 6

11.9. A 4-way multiplexer is required, as shown in the following gure. 32-bit datapath in 8 8 8 8 4-way multiplexer MUX low-order byte datapath out 11.10. There are no direct counterparts of the memory stack pointer registers SP and FP in the IA-64 architecture. The register remapping hardware in IA-64 processors allows the main program and any sequence of nested subroutines to all use logical register addresses R32 and upward for their own local variables, with the rst part of that register space containing parameters passed from the calling routine. An example of this is shown in Figure 11.4. If the 92 registers of the stacked physical register space are used up by register allocations for a sequence of nested subroutine calls, then some of those physical registers must be spilled into memory to create physical register space for any additional nested subroutines. The memory pointer register used by the processor for that memory area could be considered as a counterpart of SP; but it is not actually used as a TOS pointer by the current routine. In fact, it is not visible to user programs. 7

11.11. Consider the example of a main program calling a subroutine, as shown in Figure 11.4. The physical register addresses of registers used by the main program are the same as the logical register addresses used in the main program instructions. However, the logical register addresses above 31 used by instructions in the subroutine must have 8 added to them to generate the correct physical register addresses. The value 8 is the rst operand in the Alloc 8,4 instruction executed by the main program. When that instruction is executed, the value 8 is stored in a processor state register associated with the main program. After the subroutine is entered, all logical register addresses above 31 issued by its instructions must be added, in a small adder, to the value (8) in that register. The output of this adder is the physical register address to be used while in the subroutine. The operand 7 in the Alloc 7,3 instruction executed by the subroutine is stored in a second processor state register associated with the subroutine. The output of that register is added in a second adder to the output of the rst adder. After the subroutine calls a second subroutine, logical register addresses above 31 issued by the second subroutine are sent into the rst adder. The output of the second adder (logical address + 8 + 7) is the physical register address used while in the second subroutine. More register/adder pairs are cascaded onto this structure as more subroutines are called. Note that logical register addresses above 31 are always applied to the rst adder; and the output of the nth adder is the physical register address to be used in the nth subroutine. All registers and adders are only 7 bits wide because the largest physical register address that needs to be generated is 127. 8

11.12. Considering cacheing effects only, the average access time over both instruction and data accesses is a function of both cache hit rates and miss penalties (see Sections 5.6.2 and 5.6.3 for general expressions for average access time). The hit rates in the 21264 L1 caches will be much higher than in the 21164 L1 caches because the 21264 caches are eight times larger. Therefore, the average access time for accesses that can be made on-chip will be larger in the 21164 because of the miss penalty in going to its on-chip L2 cache. Next, we need to consider the effect on average access time of going to the offchip caches in each system. The total on-chip cache capacity (112K bytes in the 21164 and 128K bytes in the 21264) is about the same in both the systems. Therefore, we can assume about the same hit rate for on-chip accesses; so the effect on average access time of the miss penalties in going to the off-chip caches will be about the same in each system. Finally, if the off-chip caches have about the same capacity, the effect on average access times of the miss penalties in going to the main DRAM memories will be about the same in each sytem. The net result is that average access times in the 21264 should be shorter than in the 21164, leading to faster program execution, primarily because of the different arrangements of the on-chip caches. 11.13. HP3000 program: MPYM MPYM MPYM MPYM DIV DEL MPY STOR A B C D E F G H I W Combined with previous instruction. Combined with previous instruction. 9

11.14. Procedure i generates 8 words of data, Procedure j generates 10 words of data, and Procedure k generates 3 words of data. Then, the top words in the stack have the following contents: [Indexreg.] i Return address i [SR] i Q i DI 1 DI 8 [Indexreg.] j 12 Return address j [SR] j 12 DJ 1 DJ 10 [Indexreg.] k 14 Return address k [SR] k 14 DK 1 DK 2 DK 3 TOS 10

11.15. HP3000 program: M M MPY MPYM STOR A B C D D E W ARM program: LDR LDR LDR LDR LDR MUL MLA STR R0,A R1,B R0,R0,R1 R1,C R2,D R1,R1,R2 R3,E R2,R2,R3 R0,R0,R1,R2 R0,W 68000 program (assume 16-bit operands): MOVE MOVE MULS MOVE MULS MOVE A,D0 B,D0 C,D1 D,D1 D1,D0 D,D1 E,D1 D1,D0 D0,W 11

IA-32 program: MOV MOV IMUL MOV IMUL MOV EAX,A EAX,B EBX,C EBX,D EAX,EBX EBX,D EBX,E EAX,EBX W,EAX 11.16. Four 11.17. Four and two 12