# PROBLEMS #20,R0,R1 #\$3A,R2,R4

Size: px
Start display at page:

## Transcription

1 506 CHAPTER 8 PIPELINING (Corrisponde al cap Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #\$3A,R2,R4 R0,R2,R5 In all instructions, the destination operand is given last. Initially, registers R0 and R2 contain 2000 and 50, respectively. These instructions are executed in a computer that has a four-stage pipeline similar to that shown in Figure 8.2. Assume that the first instruction is fetched in clock cycle 1, and that instruction fetch requires only one clock cycle. (a) Draw a diagram similar to Figure 8.2a. Describe the operation being performed by each pipeline stage during each of clock cycles 1 through 4. (b) Give the contents of the interstage buffers, B1, B2, and B3, during clock cycles 2 to 5.

2 PROBLEMS Repeat Problem 8.1 for the following program: Mul And #20,R0,R1 #3,R2,R3 #\$3A,R1,R4 R0,R2,R5 8.3 I 2 in Figure 8.6 is delayed because it depends on the results of I 1.By occupying the Decode stage, instruction I 2 blocks I 3, which, in turn, blocks I 4. Assuming that I 3 and I 4 do not depend on either I 1 or I 2 and that the register file allows two Write steps to proceed in parallel, how would you use additional storage buffers to make it possible for I 3 and I 4 to proceed earlier than in Figure 8.6? Redraw the figure, showing the new order of steps. 8.4 The delay bubble in Figure 8.6 arises because instruction I 2 is delayed in the Decode stage. As a result, instructions I 3 and I 4 are delayed even if they do not depend on either I 1 or I 2. Assume that the Decode stage allows two Decode steps to proceed in parallel. Show that the delay bubble can be completely eliminated if the register file also allows two Write steps to proceed in parallel. 8.5 Figure 8.4 shows an instruction being delayed as a result of a cache miss. Redraw this figure for the hardware organization of Figure Assume that the instruction queue can hold up to four instructions and that the instruction fetch unit reads two instructions at a time from the cache. 8.6 A program loop ends with a conditional branch to the beginning of the loop. How would you implement this loop on a pipelined computer that uses delayed branching with one delay slot? Under what conditions would you be able to put a useful instruction in the delay slot? 8.7 The branch instruction of the UltraSPARC II processor has an Annul bit. When set by the compiler, the instruction in the delay slot is discarded if the branch is not taken. An alternative choice is to have the instruction discarded if the branch is taken. When is each of these choices advantageous? 8.8 A computer has one delay slot. The instruction in this slot is always executed, but only on a speculative basis. If a branch does not take place, the results of that instruction are discarded. Suggest a way to implement program loops efficiently on this computer. 8.9 Rewrite the sort routine shown in Figure 2.34 for the SPARC processor. Recall that the SPARC architecture has one delay slot with an associated Annul bit and uses branch prediction. Attempt to fill the delay slots with useful instructions wherever possible Consider a statement of the form IF A>B THEN action 1 ELSE action 2 Write a sequence of assembly language instructions, first using branch instructions only, then using conditional instructions such as those available on the ARM processor.

3 508 CHAPTER 8 PIPELINING Assume a simple two-stage pipeline, and draw a diagram similar to that in Figure 8.8 to compare execution times for the two approaches The feed-forward path in Figure 8.7 (blue lines) allows the content of the RSLT register to be used directly in an ALU operation. The result of that operation is stored back in the RSLT register, replacing its previous contents. What type of register is needed to make such an operation possible? Consider the two instructions I 1 : R1,R2,R3 I 2 : Shift left R3 Assume that before instruction I 1 is executed, R1, R2, R3, and RSLT contain the values 30, 100, 45, and 198, respectively. Draw a timing diagram for a 4-stage pipeline, showing the clock signal and the contents of the RSLT register during each cycle. Use your diagram to show that correct results will be obtained during the forwarding operation Write the program in Figure 2.37 for a processor in which only load and store instructions access memory. Identify all dependencies in the program and show how you would optimize it for execution on a pipelined processor Assume that 20 percent of the dynamic count of the instructions executed on a computer are branch instructions. Delayed branching is used, with one delay slot. Estimate the gain in performance if the compiler is able to use 85 percent of the delay slots A pipelined processor has two branch delay slots. An optimizing compiler can fill one of these slots 85 percent of the time and can fill the second slot only 20 percent of the time. What is the percentage improvement in performance achieved by this optimization, assuming that 20 percent of the instructions executed are branch instructions? 8.15 A pipelined processor uses the delayed branch technique. You are asked to recommend one of two possibilities for the design of this processor. In the first possibility, the processor has a 4-stage pipeline and one delay slot, and in the second possibility, it has a 6-stage pipeline with two delay slots. Compare the performance of these two alternatives, taking only the branch penalty into account. Assume that 20 percent of the instructions are branch instructions and that an optimizing compiler has an 80 percent success rate in filling the single delay slot. For the second alternative, the compiler is able to fill the second slot 25 percent of the time Consider a processor that uses the branch prediction mechanism represented in Figure 8.15b. The initial state is either LT or LNT, depending on information provided in the branch instruction. Discuss how the compiler should handle the branch instructions used to control do while and do until loops, and discuss the suitability of the branch prediction mechanism in each case Assume that the instruction queue in Figure 8.10 can hold up to six instructions. Redraw Figure 8.11 assuming that the queue is full in clock cycle 1 and that the fetch unit can read up to two instructions at a time from the cache. When will the queue become full again after instruction I k is fetched?

4 Chapter 8 Pipelining 8.1. ( ) The operation performed in each step and the operands involved are as given in the figure below. Clock cycle I 1 : 20, 2000 R I 2 : Mul 3, 50 Mul R3 150 I 3 : And \$3A, 50 And R4 50 I 4 : 2000, 50 R ( ) Clock cycle Buffer B1 Buffer B2 Buffer B3 instruction (I ) Information from a previous instruction Information from a previous instruction Mul instruction (I ) Decoded I Source operands: 20, 2000 Information from a previous instruction And instruction (I ) Decoded I Source operands: 3, 50 Result of I : 2020 Destination R1 instruction (I ) Decoded I Source operands: \$3A, 50 Result of I : 150 Destination R3 1

5 8.2. ( ) Clock cycle , 2000 R Mul 3, 50 Mul R3 150 And \$3A,? \$3A, 2020 And R , 50 R ( ) Cycles 2 to 4 are the same as in P8.1, but contents of R1 are not available until cycle 5. In cycle 5, B1 and B2 have the same contents as in cycle 4. B3 contains the result of the multiply instruction Step D may be abandoned, to be repeated in cycle 5, as shown below. But, instruction I must remain in buffer B1. For I to proceed, buffer B1 must be capable of holding two instructions. The decode step for I has to be delayed as shown, assuming that only one instruction can be decoded at a time. Clock cycle I 1 (Mul) F 1 D 1 E 1 W 1 I 2 () F 2 D 2 D 2 E 2 W 2 I 3 I 4 F 3 D 3 E 3 W 3 F 4 D 4 E 4 W 4 2

6 8.4. If all decode and execute stages can handle two instructions at a time, only instruction I is delayed, as shown below. In this case, all buffers must be capable of holding information for two instructions. Note that completing instruction I before I could cause problems. See Section Clock cycle I 1 (Mul) F 1 D 1 E 1 W 1 I 2 () F 2 D 2 E 2 W 2 I 3 F 3 D 3 E 3 W 3 I 4 F 4 D 4 E 4 W Execution proceeds as follows. Clock cycle I 1 F 1 D 1 E 1 W 1 I 2 F 2 D 2 E 2 W 2 I 3 F 3 D 3 E 3 W 3 I 4 F 4 D 4 E 4 W The instruction immediately preceding the branch should be placed after the branch. LOOP 1 LOOP 1 Conditional Branch LOOP Conditional Branch LOOP This reorganization is possible only if the branch instruction does not depend on instruction. 3

7 8.7. The UltraSPARC arrangement is advantageous when the branch instruction is at the end of the loop and it is possible to move one instruction from the body of the loop into the delay slot. The alternative arrangement is advantageous when the branch instruction is at the beginning of the loop The instruction executed on a speculative basis should be one that is likely to be the correct choice most often. Thus, the conditional branch should be placed at the end of the loop, with an instruction from the body of the loop moved to the delay slot if possible. Alternatively, a copy of the first instruction in the loop body can be placed in the delay slot and the branch address changed to that of the second instruction in the loop The first branch (BLE) has to be followed by a NOP instruction in the delay slot, because none of the instructions around it can be moved. The inner and outer loop controls can be adjusted as shown below. The first instruction in the outer loop is duplicated in the delay slot following BLE. It will be executed one more time than in the original program, changing the value left in R3. However, this should cause no difficulty provided the contents of R3 are not needed once the sort is completed. The modified program is as follows: ADD R0,LIST,R3 ADD R0,N,R1 SUB R1,1,R1 SUB R1,1,R2 OUTER LDUB [R3+R1],R5 Get LIST(j) LDUB [R3+R2],R6 Get LIST(k) INNER SUB R6,R5,R0 BLE,pt NEXT SUB R2,1,R2 k k 1 STUB R5,[R3+R2] STUB R6,[R3+R1] OR R0,R6,R5 NEXT BGE,pt,a INNER LDUB [R3+R2],R6 Get LIST(k) SUB R1,1,R1 BGT,pt OUTER SUB R1,1,R2 4

8 8.10. Without conditional instructions: Compare A,B Branch 0 Action1 Check A B Action One or more instructions Branch Next Action One or more instructions Next... If conditional instructions are available, we can use: Compare A,B Check A B Action1 instruction(s), conditional Action2 instruction(s), conditional Next... In the second case, all Action 1 and Action 2 instructions must be fetched and decoded to determine whether they are to be executed. Hence, this approach is beneficial only if each action consists of one or two instructions. Without conditional instructions Clock cycle Compare A,B F 1 E 1 Action2 Branch>0 Action1 F 2 E 2 F 3 E 3 Branch Next F 4 E 4 Action1 Next F 6 E 1 With conditional instructions Compare A,B F 1 E 1 If >0 then action1 F 2 E 2 NEXT If 0 then action2 F 3 E 3 F 4 E 4 5

10 LOOP Load RNEXT,4(RCURRENT) Load RTEMP2,(RNEWREC) Test RNEXT Load RTEMP1,(RNEXT) Branch=0 TAIL Compare RTEMP1,RTEMP2 Branch 0 INSERT Move RNEXT,RCURRENT Branch LOOP INSERT Store RNEXT,4(RNEWREC) TAIL Store RNEWREC,4(RCURRENT) Return Note that we have assumed that the Load instruction does not affect the condition code flags Because of branch instructions, 120 clock cycles are needed to execute 100 program instructions when delay slots are not used. Using the delay slots will eliminate 0.85 of the idle cycles. Thus, the improvement is given by: That is, instruction throughput will increase by 8.1% Number of cycles needed to execute 100 instructions: Without optimization 140 With optimization (!"#\$ #%\$ ) 127 Thus, throughput improvement is!'&(*)+,, or 10.2% Throughput improvement due to pipelining is, where is the number of pipeline stages. Number of cycles needed to execute one instruction: Throughput 4-stage: \$ -,! 4/ stage:!#.# \$ - / 6/ Thus, the 6-stage pipeline leads to higher performance. 7

11 8.16. For a do while loop, the termination condition is tested at the beginning of the loop. A conditional branch at that location will be taken when exiting the loop. Hence, it should be predicted not taken. That is, the state machine should be started in the state LNT, unless the loop is not likely to be executed at all. A do until loop is executed at least once, and the branch condition is tested at the end of the loop. Assuming that the loop is likely to be executed several times, the branch should be predicted taken. That is, the state machine should be started in state LT An instruction fetched in cycle 0 reaches the head of the queue and enters the decode stage in cycle Assume that the instruction preceding I is decoded and instruction I5 is fetched in cycle 1. This leads to instructions I to I5 being in the queue at the beginning of cycle 2. Execution would then proceed as shown below. Note that the queue is always full, because at most one instruction is dispatched and up to two instructions are fetched in any given cycle. Under these conditions, the queue length would drop below 6 only in the case of a cache miss. Clock cycle Time 10 Queue length I 1 I 2 D 1 E 1 E 1 E 1 W 1 D 2 E 2 W 2 I 3 D 3 E 3 W 3 I 4 D 4 E 4 W 4 I 5 (Branch) D 5 I 6 F 6 X I k F k D k E k W k I k+1 F k+1 D k+1 E k+1 8

### PIPELINING CHAPTER OBJECTIVES

CHAPTER 8 PIPELINING CHAPTER OBJECTIVES In this chapter you will learn about: Pipelining as a means for executing machine instructions concurrently Various hazards that cause performance degradation in

### Computer Organization and Architecture

Computer Organization and Architecture Chapter 14 Instruction Level Parallelism and Superscalar Processors What does Superscalar mean? Common instructions (arithmetic, load/store, conditional branch) can

### Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

### PROBLEMS. which was discussed in Section 1.6.3.

22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between

### UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering EEC180B Lab 7: MISP Processor Design Spring 1995 Objective: In this lab, you will complete the design of the MISP processor,

### Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

### Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

### Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

### CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

### Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Software Pipelining for (i=1, i

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

### VLIW Processors. VLIW Processors

1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

### Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

### INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

### WAR: Write After Read

WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

### EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the

### CS4617 Computer Architecture

1/39 CS4617 Computer Architecture Lecture 20: Pipelining Reference: Appendix C, Hennessy & Patterson Reference: Hamacher et al. Dr J Vaughan November 2013 2/39 5-stage pipeline Clock cycle Instr No 1 2

### Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

### Computer Organization and Components

Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

### Solutions for the Sample of Midterm Test

s for the Sample of Midterm Test 1 Section: Simple pipeline for integer operations For all following questions we assume that: a) Pipeline contains 5 stages: IF, ID, EX, M and W; b) Each stage requires

### Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

### Branch Prediction Techniques

Course on: Advanced Computer Architectures Branch Prediction Techniques Prof. Cristina Silvano Politecnico di Milano email: cristina.silvano@polimi.it Outline Dealing with Branches in the Processor Pipeline

### William Stallings Computer Organization and Architecture

William Stallings Computer Organization and Architecture Chapter 12 CPU Structure and Function Rev. 3.3 (2009-10) by Enrico Nardelli 12-1 CPU Functions CPU must: Fetch instructions Decode instructions

### Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

### Advanced Pipelining Techniques

Advanced Pipelining Techniques 1. Dynamic Scheduling 2. Loop Unrolling 3. Software Pipelining 4. Dynamic Branch Prediction Units 5. Register Renaming 6. Superscalar Processors 7. VLIW (Very Large Instruction

### Data Dependences. A data dependence occurs whenever one instruction needs a value produced by another.

Data Hazards 1 Hazards: Key Points Hazards cause imperfect pipelining They prevent us from achieving CPI = 1 They are generally causes by counter flow data pennces in the pipeline Three kinds Structural

### PROBLEMS (Cap. 4 - Istruzioni macchina)

98 CHAPTER 2 MACHINE INSTRUCTIONS AND PROGRAMS PROBLEMS (Cap. 4 - Istruzioni macchina) 2.1 Represent the decimal values 5, 2, 14, 10, 26, 19, 51, and 43, as signed, 7-bit numbers in the following binary

### HW 5 Solutions. Manoj Mardithaya

HW 5 Solutions Manoj Mardithaya Question 1: Processor Performance The critical path latencies for the 7 major blocks in a simple processor are given below. IMem Add Mux ALU Regs DMem Control a 400ps 100ps

### CISC 360. Computer Architecture. Seth Morecraft Course Web Site:

CISC 360 Computer Architecture Seth Morecraft (morecraf@udel.edu) Course Web Site: http://www.eecis.udel.edu/~morecraf/cisc360 Static & Dynamic Scheduling Scheduling: act of finding independent instructions

### IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

### Central Processing Unit (CPU)

Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following

### Introduction to Design of a Tiny Computer

Introduction to Design of a Tiny Computer (Background of LAB#4) Objective In this lab, we will build a tiny computer in Verilog. The execution results will be displayed in the HEX[3:0] of your board. Unlike

### The University of Nottingham

The University of Nottingham School of Computer Science A Level 1 Module, Autumn Semester 2007-2008 Computer Systems Architecture (G51CSA) Time Allowed: TWO Hours Candidates must NOT start writing their

### Computer Architecture

Wider instruction pipelines 2016. május 13. Budapest Gábor Horváth associate professor BUTE Dept. Of Networked Systems and Services ghorvath@hit.bme.hu How to make the CPU faster Option 1: Make the pipeline

### Pipelining: Its Natural!

Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes, Dryer takes 40 minutes, Folder takes 20 minutes T a s k O

### The Von Neumann Model. University of Texas at Austin CS310H - Computer Organization Spring 2010 Don Fussell

The Von Neumann Model University of Texas at Austin CS310H - Computer Organization Spring 2010 Don Fussell The Stored Program Computer 1943: ENIAC Presper Eckert and John Mauchly -- first general electronic

### Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

### Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory

### von Neumann von Neumann vs. Harvard Harvard Architecture von Neumann vs. Harvard Computer Architecture in a nutshell Microprocessor Architecture

Microprocessor Architecture Alternative approaches Two opposite examples SHARC ARM7 Computer Architecture in a nutshell Separation of CPU and memory distinguishes programmable computer. CPU fetches instructions

### Reduced Instruction Set Computer vs Complex Instruction Set Computers

RISC vs CISC Reduced Instruction Set Computer vs Complex Instruction Set Computers for a given benchmark the performance of a particular computer: P = 1 I C 1 S where P = time to execute I = number of

### CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

### response (or execution) time -- the time between the start and the finish of a task throughput -- total amount of work done in a given time

Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. response (or execution) time -- the time between the start and the finish of a task 2. Define throughput. throughput

### How to improve (decrease) CPI

How to improve (decrease) CPI Recall: CPI = Ideal CPI + CPI contributed by stalls Ideal CPI =1 for single issue machine even with multiple execution units Ideal CPI will be less than 1 if we have several

### StrongARM** SA-110 Microprocessor Instruction Timing

StrongARM** SA-110 Microprocessor Instruction Timing Application Note September 1998 Order Number: 278194-001 Information in this document is provided in connection with Intel products. No license, express

### Unit 5 Central Processing Unit (CPU)

Unit 5 Central Processing Unit (CPU) Introduction Part of the computer that performs the bulk of data-processing operations is called the central processing unit (CPU). It consists of 3 major parts: Register

### Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

### Introduction to Cloud Computing

Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

### CPU Performance Equation

CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down

### CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences University of California at

### Computer Architecture TDTS10

why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

### A Lab Course on Computer Architecture

A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

### Quiz for Chapter 2 Instructions: Language of the Computer3.10

Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [5 points] Prior to the early 1980s, machines

### Thread level parallelism

Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

### A s we saw in Chapter 4, a CPU contains three main sections: the register section,

6 CPU Design A s we saw in Chapter 4, a CPU contains three main sections: the register section, the arithmetic/logic unit (ALU), and the control unit. These sections work together to perform the sequences

### The Microarchitecture of Superscalar Processors

The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: jes@ece.wisc.edu

### Architecture and Programming of x86 Processors

Brno University of Technology Architecture and Programming of x86 Processors Microprocessor Techniques and Embedded Systems Lecture 12 Dr. Tomas Fryza December 2012 Contents A little bit of one-core Intel

### CS107 Handout 12 Spring 2008 April 18, 2008 Computer Architecture: Take I

CS107 Handout 12 Spring 2008 April 18, 2008 Computer Architecture: Take I Computer architecture Handout written by Julie Zelenski and Nick Parlante A simplified picture with the major features of a computer

### Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Application Note 195 ARM11 performance monitor unit Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Copyright 2007 ARM Limited. All rights reserved. Application Note

### CSE320 Final Exam Practice Questions

CSE320 Final Exam Practice Questions Single Cycle Datapath/ Multi Cycle Datapath Adding instructions Modify the datapath and control signals to perform the new instructions in the corresponding datapath.

### Advanced Microprocessors RISC & DSP

Advanced Microprocessors RISC & DSP RISC & DSP :: Slide 1 of 23 RISC Processors RISC stands for Reduced Instruction Set Computer Compared to CISC Simpler Faster RISC & DSP :: Slide 2 of 23 Why RISC? Complex

### Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction

### Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Per Stenström, Håkan Nilsson, and Jonas Skeppstedt Department of Computer Engineering, Lund University P.O. Box 118, S-221

### Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine This is a limited version of a hardware implementation to execute the JAVA programming language. 1 of 23 Structured Computer

### Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Chapter 4 Register Transfer and Microoperations Section 4.1 Register Transfer Language Digital systems are composed of modules that are constructed from digital components, such as registers, decoders,

### Processing Unit Design

&CHAPTER 5 Processing Unit Design In previous chapters, we studied the history of computer systems and the fundamental issues related to memory locations, addressing modes, assembly language, and computer

### Giving credit where credit is due

CSCE 230J Computer Organization Processor Architecture VI: Wrap-Up Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due ost of slides for

### MACHINE ARCHITECTURE & LANGUAGE

in the name of God the compassionate, the merciful notes on MACHINE ARCHITECTURE & LANGUAGE compiled by Jumong Chap. 9 Microprocessor Fundamentals A system designer should consider a microprocessor-based

### what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

### Chapter 5. Processing Unit Design

Chapter 5 Processing Unit Design 5.1 CPU Basics A typical CPU has three major components: Register Set, Arithmetic Logic Unit, and Control Unit (CU). The register set is usually a combination of general-purpose

### CPU- Internal Structure

ESD-1 Elettronica dei Sistemi Digitali 1 CPU- Internal Structure Lesson 12 CPU Structure&Function Instruction Sets Addressing Modes Read Stallings s chapters: 11, 9, 10 esd-1-9:10:11-2002 1 esd-1-9:10:11-2002

### 612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap. 12 - Famiglie di processori) PROBLEMS

612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap. 12 - Famiglie di processori) PROBLEMS 11.1 How is conditional execution of ARM instructions (see Part I of Chapter 3) related to predicated execution

### Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

### A Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure

A Brief Review of Processor Architecture Why are Modern Processors so Complicated? Basic Structure CPU PC IR Regs ALU Memory Fetch PC -> Mem addr [addr] > IR PC ++ Decode Select regs Execute Perform op

### Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

### Decoding Instructions

Decoding s Decoding instructions involves Sending the instruction s fields to the control unit Control Unit r Register r 2 Write r Data ing two values from the Register Executing R-Type s R-type instructions

### ECE4703-B06 TMS320C6713 Architecture Overview and Assembly Language Programming

ECE4703-B06 TMS320C6713 Architecture Overview and Assembly Language Programming D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department drb@ece.wpi.edu

### Lecture 7 ARM Processor Organization. Internal Organization of ARM

Lecture 7 AM Processor Organization First AM processor developed on 3 micron technology in 83-85 his course is mainly based on the AM6/7 architecture developed between 90-95. Digital quipment Corporation

### Chapter 2 : Central Processing Unit

Chapter 2 Central Processing Unit The part of the computer that performs the bulk of data processing operations is called the Central Processing Unit (CPU) and is the central component of a digital computer.

### Computer Architecture

Computer Architecture Having studied numbers, combinational and sequential logic, and assembly language programming, we begin the study of the overall computer system. The term computer architecture is

### Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

### Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997

Design of Pipelined MIPS Processor Sept. 24 & 26, 1997 Topics Instruction processing Principles of pipelining Inserting pipe registers Data Hazards Control Hazards Exceptions MIPS architecture subset R-type

### OC By Arsene Fansi T. POLIMI 2008 1

IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

### Computer Architecture and Systems

PhD Qualifier Exam, Spring 2013 Computer Architecture and Systems 1. (6 points) Consider a virtual memor system. (1) (2 points) Explain the difference between a virtual address and a phgsical address.

### Lecture 5: MIPS Examples

Lecture 5: MIPS Examples Today s topics: the compilation process full example sort in C Reminder: 2 nd assignment will be posted later today 1 Dealing with Characters Instructions are also provided to

### (Refer Slide Time: 00:01:16 min)

Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control

### This Unit: Code Scheduling. CIS 371 Computer Organization and Design. Pipelining Review. Readings. ! Pipelining and superscalar review

This Unit: Code Scheduling CIS 371 Computer Organization and Design App App App System software Mem CPU I/O! Pipelining and superscalar review! Code scheduling! To reduce pipeline stalls! To increase ILP

### CPU Organization and Assembly Language

COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:

### PART B QUESTIONS AND ANSWERS UNIT I

PART B QUESTIONS AND ANSWERS UNIT I 1. Explain the architecture of 8085 microprocessor? Logic pin out of 8085 microprocessor Address bus: unidirectional bus, used as high order bus Data bus: bi-directional

### Advanced Computer Architecture-CS501

Lecture Handouts Computer Architecture Lecture No. 12 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 4 Computer Systems Design and Architecture 4.1, 4.2, 4.3 Summary 7) The design process

### COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December 2009 23:59

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December 2009 23:59 Overview: In the first projects for COMP 303, you will design and implement a subset of the MIPS32 architecture

### EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds

### CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

### MICROPROCESSOR AND MICROCOMPUTER BASICS

Introduction MICROPROCESSOR AND MICROCOMPUTER BASICS At present there are many types and sizes of computers available. These computers are designed and constructed based on digital and Integrated Circuit

### EC 362 Problem Set #2

EC 362 Problem Set #2 1) Using Single Precision IEEE 754, what is FF28 0000? 2) Suppose the fraction enhanced of a processor is 40% and the speedup of the enhancement was tenfold. What is the overall speedup?

### Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are

### #5. Show and the AND function can be constructed from two NAND gates.

Study Questions for Digital Logic, Processors and Caches G22.0201 Fall 2009 ANSWERS Digital Logic Read: Sections 3.1 3.3 in the textbook. Handwritten digital lecture notes on the course web page. Study

### Recap & Perspective. Sequential Logic Circuits & Architecture Alternatives. Register Operation. Building a Register. Timing Diagrams.

Sequential Logic Circuits & Architecture Alternatives Recap & Perspective COMP 251 Computer Organization and Architecture Fall 2009 So Far: ALU Implementation Next: Register Implementation Register Operation

### Switch Fabric Implementation Using Shared Memory

Order this document by /D Switch Fabric Implementation Using Shared Memory Prepared by: Lakshmi Mandyam and B. Kinney INTRODUCTION Whether it be for the World Wide Web or for an intra office network, today