Computer Architecture TDTS10

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Computer Architecture TDTS10"

Transcription

1 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers Erik Larsson Department of Computer Science 2 Pipelining FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand Superpipelining FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand 3 4

2 Superscalar Architecture FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand Superscalar Architectures Superscalar architectures allow several instructions to be issued and completed per clock cycle. A superscalar architecture consists of a number of pipelines that are working in parallel. Depending on the number and kind of parallel units available, a certain number of instructions can be executed in parallel. 5 6 Limitations Resource conflicts Control dependency Data conflicts True data dependency Output dependency Anti-dependency Limitations worse - Resource conflict - similar to structural hazards Control - branches Data-conflicts True Data Dependency ADD R2,R4,R5 R2 R4 + R5 MUL R4,R3,R1 ADD R2,R4,R5 Assume: R1=2, R2=0, R3=2, R4=0, R5=2 Correct: MUL makes: R4=4 ADD makes: R2=6 R4 becomes 4 FO EI WO Value of R4 is read. R4=0 at this point 7 8

3 9 Output Dependency Antidependency Two instructions are writing into the same location; if the second instruction writes before the first one, an error occurs: ADD R4,R2,R5 R4 R2 + R5 An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs: MUL R4,R3,R1 ADD R4,R2,R5 ADD R3,R2,R5 R3 R2 + R5 10 Output Dependency and Antidependency Correct operation Policies for Parallel Instruction Execution Without dependencies by using additional registers: Case 1: ADD R7,R2,R5 R7 R2 + R5 R4 Case 2: ADD R6,R2,R5 R6 R2 + R5 R4 Storage conflicts The ability of a superscalar processor to execute instructions in parallel is determined by: the number and nature of parallel pipelines the mechanism used to find independent instructions. The policies are characterized by: the order in which instructions are issued for execution. the order in which instructions are completed. Execution policies: In-order issue with in-order completion. In-order issue with out-of-order completion. Out-of-order issue with out-of-order completion

4 13 Policies for Parallel Instruction Execution We consider the following instruction sequence: I1: ADDF R12,R13,R14 R12 R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 R8 + R9 I3: MUL R4,R2,R3 R4 R2 * R3 I4: MUL R5,R6,R7 R5 R6 * R7 I5: ADD R10,R5,R7 R10 R5 + R7 I6: ADD R11,R2,R3 R11 R2 + R3 Two instructions can be fetched and decoded at a time; Three functional units can work in parallel: floating point unit, integer adder, integer multiplier; Two instructions can be written back (completed) at a time; I1 requires two cycles to execute; I3 and I4 are in conflict for the same functional unit; I5 depends on the value produced by I4 (we have a true data dependency between I4 and I5); I2, I5 and I6 are in conflict for the same functional unit; 14 In-Order Issue with In-Order Completion In-Order Issue with In-Order Completion Decode/ Issue Execute Write back/ complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I6 I1 STALL 3 I3 I1 I2 4 I4 I3 5 I5 I4 6 I6 I5 7 I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3 Instructions are issued in the exact order that would correspond to sequential execution; results are written (completion) in the same order. An instruction cannot be issued before the previous one has been issued; An instruction completes only after the previous one has completed. To guarantee in-order completion, instruction issuing stalls when there is a conflict and when the unit requires more than one cycle to execute; The processor detects and handles (by stalling) true data dependencies and resource conflicts. I

5 17 In-Order Issue with In-Order Completion Out-of-Order Issue with Out-of-Order Completion With in-order issue&in-order completion the processor has not to bother about output-dependency and antidependency. It has only to detect true data dependencies. No one of the two dependencies will be violated if instructions are issued/completed in-order: Output dependency ADD R4,R2,R5 R4 R2 + R5 Antidependency ADD R3,R2,R5 R3 R2 + R5 Decode/ Issue Execute Write back/ complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I4 I1 I3 I2 3 I6 I4 I1 I3 4 I5 I4 I6 5 I6 I I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3 18 Comments on Superscalars Some Architectures Main techniques for superscalar processors: additional pipelined units which are working in parallel; out-of-order issue&out-of-order completion; register renaming. All of the above techniques are aimed to enhance performance. Experiments have shown: only adding additional units is not efficient out-of-order issue is extremely important register renaming can improve performance with more than 30% it is important to provide a fetching/decoding capacity so that instruction window is sufficiently large. PowerPC 604 six independent execution units: Branch execution unit Load/Store unit 3 Integer units Floating-point unit in-order issue Power PC 620 provides in addition to the 604 out-of-order issue 19 20

6 21 Some Architectures Summary Superscalars Pentium three independent execution units: 2 Integer units Floating point unit in-order issue two instructions issued per clock cycle. Pentium II to 4 provides in addition to the Pentium out-of-order issue five instructions can be issued in one cycle Good The hardware solves everything: Hardware detects potential parallelism between instructions; Hardware tries to issue as many instructions as possible in parallel. Hardware solves register renaming. Binary compatibility If functional units are added in a new version of the architecture or some other improvements have been made to the architecture (without changing the instruction sets), old programs can benefit from the additional potential of parallelism. Why? Because the new hardware will issue the old instruction sequence in a more efficient way. 22 Summary Superscalars Outline Bad Very complex Much hardware is needed for run-time detection. There is a limit in how far we can go with this technique. Power consumption can be very large! The instruction window is limited this limits the capacity to detect potentially parallel instructions Superscalar Processors Very Long Instruction Word Processors Parallel computers 23 24

7 25 The Alternative: VLIW Processors VLIW Processors VLIW architectures rely on compile-time detection of parallelism the compiler analysis the program and detects operations to be executed in parallel; such operations are packed into one large instruction. After one instruction has been fetched all the corresponding operations are issued in parallel. No hardware is needed for run-time detection of parallelism. Detection of parallelism and packaging of operations into instructions is done, by the compiler, off-line. The instruction window problem is solved: the compiler can potentially analyse the whole program in order to detect parallel operations. 26 VLIW Processors Summary VLIW Processors Advantages the number of FUs can be increased without needing additional sophisticated hardware to detect parallelism. power consumption can be reduced. Compilers can detect parallelism based on global analysis program. Problems: large number of registers needed in order to keep all FUs active. Large data transport capacity is needed between FUs and the register file register files and memory. instruction cache and fetch unit. Large code size, partially because unused operations -> wasted bits. Incompatibility of binary code 27 28

8 Outline Parallel Computers Superscalar Processors Very Long Instruction Word Processors Parallel computers One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application. Such computers have been organized in very different ways. Some key features: number and complexity of individual CPUs availability of common (shared memory) interconnection topology performance of interconnection network I/O devices The need for high performance! Two main factors contribute to high performance of modern processors: Fast circuit technology Architectural features: large caches multiple fast buses pipelining superscalar architectures (multiple funct. units) Classification of Computer Architectures Processor organisation Flynn s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate. The multiplicity of instruction stream and data stream gives us four different classes: Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data stream- MIMD References Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 948, Duncan, Ralph, "A Survey of Parallel Computer Architectures", IEEE Computer. February 1990, pp Single instruction, single data stream (SISD) Uniprocessor Single instruction, multiple data stream (SIMD) Vector processor Processor organizations Array processor Multiple instruction, single data stream (MISD) Shared memory Symmetric multiprocessor (SMP) Multiple instruction, multiple data stream (MIMD) Distributed memory Nonuniform memory access (NUMA) 31 32

9 33 Single Instruction stream, Single Data stream (SISD) Single Instruction stream, Multiple Data stream (SIMD) with shared memory 34 SIMD with no shared memory Multiple Instruction stream, Multiple Data stream (MIMD) with shared memory 35 36

10 37 MIMD with no shared memory Multiprocessors Shared memory MIMD computers are called multiprocessors: 38 Amdahl's law N - parallel units Multiprocessors S = (s + p)/(s + p/n)= 1 / (s + p/n) where S is speed, s is seriel and p is parallel, and N is the number of processors. (s+p) is the total execution time p is the time that can be executed in parallel s is the time that cannot be executed in parallel Example, assume that 20% must be executed serial while 80% can be executed in parallel, and that there are 4 processors. Optimal speed-up is 4 but this example gives: S= (0,2 + 0,8)/ (0,2+0,8/4) = 1/0,4 = 2,5 IBM System/370 (1970s): two IBM CPUs connected to shared memory. IBM System370/XA (1981): multiple CPUs can be connected to shared memory. IBM System/390 (1990s): similar features like S370/XA, with improved performance. Possibility to connect several multiprocessor systems together through fast fiber-optic connection. CRAY X-MP (mid 1980s): from one to four vector processors connected to shared memory (cycle time: 8.5 ns). CRAY Y-MP (1988): from one to 8 vector processors connected to shared memory; 3 times more powerful than CRAY X-MP (cycle time: 4 ns). C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors. CRAY 3 (1993): maximum 16 vector processors (cycle time 2ns). Butterfly multiprocessor system, by BBN Advanced Computers (1985/87): maximum 256 Motorola processors, interconnected by a sophisticated dynamic switching network. BBN TC2000 (1990): improved version of the Butterfly using Motorola RISC processor

11 Questions What is a superscalar architecture? How does Amdahl's law work? Give examples why many parallel units will not improve performance? What is the difference between: In-order issue with in-order completion, In-order issue with out-of-order completion, and Out-of-order issue with out-of-order completion? Give examples of True data dependency, Output dependency, and Anti-dependency Give advantages and disadvantages of VLIW How did Flynn classify computers? 41

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

Advanced Pipelining Techniques

Advanced Pipelining Techniques Advanced Pipelining Techniques 1. Dynamic Scheduling 2. Loop Unrolling 3. Software Pipelining 4. Dynamic Branch Prediction Units 5. Register Renaming 6. Superscalar Processors 7. VLIW (Very Large Instruction

More information

Computer Organization and Architecture

Computer Organization and Architecture Computer Organization and Architecture Chapter 14 Instruction Level Parallelism and Superscalar Processors What does Superscalar mean? Common instructions (arithmetic, load/store, conditional branch) can

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Lecture 23: Multiprocessors

Lecture 23: Multiprocessors Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available: Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Hiroaki Kobayashi 8/3/2011

Hiroaki Kobayashi 8/3/2011 Hiroaki Kobayashi 8/3/2011 l l l l l l l l l Introduction The Difficulty of Creating Parallel Processing Programs Shared Memory Multiprocessors Clusters and Other Message-Passing Multiprocessors Cache

More information

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE

More information

A Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure

A Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure A Brief Review of Processor Architecture Why are Modern Processors so Complicated? Basic Structure CPU PC IR Regs ALU Memory Fetch PC -> Mem addr [addr] > IR PC ++ Decode Select regs Execute Perform op

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No. Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c-21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture ELE 356 Computer Engineering II Section 1 Foundations Class 6 Architecture History ENIAC Video 2 tj History Mechanical Devices Abacus 3 tj History Mechanical Devices The Antikythera Mechanism Oldest known

More information

Annotation to the assignments and the solution sheet. Note the following points

Annotation to the assignments and the solution sheet. Note the following points Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed

More information

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) International Journal of Electronics and Communication Engineering & Technology (IJECET), ISSN 0976 ISSN 0976 6464(Print)

More information

White Paper COMPUTE CORES

White Paper COMPUTE CORES White Paper COMPUTE CORES TABLE OF CONTENTS A NEW ERA OF COMPUTING 3 3 HISTORY OF PROCESSORS 3 3 THE COMPUTE CORE NOMENCLATURE 5 3 AMD S HETEROGENEOUS PLATFORM 5 3 SUMMARY 6 4 WHITE PAPER: COMPUTE CORES

More information

This Unit: Code Scheduling. CIS 371 Computer Organization and Design. Pipelining Review. Readings. ! Pipelining and superscalar review

This Unit: Code Scheduling. CIS 371 Computer Organization and Design. Pipelining Review. Readings. ! Pipelining and superscalar review This Unit: Code Scheduling CIS 371 Computer Organization and Design App App App System software Mem CPU I/O! Pipelining and superscalar review! Code scheduling! To reduce pipeline stalls! To increase ILP

More information

How to improve (decrease) CPI

How to improve (decrease) CPI How to improve (decrease) CPI Recall: CPI = Ideal CPI + CPI contributed by stalls Ideal CPI =1 for single issue machine even with multiple execution units Ideal CPI will be less than 1 if we have several

More information

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu. Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

More information

Design Cycle for Microprocessors

Design Cycle for Microprocessors Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

What are embedded systems? Challenges in embedded computing system design. Design methodologies.

What are embedded systems? Challenges in embedded computing system design. Design methodologies. Embedded Systems Sandip Kundu 1 ECE 354 Lecture 1 The Big Picture What are embedded systems? Challenges in embedded computing system design. Design methodologies. Sophisticated functionality. Real-time

More information

Introducción. Diseño de sistemas digitales.1

Introducción. Diseño de sistemas digitales.1 Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1

More information

Instruction Set Architectures

Instruction Set Architectures Instruction Set Architectures Early trend was to add more and more instructions to new CPUs to do elaborate operations (CISC) VAX architecture had an instruction to multiply polynomials! RISC philosophy

More information

High Performance Computing

High Performance Computing High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards eddie.edwards@imperial.ac.uk At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7. Multicores, Multiprocessors, and Clusters Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

on an system with an infinite number of processors. Calculate the speedup of

on an system with an infinite number of processors. Calculate the speedup of 1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

ARM Cortex-A8 Processor

ARM Cortex-A8 Processor ARM Cortex-A8 Processor High Performances And Low Power for Portable Applications Architectures for Multimedia Systems Prof. Cristina Silvano Gianfranco Longi Matr. 712351 ARM Partners 1 ARM Powered Products

More information

Computer Organization

Computer Organization Computer Organization and Architecture Designing for Performance Ninth Edition William Stallings International Edition contributions by R. Mohan National Institute of Technology, Tiruchirappalli PEARSON

More information

Multicore and Parallel Processing

Multicore and Parallel Processing Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 Administrivia FlameWar Games Night Next Friday, April 27 th 5pm

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

CS 352H: Computer Systems Architecture

CS 352H: Computer Systems Architecture CS 352H: Computer Systems Architecture Topic 14: Multicores, Multiprocessors, and Clusters University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Introduction Goal:

More information

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC

More information

Introduction to Microprocessors

Introduction to Microprocessors Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

SUPERCOMPUTERS SECOND EDITION SEPTEMBER 1991

SUPERCOMPUTERS SECOND EDITION SEPTEMBER 1991 SUPERCOMPUTERS SECOND EDITION SEPTEMBER 1991 ARCHITECTURE TECHNOLOGY CORPORATION SPECIALISTS IN COMPUTER ARCHITECTURE P.O. BOX 24344 MINNEAPOLIS, MINNESOTA 55424 (612) 935-2035 ELSEVIER ADVANCED TECHNOLOGY

More information

EE361: Digital Computer Organization Course Syllabus

EE361: Digital Computer Organization Course Syllabus EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)

More information

Parallel Programming

Parallel Programming Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Parallel Architectures Acknowledgements Prof. Felix

More information

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor

More information

Introduction to Computer Architecture Concepts

Introduction to Computer Architecture Concepts to Computer Architecture Concepts 1. We will start at the very beginning, first with the fundamental concepts behind the modern digital computer, and then some details of their implementation. Many people,

More information

İSTANBUL AYDIN UNIVERSITY

İSTANBUL AYDIN UNIVERSITY İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER

More information

Battle against Intel CPU Monopoly Moves from Lower Costs to Faster IA-32 Processors.

Battle against Intel CPU Monopoly Moves from Lower Costs to Faster IA-32 Processors. Battle against Intel CPU Monopoly Moves from Lower Costs to Faster IA-32 Processors. Giovana Mendes Dep. Informática, Universidade do Minho 4710-057 Braga, Portugal gimendes@uol.com.br Abstract.The battle

More information

Computer Organization and Components

Computer Organization and Components Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

More information

Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield. Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

ANALYSIS OF SUPERCOMPUTER DESIGN

ANALYSIS OF SUPERCOMPUTER DESIGN ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis

More information

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.

More information

Analysis (III) Low Power Design. Kai Huang

Analysis (III) Low Power Design. Kai Huang Analysis (III) Low Power Design Kai Huang Chinese new year: 1.3 billion urban exodus 1/28/2014 Kai.Huang@tum 2 The interactive map, which is updated hourly The thicker, brighter lines are the busiest routes.

More information

Software implementation of Post-Quantum Cryptography

Software implementation of Post-Quantum Cryptography Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software

More information

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

More information

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)

More information

Multi-Core Programming

Multi-Core Programming Multi-Core Programming Increasing Performance through Software Multi-threading Shameem Akhter Jason Roberts Intel PRESS Copyright 2006 Intel Corporation. All rights reserved. ISBN 0-9764832-4-6 No part

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

Embedded System Hardware - Processing (Part II)

Embedded System Hardware - Processing (Part II) 12 Embedded System Hardware - Processing (Part II) Jian-Jia Chen (Slides are based on Peter Marwedel) Informatik 12 TU Dortmund Germany Springer, 2010 2014 年 11 月 11 日 These slides use Microsoft clip arts.

More information

Systolic Computing. Fundamentals

Systolic Computing. Fundamentals Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim MIPS Architecture MIPS (Microprocessor without interlocked pipeline stages) MIPS Computer Systems Inc. Developed from Stanford MIPS architecture usages 1990 s R2000, R3000,

More information

CISC, RISC, and DSP Microprocessors

CISC, RISC, and DSP Microprocessors CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

More information

Current Trend of Supercomputer Architecture

Current Trend of Supercomputer Architecture Current Trend of Supercomputer Architecture Haibei Zhang Department of Computer Science and Engineering haibei.zhang@huskymail.uconn.edu Abstract As computer technology evolves at an amazingly fast pace,

More information

Sotirios G. Ziavras, Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, New Jersey 07102, U.S.A.

Sotirios G. Ziavras, Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, New Jersey 07102, U.S.A. COMPUTER SYSTEMS Sotirios G. Ziavras, Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, New Jersey 07102, U.S.A. Keywords Computer organization, processor,

More information

SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Pentium vs. Power PC Computer Architecture and PCI Bus Interface Pentium vs. Power PC Computer Architecture and PCI Bus Interface CSE 3322 1 Pentium vs. Power PC Computer Architecture and PCI Bus Interface Nowadays, there are two major types of microprocessors in the

More information

PROBLEMS. which was discussed in Section 1.6.3.

PROBLEMS. which was discussed in Section 1.6.3. 22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Interconnection Networks

Interconnection Networks Advanced Computer Architecture (0630561) Lecture 15 Interconnection Networks Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interconnection Networks: Multiprocessors INs can be classified based on: 1. Mode

More information