Computer Architecture TDTS10

Similar documents
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

LSN 2 Computer Processors

Introduction to Cloud Computing

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

CMSC 611: Advanced Computer Architecture

VLIW Processors. VLIW Processors

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Lecture 23: Multiprocessors

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:


Lecture 2 Parallel Programming Platforms

Pipelining Review and Its Limitations

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

A Lab Course on Computer Architecture

Scalability and Classifications

Chapter 2 Parallel Computer Architecture

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Chapter 2 Logic Gates and Introduction to Computer Architecture

OC By Arsene Fansi T. POLIMI

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

SOC architecture and design

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Introduction to GPU Programming Languages

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Annotation to the assignments and the solution sheet. Note the following points

Chapter 2 Parallel Architecture, Software And Performance

Instruction Set Architecture (ISA)

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Design Cycle for Microprocessors

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Introducción. Diseño de sistemas digitales.1

An Introduction to Parallel Computing/ Programming

High Performance Computing

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

PROBLEMS #20,R0,R1 #$3A,R2,R4

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

on an system with an infinite number of processors. Calculate the speedup of

IA-64 Application Developer s Architecture Guide

Computer Organization

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to Microprocessors

EE361: Digital Computer Organization Course Syllabus

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Parallel Programming

Introduction to Computer Architecture Concepts

Computer Organization and Components

Multithreading Lin Gao cs9244 report, 2006

AMD Opteron Quad-Core

İSTANBUL AYDIN UNIVERSITY

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

ANALYSIS OF SUPERCOMPUTER DESIGN

Software implementation of Post-Quantum Cryptography

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Multi-Core Programming

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Systolic Computing. Fundamentals

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Current Trend of Supercomputer Architecture

Generations of the computer. processors.

CISC, RISC, and DSP Microprocessors

Sotirios G. Ziavras, Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, New Jersey 07102, U.S.A.

WAR: Write After Read

Embedded System Hardware - Processing (Part II)

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

SPARC64 VIIIfx: CPU for the K computer

High Performance Computing. Course Notes HPC Fundamentals

PROBLEMS. which was discussed in Section

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Interconnection Networks

Architectures and Platforms

Instruction scheduling

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

How To Write A Parallel Computer Program

Next Generation GPU Architecture Code-named Fermi

Principles and characteristics of distributed systems and environments

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Instruction Set Design

10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Computer Architectures

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Data Level Parallelism

THREE YEAR DEGREE (HONS.) COURSE BACHELOR OF COMPUTER APPLICATION (BCA) First Year Paper I Computer Fundamentals

Study Material for MS-07/MCA/204

Transcription:

why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers Erik Larsson Department of Computer Science 2 Pipelining FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand Superpipelining FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand 3 4

Superscalar Architecture FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand Superscalar Architectures Superscalar architectures allow several instructions to be issued and completed per clock cycle. A superscalar architecture consists of a number of pipelines that are working in parallel. Depending on the number and kind of parallel units available, a certain number of instructions can be executed in parallel. 5 6 Limitations Resource conflicts Control dependency Data conflicts True data dependency Output dependency Anti-dependency Limitations worse - Resource conflict - similar to structural hazards Control - branches Data-conflicts True Data Dependency ADD R2,R4,R5 R2 R4 + R5 MUL R4,R3,R1 ADD R2,R4,R5 Assume: R1=2, R2=0, R3=2, R4=0, R5=2 Correct: MUL makes: R4=4 ADD makes: R2=6 R4 becomes 4 FO EI WO Value of R4 is read. R4=0 at this point 7 8

9 Output Dependency Antidependency Two instructions are writing into the same location; if the second instruction writes before the first one, an error occurs: ADD R4,R2,R5 R4 R2 + R5 An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs: MUL R4,R3,R1 ADD R4,R2,R5 ADD R3,R2,R5 R3 R2 + R5 10 Output Dependency and Antidependency Correct operation Policies for Parallel Instruction Execution Without dependencies by using additional registers: Case 1: ADD R7,R2,R5 R7 R2 + R5 R4 Case 2: ADD R6,R2,R5 R6 R2 + R5 R4 Storage conflicts The ability of a superscalar processor to execute instructions in parallel is determined by: the number and nature of parallel pipelines the mechanism used to find independent instructions. The policies are characterized by: the order in which instructions are issued for execution. the order in which instructions are completed. Execution policies: In-order issue with in-order completion. In-order issue with out-of-order completion. Out-of-order issue with out-of-order completion. 11 12

13 Policies for Parallel Instruction Execution We consider the following instruction sequence: I1: ADDF R12,R13,R14 R12 R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 R8 + R9 I3: MUL R4,R2,R3 R4 R2 * R3 I4: MUL R5,R6,R7 R5 R6 * R7 I5: ADD R10,R5,R7 R10 R5 + R7 I6: ADD R11,R2,R3 R11 R2 + R3 Two instructions can be fetched and decoded at a time; Three functional units can work in parallel: floating point unit, integer adder, integer multiplier; Two instructions can be written back (completed) at a time; I1 requires two cycles to execute; I3 and I4 are in conflict for the same functional unit; I5 depends on the value produced by I4 (we have a true data dependency between I4 and I5); I2, I5 and I6 are in conflict for the same functional unit; 14 In-Order Issue with In-Order Completion In-Order Issue with In-Order Completion Decode/ Issue Execute Write back/ complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I6 I1 STALL 3 I3 I1 I2 4 I4 I3 5 I5 I4 6 I6 I5 7 I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3 Instructions are issued in the exact order that would correspond to sequential execution; results are written (completion) in the same order. An instruction cannot be issued before the previous one has been issued; An instruction completes only after the previous one has completed. To guarantee in-order completion, instruction issuing stalls when there is a conflict and when the unit requires more than one cycle to execute; The processor detects and handles (by stalling) true data dependencies and resource conflicts. I6 8 15 16

17 In-Order Issue with In-Order Completion Out-of-Order Issue with Out-of-Order Completion With in-order issue&in-order completion the processor has not to bother about output-dependency and antidependency. It has only to detect true data dependencies. No one of the two dependencies will be violated if instructions are issued/completed in-order: Output dependency ADD R4,R2,R5 R4 R2 + R5 Antidependency ADD R3,R2,R5 R3 R2 + R5 Decode/ Issue Execute Write back/ complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I4 I1 I3 I2 3 I6 I4 I1 I3 4 I5 I4 I6 5 I6 I5 6 7 8 I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3 18 Comments on Superscalars Some Architectures Main techniques for superscalar processors: additional pipelined units which are working in parallel; out-of-order issue&out-of-order completion; register renaming. All of the above techniques are aimed to enhance performance. Experiments have shown: only adding additional units is not efficient out-of-order issue is extremely important register renaming can improve performance with more than 30% it is important to provide a fetching/decoding capacity so that instruction window is sufficiently large. PowerPC 604 six independent execution units: Branch execution unit Load/Store unit 3 Integer units Floating-point unit in-order issue Power PC 620 provides in addition to the 604 out-of-order issue 19 20

21 Some Architectures Summary Superscalars Pentium three independent execution units: 2 Integer units Floating point unit in-order issue two instructions issued per clock cycle. Pentium II to 4 provides in addition to the Pentium out-of-order issue five instructions can be issued in one cycle Good The hardware solves everything: Hardware detects potential parallelism between instructions; Hardware tries to issue as many instructions as possible in parallel. Hardware solves register renaming. Binary compatibility If functional units are added in a new version of the architecture or some other improvements have been made to the architecture (without changing the instruction sets), old programs can benefit from the additional potential of parallelism. Why? Because the new hardware will issue the old instruction sequence in a more efficient way. 22 Summary Superscalars Outline Bad Very complex Much hardware is needed for run-time detection. There is a limit in how far we can go with this technique. Power consumption can be very large! The instruction window is limited this limits the capacity to detect potentially parallel instructions Superscalar Processors Very Long Instruction Word Processors Parallel computers 23 24

25 The Alternative: VLIW Processors VLIW Processors VLIW architectures rely on compile-time detection of parallelism the compiler analysis the program and detects operations to be executed in parallel; such operations are packed into one large instruction. After one instruction has been fetched all the corresponding operations are issued in parallel. No hardware is needed for run-time detection of parallelism. Detection of parallelism and packaging of operations into instructions is done, by the compiler, off-line. The instruction window problem is solved: the compiler can potentially analyse the whole program in order to detect parallel operations. 26 VLIW Processors Summary VLIW Processors Advantages the number of FUs can be increased without needing additional sophisticated hardware to detect parallelism. power consumption can be reduced. Compilers can detect parallelism based on global analysis program. Problems: large number of registers needed in order to keep all FUs active. Large data transport capacity is needed between FUs and the register file register files and memory. instruction cache and fetch unit. Large code size, partially because unused operations -> wasted bits. Incompatibility of binary code 27 28

Outline Parallel Computers Superscalar Processors Very Long Instruction Word Processors Parallel computers One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application. Such computers have been organized in very different ways. Some key features: number and complexity of individual CPUs availability of common (shared memory) interconnection topology performance of interconnection network I/O devices The need for high performance! Two main factors contribute to high performance of modern processors: Fast circuit technology Architectural features: large caches multiple fast buses pipelining superscalar architectures (multiple funct. units) 29 30 Classification of Computer Architectures Processor organisation Flynn s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate. The multiplicity of instruction stream and data stream gives us four different classes: Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data stream- MIMD References Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 948, 1972. Duncan, Ralph, "A Survey of Parallel Computer Architectures", IEEE Computer. February 1990, pp. 5-16. Single instruction, single data stream (SISD) Uniprocessor Single instruction, multiple data stream (SIMD) Vector processor Processor organizations Array processor Multiple instruction, single data stream (MISD) Shared memory Symmetric multiprocessor (SMP) Multiple instruction, multiple data stream (MIMD) Distributed memory Nonuniform memory access (NUMA) 31 32

33 Single Instruction stream, Single Data stream (SISD) Single Instruction stream, Multiple Data stream (SIMD) with shared memory 34 SIMD with no shared memory Multiple Instruction stream, Multiple Data stream (MIMD) with shared memory 35 36

37 MIMD with no shared memory Multiprocessors Shared memory MIMD computers are called multiprocessors: 38 Amdahl's law N - parallel units Multiprocessors S = (s + p)/(s + p/n)= 1 / (s + p/n) where S is speed, s is seriel and p is parallel, and N is the number of processors. (s+p) is the total execution time p is the time that can be executed in parallel s is the time that cannot be executed in parallel Example, assume that 20% must be executed serial while 80% can be executed in parallel, and that there are 4 processors. Optimal speed-up is 4 but this example gives: S= (0,2 + 0,8)/ (0,2+0,8/4) = 1/0,4 = 2,5 IBM System/370 (1970s): two IBM CPUs connected to shared memory. IBM System370/XA (1981): multiple CPUs can be connected to shared memory. IBM System/390 (1990s): similar features like S370/XA, with improved performance. Possibility to connect several multiprocessor systems together through fast fiber-optic connection. CRAY X-MP (mid 1980s): from one to four vector processors connected to shared memory (cycle time: 8.5 ns). CRAY Y-MP (1988): from one to 8 vector processors connected to shared memory; 3 times more powerful than CRAY X-MP (cycle time: 4 ns). C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors. CRAY 3 (1993): maximum 16 vector processors (cycle time 2ns). Butterfly multiprocessor system, by BBN Advanced Computers (1985/87): maximum 256 Motorola 68020 processors, interconnected by a sophisticated dynamic switching network. BBN TC2000 (1990): improved version of the Butterfly using Motorola 88100 RISC processor. 39 40

Questions What is a superscalar architecture? How does Amdahl's law work? Give examples why many parallel units will not improve performance? What is the difference between: In-order issue with in-order completion, In-order issue with out-of-order completion, and Out-of-order issue with out-of-order completion? Give examples of True data dependency, Output dependency, and Anti-dependency Give advantages and disadvantages of VLIW How did Flynn classify computers? 41 www.liu.se