why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers Erik Larsson Department of Computer Science 2 Pipelining FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand Superpipelining FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand 3 4
Superscalar Architecture FI: Fetch Instruction DI: Decode Instruction CO: Calculate operand FO: Fetch Operand EI: Execute Instruction WO: Write Operand Superscalar Architectures Superscalar architectures allow several instructions to be issued and completed per clock cycle. A superscalar architecture consists of a number of pipelines that are working in parallel. Depending on the number and kind of parallel units available, a certain number of instructions can be executed in parallel. 5 6 Limitations Resource conflicts Control dependency Data conflicts True data dependency Output dependency Anti-dependency Limitations worse - Resource conflict - similar to structural hazards Control - branches Data-conflicts True Data Dependency ADD R2,R4,R5 R2 R4 + R5 MUL R4,R3,R1 ADD R2,R4,R5 Assume: R1=2, R2=0, R3=2, R4=0, R5=2 Correct: MUL makes: R4=4 ADD makes: R2=6 R4 becomes 4 FO EI WO Value of R4 is read. R4=0 at this point 7 8
9 Output Dependency Antidependency Two instructions are writing into the same location; if the second instruction writes before the first one, an error occurs: ADD R4,R2,R5 R4 R2 + R5 An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs: MUL R4,R3,R1 ADD R4,R2,R5 ADD R3,R2,R5 R3 R2 + R5 10 Output Dependency and Antidependency Correct operation Policies for Parallel Instruction Execution Without dependencies by using additional registers: Case 1: ADD R7,R2,R5 R7 R2 + R5 R4 Case 2: ADD R6,R2,R5 R6 R2 + R5 R4 Storage conflicts The ability of a superscalar processor to execute instructions in parallel is determined by: the number and nature of parallel pipelines the mechanism used to find independent instructions. The policies are characterized by: the order in which instructions are issued for execution. the order in which instructions are completed. Execution policies: In-order issue with in-order completion. In-order issue with out-of-order completion. Out-of-order issue with out-of-order completion. 11 12
13 Policies for Parallel Instruction Execution We consider the following instruction sequence: I1: ADDF R12,R13,R14 R12 R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1 R8 + R9 I3: MUL R4,R2,R3 R4 R2 * R3 I4: MUL R5,R6,R7 R5 R6 * R7 I5: ADD R10,R5,R7 R10 R5 + R7 I6: ADD R11,R2,R3 R11 R2 + R3 Two instructions can be fetched and decoded at a time; Three functional units can work in parallel: floating point unit, integer adder, integer multiplier; Two instructions can be written back (completed) at a time; I1 requires two cycles to execute; I3 and I4 are in conflict for the same functional unit; I5 depends on the value produced by I4 (we have a true data dependency between I4 and I5); I2, I5 and I6 are in conflict for the same functional unit; 14 In-Order Issue with In-Order Completion In-Order Issue with In-Order Completion Decode/ Issue Execute Write back/ complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I6 I1 STALL 3 I3 I1 I2 4 I4 I3 5 I5 I4 6 I6 I5 7 I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3 Instructions are issued in the exact order that would correspond to sequential execution; results are written (completion) in the same order. An instruction cannot be issued before the previous one has been issued; An instruction completes only after the previous one has completed. To guarantee in-order completion, instruction issuing stalls when there is a conflict and when the unit requires more than one cycle to execute; The processor detects and handles (by stalling) true data dependencies and resource conflicts. I6 8 15 16
17 In-Order Issue with In-Order Completion Out-of-Order Issue with Out-of-Order Completion With in-order issue&in-order completion the processor has not to bother about output-dependency and antidependency. It has only to detect true data dependencies. No one of the two dependencies will be violated if instructions are issued/completed in-order: Output dependency ADD R4,R2,R5 R4 R2 + R5 Antidependency ADD R3,R2,R5 R3 R2 + R5 Decode/ Issue Execute Write back/ complete Cycle I1 I2 1 I3 I4 I1 I2 2 I5 I4 I1 I3 I2 3 I6 I4 I1 I3 4 I5 I4 I6 5 I6 I5 6 7 8 I1: ADDF R12,R13,R14 I2: ADD R1,R8,R9 I3: MUL R4,R2,R3 I4: MUL R5,R6,R7 I5: ADD R10,R5,R7 I6: ADD R11,R2,R3 18 Comments on Superscalars Some Architectures Main techniques for superscalar processors: additional pipelined units which are working in parallel; out-of-order issue&out-of-order completion; register renaming. All of the above techniques are aimed to enhance performance. Experiments have shown: only adding additional units is not efficient out-of-order issue is extremely important register renaming can improve performance with more than 30% it is important to provide a fetching/decoding capacity so that instruction window is sufficiently large. PowerPC 604 six independent execution units: Branch execution unit Load/Store unit 3 Integer units Floating-point unit in-order issue Power PC 620 provides in addition to the 604 out-of-order issue 19 20
21 Some Architectures Summary Superscalars Pentium three independent execution units: 2 Integer units Floating point unit in-order issue two instructions issued per clock cycle. Pentium II to 4 provides in addition to the Pentium out-of-order issue five instructions can be issued in one cycle Good The hardware solves everything: Hardware detects potential parallelism between instructions; Hardware tries to issue as many instructions as possible in parallel. Hardware solves register renaming. Binary compatibility If functional units are added in a new version of the architecture or some other improvements have been made to the architecture (without changing the instruction sets), old programs can benefit from the additional potential of parallelism. Why? Because the new hardware will issue the old instruction sequence in a more efficient way. 22 Summary Superscalars Outline Bad Very complex Much hardware is needed for run-time detection. There is a limit in how far we can go with this technique. Power consumption can be very large! The instruction window is limited this limits the capacity to detect potentially parallel instructions Superscalar Processors Very Long Instruction Word Processors Parallel computers 23 24
25 The Alternative: VLIW Processors VLIW Processors VLIW architectures rely on compile-time detection of parallelism the compiler analysis the program and detects operations to be executed in parallel; such operations are packed into one large instruction. After one instruction has been fetched all the corresponding operations are issued in parallel. No hardware is needed for run-time detection of parallelism. Detection of parallelism and packaging of operations into instructions is done, by the compiler, off-line. The instruction window problem is solved: the compiler can potentially analyse the whole program in order to detect parallel operations. 26 VLIW Processors Summary VLIW Processors Advantages the number of FUs can be increased without needing additional sophisticated hardware to detect parallelism. power consumption can be reduced. Compilers can detect parallelism based on global analysis program. Problems: large number of registers needed in order to keep all FUs active. Large data transport capacity is needed between FUs and the register file register files and memory. instruction cache and fetch unit. Large code size, partially because unused operations -> wasted bits. Incompatibility of binary code 27 28
Outline Parallel Computers Superscalar Processors Very Long Instruction Word Processors Parallel computers One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application. Such computers have been organized in very different ways. Some key features: number and complexity of individual CPUs availability of common (shared memory) interconnection topology performance of interconnection network I/O devices The need for high performance! Two main factors contribute to high performance of modern processors: Fast circuit technology Architectural features: large caches multiple fast buses pipelining superscalar architectures (multiple funct. units) 29 30 Classification of Computer Architectures Processor organisation Flynn s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate. The multiplicity of instruction stream and data stream gives us four different classes: Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data stream- MIMD References Flynn, M., Some Computer Organizations and Their Effectiveness, IEEE Trans. Comput., Vol. C-21, pp. 948, 1972. Duncan, Ralph, "A Survey of Parallel Computer Architectures", IEEE Computer. February 1990, pp. 5-16. Single instruction, single data stream (SISD) Uniprocessor Single instruction, multiple data stream (SIMD) Vector processor Processor organizations Array processor Multiple instruction, single data stream (MISD) Shared memory Symmetric multiprocessor (SMP) Multiple instruction, multiple data stream (MIMD) Distributed memory Nonuniform memory access (NUMA) 31 32
33 Single Instruction stream, Single Data stream (SISD) Single Instruction stream, Multiple Data stream (SIMD) with shared memory 34 SIMD with no shared memory Multiple Instruction stream, Multiple Data stream (MIMD) with shared memory 35 36
37 MIMD with no shared memory Multiprocessors Shared memory MIMD computers are called multiprocessors: 38 Amdahl's law N - parallel units Multiprocessors S = (s + p)/(s + p/n)= 1 / (s + p/n) where S is speed, s is seriel and p is parallel, and N is the number of processors. (s+p) is the total execution time p is the time that can be executed in parallel s is the time that cannot be executed in parallel Example, assume that 20% must be executed serial while 80% can be executed in parallel, and that there are 4 processors. Optimal speed-up is 4 but this example gives: S= (0,2 + 0,8)/ (0,2+0,8/4) = 1/0,4 = 2,5 IBM System/370 (1970s): two IBM CPUs connected to shared memory. IBM System370/XA (1981): multiple CPUs can be connected to shared memory. IBM System/390 (1990s): similar features like S370/XA, with improved performance. Possibility to connect several multiprocessor systems together through fast fiber-optic connection. CRAY X-MP (mid 1980s): from one to four vector processors connected to shared memory (cycle time: 8.5 ns). CRAY Y-MP (1988): from one to 8 vector processors connected to shared memory; 3 times more powerful than CRAY X-MP (cycle time: 4 ns). C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors. CRAY 3 (1993): maximum 16 vector processors (cycle time 2ns). Butterfly multiprocessor system, by BBN Advanced Computers (1985/87): maximum 256 Motorola 68020 processors, interconnected by a sophisticated dynamic switching network. BBN TC2000 (1990): improved version of the Butterfly using Motorola 88100 RISC processor. 39 40
Questions What is a superscalar architecture? How does Amdahl's law work? Give examples why many parallel units will not improve performance? What is the difference between: In-order issue with in-order completion, In-order issue with out-of-order completion, and Out-of-order issue with out-of-order completion? Give examples of True data dependency, Output dependency, and Anti-dependency Give advantages and disadvantages of VLIW How did Flynn classify computers? 41 www.liu.se