Traditional processor design. Ways to speed up execution

Similar documents
VLIW Processors. VLIW Processors


Instruction scheduling

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Introduction to Cloud Computing

WAR: Write After Read

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Computer Architecture TDTS10

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

A Lab Course on Computer Architecture

Module: Software Instruction Scheduling Part I

IA-64 Application Developer s Architecture Guide

LSN 2 Computer Processors

2) Write in detail the issues in the design of code generator.

Pipelining Review and Its Limitations

PROBLEMS #20,R0,R1 #$3A,R2,R4

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

Using Power to Improve C Programming Education

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Using Predictive Adaptive Parallelism to Address Portability and Irregularity

Instruction Scheduling. Software Pipelining - 2

The Microarchitecture of Superscalar Processors

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Thread level parallelism

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

Real-Time Monitoring Framework for Parallel Processes

Optimizing compilers. CS Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

FLIX: Fast Relief for Performance-Hungry Embedded Applications

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Computer Organization and Components

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Real Time Programming: Concepts

Week 1 out-of-class notes, discussions and sample problems

Optimising Cloud Computing with SBSE

CS521 CSE IITG 11/23/2012

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Instruction Set Design

Chapter 5 Instructor's Manual

Putting Checkpoints to Work in Thread Level Speculative Execution

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

ADVANCED COMPUTER ARCHITECTURE

Software Pipelining - Modulo Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

CMSC 611: Advanced Computer Architecture

Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine

CHAPTER 7: The CPU and Memory

GPU Hardware Performance. Fall 2015

Chapter 2 Parallel Architecture, Software And Performance

İSTANBUL AYDIN UNIVERSITY

Instruction Set Architecture (ISA)

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Architecture of Hitachi SR-8000

MPI and Hybrid Programming Models. William Gropp

OpenCL Programming for the CUDA Architecture. Version 2.3

Giving credit where credit is due

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Chapter 2 Parallel Computer Architecture

An Introduction to the ARM 7 Architecture

Performance Analysis and Optimization Tool

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Scalability and Classifications

CS352H: Computer Systems Architecture

Let s put together a Manual Processor

Parallel and Distributed Computing Programming Assignment 1

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

The Design of the Inferno Virtual Machine. Introduction

(Refer Slide Time: 00:01:16 min)

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Why? A central concept in Computer Science. Algorithms are ubiquitous.

High Performance Computing

Multi-Threading Performance on Commodity Multi-Core Processors

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Instruction Set Architecture (ISA) Design. Classification Categories

A graph-oriented task manager for small multiprocessor systems.

Parallel Computing. Benson Muite. benson.

FPGA area allocation for parallel C applications

RevoScaleR Speed and Scalability

Next Generation GPU Architecture Code-named Fermi

Chapter 13: Query Processing. Basic Steps in Query Processing

Transcription:

Traditional processor design CMSC430 Spring 2007 1 Ways to speed up execution To increase processor speed - Increase functionality of processor - add more complex instructions (CISC - Complex Instruction Set Computers) Need more cycles to execute instruction: > HAC430 assumes 2 cycles: - fetch instruction - execute instruction How many: - fetch instruction - decode operands - fetch operand registers - decode instruction - perform operation - decode resulting address - move data to store register - store result 8 cycles per instruction, not 2 CMSC430 Spring 2007 2 1

Alternative - RISC Have simple instructions, each executes in one cycle > RISC - Reduced instruction set computer Speed - one cycle for each operation, but more operations. For example: A=B+C CISC: 3 instructions Load Register 1, B Add Register 1, C Store Register 1, A RISC:10 instruction Address of B in read register Read B Move data Register 1 Address of C in read register Read C Move data Register 2 Add Register 1, Register 2, Register 3 Move Register 3 to write register Address of A in write register Write A to memory CMSC430 Spring 2007 3 Aspects of RISC design Single cycle instructions Large control memory - often more than 100 registers. Fast procedure invocation - activation record invocation part of hardware. Put activation records totally within registers. Implications: > Cannot compare processor speeds of a RISC and CISC processor: > CISC - perhaps 8-10 cycles per instruction > RISC - 1 cycle per instruction > CISC can do some of these operations in parallel. CMSC430 Spring 2007 4 2

Pipeline architecture CISC design: 1. Retrieve instruction from main memory. 2. Decode operation field, source data, and destination data. 3. Get source data for operation. 4. Perform operation. Pipeline design: while Instruction 1 is executing Instruction 2 is retrieving source data Instruction 3 is being decoded Instruction 4 is being retrieved from memory. Four instructions at once, with an instruction completion each cycle. CMSC430 Spring 2007 5 Impact on language design With a standard CISC design, the statement E=A+B+C+D will have the postfix EAB+C+D+= and will execute as follows: 1. Add A to B, giving sum. 2. Add C to sum. 3. Add D to sum. 4. Store sum in E. But, Instruction 2 cannot retrieve sum (the result of adding A to B until the previous instruction stores that result. This causes the processor to wait a cycle on Instruction 2 until Instruction 1 completes its execution. A more intelligent translator would develop the postfix EAB+CD++=, which would allow A+B to be computed in parallel with C+D CMSC430 Spring 2007 6 3

Further optimization E=A+B+C+D J=F+G+H+I has a modified postfix of EFAB+FG+CD+HI+++== In this case, each statement executes with no interference from the other, and the processor executes at the full speed of the pipeline CMSC430 Spring 2007 7 Conditionals A = B + C; if D then E = 1 else E = 2 Consider the above program. A pipeline architecture may even start to execute (E=1) before it evaluates to see if D is true. Options could be: > If branch is take, wait for pipeline to empty. This slows down machine considerably at each branch > Simply execute the pipeline as is. The compiler has to make sure that if the branch is taken, there is nothing in the pipeline that can be affected. This puts a great burden on the compiler writer. Paradoxically the above program can be compiled and run more efficiently as if the following was written: if D (A=B+C) then E = 1 else E = 2 [Explain this strange behavior.] CMSC430 Spring 2007 8 4

Summary New processor designs are putting more emphasis on good language processor designs. Need for effective translation models to allow languages to take advantage of faster hardware. What s worse - simple translation strategies of the past, may even fail now. CMSC430 Spring 2007 9 Multiprocessor system architecture Impact: Multiple processors independently executing Need for more synchronization Cache coherence: Data in local cache may not be up to date. CMSC430 Spring 2007 10 5

Tightly coupled systems Architecture on previous slide are examples of tightly coupled systems. > All processors have access to all memory > Semaphores can be used to coordinate communication among processors > Quick (nanosecond) access to all data Networks of machines (Loosely couples machines): > Processors have access to only local memory > Semaphores cannot be used > Relatively slow (millisecond) access to some data > Model necessary for networks like the Internet CMSC430 Spring 2007 11 Instruction scheduling Motivation > instruction latency (pipelining) - several cycles to complete instruction > instruction-level parallelism (VLIW, superscalar) - execute multiple instructions per cycle Issues > reorder instructions to reduce execution time > static schedule insert NOPs > dynamic schedule pipeline stalls > preserve correctness > operate efficiently > interactions with optimizations Sources of latency (hazards) > data - operands depend on previous instruction > structural - limited hardware resources > control - targets of conditional branches CMSC430 Spring 2007 12 6

Approach schedule after register allocation (postpass) - model used to estimate execution time A legal schedule assume each i Instr has delay(i) a legal schedule S maps each i Instr onto a non-negative integer representing its cycle number (i.e., time instruction is executed) if i2 depends on i1, S(i1) + delay(i1) S(i2) there are no more instructions in any cycle than can be issued by the machine the length of schedule S, denoted L(S), is L(S) = MAX i Instr (S(i) + delay(i)) S is optimal if L(S) L(T), legal schedules T Scope > basic blocks (list scheduling) > branches (trace scheduling, percolation) > loops (unrolling, software pipelining) CMSC430 Spring 2007 13 Data dependencies Dependences memory locations instead of values Statement b depends on statement a if there exists: > true or flow dependence - a writes a location that b later reads (read-after-write or RAW) > anti-dependence - a reads a location b later writes (writeafter-read or WAR) > output dependence - a writes a location that b later writes (write-after-write or WAW) Another dependence (doesn't constrain ordering) > input dependence - a reads a location that b later reads (read-after-read or RAR) Example true anti output input a = = a a = = a = a a = a = = a CMSC430 Spring 2007 14 7

Precedence graph Construction > instructions nodes > dependences edges Example 1 load r1, X 2 load r2, Y 3 mult r3, r2, r1 4 load r4, A 5 mult r5, r4, r3 6 add r6, r2, r3 7 mult r2, r5, r6 8 load r8, B 9 add r9, r8, r2 Register renaming eliminates anti and output dependenciess a = a = = a = a a = b = = a = b CMSC430 Spring 2007 15 List scheduling Algorithm > rename to eliminate anti/output dependenciess > construct precedence graph > assign priorities to instructions > iteratively select & schedule instructions candidates roots of graph while candidates remaining pick highest priority candidate schedule instruction add exposed instructions to candidates Two flavors of list scheduling Forward list scheduling Backward list scheduling start with available ops start with no successors work forward work backward ready all ops available ready latency covers uses CMSC430 Spring 2007 16 8

Scheduling heuristics Problems > how to choose between ready instructions? > NP-hard for straight-line code Heuristics used to prioritize candidates > ready to execute (no stalls) > highest latency (more overlap) > most immediate successors (create candidates) > most descendants (create more candidates) > longest weighted path to root (critical path) Approach > use multiple heuristics (help break ties) > try multiple schedules (take best result) CMSC430 Spring 2007 17 Scheduling example Machine model > 3-cycle latency for load > 2-cycle latency for mult > 1-cycle latency for add Example 1 load r1, X 2 load r2, Y 3 mult r3, r2, r1 4 load r4, A 5 mult r5, r4, r3 6 add r6, r2, r3 Cycle Candidates 1 2 3 4 5 6 7 8 Forward scheduling Candidates Backward scheduling Candidates 2-instruction scheduling CMSC430 Spring 2007 18 9

Trace scheduling Overview- a trace is a path through code > examine branch probabilities > find trace representing most likely path > schedule instructions for trace > emit repair code around trace > repeat as necessary Example code1 code2 code4 code3 code5 code1 code2 code4 code3 code5 code6 code6 Result > better instruction schedule along trace > less efficient schedule off trace > increase in code size (from repair code) CMSC430 Spring 2007 19 Trace repair code Code moved below split > create basic block B at branch target > replicate code C moved below split > insert C in B in original order Example code1 if (...) if (...) code1 code1 code2 code3 code2 code3 Code moved above split > only move code which is dead off trace > may perform unnecessary work Example code1 if (...) code2 code3 code1 code2 if (...) code3 // code2 is dead CMSC430 Spring 2007 20 10

Loop unrolling Approach > create multiple copies of loop > more candidates for scheduling Example do i=1,n do i=1,n,3 1 load load 2 mult mult load 3 mult load 4 store store mult 5 store 6 store [1 pass=4 cycles] [3 passes= 6 cycles] Problems > choosing degree of unrolling k > pipeline hiccup every k iterations > increased compilation time > instruction cache overflow CMSC430 Spring 2007 21 Software pipelining Approach > overlap iterations of loop > select schedule for loop body > initiate iteration after every k cycles, before previous iterations complete > constant initiation interval Example T i=1 i=2 i=3 i=4 1 load 2 mult load 3 mult load 4 L: store mult load (cjmp L) 5 store mult 6 store 7 store Properties > prolog, epilog code for pipeline > steady state within body of loop > hiccups in pipeline at entry, exit CMSC430 Spring 2007 22 11

Branch prediction Will a conditional branch be taken? > affects instruction scheduling > execution penalty for incorrect guess Prediction approaches > hardware history (branch bit) > history plus branch correlation > profiling (feedback to compiler) > compile-time heuristics Static branch prediction > no run-time information > single prediction for all executions > perfect prediction = 50--100% correct Simple (target) heuristic > predict conditional branch taken > catches loop back edges CMSC430 Spring 2007 23 Branch prediction Loop branch heuristic > find forward, back, exit edges > predict back edge, non-exit edge Op code heuristic > predict greater than zero (error conditions) > predict floating point values differ Loop heuristic - predict branch leading to loop header Call heuristic - predict branch not leading to call Return heuristic - predict branch not leading to return Guard heuristic - predict branch leading to guarded variable Store heuristic - predict branch leading to store of variable Pointer heuristic - predict pointer not NULL, pointers differ CMSC430 Spring 2007 24 12

Compiling high performance machines Issues > Parallelism > locality Classical compilation > focus on individual operations > scalar variables > flow of values > unstructured code High-performance compilation > focus on aggregate operations (loops) > array variables > memory access patterns > structured code Issues > dependence analysis > loop transformations CMSC430 Spring 2007 25 Forms of parallelism Instruction-level parallelism > for superscalar and VLIW architectures > examine dependences between statements > very fine grain parallelism A = 1 B = C Task-level parallelism > for multiprocessors > examine dependences between tasks > parallelism is not scalable do i = 1,10 do i = 1,10 A(i) = A(i+1) B(i) = B(i+1) Loop-level parallelism > for vector machines and multiprocessors > examine dependences between loop iterations > parallelism is scalable doall i = 1,10 A(i) = B(i+1) CMSC430 Spring 2007 26 13

Loop-level parallelism Basic approach > execute loop iterations in parallel > safe if no loop-carried data dependences (i.e., no accesses to same memory location) do i = 1,10 doall i = 1,10 A(i) = A(i+1) A(i) = A(i+10) Several parallel architectures > vector processors A[1:10] = B[2:11] > multiprocessors doall i = 1,10 A(i) = B(i+1) > message-passing machines if (...) send B(1) if (...) recv B(11) do i = L,B A(i) = B(i+1) CMSC430 Spring 2007 27 Exposing parallelism Scalar analysis improve precision of dependence tests eliminate unnecessary scalar statements based on data-flow analysis Solution techniques forward propagation (expose value of scalar variables) do i = 1,10 do i =1,10 k = i+1 k = i+1 A(k) = A(i+1) = constant propagation k = 1 k = 1 do i = 1,10 do i =1,10 A(i+k) = A(i+1) = induction variable recognition k = 1 k = 1 do i = 1,10 do i =1,10 k = k+1 A(i+1) = A(k) = k = 11 CMSC430 Spring 2007 28 14

Vectorization Vector processors > Operations on vectors of data > Overlap iterations of inner loop doall i=1,10 A[1:10] = 1.0 A(I) = 1.0 B[1:10] = A[1:10] B(i) = A(i) > Exploits fine-grain parallelism > Expressed in vector languages (APL, Fortran 90) Execution model > Single thread of control > Single instruction, multiple data (SIMD) > Load data into vector registers > Efficiently execute pipelined operations Issues > Vector length coalesce loops to reduce overhead > Control flow convert conditions into explicit data > Not very popular today! CMSC430 Spring 2007 29 Parallelization Multiprocessors > Multiple independent processors (MIMD) > Assign iterations to different processors doall j=1,10 do j=l,u do i=1,10 do i=1,10 A(I,j) = 1.0 A(i,j)=1.0 > Exploits coarse grained parallelism Execution model > Fork-join parallelism > Master executes sequential code > Workers (and master) execute parallel code > Master continues after workers finish Issues > Granularity larger computation partitions to reduce overhead > Scheduling policy for assigning iterations to processors CMSC430 Spring 2007 30 15

multithreading High latency events > I/O > Interprocess communication > Page miss > Cache miss Multiple threads of execution > Switch to new thread after event > Overlaps computation with event > Requires threads, efficient context switch Hardware support > Switch sets of registers > Shared caches Software support > Uncovers parallelism for multiple threads > Reduce context switch overhead CMSC430 Spring 2007 31 16