Instruction scheduling



Similar documents
Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

Pipelining Review and Its Limitations

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Computer Architecture TDTS10

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations


Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Using Power to Improve C Programming Education

PROBLEMS. which was discussed in Section

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

PROBLEMS #20,R0,R1 #$3A,R2,R4

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

A Lab Course on Computer Architecture

Towards Optimal Firewall Rule Ordering Utilizing Directed Acyclical Graphs

2) Write in detail the issues in the design of code generator.

Optimizing Configuration and Application Mapping for MPSoC Architectures

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

CS521 CSE IITG 11/23/2012

NP-complete? NP-hard? Some Foundations of Complexity. Prof. Sven Hartmann Clausthal University of Technology Department of Informatics

Advanced compiler construction. General course information. Teacher & assistant. Course goals. Evaluation. Grading scheme. Michel Schinz

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Dynamic Network Resources Allocation in Grids through a Grid Network Resource Broker

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Integrated System Modeling for Handling Big Data in Electric Utility Systems

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Software Pipelining - Modulo Scheduling

Definitions. Software Metrics. Why Measure Software? Example Metrics. Software Engineering. Determine quality of the current product or process

Throughput constraint for Synchronous Data Flow Graphs

Satisfiability Checking

On Tackling Virtual Data Center Embedding Problem

Scheduling Shop Scheduling. Tim Nieberg

Computer Algorithms. NP-Complete Problems. CISC 4080 Yanjun Li

WAR: Write After Read

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Discuss the size of the instance for the minimum spanning tree problem.

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Chapter 5 Instructor's Manual

Bandwidth Allocation in a Network Virtualization Environment

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

CSEE W4824 Computer Architecture Fall 2012

FPGA-based Multithreading for In-Memory Hash Joins

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Applied Algorithm Design Lecture 5

Concept of Cache in web proxies

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Solutions. Solution The values of the signals are as follows:

Introduction to Scheduling Theory

FPGA area allocation for parallel C applications

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

SIMULATION OF LOAD BALANCING ALGORITHMS: A Comparative Study

How To Provide Qos Based Routing In The Internet

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

CAD Algorithms. P and NP

Architectures for massive data management

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Probe Station Placement for Robust Monitoring of Networks

Data Structure [Question Bank]

Decentralized Utility-based Sensor Network Design

Big Data looks Tiny from the Stratosphere

Preserving Message Integrity in Dynamic Process Migration

Traffic Engineering for Multiple Spanning Tree Protocol in Large Data Centers

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

OAMulator. Online One Address Machine emulator and OAMPL compiler.

Chapter 1 Computer System Overview

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

John KW Hou, Ph.D. World- Wide Director, Industry Solutions Member of IBM Industry Academy IBM Corporation. José R. Favilla Jr

Parametric Attack Graph Construction and Analysis

Topics. Flip-flop-based sequential machines. Signals in flip-flop system. Flip-flop rules. Latch-based machines. Two-sided latch constraint

Computer Organization and Components

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Path Selection Analysis in MPLS Network Based on QoS

Why have SSA? Birth Points (cont) Birth Points (a notion due to Tarjan)

Approximation Algorithms

A3 Computer Architecture

A number of tasks executing serially or in parallel. Distribute tasks on processors so that minimal execution time is achieved. Optimal distribution

Shared Backup Network Provision for Virtual Network Embedding

Transcription:

Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them. However, that order is usually not the only valid one, in the sense that it can be changed without modifying the program s behavior. For example, if two instructions i 1 and i 2 appear sequentially in that order and are independent, then it is possible to swap them. 1 2 Instruction scheduling Pipeline stalls Among all the valid permutations of the instructions composing a program i.e. those that preserve the program s behavior some can be more desirable than others. For example, one order might lead to a faster program on some machine, because of architectural constraints. The aim of instruction scheduling is to find a valid order that optimizes some metric, like execution speed. Modern, pipelined architectures can usually issue at least one instruction per clock cycle. However, an instruction can be executed only if the data it needs is ready. Otherwise, the pipeline stalls for one or several cycles. Stalls can appear because some instructions, e.g. division, require several cycles to complete, or because data has to be fetched from memory. 3 4

Scheduling example Scheduling example The following example will illustrate how proper scheduling can reduce the time required to execute a piece of RTL code. We assume the following delays for instructions: Instruction kind RTL notation Delay Memory load or store Ra Mem[Rb+c] 3 Mem[R b+c] R a Multiplication R a R b * R c 2 Cycle Instruction 1 R 1 Mem[R SP] 4 R 1 R 1 + R 1 5 R 2 Mem[R SP+1] 8 R 1 R 1 * R 2 9 R 2 Mem[R SP+2] 12 R 1 R 1 * R 2 13 R 2 Mem[R SP+3] 16 R 1 R 1 * R 2 Cycle Instruction 1 R 1 Mem[R SP] 2 R 2 Mem[R SP+1] 3 R 3 Mem[R SP+2] 4 R 1 R 1 + R 1 5 R 1 R 1 * R 2 6 R 2 Mem[R SP+3] 7 R 1 R 1 * R 3 9 R 1 R 1 * R 2 Addition R a R b + R c 1 18 Mem[R SP+4] R 1 11 Mem[R SP+4] R 1 After scheduling (including renaming), the last instruction is issued at cycle 11 instead of 18! 5 6 Instruction dependences Data dependences An instruction i 2 depends on an instruction i 1 when it is not possible to execute i 2 before i 1 without changing the behavior of the program. The most common reason for dependence is datadependence: i 2 uses a value that is computed by i 1. However, as we will see, there are other kinds of dependences. We distinguish three kinds of dependences between two instructions i 1 and i 2 : 1. true dependence i 2 reads a value written by i 1 (read after write or RAW), 2. antidependence i 2 writes a value read by i 1 (write after read or WAR), 3. antidependence i 2 writes a value written by i 1 (write after write or WAW). 7 8

Antidependences Antidependences are not real dependences, in the sense that they do not arise from the flow of data. They are due to a single location being used to store different values. Most of the time, antidependences can be removed by renaming locations e.g. registers. In the example below, the program on the left contains a WAW antidependence between the two memory load instructions, that can be removed by renaming the second use of R 1. R 1 Mem[R SP] R 1 Mem[R SP] Computing dependences Identifying dependences among instructions that only access registers is easy. Instructions that access memory are harder to handle. In general, it is not possible to know whether two such instructions refer to the same memory location. Conservative approximations not examined here therefore have to be used. R 1 Mem[R SP+1] R 2 Mem[R SP+1] R 4 R 4 + R 2 9 10 Dependence graph Dependence graph example Name Instruction a The dependence graph is a directed graph representing dependences among instructions. Its nodes are the instructions to schedule, and there is an edge from node n 1 to node n 2 iff the instruction of n 2 depends on n 1. Any topological sort of the nodes of this graph represents a valid way to schedule the instructions. a R 1 Mem[R SP] b R 1 R 1 + R 1 c R 2 Mem[R SP+1] d R 1 R 1 * R 2 e R 2 Mem[R SP+2] f R 1 R 1 * R 2 g R 2 Mem[R SP+3] h R 1 R 1 * R 2 i Mem[R SP+4] R 1 b d c f e h g i true dependence antidependence 11 12

Difficulty of scheduling List scheduling algorithm Optimal instruction scheduling is NP-complete. As always, this implies that we will use techniques based on heuristics to find a good but sometimes not optimal solution to that problem. List scheduling is a technique to schedule the instructions of a single basic block. Its basic idea is to simulate the execution of the instructions, and to try to schedule instructions only when all their operands can be used without stalling the pipeline. The list scheduling algorithm maintains two lists: ready is the list of instructions that could be scheduled without stall, ordered by priority, active is the list of instructions that are being executed. At each step, the highest-priority instruction from ready is scheduled, and moved to active, where it stays for a time equal to its delay. Before scheduling is performed, renaming is done to remove all antidependences that can be removed. 13 14 Prioritizing instructions Nodes (i.e. instructions) are sorted by priority in the ready list. Several schemes exist to compute the priority of a node, which can be equal to: the length of the longest latency-weighted path from it to a root of the dependence graph, the number of its immediate successors, the number of its descendants, its latency, etc. Unfortunately, no single scheme is better for all cases. b 10 d 9 List scheduling example a 13 Cycle ready active c 12 f 7 e 10 h 5 priority g 8 i 3 A node's priority is the length of the longest latencyweighted path from it to a root of the dependence graph 1 [a 13,c 12,e 10,g 8 ] [a] 2 [c 12,e 10,g 8 ] [a,c] 3 [e 10,g 8 ] [a,c,e] 4 [b 10,g 8 ] [b,c,e] 5 [d 9,g 8 ] [d,e] 6 [g 8 ] [d,g] 7 [f 7 ] [f,g] 8 [] [f,g] 9 [h 5 ] [h] 10 [] [h] 11 [i 3 ] [i] 12 [] [i] 13 [] [i] 14 [] [] 15 16

Scheduling conflicts It is hard to decide whether scheduling should be done before or after register allocation. If register allocation is done first, it can introduce antidependences when reusing registers. If scheduling is done first, register allocation can introduce spilling code, destroying the schedule. Solution: schedule first, then allocate registers and schedule once more if spilling was necessary. 17