Module: Software Instruction Scheduling Part I



Similar documents
VLIW Processors. VLIW Processors

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Software Pipelining - Modulo Scheduling

WAR: Write After Read

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes


EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipelining Review and Its Limitations

Introduction to Cloud Computing

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

A Lab Course on Computer Architecture

Optimizing compilers. CS Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations

Thread level parallelism

IA-64 Application Developer s Architecture Guide

Machine-Code Generation for Functions

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

How to make the computer understand? Lecture 15: Putting it all together. Example (Output assembly code) Example (input program) Anatomy of a Computer

EE361: Digital Computer Organization Course Syllabus

Lumousoft Visual Programming Language and its IDE

Programming Language Pragmatics

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Instruction Set Design

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Computer Architecture TDTS10

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

2) Write in detail the issues in the design of code generator.

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Multi-Threading Performance on Commodity Multi-Core Processors

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Course on Advanced Computer Architectures

Optimizing Code for Accelerators: The Long Road to High Performance

Enterprise Applications

Operating Systems. Virtual Memory

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Glossary of Object Oriented Terms

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Compiler Construction

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

Spring 2011 Prof. Hyesoon Kim

Real-time Process Network Sonar Beamformer

Timing Analysis of Real-Time Software

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

CA Compiler Construction

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Symbol Tables. Introduction

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

CS473 - Algorithms I

C Compiler Targeting the Java Virtual Machine

Language Processing Systems

HY345 Operating Systems

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Introducción. Diseño de sistemas digitales.1

CS352H: Computer Systems Architecture

Artificial Intelligence. Class: 3 rd

Compiler I: Syntax Analysis Human Thought

Annotation to the assignments and the solution sheet. Note the following points

Performance Analysis and Optimization Tool

CPU Performance Equation

6.852: Distributed Algorithms Fall, Class 2

Architecture Design & Sequence Diagram. Week 7

1/20/2016 INTRODUCTION

DEFERRED IMAGE PROCESSING IN INTEL IPP LIBRARY

CUDA Programming. Week 4. Shared memory and register

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

CSCI 3136 Principles of Programming Languages

Performance evaluation

Architecture of Hitachi SR-8000

Evaluation of CUDA Fortran for the CFD code Strukti

Programming Languages

Figure 1: Graphical example of a mergesort 1.

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Organization of Programming Languages CS320/520N. Lecture 05. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Parallel Computing. Benson Muite. benson.

LSN 2 Computer Processors

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Transcription:

Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section 2.1 Optimizations Appendix G.2 ECE 4100/6100 (2)

Compiler Hardware Interface Lexical Analyzer tokens parse tree parse tree Semantic Parser Analyzer Intermediate Code Generator IR This module addresses scheduling above the HW/SW IF Optimizer IR Code Generator low level IR Postpass Optimizer Hardware/Software Interface: The ISA machine code IF ID MEM WB L.D F2, 0(R1) MUL.D F6, F4, F8 S.D F6, 4(R2) ECE 4100/6100 (3) Instruction Scheduling A compiler can enhance a processor s ability to exploit instruction level parallelism Program dependencies can be preserved via different sequence of instructions Some sequences are more efficient than others, e.g., less stall cycles Schedule to avoid hazards Instruction scheduling is a complex optimization problem ECE 4100/6100 (4)

Opportunity in Superscalars High degree of Instruction Level Parallelism (ILP) via multiple (possibly) pipelined functional units (FUs). Essential to harness promised performance. Clean simple model and Instruction Set makes compile-time optimizations feasible. Therefore, performance advantages can be harnessed automatically ECE 4100/6100 (5) Processor components Example of ILP 5 functional units: 2 fixed point units, 2 floating point units and 1 branch unit. Pipeline depth: floating point unit is 2 deep, and the others are 1 deep. Peak rates: 7 instructions being processed simultaneously in each cycle Given a source program P, schedule the instructions so as to minimize the overall execution time on the functional units in the target machine. ECE 4100/6100 (6)

Cost Functions Effectiveness of the Optimizations: How well can we optimize our objective function? Impact on running time of the compiled code determined by the completion time. Efficiency of the optimization: How fast can we optimize? Impact on the time it takes to compile or cost for gaining the benefit of code with fast running time. ECE 4100/6100 (7) Example Original Loop Loop: L.D F0, 0(R1) S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Source FP ALU FP ALU Load Load Consumer FP ALU Store FP ALU Store Latency 3 (clocks) #intervening clock cycles for dependent instructions 2 1 0 Scheduled Loop Loop: L.D F0, 0(R1) stall stall stall S.D F4, 0(R1) DADDUI R1, R1, #-8 stall BNE R1, R2, Loop ECE 4100/6100 (8)

Scheduled Loop Body Loop: L.D F0, 0(R1) DADDUI R1, R1, #-8 stall stall S.D F4, 0(R1) BNE R1, R2, Loop Schedule instructions to fill stall cycles and honor flow dependencies Apply to any basic block in the program How can we increase the available instructions to schedule? ECE 4100/6100 (9) Enhancing Parallelism Program Program Program Analysis Analysis Program Program Transformations Transformations Scheduling Scheduling & & Optimization Optimization Where does concurrency exist? across instructions across loops How do we discover it? Program analysis Dependence analysis Aliasing analysis How do we enhance it? How do we exploit it? Program transformations: instruction scheduling loop unrolling hardware trace generation (later) software software pipelining (later) ECE 4100/6100 (10)

Loop Unrolling for (i=0: i<8; i++){ c[i] = a[i] * X + Y; for (i=0: i<4; i++) { c[2i] = a[2i] * X + Y; c[2i+1] = a[2i+1] * X + Y; Loops are syntactic constructs for specifying repetitive code sequences Unroll once We would prefer longer code sequences Create longer code sequences via unrolling Replicate loop body Replicate the loop body multiple times Example ECE 4100/6100 (11) What about loop termination? (need to transform) The Unrolled Loop: The Simple Case Loop: L.D F0, 0(R1) S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Output dependencies (need more registers) ECE 4100/6100 (12)

The Unscheduled and Unrolled Loop Loop: L.D F0, 0(R1) S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10, -16(R1) Modified loop termination ADD.D F12, F10, F2 condition via merging S.D F12, -16(R1) (DADDUI) and removal (BNE) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Iteration i Iteration i+1 Use of additional registers register pressure Address compensation With stalls 6.75 cycles/element ECE 4100/6100 (13) Unrolled and Scheduled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, -16(R1) S.D F16, -24(R1) BNE R1, R2, Loop Use additional registers to recover concurrency and create ILP With scheduling no stalls and 3.5 cycles/element ECE 4100/6100 (14)

Key Elements of Loop Unrolling The overhead of loop termination checks and branching is amortized More useful work is performed per branch There is greater demand for registers Register pressure more concurrency means more resources! Symbolic optimizations Need to recognize memory address dependencies Symbolic optimizations to adjust addresses Instruction scheduling to hide hazards ECE 4100/6100 (15) Key Elements of Loop Unrolling Using the loop to optimize its dynamic behavior is a challenging problem. Hard to optimize well without detailed knowledge of the range of the iteration. Program transformations In practice, profiling can offer limited help in estimating loop bounds. Use of pragmas to communicate information to the optimizer ECE 4100/6100 (16)

Some Pragmatics of Loop Unrolling Do we always know the loop iteration count? Key idea: transform the loop into two loops Assume the loop iteration count is n, and is determined at run time The first loop is unrolled k times and executes n/k times The second loop contains a single copy of the loop body and is executed (n mod k) times Example ECE 4100/6100 (17) Loops with Data Dependent Iteration Count m = n/k for (i=0; i<m; i++) { for (i=0; i<n; i++) { c[i] = a[i] * X + Y; k times c[k*i] = a[k*i] * X + Y; for (i=0; i<(n mod k); i++) { c[m*k+i] = a[m*k+i] * X + Y; ECE 4100/6100 (18)

Loop Carried Dependencies /* prologue code */ for (i=6; i<n; i++) { Dependence distance of 5 a[i] = a[i-5] * X + Y; /* epilogue code */ Iteration level parallelism refers to concurrent execution of different iterations of a loop Iteration level parallelism is determined by loop carried dependencies Consider data dependencies: o o Data produced in one iteration is used in a later iteration Concept of dependence distance ECE 4100/6100 (19) Loop Level Parallelism The goal is to determine those iterations that can be executed in parallel Special cases: loop carried dependencies may be in the form of recurrences for (i=6: i<100; i++) { a[i] = a[i-5] + b[i]; Dependence distance can be used to determine the degree of unrolling o Unrolling can produce independent instructions in the unrolled loop body ECE 4100/6100 (20)

Detecting Loop Carried Data Dependencies Distinguish between dependencies that may be to the same location vs. those that are to the same location How do we detect such dependences? Evaluate to the same index for any value of i? for (i=0: i<n; i++) { C[i] = B[i**2 + 4] + A[2*i + K] A[4*i + D] + X = B[i+G] + Y; ECE 4100/6100 (21) Importance of Detecting Loop Carried Dependencies Necessary to detect and exploit parallelism at the level of loop iterations Necessary for generating good instruction schedules Pointers significantly complicate the analysis ECE 4100/6100 (22)

A Simple Test A large class of dependence analysis algorithms assume that the array indices are computed as affine functions affine function X[a*i + b], where a and b are constants and i is the loop index variable For such references, a dependence exists if for two reference to indices [a*i + b] and [c*k + d] (b-d)/gcd(a,c) is an integer ECE 4100/6100 (23) Increasing Instruction Concurrency Reducing sequences of dependent instructions copy propagation X = A Y = Z + X Y = Z + A Constant propagation is a special case ECE 4100/6100 (24)

Increasing Instruction Concurrency (cont.) Tree height reduction: exploiting associativity Use parentheses + + a + + + b + a b c d c ADD.D F0, F2, F4 d ADD.D F0, F2, F4 ADD.D F6, F8, F0 ADD.D F10, F12, F6 ADD.D F6, F8, F10 ADD.D F12, F0 F6 Increase ILP ECE 4100/6100 (25) Increasing Instruction Concurrency (cont.) Opportunities for increasing ILP via tree height reduction can increase with loop unrolling For example consider recurrence sum = sum + x; Initial unrolled loop sum = sum + x1 sum = sum + x2 sum = sum + x3 sum = sum + x4 sum = sum + x5 sum = sum + x1 + x2 + x3 + x4 + x5 sum = ((sum + x1) + (x2 + x3)) + (x4 + x5) ECE 4100/6100 (26)

Constant propagation Dependence distance Instruction scheduling Loop unrolling Symbolic optimization Loop carried dependency Loop level parallelism Register pressure Tree height reduction Study Guide: Glossary ECE 4100/6100 (27) Study Guide Schedule simple code sequences on the basic 5 stage, multifunction pipeline Be able to detect dependences and schedule to minimize stalls Given a simple loop unroll and schedule the loop body to minimize stalls Unroll by a factor of k, a loop with unknown loop bounds Given a code block, determine if loop carried dependencies exist Be able to state whether one cannot make such a determination ECE 4100/6100 (28)

Study Guide (cont.) Can you provide a code sequence where hardware scheduling will always perform better than a compiler scheduled solution? Can you provide a short code sequence where a compiler scheduled solution will never perform poorer than a hardware scheduled (Tomasulo) implementation? ECE 4100/6100 (29)