Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Similar documents

Instruction scheduling

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

Matrix Multiplication

Module: Software Instruction Scheduling Part I

Software Pipelining - Modulo Scheduling

Optimizing compilers. CS Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

VLIW Processors. VLIW Processors

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Optimal vs. Heuristic Methods in a Production Compiler

Optimizing matrix multiplication Amitabha Banerjee

IA-64 Application Developer s Architecture Guide

Parallel Algorithm for Dense Matrix Multiplication

Introduction to Scheduling Theory

FPGA area allocation for parallel C applications

Approximation Algorithms

THE NAS KERNEL BENCHMARK PROGRAM

PROBLEMS #20,R0,R1 #$3A,R2,R4

What is Multi Core Architecture?

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Small Maximal Independent Sets and Faster Exact Graph Coloring

Parallel and Distributed Computing Programming Assignment 1

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Analysis of Binary Search algorithm and Selection Sort algorithm

Hardware performance monitoring. Zoltán Majó

Making Dynamic Memory Allocation Static To Support WCET Analyses

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Applied Algorithm Design Lecture 5

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Discuss the size of the instance for the minimum spanning tree problem.

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Using Power to Improve C Programming Education

A Lab Course on Computer Architecture

Scheduling Shop Scheduling. Tim Nieberg

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

High-speed image processing algorithms using MMX hardware

Algorithm Design and Analysis

HY345 Operating Systems

WAR: Write After Read

Operating Systems. Virtual Memory

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER. B.S., University of Portland, 2000 THESIS

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Real-time Process Network Sonar Beamformer

Clustering and scheduling maintenance tasks over time

Peter Mileff PhD SOFTWARE ENGINEERING. The Basics of Software Engineering. University of Miskolc Department of Information Technology

Warshall s Algorithm: Transitive Closure

Hybrid Programming with MPI and OpenMP

Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture

GameTime: A Toolkit for Timing Analysis of Software

CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 313]

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Java Virtual Machine: the key for accurated memory prefetching

Pipelining Review and Its Limitations

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

CIS570 Modern Programming Language Implementation. Office hours: TDB 605 Levine

VEHICLE ROUTING PROBLEM

CUDA Programming. Week 4. Shared memory and register

Optimizing Code for Accelerators: The Long Road to High Performance

Multi-GPU Load Balancing for Simulation and Rendering

Scoping (Readings 7.1,7.4,7.6) Parameter passing methods (7.5) Building symbol tables (7.6)

Chapter 11 I/O Management and Disk Scheduling

Why? A central concept in Computer Science. Algorithms are ubiquitous.

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max. structure of a schedule Q...

Pitfalls of Object Oriented Programming

Transcription:

Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas

Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling Pipelines Limitations Enhancements Plans

Why Bother? Instruction-level parallelism (ILP) requires scheduling 90% of execution time spent in loops Loops can exhibit significant parallelism across iterations of a loop. Local and global instruction scheduling don t address parallelism across loop boundaries

Example Code int a[4][4]; int b[4][4]; int c[4][4]; main() int i, j, k; for(i=o; i<4; i++) for(j=0; j<4; j++) c[i][j] = 0; for(k=0; k<4; k++) c[i][j] += a[i][k] * b[k][j];

Machine Model (for example) Long instruction word (LIW) Separate add and multiply pipes Add produces a result in one machine cycle Multiply requires two machine cycles to produce result Multiply is pipelined

Local Schedule 192 cycles 1. t = a[i][k] * b[k][j] 2. nop 3. c[i][j] = t + c[i][j]

A Pipelined Schedule 144 cycles Prelude: executed once 1. t[0] = a[0][0] * b[0][0] Loop Body: executed for k = 1 to 3 1 nop 2 t[k] = a[i][k] * b[0][0] # c[i][j] += t[k- 1] Postlude: executed once 1 nop 2 c[i][j] += t[3] Note: Can actually get 96 cycles by overlapping two iterations of the loop in a single 2-cycle loop body.

Software Pipelining Methods Allan et al. (Computing Surveys, 1995) defines two major approaches to software pipelining Kernel Recognition unrolls the innermost loop an appropriate number of times and searches for a repeating pattern. This repeating pattern then becomes the loop body. Modulo Scheduling selects a (minimum) schedule for one loop iteration such that, when the schedule is repeated, no constraints are violated. Length of this schedule is called the initiation interval, II.

Iterative Modulo Scheduling Schedule loop with the smallest possible II Build data dependence graph (DDG) Nodes represent operations to be performed Weighted, directed arcs represent precedence constraint between two nodes Compute resource constraint, resii, smallest II that contains enough resources (functional units?) to satisfy all the resources required by one loop iteration. Compute recurrence constraint, recii, longest dependence loop in the DDG. minii = max(resii, recii) Try to schedule with II = minii, II = minii + 1, minii + 2,... until a valid schedule is found

Scheduling a Pipeline How do you Try to schedule with II = minii, II = minii + 1, minii + 2,... until a valid schedule is found Use slightly modified local scheduling algorithm (List scheduling, but with slightly modified heuristics for prioritizing DDG nodes.) Set the schedule length to be the II we re attempting to schedule Attempt to fit the operations to be scheduled II cycles If no schedule found backtrack and try again in II cycles. After sufficient number of backtracking attempts, admit failure.

Dependence Operation A is dependent upon operation B iff B precedes A in execution and both operations access the same memory location. Four basic types of dependence True dependence (RAW) Anti-dependence (WAR) Output dependence (WAW) Input dependence (RAR) Input dependence usually ignored in scheduling

Dependence (cont.) Represented in DDG Two groups of dependences Loop-independent dependence Loop-carried dependence DDG arcs decorated with two integers, (dist, min) Dist is the difference in iterations between the head and tail of the dependence Min is the latency required between two nodes Dependence analysis is quite complicated, but techniques do exist.

1 2 3 4 5 6 7 1-1 - - - - - 2 - - 0 - - 1-3 - - - 1 - - - 4 - - - - 1 - - 5 0 - - - - - - 6 - - - - - - 1 7 - - - - - -1 - Start of Distance Table II = 1

1 2 3 4 5 6 7 1 3 4 4 5 6 8 9 2 2 3 3 4 5 7 8 3 2 3 3 4 5 7 8 4 1 2 2 3 4 6 7 5 3 4 4 5 6 8 9 6 - - - - - 0 1 7 - - - - - -1 0 Distance Table II = 1

1 2 3 4 5 6 7 1-1 - - - - - 2 - - -1 - - 1-3 - - - 0 - - - 4 - - - - 1 - - 5-1 - - - - - - 6 - - - - - - 1 7 - - - - - -3 - Start of Distance Table II = 2

1 2 3 4 5 6 7 1 0 1 0 0 1 2 3 2-1 0-1 -1 0 1 2 3 0 1 0 0 1 2 3 4 0 1 0 0 1 2 3 5-1 0-1 -1 0 1 2 6 - - - - - -2 1 7 - - - - - -3-2 Distance Table II = 2

1 2 3 4 5 6 7 1-1 - - - - - 2 - - -2 - - 1-3 - - - 0 - - - 4 - - - - -1 - - 5-2 - - - - - - 6 - - - - - - 1 7 - - - - - -5 - Start of Distance Table II = 3

1 2 3 4 5 6 7 1-4 1-1 -1-2 2 3 2-5 -4-2 -2-3 1 2 3-3 -2-4 0-1 -1 0 4-3 -2-4 -4-1 -1 0 5-2 -1-3 -3-4 0 1 6 - - - - - -4 1 7 - - - - - -5-4 Distance Table II = 3

Limitations of Modulo Scheduling Control flow makes things difficult Modulo scheduling requires a single DDG Can use if-conversion (Warter) but graph explodes quickly Lam s hierarchical reduction is another method, but not without problems Software pipelining in general increases register needs dramatically. Modulo scheduling exacerbates this problem. Techniques exist to minimize required registers but cannot guarantee fitting in machine s registers To address register problems, some increase II; others spill Compile time is a problem, particularly computing the distance matrix

Improved (?) Modulo Scheduling Unroll-and-Jam on nested loops can significantly shorten the execution time needed for a loop at the cost of code size and register usage. Use of a cache-reuse model can give better schedules than assuming all cache accesses are hits and can reduce register requirements over assuming all accesses are cache misses. A plan to software pipeline with a fixed number of registers; like spilling while scheduling.

Unroll-and-Jam Requires nested loops Unroll outer loop and jam resulting inner loops back together Introduces more parallelism into inner loop and it will not lengthen any recurrence cycles. Using unroll-and-jam on 26 FORTRAN nested loops before performing modulo scheduling led to: Decreased execution time for loops of up to 94.2%. On average, loops decreased execution time by 56.9% Increased register requirements significantly, often by a factor of 5.

Example Code with Unroll-and-Jam for(i=o; i<4; i++) for(j=0; j<4; j+=2) c[i][j] = 0; c[i][j+1] = 0; for(k=0; k<4; k++) c[i][j] += a[i][k] * b[k][j]; c[i][j+1] += a[i][k] * b[k][j+1];

Using Cache Reuse Information Caches lead to uncertain (for compiler) latencies Local scheduling typically assumes all cache accesses are hits Assuming all hits leads to stalls and slower execution Typical modulo schedulers assume all cache accesses are misses. (Modulo scheduling can effectively hide long latency of miss) Assuming cache miss stretches register lifetimes and thus leads to greater requirements

Cache Reuse Experiment Using a simple cache reuse model, our modulo scheduler Improved execution time roughly 11% over an all-hit assumption with little change in register usage Used 17.9% fewer registers than an allmiss assumption, while generating 8% slower code

Summary Modulo scheduling can lead to dramatic speed improvements for single-block loops Modulo scheduling does have limitations, namely difficulty with control flow and (potentially) massive register requirements At MTU we discovered some enhancements to modulo scheduling to make it even more effective and practical, and in one case improved upon the optimal schedule.