Optimal vs. Heuristic Methods in a Production Compiler
|
|
|
- Imogene McKinney
- 9 years ago
- Views:
Transcription
1 Software Pip~elining Showdown: Optimal vs. Heuristic Methods in a Production Compiler John Ruttenberg*, G. R. Gao ~, A. Stoutchinin~, and W.Lichtenstein* *Silicon Graphics Inc., 2011 N. Shoreline Blvd., Mountain View, CA ~McGill University School of Computer Science, 3480 University St., McConnell Building, Room 318, Montreal, Canada H3A2A7 Authors rutt,[email protected] gao,stoafjcs.mcgill.ca Abstract This paper is a scientific comparison of two code generation techniques with identical goals generation of the best possible software pipelined code for computers with instruction level parallelism. Both are variants of modulo scheduling, a framework for generation of soflware pipelines pioneered by Rau and Glaser [RaG181 ], but are otherwise quite dissimilar. One technique was developed at Silicon Graphics and is used in the MIPSpro compiler. This is the production compiler for SG1 S systems which are based on the MIPS R8000 processor [Hsu94]. It is essentially a branch-and-bound enumeration of possible schedules with extensive pruning. This method is heuristic becaus(s of the way it prunes and also because of the interaction between register allocation and scheduling. The second technique aims to produce optimal results by formulating the scheduling and register allocation problem as an integrated integer linear programming (ILP1 ) problem. This idea has received much recent exposure in the literature [AlGoGa95, Feautner94, GoAlGa94a, GoAlGa94b, Eichenberger95], but to our knowledge all previous implementations have been too preliminary for detailed measurement and evaluation. In particular, we believe this to be the first published measurement of runtime performance for ILP based generation of software pipelines. A particularly valuable result of this study was evaluation of the heuristic pipelining technology in the SGI compiler. One of the motivations behind the McGill research was the hope that optimal software pipelining, while not in itself practical for use in production compilers, would be usefhl for their evaluation and validation. Our comparison has indeed provided a quantitative validation of the SGI compiler s pipeliner, leading us to increased confidence in both techniques. 1. It is unfortunate that both instruction Level Parallelism and Integer Linear Programming are abbreviated ILP in the literature. To clarifi, we always use the abbreviation ILP to mean the latter and will suffer thrcugh describing instruction level parallelism by its fill name. Permission to make digitalhrd COPYof part or all of this work for personal or elasaroom use is ranted without fee provided that copies are nc)t made or distributed for pm 1t or eommemial advanta e, the copyright notice, the Me of the ubiioation and its date appear, an! notice is given that copying t IS y permission of ACM, Ino. To copy othenvise, to republish, to past on sswsra, or to mdistributs to lists. requires prior speoific parmiaaion ancf/or a fee. PLDI PA, USA O 1996 ACM O YWOW5...$505O Little work has been done to evaluate and compare alternative algorithms and heuristics for modulo scheduling from the viewpoints of schedule quality as well as computational complexity. This, along with a vague and unfounded perception that modtdo scheduling is computationally expensive as well as dljicult to implement, have inhibited its incorporation into product compilers. 1.0 Introduction 1.1 Software pipelining B. Ramakrishna Rau [Rau94] Software pipelining is a coding technique that overlaps operations from various loop iterations in order to exploit instruction level parallelism. In order to be effective software pipelining code must take account of certain constraints of the target processor, namely: 1. instruction latencies, 2. resource availability, and 3. register restrictions. Finding code sequences that satisfy the constraints (1) and (2) is called scheduling. Finding an assignment of registers to program symbols and temporaries is called register allocation. The primary measure of quality of software pipelined code is called the 11, short for iteration or initiation interval. In order to generate the best possible software pipelined code for a loop, we must find a schedule with the minimum possible II and also we must have a register allocation for the values used in the loop that is valid for that schedule. Such a schedule is called rate--optimal. The problem of finding rate-optimal schedules is NP--complete [GaJo79], a fact that has led to a number of heuristic techniques [DeTo93, GaSc9 1, HuffP3, MoEb92, Rau94, Warter92, Lam88, AiNi88] for generatioti of optimal or near optimal soflware pipelined code. [RaFi93] contains an introductory survey of these methods. 1.2 The allure of optimal techniques Recently, motivated by the critical role of software pipelining in high performance computing, researchers have become interested in non-heuristic methods, ones that guarantee the optimality of the solutions they find. The key observation is that the problem can be formulated using integer linear programming (ZLP), a well known framework for solving NP-complete problems [Altman95, AlKePoWa83]. Given such a formulation, the problem can be given to one of a number of standard ILP solving packages. Because this framework has been effective in solving other computationally difficult problems [NeWo88, Nemhauser94, Pugh91,BiKeKr94], it is hoped that it can also be useful for software pipelining.
2 1.3 The showdown The rest of this paper presents a detailed comparison of two software pipelining implementations, one heuristic and one optimal. The heuristic approach is represented by the Silicon Graphics MZPSpro compiler. The optimal approach is represented by MOSZ an ILP based pipeliner developed at McGill University. We adapted the MOST scheduler to the MIPSpro compiler so that it could act as an exact fictional replacement for the original heuristic scheduler. This work allowed us to perform a very fair and detailed comparison of the two approaches. What did we discover? In short, the SGI heuristics were validated by this study. Comparison with,most was an effective way to evaluate SGI S production quality pipeliner. In particular, we discovered that the optimal approach is only very rarely able to find a better II than the heuristic approach and benchmarks compiled with the optimal techniques did not clearly perform better than those compiled with the hefistic pipeliner. In the remainder of this paper we describe the two approaches in detail, present our experimental results, draw some conclusions from these results, and try to shed some intuitive light on the results. 2.0 The heuristic approach software pipelining at Silicon Graphics 2.1 MIPSpro compiler Starting in 1990, Silicon Graphics designed a new microprocessor aimed at the supercomputing market called R8000 [HSU94] and shipped with SGI S Power Challenge and Power Indigo2 computers. It is an in+rder 4-issue superscalar RISC processor featuring fully pipelined floating point and memory operations. As part of this project, a new compiler was implemented that had the basic goal of generating very high quality software pipelined inner loop code. This compiler shipped under the name MIPSpro Compiler with R8000 based systems starting in late summer of At the time the system shipped, it had the best reported perfomxmce on a number of important benchmarks, in particular for the floating point SPEC92. Our measurements show the compiler s software pipelining capabilities play the central role in delivering this performance on the R8000, (See Figure 2.) The MIPSpro compiler performs a rich set of analysis and optimization before its software pipelining phase. These fall into three distinct categories: 1. high level loop analysis and transformations, including: a. array dependence analysis, b. loop interchange, and c. outer loop unrolling; 2. classical intermediate code optimizations, such as: a. common subexpression elimination, b. copy propagation, c. constant folding, and d. strength reductioru 3. special inner loop optimizations and analysis in preparation for software pipelining, in particular: a, if-conversion to convert loops with internal branches to a form using conditional moves [AlKePoWa83, DeTo93], b. interleaving of register recurrences such as summation or dot products, c. inter iteration common memory reference elimination, and d. data dependence graph construction. 2.2 Modulo scheduling The framework for the MIPSpro compiler s pipeliner is modulo scheduling, a technique pioneered by Bob Rau and others and very well described in the literature. [Larn88, RaFi93, Lam89]. Modulo schedulers search for possible schedules by first fixing a iteration interval (II) and then trying to pack the operation in the loop body into the given number of cycles. If this is unsuccessful for one of a number of reasons, a larger II may be tried. The SGI scheduler contains a number of extensions and elaborations, Five of the most important are: 1. binary instead of linear search of the 11s, 2. brancbsm&bound ordering of the search for a schedule with a great deal of heuristic pruning, 3. use of multiple heuristic orderings in (2) to facilitate register allocation, 4. generation of spill code when required to alleviate register pressure, and 5. pairing of memory references to optimize memory system utilization. As we describe SGI S pipeliner in the following sections, it will be useful to keep these in mind. 2.3 Modulo scheduling using binary search The SGI modulo scheduler searches the space of possible 11susing a binary instead of linear search. 1 This has no measurable impact on output code quality, but can have a dramatic impact on compile speed. The search is bounded fi-om below by MinII, a loose lower bound based on resources required and any dependence cycles in the loop body [RaG18 1]. The search is bounded fi-om above by an arbitrary maximum MaxII = 2MinII. We set this maximum as a sort of compile speed circuit breaker under the assumption that software pipelining has little advantage over traditional scheduling once this bound is exceeded. In practice, this has proven to be a reasonably accurate assumption. Our search also makes the heuristic assumption that if we can find a schedule at II we will also be able to find one at II+ 1. Although it is possible to find counter examples of this in theory, we have yet to encounter one in the course of fairly extensive testing. The search is designed to favor the situation where we can schedule at or close to MinII, as these cases are overwhelmingly common. A simple binary search would be wasteful in cases where a schedule could be found within a few cycles of the MirdI. Instead, we use a search with two distinct phases: 1. Exponential buckoff Initially, the search attempts to establish an upper bound at which a schedule exists. During this phase we perform an exponential backoff from MinII, 1. The use of binary search in this context has a fairly long history in the literature. Touzems described the AP 120 and FPS 164 compiler and explains how binary search 1sused [Tou84]. Lam pointed out that being able to find a schedule at II does not imply being able to find a schedule at 11+1and used this to explain why her compiler used linear search [Lam88]. 2
3 searching successively: MinII, MinII+l, MinII+2, MinII+4, MinII+8,... until we either fmd a schedule or exceed MaxII. If a schedule is found with II s MinII+2, there are no better 11s left to search and the schedule is accepted. If no schedule is found, our modulo scheduling attempt has failed. (But we may still be able to find a software pipelined schedule by introducing spills and trying again. Section 2.8.) 2. Binary search If phase 1 is successful and a schedule is found, but with II > MinII+2, a binary search is used over the space of feasible 11s.Information from phase 1 allows us to tighten both the upper and lower bounds for the search, For the lower bound, we use the largest II for which phase 1 tried to schedule and failed. For the upper bound, we use the II for which phase 1 succeeded in finding a schedule. These bounds are tighter than MinII and MaxII which are used to bound phase 1. In special situations the two phase approach is abandoned in fawor of simple binary search. This is done when there is reason tc) be pessimistic about being able to software pipeline. In particular, simple binary search is used exclusively after spills are introduced into the code. (See Section 2.8.) 2.4 Enumerating possible schedules with pruning The process of searching for a schedule at an given II is viewed as an enumeration of the possible schedules for that II. This enumeration can be accomplished with a branch and-bound algorithm whose outlines should be familiar. The algorithm is presented in its filly exponential form in Figure 1. This is an exponential algorithm and is not practical in its unpruned form. In order to make it useful, it is necessary to prime away much of the search. Fortunately, this can be done without losing too much, since much of the backtracking it does is extremely unlikely to accomplish anything. For example, consider two operations with unrelated resource requirements which are unrelated by data precedence constraints. Suppose one of these has been scheduled and the second one fails to schedule. What good can possibly come of allowing a backtrack from one to the other? They have nothing to do with one another and moving one of them in the schedule cannot make it possiblle to schedule the other. Or consider the example of two fully pipelined operations with identical resource requirements and unrelated data precedence. Suppose the second (on the priority list) of these fails to schecktle. Can it help to move the first one? Any other place we put it is a place that second operation would have found if still available. The assumption that two operations are unrelated by data precedence is less important than might appear at first. Because we are modulo scheduling, any two operations that are not in the same strongly connected component can occupy any tno cycles inl the schedule; it s just a matter of adjusting the pipestages. (See Section 2.5 below.) Step (4) in Figure 1 guides the backtracking of the algorithm. We will say that it chooses a catch point for the backtracking. The catch point is a scheduled operation that will advance to the next cycle of its legal range after all the operations following it on the priority list have been unscheduled. Pruning is accomplished by strengthening the requirements about which operations can be catch points. 1. Make a list of the operations to be scheduled, LO..Ln.l. 2. Afler LO..Li.l have been scheduled, attempt to schedule Li as follows: a. Calculate a legal range of cycles in which Li may be scheduled. If Li is not part of a nontrivial strongly connected component in the data precedence graph, its legal range is calculated by considering arty of its direct predecessors and successors that have already been scheduled. Li must be scheduled late enough so that precedence arcs from its predecessors are not violated and early enough so any arcs to its successors are not violated. For operations that are part of nontrivial strongly connected components, we need to consider all scheduled members of Li s strongly connected component. A longest path table is kept and used to determine the number of cycles by which two members must precede or follow each other. The legal range is cut off to be no more than II cycles, since searching more than this number of cycles will just reconsider the same schedule slots pointlessly. b. Consider the cycles in Li s legal range in order. If a cycle is found in which Li maybe scheduled without violating resource constraints, schedule Li in that schedule. If no such cycle is found, we have failed in this attempt to schedule Li. 3. If(2) is successfid, resume scheduling the next element of L in step (2) or if all the operations are scheduled, terminate with success, $. If(2) is tmsuccessfi-d, find the largest j <i, such that Lj s legal range is not exhausted. Unscheduled all the operations Lj..Li-l If there is no such j, the entire scheduling attempt has failed. Otherwise, set i = j and resume scheduling at step (2) with the next cycle of Li s legal range. FIGURE 1, Branch-and-bound enumeration of all possible schedules at a given II The SGI pipeliner prunes by imposing the following constraints on backtracking: Only the first listed element of a strongly connected component can catch. Operation j may catch the backtrack caused by the scheduling failure of operation i if the resource required of i and j are non-identical and unscheduling j makes it possible to schedule i. If no catch point is found under constraint (2) a slightly looser constraint is used. Under this constraint, Operation j may catch the backtrack of operation i even if i and j have identical resources, Unscheduled all the operations starting with j on the priority list. If i can now be scheduled but in a dzjjlerent schedule slot than j, j may catch the backtrack. 2.5 Adjusting the pipestages Our branchd-bound algorithm does not require that the priority list be a strict topological ordering of the data dependence graph. The legal range calculation in Figure 1 only takes account of such predecessors and successors of an operation as are actually scheduled already, i.e., ahead of the current operation on the priority list. Why doesn t this result in schedules that violate data precedence constraints? 3
4 The answer is that it does. A simple postpass is used to compensate, ensuring validity. The postpass performs a single depth first search of the strongly connected component tree starting with the roots (operations with no successors, such as stores) and proceeding to predecessors. This visitation is topological, When each strongly connected component is visited, all its successors have already been visited, and their times in the schedule made legal in terms of their successors. Now it may be required to move the component to an earlier time in the schedule in order to make its results available on time to its predecessors. This can be done without changing the resource requirements of the schedule as a whole by moving the component by multiples of II. Once this has been done for every strongly connected component, we have a schedule that is valid both in terms of its resource and data precedence constraints. Of course, this process may have the negative effect of lengthening live ranges. It is better ilom a register allocation perspective to use a topological ordering for the schedule priority list. But sometimes that is untenable in terms of actually finding a schedule. Fortunately the scheme of using multiple priority heuristics ensures that both topological and non-topological approaches can be applied to every piece of code, in search of what works best. (See Section 2.7.) 2.6 Register allocation and modulo renaming Once a legal schedule is found, an attempt is made to allocate registers for it using fairly well known techniques. The R8000 has no explicit support for software pipelining. In particular, it has only conventionally accessed registers. Lam describes a technique called modulo renaming that allows effective pipelining with conventional architectures by replicating the software pipeline and rewriting the register references in each replicated instance [Lam89]. The SGI pipeliner borrows this technique. The modulo renamed live ranges so generated serve as input of a standard global register allocator that uses the Cluzitin-Briggs algorithm with minor modifications. [BrCoKeTo89, Briggs92]. 2.7 Multiple scheduling priorities and their effect on register allocation Early on the SGI team discovered that the ordering of operations on the schedule priority list has a significant impact on whether the schedules found can be register allocated. On reflection this is not very surprising. For example, a priority list that is a backward topological sort of the data dependence graph has the advantage that each operation is considered for scheduling only afier any of its uses have already been placed in the schedule. When this is the case, we know the latest point in the schedule at which it is valid to place the operation and still have its result available in time for its uses. We can shorten live ranges by trying to place each operation as low in the schedule as possible, given this limitation. A more interesting discovery was that different search orders seem to work well with different loop bodies. No single search order works best with all loop bodies. Backward topological search order, as described in the previous paragraph, works well in many cases since it tends to group the operands of each operation close together shortening live ranges from their beginnings. On the other hand, we have found cases where forward topological search order works better. The reason is exactly symmetric to the one given in the previous paragraph. Forward orders allow us to group all the uses of each operation close together and as high as possible in the schedule, thus shortening live ranges from their ends. This can be particularly useful in some codes with common subexpressions of high degree. In some cases, shortening the live ranges has to take aback seat to producing code at all. In loops with many operations that are not fully pipelined or large strongly connected components, it is crucial to move these to the head of the scheduling list. What scheduling heuristic could take account of so many different factors? Perhaps it would be possible to devise such a heuristic, but instead of searching for the one right heuristic, the SGI team took a different approach. A number of different scheduling list heuristics would be tried on every loop, ordered by likelihood to succeed. In common cases only one heuristic would be tried. Subsequent heuristics need only do well in cases where the other are weak. The MIPSpro pipeliner uses 4 scheduling priorities. Experimental results show that lower quality results are obtained if any one is omitted. (See Section 4.2). Here is a brief description of the two primary ordering heuristics. 1. Folded depthjrst ordering In the simple cases, this is just a depth first ordering starting with the roots (stores) of the calculation. But when there are difficult+o-schedule operations or large strongly-connected components, they are folded and become virtual roots. Then the depth first search proceeds outward from the fold points, backward to the leaves (loads) and forward to the roots (stores). 2. Data precedence graph heights The operations are ordered in terms of the maximum sum of the latencies along any path to a root. These two fundamental heuristics are modified in one or both of two ways to derive additional heuristics: 1. Reversal the list can be reversed. Forward scheduling is particularly usefid with the heights heuristic. 2. Ajinal memory sort-pulling stores with no successors and loads with no predecessors to the end of the list. The four heuristics actually in use in the MIPSpro compiler are: 1. FDA4S folded depth first ordering with a final memory sort, 2. FDNMS folded depth first ordering with no final memory sort, 3. HMS heights with a final memory sort, and 4. RHMS reversed heights with a final memory soti, See Section 4.2 for experimental results that show the complementary effect of these four. 2.8 Spilling The SGI pipeliner has the ability to spill in order to alleviate register pressure. If a modulo scheduling pass is unsuccessful because of failure to register allocate, spills and restores are added to the loop and another attempt is made. Spills are added exponentially; the first modulo scheduling failure results in a single value being spilled; the second failure spills two additional values; the third spills 4 more, and so on. The process is capped at 8 modulo scheduling failures, implying that up to 255 values maybe spilled before giving up. In practice this limit is never reached. Spill candidates are chosen by looking at the best schedule that failed to register allocate. For each live range, a ratio is calculated: the number of cycles spanned divided by the number of references. The ~eater this ratio, the greater the cost and smaller the benefit of keeping the value in a register. Thus values with the largest ratios are spilled first. 2.9 Memory bank optimization The MIPS R8000 is one of the simplest possible implementations of an architecture supporting more than one memory reference per cycle. The processor can issue two references per cycle, and the memory (specifically the second level cache which is directly 4
5 accessed by floating point memory references) is divided into two banks of double-words, the even address bank, and the odd address bank. If two references in the same cycle address opposite banks, then both references are serviced immediately, If two references in the same cycle both address the same bank, one is serviced immediately, and the other is queued for service in a l-element queue called the bellows. If this hardware configuration cannot keep up with the stream of references, the processor stalls. In the worst case there are two references every cycle all addressing the same bank, and the processor stalls once on each cycle, so that it enck up running at half speed. Scheduling to avoid memory bank conflicts must address two issues, which are independent of the details of any particular banked memory system. First, at compile-time we rarely have complete information on the bank assignments of references. For example the relative banks of consecutive elements of a row of two-dimensional Fortran array depend on the leading dimensicm of the array, which is often not a compile-time constant. Second, modifying a schedule to enforce rules about the relative bank assignments of nearby references may increase register pressure, and therefore increase the length of the schedule more than the savings in memory stalls shortens the schedule. The MIPSpro heuristic attempts to find known even-odd pairs of references to schedule in the same cycle it does not model the bellows feature of the memory system. Before scheduling begins, but after priority orders have been calculated, for each memory reference m it forms the prioritized list L(m) of all other references m for which (m, m ~ is known to be an even-odd pair. If L(w) is no~mpty then we say m is pairable. Given the iteration interval, the number of memory references, and the number of pairable references, we can tell how many references should ideally be scheduled in known pairs, and how many references will have to be scheduled together even though they may turn out at rnntime to reference the same bank. Until enough pairs have been scheduled together, whenever the compiler schedules a pairable memory reference m, it immediately attempts to schedule the first possible unscheduled element m of L(m),. in the same cvcle as m. If this fails, it tries the following in order: try to schedule another element of L(m) in the same cycl[e as m, or try scheduling m in a different cycle, or backtrack and try changing the scheduling of earlier operations in scheduling priority order. Note that this process may change the priority ordering of scheduling, since memory reference with higher priority than m are passed over in favor of scheduling m with m. The MIPSpro scheduler has two methods of controlling the impact of memory bank scheduling on register pressure. First, it measures the amount that priorities are changed due to pairing attempts during the entire scheduling process. If this measurement is large enough, and if register allocation fails, it tries scheduling again with reduced pairing requirements. Second, in the adjusting pipe stages part of scheduling (see Section 2.5), preserving pairing may require issuing a load one or more pipe stages earlier than would be necessary for simple legality of a schedule, Again, register :dlocation history is used to guide policy if there has been trouble allocating registers, the scheduler is less willing to add pipe stages to preserve pairing. Finally, since the minimal II schedule found first may not be best once memory stalls are taken into account, the algorithm makes a small exploration of other schedules at the same or slightly larger II, searching for schedules with provably better stalling behavior. 3.0 Software pipelining at McGill University the optimal approach 3.1 Development of the ILP formulation The interest in software pipelining at McGill stemmed from work on register allocation for loops on dataflow machines. This work culminated in a mathematical formulation of the problem in a linear periodic form [GaNi9 1,NiGa92]. It was soon discovered that this formulation can also be applied to software pipelining for conventional architectures. This formulation was then used to prove an interesting theoretical result: the minimum storage assignment problem for rate-optimal software pipelined schedules can be solved using an efficient polynomial-time method provided the target machine has enough functional units so resource constraints can be ignored [NiGa93]. In this framework, FIFO buffers are used to model register requirements. A graph coloring method can be applied subsequently on the obtained schedule to further decrease the register requirements of the loop. This is referred to as the integrated formulation and results in rate-optimal schedules with minimal register usage. Subsequently, the McGill researchers extended their ftamework to handle resource constraints, resulting in a unified ILP formulation for the problem for simple pipelined architectures ~iga92,goalga94a]. The work was subsequently generalized to more complex architectures [AlGoGa95]. By the spring of 1995, this work was implemented at McGill in MOST, the Modulo Scheduling Toolset, which makes use of any one of several external ILP solvers. MOST was not intended as a component of a production compiler, but rather as a standard of comparison. Being able to generate optimal pipelined loops can be useful for evaluat~ ing and improving heuristics for production pipeliners (as this study demonstrates). Because the McGill work is well represented in recent publications, we omit the details of the ILP formulation. The interested reader should consult [AlGoGa95] and then the other cited publications. 3.2 Integration with the MIPSpro compiler The research at McGill left many open questions. Although the formulation of the software pipelining problem in ILP was appealing, could it also be usefhl? The McGill team did not believe that it could be part of a production compiler due to its exponential run time, but did expect that it could be used to evaluate such a compiler. How would it compare with a more specialized heuristic implementation? It was bound to be slower, perhaps even much slower; but how much better would its results be? Because heuristic approaches can have near linear rnming time, they would certainly be able to handle larger loops. How much larger? MOST was not a full software pipelining implementation; its output was a set of static quality measures, principally the II of the schedule found and the number of registers required, not a piece of runable code. It only targets were models that exhibited certain interesting properties, never a real commercial high performance processor. How well would it work with when targeted to a real processor? Were there any unexpected problems standing in the way of a full implementation, one that would generate runable code? The opportunity to embed MOST in the MIPSpro compiler seemed perfect to answer these questions. In that context, MOST would enjoy a proven pipelining context a full set of optimization and analysis before pipelining and a robust post processing implementation to integrate the pipelined code correctly back into the program. It was particularly attractive to reuse the postprocessing code. Although modulo renaming, generation of pipeline fill and drain code, and other related bookkeeping tasks may seem the- 5
6 oretically uninteresting, they account for a large part of the job of implementing a working pipeliner. In the MIPSpro compiler, this postprocessing accounts for 18 XOof the lines of code in the pipeliner about 6,000 out of 33,000 lines total in the pipeliner. MOST was successfully embedded in the MIPSpro compiler over the summer of Adjustment of McGill approach Due to this study Over the course of this study, the McGill team found it necessary to adjust their approach in order to conduct a useful evaluation of SGI S production pipeliner. Sometimes a rat~ptimal schedule with minimal register usage cannot be found in a reasonable amount of time. For the purpose of this study, we used 3 minutes as a limit on searches for rattiptimal schedules with minimal register usage. Increasing this limit doesn t seem to improve the results very much. When no optimal schedule is found within the given time limit, it is necessary to derive a good but not necessarily optimal schedule. As a result, the following adjustments were made to MOST during this study: 1. Obtain a resource-constrained schedule)rst. Using the ILP formulation of the integrated register allocation and scheduling problem was just too slow and unacceptably limited the size of loop that could be scheduled. For example, finding a schedule for the large N3 loop in the tomcatv benchmark was far beyond the reach of the integrated formulation. The ILP formulation of resourc~onstiained scheduling could be solved considerably faster. This formulation finds schedules satisfying resource constraints, but does not address register allocation. When it cannot be solved within reasonable time limits, we would also not be able to solve the integrated problem, so separating the problems seems to serve as a useful filter against unnecessary compile time usage. 2. Adjustment of the objective function Often it was not feasible to use the register optimal formulation (involving coloring) reported in [AlGoGa95]. Instead, the ILP formulation used minimized the number of buffers in the sotlvvare pipeline. This objective function directly translates into the reduction of the number of iterations overlapped in the steady state of the pipeline. 1 The ILP solver for this phase was restricted to a fixed time limit. After that, it would accept the best suboptimal solution found, if any. 3. Multiple priority orders One surprising result of this study was the discovery that the same multiple priority order heuristics that were used in the SGI pipeliner are also very useful in driving the McGill solver. The priority order in which the ILP solver traverses the branch-and-bound tree is by far the most important factor affecting whether it could solve the problem. MOST thus adopted SGI S policy of trying many different priority orders in turn until a solution is found.2 These adjustments have proved to be very important for our study, They enable MOST to generate R8000 code and greatly extend the size and number of loops it can successfully schedule 1. This is probably at least asimportant a parameter to optimize asregister usage as it has a more dkect impact on the size of the pipeline fill and drain code and thus on short trip count performance of the loop. Short trip count performance is the only performance impact of either minimization, Only the II affects the asymptotic performance. 2. In fact, the final implementation of MOST embedded in the MIPSpro compiler used most of the code from the SGI scheduler directly, replacing only the scheduler proper with MOST. 4.0 Experiments and results 4.1 Effectiveness of the MIPSpro pipeliner To lend perspective to the comparative results in Section 4,4, we first present some results to demonstrate that the pipelining technology in the MIPSpro compiler is very effective. We use the 14 benchmarks in the SPEC92 floating point suite for this purpose, as SPEC has become an important standard of comparison. (We omit the SPEC92 integer suite because our software pipeliner unfortunately has little impact on those benchmarks.) Figure 2 shows the results for each of the 14 SPEC92 floating point benchmarks with the software pipeliner both enabled and disabled, Not only is the overall result very good for a machine with a 75 MHz clock, the effectiveness of the pipeliner is dramatic, resulting in more than 35?/o improvement in the overall SPEC number (calculated as the geometric mean of the results on each benchmark). To be fair, this overstates the case for SGI S scheduling technology because disabling software pipelining does not enable any very effective replacement. The MIPSpro compiler leans heavily on software pipelining for floating point performance and there has been little work in tuning the alternative paths. In particular, with software pipelining disabled, the special inner loop optimizations and analysis described in Section 2.1 are disabled; without pipelining there is also no if+onversion and vector memory reference analysis, When soflware pipelining is disabled a fairly simple list scheduler is used. We are often asked why we consider it important to be able to pipeline large loops. When a loop is large, the argument goes, there should be enough parallelism within each iteration without having to overlap iterations. Perhaps this is true, but softiarepipelining as we use the term is much more than just overlapping loop iterations. Rather it is a fundamentally different approach to code generation for innermost loops. We are willing to spend much more compile time on them than other parts of programs, We have a suite of special optimizations that we apply to them. We have a scheduler that backtracks in order to squeeze out gaps that would spice2g6 doduc mdljdp2 wave5 tomcatv ora alvinn ear mdljsp2 swm256 su2cor hydro2d nasa7 fppp geomt >tric mean FIGURE 2. SPEC92 floating point benchmarks for the MIPSpro compiler run on an SGI Power Challenge with 75MHz R8000 processors. Results are given with software pipelining both enabled and disabled. Results are given as SPECmarks, i.e., performance mukiples of VAX 780 mu times. 6
7 tp# wave5 mdljdp2 doduc spice2g RHMS _ FDMS - FDNMS - HMS FIGURE 3. Three of the four priority list heuristics help to win at least one SPEC benchmark. Each bar gives the results of an experiment where all but one heuristic was disabled. The result is reported as the ratio over the numbers reported in Figure 2. otherwise be left in the schedule. We have even gone so far (in this paper) as to consider au exponential scheduling ~echnique. 4.2 The effect of multiple scheduling priority heuristics Figure 3 shows the results of an experiment testing the effectiveness of the multiple scheduling heuristics discussed in Section 2.7. We tried compiling the SPEC benchmarks with each of the heuristics alone. No single heuristic worked best with all the benchmarks. In fact, three out of the four heuristics were required in order to achieve our best time for at least one of the benchmarks: FDMS FDNMS HMS hydro2d alvinn, ear, swm256 tomcatv, mdljsp2 The fourth heuristic, RHMS, is useful for loops not important to any of the SPEC92 floating point benchmarks. For several of the benchmarks, such as wave5, no single heuristic achieves a time equal to the overall best time shown in Figure 2, This is because the benchmarks consist of more than one important loop, and within a single benchmark each loop may require a different heuristic. There are also a few benchmarks, e.g., ear, that do better by a few percent when limited to a single heuristic than when the best schedule is chosen from among all four heuristics. This is less strange than it may seem at first, The MIPSpro compiler uses only the II of a schedule to measure its quality and ignores some other factors that can have minor impacts on the performance of a sclhedule. In particular, the overhead to start-up and complete a 10CJPis ignored when looking for a schedule, but can be relevant to, the performance of short trip loops. (See Section 4.6.) I hydro2d I su2cor swm256 mdlj sp2 ear alvinn ora. tomcatv wave5 mdljdp2 doduc spice2g6 geometric mean FIGURE 4. Effectiveness of the MIPSpro memory bank heuristics Performance improvements from enabling these heuristics 4.3 Effectiveness of the MIPSpro memory bank heuristics We measured the effectiveness of the MIPSpro compiler s memory bank heuristics discussed in Section 2.9. Figure 4 shows the performance ratio for code compiled with the heuristic enabled over the same code compiled with the heuristic disabled. Two benchmarks stand out as benefiting especially, alvinn and mdljdp2, For alvinn, the explanation is fairly simple. This program spends nearly 100 A of its time in two memory bound loops that process consecutive single precision vector elements. Because the R8000 is banked on double precision boundaries, the most natural pairings of memory references can easily cause memory bank stalls. In particular, one of the two critical loops is a dot product of two single precision vectors. Two natural memory reference patterns for this code are: and v[i+o], u[i+o] v[i+ I], u[i+l] v[i+o], v[i+ 1] u[i+o], u[i+l]. When (as in this case) both u and v are single precision and even aligned, both of these patterns batch 4 references to the same bank within two cycles, causing a stall. The memory bank heuristic prevents this, producing a pattern such as the following memory reference pattern instead: v[i+oj, v[i+2] u[i+o], u[i+2] This pattern is guaranteed to reference an even and an odd bank in each cycle and thus to have no stall, For mdljdp2, the situation is different and more complicated, This loop is not memory bound; it has only 16 memory references out of 95 instructions. The pattern of memory references is complicated by the fact that there is a memory indirection and so the exact pattern of memory references cannot be known. Inspection of the generated code reveals that without memory bank heuristics, memory references with unknowable relative offsets are grouped together unnecessarily. The memory bank heuristics prevent that grouping and thus avoid risking stalls by preventing any pairing of memory references. 4.4 Comparison between ILP and SGI S heuristics We wanted to answer the question whether an optimal approach could improve performance relative to the pipeliner in the. 7
8 MIPSpro compiler. Remember that the primaty goal of this study is to validate and improve SGI s heuristic techniques. Thus we wanted to give the ILP approach every possible advantage to expose weaknesses in the production compiler. $ Unfortunately, there is a problem that could strongly favor the heuristic pipeliner not every loop that can be scheduled by the SGI pipeliner can also be scheduled by the MOST scheduler in reasonable time. This is a particular problem because the penalty for not pipelining can be very high. (See Section 4. 1.) We addressed this by using the heuristic pipeliner as a backup for MOST. Thus instead of falling back to the single block scheduler used when the MIPSpro pipeliner fails to schedule and register allocate a loop, it instead falls back to the MIPSpro pipeliner itself. In theory, this should reveal only deficiencies in the production compiler, never in the ILP pipeliner. fpppp nasa7 hydro2d su2cor swm256 mdlj sp2 alvim ear ora tomcatv wave5 mdljdp2 doduc spice2g6 >tric mean FIGURE 5. Relative performance of ILP scheduled code over MIPSpro results with and without memory bank pairing heuristics FIGURE 6. Relative performance of ILP scheduled code over MIPSpro results for each Livermore kernel. In order to show the size of the dynamic effect introduced by the memory system, we report a second set of numbers. These compare the ILP results to code scheduled by the MIPSpro pipeliner with its memory bank optimization disabled. These results are also shown in Figure 5, this time as striped bars. In this comparison, the ILP code performs within a range from just a little bit worse to about s~o better than the MIPSpro code. (See Section 4.3 for the direct comparison of MIPSpro results with and without memory bank heuristics.) For two benchmarks, tomcatv and alvinn, the ILP does about 5?/o better than the MIPSpro code scheduled with memory bank heuristics disabled. Are these places where it is actually generating better code, or is it just another random factor introduced by the memory system? The latter is the case and the ILP code is just generating fewer memory stalls for these programs by chance. We know this because the performance of both programs is dominated by just a few loops which: 1. are scheduled by the MIPSpro compiler at their MinII, 4.5 Comparison of run time performance The solid bars in Figure 5 show comparative results for the SPEC floating point benchmarks. This data shows the code scheduled by the SGI pipelined code outperforming the optimal code on 8 of the benchmarks. The geometric mean of the suite as a whole is 8V0 better for the heuristic scheduler than for ILP method. On one benchmark, alvinn, the code scheduled by MOST ran 15% slower than the code scheduled by the MIPSpro pipeliner. How can this be? The design of the experiment should have prevented the ILP pipeliner from ever finding a worse schedule than could be found by MIPSpro. After all, the heuristics are available as a fallback when an optimal schedule cannot be found. A large part of the answer is a dynamic factor introduced by the memory system of the R8000 and described briefly in Section 2.9. The MIPSpro pipeliner uses heuristics to minimize the likelihood of unbalanced memory references causing stalls. Currently, the ILP scheduling formulation does not optimize for this hardware. Thus it can generate code which has poorer dynamic performance than that generated by the MIPSpro scheduler. 2. have very long trip counts (300 for tomcatv, > 1000 for alvinn), and 3. are memory bound or nearly memory bound. In fact these two programs were part of the original motivation for the MIPSpro pipeliner s memory bank optimization and served as a benchmark for tuning it. 4.6 Short trip count performance Both the SGI and the McGill techniques strive to minimize the number of registers used by software pipelined steady states; however, the emphasis and motivations are somewhat different. At SGI the motivation was to make schedules that could be register allocated. Certainly this was important to the McGill team, but there was the additional goal of improving performance even for pipelines scheduled at the same II. How well do the two approaches perform in this regard, and how important is it in practice? In fact register usage is only one of a number of factors that can affect the time to enter and exit a pipelined steady state -- its overhead. This overhead is constant relative to the trip count of the loop and thus increases in importance as the trip count decreases 8
9 g 1;! , g 7 ~ : FIGURE 7. Difference MIPSpro minus ILP in second order static quality measures for each of the loops in the Livermore kernels given as absolute register numbers and instruction cycles -, Overall pipeline overhead counts directly toward performance, while register usage is important only is so far as it impacts pipeline overhead. This chart shows two things quite clearly: 1. Neither scheduler produces consistent/y better schedules by either measure of overhead. The heuristic method uses fewer registers in 15 of the 26 loops and requires lower total overhead in There is no clear correlation between register usage and overhead. For 16 of the loops, the schedule with smaller overhead didn t use fewer registers. (1) seems particularly surprising in light of the emphasis on register optimality at McGill, but a careful examination of Section 3.3 will shed some light. Even in the best case, the McGill schedules do not have optimal register usage, but only minimal usage of the modulo renamed registers or buffers, ignoring the withiwiteration register usage. Moreover, for the purposes of this study, scheduling and buffer minimization were often performed in separate passes so that buffer usage is not really optimized over all possible schedules, but only over a rather small subset. (2) should not be surprising in light of the discussion above. Saving and restoring the registers used by the steady state is only one of the things that needs doing in the loop prologue and epilog. An additional overlapped iteration in the steady state can have a far larger effect than the use of more registers. 4.7 Compile speed comparison and asymptotically disappears in importance as the trip count increases. If we ignore dynamic factors such the memory system effect discussed in the previous section, then different schedules of a loop with identical 11sand different registers requirements differ at most in overhead so long as they both fit in the machine s available registers. After all, register usage can be compensated by spills and restores in the loop head and tail. There are other factors beside register usage that influence pipeline overhead. Before the steady state can execute the first time, the pipeline has to bejilled, and after the last execution of the steady state, the pipeline has to be drained. The number of instructions in the fill and drain code is a fimction of how deeply pipelined each instruction in the loop is and whether the less deeply pipelined instructions can be executed speculatively during the final few iterations of the loop. How do the two schedulers compare with regard to the shod trip count performance of the loops they generate? The well know Livermore Loops benchmark is particularly well suited for this measurement. It measures the performance on each of 24 floating point kernels for short, medium, and long trip counts. Figure 6 shows the relative performance of the two approaches on each loop for both the short and long trip count cases. These results show better performance for SGI scheduler in nearly all cases with both short and long trip counts. But as we have just seen, these results can be distorted by the effects of the machine s memory system. We d like a way to make a more direct comparison. To do this, we looked at some static performance information about the individual loops in the benchmark. For all the loops in the benchmark, the schedules produced by both pipeliners had identical 11s. Figure 7 shows the relative performance of the two pipeliners in terms ofi 1, register usage measured in total number of both floating point and integer registers used, and 2. overall pipeline overhead, measured in cycles required to enter and exit the loop. As expected, the ILP based pipeliner is much slower than the heuristic based one. Of the 261 seconds spent in the MIPSpro pipeliner while compiling the 14 SPEC benchmarks, 237 seconds are spent in the scheduler proper, with the rest spent in inner loop optimization and post-pipelining bookkeeping. This compares to 67,634 seconds in the ILP scheduler compiling the same code. 5.0 Conclusions and future work Only very rarely does the optimal technique schedule and allocate a loop at a lower II than the heuristics. In the course of this study, we found only one such loop. Even in that case a very modest increase in the backtracking limits of the heuristic approach equalized the situation. The heuristic technique is much more efficient than MOST. This is especially important for larger loop bodies, where its greater efficiency significantly extends its functionality. In our experiments, the largest loop successfidly scheduled by the SGI technique had 116 operations, while the largest optimal schedule found by MOST had 61 operations. The ILP technique was not able to guarantee register optimality for many interesting and important loops. In order to produce results for these loops, the McGill team had to devise a heuristic fallback. This led to the result that neither implementation had consistently better register usage. So long as a software pipelined loop actually fits in the registers, the number of registers used is not an important parameter to optimize since it is not well related to performance, even for short trip count loops. Future work in this direction should focus on the more complex task of minimizing overall loop overhead. We feel that there is much to learn in this area. Inherent in the modulo scheduling algorithm is a measure of how the results compare against a loose lower bound on optimality, the MinII (see Section 2.3.) This has always given us at least a rough idea of how much room there was for improvement in the iteration intervals of generated schedules. On the other hand, we know very little about the limits on the performance of sofhvare pipelined loops when the trip count is short. Perhaps an ILP formulation can be made that 9
10 optimizes loop overhead more directly than by optimizing register usage. We were unable to compare the MIPSpro memory heuristics with code that optimally uses the machine s memory system. This comparison would probably be useful for the evaluation and wing of the heuristics. Frankly, we do not currently understand how well these heuristics work compared to how well they could work. A good solution to this problem could have interesting implications ~n design of memory systems, This study had not produced any convincing evidence that there is much room for improvement in the SGI scheduling heuristics. But we know there is still room for improvement in several areas, most notably performance of loops with short trip counts and large loops bodies with severe register pressure. Without very significant strides toward a more efficient implementation of the ILP approach, we do not see how it can help with the latter problem. But short trip count performance is an important problem, especially in light of loop nest transformations such as tiling. And here we feel that the ILP approaches to software pipelining may yet have a significant role to play. Acknowledgments We wish to express our gratitude to a number of people and institutions. John Hennessy and David Wall provided invaluable editorial assistance Without their help, this paper would be much harder to understand. Erik Altman, Fred Chow, Jim Dehnert, Suneel Jain, Earl Killian, and Dror Maydan also helped substantially with the proofreading and editing. Monica Lam was the matchmaker of this project; she brought the SGI and McGill teams together in the first place. Erik Altman, R. Govindarajan, and other people from the ACAPS lab at McGill built the MOST scheduling tool. Douglas Gilmore made crucial contributions to the design of the SGI scheduler, in particular the idea of using more than one scheduling heuristic. Ross Towle was the original instigator of software pipelining at Silicon Graphics and the father of pratical software pipelining. We received financial support from Silicon Graphic and Canadian NSERC (Natural Science and Engineering Council). Ashfaq Munshi has been primarily responsible for creating an atmosphere of great productivity and creativity at Silicon Graphics where the bulk of this work took place during the summer of Finally, we must thank the members of our families for their loving support and forbearance. Bibliography [AiNi88] A. Aiken and A. Nicolau. Optimal loop parallelizations. In Proc. of the 88 SIGPLAN ConJ on Programming Language Design and Implementation, pp , Atlanta, Georgia, June [AlKePoWa83] J. R. Allen, K. Kennedy, C. Portertield, and J. Warren. Conversion of control dependence to data dependence. In Proc. of the 10th Ann. ACM Symp. on Principles of Programming Languages, pp , January [AlGoGa95] E. R. Altrnan, R. Givindarajan~ and G.R. Gao. Scheduling and mapping: Software pipelining m the presence of structural hazards. In Proc. of 95 SIGPLAN Con$ on Programming Language Design and Implementation, La Jolla, Calif., June [Altman95] E. R. Altman, Optimal Sof~are P@elining with Functional Unit and Register Constraints. Ph.D. thesis, McGill University, Montreal, Quebec, [BiKeKr94] R. Bixby, K. Kennedy, and U. Kremer. Automatic data layout using O-1 integer linear programming. In Proc. of Con$ on Parallel Architectures and Compilation Techniques, pp , August [BrCoKeTo89] P. Bnggs, K.D. Cooper, K. Kennedy, and L. Torczon. Coloring heuristics for register allocation. In Proc. of 89 S IGPLAN Con$ on Programming Language Design and Implementation, pp , July [Briggs92] P. Briggs. Register Allocation via Graph Coloring. Ph.D. thesis, Rice University, Houston, Texas, April [DeTo93] J.C. Dehnert and R.A. Towle. Compiling for Cydra 5. Journal of Supercomputing, v.7, pp , May [Eichenberger95] A. E. Eichenberger, E. S. Davidson, and S, G. Abraham. Optimum modulo schedules for minimum register requirements. In Proc. of 95 Intl. Corf on Supercomputing, pp , Barcelona, Spain, July [Feautrier94] P. Feautrier. Fine-grained scheduling under resource constraints. In 7th Ann. Workshop on Languages and Compilers for Parallel Computing, Ithaca, N.Y., August [GaNi91] G, R. Gao and Q. Ning. Loop storage optimization for dataflow machines. In Proc. of 4th Ann. Workshop on Languages and Compilers for Parallel Computing, pp , August [GaJo79] M.R. Garey and D.S. Johnson. Computers and Intractabili~: A Guideto the Theoty of NP-Completeness, W.H. Freeman and Co., New York, N.Y., [GaSc91] F. Gasperoni and U. Schwiegelshohn, Efficient algorithms for cyclic scheduling. Res. Prep. RC 17068, IBM TJ Watson Res. Center, Yorktown Heights, N. Y,, [GoAlGa94a] R. Govindarajan, E. R. Altma:, and G. R. Gao. A framework for rate-optimal resource-constrained soi-lware pipelining. In Proc. of CONPAR94- VAPP VI, no. 854, Lecture Notes in Computer Science, pp , Linz, Austria, September [GoAlGa94b] R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proc. of the 27th Arm. Intl. Symp. on Microarchitecture, pp , San Jose, Calif., November-December [Hsu94] Hsu, P., Designing the TFP Microprocessor, IEEE Micro, April 1994, pp [1-Iuf193] R, A, Huff. Lifetime-sensitive modulo scheduling. In Proc. of the 93 SIGPLAN Conf on Programming Language Design and Implementation, pp , Albuquerque, N. Mex., June [Lam88] M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proc. of the 88 SIGPLAN Conf on Programming Lanugage Design and Implementation, pp , Atlanta, Georgia, June
11 [Larn89] M. Lam. A systolic array optimizing compiler. Kluwer Academic Publishers, [MoEb92] S. Moon and K. Ebcioglu. An efficient resource-constrained global scheduling technique for superscalar and VIJW processors. In Proc. of the 25th Ann. Zntl. Symp. on A4icroarchitecture, pp , Portland, Ore., December ~ewo88] G. Nemhauser and L. Wolsey. Integer and Combinatorial Optimization. John Wiley & Sons, [Nemhauser94] G. Nemhauser. Theageofoptimization: Solving large-scale real-world problems. Operations Research, 42(1):5-13, January-February ~iga92] Q. Ning and G. Gao. Optimal loop storage allocation for argument-fetching dataflow machines. International Jow-- nal of Parallel Programming, v. 21, no. 6, December ~iga93] Q. Ning and G. Gao. A novel framework of register allocation for software pipelining. Zn Conf Rec. of the 20th Ann. ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, pp , Charleston, S. Carolina, January [Pugh91] W, Pugh. The Omega test: A fast and practical integer programming algorithm for dependence analysis. In Proc, SW percomputing 91, pp , November [RaG181] B. R. Rau and C. D. Glaser. Some scheduling tecniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proc. of the 14 Ann. Microprogramming Workshop, pp , Chatham, Mass,, Clctober 1981, [RaFi93] B. R. Rau and J.A. Fisher. Instruction-level parallel processing: History, overview and perspective. Journal of Supercomputing, v.7, pp.:9-50, May [Rau94] B.R. ~au. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Ann. Intl. Symp. on A4icroarchitecture, pp , San Jose, Calif., November-December [Tou84] R.F. Touzeau. A FORTRAN compiler for the FPS-164 scientific computer. In Proceedings of the SIGPLAN 84 Symposium on Compiler Construction, pages 48-57, Montreal, Quebec, June 17 22, [Warter92] N.J. Warter, John W. Bockhaus, Grant E. Haab, and K. Supraliminal. Enhanced modulo scheduling for loops with conditional branches. In Proc. of the 25th Ann. Intl. Symp. on Microarchitecture, pp , Portland, Ore., December
Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS BY TANYA M. LATTNER B.S., University of Portland, 2000 THESIS Submitted in partial fulfillment of the requirements for the degree
2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
Module: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
The Trip Scheduling Problem
The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems
(Refer Slide Time: 01:52)
Software Engineering Prof. N. L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture - 2 Introduction to Software Engineering Challenges, Process Models etc (Part 2) This
Instruction Set Architecture (ISA) Design. Classification Categories
Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers
Instruction scheduling
Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.
Clustering and scheduling maintenance tasks over time
Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
Software Pipelining - Modulo Scheduling
EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi
Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture - 9 Basic Scheduling with A-O-A Networks Today we are going to be talking
EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro
EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds
HY345 Operating Systems
HY345 Operating Systems Recitation 2 - Memory Management Solutions Panagiotis Papadopoulos [email protected] Problem 7 Consider the following C program: int X[N]; int step = M; //M is some predefined constant
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
[Refer Slide Time: 05:10]
Principles of Programming Languages Prof: S. Arun Kumar Department of Computer Science and Engineering Indian Institute of Technology Delhi Lecture no 7 Lecture Title: Syntactic Classes Welcome to lecture
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
Factoring & Primality
Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
arxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES
A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES ULFAR ERLINGSSON, MARK MANASSE, FRANK MCSHERRY MICROSOFT RESEARCH SILICON VALLEY MOUNTAIN VIEW, CALIFORNIA, USA ABSTRACT Recent advances in the
Memory Systems. Static Random Access Memory (SRAM) Cell
Memory Systems This chapter begins the discussion of memory systems from the implementation of a single bit. The architecture of memory chips is then constructed using arrays of bit implementations coupled
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
EEM 486: Computer Architecture. Lecture 4. Performance
EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada [email protected]
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
1 The Java Virtual Machine
1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This
Types of Workloads. Raj Jain. Washington University in St. Louis
Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 [email protected] These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!
Enhancing SQL Server Performance
Enhancing SQL Server Performance Bradley Ball, Jason Strate and Roger Wolter In the ever-evolving data world, improving database performance is a constant challenge for administrators. End user satisfaction
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams André Ciré University of Toronto John Hooker Carnegie Mellon University INFORMS 2014 Home Health Care Home health care delivery
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Measuring the Performance of an Agent
25 Measuring the Performance of an Agent The rational agent that we are aiming at should be successful in the task it is performing To assess the success we need to have a performance measure What is rational
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
Optimizing Description Logic Subsumption
Topics in Knowledge Representation and Reasoning Optimizing Description Logic Subsumption Maryam Fazel-Zarandi Company Department of Computer Science University of Toronto Outline Introduction Optimization
Communication Protocol
Analysis of the NXT Bluetooth Communication Protocol By Sivan Toledo September 2006 The NXT supports Bluetooth communication between a program running on the NXT and a program running on some other Bluetooth
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Optimizing compilers. CS6013 - Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations
Optimizing compilers CS6013 - Modern Compilers: Theory and Practise Overview of different optimizations V. Krishna Nandivada IIT Madras Copyright c 2015 by Antony L. Hosking. Permission to make digital
(Refer Slide Time: 02:17)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #06 IP Subnetting and Addressing (Not audible: (00:46)) Now,
Gaining Competitive Advantage through Reducing Project Lead Times Duncan Patrick, CMS Montera Inc. Jack Warchalowski, CMS Montera Inc.
Gaining Competitive Advantage through Reducing Project Lead Times Duncan Patrick, CMS Montera Inc. Jack Warchalowski, CMS Montera Inc. Introduction We all know that any successfully completed project has
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
Data Storage - II: Efficient Usage & Errors
Data Storage - II: Efficient Usage & Errors Week 10, Spring 2005 Updated by M. Naci Akkøk, 27.02.2004, 03.03.2005 based upon slides by Pål Halvorsen, 12.3.2002. Contains slides from: Hector Garcia-Molina
GameTime: A Toolkit for Timing Analysis of Software
GameTime: A Toolkit for Timing Analysis of Software Sanjit A. Seshia and Jonathan Kotker EECS Department, UC Berkeley {sseshia,jamhoot}@eecs.berkeley.edu Abstract. Timing analysis is a key step in the
File Management. Chapter 12
Chapter 12 File Management File is the basic element of most of the applications, since the input to an application, as well as its output, is usually a file. They also typically outlive the execution
Amajor benefit of Monte-Carlo schedule analysis is to
2005 AACE International Transactions RISK.10 The Benefits of Monte- Carlo Schedule Analysis Mr. Jason Verschoor, P.Eng. Amajor benefit of Monte-Carlo schedule analysis is to expose underlying risks to
Information, Entropy, and Coding
Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
Analysis of Compression Algorithms for Program Data
Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory
What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL
What mathematical optimization can, and cannot, do for biologists Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL Introduction There is no shortage of literature about the
Dynamic programming formulation
1.24 Lecture 14 Dynamic programming: Job scheduling Dynamic programming formulation To formulate a problem as a dynamic program: Sort by a criterion that will allow infeasible combinations to be eli minated
TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)
TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) *Slides adapted from a talk given by Nitin Vaidya. Wireless Computing and Network Systems Page
Brillig Systems Making Projects Successful
Metrics for Successful Automation Project Management Most automation engineers spend their days controlling manufacturing processes, but spend little or no time controlling their project schedule and budget.
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
Encoding Text with a Small Alphabet
Chapter 2 Encoding Text with a Small Alphabet Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out
Hypercosm. Studio. www.hypercosm.com
Hypercosm Studio www.hypercosm.com Hypercosm Studio Guide 3 Revision: November 2005 Copyright 2005 Hypercosm LLC All rights reserved. Hypercosm, OMAR, Hypercosm 3D Player, and Hypercosm Studio are trademarks
Chapter 6: Sensitivity Analysis
Chapter 6: Sensitivity Analysis Suppose that you have just completed a linear programming solution which will have a major impact on your company, such as determining how much to increase the overall production
FLIX: Fast Relief for Performance-Hungry Embedded Applications
FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise.
IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise. Peter R. Welbrock Smith-Hanley Consulting Group Philadelphia, PA ABSTRACT Developing
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania
Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically
each college c i C has a capacity q i - the maximum number of students it will admit
n colleges in a set C, m applicants in a set A, where m is much larger than n. each college c i C has a capacity q i - the maximum number of students it will admit each college c i has a strict order i
FPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
The Open University s repository of research publications and other research outputs
Open Research Online The Open University s repository of research publications and other research outputs The degree-diameter problem for circulant graphs of degree 8 and 9 Journal Article How to cite:
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Figure 1: Graphical example of a mergesort 1.
CSE 30321 Computer Architecture I Fall 2011 Lab 02: Procedure Calls in MIPS Assembly Programming and Performance Total Points: 100 points due to its complexity, this lab will weight more heavily in your
Chapter 11 I/O Management and Disk Scheduling
Operatin g Systems: Internals and Design Principle s Chapter 11 I/O Management and Disk Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles An artifact can
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
The Classes P and NP
The Classes P and NP We now shift gears slightly and restrict our attention to the examination of two families of problems which are very important to computer scientists. These families constitute the
gprof: a Call Graph Execution Profiler 1
gprof A Call Graph Execution Profiler PSD:18-1 gprof: a Call Graph Execution Profiler 1 by Susan L. Graham Peter B. Kessler Marshall K. McKusick Computer Science Division Electrical Engineering and Computer
A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS
Mihai Horia Zaharia, Florin Leon, Dan Galea (3) A Simulator for Load Balancing Analysis in Distributed Systems in A. Valachi, D. Galea, A. M. Florea, M. Craus (eds.) - Tehnologii informationale, Editura
IT Service Management
IT Service Management Service Continuity Methods (Disaster Recovery Planning) White Paper Prepared by: Rick Leopoldi May 25, 2002 Copyright 2001. All rights reserved. Duplication of this document or extraction
How To Develop Software
Software Engineering Prof. N.L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture-4 Overview of Phases (Part - II) We studied the problem definition phase, with which
Prescriptive Analytics. A business guide
Prescriptive Analytics A business guide May 2014 Contents 3 The Business Value of Prescriptive Analytics 4 What is Prescriptive Analytics? 6 Prescriptive Analytics Methods 7 Integration 8 Business Applications
Approximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001
A comparison of the OpenGIS TM Abstract Specification with the CIDOC CRM 3.2 Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001 1 Introduction This Mapping has the purpose to identify, if the OpenGIS
New Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Attaining EDF Task Scheduling with O(1) Time Complexity
Attaining EDF Task Scheduling with O(1) Time Complexity Verber Domen University of Maribor, Faculty of Electrical Engineering and Computer Sciences, Maribor, Slovenia (e-mail: [email protected]) Abstract:
CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing
CSE / Notes : Task Scheduling & Load Balancing Task Scheduling A task is a (sequential) activity that uses a set of inputs to produce a set of outputs. A task (precedence) graph is an acyclic, directed
Compact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
7.1 Our Current Model
Chapter 7 The Stack In this chapter we examine what is arguably the most important abstract data type in computer science, the stack. We will see that the stack ADT and its implementation are very simple.
