COMPILER-DIRECTED FUNCTIONAL UNIT SHUTDOWN FOR MICROARCHITECTURE POWER OPTIMIZATION SANTOSH B TALLI. A thesis submitted to the Graduate School

Transcription

1 COMPILER-DIRECTED FUNCTIONAL UNIT SHUTDOWN FOR MICROARCHITECTURE POWER OPTIMIZATION BY SANTOSH B TALLI A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master of Science in Electrical Engineering New Mexico State University Las Cruces, New Mexico September 2006

2 Compiler-Directed Functional Unit Shutdown for Microarchitecture Power Optimization, a thesis prepared by Santosh B Talli in partial fulfillment for the degree, Master of Science in Electrical Engineering, has been approved and accepted by the following: Linda Lacey Dean of the Graduate School Jeanine Cook Chair of the Examining Committee Date Committee in charge: Dr. Jeanine Cook, Chair Dr. Steve Stochaj Dr. Mary Ballyk ii

3 To My Family: Mother, Father, Sister and Brother-in-law iii

4 ACKNOWLEDGEMENTS I would like to thank my parents for their love and support. I thank my sister and brother-in-law for their support all through my graduate studies. I have no words to express my gratitude towards my advisor, Dr. Jeanine Cook, for her support and insightful suggestions. Her computer architecture course made me love the subject and changed the course of my career. Dr. Cook has also been of great help in helping me decide on my courses in the Electrical Engineering Department. I am thankful to Dr. Steve Stochaj and Dr. Mary Ballyk for serving in my committee. Dr. Stochaj s computer performance analysis course helped me analyze my results better. I thank my good friend Ram, for his constant help and support during my research. I thank him for the technical discussions and ideas. Finally I would like to thank all my friends at NMSU who made my stay memorable. iv

5 ABSTRACT COMPILER DIRECTED FUNCTIONAL UNIT SHUTDOWN FOR MICROARCHITECTURE POWER OPTIMIZATION BY SANTOSH B TALLI Master of Science in Electrical Engineering New Mexico State University Las Cruces, New Mexico, 2006 Dr. Jeanine Cook, Chair Leakage power is a major concern in current microarchitectures as it is increasing exponentially with decreasing transistor feature sizes. In this study, we present a technique called functional unit shutdown to reduce the static leakage power consumption of a micro-processor by power gating functional units when not used. We use profile information to identify functional unit idle periods that is used by the compiler to issue corresponding OFF/ON instructions. The decision to power gate during idle periods is made based on the comparison between the energy consumed by leaving the units ON and the overhead and leakage energy involved in power v

6 cycling them. This comparison identifies short idle periods where less power is consumed if a functional unit is left ON rather than cycling the power during that period. The results show that this technique saves up to 18% of the total energy with a performance degradation of 1%. vi

7 TABLE OF CONTENTS Page LIST OF TABLES... LIST OF FIGURES... x xi 1. INTRODUCTION AND MOTIVATION CMOS Transistor and Leakage Power Superscalar Processors and ILP Functional Units Power Consumption Combination of Compiler and Hardware Techniques RELATED WORK Clock Gating Static Power Reduction Techniques Hardware Techniques to Reduce Static Power Power Gating Dual Threshold Voltage Gate Level Leakage Reduction (Input Vector Control) Compiler Directed Static Power Reductions Functional Unit Static Power Reduction FUNCTIONAL UNIT SHUTDOWN CFG Generation Energy Estimation Overhead Energy and BreakEven Cycles vii

8 3.3 FU Requirement Optimization Complexity of Optimization: Local and Global Local Optimizer Global Optimizer Processor Support Performance Penalty Variations of the Basic Algorithm EXPERIMENTAL PLATFORM The SimpleScalar Simulator The Wattch Simulator SPEC CPU2000 Benchmarks Subset of SPEC CPU2000 Benchmarks Framework RESULTS Effect of FU Shutdown Energy Breakdown Performance Degradation Sensitivity Analysis on BreakEven Cycles Depth of Global Optimization Wait if Busy vs Busy ON Conclusions and Future Work CONCLUSIONS References viii

9 ix

10 LIST OF TABLES Table Page 1.1 Static power dissipation by Functional Units Instruction class FU requirements Average difference in FU requirements estimation and actual usage Total Number of FUs Used Number of basic blocks in each benchmark Local optimizer energy estimation Time for local optimizer for all benchmarks Energy saved versus depth of CFG optimization Global optimizer energy estimation Optimization time for global2 CFG optimization Optimization time for global4 CFG optimization Subset of SPEC CPU2000 benchmarks Simulation parameters Dynamic benchmark instruction mix Average number of FUs used in a BB Total energy savings by implementing FU shutdown Increase in energy due to performance for global x

11 LIST OF FIGURES Figure Page 1.1 CMOS Inverter Power Gating implementation BB instruction dependence tree CFG with FU requirements Illustration of BreakEven point Short FU idle period CFG to be optimized a Exhaustive search over nodes 1,4 and b Exhaustive search on nodes 1, 2, 3, and Local optimization Local optimizer algorithm Depth one sub-cfgs Global optimizer algorithm Global optimizer time complexity, mcf Compiler-Inserted instructions Original Percent Energy Breakdown Total energy savings (%) a Different Strategies for eon b Different Strategies for facerec xi

12 5.3c Different Strategies for fma3d d Different Strategies for mesa e Different Strategies for swim a Total energy breakdown for vortex b Total energy breakdown for INT benchmarks c Total energy breakdown for FP benchmarks d Total energy breakdown over all benchmarks Execution time in clock cycles Energy breakup for different BE values for art a BE cycles sensitivity, art b BE cycles sensitivity, vpr c BE cycles sensitivity, facerec Energy consumption versus global optimizer depth xii

13 1. Introduction and Motivation Decreasing CMOS transistor feature sizes have enabled higher processing speeds and more components on chip. However, this is at the expense of increased static power dissipation in the form of transistor leakage current, which increases as the transistor size decreases. Future technologies will have greater levels of on-chip integration and higher clock frequencies making the energy dissipation an even more critical design constraint. It is now estimated that static power dissipation accounts for about 40% of the total power of high-speed processing chips which use the 65nm technology [11]. Also, with decreasing transistor feature sizes static power dissipation in a microprocessor is increasing exponentially [12]. Power consumption is a crucial factor that determines the functionality and mobility of devices. The performance potential in a mobile device is limited by the power consumption as the increasing levels of integration and clock frequencies for high performance escalate the power dissipation. Due to these factors power optimization at various levels of a microprocessor design becomes essential. 1.1 CMOS Transistor and Leakage Power One of the most popular Metal Oxide Semiconductor Field Effect Transistor (MOSFET) technologies is the Complementary MOS (CMOS) technology. This technology makes use of both P and N channel devices in the same substrate material. Such devices are extremely useful, since the signal which turns a transistor of one type ON is used to turn a transistor of the other type OFF. This allows the design of logic devices using only simple switches, without the need for a pull-up resistor. Figure 1.1 shows a typical inverter implemented with CMOS technology. V DD is the supply voltage, V in and V out are the input and output voltages respectively and C L is the load capacitance. In this case an input of logic 1 (V dd volts, transistor supply voltage) switches the N transistor on and the P transistor off. The decreasing transistor feature sizes have the advantage 1

14 of (1) reducing gate delay, resulting in an increased clock frequency and faster circuit operation, and (2) increasing transistor density, making the chips smaller and reducing cost. Figure 1.1: CMOS Inverter CMOS circuits dissipate power by charging and discharging the various load capacitances (mostly gate and wire capacitance, but also drain and some source capacitances) whenever they are switched. The charge moved, Q, is the capacitance multiplied by the voltage change, V gain. The current used, I used,, is a product of the charge moved and the switching frequency, f. Finally, the characteristic switching power, P, dissipated by a CMOS is the product of the current used and the voltage gain: P (dynamic) = I used * V gain = (Q * f) * V gain = ((C * V gain ) * f) * V gain = C V 2 gain f As opposed to the dynamic power which is due to switching of devices, the main contributor to leakage power is the sub-threshold leakage current present in deep submicron MOS transistors acting in the weak inversion region. Sub-threshold leakage is the current that flows from drain to source even when the transistor is off (gate voltage less than threshold voltage). The transistor begins to conduct at the threshold voltage. Sub-threshold leakage increases exponentially with decreasing threshold voltage (V T ) and the continuous reduction of V T with technology scaling is making the static (leakage) power increasingly significant. Hence, in recent 2

15 years computer architects have invented solutions to decrease power by relying on microarchitectural innovations. It is also important that the solutions developed do not lead to a significant degradation in performance. In this study, we primarily focus on leakage energy dissipation in high-performance microprocessors and develop a power aware design that reduces the leakage energy. 1.2 Superscalar Processors and ILP A scalar processor processes one data item at a time, whereas in a vector processor, a single instruction operates simultaneously on multiple data items. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple processing units so that multiple instructions can be processing separate data items at the same time. A superscalar processor has multiple functional units of the same type, along with additional circuitry to support issuing the instruction to the units. The issuing/scheduling unit reads instructions from memory and decides which ones can be run in parallel and dispatches them to the units. Instruction Level Parallelism (ILP) is a measure of the number of instructions in a straight-line piece of code that can be performed simultaneously. To exploit ILP the instructions that can be executed in parallel have to be determined. Two instructions can be executed in parallel if they can execute simultaneously in a pipeline without causing any stalls, if they have sufficient resources at their disposal. The decision to issue multiple instructions at the same time is based on determining if an instruction is dependent on another instruction. The types of dependences that can exist between two instructions are: data, name and control dependences. 1. Data Dependence: Data Dependence can occur when an instruction depends on the result of an earlier instruction. If two instructions are data dependent, they cannot be executed simultaneously as the second instruction has to wait till the result of the first instruction is 3

16 produced. If both are executed in parallel, then the second instruction might read an earlier value of the operand. 2. Name Dependence: Name dependence occurs when two instructions use the same register or memory location, referred to as a name, but there is no flow of data between them associated with that name. There are two types of name dependences that can occur between two instructions: a. Anti-dependence occurs when an instruction requires the value of an operand that is later updated. The original ordering must be preserved to ensure that the first instruction reads the correct value. b. Output dependence occurs when the ordering of the instructions will affect the final output of an operand, when both the instructions write the same register or memory location. The ordering between the instructions must be preserved so that the final value corresponds to the one written by the second instruction. 3. Control Dependence: Control dependence occurs when there is a branch instruction and the next instruction to be executed is based on the direction of the branch which is known from the outcome of its execution. The next instruction to be executed can either be the next sequential instruction (if the branch is not taken) or the instruction specified by the branch (if the branch is taken). A hazard occurs whenever there is a dependence between the instructions and they are close enough that the overlap caused by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. Due to the dependence, program order must be preserved, that is, the order that the instructions would execute in if executed sequentially one at a time. Data hazards can be classified into the following three types based on the order of read and write accesses in the instructions. 4

17 1. Read after Write (RAW): An operand is modified just before its read. If the first instruction did not finish writing to the operand, the second instruction will read an incorrect data. 2. Write after Read (RAW): An operand is written right after it is read. If the write finishes before the read, the read instruction will incorrectly get the new written value. 3. Write after Write (WAW): Two instructions write to the same operand. If the second instruction finishes first, then the operand is left with an incorrect value. The hazards can be avoided in certain cases, if not completely eliminated. There are various software and hardware techniques to avoid hazards and exploit parallelism by preserving the program order only where it affects the outcome of the program. In the case of data hazards, WAR and WAW can be avoided by using techniques like register renaming but RAW cannot be avoided and the pipeline must be stalled as an instruction has to read the operand modified by the earlier instruction. Control hazards can be avoided at times, if not always, by speculating if the branch instruction will be taken/not taken by using various branch prediction techniques. As these hazards cannot be eliminated completely, maximum ILP is not always achieved and the utilization rates of the various processor hardware components are not 100%, leading to idle periods in them. 1.3 Functional Unit Power Consumption Contemporary high-performance microprocessors are some of the most highly integrated, performance driven state-of-the-art chips designed today. They require an extremely large number of transistors and implement high clock rates to meet the high performance requirement, which leads to significant static and dynamic power dissipation. The current Intel processors have around 200 million transistors. Of the total processor power consumption, Functional Units contribute to about 20% [17]. Dynamic power of the FUs can be reduced by using clock gating 5

18 [14, 11]. The other major static power dissipating units of a microprocessor are the cache memories and techniques have been proposed to reduce this [16, 17, 9], which leave the FUs to be the major static power dissipating components. The power consumption of the functional units at different technology parameters is shown in Table 1.1, taken from [7]. It can be seen that static power dissipation increases non-linearly with decreasing transistor sizes. Table 1.1: Static power dissipation by functional units [7] Simulation results have shown that functional unit utilization rates are typically low with idle periods characterizing their use [5]. Superscalar processors achieve their high processing speeds by dynamically detecting and exploiting Instruction Level Parallelism (ILP), subsequently executing instructions in parallel on multiple functional units. As maximum ILP is rarely achieved, as discussed in section 1.2, not all functional units available are fully utilized leading to idle periods in them. Also, the applications being run may not utilize the functional units completely. For example, an application which is intensive of integer computations and with only a few floating point operations, will under utilize the floating point functional units. During these idle periods static power is dissipated. Our work is aimed at reducing static power consumption of functional units during these idle periods by turning the power OFF. While turning the FUs OFF results in static energy savings, a performance penalty may be incurred when a unit is turned OFF but is needed for instruction execution. Additionally, a certain amount of dynamic energy is consumed to turn the FUs ON. Hence, it is important that turning OFF a FU does not affect 6

19 performance and that the static energy saved by turning it OFF is greater than the dynamic energy incurred to power cycle it. In our implementation, if the idle period of a functional unit is short, such that the power required to cycle from ON to OFF to ON (power cycle) is greater than the power consumed by leaving it ON during the idle period, then the unit is left ON. We use an annotated Control Flow Graph (CFG) extracted from an application to detect the idle periods. These idle periods are then translated into compiler instructions that turn FUs OFF/ON while hardware enables the actual OFF/ON operation. The advantages of using such a hardwaresoftware approach in lieu of a purely hardware based approach are discussed below. Although Energy = Power * Time, we use the terms power and energy interchangeably to suit the context. 1.4 Combination of Compiler and Hardware Techniques Most of the early techniques to reduce static microprocessor power were hardware based [16, 17, 9], wherein, once the idle period on a microprocessor component such as memory starts, a hardware counter starts counting the number of cycles. If the counter reaches a set threshold value the component is turned OFF. The disadvantages of such a technique are that (1) the device may need to be accessed immediately after it is turned OFF; (2) decreased energy savings while the counter waits to reach a particular threshold value and (3) the hardware counter consumes additional power. To overcome these disadvantages, we use a combination of compiler and hardware techniques. Instead of depending solely on the hardware by using the hardware counters, the compiler produces information which is used to issue the OFF/ON directives. The advantages of using such an approach, which overcomes the disadvantages of the hardware based techniques described above, are given below. Identifying Idle Regions off-line: In order to turn the FUs OFF, their idle periods need to be detected. Primitive Hardware techniques for power gating devices are based on keeping track of the idle period as discussed above. In our technique, the compiler 7

20 examines all of the code off-line and identifies suitable regions for turning the FUs OFF. Furthermore, the compiler also identifies the type of FUs and determines the number of FUs that can be turned OFF without degrading the performance or increasing the total power consumption. Ability to Hide Latency: While turning a FU OFF prior to entering an idle period, it has to be ensured that all the pending instructions have committed. Similarly, while turning a FU ON upon exiting an idle period it has to be turned ON sufficiently ahead of instructions accessing the FU as there is latency for FU turn ON. A hardware based technique cannot take care of these as it is just based on hardware counters. Variable Length Idle Periods: Idle periods in a FU can be of variable lengths. If the idle period is long, turning the FU OFF saves power. But if the idle period is too short, the FU will have to be turned ON as soon as it is turned OFF. If the latter situation occurs frequently, while there is little or no power savings, additional dynamic power is dissipated during power cycling of the FU. Hence, turning a FU OFF for a short idle period could lead to more overall power consumption than leaving it ON. In compilerdirected FU shutdown the FU is turned OFF only if the compiler is sure that turning OFF the FU will save power, which nullifies the effect of very short idle periods. In this work, we identify the FU idle periods and propose architectural techniques to reduce their static power consumption during the idle periods by comparing the dynamic and static power consumption of different Functional Units by leaving them ON versus turning them OFF. The rest of the thesis is organized as follows: Chapter 2 presents the prior related work done with respect to reducing the dynamic and static power of a microprocessor in general and of the Functional Units. Chapter 3 proposes our Compiler-Directed FU Shutdown methodology and Chapter 4 describes our experimental platform. In Chapter 5 we present the effectiveness of our technique to reduce static power and we conclude in Chapter 6. 8

21 2. Related Work The first efforts of reducing microprocessor power consumption focused on reducing the dynamic power dissipation [14, 8, 15]. There are a number of power-aware architecture designs many of which focus on reducing power of various micro-architectural components [16, 17, 18, 9]. In this section we focus mainly on the FU power reduction techniques [6, 10, 24]. Approaches for reducing dynamic power dissipated by functional units during idle periods by clock gating techniques have been described in [5, 8, 13]. With decreasing transistor feature sizes, static (leakage) power dissipation is a major contributor to the total power dissipation. In this chapter we review recent related work that focuses on the dynamic and static FU power reduction techniques. 2.1 Clock Gating One of the first efforts to reduce power dissipation of functional units introduced the Clock Gating technique [14, 15, 8]. Clock Gating is implemented in synchronous circuits to disable portions of a circuit when they are not actively performing computation, thereby reducing the dynamic power dissipation of the portions gated. The clock network in a microprocessor connects the clock to sequential elements like flip-flops, latches, and dynamic logic gates which are used in high-performance execution units and array address decoders in cache memories. At a high level, gating the clock to a latch or a logic gate by ANDing the clock with a control signal prevents the unnecessary charging/discharging of the circuit s capacitances when the circuit is idle, and saves the circuit s clock power. Initially clock gating was applied to a functional unit only when none of the functional unit stages are active. Clock gating techniques have improved on this limitation by having the ability to disable stages of a functional unit which are not active [23]. 9

22 One of the first attempts of clock gating was in [14], where they state that the total clock power is usually around 30-35% of the total microprocessor power. Clock power is a major component of microprocessor power mainly because the clock is fed to most of the circuit blocks in the processor, and the clock switches every cycle. However, effective clock gating requires a methodology to determine which circuits are gated, when and for how long. Clock gating schemes that either (1) result in frequent toggling of the clock-gated circuit between enabled and disabled states, or (2) apply clock gating to such small blocks that the clock gating control circuitry incurs, is as large as the block itself, incur large overhead. This overhead may result in power dissipation higher than that without clock gating. Pipeline balancing (PLB) is a technique which essentially outlines a predictive clock gating methodology [15]. PLB exploits the inherent variation of instruction level parallelism (ILP) within a program. It uses past program behavior and its characteristics such as issue IPC to predict a program s ILP at the granularity of a 256-cycle window. If the degree of ILP in the next window is predicted to be lower than the width of the pipeline, PLB clock gates a cluster of pipeline components which include not just the datapath but all associated control logic and clocks, during the window. Using a simulator based on an extension of the Alpha Processor, [15] presents a component and full chip power and energy savings for single and multi-threaded execution. Results show an issue queue and execution unit power reduction of up to 23% and 13%, respectively, with an average performance loss of 1% to 2% on SPEC95 benchmarks. In contrast to PLB s predictive methodology (as it uses past program behavior and predicts ILP), [8] proposes a deterministic methodology called Deterministic Clock Gating (DCG). DCG is based on the key observation that for many of the stages in a modern pipeline, a circuit block s usage in a specific cycle in the near future is deterministically known a few cycles 10

23 ahead of time. DCG exploits this advance knowledge to clock gate the unused blocks. In an outof-order pipeline, whether these blocks will be used is known at the end of issue based on the instructions issued. The execution units, pipeline latches of back-end stages after issue, L1 D- cache wordline decoders, and result bus drivers are clock gated. There is at least one cycle of register read stage between issue and the stages using execution units, D-cache wordline decoder, result bus driver, and the back-end pipeline latches. DCG exploits this one-cycle advance knowledge to clock gate the unused blocks without impacting the clock speed. DCG s deterministic methodology has three key advantages over PLB s predictive methodology: (1) PLB s ILP prediction is not 100% accurate; if the predicted ILP is lower than the actual ILP, PLB ends up clock-gating useful blocks and incurs performance loss and vice versa. DCG guarantees no performance loss and no lost opportunity for the blocks whose usage can be known in advance, (2) DCG clock gates at a finer granularities than PLB s clock gating granularities, both circuit and time granularities, (3) While PLB s prediction heuristics have to be fine-tuned, DCG uses no extra heuristics and is significantly simpler. Experimental results show an average of 19.9% reduction in dynamic processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, D- Cache word-line decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss. Clock gating techniques help in reducing the processor power consumption by reducing the dynamic power of the functional units while in our work, we focus on the static power. 2.2 Static Power Reduction Techniques With decreasing transistor technology sizes, the static power consumption of high performance microprocessors is increasing exponentially [12]. Techniques have been proposed to reduce static power consumption of various microprocessor components which include Level 1 11

24 and Level 2 Caches [16, 17, 18, 9] and Functional/Execution Units [10, 6, 19]. Additional hardware support is needed to reduce static power consumption of various microarchitectural components, which is discussed below Hardware Techniques to Reduce Static Power To reduce static leakage power of microarchitectural components, they need to be put in a low leakage state. The components can be put into a low leakage state either by (1) reducing the supply voltage of components transistors, (2) increasing the threshold voltage of the transistors, and (3) changing the gate inputs. The first technique is called Power Gating [4, 9], the second is the Dual Threshold Voltage (Vt) [20, 26] and the third technique is the Input Vector Control [25, 28] Power Gating In power gating the power supply to the appropriate microarchitectural component is reduced or shut off during the idle periods. Leakage power reduces as the supply voltage reduced. While lowering the supply voltage to zero leads to no static power, it leads to loss of information stored in the transistors in cache memories. Hence, in most of the proposed designs for reducing leakage power of cache memories [9], they are put into a low leakage state to retain the stored data by using Dynamic Voltage Scaling (DVS). For example, in 70nm transistor technology, the supply voltage to the transistors is 1.0V. By using DVS the supply voltage can be reduced to 0.3V while retaining the data stored [9]. The power-gating approach achieves ultra low leakage power as the device is completely shut off. Sleep transistors are inserted into a logic gate to control power supplies to the gates of the transistors as shown in Figure 2.1 [16] for an SRAM cell. When 12

25 the signal LowVolt goes high transistor P2 gets switched ON and P1 OFF, supply voltage is V DD Low (0.3V) Dual Threshold Voltage Figure 2.1: Power Gating Implementation [16] A microarchitectural component can be put into a low leakage state by raising the threshold voltage (Vt) during the idle periods. The leakage power in this case does not go all the way down to zero as the transistor is still ON. Putting the transistor in a high Vt state decreases its leakage power and increases its latency. In [20], the problem of optimal assignment of threshold voltages to transistors in a CMOS logic circuit is defined, and an efficient algorithm for its solution is given Gate Level Leakage Reduction (Input Vector Control) In [25] a new gate level leakage reduction technique is proposed that can be used during the logic design of CMOS circuits that use clock gating to reduce the dynamic power. The 13

26 original logic design of a multi-gate logic circuit is modified by using minimal additional circuitry to force the combinational logic into a low leakage state during an idle period. Based on a library of gates, characterized for leakage current, a low leakage input vector is determined using a sampling of random vectors. When parts of a circuit are disabled by clock gating, they dissipate leakage power. When a circuit is clock gated, its internal state is set to a low leakage state for the idle period. Leakage power reductions of up to 54% have been achieved on the ISCAS-89 benchmark circuits. In [27] the three techniques discussed above are compared based on their limits and benefits, leakage reduction potential, performance penalty, and area and power overhead. Power gating achieves the maximum possible leakage reduction, but at the cost of large overheads. Dual V t has the lowest leakage energy savings but it helps in retaining the internal state. Input vector control (IVC) performs better than dual V t but it changes the internal state as the gate inputs are changed. Also, IVC can be applied only to circuits which are clock gated and equipped with front-end latches. In the above approaches described in and , there is turnon latency involved, that is, when the device is turned back ON it cannot be used immediately as some time is needed before the circuitry returns to its operating condition. To have a noteworthy impact on reducing the leakage energy by using Dual Vt technique, the Vt has to be increased significantly. As the current transistor feature sizes have a low Vt, latency for implementing the Dual Vt is high. Also, power gating can put the FU in an ultra low leakage state by gating the supply all the way to zero whereas dual Vt cannot. Hence, in this work we assume that power gating is being employed to for FU shutdown. 14

27 2.2.2 Compiler Directed Static Power Reduction Most of the initial static power reduction techniques have focused on utilizing hardware counters to monitor the idle period. These techniques transition to low leakage modes after fixed periods of inactivity which incur an energy penalty for transitioning to low leakage mode only after fixed periods. To address this problem, approaches have been proposed which dynamically change the turn-off periods [21, 22] and are compiler-based. In compiler-based techniques, appropriate profile information is first collected on a program which is in turn used to generate compiler hints during the program s execution. The profile information is either (1) collected statically by examining the program code; (2) dynamically by executing the program, or, (3) is based on previous behavior. In [22] a compiler-based approach is used to reduce static power of Level 1 Instruction Cache. In this approach, the last use of instructions is identified and the corresponding cache lines that contain them are placed into a low leakage mode. Such an approach was proved to be competitive in terms of energy and energy-delay product as compared to the hardware-based leakage control techniques. By using the compiler-directed techniques, 6.4% more L1 data cache leakage energy is saved in [21] than by using a purely hardware based approach. Hence, in our approach we use compiler directives to issue the FU shutdown instructions. 2.3 Functional Unit Static Power Reduction In [6] the potential to power gate functional units is evaluated based on parameterized analytical equations that estimate the break-even point of power gating techniques. There is an overhead energy associated to power gate a functional unit. The break-even point is the point when the aggregate leakage energy savings by turning the FU OFF equals the energy overhead of switching OFF and ON the header device used to power gate. So, to save energy by turning OFF 15

28 a functional unit its idle period has to be at least break-even number of cycles. In this study they assume a perfect predictor that can predict the idle intervals of the functional units with no delay. They use the technique of turning OFF a functional unit after detecting a series of idle cycles. They also propose a technique to turn OFF the functional units when a mispredicted branch has been detected as the units are going be idle while the current instructions in the pipeline are going to be flushed. Their results show that floating-point units can be put to sleep for up to 28% of the execution cycles at a performance loss of 2%. Using the branch misprediction guided technique the fixed-point units can be put to sleep for up to 6% of the execution cycles compared to the previous approach. This approach is purely hardware based where hardware counters are used to detect idle periods in a FU as compared to ours which is compiler directed. Such an approach has disadvantages as described in Section 1.3. Also, in this technique only the percentage of time the FUs can be put to sleep is talked about with no reference to the percentage energy saved. As power cycling the FU consumes energy, shutting down the FUs for 28% of the time might not necessarily lead to power savings. In our approach, we propose a technique wherein the FU shutdown is down based on the power savings and the results are reported in terms of power savings too. The work most similar to ours is presented in [10], where FU static power dissipation is optimized by power gating these units during idle periods. This is a compiler driven FU shutdown technique where the FU turn OFF/ON directives are based on compiler hints. Program regions with low ILP and thus low functional unit demands are detected. The compiler can examine all of the code off-line and, therefore, identify suitable regions for turning the FUs OFF. Large subgraphs are identified in the control flow graph which represent control structured (e.g., loops) which are called Power blocks. These blocks are then classified into: hot blocks whose execution frequencies are greater than a certain threshold and cold blocks which are the remaining blocks in the program. The functional unit usage in each block is also analyzed to identify the units that are 16

29 expected to be idle in that block. OFF and ON directives are placed in cold blocks adjacent to the hot blocks in which the unit is expected to be idle. This information is then communicated to the hardware by generating special OFF and ON directives. The idle regions in the functional units may vary greatly in duration and for this strategy to work well the idle regions should be long in duration. This is because turning OFF and ON the FU incurs additional energy and this energy should be greater than the energy saved by turning the FU OFF. The greater the idle period, the greater is the energy savings by turning the FU OFF. To nullify short idle periods they turn OFF the FU only after certain duration of clock cycles after the turn OFF directive is issued. Such a strategy incurs an energy loss by leaving the FU ON for a certain duration, where shutting the FU OFF would save power. Also, the FU idle periods are detected purely based on utilization rates within a basic block and the overhead energy required for FU power cycling is not considered. In our approach we quantify the energy consumed by the FUs and the overhead energy for power cycling and use this information to detect short and long idle periods and to drive FU shutdown. In [24] a compiler based technique for optimizing leakage energy consumption in VLIW functional units is proposed. A data-flow analysis is done to detect idle functional units along the control flow graph paths and then a leakage control mechanism inserts the FU turn OFF/ON directives. The FU idleness is defined at the basic block level by detecting if a FU will be used by an operation in a basic block. All the FUs of a type are turned ON even if there is only one operation which needs it. In our implementation, we turn ON only the required number of FUs in a basic block. Two leakage control mechanisms are evaluated: (1) power gating, and (2) input vector control [25]. Input vector control is a gate level leakage reduction technique which exploits the state dependence of the leakage current and sets the logic gate inputs to values that have the minimum leakage current when the units are idle. The input vector control mechanism lead to about 45% savings in the leakage power, while the power gating mechanism did not perform well due to the re-activation time of the FU. Compared to our implementation, the energy overhead 17

30 incurred to transition the FU into a low leakage state is not considered and the savings are reported only in terms of the leakage energy saved but not the total energy. 18

31 3. FU Shutdown In Chapter 2, various techniques for reducing static processor power by FU shutdown are discussed. These techniques are either purely hardware based and/or the results are based on the percentage of time the FUs spend in the OFF state. Further, none of these account for the overhead or the total energy consumed. In our implementation, the FU shutdown directives are issued by the compiler and we consider the extra energy (overhead) incurred to power cycle the FUs and also the total energy saved by shutting down FUs during their idle period. To implement FU shutdown, our algorithm first generates a Control Flow Graph (CFG) from the static representation of the program code that is subsequently annotated with initial FU requirements for each program basic block (BB). Our algorithm then analyzes the tradeoff between leaving FUs ON and turning FUs OFF when not utilized by consecutive BBs. Through this analysis, long and short idle periods are detected and FU requirements are optimized for minimum energy consumption. The FU requirements are then translated into compiler-generated instructions that are used during program execution to physically switch OFF/ON the FUs by using the power gating mechanism described in Section The generation and annotation of the CFG, the energy estimation, and the FU requirement optimization algorithm are described in the following sections. 3.1 CFG Generation We first generate a CFG from the compiled static representation of the program code. A CFG is an abstract representation of a program, where each node in the graph represents a basic block, i.e. a straight-line piece of code; jump targets start a block, and jumps end a block. The assembly code of a program is fed to a functional simulator, which is a fast and less detailed 19

32 simulator with no time accounting, and the branch instructions are identified. For every branch instruction fetched, its target address(es) and the number of times a branch instruction transitions to each of its target address(es) are captured. The target of a branch instruction starts a basic block (BB) and the branch instruction marks the end of a basic block. This information is encoded in a CFG, where each node represents a BB. The transition probabilities between the nodes (pictured on arcs in Figure 3.2) are generated by looking at the number of times a branch instruction transitions to its respective target address(es). These are used to guide our FU requirement optimization process which is discussed in Section 3.3. Next we identify the initial (pre-optimized) FU requirements of each node and embed this information to the CFG. To accomplish this, a second analysis of the static code is done to determine the number and class of instructions in each BB from which we extract initial FU requirements. Table 3.1 shows the FU requirements for different instruction classes. Based on the class of the instruction the required FU can be identified. Apart from identifying the types of FUs required for a BB, we also need to determine the required number of FUs of each type. INTEGER ALU INTEGER MULTIPLY FLOATING-POINT ALU FLOATING-POINT MULTIPLY IntALU IntMult FloatADD FloatMult Control IntDiv Float Compare FloatDiv Memory Float Convert FloatSQRT Table 3.1. Instruction Class FU Requirements Knowing the number of instructions within a BB is not enough information to accurately determine the number of FUs required for instruction execution. Due to dependencies between BB instructions, the raw number of instructions of a class cannot be considered as the BBs FU requirement. An accurate determination of FU requirements may only be done through dynamic analysis, which is undesirable due to the complexity and time associated with gathering this information. Therefore, we estimate basic block FU requirements based on a static read-afterwrite (RAW) dependence analysis of instructions. The instruction dependences within each BB 20

33 are represented in a tree structure, where nodes represent instructions and edges between nodes indicate dependence. Nodes that are at the same level in the tree contain independent instructions. The level or depth of the tree that contains the maximum number of instructions is used to estimate the FU requirement. Consider the code sequence for a basic block and its dependence tree shown in Figure 3.1. Here, instructions 3, 5 and 6 are dependent on instruction 1; instructions 3, 4 and 5 are independent. Therefore, the Integer (INT) Add unit requirement for this block is 3, which corresponds to the number of instructions in Level 2. If the maximum number of independent instructions of a particular class exceeds the number of FUs of that class, the FU requirement is set to the maximum number of FUs which is currently defined by the simulator. Figure 3.1: BB Instruction Dependence Tree Figure 3.2 shows a CFG where each node represents a BB and the arcs represent transitions between BBs. Each node is annotated with its FU requirements and the arcs with transition probabilities. A FU requirement of [a, b, c, d] indicates that the block requires a INT Add units, b INT Multiply units, c Floating Point (FP) Add units, and d FP Multiply units for execution. When control flows from block 1 to block 2, we see that block 2 requires fewer FP FUs and one more INT FU than block 1. Therefore, three FP Adders and one FP Multiply unit can be turned OFF and one INT Multiply FU is turned ON during the execution of block 2. Since block 5 has the same requirement as that of block 2, no additional OFF/ON operations are required when transitioning from block 2 to block 5. Similarly, when control flows from block 4 to 6, the FP Multiply unit is left ON, while a unit each of INT Add, INT Multiply and FP Add are 21

34 switched ON. During our optimization process (Section 3.3), we traverse a node s children in decreasing order of their transition probability magnitude. For example, the order of traversing the children of BB1 is 4, 3, 2. Figure 3.2: CFG with FU Requirements To understand the accuracy of our FU requirement estimation, we captured the actual FU usage from a dynamic execution of each benchmark. During the dynamic execution of a benchmark, we capture the maximum number and type of FUs used in each cycle for every BB. These requirements are BB accurate but not cycle accurate as the actual usage in a BB is set to be the maximum number of FUs used in a cycle during the BBs execution. For example, if a BB has three INT Add instructions out which two are executed in parallel and then third, then the actual usage of this BB is set as 2 INT Add units. Table 3.2 shows the average difference per basic block between our estimation and the actual FU usage for each FU class and Table 3.3 shows the total number of FUs used for each class. This difference is weighted by the number of times the basic block appears in the benchmark. It can be seen that on an average over all the benchmarks, we are 0.84 units off per BB for INT ADD units from the actual usage and for the rest of the units we are on average 0.11 units or less off per BB. The high difference in the estimation of INT Add units could be due to 22

35 the high utilization of the INT Add which we fail to estimate or over-estimate. There is not much of a difference in the other FUs as the utilization rates on the other FUs are less. In our configuration, we have 4 INT Add and 4 FP Add, and 1 INT Multiply and 1 FP Multiply FUs. Benchmark FP ADD FP MUL INT ADD INT MUL art eon facerec fma3d gzip mcf mesa swim vortex vpr Average Table 3.2. Average Difference in FU Requirements Estimation and Actual Usage Benchmark FP ADD FP MUL INT ADD INT MUL art eon facerec fma3d gzip mcf mesa swim vortex vpr Table 3.3. Total Number of FUs Used 3.2 Energy Estimation Our FU shutdown technique computes the total the energy consumed by BB instructions as the sum of the dynamic and static (leakage) energy dissipated every execution cycle by the individual FUs and the overhead energy associated with power cycling a FU, if necessary. Note that the energy is an estimate since the FU requirements are estimated based on a static dependence analysis (and not cycle accurate); and the number of cycles that FUs are OFF/ON is 23

36 estimated based on the average IPC (instructions committed per cycle) obtained by application execution profiling rather than the average IPC per instruction class. Wattch computes the power consumed by a FU (excluding overhead power), P FU as: P FU = P ONused + P ONunused (1) or P FU = N ONused * D FU + N ONunused * D FU * LF (2) = Dynamic + Static Power where D FU is the dynamic or instantaneous power consumed by a FU; N ONused is the number of FUs that are ON and in use; N ONunused is the number of FUs that are ON and not in use; LF is the Leak Factor, which is the ratio of the static power consumption to the total power. In Eqn 2, Wattch [2], the simulator used in this work, assumes that FUs that are ON but are not in use consume only their leakage power; the units which are OFF consume no power [14]. The values for INT and FP FU dynamic power and cycle time are taken directly from Wattch and are shown in Table 4.2. Leak factor is the static/(static + dynamic) power ratio and for the current high performance microprocessors for the 65nm transistor technology, the static/dynamic power ratio is about 2/3 [11]. Hence, we assume that leakage power or Leak Factor is Leak Factor = 0.4 (40% of total power) (3) Power cycling the FUs incurs an energy overhead, which is discussed in more detail in Section The compiler-generated instructions (see Section 3.4) that work in conjunction with the hardware to physically turn FUs OFF/ON also incur energy overhead for their execution. Therefore, these instructions must be counted and their energy (E OH ) included in the determination of total energy, which is computed as, E total = P FU * time + E OH (4) Where time = clock cycles * clock cycle time. 24

37 3.2.1 Overhead Energy and BreakEven Cyles Power cycling FUs incurs energy overhead since power gating requires a header circuit (Section ) to perform the physical switching. There is a time and energy associated with turning this circuit ON (E OHon ) and with turning it OFF (E OHoff ), where the total energy overhead is E OH = E OHon + E OHoff. This time and energy is proportional to the size and capacitance of the FU. When we turn a FU OFF, the amount of leakage energy saved per cycle increases as the supply voltage, V DD, gradually decreases. Conversely, when a FU is turned ON, the amount of leakage energy savings decrease as V DD is charged back up. The breakeven point is the point at which the aggregate leakage energy savings is equal to the total overhead energy due to switching, E SAVEDaggregate = E OHon + E OHoff. This is illustrated in Figure 3.3 taken from [6]. At T1 the power gating circuit makes a decision to power-gate the unit and an overhead energy is incurred till the time taken to turn OFF. Once the OFF signal is delivered to the gate of the header device at T2, the supply voltage starts going down. As the voltage reduces, savings in leakage energy begin. At T3 the aggregate leakage energy savings equals the energy overhead of switching OFF and ON the device. At T4 the reduction in supply voltage saturates at 0 and the unit is completely turned OFF with no leakage power dissipation. At T5 a signal to turn ON the unit is asserted and there an overhead energy involved and the device starts to turn ON at T6, turning ON completely by T7. During T6 T7, as the supply voltage is charged back up, the amount of leakage energy savings per cycle gradually decreases to zero by T7. Since specific, often proprietary information is required about individual FUs to precisely determine the breakeven point, we assume a BreakEven value of 20 cycles based on the work done in [6]. It is shown in [6] that BreakEven cycles is close to 10 clock cycles for transistor technologies in which the static power accounts to about 33% of the total power, while our FU energy consumption values are based on the ones used by Wattch where the static power accounts 25

38 to about 10%. We did a sensitivity analysis that quantifies the aggregate energy saved to the BreakEven cycles (Section 5.3) which shows that using a value of 10 results in a large amount of FU power cycling; a BreakEven value of 30 cycles leads to very little power cycling; while a BreakEven cycles of 20 shows appropriate power cycling activity. Based on equation (5) E OH is directly proportional to the BreakEven cycles. A BreakEven cycles of 20 means that a FU must be powered OFF for more than 20 cycles for the aggregate leakage energy savings to be greater than the total energy overhead cost. Conversely, if a FU is powered OFF for less than 20 cycles, the overhead energy cost for power cycling is greater than the aggregate leakage energy saved. In this case, the total energy consumed is minimized when the FU is left ON during this period. We use the same BreakEven cycles for INT and FP units. Based on the FU power consumption values from Wattch (Table 4.2), the FP FU dynamic power per cycle is three times to that of the INT FU, which accounts for the FP FU being more complex and having more capacitance than the INT FU. Therefore, although we use the same BreakEven values for INT and FP FUs the difference in the energy consumed to turn them OFF is accounted for by their respective dynamic power values. Figure 3.3: Illustration of Breakeven Point [6] The energy overhead attributed to FU power cycling can be expressed in terms of leakage energy and BreakEven cycles, as: 26

39 E OH = P FU * LF * BE Cycles * Cycle time (5) where the BE cycles is the BreakEven cycles; Cycle time is the clock cycle time (Table 4.2). The other variables are defined in Eqn 2. When we use the concept of BreakEven cycles, we consider the overhead energy as an aggregate (E OHon + E OHoff ) rather than the individual overhead energies to turn OFF and ON the FU. Hence, we assume that E OHon = E OHoff as we compute the total overhead energy as Total E OH = N turnon * E OHon + N turnoff * E OHof f (6) = (N turnon + N turnoff) * (E OH / 2) 3.3 FU Requirement Optimization To maximize the energy savings, our algorithm optimizes basic block functional unit requirements. The optimization depends on accurately detecting short FU idle periods where the energy overhead for power cycling idle FUs is greater than the aggregate energy saved during these cycling periods. Short idle periods occur as shown in Figure 3.4. The FU requirement of basic block 1 (BB1) is 2 INT Add, 1 INT Multiply, 1 FP Add and 1 FP Multiply units; BB2 requires only 2 INT Adders for its execution; BB3 has the same FU requirements as that of BB1. Assume that BB2 executes for 6 cycles (less than the BE cycles). In this example, the overhead energy required to power cycle the FUs (INT Multiply, FP Add, and FP Multiply) OFF for the execution of BB2 and back ON for BB3 will be greater than the static leakage energy saved by turning these FUs OFF during BB2 s execution as the FUs have to be turned OFF for at least BE cycles. Our algorithm detects these cases in an application s CFG, determines and compares the total energy dissipation for various FU configurations, and sets the basic block FU requirements to consume minimal energy. This may result in FUs remaining ON during short idle periods, especially in cases where the idle FUs are FP units, since the switching overhead of these units can be large due to their size and complexity (i.e., large capacitance). In Figure 3.4 BB2 s FU 27

40 configuration may be set to [2, 1, 1, 1] or [2, 1, 0, 0] rather than [2, 0, 0, 0] depending on the result of the energy analysis. Figure 3.4: Short FU Idle Period Complexity of Optimization: Local and Global Optimization done on an exhaustive search of the CFG results in a least-energy FU configuration (Figure 5.7). The exhaustive CFG search tries all possible FU requirement combinations for each basic block and computes the total energy consumed for each combination using Equation 5. The energy consumed changes based on the energy overhead, E OH, for turning FUs OFF/ON, and the energy saved if the FU is left ON for cycle-time (N ONunused * DFU * LF * cycle-time). For each combination FU OFF/ON, we calculate the energy and check to see if the present FU configuration consumes less total energy. If it does, this combination is set as the FU requirement of the corresponding node. For example, if the INT Add requirement for a sequence of nodes is ( ) along one path and (4-3-4) along the other path as shown in Figure 3.5, the exhaustive search evaluates the energy consumption for all combinations of FU requirements. The search starts from node 1, evaluating all possible combinations on both the paths: from ( ) to ( ) and from (4-3-4) to (4-4-4) and chooses the FU configuration that results in the least energy. The exhaustive search proceeds in the decreasing order of transition probabilities. Hence, nodes 1, 4 and 5 are optimized first followed by nodes 1, 2, 3 and 5. Figure 3.6a and 3.6b 28

41 illustrate the various combinations tried for nodes 1, 4, 5 and 1, 2, 3, 5 respectively, wherein the node being worked on is enclosed by a dashed box. Figure 3.5: CFG to be optimized Figure 3.6a: Exhaustive search over nodes 1, 4 and 5 Figure 3.6b: Exhaustive Search on nodes 1, 2, 3, and 5 29

42 The complexity of this method is O(N B ), where N is the number of FUs and B is the number of basic blocks in the CFG. Because exhaustive analysis of the CFG is computationally infeasible, we implement sub-optimal but computationally feasible solutions called the Local and Global Optimizers. Table 3.4 shows the number of basic blocks in each of the benchmarks we use Local Optimizer Benchmark # Basic Blocks art 7722 eon facerec fma3d gzip 6772 mcf 5913 mesa swim vortex vpr Table 3.4. Number of basic blocks in each benchmark This optimization is performed one node at a time. For example, if the actual INT Add unit requirement in sequential nodes is ( ), the optimizer starts on node 2 as node 1 has the maximum number of available units as shown in Figure 3.7. The optimizer sets the requirement of the second node to 3 and calculates the energy consumed. Setting the FU requirement to 3 reduces the overhead of switching a FU OFF from node 1 to node 2 and also switching a FU ON from node 2 to node 3 but consumes extra static energy to leave it ON. If the total energy consumed is less when the requirement is set to 3 rather than 2, then the requirement for node 2 is set to 3. We then optimize node 3 by setting the requirement to 4. If in the previous step the requirement of node 2 was set to 3, then setting the requirement of node 3 to 4 would increase static energy. The optimizer then picks node 4 by setting the requirement to 3, which 30

43 would reduce the overhead of switching a FU OFF from node 3 to node 4 and switching a FU ON from node 4 to node 5. Setting the requirement of node 4 to 4 would have the same overhead cost of setting it to 3 but with an increase in static energy. As node 5 has the maximum number of units available, no optimization is done on it. Figure 3.7: Local Optimization Table 3.5 shows the total energy and each of its components for all INT FU configurations that are examined by the local optimizer, with the FU configuration with the least energy consumption in bold. The values are obtained by using the equations described in Section 3.2 and It is also assumed that the actual usage of INT Adder unit is ( ) units. Figure 3.8 shows the pseudo code representation of the local optimizer algorithm. INT FU Config E ONused E ONunused E OFF E ON E Total E E E E E E E E E E E E E E E E E E E E E E E E E-06 Table 3.5. Local Optimizer Energy Estimation The complexity of this algorithm O(N*B), where again, N is the number of FUs and B is the number basic blocks in the CFG. The primary advantage of the local optimizer is its relatively low complexity, which leads to a linear increase in computation/optimization time with an increase in the number of basic blocks in the CFG. The main disadvantage is that since it only 31

44 optimizes using one node at a time, higher order combinations that may result in increased energy savings are not analyzed. Table 3.6 shows the time taken to locally optimize FU requirements for all of the benchmarks Benchmark Time (minutes) art 0.1 eon 3.65 facerec 1.2 fma3d 5.75 gzip 0.2 mcf 0.06 mesa 0.4 swim 0.32 vortex vpr 0.6 Table 3.6. Time for local optimizer for all benchmarks Figure 3.8: Local Optimizer Algorithm 32

45 3.3.3 Global Optimizer To take advantage of higher order optimizations in a computationally feasible method, we divide the CFG into smaller sub-cfgs of a specified depth and perform an exhaustive search for optimal FU requirements on each sub-cfg. The CFG in Figure 3.2 is shown with sub-cfgs of depth one in dashed boxes in Figure 3.9. The depth chosen for sub-cfgs exhibits a trade off between energy reduction and optimization time. Table 3.7 shows the percentage energy saved and time taken to locally and globally optimize for depth 2 for some of the benchmarks. It can be seen that as the depth of optimization increases the energy savings increase with corresponding increase in optimization times. Benchmark % Energy Saved/Time taken in minutes to optimize Local Global2 Art 1.364/ /0.3 Gzip 15.9/ /0.4 Mcf 18.56/ /0.4 Vortex 11.9/ /547.5 Vpr 4.3/ /1.5 Table3.7. Energy saved versus depth of CFG optimization Figure 3.9: Depth One sub-cfgs 33

46 The optimizer works as follows: if the INT Add unit requirement in sequential nodes is ( ) and we assume a depth of 3, the optimization is performed over 2 sub-cfgs ( ) and ( ). For the ( ) sub-cfg, the FU requirements of these nodes are changed from ( ) to ( ) to try out all possible combinations and the energy is computed for each combination as shown in Table 3.8. The FU configuration with the least energy consumption is chosen and shown in bold. Comparing global with local optimization for this example, the FU configurations evaluated by the local algorithm include ( ), ( ), ( ) and ( ), which are a subset of the combinations analyzed by the global optimizer, shown in the table. Figure 3.10 shows the pseudo code representation of the global optimization algorithm. INT FU Config E ONused E ONunused E OFF E ON E Total E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-06 Table3.8. Global Optimizer Energy Estimation The complexity of this algorithm is O (B * N x ), where B is the number of basic blocks in the CFG, N is the number of FUs and x is proportional to the chosen sub-cfg depth and the depth of the original CFG. If the global optimization depth (sub-cfg depth) is small relative to CFG depth, x << B; if the global optimization depth is relatively large, x B. Figure 3.11 presents the computation time versus global optimizer depth for the mcf benchmark which has 34

47 the least number of basic blocks (nodes) in its CFG. This shows that as the depth of optimization is increased the computation time increases non-linearly. Table 3.9 shows the global optimization times at depth 2 for all benchmarks and Table 3.10 shows the global optimization times at depth 4. It can be seen that the optimization at a depth of 4 takes much longer than at depth 2. Figure 3.10: Global Optimizer Algorithm Figure 3.11: Global Optimizer Time Complexity, mcf 35

48 Benchmark Time (minutes) art 0.4 eon facerec 6.05 fma3d gzip 0.45 mcf 0.44 mesa 0.88 swim 0.88 vortex vpr 1.5 Table 3.9. Global optimization time at depth 2 CFG optimization Benchmark Time (minutes) art 0.6 eon facerec 74.0 fma3d gzip 1.2 mcf 1.05 mesa 1.9 swim 2.15 vortex vpr Table Global optimization time at depth 4 CFG optimization 3.4 Processor Support In our implementation, we assume that the compiler inserts additional instructions to support the FU OFF/ON operations. At the start of a basic block, the current FU configurations and the new block requirements are compared to see if any units need to be turned OFF/ON. The compiler inserts additional instructions accordingly. Figure 3.12 shows instructions for a sequence of two basic blocks with and without compiler-inserted FU OFF/ON instructions. In BB0, an instruction is inserted (mul.1.on) to turn ON the INT Multiply unit. In BB1, three INT Add units are turned ON (add.3.on) since there are three independent INT add instructions 36

49 (Figure 3.1) and the INT Multiply unit used in BB0 is turned OFF (mul.1.off) since it is not required for the execution of BB1. Figure 3.12: Compiler-Inserted Instructions Hardware support is required to power cycle the FUs when compiler inserted OFF/ON instructions are encountered. As discussed earlier in Section we use the Power Gating techniques to power cycle the FUs. Processor logic implemented in the issue stage checks readyto-issue instructions and their FU requirements against available ON and busy FUs. This requires implementing additional comparator circuitry and registers to store the current number of FUs ON. As the FU requirement of a BB can be stored in just 6 bits of data the additional comparator circuit will consume very little power. If all ON units are busy but not all available units are ON, then either (1) an additional unit is switched ON with a Busy ON processor signal; or (2) a Wait If Busy signal causes the issuing instructions to wait for the next available FU. We have implemented both of these strategies. 37

50 FUs are turned ON after the first instruction of each basic block. If control flows into a block after the first instruction, the FU requirement of the block in unknown. In such a case, based on the FUs required by the subsequent instructions, if no unit of the required type is ON, all of the available units of that instruction type are turned ON with a Hardware ON processor signal. For example, if control flows into a block at its third instruction, the FU requirement is unknown. In such a case, if the third instruction of the block is an Integer Add instruction, then all the INT Add units are turned ON. Additional FUs are turned OFF at the commit stage of the first instruction of each basic block. This ensures that all the instructions of the previous block have been successfully committed. 3.5 Performance Penalty Performance and energy penalties are incurred in cases where a FU cannot be assigned to an instruction which is ready to issue either because (1) it is not fully ON (i.e. it is in the process of being turned ON), (2) none of the required type of FUs are ON, or (3) all ON FUs of the required type are busy and one or more are OFF. In (2), a performance penalty is incurred when FUs are turned ON using the Hardware ON processor signal, described in Section 3.4. In (3), a performance penalty is incurred if a ready-to-issue instruction waits until a FU is available or until an additional FU is switched ON. An additional performance penalty is incurred by the required FU OFF/ON instructions. The FU OFF/ON instructions are executed when the BB instructions are stalled and waiting for the FUs. So the additional performance penalty by executing them can be neglected. We have accounted for the performance penalty as the increase in execution time (clock cycles) when our technique is implemented. Our results in Chapter 4 indicate that the performance degradation due to implementing FU shutdown is small. There is also an increase in power associated to the increase in clock cycles by implementing FU shutdown. We calculate the performance related power by implementing FU shutdown with 0 38

51 cycles FU turnon latency and comparing it to the regular FU shutdown with a turnon latency of 3 cycles. The difference in the power between the two is the extra power due to performance penalty. 3.6 Variations of the Basic Algorithm To reflect on at the effects of the various factors/metrics involved in our CFG generation, gathering the FU requirements, CFG optimization and energy estimation, we tried a few techniques described below in which we modify the gathered FU requirements during/after optimization. 1. Turning ON FUs in the previous BB: As discussed in Section 3.4, the FUs are turned ON after the first instruction of the BB. This can lead to an decrease in performance as the BB instructions have to wait till the FUs are turned ON. By turning ON the FUs a block earlier, the decrease in performance can be avoided but there is an increase in static energy as the FUs are ON for greater duration. 2. Leaving 2 FP FUs always ON: FP FUs have a high overhead energy due to their complexity. By leaving 2 FP FUs always ON there is a decreased overhead energy and also lesser performance degradation as two instructions of a BB can be issued without waiting for the FU to be turned ON. 3. Leaving the FP FUs always ON: Most of the INT benchmarks have a very few FP computations. Hence, the FP FUs can be turned OFF for greater periods. In the case of FP benchmarks, the FP FUs have higher utilization rates and hence smaller idle periods. Frequent switching of the FP FUs can impact the total energy due to the high overhead cost. By leaving all the FP FUs ON and only power cycling INT FUs, when appropriate, the tradeoff between the FP FUs overhead energy and the static energy dissipation can be seen. 39

52 4 Experimental Platform For implementing and validating the FU shutdown technique, we have used Wattch [2] which supports the simulation of a MIPS-like superscalar, out-of-order, speculative pipeline and includes power models for microprocessor power dissipation. Our functional unit configuration consists of 4 Integer (INT) Add units, 1 INT Multiply unit, 4 Floating Point (FP) Add units, and 1 FP multiply unit. The various tools used are described below. 4.1 The SimpleScalar Simulator The SimpleScalar tool set [3] is a suite of simulators consisting of tools to build and run an execution-driven simulator based on SimpleScalar architecture [3]. The tool set consists of different fast functional simulators and a detailed cycle accurate, out-of-order issue processor that supports non-blocking caches, speculative execution and state-of-the-art branch prediction. The simulator collects performance statistics related to the various components of the microarchitecture. Sim-outorder is a cycle accurate and most detailed simulator which supports out-oforder issue and execution. Sim-fast is the fastest, least detailed functional simulator. Sim-cache simulates cache hierarchy with support for different configurations and replacement policies. We use sim-safe, another functional simulator, to generate the CFG which is superior to sim-fast with support for access permissions for memory references. To generate the FU requirements we use the sim-profile simulator which can generate profile information on instruction classes and addresses, memory accesses, branches and data segment symbols. Using sim-profile we identify the instruction classes and then do a static RAW dependence analysis to determine the FU requirements. 40

53 4.2 The Wattch Simulator Wattch [2] is based on the SimpleScalar tool set. It provides a power evaluation methodology within the portable and familiar SimpleScalar framework by incorporating estimation of the power consumption of various components of the micro-architecture. Wattch is 1000X or more faster than existing layout-level power tools and yet maintains accuracy within 10% of their estimates as verified using industry tools on leading-edge designs [2]. The results are based on analysis done on SPEC95 benchmark suites wherein each program was run for 200 million instructions. We have modified the sim-outorder simulator in Wattch to implement FU shutdown based on the description given in Section SPEC CPU2000 Benchmarks A benchmark is a standardized program for performance evaluation. A collection of such programs is called a benchmark suite. Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers. SPEC CPU2000 [28] is the industry-standardized CPU-intensive benchmark suite. SPEC designed CPU2000 to provide a comparative measure of compute intensive performance across the widest practical range of hardware. The implementation resulted in source code benchmarks developed from real user applications. These benchmarks measure the performance of the processor, memory and compiler on the tested system. SPEC CPU2000 consists of 12 Integer and 14 Floating Point benchmarks. In this work we use a subset of the 26 benchmark SPEC CPU2000 suite. 41

54 4.3.1 Subset of SPEC CPU2000 Benchmarks We use a subset of five Integer and five Floating Point benchmarks from the SPEC CPU2000 suite that we chose based on overall program characteristics given in [1] to validate our technique. Table 4.1 shows the eight clusters obtained from SPEC CPU2000 when measuring similarity based on locality, branch predictability and ILP [1]. For clusters with just two programs, any program can be chosen as a representative. If a cluster has more than one program, and all are either INT or FP programs, we picked only one among it. However, if a cluster has more than one program and they are a mix of INT and FP programs, then we picked one INT and one FP programs among them. Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 applu,mgrid gzip, bzip2 equake, crafty fma3d, ammp, apsi, galgel, swim, vpr, wupwise Mcf twolf, lucas, parser, vortex mesa, art, eon Gcc Table 4.1. Subset of SPEC CPU2000 benchmarks taken from [1] 4.4 Framework In our estimated energy, we use the Wattch values for INT FU Power, FP FU Power, and the Cycle Time. For technology parameters, Wattch uses the process parameters for a 0.35um process at 600MHz (cycle time of 1.6ns). We modified Wattch to include the overhead energy 42

55 computations described in Section and shown in Table 4.2. We assume a FU Turn-On latency of 3 cycles based on [6]. In Wattch a leak factor of 0.1 is used. However, we use a leak factor of 0.4 which corresponds to the current transistor technologies. Our results in chapter 4 are relative and shown as a percentage improvement over the base case with no FU shutdown. Simulation Parameter FU Energy per cycle for Integer Units FU Energy per cycle for Floating Point Units Cycle time Leak Factor BreakEven Cycles FU Turnon Latency Functional Units FU Energy Overhead Energy to Switch On Energy to Switch Off Value J J 1.6 ns 0.4 (40% of actual power) 20 3 cycles 4 INT Add 1 INT Multiply 4 FP Add 1 FP Multiply Energy per cycle * cycle time * Leak Factor * BE cycles FU Energy Overhead * 0.5 FU Energy Overhead * 0.5 Table 4.2. Simulation Parameters 43

56 Normalized power 5 Results The main goal of this work is to minimize the static (leakage) power of the Functional Units during the periods when they are idle. Figure 5.1 shows the percent energy breakdown into the static power of the FUs and the total dynamic processor power for all the benchmarks in the base case where FU shutdown is not implemented. By implementing FU Shutdown we propose to reduce the total energy consumption by decreasing the static energy % 80.00% 60.00% 40.00% static energy (J) dynamic energy (J) 20.00% 0.00% art eon facerec fma3d gzip mcf mesa swim vortex vpr Figure 5.1: Original Percent Energy Breakdown 5.1 Effect of FU Shutdown The effectiveness of implementing FU shutdown is shown in Figure 5.2, where the percentage of power savings realized by each benchmark is shown for various CFG optimizations. no-opt uses the original FU requirements to compute the total energy dissipation prior to any optimization; local uses the Local CFG Optimizer; global2 uses the Global CFG Optimizer with depth 2; FP ON global2 leaves all FP FUs ON (only powering INT FUs OFF 44

57 when appropriate). We chose a depth of 2 for global optimization since it demonstrated an acceptable tradeoff between optimization/computation time and energy reduction. At this depth, optimization time ranged from a few minutes to a few days for the benchmarks with a large number of basic blocks (eon, vortex and fma3d). Higher depths of optimization are not computationally feasible on all benchmarks (Table 3.9). It can be seen that global optimization at a depth of 4 takes much longer with little power savings. The power savings is computed with respect to the baseline case where FU shutdown is not implemented. By implementing FU shutdown we save a maximum of 18.7% on the mcf benchmark with global optimization of depth 2. The benchmarks which show an increase in energy consumption (negative savings) are the ones with high FP computation and are further analyzed below. By leaving all the FP FUs ON and shutting down only INT FUs when appropriate, these benchmarks show positive results. Figure 5.2: Total Energy Savings (%) The negative energy savings noted in some of the benchmarks for the various levels of CFG optimization may be primarily attributed to three characteristics-the instruction mix, the energy overhead to power cycles FP FUs and the inaccuracy of FU requirements set in the CFG. The benchmarks that exhibit an increase in total energy are those that execute a relatively high 45

58 percentage of FP operations. Table 5.1 shows the dynamic instruction mix for the benchmarks used. The instruction mix represented here is based on the percentage of instructions which access a particular FU. Rather than classifying the instructions based on the class (INT, FP, LOAD/STORE, and BRANCH) we classify them here based on the FU they use as the memory access and branch instructions also access the FUs for memory address, branch outcome, and branch target address calculations. Eon, swim, facerec, fma3d and mesa have a high percentage of FP ops, and exhibit an increase in total energy when implementing FU shutdown. Gzip, mcf, and vortex show the largest decrease in total energy using FU shutdown and are all characterized by a large percentage of integer operations and few to no FP operations. Table 5.2 shows the dynamic FU usage per BB by all the benchmarks, with the ones with negative savings highlighted. The number of FUs used in a BB is the maximum number of FUs used in a cycle during the execution of the BB. The average is weighted over the number of times a BB is executed. The benchmarks with negative savings have the highest FP FU utilization. Benchmark % INT Add % INT Multiply % FP Add %FP Multiply art eon facerec fma3d gzip mcf mesa swim vortex vpr Table 5.1. Dynamic Benchmark Instruction Mix 46

59 Benchmark INT Add INT Multiply FP Add FP Multiply art eon facerec fma3d gzip mcf mesa swim vortex vpr Table 5.2. Average Number of FUs Used in a BB FP FUs have a relatively large energy overhead for power cycling due to their size/capacitance. Additionally, CFG FU requirements are determined based on the instruction types within basic blocks and instruction dependencies which are determined statically. If a basic block FU configuration is not set correctly based on the static analysis and dynamically this basic block accounts for a large percentage of the total execution time (as is often the case for benchmarks classified as FP), then a large energy overhead, including energy due to power cycling the FU unnecessarily. To further investigate the negative savings, we implemented a few techniques described in Section 3.6: (1) Leaving 2 FP FUs always ON, (2) Turning ON all FP FUs in the previous basic block, and (3) Leaving all FP FUs always ON and powering OFF only INT FUs (when appropriate). Figure 5.3a shows the total energy breakdown for these strategies along with the base, global2 CFG optimization and FP ON global2 cases for eon benchmark into the overhead and static power of the FUs and the dynamic total power. It can be seen that leaving FP FUs always ON leads to best energy savings by reducing the static energy of INT FUs. Compared to global2, leaving 2 FP ADD units always ON and turning ON FP FUs in the previous basic block, the overhead energy is reduced (fewer units are power cycled) but the static energy increases as 47

60 Total Energy Breakup (J) Total Energy Breakup (J) more units are ON. Figures 5.3b, 5.3c, 5.3d and 5.3e show similar plots for the other benchmarks which show negative savings: facerec, fma3d, mesa and swim, which show analogous characteristics Overhead Static Dynamic Base global 2 2 FP Add Prev BB FP FP ON Figure 5.3a: Different Strategies for eon Overhead Static Dynamic Base global 2 2 FP Add Prev BB FP FP ON Figure 5.3b: Different Strategies for facerec 48

61 Total Energy Breakup (J) Total Energy Breakup (J) Overhead Static Dynamic Base global 2 2 FP Add Prev BB FP FP ON Figure 5.3c: Different Strategies for fma3d Overhead Static Dynamic Base global 2 2 FP Add Prev BB FP FP ON Figure 5.3d: Different Strategies for mesa 49

62 Total Energy Breakup (J) Overhead Static Dynamic Base global 2 2 FP Add Prev BB FP FP ON Figure 5.3e: Different Strategies for swim Table 5.3 gives the average and maximum percentage of total energy saved due to each of these four FU shutdown optimizations. With no optimization, FU shutdown saves a maximum of approximately 18% and an average of 0.60% of the total energy. The low average percentage energy savings is caused by the energy increase for FP benchmarks (e.g., the average energy increase for FP applications is 3.68%). Using the global2 CFG optimization, an average of around 4% of the total energy is saved across all benchmarks, with an average of approximately 9.5% savings for integer applications. For all of the benchmarks except swim, CFG optimization increases the energy savings. For swim, increasing levels of optimization do not have much of an effect in energy as the overhead energy is very small which leaves little scope for optimization. The small overhead energy can be attributed to the high FU utilization which can be seen in Table 5.2. Due to this most of the available FUs are always turned ON with very little power cycling activity of the FUs. vortex, vpr, gzip and mcf have considerable effect on energy savings with 50

63 CFG optimization due to the low FU utilization rates. mcf,which has the least FU utilization exhibits the maximum energy savings. Optimization Average/Maximum % Energy Savings INT FP Total (INT & FP) no-opt 4.88/ / /17.98 Local 8.72/ / /18.56 global2 9.57/ / / FP Add 10.02/ / /18.66 Previous BB FP 10.04/ / /18.66 FP ON global / / /18.66 Table 5.3. % Total Energy Savings by implementing FU Shutdown For those benchmarks that show an increase in energy consumption for all CFG optimizations, we leave all FP units ON while implementing FU shutdown for integer functional units. This results in a small but positive maximum energy savings of 4.81% with an average savings of 3.66% for these benchmarks. 5.2 Energy Breakdown Figure 5.4a shows the total energy breakdown for different CFG optimizations for the vortex benchmark. The energy consumption comprises dynamic energy, static leakage energy, and the overhead energy when FU shutdown is used. The baseline case (no FU shutdown) consumes the most energy. Each additional level of CFG optimization results in a further decrease in total energy dissipated. The leakage or static energy is largely reduced using FU 51

64 shutdown at all optimization levels. Note that with increasing levels of optimization there is a decrease in overhead energy with a small rise in static energy, as the optimizer tends to leave more units ON to minimize overhead energy. Finally, the increase in dynamic energy for no-opt compared to the base case is due to the initial FU requirements. Since this is based on a static dependence analysis, fewer FUs than are actually required are turned ON. Hence, there is an increase in the number of cycles executed as there are fewer units ON. Figures 5.4b, 5.4c, and 5.4d, show the average energy breakdown over INT, FP and all benchmarks respectively. Similar trends can be seen in all the plots except for Figure 5.4c, wherein energy consumption increases for FP benchmarks. Figure 5.4a: Total energy breakdown for vortex Figure 5.4b: Total energy breakdown for INT benchmarks 52

65 Figure 5.4c: Total energy breakdown for FP benchmarks Figure 5.4d: Total energy breakdown over all benchmarks 5.3 Performance Degradation The performance degradation is presented in Figure 5.5. The execution time in cycles with no FU shutdown is compared to that with optimized and un-optimized FU shutdown implemented by normalizing the execution time with no FU shutdown. For most of the benchmarks the degradation in performance of the optimized execution compared to the baseline execution time is about 1%. Since the optimizer tends to turn more units ON than required to 53

66 Normalized Execution Time reduce the overhead energy, execution is faster since no performance penalty is incurred for turn- ON latency of a needed unit. This can be seen in the figure wherein the most performance degradation is for no-opt case. Once the CFG is optimized using local and global2 CFG optimizations, the performance degradation reduces significantly. For swim there is not much performance degradation even for the no-opt case due to high FU utilization which leads to setting most of the FUs ON initially. For the benchmarks which show negative savings with CFG optimization of global2 there is increase in performance by leaving the FP FUs ON Actual No-opt local global2 FP ON global art eon facerec fma3d gzip mcf mesa swim vortex vpr Figure 5.5: Execution Time in Clock Cycles Table 5.4 shows the percentage of energy for the global2 optimizer which is due to the increase in performance as shown in Figure 5.5 along with the percentage energy saved and the percentage increase in cycles. The energy due to performance was estimated by setting the FU turnon latency to 0 cycles. eon, facerec, mesa and vpr are the benchmarks most impacted by decrease in performance. 54

67 Benchmark % Energy (Performance) % Energy Saved % Increase in Cycles art eon facerec fma3d gzip mcf mesa swim vortex vpr Table 5.4. Increase in Energy Due to Performance for global2 5.4 Sensitivity Analysis on BreakEven Cycles BreakEven (BE) cycles are used to compute the energy overhead to power cycle FUs. A smaller number of BE Cycles results in a lower energy for power cycling FUs which implies that FUs can be turned OFF more frequently. Similarly, a large number of BE Cycles results in greater energy for power cycling the FUs which implies that FUs can be turned OFF less frequently. From Figure 5.7a, we see that the total energy savings decreases as BE cycles increase. This is primarily due to the increase in overhead energy as the number of BE cycles increases. There is also an increase in static energy, Figure 5.6, as more units are turned ON and there is a lesser power cycling activity due to the high overhead cost. Based on this analysis and the work done in [6], we found a BE cycles of 20 to be ideal for our implementation. It is shown in [6] that a BE value of 10 cycles id ideal for transistor technologies in which the static power is about 25% of the total power. In our implementation, the FU power values are based on a technology where the static power is about 10% of the total power. Hence, the BE cycles has to be greater than 10. A BE value of 20 cycles shows reasonable savings while a BE value of 30 has negative savings. Figures 5.7b and 5.c, show similar total energy saved versus BE cycles for other benchmarks. The 55

68 % Energy Saved three Figures 5.7a, 5.7b, and 5.7c represent an FP benchmark with positive savings, INT benchmark, and FP benchmark with negative savings, at a BE value of 20 cycles and global optimization at a depth 2. Figure 5.6: Energy Breakup for different BE values for art % Energy Saved BE Cycles Figure 5.7a: BE Cycles Sensitivity, art 56

69 % Energy Saved % Energy Saved % Energy Saved BE Cycles Figure 5.7b: BE Cycles Sensitivity, vpr % Energy Saved BE Cycles Figure 5.7c: BE Cycles Sensitivity, facerec 57

70 5.5 Depth of Global Optimization As the depth of global optimization increases, a greater number of FU configurations are examined and a higher number of FUs are switched ON to reduce the power cycling of FUs. This leads to a larger reduction in overhead energy at the cost of higher static energy since the optimizer reduces the overhead by turning ON more FUs than required by a particular basic block. Figure 5.8 shows that for certain benchmarks, there is little change in total energy with an increase in the depth of optimization. This is due to the fact that higher levels of optimizations tend to switch ON additional units to reduce the power cycling/overhead energy, but at the cost of increased static energy. Hence, a decrease in overhead energy is at the expense of increase in static energy, due to which a higher level of optimization may not always save more energy. It can also be seen that the best energy savings are achieved for global optimizer with CFG optimization depth of 10, which justifies the statement, Optimization done on an exhaustive search of the CFG results in a least-energy FU configuration (section 3.3.1). Data is shown only for one benchmark as optimizing beyond a depth of 2 is not computationally feasible for most benchmarks. Also, as the overhead energy is a small portion of the total energy, the plot starts from 5000J for better representation. 58

71 Figure 5.8: Energy Consumption vs Global Optimizer Depth for mcf 5.6 Wait if Busy vs Busy ON Section 3.4 describes two techniques that either stall a ready-to-issue instruction or turn ON an additional FU if all ON units are busy (and not all available units are ON). Stalling an instruction leads to performance degradation, while turning an additional unit ON leads to faster execution time at the cost of higher static energy consumption. Based on the simulations done on both the cases, Wait if Busy case saves 2.8% more energy than busy ON case with 0.25% degradation in performance, on an average over all the benchmarks. Hence, the performance degradation is insignificant compared to the higher energy saved in the case of Wait if Busy. 59