CARNEGIE MELLON UNIVERSITY

From this document you will learn the answers to the following questions:

What is the Pipelined Dispatch?

Similar documents

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

WAR: Write After Read

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Pipelining Review and Its Limitations

Instruction Set Architecture (ISA)

The Microarchitecture of Superscalar Processors

Precise and Accurate Processor Simulation

Multithreading Lin Gao cs9244 report, 2006

Introduction to Cloud Computing

Data Memory Alternatives for Multiscalar Processors

VLIW Processors. VLIW Processors

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

IA-64 Application Developer s Architecture Guide

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Binary search tree with SIMD bandwidth optimization using SSE

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Concept of Cache in web proxies

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Giving credit where credit is due

Computer Architecture TDTS10

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Introduction to Microprocessors

Computer Organization and Components

OC By Arsene Fansi T. POLIMI

Instruction Set Design

On some Potential Research Contributions to the Multi-Core Enterprise

A PPM-like, tag-based branch predictor

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

A Performance Counter Architecture for Computing Accurate CPI Components

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Instruction Set Architecture (ISA) Design. Classification Categories

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

OBJECT-ORIENTED programs are becoming more common

EEM 486: Computer Architecture. Lecture 4. Performance

Hardware/Software Co-Design of a Java Virtual Machine

Key Components of WAN Optimization Controller Functionality

Energy-Efficient, High-Performance Heterogeneous Core Design

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Putting Checkpoints to Work in Thread Level Speculative Execution

Operating System Impact on SMT Architecture

Computer Architecture Syllabus of Qualifying Examination

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER

CPU Performance Equation

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Enterprise Applications

CPU Organization and Assembly Language

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

CHAPTER 7: The CPU and Memory

Computer Architecture

Optimizing Shared Resource Contention in HPC Clusters

Effective ahead pipelining of instruction block address generation

Putting it all together: Intel Nehalem.

MONITORING power consumption of a microprocessor

Recommendations for Performance Benchmarking

PowerPC Microprocessor Clock Modes

CHAPTER 1 INTRODUCTION

An Event-Driven Multithreaded Dynamic Optimization Framework

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Multi-Threading Performance on Commodity Multi-Core Processors

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Performance Impacts of Non-blocking Caches in Out-of-order Processors

CS 147: Computer Systems Performance Analysis

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

TPCalc : a throughput calculator for computer architecture studies

A Lab Course on Computer Architecture

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1

Week 1 out-of-class notes, discussions and sample problems

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION

POWER8 Performance Analysis

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Load Distribution in Large Scale Network Monitoring Infrastructures

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

CHAPTER 7 SUMMARY AND CONCLUSION

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

Transcription:

CARNEGIE MELLON UNIVERSITY VALUE LOCALITY AND SPECULATIVE EXECUTION A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree of DOCTOR OF PHILOSOPHY in ELECTRICAL AND COMPUTER ENGINEERING by Mikko Herman Lipasti Pittsburgh, PA 15213 April 1997

Value Locality and Speculative Execution ii

Abstract This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism without violating program correctness. Value locality is a program attribute that describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Most modern processors already exploit value locality through the use of control speculation (i.e. branch prediction), which seeks to predict the future values of condition code bits and branch-target addresses based on previously-seen values. Experimental results indicate that value locality exists for condition codes and branch target addresses, and for general-purpose and floating-point registers as well. Furthermore, value locality exists not only in the data flow portion of a processor, but also in the control logic, where both register and memory dependences between instructions tend to remain static and are relatively easily predictable. To exploit value locality, several dynamic prediction mechanisms are proposed. These mechanisms increase instruction-level parallelism by speculatively collapsing explicit and implicit dependences between instructions and folding away execution pipeline stages under the pipeline contraction framework. Detailed evaluation of value prediction for load instructions only, as well as all instructions that write general-purpose or floating-point registers shows significant potential for performance improvements. Further experimental results indicate that both register and memory dependence relationships between instructions are easily predictable. These discoveries result in significant potential for further performance improvements, particularly in conjunction with wide-issue and deeply-pipelined superscalar processors that employ aggressive techniques like accurate dynamic branch prediction and instruction fetching via trace cache to overcome control- and data-flow restrictions on parallel execution of serial programs. Finally, this thesis introduces a new microarchitectural paradigm called Superflow that supersedes historical limits on instruction flow, register dataflow, and memory dataflow and demonstrates potential performance improvements of 2-4x over the current state-of-the-art microprocessors. Value Locality and Speculative Execution iii

Value Locality and Speculative Execution iv

Acknowledgments First of all, I want to express my appreciation to my advisor, John Paul Shen, for his support, advice, and leadership throughout this project. John s ability for condensing wild ideas and conjectures down to key concepts, as well as his willingness to entertain and pursue such notions, has been instrumental in evolving this thesis from vague ideas about redundancy in memory traffic to effective and powerful microarchitectural techniques for increasing instruction-level parallelism. I would also like to thank professors Daniel P. Siewiorek and Rob Rutenbar of CMU, and James Smith of University of Wisconsin, for serving on my thesis committee. I benefited greatly from their wealth of knowledge and experience during the transition from Ph.D. proposal to completed Ph.D. thesis. I also want to express my gratitude to my outside committee members, Dr. Greg Pfister from IBM Austin, and Dr. Steve Kunkel from IBM Rochester. Their insights and feedback helped to gravitate my research toward real-world implementation and performance issues. I also benefited greatly from numerous discussions with other members of our research group, including Bryan Black, Yuan Chou, Andrew Huang, Derek Noonburg, and Chris Newburn. Special thanks go to Chris Wilkerson for his key insights in the early stages of this thesis as well as for coining the term Superflow to describe the microarchitectural paradigm advocated in this thesis. I also want to express my appreciation to my management at IBM, whose professional and financial support greatly eased the transition from salaried employee to indentured servant (i.e. graduate student). Finally, thanks to my loving wife Erica Ann Lipasti for her patience, forbearance, and willingness to make tremendous sacrifices in not just financial security, standard of living, and quality of life, but also personal friendships, proximity to family, and psychological and emotional support systems in order to let me perform this work. Her enduring support during this time is the ultimate sign of her love, devotion, and respect. Furthermore, her selfless devotion to caring and providing for our dear daughter, Emma Kristiina who has evolved from a toddler barely out of helpless infancy, to a happy, active, intelligent, and very outgoing young lady during our time here at CMU has been instrumental in allowing me to focus my energies on my research without having to sacrifice those precious and irreplaceable times together as a family. Value Locality and Speculative Execution v

Value Locality and Speculative Execution vi

Contents Abstract iii Acknowledgments v CHAPTER 1 Introduction 1 Historical Background and Motivation 1 Taxonomy of Speculative Execution 3 Theoretical Contributions 5 Value Locality 5 The Weak Dependence Model 7 Pipeline Contraction Framework 9 Microarchitectural Contributions 10 Load and Register Value Prediction 10 Dependence Prediction 15 Alias Prediction 16 Putting it all Together: The Superflow Paradigm 17 Thesis Overview 19 CHAPTER 2 Machine Models and Workloads 21 Machine Models 21 PowerPC 620 21 PowerPC 620+ 22 Infinite PowerPC Model 22 Alpha AXP 21164 23 Execution-Driven Idealized PowerPC Model 24 Misprediction Recovery Mechanisms 29 Instruction Refetch 29 Instruction Reissue 30 Selective Instruction Reissue 30 Workloads 31 SPEC92 Integer Suite. 31 Miscellaneous Integer Programs 31 SPEC92 Floating Point Suite 32 SPEC95 Integer Suite 32 SPEC95 Floating Point Suite 33 Value Locality and Speculative Execution vii

CHAPTER 3 Load Value Prediction 35 Introduction and Related Work 35 Value Locality 37 Exploiting Value Locality 41 Load Value Prediction Table 42 Dynamic Load Classification. 42 Constant Verification Unit 43 The Load Value Prediction Unit 45 LVP Unit Implementation Notes 46 Microarchitectural Models 46 PowerPC 620/620+ LVP Unit Operation 47 Alpha AXP 21164 LVP Unit Operation 48 Experimental Framework 49 Experimental Results 49 Base Machine Model Speedups with Realistic LVP 50 Enhanced Machine and LVP Model Speedups 51 Distribution of Load Verification Latencies 52 Data Dependency Resolution Latencies 53 Bank Conflicts 54 Conclusions and Future Work 56 CHAPTER 4 Register Value Prediction 57 Motivation 57 Value Locality 58 Exploiting Value Locality 60 The Value Prediction Unit 61 Verifying Predictions 65 Microarchitectural Models 66 VP Unit Operation 66 Misprediction Penalty 67 Experimental Framework 68 Experimental Results 68 PowerPC 620 Machine Model Speedups 69 PowerPC 620+ Machine Model Speedups 70 Infinite Machine Model Speedups 71 VP Unit Implementation 72 Conclusions and Future Work 74 CHAPTER 5 Dependence Prediction 77 Motivation and Related Work 77 Detecting Control and Data Dependences 78 Experimental Framework 79 Value Locality and Speculative Execution viii

Pipelined Dispatch Structure 80 Dependence Prediction and Recovery 82 Source Operand Value Prediction and Recovery 84 Conclusions and Future Work 88 CHAPTER 6 The Superflow Paradigm 91 Background 91 The Superflow Paradigm 93 The Weak Dependence Model 94 Generalized Speculation and Pipeline Contraction 95 Overview of the Superflow Paradigm 96 Instruction Flow Techniques 98 Conditional Branch Throughput 98 Taken Branch Throughput 99 Misprediction Latency 102 Register Data Flow Techniques 103 Dependence Detection and Prediction 103 Eliminating Dependences 104 Memory Data Flow Techniques 110 Memory Latency 110 Memory Bandwidth 112 Summary and Conclusions 115 CHAPTER 7 Summary and Conclusions 117 Thesis Summary 117 Key Contributions 118 Theoretical Contributions 118 Microarchitectural Contributions 119 Future Work 120 APPENDIX A Additional Data on Load Value Prediction 123 Miscellaneous PowerPC Data 123 PowerPC Load Value Prediction Data 128 Miscellaneous Alpha AXP 21164 Data 133 APPENDIX B Additional Data on Register Value Prediction 137 Miscellaneous PowerPC Data 137 PowerPC Register Value Prediction Data 143 Value Locality and Speculative Execution ix

APPENDIX C Additional Data on Superflow Machine Models 153 Additional Results For Instruction Flow 153 Additional Results For Register Data Flow 160 Additional Results For Memory Data Flow 169 Value Locality and Speculative Execution x

List of Figures Figure 1-1. Taxonomy of Speculative Execution 3 Figure 1-2. Load Value Locality 6 Figure 1-3. Register Value Locality 7 Figure 1-4. Pipeline Contraction 9 Figure 1-5. Block Diagram of Value Prediction Unit 11 Figure 1-6. Example use of Value Prediction Mechanism 12 Figure 1-7. Branch Misprediction Penalty 15 Figure 1-8. Dependence Prediction Mechanism 17 Figure 1-9. Superflow Overview 18 Figure 2-1. PPC 620 and 620+ Block Diagram 23 Figure 2-2. Alpha AXP 21164 Block Diagram 24 Figure 3-1. Load Value Locality 39 Figure 3-2. PowerPC Value Locality by Data Type 40 Figure 3-3. Block Diagram of the LVP Mechanism 45 Figure 3-4. Base Machine Model Speedups 50 Figure 3-5. Load Verification Latency Distribution 53 Figure 3-6. Data Dependency Resolution Latencies 54 Figure 3-7. Percentage of Cycles with Bank Conflicts 55 Figure 4-1. Register Value Locality 59 Figure 4-2. Register Value Locality by Instruction Type 61 Figure 4-3. Value Prediction Unit 62 Figure 4-4. VPT Hit Sensitivity to Size 63 Figure 4-5. CT Hit s 64 Figure 4-6. Example use of Value Prediction Mechanism 65 Figure 4-7. 620 Speedups 70 Figure 4-8. 620+ Speedups 71 Figure 4-9. Infinite Machine Model Speedups 72 Figure 4-10. Doubling Data Cache vs. VP 73 Figure 5-1. Branch Misprediction Penalty 79 Figure 5-2. Pipelined Dispatch Structure 81 Figure 5-3. Dependence Prediction Mechanism 83 Figure 5-4. Source Operand Value Prediction Mechanism 85 Figure 5-5. Effect of Dependence and Value Prediction 87 Figure 5-6. Reduced Branch Misprediction Penalty 88 Figure 6-1. Pipeline Contraction 96 Figure 6-2. Superflow Overview 97 Figure 6-3. Superflow Instruction Fetch Unit 101 Value Locality and Speculative Execution xi

Figure 6-4. Instruction Fetch Unit Performance 102 Figure 6-5. Superflow Instruction Fetch Unit Performance 103 Figure 6-6. Source Operand Value Predictability 105 Figure 6-7. Dependence Predictability 106 Figure 6-8. Effect of Deep Pipelining 107 Figure 6-9. Effect of Finite Reorder Buffer 109 Figure 6-10. Alias Predictability 111 Figure 6-11. Load Stream Partitioning 112 Figure 6-12. Load Value Predictability 113 Figure 6-13. Effect of Constrained Memory Bandwidth 114 Value Locality and Speculative Execution xii

List of Tables Table 2-1. PowerPC Machine Model Specifications 22 Table 2-2. Alpha AXP 21164Instruction Latencies 24 Table 2-3. Idealized PowerPC Model Instruction Latencies 25 Table 2-4. SPEC92 Integer Benchmark Descriptions 31 Table 2-5. Miscellaneous Integer Benchmark Descriptions 32 Table 2-6. SPEC92 Floating Point Benchmark Descriptions 32 Table 2-7. SPEC95 Integer Benchmark set 33 Table 2-8. SPEC95 Floating Point Benchmark set 33 Table 3-1. LVP Unit Configurations 41 Table 3-2. LCT Hit s 43 Table 3-3. Successful Constant Identification s 44 Table 3-4. PowerPC 620+ Speedups 52 Table 4-1. Instruction Types 60 Table 4-2. Classification Table Configurations 64 Table 4-3. Baseline Performance (IPC) 66 Table 4-4. VP Unit Configurations 68 Table 5-1. Machine Model Parameters 79 Table 5-2. Benchmark Characteristics 80 Table 5-3. Dependence Prediction Results 83 Table 5-4. Source Operand Value Prediction Results 86 Table 6-1. Evolution of Microprocessors 92 Table 6-2. Benchmark Characteristics 100 Table A-1. PowerPC 620 Model Miscellaneous Data 123 Table A-2. PowerPC 620+ Model Miscellaneous Data 126 Table A-3. PowerPC 620 LVP Data 129 Table A-4. PowerPC 620+ LVP Data 131 Table A-5. Alpha AXP 21164 Data 133 Table B-1. PowerPC 620 Model Miscellaneous Data 137 Table B-2. PowerPC 620+ Model Miscellaneous Data 140 Table B-3. PowerPC 620 VP Data 144 Table B-4. PowerPC 620+ VP Data 146 Table B-5. Infinite PowerPC VP Data 149 Table C-1. SPECInt95 Results for Instruction Flow 153 Table C-2. SPECFP95 Results for Instruction Flow 158 Table C-3. SPECInt95 Results for Register Data Flow 161 Table C-4. SPECFP95 Results for Register Data Flow 164 Value Locality and Speculative Execution xiii

Table C-5. Table C-6. Table C-7. Table C-8. Table C-9. Table C-10. SPECInt95 Results for ROB Size 128 and Fetch Width 16 167 SPECFP95 Results for ROB Size 128 and Fetch Width 16 168 SPECFP95 Results for ROB Size 256 and Fetch Width 16 168 SPECInt95 Results for Memory Data Flow for ROB Size 128 170 SPECFP95 Results for Memory Data Flow for ROB Size 128 177 SPECFP95 Results for Memory Data Flow for ROB Size 256 182 Value Locality and Speculative Execution xiv

CHAPTER 1 Introduction This thesis introduces a ubiquitous program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism. To exploit value locality, several dynamic prediction mechanisms are proposed. These mechanisms increase instruction-level parallelism (also known as ILP or IPC, instructions per cycle) by speculatively collapsing explicit and implicit dependences between instructions and folding away execution pipeline stages under the pipeline contraction framework. 1.1 Historical Background and Motivation There are two fundamental restrictions that limit the amount of instruction level parallelism (ILP) that can be extracted from sequential programs: control flow and data flow. Control flow limits ILP by imposing serialization constraints at forks and joins in a program s control flow graph [1]. Data flow limits ILP by imposing serialization constraints on pairs of instructions that are data dependent (i.e. one needs the result of another to compute its own result, and hence must wait for the other to complete before beginning to execute). Examining the extent and effect of these limits has been a popular and important area of research, particularly in the case of control flow [2,3,4,5]. Continuing advances in the development of accurate branch predictors (e.g. [6]) have led to increasingly-aggressive control-speculative microarchitectures (e.g. the Intel Pentium Pro [7]), which undertake aggressive measures to overcome control-flow restrictions by using branch prediction and speculative execution to bypass control dependences and expose additional instruction-level parallelism to the microarchitecture. Meanwhile, numerous mechanisms have been Value Locality and Speculative Execution 1

Historical Background and Motivation proposed and implemented to eliminate false data dependences and tolerate the latencies induced by true data dependences by allowing instructions to execute out of program order (see [8] for an overview). Surprisingly, in light of the extensive energies focused on eliminating control-flow restrictions on parallel instruction issue, less attention has been paid to eliminating data-flow restrictions on parallel issue. Recent work has focused primarily on reducing the latency of specific types of instructions (usually loads from memory) by rearranging pipeline stages [9, 10], initiating memory accesses earlier [11], or speculating that dependences to earlier stores do not exist [12, 13, 14, 15]. The most relevant prior work in the area of eliminating data-flow dependences consists of the Tree Machine [16,17], which uses a value cache to store and look up the results of recurring arithmetic expressions to eliminate redundant computation (the value cache, in effect, performs common subexpression elimination [1] in hardware). Richardson follows up on this concept in [18] by introducing the concepts of trivial computation, which is defined as the trivialization of potentially complex operations by the occurrence of simple operands; and redundant computation, where an operation repeatedly performs the same computation because it sees the same operands. He proposes a hardware mechanism (the result cache) which reduces the latency of such trivial or redundant complex arithmetic operations by storing and looking up their results in the result cache. In this thesis, we introduce the concept of value locality, which is similar to redundant computation, along with a proposed technique--value Prediction, or VP--for predicting the results of instructions at dispatch by exploiting the affinity between instruction addresses and the values these instructions produce. VP differs from Harbison s value cache and Richardson s result cache in two important ways: first, the VP table is indexed by instruction address, and hence value lookups can occur very early in the pipeline; second, it is speculative in nature, and relies on a verification mechanism to guarantee correctness. In contrast, both Harbison and Richardson use table indices that are only available later in the pipeline (Harbison uses data addresses, while Richardson uses actual operand values); and require their predictions to be correct, hence requiring mechanisms for keeping their tables coherent with all other computation. Value Locality and Speculative Execution 2

Historical Background and Motivation Speculative Execution Control Speculation Branch Direction (binary) Branch Target (multi-valued) Data Speculation Data Location Aliased (binary) Address (multi-valued) Data Value (multi-valued) Figure 1-1. Taxonomy of Speculative Execution. 1.1.1 Taxonomy of Speculative Execution In order to place our work on prediction-based speculative execution into a meaningful historical context, we introduce a taxonomy of speculative execution. This taxonomy, summarized in Figure 1-1, categorizes our work as well as previously introduced techniques based on which types of dependences are being bypassed (control vs. data), whether the speculation relates to storage location or value, and what type of decision must be made to enable the speculation (binary vs. multivalued). Control Speculation There are essentially two types of control speculation: speculating on the direction of a branch, which requires a binary decision (taken vs. not-taken); and speculating on the target of a branch, which requires a multi-valued decision (the target can potentially be anywhere in the program s address space). Examples of the former are any of the many branch prediction schemes explored in the literature (e.g. [19,6,20]), while examples of the latter are the Branch Target Buffer (BTB) or Branch Target Address Cache (BTAC) units included on most modern microprocessors (e.g. the PowerPC 620 [15] or the Intel Pentium Pro [7]). A novel mechanism for performing both branch direction and branch target prediction is proposed as part of the Superflow microarchitecture paradigm in Chapter 6. Value Locality and Speculative Execution 3

Historical Background and Motivation Data Speculation Data speculation techniques break down logically into two categories: those that speculate on the storage location of the data, and those that speculate on the actual value. Furthermore, techniques that speculate on the location come in two fundamentally different flavors: those that speculate on a specific attribute of the storage location (e.g. is it aliased with an earlier definition), and those that speculate on the address of the storage location. An example of the former is speculative disambiguation, which optimistically assumes that an earlier definition does not alias with a current use, and provides a mechanism for checking the accuracy of that assumption. Speculative disambiguation has been implemented both in software [13] as well as in hardware [12, 14, 15]. Another example of this type of speculation occurs implicitly in most control-speculative processors, whenever execution proceeds speculatively past a join in the control-flow graph where multiple reaching definitions for a storage location are live [1]. By speculating past that join, the processor hardware is implicitly speculating that the definition on the predicted path to the join in question is in fact the correct one (as opposed to the definition on an alternate path). There are a large number of techniques that speculate on data address. Most prefetching techniques, for example, are speculative in nature and rely on some heuristic for generating addresses of future memory references (e.g. [21, 22, 23, 24, 25]). Of course, since prefetching has no architected side effects, no mechanism is needed for verifying the accuracy of the prediction or for recovering from mispredictions. Another example of a technique that speculates on data address is fast address calculation [26, 11], which enables early initiation of memory loads by speculatively generating addresses early in the pipeline. Dependence prediction, proposed in Chapter 5, and alias prediction, proposed in Chapter 6, are speculative techniques that predict the current storage location of register input operands (i.e. rename buffer number) and memory operands (e.g. store queue entry), respectively. The final category in our taxonomy, techniques that speculate on data value, has received little attention in the literature. The only work we are aware of is that proposed in this thesis (preliminary results have been published in [27] and [28]). Note that neither the Tree Machine [16,17] or Richardson s work [18] qualify since they are not speculative. Value Locality and Speculative Execution 4

Theoretical Contributions 1.2 Theoretical Contributions 1.2.1 Value Locality In this thesis, we introduce the concept of value locality, which we define as the likelihood of a previously-seen value recurring repeatedly within a storage location. Although the concept is general and can be applied to any storage location within a computer system, we have limited our study to examine only the value locality of general-purpose or floating point registers immediately following instructions that write to those registers, as well as the value locality exhibited in dependence relationships between instructions. A plethora of previous work on static and dynamic branch prediction (e.g. [19,6,20]) has focused on an even more restricted application of value locality, namely the prediction of a single condition bit based on its past behavior. Intuitively, it seems that it would be a very difficult task to discover any useful amount of value locality in a general purpose register. After all, a 32-bit register can contain any one of over four billion values--how could one possibly predict which of those is even somewhat likely to occur next? As it turns out, if we narrow the scope of our prediction mechanism by considering each static instruction individually, the task becomes much easier and we are able to accurately predict a significant fraction of register values being written by machine instructions. We examine the phenomenon of value locality more closely in Section 3.2 on page 37 and Section 4.2 on page 58. The initial benchmark set that we use to explore value locality and quantify its performance impact consists of the SPEC92 integer suite (described in Section 2.3.1 on page 31) and miscellaneous integer benchmarks (described in Section 2.3.2 on page 31). In later experiments, we augment this initial benchmark set with integer benchmarks from the more recent SPEC95 suite and floating point benchmarks from both SPEC92 and SPEC95 (all of these benchmarks are described in Section 2.3 on page 31). Load Value Locality Figure 1-2 shows the average value locality for load instructions in each of the benchmarks. The value locality of each static load is measured by counting the number of times that load instruction retrieves a value from memory that matches a previously seen value for that static load and dividing by the total number of dynamic occurrences of that load. The average load value locality of a benchmark is the dynamically-weighted average of the value localities of all the static loads in that benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we Value Locality and Speculative Execution 5

Theoretical Contributions 100.0 Load Value Locality Load Value Locality 80.0 60.0 40.0 20.0 0.0 cc1-271 cjpeg compress eqntott gawk gperf grep mpeg perl quick sc xlisp Figure 1-2. Load Value Locality. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of sixteen. check for matches against only the most recently retrieved value), while the second set (dark bars) has a history depth of sixteen (i.e. we check against the last sixteen unique values). We see that even with a history depth of one, most of the integer programs exhibit load value locality in the 50% range, while extending the history depth to sixteen can improve that to better than 80%. What that means is that the vast majority of static loads exhibit very little variation in the values that they load during the course of a program s execution. Unfortunately, one of our benchmarks--cjpeg-- demonstrates poor load value locality. Register Value Locality Figure 1-3 shows the average value locality for all instructions that write an integer or floating point register in each of the benchmarks. The value locality of each static instruction is measured by counting the number of times that instruction writes a value that matches a previously seen value for that static instruction and dividing by the total number of dynamic occurrences of that instruction. The average value locality of a benchmark is the dynamically-weighted average of the value localities of all the static instructions in that benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we check for matches against only the most-recentlywritten value), while the second set (dark bars) has a history depth of four (i.e. we check against the last four unique values). We see that even with a history depth of one, most of the programs exhibit value locality in the 40-50% range (average 51%), while extending the history depth to Value Locality and Speculative Execution 6

Theoretical Contributions 100.0 Register Value Locality Register Value Locality 80.0 60.0 40.0 20.0 0.0 cc1-271 cjpeg compress eqntott gawk gperf grep mpeg perl quick sc xlisp Figure 1-3. Register Value Locality. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of four. four (along with a perfect mechanism for choosing the right one of the four values) can improve that to the 60-70% range (average 66%). What that means is that a majority of static instructions exhibit very little variation in the values that they write into registers during the course of a program s execution. Unfortunately, three of our benchmarks--cjpeg, compress, and quick--demonstrate poor register value locality. 1.2.2 The Weak Dependence Model The implied inter-instruction precedences of a sequential program are an overspecification and need not be rigorously enforced to meet the requirements of the sequential execution model. The actual program semantics and inter-instruction dependences are specified by the control-flow graph (CFG) and the data-flow graph (DFG). As long as the serialization constraints imposed by the CFG and the DFG are not violated, the execution of instructions can be overlapped and reordered (e.g. via out-of-order execution) to achieve better performance by avoiding the enforcement of implied but unnecessary precedences. However, true inter-instruction dependences must still be enforced. To date, all machines enforce such dependences in a rigorous fashion that involves the following two requirements: Value Locality and Speculative Execution 7

Theoretical Contributions Dependences are determined in an absolute and exact way, i.e. two instructions are identified as either dependent or independent, and when in doubt dependences are pessimistically assumed to exist. Dependences are enforced throughout instruction execution, i.e. the dependences are never allowed to be violated, and are enforced continuously while the instructions are in flight. We classify such a traditional and conservative approach as adhering to the strong dependence model for program execution. We believe that the traditional strong dependence model is overly rigorous and unnecessarily restricts available parallelism. This thesis proposes the weak dependence model, which specifies that: Dependences need not be determined exactly or assumed pessimistically, but can instead be optimistically approximated or even temporarily ignored. Dependences can be temporarily violated during instruction execution as long as recovery can be performed prior to affecting the permanent machine state. The advantage of adopting the weak dependence model is that the program semantics as specified by the CFG and DFG need not be completely determined before the machine can process instructions. Furthermore, the machine can now speculate aggressively and temporarily violate the dependences as long as corrective measures are in place to recover from misspeculation. If a significant percentage of the speculations are correct, the machine can effectively exceed the performance limit imposed by the traditional strong dependence model. Conceptually speaking, a machine that exploits the weak dependence model has two interacting engines. The front-end engine assumes the weak dependence model and is highly speculative. It tries to make predictions about instructions in order to aggressively process instructions. When the predictions are correct, these speculative instructions will effectively have skipped over or folded out certain pipeline stages. The back-end engine still uses the strong dependence model to validate the speculations, to recover from misspeculation, and to provide history and guidance information to the speculative engine. In combining these two interacting engines, an unprecedented level of instruction level parallelism can be harvested without violating the program semantics. The edges in the DFG that represent inter-instruction dependences are now enforced in the critical path only when misspeculations occur. Essentially, these dependence edges have become probabilistic and the serialization penalties incurred due to enforcing these dependences are eliminated or masked whenever correct speculations occur. Hence, the traditional data-flow limit based on the length of the critical path in the DFG is no longer a hard limit that cannot be exceeded [28]. Value Locality and Speculative Execution 8

Theoretical Contributions Branch Prediction Value Prediction Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Figure 1-4. Pipeline Contraction. Branch prediction is used to fold the dispatch and execute pipeline stages into the fetch stage, and value prediction is used to fold the execute stage into the dispatch stage. 1.2.3 Pipeline Contraction Framework In this section, we introduce a generalized framework called pipeline contraction that captures all forms of speculation, both in the control-flow and data-flow domains. Control-flow speculation, already ubiquitous in high-performance processors, consists of speculating on both the direction (taken vs. not taken) and the target (if taken) of branch instructions. Data-flow speculation, which is less common, consists of speculating on the specific attributes or even values of instruction inputs and outputs. Both types of speculation can be described as attempts to contract the instruction execution pipeline by probabilistically obtaining the semantic outcome of an instruction as early as possible. For example, in Figure 1-4, we see the semantics of a branch instruction, which without speculation would require three pipeline stages, contracted down to one stage whenever both the target and the direction of the branch can be correctly predicted during the fetch stage. Similarly, data speculation techniques such as value prediction [28] can be used to contract execution pipelines and allow dependent instructions to execute in parallel. The pipeline contraction framework is a useful tool for assessing the potential benefit of speculative techniques by considering the following metrics: the degree of contraction that can be obtained with the proposed technique (i.e. how many pipeline stages can be folded away), the relative frequency and accuracy of these contractions, and the delays incurred while recovering from Value Locality and Speculative Execution 9

Microarchitectural Contributions incorrect contractions. For example, branch prediction is a very powerful technique because it measures up well against all three factors: it folds away a large number of pipeline stages, branches occur frequently and are very predictable, and recovery from mispredictions costs little or no additional delay relative to not predicting the branches. Within this framework, value prediction can be generalized to include: 1) predicting direction and target of a branch instruction; 2) predicting source and/or destination operands of an ALU instruction; and 3) predicting the memory address and/or operand of a load/store instruction. The number of stages folded away is determined by the distance (in pipe stages) between where the prediction is made and where the value is normally produced. 1.3 Microarchitectural Contributions 1.3.1 Load and Register Value Prediction The fact that the register writes in many programs demonstrate a significant degree of value locality opens up exciting new possibilities for the microarchitect. Since the results of many instructions can be accurately predicted before they are issued or executed, dependent instructions are no longer bound by the serialization constraints imposed by operand data flow. Instructions can now be scheduled speculatively with additional degrees of freedom to better utilize existing functional units and hardware buffers, and are frequently able to complete execution sooner since the critical paths through the data dependence graph have been collapsed. We propose two approaches to exploiting value locality: Load Value Prediction and the more general Register Value Prediction. Both of these share two basic mechanisms: one for accurately predicting values--the VP (value prediction) unit--and one for verifying these predictions. The Value Prediction Unit Value prediction is useful only if it can be done accurately, since incorrect predictions can lead to increased structural hazards and longer latency (the misprediction penalty is described in greater detail on page 14). Hence, we propose a two-level prediction structure for the VP Unit: the first level is used to generate the prediction values, and the second level is used to decide whether or not the predictions are likely to be accurate. The internal structure of the VP Unit is illustrated in Figure 1-5. The VP Unit consists of two tables: the Classification Table (CT) and the Value Prediction Table (VPT), both of which are direct-mapped and indexed by the instruction address (PC) of the instruction being predicted. Value Locality and Speculative Execution 10

Microarchitectural Contributions Classification Table (CT) <valid> <pred history> PC of pred. instr. Value Prediction Table (VPT) <valid><value history> Prediction Result Predicted Value Updated Value Figure 1-5. Block Diagram of Value Prediction Unit. The PC of the instruction being predicted is used to index into the Value Prediction Table to find a value to predict. At the same time, the Classification Table is also indexed with the PC to determine whether or not a prediction should be made. When the instruction completes, both the prediction history and value history are updated. Entries in the CT contain two fields: the valid field, which consists of either a single bit that indicates a valid entry or a partial or complete tag field that is matched against the upper bits of the PC to indicate a valid field; and the prediction history field, which is a saturating counter of 1 or more bits that tracks the correctness of recent predictions. The prediction history is incremented or decremented whenever a prediction is correct or incorrect, respectively, and is used to classify instructions as either predictable or unpredictable. This classification is used to decide whether or not the result of a particular instruction should be predicted. Increasing the number of bits in the saturating counter adds hysteresis to the classification process and can help avoid erroneous classifications by ignoring anomalous values and/or destructive interference caused by multiple static instructions mapping to the same CT entry. The relatively simple CT configurations described in Chapters 2-4 (as well as [27] and [28]) achieved classification hit rates between 70% and 95%. The VPT entries also consist of two fields: a valid field, which, again, can consist of a single valid bit or a full or partial tag; and a value history field, which contains one or more 32- or 64-bit values that are maintained with an LRU policy. The value history fields are written when an instruction is first encountered (by its result) or whenever a prediction is incorrect (by the actual result). The Value Locality and Speculative Execution 11

Microarchitectural Contributions Predicted CT PC of pred. instr. VPT Dependent Fetch Dispatch Buffer Dispatch Buffer Release Dispatch Reserv. Station Predict Rename Buffer Spec? Data Reserv. Station Execute FU Reissue FU Result Bus Complete/ Verify Compl. Buffer?= Committed Value Invalidate Predicted Value Compl. Buffer Figure 1-6. Example use of Value Prediction Mechanism. The dependent instruction shown on the right uses the predicted result of the instruction on the left, and is able to issue and execute in the same cycle. VPT replacement policy is also governed by the CT prediction history to introduce hysteresis and avoid replacing useful values with less useful ones. Verifying Predictions Since value prediction is by nature speculative, we need a mechanism for verifying the correctness of the predictions and efficiently recovering from mispredictions. This mechanism is summarized in the example of Figure 1-6, which shows the parallel execution of two data-dependent instructions. The producer instruction, shown on the left, has its value predicted and written to its rename buffer during the fetch and dispatch cycles. The consumer instruction, shown on the right, reads the predicted value from the rename buffer at the beginning of the execute cycle, and is able to issue and execute normally, but is forced to retain its reservation station. Meanwhile, the predicted instruction also executes, and its computed result is compared with the predicted result during its completion stage. If the values match, the consumer instruction releases its reservation station. If not, completion of the first instance of the consumer instruction is invalidated, and a second instance reissues with the correct value. Value Locality and Speculative Execution 12

Microarchitectural Contributions Verifying Constant Loads In our experiments with Load Value Prediction, we discovered that certain loads exhibit constant behavior; that is, they load the same constant value repeatedly. To exploit this behavior and avoid accessing the conventional memory hierarchy for these loads, we propose the constant verification unit (CVU), which is described in further detail in Chapter 3 (and [27]). To verify predictable loads, we simply retrieve the value from the conventional memory hierarchy and compare the predicted value to the actual value, just as we do in the more generalized value prediction scheme (see Figure 1-6). However, for highly-predictable or constant loads, we use the CVU, which allows us to avoid accessing the conventional memory system completely by forcing the VPT entries that correspond to constant loads to remain coherent with main memory (loads are classified as constant if the saturating counter at their VPT entry has reached its maximum value). For the VPT entries that are classified as constants by the CT, the data address and the index of the VPT entry are placed in a separate, fully-associative table inside the CVU. This table is kept coherent with main memory by invalidating any entries where the data address matches a subsequent store instruction. Meanwhile, when the constant load executes, its data address is concatenated with the VPT index (the lower bits of the instruction address) and the CVU s contentaddressable-memory (CAM) is searched for a matching entry. If a matching entry exists, we are guaranteed that the value at that VPT entry is coherent with main memory, since any updates (stores) since the last retrieval would have invalidated the CVU entry. If one does not exist, the constant load is demoted from constant to just predictable status, and the predicted value is now verified by retrieving the actual value from the conventional memory hierarchy. We find that an average of 6% (and up to 33% for some benchmarks) of loads from memory can be verified with the CVU, resulting in a proportional reduction of L1 cache bandwidth requirement. VP Unit Operation The VP Unit predicts the values during fetch and dispatch, then forwards them speculatively to subsequent dependent instructions via the processor s standard result forwarding mechanism. Dependent instructions are able to issue and execute immediately, but are prevented from completing architecturally and are forced to retain possession of their reservation stations until their inputs are no longer speculative. Speculatively forwarded values are tagged with a bit vector representing the uncommitted register writes they depend on, and these tags are propagated to the results of any Value Locality and Speculative Execution 13

Microarchitectural Contributions subsequent dependent instructions. Meanwhile, uncommitted instructions execute in their respective functional units, and the predicted values are verified either by a comparison against the actual values computed by the instructions, or in the case of constant loads, by an address match in the CVU. Once a prediction is verified, all the dependent instructions can either release their reservation stations and proceed into the completion unit (in the case of a correct prediction), or restart execution with the correct register values (if the prediction was incorrect). Since a large number of instructions can be in flight at the same time, the time between predicting and verifying a value can be dozens of cycles or more, allowing the processor to speculate multiple levels down the dependence chain beyond the write, executing instructions and resolving branches that would otherwise be blocked by data-flow dependences. Misprediction Penalty The worst-case penalty for an incorrect value prediction in this scheme, as compared to not predicting the value in question, is one additional cycle of latency along with structural hazards that might not have occurred otherwise. The penalty occurs only when a dependent instruction has already executed speculatively, but is waiting in its reservation station for one of its predicted inputs to be verified. Since the value comparison takes an extra cycle beyond the pipeline result latency, the dependent instruction will reissue and execute with the correct value one cycle later than it would have had there been no prediction. In addition, the earlier incorrect speculative issue may have caused a structural hazard that prevented other useful instructions from dispatching or executing. In those cases where the dependent instruction has not yet executed (due to structural or other unresolved data dependences), there is no penalty, since the dependent instruction can issue as soon as the actual computed value is available, in parallel with the value comparison that verifies the prediction. In any case, due to the CT which accurately prevents incorrect predictions from occurring, the misprediction penalty does not significantly affect performance. There can also be a structural hazard penalty even in the case of a correct prediction. Since speculative values are not verified until one cycle after the actual values become available, speculatively issued dependent instructions end up occupying their reservation stations for one cycle longer than they would have had there been no prediction. Value Locality and Speculative Execution 14

Microarchitectural Contributions Cycles Per Instruction (CPI) 1.0 0.8 0.6 0.4 0.2 0.0 RAS Mispred BTB Mispred BHT Mispred Other go m88ksim gcc compress li ijpeg perl vortex Figure 1-7. Branch Misprediction Penalty. The approximate contribution of RAS, BTB, and BHT mispredictions to overall CPI is shown for single-cycle dispatch (left bar), 2-cycle (middle bar) and 3-cycle (right bar) pipelined dispatch. 1.3.2 Dependence Prediction Detecting data dependences among multiple instructions in flight is an inherently sequential task that becomes very expensive combinatorially as the number of concurrent in-flight instructions increases. Olukotun et al. argue convincingly against wide-dispatch superscalars because of this very fact [30]. Wide (i.e. greater than four instructions per cycle) dispatch is difficult to implement and has adverse impact on cycle time because all instructions in a dispatch group must be simultaneously cross-checked. Even current microprocessor implementations with dispatch windows of four or less (e.g. Alpha AXP 21164 and Pentium Pro) require multiple instruction decode and dependence-checking pipeline stages. One obvious solution to the problem of the complexity of dependence detection is to pipeline it into two or more stages to minimize impact on cycle time. In Chapter 5, Section 5.4 we propose a pipelined approach to dependence detection that facilitates the implementation of wide-dispatch microarchitectures. However, pipelined dependence checking aggravates the cost of branch mispredictions by delaying resolution of mispredicted branches. In Figure 1-7, we see the IPC impact of pipelining dependence checking on a 16-dispatch machine with an advanced branch predictor and no other structural resource limitations (refer to Section 2.3.4 on page 32 and Section 2.1.5 on page 24 in Chapter 2 for further details on the benchmarks and machine model). We see that lengthening dispatch to two or three pipeline stages (vs. the baseline case of one) severely increases the number of cycles during which no useful instructions are dispatched and increases Value Locality and Speculative Execution 15

Microarchitectural Contributions CPI (decreases IPC) dramatically, to the point where sustaining even 2-3 IPC becomes very difficult. We alleviate these problems in two ways: by introducing a scalable, pipelined, and speculative approach to dependence detection called dependence prediction and also by exploiting a modified approach to value prediction called source operand value prediction [28]. Fundamental to these is the notion that maintaining semantic correctness does not require that we rigorously enforce source-to-sink data-flow relationships or that we even exactly detect these relationships before we start executing. Rather, we use dynamically adaptive techniques for predicting values as well as dependences and speculatively issue instructions early, before their dependences are resolved or even known. As shown in Figure 1-8, dependence prediction is implemented with a dependence prediction table (DPT) with 8K entries, which is direct-mapped and indexed by hashing together the instruction address bits, the gshare branch predictor s branch history register (BHR), and the relative position of the operand (i.e. first, second, or third) being looked up. Each DPT entry contains a numeric value which reflects the relative index of that input operand s location in the rename buffers. This relative index is used to check the value silo to see if the operand is already available. If all of the instruction s predicted input operands are available, the instruction is permitted to dispatch early, after the first dispatch cycle. In the second (or third, in the three-cycle dispatch pipeline) dispatch cycle, exact dependence information becomes available, and the earlier prediction is verified against the actual information. In case of a mismatch, the DPT entry is replaced with the correct relative position, and the early dispatch is cancelled. 1.3.3 Alias Prediction As described in the previous section, detecting and enforcing dependences between multiple instructions in flight presents a serious scalability bottleneck for wide-issue superscalar processors. To a lesser extent, the detection and enforcement of dependences that occur through aliased memory locations also causes difficulties. In this case, however, the problems are caused by the latency involved in computing and comparing the addresses of all loads with all previous unretired stores. Data shown in Chapter 6 indicates that a significant portion of all loads are aliased to earlier stores (15% on average for integer benchmarks, and 6% on average for floating point benchmarks- -see Figure 6-10 on page 111). In order to resolve these dependences as early as possible, before Value Locality and Speculative Execution 16