CARNEGIE MELLON UNIVERSITY

Size: px
Start display at page:

Download "CARNEGIE MELLON UNIVERSITY"

From this document you will learn the answers to the following questions:

  • What is the Pipelined Dispatch?

Transcription

1 CARNEGIE MELLON UNIVERSITY VALUE LOCALITY AND SPECULATIVE EXECUTION A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree of DOCTOR OF PHILOSOPHY in ELECTRICAL AND COMPUTER ENGINEERING by Mikko Herman Lipasti Pittsburgh, PA April 1997

2 Value Locality and Speculative Execution ii

3 Abstract This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism without violating program correctness. Value locality is a program attribute that describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Most modern processors already exploit value locality through the use of control speculation (i.e. branch prediction), which seeks to predict the future values of condition code bits and branch-target addresses based on previously-seen values. Experimental results indicate that value locality exists for condition codes and branch target addresses, and for general-purpose and floating-point registers as well. Furthermore, value locality exists not only in the data flow portion of a processor, but also in the control logic, where both register and memory dependences between instructions tend to remain static and are relatively easily predictable. To exploit value locality, several dynamic prediction mechanisms are proposed. These mechanisms increase instruction-level parallelism by speculatively collapsing explicit and implicit dependences between instructions and folding away execution pipeline stages under the pipeline contraction framework. Detailed evaluation of value prediction for load instructions only, as well as all instructions that write general-purpose or floating-point registers shows significant potential for performance improvements. Further experimental results indicate that both register and memory dependence relationships between instructions are easily predictable. These discoveries result in significant potential for further performance improvements, particularly in conjunction with wide-issue and deeply-pipelined superscalar processors that employ aggressive techniques like accurate dynamic branch prediction and instruction fetching via trace cache to overcome control- and data-flow restrictions on parallel execution of serial programs. Finally, this thesis introduces a new microarchitectural paradigm called Superflow that supersedes historical limits on instruction flow, register dataflow, and memory dataflow and demonstrates potential performance improvements of 2-4x over the current state-of-the-art microprocessors. Value Locality and Speculative Execution iii

4 Value Locality and Speculative Execution iv

5 Acknowledgments First of all, I want to express my appreciation to my advisor, John Paul Shen, for his support, advice, and leadership throughout this project. John s ability for condensing wild ideas and conjectures down to key concepts, as well as his willingness to entertain and pursue such notions, has been instrumental in evolving this thesis from vague ideas about redundancy in memory traffic to effective and powerful microarchitectural techniques for increasing instruction-level parallelism. I would also like to thank professors Daniel P. Siewiorek and Rob Rutenbar of CMU, and James Smith of University of Wisconsin, for serving on my thesis committee. I benefited greatly from their wealth of knowledge and experience during the transition from Ph.D. proposal to completed Ph.D. thesis. I also want to express my gratitude to my outside committee members, Dr. Greg Pfister from IBM Austin, and Dr. Steve Kunkel from IBM Rochester. Their insights and feedback helped to gravitate my research toward real-world implementation and performance issues. I also benefited greatly from numerous discussions with other members of our research group, including Bryan Black, Yuan Chou, Andrew Huang, Derek Noonburg, and Chris Newburn. Special thanks go to Chris Wilkerson for his key insights in the early stages of this thesis as well as for coining the term Superflow to describe the microarchitectural paradigm advocated in this thesis. I also want to express my appreciation to my management at IBM, whose professional and financial support greatly eased the transition from salaried employee to indentured servant (i.e. graduate student). Finally, thanks to my loving wife Erica Ann Lipasti for her patience, forbearance, and willingness to make tremendous sacrifices in not just financial security, standard of living, and quality of life, but also personal friendships, proximity to family, and psychological and emotional support systems in order to let me perform this work. Her enduring support during this time is the ultimate sign of her love, devotion, and respect. Furthermore, her selfless devotion to caring and providing for our dear daughter, Emma Kristiina who has evolved from a toddler barely out of helpless infancy, to a happy, active, intelligent, and very outgoing young lady during our time here at CMU has been instrumental in allowing me to focus my energies on my research without having to sacrifice those precious and irreplaceable times together as a family. Value Locality and Speculative Execution v

6 Value Locality and Speculative Execution vi

7 Contents Abstract iii Acknowledgments v CHAPTER 1 Introduction 1 Historical Background and Motivation 1 Taxonomy of Speculative Execution 3 Theoretical Contributions 5 Value Locality 5 The Weak Dependence Model 7 Pipeline Contraction Framework 9 Microarchitectural Contributions 10 Load and Register Value Prediction 10 Dependence Prediction 15 Alias Prediction 16 Putting it all Together: The Superflow Paradigm 17 Thesis Overview 19 CHAPTER 2 Machine Models and Workloads 21 Machine Models 21 PowerPC PowerPC Infinite PowerPC Model 22 Alpha AXP Execution-Driven Idealized PowerPC Model 24 Misprediction Recovery Mechanisms 29 Instruction Refetch 29 Instruction Reissue 30 Selective Instruction Reissue 30 Workloads 31 SPEC92 Integer Suite. 31 Miscellaneous Integer Programs 31 SPEC92 Floating Point Suite 32 SPEC95 Integer Suite 32 SPEC95 Floating Point Suite 33 Value Locality and Speculative Execution vii

8 CHAPTER 3 Load Value Prediction 35 Introduction and Related Work 35 Value Locality 37 Exploiting Value Locality 41 Load Value Prediction Table 42 Dynamic Load Classification. 42 Constant Verification Unit 43 The Load Value Prediction Unit 45 LVP Unit Implementation Notes 46 Microarchitectural Models 46 PowerPC 620/620+ LVP Unit Operation 47 Alpha AXP LVP Unit Operation 48 Experimental Framework 49 Experimental Results 49 Base Machine Model Speedups with Realistic LVP 50 Enhanced Machine and LVP Model Speedups 51 Distribution of Load Verification Latencies 52 Data Dependency Resolution Latencies 53 Bank Conflicts 54 Conclusions and Future Work 56 CHAPTER 4 Register Value Prediction 57 Motivation 57 Value Locality 58 Exploiting Value Locality 60 The Value Prediction Unit 61 Verifying Predictions 65 Microarchitectural Models 66 VP Unit Operation 66 Misprediction Penalty 67 Experimental Framework 68 Experimental Results 68 PowerPC 620 Machine Model Speedups 69 PowerPC 620+ Machine Model Speedups 70 Infinite Machine Model Speedups 71 VP Unit Implementation 72 Conclusions and Future Work 74 CHAPTER 5 Dependence Prediction 77 Motivation and Related Work 77 Detecting Control and Data Dependences 78 Experimental Framework 79 Value Locality and Speculative Execution viii

9 Pipelined Dispatch Structure 80 Dependence Prediction and Recovery 82 Source Operand Value Prediction and Recovery 84 Conclusions and Future Work 88 CHAPTER 6 The Superflow Paradigm 91 Background 91 The Superflow Paradigm 93 The Weak Dependence Model 94 Generalized Speculation and Pipeline Contraction 95 Overview of the Superflow Paradigm 96 Instruction Flow Techniques 98 Conditional Branch Throughput 98 Taken Branch Throughput 99 Misprediction Latency 102 Register Data Flow Techniques 103 Dependence Detection and Prediction 103 Eliminating Dependences 104 Memory Data Flow Techniques 110 Memory Latency 110 Memory Bandwidth 112 Summary and Conclusions 115 CHAPTER 7 Summary and Conclusions 117 Thesis Summary 117 Key Contributions 118 Theoretical Contributions 118 Microarchitectural Contributions 119 Future Work 120 APPENDIX A Additional Data on Load Value Prediction 123 Miscellaneous PowerPC Data 123 PowerPC Load Value Prediction Data 128 Miscellaneous Alpha AXP Data 133 APPENDIX B Additional Data on Register Value Prediction 137 Miscellaneous PowerPC Data 137 PowerPC Register Value Prediction Data 143 Value Locality and Speculative Execution ix

10 APPENDIX C Additional Data on Superflow Machine Models 153 Additional Results For Instruction Flow 153 Additional Results For Register Data Flow 160 Additional Results For Memory Data Flow 169 Value Locality and Speculative Execution x

11 List of Figures Figure 1-1. Taxonomy of Speculative Execution 3 Figure 1-2. Load Value Locality 6 Figure 1-3. Register Value Locality 7 Figure 1-4. Pipeline Contraction 9 Figure 1-5. Block Diagram of Value Prediction Unit 11 Figure 1-6. Example use of Value Prediction Mechanism 12 Figure 1-7. Branch Misprediction Penalty 15 Figure 1-8. Dependence Prediction Mechanism 17 Figure 1-9. Superflow Overview 18 Figure 2-1. PPC 620 and 620+ Block Diagram 23 Figure 2-2. Alpha AXP Block Diagram 24 Figure 3-1. Load Value Locality 39 Figure 3-2. PowerPC Value Locality by Data Type 40 Figure 3-3. Block Diagram of the LVP Mechanism 45 Figure 3-4. Base Machine Model Speedups 50 Figure 3-5. Load Verification Latency Distribution 53 Figure 3-6. Data Dependency Resolution Latencies 54 Figure 3-7. Percentage of Cycles with Bank Conflicts 55 Figure 4-1. Register Value Locality 59 Figure 4-2. Register Value Locality by Instruction Type 61 Figure 4-3. Value Prediction Unit 62 Figure 4-4. VPT Hit Sensitivity to Size 63 Figure 4-5. CT Hit s 64 Figure 4-6. Example use of Value Prediction Mechanism 65 Figure Speedups 70 Figure Speedups 71 Figure 4-9. Infinite Machine Model Speedups 72 Figure Doubling Data Cache vs. VP 73 Figure 5-1. Branch Misprediction Penalty 79 Figure 5-2. Pipelined Dispatch Structure 81 Figure 5-3. Dependence Prediction Mechanism 83 Figure 5-4. Source Operand Value Prediction Mechanism 85 Figure 5-5. Effect of Dependence and Value Prediction 87 Figure 5-6. Reduced Branch Misprediction Penalty 88 Figure 6-1. Pipeline Contraction 96 Figure 6-2. Superflow Overview 97 Figure 6-3. Superflow Instruction Fetch Unit 101 Value Locality and Speculative Execution xi

12 Figure 6-4. Instruction Fetch Unit Performance 102 Figure 6-5. Superflow Instruction Fetch Unit Performance 103 Figure 6-6. Source Operand Value Predictability 105 Figure 6-7. Dependence Predictability 106 Figure 6-8. Effect of Deep Pipelining 107 Figure 6-9. Effect of Finite Reorder Buffer 109 Figure Alias Predictability 111 Figure Load Stream Partitioning 112 Figure Load Value Predictability 113 Figure Effect of Constrained Memory Bandwidth 114 Value Locality and Speculative Execution xii

13 List of Tables Table 2-1. PowerPC Machine Model Specifications 22 Table 2-2. Alpha AXP 21164Instruction Latencies 24 Table 2-3. Idealized PowerPC Model Instruction Latencies 25 Table 2-4. SPEC92 Integer Benchmark Descriptions 31 Table 2-5. Miscellaneous Integer Benchmark Descriptions 32 Table 2-6. SPEC92 Floating Point Benchmark Descriptions 32 Table 2-7. SPEC95 Integer Benchmark set 33 Table 2-8. SPEC95 Floating Point Benchmark set 33 Table 3-1. LVP Unit Configurations 41 Table 3-2. LCT Hit s 43 Table 3-3. Successful Constant Identification s 44 Table 3-4. PowerPC 620+ Speedups 52 Table 4-1. Instruction Types 60 Table 4-2. Classification Table Configurations 64 Table 4-3. Baseline Performance (IPC) 66 Table 4-4. VP Unit Configurations 68 Table 5-1. Machine Model Parameters 79 Table 5-2. Benchmark Characteristics 80 Table 5-3. Dependence Prediction Results 83 Table 5-4. Source Operand Value Prediction Results 86 Table 6-1. Evolution of Microprocessors 92 Table 6-2. Benchmark Characteristics 100 Table A-1. PowerPC 620 Model Miscellaneous Data 123 Table A-2. PowerPC 620+ Model Miscellaneous Data 126 Table A-3. PowerPC 620 LVP Data 129 Table A-4. PowerPC 620+ LVP Data 131 Table A-5. Alpha AXP Data 133 Table B-1. PowerPC 620 Model Miscellaneous Data 137 Table B-2. PowerPC 620+ Model Miscellaneous Data 140 Table B-3. PowerPC 620 VP Data 144 Table B-4. PowerPC 620+ VP Data 146 Table B-5. Infinite PowerPC VP Data 149 Table C-1. SPECInt95 Results for Instruction Flow 153 Table C-2. SPECFP95 Results for Instruction Flow 158 Table C-3. SPECInt95 Results for Register Data Flow 161 Table C-4. SPECFP95 Results for Register Data Flow 164 Value Locality and Speculative Execution xiii

14 Table C-5. Table C-6. Table C-7. Table C-8. Table C-9. Table C-10. SPECInt95 Results for ROB Size 128 and Fetch Width SPECFP95 Results for ROB Size 128 and Fetch Width SPECFP95 Results for ROB Size 256 and Fetch Width SPECInt95 Results for Memory Data Flow for ROB Size SPECFP95 Results for Memory Data Flow for ROB Size SPECFP95 Results for Memory Data Flow for ROB Size Value Locality and Speculative Execution xiv

15 CHAPTER 1 Introduction This thesis introduces a ubiquitous program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism. To exploit value locality, several dynamic prediction mechanisms are proposed. These mechanisms increase instruction-level parallelism (also known as ILP or IPC, instructions per cycle) by speculatively collapsing explicit and implicit dependences between instructions and folding away execution pipeline stages under the pipeline contraction framework. 1.1 Historical Background and Motivation There are two fundamental restrictions that limit the amount of instruction level parallelism (ILP) that can be extracted from sequential programs: control flow and data flow. Control flow limits ILP by imposing serialization constraints at forks and joins in a program s control flow graph [1]. Data flow limits ILP by imposing serialization constraints on pairs of instructions that are data dependent (i.e. one needs the result of another to compute its own result, and hence must wait for the other to complete before beginning to execute). Examining the extent and effect of these limits has been a popular and important area of research, particularly in the case of control flow [2,3,4,5]. Continuing advances in the development of accurate branch predictors (e.g. [6]) have led to increasingly-aggressive control-speculative microarchitectures (e.g. the Intel Pentium Pro [7]), which undertake aggressive measures to overcome control-flow restrictions by using branch prediction and speculative execution to bypass control dependences and expose additional instruction-level parallelism to the microarchitecture. Meanwhile, numerous mechanisms have been Value Locality and Speculative Execution 1

16 Historical Background and Motivation proposed and implemented to eliminate false data dependences and tolerate the latencies induced by true data dependences by allowing instructions to execute out of program order (see [8] for an overview). Surprisingly, in light of the extensive energies focused on eliminating control-flow restrictions on parallel instruction issue, less attention has been paid to eliminating data-flow restrictions on parallel issue. Recent work has focused primarily on reducing the latency of specific types of instructions (usually loads from memory) by rearranging pipeline stages [9, 10], initiating memory accesses earlier [11], or speculating that dependences to earlier stores do not exist [12, 13, 14, 15]. The most relevant prior work in the area of eliminating data-flow dependences consists of the Tree Machine [16,17], which uses a value cache to store and look up the results of recurring arithmetic expressions to eliminate redundant computation (the value cache, in effect, performs common subexpression elimination [1] in hardware). Richardson follows up on this concept in [18] by introducing the concepts of trivial computation, which is defined as the trivialization of potentially complex operations by the occurrence of simple operands; and redundant computation, where an operation repeatedly performs the same computation because it sees the same operands. He proposes a hardware mechanism (the result cache) which reduces the latency of such trivial or redundant complex arithmetic operations by storing and looking up their results in the result cache. In this thesis, we introduce the concept of value locality, which is similar to redundant computation, along with a proposed technique--value Prediction, or VP--for predicting the results of instructions at dispatch by exploiting the affinity between instruction addresses and the values these instructions produce. VP differs from Harbison s value cache and Richardson s result cache in two important ways: first, the VP table is indexed by instruction address, and hence value lookups can occur very early in the pipeline; second, it is speculative in nature, and relies on a verification mechanism to guarantee correctness. In contrast, both Harbison and Richardson use table indices that are only available later in the pipeline (Harbison uses data addresses, while Richardson uses actual operand values); and require their predictions to be correct, hence requiring mechanisms for keeping their tables coherent with all other computation. Value Locality and Speculative Execution 2

17 Historical Background and Motivation Speculative Execution Control Speculation Branch Direction (binary) Branch Target (multi-valued) Data Speculation Data Location Aliased (binary) Address (multi-valued) Data Value (multi-valued) Figure 1-1. Taxonomy of Speculative Execution Taxonomy of Speculative Execution In order to place our work on prediction-based speculative execution into a meaningful historical context, we introduce a taxonomy of speculative execution. This taxonomy, summarized in Figure 1-1, categorizes our work as well as previously introduced techniques based on which types of dependences are being bypassed (control vs. data), whether the speculation relates to storage location or value, and what type of decision must be made to enable the speculation (binary vs. multivalued). Control Speculation There are essentially two types of control speculation: speculating on the direction of a branch, which requires a binary decision (taken vs. not-taken); and speculating on the target of a branch, which requires a multi-valued decision (the target can potentially be anywhere in the program s address space). Examples of the former are any of the many branch prediction schemes explored in the literature (e.g. [19,6,20]), while examples of the latter are the Branch Target Buffer (BTB) or Branch Target Address Cache (BTAC) units included on most modern microprocessors (e.g. the PowerPC 620 [15] or the Intel Pentium Pro [7]). A novel mechanism for performing both branch direction and branch target prediction is proposed as part of the Superflow microarchitecture paradigm in Chapter 6. Value Locality and Speculative Execution 3

18 Historical Background and Motivation Data Speculation Data speculation techniques break down logically into two categories: those that speculate on the storage location of the data, and those that speculate on the actual value. Furthermore, techniques that speculate on the location come in two fundamentally different flavors: those that speculate on a specific attribute of the storage location (e.g. is it aliased with an earlier definition), and those that speculate on the address of the storage location. An example of the former is speculative disambiguation, which optimistically assumes that an earlier definition does not alias with a current use, and provides a mechanism for checking the accuracy of that assumption. Speculative disambiguation has been implemented both in software [13] as well as in hardware [12, 14, 15]. Another example of this type of speculation occurs implicitly in most control-speculative processors, whenever execution proceeds speculatively past a join in the control-flow graph where multiple reaching definitions for a storage location are live [1]. By speculating past that join, the processor hardware is implicitly speculating that the definition on the predicted path to the join in question is in fact the correct one (as opposed to the definition on an alternate path). There are a large number of techniques that speculate on data address. Most prefetching techniques, for example, are speculative in nature and rely on some heuristic for generating addresses of future memory references (e.g. [21, 22, 23, 24, 25]). Of course, since prefetching has no architected side effects, no mechanism is needed for verifying the accuracy of the prediction or for recovering from mispredictions. Another example of a technique that speculates on data address is fast address calculation [26, 11], which enables early initiation of memory loads by speculatively generating addresses early in the pipeline. Dependence prediction, proposed in Chapter 5, and alias prediction, proposed in Chapter 6, are speculative techniques that predict the current storage location of register input operands (i.e. rename buffer number) and memory operands (e.g. store queue entry), respectively. The final category in our taxonomy, techniques that speculate on data value, has received little attention in the literature. The only work we are aware of is that proposed in this thesis (preliminary results have been published in [27] and [28]). Note that neither the Tree Machine [16,17] or Richardson s work [18] qualify since they are not speculative. Value Locality and Speculative Execution 4

19 Theoretical Contributions 1.2 Theoretical Contributions Value Locality In this thesis, we introduce the concept of value locality, which we define as the likelihood of a previously-seen value recurring repeatedly within a storage location. Although the concept is general and can be applied to any storage location within a computer system, we have limited our study to examine only the value locality of general-purpose or floating point registers immediately following instructions that write to those registers, as well as the value locality exhibited in dependence relationships between instructions. A plethora of previous work on static and dynamic branch prediction (e.g. [19,6,20]) has focused on an even more restricted application of value locality, namely the prediction of a single condition bit based on its past behavior. Intuitively, it seems that it would be a very difficult task to discover any useful amount of value locality in a general purpose register. After all, a 32-bit register can contain any one of over four billion values--how could one possibly predict which of those is even somewhat likely to occur next? As it turns out, if we narrow the scope of our prediction mechanism by considering each static instruction individually, the task becomes much easier and we are able to accurately predict a significant fraction of register values being written by machine instructions. We examine the phenomenon of value locality more closely in Section 3.2 on page 37 and Section 4.2 on page 58. The initial benchmark set that we use to explore value locality and quantify its performance impact consists of the SPEC92 integer suite (described in Section on page 31) and miscellaneous integer benchmarks (described in Section on page 31). In later experiments, we augment this initial benchmark set with integer benchmarks from the more recent SPEC95 suite and floating point benchmarks from both SPEC92 and SPEC95 (all of these benchmarks are described in Section 2.3 on page 31). Load Value Locality Figure 1-2 shows the average value locality for load instructions in each of the benchmarks. The value locality of each static load is measured by counting the number of times that load instruction retrieves a value from memory that matches a previously seen value for that static load and dividing by the total number of dynamic occurrences of that load. The average load value locality of a benchmark is the dynamically-weighted average of the value localities of all the static loads in that benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we Value Locality and Speculative Execution 5

20 Theoretical Contributions Load Value Locality Load Value Locality cc1-271 cjpeg compress eqntott gawk gperf grep mpeg perl quick sc xlisp Figure 1-2. Load Value Locality. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of sixteen. check for matches against only the most recently retrieved value), while the second set (dark bars) has a history depth of sixteen (i.e. we check against the last sixteen unique values). We see that even with a history depth of one, most of the integer programs exhibit load value locality in the 50% range, while extending the history depth to sixteen can improve that to better than 80%. What that means is that the vast majority of static loads exhibit very little variation in the values that they load during the course of a program s execution. Unfortunately, one of our benchmarks--cjpeg-- demonstrates poor load value locality. Register Value Locality Figure 1-3 shows the average value locality for all instructions that write an integer or floating point register in each of the benchmarks. The value locality of each static instruction is measured by counting the number of times that instruction writes a value that matches a previously seen value for that static instruction and dividing by the total number of dynamic occurrences of that instruction. The average value locality of a benchmark is the dynamically-weighted average of the value localities of all the static instructions in that benchmark. Two sets of numbers are shown, one (light bars) for a history depth of one (i.e. we check for matches against only the most-recentlywritten value), while the second set (dark bars) has a history depth of four (i.e. we check against the last four unique values). We see that even with a history depth of one, most of the programs exhibit value locality in the 40-50% range (average 51%), while extending the history depth to Value Locality and Speculative Execution 6

21 Theoretical Contributions Register Value Locality Register Value Locality cc1-271 cjpeg compress eqntott gawk gperf grep mpeg perl quick sc xlisp Figure 1-3. Register Value Locality. The light bars show value locality for a history depth of one, while the dark bars show it for a history depth of four. four (along with a perfect mechanism for choosing the right one of the four values) can improve that to the 60-70% range (average 66%). What that means is that a majority of static instructions exhibit very little variation in the values that they write into registers during the course of a program s execution. Unfortunately, three of our benchmarks--cjpeg, compress, and quick--demonstrate poor register value locality The Weak Dependence Model The implied inter-instruction precedences of a sequential program are an overspecification and need not be rigorously enforced to meet the requirements of the sequential execution model. The actual program semantics and inter-instruction dependences are specified by the control-flow graph (CFG) and the data-flow graph (DFG). As long as the serialization constraints imposed by the CFG and the DFG are not violated, the execution of instructions can be overlapped and reordered (e.g. via out-of-order execution) to achieve better performance by avoiding the enforcement of implied but unnecessary precedences. However, true inter-instruction dependences must still be enforced. To date, all machines enforce such dependences in a rigorous fashion that involves the following two requirements: Value Locality and Speculative Execution 7

22 Theoretical Contributions Dependences are determined in an absolute and exact way, i.e. two instructions are identified as either dependent or independent, and when in doubt dependences are pessimistically assumed to exist. Dependences are enforced throughout instruction execution, i.e. the dependences are never allowed to be violated, and are enforced continuously while the instructions are in flight. We classify such a traditional and conservative approach as adhering to the strong dependence model for program execution. We believe that the traditional strong dependence model is overly rigorous and unnecessarily restricts available parallelism. This thesis proposes the weak dependence model, which specifies that: Dependences need not be determined exactly or assumed pessimistically, but can instead be optimistically approximated or even temporarily ignored. Dependences can be temporarily violated during instruction execution as long as recovery can be performed prior to affecting the permanent machine state. The advantage of adopting the weak dependence model is that the program semantics as specified by the CFG and DFG need not be completely determined before the machine can process instructions. Furthermore, the machine can now speculate aggressively and temporarily violate the dependences as long as corrective measures are in place to recover from misspeculation. If a significant percentage of the speculations are correct, the machine can effectively exceed the performance limit imposed by the traditional strong dependence model. Conceptually speaking, a machine that exploits the weak dependence model has two interacting engines. The front-end engine assumes the weak dependence model and is highly speculative. It tries to make predictions about instructions in order to aggressively process instructions. When the predictions are correct, these speculative instructions will effectively have skipped over or folded out certain pipeline stages. The back-end engine still uses the strong dependence model to validate the speculations, to recover from misspeculation, and to provide history and guidance information to the speculative engine. In combining these two interacting engines, an unprecedented level of instruction level parallelism can be harvested without violating the program semantics. The edges in the DFG that represent inter-instruction dependences are now enforced in the critical path only when misspeculations occur. Essentially, these dependence edges have become probabilistic and the serialization penalties incurred due to enforcing these dependences are eliminated or masked whenever correct speculations occur. Hence, the traditional data-flow limit based on the length of the critical path in the DFG is no longer a hard limit that cannot be exceeded [28]. Value Locality and Speculative Execution 8

23 Theoretical Contributions Branch Prediction Value Prediction Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Fetch Fetch Dispatch Dispatch Execute Rename Commit Op Read Figure 1-4. Pipeline Contraction. Branch prediction is used to fold the dispatch and execute pipeline stages into the fetch stage, and value prediction is used to fold the execute stage into the dispatch stage Pipeline Contraction Framework In this section, we introduce a generalized framework called pipeline contraction that captures all forms of speculation, both in the control-flow and data-flow domains. Control-flow speculation, already ubiquitous in high-performance processors, consists of speculating on both the direction (taken vs. not taken) and the target (if taken) of branch instructions. Data-flow speculation, which is less common, consists of speculating on the specific attributes or even values of instruction inputs and outputs. Both types of speculation can be described as attempts to contract the instruction execution pipeline by probabilistically obtaining the semantic outcome of an instruction as early as possible. For example, in Figure 1-4, we see the semantics of a branch instruction, which without speculation would require three pipeline stages, contracted down to one stage whenever both the target and the direction of the branch can be correctly predicted during the fetch stage. Similarly, data speculation techniques such as value prediction [28] can be used to contract execution pipelines and allow dependent instructions to execute in parallel. The pipeline contraction framework is a useful tool for assessing the potential benefit of speculative techniques by considering the following metrics: the degree of contraction that can be obtained with the proposed technique (i.e. how many pipeline stages can be folded away), the relative frequency and accuracy of these contractions, and the delays incurred while recovering from Value Locality and Speculative Execution 9

24 Microarchitectural Contributions incorrect contractions. For example, branch prediction is a very powerful technique because it measures up well against all three factors: it folds away a large number of pipeline stages, branches occur frequently and are very predictable, and recovery from mispredictions costs little or no additional delay relative to not predicting the branches. Within this framework, value prediction can be generalized to include: 1) predicting direction and target of a branch instruction; 2) predicting source and/or destination operands of an ALU instruction; and 3) predicting the memory address and/or operand of a load/store instruction. The number of stages folded away is determined by the distance (in pipe stages) between where the prediction is made and where the value is normally produced. 1.3 Microarchitectural Contributions Load and Register Value Prediction The fact that the register writes in many programs demonstrate a significant degree of value locality opens up exciting new possibilities for the microarchitect. Since the results of many instructions can be accurately predicted before they are issued or executed, dependent instructions are no longer bound by the serialization constraints imposed by operand data flow. Instructions can now be scheduled speculatively with additional degrees of freedom to better utilize existing functional units and hardware buffers, and are frequently able to complete execution sooner since the critical paths through the data dependence graph have been collapsed. We propose two approaches to exploiting value locality: Load Value Prediction and the more general Register Value Prediction. Both of these share two basic mechanisms: one for accurately predicting values--the VP (value prediction) unit--and one for verifying these predictions. The Value Prediction Unit Value prediction is useful only if it can be done accurately, since incorrect predictions can lead to increased structural hazards and longer latency (the misprediction penalty is described in greater detail on page 14). Hence, we propose a two-level prediction structure for the VP Unit: the first level is used to generate the prediction values, and the second level is used to decide whether or not the predictions are likely to be accurate. The internal structure of the VP Unit is illustrated in Figure 1-5. The VP Unit consists of two tables: the Classification Table (CT) and the Value Prediction Table (VPT), both of which are direct-mapped and indexed by the instruction address (PC) of the instruction being predicted. Value Locality and Speculative Execution 10

25 Microarchitectural Contributions Classification Table (CT) <valid> <pred history> PC of pred. instr. Value Prediction Table (VPT) <valid><value history> Prediction Result Predicted Value Updated Value Figure 1-5. Block Diagram of Value Prediction Unit. The PC of the instruction being predicted is used to index into the Value Prediction Table to find a value to predict. At the same time, the Classification Table is also indexed with the PC to determine whether or not a prediction should be made. When the instruction completes, both the prediction history and value history are updated. Entries in the CT contain two fields: the valid field, which consists of either a single bit that indicates a valid entry or a partial or complete tag field that is matched against the upper bits of the PC to indicate a valid field; and the prediction history field, which is a saturating counter of 1 or more bits that tracks the correctness of recent predictions. The prediction history is incremented or decremented whenever a prediction is correct or incorrect, respectively, and is used to classify instructions as either predictable or unpredictable. This classification is used to decide whether or not the result of a particular instruction should be predicted. Increasing the number of bits in the saturating counter adds hysteresis to the classification process and can help avoid erroneous classifications by ignoring anomalous values and/or destructive interference caused by multiple static instructions mapping to the same CT entry. The relatively simple CT configurations described in Chapters 2-4 (as well as [27] and [28]) achieved classification hit rates between 70% and 95%. The VPT entries also consist of two fields: a valid field, which, again, can consist of a single valid bit or a full or partial tag; and a value history field, which contains one or more 32- or 64-bit values that are maintained with an LRU policy. The value history fields are written when an instruction is first encountered (by its result) or whenever a prediction is incorrect (by the actual result). The Value Locality and Speculative Execution 11

26 Microarchitectural Contributions Predicted CT PC of pred. instr. VPT Dependent Fetch Dispatch Buffer Dispatch Buffer Release Dispatch Reserv. Station Predict Rename Buffer Spec? Data Reserv. Station Execute FU Reissue FU Result Bus Complete/ Verify Compl. Buffer?= Committed Value Invalidate Predicted Value Compl. Buffer Figure 1-6. Example use of Value Prediction Mechanism. The dependent instruction shown on the right uses the predicted result of the instruction on the left, and is able to issue and execute in the same cycle. VPT replacement policy is also governed by the CT prediction history to introduce hysteresis and avoid replacing useful values with less useful ones. Verifying Predictions Since value prediction is by nature speculative, we need a mechanism for verifying the correctness of the predictions and efficiently recovering from mispredictions. This mechanism is summarized in the example of Figure 1-6, which shows the parallel execution of two data-dependent instructions. The producer instruction, shown on the left, has its value predicted and written to its rename buffer during the fetch and dispatch cycles. The consumer instruction, shown on the right, reads the predicted value from the rename buffer at the beginning of the execute cycle, and is able to issue and execute normally, but is forced to retain its reservation station. Meanwhile, the predicted instruction also executes, and its computed result is compared with the predicted result during its completion stage. If the values match, the consumer instruction releases its reservation station. If not, completion of the first instance of the consumer instruction is invalidated, and a second instance reissues with the correct value. Value Locality and Speculative Execution 12

27 Microarchitectural Contributions Verifying Constant Loads In our experiments with Load Value Prediction, we discovered that certain loads exhibit constant behavior; that is, they load the same constant value repeatedly. To exploit this behavior and avoid accessing the conventional memory hierarchy for these loads, we propose the constant verification unit (CVU), which is described in further detail in Chapter 3 (and [27]). To verify predictable loads, we simply retrieve the value from the conventional memory hierarchy and compare the predicted value to the actual value, just as we do in the more generalized value prediction scheme (see Figure 1-6). However, for highly-predictable or constant loads, we use the CVU, which allows us to avoid accessing the conventional memory system completely by forcing the VPT entries that correspond to constant loads to remain coherent with main memory (loads are classified as constant if the saturating counter at their VPT entry has reached its maximum value). For the VPT entries that are classified as constants by the CT, the data address and the index of the VPT entry are placed in a separate, fully-associative table inside the CVU. This table is kept coherent with main memory by invalidating any entries where the data address matches a subsequent store instruction. Meanwhile, when the constant load executes, its data address is concatenated with the VPT index (the lower bits of the instruction address) and the CVU s contentaddressable-memory (CAM) is searched for a matching entry. If a matching entry exists, we are guaranteed that the value at that VPT entry is coherent with main memory, since any updates (stores) since the last retrieval would have invalidated the CVU entry. If one does not exist, the constant load is demoted from constant to just predictable status, and the predicted value is now verified by retrieving the actual value from the conventional memory hierarchy. We find that an average of 6% (and up to 33% for some benchmarks) of loads from memory can be verified with the CVU, resulting in a proportional reduction of L1 cache bandwidth requirement. VP Unit Operation The VP Unit predicts the values during fetch and dispatch, then forwards them speculatively to subsequent dependent instructions via the processor s standard result forwarding mechanism. Dependent instructions are able to issue and execute immediately, but are prevented from completing architecturally and are forced to retain possession of their reservation stations until their inputs are no longer speculative. Speculatively forwarded values are tagged with a bit vector representing the uncommitted register writes they depend on, and these tags are propagated to the results of any Value Locality and Speculative Execution 13

28 Microarchitectural Contributions subsequent dependent instructions. Meanwhile, uncommitted instructions execute in their respective functional units, and the predicted values are verified either by a comparison against the actual values computed by the instructions, or in the case of constant loads, by an address match in the CVU. Once a prediction is verified, all the dependent instructions can either release their reservation stations and proceed into the completion unit (in the case of a correct prediction), or restart execution with the correct register values (if the prediction was incorrect). Since a large number of instructions can be in flight at the same time, the time between predicting and verifying a value can be dozens of cycles or more, allowing the processor to speculate multiple levels down the dependence chain beyond the write, executing instructions and resolving branches that would otherwise be blocked by data-flow dependences. Misprediction Penalty The worst-case penalty for an incorrect value prediction in this scheme, as compared to not predicting the value in question, is one additional cycle of latency along with structural hazards that might not have occurred otherwise. The penalty occurs only when a dependent instruction has already executed speculatively, but is waiting in its reservation station for one of its predicted inputs to be verified. Since the value comparison takes an extra cycle beyond the pipeline result latency, the dependent instruction will reissue and execute with the correct value one cycle later than it would have had there been no prediction. In addition, the earlier incorrect speculative issue may have caused a structural hazard that prevented other useful instructions from dispatching or executing. In those cases where the dependent instruction has not yet executed (due to structural or other unresolved data dependences), there is no penalty, since the dependent instruction can issue as soon as the actual computed value is available, in parallel with the value comparison that verifies the prediction. In any case, due to the CT which accurately prevents incorrect predictions from occurring, the misprediction penalty does not significantly affect performance. There can also be a structural hazard penalty even in the case of a correct prediction. Since speculative values are not verified until one cycle after the actual values become available, speculatively issued dependent instructions end up occupying their reservation stations for one cycle longer than they would have had there been no prediction. Value Locality and Speculative Execution 14

29 Microarchitectural Contributions Cycles Per Instruction (CPI) RAS Mispred BTB Mispred BHT Mispred Other go m88ksim gcc compress li ijpeg perl vortex Figure 1-7. Branch Misprediction Penalty. The approximate contribution of RAS, BTB, and BHT mispredictions to overall CPI is shown for single-cycle dispatch (left bar), 2-cycle (middle bar) and 3-cycle (right bar) pipelined dispatch Dependence Prediction Detecting data dependences among multiple instructions in flight is an inherently sequential task that becomes very expensive combinatorially as the number of concurrent in-flight instructions increases. Olukotun et al. argue convincingly against wide-dispatch superscalars because of this very fact [30]. Wide (i.e. greater than four instructions per cycle) dispatch is difficult to implement and has adverse impact on cycle time because all instructions in a dispatch group must be simultaneously cross-checked. Even current microprocessor implementations with dispatch windows of four or less (e.g. Alpha AXP and Pentium Pro) require multiple instruction decode and dependence-checking pipeline stages. One obvious solution to the problem of the complexity of dependence detection is to pipeline it into two or more stages to minimize impact on cycle time. In Chapter 5, Section 5.4 we propose a pipelined approach to dependence detection that facilitates the implementation of wide-dispatch microarchitectures. However, pipelined dependence checking aggravates the cost of branch mispredictions by delaying resolution of mispredicted branches. In Figure 1-7, we see the IPC impact of pipelining dependence checking on a 16-dispatch machine with an advanced branch predictor and no other structural resource limitations (refer to Section on page 32 and Section on page 24 in Chapter 2 for further details on the benchmarks and machine model). We see that lengthening dispatch to two or three pipeline stages (vs. the baseline case of one) severely increases the number of cycles during which no useful instructions are dispatched and increases Value Locality and Speculative Execution 15

30 Microarchitectural Contributions CPI (decreases IPC) dramatically, to the point where sustaining even 2-3 IPC becomes very difficult. We alleviate these problems in two ways: by introducing a scalable, pipelined, and speculative approach to dependence detection called dependence prediction and also by exploiting a modified approach to value prediction called source operand value prediction [28]. Fundamental to these is the notion that maintaining semantic correctness does not require that we rigorously enforce source-to-sink data-flow relationships or that we even exactly detect these relationships before we start executing. Rather, we use dynamically adaptive techniques for predicting values as well as dependences and speculatively issue instructions early, before their dependences are resolved or even known. As shown in Figure 1-8, dependence prediction is implemented with a dependence prediction table (DPT) with 8K entries, which is direct-mapped and indexed by hashing together the instruction address bits, the gshare branch predictor s branch history register (BHR), and the relative position of the operand (i.e. first, second, or third) being looked up. Each DPT entry contains a numeric value which reflects the relative index of that input operand s location in the rename buffers. This relative index is used to check the value silo to see if the operand is already available. If all of the instruction s predicted input operands are available, the instruction is permitted to dispatch early, after the first dispatch cycle. In the second (or third, in the three-cycle dispatch pipeline) dispatch cycle, exact dependence information becomes available, and the earlier prediction is verified against the actual information. In case of a mismatch, the DPT entry is replaced with the correct relative position, and the early dispatch is cancelled Alias Prediction As described in the previous section, detecting and enforcing dependences between multiple instructions in flight presents a serious scalability bottleneck for wide-issue superscalar processors. To a lesser extent, the detection and enforcement of dependences that occur through aliased memory locations also causes difficulties. In this case, however, the problems are caused by the latency involved in computing and comparing the addresses of all loads with all previous unretired stores. Data shown in Chapter 6 indicates that a significant portion of all loads are aliased to earlier stores (15% on average for integer benchmarks, and 6% on average for floating point benchmarks- -see Figure 6-10 on page 111). In order to resolve these dependences as early as possible, before Value Locality and Speculative Execution 16

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors Haitham Akkary Ravi Rajwar Srikanth T. Srinivasan Microprocessor Research Labs, Intel Corporation Hillsboro, Oregon

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

The Microarchitecture of Superscalar Processors

The Microarchitecture of Superscalar Processors The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: jes@ece.wisc.edu

More information

Precise and Accurate Processor Simulation

Precise and Accurate Processor Simulation Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical

More information

Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Data Memory Alternatives for Multiscalar Processors

Data Memory Alternatives for Multiscalar Processors Data Memory Alternatives for Multiscalar Processors Scott E. Breach, T. N. Vijaykumar, Sridhar Gopal, James E. Smith, Gurindar S. Sohi Computer Sciences Department University of Wisconsin-Madison 1210

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

More information

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design. Enhancing Memory Level Parallelism via Recovery-Free Value Prediction Huiyang Zhou Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University 1-919-513-2014 {hzhou,

More information

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1 Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

More information

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS AN INSTRUCTION WINDOW THAT CAN TOLERATE LATENCIES TO DRAM MEMORY IS PROHIBITIVELY COMPLEX AND POWER HUNGRY. TO AVOID HAVING TO

More information

Concept of Cache in web proxies

Concept of Cache in web proxies Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 230J Computer Organization Processor Architecture VI: Wrap-Up Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due ost of slides for

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs. This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

More information

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu. Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive

More information

Introduction to Microprocessors

Introduction to Microprocessors Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?

More information

Computer Organization and Components

Computer Organization and Components Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Instruction Set Design

Instruction Set Design Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

More information

On some Potential Research Contributions to the Multi-Core Enterprise

On some Potential Research Contributions to the Multi-Core Enterprise On some Potential Research Contributions to the Multi-Core Enterprise Oded Maler CNRS - VERIMAG Grenoble, France February 2009 Background This presentation is based on observations made in the Athole project

More information

A PPM-like, tag-based branch predictor

A PPM-like, tag-based branch predictor Journal of Instruction-Level Parallelism 7 (25) 1-1 Submitted 1/5; published 4/5 A PPM-like, tag-based branch predictor Pierre Michaud IRISA/INRIA Campus de Beaulieu, Rennes 35, France pmichaud@irisa.fr

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.

Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier. Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core

More information

FLIX: Fast Relief for Performance-Hungry Embedded Applications

FLIX: Fast Relief for Performance-Hungry Embedded Applications FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):

More information

A Performance Counter Architecture for Computing Accurate CPI Components

A Performance Counter Architecture for Computing Accurate CPI Components A Performance Counter Architecture for Computing Accurate CPI Components Stijn Eyerman Lieven Eeckhout Tejas Karkhanis James E. Smith ELIS, Ghent University, Belgium ECE, University of Wisconsin Madison

More information

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle

More information

Instruction Set Architecture (ISA) Design. Classification Categories

Instruction Set Architecture (ISA) Design. Classification Categories Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

OBJECT-ORIENTED programs are becoming more common

OBJECT-ORIENTED programs are becoming more common IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 9, SEPTEMBER 2009 1153 Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware Hyesoon

More information

EEM 486: Computer Architecture. Lecture 4. Performance

EEM 486: Computer Architecture. Lecture 4. Performance EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design

More information

Hardware/Software Co-Design of a Java Virtual Machine

Hardware/Software Co-Design of a Java Virtual Machine Hardware/Software Co-Design of a Java Virtual Machine Kenneth B. Kent University of Victoria Dept. of Computer Science Victoria, British Columbia, Canada ken@csc.uvic.ca Micaela Serra University of Victoria

More information

Key Components of WAN Optimization Controller Functionality

Key Components of WAN Optimization Controller Functionality Key Components of WAN Optimization Controller Functionality Introduction and Goals One of the key challenges facing IT organizations relative to application and service delivery is ensuring that the applications

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

More information

Putting Checkpoints to Work in Thread Level Speculative Execution

Putting Checkpoints to Work in Thread Level Speculative Execution Putting Checkpoints to Work in Thread Level Speculative Execution Salman Khan E H U N I V E R S I T Y T O H F G R E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Computer Architecture Syllabus of Qualifying Examination

Computer Architecture Syllabus of Qualifying Examination Computer Architecture Syllabus of Qualifying Examination PhD in Engineering with a focus in Computer Science Reference course: CS 5200 Computer Architecture, College of EAS, UCCS Created by Prof. Xiaobo

More information

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS By CHRISTOPHER J. ZIMMER A Thesis submitted to the Department of Computer Science In partial fulfillment of

More information

CPU Performance Equation

CPU Performance Equation CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

Enterprise Applications

Enterprise Applications Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting

More information

CPU Organization and Assembly Language

CPU Organization and Assembly Language COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:

More information

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

More information

CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

Computer Architecture

Computer Architecture Cache Memory Gábor Horváth 2016. április 27. Budapest associate professor BUTE Dept. Of Networked Systems and Services ghorvath@hit.bme.hu It is the memory,... The memory is a serious bottleneck of Neumann

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Effective ahead pipelining of instruction block address generation

Effective ahead pipelining of instruction block address generation Effective ahead pipelining of instruction block address generation to appear in proceedings of the 30th ACM-IEEE International Symposium on Computer Architecture, 9-11 June 2003, San Diego André Seznec

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

MONITORING power consumption of a microprocessor

MONITORING power consumption of a microprocessor IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS-II, VOL. X, NO. Y, JANUARY XXXX 1 A Study on the use of Performance Counters to Estimate Power in Microprocessors Rance Rodrigues, Member, IEEE, Arunachalam Annamalai,

More information

Recommendations for Performance Benchmarking

Recommendations for Performance Benchmarking Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best

More information

PowerPC Microprocessor Clock Modes

PowerPC Microprocessor Clock Modes nc. Freescale Semiconductor AN1269 (Freescale Order Number) 1/96 Application Note PowerPC Microprocessor Clock Modes The PowerPC microprocessors offer customers numerous clocking options. An internal phase-lock

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

An Event-Driven Multithreaded Dynamic Optimization Framework

An Event-Driven Multithreaded Dynamic Optimization Framework In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2005. An Event-Driven Multithreaded Dynamic Optimization Framework Weifeng Zhang Brad Calder

More information

An Overview of Stack Architecture and the PSC 1000 Microprocessor

An Overview of Stack Architecture and the PSC 1000 Microprocessor An Overview of Stack Architecture and the PSC 1000 Microprocessor Introduction A stack is an important data handling structure used in computing. Specifically, a stack is a dynamic set of elements in which

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1 System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Performance Impacts of Non-blocking Caches in Out-of-order Processors Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Per Stenström, Håkan Nilsson, and Jonas Skeppstedt Department of Computer Engineering, Lund University P.O. Box 118, S-221

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power

Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power Steve Dropsho, Alper Buyuktosunoglu, Rajeev Balasubramonian, David H. Albonesi, Sandhya Dwarkadas, Greg Semeraro, Grigorios Magklis,

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor

More information

TPCalc : a throughput calculator for computer architecture studies

TPCalc : a throughput calculator for computer architecture studies TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield. Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

More information

A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1

A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1 A Hybrid Analytical Modeling of Pending Cache Hits, Data Prefetching, and MSHRs 1 XI E. CHEN and TOR M. AAMODT University of British Columbia This paper proposes techniques to predict the performance impact

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 "JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer

More information

LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION

LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION DAVID KROFT Control Data Canada, Ltd. Canadian Development Division Mississauga, Ontario, Canada ABSTRACT In the past decade, there has been much

More information

POWER8 Performance Analysis

POWER8 Performance Analysis POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Load Distribution in Large Scale Network Monitoring Infrastructures

Load Distribution in Large Scale Network Monitoring Infrastructures Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu

More information

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean M. Tullsen, Susan J. Eggers, Joel S. Emer y, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

CHAPTER 7 SUMMARY AND CONCLUSION

CHAPTER 7 SUMMARY AND CONCLUSION 179 CHAPTER 7 SUMMARY AND CONCLUSION This chapter summarizes our research achievements and conclude this thesis with discussions and interesting avenues for future exploration. The thesis describes a novel

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic 1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory

More information

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING Hussain Al-Asaad and Alireza Sarvi Department of Electrical & Computer Engineering University of California Davis, CA, U.S.A.

More information