Superscalar Processors. Superscalar Processors: Branch Prediction Dynamic Scheduling. Superscalar: A Sequential Architecture. Superscalar Terminology
|
|
|
- Clifton Pitts
- 10 years ago
- Views:
Transcription
1 Superscalar Processors Superscalar Processors: Branch Prediction Dynamic Scheduling Superscalar: A Sequential Architecture Superscalar processor is a representative ILP implementation of a sequential architecture - For every instruction issued by a Superscalar processor, the hardware must check whether the operands interfere with the operands of any other instruction that is either (1) already in execution, (2) been issued but waiting for completion of interfering instructions that would have been executed earlier in a sequential program, and (3) being issued concurrently but would have been executed earlier in the sequential execution of the program - Superscalar proc. issues multiple inst. In cycle Superscalar erminology Basic Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. E.g., MIPS R5000 has 8 stages Branch prediction Logic to guess whether or not branch will be taken, and possibly branch target Advanced Out-of-order Able to issue instructions out of program order Speculation Execute instructions beyond branch points, possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete.
2 Superscalar Execution Example Single Order, Data Dependence In Order Assumptions - Single FP adder takes 2 cycles - Single FP multiplier takes 5 cycles - Can issue add & multiply together - Must issue in-order - <op> in,in,out (In order) (Single adder, data dependence) v: addt $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f10 v w (inorder) Critical Path = 9 cycles x Data Flow $f2 $f4 $f6 w x v + + * + $f12 y $f8 z $f10 y $f4 + z z Adding Advanced Features Out Of Order Issue - Can start y as soon as adder available - Must hold back z until $f10 not busy & adder available v: addt $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f10 v y w x z Adding Advanced Features Flow Path Model of Superscalars I-cache With Register Renaming Branch Predictor FECH DECODE Instruction Buffer Instruction Flow Integer Floating-point Media Memory v: addt $f2, $f4, $f10a w: mult $f10a, $f6, $f10a x: addt $f10a, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f14 v y w z x Register Data Flow Reorder Buffer (ROB) Store Queue EXECUE COMMI D-cache Memory Data Flow
3 Superscalar Pipeline Design Inorder Pipelines Fetch Instruction Flow Decode Instruction Buffer Dispatch Buffer IF D1 IF D1 IF D1 Dispatch D2 D2 D2 Data Flow Execute Issuing Buffer Completion Buffer EX WB EX WB EX WB Complete U -Pipe V -Pipe Store Buffer Intel i486 Intel Pentium Retire Inorder pipeline, no WAW no WAR (almost always true) IF ID RD EX WB Out-of-order Pipelining 101 I Fadd1 Fmult1 LD/S Fadd2 Fmult2 Fmult3 Program Order I a : F1 F2 x F I b : F1 F4 + F5 Out-of-order WB I b : F1 F4 + F I a : F1 F2 x F3 What is the value of F1? WAW!!! In-order Issue into Diversified Pipelines Inorder Inst. Stream I Fadd1 Fmult1 LD/S Fadd2 Fmult2 Fmult3 RD Fn (RS, R) Dest. Reg. Func Unit Source Registers Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard
4 Icache Icache Instruction buffer Instruction buffer Decode / Issue Decode / Issue Scalar issue Superscalar issue ypical FXpipeline layout F D/I... F D I... Contrasting decoding and instruction issue in a scalar and a 4-way superscalar processor Superscalar Processors: asks Superscalar Issues to be considered parallel decoding superscalar instruction issue parallel instruction execution - preserving sequential consistency of exception processing - preserving sequential consistency of exec. Parallel decoding more complex task than in scalar processors. - High issue rate can lengthen the decoding cycle therefore use predecoding. - partial decoding performed while instructions are loaded into the instruction cache Superscalar instruction issue A higher issue rate gives rise to higher processor performance, but amplifies the restrictive effects of control and data dependencies on the processor performance as well. - o overcome these problems designers use advanced techniques such as shelving, register renaming, and speculative branch processing
5 Superscalar issues Parallel instruction execution task Also called preservation of the sequential consistency of instruction execution. While instructions are executed in parallel, instructions are usually completed out of order in respect to a sequential operating procedure Preservation of sequential consistency of exception processing task Pre-Decoding more EUs than the scalar processors, therefore higher number of instructions in execution - more dependency check comparisons needed Predecoding As I-cache is being loaded, a predecode unit, performs a partial decoding and appends a number of decode bits to each instruction. hese bits usually indicate: - the instruction class - type of resources which are required for the execution - the fact that branch target addresses have been calculated - Predecoding used in PowerPC 601, MIPS R8000,SuperSparc he Principle of Predecoding Superscalar Instruction Issues Second-level cache (or memory) Predecode unit Icache ypically 128 bits/cycle E.g. 148 bits/cycle When instructions are written into the Icache, the predecode unit appends 4-7 bits to each 1 RISC instruction specify how false data and unresolved control dependencies are coped with during instruction issue - the design options are either to avoid them during the instruction issue by using register renaming and speculative branch processing, respectively, or not False data dependencies between register data may be removed by register renaming 1 In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte
6 Hardware Features to Support ILP Instruction Issue Unit - Care must be taken not to issue an instruction if another instruction upon which it is dependent is not complete - Requires complex control logic in Superscalar processors - Virtually trivial control logic in VLIW processors Parallel Execution when instructions executed in parallel they will finish out of program order - unequal execution times specific means needed to preserve logical consistency - preservation of sequential consistency exceptions during execution - preservation seq. consistency exception proc. Hardware Features to Support ILP Speculative Execution - Little ILP typically found in basic blocks a straight-line sequence of operations with no intervening control flow - Multiple basic blocks must be executed in parallel Execution may continue along multiple paths before it is known which path will be executed Hardware Features to Support ILP Requirements for Speculative Execution - erminate unnecessary speculative computation once the branch has been resolved - Undo the effects of the speculatively executed operations that should not have been executed - Ensure that no exceptions are reported until it is known that the excepting operation should have been executed - Preserve enough execution state at each speculative branch point to enable execution to resume down the correct path if the speculative execution happened to proceed down the wrong one.
7 Hardware Features to Support ILP ext... Superscalar Processor Design Speculative Execution - Expensive in hardware - Alternative is to perform speculative code motion at compile time Move operations from subsequent blocks up past branch operations into proceeding blocks - Requires less demanding hardware A mechanism to ensure that exceptions caused by speculatively scheduled operations are reported if and only if flow of control is such that they would have been executed in the non-speculative version of the code Additional registers to hold the speculative execution state How to deal with instruction flow - Dynamic Branch prediction How to deal with register/data flow - Register renaming Solutions studied: - Dynamic branch prediction algorithms - Dynamic scheduling using omasulo method Summary of discussions Superscalar Pipeline Design ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware Stalls play an even greater role in ILP processors Software solutions, such as code scheduling through code movement, can lead to improved execution times - More sophisticated techniques needed - Can we provide some H/W support to help the compiler leads to EPIC/VLIW Instruction Flow Data Flow Fetch Decode Dispatch Execute Complete Instruction Buffer Dispatch Buffer Issuing Buffer Completion Buffer Store Buffer Retire
8 Flow Path Model of Superscalars Instruction Fetch Bandwidth Solutions Branch Predictor I-cache FECH DECODE Instruction Buffer Instruction Flow Integer Floating-point Media Memory Ability to fetch number of instructions from cache is crucial to superscalar performance - Use instruction fetch buffer to prefetch instructions - Fetch multiple instructions in one cycle to support the s-wide issue of superscalar processors Design instruction cache ( I-Cache) to support this - Shall discuss solutions when Memory design is covered Register Data Flow Reorder Buffer (ROB) Store Queue EXECUE COMMI D-cache Memory Data Flow Instruction Decoding Issues he Principle of Predecoding Primary tasks: - Identify individual instructions - Determine instruction types - Detect inter-instruction dependences Predecoding - Identify inst classes - Add more bits to instruction after fetching wo important factors: - Instruction set architecture - Width of parallel pipeline Second-level cache (or memory) Predecode unit Icache ypically 128 bits/cycle E.g. 148 bits/cycle 1 In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte When instructions are written into the Icache, the predecode unit appends 4-7 bits to each 1 RISC instruction
9 Instruction Flow Control Flow Control Dependence and Branch Prediction hroughput of early stages places bound an upper bound on per. Of subsequent stages Program control flow represented by Control Flow Graph (CFG) - odes represent basic block of code Sequence of instructions with no incoming or outgoing branches - Edges represent transfer of control flow from one block to another IBM s Experience on Pipelined Processors [Agerwala and Cocke 1987] Code Characteristics (dynamic) - loads - 25% - stores - 15% - ALU/RR - 40% - branches - 20% 1/3 unconditional (always taken) unconditional - 100% schedulable 1/3 conditional taken 1/3 conditional not taken conditional - 50% schedulable Control Flow Graph Shows possible paths of control flow through basic blocks BB 1 BB 2 BB 3 BB 4 BB 5 main: addi r2, r0, A addi r3, r0, B addi r4, r0, C BB 1 addi r5, r0, add r10,r0, r0 bge r10,r5, end loop: lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,1 sw r21, 0(r4) BB 3 b 2 Control Dependence - ode X is control dependant on ode Y if the computation in Y determines whether X executes 1: 2: end: sw r20, 0(r4) BB 4 addi r10,r10,1 addi r2, r2, 4 addi r3, r3, 4 BB 5 addi r4, r4, 4 blt r10,r5, loop
10 Why Branches: CFG and Branches Basic blocks and their constituent instructions must be stored in sequential location in memory - In mapping a CFG to linear consecutive mem location, additional unconditional branches must be added Encounter of branches (cond and uncond.) at runtime induces deviations from implied sequential control flow and consequent disruptions to sequential fetching of instructions - hese disruptions cause stalls in Inst.Fetch (IF) stage and reduce overall IF bandwidth A B D Mapping CFG to Linear Instruction Sequence C A C D B A B D Conditional branches Unconditional branch C Branch ypes and Implementation ypes of Branches - Conditional or Unconditional? - Subroutine Call (aka Link), needs to save PC? - How is the branch target computed? Static arget e.g. immediate, PC-relative Dynamic targets e.g. register indirect What s So Bad About Branches? Performance Penalties - Use up execution resources - Fragmentation of I-Cache lines - Disruption of sequential control flow eed to determine branch direction (conditional branches) eed to determine branch target Robs instruction fetch bandwidth and ILP
11 Branch-- actions When branches occur, disruption to IF occurs For unconditional branches - Subsequent instruction cannot be fetched until target address determined For conditional branches - Machine must wait for resolution of branch condition - And if branch taken then wait till target address computed Branch inst executed by the branch functional unit ote: Cost in superscalar/ilp processors = width (parallelism) X stall cycles - 3 stall cycles on a 4 wide machine = 12 lost cycles CPU Performance.. Recall: CPU time = IC*CPI*Clk - CPI = ideal CPI + stall cycles/inst - Minimizing CPI implies minimize stall cycles - Stall cycles from branch instructions How to determine the number of stall cycles Branch penalties/stall cycles Condition Resolution When branch occurs two parts needed: - Branch target address (BA) has to be computed - Branch condition resolution Addressing modes will affect BA delay - For PC relative, BA can be generated during Fetch stage for 1 cycle penalty - For Register indirect, BA generated after decode stage (to access register) = 2 cycle penalty - For register indirect with offset = 3 cycle penalty For branch condition resolution, depends on methods - If condition code registers used, then penalty =2 - If ISA permits comparison of 2 registers then output of ALU => 3 cycles Penalty will be max of penalties for condition resolution and BA Stall=3 GP reg. value comp. CC reg. Stall=2 Issue Branch Execute Finish Fetch Decode Dispatch Complete Retire Decode Buffer Dispatch Buffer Reservation Stations Completion Buffer Store Buffer
12 arget Address Generation What to do with branches Reg. ind. with offset Stall=3 Reg. ind. Stall=2 Issue Branch Execute Finish PCrel. Stall=1 Fetch Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations Completion Buffer o maximize sustained instruction fetch bandwidth, number of stall cycles in fetch stage must be minimized he primary aim of instruction flow techniques (branch prediction) is to minimize stall cycles and/or make use of these cycles to do useful work - Predict the branch outcome and do work during potential stall cycles - ote that there must be a mechanism to validate prediction and to safely recover from misprediction Complete Store Buffer Retire Doing away with Branches: Riseman and Foster s Study 7 benchmark programs on CDC-3600 Assume infinite machine: - Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit If bounded to single basic block, i.e. no bypassing of branches maximum speedup is 1.72 Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: Max Speedup: Determining Branch Direction Problem: Cannot fetch subsequent instructions until branch direction is determined Minimize penalty - Move the instruction that computes the branch condition away from branch (ISA&compiler) Make use of penalty - Bias for not-taken - Fill delay slots with useful/safe instructions (ISA&compiler) - Follow both paths of execution (hardware) - Predict branch direction (hardware)
13 Determining Branch arget Problem: Cannot fetch subsequent instructions until branch target is determined Minimize delay - Generate branch target early in the pipeline Make use of delay - Bias for not taken - Predict branch target PC-relative vs Register Indirect targets Keys to Branch Prediction arget Address Generation - Access register PC, GP register, Link register - Perform calculation +/- offset, auto incrementing/decrementing arget Speculation Condition Resolution - Access register Condition code register, data register, count register - Perform calculation Comparison of data register(s) Condition Speculation History based Branch arget Speculation Branch arget Buffer If you have seen this branch instruction before, can you figure out the target address faster? - Create history table How to organize the history table? History based Branch arget Speculation Branch arget Buffer Use branch target buffer (BB) to store previous branch target address BB is a small fully associative cache - Accessed during instruction fetch using PC BB can have three fields - Branch instruction address (BIA) - Branch target address (BA) - History bits When PC matches BIA, an entry is made into BB - A hit in BB Implies inst being fetched is branch inst - he BA field can be used to fetch next instruction if particular branch is predicted to be taken - ote: branch inst is still fetched and executed for validation/recovery
14 Branch arget Buffer (BB) A small cache-like memory in the instruction fetch stage current PC. Branch Inst. Address (tag). Branch History Branch arget (Most Recent) Remembers previously executed branches, their addresses, information to aid prediction, and most recent target addresses Instruction fetch stage compares current PC against those in BB to guess npc - If matched then prediction is made else npc=pc+4 - If predict taken then npc=target address in BB else npc=pc+4 When branch is actually resolved, BB is updated Branch Condition Speculation Biased For ot aken - Does not affect the instruction set architecture - ot effective in loops Software Prediction - Encode an extra bit in the branch instruction Predict not taken: set bit to 0 Predict taken: set bit to 1 - Bit set by compiler or user; can use profiling - Static prediction, same behavior every time Prediction Based on Branch Offsets - Positive offset: predict not taken - egative offset: predict taken Prediction Based on History npc=bp(pc) Branch Instruction Speculation prediction npc to Icache FA-mux specu. target npc(seq.) = PC+4 Branch PC Fetch specu. cond. Predictor (using a BB) Decode Buffer BB Decode update (target addr. Dispatch Buffer and history) Dispatch Issue Branch Execute Finish Reservation Stations Completion Buffer Branch Prediction Function Based on opcode only (%) IBM1 IBM2 IBM3 IBM4 DEC CDC Based on history of branch - Branch prediction function F (X1, X2,... ) - Use up to 5 previous branches for history (%) IBM1 IBM2 IBM3 IBM4 DEC CDC
15 History based prediction Finite State Machine based predictors Make prediction based on previous observation - historical info on direction taken by branch in previous execution can hints on direction taken in future How much history? What prediction? - History means how many branches and for each branch how much? Where to store the history? FSMs - capture history - Easy and fast to design and implement - ransition from one state to another on input FSM branch prediction algorithm - state variables encode direction taken by last n exec of branch Each state represents particular history pattern in terms of taken/not-taken (/) Output logic generates prediction based on history - When predicted branch is finally executed, use actual outcome to transition to next state - ext state logic chain state variables into shift Reg. -bit predictors 2-bit predictors Store outcome of last occurences of branche - 1-bit predictor implies store history of last branch only he values of the bits is the state of the branch predictor Use history of last to predict next - Use the value of the -bit state to predict the branch this is the prediction algorithm - Implement the algorithm using some logic gates - How much time does algorithm have to compute outcome?? Larger the size of, the more hardware you need to implement the bit predictor How many branch instructions? - Size of/entries in the Branch history table, BB 1024, 2048,etc. Use 2 history bits to track outcome of 2 previous executions of branch - 2 bits are status of FSM -,,, Each FSM represents a prediction algorithm o support history based prediction, the BB includes history field for each branch - Retrieve target address plus history bits - Feed history bits to logic that generates next state and prediction
16 Example Prediction Algorithm Prediction accuracy approaches maximum with as few as 2 preceding branch occurrences used as history last two branches next prediction Results (%) IBM1 IBM2 IBM3 IBM4 DEC CDC How does prediction algo work? While (i > 0) do /* Branch 1 */ { If (x>y) then /* Branch 2 */ {then part} /* no changes to x,y in this code */ else {else part} i= i-1; } wo branches in this code: B1, B2 How many times is each executed? Example Prediction Algorithm How does prediction algo work? Assume history bits = for B1, for B2 last two branches next prediction i=100; x=30; y=50; While (i > 0) do /* Branch 1 */ { If (x>y) then /* Branch 2 */ {then part} /* no changes to x,y in this code */ else {else part} i= i-1; } Using the same 2-bit predictor for all branches Prediction for B1:? Prediction for B2:?
17 Saturation Counter t Other Prediction Algorithms n? t? n t n? t? n Hysteresis Counter Combining prediction accuracy with BB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. his implies a 5-20% performance enhancement. IBM RS/6000 Study [air, 1992] Five different branch types - b: unconditional branch - bl: branch and link (subroutine calls) - bc: conditional branch - bcr: conditional branch using link register (subroutine returns) - bcc: conditional branch using count register (system calls) Separate branch function unit to overlap of branch instructions with other instructions wo causes for branch stalls - Unresolved conditions - Branches downstream too close to unresolved branches Benchmark spice2g6 doduc gcc espresso li eqntott umber of Counter Bits eeded Prediction Accuracy (Overall CPI Overhead) 3-bit 97.0 (0.009) 94.2 (0.003) 89.7 (0.025) 89.5 (0.045) 88.3 (0.042) 89.3 (0.028) 2-bit 97.0 (0.009) 94.3 (0.003) 89.1 (0.026) 89.1 (0.047) 86.8 (0.048) 87.2 (0.033) 1-bit 96.2 (0.013) 90.2 (0.004) 86.0 (0.033) 87.2 (0.054) 82.5 (0.063) 82.9 (0.046) 0-bit 76.6 (0.031) 69.2 (0.022) 50.0 (0.128) 58.5 (0.176) 62.4 (0.142) 78.4 (0.049) Branch history table size: Direct-mapped array of 2k entries Programs, like gcc, can have over 7000 conditional branches In collisions, multiple branches share the same predictor Variation of branch penalty with branch history table size level out at 1024 Frequency of of Mispredictions Mispredictions 20% 18% 18% 16% 14% 12% 10% 8% 6% 4% 0% 2% 0% Accuracy of Different Schemes ( 1% 4096 Entries 2-bit BH Unlimited Entries 2-bit BH 1024 Entries (2,2) BH 0% 1% 5% 6% 6% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 11% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 4% 6% 5%
18 Advanced Branch Prediction: Multi-level Branch Prediction So far, the prediction of each static branch instruction - is based solely on its own past behavior and not the behaviors of other neighboring static branch instructions - Does not take into account dynamic context within which branch is being executed E.g.: does not use any info on the particular control flow path taken in arriving at that branch - Same algorithm used for prediction regardless of dynamic context Experimental observations reveal behavior of some branches strongly correlated to other branches - More accurate prediction can be achieved by taking into account branch history of other related branches and adapt algo to the dynamic branching context Will cover some advanced branch prediction schemes later after dataflow and scheduling BB for Superscalar Fetch +16 PC Branch History able Branch arget Address Cache feedback BR Issue Branch BB input: current cache line Execute address BB output: what is the next Finish cache line address and how many words of the current line is useful icache Decode Decode Buffer Dispatch Buffer Dispatch Reservation Stations SFX SFX CFX FPU LS Completion Buffer Branch Mis-prediction Recovery Branch speculation involves predicting direction of branch and then proceeding to fetch along that path - Fetching on that path may encounter more branch inst Must provide validation and recovery mechanisms o identify speculated instructions, tagging is used - agged instruction indicates a speculative inst - ag value for each basic block (branch) Validation occurs when branch is executed and outcome known; correction of prediction known - prediction correct = de-allocate spec. tag - Incorrect prediction = terminate incorrect path and fetch from correct path Control Flow Speculation tag1 tag2 tag3 Leading Speculation - ag speculative instructions - Advance branch and following instructions - Buffer addresses of speculated branch instructions
19 Mis-speculation Recovery tag1 tag2 tag2 tag3 tag3 tag3 Eliminate Incorrect Path - Must ensure that the mis-speculated instructions produce no side effects Start ew Correct Path - Must have remembered the alternate (non-predicted) path Mis-speculation Recovery Eliminate Incorrect Path - Use branch tag(s) to deallocate completion buffer entries occupied by speculative instructions (now determined to be mis-speculated). - Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations How expensive is a misprediction? Start ew Correct Path - Update PC with computed branch target (if it was predicted ) - Update PC with sequential instruction address (if it was predicted ) - Can begin speculation once again when encounter a new branch How soon can you restart? railing Confirmation tag1 tag2 tag2 tag3 tag3 tag3 railing Confirmation - When branch is resolved, remove/deallocate speculation tag - Permit completion of branch and following instructions Impediments to Parallel/Wide Fetching Average Basic Block Size - integer code: 4-6 instructions - floating-point code: 6-10 instructions Branch Prediction Mechanisms - must make multiple branch predictions per cycle - potentially multiple predicted taken branches Conventional I-Cache Organization discuss later - must fetch from multiple predicted taken targets per cycle - must align and collapse multiple fetch groups per cycle race Caching!!
20 Recap.. Recap.. CPU time = IC * CPI * Clk - CPI = ideal CPI + stall cycles/instruction - Stall cycles due to (1) control hazards and (2) data hazards What did branch prediction do? - ries to reduce number of stall cycles from control hazards What about stall cycles from data hazards - ext.. CPU time = IC * CPI * Clk - CPI = ideal CPI + stall cycles/instruction - Stall cycles due to (1) control hazards and (2) data hazards What did branch prediction do? - ries to reduce number of stall cycles from control hazards What about stall cycles from data hazards - ext.. ext- Register Dataflow and Dynamic Scheduling Branch prediction provides a solution to handling the control flow problem and increase instruction flow bandwidth - Stalls due to control flow change can decrease performance ext step is flow in the execute stage register data flow - Parallel execution of instructions - Keep dependencies in mind Remove false dependencies, honor true dependencies infinite register set can remove false dependencies - Go back and look at the nature of true dependencies using the data flow diagram of a computation Instruction Flow Data Flow Superscalar Pipeline Design Fetch Decode Dispatch Execute Complete Instruction Buffer Dispatch Buffer Issuing Buffer Completion Buffer Store Buffer Retire
21 HW Schemes: Instruction Parallelism Why in HW at run time? - Works when can t know real dependence at compile time - Compiler simpler - Code for one machine runs well on another Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 - Enables out-of-order execution => out-of-order completion - ID stage checked both for structuralscoreboard dates to CDC 6600 in 1963 Superscalar Execution Example Single Order, Data Dependence In Order Assumptions - Single FP adder takes 2 cycles - Single FP multiplier takes 5 cycles - Can issue add & multiply together - Must issue in-order - <op> in,in,out (In order) (Single adder, data dependence) v: addt $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f10 v w (inorder) Critical Path = 9 cycles x Data Flow $f2 $f4 $f6 w x v + + * + $f12 y $f8 z $f10 y $f4 + z z Adding Advanced Features Out Of Order Issue - Can start y as soon as adder available - Must hold back z until $f10 not busy & adder available v: addt $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f10 With Register Renaming v: addt $f2, $f4, $f10a w: mult $f10a, $f6, $f10a x: addt $f10a, $f8, $f12 y: addt $f4, $f6, $f4 z: addt $f4, $f8, $f14 v v y y w w z x x z Where do false dependencies come from i.e., who messed up? ALU ops are R_d Fi(Rj,Rk) - If functional unit is not available then structural hazard - If source operand not available then data hazard due to true dependency - If destination register R_d not available then hazard due to anti and output dependencies, i.e., false dependencies Recycling/reuse of destination register leads to dependency - Static recycling due to Register allocation in compilation process Code generation into Infinite set of symbolic registers Register allocation maps this into finite set of architected registers, i.e., register recycling/reuse - Dynamic form of register recycling occurs during execution of loops
22 Register Renaming Register Renaming Dynamically assign different names to the multiple definitions of the same register thereby remove false dependencies Use a separate rename register file (RRF) in addition to architected register file (ARF) - One way, duplicated ARF and use RRF as a shadow version ot an efficient way to use RRF Where to place RRF - Implement stand-alone structure similar to ARF - Incorporate RRF as part of reorder buffer Can be inefficient since you need rename for every instruction in flight even if not every inst defines a register - Use a busy bit to indicate if renaming has occurred for an ARF, and if so the a map to indicate the name What does it involve - Source read at decode/dispatch stage Access source register, check if operand ready, which actual register should be accessed if renamed. - Destination allocate at decode/dispatch stage Destination register used to index into ARF, and now the dest reg has pending write; set the mapping to specify which RRF is used (i.e., the renaming mapping) - Register update Involves updating the RRF when instruction completes and then copy from RRF into ARF Do we really need separate ARF and RRF - Can be pooled together and save data transfer interconnect but more complex to do context switching Back to register dataflow problem.. What will we study Register renaming can eliminate false dependencies False dependencies can introduce stalls into the superscalar Register dataflow problem: - Issue instructions in parallel if there are no true dependencies How much parallelism is there in the code? - Data flow limit to program execution is the critical path in the program - Data flow execution model stipulates that every instruction begin execution immediately in the cycle following when all its operands are ready All register data flow techniques attempt to approach this limit What are the obstacles to this execution model? oday: Cover the basic dynamic scheduling method - Register renaming - omasulo Method V1.0! ext week Modify the basic scheduler to handle speculation, out of order, etc.
23 Register dataflow- Key concepts Example Simulating the data flow graph will eliminate false dependencies and allow maximum parallelism - Subject to resource constraints How can we have infinite registers?? - Remove reference to registers and replace with the data flow graph information ote that we should not actually construct the data flow graph Work with non-speculative instructions to provide a solution - Add the branch prediction speculation support to modify the solution and get a speculative out of order execution unit! A: R4 R0 + R8 latencies: B: R2 R0 * R4 add= 2 C: R4 R4 + R8 mult=3 D: R8 R4 * R2 What is the (true) data flow? How does execution work? HW Schemes: Instruction Parallelism Out-of-order execution divides ID stage: 1. Issue decode instructions, check for structural hazards 2. Read operands wait until no data hazards, then read operands wo major schemes: - Scoreboard - Reservation station (omasulo algo) hey allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions In order issue, out of order execution, out of order commit ( also called completion) Advantages of Dynamic Scheduling Handles cases when dependences unknown at compile time - (e.g., because they may involve a memory reference) It simplifies the compiler Allows code that compiled for one pipeline to run efficiently on a different pipeline Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling
24 A Dynamic Algorithm: omasulo s Algorithm For IBM 360/91 (before caches!) Goal: High Performance without special compilers Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations - his led omasulo to try to figure out how to get more effective registers renaming in hardware! Why Study 1966 Computer? he descendants of this have flourished! - Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, omasulo Method:Approach racks when operands for instructions are available - Minimizes RAW hazards Register renaming - Minimize WAR and WAW hazards Many variations in use today but key remains - racking instruction dependencies to allow execution as soon as operands available, and rename registers to avoid WAR and WAW - Basically, it tries to follow the data-flow execution Diversified Pipelined Inorder Issue, Out-of-order Complete Multiple functional units (FU s) - Floating-point add - Floating-point multiply/divide hree register files (pseudo reg-reg machine in FP unit) - (4) floating-point registers (FLR) - (6) floating-point buffers (FLB) - (3) store data buffers (SDB) Out of order instruction execution: - After decode the instruction unit passes all floating point instructions (in order) to the floating-point operation stack (FLOS). - In the floating point unit, instructions are then further decoded and issued from the FLOS to the two FU s Variable operation latencies (not pipelined): - Floating-point add: 2 cycles - Floating-point multiply: 3 cycles - Floating-point divide: 12 cycles omasulo Algorithm Control & buffers distributed with Function Units (FU) - FU buffers called reservation stations ; have pending operands IF FU busy, then instead of stalling, issue to reservation station which is a set of buffers for the FU..i.e., a virtual Functional unit Registers in instructions replaced by values or pointers to reservation stations(rs); called register renaming ; - avoids WAR, WAW hazards - More reservation stations than registers, so can do optimizations compilers can t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs - CDB connects outputs of FUs to reservation stations and Store buffer Load and Stores treated as FUs with RSs as well Int inst can go past branches, allowing ops beyond basic block
25 Reservation Station Buffers where instructions can wait for RAW hazard resolution and execution Associate more than one set of buffering registers (control, source, sink) with each FU virtual FU s. - Add unit: three reservation stations - Multiply/divide unit: two reservation stations Pending (not yet executing) instructions can have either value operands or pseudo operands (aka. tags). RS1 RS2 Mult RS1 Mult RS2 Mult Reservation Station Components Op: Operation to perform in the unit (e.g., + or ) Vj, Vk: Value of Source operands - Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) - ote: Qj,Qk=0 => ready - Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. Rename ags Register names are normally bound to FLR registers When an FLR register is stale, the register name is bound to the pending-update instruction ags are names to refer to these pending-update instructions In omasulo, A tag is statically bound to the buffer where a pending-update instruction waits. Instructions can be dispatched to RSs with either value operands or just tags. - ag operand unfulfilled RAW dependence - the instruction in the RS corresponding to the ag will produce the actual value eventually Common Data Bus (CDB) CDB is driven by all units that can update FLR - When an instruction finishes, it broadcasts both its tag and its result on the CDB. - Facilitates forwarding results directly from producer to consumer - Why don t we need the destination register name? Sources of CDB: - Floating-point buffers (FLB) - wo FU s (add unit and the multiply/divide unit) he CDB is monitored by all units that was left holding a tag instead of a value operand - Listens for tag broadcast on the CDB - If a tag matches, grab the value Destinations of CDB: - Reservation stations - Store data buffers (SDB) Holds data to be written into memory due to a store operation - Floating-point registers (FLR)
26 From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Add1 Add2 Add3 omasulo Organization FP FP adders adders Mult1 Mult2 Reservation Stations FP Registers FP FP multipliers Store Buffers o Mem Superscalar Execution Check List ISRUCIO PROCESSIG COSRAIS Resource Contention Code Dependences (Structural Dependences) Control Dependences Data Dependences (RAW) rue Dependences Storage Conflicts (WAR) Anti-Dependences Output Dependences (WAW) Common Data Bus (CDB) hree Stages of omasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute operate on operands (EX) When both operands ready then execute; wakeup inst if all ready bits are set; if not ready, watch Common Data Bus for result 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available ormal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) - 64 bits of data + 4 bits of Functional Unit source address - Write if matches expected Functional Unit (produces result) - Does the broadcast Example speed: 3 clocks for Fl.pt. +,-; 10 for * ; 40 clks for / Dependence Resolution Structural dependence: virtual FU s - Can send more than num of FU s rue dependence: ags + CDB - If an operand is available in FLR, it is copied to RS - If an operand is not available then a tag is copied to the RS instead. his tag identifies the source (RS/instruction) of the pending write RAW does not block subsequent inst or FU Anti-dependence: Operand Copying - If an operand is available in FLR, it is copied to RS with the issuing instruction Output dependence: register renaming + result forwarding
27 Instruction stream omasulo Example Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 Load1 o LD F2 45+ R3 Load2 o MULD F0 F2 F4 Load3 o SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 3 Load/Buffers Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk Add1 o FU count down Add2 o Add3 o Mult1 o Mult2 o 3 FP Adder R.S. 2 FP Mult R.S. Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 0 FU omasulo Example Cycle 1 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 o MULD F0 F2 F4 Load3 o SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk Add1 o Add2 o Add3 o Mult1 o Mult2 o Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 1 FU Load1 Clock cycle counter omasulo Example Cycle 2 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULD F0 F2 F4 Load3 o SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk Add1 o Add2 o Add3 o Mult1 o Mult2 o Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 2 FU Load2 Load1 ote: Can have multiple loads outstanding omasulo Example Cycle 3 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F2 45+ R3 2 Load2 Yes 45+R3 MULD F0 F2 F4 3 Load3 o SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk Add1 o Add2 o Add3 o Mult1 Yes MULD R(F4) Load2 Mult2 o Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 3 FU Mult1 Load2 Load1 ote: registers names are removed ( renamed ) in Reservation Stations; MUL issued Load1 completing; what is waiting for Load1?
28 omasulo Example Cycle 4 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 o LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULD F0 F2 F4 3 Load3 o SUBD F8 F6 F2 4 DIVD F10 F0 F6 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk Add1 Yes SUBD M(A1) Load2 Add2 o Add3 o Mult1 Yes MULD R(F4) Load2 Mult2 o Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 4 FU Mult1 Load2 M(A1) Add1 Load2 completing; what is waiting for Load2? omasulo Example Cycle 5 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 o LD F2 45+ R Load2 o MULD F0 F2 F4 3 Load3 o SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk 2Add1 Yes SUBD M(A1) M(A2) Add2 o Add3 o 10 Mult1 Yes MULD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2 imer starts down for Add1, Mult1 omasulo Example Cycle 6 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 o LD F2 45+ R Load2 o MULD F0 F2 F4 3 Load3 o SUBD F8 F6 F2 4 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk 1Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 o 9Mult1 Yes MULD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 6 FU Mult1 M(A2) Add2 Add1 Mult2 Issue ADDD here despite name dependency on F6? omasulo Example Cycle 7 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 o LD F2 45+ R Load2 o MULD F0 F2 F4 3 Load3 o SUBD F8 F6 F2 4 7 DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk 0Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 o 8Mult1 Yes MULD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 7 FU Mult1 M(A2) Add2 Add1 Mult2 Add1 (SUBD) completing; what is waiting for it?
29 omasulo Example Cycle 8 Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address LD F6 34+ R Load1 o LD F2 45+ R Load2 o MULD F0 F2 F4 3 Load3 o SUBD F8 F6 F DIVD F10 F0 F6 5 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS ime ame Busy Op Vj Vk Qj Qk Add1 o 2 Add2 Yes ADDD (M-M) M(A2) Add3 o 7Mult1 Yes MULD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Register result status: Clock F0 F2 F4 F6 F8 F10 F12... F30 8 FU Mult1 M(A2) Add2 (M-M) Mult2 omasulo Drawbacks Complexity - delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! Many associative stores (CDB) at high speed Performance limited by Common Data Bus - Each CDB must go to multiple functional units high capacitance, high wiring density - umber of functional units that can complete per cycle limited to one! Multiple CDBs more FU logic for parallel assoc stores on-precise interrupts! Why can omasulo overlap iterations of loops? Register renaming - Multiple iterations use different physical destinations for registers (dynamic loop unrolling). Reservation stations - Permit instruction issue to advance past integer control flow operations - Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard. Other perspective: omasulo building data flow dependency graph on the fly. omasulo s scheme offers 2 major advantages (1) the distribution of the hazard detection logic - distributed reservation stations and the CDB - If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB - If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards
30 Relationship between precise interrupts and speculation: Speculation is a form of guessing. Important for branch prediction: - eed to take our best shot at predicting branch direction. If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: - his is exactly same as precise exceptions! echnique for both precise interrupts/exceptions and speculation: in-order completion or commit Multiple Issue ILP Processors In statically scheduled superscalar instructions issue in order, and all pipeline hazards checked at issue time - Inst causing hazard will force subsequent inst to be stalled In statically scheduled VLIW, compiler generates multiple issue packets of instructions During instruction fetch, pipeline receives number of inst from IF stage issue packet - Examine each inst in packet: if no hazard then issue else wait - Issue unit examines all inst in packet Complexity implies further splitting of issue stage Multiple Issue o issue multiple instructions per clock, key is assigning reservation station and updating pipeline control tables wo approaches - Run this step in half a clock cycle; two inst can be processed in one clock cycle - Build logic needed to handle two instructions at once Dynamic Scheduling Dynamic Execution core Current superscalar processors provide out of order - Architecture is an out of order core sandwiched between inorder front-end and in-order back-end - In-order front end Fetches and dispatches instructions in program order - In-order back-end Completes and retires inst in program order - Dynamic execution core Refinement of omasulo method Has three parts Instruction dispatch: rename reg, alloc. Res.stations Inst execution: exec., forward results Instruction completion:
31 Extending omasulo Have to allow for speculative instructions - Separate bypassing of results among instructions from the actual completion of instruction ag instructions as speculative until they are validated and then commit the instruction Key idea: allow instructions to execute out of order but commit in order - Requires reorder buffer (ROB) to hold and pass results among speculated instructions - ROB is a source of operands - Register file is not updated until instruction commits herefore ROB supplies operands in interval between completion of inst and commit of inst Reorder Buffer (ROB) Reorder buffer contains all instructions in flight, i.e., all inst dispatched but not completed - Inst waiting in reservation stations - Executing in functional units - Finished execution but waiting to be completed in program order Status of each instruction can be tracked using several bits in each entry in ROB - Additional bit to indicate if instruction is speculative - Only finished and non-speculative can be committed - Instruction marked invalid is not architecturally completed when exiting the reorder buffer HW support eed HW buffer for results of uncommitted instructions: reorder buffer - 3 fields: instr, destination, value - Use reorder buffer number instead of reservation station when execution completes - Supplies operands between execution complete & commit - (Reorder buffer can be operand source => more registers like RS) - Instructions commit - Once instruction commits, result is put into register - As a result, easy to undo speculated instructions on mispredicted branches or exceptions FP Op Queue Res Stations FP Adder Reorder Buffer FP Regs Res Stations FP Adder Dest Reg What are the hardware complexities with reorder buffer (ROB)? Result Exceptions? Valid Program Counter Reorder able FP Op Queue Res Stations FP Adder Compar network Reorder Buffer FP Regs Res Stations FP Adder How do you find the latest version of a register? - (As specified by Smith paper) need associative comparison network - Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value eed as many ports on ROB as register file
32 Four Steps of Speculative omasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called dispatch ) 2. Execution operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ) 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation ) Summary Reservations stations: implicit register renaming to larger set of registers + buffering source operands - Prevents registers as bottleneck - Avoids WAR, WAW hazards of Scoreboard - Allows loop unrolling in HW ot limited to basic blocks (integer units gets ahead, beyond branches) Lasting Contributions - Dynamic scheduling - Register renaming - Load/store disambiguation 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha More branch prediction Inter-relating Branches So far, not considered inter-dependent branches - Branches whose outcome depends on other branches If (a<=0) then { s1}; /* branch b1 */ If (a>0) then {s2}; /* branch b2 */ Relation between b1 and b2?
33 2-level predictors Advanced Branch Prediction: Global Branch Prediction Relate branches - Globally (G) - Individual (P) (also known as per-branch) Global: single BHSR of k bits tracks branch directions of last k dynamic branches Individual: employ set of k-bit BHSR, one of which is selected for a branch - Global shared by all branches, whereas individual has BHSR dedicated to each branch (or subset) PH has options: global (g), individual (p), shared(s) - Global has single table for all static - Individual has PH for each static branch - Or subset of PH shared by each branch Branch History Register (shift left when update) Pattern History able (PH) index Branch Resolution old PH Bits FSM Logic new Prediction 2-Level Adaptive Prediction [Yeh & Patt] Global BHSR Scheme (GAs) wo-level adaptive branch prediction - 1st level: History of last k (dynamic) branches encountered - 2nd level: branch behavior of the last s occurrences of the specific pattern of these k branches - Use a Branch History Register (BHR) in conjunction with a Pattern History able (PH) Example: (k=8, s=6) - Last k branches with the behavior ( ) - s-bit History at the entry ( ) is [101010] - Using history, branch prediction algorithm predicts direction of the branch Effectiveness: Branch Address Branch History Shift Register (BHSR) k bits j bits Prediction - Average 97% accuracy for SPEC - Used in the Intel P6 and AMD K6 BH of 2 x 2 j+k
34 Per-Branch BHSR Scheme (PAs) Other Schemes Branch Address i bits Branch History Shift Register (BHSR) k x 2 i k bits j bits BH of 2 x 2 j+k Standard BH Prediction Function Return Stack - Register indirect targets are hard to predict from branch history - Register indirect branches are mostly used for function returns 1. Push the return address onto a stack on each function call 2. On a reg. indirect branch, pop and return the top address as prediction Combining Branch Predictors - Each type of branch prediction scheme tries to capture a particular program behavior - May want to include multiple prediction schemes in hardware - Use another history-based prediction scheme to predict which predictor should be used for a particular branch You get the best of all worlds. his works quite well ext... VLIW/EPIC Static ILP Processors: Very Long Instruction Word(VLIW) Explicit Parallel Instruction Computing (EPIC) - Compiler determines dependencies and resolves hazards - Hardware support needed for this? Do standard compiler techniques work or do we need new compiler optimization methods to extract more ILP - Overview of compiler optimization
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
WAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
Pipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
CPU Performance Equation
CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down
Computer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive
EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)
EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the
Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic
1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
Instruction Set Architecture (ISA) Design. Classification Categories
Instruction Set Architecture (ISA) Design Overview» Classify Instruction set architectures» Look at how applications use ISAs» Examine a modern RISC ISA (DLX)» Measurement of ISA usage in real computers
VLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
The Microarchitecture of Superscalar Processors
The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: [email protected]
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
CS352H: Computer Systems Architecture
CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions
Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
Course on Advanced Computer Architectures
Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2
Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations
Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2
Module: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
PROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
Instruction Set Design
Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,
Computer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
18-447 Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013
18-447 Computer Architecture Lecture 3: ISA Tradeoffs Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013 Reminder: Homeworks for Next Two Weeks Homework 0 Due next Wednesday (Jan 23), right
Week 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
Computer organization
Computer organization Computer design an application of digital logic design procedures Computer = processing unit + memory system Processing unit = control + datapath Control = finite state machine inputs
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats
Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction
"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
Computer Organization and Architecture
Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal
CS521 CSE IITG 11/23/2012
CS521 CSE TG 11/23/2012 A Sahu 1 Degree of overlap Serial, Overlapped, d, Super pipelined/superscalar Depth Shallow, Deep Structure Linear, Non linear Scheduling of operations Static, Dynamic A Sahu slide
EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro
EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
Central Processing Unit (CPU)
Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:
4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995
UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering EEC180B Lab 7: MISP Processor Design Spring 1995 Objective: In this lab, you will complete the design of the MISP processor,
Data Memory Alternatives for Multiscalar Processors
Data Memory Alternatives for Multiscalar Processors Scott E. Breach, T. N. Vijaykumar, Sridhar Gopal, James E. Smith, Gurindar S. Sohi Computer Sciences Department University of Wisconsin-Madison 1210
Instruction Set Architecture
Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)
Computer Architectures
Computer Architectures 2. Instruction Set Architectures 2015. február 12. Budapest Gábor Horváth associate professor BUTE Dept. of Networked Systems and Services [email protected] 2 Instruction set architectures
Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.
Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December
Design Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION
LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION DAVID KROFT Control Data Canada, Ltd. Canadian Development Division Mississauga, Ontario, Canada ABSTRACT In the past decade, there has been much
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
Lecture 17: Virtual Memory II. Goals of virtual memory
Lecture 17: Virtual Memory II Last Lecture: Introduction to virtual memory Today Review and continue virtual memory discussion Lecture 17 1 Goals of virtual memory Make it appear as if each process has:
Software Pipelining - Modulo Scheduling
EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
CPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
1 The Java Virtual Machine
1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This
Introducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007
Application Note 195 ARM11 performance monitor unit Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Copyright 2007 ARM Limited. All rights reserved. Application Note
EEM 486: Computer Architecture. Lecture 4. Performance
EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design
High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
StrongARM** SA-110 Microprocessor Instruction Timing
StrongARM** SA-110 Microprocessor Instruction Timing Application Note September 1998 Order Number: 278194-001 Information in this document is provided in connection with Intel products. No license, express
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors Haitham Akkary Ravi Rajwar Srikanth T. Srinivasan Microprocessor Research Labs, Intel Corporation Hillsboro, Oregon
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine
Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine This is a limited version of a hardware implementation to execute the JAVA programming language. 1 of 23 Structured Computer
POWER8 Performance Analysis
POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs [email protected] #OpenPOWERSummit Join the conversation at #OpenPOWERSummit
Instruction scheduling
Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.
In the Beginning... 1964 -- The first ISA appears on the IBM System 360 In the good old days
RISC vs CISC 66 In the Beginning... 1964 -- The first ISA appears on the IBM System 360 In the good old days Initially, the focus was on usability by humans. Lots of user-friendly instructions (remember
Design of Pipelined MIPS Processor. Sept. 24 & 26, 1997
Design of Pipelined MIPS Processor Sept. 24 & 26, 1997 Topics Instruction processing Principles of pipelining Inserting pipe registers Data Hazards Control Hazards Exceptions MIPS architecture subset R-type
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1
Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Chapter 5 Instructor's Manual
The Essentials of Computer Organization and Architecture Linda Null and Julia Lobur Jones and Bartlett Publishers, 2003 Chapter 5 Instructor's Manual Chapter Objectives Chapter 5, A Closer Look at Instruction
SOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
COMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
Reduced Instruction Set Computer (RISC)
Reduced Instruction Set Computer (RISC) Focuses on reducing the number and complexity of instructions of the ISA. RISC Goals RISC: Simplify ISA Simplify CPU Design Better CPU Performance Motivated by simplifying
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
Review: MIPS Addressing Modes/Instruction Formats
Review: Addressing Modes Addressing mode Example Meaning Register Add R4,R3 R4 R4+R3 Immediate Add R4,#3 R4 R4+3 Displacement Add R4,1(R1) R4 R4+Mem[1+R1] Register indirect Add R4,(R1) R4 R4+Mem[R1] Indexed
Computer Organization and Components
Computer Organization and Components IS1500, fall 2015 Lecture 5: I/O Systems, part I Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley
Boosting Beyond Static Scheduling in a Superscalar Processor
Boosting Beyond Static Scheduling in a Superscalar Processor Michael D. Smith, Monica S. Lam, and Mark A. Horowitz Computer Systems Laboratory Stanford University, Stanford CA 94305-4070 May 1990 1 Introduction
EE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
OPENSPARC T1 OVERVIEW
Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share
COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December 2009 23:59
COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December 2009 23:59 Overview: In the first projects for COMP 303, you will design and implement a subset of the MIPS32 architecture
picojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
Multithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
CPU Organisation and Operation
CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program
Intel 8086 architecture
Intel 8086 architecture Today we ll take a look at Intel s 8086, which is one of the oldest and yet most prevalent processor architectures around. We ll make many comparisons between the MIPS and 8086
Energy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
