PIPELINING: POWERPC VS PENTIUM II

Transcription

1 PIPELINING: POWERPC VS PENTIUM II Abstract: This paper discusses some basic concepts of pipelining in 1 st chapter and then detail information about PowerPC 750 s pipelining in 2 nd chapter; the 3 rd chapter is about pipelining architecture of Intel Pentium II. 1. OVERVIEW OF PIPELINE 1.1 Definition In everyday life, people do many tasks in stages. For instance, when we do the laundry, we place a load in the washing machine. When it is done, it is transferred to the dryer and another load is placed in the washing machine. When the first load is dry, we pull it out for folding or ironing, moving the second load to the dryer and start the third load in the washing machine. We proceed with folding or ironing of the first load while the second and third loads are being dried and washed, respectively. We may have never thought of it this way but we do laundry by pipeline processing. A Pipeline is a series of stages, where some work is done at each stage. The work is not finished until it has passed through all stages. Let us review the laundry example. The washing machine is a stage. The second is the dryer. The third is the folding or ironing stage. Partial processing takes place in each stage. We certainly aren't done when the clothes leave the washer. Nor when they leave the dryer, although we're getting close. We must take the third step and fold or iron the cloths. The "fully processed result" is obtained only after the operand (the load of clothes) has passed through the entire pipeline. We are often taught to take a large task and to divide it into smaller pieces. This may make a unmanageable complex task into a series of more tractable smaller steps. In the case of manageable tasks such as the laundry example, it allows us to speed up the task by doing it in overlapping steps. This is the key to pipelining: Division of a larger task into smaller overlapping tasks. Overlap and pipelining are essentially operation management techniques based on job sub-divisions under a precedence constraint. Some people separate the pipeline into two categories: Instructional pipeline: where different stages of an instruction fetch and execution are handled in a pipeline.

2 Arithmetic pipeline: where different stages of an arithmetic operation are handled along the stages of a pipeline. The above definitions are correct but are based on a narrow perspective, it considers only the central processor. There are also other types of computing pipelines, for instances, pipelines used to compress and transfer video data etc. 1.2 Advantage Of Pipeline Speed: Pipelining is used to obtain improvements in processing time that would be unobtainable with existing non-pipelined technology. The development goal for IBM 7030 was an over-all performance of 100 times IBM 704 computer, the fastest computer in production at that time, whereas circuit improvements would only give a factor-of-10 improvement. This goal could only be met with overlapping instructions, i.e. pipelining. A Pipeline is used to improve performance beyond what can be achieved with non-pipelined processing. Similarly, the goal for IBM 360/91 was an improvement of one to two orders of magnitude over IBM Technology advances could only bring about a fourfold improvement. Another example is Motorola 6502 microprocessor had a throughput similar to Intel 8080 processor running at a clock rate four times faster. This was due to the pipelined architecture of the 6502 versus the nonpipelined Nowadays, almost all the microprocessors use pipelined architecture to improve their performance. 1.3 Disadvantage Of Pipeline There are also two major disadvantages of pipeline architecture. The first is complexity. The second is the inability to continuously run the pipeline at full speed, i.e. the pipeline stalls. Why the pipeline cannot run at full speed? There are phenomena called pipeline hazards that disrupt the smooth execution of the pipeline. The resulting delays in the pipeline flow are called bubbles. These pipeline hazards include: Structural hazards from hardware conflicts. Data hazards arising from data dependencies. Control hazards that come about from branch, jump, and other control flow changes. These issues can and are successfully dealt with. But detecting and avoiding the hazards leads to a considerable increase in hardware complexity. The control paths controlling the gating between stages can contain more circuit levels than the data paths being controlled. In 1970, this complexity is one reason that led

3 some people to call pipelining "still-controversial". But following the rapid progress of electrical technology, this kind of worry disappeared soon. At present, we can see that most of the new CPUs belong to RISC microprocessor class, a class with fully pipelined architecture to obtain better performance. 1.4 Pipelining Model In the pipelining model, a task is divided into steps. The steps must be performed in sequence to produce a single instance of the desired output, and the work done in each step (except for the first and last) is based on the preceding step and is a prerequisite for the work in the next step. However, the program is designed to produce multiple instances of the desired output, and the steps are designed to operate in a parallel time frame so that each step is kept busy. An example of the pipelining model is an automobile assembly line. Each step or stage in the assembly line is continually busy receiving the product of the previous stage's work, performing its assigned work, and passing the product along to the next stage. A car needs a body before it can be painted, but at any one time numerous cars are receiving bodies, and numerous cars are being painted. In a multithreaded program using the pipelining model, each thread represents a step in the task. And in a Pipelining Model, task is performed by several threads. 2 PIPELINING OF POWERPC Instrction Execution Of Powerpc 750 PowerPC 750 is a pipelined, superscalar processor. A pipelined processor is one in which instruction processing is divided into discrete stages, allowing work to be done on different instructions in each stage. For example, after an instruction completes one stage, it can pass on to the next stage leaving the previous stage available to the subsequent instruction. This improves overall instruction throughput. The entire path that instructions take through the fetch, decode/dispatch, execute, complete, and write-back stages is considered PowerPC 750 s master pipeline, and two of PowerPC 750 s execution units (the FPU and LSU) are also multiple-stage pipelines. PowerPC 750 contains the following execution units that operate independently and in parallel: Branch processing unit (BPU) Integer unit 1 (IU1) executes all integer instructions

4 Integer unit 2 (IU2) executes all integer instructions except multiplies and divides 64-bit floating-point unit (FPU) Load/store unit (LSU) System register unit (SRU) PowerPC 750 can retire two instructions on every clock cycle. In general, it processes instructions in four stages fetch, decode/dispatch, execute, and complete as shown in Figure 2-1. Figure 2-1 Superscaler/Pipeline Diagram A superscalar processor is one that issues multiple independent instructions into separate execution units, allowing instructions to execute in parallel. PowerPC 750 has six independent execution units, two for integer instructions, and one each for floating-point instructions, branch instructions, load/store instructions, and system register instructions. Having separate GPRs and FPRs allows integer, floating-point calculations, and load and store operations to occur simultaneously without interference. Additionally, rename buffers are provided to allow operations to post execution results for use by subsequent instructions without committing them to the architected FPRs and GPRs. 2.2 Instruction Timing In the context of instruction timing, the term pipeline refers to the interconnection of the stages. The events necessary to process an instruction are broken into

5 several cycle-length tasks to allow work to be performed on several instructions simultaneously analogous to an assembly line. As an instruction is processed, it passes from one stage to the next. When it does, the stage becomes available for the next instruction. Although an individual instruction may take many cycles to complete (the number of cycles is called instruction latency), pipelining makes it possible to overlap the processing so that the throughput (number of instructions completed per cycle) is greater than if pipelining were not implemented. PowerPC 750 design minimizes average instruction execution latency, the number of clock cycles it takes to fetch, decode, dispatch, and execute instructions and make the results available for a subsequent instruction. Some instructions, such as loads and stores, access memory and require additional clock cycles between the execute phase and the write-back phase. These latencies vary depending on whether the access is to cacheable or noncacheable memory, whether it hits in the L1 or L2 cache, whether the cache access generates a write-back to memory, whether the access causes a snoop hit from another device that generates additional activity, and other conditions that affect memory accesses. PowerPC 750 implements many features to improve throughput, such as pipelining, superscalar instruction issue, branch folding, removal of fall-through branches, two-level speculative branch handling, and multiple execution units that operate independently and in parallel. As an instruction passes from stage to stage in a pipelined system, the following instruction can follow through the stages as the former instruction vacates them, allowing several instructions to be processed simultaneously. While it may take several cycles for an instruction to pass through all the stages, when the pipeline has been filled, one instruction can complete its work on every clock cycle. Figure 2-2. Pipelined Execution Unit

6 Figure 2-2 represents a generic pipelined execution unit. 2.3 Pipelining in Instruction Timing As shown in Figure 2-1, the common pipeline of PowerPC 750 has four stages through which all instructions must pass fetch, decode/dispatch, execute, and complete/write back. Some instructions occupy multiple stages simultaneously and some individual execution units have additional stages. For example, the floating-point pipeline consists of three stages through which all floating-point instructions must pass. Note that Figure 2-1 does not show features, such as reservation stations and rename buffers that reduce stalls and improve instruction throughput. The instruction pipeline in PowerPC 750 has four major pipeline stages, described as follows: The fetch pipeline stage primarily involves retrieving instructions from the memory system and determining the location of the next instruction fetch. The BPU decodes branches during the fetch stage and removes those that do not update CTR or LR from the instruction stream. The dispatch stage is responsible for decoding the instructions supplied by the instruction fetch stage and determining which instructions can be dispatched in the current cycle. If source operands for the instruction are available, they are read from the appropriate register file or rename register to the execute pipeline stage. If a source operand is not available, dispatch provides a tag that indicates which rename register will supply the operand when it becomes available. At the end of the dispatch stage, the dispatched instructions and their operands are latched by the appropriate execution unit. Instructions executed by the IUs, FPU, SRU, and LSU are dispatched from the bottom two positions in the instruction queue. In a single clock cycle, a maximum of two instructions can be dispatched to these execution units in any combination. When an instruction is dispatched, it is assigned a position in the six-entry completion queue. A branch instruction can be issued on the same clock cycle for a maximum three-instruction dispatch. During the execute pipeline stage, each execution unit that has an executable instruction executes the selected instruction (perhaps over multiple cycles), writes the instruction's result into the appropriate rename register, and notifies the completion stage that the instruction has finished execution. In the case of an internal exception, the execution unit reports the exception to the completion pipeline stage and (except for the FPU) discontinues instruction execution until the exception is handled. The exception is not signaled until that instruction is the next to be completed. Execution of most floating-point instructions is pipelined within the FPU allowing up to three instructions to be

7 executing in the FPU concurrently. The FPU stages are multiply, add, and round-convert. Execution of most load/store instructions is also pipelined. The load/store unit has two pipeline stages. The first stage is for effective address calculation and MMU translation and the second stage is for accessing the data in the cache. The complete pipeline stage maintains the correct architectural machine state and transfers execution results from the rename registers to the GPRs and FPRs (and CTR and LR, for some instructions) as instructions are retired. As with dispatching instructions from the instruction queue, instructions are retired from the two bottom positions in the completion queue. If completion logic detects an instruction causing an exception, all following instructions are cancelled, their execution results in rename registers are discarded, and instructions are fetched from the appropriate exception vector. Because the PowerPC architecture can be applied to such a wide variety of implementations, instruction timing varies among PowerPC processors. 2.4 Pipelining In Bus Interface Operation Another pipelined operation in PowerPC 750 is in Bus Interface Operation. A conceptual block diagram of the bus interface is shown in Figure 2-3. The address register queues in the figure hold transaction requests that the bus interface may issue on the bus independently of the other requests. The bus interface may have up to two transactions operating on the bus at any given time through the use of address pipelining Bus Interface Figure 2-3 Bus Interface Address Buffers The bus interface prioritizes requests for bus operations from the instruction and data caches, and performs bus operations. It includes address register queues,

8 prioritization logic, and bus control unit. The bus interface latches snoop addresses for snooping in the data cache and in the address register queues, and for reservations controlled by the Load Word and Reserve Indexed and Store Word Conditional Indexed instructions, and maintains the touch load address for the cache. The interface allows one level of pipelining; that is, with certain restrictions discussed later, there can be two outstanding transactions at any given time. Accesses are prioritized with load operations preceding store operations. Memory accesses can occur in single-beat (1, 2, 3, 4, and 8 bytes) and four-beat (32 bytes) burst data transfers. The address and data buses are independent for memory accesses to support pipelining and split transactions. PowerPC750 can pipeline as many as two transactions and has limited support for out-of-order split-bus transactions. Access to the bus interface is granted through an external arbitration mechanism that allows devices to compete for bus mastership. This arbitration mechanism is flexible, allowing PowerPC 750 to be integrated into systems that implement various fairness and bus-parking procedures to avoid arbitration overhead Address Pipelining and Split-Bus Transactions PowerPC750 protocol provides independent address and data bus capability to support pipelined and split-bus transaction system organizations. Address pipelining allows the address tenure of a new bus transaction to begin before the data tenure of the current transaction has finished. Split-bus transaction capability allows other bus activity to occur (either from the same master or from different masters) between the address and data tenures of a transaction. While this capability does not inherently reduce memory latency, support for address pipelining and split-bus transactions can greatly improve effective bus/memory throughput. For this reason, these techniques are most effective in shared-memory multimaster implementations where bus bandwidth is an important measurement of system performance. PowerPC 750 can pipeline its own transactions to a depth of one level (intraprocessor pipelining); however, PowerPC 750 bus protocol does not constrain the maximum number of levels of pipelining that can occur on the bus between multiple masters (interprocessor pipelining). The external arbiter must control the pipeline depth and synchronization between masters and slaves. In a pipelined implementation, data bus tenures are kept in strict order with respect to address tenures. However, external hardware can further decouple the address and data buses, allowing the data tenures to occur out of order with respect to the address tenures. This requires some form of system tag to associate the out-of-order data transaction with the proper originating address transaction (not defined for PowerPC 750 interface). Individual bus requests and

9 data bus grants from each processor can be used by the system to implement tags to support interprocessor, out-of-order transactions. PowerPC 750 supports a limited intraprocessor out-of-order, split-transaction capability via the data bus write only (DBWO) signal. Note that PowerPC 750 drops out of pipeline mode between consecutive burst data reads and between consecutive burst instruction fetches. No other sequences of operations cause this effect. In this case, the address tenure of the second transaction will not begin until one to three bus clocks after the end of the data tenure of the first transaction. 3. PIPELINING OF PENTIUM II PROCESSOR 3.1. Overview of Pentium II Pipelining In order to get a closer look at how the P6 family micro-architecture implements Dynamic Execution, Figure 3-1 shows a block diagram of the Pentium II processor with cache and memory interfaces. The Units shown in Figure represent stages of the Pentium II processor pipeline. Figure 3-1. The Three Core Engines Interface with Memory via Unified Caches The FETCH/DECODE unit: An in-order unit that takes as input the user program instruction stream from the instruction cache, and decodes them into a series of µoperations (µops) that represent the dataflow of that instruction stream. The pre-fetch is speculative. The DISPATCH/EXECUTE unit: An out-of-order unit that accepts the dataflow stream, schedules execution of the µops subject to data dependencies and

10 resource availability and temporarily stores the results of these speculative executions. The RETIRE unit: An in-order unit that knows how and when to commit ( retire ) the temporary, speculative results to permanent architectural state. The BUS INTERFACE unit: A partially ordered unit responsible for connecting the three internal units to the real world. The bus interface unit communicates directly with the L2 (second level) cache supporting up to four concurrent cache accesses. The bus interface unit also controls a transaction bus, with MESI snooping protocol, to system memory. 3.2 The Fetch/Decode Unit Figure 3-2 shows a more detailed view of the Fetch/Decode unit. Figure 3-2. Inside the Fetch/Decode Unit The L1 Instruction Cache is a local instruction cache. The Next_IP unit provides the L1 Instruction Cache index, based on inputs from the Branch Target Buffer (BTB), trap/interrupt status, and branch-misprediction indications from the integer execution section. The L1 Instruction Cache fetches the cache line corresponding to the index from the Next_IP, and the next line, and presents 16 aligned bytes to the decoder. The prefetched bytes are rotated so that they are justified for the instruction decoders (ID). The beginning and end of the Intel Architecture instructions are marked. Three parallel decoders accept this stream of marked bytes, and proceed to find and decode the Intel Architecture instructions contained therein. The decoder converts the Intel Architecture instructions into triadic µops (two logical sources, one logical destination per µop). Most Intel Architecture instructions are converted directly into single µops, some instructions are decoded into one-to-

11 four µops and the complex instructions require microcode. This microcode is just a set of preprogrammed sequences of normal µops. The µops are queued, and sent to the Register Alias Table (RAT) unit, where the logical Intel Architecturebased register references are converted into references to physical registers in P6 family processors physical register references, and to the Allocator stage, which adds status information to the µops and enters them into the instruction pool. The instruction pool is implemented as an array of Content Addressable Memory called the ReOrder Buffer (ROB) The Dispatch/Execute Unit The Dispatch unit selects µops from the instruction pool depending upon their status. If the status indicates that a µop has all of its operands then the dispatch unit checks to see if the execution resource needed by that µop is also available. If both are true, the Reservation Station removes that µop and sends it to the resource where it is executed. The results of the µop are later returned to the pool. There are five ports on the Reservation Station, and the multiple resources are accessed as shown in Figure 3-3. Figure 3-3. Inside the Dispatch/Execute Unit The Pentium II processor can schedule at a peak rate of 5 µops per clock, one to each resource port, but a sustained rate of 3 µops per clock is more typical. The activity of this scheduling process is the out-of-order process; µops are dispatched to the execution resources strictly according to dataflow constraints and resource availability, without regard to the original ordering of the program.

12 The actual algorithm employed by this execution-scheduling process is vitally important to performance. If only one µop per resource becomes data-ready per clock cycle, then there is no choice. But if several are available, it must choose. The P6 family micro-architecture uses a pseudo FIFO scheduling algorithm favoring back-to-back µops. Also many of the µops are branches. The Branch Target Buffer will correctly predict most of these branches but it can t correctly predict them all. Consider a BTB that is correctly predicting the backward branch at the bottom of a loop; eventually that loop is going to terminate, and when it does, that branch will be mispredicted. Branch µops are tagged (in the in-order pipeline) with their fallthrough address and the destination that was predicted for them. When the branch executes, what the branch actually did is compared against what the prediction hardware said it would do. If those coincide, then the branch eventually retires and the speculatively executed work between it and the next branch instruction in the instruction pool is good. But if they do not coincide, then the Jump Execution Unit (JEU) changes the status of all of the µops behind the branch to remove them from the instruction pool. In that case the proper branch destination is provided to the BTB which restarts the whole pipeline from the new target address The Retire Unit Figure 3-4. Inside the Retire Unit Figure 3-4 shows a more detailed view of the Retire Unit. The Retire Unit is also checking the status of µops in the instruction pool. It is looking for µops that have executed and can be removed from the pool. Once removed, the original architectural target of the µops is written as per the original Intel Architecture instruction. The Retire Unit must not only notice which µops are

13 complete, it must also re-impose the original program order on them. It must also do this in the face of interrupts, traps, faults, breakpoints and mispredictions. The Retire Unit must first read the instruction pool to find the potential candidates for retirement and determine which of these candidates are next in the original program order. Then it writes the results of this cycle s retirements to the Retirement Register File (RRF). The Retire Unit is capable of retiring 3 µops per clock The Bus Interface Unit Figure 3-5 shows a more detailed view of the Bus Interface Unit. Figure 3-5. Inside the Bus Interface Unit There are two types of memory access: loads and stores. Loads only need to specify the memory address to be accessed, the width of the data being retrieved, and the destination register. Loads are encoded into a single µop. Stores need to provide a memory address, a data width, and the data to be written. Stores therefore require two µops, one to generate the address and one to generate the data. These µops must later re-combine for the store to complete. Stores are never performed speculatively since there is no transparent way to undo them. Stores are also never re-ordered among themselves. A store is dispatched only when both the address and the data are available and there are no older stores awaiting dispatch. A study of the importance of memory access reordering concluded: Stores must be constrained from passing other stores, for only a small impact on performance. Stores can be constrained from passing loads, for an inconsequential performance loss.

14 Constraining loads from passing other loads or stores has a significant impact on performance. The Memory Order Buffer (MOB) allows loads to pass other loads and stores by acting like a reservation station and re-order buffer. It holds suspended loads and stores and re-dispatches them when a blocking condition (dependency or resource) disappears. 4. Summary Dynamically scheduled pipelines are used in both microprocessors, they have similar pipeline organizations. Both of them use pipeline technology on instruction operation and bus operation. The main stages in PowerPC 750 instruction are: Fetch, Dispatch, Execute and Complete. In Pentium II, there are: Fetch/Decode, Dispatch/Execute, Retire units. Also there are some differences between the bus operation pipelining. These two microprocessors have different architectures, but at the point view of pipelining, the operation structures of both processors are almost the same with only some minor differences. References: 1. PPC740/750 User Manual 2. Pentium II Processor Developer's Manual ftp://download.intel.com/design/pentiumii/manuals/ pdf 3. Computer Organization and Design, John L. Hennessy, Morgan Kaufmann Publishers, INC., 2 nd Edition, 1997