Pipeline Depth Tradeoffs and the Intel Pentium 4 Processor

Similar documents
Intel Pentium 4 Processor on 90nm Technology


Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Intel X38 Express Chipset Memory Technology and Configuration Guide

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Intel Media SDK Library Distribution and Dispatching Process

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

The Transition to PCI Express* for Client SSDs

2013 Intel Corporation

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

21152 PCI-to-PCI Bridge

Intel SSD 520 Series Specification Update

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Intel Desktop Board DG41WV

The Microarchitecture of the Pentium 4 Processor

Intel Desktop Board DG43RK

Intel Desktop Board DG41BI

StrongARM** SA-110 Microprocessor Instruction Timing

Fiber Channel Over Ethernet (FCoE)

Intel Core TM i7-660ue, i7-620le/ue, i7-610e, i5-520e, i3-330e and Intel Celeron Processor P4505, U3405 Series

Benefits of Intel Matrix Storage Technology

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

DDR2 x16 Hardware Implementation Utilizing the Intel EP80579 Integrated Processor Product Line

Intel Desktop Board DG41TY

Intel Desktop Board DG31PR

Intel Desktop Board D945GCPE Specification Update

Scaling Networking Solutions for IoT Challenges and Opportunities

Intel Desktop Board DP43BF

Intel Desktop Board D945GCPE

Bandwidth Calculations for SA-1100 Processor LCD Displays

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Intel Desktop Board DQ43AP

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Intel StrongARM SA-110 Microprocessor

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Extreme Memory Profile (Intel XMP) DDR3 Technology

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

The Pentium 4 and the G4e: An Architectural Comparison

Giving credit where credit is due

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

Family 10h AMD Phenom II Processor Product Data Sheet

Intel Identity Protection Technology (IPT)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

High Performance Computing and Big Data: The coming wave.

Intel 815 Chipset Platform for Use with Universal Socket 370

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Intel Desktop Board DQ35JO

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

NVIDIA GeForce GTX 580 GPU Datasheet

Intel Desktop Board DG33TL

Specification Update. January 2014

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

VLIW Processors. VLIW Processors

AMD Opteron Quad-Core

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Intel Desktop Board DG31GL

Intel Core i5 processor 520E CPU Embedded Application Power Guideline Addendum January 2011

Intel Desktop Board D945GCL

PROBLEMS #20,R0,R1 #$3A,R2,R4

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Contributed Article Program and Intel DPD Search Optimization Training. John McHugh and Steve Moore January 2012

Intel Desktop Board DP55WB

Intel Desktop Board DQ965GF

Addendum Intel Architecture Software Developer s Manual

Intel 865G, Intel 865P, Intel 865PE Chipset Memory Configuration Guide

Breakthrough AES Performance with. Intel AES New Instructions

Pipelining Review and Its Limitations

Introduction to Microprocessors

Introduction to PCI Express Positioning Information

Intel Platform and Big Data: Making big data work for you.

Evaluating Intel Virtualization Technology FlexMigration with Multi-generation Intel Multi-core and Intel Dual-core Xeon Processors.

with PKI Use Case Guide

Family 12h AMD Athlon II Processor Product Data Sheet

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Accelerating Business Intelligence with Large-Scale System Memory

Performance Analysis and Software Optimization on Systems Using the LAN91C111

2

Intelligent Business Operations

Five Families of ARM Processor IP

Intel Identity Protection Technology Enabling improved user-friendly strong authentication in VASCO's latest generation solutions

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Intel Desktop Board DG45FC

Intel Desktop Board DQ45CB

Intel RAID RS25 Series Performance

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Intel 845G/845GL/845GV Chipset

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Intel IXP42X Product Line of Network Processors and IXC1100 Control Plane Processor: Spread-Spectrum Clocking to Reduce EMI

Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors By Dr David Levinthal PhD. Version 1.0

Putting it all together: Intel Nehalem.

Transcription:

Pipeline Depth Tradeoffs and the Intel Pentium 4 Processor Doug Carmean Principal Architect Intel Architecture Group August 21, 2001

Agenda Review Pipeline Depth Execution Trace Cache L1 Data Cache Summary

Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries. Copyright (2001) Intel Corporation.

Intel Netburst TM Micro-architecture vs P6 1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Basic P6 Pipeline Decode Decode Rename Intro at 733MHz.18µ ROB Rd Rdy/Sch Dispatch Exec Basic Pentium 4 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch 13 14 Disp Disp 15 16 17 18 19 20 RF RF Intro at 1.5GHz Ex.18µ Flgs Br Ck Drive Hyper pipelined Technology enables industry leading performance and clock rate

Hyper Pipelined Technology Frequency Intro Today 1.5GHz 2.0GHz 1.8GHz 20 Intel Netburst Micro-Architecture 1.2GHz 10 P6 Micro-Architecture 166MHz 60MHz Introduction Time 233MHz P5 Micro-Architecture 5

Deeper Pipelines are Better 120% Performance Improvement 100% 80% 60% 40% 20% 2 MB 1 MB 512 KB 256 KB 0% 10 15 20 25 30 Pipeline Depth Source: Average of 2000 application segments from performance simulations

Why not deeper pipelines? Increases complexity Harder to balance More challenges to architect around More algorithms Greater validation effort Need to pipeline the wires Overall Engineering Effort Increases Quickly as Pipeline depth increases

Performance High bandwidth front end Low latency core Lower memory latency High Bandwidth Front End

Higher Frequency increases requirements of front end Branch prediction is more important So we improved it Need greater uop bandwidth Branches constantly change the flow Need to decode more instructions in parallel

Block Diagram System Bus 64-bit wide Instruction TLB Instruction Decoder Dynamic BTB 4K Entries Bus Interface Unit Quad Pumped 3.2 GB/s Micro Instruction Sequencer Memory µop Queue Memory Execution Trace Cache 12K µops uops Integer Schedulers Slow Fast Int Allocator / Register Renamer Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Floating Point Schedulers FP Move FP Gen Integer Register File / Bypass Network FP Register / Bypass L2 Cache 64GB/s Slow ALU Complex Instr. 2xAGU Ld/St Address unit 2xALU Simple Instr. 2xALU Simple Instr. FP Move Fmul FAdd Fmul FAdd Fmul FAdd 256-bit wide L1 Data Cache 8Kbyte 4-way4

Execution Trace Cache 1 cmp 2 br -> > T1..... (unused code) T1: 3 sub 4 br -> > T2..... (unused code) T2: 5 mov 6 sub 7 br -> > T3..... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> > T4 Trace Cache Delivery 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 7 br T3 8 T3:add 9 sub 10 mul 11 cmp 12 br T4

Execution Trace Cache P6 Microarchitecture 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 7 br T3 Trace Cache Delivery 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 7 br T3 8 T3:add 9 sub 10 mul 11 cmp 12 br T4 8 T3:add 9 sub 10 mul 11 cmp 12 br T4 BW = 1.5 uops/ns BW = 6 uops/ns

Inside the Execution Trace Cache Instruction Pointer 0x0900 Set 0 Set 1 Set 2 Set 3 Set 4 Way 0 Way 1 Way 2 Way 3 head body 1 body 2 tail head body 1 body 2 tail cmp, br T1, T1:sub, br T2, mov, sub br T3, T3:add, sub, mul, cmp, br T4 T4:add, sub, mov, add, add, mov add, sub, mov, add, add, mov

Self Modifying Code Programs that modify the instruction stream that is being executed Very common in Java* code from JITs Requires hardware mechanisms to maintain consistency *Other names and brands may be claimed as the property of others.

Self Modifying Code The hardware needs to handle two basic cases: Stores that write to instructions in the Trace Cache Instruction fetches that hit pending stores Speculative Committed

Case 1: Stores to cached instructions addr addr in use bits Execution Core Instruction TLB (128 entries) addr Trace Cache Data TLB Store s Physical Address

Case 2: Fetches to pending stores addr addr Instruction Pointer Instruction TLB (128 entries) addr Please Re-Fetch Please Re-Fetch Committed Store Buffer Speculative Execution Core Write Combining Buffer Please Flush Pipeline

Execution Trace Cache Provides higher bandwidth for higher frequency core Reduces fetch latency Requires new fundamentally new algorithms

Performance High bandwidth front end Low latency core Lower memory latency Low Latency Core

L1 Data Cache System Bus 64-bit wide Instruction TLB Instruction Decoder Dynamic BTB 4K Entries Bus Interface Unit Quad Pumped 3.2 GB/s Micro Instruction Sequencer Memory µop Queue Memory Execution Trace Cache 12K µops Integer Schedulers Slow Fast Int Allocator / Register Renamer Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Floating Point Schedulers FP Move FP Gen Integer Register File / Bypass Network FP Register / Bypass L2 Cache 64GB/s Slow ALU Complex Instr. 2xAGU Ld/St Address unit 2xALU Simple Instr. 2xALU Simple Instr. FP Move Fmul FAdd Fmul FAdd Fmul FAdd 256-bit wide L1 Data Cache 8Kbyte 4-way4 L1 Data Cache 8Kbyte 4-way

L1 Cache is 3x Faster P6: 33 clocks @ 1GHz Pentium 4 Processor : 22 clocks @ 2GHz 3ns 1ns Lower Latency is is Higher Performance

L1 Data Cache Pipeline Stages VA 15:0 VA 31:16 2x Clock tag data Fast SB Data VA 31:0 TLB TAG Slow SB replay replay replay

L1 Data Cache VA 15:0 VA 31:16 2x Clock tag data Fast SB Data VA 31:0 TLB TAG Slow SB Mini Tag Way Predictor (Tag array) Data array Way select Hit (Replay) 10:6 15:11 19:16

Performance High bandwidth front end Low latency core Lower memory latency Lower Memory Latency

Reducing Latency As frequency increases, it is important to improve the performance of the memory subsystem Data Prefetch Logic Watches processor memory traffic Looks for patterns Initiates accesses

Data Prefetch Logic System Bus 64-bit wide Instruction TLB Instruction Decoder Dynamic BTB 4K Entries Bus Interface Unit Quad Data Pumped Prefetch 3.2 Logic GB/s Micro Instruction Sequencer Memory µop Queue Memory Execution Trace Cache 12K µops Integer Schedulers Slow Fast Int Allocator / Register Renamer Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Floating Point Schedulers FP Move FP Gen Integer Register File / Bypass Network FP Register / Bypass L2 Cache 64GB/s Slow ALU Complex Instr. 2xAGU Ld/St Address unit 2xALU Simple Instr. 2xALU Simple Instr. FP Move Fmul FAdd Fmul FAdd Fmul FAdd 256-bit wide L1 Data Cache 8Kbyte 4-way4

Data Prefetch Logic Instruction Buffers 64 Instruction Fetch Data Prefetch Logic L2 Advanced Transfer Cache Bus Queue L1 Data Cache 256 Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache.

Data Prefetch Logic Watches for streaming memory access patterns Can track 8 multiple independent streams Loads, Stores or Instruction Forward or Backward Analysis on 32 byte cache line granularity Looks for mostly complete streams: Access to cache lines 1,2,3,4,5,6 will prefetch Access to cache lines 1,2, 4,5,6 will prefetch 1,,3,,,6,,,9 will not prefetch

Performance High bandwidth front end Low latency core Lower memory latency

Summary The Pentium 4 Processor s deep pipelines provide high performance by enabling high frequency Deep pipelines are more difficult to engineer Larger caches further improve benefits of Pentium 4 Processor s deep pipeline Compilers further increase benefits of deep pipelines by removing hazards