Doug Carmean Principal Architect Intel Architecture Group

Similar documents

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Intel Pentium 4 Processor on 90nm Technology

The Microarchitecture of the Pentium 4 Processor

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Putting it all together: Intel Nehalem.

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Introduction to Microprocessors

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

The Pentium 4 and the G4e: An Architectural Comparison

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors By Dr David Levinthal PhD. Version 1.0

OC By Arsene Fansi T. POLIMI

Central Processing Unit (CPU)

2

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Generations of the computer. processors.

Multithreading Lin Gao cs9244 report, 2006

CHAPTER 7: The CPU and Memory

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors

VLIW Processors. VLIW Processors

PROBLEMS #20,R0,R1 #$3A,R2,R4

RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Computer Organization and Components

OPENSPARC T1 OVERVIEW

Performance evaluation

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Pipelining Review and Its Limitations

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Five Families of ARM Processor IP

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Computer Architecture TDTS10

NVIDIA Tegra 4 Family CPU Architecture

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Introduction to Cloud Computing

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

AMD Opteron Quad-Core

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Introduction to GPU Architecture

Overview. CPU Manufacturers. Current Intel and AMD Offerings

IA-64 Application Developer s Architecture Guide

Multi-Threading Performance on Commodity Multi-Core Processors

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

StrongARM** SA-110 Microprocessor Instruction Timing

Computer Organization. and Instruction Execution. August 22

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Switch Fabric Implementation Using Shared Memory

Chapter 1 Computer System Overview

361 Computer Architecture Lecture 14: Cache Memory

Design Cycle for Microprocessors

POWER8 Performance Analysis

Solid State Drive Architecture

1. Memory technology & Hierarchy

Intel 8086 architecture

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

WAR: Write After Read

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

8051 hardware summary

The Microarchitecture of Superscalar Processors

Basic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

UltraSPARC T1: A 32-threaded CMP for Servers. James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

Radeon HD 2900 and Geometry Generation. Michael Doggett

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Coming Challenges in Microarchitecture and Architecture

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Computer Organization and Architecture

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Binary search tree with SIMD bandwidth optimization using SSE

CPU Performance Equation

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Architecture and Implementation of the ARM Cortex -A8 Microprocessor

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Architecture of Hitachi SR-8000

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Microprocessor & Assembly Language

Transcription:

Inside the Pentium 4 Processor Micro-architecture Next Generation IA-32 Micro-architecture Doug Carmean Principal Architect Architecture Group August 24, 2000 Copyright 2000 Corporation. Labs

Agenda IA-32 Processor Roadmap Design Goals Frequency Instructions Per Cycle Summary Copyright 2000 Corporation.

Pentium 4 Processor NetBurst Micro-Architecture Performance P6 Micro-Architecture P5 Micro-Architecture Today 486 Micro-architecture Copyright 2000 Corporation. Time

Pentium 4 Processor Design Goals Deliver world class performance across both existing and emerging applications Deliver performance headroom and scalability for the future Micro-architecture that that will will Drive Performance Leadership for for the the Next Several Years Copyright 2000 Corporation.

NetBurst Micro-architecture 400 MHz System Bus Advanced Dynamic Execution Rapid Execution Engine Advanced Transfer Cache Hyper Pipelined Technology Streaming SIMD Extensions 2 Copyright 2000 Corporation. Execution Trace Cache Enhanced Floating Point / Multi-Media Media

Pentium 4 Processor Block Diagram 3.2 GB/s System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ucode ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Store Load FP move FP store FMul FAdd MMX SSE L1 D D-Cache and D D-TLB

3.2 GB/s System Interface Pentium 4 Processor Block Diagram BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ucode ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Store Load FP move FP store FMul FAdd MMX SSE L1 D D-Cache and D D-TLB

CPU Architecture 101 Delivered Performance = Frequency * Instructions Per Cycle Frequency Copyright 2000 Corporation.

Frequency What limits frequency? Process technology Microarchitecture On a given process technology Fewer gates per pipeline stage will deliver higher frequency Frequency is is driven by Micro-architecture Copyright 2000 Corporation.

Netburst TM Micro-architecture Pipeline vs P6 1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Basic P6 Pipeline Decode Decode Rename Intro at 733MHz.18µ ROB Rd Rdy/ Dispatch Exec Basic Pentium 4 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Intro at 1.4GHz Ex.18µ Flgs Br Ck Drive Hyper pipelined Technology enables industry leading performance and clock rate Copyright 2000 Corporation.

Hyper Pipelined Technology 20 Frequency Today 1.4GHz Netburst Micro- Architecture 1.13GHz 10 P6 Micro-Architecture 166MHz 60MHz Introduction Copyright 2000 Corporation. Time 233MHz 5 P5 Micro-Architecture

Hyper pipelined Technology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC Nxt IP TC Fetch Drive Alloc Rename Que Disp Disp Ex Flgs Br Ck Drive TC Nxt IP: Trace cache next instruction pointer Pointer from the BTB, indicating location of next instruction. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive TC Fetch: Trace cache fetch Read the decoded instructions (uops) out of the Execution Trace Cache Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Drive: Wire delay Drive the uops to the allocator Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Alloc: Allocate Allocate resources required for execution. The resources include Load buffers, Store buffers, etc.. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Rename: Register renaming Rename the logical registers (EAX) to the physical register space (128 are implemented). Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Que: Write into the uop Queue uops are placed into the queues, where they are held until there is room in the schedulers Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive : edule Write into the schedulers and compute dependencies. Watch for dependency to resolve. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Disp: Dispatch Send the uops to the appropriate execution unit. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive : Register File Read the register file. These are the source(s) for the pending operation ( or other). Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Ex: Execute Execute the uops on the appropriate execution port. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Flgs: Flags Compute flags (zero, negative, etc..). These are typically the input to a branch instruction. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Br Ck: Branch Check The branch operation compares the result of the actual branch direction with the prediction. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

Hyper Pipelined Technology 1 TC Nxt IP 2 3 4 5 6 7 8 9 10 11 12 TC Fetch Drive Alloc Rename Que 13 14 Disp Disp 15 16 17 18 19 20 Ex Flgs Br Ck Drive Drive: Wire delay Drive the result of the branch check to the front end of the machine. Copyright 2000 Corporation. System Interface BTB & I-TLB Decoder L2 Cache and Control BTB Trace Cache ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Fms Fop L1 D-Cache and D-TLB

CPU Architecture 101 Delivered Performance = Frequency * Instructions Per Cycle Instructions Per Cycle Copyright 2000 Corporation.

Improving Instructions Per Cycle Improve efficiency Do more things in a clock Branch prediction Reduce time it takes to do something Reducing latency Copyright 2000 Corporation.

Improving Instructions Per Cycle Improve efficiency Branch prediction Do more things in a clock Reduce time it takes to do something Reducing latency Copyright 2000 Corporation.

Branch Prediction Accurate branch prediction is key to enabling longer pipelines Dramatic improvement over P6 branch predictor: 8x the size (4K) Eliminated 1/3 of the mispredictions Proven to be better than all other publicly disclosed predictors (g-share, hybrid, etc) Copyright 2000 Corporation.

3.2 GB/s System Interface The Execution Trace Cache BTB & I-TLB Decoder Copyright 2000 Corporation. L2 Cache and Control BTB Trace Cache ucode ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Store Load FP move FP store FMul FAdd MMX SSE L1 D D-Cache and D D-TLB

Execution Trace Cache Advanced L1 instruction cache Caches decoded IA-32 instructions (uops) Removes decoder pipeline latency Capacity is ~12K uops Integrates branches into single line Follows predicted path of program execution Execution Trace Cache feeds fast engine Copyright 2000 Corporation.

Execution Trace Cache 1 cmp 2 br -> > T1..... (unused code) T1: 3 sub 4 br -> > T2..... (unused code) T2: 5 mov 6 sub 7 br -> > T3..... (unused code) T3: 8 add 9 sub 10 mul 11 cmp 12 br -> > T4 Trace Cache Delivery 1 cmp 2 br T1 3 T1: sub 4 br T2 5 mov 6 sub 7 br T3 8 T3:add 9 sub 10 mul 11 cmp 12 br T4 Copyright 2000 Corporation.

Advanced Dynamic Execution Extends basic features found in P6 core Very deep speculative execution 126 instructions in flight (3x P6) 48 loads (3x P6) and 24 stores (2x P6) Provides larger window of visibility Better use of execution resources Deep Speculation Improves Parallelism Copyright 2000 Corporation.

Improving Instructions Per Cycle Improve efficiency Do more things in a clock Branch prediction Reduce time it takes to do something Reducing latency Copyright 2000 Corporation.

Rapid Execution Engine Dramatically lower latency P6: 11 clock @ 1GHz NetBurst micro-architecture: ½ ½ clock @ >1.4GHz 1ns Copyright 2000 Corporation. <0.35ns

3.2 GB/s System Interface L1 Data Cache BTB & I-TLB Decoder Copyright 2000 Corporation. L2 Cache and Control BTB Trace Cache ucode ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Store Load FP move FP store FMul FAdd MMX SSE L1 D D-Cache and D D-TLB

High Performance L1 Data Cache 8KB, 4-way 4 set associative, 64-byte lines Very high bandwidth 11 Ld and 1 St per clock New access algorithms Very low latency 22 clock read Copyright 2000 Corporation. New algorithm enables faster cache

Data Speculation Observation: Almost all memory accesses hit in the cache Optimize for the common case Assume that the access will hit the cache Use a low cost mechanism to fix the rare cases that miss Benefit: Reduces latency Significantly higher performance Copyright 2000 Corporation.

Replay Repairs incorrect speculation Re-execute execute until correct Replay is uop specific Replay the uop that mis-speculated speculated Replay dependent uops Independent uops are not replayed Efficient mechanism to reduce latency Copyright 2000 Corporation.

L1 Cache is >2x Faster P6: 33 clocks @ 1GHz NetBurst micro-architecture: 22 clocks @ 1.4GHz 3ns <1.4ns Lower Latency is is Higher Performance Copyright 2000 Corporation.

Example with higher IPC and Faster Clock! Code Sequence Ld Add Add Ld Add Add P6 @1GHz NetBurst Micro-architecture @1.4GHz Copyright 2000 Corporation. 10 clocks 10ns IPC = 0.6 6 clocks 4.3ns IPC = 1.0

L2 Advanced Transfer Cache 3.2 GB/s System Interface BTB & I-TLB Decoder Copyright 2000 Corporation. L2 Cache and Control BTB Trace Cache ucode ROM Rename/Alloc 3 3 uop Queues edulers Integer FP Store Load FP move FP store FMul FAdd MMX SSE L1 D D-Cache and D D-TLB

L2 ATC Organization 256KB, 8-way 8 set associative 128-byte lines Two 64-byte pieces per line Holds both data and instructions High bandwidth: 45 GB/Sec @ 1.4GHz 2.8x P6 @1GHz Copyright 2000 Corporation.

Aggregate Cache Latency Function of all caches in a processor Overall Effective Latency L1 latency + L1 Miss Rate * L2 latency + L2 Miss Rate * DRAM Latency Average cache speed is >1.8x better than the Pentium III Processor Copyright 2000 Corporation. Average on desktop applications, Pentium III processor @ 1GHz, Pentium 4 processor or @ 1.4GHz

Comparison Pentium III Processor Pentium 4 Processor Relative Improvement Frequency 1 GHz 1.4 Ghz 1.4 Adder Speed 1 ns <.36 ns 2.8 Adder Bandwidth 2 billion/sec 5.6 billion/sec > 2.8 L1 Cache Speed L1 Cache Size 3 ns 16 KB < 1.42 ns 8 KB 2.1 0.5 L1 Cache Bandwidth 16 GB/sec 44.8 GB/sec 2.8 L2 Cache Bandwidth 16 GB/sec 44.8 GB/sec 2.8 Instructions In flight 40 126 3.15 Loads in flight 16 48 3 Stores in flight 12 24 2 Branch targets 512 4092 8 Uop Fetch Bandwidth 3 billion/sec 4.2 billion/sec 1.4 Copyright 2000 Corporation.

Pentium 4 Processor Summary Revolutionary, new micro- architecture from designed for the evolving Internet Design features for balanced, high performance platform scalability and headroom The world s highest performance desktop processor Copyright 2000 Corporation.