Intel Pentium 4 Processor on 90nm Technology



Similar documents

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x Architecture

Putting it all together: Intel Nehalem.

Generations of the computer. processors.

Overview. CPU Manufacturers. Current Intel and AMD Offerings

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Configuring Memory on the HP Business Desktop dx5150

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Accelerating Innovation in the Desktop. Rob Crooke VP and General Manager Business Client Group

PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER

Next Generation GPU Architecture Code-named Fermi

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

SPARC64 VIIIfx: CPU for the K computer

Autodesk Revit 2016 Product Line System Requirements and Recommendations

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

NVIDIA Tegra 4 Family CPU Architecture

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Intel Core i Processor (3M Cache, 3.30 GHz)

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Boundless Security Systems, Inc.

Motivation: Smartphone Market

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Revit products will use multiple cores for many tasks, using up to 16 cores for nearphotorealistic

Communicating with devices

Precise and Accurate Processor Simulation

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

The Microarchitecture of the Pentium 4 Processor

Intel 8086 architecture

Multi-Threading Performance on Commodity Multi-Core Processors

Introduction to Cloud Computing

Performance Analysis of Dual Core, Core 2 Duo and Core i3 Intel Processor

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Radeon HD 2900 and Geometry Generation. Michael Doggett

Introduction to GPU Architecture

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Unified Architectural Support for Soft-Error Protection or Software Bug Detection

Introduction to Microprocessors

IP Video Rendering Basics

Parallel Algorithm Engineering

Introduction to GPU Programming Languages

System requirements for Autodesk Building Design Suite 2017

Intel Xeon Processor 3400 Series-based Platforms

OC By Arsene Fansi T. POLIMI

Rethinking SIMD Vectorization for In-Memory Databases

Data Sheet Graphic Cards for Fujitsu ESPRIMO PCs

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

The Motherboard Chapter #5

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Autodesk Building Design Suite 2012 Standard Edition System Requirements... 2

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

What is in Your Workstation?

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Data Sheet. Desktop ESPRIMO. General

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Fusionstor NAS Enterprise Server and Microsoft Windows Storage Server 2003 competitive performance comparison

Qualified Apple Mac Workstations for Avid Media Composer v5.0.x

Intel Core i3-2310m Processor (3M Cache, 2.10 GHz)

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Chapter 1 Computer System Overview

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

ST810 Advanced Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Family 10h AMD Phenom II Processor Product Data Sheet

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

An examination of the dual-core capability of the new HP xw4300 Workstation

Operating System Impact on SMT Architecture

Intel Server S3200SHL

Brainlab Node TM Technical Specifications

Discovering Computers Living in a Digital World

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

The Transition to PCI Express* for Client SSDs

Transcription:

Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended Memory 64 Technology Performance Results 2 2 1

Vital Statistics 125 million transistors 112 mm 2 die size 90nm manufacturing process Introduced in Feb. 2004 @ >3GHz 3 3 Intel Pentium 4 Processor Block Diagram System Bus Bus Interface Unit Quad Pumped 6.4 GB/s 64-bit wide Micro Instruction Sequencer Memory µop Queue Memory Instruction TLB Instruction Decoder Slow Execution Trace Cache 12K µops Fast Int Allocator / Register Renamer Dynamic BTB 4K Entries Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Integer Register File / Bypass Network FP Move FP Gen FP Register / Bypass L2 Cache (1 MB 8-way) Slow ALU Complex 256-bit wide 2xAGU Ld/St Address unit 2xALU Simple 2xALU Simple FP Move L1 Data Cache 8Kbyte 4-way L1 Data Cache 16Kbyte 8-way FP MMX SSE SSE2 SSE3 4 4 2

Key Characteristics System Bus Bus Interface Unit Quad Pumped 6.4 GB/s 64-bit wide Micro Instruction Sequencer Memory µop Queue Memory Instruction TLB Instruction Decoder Slow Execution Trace Cache 12K µops Fast Int Allocator / Register Renamer Dynamic BTB 4K Entries Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Integer Register File / Bypass Network FP Move FP Gen FP Register / Bypass L2 Cache (1 MB 8-way) Slow ALU Complex 256-bit wide 2xAGU Ld/St Address unit 2xALU Simple 2xALU Simple L1 Data Cache 8Kbyte 4-way L1 Data Cache 16Kbyte 8-way FP Move FP MMX SSE SSE2 SSE3 Trace Cache instead of conventional I-Cache 5 5 System Bus Bus Interface Unit Quad Pumped 6.4 GB/s Key Characteristics 64-bit wide Micro Instruction Sequencer Memory µop Queue Memory Instruction TLB Instruction Decoder Slow Execution Trace Cache 12K µops Fast Int Allocator / Register Renamer Dynamic BTB 4K Entries Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Integer Register File / Bypass Network FP Move FP Gen FP Register / Bypass L2 Cache (1 MB 8-way) Slow ALU Complex 256-bit wide 2xAGU Ld/St Address unit 2xALU Simple 2xALU Simple L1 Data Cache 8Kbyte 4-way L1 Data Cache 16Kbyte 8-way FP Move FP MMX SSE SSE2 SSE3 6 2x Frequency Execution Core for High Throughput 6 3

New Microarchitecture Features Larger Caches Deeper Buffers Faster Execution Units Algorithmic Enhancements 7 7 Cache Comparison 1 st level data cache 2 nd level data cache Trace Cache 130nm 8KB, 4-ways, Write-through 512KB, 8-ways, Write-back 12k uops 90nm 16KB, 8-ways, Write-through 1MB, 8-ways, Write-back 12k uops 8 8 4

Larger Buffers ROB Size Load Buffers Store Buffers Write Combining Buffers Outstanding 1 st level Data Cache Misses FP Schedulers 130nm 126 48 24 6 4 10/12 90nm 126 48 32 8 8 14/16 9 9 Shifts Faster Execution Units Typical shifts now handled inside of fast execution core w/ single cycle latency Previously handled in complex integer unit with 6 cycle latency Integer Multiply Adds dedicated integer multiplier Previously handled by the FP multiplier 10 10 5

Algorithmic Enhancements Branch Prediction Hardware Prefetching 11 11 Branch Prediction Continued improvement of existing algorithms Improved static prediction algorithm Displacement check Condition check Added indirect branch predictor Idea first introduced on Intel Pentium M processor 12 12 6

Branch Predictor Comparison 130nm 90nm 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf 1.03 1.32 0.85 1.35 0.72 1.06 0.44 0.62 0.33 0.08 1.19 1.32 1.01 1.21 0.70 1.22 0.68 0.87 0.39 0.28 0.24 0.09 1.12 1.23 # of Branch Mispredicts Per 100 Instructions on SPECint*_base2000 13 * Other names and brands are the property of their respective owners 13 Hardware Prefetching Primary mechanism to hide DRAM latency Processor predicts what data will be needed in the future and proactively fetches it from DRAM Exists on all Intel Pentium 4 Processor implementations 90nm version improves on what data to get and when to get it 14 14 7

Hardware Prefetching 2.00 HWP Disabled HWP Enabled 1.97 1.50 1.00 1.16 1.18 1.21 1.26 1.29 1.30 1.32 1.40 1.45 1.49 0.50 0.00 254.gap 191.fma3d 178.galgel 187.facerec 171.swim 168.wupwise 173.applu 189.lucas 181.mcf 172.mgrid 183.equake Impact of HW prefetcher on most sensitive benchmarks in SPEC CPU2000 15 15 Hyper-Threading Technology Makes a single processor look like two processors to software Takes advantage of underutilized resources when running a single thread through the processor Hyper-Threading Technology requires a computer system with an Intel Pentium 4 processor supporting HT Technology and a Hyper-Threading Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading/ for more information including details on which processors support HT Technology. 16 16 8

Hyper-Threading Technology Improvements 1 st level data cache Uses partial virtual address index Aliasing can occur due to stacks of two threads being offset by a fixed amount Use context identifier to differentiate between data from different threads. Better than thread identifier to allow data sharing between threads. Introduced on later steppings of 130nm version. Parallel Operations Allow page walks and split memory access handling in parallel Allow multiple page walks if one goes to DRAM Buffer sizes Motivated increase in # of outstanding 1 st level cache misses 17 17 13 new instructions SSE3 x87 to integer conversion Graphics (Horizontal Add/Subtract) Complex arithmetic Video Encoding Thread Synchronization 18 18 9

Complex Arithmetic MOVDDUP, MOVSHDUP, MOVSLDUP Instructions to load and duplicate data implicitly ADDSUBPS, ADDSUBPD Perform a mix of addition and subtraction simultaneously 10-20% gain on 168.wupwise from these instructions (complex matrix multiply) 19 19 Video Encoding Motion Estimation compares previous frame to current frame Loads from the previous frame are unaligned Leads to costly cache line split memory accesses LDDQU instruction loads 128-bits at an arbitrary alignment with no cache line split Speedups of > 10% on MPEG-4 encoders 20 20 10

Thread Synchronization Used to indicate that a thread is spinning and waiting for work Allows processor to go into an optimized state MONITOR Sets up address monitoring hardware MWAIT Sets processor into optimized state. Will wake up when monitored address in written to 21 21 Intel Extended Memory 64 Technology Additional capability in today s Intel Xeon processors on top of: Netburst Microarchitecture Hyper-Threading Technology SSE3 Provides 8 more integer and SSE registers Larger addressing capability 48 bits of virtual address on this implementation 36-40 bits of physical address on this implementation Full 64-bit support carefully engineered into the 90nm design Limited differences between 32-bit and 64-bit operations Similar optimizations for 32-bit and 64-bit code 22 22 11

32-bit vs. 64-bit Comparison ALU Latency ALU Throughput Memory Throughput Page walks 32-bit 1 load + 1 store per cycle 2 levels -- use PDE cache to reduce to 1 level in common case 1 cycle 4 operations/cycle 64-bit 4 levels use PDE cache to reduce to 1 level in common case Enable strong 64-bit performance without compromising 32-bit performance 23 23 Optimizing 64-bit code Rule #1: Follow 32-bit optimizations Rule #2: Compile with Pentium 4 specific optimizations enabled Few additional new rules: When data size is 32 bits, typically use 32-bit instructions Example: XOR EAX, EAX instead of XOR RAX, RAX But sign extend to full 64-bits instead of only 32 bits (even for 32-bit data size) Example: Load 16 bits, sign extend to RAX not EAX 24 24 12

Performance Results 1.50 HT Technology Disabled HT Technology Enabled 1.25 1.28 1.19 1.20 1.12 1.00 0.50 0.00 Adobe* Photoshop* CS Windows Media* Encoder 9.0 Maxon Cinema* 4D MainConcept* 1.4 MAGIX* mp3 maker 2004 diamond Source: Intel Configuration: Intel Pentium 4 processor with HT Technology 3.40E GHz Intel D875PBZ Desktop Board (AA-301); All Platforms 1GB DDR400 CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel Application Accelerator RAID Edition 3.5 with RAID ready, Intel Chipset Software Installation Utility 5.01.1015, Seagate* Barra cuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Windows* XP Build 2600 SP1, Intel PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 25 25 Performance Results 1.50 1.00 Intel Pentium 4 Processor with HT Technology 3.40 GHz Intel Pentium 4 Processor with HT Technology 3.40E GHz 1.14 1.07 0.50 0.00 SPECint*_base2000 SPECfp*_base2000 Source: Intel Configuration: Intel Pentium 4 processor with HT Technology 3.40 GHz Intel D875PBZ Desktop Board (AA-204); Intel Pentium 4 processor with HT Technology 3.40E GHz Intel D875PBZ Desktop Board (AA-301); All Platforms 1GB DDR400 CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel Application Accelerator RAID Edition 3.5 with RAID ready, Intel Chipset Software Installation Utility 5.01.1015, Seagate* Barracuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Wi ndows* XP Build 2600 SP1, Intel 26 PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any di fference in system hardware or software design or configuration may affect actual performance. 26 13

The End 27 27 14