Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended Memory 64 Technology Performance Results 2 2 1
Vital Statistics 125 million transistors 112 mm 2 die size 90nm manufacturing process Introduced in Feb. 2004 @ >3GHz 3 3 Intel Pentium 4 Processor Block Diagram System Bus Bus Interface Unit Quad Pumped 6.4 GB/s 64-bit wide Micro Instruction Sequencer Memory µop Queue Memory Instruction TLB Instruction Decoder Slow Execution Trace Cache 12K µops Fast Int Allocator / Register Renamer Dynamic BTB 4K Entries Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Integer Register File / Bypass Network FP Move FP Gen FP Register / Bypass L2 Cache (1 MB 8-way) Slow ALU Complex 256-bit wide 2xAGU Ld/St Address unit 2xALU Simple 2xALU Simple FP Move L1 Data Cache 8Kbyte 4-way L1 Data Cache 16Kbyte 8-way FP MMX SSE SSE2 SSE3 4 4 2
Key Characteristics System Bus Bus Interface Unit Quad Pumped 6.4 GB/s 64-bit wide Micro Instruction Sequencer Memory µop Queue Memory Instruction TLB Instruction Decoder Slow Execution Trace Cache 12K µops Fast Int Allocator / Register Renamer Dynamic BTB 4K Entries Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Integer Register File / Bypass Network FP Move FP Gen FP Register / Bypass L2 Cache (1 MB 8-way) Slow ALU Complex 256-bit wide 2xAGU Ld/St Address unit 2xALU Simple 2xALU Simple L1 Data Cache 8Kbyte 4-way L1 Data Cache 16Kbyte 8-way FP Move FP MMX SSE SSE2 SSE3 Trace Cache instead of conventional I-Cache 5 5 System Bus Bus Interface Unit Quad Pumped 6.4 GB/s Key Characteristics 64-bit wide Micro Instruction Sequencer Memory µop Queue Memory Instruction TLB Instruction Decoder Slow Execution Trace Cache 12K µops Fast Int Allocator / Register Renamer Dynamic BTB 4K Entries Trace Cache BTB 512 Entries Integer/Floating Point µop Queue Fast Int Integer Register File / Bypass Network FP Move FP Gen FP Register / Bypass L2 Cache (1 MB 8-way) Slow ALU Complex 256-bit wide 2xAGU Ld/St Address unit 2xALU Simple 2xALU Simple L1 Data Cache 8Kbyte 4-way L1 Data Cache 16Kbyte 8-way FP Move FP MMX SSE SSE2 SSE3 6 2x Frequency Execution Core for High Throughput 6 3
New Microarchitecture Features Larger Caches Deeper Buffers Faster Execution Units Algorithmic Enhancements 7 7 Cache Comparison 1 st level data cache 2 nd level data cache Trace Cache 130nm 8KB, 4-ways, Write-through 512KB, 8-ways, Write-back 12k uops 90nm 16KB, 8-ways, Write-through 1MB, 8-ways, Write-back 12k uops 8 8 4
Larger Buffers ROB Size Load Buffers Store Buffers Write Combining Buffers Outstanding 1 st level Data Cache Misses FP Schedulers 130nm 126 48 24 6 4 10/12 90nm 126 48 32 8 8 14/16 9 9 Shifts Faster Execution Units Typical shifts now handled inside of fast execution core w/ single cycle latency Previously handled in complex integer unit with 6 cycle latency Integer Multiply Adds dedicated integer multiplier Previously handled by the FP multiplier 10 10 5
Algorithmic Enhancements Branch Prediction Hardware Prefetching 11 11 Branch Prediction Continued improvement of existing algorithms Improved static prediction algorithm Displacement check Condition check Added indirect branch predictor Idea first introduced on Intel Pentium M processor 12 12 6
Branch Predictor Comparison 130nm 90nm 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf 1.03 1.32 0.85 1.35 0.72 1.06 0.44 0.62 0.33 0.08 1.19 1.32 1.01 1.21 0.70 1.22 0.68 0.87 0.39 0.28 0.24 0.09 1.12 1.23 # of Branch Mispredicts Per 100 Instructions on SPECint*_base2000 13 * Other names and brands are the property of their respective owners 13 Hardware Prefetching Primary mechanism to hide DRAM latency Processor predicts what data will be needed in the future and proactively fetches it from DRAM Exists on all Intel Pentium 4 Processor implementations 90nm version improves on what data to get and when to get it 14 14 7
Hardware Prefetching 2.00 HWP Disabled HWP Enabled 1.97 1.50 1.00 1.16 1.18 1.21 1.26 1.29 1.30 1.32 1.40 1.45 1.49 0.50 0.00 254.gap 191.fma3d 178.galgel 187.facerec 171.swim 168.wupwise 173.applu 189.lucas 181.mcf 172.mgrid 183.equake Impact of HW prefetcher on most sensitive benchmarks in SPEC CPU2000 15 15 Hyper-Threading Technology Makes a single processor look like two processors to software Takes advantage of underutilized resources when running a single thread through the processor Hyper-Threading Technology requires a computer system with an Intel Pentium 4 processor supporting HT Technology and a Hyper-Threading Technology enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. See http://www.intel.com/info/hyperthreading/ for more information including details on which processors support HT Technology. 16 16 8
Hyper-Threading Technology Improvements 1 st level data cache Uses partial virtual address index Aliasing can occur due to stacks of two threads being offset by a fixed amount Use context identifier to differentiate between data from different threads. Better than thread identifier to allow data sharing between threads. Introduced on later steppings of 130nm version. Parallel Operations Allow page walks and split memory access handling in parallel Allow multiple page walks if one goes to DRAM Buffer sizes Motivated increase in # of outstanding 1 st level cache misses 17 17 13 new instructions SSE3 x87 to integer conversion Graphics (Horizontal Add/Subtract) Complex arithmetic Video Encoding Thread Synchronization 18 18 9
Complex Arithmetic MOVDDUP, MOVSHDUP, MOVSLDUP Instructions to load and duplicate data implicitly ADDSUBPS, ADDSUBPD Perform a mix of addition and subtraction simultaneously 10-20% gain on 168.wupwise from these instructions (complex matrix multiply) 19 19 Video Encoding Motion Estimation compares previous frame to current frame Loads from the previous frame are unaligned Leads to costly cache line split memory accesses LDDQU instruction loads 128-bits at an arbitrary alignment with no cache line split Speedups of > 10% on MPEG-4 encoders 20 20 10
Thread Synchronization Used to indicate that a thread is spinning and waiting for work Allows processor to go into an optimized state MONITOR Sets up address monitoring hardware MWAIT Sets processor into optimized state. Will wake up when monitored address in written to 21 21 Intel Extended Memory 64 Technology Additional capability in today s Intel Xeon processors on top of: Netburst Microarchitecture Hyper-Threading Technology SSE3 Provides 8 more integer and SSE registers Larger addressing capability 48 bits of virtual address on this implementation 36-40 bits of physical address on this implementation Full 64-bit support carefully engineered into the 90nm design Limited differences between 32-bit and 64-bit operations Similar optimizations for 32-bit and 64-bit code 22 22 11
32-bit vs. 64-bit Comparison ALU Latency ALU Throughput Memory Throughput Page walks 32-bit 1 load + 1 store per cycle 2 levels -- use PDE cache to reduce to 1 level in common case 1 cycle 4 operations/cycle 64-bit 4 levels use PDE cache to reduce to 1 level in common case Enable strong 64-bit performance without compromising 32-bit performance 23 23 Optimizing 64-bit code Rule #1: Follow 32-bit optimizations Rule #2: Compile with Pentium 4 specific optimizations enabled Few additional new rules: When data size is 32 bits, typically use 32-bit instructions Example: XOR EAX, EAX instead of XOR RAX, RAX But sign extend to full 64-bits instead of only 32 bits (even for 32-bit data size) Example: Load 16 bits, sign extend to RAX not EAX 24 24 12
Performance Results 1.50 HT Technology Disabled HT Technology Enabled 1.25 1.28 1.19 1.20 1.12 1.00 0.50 0.00 Adobe* Photoshop* CS Windows Media* Encoder 9.0 Maxon Cinema* 4D MainConcept* 1.4 MAGIX* mp3 maker 2004 diamond Source: Intel Configuration: Intel Pentium 4 processor with HT Technology 3.40E GHz Intel D875PBZ Desktop Board (AA-301); All Platforms 1GB DDR400 CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel Application Accelerator RAID Edition 3.5 with RAID ready, Intel Chipset Software Installation Utility 5.01.1015, Seagate* Barra cuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Windows* XP Build 2600 SP1, Intel PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. 25 25 Performance Results 1.50 1.00 Intel Pentium 4 Processor with HT Technology 3.40 GHz Intel Pentium 4 Processor with HT Technology 3.40E GHz 1.14 1.07 0.50 0.00 SPECint*_base2000 SPECfp*_base2000 Source: Intel Configuration: Intel Pentium 4 processor with HT Technology 3.40 GHz Intel D875PBZ Desktop Board (AA-204); Intel Pentium 4 processor with HT Technology 3.40E GHz Intel D875PBZ Desktop Board (AA-301); All Platforms 1GB DDR400 CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel Application Accelerator RAID Edition 3.5 with RAID ready, Intel Chipset Software Installation Utility 5.01.1015, Seagate* Barracuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Wi ndows* XP Build 2600 SP1, Intel 26 PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any di fference in system hardware or software design or configuration may affect actual performance. 26 13
The End 27 27 14