The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1
Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure. Orca Overview. Orca Design Points. System Comparisons. Summary 2
Background. 2H 94 IBM Announced the RISC System/6000 Model J30/R30: 601 601 A D A D DIR & 1MB CC CC DIR & 1MB MCA IOBB MCA IOBB A D A D A D A D 32 32 PowerPC 6XX System Bus MC CNTL DATA SWITCH 256 SYSTEM MEMORY 3
Background. IBM s First PowerPC SMP System. 1 8 Way SMP Scalable Structure. Introduction of the PowerPC 6XX System Bus. Symmetric and Coherent I/O Subsystems. New and Efficient OS for SMP (AIX Ver. 4). Upgradable Processor Cards Higher Frequency Processors New PowerPC Processor Designs (604/604e/620/etc). Baseline Platform to Begin Performance Analysis 4
Server Workloads. TPC C Mix of database transactions w/high multiprogramming level Multiple implementations using different database products Large code and data footprint Not OS intensive (12% OS, 88% DB). SPEC SFS 097.laddis NFS file server benchmark Mix of file server transactions w/high multiprogramming level Modest code and data footprint Mostly OS intensive (100% OS). SPEC SDM 057.sdet Batch oriented software development benchmark Modest multiprogramming level Large code and data footprint Mostly OS intensive (50% OS, 50% commands/libraries) 5
Server Workloads. SPEC SDM 061.kenbus Interactive oriented multiuser benchmark High multiprogramming level Large code and data footprint Mostly OS intensive (50% OS, 50% commands/libraries). Netperf TCP IP Performance Test Mostly sends/receives variable data lengths in loop format Small code footprint, mostly runs in L1 Cache Sensitive to processor Mhz. Other SPECint 95 SPECfp 95 G92 (Computational Chemistry) Les (Aircraft surface turbulence simulation) 6
Performance Study. Performance Tools Software Instruction Trace Tools Hardware Bus Trace Tools Processor Simulators Cache Simulators Memory Simulators System Topology/Interconnect Behaviorals Validation Tools 7
Performance Study. L2 Cache Parameters Varied Processor to L2 Bus Ratios (1:1, 3:2, 2:1, etc) Associativity Cache Line Size Sectors/Cache Line Cache Replacement Algorithms L2 Access Latency L2 Intervention Latency Lookaside vs Inline L2 Caches Unified vs Split I/D Caches Shared L2 Cache vs Dedicated L2 Cache. System Parameters Varied 601, 604, 604e, and 620 Processor Cores Memory Access Latency System Bus Width System Bus Ratios Memory Bus Width Memory Bus Ratios Switch vs Bus for System Address Bus and Data Bus Switch vs Bus for Memory Bus Number of Processors System Bus Protocols 8
Performance Study 3 2 TPC C L2 Cache Miss Rate/ Instruction 1 Laddis Kenbus Sdet Netperf 0 32 96 128 Line Length Effect of Increasing Line Length on Miss Rates (256KB 8 Way SA) 9
Performance Study 3 TPC C L2 Cache Miss Rate/ Instruction 2 Kenbus Laddis Sdet 1 Netperf 0 1 2 3 4 5 6 7 8 Associativity Effects of Increasing Associativity on Miss Rates (256KB B Line) 10
Performance Study 1.10 1.05 1.00 1:1 Bus Relative Performance 0.95 0.90 0.85 2:1 Bus 0.80 0.75 2 3 4 5 6 7 8 9 10 11 Pclocks Latency Processor to L2 Critical Word Effect of L2 Latency on Performance of TPC C 11
HW Development Strategy Performance Study Entry Chip Set Value Chip Set Enterprise Chip Set Specials Entry Servers Value Servers Enterprise Servers 12
Structure RS/6000 Value Server Structure (200 Mhz) Processor Bus (1:1 Bus Freq) (200 Mhz) 604e 32 ORCA 604e 604e 604e 32... 32 32 ORCA ORCA ORCA 128 128 128 128 PowerPC 6XX Bus (100 Mhz) A 128 D 128 Memory & I/O Controller 128 SYSTEM MEMORY FEATURE SLOT (Graphics, Clustering, Additional I/O, etc.) (100 Mhz)... PCI Bus PCI Bus 13
Orca Overview Fully integrated Cache, Tags, Status, LRU and Controller. Cache Controller Features Fully Integrated L2 Cache, Directory, and Controller 256KB or 512KB Cache Sizes 8 Way Set Associative Low Latency L1 Miss, L2 Hit Access 200 Mhz Operation (1:1 with CPU Frequency) 32B/Cycle Internal Cache Access (6.4 GB/sec Cache Bandwidth) B L2 Cache Lines (32B L1 Cache Lines) Non Blocking L2 Cache for Processor Bus Reads and Writes Non Blocking L2 Directory for Processor Bus Snoops Non Blocking L2 Directory for System Bus Snoops Single Cycle Snoop Coherency Resolution High Speed System Bus Intervention Supported Queue Depths Support Processor Bus and System Bus Saturation Limits (Improves Technical/fp Performance) RISC Style Micro Architecture (Shallow Logic/Heavily Pipelined) Improved SMP Locks Performance High Speed Cache Inhibited Stores (Graphics Performance) Single Bit Error Recovery (ECC) for Internal Cache & Directory 14
Orca Overview. PowerPC 6XX System Bus Features Seperate Address and Data Buses (Fully Tagged) Efficient Address and Data Bus Arbitration Protocols Bit Addressing Support Bit and 128 Bit Data Bus Support High Speed Intervention Support Split Flow Control and Coherency Responses Efficient Protocols for SMP Clustering Efficient Protocols for Switched Address and Data Buses High Frequency Capable (Latch to Latch Protocols, etc) System Centric Bus Architecture (The System Directly Controls All OCD Enables, When Snoopers Sample, etc) Heavily Pipelined Address/Response Buses Other (Bus Parking, Prefetch Hints, Large Bursts, etc) Robust Error Recovery 15
Block Diagram PowerPC 60X Processor Bus Add(32) Cntl Data() ORCA Generic Internal Interface Generic Internal Interface PowerPC 60X Processor Bus Interface Controller PowerPC Architectural Controller Tags/Status L2 Cache Controller Tags/Status LRU 256KB or 512KB Cache PowerPC 6XX System Bus Interface Controller Performance Monitor Debug Assist On Chip Regs Configuration Error Handler JTAG/COP Add() Cntl Data(128) PowerPC 6XX System Bus 16
ORCA Technology. 256KB, 512KB L2 Cache. 0.5u Nwell CMOS. Five Level Metal. Local Interconnect. L = 0.25um eff. 2.5V Core Voltage. 3.3V Drivers/Receivers. 7.5W Typical Power at 200 Mhz (estimated). 430 I/O Signals, CMOS/TTL Compatible. LSSD Design, JTAG Compliant. 32mm Ball Grid Array 17
Structure RS/6000 Value Server Structure (200 Mhz) Processor Bus (1:1 Bus Freq) (200 Mhz) 604e 32 ORCA 604e 604e 604e 32... 32 32 ORCA ORCA ORCA 128 128 128 128 PowerPC 6XX Bus (100 Mhz) A 128 D 128 Memory & I/O Controller 128 SYSTEM MEMORY FEATURE SLOT (Graphics, Clustering, Additional I/O, etc.) (100 Mhz)... PCI Bus PCI Bus 18
Structure. Alternative Option via RS/6000 J30/R30 Parts (200 Mhz) 604e 604e A D A D DIR & 2MB CC CC DIR & 2MB MCA IOBB MCA IOBB A D A D A D A D 32 32 PowerPC 6XX System Bus (100 Mhz) CNTL DATA SWITCH MC 256 SYSTEM MEMORY 19
System Comparisons Parameters System w/r30 Parts System w/value Parts Processor/Frequency 604e @ 200Mhz 604e @ 200Mhz Processor I/D, Associativity 32K/32K, 4Way 32K/32K, 4Way Processor Bus Frequency L2 Access Latency (Pclks) 100 Mhz (2:1) 9 2 2 2 200 Mhz (1:1) 3 1 1 1 L2 Cache Size L2 Associativity L2 Cache Line Size L2 Sectors/Cache Line L2 Inline or Lookaside L1 Inclusivity L2 Unified or Split I/D L2 Shared or Dedicated SB Address Switch or Bus SB Data Switch or Bus SB Data Bus Width SB Intervention Latency Memory Bus Width Memory Bus Frequency 512K, 1MB, 2MB 1 Way 32 Byte 1 Sector Inline Imprecise Unified Dedicated Bus Switch 8 Bytes Slow 32 Bytes 50 Mhz 256K, 512K 8 Way Byte 2 Sectors Inline Precise Unified Dedicated Bus Bus 16 Bytes Fast 16 Bytes 100 Mhz I/O Subsystem Micro Channel PCI 20
Summary. IBM has performed extensive UNIX server performace studies.. The result of these studies has led to the development of three server chip sets within the IBM RS/6000 Division: Entry Servers Value Servers Enterprise Servers. The heart of the Value Servers is the Orca Chip: Fully Integrated L2 Cache (256KB/512KB), Directory, & Controller Drastic departure from traditional L2 Cache Controller designs Highly associative (8 way), low latency, high bandwidth design Aggressive, fully non blocking, heavily pipelined design points Initial offerring at 200Mhz with future frequency increases Supports the 128 bit PowerPC 6XX System Bus Robust Performance Monitor Support. The ORCA Chip provides high commercial performance, and scalability w.r.t. the number and frequency of processors.. The ORCA Chip enables cost effective desktop PowerPC Microprocessors (604e) to be used in SMP Server enviroments. 21