The Orca Chip... Heart of IBM s RISC System/6000 Value Servers



Similar documents
Chapter 1 Computer System Overview

ECLIPSE Performance Benchmarks and Profiling. January 2009

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

OC By Arsene Fansi T. POLIMI

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

SUPERTALENT PCI EXPRESS RAIDDRIVE PERFORMANCE WHITEPAPER

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Understanding PCI Bus, PCI-Express and In finiband Architecture

21152 PCI-to-PCI Bridge

Rackspace Cloud Databases and Container-based Virtualization

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

AMD Opteron Quad-Core

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Putting it all together: Intel Nehalem.

DS1104 R&D Controller Board

Setting a new standard

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

INTRODUCTION ADVANTAGES OF RUNNING ORACLE 11G ON WINDOWS. Edward Whalen, Performance Tuning Corporation

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Trace Port Analysis for ARM7-ETM and ARM9-ETM Microprocessors

OpenSPARC T1 Processor

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

PCI Express IO Virtualization Overview

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

LogiCORE IP AXI Performance Monitor v2.00.a

Configuring Memory on the HP Business Desktop dx5150

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Microsoft SQL Server: MS Performance Tuning and Optimization Digital

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Memory Architecture and Management in a NoC Platform

Models Smart Array 6402A/128 Controller 3X-KZPEC-BF Smart Array 6404A/256 two 2 channel Controllers

Gigabit Ethernet Design

Avid ISIS

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Priority Pro v17: Hardware and Supporting Systems

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Chapter 02: Computer Organization. Lesson 04: Functional units and components in a computer organization Part 3 Bus Structures

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

LS DYNA Performance Benchmarks and Profiling. January 2009

High performance ETL Benchmark

SQL Server Performance Tuning and Optimization

Notes and terms of conditions. Vendor shall note the following terms and conditions/ information before they submit their quote.

361 Computer Architecture Lecture 14: Cache Memory

Microsoft Exchange Server 2003 Deployment Considerations

MPC603/MPC604 Evaluation System

The i860 XP Second Generation of the i860 Supercomputing Microprocessor Family. Presentation Outline

Lecture 23: Multiprocessors

As enterprise data requirements continue

D1.2 Network Load Balancing

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

PCI Express Impact on Storage Architectures and Future Data Centers

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Low Power AMD Athlon 64 and AMD Opteron Processors

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Performance Analysis and Testing of Storage Area Network

Using Synology SSD Technology to Enhance System Performance Synology Inc.

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

SUN SPARC ENTERPRISE M4000 SERVER

Architecture of Hitachi SR-8000

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

PCI Express and Storage. Ron Emerick, Sun Microsystems

VTrak SATA RAID Storage System

Storage Performance Testing

Switch Fabric Implementation Using Shared Memory

Infrastructure Matters: POWER8 vs. Xeon x86

Performance and scalability of a large OLTP workload

The Motherboard Chapter #5

Configuration Maximums VMware Infrastructure 3

Power Reduction Techniques in the SoC Clock Network. Clock Power

Configuring a U170 Shared Computing Environment

Understanding the Performance of an X User Environment

AIX NFS Client Performance Improvements for Databases on NAS

HP Smart Array Controllers and basic RAID performance factors

IOS110. Virtualization 5/27/2014 1

High Performance Computing. Course Notes HPC Fundamentals

Performance Tuning and Optimizing SQL Databases 2016

Intel X38 Express Chipset Memory Technology and Configuration Guide

Router Architectures

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Computer Systems Structure Input/Output

Windows 2003 Performance Monitor. System Monitor. Adding a counter

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Chapter 13. PIC Family Microcontroller

Intel s SL Enhanced Intel486(TM) Microprocessor Family

Introduction to Cloud Computing

Introducción. Diseño de sistemas digitales.1

Microsoft Exchange Solutions on VMware

Transcription:

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1

Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure. Orca Overview. Orca Design Points. System Comparisons. Summary 2

Background. 2H 94 IBM Announced the RISC System/6000 Model J30/R30: 601 601 A D A D DIR & 1MB CC CC DIR & 1MB MCA IOBB MCA IOBB A D A D A D A D 32 32 PowerPC 6XX System Bus MC CNTL DATA SWITCH 256 SYSTEM MEMORY 3

Background. IBM s First PowerPC SMP System. 1 8 Way SMP Scalable Structure. Introduction of the PowerPC 6XX System Bus. Symmetric and Coherent I/O Subsystems. New and Efficient OS for SMP (AIX Ver. 4). Upgradable Processor Cards Higher Frequency Processors New PowerPC Processor Designs (604/604e/620/etc). Baseline Platform to Begin Performance Analysis 4

Server Workloads. TPC C Mix of database transactions w/high multiprogramming level Multiple implementations using different database products Large code and data footprint Not OS intensive (12% OS, 88% DB). SPEC SFS 097.laddis NFS file server benchmark Mix of file server transactions w/high multiprogramming level Modest code and data footprint Mostly OS intensive (100% OS). SPEC SDM 057.sdet Batch oriented software development benchmark Modest multiprogramming level Large code and data footprint Mostly OS intensive (50% OS, 50% commands/libraries) 5

Server Workloads. SPEC SDM 061.kenbus Interactive oriented multiuser benchmark High multiprogramming level Large code and data footprint Mostly OS intensive (50% OS, 50% commands/libraries). Netperf TCP IP Performance Test Mostly sends/receives variable data lengths in loop format Small code footprint, mostly runs in L1 Cache Sensitive to processor Mhz. Other SPECint 95 SPECfp 95 G92 (Computational Chemistry) Les (Aircraft surface turbulence simulation) 6

Performance Study. Performance Tools Software Instruction Trace Tools Hardware Bus Trace Tools Processor Simulators Cache Simulators Memory Simulators System Topology/Interconnect Behaviorals Validation Tools 7

Performance Study. L2 Cache Parameters Varied Processor to L2 Bus Ratios (1:1, 3:2, 2:1, etc) Associativity Cache Line Size Sectors/Cache Line Cache Replacement Algorithms L2 Access Latency L2 Intervention Latency Lookaside vs Inline L2 Caches Unified vs Split I/D Caches Shared L2 Cache vs Dedicated L2 Cache. System Parameters Varied 601, 604, 604e, and 620 Processor Cores Memory Access Latency System Bus Width System Bus Ratios Memory Bus Width Memory Bus Ratios Switch vs Bus for System Address Bus and Data Bus Switch vs Bus for Memory Bus Number of Processors System Bus Protocols 8

Performance Study 3 2 TPC C L2 Cache Miss Rate/ Instruction 1 Laddis Kenbus Sdet Netperf 0 32 96 128 Line Length Effect of Increasing Line Length on Miss Rates (256KB 8 Way SA) 9

Performance Study 3 TPC C L2 Cache Miss Rate/ Instruction 2 Kenbus Laddis Sdet 1 Netperf 0 1 2 3 4 5 6 7 8 Associativity Effects of Increasing Associativity on Miss Rates (256KB B Line) 10

Performance Study 1.10 1.05 1.00 1:1 Bus Relative Performance 0.95 0.90 0.85 2:1 Bus 0.80 0.75 2 3 4 5 6 7 8 9 10 11 Pclocks Latency Processor to L2 Critical Word Effect of L2 Latency on Performance of TPC C 11

HW Development Strategy Performance Study Entry Chip Set Value Chip Set Enterprise Chip Set Specials Entry Servers Value Servers Enterprise Servers 12

Structure RS/6000 Value Server Structure (200 Mhz) Processor Bus (1:1 Bus Freq) (200 Mhz) 604e 32 ORCA 604e 604e 604e 32... 32 32 ORCA ORCA ORCA 128 128 128 128 PowerPC 6XX Bus (100 Mhz) A 128 D 128 Memory & I/O Controller 128 SYSTEM MEMORY FEATURE SLOT (Graphics, Clustering, Additional I/O, etc.) (100 Mhz)... PCI Bus PCI Bus 13

Orca Overview Fully integrated Cache, Tags, Status, LRU and Controller. Cache Controller Features Fully Integrated L2 Cache, Directory, and Controller 256KB or 512KB Cache Sizes 8 Way Set Associative Low Latency L1 Miss, L2 Hit Access 200 Mhz Operation (1:1 with CPU Frequency) 32B/Cycle Internal Cache Access (6.4 GB/sec Cache Bandwidth) B L2 Cache Lines (32B L1 Cache Lines) Non Blocking L2 Cache for Processor Bus Reads and Writes Non Blocking L2 Directory for Processor Bus Snoops Non Blocking L2 Directory for System Bus Snoops Single Cycle Snoop Coherency Resolution High Speed System Bus Intervention Supported Queue Depths Support Processor Bus and System Bus Saturation Limits (Improves Technical/fp Performance) RISC Style Micro Architecture (Shallow Logic/Heavily Pipelined) Improved SMP Locks Performance High Speed Cache Inhibited Stores (Graphics Performance) Single Bit Error Recovery (ECC) for Internal Cache & Directory 14

Orca Overview. PowerPC 6XX System Bus Features Seperate Address and Data Buses (Fully Tagged) Efficient Address and Data Bus Arbitration Protocols Bit Addressing Support Bit and 128 Bit Data Bus Support High Speed Intervention Support Split Flow Control and Coherency Responses Efficient Protocols for SMP Clustering Efficient Protocols for Switched Address and Data Buses High Frequency Capable (Latch to Latch Protocols, etc) System Centric Bus Architecture (The System Directly Controls All OCD Enables, When Snoopers Sample, etc) Heavily Pipelined Address/Response Buses Other (Bus Parking, Prefetch Hints, Large Bursts, etc) Robust Error Recovery 15

Block Diagram PowerPC 60X Processor Bus Add(32) Cntl Data() ORCA Generic Internal Interface Generic Internal Interface PowerPC 60X Processor Bus Interface Controller PowerPC Architectural Controller Tags/Status L2 Cache Controller Tags/Status LRU 256KB or 512KB Cache PowerPC 6XX System Bus Interface Controller Performance Monitor Debug Assist On Chip Regs Configuration Error Handler JTAG/COP Add() Cntl Data(128) PowerPC 6XX System Bus 16

ORCA Technology. 256KB, 512KB L2 Cache. 0.5u Nwell CMOS. Five Level Metal. Local Interconnect. L = 0.25um eff. 2.5V Core Voltage. 3.3V Drivers/Receivers. 7.5W Typical Power at 200 Mhz (estimated). 430 I/O Signals, CMOS/TTL Compatible. LSSD Design, JTAG Compliant. 32mm Ball Grid Array 17

Structure RS/6000 Value Server Structure (200 Mhz) Processor Bus (1:1 Bus Freq) (200 Mhz) 604e 32 ORCA 604e 604e 604e 32... 32 32 ORCA ORCA ORCA 128 128 128 128 PowerPC 6XX Bus (100 Mhz) A 128 D 128 Memory & I/O Controller 128 SYSTEM MEMORY FEATURE SLOT (Graphics, Clustering, Additional I/O, etc.) (100 Mhz)... PCI Bus PCI Bus 18

Structure. Alternative Option via RS/6000 J30/R30 Parts (200 Mhz) 604e 604e A D A D DIR & 2MB CC CC DIR & 2MB MCA IOBB MCA IOBB A D A D A D A D 32 32 PowerPC 6XX System Bus (100 Mhz) CNTL DATA SWITCH MC 256 SYSTEM MEMORY 19

System Comparisons Parameters System w/r30 Parts System w/value Parts Processor/Frequency 604e @ 200Mhz 604e @ 200Mhz Processor I/D, Associativity 32K/32K, 4Way 32K/32K, 4Way Processor Bus Frequency L2 Access Latency (Pclks) 100 Mhz (2:1) 9 2 2 2 200 Mhz (1:1) 3 1 1 1 L2 Cache Size L2 Associativity L2 Cache Line Size L2 Sectors/Cache Line L2 Inline or Lookaside L1 Inclusivity L2 Unified or Split I/D L2 Shared or Dedicated SB Address Switch or Bus SB Data Switch or Bus SB Data Bus Width SB Intervention Latency Memory Bus Width Memory Bus Frequency 512K, 1MB, 2MB 1 Way 32 Byte 1 Sector Inline Imprecise Unified Dedicated Bus Switch 8 Bytes Slow 32 Bytes 50 Mhz 256K, 512K 8 Way Byte 2 Sectors Inline Precise Unified Dedicated Bus Bus 16 Bytes Fast 16 Bytes 100 Mhz I/O Subsystem Micro Channel PCI 20

Summary. IBM has performed extensive UNIX server performace studies.. The result of these studies has led to the development of three server chip sets within the IBM RS/6000 Division: Entry Servers Value Servers Enterprise Servers. The heart of the Value Servers is the Orca Chip: Fully Integrated L2 Cache (256KB/512KB), Directory, & Controller Drastic departure from traditional L2 Cache Controller designs Highly associative (8 way), low latency, high bandwidth design Aggressive, fully non blocking, heavily pipelined design points Initial offerring at 200Mhz with future frequency increases Supports the 128 bit PowerPC 6XX System Bus Robust Performance Monitor Support. The ORCA Chip provides high commercial performance, and scalability w.r.t. the number and frequency of processors.. The ORCA Chip enables cost effective desktop PowerPC Microprocessors (604e) to be used in SMP Server enviroments. 21