5.5 A Wire-Speed Power TM Processor: 2.3GHz 45nm SOI with 16 Cores and 64 Threads

Similar documents

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

PCI Express: The Evolution to 8.0 GT/s. Navraj Nandra, Director of Marketing Mixed-Signal and Analog IP, Synopsys

Intel Xeon Processor E5-2600

PCI Express IO Virtualization Overview

Alpha CPU and Clock Design Evolution

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Low Power AMD Athlon 64 and AMD Opteron Processors

Pericom PCI Express 1.0 & PCI Express 2.0 Advanced Clock Solutions

OC By Arsene Fansi T. POLIMI

QorIQ T4 Family of Processors. Our highest performance processor family. freescale.com

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

ECLIPSE Performance Benchmarks and Profiling. January 2009

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

IC-EMC Simulation of Electromagnetic Compatibility of Integrated Circuits

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

What is a System on a Chip?

Storage Architectures. Ron Emerick, Oracle Corporation

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Intel Xeon +FPGA Platform for the Data Center

Networking Goes Open-Source. Michael Zimmerman VP Marketing, Tilera

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

1. Memory technology & Hierarchy

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Computer Architecture

Networking Virtualization Using FPGAs

1.1 Silicon on Insulator a brief Introduction

Scaling from Datacenter to Client

7a. System-on-chip design and prototyping platforms

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

Putting it all together: Intel Nehalem.

OpenSoC Fabric: On-Chip Network Generator

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

STM32 F-2 series High-performance Cortex-M3 MCUs

Successfully negotiating the PCI EXPRESS 2.0 Super Highway Towards Full Compliance

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

DDR subsystem: Enhancing System Reliability and Yield

Jitter in PCIe application on embedded boards with PLL Zero delay Clock buffer

EMC VMAX3 FAMILY - VMAX 100K, 200K, 400K

Big Data Performance Growth on the Rise

SeaMicro SM Server

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

ARM Webinar series. ARM Based SoC. Abey Thomas

Memory unit. 2 k words. n bits per word

Fairchild Solutions for 133MHz Buffered Memory Modules

Cut Network Security Cost in Half Using the Intel EP80579 Integrated Processor for entry-to mid-level VPN

Intel PCI and PCI Express*

The Foundation for Better Business Intelligence

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Class 18: Memories-DRAMs

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

AMD Opteron Quad-Core

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

White Paper. Innovate Telecom Services with NFV and SDN

HCF4056B BCD TO 7 SEGMENT DECODER /DRIVER WITH STROBED LATCH FUNCTION

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Xeon+FPGA Platform for the Data Center

High-Density Network Flow Monitoring

Accelerating the Data Plane With the TILE-Mx Manycore Processor

Table 1 SDR to DDR Quick Reference

Developments in Internet Infrastructure

How To Design A Chip Layout

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

LS DYNA Performance Benchmarks and Profiling. January 2009

PCI Express Impact on Storage Architectures and Future Data Centers

Data Center and Cloud Computing Market Landscape and Challenges

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

QLogic 16Gb Gen 5 Fibre Channel for Database and Business Analytics

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

How SSDs Fit in Different Data Center Applications

SABRE Lite Development Kit

Evaluating Embedded Non-Volatile Memory for 65nm and Beyond

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Intel Labs at ISSCC Copyright Intel Corporation 2012

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Emulex s OneCore Storage Software Development Kit Accelerating High Performance Storage Driver Development

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Accelerating Microsoft Exchange Servers with I/O Caching

Memory Basics. SRAM/DRAM Basics

Using Network Virtualization to Scale Data Centers

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Qsys and IP Core Integration

PCI Express and Storage. Ron Emerick, Sun Microsystems

OPENSPARC T1 OVERVIEW

Data Centric Systems (DCS)

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

Transcription:

5.5 A Wire-Speed Power TM Processor: 2.3GHz 45nm SOI with 16 Cores and 64 Threads Charles Johnson, David H. Allen, Jeff Brown, Steve Vanderwiel, Russ Hoover, Heather Achilles, Chen-Yong Cher, George A. May, Hubertus Franke, Jimi Xenidis, Claude Basso IBM Research & System Technology Group 1

Wire-Speed Processor: Goals Address new evolving market Address performance issues relating to new semiconductor scaling properties Reduced application power requirements Achieve >50% thru Architecture Achieve >50% thru Implementation 2

Wire-Speed Processor Technology Core Frequency Chip size Chip Power (4-AT node) Chip Power (1-AT node) IBM 45nm SOI Main Voltage (VDD) 0.7V to 1.1V Metal Layers Latch Count 3.2M Transistor Count 1.43B 2.3GHz @ 0.97V (Worst Case Process) 428 mm2 (including kerf) 65W @ 2.0GHz, 0.85V Max Single Chip 20W @ 1.4GHz, 0.77V Min Single Chip 11 Cu (3-1x, 2-1.3x, 3-2x, 1-4x, 2-10x) Cores / Threads 16 / 64 L1 I & D Cache L2 Cache Hardware Accelerators Intelligent Network Interfaces Memory Bandwidth System I/O Bandwidth Chip-to-Chip Bandwidth Chip Scaling Package 16 x (16KB + 16KB) SRAM 4 x 2MB edram Crypto, Compression, RegX, XML Host Ethernet Adapter/Packet Processor 2 Modes: Endpoint & Network 2x DDR3 controllers 4 Channels @ 800-1600MHz 4x 10G Ethernet, 2x PCI Gen2 3 Links, 20GB/s per link 4 Chip SMP 50mm FCPBGA (4 or 6 layers) 3

Wire-Speed Processor Attributes A blurring of the Network and Server worlds Highly-multi-threaded low power cores with full ISA Standard programming models with OS s & hypervisors Smaller memory line sizes Accelerators: for both Networking & Application tiers Integrated Network system & Memory I/O Virtualization support Server RAS & infrastructure Low total power solution based on thruput optimization Tiered IT Data Center Function: Network Web serving Application & Business processing Database File Server Network Processors Wire-Speed Processing Server processing 4

Power/Performance Scaling Compute Scaling Choose the right processor Choose the right mix of processors & accelerators Accelerators have best power/perf efficiency, but only do one function SMP Scaling Power/Perf Efficiency SW & Application complexity Power/Perf Efficiency Unique Accelerator MMT Processor General Purpose Processor 10-100X Improvement 2-4X Improvement 5

Wire-Speed Computing: Threads vs IPC 80.0 WireSpeed Processor Chip (Minimum Thread Count vs Core IPC ) 70.0 Composite Packet Size 60.0 1500B Packets Total Threads 50.0 40.0 30.0 better 576B Packets 64B Packets 4-GP Cores 16-WireSpeed Cores 20.0 10.0 0.0 0.00 0.50 1.00 1.50 2.00 2.50 Projected Core IPC (Wire Speed Benchmark) 6

Core/AT Node Core Full 64b Power ISA Embedded Architecture Virtualization / Hypervisor support DP Floating Point Power efficient throughput optimized micro architecture 4 way SMT, In-order 2 way concurrent issue Enhanced instruction set for low latency accelerator communications AT Node Fully coherent shared 2MB L2 cache High speed interconnect for many cores Power efficient EDRAM Lots of cache controls Line locking Streaming Slave memory Injection L2 7

Wire-Speed Accelerator Bandwidths Accelerator Unit Algorithm # of Engines Projected Bandwidth Typical Peak HEA network node mode 4 40 Gbps 40 Gbps endpoint mode 4 40 Gbps 40 Gbps Compression gzip (input bandwidth) 1 8 Gbps 16 Gbps gunzip (output bandwidth) 1 8 Gbps 16 Gbps Encryption AES 3 41 Gbps TDES 8 19 Gbps ARC4 1 5.1 Gbps Kasumi SHA 1 6 5.9 Gbps 23-37 Gbps 60 Gbps MD5 6 31 Gbps AES/SHA 3 19-31 Gbps XML RSA/ECC (RSA with 1024/2048 bit key) 3 45000/7260 Customer workload 4 10 Gbps Benchmark workload 4 20 Gbps 30 Gbps RegX For typical pattern sets 8 20-40 Gbps 70 Gbps 8

Wire-Speed Processor AT 0 AT 1 AT 2 AT 3 Mem PHY Mem PHY PLL PLL PLL PLL PLL PLL PLL PLL PLL PLL 2MB L2 2MB L2 2MB L2 2MB L2 MC MC Comp / Decomp PBIC Crypto XML PBIC Pattern Engine PBus PBus External Controller Root/EP Engine PCI Exp Gen 2 PBIC Root Engine PCI Exp Gen 2 PBIC Host Ethernet Controller / Packet Processor Pbus Access Macro Pervasive 4x 10GE MAC or 4x 1GE MAC EI3 EI3 EI3 x8 PHY x8 PHY x8 PHY x4 PHY Misc I/O 9

Wire-Speed Processor: SoC Design Enabled Distributed Development Team Burlington Yorktown Endicott Rochester Hursley, England La Gaude, France Boebligen, Germany Mainz, Germany San Jose Austin Raleigh Beijing, China Yamato, Japan Kyoto, Japan Shanghai, China Zurich, Switzerland Haifa, Israel Tel Aviv, Israel Bangalore, India Verification is key to complex SoC 1.43 billion transistors 1.5 teracycles Chip / Unit Sim 22 teracycles System Sim 19 World Wide Locations Participating 10

Evolution of a Wire-Speed Processor Power Savings from Architecture / Integration - Low Power Integrated Design 1 2 3 -- Accelerators -- I/O (Ethernet/HEA) -- Cores/Threads -- In 25-75 W Sockets - Eliminates I/O Bottlenecks - Improves Latency & State Replication - Scalable Solution -- 10 => 40 => 100 GbE [General Purpose Processor chip] 0 : 8- GP cores 1 : 8- GP cores + 4 Accelerators 2 : 8- GP cores + 4 Accelerators + Integrated I/O (DDR3, 10Ge, HEA) [WireSpeed chip] 3 : 16- cores + 4 Accelerators + Integrated I/O (DDR3, 10Ge, HEA) 11

Double Pumped Register File 64 total register file instances required in the 16- cores. Double pumping a 2R1W cell provided 2R2W functionality at 55% of the area and 85% of the power. Additional power savings from smaller cell and reduced performance requirements on critical read access logic 12

45nm SOI edram Cells Compared to SRAM Cell Density Ratio 6.8x at cell level 3.1x at array level Power Ratio 0.2x /mm2 at array level Performance Ratio Cycle: 3.2x (2.0x application) Latency: 2.5x (1.3x application) Cycle = 1.7ns Latency = 1.35ns Application net result Similar performance in <0.5x area and <0.2x power Cycle = 0.53ns Latency = 0.53ns 13

Asynchronous Domain Crossing Latch Optimization Soft error rate (SER): asynchronous latches used triple-stack feedback inverter to reduce diffusion cross-section. Latch metastability: large numbers of asynchronous clock domain crossings increased the failure rate SER rate was orders of magnitude less than the metastability failure rate Latches redesigned to remove triple stack Latch Metastability rate improved by >10X Modification required no size, pin location, or blockage changes on the latches 14

Wire-Speed: Transistor Characteristics Device Scaling (0.7V) 1.50x Perf 5.65x Leakage Power Device Scaling (0.9V) 1.37x Perf 4.61x Leakage Power Transistor TT TT TT TT 85C 85C 85C 125C 0.7V 0.9V 0.9V 0.9V Performance (GHz/stage) Leakage (na/stages) RVT 48 73 1300 2300 HVT 37 61 550 950 SVT 32 53 230 500 Performance (Normalized) Leakage (Normalized) RVT 1.50 2.28 5.65 10.00 HVT 1.16 1.91 2.39 4.13 SVT 1.00 1.66 1.00 2.17 15

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Power Design Parameters (Pass 1) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 16 ATNode Pbus MemCtrl RegX XML Crypto CompDec Pbic PP HEA+ PCIe Pervasive Unit ATNode Pbus MemCtrl RegX XML Crypto CompDec Pbic PP HEA+ PCIe Pervasive Unit SVT HVT RVT Average Local Clocks Gated Off Device Usage VT Usage Targets: 75% SVT 20% HVT 5% RVT Clock Gating Target: 90%

Voltage Domains Supply VDD VCS Logic Combo Phy Array Nom Voltage @BGA (V) @C4 Vmin (V) @Cct Vmin (V) @Cct Vmax (V) 1.20 1.22 1.01 0.96 0.90 n/a 0.89 0.85 0.79 n/a 0.70 0.67 0.63 n/a 1.01 0.93 0.86 1.03 0.89 0.82 0.76 n/a 1.19 1.08 1.05 1.18 1.11 1.02 0.99 1.11 VMEM Memory 1.50 1.37 1.29 1.56 VIO EI3 IO 1.20 1.11 1.04 1.25 VTT Combo PHY termination 1.20 1.09 1.01 1.23 VTTA Combo PHY drivers 1.20 1.09 1.03 1.21 DVDD Low Speed Device I/O 3.30 3.05 2.87 3.44 DVDD2 Low Speed Device I/O Bias 1.80 1.67 1.57 1.87 AVDD Various PLL analog 1.80 1.72 1.62 1.87 GND Logic & array ground 0.00 AGND Various Analog PLL ground 0.00 17

DT Decoupling Capacitors Deep trench (DT) decoupling capacitors occupy 10.2% of chip area DT shapes ~4% of chip area ~6 uf decoupling cap on VDD 30mv noise reduction 5W power savings DT density 1% 2% 3% 4% 5% 6% 8% 14% (edram) 18

AT Node: Vdd-Freq-Power Scaling - 11% Vdd Reduction - 22% Frequency Reduction - 41% Power Reduction -- 42% AC Power Reduction -- 39% DC Power Reduction 2.3 Ghz 2.0 Ghz 1.8 Ghz L2 nest L2 nest L2 nest Temperature (ºC) 85 85 85 Frequency (Ghz) 2.30 2.00 1.81 VOLTAGE @ C4 0.97 0.89 0.86 VOLTAGE @ Circuit 0.89 0.82 0.79 Instances 4 1 4 1 4 1 Logic AC (W) 3.612 2.294 2.517 1.598 2.058 1.307 SRAM AC (W) 0.959 0.196 0.702 0.143 0.587 0.120 DRAM AC (W) 0.259 0.194 0.164 Logic DC 1.181 0.727 0.835 0.520 0.715 0.448 SRAM DC 0.129 0.121 0.090 0.085 0.077 0.073 DRAM DC 0.079 0.055 0.047 CLK Grid (s + AT) 0.772 0.551 0.455 AT Power 10.329 7.290 6.052 19

Wire-Speed Processor: Projected Chip Power at Various Frequencies / Functionality Power AT Nodes Frequency Comments 85W 4 2.3 GHz Full function/frequency 65 W 56 W 50 W 41 W 36 W 4 2 4 2 1 2.0 GHz 2.0 GHz 1.8 GHz 1.8 GHz 1.8 GHz 20-22 W 1 1.4 GHz Power limiter in operation, disabled EI3, PBEC, XML disabled EI3, PBEC, XML PPI, PCI-E @ 50% frequency DDR3 @ 1333MHz 1 mem controller & IOM PHY @ 800MHz No accelerators enabled. PBus @ 700MHz. HEA/PP grid @ 312.5GHz, (2) 1 Gb ethernet ports. PCI-E @ 250MHz, I/O is GEN 1 @ 2.5Gbps, 4 lanes. PBEC & EI3 links disabled. 20

Wire-Speed Processor: Projected power breakdown by function @ 2.0GHz X4 COMBO PHY 1% X8 COMBO PHY 2 2% X8 COMBO PHY 1 2% X8 COMBO PHY 0 8% PLL & I/O 0% Pbus 2% Pervasive 1% PBIC 3% Pbec 1% IOM PHY 1 4% MemCtrl 1 2% IOM PHY 0 4% MemCtrl 0 2% EI3 6% PCIe 1% RegX 3% CLK Grid 3% XML 3% AT 0 10% AT 1 10% HEA+ 3% AT 3 10% Crypto 4% AT 2 10% PP 5% CompDec 1% Cores/Caches: 40% Accelerators: 11% I/O & Bus Logic: 19% PHY: 27% Clock Grid: 3% 21

Power Reduction Big Hitters Technique Static voltage scaling Impact Lower voltage & 50% reduced DC leakage on fast parts Power Reduction (W) 10-12W Clock skew reduction 0.4W/ps 4-6W VT optimization Deep trench decoupling capacitors Improved timing on critical paths, lower VDD @ frequency 30mV lower AC noise 8-12W 5W Clock gating 50% lower AC power 18-22W Net 45-57W 22

Wire-Speed: Power/Current Maps Vdd Current Density Multi-Chip Endpoint mode 2.3GHz Single-Chip Packet mode 2.0GHZ AT0 PBus AT1 AT0 PBus AT1 MC0 MC0 AT2 PCI HE PP AT3 MC1 DC SU RX XML 0.45-0.5 0.4-0.45 0.35-0.4 0.3-0.35 0.25-0.3 0.2-0.25 0.15-0.2 0.1-0.15 0.05-0.1 0-0.05 (W/mm 2 ) AT2 PCI HE PP AT3 MC1 DC SU XML RX 0.45-0.5 0.4-0.45 0.35-0.4 0.3-0.35 0.25-0.3 0.2-0.25 0.15-0.2 0.1-0.15 0.05-0.1 0-0.05 (W/mm 2 ) 0.925V, 2.3GHZ Total power: 87.3W Total power: 65.4W 23

Sensor Summary PSRO (32) Performance Screen Ring Oscillator MHCRO (1) Model Hardware Correlation Ring Oscillator Skitter (10) DTS2 (18) Digital Thermal Sensor Thermal Diodes (4) CPM (18) Critical Path Monitor Kelvin Probe (2) 24

Wire-Speed Power TM Processor: Summary New architecture addresses converging market between the network and servers Low power, highly scalable design 25-75W chip at full frequency (2.3GHz) EI3 RX EI3 TX AT 0 AT 1 L2 L2 L2 PBEC PBUS MC L2 DDR3 DDR3 Significant power reduction thru architecture & implementation Architecture: simple highly threaded cores, accelerators & integrated I/O Implementation: static voltage scaling, clock skew reduction, VT optimization, deep trench decoupling capacitors, clock gating AT 2 PCI/Ethernet PCI HEA/ PP PBIC PBIC AT 3 MC Comp PBIC PBIC RegX Crypto XML DDR3 DDR3 Special thanks to STG Development Team 25