SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers



Similar documents
SPARC64 VII Fujitsu s Next Generation Quad-Core Processor

SPARC64 VIIIfx: CPU for the K computer

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

OpenSPARC T1 Processor

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

OC By Arsene Fansi T. POLIMI

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Putting it all together: Intel Nehalem.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Microarchitecture and Performance Analysis of a SPARC-V9 Microprocessor for Enterprise Server Systems

Architecture of Hitachi SR-8000

Current Status of FEFS for the K computer

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Next Generation GPU Architecture Code-named Fermi

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek


WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

Multi-Threading Performance on Commodity Multi-Core Processors

Itanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation

VLIW Processors. VLIW Processors

Intel Xeon Processor E5-2600

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Intel Pentium 4 Processor on 90nm Technology

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

The K computer: Project overview

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Central Processing Unit (CPU)

PCI Express IO Virtualization Overview

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

OPENSPARC T1 OVERVIEW

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

AMD Opteron Quad-Core

Intel Itanium Architecture

Creative Commons Attribution-Share 3.0 United States License

SOC architecture and design

Introduction to Microprocessors

The Foundation for Better Business Intelligence

picojava TM : A Hardware Implementation of the Java Virtual Machine

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Introducción. Diseño de sistemas digitales.1

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Instruction Set Architecture (ISA)

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Communicating with devices

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Optimizing Code for Accelerators: The Long Road to High Performance

Generations of the computer. processors.

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Chapter 1 Computer System Overview

POWER8 Performance Analysis


Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

İSTANBUL AYDIN UNIVERSITY

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

A Quantum Leap in Enterprise Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Discovering Computers Living in a Digital World

Family 10h AMD Phenom II Processor Product Data Sheet

Computer Organization

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Performance evaluation

Week 1 out-of-class notes, discussions and sample problems

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

ECLIPSE Performance Benchmarks and Profiling. January 2009

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

White paper accelerate and consolidate the data center

Instruction Set Architecture (ISA) Design. Classification Categories

LS DYNA Performance Benchmarks and Profiling. January 2009

Storage Architectures. Ron Emerick, Oracle Corporation

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Going Linux on Massive Multicore

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Infrastructure Matters: POWER8 vs. Xeon x86

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Radeon HD 2900 and Geometry Generation. Michael Doggett

85MIV2 / 85MIV2-L -- Components Locations

Competitive Comparison Dual-Core Intel Xeon Processor-based Platforms vs. AMD Opteron*

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Mainframe. Large Computing Systems. Supercomputer Systems. Mainframe

Introduction to Cloud Computing

Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server

Transcription:

X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright FUJITSU LIMITED 2012

X 2 Agenda Fujitsu Processor Development History TM X Design concept SWoC (Software on Chip) Processor chip overview u-architecture Performance Summary

High Performance Technology High Reliability Technology Fujitsu Processor Development Virtual Machine Architecture Software On Chip HPC-ACE System On Chip Hardware Barrier Multi-core Multi-thread L2$ on Die Non-Blocking $ O-O-O Execution Super-Scalar Single-chip CPU Store Ahead Branch History Prefetch $ ECC Register/ALU Parity Instruction Retry $ Dynamic Degradation RC/RT/History Tr=10M CMOS Al 350nm GS8600 II Tr=30M CMOS Al 250nm / 220nm GS8800B GS8800 GP ~1999 Processor Tr=190M0 130nm Tr=46M 180nm Tr=30M 180nm / 150nm 2000~2003 Tr=190M 130nm X 3 GS8900 GP Tr=400M 90nm V GS21 600 Tr=540M 90nm V + Tr=760M 45nm Tr=600M 65nm VI GS21 900 VII Tr=500M 90nm Tr=1B 40nm VIIIfx IXfx GS21 Mainframe :Technology generation 2004~2007 2008~2011 Tr=2.95B 28nm X 2012~

X 4 X Design Concept Combine UNIX and HPC FJ processor features to realize an extremely high throughput UNIX processor. VII/VII+ (UNIX processor) feature High CPU frequency (up-to 3GHz) Multicore/Multithread Scalability : up-to 64sockets VIIIfx (HPC processor) feature HPC-ACE: Innovative ISA extensions to SPARC-V9 High Memory B/W: peak 64GB/s, Embedded Memory Controller Add new features vital to current and future UNIX servers Virtual Machine Architecture Software On Chip Embedded IOC (PCI-GEN3 controller) Direct CPU-CPU interconnect

X 5 Software on Chip 1/2 HW for SW Accelerates specific software function with HW The targets Decimal operation (IEEE754 decimal and NUMBER) Cypher operation (AES/DES) Database acceleration HW implementation The HW engines for SWoC are implemented in FPU To fully utilize 128 FP registers & software pipelining Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW. Avoid complication due to CISC type instructions Various RISC type instructions are newly defined, instead. 18 insts. for Decimal, and 10 insts. for Cypher operation

Software on Chip 2/2 Decimal Instructions Supported data type IEEE754 DPD(Densely Packed Decimal) 8B fixed length NUMBER Variable length (max 21Byte) Instructions Both DPD/NUMBER instructions are defined as 8B operation (add/sub/mul/div/cmp) on FP registers To maximize performance with reasonable HW cost When the data length is > 8byte, multiple such instructions will be used. An instruction for special byte-shift on FP registers is newly added to support unaligned NUMBER 0 1...1 0 1...1 0...0 X 6 0 Fd[rs1] and 0 0 Fd[rd] 0 Fd[rs2] and 0 0

X Chip Overview DDR3 Interface MAC Architecture Features 16 cores x 2 threads SWoC (Software on Chip) Shared 24 MB L2$ Embedded Memory and IO Controller SERDES PCI GEN3 L2 Cache Data L2 Cache Control MAC L2 Cache Data SERDES Inter-CPU 28nm CMOS 23.5mm x 25.0mm 2,950M transistors 1,500 signal pins 3GHz DDR3 Interface Performance (peak) 288GIPS/382GFlops 102GB/s memory throughput X 7

X 8 X spec L1I$ L1$ control Instruction Control L1D$ Execution Unit Register File Instruction Set Architecture Branch Prediction Integer Execution Units FP Execution Units SPARC-V9/JPS HPC-ACE VM SWoC 4K BRHIS 16K PHT 156 GPR x 2 + 64 GUB ALU/SHIFT x2 ALU/AGEN x2 MULT/DIVIDE x1 128 FPR x 2 + 64 FUB FMA x4, FDIV x2 IMA/Logic x4 Decimal x1 / Cypher x2 L1$ L1I$ 64KB/4way L1D$ 64KB/4way

CPU u-architecture enhancements from TM VII+ Deeper pipeline to increase Frequency Better Branch Prediction Scheme Various Queue-size and #Floating point register increase Richer execution Units, including 2EX + 2EAG 2EX + 2EX/EAG 2FMA 4FMA to support 2way-SIMD SWoC engine (Decimal and Cypher) More aggressive O-O-O execution of load and store Multi-banked 2port L1-Cache System On Chip #core and L2$ size (4core/12MB 16core/24MB) Memory Controller, IO Controller, and CPU-CPU I/F are all embedded to increase performance and reduce cost. X 9

Fetch (4 stages) TM VII/VII+ Pipeline Issue Dispatch Reg.-Read (2 stages) (4 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 2Way Branch Target Address 8Kentry Decode & Issue RSA 10Entry RSE 8x2Entry RSF 8x2Entry GPR 156Registers x2 GUB 32Registers FPR 64Registers x2 EAGA EAGB EXA EXB FLA FLB Fetch Port 16Entry Store Port 16Entry Store Buffer 16Entry L1 D$ 64KB 2Way CSE 64Entry PC x2 Control Registers x2 RSBR 10Entry FUB 48Registers L2$ 6MB/12MB 12Way 4-core System Bus Interface X 10

Fetch (4 stages) TM X Pipeline Issue Dispatch Reg.-Read (4 stages) (5 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 4Way Branch Target Address 4Kentry Pattern History Table 16Kentry Decode & Issue RSA 24Entry RSE 24Entry RSF 20Entry RSBR 16Entry GPR GPR 156Registers 156Registers GUB 64Registers FPR FPR 128Registers 128Registers FUB 64Registers EAGA EXC EAGB EXD EXA EXB FLA Decimal Cypher FLB FLC Cypher FLD Fetch Port 32Entry Store Port 24Entry Write Buffer 10Entry L1 D$ 64KB 4Way CSE 96Entry PC PC Control Control Registers Registers L2$ 24MB 24Way 16-core Router CPU-CPU I/F Memory Controller IO Controller PCI-GEN3 DIMM X 11

Execution units enhancements (Ex.) Integer Execution Unit 2EX + 2EAG 2EX + 2EX/EAG 2 4W GPR 4 integer instructions can be executed per cycle (sustained) EXA EXB EXC EXD Update GUB Commit GPR Load Store Unit Aggressive load/store O-O-O execution: Execute load without waiting for preceding store address calculation. Multi-banked 2port L1-cache to execute 2 load or 1 load+1 store in parallel Doubled L1$ bus size Doubled L1$ associativity (2 4way) Increase L1-cache throughput and hit-rate L1$ 16B store L1 cache 2R/1W. (banked) 2R/1W 16B load x2 X 12

TM X interconnects TM VII/VII+ interconnects (SPARC Enterprise M8000) CPU CPU CPU SC SC SC MAC MAC MAC DDR2 DIMMs DDR2 DIMMs DDR2 DIMMs TM VII/VII+ interconnects 4 CPU require 8 additional LSIs to be connected with DIMM DIMM i/f: 4.35GB/s (STREAMtriad) CPU SC MAC DDR2 DIMMs TM X interconnects CPU CPU CPU 14.5GB/s CPU 102GB/s DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs TM X interconnects No additional LSIs to be connected with DIMM DIMM i/f: 65.6GB/s (STREAMtriad) CPU i/f: 14.5GB/s x 5ports (peak) 3 ports: glueless 4way CPU interconnect 2 ports: > 4way CPU X 13

High Speed Transceivers (SerDes) CPU-CPU glue-less communication links 14.5Gb/s x 8 lanes bi-directional serial interface, 5 ports Embedded equalizer circuit enables long distance signal transmission Embedded adaptive control logic optimizes equalizer parameters automatically depending on the various system configurations PCI Express ports 8Gb/s x 8 lanes (Gen 3), 2 ports RX RX PLL TX Tx 14.5Gb/s x 8lanes SerDes Built-in SerDes provides peak 88.5GB/s x2 (up/down) total throughput X 14

Reliability, Availability, Serviceability Units Cache (Tag) Cache (Data) Register ALU Cache dynamic degradation HW Instruction Retry History Error detection and correction scheme ECC Duplicate & Parity ECC Parity ECC (INT/FP) Parity(Others) Parity/Residue Yes Yes Yes TM X RAS diagram Green: 1bit error Correctable Yellow: 1bit error Detectable Gray: 1bit error harmless New RAS features from VII/VII+ Floating-point registers are ECC protected #checkers increased to ~53,000 to identify a failure point more precisely Guarantees Data Integrity X 15

Hardware Instruction Retry Instruction Retry Fetch Execute Commit Fetch Execute Commit 1. Error 2. Flush 3. Single step execution 4. Update of SW visible resources IBF IWR CSE X PC IBF IWR CSE PC ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources 5. Back to normal execution after the re-executed Instruction gets committed without an error. When an error is detected, Hardware re-execute the instruction automatically to remove the transient error by itself. X 16

Relative to TM VII+@2.86GHz X 17 TM X Performance @3GHz Hardware measured results 98x SWoC 15x 7x 8x TM X realizes 7x INT/FP/JVM throughput and 15x memory throughput of TM VII+ The INT/FP/JVM result is with un-tuned Compiler/JVM. SWoC of TM X results in max 98x throughput. The NUMBER score is for scalar. Expect to be much better for vector data.

TM X CPI (Cycle Per Instruction) Example Lower Performance TM VII+ v.s. TM X INT (single thread) Hardware measured results Shorter memory latency Large L2$ Improved Branch prediction 2 4way L1$ Increased throughput of L1$ 2EX+2EAG 2EX+2EX/EAG 2 4W GPR Higher Performance 4 integer execution units and write port increase of GPR (integer register) improves overall performance. Memory latency reduction, Large L2$, branch prediction, and L1$ improvement also contribute to the high performance dramatically. X 18

X 19 Summary TM X is Fujitsu s10 th SPARC processor which has been designed to be used for Fujitsu s next generation UNIX server. TM X integrates 16cores + 24MB L2 cache with over 100GB/s(peak) memory B/W. TM X keeps strong RAS features. TM X chip is up and running in the lab. It has shown 7 times throughput of TM VII+ w/o compiler tuning. SWoC is effective to accelerate specific software functions Fujitsu will continue to develop TM series.

TM X Abbreviations IB: Instruction Buffer RSA: Reservation Station for Address generation RSE: Reservation Station for Execution RSF: Reservation Station for Floating-point RSBR: Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update Buffer GPR: General Purpose Register FPR: Floating Point Register CSE: Commit Stack Entry X 20