SPARC64 VII Fujitsu s Next Generation Quad-Core Processor

Similar documents

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

SPARC64 VIIIfx: CPU for the K computer

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

Achieving QoS in Server Virtualization

Intel Pentium 4 Processor on 90nm Technology

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Next Generation GPU Architecture Code-named Fermi

Chapter 1 Computer System Overview

OC By Arsene Fansi T. POLIMI

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Multi-Threading Performance on Commodity Multi-Core Processors

Introduction to Cloud Computing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Thread level parallelism

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

AMD Opteron Quad-Core

Generations of the computer. processors.

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

1. Memory technology & Hierarchy

OpenSPARC T1 Processor

Architecture of Hitachi SR-8000

Operating System Impact on SMT Architecture

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Introduction to Microprocessors

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

LS DYNA Performance Benchmarks and Profiling. January 2009

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way. Ashish C. Morzaria, SAP

Parallel Algorithm Engineering

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Radeon HD 2900 and Geometry Generation. Michael Doggett

OPENSPARC T1 OVERVIEW

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

LSN 2 Computer Processors

The K computer: Project overview

Putting it all together: Intel Nehalem.

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Rethinking SIMD Vectorization for In-Memory Databases

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Current Status of FEFS for the K computer

~ Greetings from WSU CAPPLab ~

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

Testing Database Performance with HelperCore on Multi-Core Processors

Intel Itanium Architecture

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

İSTANBUL AYDIN UNIVERSITY

ECLIPSE Performance Benchmarks and Profiling. January 2009

Chapter 2 Parallel Computer Architecture

Enterprise Applications

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

An Implementation Of Multiprocessor Linux

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Performance evaluation

High-Performance Modular Multiplication on the Cell Processor

Performance Impacts of Non-blocking Caches in Out-of-order Processors

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Family 10h AMD Phenom II Processor Product Data Sheet

Thread Level Parallelism II: Multithreading

picojava TM : A Hardware Implementation of the Java Virtual Machine

Central Processing Unit (CPU)

Informatica Ultra Messaging SMX Shared-Memory Transport

FPGA-based Multithreading for In-Memory Hash Joins

Fusionstor NAS Enterprise Server and Microsoft Windows Storage Server 2003 competitive performance comparison

An OS-oriented performance monitoring tool for multicore systems

Multithreading Lin Gao cs9244 report, 2006

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

Introduction to GPU Architecture

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

PROBLEMS #20,R0,R1 #$3A,R2,R4

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Application Performance Analysis of the Cortex-A9 MPCore

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

SOC architecture and design

Transcription:

SPARC64 VII Fujitsu s Next Generation Quad-Core Processor August 26, 2008 Takumi Maruyama LSI Development Division Next Generation Technical Computing Unit Fujitsu Limited

High Performance Technology High Reliability Technology Fujitsu Processor Development Hardware Barrier Multi-core / Multi-thread L2$ on Die Non-Blocking $ O-O-O Execution Super-Scalar Single-chip CPU Store Ahead Branch History Prefetch Cache ECC Register/ALU Parity Instruction Retry Cache Dynamic Degradation RC/RT/History ~1995 SPARC64 II SPARC64 GS8600 1996 ~1997 Tr=10M CMOS Al 350nm 1998 ~1999 :Technology generation SPARC64 Processor Tr=190M CMOS Cu 130nm Tr=30M CMOS Cu 180nm / 150nm SPARC64 GP GS8800 GS8800B GS8900 Tr=30M CMOS Al 250nm / 220nm Tr=46M CMOS Cu 180nm 2000 ~2003 Tr=400M CMOS Cu 90nm SPARC64 V GS21 600 Tr=540M CMOS Cu 90nm SPARC64 V + Tr=600M CMOS Cu 65nm SPARC64 VI GS21 900 Tr=500M CMOS Cu 90nm Tr=190M CMOS Cu 130nm Mainframe 2004~ SPARC64 VII HPC Venus UNIX GS21

Chip L2 Cache Core 1 Tag Core 3 Execution L1D Cache Unit L1 Cache Instruction Control Control L1I Cache L2 Cache Data L2 Cache Control System Interface System I/O Core 2 Architecture Features 4core x 2threads (SMT) Embedded 6MB L2$ 2.5GHz Jupiter Bus Fujitsu 65nm CMOS 21.31mm x 20.86mm 600M transistors 456 signal pins 135 W (max) 44% power reduction per core from SPARC64 TM VI

Design Target Upgradeable from current SPARC64 VI on SPARC Enterprise Servers Keep single thread performance & high reliability by reusing SPARC64 VI core as much as possible Same system I/F: Jupiter-Bus More Throughput Performance Dual core to Quad-core VMT (Vertical Multithreading) to SMT (Simultaneous Multithreading) Technical Computing Shared L2$ & Hardware Barrier Increased Bus frequency (on Fujitsu FX1 server)

Pipeline Overview Fetch (4stages) Issue Dispatch (2stages) Reg.-Read (4stages) Execute Memory (L1$: 3stages) Commit (2stages) CSE 64Entry L1 I$ 64KB 2Way Branch Target Address 8Kentry 4Way Decode & Issue RSA 10Entry RSE 8x2Entry RSF 8x2Entry RSBR 10Entry GPR 156Registers x2 GUB 32Registers FPR 64Registers x2 FUB 48Registers EAGA EAGB EXA EXB FLA FLB Fetch Port 16Entry Store Port 16Entry Store Buffer 16Entry L2$ 6MB 12Way L1 D$ 64KB 2Way PC x2 Control Registers x2 System Bus Interface

Simultaneous Multi-Threading SPARC64 VI Core0 t0 t1 t0 t1 Core1 t2 t3 t2 t3 SPARC64 VII Core0 t0 t1 Core1 t3 Core2 Core3 t7 t2 t3 t4 t5 t6 t7 The other thread is idle t1 t5 time Fine Grain Multi-thread Fetch/Issue/Commit stage: Alternatively select one of the two thread each cycle Execute stage: Select instructions based on oldest-ready independently from threads. SMT throughput increase x 1.2 INT/FP x 1.3 Java x 1.3 OLTP Split HW Queues #Active threads Two IB 4+4 Reservation Station. RSE 8x2 RSF 8x2 Rename Registers GUB 24+24 FUB 24+24 Port FP 8+8 CSE SP 8+8 32+32 To avoid interaction between threads. Automatically combined if the other thread is idle. One 8 8x2 8x2 48 48 16 8 64

Integrated Multi-core Parallel Architecture Hardware Barrier CPU Core LBSY IU EU L1$ Shared L2$ (6MB 12way) L2 cache Shared by 4 Cores to avoid false-sharing B/W between L2 cache and L1cache (2x of SPARC64 TM VI) L2$ L1$: 32byte/cycle/2core x 4core L2$ L1$: 16byte/cycle/core x 4core Hardware Barrier High speed synchronization mechanism between cores in a CPU chip. Reduce overhead for parallel execution DO J=1,N P DO I=1,M P A(I,J)=A(I,J+1)*B(I,J) P END END Handles the multi-core CPU as one equivalent fast CPU with Compiler Technology (Fujitsu s Parallelnavi Language Package 3.0) Automatic parallelization Optimize the innermost loop

Hardware Barrier Barrier resources BST: Barrier STatus BST_mask: Select Bits in BST LBSY: Last Barrier synchronization status Synchronization is established when all BST bits selected by BST_mask are set to the same value. Sample Code of Barrier Synchronization /* * %r1: VA of a window ASI, %r2:, %r3: work */ ldxa [%r1]asi_lbsy, %r2! read current LBSY not %r2! inverse LBSY and %r2, 1, %r2! mask out reserved bits stxa %r2, [%r1]asi_bst! update BST membar #storeload! to make sure stxa is complete loop: ldxa [%r1]asi_lbsy, %r3! read LBSY and %r3, 1, %r3! mask out reserved bits subcc %r3, %r2, %g0! check if status changed bne,a loop sleep! if not changed, sleep for a while Usage Each core updates BST Wait until LBSY gets flipped Synchronization time: 60ns 10 times faster than SW

Other performance enhancements Faster FMA (Fused Multiply-Add) 7cycles 6cycles DGEMM efficiency: 93% on 4 cores SPARC64 TM VII Relative Performance LINPACK=2,023Gflops with 64CPU (SPARC64 TM VI=1.0) Prefetch Improvement Hardware measured results 2.0 HW prefetch algorithm further refined SW is now able to specify strong 1.5 prefetch to avoid prefetch lost. Shared context 1.0 Virtual address space shared by two or more processes 0.5 Effective to save TLB entries New Instruction 0.0 Integer Multiply-Add with FP registers INT FP INT FP JAVA Single thread Multi-thread etc throughput Performance relative to SPARC64 TM VI Single thread performance : x1.0 - x1.2 Throughput: x1.7 - x1.9

Reliability, Availability, Serviceability Existing RAS features Cache Tag ECC (L2$) Protection Dupicate & Parity (L1$) Data ECC (L2$, L1D$) Parity (L1I$) Cache Dynamic Degradation Yes ALU/Register ECC (INT Regs) Parity (FP Regs, etc) HW Instruction Retry Yes History Yes Fetch Execute Commit Error flush New RAS features of SPARC64 VII Integer registers are ECC protected Number of checkers increased to ~3,400 IBF IWR ALU EAGA/B EXA/B FLA/B CSE RSE,RSF RSA RSBR GUB,FUB X PC GPR,FPR Memory SW visible resources Instruction Retry Fetch Execute Commit Single step execution IBF IWR ALU EAGA/B EXA/B FLA/B CSE RSE,RSF RSA RSBR GUB,FUB Update of SW visible resources PC GPR,FPR Memory SW visible resources Back to normal execution after the reexecuted Instruction gets committed without error.

RAS coverage RAM (71.5Mbits) Others L1$ L2$(tag) RF/Latch (0.9Mbits) Arch Reg (INT) Arch Reg (FP) etc Green: 1bit error Correctable Yellow: 1bit error Detectable Gray: 1bit error harmless L2$(data) Transient States More than 70% of Transient States are 1bit error detectable. Recoverable through HW Instruction Retry All RAMs are ECC protected or Duplicated Most latches are parity protected Guaranteed Data Integrity

Servers based on SPARC64 TM VII SPARC Enterprise M8000 SPARC64 TM VI or VII 46GB/s (for 4CPUs) CPU CPU SC SC MC MC DDR2 DDR2 SPARC Enterprise: UNIX Max 64CPUs / SMP Upgradeable from SPARC64 TM VI Possible to mix SPARC64 TM VI and SPARC64 TM VII within the same SB CPU SC MC DDR2 CPU SC MC DDR2 FX1: Technical Computing Single CPU node Increased Jupiter Bus freq = 1.26GHz FX1 40GB/s 40GB/s System chip is newly designed to realize Higher memory throughput SPARC64 TM VII JSC 32B DDR2 Lower Memory latency Smaller footprint 20B(in)+12B(out) JSC 32B DDR2 Performance FP throughput/socket: x1.5 STREAM benchmark: >13.5GB/s

Performance Analysis Performance Analyzer About 100 performance events can be monitored Available through cputrack() and cpustat() 8 performance events can be gathered at the same time. Commit base performance events Number of commit instructions each cycle. The cause of no commit L2$miss L1D$miss Fetch miss Execution Unit busy

SPARC64 TM VII CPI (Cycle Per Instruction) Example 3 Lower Performance Hardware measured results SE: SPARC Enterprise Server M8000 FX1: FX1 Technical Computing Server 2 CPI Memory access cost has been reduced on FX1. 1 0 Commit due to system 0 Commit due to $ 0 Commit due to core Higher Performance 0 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 SE FX1 1 Commit 2-3 Commit 4 Commit bwaves gamess milc zeusmp gromacs cactus leslie3d namd dealii soplex povray calculix Gems tonto lbm wrf sphinx3 Average

SPARC64 Future 10.0 SPARC64 TM Relative Performance (SPARC64 TM V/1.35GHz = 1.0) Hardware measured results Design History: Evolution rather than revolution Multi-Thread Througput 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 SPARC64 TM VII w/ FX1 SPARC64 TM VII SPARC64 TM VI SPARC64 TM V 1.0 1.5 2.0 2.5 Single-Thread (Integer) Single thread JAVA FP Throughput SPARC64 TM V (1core) RAS Single thread performance SPARC64 TM VI (2core x 2VMT) Throughput SPARC64 TM VII (4core x 2SMT) More Throughput High Performance Computing What s next?

SPARC64 TM VII Summary The same system I/F with existing SPARC64 TM VI to protect customer s investment 4core x 2SMT design realizes high throughput without sacrificing single thread performance Shared L2$ and HW barrier makes 4 cores behave as a single processor with compiler technology. The new system design has fully exploited potential of SPARC64 TM VII. Fujitsu will continue to develop SPARC64 TM series to meet the needs of a new era.

Venus Venus IBUF Dec Issue BR predict RSE RSF RSBR RSA FP/SP EX FL AGEN x8 For PETA-scale Computing Server 8core Embedded Memory Controller L1I$ L1D$ SPARC-V9 + extension (HPC-ACE) SIMD L2$ MC Fujitsu 45nm CMOS DDR3 DDR3 DDR3 128GF@socket

Abbreviations SPARC64 TM VII RSA: Reservation Station for Address generation RSE: Reservation Station for Execution RSF: Reservation Station for Floating-point RSBR: Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update Buffer GPR: General Purpose Register FPR: Floating Point Register CSE: Commit Stack Entry FP: Fetch Port SP: Store Port Chip-sets SC: System Controller MC: Memory Controller JSC: Jupiter System Controller SB: System Board Others DGEMM: Double precision matrix multiply and add