SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Size: px

Start display at page:

Download "SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers"

Marcia Warner
10 years ago
Views:

1 X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright FUJITSU LIMITED 2012

Processor Development Division Enterprise Server Business

2 X 2 Agenda Fujitsu Processor Development History TM X Design concept SWoC (Software on Chip) Processor chip overview u-architecture Performance Summary

3 High Performance Technology High Reliability Technology Fujitsu Processor Development Virtual Machine Architecture Software On Chip HPC-ACE System On Chip Hardware Barrier Multi-core Multi-thread L2$ on Die Non-Blocking $ O-O-O Execution Super-Scalar Single-chip CPU Store Ahead Branch History Prefetch $ ECC Register/ALU Parity Instruction Retry $ Dynamic Degradation RC/RT/History Tr=10M CMOS Al 350nm GS8600 II Tr=30M CMOS Al 250nm / 220nm GS8800B GS8800 GP ~1999 Processor Tr=190M0 130nm Tr=46M 180nm Tr=30M 180nm / 150nm 2000~2003 Tr=190M 130nm X 3 GS8900 GP Tr=400M 90nm V GS Tr=540M 90nm V + Tr=760M 45nm Tr=600M 65nm VI GS VII Tr=500M 90nm Tr=1B 40nm VIIIfx IXfx GS21 Mainframe :Technology generation 2004~ ~2011 Tr=2.95B 28nm X 2012~

RC/RT/History Tr=10M CMOS Al 350nm GS8600 II Tr=30M CMOS Al 250nm / 220nm GS8800B GS8800 GP ~1999 Processor Tr=190M0 130nm Tr=46M 180nm Tr=30M 180nm / 150nm 2000~2003 Tr=190M 130nm X 3

4 X 4 X Design Concept Combine UNIX and HPC FJ processor features to realize an extremely high throughput UNIX processor. VII/VII+ (UNIX processor) feature High CPU frequency (up-to 3GHz) Multicore/Multithread Scalability : up-to 64sockets VIIIfx (HPC processor) feature HPC-ACE: Innovative ISA extensions to SPARC-V9 High Memory B/W: peak 64GB/s, Embedded Memory Controller Add new features vital to current and future UNIX servers Virtual Machine Architecture Software On Chip Embedded IOC (PCI-GEN3 controller) Direct CPU-CPU interconnect

processor) feature HPC-ACE: Innovative ISA extensions to SPARC-V9 High Memory B/W: peak 64GB/s, Embedded Memory Controller Add new

5 X 5 Software on Chip 1/2 HW for SW Accelerates specific software function with HW The targets Decimal operation (IEEE754 decimal and NUMBER) Cypher operation (AES/DES) Database acceleration HW implementation The HW engines for SWoC are implemented in FPU To fully utilize 128 FP registers & software pipelining Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW. Avoid complication due to CISC type instructions Various RISC type instructions are newly defined, instead. 18 insts. for Decimal, and 10 insts. for Cypher operation

registers & software pipelining Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW.

6 Software on Chip 2/2 Decimal Instructions Supported data type IEEE754 DPD(Densely Packed Decimal) 8B fixed length NUMBER Variable length (max 21Byte) Instructions Both DPD/NUMBER instructions are defined as 8B operation (add/sub/mul/div/cmp) on FP registers To maximize performance with reasonable HW cost When the data length is > 8byte, multiple such instructions will be used. An instruction for special byte-shift on FP registers is newly added to support unaligned NUMBER X 6 0 Fd[rs1] and 0 0 Fd[rd] 0 Fd[rs2] and 0 0

performance with reasonable HW cost When the data length is > 8byte, multiple such instructions will be used.

7 X Chip Overview DDR3 Interface MAC Architecture Features 16 cores x 2 threads SWoC (Software on Chip) Shared 24 MB L2$ Embedded Memory and IO Controller SERDES PCI GEN3 L2 Cache Data L2 Cache Control MAC L2 Cache Data SERDES Inter-CPU 28nm CMOS 23.5mm x 25.0mm 2,950M transistors 1,500 signal pins 3GHz DDR3 Interface Performance (peak) 288GIPS/382GFlops 102GB/s memory throughput X 7

Control MAC L2 Cache Data SERDES Inter-CPU 28nm CMOS 23.5mm x 25.

8 X 8 X spec L1I$ L1$ control Instruction Control L1D$ Execution Unit Register File Instruction Set Architecture Branch Prediction Integer Execution Units FP Execution Units SPARC-V9/JPS HPC-ACE VM SWoC 4K BRHIS 16K PHT 156 GPR x GUB ALU/SHIFT x2 ALU/AGEN x2 MULT/DIVIDE x1 128 FPR x FUB FMA x4, FDIV x2 IMA/Logic x4 Decimal x1 / Cypher x2 L1$ L1I$ 64KB/4way L1D$ 64KB/4way

HPC-ACE VM SWoC 4K BRHIS 16K PHT 156 GPR x 2 + 64 GUB ALU/SHIFT x2 ALU/AGEN x2 MULT/DIVIDE x1

9 CPU u-architecture enhancements from TM VII+ Deeper pipeline to increase Frequency Better Branch Prediction Scheme Various Queue-size and #Floating point register increase Richer execution Units, including 2EX + 2EAG 2EX + 2EX/EAG 2FMA 4FMA to support 2way-SIMD SWoC engine (Decimal and Cypher) More aggressive O-O-O execution of load and store Multi-banked 2port L1-Cache System On Chip #core and L2$ size (4core/12MB 16core/24MB) Memory Controller, IO Controller, and CPU-CPU I/F are all embedded to increase performance and reduce cost. X 9

engine (Decimal and Cypher) More aggressive O-O-O execution of load and store Multi-banked 2port L1-Cache System On Chip #core and L2$

10 Fetch (4 stages) TM VII/VII+ Pipeline Issue Dispatch Reg.-Read (2 stages) (4 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 2Way Branch Target Address 8Kentry Decode & Issue RSA 10Entry RSE 8x2Entry RSF 8x2Entry GPR 156Registers x2 GUB 32Registers FPR 64Registers x2 EAGA EAGB EXA EXB FLA FLB Fetch Port 16Entry Store Port 16Entry Store Buffer 16Entry L1 D$ 64KB 2Way CSE 64Entry PC x2 Control Registers x2 RSBR 10Entry FUB 48Registers L2$ 6MB/12MB 12Way 4-core System Bus Interface X 10

Decode & Issue RSA 10Entry RSE 8x2Entry RSF 8x2Entry GPR 156Registers x2 GUB 32Registers FPR 64Registers x2 EAGA EAGB EXA EXB

11 Fetch (4 stages) TM X Pipeline Issue Dispatch Reg.-Read (4 stages) (5 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 4Way Branch Target Address 4Kentry Pattern History Table 16Kentry Decode & Issue RSA 24Entry RSE 24Entry RSF 20Entry RSBR 16Entry GPR GPR 156Registers 156Registers GUB 64Registers FPR FPR 128Registers 128Registers FUB 64Registers EAGA EXC EAGB EXD EXA EXB FLA Decimal Cypher FLB FLC Cypher FLD Fetch Port 32Entry Store Port 24Entry Write Buffer 10Entry L1 D$ 64KB 4Way CSE 96Entry PC PC Control Control Registers Registers L2$ 24MB 24Way 16-core Router CPU-CPU I/F Memory Controller IO Controller PCI-GEN3 DIMM X 11

Issue RSA 24Entry RSE 24Entry RSF 20Entry RSBR 16Entry GPR GPR 156Registers 156Registers GUB 64Registers FPR FPR 128Registers 128Registers FUB 64Registers EAGA EXC

12 Execution units enhancements (Ex.) Integer Execution Unit 2EX + 2EAG 2EX + 2EX/EAG 2 4W GPR 4 integer instructions can be executed per cycle (sustained) EXA EXB EXC EXD Update GUB Commit GPR Load Store Unit Aggressive load/store O-O-O execution: Execute load without waiting for preceding store address calculation. Multi-banked 2port L1-cache to execute 2 load or 1 load+1 store in parallel Doubled L1$ bus size Doubled L1$ associativity (2 4way) Increase L1-cache throughput and hit-rate L1$ 16B store L1 cache 2R/1W. (banked) 2R/1W 16B load x2 X 12

Update GUB Commit GPR Load Store Unit Aggressive load/store O-O-O execution: Execute load without waiting for preceding store address

13 TM X interconnects TM VII/VII+ interconnects (SPARC Enterprise M8000) CPU CPU CPU SC SC SC MAC MAC MAC DDR2 DIMMs DDR2 DIMMs DDR2 DIMMs TM VII/VII+ interconnects 4 CPU require 8 additional LSIs to be connected with DIMM DIMM i/f: 4.35GB/s (STREAMtriad) CPU SC MAC DDR2 DIMMs TM X interconnects CPU CPU CPU 14.5GB/s CPU 102GB/s DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs TM X interconnects No additional LSIs to be connected with DIMM DIMM i/f: 65.6GB/s (STREAMtriad) CPU i/f: 14.5GB/s x 5ports (peak) 3 ports: glueless 4way CPU interconnect 2 ports: > 4way CPU X 13

35GB/s (STREAMtriad) CPU SC MAC DDR2 DIMMs TM X interconnects CPU CPU CPU 14.

14 High Speed Transceivers (SerDes) CPU-CPU glue-less communication links 14.5Gb/s x 8 lanes bi-directional serial interface, 5 ports Embedded equalizer circuit enables long distance signal transmission Embedded adaptive control logic optimizes equalizer parameters automatically depending on the various system configurations PCI Express ports 8Gb/s x 8 lanes (Gen 3), 2 ports RX RX PLL TX Tx 14.5Gb/s x 8lanes SerDes Built-in SerDes provides peak 88.5GB/s x2 (up/down) total throughput X 14

transmission Embedded adaptive control logic optimizes equalizer parameters automatically depending on the various system

15 Reliability, Availability, Serviceability Units Cache (Tag) Cache (Data) Register ALU Cache dynamic degradation HW Instruction Retry History Error detection and correction scheme ECC Duplicate & Parity ECC Parity ECC (INT/FP) Parity(Others) Parity/Residue Yes Yes Yes TM X RAS diagram Green: 1bit error Correctable Yellow: 1bit error Detectable Gray: 1bit error harmless New RAS features from VII/VII+ Floating-point registers are ECC protected #checkers increased to ~53,000 to identify a failure point more precisely Guarantees Data Integrity X 15

X RAS diagram Green: 1bit error Correctable Yellow: 1bit error Detectable Gray: 1bit error harmless New RAS features from VII/VII+

16 Hardware Instruction Retry Instruction Retry Fetch Execute Commit Fetch Execute Commit 1. Error 2. Flush 3. Single step execution 4. Update of SW visible resources IBF IWR CSE X PC IBF IWR CSE PC ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources 5. Back to normal execution after the re-executed Instruction gets committed without an error. When an error is detected, Hardware re-execute the instruction automatically to remove the transient error by itself. X 16

resources ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources 5.

17 Relative to TM X 17 TM X Hardware measured results 98x SWoC 15x 7x 8x TM X realizes 7x INT/FP/JVM throughput and 15x memory throughput of TM VII+ The INT/FP/JVM result is with un-tuned Compiler/JVM. SWoC of TM X results in max 98x throughput. The NUMBER score is for scalar. Expect to be much better for vector data.

realizes 7x INT/FP/JVM throughput and 15x memory throughput of TM VII+ The INT/FP/JVM

18 TM X CPI (Cycle Per Instruction) Example Lower Performance TM VII+ v.s. TM X INT (single thread) Hardware measured results Shorter memory latency Large L2$ Improved Branch prediction 2 4way L1$ Increased throughput of L1$ 2EX+2EAG 2EX+2EX/EAG 2 4W GPR Higher Performance 4 integer execution units and write port increase of GPR (integer register) improves overall performance. Memory latency reduction, Large L2$, branch prediction, and L1$ improvement also contribute to the high performance dramatically. X 18

TM X INT (single thread) Hardware measured results Shorter memory latency Large L2$ Improved Branch prediction 2 4way L1$

19 X 19 Summary TM X is Fujitsu s10 th SPARC processor which has been designed to be used for Fujitsu s next generation UNIX server. TM X integrates 16cores + 24MB L2 cache with over 100GB/s(peak) memory B/W. TM X keeps strong RAS features. TM X chip is up and running in the lab. It has shown 7 times throughput of TM VII+ w/o compiler tuning. SWoC is effective to accelerate specific software functions Fujitsu will continue to develop TM series.

TM X keeps strong RAS features. TM X chip is up and running in the lab.

20 TM X Abbreviations IB: Instruction Buffer RSA: Reservation Station for Address generation RSE: Reservation Station for Execution RSF: Reservation Station for Floating-point RSBR: Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update Buffer GPR: General Purpose Register FPR: Floating Point Register CSE: Commit Stack Entry X 20

Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update

SPARC64 VII Fujitsu s Next Generation Quad-Core Processor

SPARC64 VII Fujitsu s Next Generation Quad-Core Processor August 26, 2008 Takumi Maruyama LSI Development Division Next Generation Technical Computing Unit Fujitsu Limited High Performance Technology High