X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright FUJITSU LIMITED 2012
X 2 Agenda Fujitsu Processor Development History TM X Design concept SWoC (Software on Chip) Processor chip overview u-architecture Performance Summary
High Performance Technology High Reliability Technology Fujitsu Processor Development Virtual Machine Architecture Software On Chip HPC-ACE System On Chip Hardware Barrier Multi-core Multi-thread L2$ on Die Non-Blocking $ O-O-O Execution Super-Scalar Single-chip CPU Store Ahead Branch History Prefetch $ ECC Register/ALU Parity Instruction Retry $ Dynamic Degradation RC/RT/History Tr=10M CMOS Al 350nm GS8600 II Tr=30M CMOS Al 250nm / 220nm GS8800B GS8800 GP ~1999 Processor Tr=190M0 130nm Tr=46M 180nm Tr=30M 180nm / 150nm 2000~2003 Tr=190M 130nm X 3 GS8900 GP Tr=400M 90nm V GS21 600 Tr=540M 90nm V + Tr=760M 45nm Tr=600M 65nm VI GS21 900 VII Tr=500M 90nm Tr=1B 40nm VIIIfx IXfx GS21 Mainframe :Technology generation 2004~2007 2008~2011 Tr=2.95B 28nm X 2012~
X 4 X Design Concept Combine UNIX and HPC FJ processor features to realize an extremely high throughput UNIX processor. VII/VII+ (UNIX processor) feature High CPU frequency (up-to 3GHz) Multicore/Multithread Scalability : up-to 64sockets VIIIfx (HPC processor) feature HPC-ACE: Innovative ISA extensions to SPARC-V9 High Memory B/W: peak 64GB/s, Embedded Memory Controller Add new features vital to current and future UNIX servers Virtual Machine Architecture Software On Chip Embedded IOC (PCI-GEN3 controller) Direct CPU-CPU interconnect
X 5 Software on Chip 1/2 HW for SW Accelerates specific software function with HW The targets Decimal operation (IEEE754 decimal and NUMBER) Cypher operation (AES/DES) Database acceleration HW implementation The HW engines for SWoC are implemented in FPU To fully utilize 128 FP registers & software pipelining Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW. Avoid complication due to CISC type instructions Various RISC type instructions are newly defined, instead. 18 insts. for Decimal, and 10 insts. for Cypher operation
Software on Chip 2/2 Decimal Instructions Supported data type IEEE754 DPD(Densely Packed Decimal) 8B fixed length NUMBER Variable length (max 21Byte) Instructions Both DPD/NUMBER instructions are defined as 8B operation (add/sub/mul/div/cmp) on FP registers To maximize performance with reasonable HW cost When the data length is > 8byte, multiple such instructions will be used. An instruction for special byte-shift on FP registers is newly added to support unaligned NUMBER 0 1...1 0 1...1 0...0 X 6 0 Fd[rs1] and 0 0 Fd[rd] 0 Fd[rs2] and 0 0
X Chip Overview DDR3 Interface MAC Architecture Features 16 cores x 2 threads SWoC (Software on Chip) Shared 24 MB L2$ Embedded Memory and IO Controller SERDES PCI GEN3 L2 Cache Data L2 Cache Control MAC L2 Cache Data SERDES Inter-CPU 28nm CMOS 23.5mm x 25.0mm 2,950M transistors 1,500 signal pins 3GHz DDR3 Interface Performance (peak) 288GIPS/382GFlops 102GB/s memory throughput X 7
X 8 X spec L1I$ L1$ control Instruction Control L1D$ Execution Unit Register File Instruction Set Architecture Branch Prediction Integer Execution Units FP Execution Units SPARC-V9/JPS HPC-ACE VM SWoC 4K BRHIS 16K PHT 156 GPR x 2 + 64 GUB ALU/SHIFT x2 ALU/AGEN x2 MULT/DIVIDE x1 128 FPR x 2 + 64 FUB FMA x4, FDIV x2 IMA/Logic x4 Decimal x1 / Cypher x2 L1$ L1I$ 64KB/4way L1D$ 64KB/4way
CPU u-architecture enhancements from TM VII+ Deeper pipeline to increase Frequency Better Branch Prediction Scheme Various Queue-size and #Floating point register increase Richer execution Units, including 2EX + 2EAG 2EX + 2EX/EAG 2FMA 4FMA to support 2way-SIMD SWoC engine (Decimal and Cypher) More aggressive O-O-O execution of load and store Multi-banked 2port L1-Cache System On Chip #core and L2$ size (4core/12MB 16core/24MB) Memory Controller, IO Controller, and CPU-CPU I/F are all embedded to increase performance and reduce cost. X 9
Fetch (4 stages) TM VII/VII+ Pipeline Issue Dispatch Reg.-Read (2 stages) (4 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 2Way Branch Target Address 8Kentry Decode & Issue RSA 10Entry RSE 8x2Entry RSF 8x2Entry GPR 156Registers x2 GUB 32Registers FPR 64Registers x2 EAGA EAGB EXA EXB FLA FLB Fetch Port 16Entry Store Port 16Entry Store Buffer 16Entry L1 D$ 64KB 2Way CSE 64Entry PC x2 Control Registers x2 RSBR 10Entry FUB 48Registers L2$ 6MB/12MB 12Way 4-core System Bus Interface X 10
Fetch (4 stages) TM X Pipeline Issue Dispatch Reg.-Read (4 stages) (5 stages) Execute Memory (L1$: 3 stages) Commit (2 stages) L1 I$ 64KB 4Way Branch Target Address 4Kentry Pattern History Table 16Kentry Decode & Issue RSA 24Entry RSE 24Entry RSF 20Entry RSBR 16Entry GPR GPR 156Registers 156Registers GUB 64Registers FPR FPR 128Registers 128Registers FUB 64Registers EAGA EXC EAGB EXD EXA EXB FLA Decimal Cypher FLB FLC Cypher FLD Fetch Port 32Entry Store Port 24Entry Write Buffer 10Entry L1 D$ 64KB 4Way CSE 96Entry PC PC Control Control Registers Registers L2$ 24MB 24Way 16-core Router CPU-CPU I/F Memory Controller IO Controller PCI-GEN3 DIMM X 11
Execution units enhancements (Ex.) Integer Execution Unit 2EX + 2EAG 2EX + 2EX/EAG 2 4W GPR 4 integer instructions can be executed per cycle (sustained) EXA EXB EXC EXD Update GUB Commit GPR Load Store Unit Aggressive load/store O-O-O execution: Execute load without waiting for preceding store address calculation. Multi-banked 2port L1-cache to execute 2 load or 1 load+1 store in parallel Doubled L1$ bus size Doubled L1$ associativity (2 4way) Increase L1-cache throughput and hit-rate L1$ 16B store L1 cache 2R/1W. (banked) 2R/1W 16B load x2 X 12
TM X interconnects TM VII/VII+ interconnects (SPARC Enterprise M8000) CPU CPU CPU SC SC SC MAC MAC MAC DDR2 DIMMs DDR2 DIMMs DDR2 DIMMs TM VII/VII+ interconnects 4 CPU require 8 additional LSIs to be connected with DIMM DIMM i/f: 4.35GB/s (STREAMtriad) CPU SC MAC DDR2 DIMMs TM X interconnects CPU CPU CPU 14.5GB/s CPU 102GB/s DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs DDR3 DIMMs TM X interconnects No additional LSIs to be connected with DIMM DIMM i/f: 65.6GB/s (STREAMtriad) CPU i/f: 14.5GB/s x 5ports (peak) 3 ports: glueless 4way CPU interconnect 2 ports: > 4way CPU X 13
High Speed Transceivers (SerDes) CPU-CPU glue-less communication links 14.5Gb/s x 8 lanes bi-directional serial interface, 5 ports Embedded equalizer circuit enables long distance signal transmission Embedded adaptive control logic optimizes equalizer parameters automatically depending on the various system configurations PCI Express ports 8Gb/s x 8 lanes (Gen 3), 2 ports RX RX PLL TX Tx 14.5Gb/s x 8lanes SerDes Built-in SerDes provides peak 88.5GB/s x2 (up/down) total throughput X 14
Reliability, Availability, Serviceability Units Cache (Tag) Cache (Data) Register ALU Cache dynamic degradation HW Instruction Retry History Error detection and correction scheme ECC Duplicate & Parity ECC Parity ECC (INT/FP) Parity(Others) Parity/Residue Yes Yes Yes TM X RAS diagram Green: 1bit error Correctable Yellow: 1bit error Detectable Gray: 1bit error harmless New RAS features from VII/VII+ Floating-point registers are ECC protected #checkers increased to ~53,000 to identify a failure point more precisely Guarantees Data Integrity X 15
Hardware Instruction Retry Instruction Retry Fetch Execute Commit Fetch Execute Commit 1. Error 2. Flush 3. Single step execution 4. Update of SW visible resources IBF IWR CSE X PC IBF IWR CSE PC ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources ALU EAGA/B EXA/B FLA/B RSE,RSF RSA RSBR GUB,FUB GPR,FPR Memory SW visible resources 5. Back to normal execution after the re-executed Instruction gets committed without an error. When an error is detected, Hardware re-execute the instruction automatically to remove the transient error by itself. X 16
Relative to TM VII+@2.86GHz X 17 TM X Performance @3GHz Hardware measured results 98x SWoC 15x 7x 8x TM X realizes 7x INT/FP/JVM throughput and 15x memory throughput of TM VII+ The INT/FP/JVM result is with un-tuned Compiler/JVM. SWoC of TM X results in max 98x throughput. The NUMBER score is for scalar. Expect to be much better for vector data.
TM X CPI (Cycle Per Instruction) Example Lower Performance TM VII+ v.s. TM X INT (single thread) Hardware measured results Shorter memory latency Large L2$ Improved Branch prediction 2 4way L1$ Increased throughput of L1$ 2EX+2EAG 2EX+2EX/EAG 2 4W GPR Higher Performance 4 integer execution units and write port increase of GPR (integer register) improves overall performance. Memory latency reduction, Large L2$, branch prediction, and L1$ improvement also contribute to the high performance dramatically. X 18
X 19 Summary TM X is Fujitsu s10 th SPARC processor which has been designed to be used for Fujitsu s next generation UNIX server. TM X integrates 16cores + 24MB L2 cache with over 100GB/s(peak) memory B/W. TM X keeps strong RAS features. TM X chip is up and running in the lab. It has shown 7 times throughput of TM VII+ w/o compiler tuning. SWoC is effective to accelerate specific software functions Fujitsu will continue to develop TM series.
TM X Abbreviations IB: Instruction Buffer RSA: Reservation Station for Address generation RSE: Reservation Station for Execution RSF: Reservation Station for Floating-point RSBR: Reservation Station for Branch GUB: General Update Buffer FUB: Floating point Update Buffer GPR: General Purpose Register FPR: Floating Point Register CSE: Commit Stack Entry X 20