SPARC64 VIIIfx: CPU for the K computer



Similar documents
SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

The K computer: Project overview

Operating System for the K computer

SPARC64 VII Fujitsu s Next Generation Quad-Core Processor

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Current Status of FEFS for the K computer

AMD Opteron Quad-Core

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

OC By Arsene Fansi T. POLIMI

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey


Introduction to Cloud Computing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

İSTANBUL AYDIN UNIVERSITY

Next Generation GPU Architecture Code-named Fermi

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

The Central Processing Unit:

OpenSPARC T1 Processor

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to Microprocessors

1. Memory technology & Hierarchy

GPU System Architecture. Alan Gray EPCC The University of Edinburgh


Computer Architecture TDTS10

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Intel Pentium 4 Processor on 90nm Technology

Building a Top500-class Supercomputing Cluster at LNS-BUAP

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Multi-Threading Performance on Commodity Multi-Core Processors

Query Acceleration of Oracle Database 12c In-Memory using Software on Chip Technology with Fujitsu M10 SPARC Servers

<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing

High-Speed Thin Client Technology for Mobile Environment: Mobile RVEC

Chapter 2 Heterogeneous Multicore Architecture

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Optical Interconnect Technology for High-bandwidth Data Connection in Next-generation Servers

Communicating with devices

The ARM Cortex-A9 Processors

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Trends in High-Performance Computing for Power Grid Applications

Parallel Algorithm Engineering

ARM Microprocessor and ARM-Based Microcontrollers

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

GPUs for Scientific Computing

The Bus (PCI and PCI-Express)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Vector Architectures

Lecture 1: the anatomy of a supercomputer

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

Symfoware Server Reliable and Scalable Data Management

Family 12h AMD Athlon II Processor Product Data Sheet

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

HUAWEI Tecal E6000 Blade Server

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Discovering Computers Living in a Digital World

Low Power AMD Athlon 64 and AMD Opteron Processors

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Chapter 2 Parallel Computer Architecture

DS1104 R&D Controller Board

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

How To Understand The Design Of A Microprocessor

Binary search tree with SIMD bandwidth optimization using SSE

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Putting it all together: Intel Nehalem.

PCI Express* Ethernet Networking

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Network Virtualization for Large-Scale Data Centers

Parallel Programming Survey

Intel Xeon Processor E5-2600

Family 10h AMD Phenom II Processor Product Data Sheet

Operating Systems 4 th Class

Advanced Core Operating System (ACOS): Experience the Performance

Generations of the computer. processors.

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

PROBLEMS #20,R0,R1 #$3A,R2,R4

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Testing Database Performance with HelperCore on Multi-Core Processors

Transcription:

SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS process for semiconductors and is composed of eight cores, a 6 MB shared level 2 cache, and memory controllers. Peak performance of 128 GFLOPS at an operating frequency of 2 GHz is achieved with power consumption as low as 58 W. The performance per unit of power is more than six times that of the SPARC processor, our previous model. To achieve this performance per unit of power, we extended the SPARC-V9 architecture to develop high performance computing-arithmetic computational extensions (HPC-ACE), the optimum instruction set for scientific computations. In addition, we successfully reduced the leakage power by water cooling and dynamic power by clock gating to achieve a lower power consumption. Furthermore, high-reliability technology for mainframes and UNIX servers is used to ensure stable operation of a system connecting more than 80 000 processors. This paper outlines the technologies used to achieve the high performance, low power consumption and high reliability of SPARC64 VIIIfx. 1. Introduction Fujitsu developed SPARC64 VIIIfx 1) (Figure 1) as a processor for the supercomputer ( K computer ). note)i The K computer has more than 80 000 processors installed to give it a computational performance in excess of 10 PFLOPS. These processors need to have high performance, low power consumption, and high reliability. This paper outlines the technologies used to achieve these goals. 2. Goals in development of SPARC64 VIIIfx The goals in the development of SPARC64 VIIIfx include: note)i K computer is the English name that RIKEN has been using for the supercomputer of this project since July 2010. K comes from the Japanese word Kei, which means ten peta or 10 to the 16th power. 1) High performance DDR3 Interface SPARC64 VIIIfx is a multicore processor Core 5 Core 4 Core 1 Core 0 Figure 1 SPARC64 VIIIfx chip. high-speed serial I/O (HSIO) L2 cache data L2 cache control L2 cache data Core 7 Core 6 Core 3 Core 2 DDR3 Interface 274 FUJITSU Sci. Tech. J., Vol. 48, No. 3, pp. 274 279 (July 2012)

integrating eight cores, a shared level 2 cache, memory access controllers (s), and a highspeed serial I/O (HSIO). For each core to exhibit high-execution performance in a real application, we extended the SPARC-V9 architecture 2) 4) and developed High Performance Computing-Arithmetic Computational Extensions (HPC-ACE), 5) an instruction set capable of efficiently executing scientific computations. To achieve a higher speed of parallel processing by having eight cores on the chip, the architecture must also have a function to share the level 2 cache across all cores and synchronize cores by means of hardware. Combining this with Fujitsu s automatic parallel compiler allows the user to handle multiple cores when programming as if it were one high-speed CPU without having to be aware of the multiple cores. At Fujitsu, this is called Virtual Single Processor by Integrated Multicore Parallel Architecture (VISIMPACT). 2) Low power consumption Due to the limited amount of power available to the entire system, the power consumption of the processor needed to be reduced to 58 W or less. To that end, Fujitsu used low-leakage transistors and water cooling to lower the junction temperature to 30 C so as to reduce the leakage power. In addition, the processor must make full use of clock gating to reduce its dynamic power. 3) High reliability The processor requires high-reliability technology used for mainframes and UNIX servers 6),7) to ensure it operates stably. 3. Microarchitecture of SPARC64 VIIIfx The pipeline of SPARC64 VIIIfx is shown in Figure 2 and the specifications are shown in Table 1. A core is composed of an instruction control unit, execution unit and level 1 cache. The instruction control unit is responsible for instruction fetch, instruction decode, out-of-order instruction control, and instruction commit control. The execution unit is equipped with Fetch Decode Issue Out-of-order execution Register read Execution Memory access Commit Instruction level 1 cache Branch history Instruction decode Address Fixed-point Fixed-point Fixed-point renaming EAGA EAGB EXA EXB FLA FLB Fetch port Store port Data level 1 cache Commit control Program counter Control Branch renaming FLC FLD Level 2 cache 8 cores Figure 2 SPARC64 VIIIfx pipeline. FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 275

Table 1 SPARC64 VIIIfx specifications. Item No. of cores 8 Level 2 cache Operating frequency Process technology Die size No. of transistors Peak performance Memory bandwidth Power consumption 6 MB 2 GHz two fixed-point functional units (EXA/B), two functional units for load/store address computation (EAGA/B) and four floating-point multiply-and-add (FMA) units (FLA/B/C/D). The FMA units have a single instruction multiple data (SIMD) architecture and execute two parallel operations with one instruction. One FMA unit is capable of conducting floating-point multiplication and addition for each cycle and each core can execute eight double-precision floating-point operations per cycle. Hence the chip is capable of making 64 double-precision floating-point operations per cycle. The operating frequency is 2 GHz and the peak performance is 128 GFLOPS. There are 192 fixed-point s and 256 floating-point s. The level 1 cache processes load/store instructions. Each core has a 32 Kbyte two-way instruction cache and data cache. The data cache has a dual-port structure capable of two simultaneous load accesses and executes two 16-byte SIMD loads or one 16-byte SIMD store. The level 2 cache is shared by the eight cores and cache coherence is ensured for each core. An inter-core hardware barrier is provided to allow high-speed synchronization between the cores, as will be described later. FSL 45-nm CMOS Specification 22.7 mm 22.6 mm Approx. 760 million 128 GFLOPS 64 GB/s (theoretical peak value) 58 W (process condition: TYP) FSL: Fujitsu Semiconductor Ltd. SPARC64 VIIIfx incorporates memory controllers to reduce its latency and improve the throughput of its memory access. The memory bandwidth is 64 GB/s as a theoretical peak value. In addition, a K computer exclusive InterConnect Controller chip and HSIO are used for connections to ensure a high inter-chip communication throughput. 4. Instruction extensions HPC- ACE HPC-ACE is an extended instruction set intended for scientific computation for the SPARC-V9 architecture. It was developed based on the analysis of many HPC applications jointly with Fujitsu software development department. 1) Expansion of number of s The number of floating-point s of SPARC-V9 is 32, which is not sufficient for HPC applications. Increasing the number of s, however, is not possible with the 32-bit SPARC architecture because there is an insufficient instruction length. As a solution to this, a new prefix instruction called the set extended arithmetic (SXAR) has been defined for HPC-ACE. An SXAR instruction extends addressing for up to two following instructions. The address length is extended by 3 bits, which allows for 256 floating-point s, eight times that of SPARC-V9 (Figure 3). The compiler uses this high-capacity set of s for optimization including software pipelining and maximizes the instruction-level parallelism of an application. In terms of the Himeno Benchmark, a representative HPC benchmark, the performance has improved by 1.65 times. 2) SIMD operations and load/store instructions SIMD is a technology to allow parallel execution of more than one data process with one instruction. HPC-ACE uses SIMD technology to execute two FMA operations with one instruction. SIMD operations for more quickly multiplying complex numbers are also supported. In addition, SIMD execution is possible with load and store instructions. SIMD processing of load instructions is executed without penalty in an 8-byte alignment for double precision and 4-byte 276 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012)

SPARC-V9 32 Extended address (3 bits) Address (5 bits) SXAR Subsequent instruction Register address 8 bits in total Extended 224 Figure 3 Register address extension by SXAR instruction. alignment for single precision. 3) Sector cache mechanism For HPC-ACE, a cache mechanism (sector cache) that can be software-controlled has been developed. The conventional caches cannot be controlled by software. Even if the user is aware of the high frequency with which data is reused, the hardware evicts the data from the cache when ing other data in the cache, which might hinder improvements in performance. To address this issue, the sector cache mechanism splits the cache into two sectors and allows software to be used to frequently reused data in a sector separately from other data. Implementing control to have the user keep frequently reused data in the cache contributes to a better performance. 4) Acceleration instructions for trigonometric functions sine and cosine Instructions to accelerate the trigonometric functions sine and cosine have been added. They have traditionally been processed by combining many instructions, but providing dedicated instructions has reduced the number of instructions, leading to an increase in speed of more than five times. 5) Conditional execution To efficiently process loops containing if statements, it is necessary to eliminate conditional branch instructions. For that purpose, conditional execution instructions have been added for HPC-ACE. Specifically, a new compare instruction is used to write the result of comparison in a floating-point and a conditional execution instruction is used based on the result of such comparison. As the conditional execution instructions, data transfer between floating-point s and store from a floating-point to memory have been defined. Combining these instructions to eliminate conditional branching instructions allows the compiler to optimize loops containing if statements by software pipelining or other means. 6) Division and square root approximation Instructions for finding reciprocal approximations have been added. This has allowed division and square root pipeline processing, resulting in a throughput improvement of about four times as a combined effect with the greater number of s. Functions 1) to 6) all make it possible to have a better performance without increasing the frequency and significantly contribute to FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 277

the higher performance per unit of power of SPARC64 VIIIfx. 5. VISIMPACT This section describes the hardware mechanisms used for VISIMPACT. 1) Shared level 2 cache SPARC64 VIIIfx is provided with a 6 MB level 2 cache shared by all of the eight cores. Making it easier to share data between cores allows efficient parallel processing of one process with multiple cores. 2) Hardware barrier SPARC64 VIIIfx is equipped with a hardware barrier for high-speed synchronization between cores. When one process is executed by multiple cores in parallel, wait (synchronization) may be implemented between cores. While ordinary processors use software for synchronization, SPARC64 VIIIfx uses dedicated hardware to increase the computational speed by more than ten times. The significant reduction in synchronization overhead allows small loops to be processed in parallel by using multiple cores for higher speeds. 6. Low power consumption SPARC64 VIIIfx uses long-gate transistors and water cooling as a cooling method to lower the junction temperature to 30 C, thereby decreasing the leakage power to 10% of the power of the entire chip. In addition, complete clock gating is provided for each latch so that it is more effective in power reduction, which has successfully decreased the dynamic power consumed in operation. As a result, the average power consumption of the SPARC64 VIIIfx is as low as 58 W and it has a high computing performance of 128 GFLOPS. This is more than six times that of the SPARC processor, our previous model, in terms of performance per unit of power. 7. High reliability SPARC64 VIIIfx is provided with highreliability technology that Fujitsu has nurtured through the development of mainframes and UNIX servers. A processor is composed of very finestructured transistors, and signals may be affected by cosmic rays or other factors. To ensure continued processing without any malfunction in spite of such intermittent and transient faults, SPARC64 VIIIfx has an instruction retry mechanism in which hardware automatically re-executes any instruction affected by faults. In addition, 1-bit errors of all RAM and floatingpoint and fixed-point s in the processor are corrected by hardware. Sections relating to program execution are protected by error detection codes so as to ensure data integrity. By making use of these technologies, Fujitsu has achieved stable operation of a system connecting more than 80 000 processors. 8. Conclusion The development of SPARC64 VIIIfx was a really challenging project for us developers. For the development, members from the software development department, laboratories and other departments in addition to the processor development team gathered together to combine all of Fujitsu s strengths. We believe that developing new technologies and inheriting the processor technologies that Fujitsu has nurtured over many years have led to the successful development of a supercomputer. We anticipate that the K computer, which uses this processor, will help solve problems in various fields in the future. References 1) T. Maruyama et al.: SPARC64 VIIIfx: A New- Generation Octocore Processor for Petascale Computing. IEEE Micro, Vol. 30, Issue 2, pp. 30 40 (2010). 2) SPARC International: The SPARC Architecture Manual (Version 9). http://www.sparc.org/standards/sparcv9.pdf 278 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012)

3) Fujitsu: SPARC Joint Programming Specification (JPS1) Commonality. (in Japanese). http://jp.fujitsu.com/solutions/hpc/brochures/ 4) Fujitsu: SPARC JPS1 Implementation Specification SPARC64 V. (in Japanese). http://jp.fujitsu.com/solutions/hpc/brochures/ 5) Fujitsu: SPARC64 VIIIfx Extensions. (in Japanese). http://jp.fujitsu.com/solutions/hpc/brochures/ 6) A. Inoue: SPARC64 V Processor for UNIX Servers. (in Japanese), FUJITSU, Vol. 53, No. 6, pp. 450 455 (2002). 7) T. Maruyama et al.: Past, Present, and Future of SPARC64 Processors. Fujitsu Sci. Tech. J., Vol. 47, No. 2, pp. 130 135 (2011). Toshio Yoshida Mr. Yoshida is currently engaged in development of processor cores. Ryuji Kan Mr. Kan is currently engaged in development of processor cores. Mikio Hondo Mr. Hondo is currently engaged in performance evaluation of processors and systems. Go Sugizaki Mr. Sugizaki is currently engaged in development of processors. FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 279