Next-Generation PRIMEHPC. Copyright 2014 FUJITSU LIMITED

Similar documents
Fujitsu s Architectures and Collaborations for Weather Prediction and Climate Research

Technical Computing Suite Job Management Software

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

SPARC64 VIIIfx: CPU for the K computer

The K computer: Project overview

Current Status of FEFS for the K computer

Sun Constellation System: The Open Petascale Computing Architecture

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

HPC Update: Engagement Model

PRIMERGY server-based High Performance Computing solutions

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Operating System for the K computer

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Cloud Computing through Virtualization and HPC technologies

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Next Generation GPU Architecture Code-named Fermi

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK


Kalray MPPA Massively Parallel Processing Array

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

GPUs for Scientific Computing

Petascale Software Challenges. William Gropp

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

MAQAO Performance Analysis and Optimization Tool

OC By Arsene Fansi T. POLIMI

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Using the Windows Cluster

Trends in High-Performance Computing for Power Grid Applications

1 Bull, 2011 Bull Extreme Computing

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Jean-Pierre Panziera Teratec 2011

Non-blocking Switching in the Cloud Computing Era

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Accelerating CFD using OpenFOAM with GPUs

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

LS DYNA Performance Benchmarks and Profiling. January 2009

Multicore Parallel Computing with OpenMP

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Optimizing Code for Accelerators: The Long Road to High Performance

Altix Usage and Application Programming. Welcome and Introduction

Architecture of Hitachi SR-8000

GPU Parallel Computing Architecture and CUDA Programming Model

CMSC 611: Advanced Computer Architecture

Parallel Programming Survey

HPC Software Requirements to Support an HPC Cluster Supercomputer

OpenMP Programming on ScaleMP

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

ECLIPSE Performance Benchmarks and Profiling. January 2009

Kriterien für ein PetaFlop System

Application Performance Analysis of the Cortex-A9 MPCore

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Lecture 2 Parallel Programming Platforms

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Parallel Algorithm Engineering

Scalability and Classifications

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A-CLASS The rack-level supercomputer platform with hot-water cooling

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Overview of HPC systems and software available within

PCI Express and Storage. Ron Emerick, Sun Microsystems

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

The Foundation for Better Business Intelligence

Introduction to GPU Programming Languages

Lecture 1: the anatomy of a supercomputer

ST810 Advanced Computing

Designed for Maximum Accelerator Performance

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Multi-Threading Performance on Commodity Multi-Core Processors

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Transcription:

Next-Generation PRIMEHPC

The K computer and the evolution of PRIMEHPC K computer PRIMEHPC FX10 Post-FX10 CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx Peak perf. 128 GFLOPS 236.5 GFLOPS 1TFLOPS ~ # of cores 8 16 32 + 2 Memory DDR3 SDRAM HMC Interconnect Tofu Interconnect Tofu Interconnect 2 System size 11PFLOPS Max. 23PFLOPS Max. 100PFLOPS Link BW 5GB/s x bidirectional 12.5GB/s x bidirectional 1

Smaller, faster, more efficient Highly integrated components with high-density packaging. Performance of 1-chassis corresponds to approx. 1-cabinet of K computer. Efficient in space, time, and power 1 2 3Chassis 4Cabinet 5 SPARC64 XIfx CPU Memory Board Fujitsu designed TM (12 CPUs) Tofu 2 32 + 2 core CPU Tofu2 integrated Three CPUs 3 x 8 HMCs (Hybrid Memory Cube) 8 optical modules (25Gbpsx12ch) 1 CPU/1 node 12 nodes/2u chassis Water cooled Over 200 nodes/cabinet 2

Architecture continuity for compatibility Upper compatible CPU: Binary-compatible with the K computer & PRIMEHPC FX10 Good byte/flop balance New features: New instructions (stride load/store, indirect load/store, permutation, concatenation) Improved micro architecture (out-of-order, branch-prediction, etc.) For distributed parallel executions: Compatible interconnect architecture Improved interconnect bandwidth 3

32 + 2 core SPARC64 XIfx Rich micro architecture improves single thread performance. HMC fulfills required bandwidth for multi-core high performance CPU. 2 additional, Assistant-cores for avoiding OS jitter and non-blocking MPI functions. K FX10 Post-FX10 Note Peak FP performance 128 GF 236.5 GF 1-TF class Maintains similar architecture for Core Execution unit FMA 2 FMA 2 FMA 2 compatibility with applications config. SIMD 128 bit 128 bit 256 bit wide Wider SIMD for better performance with small-sized additional hardware Dual SP mode NA NA Double of DP Accelerates SP-rich apps Integer SIMD NA NA Support Accelerates INT rich apps Assists SIMDization with list vector Single thread performance enhancement - - Increase OOO resources, better branch prediction, larger cache Application performance often limited by single thread performance and no FP calc. 4

Flexible SIMD operations New 256bit wide SIMD functions enable versatile operations Four double-precision calculations Stride load/store, Indirect (list) load/store, Permutation, Concatenation SIMD Stride load Indirect load Permutation 128 regs Double precision reg S1 reg S2 simultaneous calculation Memory Specified stride Memory reg S Arbitrary shuffle reg S reg D reg D reg D reg D 5

Tofu Interconnect 2 Successor to Tofu Interconnect Highly scalable, 6-dimensional mesh/torus topology Increased link bandwidth by 2.5 times to 12.5 GB/s Interconnect integrated into CPU System-on-chip (SoC) removes off-chip I/O Improved packaging density and energy efficiency Optical cable connection between chassis 6

Flexible interconnect topology Tofu: Six-dimensional mesh/torus direct network Logical 3D, 2D or 1D torus network from the user s point of view Scalable three-dimensional torus Three-dimensional torus unit 2 3 2 Well-balanced shape available 7

Entire software stack is enhanced for Post-FX10 Applications HPC Portal / System Management Portal Technical Computing Suite System Management System management System control System monitoring System operation support Job Management Job manager Job scheduler Resource management Parallel High Performance File System FEFS Lustre based high performance distributed file system High scalability, high reliability and availability Linux based OS (enhanced for FX series) PRIMEHPC FX series 8 Automatic parallelization compiler Fortran C C++ Tools and math libraries Programming support tools Mathematical libraries Parallel languages and libraries OpenMP MPI XPFortran

Summary Successor to the PRIMEHPC FX10 Fujitsu-developed, compatible SPARC64 CPU and Tofu Interconnect High-density packaging SPARC64 XIfx Rich micro architecture (32 computing cores + 2 assistant cores) Richer SIMD operations supported High memory bandwidth (with HMC) Interconnect Tofu Interconnect 2 is integrated into the CPU Optical connections between chassis 100 petaflops-capable system 9

10 Copyright 2013 FUJITSU LIMITED