Highly Productive HPC on Modern Vector Supercomputers: Current and Future

Similar documents

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Performance of the JMA NWP models on the PC cluster TSUBAME.

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Kriterien für ein PetaFlop System

Trends in High-Performance Computing for Power Grid Applications

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Characteristics of Large SMP Machines

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Parallel Programming Survey

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

High Performance Computing Infrastructure in JAPAN

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Hadoop on the Gordon Data Intensive Cluster

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

Current Status of FEFS for the K computer

The K computer: Project overview

Accelerating CFD using OpenFOAM with GPUs

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

SGI High Performance Computing

Keys to node-level performance analysis and threading in HPC applications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Cloud Data Center Acceleration 2015

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

OpenMP and Performance

Jean-Pierre Panziera Teratec 2011

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Multicore Parallel Computing with OpenMP

PRIMERGY server-based High Performance Computing solutions

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Lecture 1: the anatomy of a supercomputer

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Sun Constellation System: The Open Petascale Computing Architecture

Parallel Computing. Introduction

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Journée Mésochallenges 2015 SysFera and ROMEO Make Large-Scale CFD Simulations Only 3 Clicks Away

Architecture of Hitachi SR-8000

Jezelf Groen Rekenen met Supercomputers

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

BSC - Barcelona Supercomputer Center

Introduction to GPGPU. Tiziano Diamanti

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

YALES2 porting on the Xeon- Phi Early results

Lecture 2 Parallel Programming Platforms

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Power-Aware High-Performance Scientific Computing

Turbomachinery CFD on many-core platforms experiences and strategies

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

OpenMP Programming on ScaleMP

HP ProLiant SL270s Gen8 Server. Evaluation Report

ECLIPSE Performance Benchmarks and Profiling. January 2009

Advancing Applications Performance With InfiniBand

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

FLOW-3D Performance Benchmark and Profiling. September 2012

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2015

White Paper The Numascale Solution: Extreme BIG DATA Computing

Parallel Algorithm Engineering

SAS Business Analytics. Base SAS for SAS 9.2

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Accelerating Server Storage Performance on Lenovo ThinkServer

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server

Barry Bolding, Ph.D. VP, Cray Product Division

Infrastructure Matters: POWER8 vs. Xeon x86

Pedraforca: ARM + GPU prototype

Parallel Computing. Benson Muite. benson.

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Bytes and BTUs: Holistic Approaches to Data Center Energy Efficiency. Steve Hammond NREL

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Resource Scheduling Best Practice in Hybrid Clusters

Performance Engineering of the Community Atmosphere Model

Large Scale Simulation on Clusters using COMSOL 4.2

Binary search tree with SIMD bandwidth optimization using SSE

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Transcription:

Highly Productive HPC on Modern Vector Supercomputers: Current and Future Hiroaki Kobayashi Director and Professor Cyberscience Center Tohoku University koba@cc.tohoku.ac.jp Moscow, Russia

About Tohoku University Founded in 1907 as the 3rd Imperial University in Sendai city Japan 3116 faculty members, 18,000 grads and undergrads. 2

Missions of Cyberscience Center As a National Supercomputer Center High-Performance Computing Center founded in 1969 Offering leading-edge high-performance computing environments to academic users nationwide in Japan 24/7 operations of large-scale vector-parallel and scalarparallel systems 1500 users registered in AY 2014 User supports Benchmarking, analyzing, and tuning users programs Holding seminars and lectures Supercomputing R&D, collaborating work with NEC Designing next-generation high-performance computing systems and their applications for highly-productive supercomputing 57-year history of collaboration between Tohoku University and NEC on High Performance Vector Computing Education Teaching and supervising BS, MS and Ph.D. Students as a cooperative laboratory of graduate school of information sciences, Tohoku university 3 1969 1982 SX-1 in 1985 SX-2 in 1989 SX-3 in 1994 SX-4 in 1998 SX-7 in 2003 SX-9 in 2008

New HPC Building Construction and System Installation (2014.7~2015.2) Hiroaki Kobayashi, Tohoku University 5

Cooling Facility of HPC Building Evaporation RF R-1-1 R-1-2 R-1-3 R-1-4 Fresh Air 2F Coolant PC-1-1 PC-1-2 PC-1-3 PC-1-4 AHU-2-2 AHU-2-1 AHU-2-3 EV Hot Air 1F AHU-1-1 AHU-1-3 AHU-1-5 AHU-1-7 AHU-1-9 Cold Air Capping area Cluster 5 Cluster 4 Cluster 3 Cluster 2 Cluster 1 EV of cold-aisle HE-1-3 HE-1-2 HE-1-1 HE-1-4 HE-1-5 AHU-1-2 AHU-1-4 AHU-1-6 AHU-1-8 AHU-1-10 6

Organization of Tohoku Univ. SX-ACE System IXS IXS Cluster 0 Cluster 4 Core CPU(Socket) Node Cluster Total System Size 1 4 Cores 1CPU 512 Nodes 5 Clusters Performance (VPU+SPU) 69GFlop/s (68GF+1GF) 276GFlop/s (272GF+4GF) 141Tflop/s (139TF+ 2TF) 707Tflop/s (697TF+10TF) Mem. BW 256GB/s 131TB/s 655TB/s Memory Cap. 64GB 32TB 160TB IXS Node BW - 4GB/s x2-7

Features of the SX-ACE Vector Processor 4 Core Configuration, each with High-Performance Vector-Processing Unit and Scalar Processing Unit 272Gflop/s of VPU + 4Gflop/s of SPU per socket 68Gflop/s + 1Gflop/s per core 1MB private ADB per core (4MB per socket) Software-controlled on-chip memory for vector load/store 4x compared with SX-9 4-way set-associative MSHR with 512 entries (address+data) 256GB/s to/from Vec. Reg. 4B/F for Multiply-Add operations 256 GB/s memory bandwidth, Shared with 4 cores 1B/F in 4-core Multiply-Add operations ~ 4B/F in 1-core Multiply-Add operations 128 memory banks per socket Other improvement and new mechanisms to enhance vector processing capability, especially for efficient handling of short vectors operations and indirect memory accesses Out of Order execution for vector load/store operations Advanced data forwarding in vector pipes chaining Shorter memory latency than SX-9 SX-ACE Processor Architecture ADB 1MB 21st Russian WSSP Supercomputing Days 8 September February 18-19, 28-29, 2015 MC SPU 256GB/s MC MC MSHR MC VPU MC Core 0 MC 256GB/s MC Core 1 Crossbar MC MC MC Core 2 MC MC Core 3 MC MC MC RCU MC 4GB/s 4GB/s Source: NEC 256GB/s 256GB/s

Features of Tohoku Univ. SX-ACE System SX-9 (2008) SX-ACE (2014) Improvement Number of Cores 1 4 4x CPU Performance Total Performance, Footprint, Power Consumption Total Flop/s 118.4Gflop/s 276Gflop/s 2.3x Memory Bandwidth 256GB/sec 256GB/sec 1 ADB Capacity 256KB 4MB 16x Total Flop/s 34.1Tfop/s 706.6Tflop/s 20.7x Total Memory Bandwidth 73.7TB/s 655TB/s 8.9x Total Memory Capacity 18TB 160TB 8.9x Power Consumption 590kVA 1,080kVA 1.8x Footprint 293m 2 430m 2 1.5x SX-ACE(2014) K(2011) Ratio CPU (Node) Performance Clock Frequency 1GHz 2GHz 0.5x Flop/s per Core 64Gflop/s 16Gflop/s 4x Cores per CPU 4 80.5x Flop/s per CPU 256Gflop/s 128Gflop/s 2x Bandwidth 256GB/s 64GB/s 4x Bytes per Flop (B/F) 1 0.5 2x Memory Capacity 64GB 16GB 4x 9

Floating-Point Operations Serviced( 10 15 ) 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 SX-2R SX-3N SX-4 SX-7 SX-7&SX-7C SX-9&SX-7C SX-9 SX-9&SX-ACE High Demands for Vector Systems in Memory-Intensive, Science and Engineering Applications Upper Limit of 24/7 service Upper Limit of 24/7 service 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 Advanced CFD model for digital flight Turbine Design Industry Academia (Outside Tohoku Univ) Atmosphere Analysis& Whether Forecasting Combustion Academia (Inside Tohoku Univ) Earthquake analysis Electromagnetic Field Analysis Catalyzer Design Magnetic recoding device Design Year 10

Expanding Industrial Use TV program Close-up GENDAI by NHK (2013.1.8) Advanced Perpendicular Magnetic Recording Hard Drive Highly Efficient Turbines for Power Plants Exhaust Gas Catalyst Industry Use Base Material for PCBs Regional Jet 11

Performance Evaluations of SX-ACE 12

Specifications of Evaluated Systems System No. of Sockets/ Node Perf./ Socket (Gflop/s) No. of Cores Perf. /core (Gflop/s) Mem. BW GB/sec On-chip mem NW BW (GB/sec) Sys. B/F SX-ACE 1 256 4 64 256 1MB ADB /core 2 x 4 IXS 1.0 SX-9 16 102.4 1 102.4 256 256KB ADB/core 2 x 128 IXS 2.5 ES2 8 102.4 1 102.4 256 256KB ADB/core 2 x 64IXS 2.5 LX 406 (Ivy Bridge) 2 230.4 12 19.2 59.7 256KB L2/core 30MB Shared L3 5 IB 0.26 FX10 (SPARK64IX) K (SPARK64VIII) 1 236.5 16 14.78 85 12MB shared L2 1 128 8 16 64 6MB Shared L2 5-50 Tofu NW 5-50 Tofu NW 0.36 0.5 SR16K M1 (Power7) 4 245.1 8 30.6 128 256KB L2/core 32MB shared L3 2 x 24-96 custom NW 0.52 Remarks: Listed performances are obtained based on total Multiply-Add performances of individual systems 13

Sustained Memory Bandwidth STREAM (TRIAD) Stream memory bandwidth (GB/sec.) 300 225 150 75 SX-ACE SR16K M1(Power7) LX406(Ivy Bridge) FX10 (SPARC64IX) 0 0 2 4 6 8 10 12 14 16 Number of threads (activated cores) 14

Sustained Single CPU Performance 60 Sustained Performance [Gflop/s] 47% 45.9% 41.1% 35.6% 17.5% 6.7% 9% 3.5% 30.1% SX-9 LX406(Ivy Bridge) 27.4% 7.7% 5.5% 39.4% 32.2% 7.9% 2.8% 29.4% SX-ACE SR16K M1(Power7) 17.5% 10.4% 9.8% 7.0% 5.9% 3% 3.4% 2.5% 12.3% 45 30 15 3% 3.5% Efficiency [%] 0 QSFDM Barotropic MHD (FDM) Seism3D MHD (Spectral) TURBINE BCM Code B/F Memory Intensity 2.16 1.97 3.04 2.15 2.21 1.78 7.01 15

Sustained Performance of Barotropic Ocean Model on Multi-Node Systems 1,200 ES2 SX-ACE LX406(Ivy Bridge) SR16K M1(Power7) FX10 (SPARC64IX) Perforrmance (Gflop/s) 900 600 300 SMP-MPI Hybrid 0 0 8 16 24 32 40 48 56 64 72 80 88 96 Number of MPI processes (Nodes) 16

Performance Evaluation of SX-ACE by using HPCG HPCG (High Performance Conjugate Gradients) is designed to exercise computational and data access patterns that more closely match a broad set of important applications, and to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications. HPL for top500 is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications. HPCG is a complete, stand-alone code that measures the performance of basic operations in a unified code: Sparse matrix-vector multiplication. Sparse triangular solve. Vector updates. Global dot products. Local symmetric Gauss-Seidel smoother. Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set of coarse grids. Reference implementation is written in C++ with MPI and OpenMP support. 17

Efficiency Evaluation of HPCG Performance 12 10 8 6 4 2 0 18 Efficiency of Sustained Performance to Peak[%] CINECA Cambridge NSC/Linkoping E. and P. Eni S.p.A NSCC/Guangihou TACC Oak Ridge SURF sara Intel TSUBAME 2.5 Stanford KAUST CSCS Argonne Lab TSUBAMEーKFC Bull Angers SURF sara Meteo France Meteo France GENCI-CINES Leibniz NASA U. of Toulouse NIFS IFERC, EU(F4E) CEA/TGCC-GENCI L Barkley EPSRC U. of Tokyo K Computer Max-Planck CEIST / JAMSTEC Osaka U Tohoku U HLRS SX-ACE Xeon K/FX GPU Xeon Phi Bluegene 140 120 100 80 60 40 20 0 Power-Efficiency [Mflops/W] U. of Toulouse IFERC, EU(F4E) L Barkley TACC NSC/Linkoping CEA/TGCC-GENCI Leibniz EPSRC Meteo France CINECA Meteo France Intel NSCC/Guangihou K Computer U. of Tokyo NASA Oak Ridge KAUST SURF sara Argonne Lab HLRS E. and P. Eni S.p.A NIFS Bull Angers GENCI-CINES Max-Planck TSUBAME 2.5 CSCS SURF sara Cambridge CEIST / JAMSTEC Osaka U TSUBAMEーKFC Tohoku U Stanford

Efficiency Evaluation of HPCG Performance 12 9 6 3 0 19 Necessary Peak Perf. to obtain the Same Sus. Perfo. (Normalized by SX-ACE) CINECA Cambridge NSC/Linkoping NSCC/Guangihou E. and P. Eni S.p.A TACC Oak Ridge SURF sara Intel TSUBAME 2.5 Stanford KAUST CSCS Argonne Lab TSUBAMEーKFC HLRS Bull Angers SURF sara Meteo France Meteo France GENCI-CINES Leibniz NASA U. of Toulouse NIFS IFERC, EU(F4E) L Barkley CEA/TGCC-GENCI EPSRC U. of Tokyo K Computer Max-Planck CEIST / JAMSTEC Osaka U Tohoku U SX-ACE Xeon K/FX GPU Xeon Phi Bluegene 60 45 30 15 0 Necessary # of Cores to obtain the Same Sus. Perf. (Normalized by SX-ACE) CINECA TACC NSCC/Guangihou Argonne Lab Intel Meteo France NSC/Linkoping Meteo France Leibniz Oak Ridge KAUST U. of Tokyo L Barkley U. of Toulouse IFERC, EU(F4E) K Computer CEA/TGCC-GENCI EPSRC HLRS NASA Bull Angers Cambridge SURF sara E. and P. Eni S.p.A SURF sara NIFS GENCI-CINES Max-Planck TSUBAME 2.5 CSCS TSUBAMEーKFC Stanford CEIST / JAMSTEC Osaka U Tohoku U

A Real-Time Tsunami Inundation Forecasting System on SX-ACE for Tsunami Disaster Prevention and Mitigation 20

Hiroaki Kobayashi, Tohoku University Motivation: Serious Damage to Sendai Area Due to 2011 Tsunami Inundation Huge Tsunami hits coastal area of Tohoku region Inundation Area 21 Sendai Airport Courtesy of Prof. Koshimura Tohoku 東北大越 University 村教授提供

It s not End: High Probability of Big Earthquakes in Japan Japan may be hit by severe earthquakes and large tsunamis in the next 30 years Seismogenic Zones 70 % probability in the next 30 years Ishinomaki The 2011 Great Tohoku Earthquake 東北大越村教授提供 Sendai Tokyo 88 % probability in the next 30 years Kochi Shizuoka 22

Design and Development of A Real-Time Tsunami Inundation Forecasting System GPS-Observation Simulation on SX-ACE Information Delivery Fault estimation based on GPS data Inundation ( ) simulation 10-m mesh models of coastal cities Inundation depth etc Just-In-Time access of Visualized information by local governments < 8 min < 8 min < 4 min < 20 min 23

Demo: Visualization of Simulation Results 24

Demo: Visualization of Simulation Results Miyagi 25

1,000,000 Scalability of Tunami Code K computer (SPARC64VIII) LX 406(Ivy Bridge) SX-ACE Execution Time(Sec) 100,000 10,000 1,000 100 10 Target of 8-min 1 64 128 256 512 1K 2K 4K 8K 13k Number of Cores (Processes) 26

Future Vector Systems R&D* *This work is partially conducted with NEC, but the contents do not reflect any future products of NEC

Big Problem in HPC System Design Brain Infarction of Modern HEC Systems Imbalance between peak flop/s rate and memory bandwidth of HEC systems result in an inefficiency in computing only a small portion of peak flop/s contributes to execution of practical applications in many important science and engineering fields. A large amount of sleeping flop/s power is wasted away!!! So fa, it would be OK because Moore s low works, but now it does not make sense, as the end of Moore s law is approaching! So, we have to become much more smart for design of Future HEC systems, Because it is very hard to obtain cost reduction by Moore s law, in particular, a sky-rocketing cost increase in semiconductor integration fabrication in the eras of 20nm or finer technologies The silicon budget for computing is also not free any more! Exploit sleeping flop/s efficiently by redesign/reinvent of memory subsystems to protect HEC systems from Serious Brain Infarction Use precious silicon budget (+ advanced device technologies) to effectively design mechanisms that can supply data to computing units smoothly. 28

Toward the Next-Generation Vector System Design Much more focusing on sustained performance by increasing processing efficiency, not heavily depending on peak performance for the design of the next generation vector systems Architecture design of high-performance vector cores connected to an advanced memory subsystem at a high sustained bandwidth find a good balance between computing performance and memory performance of future HEC systems to maximize computing efficiency of wide variety of science and engineering applications. achieve a certain level of sustained performance with a moderate number of high-performance cores/nodes with a lower power budget. shorter diameter, lower latency, high-bw networks also become adoptable. Intra-socket Network (elec. / opt.) Vector-core layers OnChip-Memory layers I/O layers 3D Vector Chip 5.5D 29

Feasibility Study of the Next-Generation Vector System Targeting Year around 2018+? System Spec Node0 4TFlop/s Node xxxx CPU0 1 CPU0 TFlop/s 1 2 3 CPU0 1 2 3 1TF = 256GFlop/s x 4cores core core core core core core core core Hierarchical Network core core core core Tread/proc 4 PeakPF/proc 1TF PeakPF/100,000proc 100PF Nodes /100,000proc 25,000 Total MemBW 200PB/s Total Cache Capacity 3.2TB ~1TB/s (4B/F) VLSB VLSB 32MB VLSB ~2TB/s(2B/F) 2.5D/3D Die-Stacking Shared memory Shared ~256GB memory 2.5D/3D Die-Stacking Shared Shared memory memory Hierarchical Distributed Storage System 30

Feasibility Study of the Next-Generation Vector System Targeting Year around 2018+? Vector 30cm 4TF 512GB~1TB 4TB/s~8TB/s 800~1000W 1U 40cm Cooling unit 30cm Vector X64 Vector Vector Vector NW-switch NW-switch Vector Vector Vector Vector PSU 60cm 200cm ~100cm 256TF(64nodes) 32TB~64TB 256TB/s~512TB/s 52~64KW 40m 100PF (~1EF) 400~ racks 12PB~25PB 100PB/s~200PB/s 20~30MW 40m 31

Performance Comparison of Our Target System with a Commodity-Based Future System Socket: 1TF 10GB/s x2 Fat-Tree 40GB/s x2 High Dimensional Torus 20GB/s Socket: 1.6TF core : 90GF 2~4 B/F 1 B/F 0.05TB/s 1~2TB/s 0.1TB/s 128GB x 2 (NUMA) Our System SMP(UMA) Architecture 4TF, ~8TB/s(~2B/F), ~512GB, 1~1.6KW Commodity-based System NUMA Architecture 3.3TF, 0.2TB/s(0.1B/F), 128GBx2, 0.4KW 32

Performance Estimation Performance normalized by Xeon-based System Our System Future Xeon-Based Architecture Speedup Speedup/Watt Our System Xeon-based Ratio Tread/proc 4 4 PeakPF/proc 1TF 0.36TF 2.79 PeakPF/100,000proc 100PF 35.84PF 2.79 Nodes /100,000proc 25,000 11,111 2.25 Total MemBW 200PB/s 3.4PB/s 58.6 Total Cache Capacity 3.2TB 0.8TB 4 33 4x~8x in Performance per watt in Seism3D and RSGDX because of their high B/F requirement Comparable performance per watt in Turbine

Performance Estimation (2/2) Execution Time(Sec) Xeon-based System: Our System Future Commodity-Based Commodity-based: Scalability Stalled # of Procs. Xeon-based System needs 6.4M processes, which needs a peak of 3.2EF to achieve the equivalent sustained performance of our 100PF system 34

Summary SX-ACE shows high sustained performance compared with SX-9, in particular short-vector processing and indirect memory accesses Well balanced HEC systems regarding memory performance is the key to success for realizing high productivity in science and engineering simulations in the post peta-scale era We explore the great potential of the new generation vector architecture for future HPC systems, with new device technologies such as 2.5D/3D die-stacking High sustained memory BW to fuel vector function units with lower power/energy. The on-chip vector load/store unit can boost the sustained memory bandwidth energy-efficiently When such new technologies will be available as production services? Design tools, fab. and markets steer the future of the technologies! 35

Final Remarks Now It s Time to Think Different! ~Make HPC Systems Much More Comfortable and Friendly ~ Targeting HPC systems design for the entry and middle-class of HPC community in daily use, not for top, flop/s-oriented, in special use! Spending much more efforts/resources to exploit the potential of the system even with the moderate number of nodes and/or cores for daily use that requires high-productivity in simulation. Even though this approach sacrifices exa-flop/s level peak performance! Seeking Exa-flops with accelerators and its downsizing deployment to entry and middle classes are NOT a smart solution in the post-peta scale era. Let s make The Supercomputer for the Rest of US happen! 36