Fujitsu s Architectures and Collaborations for Weather Prediction and Climate Research



Similar documents
PRIMERGY server-based High Performance Computing solutions

Data Sheet FUJITSU Server PRIMERGY CX400 M1 Multi-Node Server Enclosure

SPARC64 VIIIfx: CPU for the K computer

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Current Status of FEFS for the K computer

HPC Update: Engagement Model

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Technical Computing Suite Job Management Software

The K computer: Project overview

Sun Constellation System: The Open Petascale Computing Architecture

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak


Data Sheet FUJITSU Server PRIMERGY CX420 S1 Out-of-the-box Dual Node Cluster Server

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

FUJITSU Enterprise Product & Solution Facts

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

Technical Brief: Egenera Taps Brocade and Fujitsu to Help Build an Enterprise Class Platform to Host Xterity Wholesale Cloud Service

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Data Sheet FUJITSU Server PRIMERGY CX400 S2 Multi-Node Server Enclosure

Operating System for the K computer

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Trends in High-Performance Computing for Power Grid Applications

A-CLASS The rack-level supercomputer platform with hot-water cooling

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Fujitsu PRIMERGY Servers Portfolio

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Building a Top500-class Supercomputing Cluster at LNS-BUAP

ECLIPSE Performance Benchmarks and Profiling. January 2009

Jean-Pierre Panziera Teratec 2011

Fujitsu PRIMERGY Servers Portfolio

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

FLOW-3D Performance Benchmark and Profiling. September 2012

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Intel Xeon Processor E5-2600

LS DYNA Performance Benchmarks and Profiling. January 2009

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

HP ProLiant SL270s Gen8 Server. Evaluation Report

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

Hadoop on the Gordon Data Intensive Cluster

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

GPUs for Scientific Computing

Kalray MPPA Massively Parallel Processing Array

Data Sheet Fujitsu PRIMERGY BX900 S1 Blade Server

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

High Performance Computing, an Introduction to

Parallel Programming Survey

Server Platform Optimized for Data Centers

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

Infrastructure Matters: POWER8 vs. Xeon x86

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Boost Database Performance with the Cisco UCS Storage Accelerator

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

FatTwin. 16% Lower Power Consumption 30% More Storage Capacity. Evolutionary 4U Twin Architecture. 4 Nodes in 4U 8 Hot-swap 3.

Advancing Applications Performance With InfiniBand

Lecture 1: the anatomy of a supercomputer

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Primergy Blade Server

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

HUAWEI Tecal E6000 Blade Server

Fujitsu Enterprise Product & Solution Facts

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

SGI High Performance Computing

IBM System x SAP HANA

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Achieving Performance Isolation with Lightweight Co-Kernels

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Transcription:

Fujitsu s Architectures and Collaborations for Weather Prediction and Climate Research Ross Nobes Fujitsu Laboratories of Europe

Fujitsu HPC - Past, Present and Future Fujitsu has been developing advanced supercomputers since 1977 No.1 in Top500 (June and Nov., 2011) K computer National project FX10 Post-FX10 Exa F230-75APU Japan s First Vector (Array) Supercomputer (1977) NWT* Developed with NAL No.1 in Top500 (Nov. 1993) Gordon Bell Prize (1994, 95, 96) VP Series VPP500 World s Fastest Vector Processor (1999) CJAXA VPP5000 SPARC Enterprise PRIMEQUEST VPP300/700 PRIMEPOWER HPC2500 World s Most Scalable Supercomputer (2003) AP3000 AP1000 FX1 Japan s Largest Cluster in Top500 (July 2004) Most Efficient Performance in Top500 (Nov. 2008) HX600 Cluster node PRIMERGY RX200 Cluster node PRIMERGY CX400 PRIMERGY BX900 Cluster node Next x86 generation *NWT: Numerical Wind Tunnel 1

Fujitsu s Approach to the HPC Market Supercomputer SPARC64-Based Supercomputers X86 PRIMERGY HPC solutions Divisional Departmental Workgroup Work Group 2 Powerful Workstations

The K computer June 2011 November 2011 November 2011 November 2012 November 2012 November 2013 June 2014 No.1 in TOP500 List (ISC11) Consecutive No. 1 in TOP500 List (SC11) ACM Gordon Bell Prize Peak-Performance (SC11) No.1 in Three HPC Challenge Award Benchmarks (SC12) ACM Gordon Bell Prize (SC12) No.1 in Class 1 and 2 of the HPC Challenge Awards (SC13) No.1 in Graph 500 Big Data Supercomputer Ranking (ISC14) Build on the success of the K computer 3

Evolution of PRIMEHPC K computer PRIMEHPC FX10 Post-FX10 CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx Peak perf. 128 GFLOPS 236.5 GFLOPS 1 TFLOPS ~ # of cores 8 16 32 + 2 Memory DDR3 SDRAM HMC Interconnect Tofu Interconnect Tofu Interconnect 2 System size 11 PFLOPS Max. 23 PFLOPS Max. 100 PFLOPS Link BW 5 GB/s x bidirectional 12.5 GB/s x bidirectional 4

Continuity in Architecture for Compatibility Upwards compatible CPU: Binary-compatible with the K computer & PRIMEHPC FX10 Good byte/flop balance New features: New instructions Improved micro architecture For distributed parallel executions: Compatible interconnect architecture Improved interconnect bandwidth 5

32 + 2 Core SPARC64 XIfx Rich micro architecture improves single thread performance 2 additional, Assistant-cores for avoiding OS jitter and non-blocking MPI functions K FX10 Post-FX10 Peak FP performance 128 GF 236.5 GF 1-TF class Core Execution unit FMA 2 FMA 2 FMA 2 config. SIMD 128 bit 128 bit 256 bit wide Dual SP mode NA NA 2x DP performance Integer SIMD NA NA Support Single thread performance enhancement - - Improved OOO execution, better branch prediction, larger cache 6

Flexible SIMD Operations New 256-bit wide SIMD functions enable versatile operations Four double-precision calculations Strided load/store, Indirect (list) load/store, Permutation, Concatenation SIMD Strided load Indirect load Permutation Double precision Memory reg S reg S 128 regs reg S1 reg S2 simultaneous calculation Specified stride Memory Arbitrary shuffle reg D reg D reg D reg D 7

Hybrid Memory Cube (HMC) The increased arithmetic performance of the processor needs higher memory bandwidth The required memory bandwidth can almost be met by HMC (480 GB/s) Interconnect is boosted to 12.5 GB/s x 2 (bi-directional) with optical link Peak performance/node K FX10 post-fx10 DP performance (Gflops) 128 236.5 Over 1TF Memory Bandwidth (GB/s) 64 85 480 Interconnect Link Bandwidth (GB/s) 5 5 12.5 8

HMC (2) Amount per processor Capacity Memory BW HMC x8 32 GB 480 GB/s DDR4-DIMM x8 32~128 GB 154 GB/s GDDR5 x16 8 GB 320 GB/s HSSD was adopted for main memory since its bandwidth is three times more than DDR4 HMC can deliver the required bandwidth for high performance multi-core processors 9

Tofu2 Interconnect Successor to Tofu Interconnect Highly scalable, 6-dimensional mesh/torus topology Logical 3D, 2D or 1D torus network from the user s point of view Increased link bandwidth by 2.5 times to 100 Gbps Scalable three-dimensional torus Three-dimensional torus unit 2 3 2 Well-balanced shape available 10

A Next Generation Interconnect Optical-dominant: 2/3 of network links are optical Gross raw bitrate per node (Tbps) 1.2 1 0.8 0.6 0.4 0.2 Optical network links Electrical network links 10 Gbps 5 Gbps 14 Gbps 14 Gbps 12.5 Gbps 14 Gbps 6.25 Gbps 25 Gbps 25 Gbps 0 IBM Blue Gene/Q IB-FDR Fat-Tree Cray Cascade Tofu1 Tofu2 11

Reduction in the Chip Area Size Process technology shrinks from 65 to 20 nm System-on-chip integration eliminates the host bus interface Chip area shrinks to 1/3 size TNR TNI and TBI Tofu network router (TNR) PCI Express root complex Tofu network interface (TNI) and Tofu barrier interface (TBI) Host bus interface 34 core processor with 8 HMC links Tofu1 < 300mm 2 InterConnect Controller (65nm) 12 PCI Express root complex Tofu2 < 100mm 2 SPARC64 TM XIfx (20nm)

Throughput of Single Put Transfer Achieved 11.46 GB/s of throughput which is 92% efficiency Throughput of Put transfer (GB/s) 16.00 8.00 4.00 2.00 1.00 0.50 0.25 0.13 0.06 0.03 11.46 4.66 Tofu2 Tofu1 4 16 64 256 1024 4096 Message size (byte) 13

Throughput of Concurrent Put Transfers Linear increase in throughput without the host bus bottleneck Throughput of Put transfer (GB/s) 50 45 40 35 30 25 20 15 10 Tofu2 Tofu1 45.82 Host bus bottleneck 17.63 5 0 1-way 2-way 3-way 4-way 14

Communication Latencies Pattern Method Latency One-way Put 8-byte to memory 0.87 μsec Put 8-byte to cache 0.71 μsec Round-trip Put 8-byte ping-pong by CPU 1.42 μsec Put 8-byte ping-pong by session 1.41 μsec Atomic RMW 8-byte 1.53 μsec 15

Communication Intensive Applications Fujitsu s math library provides 3D-FFT functionality with improved scalability for massively parallel execution Significantly improved interconnect bandwidth Optimised process configuration enabled by Tofu interconnect and job manager Optimised MPI library provides high-performance collective communications 16

Enhanced Software Stack for Post-FX10 17

Enduring Programming Model 18

Smaller, Faster, More Efficient Highly integrated components with high-density packaging. Performance of 1-chassis corresponds to approx. 1-cabinet of K computer. Efficient in space, time, and power 1 2 3Chassis 4 Cabinet 5 SPARC64 CPU Memory Fujitsu designed XIfx TM Board (12 CPUs) Tofu 2 32 + 2 core CPU Tofu2 integrated Three CPUs 3 x 8 HMCs (Hybrid Memory Cube) 8 optical modules (25Gbpsx12ch) 1 CPU/1 node 12 nodes/2u chassis Water cooled Over 200 nodes/cabinet 19

Scale-Out Smart for HPC and Cloud Computing Fujitsu Server PRIMERGY CX400 M1 (CX2550 M1 / CX2570 M1) 20 2014 FUJITSU

PRIMERGY x86 Cluster Series Fujitsu has been developing advanced supercomputers since 1977 No.1 in Top500 (June and Nov., 2011) National projects K computer FX10 Post-FX10 Exa F230-75APU Japan s First Vector (Array) Supercomputer (1977) NWT* Developed with NAL No.1 in Top500 (Nov. 1993) Gordon Bell Prize (1994, 95, 96) VP Series VPP500 World s Fastest Vector Processor (1999) CJAXA VPP5000 SPARC Enterprise PRIMEQUEST VPP300/700 PRIMEPOWER HPC2500 World s Most Scalable Supercomputer (2003) AP3000 AP1000 FX1 Japan s Largest Cluster in Top500 (July 2004) Most Efficient Performance in Top500 (Nov. 2008) HX600 Cluster node PRIMERGY RX200 Cluster node PRIMERGY CX400 PRIMERGY BX900 Cluster node Next x86 generation *NWT: Numerical Wind Tunnel 21

Fujitsu PRIMERGY Portfolio TX Family RX Family BX Family CX Family CX400 CX420 CX2550 CX2570 Products Chassis Server Nodes 22

Fujitsu PRIMERGY CX400 M1 Feature Overview 4 dual-socket nodes in 2U density Up to 24 storage drives Choice of server nodes Dual socket servers with Intel Xeon processor E5-2600 v3 product family (Haswell) Optionally up to two GPGPU or co-processor cards DDR4 memory technology Cool-safe Advanced Thermal Design enables operation in a higher ambient temperature 23

Fujitsu PRIMERGY CX2550 M1 Feature Overview Highest performance & density Condensed half-width-1u server Up to 4x CX2550 M1 into a CX400 M1 2U chassis Intel Xeon E5-2600 v3 product family, 16 DIMMs per server node with up to 1,024 GB DDR4 memory High reliability & low complexity Variable local storage: 6x 2.5 drives per node, 24 in total Support for up to 2x PCIe SSD per node for fast caching Hot-plug for server nodes, power supplies and disk drives enable enhanced availability and easy serviceability 24

Fujitsu PRIMERGY CX2570 M1 Feature Overview HPC optimization Intel Xeon E5-2600 v3 product family, 16 DIMMs per server node with up to 1,024 GB DDR4 memory Optional two high-end GPGPU/co-processor cards (Nvidia Tesla, Grid or Intel Xeon Phi) High reliability & low complexity Variable local storage: 6x 2.5 drives per node, 12 in total Support for up to 2x PCIe SSD per node for fast caching Hot-plug for server nodes, power supplies and disk drives enable enhanced availability and easy serviceability 25

Extreme Weather Strong interest within Wales in impacts of extreme weather events Fujitsu has established collaborations in this area HPC Wales studentships Knowledge Transfer Partnership with Swansea University 26

HPC Wales Fujitsu PhD Studentships 20 PhD studentships in Welsh Government priority sectors including Energy and Environment Cardiff University Dr Michaela Bray Extreme weather events in Wales: Weather modelling to flood inundation modelling Unified Model linked to river-estuary numerical flow model Bangor University Dr Reza Hashemi Simulating the impacts of climate change on estuarine dynamics Integrated catchment-to-coast model Bangor University Dr Simon Creer Biogeochemical analysis of the effects of drought on CO 2 release from peat bogs and fens 27

Knowledge Transfer Partnership UK Government program to promote skills uptake Partnership between Swansea University and Fujitsu Funding from Welsh Government including overseas visits Met Office Unified Model performance Modelling of extreme weather and flood risk Two Associates Associates UM Optimisation Coupling with hydrology codes Company Improved application performance New solutions/services Company Supervisor (FLE) Academic Supervisor (Swansea) Knowledge Base Engineering, Swansea 28

Collaboration with NCI Planned team of 10: 4 funded positions + 2 in-kind (NCI) + 2 in-kind (BoM) + 2 in-kind (Fujitsu) Contribute to the development of NG-ACCESS Next Generation Australian Community Climate and Earth-System Simulator (NG-ACCESS) A Roadmap 2014-2019 29

ACCESS UM OASIS- MCT MOM WW3 ACCESS UKCA 4DVAR, EnOI Cable CICE 30

Collaborative Activities Understanding of scalability bottlenecks IO server optimisation Improved use of thread (OpenMP) parallelism Activities within the ACCESS Optimisation Project Atmosphere Ocean Scaling and optimisation of 2-3 global configurations of UM Benchmark configurations of UM provided by the Bureau Scaling and optimisation of two global resolution configurations of the MOM ocean model Coupled MOM-CICE utilising the OASIS3-MCT coupler Coupled Climate Data Assimilation Performance characteristics of ACCESS-CM 1.3 and then introducing a 0.25 degree global ocean into ACCESS-CM 1.4 Collect performance and scalability data for the various data assimilation codes used in ACCESS Parallel Suites Initial work will focus on scalability of 4DVAR 31

Summary Fujitsu is committed to developing high-end HPC platforms Fujitsu selected as RIKEN s partner in FLAGSHIP 2020 project to develop an exascale-class supercomputer Post-FX10 is a step on the way to exascale computing October 1, 2014 RIKEN Selects Fujitsu to Develop New Supercomputer Oct. 1 Following an open bidding process, Fujitsu Ltd. has been selected to work with RIKEN to develop the basic design for Japan s next-generation supercomputer. Fujitsu also offers x86 cluster solutions across the range from departmental to petascale Fujitsu is promoting collaborative research and development programmes with key partners and customers 32