and RISC Optimization Techniques for the Hitachi SR8000 Architecture



Similar documents
Architecture of Hitachi SR-8000

Multicore Parallel Computing with OpenMP

Access to the Federal High-Performance Computing-Centers

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Performance of the JMA NWP models on the PC cluster TSUBAME.

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

YALES2 porting on the Xeon- Phi Early results

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

High Performance Computing. Course Notes HPC Fundamentals

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Matrix Multiplication

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

64-Bit versus 32-Bit CPUs in Scientific Computing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

High Performance Computing in the Multi-core Area

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Building a Top500-class Supercomputing Cluster at LNS-BUAP

OpenMP and Performance

Evaluation of CUDA Fortran for the CFD code Strukti

High Performance Computing

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Accelerating CFD using OpenFOAM with GPUs

Performance Characteristics of Large SMP Machines

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Parallel Programming Survey

Optimizing Code for Accelerators: The Long Road to High Performance

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Supercomputing Status und Trends (Conference Report) Peter Wegner

Parallel Low-Storage Runge Kutta Solvers for ODE Systems with Limited Access Distance

Power-Aware High-Performance Scientific Computing

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based Architectures

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

High Performance Computing in CST STUDIO SUITE

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

HP ProLiant SL270s Gen8 Server. Evaluation Report

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Benchmarking Large Scale Cloud Computing in Asia Pacific

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Building an Inexpensive Parallel Computer

Current Trend of Supercomputer Architecture

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

MAQAO Performance Analysis and Optimization Tool

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

1 Bull, 2011 Bull Extreme Computing

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

CUDA programming on NVIDIA GPUs

Keys to node-level performance analysis and threading in HPC applications

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

benchmarking Amazon EC2 for high-performance scientific computing

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

LS DYNA Performance Benchmarks and Profiling. January 2009

Microsoft Windows Compute Cluster Server 2003 Evaluation

A Very Brief History of High-Performance Computing

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CSE 6040 Computing for Data Analytics: Methods and Tools


Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Scientific Computing Programming with Parallel Objects

Clusters: Mainstream Technology for CAE

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

HPC with Multicore and GPUs

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

HPC enabling of OpenFOAM R for CFD applications

OpenMP Programming on ScaleMP

Lecture 1: the anatomy of a supercomputer

Multi-Threading Performance on Commodity Multi-Core Processors

Recommended hardware system configurations for ANSYS users

Petascale Software Challenges. William Gropp

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

A Pattern-Based Approach to. Automated Application Performance Analysis

Case Study on Productivity and Performance of GPGPUs

Transcription:

1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G. Wellein (Regionales Rechenzentrum Erlangen) L. Palm, M. Brehm (LRZ München)

Centre of Excellence for High Performance Computing Ensure efficient use of supercomputers by providing top quality HPC project support : Physics, Chemistry, Engineering,.. cxhpc Supercomputers Architecture specific Optimizations Appropriate programming models Efficient Algorithms and solvers Find appropriate (super)computer Hot-line --- large projects HPC training & lectures Information / PR

HPC Support Projects Breuer/Durst h001v,h0011) Brenner/Durst (h001y) Simulation of complex flows Applied Methods Mathematics Fluid Dynamics Finite-Volume Material (SIP Solver) Science, etc Lattice-Boltzmann methods Theoretical Physics Theoretical Chemistry Computer Sciences Quantummechanical many-body problems Exact diagonalization (sparse/dense) DMRG Hofmann (h008z) Fehske (h0441) Heß (h023z) Rüde (h0671) Goals Support for large-scale HLRB projects Find appropriate (super)computer for each problem/scientist Competence & Consulting for methods used and developed by local (FAU) scientists

Performance evaluation: Benchmark systems Platform Peak [GFlop/s] MemBW [GB/s] L1 cache [kb] L2 cache [MB] Intel P4 1.5 GHz 3.0 3.2 8 0.256 RD-RAM MIPS R14k 0.5 GHz 1.0 1.6 32 8 O3400 IBM Power4 1.3 GHz (32-way node) 5.2 166.0 13 110 32 1024 0.73 23 p690 512 MB L3 NEC SX5e 4.0 32.0 (LD) 16.0 (ST) --- --- --- HSR8k 0.375 GHz (8-way node) 1.5 12.0 4.0 32.0 128 1024 --- PVP +COMPAS What is different between Hitachi and others? Pseudo-Vector-Processing (CPU level): Large register set (160 FP registers) 16 outstanding PREFETCH or COMPAS (SMP-node level): High peak performance & memory bandwidth aggregate MemBW. 512-way Mem. interleaving 128 outstanding PRELOAD Collective Thread operations Extensive software-pipelining Hide memory latency Vector-processor like performance with RISC technology Compiler

Performance Evaluation: Vector-Triad A(1:N)=B(1:N)+C(1:N)*D(1:N) single processors vector processors / HSR node HSR8k 1 CPU: 92% Intel-P4/RD-RAM: 55% Memory efficiency HSR8k 1 node: 75% NEC SX5e: 75% (max. BW) 97% (effect. BW)

Performance Evaluation: Sparse Matrix-Vector-Multiplikation Sparse MVM is numerical core of exact diagonalization algorithms (Davidson, Lanczos, etc.) widely used in theoretical physics and theoretical chemistry Several storage formats are available: JDS, CRS, Jagged Diagonals Storage (JDS) format: Best performance for Hitachi and vector systems Only minor performance drawbacks on RISC systems Shared-memory parallelization of inner loop DO j = 1,max_nonz DO i = 1,(jd_ptr(j+1)-jd_ptr(j)) max. #non-zeros per row (10-100) Matrix dimension (10 3-10 9 ) Y(i)=Y(i)+VALUE(jd_ptr(j)+i-1)*X( COL_IND (jd_ptr(j)+i-1) ) ENDDO ENDDO Indirect / non-contiguous memory access Perfomance limited bymemorybandwidth & latency!

Performance Evaluation: Sparse MVM Pseudo-Vector-Processing of sparse MVM (JDS format): time iteration Prefetch index arraycol_ind Loadindex from cache to reg Preload single data itemx(index) Innermost loop is being unrolled 48 times by HSR-compiler! P V P intermediate to long loop lengths (unrolling / pipelining) no data dependencies (PREFETCH/PRELOAD) small to intermediate loop body (register spill!!)

Performance Evaluation: Sparse MVM single processor vector processor/ SMP nodes Memory efficiency HSR8k 1 CPU: 70% Intel-P4/RD-RAM: 48% SMP scalability HSR8k 1 node: 89% (8p.) IBM Power4: 56% (16p.) 39% (32p.)

Use PVP with care! Simple kernel from nuclear physics: FORTRAN, approx. 200 KByte (Scattering problems with three-nucleon forces, Prof. H. Hofmann, FAU) DO M=1,IQM DO K=KZHX(M),KZAHL F(K)=F(K) * S(MVK(K,M)) ENDDO ENDDO S(): short; approx. 100-200 IQM: small; typically 9 KZAHL: much larger than 1000 HSR-Compiler: preload streams for S() poor performance *voption nopreload improves performance by a factor of 2.9 Blocking of M loop & unrolling of inner loop additional 12 % HSR8k-F1 MIPS-R14k Intel P4 (1.5GHz) Original 26 MFlop/s 97 MFlop/s 195 MFlop/s Optimized 90 MFlop/s 149 MFlop/s 257 MFlop/s Speed-up 3.46 1.54 1.32

CFD applications: Strong Implicit Solver CFD: Solving for finite volume methods can be done by Strongly-Implicit-Procedure (SIP) according to Stone SIP-solver is widely used: A x = b LESOCC, FASTEST, FLOWSI (Institute of Fluid Mechanics, Erlangen) STHAMAS3D (Crystal Growth Laboratory, Erlangen) CADiP (Theoretical Thermodynamics and Transport Processes, Bayreuth) SIP-Solver: 1) Incomplete LU-factorization 2) Series of forward/backward substitutions Toy program available at: ftp.springer.de in /pub/technik/peric (M. Peric)

SIP-solver: Data-dependencies & Implementations Basic data-dependency: (i,j,k) {(i-1,j,k);(i,j-1,k);(i,j,k-1)} 3-fold nested loop (3D): (i,j,k) Data-locality No shared memory parallelization (Hitachi: Pipeline parallel processing) Hyperplane: (i+j+k=const) Non-contiguous memory access shared memory parallelization /vectorization of inner-most loop Hyperline: (i,j+k=const) shared memory parallelization of (j+k=const) loop Contiguous memory access for inner-most (i) loop k j i

IP-solver: Implementations & Single Processor Performance Benchmark: Lattice: 91 3 100 MB 1 ILU 500 iterations Performance: HSR8k-F1: 3D: unrolling 32 times IBM Power4: MFlop/s 128 MB L3 cache accessible for 1 proc. 300 250 200 150 100 50 0 3D hyperplane hyperline HSR8k MIPS R14k Intel P4 IBM Power4

SIP-solver: Implementations & Shared-memory scalability hyperplane hyperline MFlop/s MFlop/s 1400 1200 1000 800 600 400 200 0 Fixed problem size: 91 3 2500 2000 1500 1000 500 0 1 4 8 16 processors 1 4 8 16 processors MFlop/s Varying problem size 31 3 91 3 201 3 HSR8k-F1 (3D) (8p) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 IBM Power4 (hl) (8p) 4 MB 100 MB 1 GB Memory HSR8k-F1 HSR8k-F1 (3D) IBM Power4 NEC SX5e

Summary & Outlook Efficient use of Hitachi SR8000: Vector-codes High level of loop unrolling Pseudo-Vectorization+COMPAS Large Register Set Hitachi SR8000 techniques are forward-looking: Large register set / many outstanding memory references High memory bandwidth: single processor and SMP node Optimization techniques for new architectures IBM Power4 Intel Itanium2/3 Summary Outlook Shared (large) caches EPIC; large register set Parallel Programming techniques for SMP clusters: MPP model (pure MPI) hybrid model (MPI+OpenMP/automatic)

Acknowledgement