and RISC Optimization Techniques for the Hitachi SR8000 Architecture

Size: px

Start display at page:

Download "and RISC Optimization Techniques for the Hitachi SR8000 Architecture"

Mae Curtis
10 years ago
Views:

1 1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G. Wellein (Regionales Rechenzentrum Erlangen) L. Palm, M. Brehm (LRZ München)

the Hitachi SR8000 Architecture F. Deserno, G. Hager, F.

2 Centre of Excellence for High Performance Computing Ensure efficient use of supercomputers by providing top quality HPC project support : Physics, Chemistry, Engineering,.. cxhpc Supercomputers Architecture specific Optimizations Appropriate programming models Efficient Algorithms and solvers Find appropriate (super)computer Hot-line --- large projects HPC training & lectures Information / PR

. cxhpc Supercomputers Architecture specific Optimizations Appropriate programming models

3 HPC Support Projects Breuer/Durst h001v,h0011) Brenner/Durst (h001y) Simulation of complex flows Applied Methods Mathematics Fluid Dynamics Finite-Volume Material (SIP Solver) Science, etc Lattice-Boltzmann methods Theoretical Physics Theoretical Chemistry Computer Sciences Quantummechanical many-body problems Exact diagonalization (sparse/dense) DMRG Hofmann (h008z) Fehske (h0441) Heß (h023z) Rüde (h0671) Goals Support for large-scale HLRB projects Find appropriate (super)computer for each problem/scientist Competence & Consulting for methods used and developed by local (FAU) scientists

Quantummechanical many-body problems Exact diagonalization (sparse/dense) DMRG Hofmann (h008z) Fehske (h0441) Heß (h023z) Rüde (h0671) Goals Support

4 Performance evaluation: Benchmark systems Platform Peak [GFlop/s] MemBW [GB/s] L1 cache [kb] L2 cache [MB] Intel P4 1.5 GHz RD-RAM MIPS R14k 0.5 GHz O3400 IBM Power4 1.3 GHz (32-way node) p MB L3 NEC SX5e (LD) 16.0 (ST) HSR8k GHz (8-way node) PVP +COMPAS What is different between Hitachi and others? Pseudo-Vector-Processing (CPU level): Large register set (160 FP registers) 16 outstanding PREFETCH or COMPAS (SMP-node level): High peak performance & memory bandwidth aggregate MemBW. 512-way Mem. interleaving 128 outstanding PRELOAD Collective Thread operations Extensive software-pipelining Hide memory latency Vector-processor like performance with RISC technology Compiler

Pseudo-Vector-Processing (CPU level): Large register set (160 FP registers) 16 outstanding PREFETCH or COMPAS (SMP-node level): High peak performance & memory bandwidth aggregate MemBW. 512-way Mem.

5 Performance Evaluation: Vector-Triad A(1:N)=B(1:N)+C(1:N)*D(1:N) single processors vector processors / HSR node HSR8k 1 CPU: 92% Intel-P4/RD-RAM: 55% Memory efficiency HSR8k 1 node: 75% NEC SX5e: 75% (max. BW) 97% (effect. BW)

6 Performance Evaluation: Sparse Matrix-Vector-Multiplikation Sparse MVM is numerical core of exact diagonalization algorithms (Davidson, Lanczos, etc.) widely used in theoretical physics and theoretical chemistry Several storage formats are available: JDS, CRS, Jagged Diagonals Storage (JDS) format: Best performance for Hitachi and vector systems Only minor performance drawbacks on RISC systems Shared-memory parallelization of inner loop DO j = 1,max_nonz DO i = 1,(jd_ptr(j+1)-jd_ptr(j)) max. #non-zeros per row (10-100) Matrix dimension ( ) Y(i)=Y(i)+VALUE(jd_ptr(j)+i-1)*X( COL_IND (jd_ptr(j)+i-1) ) ENDDO ENDDO Indirect / non-contiguous memory access Perfomance limited bymemorybandwidth & latency!

Hitachi and vector systems Only minor performance drawbacks on RISC systems Shared-memory parallelization of inner loop DO j = 1,max_nonz DO i = 1,(jd_ptr(j+1)-jd_ptr(j)) max.

7 Performance Evaluation: Sparse MVM Pseudo-Vector-Processing of sparse MVM (JDS format): time iteration Prefetch index arraycol_ind Loadindex from cache to reg Preload single data itemx(index) Innermost loop is being unrolled 48 times by HSR-compiler! P V P intermediate to long loop lengths (unrolling / pipelining) no data dependencies (PREFETCH/PRELOAD) small to intermediate loop body (register spill!!)

Innermost loop is being unrolled 48 times by HSR-compiler!

8 Performance Evaluation: Sparse MVM single processor vector processor/ SMP nodes Memory efficiency HSR8k 1 CPU: 70% Intel-P4/RD-RAM: 48% SMP scalability HSR8k 1 node: 89% (8p.) IBM Power4: 56% (16p.) 39% (32p.)

1 CPU: 70% Intel-P4/RD-RAM: 48% SMP scalability

9 Use PVP with care! Simple kernel from nuclear physics: FORTRAN, approx. 200 KByte (Scattering problems with three-nucleon forces, Prof. H. Hofmann, FAU) DO M=1,IQM DO K=KZHX(M),KZAHL F(K)=F(K) * S(MVK(K,M)) ENDDO ENDDO S(): short; approx IQM: small; typically 9 KZAHL: much larger than 1000 HSR-Compiler: preload streams for S() poor performance *voption nopreload improves performance by a factor of 2.9 Blocking of M loop & unrolling of inner loop additional 12 % HSR8k-F1 MIPS-R14k Intel P4 (1.5GHz) Original 26 MFlop/s 97 MFlop/s 195 MFlop/s Optimized 90 MFlop/s 149 MFlop/s 257 MFlop/s Speed-up

100-200 IQM: small; typically 9 KZAHL: much larger than 1000 HSR-Compiler: preload streams for S() poor performance *voption nopreload improves performance

10 CFD applications: Strong Implicit Solver CFD: Solving for finite volume methods can be done by Strongly-Implicit-Procedure (SIP) according to Stone SIP-solver is widely used: A x = b LESOCC, FASTEST, FLOWSI (Institute of Fluid Mechanics, Erlangen) STHAMAS3D (Crystal Growth Laboratory, Erlangen) CADiP (Theoretical Thermodynamics and Transport Processes, Bayreuth) SIP-Solver: 1) Incomplete LU-factorization 2) Series of forward/backward substitutions Toy program available at: ftp.springer.de in /pub/technik/peric (M. Peric)

(Crystal Growth Laboratory, Erlangen) CADiP (Theoretical Thermodynamics and Transport Processes, Bayreuth) SIP-Solver: 1)

11 SIP-solver: Data-dependencies & Implementations Basic data-dependency: (i,j,k) {(i-1,j,k);(i,j-1,k);(i,j,k-1)} 3-fold nested loop (3D): (i,j,k) Data-locality No shared memory parallelization (Hitachi: Pipeline parallel processing) Hyperplane: (i+j+k=const) Non-contiguous memory access shared memory parallelization /vectorization of inner-most loop Hyperline: (i,j+k=const) shared memory parallelization of (j+k=const) loop Contiguous memory access for inner-most (i) loop k j i

Hyperplane: (i+j+k=const) Non-contiguous memory access shared memory parallelization /vectorization of inner-most loop

12 IP-solver: Implementations & Single Processor Performance Benchmark: Lattice: MB 1 ILU 500 iterations Performance: HSR8k-F1: 3D: unrolling 32 times IBM Power4: MFlop/s 128 MB L3 cache accessible for 1 proc D hyperplane hyperline HSR8k MIPS R14k Intel P4 IBM Power4

unrolling 32 times IBM Power4: MFlop/s 128 MB L3 cache accessible for 1

13 SIP-solver: Implementations & Shared-memory scalability hyperplane hyperline MFlop/s MFlop/s Fixed problem size: processors processors MFlop/s Varying problem size HSR8k-F1 (3D) (8p) IBM Power4 (hl) (8p) 4 MB 100 MB 1 GB Memory HSR8k-F1 HSR8k-F1 (3D) IBM Power4 NEC SX5e

processors MFlop/s Varying problem size 31 3 91 3 201 3 HSR8k-F1 (3D) (8p) 2000 1800 1600 1400 1200

14 Summary & Outlook Efficient use of Hitachi SR8000: Vector-codes High level of loop unrolling Pseudo-Vectorization+COMPAS Large Register Set Hitachi SR8000 techniques are forward-looking: Large register set / many outstanding memory references High memory bandwidth: single processor and SMP node Optimization techniques for new architectures IBM Power4 Intel Itanium2/3 Summary Outlook Shared (large) caches EPIC; large register set Parallel Programming techniques for SMP clusters: MPP model (pure MPI) hybrid model (MPI+OpenMP/automatic)

bandwidth: single processor and SMP node Optimization techniques for new architectures IBM Power4 Intel Itanium2/3 Summary Outlook

15 Acknowledgement

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data