and RISC Optimization Techniques for the Hitachi SR8000 Architecture

1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G. Wellein (Regionales Rechenzentrum Erlangen) L. Palm, M. Brehm (LRZ München)

Centre of Excellence for High Performance Computing Ensure efficient use of supercomputers by providing top quality HPC project support : Physics, Chemistry, Engineering,.. cxhpc Supercomputers Architecture specific Optimizations Appropriate programming models Efficient Algorithms and solvers Find appropriate (super)computer Hot-line --- large projects HPC training & lectures Information / PR

HPC Support Projects Breuer/Durst h001v,h0011) Brenner/Durst (h001y) Simulation of complex flows Applied Methods Mathematics Fluid Dynamics Finite-Volume Material (SIP Solver) Science, etc Lattice-Boltzmann methods Theoretical Physics Theoretical Chemistry Computer Sciences Quantummechanical many-body problems Exact diagonalization (sparse/dense) DMRG Hofmann (h008z) Fehske (h0441) Heß (h023z) Rüde (h0671) Goals Support for large-scale HLRB projects Find appropriate (super)computer for each problem/scientist Competence & Consulting for methods used and developed by local (FAU) scientists

Performance evaluation: Benchmark systems Platform Peak [GFlop/s] MemBW [GB/s] L1 cache [kb] L2 cache [MB] Intel P4 1.5 GHz 3.0 3.2 8 0.256 RD-RAM MIPS R14k 0.5 GHz 1.0 1.6 32 8 O3400 IBM Power4 1.3 GHz (32-way node) 5.2 166.0 13 110 32 1024 0.73 23 p690 512 MB L3 NEC SX5e 4.0 32.0 (LD) 16.0 (ST) --- --- --- HSR8k 0.375 GHz (8-way node) 1.5 12.0 4.0 32.0 128 1024 --- PVP +COMPAS What is different between Hitachi and others? Pseudo-Vector-Processing (CPU level): Large register set (160 FP registers) 16 outstanding PREFETCH or COMPAS (SMP-node level): High peak performance & memory bandwidth aggregate MemBW. 512-way Mem. interleaving 128 outstanding PRELOAD Collective Thread operations Extensive software-pipelining Hide memory latency Vector-processor like performance with RISC technology Compiler

Performance Evaluation: Vector-Triad A(1:N)=B(1:N)+C(1:N)*D(1:N) single processors vector processors / HSR node HSR8k 1 CPU: 92% Intel-P4/RD-RAM: 55% Memory efficiency HSR8k 1 node: 75% NEC SX5e: 75% (max. BW) 97% (effect. BW)

Performance Evaluation: Sparse Matrix-Vector-Multiplikation Sparse MVM is numerical core of exact diagonalization algorithms (Davidson, Lanczos, etc.) widely used in theoretical physics and theoretical chemistry Several storage formats are available: JDS, CRS, Jagged Diagonals Storage (JDS) format: Best performance for Hitachi and vector systems Only minor performance drawbacks on RISC systems Shared-memory parallelization of inner loop DO j = 1,max_nonz DO i = 1,(jd_ptr(j+1)-jd_ptr(j)) max. #non-zeros per row (10-100) Matrix dimension (10 3-10 9 ) Y(i)=Y(i)+VALUE(jd_ptr(j)+i-1)*X( COL_IND (jd_ptr(j)+i-1) ) ENDDO ENDDO Indirect / non-contiguous memory access Perfomance limited bymemorybandwidth & latency!

Performance Evaluation: Sparse MVM Pseudo-Vector-Processing of sparse MVM (JDS format): time iteration Prefetch index arraycol_ind Loadindex from cache to reg Preload single data itemx(index) Innermost loop is being unrolled 48 times by HSR-compiler! P V P intermediate to long loop lengths (unrolling / pipelining) no data dependencies (PREFETCH/PRELOAD) small to intermediate loop body (register spill!!)

Performance Evaluation: Sparse MVM single processor vector processor/ SMP nodes Memory efficiency HSR8k 1 CPU: 70% Intel-P4/RD-RAM: 48% SMP scalability HSR8k 1 node: 89% (8p.) IBM Power4: 56% (16p.) 39% (32p.)

Use PVP with care! Simple kernel from nuclear physics: FORTRAN, approx. 200 KByte (Scattering problems with three-nucleon forces, Prof. H. Hofmann, FAU) DO M=1,IQM DO K=KZHX(M),KZAHL F(K)=F(K) * S(MVK(K,M)) ENDDO ENDDO S(): short; approx. 100-200 IQM: small; typically 9 KZAHL: much larger than 1000 HSR-Compiler: preload streams for S() poor performance *voption nopreload improves performance by a factor of 2.9 Blocking of M loop & unrolling of inner loop additional 12 % HSR8k-F1 MIPS-R14k Intel P4 (1.5GHz) Original 26 MFlop/s 97 MFlop/s 195 MFlop/s Optimized 90 MFlop/s 149 MFlop/s 257 MFlop/s Speed-up 3.46 1.54 1.32

CFD applications: Strong Implicit Solver CFD: Solving for finite volume methods can be done by Strongly-Implicit-Procedure (SIP) according to Stone SIP-solver is widely used: A x = b LESOCC, FASTEST, FLOWSI (Institute of Fluid Mechanics, Erlangen) STHAMAS3D (Crystal Growth Laboratory, Erlangen) CADiP (Theoretical Thermodynamics and Transport Processes, Bayreuth) SIP-Solver: 1) Incomplete LU-factorization 2) Series of forward/backward substitutions Toy program available at: ftp.springer.de in /pub/technik/peric (M. Peric)

SIP-solver: Data-dependencies & Implementations Basic data-dependency: (i,j,k) {(i-1,j,k);(i,j-1,k);(i,j,k-1)} 3-fold nested loop (3D): (i,j,k) Data-locality No shared memory parallelization (Hitachi: Pipeline parallel processing) Hyperplane: (i+j+k=const) Non-contiguous memory access shared memory parallelization /vectorization of inner-most loop Hyperline: (i,j+k=const) shared memory parallelization of (j+k=const) loop Contiguous memory access for inner-most (i) loop k j i

IP-solver: Implementations & Single Processor Performance Benchmark: Lattice: 91 3 100 MB 1 ILU 500 iterations Performance: HSR8k-F1: 3D: unrolling 32 times IBM Power4: MFlop/s 128 MB L3 cache accessible for 1 proc. 300 250 200 150 100 50 0 3D hyperplane hyperline HSR8k MIPS R14k Intel P4 IBM Power4

SIP-solver: Implementations & Shared-memory scalability hyperplane hyperline MFlop/s MFlop/s 1400 1200 1000 800 600 400 200 0 Fixed problem size: 91 3 2500 2000 1500 1000 500 0 1 4 8 16 processors 1 4 8 16 processors MFlop/s Varying problem size 31 3 91 3 201 3 HSR8k-F1 (3D) (8p) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 IBM Power4 (hl) (8p) 4 MB 100 MB 1 GB Memory HSR8k-F1 HSR8k-F1 (3D) IBM Power4 NEC SX5e

Summary & Outlook Efficient use of Hitachi SR8000: Vector-codes High level of loop unrolling Pseudo-Vectorization+COMPAS Large Register Set Hitachi SR8000 techniques are forward-looking: Large register set / many outstanding memory references High memory bandwidth: single processor and SMP node Optimization techniques for new architectures IBM Power4 Intel Itanium2/3 Summary Outlook Shared (large) caches EPIC; large register set Parallel Programming techniques for SMP clusters: MPP model (pure MPI) hybrid model (MPI+OpenMP/automatic)

Acknowledgement