Algorithms of Scientific Computing II

Transcription

1 Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware implementations and Barnes-Hut-Method 1) Today s Systems - Compute Power Today s supercomputer are highly parallel systems. Please read LRZ s web page and identify the different levels of parallelism in SuperMUC! Vector-register; Hardware-thread; core; processor; node; island; whole system. Which levels of parallelism are not documented there, but can be found in Intel Architecture Manual? Instruction level parallelism: A core is able to decode several instruction a time (e.g. four-way decode). Since it features several execution units the so-called out-oforder unit is able to run different instructions employing different units out-of-order and in parallel. Please refer to Fig. 1 and Fig. 2 for examples. In total we are now able to calculated the peak-performance (double percision) of a node in SuperMUC. 2: out-of-order: mul and add 4: vectorization 8: cores per package 2.7: clock-speed 2: two sockets per node 45.6 GFLOPS per node! =.19 PFLOPS for whole system.

2 Branch Prediction 2 KB Instruction Cache Instruction Fetch and Pre-Decode uop decoded Cache (1.5k uops) 4-way Decode Rename/Allocate/Retirement (Reorder-Buffer: 168 Entries) Scheduler (physical registerfile: bit VPU registers, 160 integer registers) Port 0 Port 1 Port 5 Port 2 Port Port 4 ALU SSE MUL SSE Shuffle DIV AVX FP MUL Imm Blend ALU SSE ADD SSE Shuffle AVX FP ADD ALU JMP AVX FP Shuffle AVX FP Boolean Imm Blend Load Store Address Store Address Load Store Data 256 KB Cache (MLC) Memory Control 2 KB Data Cache Abbildung 1: Intel Sandy Bridge Architecture, changes in comparison to Intel Nehalem Architecture are highlighted in orange, based on descriptions from Intel. Instruction Fetch 64 KB Instruction Cache Instruction Decode Branch Prediction 4-way Decode Integer Scheduler Port 0 Port 1 Port 2 Port ALU ALU AGU AGU Instruction Retirement Floating Point Scheduler, 60 entries Port 0 Port 1 Port 2 Port 128bit FMAC 128bit FMAC SSE x87 SSE x87 Instruction Retirement Integer Scheduler Port 0 Port 1 Port 2 Port ALU ALU AGU AGU 2 MB shared Cache Load/Store Unit 16 KB Data Cache Floating Point Load / Store Buffer 16 KB Data Cache Load/Store Unit Data Prefetches Abbildung 2: AMD Bulldozer Architecture, based on descriptions from AMD. 2) Today s Systems - Memory Subsystem Please sketch the memory subsystem/ network structure of SuperMUC. Is there a different network topology available on another supercomputer? A single node features a state of the art NUMA-architecture as displayed in Fig., which implements a shared memory but with different access time: local vs. remo-

3 te. A remote access takes roughly 10-20% longer than a local access. Operating systems like Linux employ a so called first touch policy: the NUMA-node which touches a page first is the owner of the this page. On each socket we are able to achieve a streaming bandwidth of more than 40 GByte/s. Core 0 Core 1 Core 2 Core Core 4 Core 5 Core 6 Core 7 Thr. 0 Thr. 1 Thr. 2 Thr. Thr. 4 Thr. 5 Thr. 6 Thr. 7 Thr. 8 Thr. 9 Thr. 10 Thr. 11 Thr. 12 Thr. 1 Thr. 14 Thr CH DDR 1600 Shared L Cache 20MB 4 CH DDR 1600 Shared L Cache 20MB Thr. 0 Thr. 1 Thr. 2 Thr. Thr. 4 Thr. 5 Thr. 6 Thr. 7 Thr. 8 Thr. 9 Thr. 10 Thr. 11 Thr. 12 Thr. 1 Thr. 14 Thr. 15 QPI 8.0 GT QPI 8.0 GT Core 0 Core 1 Core 2 Core Core 4 Core 5 Core 6 Core 7 Abbildung : Intel Sandy Bridge NUMA-Architecture For inter node communication, SuperMUC features an Infiniband network. Infiniband offers low latencies (1.2 us) at more then 6 GByte/s per second bandwidth. The overall network is organizes as a pruned fat-tree: 512 nodes form a so-called island with one high-end switch. 18 island are connected via a 1-level tree. Other systems, like IBM s BlueGene or Cray s XK6 feature custom interconnects (which are routerless) in order to form a 2D/D mesh. If a scientific application (e.g. based on regular grids or matrices) matches to this network only llowöverhead for communication is measured. What is the peak performance of the system if a memory bound application is executed?

4 SuperMUC features four channels per socket of DDR-1600 RAM. On channel is able to deliver 12.8 GByte/s what results in a total bandwidth of GByte/s. Let s execute a matrix vector multiplication: For a N N matrix we have to calculate N scalar products. For each scalar product we have to read 2N values which is have an amount of of 16N bytes and we are executing N FLOP. Therefore we are able to achieve 102.4/16 GFLOPS = 6.4 GFLOPS (1.8 % peak performance!!!) Which hardware feature may help to improve this performance? Caches! Abbildung 4: cache hierarchy Let s assume we can store b of our matrix vector multiplication in the cache, we just have to stream A. This means for a scalar product we have to read 8N bytes which results in a performance of 12.8 GFLOPS. In order to unleash the performance of today s chips cache awareness is a must.

5 ) Vectorization of Linked-Cells particle simulations Recently C++ has established itself as one of the standard language for new development of HPC codes apart from FORTRAN and C. Please sketch the software layout of a linked-cells C++ implementation. Classes for: Particle, ParticleContainer, Cell, ForceCalculation. Which problems do you face when it comes to low level kernels and data layout? Interactions are taking place on a particle basis. ForceCalculation class gets two references and computes and updates velocities and forces for each particle. This leads to a so-called Array of Structures (AoS) which is performance-killing. In order to speed-up calculations it s useful to flip this layout which leads to a cell-based only method, Fig v_y_1 v_x_1 f_z_1 64 byte (512 bit) f_y_1 f_x_1 z_1 y_1 x_ y_2 x_2 b_1 a_1 v_z_1... x_ x_2 x_ y_ b_ y_2 b_2 y_1 b_1 Abbildung 5: AoS to SoA conversion Can you use vectorization in order to speed-up calculations? Given a SoA-storage, we are able to vectorize the LJ-12-6 kernel across particle pairs. Please note that this approach scales with the number of particles per cell and does not require zero-padding or is limited to a vector-length of four as methods. In our case the only limitations are the number of particles per cell, the size of the cutoff-radius r cutoff and cell-length r c as these factors influence the vector load.

6 x_a x_a calculate distances x_2 x_1 y_a y_a d_a_2 d_a_1 y_2 y_1 z_a 0 0 z_a f_x_a f_y_a d_a_2 > rcutoff d_a_1 > rcutoff z_2 f_x_2 f_y_2 z_1 f_x_1 f_y_1 0 f_z_a calculate forces (LJ-12-6) f_z_2 f_z_1 Abbildung 6: Vectorization of the LJ-12-6 force calculation kernel. Figure 6 sketches the applied vectorization of the LJ-12-6 calculations. As we need to deal with double precision floating point numbers we can store a maximum of two elements per SSE register. The calculation is performed on particle pairs, therefore we broadcast-load (duplicating by vector-length) the required data of one particle in the first register (a) and the second register is filled by data of two different particles (1 and 2). With this approach we can theoretically reduce the number of operations by a factor of two. Since we are calculating two interactions within one step, we need to apply some pre- and post-processing, which can be performed by regular logical operations directly in hardware: the SSE/AVX instruction set offers compare, logical and sign-bit operations that can be used to decide if the force calculation should be initiated (at least one particle pair distance is smaller than r cutoff, pre-processing) and if calculated results need to be zeroed by a mask (a particle pair distance is greater than r cutoff ). In case of SSE this should already gain a factor of two for small particle numbers. The lower performance bound is the scalar performance as these algorithms will not execute more instructions than necessary in the scalar case. additional: 4) Complexity of the Barnes-Hut-Method In the lecture the costs for the d-dimensional Barnes-Hut-Method were given as O( d NlogN). First derive the costs for the twodimensional Barnes-Hut-Method. Then explain descriptively (without proof), why the formula is also correct for d. The outer loop of the Barnes-Hut-algorithm iterates over all particles and calculates the force effective on every particle. As there are N particles in total, for each of those particles the costs have to be O( d log N). For the given complexity of O( d N log N) we assume that the tree is not degenerated, thus the tree

7 has O(log N) levels. If we can show, that at each level the costs are O( d ), the formula is proven valid. We start out from the twodimensional case. First we consider an arbitrary level and determine the number of nodes (i.e. particles of pseude-particles) we have to consider.those are exactly the nodes for which the parent node didn t fulfill the -criterion. For the parent node it has to hold: diam distance >. Thus it follows for the distance: distance < diam We will now try to determine an upper bound for the number of parent cells which fulfill that condition. Therefor we consider only one coordinate axis first. Not considering the absolute value, (1) can be rewritten as diam < distance < diam. Thus the area of the distance is 2 diam. As one node coveres an area with the diameter diam there can be only 2 diam diam nodes next to each other, for which the -criterion does not hold. In the second axis there are just as many. Thus there are = 2 ( ) 2 2 = 4 2 nodes in total, which don t fulfill the -criterion. Those are the parent nodes we have to consider at the current level. That number still has to be multiplied by 4, what doesn t change the order. Thus the maximum computational costs on each level are O( 2 ). As there area log N levels in total and N particles, the total computational costs sum up to O( 2 N log N). For three dimensions it is easy to see that there are not c 2 any more for which the -criterion is fulfilled, but only c, as we are dealing now with a cubic area. (1)

8 5,51 2,85 1,5 0,82 0,44 0,2 0,12 2,86 1,42 0,82 0,42 0,22 0,12 0,062 9,1 4, 2,15 1,08 0,62 9,48 0,4 4,87 0,19 2,57 0,118 1,7 0,081 0,77 0,4 0,22 runtime per timestep [s] ,1 0,01 SGI ICE, 100K particles 1 rank 2 ranks 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks AoS SoA SoA,SSE AoS SoA SoA,SSE AoS SoA SoA,SSE cell-length 1.5, cutoff.0 cell-length.0, cutoff.0 cell-length 5.0, cutoff 5.0 nks 4,7 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks 2, 0,1, 1,7 0,89 0,46 0,24 0,1 1,16,1 1,69 0,88 0,46 0,24 0,1 0,61 2,85 1,5 0,82 0,44 0,2 0,12 0,5 0,01,1 1,72 0,88 0,46 0,24 0,1 0,19 AoS SoA SoA,SSE AoS SoA SoA,SSE AoS SoA SoA,SSE 2,6 1,2 0,65 0,5 0,18 0,099 0,11 cell-length 1.5, cutoff.0 cell-length.0, cutoff.0 cell-length 5.0, cutoff 5.0 1,42 0,82 0,42 0,22 0,12 0,062 0,065 15,5 8,1 4,19 2,5 1,22 0,69 0,047 9,99 Abbildung 5,7 7: 2,78 Runtimes in 1,54 seconds for 0,81 a scenario with 0,46 100k particles on both 4,87 clusters, 2,57 one MPI 1,7 rank per0,77 core. 0,4 0,22 14,6 7,7,95 2,09 1,21 0,65 0,6 0,28 0,1 runtime per timestep [s] MiniMUC, 100K particles 1 rank 2 ranks 4 ranks 8 ranks 16 ranks 2 ranks 64 ranks 128 ranks 256 ranks x x x x