Implementation of NAMD molecular dynamics nonbonded force-field on the Cell Broadband Engine processor

Size: px

Start display at page:

Download "Implementation of NAMD molecular dynamics nonbonded force-field on the Cell Broadband Engine processor"

Junior George
7 years ago
Views:

Guochun Shi Volodymyr Kindratenko National Center for

1 Implementation of NAMD molecular dynamics nbonded force-field on the Cell Broadband Engine processor Guochun Shi Volodymyr Kindratenko National Center for Supercomputing Applications University of Illiis at Urbana-Champaign

2 Presentation outline Introduction Cell Broadband Engine NA Molecule Dynamics Simulation (NAMD) Implementation Task library and task dispatch system SIMD and code optimizations for SPU Performance Static analysis Performance on hardware Conclusions 2

Cell Broadband Engine One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KB local storage 3.2 GHz processor 25.

3 Cell Broadband Engine One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KB local storage 3.2 GHz processor 25.6 GB/s processor-tomemory bandwidth > 200 GB/s EIB sustained aggregate bandwidth Theoretical peak performance of GFLOPS (SP) and GFLOPS (DP) Memory Interface Controller (MIC) Element Interconnect Bus (EIB) PowerPC Processor Storage Subsystem (PPSS) L1 instruction cache L2 cache L1 data cache PowerPC Processor Unit (PPU) PowerPC Processor Element (PPE) I/O Interface (IOIF0) Memory Flow Controller (MFC) DMA Controller Local Store (LS) Synergistic Processor Unit (SPU) Synergistic Processor Element (SPE) 0 I/O Interface (IOIF1) SPE 7 3

dimensions > cutoff radius for n-bonded interaction Each patch only

4 NAMD Compute forces and update positions repeatedly The simulation space is divided into rectangular regions called patches Patch dimensions > cutoff radius for n-bonded interaction Each patch only needs to be checked against nearby patches Self-compute and paircompute 4

5 NAMD kernel NAMD SPEC 2006 CPU benchmark kernel 1: for each atom i in patch p k 2: for each atom j in patch p l 3: if atoms i and j are bonded, compute bonded forces 4: otherwise, if atoms i and j are within the cutoff distance, add atom j to the i s atom pair list 5: end 6: for each atom k in the i s atom pair list 7: compute n-bonded forces (L-J potential and PME direct sum, both vial lookup tables) 8: end 9: end We implemented a simplified version of the kernel that excludes pairlists and bonded forces 1: for each atom i in patch p k 2: for each atom j in patch p l 3: if atoms i and j are within the cutoff distance 4: compute n-bonded forces (L-J potential and PME direct sum, both via lookup tables) 5: end 6: end 5

6 Implementation: task library and dispatch system Compute task struct typedef struct task_s { int cmd; // operand int size; // the size of task structure } task_t; API for PPE and SPE int ppu_task_init(int argc, char **argv,spe_program_handle_t ); // initialization int ppu_task_run(volatile task_t * task); // start a task in all SPEs int ppu_task_spu_run(volatile task_t* task, int spe); // start a task in one SPE int ppu_task_spu_wait(void); // wait for any SPE to finish, blocking call void ppu_task_spu_waitall(void); // wait for all SPEs to finish, blocking all typedef struct compute_task_s { task_t common; <user_type1> <user_var_name1> <user_type2> <user_var_name2> } compute_task_t; int spu_task_init(unsigned long long); int spu_task_register(dotask_t,int); // register a task int spu_task_run(void); // start the infinite loop, wait for tasks Task dispatch system start pool of idle SPEs find idle SPE pool of patch pairs to be processed find a dependency-free patch pair remove patch pair from the pool run pair-compute task on the idle SPE task pool is empty wait until all SPU tasks exit ppu_task_spe_wait() stop 6

7 Implementation: SPU SIMD: each component is kept in a separate vector Data movement dominates the time Buffer size carefully chosen to fit into the local store Data in memory Shuffle operation s vectors Data movement for distance computation (SP case) x0 y0 z0 c0 id0 x1 y1 z1 c0 i0 x2 y2 z2 c0 i0 x3 y3 z3 c0 i0 x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 c0 c1 c2 c3 Local store usage Computation Code (KB) L-J table (KB) Table _four (KB) Atom_ buffer (KB) Force buffer (KB) Stack others dx0 dx1 dx2 dx3 dy0 dy1 dy2 dy3 dz0 dz1 dz2 dz3 SP DP distance vector r0 r1 r2 r3 7

8 Implementation: optimizations task start Different vectorization schemes are applied in order to get best performance compute distances for 4 pairs of atoms DMA in atoms and forces for the patch Yes Selfcompute? No compute distances for 4 pairs of atoms Self-compute: do redundant computations and fill zeros to unused slots Pair-compute: save eugh pairs of atoms, then do calculations any of them within cutoff? compute 4 n-bonded forces nullify forces for those outside of cutoff distance all pairs processed? Save pairs within the cutoff distance Number of save pairs >= 4? compute 4 n-bonded forces all pairs processed? DMA forces back to main memory task exit 8

9 number of clock cycles number of clock cycles Performance: static analysis 4E E+10 3E E+10 scalar force computation scalar distance computation SIMD force computation merge SIMD distance computation 4E E+10 3E E+10 data manupulation cycles compute cycles 2E+10 2E E E+10 1E+10 1E+10 5E+09 5E single-precision self-compute single-precision pair-compute double-precision self-compute double-precision pair-compute single-precision self-compute single-precision pair-compute double-precision self-compute double-precision pair-compute Distance computation code takes most of the time Data manipulation time is significant 9

execution time (seconds) speedup execution time (seconds) execution time (seconds) speedup Performance 40 35 30 25 20 15 10 5 0 28.6 Cell/B.E. PPE (3.

10 execution time (seconds) speedup execution time (seconds) execution time (seconds) speedup Performance Cell/B.E. PPE (3.2GHz) NAMD kernel performance on different architectures Intel Itanium 2 (900MGz) IBM Power4 (1.2GHz) Intel Xeon (3.0GHz) single-precision kernel double-precision kernel AMD Opteron (2.2GHz) Cell/B.E. 8 SPEs (3.2GHz) 13.4x speedup for SP and 11.6x speedup for DP compared to a 3.0 GHz Intel Xeon processor SP performance is < 2x better than DP Scaling and speedup of the forcefield kernel as compared to a 3.0 GHz Intel Xeon processor Intel Xeon (3.0GHz) Intel Xeon (3.0GHz) speedup execution time SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs speedup execution time SPE 2 SPEs 4 SPEs 8 SPEs 16 SPEs

11 Conclusions Linear speedup when using multiple synergistic processing units Performance of the double-precision floating-point kernel differs by less than a factor of two from the performance of the single-precision floating-point kernel Even though the peak performance of the Cell s single-precision floating point SIMD engine is 14 times the peak performance of the double-precision floating-point SIMD engine The biggest challenge in using the Cell/B.E. processor in scientific computing applications, such as NAMD, is the software development complexity due to the underlying hardware architecture 11

12 Ackwledgment NSF grant SCI IBM Linux Techlogy Center Dr. James Phillips from the Theoretical and Computational Biophysics Group Access to Cell blades was provided by Prof. Paul Woodward from the University of Minnesota Prof. Dave Bader from the Georgia Institute of Techlogy Trish Barker, Jeremy Es, and Dr. John Larson from NCSA 12

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms