A New Scalable Parallel Method for Molecular Dynamics Based on Cell-Block Data Structure 1

Transcription

1 A New Scalable Parallel Method for Molecular Dynamics Based on Cell-Block Data Structure Cao Xiaolin Mo Zeyao High Performance Computing Center, State Key Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, P. O. Box 8009, 00088, Bei-Jing P. R. China {xiaolincao, Abstract. A scalable parallel algorithm especially for large-scale three dimensional simulations with seriously non-uniform particles distributions is presented. In particular, based on cell-block data structures, this algorithm uses Hilbert space filling curve to convert three-dimensional domain decomposition for load distribution across processors into one-dimensional load balancing problems for which measurement-based multilevel averaging weights(maw) method can be applied successfully. Against inverse space-filling partitioning(isp), MAW redistributes blocks by monitoring change of total load in each processor. Numerical experimental results have shown that MAW is superior to ISP in rendering balanced load for large-scale multi-medium MD simulation in high temperature and high pressure physics. Excellent scalability was demonstrated, with a speedup larger than 200 with 240 processors of one MPP. The largest run with. 0 9 particles on 500 processors took 80 seconds per time step. Introduction Molecular dynamics(md) simulation is an important tool in studying the properties of condensed matter and their dynamic interactions that can be difficult to obtain by other means. In order to make reasonable comparison with experiment, it is often necessary to simulate features on a micron scale. Realistic MD simulations of this size require at least 0 8 ~0 9 particles, preferably more. Non-uniform distribution of various kinds of particles in space produces a highly irregular computational load. Also, as the simulations evolves, the movement of particles causes changes in the load distribution of processor used. These factors make it difficult to achieve high scalability. Therefore a good load balancing scheme is necessary to enhance scalability. Research supported by Chinese NSF( ), Chinese 863 program(2002aa04570) and CAEP Funds

2 Mo[] has presented a robust iterative -D DLB algorithm(i.e. MAW) to be suitable for 2-D parallel link-cell MD simulation. Hayashi[2] generalized the cellular automation diffusion scheme to the 3-D simulation by introducing a concept of permanent cell to minimize inter-processor communication overheads. Against ORB and ORB-MM, Pilkington[3] have shown that of the three strategies, only Inverse Space-filling Partitioning(ISP) is able to render highly balanced workloads without incurring elaborate bookkeeping on a uniform mesh of N-body problem. NAMD[4] relies on a measurement-based DLB scheme to achieve high scalability for biomolecular systems. However, these DLB schemes are not suitable for our real application. Based on these research described above, a measurement-based multilevel averaging weight DLB scheme based on Hilbert space-filling curve(hsfc)[5] was presented for our large-scale multi-medium MD simulation in high temperature and high pressure physics, which computational load is unpredictably and non-uniform with position and time. Then a new cell-block data structure required to describe MD simulation was constructed in order to help DLB scheme provide assistance with the movement of data. After data are moved between processors, the data structures must be rebuilt and the inter-processor communication patterns need to be updated. The DLB scheme and cell-block data structure, along with some auxiliary function, were integrated into a new parallel MD algorithm aimed at utilizing large parallel machines in a scalable manner. The new parallel MD algorithm is described in section 2. The results of some performance evaluation are discussed in section 3. Against ISP, our DLB scheme can get better load balance with monitoring change of total load in each processor instead of monitoring change of workload in each block. Two numerical experimental results have showed that this new parallel MD algorithm can achieve high scalability for large-scale multi-medium MD simulation in high temperature and high pressure physics. Finally, we give some conclusions in section 4. 2 Parallelization strategy Provided that P is the number of processors, traditional link-cell domain decomposition method(ddm) partitions space into P sub-spaces (one per processor). It is highly scalable while particle densities are uniform. However, it has some disadvantages: () It is hard to use if the number of processors cannot be factored into three roughly equal factors; (2) Non-uniform distribution of particles can result in load imbalance; (3) Its data structure is not suitable for most DLB methods. In order to solve these problems, our method firstly partitions space into Q(Q >>P ) fixed-size blocks and creates cell-block data structure. Secondly, Hilbert space filling curve(hsfc) imposes a linear order(i.e. HSFC index) of the blocks in the high-dimensional space, which is the foundation of our DDM and DLB scheme. Thirdly, it constructs cell-block DDM based on HSFC and maps multiple blocks to each processor. Finally, a measurement- based multilevel averaging weight DLB scheme based on HSFC is used to get better load balance by redistributing blocks.

3 2. Cell-Block data structure Our method firstly partitions physical space into Q blocks and then each block is subdivided into smaller volumes named cells with a side length R L = r c + δ, where r c is cut-off radius, δ is a small positive number. In Fig., each block includes 4 4 computing cells (white). A layer of empty cells (shade) is padded to each block. We call these extra cells auxiliary cell, which stores temporarily some indices of particles mo v- ing to neighbor blocks. Because link-cell data structure can result in irregular memory access, we adopt compact memory management in cell-block data structure. Particles in the same cell are always stored sequentially in a list. So we designed a cell head pointer describing initial memory address of particles in the cell and an integer number describing the number of particles in the cell Fig.. Cell-Block structure include compute cell(white) and auxiliary cell(shade), where number in cell is index of cell. 2.2 cell-block DDM based on HSFC Given a non-uniform model as shown in Fig. 2(a), simulation space was divided into 8 8 blocks, the total number of blocks Q is 64. Load of each block was first evaluated approximately by counting particle number in block. Then we adopt Zoltan method[6] generating HSFC, which is valid for any shape space. As shown in Fig. 2(b), HSFC visit every block of 2-D space. Meanwhile, we numbered all blocks from to 64 along HSFC. The mapping of the hyperspace to the line is done once only, and is therefore a pre-processing step. Moreover, we construct a fast transform table in order to manage the mapping information. Finally, we apply -D recursive bisection to cut HSFC into 4

4 logically contiguous segments containing almost equal loads that correspond to physically irregular partitions, as shown in Fig. 2(b). Loads of 4 sub-domain is 5(dot: -8), 508(up diagonal line: 9-33), 525(white: 34-49), 543(down diagonal line: 50-64), respectively. Each segment includes a collection of blocks, which can be assigned to corresponding processors. So we can implicitly partition the hyperspace by partitioning simply and effectively -D line, which transforms hyperspace load balance problem into -D load balance problem. It can achieve better load balance, but arise irregular hyperspace partitions, which mu st manage unstructured communication. Therefore, bookkeeping information is required for each block. In real large-scale simulation, the number of blocks are generally far less than the number of cells, which reduce overheads of managing communication and the memory of bookkeeping information to O(Q). Moreover, D recursive bisection partitioning cost is O(Q). So we use block instead of cell as a allocable and mobile unit in order to reduce these overheads (a) (b) Fig. 2. HSFC-based domain decomposition. (a) load of 8 8 blocks, the number in each block represent load of owned block (b) partitioning results, the number in each block represent the HSFC index of owned block. 2.3 Measured-based MAW DLB based on HSFC In our MD application, the movement of particles cause load fluctuation, and this phenomenon becomes more and more critical as the time evolves. This may cause severe load imbalance. It is often necessary to adjust the loads very quickly to be balanced. Paper[3] adopt ISP method to solve this imbalance in N-body problem. It exhibits several disadvantages in our large-scale multi-medium MD simulation in high temperature and high pressure physics. The main difficulty was determining the computational load of single block. ISP evaluates load of each block by a simple timedependent function, such as the number of particles in the block and maybe the number of particles in neighboring blocks. It is not suitable for our MD simulation because

5 load of each block is dependent on the number of particles, the geometric distribution of particles and the type of particles. Moreover, processor speed and cache effect both also affect this load. So a new DLB scheme needs to be designed for this MD simulation. Once the hyperspace has been mapped to the line by HSFC, -D MAW[7] can be used to solve load balance problem arising from our MD application. So a measurement-based MAW DLB scheme based on HSFC and cell-block data structure was presented. It redistributes blocks sorted by HSFC index by monitoring change of total load in single processor instead of monitoring change of workload in single block. Simulation executes in the following procedure to perform DLB. First, the simulation runs for a small number of steps, typically lasting a few minutes. Actual computing time of local processor is measured during this time. After a particular simulation step, the main processor collects the load data from each processor, computes a new blocks distribution by calling MAW method. Since its partitioning time is linear with the number of blocks(q) that is O(Q), partitioning overheads on a single processor will remain modest for large problems so long as Q scaled accordingly. A table describing the complete partitioning with an O(P) storage may be broadcasted to all processors, where P is the number of processors. Each processor can maintain the change in its load. Then, only the blocks near the boundary of HSFC contiguous segments need to be considered for exchanging with the neighbor processor to balance the load assuming that the particles do not move quickly. If the processors are sorted by one dimension, a processor typically communicates with only its two neighbors on the line. Communication is therefore inexpensive for adjusting load by migrating block. After migrating blocks are moved between processors, the data structures must be rebuilt and the inter-processor communication patterns need to be updated. 2.4 Force calculation schemes The calculation of force on each particle is the most expensive step in our MD simulation. So it must be calculated both efficiently and in a manner which can be readily parallelized and load-balanced. We have enabled parallelization within our MD algorithm by dividing force computation into two classes of compute function: self-block and pair-block. The self-block function calculates pair interactions between particles within a particular block. The pair-block function calculates pair interactions between pairs of particles residing in neighboring blocks. If one neighbor block lies in the other processor, the system triggers a pair-block function when these data in the neighbor block are received. For managing these calculations, we create two index tables with exploitation of Newton s 3 rd law. One is pair cell in single block, the other is pair cell in neighbor blocks sorted by neighbor relationship between block. These tables along with cell head pointer and the number of particles in the cell can improve executing efficiency by avoiding some jump instruction. These are much more suitable for instruction-level parallelism in advanced computer architecture.

6 3 Parallel performance and scalability We have implemented a MD code based on these algorithm described above on the distributed memory MPP with MPI. It is suitable for our large-scale multi-medium MD simulation in high temperature and high pressure physics. For simplicity, we call this code PMD2D/3D. In this section, we examine parallel efficiency and scalability of our algorithm. These are run on a MPP including hundreds of processors. All units are given in a dimensionless form. We define the following variables: N = number of particles; Q = number of blocks; P = number of Processor used; PE is parallel efficiency; LBE is load balancing efficiency. 3. Parallel performance The first model is: N=,560,000, Q=2000, P=~64. Each block includes cells in x, y, z direction. MD simulation lasted,000 time-steps. We adopt three parallel strategies: () regularly geometric partitioning (RGP); (2) ISP; (3) our method. RGP is often used by classical link-cell MD, which doesn t adjust load balance. So PE of RGP decreases quickly while P increases. ISP and our DLB method are much better than RGP in rendering balanced loads because they can adjust load distribution on time. Our DLB method is superior to ISP, which has improved LBE by 0%. The main reason is that load in single block change quickly and is very difficult to evaluate accurately. By comparison, our DLB relies on actual measurement of time spent by each processor to achieve a much more efficient load distribution as shown in figure 3(b). So its PE decreases slowly while P increases, while P=64, PE 60%. Part of the efficiency loss is inevitable due to communication overhead because communication and compute ratio grows when P increases. Part of this loss is idle time and partitioning time. We believe that improvements to our DLB method will allow us to decrease the idle time further. PE LBE (a) P 0.4 STEP (b) Fig. 3. N=,560,000 parallel efficiency (a) and load balance efficiency while P=64 (b) of PMD2D/3D using RGP(line ), ISP(line 2) and our method(line3) respectively.

7 3.2 Scalability In order to make reasonable comp arison with experimental data, it is often necessary to be able to simulate features on a micro-scale system with at least hundreds of millions of particles, preferably more. So we simultaneously increase the system size and the number of processors, such that N/P=const. Table gives the corresponding parallel efficiency of 2-D simulation while keeping N/P=,600,000 and 3-D simulation keeping N/P=,00,000, where t step is time per integration step in seconds. Obviously, PMD2D/3D has gained very good performance for all number of processors ranging from 2 to 240. Both achieve good scalability with speedup of over 200 on 240 processors even on large numbers of processor for sufficiently large simulations. On 240 processors, it takes about 0 second every step to simulate 380,000,000 particles in 2-D and about 37 second to simulate 276,000,000 particles in 3-D. These results have showed that our algorithm is very effective in modeling hundreds of millions of particles in both 2-D and 3-D. Table. parallel efficiency with N/P=constant P 2-D t step (s) 2-D E (%) 3-D t step (s) 3-D E (%) Comparable performances In the last few years, many research groups[8-0] reported their record of MD simulation. For comparison, we list these results and our current record together in Table 2. However, we have not yet done experiments to compare the performance of MD with other programs for identical molecules with identical potential parameters and identical machines. From Table 2, it is shown that our MD code can simulate the same magnitude number of particles within the same magnitude time costing compared with the world record reported. Table 2. comparable performance in 3-D MD simulation Groups Machine P N t step (s) Lohmdahl [8] CM ,000, Plimpton [9] Paragon ,000, Stadler [9] T3E 52,23,857, Roth [0] T3E 52 5,80,6, Our team One MPP 500,00,000,000 80

8 4 Conclusion We have described the design of the new scalable parallel MD algorithm for largescale multi-medium MD simulation in high temperature and high pressure physics. It uses cell-block DDM based on HSFC to attain practical scalability. It uses a Measured-based MAW DLB based on HSFC to attain high parallel efficiency even while simulating non-uniform MD systems. Our DLB scheme is superior to ISP in real application. Excellent scalability was gained, with a speedup of above 200 on 240 processors of one MPP in both 2-D and 3-D. Although the parallel performance of our algorithm is quite good, it still leaves room for improvement. We believe that improvements to our DLB scheme, combined with the use of computation and communication overlap will allow us to decrease the idle time further. Moreover, parallel I/O and corresponding parallel visualization will be developed in order to help physicists analyze results and penetrate deeply motion of particles. References. Mo Zeyao, Zhang Jinglin Dynamic Load Balancing for Short-Range Parallel Molecular Dynamics Simulations. Intern. J. Computer Math. 79 (2002) Hayashi R., Horiguchi S.: Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation. Proc. Of IPDPS, Cancun, Mexio(2000) Pilkington R., Baden B.: Dynamic Partitioning of Non-Uniform Structured Workloads with Spacefilling Curves. IEEE Trans. on Parallel and Distributed Systems. 7 (996) Kale, L., Skeel, R., Bhandarkar M.: NAMD2: Greater Scalability for Parallel Molecular Dynamics. J. Computational Physics, 5(999) Sagan, H.: Space-Filling Curves. Springer, New York (994) Mo Zeyao, Zhang Baolin: Multilayer Averaged Weight Method for Dynamic Load Imbalance Problems. Intern. J. Computer Math. 76 (200) Plimpton S.: Fast Parallel Algorithms for Short-range Molecular Dynamics. J. of Computational Physics, 7 (995) Stadler J., Mikulla R., Trebin H. R.: IMD: A Software Package for Molecular Dynamics Studies on Parallel Computes. Intern. J. Modern Physics. 8 (997) Roth J., Gahler F., Trebin H.: A Molecular Dynamics Run with 5,80,6,000 Particles. Int. J. Modern Physics C. (2000)