Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk, Bristol BS8 1TW, United Kingdom 2 Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439 U.S.A. Abstract. The fast multipole method is used in many scientific computing applications such as astrophysics, fluid dynamics, electrostatics and others. It is capable of greatly accelerating calculations involving pair-wise interactions, but one impediment to a more widespread use is the algorithmic complexity and programming effort required to implement this method. We are developing an open source, parallel implementation of the fast multipole method, to be made available as a component of the PETSc library. In this process, we also contribute to the understanding of how the accuracy of the multipole approximation depends on the parameter choices available to the user. Moreover, the proposed parallelization strategy provides optimizations for automatic data decomposition and load balancing. 1INTRODUCTION The advantages of the fast multipole method (FMM) for accelerating pair-wise interactions or N-body problems are well-known. In theory, one can reduce an O(N 2 ) calculation to O(N),whichhasahugeimpactinsimulationsusingparticlemethods. Considering this impact, it is perhaps surprising that the adoption of the FMM algorithm has not been more widespread. There are two main reasons for its seemingly slow adoption; first, the scaling of the FMM can really only be achieved for simulations involving very large numbers of particles, say, larger than 10 3 10 4.Soonly those researchers interested in solving large problems will see an advantage with the method. More importantly, perhaps, is the fact that the FMM requires considerable extra programming effort, when compared with other algorithms like particle-mesh methods, or treecodes providing O(N log N) complexity. One could argue that a similar concern has been experienced in relation to most advanced algorithms. For example, when faced with a problem resulting in a large system of algebraic equations to be solved, most scientists would be hard pressed to have to program a modern iterative solution method, such as a generalized minimum residual (GMRES) method. Their choice, in the face of programming from scratch, will most likely be direct Gaussian elimination, or if attempting an iterative method, D. Tromeur-Dervout (eds.), Parallel Computational Fluid Dynamics 2008, Lecture Notes in Computational Science and Engineering 74, DOI: 10.1007/978-3-642-14438-7 30, c Springer-Verlag Berlin Heidelberg 2010
286 F. A. Cruz, M. G. Knepley, and L. A. Barba the simplest to implement but slow to converge Jacobi method. Fortunately, there is no need to make this choice, as we nowadays have available a wealth of libraries for solving linear systems with a variety of advanced methods. What s more, there are available parallel implementations of many librarires, for solving large problems in a distributed computational resource. One of these tools is the PETSc library for large-scale scientific computing [1]. This library has been under development for more than 15 years and offers distributed arrays, and parallel vector and matrix operations, as well as a complete suite of solvers, and much more. We propose that a parallel implementation of the FMM, provided as a library component in PETSc, is a welcome contribution to the diverse scientific applications that will benefit from the acceleration of this algorithm. In this paper, we present an ongoing project which is developing such a library component, offering an open source, parallel FMM implementation, which furthermore will be supported and maintained via the PETSc project. In the development of this software component, we have also investigated the features of the FMM approximation, to offer a deeper understanding of how the accuracy depends on the parameters. Moreover, the parallelization strategy involves an optimization approach for data decomposition among processors andloadbalancing,thatshouldmakethis averyusefullibraryforcomputationalscientists. This paper is presented as follows, in section 2 wepresentanoverviewofthe original FMM [2], in section 3 we present our proposed parallelization strategy, and in 4 scaling results of the parallel implementation are presented. 2CHARACTERIZATIONOFTHEMULTIPOLE APPROXIMATION 2.1 Overview of the algorithm The FMM is a fast summation method that accelerates the multiple evaluation of functions of the form: f (y j )= N i=1 c i K(y j,x i ) (1) where the function f ( ) is evaluated at a set of {y j } locations. In a single evaluation of (1) thesumoverasetofsources{(c i,x i )} is carried out, where each source is characterized by its weight c i and position x i.therelationbetweentheevaluation and source points is given by the the kernel K, ingeneral,thekernelfunctionis required to decay monotonically. In order to accelerate the computations, the FMM uses the idea that the influence of a cluster of sources can be approximated by an agglomerated quantity, when such influence is evaluated far enough away from the cluster itself. The method works by dividing the computational domain into a near-domain and a far-domain: Near domain: contains all the particles that are near the evaluation point, and is usually a minor fraction of all the N particles. The influence of the near-domain is
Fast Multipole Method for particle interactions 287 computed by directly evaluating the pair-wise particle interactions with (1). The computational cost of directly evaluating the near domain is not dominant as the near-domain remains small. Far domain: contains all the particles that are far away from the evaluation point, and ideally contains most of the N particles of the domain. The evaluation of the far domain will be sped-up by evaluating the approximated influence of clusters of particles rather than computing the interaction with every particle of the system. The approximation of the influence of a cluster is represented as Multipole Expansions (MEs) and as Local Expansions (LEs); these two different representations of the cluster are the key ideas behind thefmm.themesandlesaretaylorseries (or other) that converge in different subdomains of space. The center of the series for an ME is the center of the cluster of source particles, and it only converges outside the cluster of particles. In the case of an LE, the series is centered near an evaluation point and converges locally. The first step of the FMM is to hierarchically subdivide space in order to form the clusters of particles; this is accomplished byusingatreestructure,illustratedin Figure 1,torepresenteachsubdivision.Inaone-dimensionalexample:level0isthe whole domain, which is split in two halves at level 1, and so on up to level l. The spatial decomposition for higher dimensions follows the same idea but changing the numberof subdivisions.in two dimensions,each domainis divided in four, to obtain aquadtree,whileinthreedimensions,domainsaresplitin8toobtainanoct-tree. Independently of the dimension of the problem, we can always make a flat drawing of the tree as in Fig. 1, with the only difference between dimensions being the number of branches coming out of each node of the flat tree. Level 0 Level 1 Level 1 Level 2 Level 2 Level 3 Level 4 Fig. 1. Sketch of a one-dimensional domain (right), divided hierarchically using a binary tree (left), to illustrate the meaning of levels in a tree and the idea of a final leaf holding a set of particles at the deepest level.
288 F. A. Cruz, M. G. Knepley, and L. A. Barba Upward Sweep To sibling: ME to LE Downward Sweep Translate ME to grand-parent Translate ME to parent Level 2 To child: LE to LE Create Multipole Expansion, ME Local Expansion, LE Fig. 2. Illustration of the upward sweep and the downward sweep of the tree. The multipole expansions (ME) are created at the deepest level, then translated upwards to the center of the parent cells. The MEs are then translated to a local expansion (LE) for the siblings at all levels deeper than level 2, and then translated downward to children cells. Finally, the LEs are created at the deepest levels. After the space decomposition, the FMM builds the MEs for each node of the tree in a recursive manner. The MEs are built first at the deepest level, level l, andthen translated to the center of the parent cell, recursively creating the ME of the whole tree. This is referred to as the upward sweep of the tree. Then, in the downward sweep the MEs are first translated into LEs for all the boxes in the interaction list. At each level, the interaction list corresponds to the cells of the same level that are in the far field for a given cell, this is defined by simple relations between nodes of the tree structure. Finally, the LEs of upper levels are added up to obtain the complete far domain influence for each box at the leaf level of the tree. The result is added to the local influence, calculated directly with Equation (1).These ideas are better visualized with an illustration, as provided in Figure 2.Formoredetailsofthe algorithm, we cite the original reference [2]. 3PARALLELIZATIONSTRATEGY In the implementation of the FMM for parallel systems, the contribution of this work lies on automatically performing load balance, while minimizing the communications between processes. To automatically perform this task, first the FMM is decomposed into more basic algorithmic elements, and then we automatically optimize the distribution of these elements across the available processes, using as criteria the load balance and minimizing communications. In our parallel strategy, we use sub-trees of the FMM as the basic algorithmic element to distribute across processes. In order to partition work among processes, we cut the tree at a certain level k, dividingitintoaroot tree and 2 dk local trees,
Fast Multipole Method for particle interactions 289 Data decomposition Level k Sub-tree Sub-tree Fig. 3. Sketch to illustrate the parallelization strategy. The image depicts the FMM algorithm as a tree, this tree is then cut at a chosen level k,andalltheresultingsub-treesaredistributed among the available processes. as seen in Figure 3. Byassigningmultiplesub-treestoanygivenprocess,wecan achieve both load balance and minimimal communication. A basic requirement of our approach is that when decomposing the FMM, we produce more sub-trees than available processes. To optimally distribute the sub-trees among the processes, we change the problem of distributing the sub-trees into a graph partitioning problem, that we later partition in as many parts as processes available. First, we assemble a graph whose vertices are the sub-trees, with edges (i, j) indicating that a cell c in sub-tree j is in the interaction list of cell c in sub-tree i, asseeninfigure4. Thenweightsare assigned to each vertex i, indicatingtheamountofcomputationalworkperformed by the sub-tree i, andtoeachedge(i, j) indicating the communication cost between sub-trees i and j. One of the advantages of using the sub-trees as a basic algorithmic elements, is that due to its well defined structure, we can rapidly estimate the amount of work and communication performed by the sub-tree. In order to produce these estimates, the only information that we require is the depth of the sub-trees and neighbor sub-tree information. By using the work and communication estimates, we have a weighted graph representation of the FMM. The graph is now partitioned, for instance using ParMetis [3], andthetree informationcan be distributedto therelevantprocesses. Advantages of this approach, over a space-filling curve partitioning for example, include its simplicity, the reuse of the serial tree data structure, and reuse of existing partitioning tools. Moreover, purely local data and data communicated from other processes are handled using the same parallel structure, known as Sieve [4]. The Sieve framework provides support for parallel, scalable, unstructures data structures, and the operations defined over them. In our parallel implementation, the use of
290 F. A. Cruz, M. G. Knepley, and L. A. Barba v i e ij v j Fig. 4. Sketch that illustrates the graph representation of the sub-trees. The FMM sub-trees are represented as vertex in the graph, and the communication between sub-trees is represented by the edges between corresponding vertices. Each vertex has an associated weight v i,that represents an estimate of the total work performed by the sub-tree. Each edge has an associated weight e ij,thatrepresentsanestimateoftheworkbetweensub-treesv i and v j.byfindingan optimal partitioning of the graph into as many section as processes available, we achieve load balance while minimizing communications. In the figure the graph has been partitioned in three sections, the boundaries between sections are represented by the dashed lines, and all the sub-trees that belongs to the same section are assigned to the same process. Sieve reduces the parallelism to a single operation for neighbor and interaction list exchange. 4RESULTS In this section we present results of the parallel implementation on a multicore architecture, a Mac Pro machine, equipped with two quad-core processors and 8GB in RAM. The implementation is being currently integrated into the open source PETSc library. In Figure 5, we present experimental results of the timings obtained from an optimized version of PETSc library using its built-in profiling tools. The experiments were run on a multicore setup with up to four cores working in parallel, while varying the number of particles in the system. We performed experiments from three thousand particles up to system with more than two million particles. All the experiments were run for a fixed set of parameters of the FMM algorithm, in all cases we used parameters that ensure us to achieve high accuracy an performance (FMM expansion terms p = 17, and FMM tree level l = 8). In the experimental results, we were able to evaluate a problem with up to two million particles in under 20 seconds. To put this in context, if we consider a problem
Fast Multipole Method for particle interactions 291 10000 1000 1 cpu 2 cpu 3 cpu 4 cpu Time [sec] 100 10 1 10 3 10 4 10 5 10 6 10 7 Number of particles Fig. 5. Log-log plot of the total execution time of the FMM parallel algorithm for fixed parameters (expansion terms p = 17, and FMM tree level l = 8) vs. problem size (number of particles in the system). In the figure, each line represents the total amount of time used by the parallel FMM to compute the interaction in the system when using 1, 2, 3, and 4 processors. with one million particles (N = 10 6 ), the direct computation of the N 2 operations would be proportional to 1 Tera floating point operations (10 12 operations), in order to compute all the particles interactions of the system. Instead, the FMM algorithm accelerates the computations and achieve an approximate result with high accuracy in only 10 Giga floating point operations, meaning 100 times less floating point operations. 5CONCLUSIONSANDFUTUREWORK In this paper we have presented the first open source project of a parallel Fast Multipole Method library. As second contribution in this work, we also introduced a new parallelization strategy for the Fast Multipole Method algorithm that automatically achieves load balance and minimization of the communication between processes. The current implementation iscapableofsolvingn-bodyproblemsformillionsof unknowns in a multicore machine with good results. We are currently tuning the library and running experiments on a cluster setup. With this paper we expect to contribute to widespread the use of fast algorithms by the scientific community, and the open source characteristic of the FMM library
292 F. A. Cruz, M. G. Knepley, and L. A. Barba is the first step towards this goal. The open source project is currently being incorporated into the PETSc library [1] and a first version of the software is going to be released by October 2008. [1] S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. Curfman- McInnes, B. F. Smith, and H. Zhang. PETSc User s Manual. Technical Report ANL-95/11 - Revision 2.1.5, Argonne National Laboratory, 2002. [2] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys.,73(2):325 348,1987. [3] George Karypis and Vipin Kumar. A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. Journal of Parallel and Distributed Computing,48:71 85,1998. [4] Matthew G. Knepley and Dmitry A. Karpeev. Mesh algorithms for pde with sieve i: Mesh distribution. Scientific Programming,2008.toappear.