P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Transcription

1 1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau Rueil Malmaison Cedex. France Abstract As full field reservoir simulations require a large amounts of computing resources, the trend is to use parallel computing to overcome hardware limitations. With a price to performance ratio often better than for traditional machines and with services in constant progress, Linux clusters have become a very attractive alternative solution to traditional high performance facilities. Therefore oil&gas companies have very quickly been moving to these new architectures. This paper presents the new reservoir simulation software developed at IFP which has been specially designed for Linux clusters. The parallel model implemented with MPI protocol is described. The linear solvers and the preconditioning techniques applied to solve very large scale problems are presented. We discuss some technical choices well suited for our specific numerical schemes and Linux clusters architectures. This new reservoir simulation software has been tested on different platforms for several already published reference large scale benchmark problems and real study cases. This paper reports the scalability of the parallel simulator and numerical stability of underlying algorithms calculated from this test campaign. Introduction As vector computers and traditional parallel machines, Linux PC clusters represent a major breakthrough in High Performance Computing (HPC). In order to take advantage of the various architectures, simulation codes must be adapted or sometimes completely rewritten. In 1983, IFP developed a vector version of its reservoir simulator [1]. In 2001, the first parallel reservoir simulator [2], dedicated for shared memory platforms was introduced. In 2003, IFP began the development of a new parallel multipurpose reservoir simulator based on a parallel scheme optimized for distributed architectures and especially for Linux clusters. This paper describes the MPI (Message Passing Interface [3]) parallel model based on a domain decomposition using an expert grid partitioning algorithm. The reservoir simulator is based on a system of equation discretized with a first order space finite volume scheme and linearized with the Newton iterative method. Reservoir simulation on reservoirs discretized with large numbers of cells and hydrocarbon components generate large sparse linear systems to be solved at each time step and each Newton iteration. The preconditioned iterative methods based on Bi-CGSTAB [4] or GMRES [4] are known to be well suited to solve this kind of problem. Nevertheless, the complexity and the size of the reservoir models keep increasing in a continuous way and that requires designing new algorithms for solving such large linear systems. Advanced methods that have implemented in our code are described and the research on linear solvers is also presented. We discussed the results obtained on the Tenth SPE Comparative Solution Project, Model 2 [5] and some bottlenecks due to Linux clusters. 9th European Conference on the Mathematics of Oil Recovery Cannes, France, 30 August - 2 September 2004

2 2 A multi purpose reservoir simulator The parallel reservoir simulator presented in this paper is a multi purpose simulator providing most of physical options required by reservoir engineers such as black-oil, multi-component, thermal, dual permeability, dual porosity, polymers and steam injection. The classical fully implicit numerical formulation, was used for tests but AIM, IMPEX, AIMPEX [6] and IMPES mixing implicit and explicit time discretization methods are also available. The simulator grid may be either space 3D Cartesian or Corner Point geometries integrating multiple refinement levels. This geometry is transformed into an unstructured internal geometry scheme for the resolution of mass conservative equations. Linux clusters Until recently the HPC (High Performance Computing) market was completely divided in two parts, vector computers and traditional proprietary parallel machines. Also, during the last ten years, Linux clusters have quickly gained in maturity to be now largely used, in particular for geo-modeling and seismic interpretation algorithm. The main reason of this success comes from their good price/performance ratio which is much more interesting than for other classical platforms. Linux clusters are built on basic bricks which contain similar components found in personal PC. The volume realized on the market of personal PC push down prices drastically. In a Linux cluster, the different nodes are linked to each other by special networks for services, Input/Output and communications. A global parallel file system is often used to ensure the simultaneous access of all nodes to a unique storage area. The networks and the parallel file system are the critical part of the architecture and can represent half the price of the whole machine. The Open Source community has made great efforts, on the Linux operating system which now offers the same level of quality and robustness as proprietary systems. Most of the largest clusters [7] in the world are now Linux clusters. Parallelism approach In our simulator, the space finite volume discretization can use either structured or unstructured mesh. Unstructured meshes are better adapted to simulate very complex geometry including faults, layer pinch outs, production areas and representation of complex horizontal wells. It also avoids storing non active cells and, so, requires less memory. Our parallel simulator is based on a general unstructured mesh with cells and links between any two cells, one upstream cell and one downstream cell. The parallelization consists in: 1.partitioning the mesh in sub-domains ; 2.distributing grid data on the different domains ; 3.distributing computations on the different processors managing each sub-domain. Data Distribution Data distribution is based on an overlapping decomposition of the grid. Each grid cell is affected to a unique sub-domain as an interior cell. Overlapping decomposition is constructed by adding to each sub-domain the cells (the ghost cells) of other domains also connected to at least one interior cell of the sub-domain. Finally the mesh is divided in several sub-domains composed of: Interior cells ; Ghost cells, connected to at least one interior cell.

3 3 Renumbering To improve the performance of solving algorithms which are not naturally parallel, we use a special cell numbering for domain decomposition. Cells are separated in : Cells only connected to other interior cells, composing the interior of the sub-domain ; Cells having at least one connection with a ghost cell that we call interface cells of the subdomain. Then, they are sorted in the following way: 1.The cells of the interior of the sub-domain ; 2.The interface cells ; 3.The ghost cells. Such renumbering can help to parallelize algorithms which are not naturally parallel and that can be very expensive for communication costs. This is the case of linear solvers and many of efficient preconditioning techniques that will be further described. Domain decompositon renumbering Interior cells Ghost cells Interface cells Figure 1: Domain decomposition example Quality of the distribution and limit of the number of sub-domains Communication cost depends on the number of interface cells. To achieve good performance, it is important to keep this rate at minimum value, at worst no more than 10% or 20 % following the network. On a constant defined size test case, when the number of domains increases, the interface rate grows as the number of interface cells increases while the global size keeps constant. So it is obvious that when the number of domains increases, the parallel scalability for a fix problem size will reach a maximum value as the interface rate increases up to a critical value where performance drops down. Distribution of cells, links, composite objets, global objets and their data All data corresponding to cell properties such as cell porosity, permeability, pressure or average velocity, are distributed regarding the cell distribution: every properties of one cell are distributed to all the processors dealing with the domains where the cell is either interior or ghost. Composite objects such as wells composed of perforated cells, border or limit region, are distributed and affected to the domain that contains the major part of their cells. Their data are distributed regarding the way these composite objects are distributed. Links between two cells i and j are distributed to all domains containing i or j as interior cells, but not on domains where i and j are both interface cells. Links data such as permeability are then distributed regarding links distribution. General global objects such as PVT laws, KrPc laws, curves and recurrent well data or numerical options, are distributed on every processor. 9th European Conference on the Mathematics of Oil Recovery Cannes, France, 30 August - 2 September 2004

4 4 Distribution of computation All local computations using cell or link data are distributed regarding cell and link distributions. Global values such as global balance, average or maximum values or all kind of global step (time steps, Newton steps and solver steps) are issued from computation on all cells or on all links. They are built by global MPI reduction of the results obtained locally on the interior cells of each sub-domain. The most expensive phase during simulation is the resolution of non linear problem using Newton iterations. At each Newton iteration, a large sparse linear system is solved with a parallel implementation of the Bi Conjugate Stabilised Gradient using a parallel preconditioner. This implementation is based on the parallel distribution of the matrix rows and of the vector components. As each matrix row corresponds to one equation of one cell unknown, each vector component to one cell unknown, and each non zero matrix entry to one mesh link, the matrix and vector distribution is done following the way cells and links are distributed. Some equations link well properties to perforated cell unknowns. These well properties may also be solved implicitly as the other implicit cell unknowns. The corresponding rows and columns of these equations are distributed regarding the well and cell distribution. New Solutions implemented in our new parallel reservoir simulator. In our new simulator, in order to overcome the difficulties of our parallel approach, we have: worked on a numerical scheme to ensure enhanced numerical stability and high parallel performance. implemented a communication strategy to use overlapping communication as often as possible, limited synchronization phase and reduced communication cost. Numerical stability and convergence criteria During the simulation we use Newton method to linearize the non linearity. The linearized problem is solved using BiCGStab algorithms. The cost of these iterative algorithms depends of the number of steps required to reach their stopping criteria. Even if we only parallelize independent operations so that results should be independent of the number of domains, global results can be nevertheless sensitive to how the reduction of local operations is done. In a parallel paradigm, we cannot avoid little differences due to global parallel reduction. During a simulation, if the Newton stopping criterion is close to the Newton precision, some differences due to parallel synchronisation can affect on the number of Newton iterations and the value of the time steps. For example, a little difference on the resolution on the linear system can make the Newton loop fail. The time step is then reduced and the Newton loop has to be restart. Finally a small difference on the solution of the linear system has very important consequences on the number of Newton steps and even worse, on the values of the time step which gradually decreases. In that case the simulation becomes very unstable. The cost of a single run becomes very dependent on the number of domains and on the hardware environment. In such cases, the accuracy of the solver is critical to ensure parallel performance as the cumulative number of steps (solver resolution, Newton loops, time loops) can be very dependent to the number of domains. This shows the importance of choosing a stable numerical scheme, good solver options and robust parallel preconditionner to have a numerically stable simulation. That difficulty emphasises the special importance of using very stable numerical schemes during parallel simulation. Solver options may have important influence on convergence criteria, and the choice of a good parallel preconditioner can help to have stable convergence iteration numbers.

5 5 In the standard sequential simulator, linear systems are efficiently solved by the Bi Conjugated Gradient Stabilized with an incomplete ILU0 preconditionner. This preconditioner, well known to be efficient for standard case is not naturally parallel as its algorithm is recursive. Some tests on several cases show that the simple block ILU0 preconditionner is not numerically stable: the number of solver iterations is very sensitive to the number of domains. We have then parallelised it taking into account all links between adjacent domains. The parallel renumbering makes the ILU0 algorithm parallel on interior cells as they are independent from one domain to another. To optimise the treatment on interface cells, we have separated and sorted them in three kinds of cells: 1. Cells only connected to a domain with a higher rank than the current domain ; 2. Cells connected to both domains with rank higher and lower than the current domain ; 3. Cells only connected with rank higher than the current domain. Even if the algorithm is recursive, the rows corresponding to the first and the third kind of interface cells can be computed independently of the domain, and then they can be computed in parallel. The second kind of cells introduces recursivity between processors, but fortunately with most type of partitionners there are not such cells. With all those renumbering techniques, the domain decomposition renumbering and the sort of interface cells, the ILU0 preconditonner has been parallelize without neglecting interface interaction between domains and avoiding recursivity between processors. In that way, the over head of the parallelism is only due to communication costs. Our new parallel ILU0 preconditionner turns to be a good scalable preconditionner. However, it can encounter difficulties on complex industrial cases with very complex geometry and physical models. Some recent researches have proved that Multigrid methods, which are more expensive than traditional ILU0 methods, are more robust for elliptic problem and very stable in term of number of iterations independently of the problem size. In reservoir modelling, linear systems are composed of unknowns such as the pressure coming from elliptic equation and other unknowns coming from transport equations. In some industrial cases, we have noticed that the variations of saturation are very sensitive to pressure gradient so that having a robust preconditionner for the pressure is very important. We have developed a new kind of preconditionner Two level AMG combining an algebraic multigrid method on pressure unknowns and a more traditional parallel one on the others (ILU0, Block ILU0, polynomial). We had, in that way, developed a new method which is robust, parallel and performing. We have tested theses methods (Parallel ILU0, Two level AMG) and studied their influence on the performance and on the numerical stability of the reservoir simulator. Communication cost The overhead of the computations is mainly due to redundant computation on ghost cells and to communications cost. A great attention has to be paid so that communication cost does not reduce the scalability For efficient partitionner with low interface rate, the cost of computation on ghost cells is not important compared to the other computations. The overhead of parallelism is mainly due to communication costs. The most communicating task is dedicated to the solver. During that phase, we take advantage of the cells domain decomposition renumbering: Sorting interior cells we use overlapping communication during linear resolution. With asynchronous communication, we can overlap communication through the network with computation on interior cells. For the synchronisation step, when communication overlapping is not possible, we use blocking communication as it turns out to be more efficient than asynchronous one. 9th European Conference on the Mathematics of Oil Recovery Cannes, France, 30 August - 2 September 2004

6 6 We only need a few synchronisation phases to ensure that ghost cell data is correct. Most of the computations are done on every processor that manages one cell as interior or ghost cell. Synchronisation is so only needed when the computation on a ghost cell may be false because of the lack of neighbour data. We group these synchronizations in a few phases, after linear resolution and before the updating of cells data with their variation, and when we compute all kind of balance to compute time steps or Newton steps. IO strategy, parallel hdf5 format As the amount of transferred data during these IO phases can be very high, a parallel IO strategy has been adopted. This strategy is a very difficult issue in parallel industrial software because it depends on the performance of the parallel file system, the way user wants to post process file results, the number of currently used processors. The HDF5 [8] (Hierarchical Data Format version 5) API is used for the management of the parallel IO in our simulator. HDF5 is a library and a file format designed for scientific data storage. HDF5 provides a complete portable high level API, which allow the developer to efficient data structures which optimized the read/write operations. All the low level calls to the MPI-IO [3] are implicitly done by HDF5, this allow a good portability, and important gain in time of development and maintenance. The parallel version of the HDF5 API enables all processors to read and write in one binary file managing all kind of competition between processors. The use of HDF5 enables to have a unique file which is independent of the number of the processors. This allows to do restart jobs and post-processing with various numbers of processors. The performance of the HDF5 library is completely dependant of the performance of the parallel file system. We noticed HDF5 is not very scalable for a large number of processors on GPFS file system. To improve the scalability of the IO in our simulator we developed a strategy which consists in reducing the number of processes doing the IO on the disk and doing. Results The parallel reservoir simulator has been tested on a 32 nodes IBM Linux cluster of dual Intel Xeon@3,06GHz with 2GB memory per node, with a high bandwidth network Myrinet 2000 and a IBM GPFS parallel file system. The results presented in this paper were obtained on the Tenth SPE Comparative Solution Project, Model 2 [6]. This model is built on a Cartesian regular geometry and simulates a 3D waterflood of a geostatistical model with more than one million cells. Only the fine grid with the following dimensions 60x220x85 and 1,094,721 active cells has been simulated on the cluster. The problem simulated is incompressible and a water-oil thermodynamic system is used to model the fluid behavior. The simulations were carried on 2, 4, 8, 16 and 32 processors, filling the cluster in a standard way, with 2 processes per node. We have shown the performances should have been better by using one process per node because of a serious bottleneck on the dual Xeon processors for the memory accesses. Nevertheless, we have decided to present results in the usual running condition as for normal users. The physical results (fig. 2) are very similar to the results presented in [5] and are independent to the number of the processors. This, confirm the good stability of our algorithms.

7 7 Figure 2: Well P3 Water cut and Std Surface Oil Rate Interface rate In the following table, we compare the interface rate on SPE10 test case versus the number of domains with a classical band partitioner. Number of CPUs Interface size Interior size Rate (%) , , , , , ,33 Table 1: Domain decomposition statistics This evolution shows that the SPE10 test case can be efficiently simulated up to 32 processors. With more than 32 processors very poor performance is expected. Stability In the following table, we compare: The number of steps needed for the all simulation ; The number of solver iterations ; The time steps evolution. We have obtained with a Combinative AMG preconditioner, versus the number of processors. 9th European Conference on the Mathematics of Oil Recovery Cannes, France, 30 August - 2 September 2004

8 8 Number of CPUs Time (s) Speedup Solver iterations Time steps , , , , , Table 2: Numerical results The results show that the simulator is numerically stable and the number of steps is quite independent from the number of processors. Scalability The following figure (fig. 3) shows the elapsed time and the speed-up relative to 2 processors, of the different runs versus the number of processors Time (s) Speedup 7,00 6,00 Elapsed time (s) ,00 4,00 3,00 2,00 1,00 Relative Speedup (Tn / T2) Number of processors 0,00 Figure 3: Elapsed times and speedups The graphic shows that up to 32 processors, we have very good speedups. One run has been realized on 64 processors and the performance was poor as SPE10 case is not large enough to enable low interface rate on 64 sub domains. Conclusions The parallel version of our simulator designed for distributed architectures achieves good performances on a Linux cluster. The results presented in the paper and the other test cases we have ran show: - The domain decomposition and the parallel algorithms used ensure a good stability for our code. - The Combinative method is well suited for the resolution of complex reservoir models and is very insensitive to the number of processors. - The choice of a parallel IO strategy has a very important impact on the performances.

9 9 The scalability of the code has been shown until 32 processors, largest model will be simulate to calculate the scalability on larger clusters. References 1 P. Quandalle, Comparaison de Quelques Algorithmes d Inversion Matricielle sur le Calculateur CRAY1, Revue de l Institut Français du Pétrole, vol. 38, n 2, March-April 1983, p.p J-F. Magras, P. Quandalle, High Performance Reservoir Simulation With Parallel ATHOS, SPE66342, Feb Message Passing Interface : 4 SAAD Y., Iterative methods for Sparse Linear Systems, SIAM, Second Edition, January M. A. Christie, Tenth SPE Comparative Solution Project: A Comparison of Upscaling Techniques, SPE66599, Feb Y. Caillabet, J-F. Magras, Large compositional reservoir simulations with parallelized adaptive implicit methods, SPE81501, June Top 500 Supercomputers sites : 8 HDF5 Home Page: 9th European Conference on the Mathematics of Oil Recovery Cannes, France, 30 August - 2 September 2004

10 10