Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin Reza Rooholamini, Ph.D. Director Enterprise Solutions Dell Computer Corp. Reza_Rooholamini@dell.com http://www.dell.com/clustering
Cost/Complexity Product Maturity Life Cycle in the Open Systems Market Heterogeneous SANs RISC systems Grids Project based SANs Proprietary Standardization 8P servers HPC Clusters Network Attached Storage 4P servers Simplicity/Volume/Choice Direct Attached Storage 1/2P servers Appliance Servers Workstation Desktops Fully Standardized 2 Enterprise Solutions
Our Vision Customers define our success: Begin with the customer. End with the customer Provide the best price/performance solutions to our customers in HPC Promote standardization to provide choice, lower cost of ownership, and simplicity in HPC solutions Evangelize new HPC technologies and selectively adopt the relevant ones for productization Derive the requirements for products by focusing on applications Provide a total solution: Hardware, software and services Partner with best of class in HPC 3 Enterprise Solutions
Building Block Approach Benchmark Parallel Benchmarks (NAS, HINT, Linpack ) and Parallel Applications Middleware MPI/Pro MPICH MVICH PVM OS OS Linux Windows Protocol TCP VIA GM Elan Interconnect Fast Ethernet Gigabit Ethernet Myrinet Quadrics Infiniband Platform Dell PowerEdge Servers (IA32 & IA64) 4 Enterprise Solutions
Dell and UT Austin Dell is sponsoring research in reservoir simulation at the Department of Petroleum and Geosystems Engineering Dr. Kamy Sepehrnoori is collaborating with Dell s HPCC team on performance studies, paper publications, and parallel simulator development Dell HPCC team includes graduates from Dr. Sepehrnoori s group specialized in Petroleum Engineering Dell has participated in Reservoir Simulation JIP (Joint Industry Project) in the past, and is planning to attend the upcoming meeting Dr. Sepehrnoori has access to Dell HPC lab for running large simulations, and is provided with hardware for development, testing, and performance studies of his program 5 Enterprise Solutions
A Performance Study of Parallel Reservoir Simulation on HPC Clusters Baris Guler Tau Leng Victor Mashayekhi Reza Rooholamini Dell Computer Corporation Kamy Sepehrnoori Center for Petroleum and Geosystems Engineering The University of Texas at Austin
Outline Background Software/Hardware Description Compositional Reservoir simulation on HPCs Results Summary Future Work
Reservoir Simulation Application Reservoir Forecasting Reservoir Performance optimization Sensitivity Analysis History Matching Risk Assessment through Stochastic Simulation Assessment of Uncertainity in Forecasting Value of Information Studies Reservoir Management
Reservoir Simulation Steps Data Input/Model Initialization Do Time Step Computation Solution of Non-Linear Partial Differential Equation Discretization Linearization and Newtonian Iteration Solution using Direct or Iterative Solvers Test for Convergence of Solution Data Output/Graphics Time-Step Increment End of Simulation Study Results Processing/Interpretation
Reservoir Simulation Hardware HPCs MPPs PCs/Workstations RISC Workstations Supercomputers Mainframes 1960 2000
Benefits of Parallel Processing Turn-around time Large-scale simulations Cost
Parallel Processing Massively Parallel Computers High Performance Computing Clusters
Benefits of Clusters Scalability High Performance Computing Low Cost Availability
Computational Mode Distributed processing Parallel processing
Distributed Processing Input Generator D 1 D 2 D 3 D n User Batch Queuing System to Simulation Program n >> m P 1 P 2 P 3 P m Database Post Processing
Input Data Cluster Simulation System FS 1 FS 2 FS 3...... FS m Cluster Scheduler Cluster Scheduler DS 1 DS 2 DS 3...... DS n Project Advisor User Input Output Data Generator Data Generator Archiver Post-Processor Processor
CPU-6 CPU-6 CPU-3 CPU-3 CPGE Parallel Processing CPU-1 CPU-1 CPU-2 CPU-2 CPU-5 CPU-5 FD RESERVOIR CPU-4 CPU-4 CPU-1 CPU-1 FD & DD
Domain Decomposition Ghost Layers Creation Communication Fundamental strategy for grid-based parallel simulation. Example: 10 x 15 grid 6 processors
Performance Issues in Parallel Processing Software Design Algorithm Parallelization Programming practice Load Balancing
Performance Issues in Parallel Processing Hardware Configuration CPU Cache Memory subsystem Front Side Bus I/O bandwidth Interconnect
Hardware - Interconnect Type Fast Ethernet Gigabit Ethernet Giganet Myrinet Infiniband 4x Quadrics Dolphin Speed(MBps) 9.0 80 110 225 500 330 385 Latency(ms) 170 170 7.5 6-7 6-8 4.5 4
CPGE-1(Ararat) 12 Nodes / 16 Processors 1.0 GHz Intel Pentium III Xeon processors 256 MB of memory Diskless configuration 100 Mbps switched Fast Ethernet and Giganet interconnects
TACC-1(Tejas) 32 Nodes / 64 Processors 1.0 GHz Intel Pentium III processors 1 GB of memory/processor 225 MBps Myrinet-2000 interconnect
Parallel Reservoir Simulators Chevron-Texaco Conoco-Phillips Exxon-Mobil IFP and Beicip-Franlab Landmark Graphics Corporation Schlumberger-Geoquest Saudi Aramco UT CPGE, UT CSM Note : 93 clusters in Top500 supercomputer sites, 23 in Oil and Gas sector.
Compositional Reservoir Simulation on HPCs
Project Objectives Develop a general purpose adaptive simulator (GPAS) capable of: modeling of complex physical processes including EOS compositional, chemical, black-oil and thermal high resolution studies on supercomputers and highperformance cluster
HPC Initiatives Evaluate and compare performance of different cluster systems Test and analyze performance of different parallel simulators Identify areas of improvement in parallel algorithm design and cluster setup for optimal parallel reservoir simulation
Summary of Clusters Cluster CPU Type CPU Speed (MHz) CPUs Memory per CPU Interconnect CPGE-1 (Fuji) Pentium II 300 16x1=16 384MB Fast Ethernet CPGE-1 (Rocky) Pentium II Xeon 400 8x2=16 256MB Fast Ethernet CPGE-1 (Ararat) Pentium III Xeon 1000 8x1+4x2=16 256MB Fast Ethernet DELL-1 (PE 1550) Pentium III 1000 16x2=32 512MB Myrinet, Gigabit, Fast Ethernet DELL-2 (PE 2650) Intel Xeon DP 2400 64x2=128 1GB Myrinet, Gigabit, Fast Ethernet TACC-1 (Tejas) Pentium III 1000 32x2=64 512MB Myrinet TACC-2 (Longhorn) Power4 1300 4x16=64 2GB IBM SP Switch2
Parallel Simulators Tested GPAS VIP (2003r4)
CPGE Simulator (GPAS) EOS Compositional Peng-Robinson EOS Fully Implicit PETSc Linear Solvers Parallel (IPARS Framework)
Performance Results
Base Benchmark Problem Compositional model 3-component Peng-Robinson EOS Dry-gas cycling process Reservoir size: 800 x 11200 x 160 ft, homogeneous 2 wells, 1 Injector, 1 producer Grids: 16 x 224 x 8 (28,672 cells) Unknowns : 229,376 100 days of gas injection One dimensional domain decomposition
Single-Processors Execution Times(GPAS) Base Benchmark Problem Fuji with Pentium II 300MHz 1030.3 Rocky with Pentium II Xeon 400MHz 615.2 PowerEdge 1550 with Pentium III 1.0GHz Ararat with Pentium III Xeon 1.0GHz TACC-Tejas with Pentium III 1.0Ghz Dell-PE2650 with Intel Xeon DP 2.4GHz 313.3 309.3 306.38 180.186 0 200 400 600 800 1000 1200 Execution Time [sec]
Multi-Processors Execution Times(GPAS) Base Benchmark Problem Execution Time (seconds) 10000 1000 100 10 Fuji Rocky Ararat PE 1550 PE 2650 Tejas Longhorn 1 0 4 8 12 16 Number of Processors
Multi-Processors Speedups(GPAS) Base Benchmark Problem Sppedup 32 28 24 20 16 12 8 Fuji(FE) Rocky(FE) Ararat(FE) PE 1550(FE) PE 2650(FE) Tejas(My) Longhorn(*) Ideal 4 0 0 4 8 12 16 20 24 28 32 Number of Processors
Comparison of MPI-Interconnects Interconnects (GPAS) Base Benchmark Problem DELL PE 2650 (Single processor/node) MPICH-GIGABIT MPICH-GM - MYRINET MPI/PRO-GIGABIT MPICH-FE Ideal 32 28 24 Speedup 20 16 12 8 4 0 0 4 8 12 16 20 24 28 32 Number of Processors
Constant Problem Size per Processor(GPAS) Fuji Rocky Ararat Tejas 800 Execution Time [sec] 600 400 200 0 19200,1CPU 38400, 2CPUs 76800, 4CPUs 153600, 8CPUs 307200, 16CPUs Grid Dimensions, Number of CPUs 614400, 32CPUs
Modified Benchmark Problem Compositional model 3-component Peng-Robinson EOS Dry-gas cycling process Reservoir size: 7.3 x 24.2 x.1 miles Grids: 77 x 256 x 10 (197,120 cells) Unknowns : 1.57 million Anisotropic, Layered Permeability with Kv/Kh = 0.1 88 wells, 54 Injectors, 24 producers, staggered line drive Injectors and Producers are completed fully 100 days of gas injection One dimensional domain decomposition
Multi-Processors Execution Times(GPAS) Modified Benchmark Problem DELL PE 2650 GBit-SINGLE My-SINGLE FE-SINGLE My-DUAL 10000 Execution Time (Seconds) 1000 100 10 0 8 16 24 32 40 48 56 64 Number of Processors
Multi-Processors Speedups(GPAS) Modified Benchmark Problem 72 DELL PE 2650 GIGABIT-SINGLE MYRINET-SINGLE FAST ETH-SINGLE MYRINET-DUAL Ideal 64 56 48 Speedup 40 32 24 16 8 0 0 8 16 24 32 40 48 56 64 Number of Processors
Commercial Parallel Simulator
REMARKS Our goal was to run the simulators in parallel mode and evaluate their performance for typical cases Our goal was to analyze the different issues involved in using the simulators in parallel and approaches to improved performance and design We did not Tune simulators for optimum performance Compare or match material balance errors of the simulator runs
Benchmark Problem for VIP Compositional model Modified SPE3 comparison project 9-component Peng Robinson EOS Gas condensate with gas cycling process Reservoir size: 10 miles x 4 miles x 160ft Grids: 180 x 72 x 4 (51,840 cells) 1 million unknowns Flow barriers present (using Transmissibility modifiers) 20 wells, 10 Injectors, 10 producers 10 years of cycling followed by 5 years of production
Multi-Processors Performance VIP
Multi-Processors Execution Times(VIP) MODIFIED SPE3 COMPARISON PROBLEM Elapsed Time (sec) 12000 10000 8000 6000 4000 2000 0 Fuji Rocky 0 4 8 12 16 Number of Processors
Multi-Processors Speedups(VIP) MODIFIED SPE3 COMPARISON PROBLEM Fuji Rocky Ideal 16 12 Speedup 8 4 0 0 4 8 12 16 Processors
Constant Problem Size per Processor(VIP) MODIFIED SPE3 COMPARISON PROBLEM Fuji Rocky Execution Time [sec] 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 25920, 1CPU 51480, 2CPUs 103680, 4CPUs 207360, 8CPUs 414720,16CPUs Number of Cells, Number of CPUs
Million Cell Commercial Benchmark Problem for VIP IMPES scheme 7-component Peng Robinson EOS Grid: 100 x 100 x 100 (1 Million cells) 16 million unknowns Stochastically characterized data field 11 wells 49 Year run
Performance Speedups - VIP MILLION GRIDBLOCK PROBLEM DELL PE 2650 64 56 48 VIP(*) Ideal Speedup 40 32 24 16 8 0 0 8 16 24 32 40 48 56 64 Number of Processors
Summary Tested GPAS and analyzed performance on new hardware Benchmarked performance of new clusters Compared performance of different interconnects and MPI libraries Tested commercial reservoir simulator VIP in parallel mode
Acknowledgements US Department of Energy Reservoir Simulation Joint Industry Project Members Dell Computer Corporation