Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin. http://www.dell.com/clustering



Similar documents
Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Clusters: Mainstream Technology for CAE

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Improved LS-DYNA Performance on Sun Servers

benchmarking Amazon EC2 for high-performance scientific computing

A Flexible Cluster Infrastructure for Systems Research and Software Development

High Performance Computing in CST STUDIO SUITE

Large Scale Parallel Reservoir Simulations on a Linux PC-Cluster 1

Large-Scale Reservoir Simulation and Big Data Visualization

A Theory of the Spatial Computational Domain

Cluster Computing at HRI

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Multicore Parallel Computing with OpenMP

ECLIPSE Performance Benchmarks and Profiling. January 2009

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

IBM Deep Computing Visualization Offering

Cellular Computing on a Linux Cluster

- An Essential Building Block for Stable and Reliable Compute Clusters

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Boosting Data Transfer with TCP Offload Engine Technology

Recommended hardware system configurations for ANSYS users

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

MOSIX: High performance Linux farm

PRIMERGY server-based High Performance Computing solutions

FLOW-3D Performance Benchmark and Profiling. September 2012

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Overlapping Data Transfer With Application Execution on Clusters

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Microsoft Exchange Server 2003 Deployment Considerations

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Enabling Technologies for Distributed Computing

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Cluster Implementation and Management; Scheduling

Scaling Study of LS-DYNA MPP on High Performance Servers

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Principles and characteristics of distributed systems and environments

LS DYNA Performance Benchmarks and Profiling. January 2009

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Accelerating CFD using OpenFOAM with GPUs

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

OpenMP Programming on ScaleMP

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Parallel Computing with MATLAB

Enabling Technologies for Distributed and Cloud Computing

Recent Advances in HPC for Structural Mechanics Simulations

Accelerating From Cluster to Cloud: Overview of RDMA on Windows HPC. Wenhao Wu Program Manager Windows HPC team

Control 2004, University of Bath, UK, September 2004

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Phire Architect Hardware and Software Requirements

Understanding the Benefits of IBM SPSS Statistics Server

White Paper. Recording Server Virtualization

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

High Performance Computing. Course Notes HPC Fundamentals

REM-Rocks: A Runtime Environment Migration Scheme for Rocks based Linux HPC Clusters

Integrated Grid Solutions. and Greenplum

Technical Computing Suite Job Management Software

SERVER CLUSTERING TECHNOLOGY & CONCEPT

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Sun Constellation System: The Open Petascale Computing Architecture

Parallel Large-Scale Visualization

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

Linux clustering. Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

1000-Channel IP System Architecture for DSS

Leveraging Windows HPC Server for Cluster Computing with Abaqus FEA

Experiences of numerical simulations on a PC cluster Antti Vanne December 11, 2002

Distributed RAID Architectures for Cluster I/O Computing. Kai Hwang

Building Clusters for Gromacs and other HPC applications

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Performance of the Cloud-Based Commodity Cluster. School of Computer Science and Engineering, International University, Hochiminh City 70000, Vietnam

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Very special thanks to Wolfgang Gentzsch and Burak Yenier for making the UberCloud HPC Experiment possible.

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

RLX Technologies Server Blades

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Current Trend of Supercomputer Architecture

Transcription:

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin Reza Rooholamini, Ph.D. Director Enterprise Solutions Dell Computer Corp. Reza_Rooholamini@dell.com http://www.dell.com/clustering

Cost/Complexity Product Maturity Life Cycle in the Open Systems Market Heterogeneous SANs RISC systems Grids Project based SANs Proprietary Standardization 8P servers HPC Clusters Network Attached Storage 4P servers Simplicity/Volume/Choice Direct Attached Storage 1/2P servers Appliance Servers Workstation Desktops Fully Standardized 2 Enterprise Solutions

Our Vision Customers define our success: Begin with the customer. End with the customer Provide the best price/performance solutions to our customers in HPC Promote standardization to provide choice, lower cost of ownership, and simplicity in HPC solutions Evangelize new HPC technologies and selectively adopt the relevant ones for productization Derive the requirements for products by focusing on applications Provide a total solution: Hardware, software and services Partner with best of class in HPC 3 Enterprise Solutions

Building Block Approach Benchmark Parallel Benchmarks (NAS, HINT, Linpack ) and Parallel Applications Middleware MPI/Pro MPICH MVICH PVM OS OS Linux Windows Protocol TCP VIA GM Elan Interconnect Fast Ethernet Gigabit Ethernet Myrinet Quadrics Infiniband Platform Dell PowerEdge Servers (IA32 & IA64) 4 Enterprise Solutions

Dell and UT Austin Dell is sponsoring research in reservoir simulation at the Department of Petroleum and Geosystems Engineering Dr. Kamy Sepehrnoori is collaborating with Dell s HPCC team on performance studies, paper publications, and parallel simulator development Dell HPCC team includes graduates from Dr. Sepehrnoori s group specialized in Petroleum Engineering Dell has participated in Reservoir Simulation JIP (Joint Industry Project) in the past, and is planning to attend the upcoming meeting Dr. Sepehrnoori has access to Dell HPC lab for running large simulations, and is provided with hardware for development, testing, and performance studies of his program 5 Enterprise Solutions

A Performance Study of Parallel Reservoir Simulation on HPC Clusters Baris Guler Tau Leng Victor Mashayekhi Reza Rooholamini Dell Computer Corporation Kamy Sepehrnoori Center for Petroleum and Geosystems Engineering The University of Texas at Austin

Outline Background Software/Hardware Description Compositional Reservoir simulation on HPCs Results Summary Future Work

Reservoir Simulation Application Reservoir Forecasting Reservoir Performance optimization Sensitivity Analysis History Matching Risk Assessment through Stochastic Simulation Assessment of Uncertainity in Forecasting Value of Information Studies Reservoir Management

Reservoir Simulation Steps Data Input/Model Initialization Do Time Step Computation Solution of Non-Linear Partial Differential Equation Discretization Linearization and Newtonian Iteration Solution using Direct or Iterative Solvers Test for Convergence of Solution Data Output/Graphics Time-Step Increment End of Simulation Study Results Processing/Interpretation

Reservoir Simulation Hardware HPCs MPPs PCs/Workstations RISC Workstations Supercomputers Mainframes 1960 2000

Benefits of Parallel Processing Turn-around time Large-scale simulations Cost

Parallel Processing Massively Parallel Computers High Performance Computing Clusters

Benefits of Clusters Scalability High Performance Computing Low Cost Availability

Computational Mode Distributed processing Parallel processing

Distributed Processing Input Generator D 1 D 2 D 3 D n User Batch Queuing System to Simulation Program n >> m P 1 P 2 P 3 P m Database Post Processing

Input Data Cluster Simulation System FS 1 FS 2 FS 3...... FS m Cluster Scheduler Cluster Scheduler DS 1 DS 2 DS 3...... DS n Project Advisor User Input Output Data Generator Data Generator Archiver Post-Processor Processor

CPU-6 CPU-6 CPU-3 CPU-3 CPGE Parallel Processing CPU-1 CPU-1 CPU-2 CPU-2 CPU-5 CPU-5 FD RESERVOIR CPU-4 CPU-4 CPU-1 CPU-1 FD & DD

Domain Decomposition Ghost Layers Creation Communication Fundamental strategy for grid-based parallel simulation. Example: 10 x 15 grid 6 processors

Performance Issues in Parallel Processing Software Design Algorithm Parallelization Programming practice Load Balancing

Performance Issues in Parallel Processing Hardware Configuration CPU Cache Memory subsystem Front Side Bus I/O bandwidth Interconnect

Hardware - Interconnect Type Fast Ethernet Gigabit Ethernet Giganet Myrinet Infiniband 4x Quadrics Dolphin Speed(MBps) 9.0 80 110 225 500 330 385 Latency(ms) 170 170 7.5 6-7 6-8 4.5 4

CPGE-1(Ararat) 12 Nodes / 16 Processors 1.0 GHz Intel Pentium III Xeon processors 256 MB of memory Diskless configuration 100 Mbps switched Fast Ethernet and Giganet interconnects

TACC-1(Tejas) 32 Nodes / 64 Processors 1.0 GHz Intel Pentium III processors 1 GB of memory/processor 225 MBps Myrinet-2000 interconnect

Parallel Reservoir Simulators Chevron-Texaco Conoco-Phillips Exxon-Mobil IFP and Beicip-Franlab Landmark Graphics Corporation Schlumberger-Geoquest Saudi Aramco UT CPGE, UT CSM Note : 93 clusters in Top500 supercomputer sites, 23 in Oil and Gas sector.

Compositional Reservoir Simulation on HPCs

Project Objectives Develop a general purpose adaptive simulator (GPAS) capable of: modeling of complex physical processes including EOS compositional, chemical, black-oil and thermal high resolution studies on supercomputers and highperformance cluster

HPC Initiatives Evaluate and compare performance of different cluster systems Test and analyze performance of different parallel simulators Identify areas of improvement in parallel algorithm design and cluster setup for optimal parallel reservoir simulation

Summary of Clusters Cluster CPU Type CPU Speed (MHz) CPUs Memory per CPU Interconnect CPGE-1 (Fuji) Pentium II 300 16x1=16 384MB Fast Ethernet CPGE-1 (Rocky) Pentium II Xeon 400 8x2=16 256MB Fast Ethernet CPGE-1 (Ararat) Pentium III Xeon 1000 8x1+4x2=16 256MB Fast Ethernet DELL-1 (PE 1550) Pentium III 1000 16x2=32 512MB Myrinet, Gigabit, Fast Ethernet DELL-2 (PE 2650) Intel Xeon DP 2400 64x2=128 1GB Myrinet, Gigabit, Fast Ethernet TACC-1 (Tejas) Pentium III 1000 32x2=64 512MB Myrinet TACC-2 (Longhorn) Power4 1300 4x16=64 2GB IBM SP Switch2

Parallel Simulators Tested GPAS VIP (2003r4)

CPGE Simulator (GPAS) EOS Compositional Peng-Robinson EOS Fully Implicit PETSc Linear Solvers Parallel (IPARS Framework)

Performance Results

Base Benchmark Problem Compositional model 3-component Peng-Robinson EOS Dry-gas cycling process Reservoir size: 800 x 11200 x 160 ft, homogeneous 2 wells, 1 Injector, 1 producer Grids: 16 x 224 x 8 (28,672 cells) Unknowns : 229,376 100 days of gas injection One dimensional domain decomposition

Single-Processors Execution Times(GPAS) Base Benchmark Problem Fuji with Pentium II 300MHz 1030.3 Rocky with Pentium II Xeon 400MHz 615.2 PowerEdge 1550 with Pentium III 1.0GHz Ararat with Pentium III Xeon 1.0GHz TACC-Tejas with Pentium III 1.0Ghz Dell-PE2650 with Intel Xeon DP 2.4GHz 313.3 309.3 306.38 180.186 0 200 400 600 800 1000 1200 Execution Time [sec]

Multi-Processors Execution Times(GPAS) Base Benchmark Problem Execution Time (seconds) 10000 1000 100 10 Fuji Rocky Ararat PE 1550 PE 2650 Tejas Longhorn 1 0 4 8 12 16 Number of Processors

Multi-Processors Speedups(GPAS) Base Benchmark Problem Sppedup 32 28 24 20 16 12 8 Fuji(FE) Rocky(FE) Ararat(FE) PE 1550(FE) PE 2650(FE) Tejas(My) Longhorn(*) Ideal 4 0 0 4 8 12 16 20 24 28 32 Number of Processors

Comparison of MPI-Interconnects Interconnects (GPAS) Base Benchmark Problem DELL PE 2650 (Single processor/node) MPICH-GIGABIT MPICH-GM - MYRINET MPI/PRO-GIGABIT MPICH-FE Ideal 32 28 24 Speedup 20 16 12 8 4 0 0 4 8 12 16 20 24 28 32 Number of Processors

Constant Problem Size per Processor(GPAS) Fuji Rocky Ararat Tejas 800 Execution Time [sec] 600 400 200 0 19200,1CPU 38400, 2CPUs 76800, 4CPUs 153600, 8CPUs 307200, 16CPUs Grid Dimensions, Number of CPUs 614400, 32CPUs

Modified Benchmark Problem Compositional model 3-component Peng-Robinson EOS Dry-gas cycling process Reservoir size: 7.3 x 24.2 x.1 miles Grids: 77 x 256 x 10 (197,120 cells) Unknowns : 1.57 million Anisotropic, Layered Permeability with Kv/Kh = 0.1 88 wells, 54 Injectors, 24 producers, staggered line drive Injectors and Producers are completed fully 100 days of gas injection One dimensional domain decomposition

Multi-Processors Execution Times(GPAS) Modified Benchmark Problem DELL PE 2650 GBit-SINGLE My-SINGLE FE-SINGLE My-DUAL 10000 Execution Time (Seconds) 1000 100 10 0 8 16 24 32 40 48 56 64 Number of Processors

Multi-Processors Speedups(GPAS) Modified Benchmark Problem 72 DELL PE 2650 GIGABIT-SINGLE MYRINET-SINGLE FAST ETH-SINGLE MYRINET-DUAL Ideal 64 56 48 Speedup 40 32 24 16 8 0 0 8 16 24 32 40 48 56 64 Number of Processors

Commercial Parallel Simulator

REMARKS Our goal was to run the simulators in parallel mode and evaluate their performance for typical cases Our goal was to analyze the different issues involved in using the simulators in parallel and approaches to improved performance and design We did not Tune simulators for optimum performance Compare or match material balance errors of the simulator runs

Benchmark Problem for VIP Compositional model Modified SPE3 comparison project 9-component Peng Robinson EOS Gas condensate with gas cycling process Reservoir size: 10 miles x 4 miles x 160ft Grids: 180 x 72 x 4 (51,840 cells) 1 million unknowns Flow barriers present (using Transmissibility modifiers) 20 wells, 10 Injectors, 10 producers 10 years of cycling followed by 5 years of production

Multi-Processors Performance VIP

Multi-Processors Execution Times(VIP) MODIFIED SPE3 COMPARISON PROBLEM Elapsed Time (sec) 12000 10000 8000 6000 4000 2000 0 Fuji Rocky 0 4 8 12 16 Number of Processors

Multi-Processors Speedups(VIP) MODIFIED SPE3 COMPARISON PROBLEM Fuji Rocky Ideal 16 12 Speedup 8 4 0 0 4 8 12 16 Processors

Constant Problem Size per Processor(VIP) MODIFIED SPE3 COMPARISON PROBLEM Fuji Rocky Execution Time [sec] 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 25920, 1CPU 51480, 2CPUs 103680, 4CPUs 207360, 8CPUs 414720,16CPUs Number of Cells, Number of CPUs

Million Cell Commercial Benchmark Problem for VIP IMPES scheme 7-component Peng Robinson EOS Grid: 100 x 100 x 100 (1 Million cells) 16 million unknowns Stochastically characterized data field 11 wells 49 Year run

Performance Speedups - VIP MILLION GRIDBLOCK PROBLEM DELL PE 2650 64 56 48 VIP(*) Ideal Speedup 40 32 24 16 8 0 0 8 16 24 32 40 48 56 64 Number of Processors

Summary Tested GPAS and analyzed performance on new hardware Benchmarked performance of new clusters Compared performance of different interconnects and MPI libraries Tested commercial reservoir simulator VIP in parallel mode

Acknowledgements US Department of Energy Reservoir Simulation Joint Industry Project Members Dell Computer Corporation