Comparing the performance of the Landmark Nexus reservoir simulator on HP servers



Similar documents
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Clusters: Mainstream Technology for CAE

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

ECLIPSE Performance Benchmarks and Profiling. January 2009

HP reference configuration for entry-level SAS Grid Manager solutions

LS DYNA Performance Benchmarks and Profiling. January 2009

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

Michael Kagan.

How System Settings Impact PCIe SSD Performance

Improved LS-DYNA Performance on Sun Servers

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

Building Clusters for Gromacs and other HPC applications

Business white paper. HP Process Automation. Version 7.0. Server performance

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

White Paper Solarflare High-Performance Computing (HPC) Applications

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

HP PCIe IO Accelerator For Proliant Rackmount Servers And BladeSystems

HP ProLiant BL460c achieves #1 performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Microsoft, Oracle

HPC Update: Engagement Model

High Performance Computing in CST STUDIO SUITE

How To Write An Article On An Hp Appsystem For Spera Hana

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

CUTTING-EDGE SOLUTIONS FOR TODAY AND TOMORROW. Dell PowerEdge M-Series Blade Servers

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

Comparing Multi-Core Processors for Server Virtualization

Interoperability Testing and iwarp Performance. Whitepaper

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

Oracle Database Scalability in VMware ESX VMware ESX 3.5

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

FLOW-3D Performance Benchmark and Profiling. September 2012

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

HP RA for SAS Visual Analytics on HP ProLiant BL460c Gen8 Servers running Linux

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

HP ProLiant BL685c takes #1 Windows performance on Siebel CRM Release 8.0 Benchmark Industry Applications

HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle

SX1012: High Performance Small Scale Top-of-Rack Switch

Sun Constellation System: The Open Petascale Computing Architecture

Boosting Data Transfer with TCP Offload Engine Technology

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

IBM System x family brochure

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

Summary. Key results at a glance:

Analyzing the Virtualization Deployment Advantages of Two- and Four-Socket Server Platforms

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Server Migration from UNIX/RISC to Red Hat Enterprise Linux on Intel Xeon Processors:

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking

IBM System Cluster 1350 ANSYS Microsoft Windows Compute Cluster Server

Upgrading Data Center Network Architecture to 10 Gigabit Ethernet

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Integrated Grid Solutions. and Greenplum

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Microsoft Exchange Server 2007 and Hyper-V high availability configuration on HP ProLiant BL680c G5 server blades

White Paper. Better Performance, Lower Costs. The Advantages of IBM PowerLinux 7R2 with PowerVM versus HP DL380p G8 with vsphere 5.

Accelerating Data Compression with Intel Multi-Core Processors

DDR3 memory technology

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

High Performance SQL Server with Storage Center 6.4 All Flash Array

Configuring the HP DL380 Gen9 24-SFF CTO Server as an HP Vertica Node. HP Vertica Analytic Database

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

Adaptec: Snap Server NAS Performance Study

The Bus (PCI and PCI-Express)

Sizing guide for SAP and VMware ESX Server running on HP ProLiant x86-64 platforms

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

PRIMERGY server-based High Performance Computing solutions

SMB Direct for SQL Server and Private Cloud

Recommended hardware system configurations for ANSYS users

benchmarking Amazon EC2 for high-performance scientific computing

Out-of-box comparison between Dell and HP blade servers

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Parallel Programming Survey

Power efficiency and power management in HP ProLiant servers

RLX Technologies Server Blades

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

OPTIMIZING SERVER VIRTUALIZATION

The Advantages of Multi-Port Network Adapters in an SWsoft Virtual Environment

Transcription:

WHITE PAPER Comparing the performance of the Landmark Nexus reservoir simulator on HP servers Landmark Software & Services

SOFTWARE AND ASSET SOLUTIONS Comparing the performance of the Landmark Nexus reservoir simulator on HP servers Including a comparison with the VIP reservoir simulator S. Crockett, Landmark, and S. Devere, HP Introduction is paper discusses the results of a benchmarking study conducted with Landmark Nexus reservoir simulation so ware to determine its performance characteristics on a variety of HP ProLiant x86_64 servers, and to provide recommendations for server configurations appropriate for optimal Nexus so ware performance. e tests and configurations were chosen to show likely performance results for a wide variety of server hardware. If specific datasets or configurations are of interest, see the For more information section. Configurations Tested e computer systems used in this study were clusters of server nodes connected by a high-speed network or by Gigabit Ethernet. Each server node was configured with two processors, which is the most common server configuration used for highperformance computing. Servers with quadcore processors from both Intel and AMD were tested, as were servers with dual-core Intel processors. HP does not currently manufacture servers with dual-core AMD processors, as these are no longer part of AMD s current processor product line. Both rack-mount and blade servers were tested. e HP ProLiant DL server line was used for rack-mount servers; the HP BladeSystem c-class portfolio was used for blade servers, with HP ProLiant BL blades. Processor speed Both Intel and AMD provide ranges of processors with similar architectures, but different clock speeds. Rather than test every possible processor, a representative was chosen from each processor class, not with the highest clock speed, but rather one with a reasonable price and power utilization relative to its performance. For example, for testing the HP ProLiant BL460c, the 3.0GHz Intel Xeon 5450 processor was chosen over the 3.16GHz Intel Xeon 5460 processor, since the processor performance difference was not significant relative to the price difference or the power consumption difference. Network e Nexus reservoir simulator is a parallel application that uses HP-MPI to communicate between processes on multiple servers. Parallel runs of the Nexus simulator require the input model data to be decomposed into sub-domains. e Nexus model chosen for this testing had 64 sub-domains, which allowed performance to be measured using up to 64 cores. Most of the clusters we evaluated had 32 cores, meaning we used eight servers when testing dual-core processors, and four servers when testing quad-core processors. We performed some runs on clusters with 64 cores; these clusters had eight servers, each with two quad-core processors. Gigabit Ethernet, IB (InfiniBand) DDR (double data rate) ConnectX, and 10 Gigabit Ethernet were tested and compared.

Memory Tested configurations had 2 GB/core, which was suitable for these tests. Most Nexus data sets will be able to run on a similarly configured cluster with no swapping to disk. e servers with dual-core processors were configured with eight 1GB memory DIMMS; those with quad-core processors used eight 2GB memory DIMMS. All used 667MHz DIMMS (DDR-PC5300). File system Since I/O generally is not a determining factor in the performance of Nexus simulation runs, two striped 15K-RPM disks attached locally on each server were used to store all files used and generated by the tests. Software Nexus R500.1 and HP-MPI v2.2.5.1 applications were used. Landmark Nexus so ware enables fully implicit, fully coupled surface-to-subsurface simulation for a comprehensive look at an oil reservoir. e Nexus reservoir simulator couples the surface network model with the subsurface model in a way that allows a simultaneous solution of the two models, which is computed both faster and more accurately than a loosely coupled solution. Furthermore, multiple reservoirs can be modeled simultaneously with a shared surface network, which is necessary to accurately model many of today s off-shore oil developments. HP-MPI is the only messagepassing interface supported by Landmark for the R5000 release of Nexus so ware and VIP so ware, and is provided for both Linux and Windows operating systems. Specific test configurations Servers with dual-core processors are annotated as 2p4c, signifying two processors with a total of four cores. Servers with quadcore processors are marked as 2p8c. HP BladeSystem server names begin with BL, while rack-mount model names begin with DL. HP ProLiant model Landmark Data Set e spe10_64grids data set was used for these benchmarks. is data is derived from Model 2 of the Tenth Society of Petroleum Engineers (SPE) Comparative Solution Project, in which a waterflood of a large geostatistical model was modeled. e reservoir modeled had the characteristics of a Brent sequence, in which the upper 70 feet represent the Tarbert formation, and the lower 100 feet represent the Upper Ness formation. e model had slightly more than 1.1 million cells, of which 766,000 were active. e grid was decomposed into 64 subgrids, which allowed the model to be run in parallel using up to 64 cores. e model was run to simulate 2,000 days of production from four wells, with a fi h well as a water injector; all five wells were vertical. Conditions in the reservoir were maintained such that no free gas was present in the reservoir throughout the entire run. Performance Summary Processor model (Code name) DL160 G5 Intel Xeon 5272 ( Wolfdale ) BL460c G1, DL360 G5 or DL380 G5 BL465c G5 and DL165 G5 BL465c G5 and DL165 G5 Intel Xeon 5160 ( Woodcrest ) AMD Opteron 2384 ( Shanghai ) AMD Opteron 2356 ( Barcelona ) DL160 G5 Intel Xeon 5472 ( Harpertown ) BL460c G5 Intel Xeon 5450 ( Harpertown ) Processor Speed Server Config 3.4 GHz 2p4c 1,600 3.0 GHz 2p4c 1,333 2.7 GHz 2p8c n/a 2.3 GHz 2p8c n/a 3.0 GHz 2p8c 1,600 3.0 GHz 2p8c 1,333 Front-side bus speed (MHz) People involved in purchasing decisions for HPC systems o en use different criteria to determine the best solution for their needs. Some find that the solution with the shortest runtime is best. Others prefer a solution in which the ratio of run speed to system cost is maximized. Still others are most concerned with parallel scalability. Since these are all valid ways to characterize the performance of a solution, all will be presented below, first in summary, then in detail. As newer servers become available, this report will be updated, but these fundamental methods of looking at the data will always apply. Nexus software: Breakthrough performance e Nexus simulator was designed with performance and parallel scalability as important features. is differentiates it from most commercially available reservoir simulators, in which performance was o en sacrificed as features were added, or in which parallel capabilities were added long a er the original commercial release. is 2

focus on performance has led Nexus so ware to be faster than other simulators by a wide margin; for some models, Nexus simulation runs require less than one-fi h the time of other simulators. Nexus models with a single reservoir and simple wells with a minimal surface network generally do not show as much performance improvement as models with multiple reservoirs and more complex surface networks. However, even for simpler cases, such as the benchmark data chosen for this study, Nexus simulation so ware can provide a significant performance benefit. is can be seen in Figure 1, which shows Nexus so ware to be more than three times faster on this benchmark model than Landmark s VIP reservoir simulator, for the tested range of process counts. Nexus software vs. VIP Relative Performance for 4-, 8-, 16- and 32-way parallel, IB DDR HP-MPI; Test: spe10_64 grids Relative Performance (bigger is better for Nexus) 4.0 3.6 3.2 2.8 2.4 1.6 1.2 0.8 0.4 HP DL160 G5 Xeon 5272 3.4 GHz IB DDR ConnectX 2p4c HP-MPI HP DL160 G5 Xeon 5472 3.0 GHz IB DDR ConnectX 2p8c HP-MPI 4 8 16 32 Number of cores Figure 1: Comparison of Landmark s two reservoir simulators, at various process counts Absolute performance comparison Looking at absolute performance is useful to those requiring best throughput. Of the servers using Xeon processors, the HP ProLiant DL160 G5 Xeon dual-core 5272 processor performed best. Of the servers using AMD Opteron processors, the ProLiant BL465 G5 and DL165 G5 Opteron 2384 solution performed best. Price-Performance Ratio e price-performance ratio shows which solution delivers the best performance when scaled by the U.S. list price of the server cluster. is analysis does not include the cost of the so ware licenses, so it exaggerates the importance of the price difference between clusters. e BL465 G5 and DL165 G5 with AMD 2356 processors using Gigabit Ethernet (GigE) had the best priceperformance ratio, followed by the BL465 G5 and DL165 G5 with AMD 2384 ( Shanghai ) processors using both IB DDR and GigE. Scaling and network Here we compare the relative performance of multiple servers to a single server, to see the benefit of using the cluster for large parallel runs. Scalability is most important when the users workflow gives them the option of running multiple serial jobs, a smaller number of parallel jobs, or a mix. Better scalability improves the relative efficiency of including a higher fraction of parallel jobs in the workflow. High-speed interconnects such as InfiniBand are recommended for optimal scalability when running Nexus so ware. e benefit of a high-speed interconnect can be seen in the comparisons between GigE, InfiniBand, and 10 GbE. Benchmark Results and Discussion General overview ere is no performance difference between blade systems and rack-mount servers, if the internal components are identical. ere are some HP blade products that do not have equivalent rack-mount servers; the converse is true as well. For example, the ProLiant DL160 G5 rack mount servers with Xeon 5472 and 5272 processors have frontside bus speeds of 1,600 MHz, which are not offered in the blade BL460c counterpart. When an application is demanding on the memory subsystem, as Nexus so ware is, there is a benefit to the faster front-side bus. is can be seen when comparing the results of the Xeon 5472 processor to the Xeon 5450 processor, both of which are 3.0- GHz quad-core processors; the front-side bus of the blade system (1,333 MHz) is slower than that of the rack-mount server (1,600 MHz), and although the processor clock and the other components are identical, the server with the faster frontside bus performs much better on the Nexus simulator benchmark run. Multi-core processor suitability for Nexus simulation e benchmark runs presented here have been performed with all cores on the blades or servers in use, except for cases where the number of Nexus processes is less than the number of cores available in a single node. Runs using all cores in a node, however, also 3

Relative Performance must share access to memory across the cores. All the cores in a single processor share the same path to memory; a quad-core processor, having more cores, gives each core a smaller fraction of that path than a dual-core processor does. us, when all cores are in use, the older quad-core Harpertown processor had poorer performance for Nexus so ware than two Wolfdale dual-core processors with the same processor speed. Because Nexus simulator performance is so dependent upon access to memory, as noted above, servers with the highest memory bandwidth per core perform best. One way to increase per-core performance on a server built with quad-core processors is to reduce the load on the server by using only half the cores in the system, effectively treating each quad-core processor as though 1.8 1.6 1.4 1.2 0.8 0.6 0.4 0.2 1 2 4 8 Number of servers used it were dual core. Timing results from runs done in this manner are shown in Figure 2. e le chart shows the relative performance of runs using a fixed number of servers, either half loaded or fully loaded; the number of cores used in the fully loaded case is double the number used in the halfloaded case. Under these conditions, the runs with more cores are always faster, but doubling the number of cores used results in only minimal performance gain. e right chart shows the relative performance of two sets of runs using the same number of active cores. For one set of runs, the minimal number of servers was used, with all cores of each processor in use, while, for the other, each processor had only half its cores used, which requires doubling the number of servers. e set of runs using more servers but only partially loading them had much better performance than the set of runs Nexus software Half-loaded vs. fully loaded processors on DL160 G5 Xeon 5472 3GHz using one, two, four, eight servers, IB DDR ConnectX HP-MPI; Test: spe10_64 grids 2 cores per processor (half-loaded) 4 cores per processor (fully loaded) Relative Performance 1.8 1.6 1.4 1.2 0.8 0.6 0.4 0.2 8 16 32 Number of cores Figure 2: Comparison of half-loaded processors to fullyloaded processors: e time to run a job is always reduced by adding more cores (le chart), but there is a benefit to spreading those cores across additional servers if they are available (right chart). using fully loaded servers. e benefit of running using only half the cores of a server will vary, depending on the server s architecture. e results shown are for the DL160 G5 Xeon 5472 Harpertown ; similar tests on the BL465c Opteron 2384 Shanghai processor saw less improvement between the two sets of runs the runs with twice the number of half-loaded servers were only 20-30 percent faster than the runs using fully loaded servers. Absolute performance results Figures 3 and 4 show the absolute performance of the servers tested, relative to the BL460c Xeon 5450 3.0GHz cluster. For clusters using InfiniBand (specifically IB DDR ConnectX), at all tested process counts, the best performing server node is the DL160 G5 (or BL460 G5) with 3.4GHz Xeon 5272 dual-core processors. When Gigabit Ethernet is the private network for MPI communications, this same server node is the fastest for all tested cases except for runs with eight parallel processes. e reason for this exception is that an eightway parallel job fits in a single node for all the other systems tested, but not for this cluster with dual-core processors. e cluster interconnect is thus being used in the eight-way run only on the cluster built from dual-core processors. e off-node communications take relatively more time for this cluster than do intra-node communications on the other clusters, which decreases the relative performance of this cluster. Note that for higher process counts, where all clusters are using multiple nodes, the cluster built from dual-core processors again leads in performance. x Price-performance ratio e price data used for this calculation is the 4

Nexus software BL460c Xeon 5450 3 GHz relative performance for 8-, 16- and 32-way parallel, IB DDR Connect X, HP-MPI; Test: spe10_64 grids Relative Performance 2.2 1.8 1.6 1.4 1.2 0.8 0.6 0.4 0.2 HP BL460c G5 Xeon 5450 3.0 GHz IB DDR ConnectX 2p8c HP-MPI HP DL160 G5 Xeon 5472 3.0 GHz IB DDR ConnectX 2p8c HP-MPI HP DL165 G5 Opteron 2356 2.3 GHz IB DDR ConnectX 2p8c HP-MPI HP BL465 G5 Opteron 2384 2.7 GHz IB DDR ConnectX 2p8c HP-MPI HP DL160 G5 Xeon 5272 3.4 GHz IB DDR ConnectX 2p4c HP-MPI 8 16 32 Number of cores Figure 3: Comparison of HP server relative performance using IB, with BL460c Xeon 5450 as a baseline. Nexus software BL460c Xeon 5450 3 GHz relative performance for 8-, 16- and 32-way parallel, Gigabit Ethernet, HP-MPI; Test: spe10_64 grids Relative Performance 1.8 1.6 1.4 1.2 0.8 0.6 0.4 0.2 HP BL460c G5 Xeon 5430 3.0 GHz 2p8c HP-MPI HP DL160 G5 Xeon 5472 3.0 GHz 2p8c HP-MPI HP DL165 G5 Opteron 2356 2.3 GHz 2p8c HP-MPI HP BL465 G5 Opteron 2384 2.7 GHz 2p8c HP-MPI HP DL160 G5 Xeon 5272 3.4 GHz 2p4c HP-MPI 8 16 32 Number of cores Figure 4: Comparison of HP server relative performance using Gigabit Ethernet, with BL460c Xeon 5450 as a baseline. 5

U.S. list price of the cluster hardware as of November 2008. e price for blade configurations includes the entire enclosure, while, for rack-mount servers, it includes one 42U rack. Red Hat Linux with 9x5 oneyear tech support is also included in the price. is analysis does not include the cost of Nexus so ware licenses. is leads to more weight being given to hardware pricing differences than an actual user would see. e head node for Xeon-based servers is a ProLiant DL380 G5 Xeon 5450 processor, and, for Opteron-based servers it is a DL385 G5 AMD 2356 processor, regardless of the configuration used in the compute nodes. e clusters with the best price performance are built from DL165 G5 server nodes using either Opteron 2384 or 2356 processors, and use a Gigabit Ethernet interconnect for MPI traffic. e DL165 G5 Opteron 2356 is used as the baseline in Figure 5, with other servers shown relative to it. Scaling In a multi-user environment where the users have jobs of widely varying size and duration, the ability to add processors to a job in an efficient manner allows a fixed hardware resource to be used effectively. Large jobs can be submitted in parallel to work around memory limitations; additionally, jobs of long duration can be run more quickly in parallel. If the overall solution hardware, so ware and data does not scale well, the overall effectiveness of the solution will be limited when large or long-duration jobs must be run. When reviewing scalability information, the absolute performance results shown in Figures 3 and 4 must also be considered. A high-performing server may scale less effectively than a lower-performing server, and thereby be less effective as a part of a cluster. Price Performance (lower is better) Nexus software: Price-performance ratio for 32-way parallel HP-MPI; spe10_64 grids Figure 5: Price-performance ratio of a range of 32-core clusters of servers. Scaling relative to 1 server 2 1 0 Nexus software - IB DDR and 10 GbE scaling to 32 or 64 cores HP-MPI; Test: spe10_64 grids 7.0 6.0 5.0 4.0 3.0 0 5 HP DL165 G5 Opteron 2356 2.3 GHz GigE 2p8c HP-MPI HP BL465c G5 Opteron 2384 2.7 GHz GigE 2p8c HP-MPI HP DL160 G5 Xeon 5472 3.0 GHz GigE 2p8c HP-MPI HP BL465 G5 Opteron 2384 2.7 GHz IB DDR ConnectX 2p8c HP-MPI HP DL165 G5 Opteron 2356 2.3 GHz IB DDR ConnectX 2p8c HP-MPI HP DL160 G5 Xeon 5272 3.4 GHz IB DDR ConnectX 2p4c HP-MPI HP DL160 G5 Xeon 5472 3.0 GHz IB DDR ConnectX 2p8c HP-MPI HP DL160 G5 Xeon 5272 3.4 GHz GigE 2p4c HP-MPI HP BL465 G5 Opteron 2356 2.3 GHz IB DDR ConnectX 2p8c HP-MPI HP BL460c G5 Xeon 5450 3.0 GHz GigE 2p8c HP-MPI HP BL460c G5 Xeon 5450 3.0 GHz IB DDR ConnectX 2p8c HP-MPI 1.22 1.27 1.35 1.37 1.42 32 cores HP BL460c G5 Xeon 5450 3.0 GHz IB DDR ConnectX 2p8c HP-MPI HP DL160 G5 Xeon 5472 3.0 GHz IB DDR ConnectX 2p8c HP-MPI HP DL160 G5 Xeon 5272 3.4 GHz IB DDR ConnectX 2p4c HP-MPI HP DL380 G5 Xeon 5160 3.0 GHz IB DDR Ex 2p4c HP-MPI HP BL460c G1 Xeon 5160 3.0 GHz Chelsio & Blade 10GbE udapl 2p4c HP-MPI HP BL465 G5 Opteron 2384 2.7 GHz IB DDR ConnectX 2p8c HP-MPI HP DL165 G5 Opteron 2356 2.3 GHz IB DDR ConnectX 2p8c HP-MPI 1 2 4 Number of servers Figure 6: Comparison using high-speed interconnects showing scaling from one to eight servers, where all cores in each server are used. 1.44 1.45 1.87 8 2.27 6

Scaling relative to 1 server Nexus software Gigabit Ethernet Scaling to 32 or 64 cores HP-MPI; Test: spe10_64 grids 5.0 4.5 4.0 3.5 3.0 2.5 1.5 0.5 HP DL160 G5 Xeon 5472 3.0 GHz GigE 2p8c HP-MPI HP DL160 G5 Xeon 5272 3.4 GHz GigE 2p4c HP-MPI HP BL460c G1 Xeon 5160 3.0 GHz GigE 2p4c HP-MPI HP BL460c G5 Xeon 5450 3.0 GHz GigE 2p8c HP-MPI HP DL165 G5 Opteron 2356 2.3 GHz GigE 2p8c HP-MPI HP BL465c G5 Opteron 2384 2.7 GHz GigE 2p8c HP-MPI 1 2 4 Number of servers Figure 7: Comparison using Gigabit Ethernet interconnect, showing scaling from one to eight servers, where all cores in each server are used. Relative Performance to Gigabit Ethernet (bigger is better for IB) Nexus software IB DDR and 10 GbE performance relative to Gigabit Ethernet (IB DDR ConnectX, except where noted) HP-MPI; Test: spe10_64 grids 1.75 1.50 1.25 0 0.75 0.50 0.25 0 HP BL465c Opteron 2384 2.7 GHz 2p8c HP-MPI HP BL465c Opteron 2356 2.3 GHz 2p8c HP-MPI HP DL160 G5 Xeon 5472 3.0 GHz 2p8c HP-MPI HP DL160 G5 Xeon 5272 3.4 GHz 2p4c HP-MPI HP BL460c G5 Xeon 5450 3.0 GHz 2p8c HP-MPI HP DL380 G5 Xeon 5160 3.0 GHz 2p4c HP-MPI (Chelsio 10GbE) HP DL380 G5 Xeon 5160 3.0 GHz 2p4c HP-MPI (IB DDR Ex) 16 32 Number of cores Figure 8: Relative performance of high-speed interconnects to Gigabit Ethernet, showing the advantage in performance using high-speed interconnects. 8 Scalability is a useful way to determine the benefit of running on a cluster. In Figures 6 and 7 we compare the performance of a single server with that of a cluster of multiple servers to see the benefit of the cluster for runs of up to 32 processes (in some cases, to 64 processes). In all cases, the servers are fully loaded. When using servers with dual-core processors, we compare a single server to two, four, and eight servers; when using servers with quadcore processors, we compare the single server to two and four servers. is provides comparisons of 32-process runs in all cases. Figure 6 shows results for highspeed interconnects (IB DDR ConnectX and 10GbE), while Figure 7 shows results for Gigabit Ethernet. Note that most of the clusters configured with high-speed interconnects reduce the runtime of the model by a factor of more than 3 when four servers are used rather than only one, and some reduce the runtime of the model by a factor approaching 7 when eight servers are used. When Gigabit Ethernet is used instead, most clusters of four servers cannot achieve a speed-up of 3, and none of the eight-node clusters can provide a speed-up of 5. Network: High-speed interconnects vs. Gigabit Ethernet e equations used by the Nexus simulator to describe the physical behavior of the reservoir being modeled have a strong dependence on the pressure field throughout the model, because pressure differences are usually the major driver of the flow of fluids in the reservoir. Changes in pressure in one area of the reservoir affect all areas of the reservoir. During a parallel simulation, reservoir pressure data in each subdomain must be communicated to other subdomains. is places a significant load on the private 7

Nexus software IB DDR Ex, 10 GigE, 1 GigE scaling to 32 cores, HP-MPI; Test: spe10_64 grids HP DL380 G5 Xeon 5160 3.0 GHz IB DDR Ex 2p4c HP-MPI HP BL460c G1 Xeon 5160 3.0 GHz Chelsio & Blade 10GbE udapl 2p4c HP-MPI HP DL380 G5 Xeon 5160 3.0 GHz GigE 2p4c HP-MPI Scaling relative to 1 server 7.0 6.0 5.0 4.0 3.0 1 2 4 Number of servers 8 Figure 9 Comparison of IB, 10-gigabit, and 1-Gigabit Ethernet for a specific server type. network connecting the cluster nodes. us, the performance of the interconnect network can play a significant role in determining the overall performance of parallel Nexus simulation runs. e server systems tested for this paper, whether blades or rack-mount systems, have two built-in Gigabit Ethernet ports. One of these ports can be used for the private network. Optional add-on hardware can be used to provide a network that has lower latency and higher bandwidth than the built-in Gigabit Ethernet. Two available options which were tested were InfiniBand (specifically, DDR ConnectX IB) and 10- Gigabit Ethernet (10GbE). Both these options showed significant improvements in Nexus parallel performance when compared to the built-in Gigabit Ethernet. Figure 8 shows the performance of the highspeed interconnects relative to the built-in Gigabit Ethernet. e 10-Gigabit Ethernet was available only on one cluster, which was built from servers that used dual-core Xeon 5160 (3.0-GHz) processors. Note that for the clusters built from servers using quad-core processors, the 32-core run requires only four servers; even at this small server count, the high-speed interconnects can provide a benefit as large as 50 percent. e 10- gigabit performance is comparable, but slightly slower, to the InfiniBand performance in the one case where a direct comparison can be made. Looking at the three interconnect types in isolation, as shown in Figure 9, it is evident that both high-speed interconnects provide a significant speed-up over Gigabit Ethernet, with InfiniBand slightly faster than 10- Gigabit Ethernet. Because the two highspeed interconnects perform in a similar manner for Nexus so ware, nonperformance characteristics may be a deciding factor for a customer desiring the most appropriate technology for his site. Selecting a high-speed interconnect that exists elsewhere in your facility is a reasonable choice, if there is one. If not, select it based on price, and be sure to include cable costs in the analysis. 8

For more information http://www.hp.com/go/hpc http://www.halliburton.com/landmark http://h18004.www1.hp.com/products/ blades/components/c-class-components.html http://h18004.www1.hp.com/products/ blades/components/ethernet/ 10gb-bl-c/index.html http://www.bladenetwork.net/hpc/oil-gas http://www.chelsio.com/hpc/oil-gas Further assistance Contact Michael Mott (Michael.Mott@hp.com) or Ken Glass (Ken.Glass@halliburton.com) for assistance with hardware specifications, such as power and cooling information and/or custom benchmarks. Contact John Davidson (John.Davidson2@halliburton.com) with Nexus so ware requests. Server and pricing info e online HP Product Bulletin can be used to see which processors are available for each server. e HP Product Bulletin tool is available at: http://www.hp.com/go/productbulletin/ e QuickSpecs, search tools, and a rudimentary mechanism for list price lookup are all available. Nexus and VIP are trademarks of Halliburton. Intel and Xeon are trademarks of Intel. AMD and Opteron are trademarks of AMD. HP Blade System and ProLiant are trademarks of HP. All other marks are marks of their respective owners. 10 Gb Ethernet Switches and Adapters e following 10 Gb Ethernet products were used to achieve the performance results of the Landmark Nexus application on the HP platform: HP c-class BladeSystem - One HP 10 Gb Ethernet BL-c Switch. is 10 Gb HP switch is designed to deliver high-performance 10Gb networking throughput at a breakthrough price. With 200Gb (400- Gb full duplex) of aggregate switching capacity, it allows full utilization of the HP BladeSystem multi-core capabilities. For more information, see http://www.bladenetwork.net/ hpc/oil-gas.html. - 8 units of Chelsio S320EM HP BladeSystem 10 Gb Ethernet adapters. Chelsio s S320 family of 10 GbE adapters, using Unified Wire technology, delivers high throughput with low latency. Interprocess communication and storage networking traffic can be concurrently handled on a single S320Eclass adapter. For more information, see http://www.chelsio.com/products.html. HP DL ProLiant Server cluster - One RackSwitch G8100. e RackSwitch G8100 from BLADE Network Technologies is a top of rack switch that offers 480 Gb of network throughput, optimal for highperformance computing applications requiring the highest bandwidth and lowest latency. With flexible airflow options and variable speed fans, it allows for variable mounting options in the server rack. For more information, see http://www.bladenetwork.net/hpc/ oil-gas.html. - Eight units of Chelsio S320E 10 Gb Ethernet adapters. Chelsio s S320 family of 10 GbE adapters, using Unified Wire technology, delivers high throughput with low latency. Interprocess communication and storage networking traffic can be concurrently handled on a single S320Eclass adapter. For more information, see http://www.chelsio.com/products.html. 9

2009 Halliburton. All rights reserved. Sales of Halliburton products and services will be in accord solely with the terms and conditions contained in the contract between Halliburton and the customer that is applicable to the sale. H06907 5/09 www.halliburton.com